Python’s Multiprocessing Performance Problem

whywhywhydude · on March 1, 2023

Unless you are doing Machine learning or using numpy, I do not recommend anyone use python for anything performance sensitive. The problem is not just GIL. Because multithreading is not so common in python, it’s really hard to know if some external library is threadsafe. Python also supports async but a lot of libraries do not have asyncio compatibility, so you need to mix threads with asyncio which leads to a big mess.

cgreerrun · on March 1, 2023

> I do not recommend anyone use python for anything performance sensitive

My default philosophy is to use python _until_ you find something that is performance sensitive, and then make a C/C++ extension for the slow bits. Pybind works great for a hybrid Python/C++ codebase (https://pybind11.readthedocs.io/en/stable/).

Then you can develop and prototype much quicker w/ Python but re-write the slow parts in C++.

Definitely more of a judgement call when threading and function call overhead enter the equation, but I've found this hybrid "99% of the time Python, 1% C++ when needed" setup works great. And it's typically easier for me to eventually port mature code to C++/Go/etc once I've fleshed it out in Python and hit all the design snags.

cgreerrun · on March 1, 2023

If you've never used Pybind before these pybind tests[1] and this repo[2] have good examples you can crib to get started (in addition to the docs). Once you handle passing/returning/creating the main data types (list, tuple, dict, set, numpy array) the first time, then it's mostly smooth sailing.

Pybind offers a lot of functionality, but core "good parts" I've found useful are (a) use a numpy array in Python and pass it to a C++ method to work on, (b) pass your python data structure to pybind and then do work on it in C++ (some copy overhead), and (c) Make a class/struct in C++ and expose it to Python (so no copying overhead and you can create nice cache-aware structs, etc.).

[1] https://github.com/pybind/pybind11/blob/master/tests/test_py...

[2] https://gitlab.unistra.fr/benzerara/pybind11/-/tree/master/e...

synergy20 · on March 7, 2023

what about CFFI?

nijave · on March 1, 2023

>and then make a C/C++ extension for the slow bits

I've found cross language bindings to be pretty easy with Python. If you want something with memory management, you can use cgo

synergy20 · on March 7, 2023

isn't cgo for golang?

mlyle · on March 1, 2023

You're still kinda stuck with concurrency of the python code itself though. It sure would be nice to be able to just throw cores at problems for awhile.

Computers are cheap, and people are expensive.

KptMarchewa · on March 2, 2023

You can spawn threads as much as you want from native language.

mlyle · on March 2, 2023

Sure, and if you can't get stuff in and out of Python objects with concurrency, it doesn't help you much a lot of the time. Plus, again, computing is cheap: it'd be nice to use all my cores before I spend a lot of effort optimizing and rewriting things in native code.

VHRanger · on March 2, 2023

That's a data layout problem largely, though.

If your data is fragmented across a bunch of small containers/classes, passing it around will be expensive whichever the method (either passing to C++, or just in terms of cache efficiency).

If you just pass an array of data back and forth it's cheap.

mlyle · on March 2, 2023

> If you just pass an array of data back and forth it's cheap.

Yes, and numpy is great, and all. Python works great as glue to marshal things to and from native code and do inexpensive (but possibly complicated) bits of control logic.

But if I'm trying to deal with large numbers of client requests, say... the lack of concurrency in python itself really hurts. Sure, I can punt almost everything to native code, but what's the point in having Python at all, then?

Not all problems have state that can be shared well across Multiprocessing or completely externalized to large lumps that travel to native code in a few calls-- I'd actually say these are special case exceptions than the rule.

VHRanger · on March 2, 2023

Right, if your issue is that you're getting a ton of small, individual API calls, python will suffer.

At that point you'd look to Go or another language, and also carefully choose the REST API framework you're setting up.

Note that having a solution setup where the end result is "a ton of small, individual API calls" could possibly indicate a bad system architecture.

mlyle · on March 2, 2023

> Note that having a solution setup where the end result is "a ton of small, individual API calls" could possibly indicate a bad system architecture.

Or just a lot of clients with a fair bit of shared state which is best kept resident, which is a pretty common use case.

It's a bummer to write python code that works well, and then maxes out at 130% CPU load when you grow your usage... and not have any obvious path to scale upwards despite you having 32 threads of execution around. Then, you can rewrite some of the more expensive things in native code to squeeze a little more performance, or add indirection to store the data somewhere else so multiprocessing works.

Other languages that have more finely grained locks scale 3-4x higher with minimal thought, and much, much higher with a bit of thought about how to handle locking and data model.

> At that point you'd look to Go or another language

Well, yah... this is us complaining about Python's concurrency problems.

bb88 · on March 2, 2023

I think the question is how much more cost does it take to move the code from python to C/C++/Rust/whatever? That's a human problem until ChatGPT can solve that problem for you.

KptMarchewa · on March 2, 2023

Rust with Py03/Maturin fills that role for me. And it works with Java/JNI too.

jacobr1 · on March 1, 2023

I can also recommend Cython as an alternative to Pybind for the extension wrapper.

bee_rider · on March 1, 2023

And if you are using for example Numpy, you aren’t using Python for anything performance sensitive of course, because Numpy is almost certainly calling the system’s tuned BLAS implementation. Which should handle the parallelism I guess. If anything I’d expect parallel Python calls to Numpy to result in oversubscription…

fest · on March 1, 2023

I don't think numpy functions are neccessarily multithreaded and probably many are inherently sequential by their nature, so there are definitely case where multiprocessing can speed up the overall program.

bb88 · on March 2, 2023

Someone once said that python + numpy is probably going to be faster than writing it using basic C++, since numpy is using highly tuned libraries underneath.

I don't know for certain this is the case, but I'd like to see some benchmarks about it.

kelipso · on March 2, 2023

You would almost never use raw C++ when working with linear algebra stuff. You use a library like Eigen that interfaces with BLAS, LAPACK, etc., so you definitely get all the advantages of those highly tuned libraries, plus the speed of C++ and potential flexibility of not having to make multiple array copies and so on.

cozzyd · on March 2, 2023

Right, but you can use those highly tuned libraries yourself.

I do kind of wish numpy had a stable C API that didn't require a Python interpreter though.

bee_rider · on March 2, 2023

They aren’t necessarily threaded, but if you care about Numpy performance on an Intel chip at least you are already using MKL for Numpy’s BLAS, and MKL’s gemm is threaded.

anonkogudhyfhhf · on March 1, 2023

Parallelism using CPU instructions and vectorization most likely. Threading would still improve performance

bee_rider · on March 2, 2023

Multiplying a large enough matrix in Python using MKL for Numpy, I can watch the cpu usage go to 400% in top. You may need to run it in a loop or make the matrices quite large, a surprisingly large amount of computation has to happen before it’ll show up in top.

sliken · on March 1, 2023

I was working on a backup system, the usual. Walk a directory tree, track new/changed files, then queue them for sha256 and encryption, then upload to a server with gRPC.

I hit the GIL, switched to multiprocessing, which helped. Still was about half as fast as I expected. Switched to go and using channels and got the performance I expected. Was still debating, till I got deeper into Python's crypto package. I ended up really happy with go.

Animats · on March 1, 2023

> happy with Go.

Yes. I've used multiple threads in Python. It doesn't work well. Some packages, including cpickle, don't work right with multiple threads because they have static variables internally. It can work; I've had multi-threaded Python code running for years. But it's not a good approach for new work.

Python does things the CPython naive interpreter can do easily, such as letting anything modify anything else. Any code anywhere can go find something far away in another thread in another module and mess with it. Everything is a dict, so that works. This makes other things hard. Pre-compilation is hard. Optimizing is hard. JIT is hard. Threading is hard. You can't nail down stuff that probably won't change, but might.

dekhn · on March 1, 2023

I believe the aws s3 command uses python multiprocessing and I've seen it saturate a 10gbit link so maybe you just didn't program it right?

sliken · on March 1, 2023

Possible. I had a fair amount of CPU going on, trying to keep 8 cores busy with SHA256(plaintext), encrypt, and SHA256(encrypted_blob). I was just trying to be straight forward, keep files queued, so that when a core went idle it could grab another.

The go was similarly very straight forward, walktree -> CheckIfNewOrchanged -> channel -> sha256/encrypt/sha256. Channels made it really easy/clear and performed quite well. I was getting near linear scaling, CPU time consumed was 8x the wall clock, and speed increased 7.9x or so. With python I was getting significantly less performance per core and worse scaling.

With 8 cores 10Gbit is 1.25GByte/8 = 160MB/sec which is ok, not great, depends on how much computation you are doing. My goal is keeping 100gbit saturated, but I am adding cores as well. I do hope to compare to Go vs Rust as well.

dekhn · on March 1, 2023

Channels are just threadsafe queues with language support and an N:M threading model.

Don't get me wrong; I agree it's easier to build performant applications in go, and to get the performance I want, I have to set my AWS boto3 S3 settings to have massive queues.

GalaxySnail · on March 2, 2023

Hopefully PEP 684 [1] which suggests a per-interpreter GIL will improve the situation.

[1] https://peps.python.org/pep-0684/

sjatkins · on March 10, 2023

Leaving aside a moment the unexpected threaded fork issue mentioned I am thinking putting all the data to be shared among the processes into Redis would at least be a lot better than the clumsiness of pickling and unpickling.

while_true_ · on March 1, 2023

I found it curious the author mentioned using the Multiprocessing with pickle() but not Pipe(). Pickle() streams entire objects, while Pipe() can be used to send data between processes. Maybe the latter is faster, especially if the data are short strings and the like?

nijave · on March 1, 2023

I think they mentioned pickling because that's what the multiprocessing queue uses by default.

I think Pipe isn't necessarily a drop in replacement depending on the complexity of object you want to share but I have found it significantly faster for simple things.

while_true_ · on March 1, 2023

Exactly. My assumption is Pipe() is faster for short and simple data because I assume it doesn't do the serialization conversion that Pickle does.

nijave · on March 1, 2023

Yup, if you need to serialize a complex object manually before piping then you end up paying the price again (like this the multiprocessing Queue)

I'm guessing that's part of the reason the article didn't mention it (it looks like they're talking about a Pandas DataFrame which I would say is non-trivial--compared to a primitive type)

I'd think Pipe + Parquet should beat filesystem though. That really depends on storage I guess

Iirc I played a bit with msgpack and orjson to see if there was anything to gain over Pickle but I don't think it made much difference. You'd probably need to deal with structs

Looking at CPython source (3.10), on Windows, you always get a NamedPipe. On other platforms, you get a OS pipe when duplex=False otherwise a socket (socket.socketpair)

VHRanger · on March 1, 2023

joblib has been a decent solution to this issue:

https://joblib.readthedocs.io/en/latest/

It doesn't manage to escape the python multiprocessing issue everywhere but it often does

sjatkins · on March 10, 2023

Gives me more reason to learn and use Julia. I have used python for years and love it in many ways but it certainly has warts.

xioxox · on March 1, 2023

Just in case someone is interested, I've written a simple fork-based module for processing jobs for Python: https://github.com/jeremysanders/forkqueue

This has the advantage of allowing work done inside a nested function, allowing large initial datasets to be shared and not have to be passed over pickle.

itamarst · on March 1, 2023

See the article for an explanation of why this is a bad idea. Note that on macOS Python disabled fork()-based multiprocessing as the default long ago (Python 3.8?) because it's so broken, and it will stop being the default for multiprocessing on Linux in 3.14, deprecated in 3.12.

xioxox · on March 1, 2023

It works really well if you avoid threads, and I've found it pretty easy to avoid them in my code.

itamarst · on March 1, 2023

Yes, but... it's harder than one might think to avoid threads. Merely importing NumPy starts threads, for example.

mickeyp · on March 1, 2023

Yes, and that's a good thing. Let numpy do its threading unless you know exactly why you don't want it to parallelise your matrix multiplications and what havbe you. Numpy releases the GIL so that works exactly as it should.

Threads aren't bad: threads that cause resource contention with the GIL is (possibly) bad. That is almost always done explicitly by the developer and almost never done without your knowledge.

itamarst · on March 1, 2023

Unfortunately NumPy's thread pool is only for BLAS, the underlying library for numpy.linalg functions, mostly. Other operations are single-threaded. So you need your own thread pool (or process pool) if you want to parallelize anything else.

mickeyp · on March 2, 2023

Yes, I know. But your statement makes it sound like numpy using threads is somehow bad or undesired (it can be; but if it is, you'll know how to tell numpy not to thread.)

RobotToaster · on March 1, 2023

Why not use POSIX/SYS V message queues?

kobalsky · on March 1, 2023

replaced the print() for sys.stderr.write() to make it compatible with python2.

the program still deadlocks on python3 but it works perfectly on python2, anyone know what changed between implementations that could be triggering this issue?

crabbone · on March 1, 2023

I don't believe OP's deadlock is related to a rare event of dead thread holding a lock. The problem is that any calls to multiprocessing objects if not guarded by if __name__ == '__main__' will result in infinite loop / fork bomb kind of situation.

For some reason on OP's computer that probably causes an appearance of the program hanging, but really, it will just crash after a while exhausting some resource.

itamarst · on March 1, 2023

If I add a `if __name__ == '__main__':` I get the same result.

Here's the gdb stack trace:

  #0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x17bc7c0) at ./nptl/futex-internal.c:57
  #1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x17bc7c0) at ./nptl/futex-internal.c:87
  #2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x17bc7c0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, 
    private=<optimized out>) at ./nptl/futex-internal.c:139
  #3  0x00007ff326c9cc5f in do_futex_wait (sem=sem@entry=0x17bc7c0, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
  #4  0x00007ff326c9ccf8 in __new_sem_wait_slow64 (sem=0x17bc7c0, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
  #5  0x00007ff326c9cd71 in __new_sem_wait (sem=<optimized out>) at ./nptl/sem_wait.c:42
  #6  0x000000000042765b in PyThread_acquire_lock_timed (lock=0x17bc7c0, microseconds=-1, intr_flag=0) at ../Python/thread_pthread.h:483
  #7  0x0000000000625813 in _enter_buffered_busy (self=0x7ff326f330f0) at ../Modules/_io/bufferedio.c:281
  #8  0x000000000045f9b3 in buffered_flush (self=0x7ff326f330f0, args=<optimized out>) at ../Modules/_io/bufferedio.c:825
  #9  0x0000000000524185 in method_vectorcall_NOARGS (func=func@entry=<method_descriptor at remote 0x7ff326f611d0>, args=args@entry=0x7fffa8bc7a38, nargsf=<optimized out>, 
    kwnames=kwnames@entry=0x0) at ../Objects/descrobject.c:436
  #10 0x000000000053bd5a in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fffa8bc7a38, callable=<method_descriptor at remote 0x7ff326f611d0>, 
    tstate=0x17bf410) at ../Include/cpython/abstract.h:118
  #11 PyObject_VectorcallMethod (name=<optimized out>, args=0x7fffa8bc7a38, nargsf=<optimized out>, kwnames=0x0) at ../Objects/call.c:828
  #12 0x000000000060283a in _PyObject_CallMethodIdNoArgs (name=0x8f6120 <PyId_flush.lto_priv.2>, self=<optimized out>) at ../Include/cpython/abstract.h:243
  #13 _io_TextIOWrapper_flush_impl (self=0x7ff326f40040) at ../Modules/_io/textio.c:3038
  #14 _io_TextIOWrapper_flush (self=0x7ff326f40040, _unused_ignored=<optimized out>) at ../Modules/_io/clinic/textio.c.h:685

crabbone · on March 2, 2023

Well, it's hard for me to tell from this stack trace what's happening here. I don't recognize PyObject_VectorcallMethod. Is this a new thing? Last time I wrote any C code for Python I don't recall seeing / using this.

Anyways, another common pitfall in multiprocessing is attempting to serialize multithreading / multiprocessing primitives s.a. locks, variables or mutexes. My memory may fail me, but, I think, it may result in deadlock too. I think, multiprocessing code tries to guard against it, but there are some weird rules for when it's OK for serialized objects to have those primitives (I think, initialization in __init__ is fine, but not so much otherwise or something like that), but the check isn't very good / just a heuristic... But, really, I don't remember this part well.

itamarst · on March 1, 2023

From memory, the reason it deadlocks is:

1. Writing to stderr grabs a lock.

2. Part of the multiprocessing code (perhaps not present in Python 2) also grabs this lock.

3. If you fork at the right moment (which is quite likely with the loop) the lock is held by a thread that is now dead, and so now you're waiting for a lock to release that will never be released.

JosephRedfern · on March 1, 2023

This post links to another post by the same author which goes into this in a little more detail: https://pythonspeed.com/articles/python-multiprocessing/

crabbone · on March 1, 2023

[flagged]

ElectricalUnion · on March 1, 2023

There are only two kinds of languages: the ones people complain about and the ones nobody uses - Bjarne Stroustrup

crabbone · on March 2, 2023

It's interesting that you decided to repeat this diatribe for the millionth time expecting... what effect exactly?

Stroustrup licked every boot and inserted himself into every possible committee to promote his language until the network effect picked up. And later he post-rationalized about his language's popularity, attributing it to qualities the language never had.

And you proceeded to conclude based on the diatribe you repeated that nobody should want good things, because everyone has to be satisfied with popular things, because you chose to eat garbage and you will be too jealous to see others getting better treats?

----

Yes. Python is garbage, and programmers should be actively discouraged from using it. Python is today in the hands of people both incapable and unwilling to improve the language in the aspects that matter, and that's why programmers need to try to defeat the network effect created by it instead of encouraging more complacency. It is pretty much in the same situation as C++, so, unintentionally, you sort of guessed the direction. Both languages started small and rode the popularity wave w/o actually filling the gaps in original design with the worthwhile contents. Kind of like a TV show that keeps capitalizing on its pilot, while not creating anymore engaging experience.

rdtsc · on March 1, 2023

Exactly. That's why OP suggested Erlang. I switched from Python to Erlang and never had issues with GIL or scaling or operability. Elixir shares the same BEAM VM so that's another option as well.

Invictus0 · on March 1, 2023

I hate this quote. People still use things like floppy drives and fax machines in certain applications not because those devices are better but because of institutional inertia and bureaucracy. And the stuff we use isn't always the best thing out there, better products can still lose. Just look at Oracle.

ElectricalUnion · on March 2, 2023

> And the stuff we use isn't always the best thing out there, better products can still lose.

Better products will lose, most of the time, unless they're significantly better as a product to a point were they can hurt and displace the incumbents. It's what people found out with Plan 9 vs UNIX.

louwrentius · on March 1, 2023

I don't know of a more appropriate quote, thank you.

whywhywhydude · on March 1, 2023

Why would you use a niche programming that you can’t find any developers for? I would just use Go.

whalesalad · on March 1, 2023

Erlang/Elixir are by no means a niche language and for certain kinds of problems they're a much better fit than Go

kgwgk · on March 1, 2023

> by no means a niche language

> for certain kinds of problems

Even your denial makes it look like a poster boy for language nicheness.

crabbone · on March 2, 2023

I think you wanted to write "niche programming language". I don't know what "niche programming" is. So, I'll answer the question as if it was about the language.

1. In my experience (I both had to learn a language for a job and had to teach a language to the newcomers), for an experienced programmer, learning a new language enough to be productive takes couple months. In the kind of projects I work with, it's typical that a programmer won't be productive for several months anyways due to having to learn about the project's structure.

2. I would prefer to work with people willing to learn something new, or those who already expanded their horizons enough to have experience with better-than-average language. It's a natural filter against people I don't like to work with.

3. Infrastructure created around languages is a doubly-edged sword. On one hand you get free stuff in the form of community-provided libraries, on the other hand, the quality varies a lot and you often have to make uneasy compromises, being held hostage of the third-party bad programming practices. In particular, when it comes to Python, since I've been often responsible for auditing dependencies used in our projects, without a trace of remorse, I can confidently tell you: all Python packages are of poor quality. You are forced to choose the best of the worst, and it hurts a lot to take on another dependency. Erlang has less infrastructure, and in many cases you'll be the master of your own libraries. It takes longer to build, but it has the same effect as with living in your own house vs renting.

4. I don't like working in huge programming shops where it's important to take into account the dynamics of the job market. You may hope to have a company with five or ten good programmers. There's no hope of having a company with thousand of good programmers. In the later scenario, you want to rely on simplified processes and practices to produce something of quality. In the former case you have a chance of just being good. It's similar to SOF army units. You cannot have the entire army being SOF, but SOF will have very different arms, tactics etc.

----

Why not Go?

I'm not necessarily opposed to it. But, my point was to show that something existed for a long time. So, Go isn't a good example of that. Conceptually, Erlang is in some ways closer to Python (being based on a VM rather than compiling to different targets depending on platform). Today, Python drifts more and more towards becoming a Java, so, it also becomes more similar to Go, but in its origins it favored short and interactive development cycle, which is also how Erlang is.

So, I was just looking for an example where a programmer would be able to keep similar workflow, but get a net benefit of using a different environment.

hbrn · on March 2, 2023

> all Python packages are of poor quality

> Erlang has less infrastructure, and in many cases you'll be the master of your own libraries

Which, funny enough, are also going to be of "poor quality" when judged by external observer.

At least with Python packages their pitfalls are known and documented. But I get it, NIH can be very enjoyable.

> In the kind of projects I work with, it's typical that a programmer won't be productive for several months anyways due to having to learn about the project's structure.

I can't help but notice that you're shifting the blame to the type of project to protect your precious tech stack.

louwrentius · on March 1, 2023

If you actually read the article, it proposes ways to deal with the issue and discusses some tradeoffs.

Erlang may have different tradeoffs, the ecosystem may be smaller, less libraries available for important tasks, who knows.

The religion of the one true language has to die, it all depends on the circumstances which language is the best fit.

crabbone · on March 2, 2023

I actually read the article. And it's a thousand and first such article about the same problem. And the "solution" the article suggests is a non-solution. It's a ridiculous clutch. But, people who are used to Python are accustomed to living with clutches and patches where none are really necessary.

Python's multiprocessing doesn't hold a candle to Erlang's processes. It's a subprocess.Popen('python') with extra steps and lots of pitfalls. Erlang went through several iterations of process schedulers, several iterations of interfacing with foreign code from Erlang processes... Python developers don't even know yet they will have to solve these problems once multiprocessing becomes more mature (but it never will, so, who cares?)

louwrentius · on March 3, 2023

Yes, who cares. You apparently. But mostly from an identity politics perspective stomping on Python developers.

Yet you know nothing about the specific particular context in which people would use Python Multiprocessing. And I bet that although not ideal from a purity standpoint PM is likely to be “good enough” for a lot of use-cases.

You seem so bitter and feel superior about those Python developers yet, here you are complaining like a bitter old person adding nothing of value to the discussion. I really feel sorry for you that you are like that.

oxfordmale · on March 1, 2023

If Python doesn't introduce multithreading before 2030, it will die a slow and painful death

FlyingSnake · on March 1, 2023

Python notwithstanding its certain limitations, will remain a mainstay language beyond 2030. It is a lucrative option in writing glue code, scripts, and almost all of data engineering and related fields depend on Python.

oxfordmale · on March 2, 2023

Java is still a mainstay language, despite its popularity slowly dropping for more than a decade.

Python will be around for many years; however, now we have hit the end of Moore's law, the easiest way to speed up code is to multi-thread.

I code exclusively in Python, so I can't judge if languages like Julia are serious contenders. I also agree that building up the same ecosystem as in Python will take some time, and therefore I do not predict a rapid decline. However, many of the things I use Python for can be easily done in another scripting language.

skrtskrt · on March 1, 2023

if other languages don't have equivalents for the devex/productivity enhancements of:

* Django

* FastAPI/Flask

* Numpy/Pandas

* PySpark

* Jupyter Notebooks (gross I know but this is what "data analysts" and "ML/data engineers" use at many places)

Then Python will stick around for forever.

I would love if it the Go community would quit the "you don't need a framework, DUH it's GO" attitude. Make a Django/Rails for Go and there would be 10x the Go jobs.

oxfordmale · on March 2, 2023

Ironically, you mention PySpark. It is a Python wrapper around a Java/Scala code base. Jupyter Notebook supports many different scripting languages, such as Julia and R.

I agree that Numpy/Pandas will introduce more migration friction, and that is why I mentioned a slow death, similar to the trajectory Java is currently on. Java is still in the top three. However, its popularity has dropped in recent years. It is worth noting that both Cobol and Fortran are still in the top 30 programming languages.

skrtskrt · on March 3, 2023

it is funny that PySpark is just a Java/Scala wrapper, but the combo on PySpark + NumPy + Panda + Jupyter notebooks means that Python is option 1, 2, and 3 for any company that's just like "hmm let's start doing some data analysis/ML/whatever"!

Also R is what's most often taught in school in my experience and boy does Python feel like a breath of fresh air when you've been trained in R. When you're coming out of college trained in data analysis but not software engineering per se, you've got no idea about the larger world of what other languages could offer.

robswc · on March 1, 2023

I've gotten into the "debate." People always suggest using anything but python but then provide no alternatives to the frameworks mentioned above. You _could_ certainly write it in Go. I write a lot of stuff in Go and I love it. I just don't want to recreate Django for the 100th time though, so I don't and I just use Python.

By all means, if someone wants to write these frameworks, people will use them.

skrtskrt · on March 3, 2023

agreed. I write Go, Rust, and Python (now only when I have to), but If I had to stand up a small business with a CRUD app in a couple weeks I'd go with Django no doubt - mainly for the no-worries auth and user management within the same monolith as everything else. I don't care about the ORM, I can quickly map out a relational schema. It's the user stuff that's killer.

With more time, yeah I might choose Go or Rust and setup a couple different nodes - an API gateway, a user auth node like KeyCloak, Ory, or SuperTokens, then a Go backend.

But it would be so nice to have that all ready in one.

robswc · on March 4, 2023

Yep.

I do enjoy building things but I learn more and more how important it is not to focus on things that are already "solved" if they aren't part of your core business.

Distractions can scale exponentially.

nerdponx · on March 2, 2023

What's with the scare quotes on "data analysts"? Just because you read on HN about how some such tool is bad, doesn't actually make it bad. Hating on notebooks is like hating on Excel: it's trendy in some circles, but generally only espoused by the ignorant.

skrtskrt · on March 3, 2023

I've held jobs where we just productionalized data analyst code from notebooks into data pipelines.

There's absolutely nothing wrong with notebooks, but they are a serious PITA to take from being handed basically the musings of a data intern into a a production pipeline.

I used scare quotes because unfortunately in many places there are essentially zero qualifications to start running around doing those jobs.

crabbone · on March 3, 2023

All of this exists in, basically, any mainstream language in some capacity. A lot of them offer superior alternatives.

There's nothing about Python's quality that's worth keeping. The reason for Python's popularity is its popularity. The reason why it won't die is inertia.

But, say, someone creates a "killer app", that runs on hardware different enough from anything we have today, and that someone hates Python (smartphones are the most recent example of such a change), then there'd be a chance to dislodge it. But I struggle to see how Python would "organically" die.

bee_rider · on March 1, 2023

It doesn’t seem to have held it back so far.

prettyStandard · on March 1, 2023

I know python's the future because of just how bad it's performance is. /S

harveywi · on March 1, 2023

Python is a religion. No quantity of technical shortcomings will dislodge its habitat in the minds of its acolytes.

ReflectedImage · on March 2, 2023

It's got the fastest development speed of the main stream languages. It's got a huge technical advantage.

crabbone · on March 3, 2023

This is just false (spoken from a perspective of infra / system / automation person, being in the field for many years). But, this is very typical of religious people: to claim anything that suites their agenda w/o even trying to verify their claims, not even being bothered by whether the claim even makes sense.

First of all, there's no such thing as a "development speed of a language", just like there isn't a development speed of ice-cream. It's just kind of a Jabberwocky: it feels like it's in English, but it doesn't really mean anything.

Development speed differs by the kind of project (eg. Web site vs filesystem), quality requirements, size of the team working on the project, expertise of the team working on the project... Needless to say that there are areas where Python is entirely not applicable, so, the development speed wouldn't even be a factor. But, even in areas where it's commonly used there are often languages that will compete for this metric, and there are certainly teams using other languages that will beat teams using Python.

But, overall, Python projects tend to be easy to start and hard to develop further and to refine. Python projects tend to fare worse in large teams. Python also doesn't attract high-caliber programmers, while also is often the first language a programmer learns, so it tends to be populated by mediocre-bad programmers (similar problem used to exist in Java before it was replaced by Python in intro to CS courses).

Finally, a huge portion of development speed rests on company's infrastructure: how quickly and reliably can developers test their code plays a tremendous role in productivity. Ironically, Python tooling is so bad that sometimes it's faster to compile a C++ program of equivalent size than to install a bunch of Python packages implementing the same thing (god forbid you are using Anaconda, because that can take hours and days in the worst case to install a handful of packages).