We went with this approach. Pandas hit GIL limits which made it too slow. Then we moved to Dask and hit GIL limits on the scheduler process. Then we moved to Spark and hit JVM GC slowdowns on the amount of allocated memory. Then we burned it all down and became hermits in the woods.
I have decided that all solutions to questions of scale fall into one of two general categories. Either you can spend all your money on computers, or you can spend all your money on C/C++/Rust/Cython/Fortran/whatever developers.
There's one severely under-appreciated factor that favors the first option: computers are commodities that can be acquired very quickly. Almost instantaneously if you're in the cloud. Skilled lower-level programmers are very definitely not commodities, and growing your pool of them can easily take months or years.
Buying hardware won't give you the same performance benefits as a better implementation/architecture.
And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.
>And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.
And yet, people still regularly choose to go down a path that leads there. Because business decisions are about satisficing, not optimizing. So "I'm 90% sure I will be able to cope with problems of this type but it might cost as much as $10,000,000" is often favored above, "I am 75% sure I might be able to solve problems of this type for no more than $500,000," when the hypothetical downside of not solving it is, "We might go out of business."
On the other hand, you can't argue with a computer. That may explain why some of my coworkers seem to behave as if they wish they were computers...
It's too difficult to renegotiate with computers, too easy to renegotiate with people. When you don't actually know what you need to do, you need people. When you think you know what you need to do, but you're wrong, then you really need people. Most of us are in the latter category, most of the time.
At some point, you just mmap shared system memory, as read only, giving direct global access to your dataset, like the good old days (or in any embedded system).
We were trying to keep everything on one machine in (mostly) memory for simplicity. Once you open up the pandoras box of distributed compute there's a lot of options including other ways of running Spark. But yes, in retrospect, we should have opened that box first.
I have solved a similar problem, in a similar way and i've found polars <https://www.pola.rs/> to solve this quite well without needing clickhouse. It has a python library but does most processing in rust, across multiple cores. I've used it for data sets up to about 20GB no worries, but my computer's ram became the issue, not polars itself.
We were using 500+gb of memory at peak and were expecting that to grow. If I remember we didn't go with Polars because we needed to run custom apply functions on DataFrames. Polars had them but the function took a tuple (not a DF or dict) which when you've got 20+ columns makes for really error prone code. Dask and Spark both supported a batch transform operation so the function took a Pandas Dataframe as input and output.
Have you tried vaex? It is a panda-like Python library that uses C++ underneath, memory mapping and optimizes its memory access patterns. It’s very fast at least up to 1TB allegedly, I’ve used it for 10-15GB.