We went with this approach. Pandas hit GIL limits which made it too slow. Then w...

mumblemumble · on Aug 2, 2022

I have decided that all solutions to questions of scale fall into one of two general categories. Either you can spend all your money on computers, or you can spend all your money on C/C++/Rust/Cython/Fortran/whatever developers.

There's one severely under-appreciated factor that favors the first option: computers are commodities that can be acquired very quickly. Almost instantaneously if you're in the cloud. Skilled lower-level programmers are very definitely not commodities, and growing your pool of them can easily take months or years.

jbverschoor · on Aug 2, 2022

Buying hardware won't give you the same performance benefits as a better implementation/architecture.

And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.

marcinzm · on Aug 2, 2022

>And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.

That's why people love the cloud.

jbverschoor · on Aug 3, 2022

This is exactly what’s wrong with it.

mumblemumble · on Aug 2, 2022

Ayup.

And yet, people still regularly choose to go down a path that leads there. Because business decisions are about satisficing, not optimizing. So "I'm 90% sure I will be able to cope with problems of this type but it might cost as much as $10,000,000" is often favored above, "I am 75% sure I might be able to solve problems of this type for no more than $500,000," when the hypothetical downside of not solving it is, "We might go out of business."

jbverschoor · on Aug 2, 2022

Just adding hardware / "cloud" to a bad design will not make it snappy.

You can go from 10 -> 12, where a better design would get you from that same 10 -> 50

hinkley · on Aug 3, 2022

On the other hand, you can't argue with a computer. That may explain why some of my coworkers seem to behave as if they wish they were computers...

It's too difficult to renegotiate with computers, too easy to renegotiate with people. When you don't actually know what you need to do, you need people. When you think you know what you need to do, but you're wrong, then you really need people. Most of us are in the latter category, most of the time.

thrwyoilarticle · on Aug 3, 2022

Then you discover - too late to change your mind - that you're throwing hardware at an NP algo

nomel · on Aug 3, 2022

At some point, you just mmap shared system memory, as read only, giving direct global access to your dataset, like the good old days (or in any embedded system).

mritchie712 · on Aug 2, 2022

Did you consider Clickhouse? join's are slow, but if your data is in a single table, it works really well.

marcinzm · on Aug 2, 2022

We were trying to keep everything on one machine in (mostly) memory for simplicity. Once you open up the pandoras box of distributed compute there's a lot of options including other ways of running Spark. But yes, in retrospect, we should have opened that box first.

anko · on Aug 2, 2022

I have solved a similar problem, in a similar way and i've found polars <https://www.pola.rs/> to solve this quite well without needing clickhouse. It has a python library but does most processing in rust, across multiple cores. I've used it for data sets up to about 20GB no worries, but my computer's ram became the issue, not polars itself.

marcinzm · on Aug 2, 2022

We were using 500+gb of memory at peak and were expecting that to grow. If I remember we didn't go with Polars because we needed to run custom apply functions on DataFrames. Polars had them but the function took a tuple (not a DF or dict) which when you've got 20+ columns makes for really error prone code. Dask and Spark both supported a batch transform operation so the function took a Pandas Dataframe as input and output.

blub · on Aug 3, 2022

Have you tried vaex? It is a panda-like Python library that uses C++ underneath, memory mapping and optimizes its memory access patterns. It’s very fast at least up to 1TB allegedly, I’ve used it for 10-15GB.

rbanffy · on Aug 3, 2022

> Then we burned it all down and became hermits in the woods.

Did the results of the calculations drive that decision?