Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ok now I have 100s of columns. I should do this for every single one in every single dataset I have?


It takes like 5 minutes, and once you are in the habit it's something you do automatically as you write the code and so it doesn't actually cost you extra time.

Efficient representation should be something you build into your data model, it will save you time in the long run.

(Also if you have 100s of columns you're hopefully already benefiting from something like NumPy or Arrow or whatever, so you're already doing better than you could be... )


> It takes like 5 minutes, and once you are in the habit it's something you do automatically as you write the code and so it doesn't actually cost you extra time.

This is the argument I've been having my whole career with people who claim the better way is "too hard and too slow" .

I'm like "gee, funny how the thing you do the most often you're fastest at... could it be that you'd be just as fast at a better thing if you did it more than never?" .


Hey, programmer time is expensive. It is our duty to always do the easiest, most wasteful thing. /s


Future me's time is free to today me. :wink:


As an individual contributor you have an incentive to approach a problem in the way that teaches you the most for your career - then you can pretend it's the approach that's the best effort to risk ratio.


It doesn't have to necessarily teach you the most. Eg at Google, if it gets you promoted, that's also good.

Promotable projects at Google even have (or at least used to have) a complexity requirement. You can guess where the incentives lead.


But premature optimization is the root of all evil! I'm a better programmer for actively ignoring these optimizations! /s


But if I change this code, I have to change them all!

Good thing the status quo requires no evidence, but any change we want to propose? Impossibly high standards.


Hah, I'd love to work with the datasets you work with if it takes five minutes to do this. Or maybe you're just suggesting it takes five minutes to write out "TEXT" for each column type?

The data I work with is messy, from hand written notes, multiple sources, millions of rows, etc etc. A single point that's written as "one" instead of 1 makes your whole idea fall on its face.


For pile-of-strings data, there are still things you can do. E.g. in Pandas, if there are a small number of different values, switch to categoricals (https://pythonspeed.com/articles/pandas-load-less-data/ item 3). And there's a new column type for strings that uses less memory (https://pythonspeed.com/articles/pandas-string-dtype-memory/).


Tried that in the past, but it's really slow. Pandas is effectively removed from my workflows because of issues like this.

But, I have workarounds for these issues by loading everything into postgres under TEXT columns in a "raw" schema, then do some typecast tests in a descending list of types to get the smallest possible type to transfer to a new table in a "prod" schema. It's read-only data, so it's not a big deal to run it once, and builds out a chain of changes from csv -> sql.

Something like this could be done with pickling to avoid having to re-type every time I run the code (and I've done that for some past projects, but it's... ehhh).


Perhaps do the data-cleaning step before loading into a data frame? (Dataframes are, after all, for canonicalized+normalized data, just like RDBMS tables are.)


No way. The initial loading into a dataframe takes way too long to make it useful for exploratory work. Load it into a database is done once and forget it. In the long run, the time wasted loading things into dataframes over and over and over again just isn't worth it. Keep in mind that we're talking about large datasets that may or may not fit into memory.

"Oops, messed up my import slightly. Gotta run it again and wait ages.... again"

"Oops, loaded the dataframe twice on accident and had OOM. Gotta restart from the beginning... again"

"Oops, forgot to .head(5) my dataframe and jupyter's crashed... again..."

Doing everything in SQL solves so many problems. And OOM is practically a non-issue.


For exploratory work, perhaps you should random sample some of the dataset (say 1k) and see the effect on them? After getting good result, you then switch to dealing with the whole dataset.


No need in my workflows since unix tools solve that exploratory starting point. Just about everything else is SQL.


Is enough data generated from handwritten notes that the memory cost is a serious problem? I was under the impression that hundreds of books worth of text fit in a gigabyte.


It can be! I have a 50m row dataset of handwritten notes that's large, but when loaded into pandas, it blooms wayyyy larger than its underlying files.


5 minutes per column or 5 minutes per dataset?

If per column then that is hopelessly slow. 500+ minutes per dataset and I may have dozens of datasets.


You'll need to decide on a case by case basis. Many datasets I work with are being generated by machines, come from network cards etc. - these are quite consistent. Occasionally I deal with datasets prepare by humans and these are mediocre at best, and in these cases I spend a lot of time cleaning them up. Once it's done, I can clearly see if there are some columns can be stored in a more efficient way, or not. If the dataset is large, I do it, because it gives me extra freedom if I can fit everything in RAM. If it's small, I don't bother, my time is more expensive than potential gains.


Assuming your data is not ephemeral, and you have some way to ingest the data, from a full precision data store, why not?

Store at full precision, process at fractional precision, a story as old as time.


Yes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: