Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I also work with foreign CSVs regularly. I'll have to try the Python Way next time I have a weird file to work with.

I typically use PowerShell to process the files from a unknown CSV format to a known one so it's easier to work with, and I've found it easy to use to iterate on.



Oh yeah that does sound challenging. If you’re interested, here’s my take on the three libraries I mentioned.

1. Pandas is more mature with much better batch reading of larger than memory CSV files than Polars. But it’s slower and the syntax is worse.

2. Polars is my goto for one off analysis of CSV files that fit in memory. When max performance isn’t a concern, sometimes I’ll iterate through the CSV using Pandas to get it in batches, then immediately convert to Polars to do any analysis. ChatGPT has been poisoned by Polars’ early syntax changes so it often makes mistakes, but Polars’ syntax is so clean and consistent this often doesn’t matter much as it’s easy to fix.

3. DuckDB is a different beast obviously as it’s a full database, not just a single dataframe. It’s slightly more setup, but it has a CSV sniffer, does out of memory processing really well (no need to batch iterate) and lets you use SQL. I’m not too experienced at SQL yet, and it’s nice that ChatGPT is really pretty good at creating complex SQL queries. I am now gravitating to DuckDB for any larger than memory processing that can be handled in SQL. If line by line streaming is needed for the algorithm I’m implementing then I still use pandas or the pandas+polars approach.


This is awesome, thanks for the advice! I'll definitely give these tools a shot for my next import job.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: