I also work with foreign CSVs regularly. I'll have to try the Python Way next ti...

directevolve · on Nov 10, 2024

Oh yeah that does sound challenging. If you’re interested, here’s my take on the three libraries I mentioned.

1. Pandas is more mature with much better batch reading of larger than memory CSV files than Polars. But it’s slower and the syntax is worse.

2. Polars is my goto for one off analysis of CSV files that fit in memory. When max performance isn’t a concern, sometimes I’ll iterate through the CSV using Pandas to get it in batches, then immediately convert to Polars to do any analysis. ChatGPT has been poisoned by Polars’ early syntax changes so it often makes mistakes, but Polars’ syntax is so clean and consistent this often doesn’t matter much as it’s easy to fix.

3. DuckDB is a different beast obviously as it’s a full database, not just a single dataframe. It’s slightly more setup, but it has a CSV sniffer, does out of memory processing really well (no need to batch iterate) and lets you use SQL. I’m not too experienced at SQL yet, and it’s nice that ChatGPT is really pretty good at creating complex SQL queries. I am now gravitating to DuckDB for any larger than memory processing that can be handled in SQL. If line by line streaming is needed for the algorithm I’m implementing then I still use pandas or the pandas+polars approach.

accrual · on Nov 10, 2024

This is awesome, thanks for the advice! I'll definitely give these tools a shot for my next import job.