Hacker Newsnew | past | comments | ask | show | jobs | submit | mrtimo's commentslogin

Me too

I'm using DuckDB WASM on github pages. This will take about 10 seconds to load [1] and shows business trends in my county (Spokane County). This site is built using data-explorer [2] which uses many other open-source projects including malloy and malloy-explorer. One cool thing... if you use the UI to make a query on the data - you can share the URL with someone and they will see the same result / query (it's all embedded in the URL).

[1] - https://mrtimo.github.io/spokane-co-biz/#/model/businesses/e... [2] - https://github.com/aszenz/data-explorer


DuckDB can read JSON - you can query JSON with normal SQL.[1] I prefer to Malloy Data language for querying as it is 10x simpler than SQL.[2]

[1] - https://duckdb.org/docs/stable/data/json/overview [2] - https://www.malloydata.dev/


So can postgres, I tend to just use PG, since I have instances running basically everywhere, even locally, but duckdb works well too.


I have experience with duckDB but not databricks... from the perspective of a company, is a tool like databricks more "secure" than duckdb? If my company adopts duckdb as a datalake, how do we secure it?


Duckdb can run as a local instance that points to parquet files in a n s3 bucket. So your "auth" can live on the layer that gives permissions to access that bucket.


Love this! Here is a similar product: https://sql-workbench.com/


Based on this comment, you might enjoy the Malloy data language. It compiles to SQL and also have an open source explorer to make filters like what you are saying easy.


Thanks for the tip. I am checking it out right now.


It’s 2025. Let’s separate storage from processing. SQLite showed how elegant embedded databases can be, but the real win is formats like Parquet: boring, durable storage you can read with any engine. Storage stays simple, compute stays swappable. That’s the future.


Counterpoint: "The two versions of Parquet" https://news.ycombinator.com/item?id=44970769 (17 days ago, 50 comments)


As I understood by reading the short description, Parquet is a column-oriented format which is made for selecting data and which is difficult to use for updating (like Yandex Clickhouse).


I agree with this 100%. The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here [1].

I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.

I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.

[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/


> The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here.

That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.

Disclaimer: I work for Polars on said query execution.


The DataFrame interface itself is the problem. It's incredibly hard to read, write, debug, and test. Too much work has gone into reducing keystrokes rather than developing a better tool.


Not sure what you mean by this. The table concept is the same age as computers. Here is a table, do something with it -> this is the high level df api. All the functions make sense, what is hard to read, write or debug here?

I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.

Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)


The problem with the dataframe API is that whenever you want to change a small part of your logic, you usually have to rethink and rewrite the whole solution. It is too difficult to write reusable code. Too many functions that try to do too many things with a million kwargs that each have their own nuances. This is because these libraries tend to favor fewer keystrokes over composable design. So the easy stuff is easy and makes for pretty docs, but the hard stuff is obnoxious to reason through.

This article explains it pretty well: https://dynomight.net/numpy/


With all due respect, have you actually used the Polars expression API? We actually strive for composability of simple functions over dedicated methods with tons of options, where possible.

The original comment I responded to was confusing Pandas with Polars, and now your blog post refers to Numpy, but Polars takes a completely different approach to dataframes/data processing than either of these tools.


I have used numpy, but don't understand what it has to do with dataframe apis

Take two examples of dataframe apis, dplyr and ibis. Both can run on a range of SQL backends because dataframe apis are very similar to SQL DML apis.

Moreover, the SQL translation for tools for pivot_longer in R are a good illustration of complex dynamics dataframe apis can support, that you'd use something like dbt to implement in your SQL models. duckdb allows dynamic column selection in unpivot. But in some SQL dialects this is impossible. dataframe apis -> SQL tools (or dbt) enable them in these dialects.


Assuming you’re comparing polars/data frames to sql… SQL has literally the worst debugging experience imaginable.


Just wanted to say I'm a huge fan of your work. Been using Polars for my team's main project for years and it just keeps getting better.


In the same talk, Mark acknowledges that "for data science workflows, database systems are frustrating and slow." Granted DuckDB is an attempt to fix that, most data scientists don't get to choose what database the data is stored in.


(I use duckdb to query data stored in parquet files)


Same. But, I use Malloy which uses duckdb to query data stored in hundreds of parquet files (as if they were one big file).


I haven't looked at Mallory, but I do regularly scan lots of parquet files using wildcards etc from duckdb. Its a neat builtin duckdb feature.


Have you used Malloy in a pipeline, e.g., with Airflow? If so, how was the experience?


Very cool to see Malloy mentioned here. Great stuff. There is an MCP server built into Malloy Publisher[1]. Perhaps useful to the author or others trying to do something similar to what the author describes. Directions on how to use the MCP server are here [2]. [1] https://github.com/malloydata/publisher [2] https://github.com/malloydata/publisher/blob/main/docs/ai-ag...


One big problem now is that LLMs are not great at writing Malloy, so it is important to have a intermediate DSL. In the future as the language models evolve or someone creates a fine-tuned model that can write Malloy well, we will be able to have more autonomous agents.


I'm a business professor who teaches Python and more. I'd like to develop some simple projects to help my students fine tune this for a business purpose. If you have ideas (or datasets for fine tuning), let me know!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: