Pulling from and into production databases is one of the early favourites from our dlt user base. Some reasons explained here in this MongoDB example (https://dlthub.com/docs/blog/MongoDB-dlt-Holistics)
This is a really cool project—congrats! A somewhat related project that I worked on at MongoDB is PyMongoArrow, it does some of the same transformations to take unstructured MongoDB data and convert it to tabular formats like Arrow data frames. I’m curious what the support for BSON types that do not map cleanly to JSON types looks like? One example I can think of off the top of my head is Decimal128
We took at least one immediate practical good piece of advice out of this which is that we should release a conda package and make sure that dlt works in it.
I wouldn't make it a high priority. If there's one thing I know about conda users it's that "no conda package available" has never stopped them. In fact they prefer to pip install inside their conda environment, and the only conda packages they use are the ones that touch Nvidia drivers (e.g. pytorch).
> "no conda package available" has never stopped them.
Yes and no. They won't stop because they want to get things done, and the things usually don't involve honing the infrastructure. But installing packages with pip usually breaks conda installation, not even a particular virtual environment. (Usually pip nukes the setuptools that come with conda, and then once you want to install / upgrade anything in base environment, you discover that it's toast because conda itself depends on setuptools, but it's now broken and cannot be reinstalled).
So, in practice, if you give up and use pip to install stuff, it means that for the next project you will be reinstalling conda (and you will probably lose all your previous virtual environments). Kinda sucks.
We have a PR (https://github.com/dlt-hub/dlt/pull/594) that is about to merge that makes the above highly configurable, between evolution and hard stopping:
- you will be able to totally freeze schema and reject bad rows
- or accept the data for existing columns but not new columns
- or accept some fields based on rules'