I've had exactly the same experience, it's a nice language but using it for things it's not suited for like data exploration makes no sense to me.
Production data pipelines on the other hand, but only after testing them well and as you say, making sure there's good testing if you're implementing things like numerical routines.
Do you have experience implementing ETL pipelines in Go? I think it'd be a better fit for us over our current language, but I'm curious to hear from people who've actually done it.
Yes. It works fairly well. With that said, I've got a feeling that life would be a lot easier to change around if we weren't using it, we end up writing a lot of code to do relatively simple things.
I do this at my job.. Disclaimer: I’m a web dev (“architect”) who does some lightweight data engineering tasks to facilitate views in some of my apps.
My pipelines are very simple (no DAG-like dependencies across pipelines). I could just have separate scripts, but instead I have a monorepo of pipelines that implement an interface with Extract, Transform, Load methods. I run this as a single process that runs pipelines on a schedule and has an HTTP API for manually triggering pipelines.
At some point I felt guilty that I am doing something nobody else seems to do, and that I had rolled my own poor-man’s orchestrator. I played around with Dagster and it was pretty nice but I decided it was overkill for my needs (however I definitely think the actual data analysis team at my company should switch from Jenkins to Dagster heh…)
On a separate note, all of my pipelines Load into Elasticsearch, which I’m using as a data warehouse. I’ve realized this is another unconventional decision I’ve made, but it also seems to work well for my use-cases.
It depends on what you're doing right? The commenter here replied to me, and we're processing really large data files that are deliberately not in a SQL database due to size, only artefacts of these files eventually make it into a time series DB. For us Go works well and is performant without any great difficulty. For domain specific analytics we generally use Python, and Go just calls out to an API to do them.
> it's a nice language but using it for things it's not suited for like data exploration
Pedantic point, but this is an issue of library support, not the language.
For whatever reason, data scientists and ML researchers decided to write their libraries and tools in a Kindergartner's language with meaningful whitespace, dynamic typing, and various other juvenile PL features specifically aimed at five year olds and the mentally infirm.
Nobody would really use a compiled language for this, the compile-run-edit-cycle just takes too long. Prior to Python people really just used MATLAB and Mathematica for that sort of work in the physics/engineering side, and R and Stata/SPSS on the bio + maths side. MATLAB, Mathematica, Stata and SPSS are all commercial and R has exactly the same problems to Python in environment management and compiled binaries, if you use it today you end up doing a lot of manual compilation of dependencies and putting them in the PATH on Linux at least.
Python became popular because the key scientific ecosystem libraries copied the libraries from MATLAB closely which made it easy to pick up, and because it was free. Anaconda made a distribution that was easy to install with all the dependencies compiled for you, which worked on Linux/Mac/Windows which made it much easier to use than R. The other interactive languages around at the time were Ruby which was heavily web dev focused, and Perl. Node didn’t yet exist.
Once you have an ecosystem in a language it’s very hard to supplant. You need big resources to go against the grain. That no big company has decided to pour lots of money into alternatives even despite the problems probably tells us that it’s not viewed as being worth it.
Production data pipelines on the other hand, but only after testing them well and as you say, making sure there's good testing if you're implementing things like numerical routines.