> "spicy" take: allowing users to write imperative code (e.g. using loops) that ...

elijahbenizzy · on March 8, 2023

Interesting... So I'm not sure I agree. While Hamilton does support this kind of thing (and we're likely going to build out more support), the assumption above is that the unit of work at the task-level is natural to map to individual items in your example.

In my experience, you want to decouple the task-level orchestration mechanism from the nature of the data itself. A queuing system with multiple consuming threads/processes or a distributed system like spark is kind of meant for this type of task, and can do it more efficiently. So, why rely on the orchestrator to handle it when you potentially have more sophisticated tooling at hand?

To make it more concrete, say your upstream files change to have 10x smaller chunks, and 10x more files -- does the same orchestration system make sense? Are you going to start polluting the set of tasks?

If you did want to rely on the orchestrator for parallelism, an alternative strategy could be to chunk and assign the files to each task -- E.G. 3 files per task, and use a stable assignment method to round-robin them (basically the same as the queuing system). Might not get everything done at quite the parallel pace, but the DAG would be stable.

Your case is particularly tricky as they take so long to process, but it seems to me that this might be better suited for a listening/streaming tool to react when new files are added, then they can add the semi-processed data to a single location, and your daily orchestration task could read from that and set a high-watermark (basically a streaming architecture).

Anyway, not 100% sure how I feel about this but wanted to present a counter-point...

slotrans · on March 8, 2023

Oh and one thing I forgot to address above:

> In my experience, you want to decouple the task-level orchestration mechanism from the nature of the data itself.

No, that's a fantasy. The data is EVERYTHING. There is no abstract, platonic solution that works equally well with 100 rows or 100 billion rows.

One of the unique things about Data Engineering that sets it apart from most other specialties is that we deal in bulk. The code we write takes significant wall-clock time to execute. Performance ALWAYS matters. Never forget that!

krawczstef · on March 9, 2023

> There is no abstract, platonic solution that works equally well with 100 rows or 100 billion rows.

Very true! One reason why navigating the options available is tough, since it's pretty dependent on your problem space.

slotrans · on March 8, 2023

In short, no, no, and no.

If my DAG tool says "I can't handle this, use Spark" I'm throwing that tool away. This is not an exotic Big Data problem, it is extremely pedestrian and common. It's just not reasonable to punt and say "go adopt some 10-million-line behemoth" as a solution. Spark is cool and all but if you're not using it, adopting it is a massive step. The simple answer to "why rely on the orchestrator to handle it" is that that is literally what the orchestrator is for.

> To make it more concrete, say your upstream files change to have 10x smaller chunks, and 10x more files -- does the same orchestration system make sense? Are you going to start polluting the set of tasks?

This should be handled very smoothly and transparently, and Luigi for example succeeds in doing so. If I tell my framework "generate a task for each input chunk, and run with 4 threads" then it just doesn't matter how many tasks get generated. I routinely run Luigi programs with thousands of tasks generated in this manner. What's funny about this is that your hypothesized scenario actually happened: the system we exported this data from decided one day to radically decrease the chunk size / increase the chunk count, without telling us. I didn't have to change my code at all! In fact I didn't even notice for a like a month.

Mapping N chunks onto M tasks breaks catastrophically when you start thinking about resumption. Let's say I have a simple DAG: [invoke export] -> { task per chunk } -> [finalize]. As long as each chunk task has a deterministic name/identity, completion state can be tracked for it. Thus if the enclosing job terminates unexpectedly, we can easily resume from where we left off simply by starting the job again. If there's no correspondence between inputs, tasks, and outputs, you can't do it, not without pushing state tracking into user code, which defeats the purpose of the framework.

Reasonable people can disagree here. Again, Airflow chose not to support this for a long time. But it's not a coincidence that I decided in 2016 that Airflow was pretty useless to me. Dynamic DAG shape was and is a bare minimum requirement for me, and is one of the first things I test with new tools.

Even if you don't think this is a great use case, or your sweet spot, it's important to realize that it is very, very common, and when your users hit it they will look up how to deal with it. When the answer is "we can't help you here" that's extremely disappointing, and that will motivate users to leave.

Maybe the best overall advice I can give you is "spend some time using Luigi" and understanding its model and what it enables. It is absolutely not a perfect tool (in particular, depending on Tasks rather than Targets is probably a design error), but it is the only one I've truly been successful with. Looked at a certain way, it is radically more powerful than most of its competitors. It's sad that it gets overlooked just because it doesn't have a built-in cron or a fancy web UI.

krawczstef · on March 9, 2023

> Even if you don't think this is a great use case, or your sweet spot, it's important to realize that it is very, very common, and when your users hit it they will look up how to deal with it. When the answer is "we can't help you here" that's extremely disappointing, and that will motivate users to leave.

Hamilton doesn't support that, but it hasn't made a decision decision that would prevent that :) Yeah I'm definitely going to dive deeper into Luigi. Thanks for the rec. Especially wondering now whether it abstracts things as data types, or some other way...

As an aside, you might like checking out redun, the influences section is particularly enlightening as to what other systems there are - https://insitro.github.io/redun/design.html#influences