troiskaer's comments

troiskaer · on Aug 10, 2022

How does Sematic compare to Metaflow? it optimizes for many of the same goals of Sematic - local workflows, cloud access, lineage tracking, state transfer etc?

josh-sematic · on Aug 10, 2022

There are several differences, but I'd say these are some of the main ones:

UI: whereas Metaflow provides the ability to build your own result visualizations explicitly in your workflow (via their "cards" feature), Sematic makes it so that your outputs (and inputs) get automatic rich visualizations based on the data type of the data being passed around.

API: Instead of being based around explicitly building up a graph, where you have to explicitly specify the I/O connections between steps, Sematic makes defining your steps look like writing/calling python functions.

Packaging: Whereas metaflow requires you to include packaging information in the code defining your steps (the @conda decorator, etc.), Sematic plugs into your existing dependency management to bundle up dependencies for execution in the cloud.

troiskaer · on Aug 10, 2022

Interesting approach on passing data between steps and constructing the overall graph - it will be interesting to see what the take rate is between the two approaches (of sematic and metaflow). On the UI front, Metaflow generates viz for all objects by default in @card; but how does Sematic package up PyTorch referenced in the example (https://docs.sematic.dev/real-example) for execution on the cloud? IIRC, Metaflow packages the cwd (in addition to @conda, @pip etc.) and relies on existing packages for local execution?

Edit: Digging deeper, Sematic relies on Bazel (https://docs.sematic.dev/execution-modes#dependency-packagin...) and needs a BUILD file to specify all the dependencies for cloud execution. It seems that the entire pipeline will execute as a single (or multiple) k8s pod(s) using the same environment?

I am quite interested in trying out Sematic. Any guidelines on what kind of scale Sematic can support today (and the near future)?

josh-sematic · on Aug 10, 2022

The way packaging is designed to work in Sematic is for us to hook into your existing dependency management solution to determine what your dependencies are, then build a docker image for you based on those. As you point out, right now we only integrate with bazel for this purpose, but we hope to add more. A simple plugin for requirements.txt -> Docker image is probably next on the TODO list.

> It seems that the entire pipeline will execute as a single (or multiple) k8s pod(s) using the same environment?

single docker image, but multiple pods (when you are using the full cloud mode). This was an intentional decision to avoid confusion around what things could be imported in what places (mimicking more what it would be like in one python instance), and also avoid weird version inconsistencies across the pipeline.

> Any guidelines on what kind of scale Sematic can support today (and the near future)?

Based on some prior tooling experiences, the main bottleneck should be what your Kubernetes cluster can handle.

> I am quite interested in trying out Sematic

Glad to hear it! We'd love to hear about your experiences. You can join our discord if you want help while you're trying it out: https://discord.com/invite/4KZJ6kYVax

troiskaer · on April 11, 2022

This is pretty much what both Netflix and Spotify do. I would argue that there isn't a canonical recommendations stack that FAANG is converging towards, and that's a direct corollary of differing business requirements and organizational structure.

troiskaer · on April 11, 2022

as well as to ML - Netflix Prize (https://en.wikipedia.org/wiki/Netflix_Prize) and Metaflow (https://github.com/Netflix/metaflow)

oofbey · on April 11, 2022

No question they've done some things that have had some impact on others in the industry. But none of them are particularly important. It's all relative. Companies like Twitter, Uber, AirBnb have all released open source projects or figured things out how to solve hard problems in ways that others have emulated.

But for every other one of the FAA(N)G companies, I can barely work a day as a developer without touching every one of their technologies. Yeah, Netflix got into ML years before most, but the netflix prize exists as a distant cautionary memory, and as an ML professional, I'd literally never heard of metaflow before. Just sayin'.

troiskaer · on April 11, 2022

> But none of them are particularly important

Nowhere was the argument made that somehow Netflix was more influential than Twitter/Uber/AirBnB, but your counter-argument that somehow it's less influential because you haven't heard of/used some projects directly holds no ground.

samhw · on April 11, 2022

> your counter-argument that somehow it's less influential because you haven't heard of/used some projects directly holds no ground

Oh come on, they are indisputably right that Microsoft, Twitter, Uber, Airbnb, hell, even Cloudflare are more technically influential than Netflix is.

Apple and Google would make anyone's top 5, that's his point. No argument about it. Their products collectively dominate anyone's life, along with MSFT. Netflix is maybe in your top 10, top 20 for sure, but it's not up there as one of the few 'platform that everyone's lives are built on' techcos.

(Like, Netflix vs Microsoft? Seriously? For that matter, Amazon probably wouldn't be in my top 5 either, and not only because it's not mainly a tech company. I s'pose it depends how you define 'Amazon', and if you include AWS. But for Netflix there's just no argument that they win a spot there.)

troiskaer · on April 11, 2022

What's your argument for Twitter/Uber/AirBnB being indisputably more technologically influential than Netflix? And let's please talk facts rather than opinions.

troiskaer · on Jan 22, 2022

How does Kedro compare to MLFlow and Metaflow?

joelschw · on Jan 22, 2022

Kedro sort of fits into a niche where it just overlaps somewhat with 'orchestrators' like Prefect, Metaflow, Dagster, Airflow and others. What makes it slightly different is that it it is focused on the rapid development journey to production, providing guardrails for teams to co-develop ML projects in a way that nudges software engineering best practice and clean code.

The 'finished article' in many cases should be deployed in production in one of those tools which provide specialised bells and whistles like scheduling, monitoring and observability.

Regarding MLFlow, there is also a slight overlap in terms of experimentation, but not things like model serving. Kedro has a mechanism to track experiments, but it's more designed to give users with zero infrastructure something for free out of the box. It's been built in a way that it can be repurposed for more dedicated experiment tracking tools - the folks at neptune.ai built their own plug-in for this purpose: https://docs.neptune.ai/integrations-and-supported-tools/aut...

troiskaer · on Jan 22, 2022

Seems like Kedro has a similar thesis to Metaflow - I will look into it.

vtuulos · on Jan 22, 2022

Yep, Kedro and Metaflow are more similar to each other than to other generic DAG orchestrators like Airflow.

Kedro and Metaflow make it easier to develop robust ML projects where orchestration plays an important role but it is not everything. They are two separate projects, so the way how they approach the problem differs greatly in details.

idomi · on Jan 23, 2022

I believe essentially all of the tools mentioned here are focusing on the engineering persona and not the Data Scientist. Writing classes and functions isn't the jargon of a Data Scientist. At Ploomber we tried to put the Data Scientists in the center of everything, helping them to work together with OPS. Check it out! https://github.com/ploomber/ploomber

troiskaer · on Jan 23, 2022

thanks for the unnecessary advertisement.

stevesimmons · on Jan 22, 2022

And Prefect