More

hdespiritu · on Dec 7, 2023

DAMA-DMBOK2 covers this very comprehensively

https://www.dama.org/cpages/body-of-knowledge

datadrivenangel · on Dec 7, 2023

Data engineering is cool and new while data management is old school and enterprise.

Specifically, data engineering in some tech companies is truly a revenue driver, so it makes data engineering in other organizations be viewed as a cost center so much, even if it is the same work at most organizations.

erhserhdfd · on Dec 7, 2023

This may be nitpicking, but technologies being described as "cool" versus "enterprise" or "new" versus "old" I find meaningless. I don't necessarily want to have the "coolest" or "newest" tech stack; I want to have the tech stack that solves reasonably and reliably solves my business problems. If that means leveraging "old" or "enterprise" technologies and practices, that could be totally fine.

esafak · on Dec 7, 2023

How do you define the two terms?

tremon · on Dec 7, 2023

Data Engineering is an engineering displine -- it can involve anything from data ingestion, transformation, storage, enrichment, aggregation up to presentation in operational reports. But it's still a manufacturing process with "data" as an input and "data" as its output.

Data Management is an organization discipline -- it is about how the enterprise manages data as an asset and how data is embedded in the organization. This includes data governance issues like common data models, and a chain of command (which person/role is responsible for which piece of data), but also second-tier data processes such as quality control and data valuation.

datadrivenangel · on Dec 7, 2023

Data engineering versus data management?

Data engineering is nominally more pipeline oriented and less concerned with the governance & people side of things, but good data engineering people end up driving a lot of data management work because that's what makes the data engineering less painful (eliminate root cause of data errors and annoying data requests) and data overall more useful and valuable.

hdespiritu · on March 7, 2023

My sentiment exactly. The premise of the article comes across a little naive because there are so many fine-tuned libraries for specialized hardware architectures that already do this computation very efficiently.

However, it did make me wonder what this might look like on a gpu-accelerated database engine that is designed to leverage the SIMD parallelism of GPGPU architectures.

Beyond using SQL/NoSQL databases for CRUD apps I am not a "database guy", so I'm not sure about the feasibility, but it would be interesting to see it implemented.

hdespiritu · on March 7, 2023

Your question "Has anyone converted stuff like gradient descent to set theory?" doesn't make sense from the perspective that gradient descent uses differential calculus to find min/max points of an objective function and differential calculus requires a lot of additional assumptions on top of set theory.

That being said, current deep learning libraries such as JAX and Pytorch use automatic differentiation to efficiently compute partial derivatives used for optimization algorithms such as gradient descent and it's not clear to me what the level of effort would be to convert that to something that could run efficiently in SQL?

zackmorris · on March 8, 2023

Thank you, I knew that derivatives were used in gradient descent, but automatic differentiation is new to me. The Wikipedia article is fairly opaque compared to what I learned in school, but this tidbit stood out:

https://en.wikipedia.org/wiki/Automatic_differentiation#Impl...

Source code transformation (SCT): the compiler processes source code so that the derivatives are calculated alongside each instruction.

Operator overloading (OO): operators are overridden so that derivatives are calculated for numbers and vectors.

Based on the state of software these days, I'm guessing that OO (the "bare hands" method) is what's mainstream. It would be far better IMHO to use SCT, since it's a universal solution that doesn't require manually refactoring programs.

But stuff like SCT might be considered metaprogramming, which seems to have fallen out of fashion. I grew up with C++ macros and templates, so I feel that this is somewhat tragic, although readability and collaboration are much better today. A modern example might be something like aspect-oriented programming (AOP):

https://en.wikipedia.org/wiki/Aspect-oriented_programming

I once used the AOP library AspectJ to trace a Java app's execution, since Java made the (unfortunate) choice to focus on objects rather than functions, which makes it generally a poor fit for data processing, due to its high use of mutable state within objects (mutable state is what limits most object-oriented projects to around 1 million lines). Meaning that I couldn't remember the program's context as I was stepping through it, and had to analyze traces instead. AspectJ allows one to hook into the code without modifying it, sort of like a debugger, so that stuff like function calls and variable mutations can be watched:

https://en.wikipedia.org/wiki/AspectJ

https://www.eclipse.org/aspectj/doc/released/progguide/index...

Looks like this might still be an open problem in Python:

https://stackoverflow.com/questions/12356713/aspect-oriented...

https://docs.spring.io/spring-python/1.2.x/sphinx/html/aop.h...

But it seems like SQL would be a good candidate for AOP:

https://stackoverflow.com/questions/12271588/aspect-oriented...

https://technology.amis.nl/it/aspect-oriented-programming-ao...

If we had that, maybe we could automatically generate derivatives for the set operations. Then either access them as variables in stored procedures, or possibly as something like views or via metadata stored somewhere like MySQL's INFORMATION_SCHEMA.

I don't really know, but maybe these breadcrumbs could be helpful.

hdespiritu · on March 3, 2023

If it's your primary residence, the homestead tax credit can offset that increase substantially

https://smartasset.com/taxes/what-is-a-homestead-tax-exempti...

OP's original point still stands. Real estate in America is designed to have many economic advantages for wealth accumulation and preservation. You might even go as far as saying the economic advantages of real estate are a byproduct of regulatory capture

nerdchum · on March 4, 2023

> economic advantages of real estate are a byproduct of regulatory capture

This seems correct to me. I don't think it's been specifically designed to be a store of wealth.

hdespiritu · on Jan 13, 2023

Hopium is a powerful drug

hdespiritu · on Dec 14, 2022

This is an interesting observation. Some of the data scientists I work with have been using Kotlin to define "analytic grammars", and is the first example (in my limited Kotlin awareness) i've seen of Kotlin outside of Android development.

It seems that Roman Elizarov (Kotlin Project Lead) has identified the opportunity for a better language ecosystem to enter the data science space

https://discuss.kotlinlang.org/t/ai-and-deep-learning-define...

hdespiritu · on Jan 29, 2021

Yeah, good catch

https://decrypt.co/55136/coinbase-to-launch-secondary-market...

hdespiritu · on Jan 22, 2020

There's obviously no guarantee in the long run, but Gabriel Weinberg (DDG founder) is arguably one of the most holistic tech founders when it comes to tackling socially impactful problems in tech probably due to his interdisciplinary background. He's detailed his approach in a variety of podcasts and blog posts over the years

https://medium.com/@yegg/mental-models-i-find-repeatedly-use...

https://fs.blog/gabriel-weinberg/

hdespiritu · on May 17, 2018

There's decades of academic work in the form of "impossibility" results showing that handling byzantine fault tolerance in distributed consensus is alot more computationally expensive than omission/crash fault tolerance. The following (FLP impossibility result) being one of the most famous ones:

https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf

neffy · on May 17, 2018

Yes, that's the root of the problem. A deceptively simple sounding result.

hdespiritu · on April 17, 2018

One of the contributors wrote a memoir of his time spent working on this project as a PhD student and gives an interesting perspective on the challenges encountered:

http://pgbovine.net/PhD-memoir.htm