The author here. If you have any questions about the article, I'm happy to help....

vaidhy · on July 15, 2020

I think there is a huge underlying assumption that given some data, building a model for it is trivial and can be done on the fly. I have seen the same kind of approach from people who have built toy models in 10 lines using pytorch and seem to equate fizzbuzz code for production code.

If you can clearly articulate how you do feature engineering, model debugging, meeting latency requirements, handling constant updates, dealing with non-numerical data and all the other issues that real world ML faces inside a query engine automatically, we can sit together and have a meaningful chat.

arauhala · on July 15, 2020

I do understand your point. There are definately tons of hard data science problems, which are simply not suitable for the predictive query kind of approach.

At the same time there are tons of ML problems e.g. in process automation or user interaction, which have extremely strong patterns and very easy to treat with sophisticated enough ML model.

Regarding your list of items. Feature engineering is greatly managed by the user selecting relevant facts in the query, by analyzers, by MDL based feature learning and by information theory based feature selection. I feel this approach is pretty robust for many problems, all thought not complete. There are special queries like $on for making conditional variables of form A|B, and $numeric to deal numeric data, that can be used manually.

Model debugging can be partly done with $why explanations, that are easy to create with the Bayesian approach. I feel that model debugging has been good enough.

Latency requirements and constant updates are more about software/database engineering and they are solvable, but right now we do recommend batch updates and applications, that can deal with sparsely occurring multi second latency. And OFC if you have limited data sets (less than 100k), there shouldn't be such problems.

I feel that all the problems you listed are solvable, but they are of course hard problems and we are still on the roadmap on fully solving those issues for larger set of applications. For many applications (like RPA, internal tools, analytics) these are not real issues, while the benefits (easiness, speed) are extremely concrete and relevant.

YeGoblynQueenne · on July 15, 2020

Thank you for the article. I have a question. The "democratization of machine learning" link in the article is missing and I'm wondering what can that term mean. Can you explain? What is "the democratization of machine learning"?

mellosouls · on July 15, 2020

The actual link should be:

https://knowledge.wharton.upenn.edu/article/democratization-...

It's a mistype bug on the OP site.

ypcx · on July 15, 2020

2020 version: https://openai.com/charter/

barumrho · on July 15, 2020

I would like to see more than just toy benchmarks. Also, can you provide any information on theoretical basis of this?

I can see this carving out a space for low value tasks.

arauhala · on July 15, 2020

We have few customers in production. One is about smart purchase invoice automation and - IMO - the real world data set is not that different from those StatLog / UCI / Kaggle datasets.

Of course our customer datasets tend to be on the easier side of the ML application field, but like you mentioned: the easy / fast or the low / mid-value ML applications are the place, where Aito / predictive database strengths play out and where they can carve out space.