It's hard to find articles like this that give a glimpse into what is used by larger shops doing ML. I take this one with a grain of salt due to the source being a vendor, but still it is generous with the amount of detail and with its even mentioning some alternative solutions for cases that might fit those, so that is really appreciated.
The pros working in big shops who write these tend to overlook the tiny use cases such as apps that recognize a cat coming through a cat door (as opposed to a raccoon) which can get by with minuscule training.
There's a lot of discussion of "big data" but small data is amazingly powerful too. I wish there was more bridging of these two worlds — to have tools that deal with the needs of small data, without the assumption that training a model takes days or months, and on the other side, to have the big data world share more insights about how they manage their data for the big cases. There is a ton of info out there but what I find lacking is info about how labeling and tagging is managed on a large scale (I'm interested in both, big and small, as well as medium). Maybe I'm just missing something. This article gave some good clues — thanks!
> There's a lot of discussion of "big data" but small data is amazingly powerful too.
You can oftentimes do surprisingly well on a smaller dataset with proper augmentation. Unfortunately augmentation techniques don't seem to get quite as much attention as they deserve, perhaps because they're perceived in ML conferences as problem-specific hacks rather than general-purpose techniques.
I agree. I’ve actually been working on something along these lines[0], albeit with a focus on marketing analytics. In my experience everyone in marketing cares about dashboards and metrics a lot but outside of larger shops almost no-one is doing any real analysis and even simple tools like linear regression could have a big impact.
Early days but we’re looking to onboard a few more customers to help guide our roadmap.
I feel that too - I joined HL just a couple of weeks ago (from Unix/webdev tech focus) and it's been a lot of learning so far. I'm going to do a little research (and write a blog post) into the specific case of 'too subtle for a regex' aimed at general webdev folk who have a problem to solve rather than people that already want to use ML.
Nice to see some active learning around here. To add a data point from a less successful story:
In one of our research projects, we used AL to improve part-of-speech prediction, inspired by work by Rehbein and Ruppenhofer, e.g. https://www.aclweb.org/anthology/P17-1107/
Our data base was a corpus of Scientific English from 17th-now and for our data and situation, we found that choosing the right tool/model and having the right training data were the most important things. Once that was in place, active learning did not, unfortunately, add that much. For different tools/settings, we got about +/-0.2% in accuracy for checking 200k tokens and only correcting 400 of them.
Maybe one problem was that AL was only triggered when a majority vote was inconclusive. Also, we used it on top of individualised, gs training data. I guess things can look different if you don't have a gs to start with. And if you have better computational resources: Our oracles spent quite some time waiting, which is why we even reorganised the original design to then process batches of corrections.
As so often, those null results were hard to publish :|
Either way, I thought I'd share our experiences. Your work sounds really cool, best of luck!
I would agree - active learning is a neat idea, but while it gets up the learning curve quicker that does not necessarily correspond to saving data in practice, for two reasons.
First, a lot of the AL papers use _simulation_ scenarios rather than production scenarios, i.e. there is already more training data available, it just gets withheld. Obviously, if you already have more, you have spent annotating it, too, so there can't have been any saving.
Second, you always want to annotate more data than you have as long as the learning curve isn't flat, so it's not about how quickly you get up, but it's about should you keep annotating or does a flattening learning curve suggest you have reached the area of diminishing returns.
There are many sampling strategies like balance exploration & exploitation, expected model change, expected error reduction, exponentiated gradient exploration, uncertainty sampling, query by committee, querying from diverse subspaces/partitions, variance reduction, conformal predictors or mismatch-first farthest-traversal, and there isn't a theory to pick the best one given what you know (I've mostly heard people play with uncertainty sampling or query by committee in academia, but nobody in industry I know has told me they use AL).
I think active learning has a time and a place. If you're getting started with a project from scratch, you probably don't need active learning for the exact reasons you describe - as long as you still get good improvements to model performance by labeling randomly sampled data, then you should scale out your labeling to get more data faster. For modern convnets fine-tuned on image data, I don't think you should think about active learning until you're past 10,000 examples.
Active learning becomes really useful when you hit diminishing returns, as most real-world ML applications deal with long tail distributions, and random sampling doesn't pick out edge cases for labeling very well. An easy way to tell if you're encountering diminishing returns is to do an ablation study where you train the same model against different subsets of your train set, evaluate them against each other on the same test set, and plot out the curve of model performance vs dataset size to see if you're starting to plateau. Or just eyeball your model errors and try to see if there's any patterns of edge cases it fails on.
Lastly, I'm pretty skeptical of model-based uncertainty sampling. In industry, almost every active learning implementation is very "what data should we label next," since model-based active learning is pretty hard to set up and confidence sampling is often not very reliable. That being said, I've anecdotally heard of some teams getting great performance from Bayesian methods once you have a large enough base dataset.
Can you shed some light on what you think are the most valuable methods for identifying high entropy examples for the model to learn faster? I'm familiar with Pool-Based Sampling, Stream-Based Selective Sampling, Membership Query Synthesis[1], but less certain which techniques are most useful in NLP.
So entropy based active learning methods are an example of pool based sampling. Even within pool based sampling there a few different techniques.
Entropy selection for pool based methods looks at the output probability for each prediction of the model in the unlabelled data-set. Then it calculates the entropy of the distributions. (in classification this is a bit like looking for the most uniform predictive distributions) and prioritises those.
Entropy based active learning works ok but doesnt distinguish uncertainty that comes from a lack of knowledge (epistemic uncertainty) from noise. Techniques like Bayesian Active Learning by disagreement can do better. :)
Ha! This is amazing -- we did a similar process for an EEG research project, and it was stellar (working memory and learning curves)! Until now, I didn't have the right words to articulate what we did - so thank you for the incantation!
Mike from Humanloop here - if you're interested in active learning we'll be around on this thread, also we're looking for fullstack SW engineers and ML engineers - https://news.ycombinator.com/item?id=25992607
Hey Mike - I did some work on an industry active learning system a few years ago. The high level finding was that transfer & nonparametric methods were huge wins, but online uncertainty and "real active learning" hardly worked at all and were super complicated (at least in caffe1 anyway lol).
Can you point to any big breakthroughs that have helped in recent years? Linear Hypermodels (https://arxiv.org/abs/2006.07464) seem promising, but that original experience has left me with some healthy skepticism.
Adding to what Raza as said - to your point on "real active learning" hardly working. I would be interested to hear what approaches you took?
We've found that the quality of the uncertainty estimate for your model is quite important for active learning to work well in practise. So applying good approximations for the model uncertainty for modern sized transformer models (like BERT) is an important consideration
Irrc we were using a Bayesian classification model on top of of fixed pretrained features from transfer, something along the lines of refitting a GP every time the number of classes changed. This was images as opposed to text, and after an epoch classification was ~ok but during training (eg the active bit) we didn't see much benefit.
Hi NMCA,
I'm Raza one of the founders of Humanloop. I totally agree that transfer learning is one of the best strategies for data efficiency and it's pretty common to see people start from large pre-trained models like BERT. Active learning then provides an additional benefit, especially when labels are expensive. For example we've worked with teams where lawyers have to annotate.
I think that you're right that it used to be much to hard to get active learning to work. Part of what we're trying to do is make it easy enough that its worth the benefits.
I'd also point out that people always focus on just the labelling savings from active learning but there are other benefits in practice too:
1) faster feedback on model performance during the annotation process
and 2) Better engagement from the annotators as they can see the benefit of their work.
I'd add that there is a deep connection between active learning and understanding the "domain of expertise" of a model, for example what inputs are ambiguous or low confidence, and which are out of distribution. E.g. BALD is a form of out of distribution detection - a point with high disagreement it not only useful to add to the training pool, it is a point for which the current model has no business making a prediction.
Hi Mike, let me know if I'm getting too into specifics for casual conversation. You mentioned a smaller dataset for training and I'm curious as to how much smaller. Like, for static image recognition, what type of dataset trade off are we looking at. Also, randomly, is there a compute cost trade off or is it just a smarter process?
I read the site, and no one likes cleaning data (no one), and answering questions from a "toddler machine" (for lack of a better term) doesn't sound as bad, but I was curious what potential trade offs there might be.
We've found you can get an order of magnitude improvement in the amount of labelled data you need - but there is some variance based on the difficulty of the problem. Because you are retraining the model in tandem with the data labelling process, there is additional compute associated to an active learning powered labelling process versus just selecting the data at random to label next. But this additional compute consideration is almost always outweighed by the saving of human time spent labelling.
I have a question on the compute aspect regarding your business model, hope I’m not being to nosy..
I tried HL, the experience was stellar (well done!) and it made me think...
To get AL working with a great user experience you need quite a bit of compute. How are you thinking about your margins, e.g the cost to produce what you’re offering versus what customers will pay for it?
Thanks for the feedback! It's a good question re compute. There are some fun engineering and ML research challenges that we are constantly iterating on that are related to this. A few examples
- how to most efficiently share compute resources in a JIT manner (e.g. GPU memory) during model serving for both training and inference (where the use case and privacy requirements permit)
- how to construct model training algorithms that operate in a more online manner effectively (so you don't have to retrain on the whole dataset when you see new examples)
- how to significantly reduce the model footprint (in terms of memory and flops) of modern deep transformer models given they are highly over-parameterised and can contain a lot of redundancy.
I have a suggestion about the first plot you show in the writeup. From what I can see, it is based on a finite pool of data and so it undersells active learning: performance shoots up as AL finds the interesting points, but then the curve flattens and is less steep than the random curve as the "boring" points get added. It would be nice to see the same curve for a bigger training pool where AL was able to get to a target accuracy without running out of valuable training points. I suspect that would make the difference between the two curves much more stark. As it is, it just looks like AL does better for very low data but to get to high accuracy you need to use the whole dataset anyway so it's a wash between AL and random.
Yeah I think is a good point. I'm actually planning to do a follow up post that is a case study with some real world data and the plots in that are much more like what you describe.
p.s. am the author of the above presentations; great to see Active Learning (AL) to finally get proper attention (I've been working in the AL area for 10+ years).
I'm not sure I really understand the advantage to AL in this context. Sure you get better performance earlier, but if you want the best performance you still appear to have to train with the same amount of data. Given the training -> example identification -> annotation -> training loop is going to be much slower than just continuing to annotate data and then running all the data at once (for a variety of reasons), I think if you were to do an honest total time and total monetary cost comparison you would probably come out with AL being more expensive overall... Am I missing something here?
I made a comment on this too, they are not presenting it in the best light because their example has a finite pool of data so using all of it will give the best performance. You can see the active learning curve is steep at first as the technique identifies the most useful data points to add to the training pool, but then flattens out once these are exhausted.
But for situations where there is a bigger pool of unlabelled data, active learning can identify which subset should be labelled to produce the best model performance, as long as the unlabelled pool contains "valuable" examples, the curve should remain steep and ideally meet performance targets much faster than for just annotating data randomly.
Also, there is some evidence that adding too many easy points to the training pool can reduce performance, see e.g. focal loss. So this could potentially mitigate that effect (mind you so could using focal loss, but that would require more labeling)
Synthetic data is particularly valuable when even the unlabelled data is expensive to obtain. For example if you want to train a driverless car, you may never see an ambulance driving at night in the rain even if you drive for thousands of miles. In that case, being able to synthesise data makes a lot of sense and we have lots of tools for computer graphics that make this easy.
Synthetic data can also be useful to share data when there are privacy concerns but my own feeling here is that there are better approaches to privacy preservation, like federated learning and learning via randomised response (https://arxiv.org/pdf/2001.04942.pdf).
In general though, outside of some vision applications, I'm pretty sceptical of synthetic data. For synthetic data to work well, you need a really good class conditional generator. E.g "generate a tweet that is a negative sentiment review" but if you have a sufficiently good model to do this, then you can probably use that model to solve your classification task anyway.
For most settings, I think synthetic data will work for data augmentation as a regulariser but will not be a substitute for all labelled data.
For the labelled data, active learning should still help.
Former self-driving engineer here. I'm also pretty skeptical about synthetic data. For the scenario you described, it turns out that if you drive enough, you'll eventually see some examples of ambulances at night in the rain. If it's really that rare, it's often easier to rent your own ambulance, drive around and do some staged data collection, and annotate the results than it is to set up a synthetic data pipeline.
At the end of the day, even in vision applications, real data is always better than synthetic data if you can get it. Things like sensor noise or interference are hard to replicate in synthetic data. Most teams turn to synthetic data for simulation purposes or as a last resort.
Spam filter is an interesting choice of motivating example, since usually it is your users labeling the data, rather than something that happens during the R&D process. You could try to use active learning but I'm not sure the users would like that product experience.
My high-level answer is weak-labelling overcomes cold starts and active learning helps with the last mile.
More detail:
We see weak learning as very complementary to active learning. By using labelling functions, you can quickly overcome the cold start problem and also better leverage external resources like knowledge bases.
But most of the work in training ML systems often comes in getting the last few percentage points of performance. Going from good to good enough. This is where active learning can really shine because it guides you as to what data you really need to move model performance.
At Humanloop, we've started with active learning tools but are also doing a lot of work on weak labelling.
Hey Luke! I'm a full stack / devops person so this is going to be high level but getting a couple of my ML researcher colleagues in for a deeper dive now.
Short version:
1) we can get to the same level of accuracy with around 10% of the data points. Getting (and managing) a big enough data set to train a supervised learning model is the biggest thing slowing down ML deployments.
2) the model contacts a human when it can't label a data point with a high degree of confidence. You'll never have people with a bunch of specialist knowledge being asked to perform mundane data labelling tasks
adding to what Raza said - a consideration for both active learning and weak supervision is the need to construct a gold standard labelled dataset for model validation and testing purposes. At Humanloop, in addition to collecting training data, we are also using a form of active learning to speed up the collection of an unbiased test data set.
Another consideration on the weak supervision side (for the snorkel style approach of labelling functions) is that creating labelling functions can be a relatively technical task, which may not be well suited for non-technical domain expert annotators for providing feedback to the model
The pros working in big shops who write these tend to overlook the tiny use cases such as apps that recognize a cat coming through a cat door (as opposed to a raccoon) which can get by with minuscule training.
There's a lot of discussion of "big data" but small data is amazingly powerful too. I wish there was more bridging of these two worlds — to have tools that deal with the needs of small data, without the assumption that training a model takes days or months, and on the other side, to have the big data world share more insights about how they manage their data for the big cases. There is a ton of info out there but what I find lacking is info about how labeling and tagging is managed on a large scale (I'm interested in both, big and small, as well as medium). Maybe I'm just missing something. This article gave some good clues — thanks!