Looks like this task focused on binary sentiment analysis (positive or negative movie reviews) - have you tried this on something with a broader potential output space? This seems relevant for what you’re calling “neural tags” on your client’s customer conversations, which seems more open-ended than simply “positive” or “negative”.
Yes, great insight! The choice to focus on sentiment here was mainly to align with fast.ai’s original research—-hopefully maximizing the generalizability and accessibility of the results.
Internally we have made use of these results to improve a broad set of language tasks, hopefully will be able to publish on those in the coming months as well.
Can you explain the domain-only model overtaking ULMFiT at 65k(?) unlabeled examples? Just noise, or is the ULM contribution competing with the domain model in some way?
I think in this case the alignment between unlabeled domain data and the language task supports a convergence in language task performance. One argument to continue to prefer the ULM+domain model is that it is likely more generally capable if you remain in same domain but switch to a task that is less directly related to your unlabeled data. I haven’t seen any research that directly speaks to that intuition so it’s a good area for further study.
Out of curiosity, have you done much work on examining your misclassifications? I'd be curious to know if there are giveaways for "negative" sentiment that show up in your task versus, say, reviews of Spiderman II.
In this work we didn't explore classification performance characteristics. I suspect the nature of the misclassification at lower levels of domain data would revolve around the ways language usage differs in reviews vs common english. "Blockbuster" may have generally negative or neutral sentiment in a wikipedia based language model, perhaps most often referring to the failed rental chain. In the context of movie reviews "blockbuster" is almost always universally positive.
One of the fantastic qualities of embedding based language models is that they provide a view on a semantic space that can be used quantitatively in most any downstream language task. As a conversational intelligence company Frame has many products that are enhanced by having a high quality domain specific language model to build on: tagging, sentiment, topic extraction, key words, summarization, etc. Best of all, these products can be iterated on in parallel! Improvements in a language model’s representation of a body of text should improve all downstream task without modification.
The same thing is also true for computer vision models. A core deep network usually trained either with a dual embedding of associated text against search ranking or trained to predict tags or labels. The output network may be of limited use on the original training task but ends up producing an excellent embedding model by extracting the neurons from some deep layer.
You start automatically encoding your entire image collection and incoming images into that embedding model and rely on it as a lingua franca on which to base all sorts of other companion models like object detection, face recognition, gender/age/ethnicity prediction, spam detection, aesthetic / composition appraisal, caption generation, style transfer etc etc.