We’ve been trying to implement an active-learning retraining loop for our critical NLP models for Koko but have never found the time to prioritize the work as it was multi-sprint level of effort. We’ve been working with them for the for a few weeks and we and we are seeing meaningful performance improvement with our models. I highly recommend trying them out.
For many domains active learning is not that efficient actually. The promise is that you make a subset of labels and train on them the model with the same accuracy. The reality is that in order to estimate long tail properly you need all the data points in the training set, not just a subset.
Consider simple language model case. In order to learn some specific phrases you need to see them in the training, and phrases of interest are rare (usually 1-2 cases per terabytes of data). You simply can not select a half.
A semi-supervised learning and self-supervised learning are more reasonable and widely used. You still consider all the data for training. You just don't annotate it manually.
You are right. Being able to learn good feature representations through SSL is very powerful. We leverage such representation to perform tasks like semantic search to tackle problems like long tail sampling.
We have seen pretty good results mining for edge cases. Let me know if you'd like to chat about it.
Koko | Data Scientist | NYC, USA preferred (REMOTE for the
right candidate)
Koko’s mission is to bring well-being to everyone.
At its heart, Koko is a peer to peer network that offers evidence-based, emotional support wherever people exist online. As is often the case with large networks, our data has unlocked exciting new opportunities. We are using this data to build machines that can interact with people at deep, emotional levels.
We are building empathetic machines
Koko was originally conceived at the MIT Media Lab and is now based in NY. It has raised funding from Union Square Ventures and Omidyar Network and is growing fast through partnerships with very large social communities.
The lead engineer for "data" at Koko will be obsessed with using data to build exceptional products using state of the art tools and algorithms. You’ll be building classifiers to detect nuanced emotional states such as whether someone is at risk of harming themselves or whether a response created by the community is empathetic. You’ll also develop new data products, such as an information retrieval system that re-purposes existing data to help distressed users in real-time.
Requirements:
* Agile, resourceful and pragmatic. Strains to find the most efficient and highest performing solution to any problem.
* Comfortable working across the entire tech stack to build, launch and maintain data products.
* Highly proficient in statistics and machine learning. Excited about the ever growing set of libraries, tools and services that enable state-of-the-art deep-learning algorithms e.g. Tensorflow, Theano, FastText. Excited to stay on top of the latest research being published.
Skilled in written and oral communication (able to summarize data insights for team members, board members, research collaborators, and the greater public)
We've been developing a bot with their platform for the last few weeks and it's a well designed API with features that can help sculpt a great bot experience e.g. allowing for controlling the amount of time the bot appears to be typing. Definitely worth a look if you're into bots (and as a team that recently developed an iOS app, boy is it great to be out of under the thumb of Apple's review process and developing on a platform where the UX is the copy).
Thanks for sharing jbarmash. Promiscuous has enabled us to break up our monolithic application into 10 single purpose apps. It's essentially MVC application level replication.