Cleanlab is automating data curation at scale 🌐

Plus: Chief Scientist Jonas on data curation, multi-modal and CleanLab Studio...

Published 03 May 2024

CV Deep Dive

Today, we’re talking with Jonas Mueller, Co-Founder and Chief Scientist of Cleanlab.

Cleanlab’s data curation platform helps AI/ML engineers and data scientists seamlessly improve the quality of any dataset. Pioneered out of MIT as an open-source project in 2018 by Jonas and his co-founders Curtis Northcutt (CEO) and Anish Athalye (CTO), the startup has built the world’s most popular Data-Centric AI software. Their mission: to enable reliable AI & Analytics by tackling “garbage in, garbage out” at enterprise-scale. Their secret: a novel AI system that automatically finds and fixes common dataset issues.

The startup’s flagship product, Cleanlab Studio, automates critical data curation, data annotation, and data quality work that today’s top AI companies (Google, OpenAI, Tesla, …) are doing with massive manual labor investments. Using Cleanlab Studio, teams across all industries (who cannot afford such investment) can still deploy similarly successful AI applications.

Cleanlab can be used for any image, text, or structured/tabular dataset (via a no-code interface or Python APIs), and is being adopted by Tech Consultants, Data Analysts, and emerging ‘AI Engineers’ - roles responsible for turning data into business value without formal education in Machine Learning.

Today, Cleanlab has hundreds of companies using its software, including some of the biggest tech companies in the world. In October 2023, the startup announced a $25 million Series A funding round led by Menlo Venture and TQ Ventures, with existing investor Bain Capital Ventures (BCV) and new investor Databricks also participating.

In this conversation, Jonas walks us through the founding premise of Cleanlab, why data curation is the most critical component of your AI stack, and how open-source has shaped Cleanlab’s fast rise..

Let’s dive in ⚡️

Read time: 8 mins


Our Chat with Jonas 💬

Jonas - welcome to Cerebral Valley! First off, give us a bit about your background and what led you to co-found Cleanlab.

Hey there! I'm Jonas, and I'm the Chief Scientist and co-founder at Cleanlab. I co-founded the company with my two good friends from grad-school, Anish and Curtis. We did our PhDs in Machine Learning at MIT and have known each other for 10 years.

In grad school, Curtis was the first ML researcher at edX, MIT’s online education platform. He was trying to classify who was cheating or not, a big deal for course certifications. The labels for who is a cheater or not are very noisy, and so Curtis was chatting with us about research into how to automatically detect incorrect labels in a dataset. We published some ML papers together, and open-sourced the code for this research as a library called Cleanlab, which cleaned the labels of your dataset.

Years later, that library had started being used by large companies like Google, Amazon and many others, and they were asking us for support. Meanwhile, I had started working at Amazon Web Services, doing AutoML research that now supports many key AWS AI services. The goal was to democratize machine learning, and we did a pretty good job on the modeling side, enabling non-expert software engineers to use machine learning effectively on benchmark and tutorial datasets. However, these non-experts were still not driving business impact with ML at their company because the data was inevitably messy and full of issues they couldn’t diagnose/remediate.

I knew that for AutoML to truly democratize AI, we would need to build something that tells you everything wrong with your data. That coincided with our open source library exploding in popularity. At the end of 2021, we decided to start a company around it and expand the mission beyond AI that checks data labels to AI that generally checks data for any kind of issues with it.

Give us a top-level overview of Cleanlab to someone who isn’t familiar with the product. How would you describe Cleanlab?

We provide AI-powered software tools that help you improve the quality of your existing dataset faster than any other tool. The specifics will continually evolve as the capabilities of AI grow and the way people are using data changes. What remains constant is how we achieve our mission via a two path approach.

The first is around algorithms. When given your dataset, our system automatically fits models to predict each piece of information (AutoML) with powerful Foundation models also bringing significant world knowledge to your data. These are combined with novel algorithms invented by our researchers, which map each data point to the likelihood that it exhibits certain kinds of issues commonly occurring in datasets like this. Through AI, Cleanlab automatically understands the semantic information in your data points, unlike traditional data quality tools based on rules and statistics. While other companies think of Data → AI as a one-way street, Cleanlab’s scientists think about the reverse direction. Our R&D ensures we're always using the best AI capabilities, combined with the best algorithms, to most-reliably detect issues in your data and identify good remediation strategies.

On the other side, we invest heavily in user experience and interactive interfaces because the overarching data-improvement problem will always be a human-in-the-loop workflow. The reason is that our system doesn't know where your data came from, and how exactly it will be used to drive business value. For that reason, Cleanlab’s platform merely presents detected problems in the data to you and suggests fixes, but you, the user, are expected to decide on enacting those fixes or not. That's a complicated concept - doing a change over millions of data points isn’t easy. We invest a lot in designing intelligent data interfaces that allow you to understand and improve millions of data points all at once as a single user. It’s a shame that many big data products today aren’t enjoyable to interact with, our system feels like a helpful data prep copilot guiding you through what data/actions are most high-ROI to focus on.

Who are your users today? Who’s currently finding the most value in what you’re building with Cleanlab?

Our data curation platform is totally no-code and very easy to use. You just put your data in, and our AI tells you all of the problems with the data and presents easy ways to curate a better-quality dataset. Today, the biggest users of Cleanlab are data scientists and ML engineers. But we’re seeing rapid growth amongst data analysts, data annotation/curation teams, as well as this emerging role of an AI Engineer. Despite lacking formal data science education, folks like that successfully use our platform to achieve reliable information, analytics, and even ML models trained and deployed for your data.

Additionally, we have a lot of tech consulting firms using our platform. These consulting companies have a lot of employees new to the AI space, and with the rise of Generative AI, they're being tasked with cleaning up the datasets of their big company clients. Besides the GPU industry, tech consulting is currently achieving the largest revenue gains from GenAI, and many consultants are each saving millions of dollars in projects leveraging Cleanlab.

Where would you advise a new engineer to start in terms of incorporating Cleanlab into their AI stack?

I’d definitely recommend starting with Cleanlab Studio. Just provide your dataset without having to configure anything, and receive an email when your results are ready to analyze. The platform supports any image, text, and structured/tabular datasets - so it can be used in all your Data/AI projects.

Beyond data curation, Cleanlab also offers access to the internal pieces of the technology powering this no-code data curation experience via individual APIs. If your end goal is to improve your data for better machine learning, you can directly retrain Cleanlab’s AutoML system (that was internally used to detect data issues) on your curated dataset, and deploy it for serving high-accuracy predictions. It’s just a few clicks – the fastest way to do data prep, model training + tuning + selection, and deployment for real-time predictions.

Everything we do relies on trustworthy confidence scores associated with all the AI underneath the hood, because that's the only way we're able to detect problems in your data, suggest remedies, and auto-label data without exposing all the (inevitable) flaws of machine learning. We offer APIs to access those confidence scores - e.g., for LLM applications, we have a Trustworthy Language Model (TLM) that our platform internally curates your text datasets with. Companies have used our TLM in high-stakes AI and data processing applications, for Q&A, it can produce 10% more accurate answers than GPT-4.

How do you measure the success of your platform within the context of an AI/ML stack? Are there specific metrics you’re always looking at?

There are a number of different things we look at given the heterogeneity of applications. Our software is used across all industries and across companies of all sizes - from some of the biggest tech companies, banks, e-commerce platforms, and consulting firms in the world, to tiny startups and understaffed data science teams. There are even many individual professors, grad students, and hospitals using us!

For large organizations, these usually have big teams of people responsible for either data annotation or data curation - especially organizations like OpenAI, Google, Tesla, etc. that are able to build good AI models. People don’t realize this, but they're investing a huge amount of money/labor to ensure their data is really well-curated. Our software streamlines a lot of that work via AI, so that your team’s time is just focused on the most impactful subset of the data and corrective actions, and our system automatically handles the rest. Using Cleanlab, your team can often save 80% of your time/costs and still generate much better results. If your team is spending millions of dollars on annotating or curating data and dealing with downstream data quality issues, those savings can be tremendous.

If your end goal is machine learning, a natural thing to do is use our platform to curate a better version of your dataset, and then plug that better version of your dataset into the machine learning you were doing before. This way, you can get better accuracy without having to change any of your machine learning, we often see 15-50% improvements. This is often the difference between successful deployment and endless prototyping – errors and edge-cases in your data are simply hampering the last x% of model reliability that you require. As you keep doing more ML experiments in the future and new models come out, data improvements achieved with Cleanlab will remain valuable for improving your AI systems. Your company’s data is your competitive advantage, increasing its quality will unlock all sorts of long-term value.

The data curation and preparation space has received a lot of attention from a number of players - what differentiates Cleanlab’s approach relative to others in the space?

I'd say our biggest differentiator is that today, most data quality tools require you, the data scientist/engineer, to develop a lot of rules. You have to anticipate all the possible problems that could be lurking in your dataset, and then codifying rules to detect them. This really only works for structured/tabular datasets and limited types of data issues. This requires you to have intuition from years of data experience - and you're not going to anticipate all the possible problems in the dataset ahead of time.

Similarly, if you're a engineer whose job is to train/deploy a machine learning model, you’d typically explore the data yourself in ad hoc Jupyter notebooks. Our goal is to make all of that much more systematic by having an AI system that learns from your data to automatically surface all kinds of issues in your data that you didn't even think might be there (erroneous labels/values, near duplicates, outliers, ambiguous examples, low-quality/unsafe content, etc). That's the big differentiating factor, in terms of making this whole process much more systematic, and providing a solution beyond tabular data that works for images/text.

What are some exciting developments you are seeing or working on?

Multi-modal is definitely interesting. Like I said, our platform works for image, text and tabular datasets, and we’re seeing more users for whom all three apply. For example, e-commerce product catalogs include product images, descriptions, and structured attributes like pricing, size, category. By relying on this multi-modal information, Cleanlab automatically catches problems like incorrect product attributes, low-quality/unsafe images, corrupted descriptions where the text isn’t nicely readable (e.g., truncated or with HTML tags). We're excited to see Data/AI projects become more multimodal! We've been doing it all along - it was painful for us in the early days. Folks advise startups to focus on one narrow data type and application, but we've always been multi-modal and dataset/industry agnostic. We’re glad to see these efforts paying off.

The other area that we’re really excited about is around reliable LLM applications. Many big tech companies are trying to make general assistant LLMs that beat GPT-4 - but, for actual business use-cases, you need something much more reliable than GPT-4 for your specific use case. Companies will often fine-tune a LLM using all of their data (e.g., logs from customer service chats) but see worse performance than if they prompt-engineer GPT-4 properly.

The reason is that the data is full of irrelevant pairs of customer requests and responses, with all sorts of bad information in there. With Cleanlab, you're able to curate your dataset and then re-run the fine-tuning job with no change in any of the training/model parameters. We've seen multiple companies do this and produce models with 50% lower error-rates than GPT-4 for their particular application. The same strategy works for in-context learning when you stuff many dataset examples into the LLM prompt instead of fine-tuning the model.

Another big use case that we're seeing a lot nowadays is around document curation. Every enterprise thought having a RAG system to answer employee questions would be a no-brainer GenAI application they would have by now. In reality, few big companies have this in place because their document collections are messy and full of issues. Big companies (or consultants they hired) are using Cleanlab to curate all their documents, categorize them, and deal with duplicated/outdated documents.

How critical has the growth of the open-source AI community been on the trajectory of Cleanlab?

We view open-source as absolutely critical. Everything we build has to be deployable within a company's VPC because of the sensitive nature of data for big enterprises. We can’t rely on APIs and pretty much build all of our tech upon the foundation of open source - we’re definitely not claiming we've invented everything from scratch!

We’re taking these big advances in open source Foundation models and finding creative algorithms that turn that big advance into a new way to curate data. A company like ours could not exist without the progress in open source, and that's why we contribute very regularly to our own open source library - to make sure that we're giving back in the same way that we are benefiting.

Lastly, tell us a bit about your team culture. Are you hiring, and what do you look for in prospective team members joining the company?

I’d say we have a really caring and helpful culture, we actually posted the tenets on GitHub. Today, we have 29 employees and we're looking to grow to 40. We used to have a small office in the Mission, and just upgraded to a bigger one in downtown SF.

The biggest aspect of the culture is to care about the impact of your work on the world and your coworkers. We're in this because we think it's such impactful work to be done for ensuring the reliability of AI and information systems. It’s not glamorous – my family jokes that companies have us scrub their dirty data. Everybody at Cleanlab strives to be super helpful and impactful, doing what it takes to enable enterprise Data/AI projects to actually realize their potential value for the world. The key is reliability - as both coworkers and enterprise partners, as well as in how our tools work across arbitrary datasets and in the AI & information systems produced using these tools.

If using AI to improve Data appeals to you, Cleanlab is hiring for jobs across the board (ML Engineer, Hacker in Residence, Backend Engineer, DevOps, Product Manager, Product Marketing Manager, summer ML intern). Apply here: https://cleanlab.ai/careers/


Conclusion

To stay up to date on the latest with Cleanlab, follow them on X and learn more about them at Cleanlab.ai.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email (newsletter@cerebralvalley.ai) or DM us on Twitter or LinkedIn.

Join Slack | All Events | Jobs

Subscribe to the Cerebral Valley Newsletter