Consensus is 'Google Scholar meets GPT' 🌐

Plus: Co-founder Christian on academia, RAG and search...

Published 24 May 2024

CV Deep Dive

Today, we’re talking with Christian Salem, Co-Founder of Consensus.

Consensus is a search engine for automating research. Analogous to ‘Google Scholar meets GPT’, it's an academic search engine that finds peer-reviewed academic papers specific to any question you might ask it - and under the hood, the tool runs an assembly line of LLMs to automate various steps of the research workflow. Consensus was founded in 2021 by Christian and his co-founder Eric Olson, and the startup’s mission is to empower researchers, students and knowledge workers with a tool to make them more efficient at finding research papers.

Today, Consensus is used by over 2 million knowledge workers, ranging from academics to AI researchers, including at notable institutions such as Stanford, Pfizer, and the National Academy of Sciences. The startup has raised funding from prominent investors including Draper Associates, Nat Friedman, Daniel Gross, and Kevin Carter.

In this conversation, Christian walks us through the founding premise of Consensus, why LLMs are perfectly suited to academic research, and his goals for the next 12 months.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Christian 💬

Christian, welcome to Cerebral Valley. Tell us a bit about yourself and what led you to co-found Consensus?

Hey there! I’m Christian, co-founder of Consensus. My professional background is in product - I got my start in tech right out of school, working as a product manager for a B2B startup called TicketManager. I had some great managers there who taught me how to build product and work in a startup environment. But I always wanted to build for consumers, so I joined the product team at the NFL, which was a great opportunity since I'm a massive football fan. I was at the NFL for about three years, leading various teams and helping build different products for the league, the 32 teams, and their hundreds of millions of users.

The founding story of Consensus actually goes back to my time in college at Northwestern. My co-founder and I played college football together, and we bonded over the fact that we were some of the only people on the team from families of academics, scientists, and teachers. My whole family has PhDs except for me—they think I'm evil for trying to make money. Eric is the same way; his whole family is involved in teaching or academia in some way. We became best friends on the team, partly because of that shared background, and we were interested in things like science, research, and AI.

One day at the end of college, Eric called me with an idea: "Could we use AI to make science easier to digest for everyone?" I thought it was a crazy idea and didn't pay much attention to it. But fast forward five or six years, we both worked in tech and started building relevant skills. The pandemic happened, which turbo-charged the demand for better research tools for everyone, both researchers and the general public. I’d been following AI developments and LLMs closely and saw the release of GPT-3, and when I first played with it, I immediately called Eric back and said, "Hey, that crazy idea you had in college—here's how we could actually get it done."

So, that’s how Consensus was kickstarted. We eventually quit our jobs to go full-time on it at the end of 2021, and launched the product late in 2022 (right before ChatGPT)

How would you describe Consensus to someone who isn’t familiar with the platform?

Consensus is a search engine that automates research. You can think of it as Google Scholar meets GPT - it's an academic search engine where you type in a question, and we find peer-reviewed academic papers related to your question. Then, we run an assembly line of LLMs over the top of it, underneath it, inside of it—automating various steps of the research workflow.

So, this could involve simple tasks like summarizing research papers, or more detailed tasks like extracting the sample size of a study or the population of a study, classifying the study design, and ranking the more rigorous study designs towards the top of the results. We built the product by imagining the workflow of an expert researcher. How would a doctor, researcher, or someone who is an expert on a topic search through the literature and come to a conclusion? We break that down into steps, then build features and train LLMs to enhance those steps.

Who are your users today? Who’s finding the most value in what you’re building with Consensus?

Today, we have a huge variety of users—more types than we ever imagined. Unsurprisingly, some of our top users are students. We have students at every major research university in the world at this point. For students, the most common use case is writing a paper. They're doing some kind of literature review process where they need to provide citations for their paper. We have a mix of undergraduate and graduate students who are finding value in it.

Another of the more surprising but strong user personas for us is actual physicians. We didn't set out to build a niche healthcare tool specifically for doctors; we just set out to make it easier to get answers from scientific literature. By doing that, people who are combing through the literature all day, including doctors, started finding value in the product. Doctors use it to look up questions their patients ask them that they may not be experts in or to quickly get up to speed on the latest research.

Could you take us under the hood of the core platform? How are you thinking about RAG, chaining and orchestration to surface the right results?

Consensus is an assembly line with over 25 different LLMs doing various parts of the process. We definitely have the classic RAG stuff where we retrieve papers from our search engine and then generate interesting outputs like answers or syntheses across papers - but we're also doing a lot with different models even before that. As an aside, I jokingly said I want to coin the term "GAR" as the reverse of RAG. We're generating a ton of interesting metadata on top of our papers before the retrieval is ever run to enhance the retrieval. So, we're actually doing generation-augmented retrieval!

Some of the things our users care about in the retrieval and ranking of the papers include study design, sample size, and whether the study was done on humans versus animals - these factors help determine how much trust to place in a paper. We can generate and extract that information using language models on our corpus ahead of time, and this allows us to use it in the ranking of millions of papers to surface the best ones at the top.

We have a multi-layered assembly line of different models doing different things. We use OpenAI for some of the final output with large synthesis models, but we also distill down to really small models, like 3-billion-parameter models and even smaller ones, to handle more niche tasks.

What has been the biggest technical challenge around building Consensus thus far?

The biggest technical challenge so far, and likely going forward, is search. The retrieval aspect of RAG is a super technical problem for our space. If you're a small company with only a couple thousand documents and want to throw a vector database on top of it and do some generation, that's not too difficult to set up. But if you want to search through hundreds of millions of extremely dense technical documents like academic papers, and it has to be super fine-grained with the very best research at the top every single time, building a search engine that does that is extremely difficult.

We hired dedicated search talent to tackle this. Our search engineer was at Amazon Search for five years and then at Google Search after that. The biggest challenge has been the retrieval and ranking of the right papers at the top, and this will continue to be a huge challenge for us. Language models are getting so good at doing tasks like summarizing, extracting answers, and classifying things. They will only get better at that. Our focus needs to be on getting better at providing the language models with the best documents to perform these tasks on.

What’s unique about the technical approach that you and Eric are taking towards Consensus? Anything you’d like to highlight?

The first is our huge investment in the search engine. This means not just relying on the more modern semantic similarity that a vector database gives you, but building a true end-to-end search engine. It has to incorporate more traditional keyword matching and different weights for different fields of the documents. If a user types in quotation marks, they're expecting an exact match. It also needs filtering capabilities, which vector databases don't really solve. For example, if a person only wants randomized control trials done since 2020 with a citation count of 50 or more, it's hard to achieve all that with just a vector database. So, we build a more traditional search engine that layers in the language model-powered semantic relevance of modern vector databases. This approach allows us to handle complex queries and deliver the precise results users expect.

Another interesting approach we're taking is what I mentioned before: the reverse of RAG, or generation prior to retrieval. It's an unexplored space of using language models on top of your documents before a user ever tries to retrieve them. This helps create more relevant metadata about those documents, which can then be fed into a search engine or vector database, improving the retrieval and ranking of the best documents for users.

We also have a lot of guardrails in the product because it's high stakes. Researchers and doctors are using the product, and they're expecting accuracy. While I would never claim we're accurate 100% of the time, we include numerous disclaimers encouraging users to read the papers and check other sources. For example, at the end of our RAG flow, where we summarize the top ten or twenty papers, we won't even start that process if we don't think we've found enough relevant papers. We have a separate model whose only job is to determine if there is enough relevant research to actually summarize and if it's close enough to what the user was asking about.

In our testing, by doing this and turning off the final RAG feature for queries with few relevant papers, we significantly reduce hallucinations. Language models want to please you and always answer the question, so if you give them irrelevant content, they might fill in gaps with inaccurate information. By ensuring the content is relevant before processing, we mitigate this issue.

Talk us through your roadmap for the next 6-12 months. What are your key areas of focus?

To us, there are three big buckets of work. One is continued investment in the search engine, which requires so much work and resources. We're focused on making our paper relevance better.

Two would be more AI analysis type features. We extract a lot of cool information about the research papers now, but our users are always asking for more. They want to know if there were any conflicts of interest in the study, what the funding sources were, and they want to be able to summarize things in even more flexible ways. They also want to chat with the papers, which is something we hope to release very soon.

The third bucket of work is around teams and workflows. We've gotten a ton of interest from industry—whether it's universities, doctors' offices, biotech companies—and they want to be able to do research and collaborate with their team members. We've just launched Consensus for Teams with a very light set of features, and we want to expand on that, enabling more people to collaborate on research within Consensus.

So, those are the three big focuses for us: improving search, adding more AI features, and supporting team collaboration.

Lastly, share a little bit about your team. How would you describe your culture, and what do you look for in prospective hires?

We’re 10 people now,

Culture-wise, we try to be as scrappy as possible. We're small and attacking a big problem, and to win, we have to be the cheapest, fastest, scrappiest team out there. How this actualizes in the product is that we try to ship as fast as humanly possible - I can even send you a link to our recent changelog so you can see all the features we've been shipping this year!The number one thing we care about is how quickly we're making improvements for our users, and this could be user-facing new features or underlying search infrastructure that bubbles better papers to the top. Every single week, we're pushing improvements. That's probably the most distinct thing about our culture.

Another aspect, is that our team deeply cares about this problem. We're trying to democratize scientific knowledge for everyone. Some of the most valuable knowledge ever created is locked in dense PDFs and academic research that is barely intelligible by the masses. We believe that by using AI, we can make it much easier for not only professional researchers or professionals in healthcare and industry but also for the general public, and this can help accelerate science and the world's understanding of science.

Our team finds motivation in this mission, and it drives us to move quickly, iterate rapidly, and work hard. Everyone on our team is super fired up about the mission.

Conclusion

To stay up to date on the latest with Consensus, follow them on X and learn more about them at Consensus.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email (newsletter@cerebralvalley.ai) or DM us on Twitter or LinkedIn.

Join Slack | All Events | Jobs