Traceloop (YC W23) is at the forefront of LLM Observability šŸŽ›

Plus: Founder/CEO Nir on hallucinations and OpenTelemetry...

Published 12 Jul 2024

CV Deep Dive

Today, weā€™re talking with Nir Gazit, CEO of Traceloop.

Traceloop is at the forefront of LLM Observability, allowing AI developers to monitor generative AI applications in production. With a rich background as an engineer at Google and the chief architect at Fiverr, Nir founded Traceloop to address the complexities of understanding and debugging agent architectures in AI systems. Initially built as an internal tool to help debug their own LLM apps, Traceloop has evolved into a comprehensive platform that detects hallucinations, issues, and errors in generative AI applications.

Traceloop participated in YC, as part of the Winter 2023 batch. Their tools are used by engineers at leading companies like Google, Microsoft, Miro, Cisco and IBM. In this conversation, Nir discusses Traceloopā€™s unique OpenTelemetry-based SDK, its real-time error detection capabilities, and how the platform has rapidly grown its user base, helping organizations automate error detection and improve the efficiency of their AI systems.

Letā€™s dive in āš”ļø

Read time: 8 mins


Our Chat with Nir šŸ’¬

Nir ā€“ welcome to Cerebral Valley! To start off, introduce yourself and give us a bit of background on what led you to found Traceloop? Hi, I'm Nir! I'm the CEO of Traceloop. I'm an engineer and have always been one. I worked at Google for many years and was the chief architect at Fiverr. I've worked with classic ML models and LLMs for a long time. For one of my most recent projects, I was trying to build a way for companies to automatically generate tests using LLMsā€”unit testing, front-end testing, back-end testingā€”all automated by LLMs.

This was back when there were no observability tools, making it difficult to understand how the system was working. We had a complex agent architecture designed to understand your system and generate code to test it. It was challenging to figure out what was going on inside this agent, especially when things didn't work well. We started building Traceloop as an internal tool to help us debug these issues, and it gradually grew into its own project.

Excellent. So how would you describe Traceloop to an uninitiated AI development team? Traceloop is a platform for monitoring generative AI applications in production. It detects hallucinations, issues, and errorsā€”basically, everything that can go wrong with your app. We can automatically alert you on issues and detect regressions.

Who exactly are Traceloop's users? What kind of developer finds the most value in using Traceloop? And if you'd like, how much growth are you seeing user-wise? Our users are typically AI developers and teams who use AI at high volumes in production and need to understand and monitor the quality of the generated content.

One key aspect of Traceloop is our open-source component called OpenLLMetry, which is essentially our SDK. This part sits within your application and collects all the necessary data. It uses OpenTelemetry behind the scenes, an open protocol built by the CNCF. The idea is that you use our SDK to collect information and connect it to Traceloop. Because we use OpenTelemetry, you can also connect our SDK to other tools like Datadog, Sentry, New Relic, or Grafana. Since releasing the SDK, we've seen tremendous growth in users, and itā€™s already serving thousands of companies.

When folks are just getting started with generative AI and want to trace executions, they can use our SDK and even connect it to tools like Grafana if they just want to see nice traces. As they scale and start having hundreds, then thousands, or even millions of users, they need more. They can't manually review all the generated content for errors or mistakes. This is where Traceloop becomes invaluable. At scale, we help automate the detection of hallucinations and other issues, making it much easier to manage large-scale generative AI applications.

What existing use case for Traceloop have you seen work best? Are there any specific customer success stories that you'd like to share? We had one customer who currently has around a few million generations a month, which is significant. When they started, they had one of their engineers manually reviewing the outputs they were getting from the LLM, looking for hallucinations and structural errors in the output. The LLM extracted insights from long texts and they wanted to make sure that the right insights were extracted and that they actually existed in the original text. These are classic issues you want to detect in terms of hallucinations in LLMs: relevance and accuracy.

Initially, the engineer would manually mark mistakes and share them in a Slack channel for the rest of the team to optimize the prompts. However, as they scaled up, it became impossible to manually review all the outputs, so they resorted to random samplingā€”just looking at a small subset of the outputs for errors, which wasn't sustainable.

When we came in, we automated many of these tasks. We started detecting these types of errors automatically, which not only replaced the manual process but also uncovered issues they didn't know existed. They found a lot of cases where the LLM was returning completely unrelated results in many cases. It was a clear example of how Traceloop can significantly improve the efficiency and accuracy of monitoring generative AI applications at scale.

There are a number of teams trying to tackle LLM observability. What sets Traceloop apart from a developer's perspective, and what are you doing differently? I would say two things set us apart. First, our SDK. We are the only ones, or at least the first ones, to build an OpenTelemetry-based SDK, and we are the largest ones. We support most models, frameworks, and vector databases that are out there. Using OpenTelemetry means that youā€™re never vendor-locked to Traceloop. Switching to a different provider is just a matter of setting an environment variable.

The real power of Traceloop is in our ability to detect hallucinations in real time in production. This has been a really difficult task, and most companies solve it by using mechanisms like LLMs as judgesā€”using other LLMs to grade the responses they've gotten back from their primary LLM. However, this approach doesn't scale well. You can't go over millions of generations and use another LLM to ensure they are correct without doubling your operational expenses.

We've developed novel methods for detecting hallucinations without using LLMs, which really sets us apart from any other competitors I've seen. No one else can do it at the scale and low latency that we can.

How do you see Traceloop evolving in the next six to twelve months?

First, we want to focus more on agentic flows. This is where OpenTelemetry becomes really valuable. With OpenTelemetry, we've added metrics and traces for calls to LLMs like OpenAI, Anthropic, LLaMA, and vector databases. But the technology itself already provides visibility into database calls and HTTP calls. So, if you have an agent that crawls the internet or performs other tasks besides using OpenAI, you can get complete visibility just by using OpenTelemetry. We want to leverage this to monitor and detect issues with agent-based runs as well.

Additionally, we want to promote multi-models. We've seen people starting to use more vision models and voice models, and we want to help monitor and observe those runs as well.

I'm glad you brought up those themes. How will new trends in AI, such as agentic and multimodal approaches, affect Traceloop?

The key difference between agentic and non-agentic workflows is the level of determinism in the flow. Many companies today have deterministic flows, where they know exactly how many API calls they'll make to OpenAI to complete a task. With an agent, you have no idea how many calls will be made, how many tools will be activated, or how long the process will takeā€”it could be seconds or hours.

This unpredictability makes tracing these executions very interesting. You'll want to trace not only single executions but also see how executions perform on average. You'll want to identify which tools are being activated more frequently, what a typical flow looks like, and what happens when the agent gets stuck in an endless loop trying to figure out how to continue. This kind of analysis is less interesting in deterministic flows, where the traces always look the same.

By focusing on agentic and multimodal approaches, Traceloop will help users monitor these complex, unpredictable workflows more effectively. What was one of the hardest technical challenges around building Traceloop?

One of the main challenges was taming OpenTelemetry. Itā€™s a complex technology, and figuring out how to wrap it nicely in a user-friendly layer was tough. We aimed to make our SDK easily installable with just one line in Python and TypeScript, which took us some time to achieve while hiding all the complexities of OpenTelemetry.

Another significant challenge was figuring out the right metrics to use to monitor LLM quality and tuning them to correlate well with human feedback. We wanted our metrics to align with a human looking at the generation and being able to say, "Yes, this is a hallucination," or "This is correct."

How would you describe the culture at Traceloop? Are you hiring, and if so, what do you look for in prospective hires?

We are an early-stage startup, and I believe strongly that everyone who joins needs to want and be able to do a bit of everything. As the CEO, I write code because I enjoy it, and I expect similar versatility from our team members. We look for individuals who can write code, think about product, do NLP research, work with models, and handle customer success and support.

We seek multi-talented people who love doing a variety of tasks and enjoy being involved in every aspect of the company. They should be excited about seeing how a company grows from nothing to something significant.

Is there anything else you want people to understand about the work you're doing at Traceloop that we haven't covered yet?

I'd like to emphasize our commitment to open source and open protocols. We chose OpenTelemetry for a reason. Many competitors are defining their own proprietary SDKs and protocols, which locks users into their platforms. In the cloud observability domain, this was also the case until OpenTelemetry became widely adopted, allowing for easy switching between providers.

In LLM observability, we're seeing a similar trend with proprietary protocols. I believe we should embrace openness from the beginning. We've been collaborating with other companies, and our SDK, OpenLLMetry, is already used by some of our competitors for their platforms. Additionally, large companies like IBM, Microsoft, and Amazon are using our SDKs internally and with their customers. This shows the value and benefits of using an open protocol.


Conclusion

To stay up to date on the latest with Traceloop, follow them on X and learn more about them at Traceloop.

Read our past few Deep Dives below:

If you would like us to ā€˜Deep Diveā€™ a founder, team or product launch, please reply to this email (newsletter@cerebralvalley.ai) or DM us on Twitter or LinkedIn.

Join Slack | All Events | Jobs

Subscribe to the Cerebral Valley Newsletter