Pinecone Serverless is turbo-charging your time-to-prod ⚡️

Plus: Director of Product Jeffrey Zhu on RAG, multimodal and more...

Published 15 May 2024

CV Deep Dive

Today, we’re talking with Jeffrey Zhu, Director of Product Management at Pinecone.

Pinecone is a leading vector database platform, purpose-built for generative AI applications. Founded in 2019 by Edo Liberty, a former research director at AWS and Yahoo!, Pinecone was conceived with the idea of combining the power of AI models and vector search into a standalone application - which created the now-popular vector database category. The startup’s mission is to make its storage and retrieval capabilities accessible to engineering teams of all sizes and levels of AI expertise, which has led to the fully managed service that Pinecone is known for today.

Today, Pinecone has thousands of companies using its platform for vector search and storage, including companies such as Microsoft, Notion, Gong and Plaid. In January 2024, Pinecone launched Serverless, its serverless offering which ‘eliminates the need for developers to provision or manage infrastructure and allows them to build GenAI applications more easily and bring them to market much faster’. Serverless has seen skyrocketing usage since its private preview launch, and is now powering Pinecone’s free tier.

In this conversation, Jeff walks us through Pinecone Serverless, vector databases, and Pinecone’s growing position within the generative AI landscape.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Jeff 💬

Jeff - welcome to Cerebral Valley! Firstly, give us a bit about your background and what led you to join Pinecone?

Hey there! My name is Jeff, and I'm a Director of Product Management here at Pinecone. I'm responsible for what I like to call the core database, which includes everything from the core vector search engine itself, as well as thinking about how we conduct vector search, how we make it extremely performant and efficient, and also very effective. I also focus on deploying that vector engine across various clouds and infrastructures like AWS, Azure and GCP - my scope is mostly everything in the backend.

In terms of my journey, I've been at Pinecone for two years, which in Pinecone years is eons! Before Pinecone, I spent eight years at Microsoft Bing, primarily as a platform PM for their machine learning infrastructure. This included everything from training large-scale language models—before it became commonplace—to optimizing massive-scale inference on ASICs and GPUs. I also initiated the vector search platform in Bing, which led me to Pinecone today.

I’ve always been struck by how interesting and capable vector search is. It uses machine learning and AI to transform complex, intractable problems that classic rule-based engines tackle into a simplified mathematical problem space. In many ways, it's pure math—setting aside all the AI and ML models. At its core, it’s essentially a math problem with a powerful property. At Bing, we utilized it to enhance our semantic search capabilities, notably in Q&A and related queries, and it was a very core capability. The one thing I always found interesting was that building a strong vector search system required 10 engineers and a ton of resources, which got me looking further afield to smaller teams working on the same problem.

One of the things that really attracted me to Pinecone was that they took a radically different approach. While everyone else was out there giving users numerous controls and tuning parameters for every aspect of the algorithms, Pinecone took the side of ignoring all of that complexity and building a completely managed service that abstracts away all these complexities for the user. We take on the hard task of ensuring your results are high-quality and managing your deployments. Pinecone was bold in saying “let’s abstract all of that away and really empower people to start leveraging vector search without having 10 engineers and researchers involved.

Give us a top-level overview of Pinecone Serverless - how would you describe it to the uninitiated developer or AI engineer?

Pinecone Serverless is essentially the next evolution of the ethos we've been talking about, which is to abstract away as much complexity away from infrastructure management, productionisation, and scaling as possible. It's our next step in enhancing the user experience and developer experience for our customers. Even though our first iteration - our pod-based architecture - abstracted away a lot of the complexities of managing and tuning algorithms, you still had to engage in capacity planning. For example, you had to estimate how many pods you needed based on the number of vectors, and if you grew beyond this amount, you needed to perform vertical scaling or completely re-shard.

So, there were still operational overheads because we were using a classic sharded database architecture, and that's where Pinecone Serverless comes in. We completely rethought what people care about, which is having their data available for search via vector search and building applications. They don't want to think about how many pods they need or how much capacity they require. With Serverless, it’s about giving developers a truly serverless experience where they can come to Pinecone, update their vectors and query them at will. We take care of all the scaling and the compute requirements.

Plus, not only is Pinecone now a completely serverless experience, but vector search is also going to be significantly cheaper than before. We're not going to be over-provisioning capacity that sits there underutilized - we're going to scale the capacity to exactly what you need and charge you only for that, rather than for idle CPU.

Could you share a bit about Serverless’ impact on cost? Creating AI applications and doing vector search can be extremely cost-prohibitive - how are you addressing that within Pinecone?

One of the very core principles we’ve followed with our vector database is to completely decouple storage and compute. This isn’t a groundbreaking concept, but it’s essential for a seamless and scalable experience - by doing this, you're no longer bound by the single node capacity itself, and you can really scale independently. This really allows you to save on cost - for example, if you have an index that you query briefly and then leave as an experimental index for a week before returning to it, you only need to be charged for the blob storage costs, instead of being charged for the SSD, memory, and CPU that are constantly running in traditional vector database architectures.

From a cost perspective, you can see massive savings depending on your utilization patterns, especially when you're just getting started or if you have very sporadic queries. You don't really want to really think about spinning things up and down and archiving them, and that's where Serverless can bring you massive savings on top of that.

Multimodal has seen a huge surge in interest in 2024. Anything you can share about how the advent of multimodal features into your plans at Pinecone?

I’d say the implementation of multimodal largely depends on the specific application and its architecture. Taking a step back, I think of multimodal as a classic example, particularly in e-commerce. Imagine having an image of an item and a text description of that item, and wanting to search by both image and text simultaneously. At Pinecone, our approach to these types of use cases involves what we call dense-sparse, which is a bit of a hybrid search method. Typically, for many of our multimodal use cases, you can represent an image through the dense vector by using CLIP or any other ML model you prefer for image embedding.

With sparse representations, there are several options. You can use BM25 or some of the learned sparse representations, among other things. The idea is to combine these dense and sparse components to ultimately rerank and effectively serve these multimodal queries by using multiple signals at the same time. Hopefully, this aligns with what you meant by multimodal, but that's immediately where my mind goes.

This explosion in generative AI has been extremely research driven - how do you balance the need for AI research with the focus on productization?

You’re absolutely right; everything is moving incredibly fast. There’s research coming out constantly, every single day someone is training a new state-of-the-art LLM—it’s almost a weekly occurrence.

At Pinecone, to keep up with the pace of innovation, we’ve built a separate research arm - we hire PhDs and research scientists who are dedicated to tackling these fundamental problems we aim to solve, whether through advancements in approximate nearest neighbor algorithms, modeling advancements, or otherwise. We have an entire organization dedicated to this—they publish papers and are actively involved in academia. Our founder, Edo, also comes from an academic background, and this academic influence profoundly shapes our DNA, keeping our ear to the ground and always looking out for the next big thing.

In terms of how we respond to industry changes, that's exactly what happened with RAG. When RAG started to surface, we pivoted quickly; the industry as a whole pivoted massively, but we were on top of it as soon as possible. Part of our approach, and something I'd be remiss not to mention, is how we productionize a lot of that research and integrate it as fast as possible. As a small company, we fortunately don't have a lot of distance between our research arm and the rest of us, so staying focused on rapid integration is always a priority for us.

What has been the biggest technical challenge around launching Pinecone Serverless?

I’d say the biggest challenge, which is perhaps less of a technical challenge and more of a product challenge, was borne out of the fact that we were actually quite successful with our pod-based architecture. We had a reasonable amount of adoption - people trusted us, and were comfortable working with us. However, we knew that in terms of longevity - looking two to three years ahead from where we are today - it wasn't sustainable based on the cost profile and the user experience, among other factors.

A big challenge for us was that we needed to take a big, bold bet and completely re-architect things. We had a very stable, well-running architecture, and we needed to completely shift the entire setup. Essentially, we were moving away from a single-node, shared-nothing architecture to a massively multi-tenant, microservice-driven architecture, and this was a massive paradigm shift in terms of the complexity of the system. However, it ultimately allowed us to provide the cost benefits and scalability benefits that we wouldn't otherwise be able to achieve. There was definitely a massive technical challenge in completely revamping everything - we had to completely change our core algorithm and how we architect the services. It's a 180-degree turn in terms of the direction of our fundamental architecture philosophies.

Even from the product side, we have to re-educate everyone, and this is still ongoing. For example, we just released on AWS, and we need to expand to the rest of the clouds across other regions. This is the start of a journey to help people re-understand how to think about Serverless, what its impact is, and how to model costs as a pay-as-you-go system. In some ways, people could argue that it's much simpler to have a per-hour rate on a pod, but then you're wasting resources on unutilized capacity.

So, our question becomes: how do we help retrain and reeducate the customers on the fact that this is the right way to think about your utilization? So, our challenge is both technical as well as a product packaging shift as well.

Tell us a bit about the culture at Pinecone - are you hiring, and what do you look for in prospective members joining the team?

One of the things I love about Pinecone is that people take a lot of pride in their work. The people who draw me in the most during interviews are those who, when they talk about their previous work or their experiences or technical challenges they've faced, I can sense the energy and passion and excitement. They’re like, "I'm ready to take these challenges on and I'm not just doing this for a job." I’ll be frank—if you’re just doing this to clock in and out, that’s not the startup culture we're looking for.

We want people who are really, very much rallying behind our flag, and who are ready to execute in a space of massive ambiguity. Things are changing all the time - the ground is literally shifting beneath our feet every month. Those are the types of people we want - who are hungry for that challenge and are really excited and passionate about the opportunity and the potential. By the way, that's very much like the Pinecone culture—we're constantly pivoting, and constantly thinking about what our next move is.

So, how do we reassess, refocus, and what will the future be like? We always need to be at the forefront of that. These are the types of people we really look out for at Pinecone. I think, regardless of what role you're in, that's the type of passion or fervor we expect. We want anyone joining Pinecone to be really interested in contributing.

Conclusion

To stay up to date on the latest with Pinecone, follow them on X and learn more about them at Pinecone.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email (newsletter@cerebralvalley.ai) or DM us on Twitter or LinkedIn.

Join Slack | All Events | Jobs