Semantic Search

At work, we recently released a big new feature I’ve spent the past few weeks and months working on; semantic search. Given all of your uploaded documents, allow natural language questions against them. Here’s some post-release thoughts.

First of all, I know what you’re thinking: “another project that’s stuffed langchain in front of a file connector”. And I’m happy to say no. I started with langchain, don’t get me started, but even as someone who doesn’t program in Python everyday I could tell the code was just… wrong. Weird abstractions, functionality devided into weird modules. It just felt difficult for what we wanted.

Instead, I split the task into two areas: indexing and querying.

Indexing was trivial, re-using the logic for our current full-text search index. Documents are embedded, currently via remote models but supporting local models, and stored in a vector store we already had for other purposes.

The actual rendering and chunking of text for embedding was an interesting challenge. I found many examples online that just assume you can embed either the entire text e.g. a tweet or github issue, or break down into individual sentences. Neither work for docs that range from a few dozen to a few thousand words, with a mix of other content types thrown in for good measure.

In the end I found a good mix of weights and breakpoints, using a sharded index. Shout-out to typesense for providing a great vector query option so quickly. Great little DB and a welcome change from the management of elasticsearch.

For querying, we ended up going with a Hybrid-RAG (Retrieval Augmented Generative) approach. We embed the query, find similar documents and then fill the holes with traditional full-text search. Finding a good similarity cut-off is critical

One of the big things, I feel, that sets us apart from the competition is that we bring our complete permissions suite into play when querying. We have teams with entire libraries of content and now there’s no worries someone can ask a question to see something they can’t. This isn’t done via prompts but at the pure database level as well.

The actual generative side completely changes based on which LLM provider you’re using. Claude by Antrhopic we find requires much more nuanced instructions, definitely paying attention to using roles and tags more often. The advantage is it’s an order of magnitude faster than OpenAI.

Overall we now have many happy teams using the product. On the day of launch our cloud provider decided to nuke their entire network infrastructure, but we were back up and ready in plenty of time. Fingers crossed to see how it grows and develops over time.