You can’t have RAG without a retriever, but what exactly is a retriever? And what kind of retriever am I going to use? This post will take a trip with BERT, and talk about some recent innovations with the model. Let’s jump in!
Parts of a retriever
All retrievers have at least three things in common: 1. A corpus of information is stored somewhere, somehow 2. When presented with an incoming query, they do some stuff to process it 3. A list of relevant information gets returned
Remarkably, these three properties are not unique to vectorstores! In fact, information retrieval has been around for decades! Colin Harman wrote a great piece, Beware Tunnel Vision in AI Retrieval about how all the hype around vectorstores is pushing older (and possibly better!) methods of retrieval to the sidelines. stealing Quoting from Harman:
To be sure: vector search is a great solution to many LLM-related retrieval problems and Iโve been using it in production for nearly two years! But it struggles in certain conditions and excels in others, and suggesting that it is โtheโ solution to retrieval for LLMs is simply misinformation. In your application, the optimal solution may be full-text keyword search, vector search, a relational database, a graph database, or a mixture.
So let’s start our recent history of retrievers at full-text keyword searches. One of the best and longest lasting full-text keyword search tools is….
BM25
BM25 (known on Wikipedia as Okapi BM25) is a way of matching a query to a set of documents using this bunch of dumb math: $$ \text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgDL}})} $$
I say “dumb” with love and to differenciate it from “smart” black-box AI methods (which can be pretty dumb themselves at times). Anyways BM25 uses statistics to figure out what the important words in the query are, and then returns documents that use those words a lot. I like to think of the old, 2000’s era of SEO when putting word vomit of keywords at the bottom of your website increased your page’s search ranking. You know, like how this:
Car insurance Auto insurance Vehicle insurance Car coverage Automobile insurance Motor insurance Car policy Insurance premium Collision coverage Comprehensive coverage Liability insurance Underinsured motorist coverage Uninsured motorist coverage Deductible Premium rate No-fault insurance Policyholder Claim process Coverage limits Insurance provider Third-party insurance Premium discount Safe driver discount Accident forgiveness Vehicle type Coverage options Insurance quote Policy renewal Insurance broker Insurance agent Compare quotes Multi-car discount Teen driver insurance SR-22 insurance High-risk driver insurance Rental car coverage Gap insurance Roadside assistance Endorsement Exclusion Premium payment Claims adjuster Policy documents Insurance terms Insurance regulations Insurance industry
Could make your webpage perform better when someone searched “Car Insurance.” Don’t get me wrong, modern implementations like ElasticSearch and OpenSearch are a lot more complicated and immune to such antics, but they are still based on deterministic statistics and “dumb” maths.
The thing is, BM25 works well. Like, really well. As in modern-day AI people still spend months fine-tuning custom transformers just to get dunked on by Steve and Karen.
IF BM25 is so hot why don’t you marry it??
Well, to put it succintly, BM25 searches with words while vectorstores search with meaning. When you have a lot of meaning into one word, like “us,” BM25 can struggle to return relevant results. Is it “us” like “just the two of us” or “us” as in “US foreign policy?” The converse is also true- when you have a lot of words with little meaning, vector search is able to distill them down into a meaningful embedding, while BM25 can go off chasing red herrings.
Let’s start with BERT
We’re gonna get the vectorstore search party in 2019 when BERT first dropped. BERT is a special kind of transformer called a bi-encoder, and was designed to do good on question-answering tasks like SQuAD and SWAG (https://huggingface.co/datasets/swag) and Bidoof. Its big innovation was its application of unsupervised learning to a bi-encoder structure. Basically it looks over a ton of language data to establish understanding, and in the process learns to take incoming text of any gender and turn it into numbers, and then take those same numbers and turn them back into text.
BERT is great because we can fine-tune it to do all sorts of stuff! People have trained BERT to detect names from texts, classify stuff as positive or hateful, put things into categories, pick answers for questions, and even fill in the blanks!
Teaching BERT how to do any one of these tasks is called Fine-Tuning and it is a very important technique! We’ll get to that in a second, but first…
Are Bi-encoders the only AI model that’s good for retrieval?
Nope! Another type worth mentioning are cross-encoders. Cross encoders are cool because they take the query numbers and the document numbers and put them together into a new number!
Get it? Cross like angry?…
The thing about cross encoders, though, is that they require a lot of number crunching. For every query, all the documents have to be re-crunched, and that can take a while. However, cross-encoders generally give better results than baseline bi-encoders, so they can be worthwhile.
Fine-Tuning for Similarity Search
Since our task is retrieving relevant documents, we want to use a BERT model that has been fine-tuned to produce the best numbers for retrieving documents. To start with, let’s have a quick overview of how similarity search with BERT works:
- BERT takes each document that we have and turns it into numbers
- These numbers are plotted on a graph
- BERT takes our query and turns it into numbers
- The query numbers are plotted on the graph, too
- Whatever is close to our query on the graph are deemed relevant documents and returned!
Similarity search is great because of a few reasons: * Numbers are easier to compute than words * Synonyms have similar numbers, and numbers don’t lie * Computers are really good at doing number stuff, fast
Before we can get good results, though, we need to teach BERT how to turn words into numbers good. But there’s a problem: BERT is big! Not as big as other models for sure, but still, 110 Million parameters are a lot of parameters, and we have to do numbers on each of those parameters all the time while fine-tuning. It would go so much faster if we had fewer parameters to fine-tune…
Introducing Distil-Bert!
DistilBert is a BERT-type model that uses 40% fewer parameters to get just as good results! Wow! Now we can fine-tune BERT quickly and cheaply! Thanks, HuggingFace!
How we fine-tune
The basic premise of fine-tuning is simple:
But which questions you ask and how you ask them are hugely important! Because it turns out that, if you ask BERT a bunch of questions about cooking, then quiz BERT about law, BERT is gonna fail that quiz! How human! So we need to ask questions that are similar to our real-world use case.
But wait! There’s More!
In a banger 2021 paper from University of Waterloo, scientists figured out that you can get better results for the same amount of training if you train on easy questions first, then move onto harder questions later. Just like a human!
Isn’t that cool? So all we need to fine-tune BERT on our own dataset is: 1. A list of all the knowledge we want BERT to pick from 2. A whole bunch of sample questions, ranging in difficulty from easy to hard
๐ STOP ๐ Hold up there! I got a problem… See, #1 is easy, everybody has data. But what if I don’t have a whole bunch of sample questions about my dataset?
BERT gets help from friends
Just like us humans, when BERT is stuck we can rely on friends to help us out. In this case, BERT’s friends are other machine learning models!
Syntetic Queries with T5
T5 is a model by Google that is fine-tuned for text generation. It isn’t an instructable model like chatGPT, but instead continues from where the incoming text leaves off. A bunch of smart people then fine-tuned it further to create a version of T5 that can take a document as input and outpt a question that can be answered by that document. It sounds like the perfect thing for creating a bunch of sample questions about our dataset!
Transfer learning with a Cross Encoder
We mentioned earlier that cross encoders do a better job but are expensive to run. But what if we used a cross encoder only on the documents our search returned? This way, we can re-rank the documents to put the most relevant ones first, and then take the top 5 and return them!
But only scrubs try run two custom machine learning models at the same time in a production system- do you know how many things can go wrong with that kind of set up? Not to mention all the increased costs from having two different models going on all the time. What if we could get all the benefits of using a cross-encoder for reranking, without actually using a cross-encoder?
Introducing GPL! The folks behind this paper put all the previous techniques into practice, and then did something crazy: used a cross-encoder as a teacher to help BERT return better results.
In GPL, when BERT retrieves results for a query, the query and the results are all run through a cross-encoder. This cross-encoder assigns its own scores to each of the documents, and the scores are compared. The closer BERT’s scores are to the cross-encoder’s scores, the more cookies BERT gets. Eventually, BERT will become so good at matching the cross-encoder’s results that the student becomes the teacher and we won’t need the cross-encoder any more!
What’s next for BERT?
Despite the trend of larger and larger LLMs, DistilBERT and their 66M parameters keep finding surprising ways to stay relevant, and I don’t think that is going to change any time soon. There are still tons of new and exciting ideas to try, like:
- BERT and BM25 working together
- New ways of generating sample questions
- Parameter efficient fine tuning techniques like LoRA
- ??????????
Personally, I’m working on actually implementing these ideas on niche, multilingual datasets. All of the papers linked in this post deal with primarily English-language datasets, which leaves non-English speaking people behind. Since DistilBERT is so small and cheap to train and fine-tune, I want to put all of these techniques to use on data in Vietnamese, Hindi, Arabic, and other non-English languages spoken by millions. I’ll keep updating y’all with my progress.
โค๏ธ Gordy