Well folks the results are in! I’ve prepared my data, trained my model, run my evals and the results are in! I lost 😭

nDCG@	my model	OpenAI
1	0.22883	0.34679
3	0.29118	0.45706
5	0.31254	0.49071
10	0.33486	0.5240

I’m actually not too sad about the results because the real win was the friends I made along the way.

Bert and ChatGPT face off in the ring, bby Dalle2

The friends I made along the way

I employed a whole host of tricks with this model, only one of which was my own. Let’s ~~share the blame~~ give a shout out to the researchers whose ideas were the foundation of this project.

Sentence Transformers

sbert logo

While I didn’t directly use the sentence_transformers library, the libraries I used used sentence_transformers. But I did directly use the shit out of a bunch of their fine-tuned Distilbert QA models. Additionally, Dr Sbert himself, Dr. Nils Reimers, was a co-author or contributor on almost every other paper. This project would have been sooooo much harder without Dr. Reimners’ open-source contributions, so big shout-out to him!

Doc2Query

sbert 2 electric boogaloo

Did you think I was done talking about Dr. Reimers? Think again! Not only was he involved in sentence_transformers, but he also trained up a handful of handy, multilingual document expansion T5 models called doc2query. I also used the shit out of these models in my project, so thank you x2 Dr. Reimers!

GPL

GPL was the big science happening in this project, and I totally ~~stole~~ adapted their code for 90% of my model training. Big shout-out to (once again) Dr. Reimer, but also to Nandan Thakur, who was one of the authors on the paper, who also responded to my email and helped me locate datasets and make sense of his paper.

My (bad) idea that added even more science onto GPL was multilingual capabilities. GPL is all about generating synthetic questions with one transformer model, and then using a third and fourth transformer models to help train the OG encoder transformer model. If that sounds like a lot of transformer models, hold onto your pants because I introduced an additional nine transformer models into the mix. Why? I decided to use doc2query models that were fine-tuned (by Dr. Reimers) for the different languages in my dataset. Did that help? maybe!

Bier

bier project logo

The last shout-out goes to the benchmark that a lot of these papers and projects compete on. Not only does it promote crazy innovation, but the team behind Bier maintain a really nice library for standard interfacing with the datasets- sort of like being the sports rules committee for the Information Retrieval Olympics.

Post-game breakdown

Alt text

Now that the dust has settled, here are some of the things I think underperformed or were just bad ideas (by me):

1. Too little training data

A big part of GPL, again, is generating synthetic questions from the documents. The wonderful code from the paper’s authors includes logic to automatically calculate how many questions need to be generated in order to have enough data to get ressults. There is a minimum of 3 questions, but not a maximum.

My dataset was about 7 million tokens, chunked into about 30,000 bits using Langchain’s recursive text splitter. The data came from a popular finance app’s support page, and I’m not linking it because I did not ask these folks before using their data 🏴‍☠️

So my dumb ass sees 30,000 documents in the corpus and thinks “oh hey that’s a lot, I shouldn’t need more than 3” and manually entered 5 (three for training, one for validation, one for testing) into the GPL team’s code. Later when I actually looked at the math and logic, it turns out I needed nine questions per document as a recommended minimum, and they only default to minimum 3 queries when your corpus is greater than 250,000.

2. Too shitty training data

My dataset had parallel documents in 10 languages: Arabic, English, Spanish, French, Hindi, Japanese, Thai, Vietnamese, and Simplified Mandarin Chinese. The original GPL paper only used English language data, and so it only needed a single T5 model for document expansion; needless to say, it is a lot easier to ensure quality when you are using only one model.

Unfortunately I dont’t speak ten languages, so it is hard for me to verify how good the nine other models did with their synthetic queries. But if they were as good as the English model, it means they were… shitty. The English model did the bare minimum in terms of query generation. For example, if the document was about “how to reset your password,” the model would give:

How to reset password
How to reset passcode
How to change password

Not a lot of variety there! And this was after I did some hyperparameter tuning like temperature, n-gram repetition penalty, beam search length, and more! Obviously the ideal would be real-world user data, but barring that I think a custom model fine-tuned on a handful of manually-written phrases would do better. It would also be interesting to see what the minimum number of human-generated phrases are in order to get quality machine-generated phrases.

3. No in-training validation

One gripe I have with the libraries used is that they did not offer much visibility into the model while it trained. The defaults for GPL were 1 epoch and 140,000 steps. Instead of rewriting the code, I manually ran evals on each checkpoint. But I would have appreciated being able to use a reserved dataset for evals, and seeing those evals during training. I’ve used Weights & Biases before and I really like its APi, so in the next round I will make sure to include it.

For what it is worth, the manual evals did not show much improvement with fine-tuning. The model started out at around 0.11 nDCG@5, then after 10,0000 steps shot up to 0.27 nDCG@5 and then slowly, slowly crept up from there. I wish I had saved a chart of its progress.

4. Bad choice of metric?

This is a small gripe, but I’m not 100% convinced that nDCG is the best metric for this project. nDCG will penalize a model if it places the most relevant document anywhere other than first in the list. However when doing RAG, does the order of documents matter? For example:

Say for query Q_1 we get 5 documents [d1, d2, d3, d4, d5] with relevencies [1,0.5, 0.25, 0.125, 0.0625]. If our retriever returns the docs in this exact order, nDCG would be $$ \frac{1}{\log_2(2)} + \frac{0.5}{\log_2(3)} + \frac{0.25}{\log_2(4)} + \frac{0.125}{\log_2(5)} + \frac{0.0625}{\log_2(6)} = 1 $$

but if we change the order to, say, [d2, d1, d4, d3, d5] we get $$ \frac{0.5}{\log_2(2)} + \frac{1}{\log_2(3)} + \frac{0.125}{\log_2(4)} + \frac{0.25}{\log_2(5)} + \frac{0.0625}{\log_2(6)} = 0.873 $$

While the nDCG score is clearly lower in the second example, all the same documents were returned so I don’t think it matters that much when you are returning only a handful of documents. That’s an important caveat! LLMs are known to have a U-shaped attention to context, meaning that stuff in the middle tends to get lost. If you are returning 20, 50, 100 documents you need to be careful of this.

Since the articles that make up my dataset are generally pretty short (>1000 words for sure, median length closer to 300 according to my feelings and no science) I am interested if parent document retrieval could help boost my little model’s performance.

But wait! Not all is lost!

In researching this project, I came across this thread from an old HuggingFace hackathon/project to build the best encoder model with 1B training pairs. The project wrapped up in 2021 with this note:

Results

The results of the cosine similarity calculations are as follows:

SBERT Cosine Similarity Scores: [0.1356698, 0.076096766, 0.015867135, 0.58982027]

OpenAI Ada Cosine Similarity Scores: [0.7172783545691249, 0.727737901177177, 0.8542776604362744, 0.7744492503622011]

As can be seen from the above results, OpenAI’s Ada model outperformed SBERT in detecting similarity across languages. All the cosine similarity scores from the OpenAI model were above 0.7, whereas the scores from SBERT were much lower.

OpenAI kicked the shit out of them! Critically, they were using RoBertA back in 2021- before innovations like TAS-B and GPL. And since then, the margin has shrunk! I have proof!

I think this validates my idea that a small encoder model can beat OpenAI under very specific conditions:

The dataset is niche. It does not include much general knowledge, was not used to fine-tune chatGPT, and uses words or phrases unique to its niche.
The dataset is multilingual. I don’t have proof (yet) but I have a feeling that a fine-tuned model can trounce OpenAI embeddings in non-western languages like Arabic, Thai, Indonesian, and Hindi. That feeling also extends for dialects of the better-represented languages like North African French, Latin American Spanish, Egyptian Arabic, and more.

Finally, I’m just chuffed that I got so far in the first place. I’m not sure how big OpenAI’s Ada 2 embeddings model is in terms to parameter count, but I know the Bert model mine is based on has only ~320 million parameters. That’s ~58x smaller than ChatGPT!

So what’s next?

In the immediate future, I take a break! This project was a ton of work, and it really pushed the limits of my understanding. That being said, I learned a ton, too, so the push was worth it.

I’ll be at the AI Engineer Summit this week and might talk to a few folks and share my experience.

After that, I think I will take up something different for a while and then come back to this in early 2024. By then, I should have some fresh stamina and ideas to take on Open AI once more!

❤️ Gordy

Tiny Retriever vs ChatGPT 🥊