So I’ve picked my model stack, cleaned and transformed my data, set up a remote machine, and my new retrieval model is baking in the oven like a fresh dozen cookies! I’ve got nothing else to do, so I’m going to record a few thoughts about what I like about this training process, and what I’m going to change for next time.

Once it is done baking, my BERT model will come out of the oven looking like this Once it is done baking, my BERT model will come out of the oven looking like this

Let’s start off easy with the good stuff…

Things I like about this training process

1. Cloud GPUs and SSH are super cool

I’m using RunPod as my cloud GPU provider because they’re cheap and available. I used to use Lambda Labs but they’re all booked up. Plus, RubPod offers tiny-ass GPUs like the RTX 3060 (with only 8GB vRAM) in addition to H100’s, so I can save money and use smol GPUs for the tasks that don’t require a lot of power (like mining hard negatives).

Additionally, accessing these machines via SSH is super convenient. I admit, it took me about an hour to figure out how to properly configure all the keys (RIP my $0.20 GPU rent for the hour), but once it was set up I never had to touch it again and things just worked.

Every time I SSH into a computer I feel like a Hollywood hacker and say "I'm in" Every time I SSH into a computer I feel like a Hollywood hacker and say “I’m in”

Additionally, big shout-out to my #1 dude ChatGPT for coming in clutch with unix commands. Its been a while since I last worked with manipulating files via terminal only, and ChatGPT is amazing for telling me which commands to use. From scp to transfer files to and from the remote machine, to df -h to see which directories are being used, without ChatGPT I’d still be searching for answers on Stack Overflow.

2. GPL library makes training with GPL fast & easy

The authors of the GPL paper also kindly published their code on GitHub for anybody to use, and use it I did! Their library takes care of all the steps for implementing GPL on a fine-tuned question-answering model.

After I had all my data properly transformed and in the right places (and we’ll get to that later), I could run training and evaluation using the code from their repo’s README file. I just popped this in a Jupyter notebook, ran it, and watched the loading bar go!

Things I didn’t like

1. BIER Dataset documentation is sparse

GPL is designed to work on the Bier Benchmark, a large, open-source benchmark for information retrieval. And that’s great! But it means that your custom data has to follow the same schema as Bier datasets, and there’s not much documentation about how to do that apart from this page right here.

2. GPL is not flexible

GPL tells you to use a separate T5 model to generate a bunch of synthetics queries for each document in your dataset. What GPL doesn’t tell you is that today’s multilingual T5 models _f**king suck__! To get good data, you need to use a T5 model that has been fine-tuned specifically in that language, like these ones from Dr. SBERT themselves, Dr. Nils Reimers.

HOWEVER GPL does not have an option for mapping your Arabic T5 model to your Arabic dataset and so on, so I had to generate queries with my own code, transform those generated queries to Bier format, and then resume GPL. Thankfully GPL does check for existing query data.

Another downside with vanilla GPL is that it does not do evals while training, nor does it have flexible checkpoint saving. This is a big downside for me because I am a very anxious boy and I don’t want to wait 6 hours for my model to finish training before I get eval results. That’s why I’m using my local machine to run evals while the cloud machine trains and also why I’m wearing gloves to type this because my 2021 M1 Mac Air with 8GB RAM was not made to handle these workloads 😂

Since GPL is built on top of sentence transformers, which has a beautiful in-training evals system built into it, I’d like to use that along with WandB to track everything.

The flexible checkpoint saving also sucks. I’m trying to squeeze as much run out of my pod as I can, and that means I might not have enough space on my disk to save a model checkpoint every 10,000 steps.

So for next time, I think I am going to cannibalize a lot of the code from GPL and build a training pipeline with the following features:

  1. Splits corpus by language and uses different models for each language
  2. Configurable model checkpoint steps
  3. Evaluate model during training, as well as after training
  4. Send data to WandB

BUT that’s not going to happen for a couple of weeks. I still plan to build my full demo with this model and show it off in a HuggingFace space. But once that is all done, I definitely want to come back and make these changes.

Until next time!

❤️ Gordy