Memes from the AI Engineer Summit

I was lucky to attend the AI Engineer Summit this year in San Francisco. There were so many amazing people both speaking and attending and it was an absolute blast. In the conference’s slack channel, somebody asked “How does an AI Engineer take notes?” hoping to engender recommendations for AI note taking apps or other high-tech solutions. Everybody has different note-taking ways, but my favorite way to take notes is to make memes.

So in the spirit of open-source, I’d like to share my memes with you all and tell you briefly what inspired them.

The hotel wifi was not prepared

hotel wifi is deathly afraid of 500 people downloading huggingface transformers all at once

Many of the folks giving workshops had amazing materials set up. I followed Simon Suo’s workshop on fine-tuning a RAG pipeline in LlamaIndex the first morning, and they had a full docker container ready for participants do download and follow along.

But while Simon was prepared, the hotel wifi was not 🥲 In fact, all throughout the conference, the hotel’s wifi had spotty performance, which left us participants with no other option than talking to each other- the horror!

One model is all you need

One model per year is really enough

The next workshop with Wane from AutoGPT was full of insights from a team that constantly evaluates and iterates upon their model, using evaluation techniques that have to be built from the ground-up because their models are so cutting-edge. Of course, they don’t recommend everybody to go out and put that much work into building and evaluating a general agent- instead, for most people, one working model that gets the job done is often good enough to not need touching or tweaking for a while.

However, as AI engineers the tweaking part it fun! Fine-tuning is a competition, and it is hard to call it a day when you know you can squeeze out just a little more performance.

Evals should mock humans

I bet you heard this image, didn’t you

Another insight from the AutoGPT team was that evals should be designed to emulate real user activity as closely as possible. While that isn’t a new concept, the challenge AI model evals present is that there is no set script for how humans act with these systems because the systems also have no set script. The evals then must be dynamic, and respond to the model rather than passively measure it.

Evals are also expensive

how to spend 1000 monthly on chatGPT

Running a full test suite isn’t free, either. Teams need to find a compromise between validating their changes and balancing their budget.

One common strategy that was talked about is leveraging user metrics into the model’s success evaluations. If the newer model has higher usage, better success, etc, that can be a stand-in for some evals- but it can be a very messy signal that can be influenced by many things outside of your model.

Times have changes for GPT-4

things were better in the olden days

We may all work in different companies, industries, and parts of the world, but one thing we have in common is this: ChatGPT was much smarter when it first released (same for GPT-4), and it has since dumbed down a bit. The AutoGPT team showed, through their workshops and talks, how adding more features can reduce existing model capabilities, and this might be something that happened to GPT-3.5 and 4. Since they are closed, proprietary models we’ll never know…

Agent protocol

My favorite agent is agent Smith

Everybody wants to build AI agents, but the best practices and industry standards haven’t been completely ironed out yet. There are as many competing agent protocols as there are agents, and there is opportunity there for a unified protocol for agents to talk to each other and to humans in the best way.

AI can take “Just Google it” too far

AIs are jsut like us!

Another insight from the AutoGPT team was that models with tools will often use their tools for questions they already know the answer to. In their evals, the team found that Google search was the go-to tool being used, and many of the usages were unnecessary. For example, asking an LLM what the capital of Washington state is should not trigger a Google search- the model ought to know this inherently.

If the model is using Google search too much, we can combat this with prompt engineering and other tools like self-asking.

Web UI is for humans, not bots

World’s best LLM vs an $8 octopus

The AutoGPT team truly was a muse for me! They gave a live demo of teaching an LLM how to navigate Amazon to look up prices or buy items. This task is a big challenge for LLMs because the Amazon interface is extremely visual and constantly changing. Building this skill involved abstracting away much of the shopping interface and a lot of trial and error. In the end, AutoGPT was able to search and find products, but this will last only as long as the interface (or Amazon products) don’t change.

System prompts can be huge

Alt text

In RAG, the ultimate product is a big prompt with lots of instructions and contextual information. This is great for helping the LLM stay informed about the task, but they can crowd out the actual user query or task. It is important to keep the final task close to the beginning or end in these cases. It can also help to audit your system prompts, context, and generation criteria to make sure that you’re not spending extraneous tokens.

Hypnotoad

hypnotizing

The keynote stage had this hypnotizing animation…

From New Computer’s talk

Nothing more to it than that! Also big shout-out to Jason Yuan (left side of the above picture) for a great presentation and amazing fashion! Those gloves are killer 😍

Let AutoGPT handle your research and knowledge tasks

reject copy and pasting, embrace Chuck Testa

In their keynote, the AutoGPT team sung the praises of using autoGPT to automate away inane, copy/paste kind of office and knowledge work. I know I have fallen into that sort of rote knowledge work a few times in the last month, and it would be great to have it all automated! I can’t help but think about this timeless XKCD

Is it worth the time?

AutoGPT (at this stage) takes a lot of effort to set up and verify. We might save time by doing it the hard way still.

Vibe check is valid

how else do you measure LLM performance

One constant talking point across all speakers was that AI Engineers need to look at their own data, and talk to their own LLM. Metrics and statistics can answer questions like “is the LLM working?” but only by talking to the bot or using the system yourself can you tell if it is a good experience or not.

Chunk size

hes not fat hes just fluffy

Finally, to round things out, we have a meme on information retrieval. Chunk size is an incredibly important hyperparameter that changes for every dataset and must be fine-tuned for the best results. Sometimes it helps to have larger chunks, sometimes it doesn’t. The AI Engineering discipline is all about the finding out!

And that’s all for now!

I hope you enjoyed my memes! Keep up with the blog for more fun.

❤️ Gordy


Enjoyed that one? How about signing up for my mailing list below? You can also find another article to read in ./posts

 

Hanakano

Conversational AI and Technology Consulting


2023-10-12