VOYAGER from NVidia shows a novel way of using GPT-4 to create an agent that plays minecraft via API. Their approach is very agentic and mirrors other development agents like gpt-engineer and smol-developer. In this post, we’re going to break down Voyager’s action loop and analyze the fine details of how this agent works.
Many autonomous agents can complete two or three tasks in a row, but VOYAGER from a team led by NVIDIA, showed ability to consistently perform over 150. Voyager is a GPT4-powered agent that plays Minecraft via API. When given the general goal of “explore as many environments, collect diverse resources, and build new and better tools,” Voyager outperformed other autonomous agents by a landslide. In this post, I want to analyze and better understand Voyager’s structure so I can implement a similar action loop in my own agents.
The Voyager team listed in their paper six unique elements that made it more successful than other agents. They are:
1. Automatic Curriculum
At the start of each action loop, Voyager’s general goal gets decomposed into a smaller task that is suitable for the agent to perform next. The team used GPT-4 to generate this task, and they fed it a snapshot of the local Minecraft environment, a list of past tasks and their pass/fail status, as well as several examples.
The automatic part of automatic curriculum proved crucial to Voyager’s success. Because the local environment and past history are different at the start of each loop, the next task is custom-made just for the agent. When running trials with a human-designed curriculum, the agent explored 73% less area.
2. Skill Library
When a task is successfully completed, the steps the agent took to complete that task are saved to a library along with the agent’s environment. This library is a vector database, and future loops query this database with the task and environment for helpful tasks.
Whenever Voyager successfully completed a task, the code it wrote that completed the task get saved to a skills database. Then, in future runs, a service searches this skills library and offers relevant skills to the code generator to use as macros in order to accomplish more advanced tasks. For example, the task “craft a bucket” calls the macros to “mine 3 iron,” “mine 3 coal,” and “place furnace.”
3. Environment Feedback
Despite playing in a single-player world, Voyager was a very chatty Minecraft player. The team used the chat as a logging system to track intermediate progress while performing tasks. If a task failed, this log is used to debug what went wrong.
4. Execution errors
Voyager’s code generation engine was not expected to write perfect code every time. If there were any errors when executing the code, these errors are returned along with the code, and the code generation engine uses them in its next attempt.
5. Self-Verification
The second-most impactful feature, self-verification uses GPT4 to evaluate the current inventory and environment, and then make a decision about whether or not the task has passed. If the task ws failed, this engine also generates feedback as to why. The researchers mention this step as a possible entry point for humans to join the loop; however they don’t elaborate on if humans should replace GPT4 in this step, or if the two can work in tandem.
6. GPT-4 for Code Generation
To use the researcher’s own words, “GPT-4 exhibits a quantum leap in coding abilities.” Code generated from GPT-3.5-turbo just didn’t cut it. I would be interested if models fine-tuned for code generation like StarCoder or Ghostwriter perform on-par with GPT4.
So what was the order of execution in Voyager? What were the prompts? Which steps were chains? Let’s dive in!
Their paper also provides an overview in psudocode. Let’s go through it line by line:
def voyager (
environment, # environment that uses code as action space
curriculum_agent, # curriculum agent for proposing the next task
action_agent, # action agent for code generation
critic_agent, # critic agent for self - verification
skill_manager, # skill manager for adding new skills and skill
retrieval
):
These are the inputs to the whole system!
Environment is the executable running the entire program. In Voyager, it includes the Mineflayer Minecraft API server, the skill library, the vectorstore, and, of course, the Voyager agent itself.
Curriculum_agent is a GPT prompt that generates a single next task for the agent.
Action_agent is the code-generation and execution loop.
Critic_agent is the self-evaluation part.
Skill_manager is a retrieval service for the skills library vectorstore. An interesting fact about its implementation in Voyager is that this was not included for the first 15 tasks so that the team could force Voyager to build up a diverse starting set of skills rather than rely on the first two or three it makes.
Retrieval was how game knowledge entered Voyager, although it was rarely needed because GPT already knew lots about Minecraft.
agent_state = environment.reset()
This line clears out any agent state and code from previous tasks.
while True:
Start of an infinite loop- gotta love it! Although the researchers only ran Voyager for 160 iterations in each trial.
exploration_progress = (
curriculum_agent.get_exploration_progress(
curriculum_agent.get_completed_tasks() ,
curriculum_agent.get_failed_tasks() ,
)
)
task = curriculum_agent.propose_next_task(
agent_state, exploration_progress
)
This is the automatic_curriculum part. It: 1. Gets a list of completed tasks and failed tasks. Voyager stored these in .json files. 2. Adds those to a big long prompt and sends it to GPT4. 3. Parses the output down to a single task.
code = None
environment_feedback = None
execution_errors = None
critique = None
success = False
I think this is just resetting the execution parameters before attempting a new task.
# try at most 4 rounds before moving on to the next task
for i in range (4) :
skills = skill_manager.retrieve_skills (
task,
environment_feedback
)
This second chain in Voyager is a retriever hooked up to a special database of skills. What’s interesting is that the environment feedback is included when querying skills, and not just the task. So the returned “mine 3 iron” skill will look different if the user is in a desert compared to if they were in a rainforest.
At this stage, the skills are just a list of titles and descriptions like:
name: craftBucket()
description: The function crafts a bucket using a crafting table. It first checks if there are enough iron ingots in the inventory, and if not, it mines iron ores and smelts them into iron ingots. Then, it places a crafting table near the bot and crafts a bucket using the crafting table.
name: craftFurnace()
description: The function crafts a furnace using a crafting table and cobblestones. If there are not enough cobblestones in the inventory, it mines the required amount. Then, it places a crafting table near the bot and crafts a furnace using the crafting table. Finally, it sends a chat message indicating that a furnace has been crafted.
name; killFourSheep()
description: The function is about killing four sheep and collecting their drops. It equips a wooden sword and kills the first three sheep, then kills the fourth sheep. After that, it collects the dropped items from the killed sheep, which include wool and raw mutton.
Up next: code generation!
code = action_agent.generate_code(
task,
code,
environment_feedback,
execution_errors,
critique,
skills,
)
Code is generated in a single shot by GPT4. To get the code, they pass the task, any previously generated code, the environment state, any errors from the previous code, notes from the self-verification stage, and the library of relevant skills.
Let’s take a look at its output. This is craftBucket.js from Voyager’s first trial:
async function craftBucket(bot) {
// Check if there are enough iron ingots in the inventory
const ironIngotsCount = bot.inventory.count(mcData.itemsByName.iron_ingot.id);
// If not enough iron ingots, mine iron ores and smelt them into iron ingots
if (ironIngotsCount < 3) {
await mineBlock(bot, "iron_ore", 3 - ironIngotsCount);
bot.chat("Collected iron ores.");
await smeltItem(bot, "iron_ore", "coal", 3 - ironIngotsCount);
bot.chat("Smelted iron ores into iron ingots.");
}
// Place the crafting table near the bot
const craftingTablePosition = bot.entity.position.offset(1, 0, 0);
await placeItem(bot, "crafting_table", craftingTablePosition);
// Craft a bucket using the crafting table
await craftItem(bot, "bucket", 1);
bot.chat("Crafted a bucket.");
}
What’s cool about this skill is: * The skill calls functions from shared dependencies * Each “chunk” of code is nicely commented * bot.chat() is used as a logger to track intermediate progress
(agent_state, environment_feedback, execution_errors) = environment.step(code)
After the code is written, it it immediately run and the updated agent state and environment are returned. If there were any errors in the code’s execution, they would be returned here as well.
success, critique = critic_agent.check_task_success (task, agent_state)
if success:
break
Last but not least, the self-verification stage! This was shown in the study to be the second-most impactful feature on overall performance.
I find it interesting that execution errors do not immediately trigger failure. I would assume it would be safe to mark any attempt a failure and skip the self-verification stage if there were execution errors. On the other hand, the self-verification stage might give useful feedback about why the error occurs, and that could help with faster debugging.
if success:
skill_manager.add_skill(code)
curriculum_agent.add_completed_task(task)
else:
curriculum_agent.add_failed_task(task)
The last step is super important! If the task is passed, whatever code was used to pass that task is saved to the library. The skill_manager
generates a description that is embedded along with the task and the environment state. The skill’s name is saved as metadata. Then, the code for the skill itself is saved as a .js file in a directory along with all the other skills where it is called and read as needed for future tasks.
Agents and Chains in Voyager
Although many of Voyager’s inputs are labeled as “agents,” I don’t think they (as individual processes) display agentic behavior. I enjoyed Lillian Weng’s blog about autonomous agents and will use her criteria for what an “agent” really is.
Is the Curriculum Agent an agent?
The curriculum agent takes three inputs to create a task: 1. A list of previous successful tasks 2. A list of previous failed tasks 3. Domain knowledge
The previous tasks and their statuses are retrieved by reading a simple JSON file. The domain knowledge was already present in model. No tools need to be used to fetch any of these inputs, and this is done in a single-shot prompt to GPT4.
Decision: It’s a chain ⛓️
Is the Action Agent an agent?
The action agent takes many more inputs than the curriculum agent: 1. The current task 2. Code from the previous attempt (if applicable) 3. The environment state 4. Errors from the previous attempt (if applicable) 5. Feedback from the previous attempt (if applicable) 6. Names and code of useful skills
All of these inputs are strings generated from process that run prior to the action agent being called. The agent also sends a single prompt to GPT4, and does not use any tools on its own.
Decision: It’s a chain ⛓️
Are the Skill Manager and Retrieval agents?
A few weeks ago there was a blog post on the Langchain blog about building a web research agent. The authors ended with the conclusion that, even though they started off anticipating an agent as the end product, they ended up with a novel retrieval chain.
Additionally, even if you want to retrieve from multiple sources and pick the best results, you still don’t need an agent for that. So even though Voyager has one vector database of skills and another database of Minecraft knowledge, agents are not needed to merge and rank results.
So both the skill manager and the retriever can be regular old vector retrievers.
Decision: They’re both chains ⛓️
Is the Critic Agent an agent?
The critic agent takes these inputs and determines if the task is successful or a failure: 1. The task assigned 2. The agent state
Once more, both of these are either generated with classical programming functions or generated prior to the critic agent being called. If the critic sought out additional information from the environment, it could be classified as an agent. Since all the information is provided to it, it is merely another chain.
Decision: It’s a chain ⛓️
Chains: 5 Agent: 0
However, as a whole I think Voyager itself is an agent. Agents use tools, and other LLM chains can be tools, too!
Copying Voyager
In upcoming blogs, I’ll share my progress with building my own autonomous agent, Taylor the software dev. It will incorporate a lot of the processes from Voyager, and add a few more. Stay tuned!