Ideating Taylor: A Voyager-type LLM Dev

A big reason why I am researching AI Agents so much is I want to try my hand at building a software development agent of my own named Taylor. The overall goal is for Taylor to be a fully autonomous AI-powered junior developer who can be a genuine asset on a development team. I want to take this idea as far as I can on my own, and use the finished product in future projects to augment my own capabilities.

The best open-domain software development agent I’ve found is Voyager, a minecraft-playing agent based on GPT-4 built by researchers from Nvidia and Princeton university. Voyager has a solid action loop that iteratively completes tasks to achieve long-horizon goals. I did a whole post breaking down Voyager’s process here.

Defining the problem

Taylor the junior developer seeks to address these pain points in the software development cycle:

Developers want to build cool things, but “building cool things” involves a lot of non-building tasks:

  • Project managers want 100% test coverage but developers don’t want to slow down the build process to write tests
  • Users want comprehensive documentation and examples but developers don’t want to slow down the build process to document
  • A UX designer wants to focus on making pretty dashboards and a clean interface and not on the boilerplate backend or data pipelines
  • Certain problems like “Add Google Oauth SSO” have well-documented solutions but developers still have to take time to wrestle them into their project

These tasks can easily take up 30-50% of a project’s time and are not directly related to the cool stuff that makes projects fun. Taylor the junior developer should be able to save developers time by doing these tasks for them.

Defining the constraints

While it’s fine making wish lists, our solution has to reconcile itself with the real world. These are the project constraints:

1. It must output genuinely useful work

There’s no point building a tool if it doesn’t work right. I’m not trying to build an agent capable of discovering a better sorting algorithm, but it can’t output garbage either. “Genuinely useful” means Taylor can do a decent job at:

  1. Writing unit tests
  2. Writing documentation
  3. Initializing projects based on well-documented open-source templates
  4. Applying well-documented solutions to an existing codebase

2. It must follow a Github-based CI/CD pipeline

Taylor needs to work well with a real-world development team, and real-world development teams use Github. Taylor needs to be able to interface with Github by:

  1. Reading and writing Github issues
  2. Creating new branches
  3. Committing changes
  4. Creating new pull requests
  5. Responding to feedback

3. It must be Dockerized

I want to avoid the issues that GPT Engineer runs into with new users who struggle to get the program running. The end user should be able to add their secrets to a Dockerfile, pull the image to a container service on AWS, GCP, or their own machine, hit “Go,” and see Taylor get to work.

4. It must be asynchronous

Humans are going to be the slowest part of Taylor’s loop, and I don’t want them to bottleneck progress. If a task is pending completion from GPT, Starcoder, or a human, Taylor should move on to the next queued task. Especially since Taylor will be running on rented cloud machines, I want to make the most efficient use of that time.

The tools

Typescript and Node.js

Typescript wins over Python as the base language for this project for a few reasons:

  1. Typescript does better with asynchronous functions
  2. JS Doc is an easy and industry-standard way to have code comments do double-duty: they can both act as in-context learning for specific functions, and generate external documentation that can be used for longer-horizon planning. JS Doc 3 also supports separate tutorials in markdown.
  3. Node.js backends scale better than Python backends.

Langchain.js!

I’m very familiar with the Langchain framework but until now I’ve exclusively used the python version. The JS/TS version is just as capable and I want to use this project as a learning opportunity to improve my familiarity with it.

Which LLMs?

I want to use OpenAI GPT-4 as the main agent executor. This will be the one that does the initial long-horizon planning, disambiguation, and verification. However I want to use code-specific LLMs to do the actual code generation, though I’m not sure which ones yet. The models I’m considering are:

  1. StarCoder 15B This model has been getting SOTA results on coding benchmarks and is the best open-source coding LLM available. However, its 15B size means I’ll need at least 2 GPUs to run it effectively, which can be costly.
  2. Replit Code v1 3B This is a newer model that, despite its size, shows promising results. It specializes in Markdown, Javascript, and Python- the exact languages I’ll be testing Taylor on. Its size and speed also give it a lot of promise.
  3. OpenAI GPT-4-32k This model is the backup plan. GPT-4 is the best in the business right now, and since capability is the #1 criteria I will not hesitate to use the best model available if the open-source ones disappoint.

Taylor Process Diagram

the process diagram

How will it work?

The tests

Talking more about the tests I want to run to eval Taylor.

1. Writing Unit Tests

Probably something like taking an existing Node.js API backend with 4 - 5 endpoints and JS Docs for each of them and writing unit cases for each possible error case. Endpoints should include get, update, insert, and delete functions.

2. Writing Documentation

Take an existing codebase and write JS Doc comments on every class, function, and endpoint. Extra points to write markdown tutorials for classes and endpoints, and then put those tutorials in the appropriate folder on Github.

3. Initializing projects based on well-documented open-source templates

Modify a next.js template for a certain use case or domain. Would be nice to specify which template to start from, and then list individual changes needed.

4. Applying well-documented solutions to an existing codebase

Something like “Add Google OAuth SSO to my website” or “make the main body a flex box” or “add a stock ticker to the top of the page”

Agents and chains

While it is safe to say that Taylor the junior dev as a whole will be an agent, will its subprocesses be agents as well?

Retrieval

A few weeks ago there was a blog post on the Langchain blog about building a web research agent. The authors ended with the conclusion that, even though they started off anticipating an agent as the end product, they ended up with a novel retrieval chain.

Additionally, even if you want to retrieve from multiple sources and pick the best results, you still don’t need an agent for that. So we could have one database of vectorized express.js docs, and another web researcher, and then use a combination of the two of them in the final product.

So the skill manager will be a regular old vector retriever and the context will be one of the new web research retrievers, because a Serp API key is probably cheaper than embedding the entire documentation libraries for whatever open-source libraries.

Agent State and Environment Feedback

The Voyager bot used the bot.chat to keep a running log of intermediate states. Taylor, however, has an actual log that they can use. the challenge, however, comes in the asynchronous nature of Taylor’s tasks. Fetching the agent state and intermediate steps is not as simple as reading a text file or looking at an pre-fetched file directory tree. We need to avoid the following assumptions:

  • The most recent logs are relevant to the current task
  • The github repository has stayed the same since the last task attempt
  • The project dependencies, globals, and secrets have stayed the same since the last task attempt

So Tayor’s state-gathering has to be more robust that Voyager’s.f Here are a few ideas how:

  • Logs: logs can have the task ID prepended to the message so that we can filter them easily.
    • We could also try scoping the logger itself, much like how Botpress has logger.forBot(), we can have logger.forTask()
  • State: The solution depends on how we interact with Github. If Taylor builds their own vend and has the Githib repo cloned “locally”, we would just ned to start each task step by pulling or syncing with that issue/task’s specific branch. However if we are using the API, we would need to call that branch and get the entire directory tree.

Validation Layers

Voyager has one validation layer: Taylor will have three.

Self validation

Like Voyager, Taylor will self-assess if the code it has written is valid and fulfills the task requirements.

Github actions and unit tests

Depending on the project, there might be existing Github actions that run on PRs to validate their code. Websites for example might have the Vercel app creating development previews. If these actions fail, the task is considered unsuccessful and the failure message should be passed to Taylor for the next iteration.

Taylor will check linting with typescript-eslint. Not only will this enforce clean code, but it will also check parameter and return types, enforce a uniform style, and ensure that code is properly modular and reusable.

If unit tests exist on a repo, an action should be enabled to run the suite of tests before allowing a PR to merge. If no unit tests exist, Taylor should spend time writing them.

Human Feedback

The last layer of validation is human feedback. After a branch has passed linting and other automatic checks, it is changed from a draft to “ready for review” and its review is assigned to a human. There should be a configurable list of humans whom Taylor can solicit reviews from, perhaps even make it configurable by PR/Issue tag so humans from team A review PRs from repo/library A, etc.

If the review is accepted and merged, the task is marked as successful, the skills saved, and Taylor moves on. If the review fails, Taylor must try at least one more iteration using the human’s feedback.

Skills and Primitives

The Voyager team outfitted their agent with a set of “primitive skills” like mining, placing, attacking mobs, and getting things to or from chests. Taylor will also have primitive skills, but instead of minecraft-specific tasks, they will be CRUD operations on the working directory.

I think having a venv for each branch and using the github CLU will be the best approach for how Taylor interacts with the codebase. I can use manually designed middleware to translate Taylor’s LLM-outputs into terminal commands, and then execute those commands to enact changes upon the working directory. Additionally, the github CLI will be used at manually-set points throughout the acton loop to ensure a consistent codebase:

  1. When an issue has been accepted, I will use the github CLI to make a new branch in the repo and clone that branch to a new venv.
  2. After the branch has been created, I will ise the github CLI to create a draft pull request and link it to the issue.
  3. When a task has been successful, I will use the github cli to commit the code from the working venv to the branch.
  4. If there are any errors in the automatic checks, the github CLI will return the errors and their feedback. I will have to catch these errors asynchronously and feed them back into the loop.
  5. When an issue is deemed successfully addressed, I will use the github CLI to mark the pull request as ready for review and tag the appropriate reviewer.
  6. When a published PR has an update, I will use the github CLI to catch the update and feed it back into the action loop along with any feedback.
  7. When a PR is merged, I will use the github CLI to mark the issue as resolved and delete the branch.

Asynchronous events

There are a lot of asynchronous events that need to be handled in Taylor’s loop:

  • New branch created and cloned into a venv
  • new message on issue
  • automatic checks started
  • automatic checks completed
  • new message on PR
  • PR status updated
  • Prompt sent to LLM
  • data received from LLM

Warm-up period

The NVidia team found that Voyager benefitted from a warm-up period where the first 15 tasks had reduced information about the environment. Taylor might also benefit from the same kind of warm-up, though it will need to look different.

Perhaps instead of solving issues for its warm-up, Taylor could begin on documentation and unit tests? Not only does this get a jump start on genuinely helpful work, but it gives Taylor an opportunity to systematically go through the codebase and save classes, functions, and directory structures to its skills library.

Any unit tests generated during this period can also be incorporated into future validation stages. I especially like this, since it lays a foundation of ensuring safe and well-documented code before any features are added or changed.

Conclusion

Voyager demonstrated remarkable results doing something LLMs traditionally haven’t done very well. Since its release Eluther AI released Minetester as a way to further test LLM agent performance in Minecraft. In the future, I hope to apply some of the lessons learned from Voyager to a coding assistant to complete some cool real-world tasks.


Enjoyed that one? How about signing up for my mailing list below? You can also find another article to read in ./posts

 

Hanakano

Conversational AI and Technology Consulting


2023-08-04