placeholder
thoughts and learnings in software engineering by Rotem Tamir

Harnessing LLMs with TDD

Reflections on a weekend project with an AI co-programmer

Almost four years into Ariga, I don’t get to write code as much as I’d like, but occasionally I encounter some interesting problem that seems small enough to tackle in a short amount of time and I dive in.

This was the case with an issue that came up from some of our lovely customers for Atlas. One of the great things about Atlas is it’s ability to inspect a database and automatically generate a schema file that can be used to manage the database schema in a declarative way. The issue many users were facing was that for many use cases, databases can contain thousands of objects and the schema file was becoming too large to manage effectively.

Atlas uses HCL as the default data definition language (DDL) for defining schemas, so I knew it should be pretty straightforward to write a tool that would parse the schema file and split it into multiple files based on some criteria.

Using LLMs to write code?

Putting the buzzwords and loads of BS-aside, even to a contrarian cynic like me, recent advancements in AI in general and specifically LLMs (Large Language Models) are truly amazing.

As an individual contributor, I am constantly using tools like GitHub Co-pilot or OpenAI ChatGPT to help me write code snippets, improve documentation, and come up with ideas for copy.

However, as a CTO, I am still very skeptical about their maturity and constantly question their value in creating production-grade code.

Despite my best efforts in prompting the AI to generate code that is readable, maintainable, and efficient, I often find that the generated code is not up to the standards I would expect from my team, and I end up rewriting it.

My Strategy: Test-driven Brute-force-based LLM-assisted Development (T.B.L.D.)

For this project, I decided to try a different approach to using LLMs to write code. It would work like this:

  • Write tests to define the expected behavior of the application.
  • Use an LLM to generate code to pass the tests.
  • Feed any compilation error or test failure back into the LLM to generate new code.
  • Repeat until all tests pass.
  • Rejoice!

The idea here is to take the old TDD approach of defining the desired behavior of the application with tests and then using the LLM to generate code that passes the tests.

Recognizing that the LLM-generated code is still far from perfect, I would feed the test and compilation errors back into the LLM to generate new code, iterating until the code compiles and passes all tests.

I dub this approach Test-driven Brute-force-based LLM-assisted Development (T.B.L.D.).

The Results

While the problem at hand was not super-complicated, it was not trivial either. HCL has great support in Go, but as an infra library used mostly by Terraform-internals devs and other HashiCorp expert-level users, it’s not the most straightforward thing to work with.

Figuring out exactly how to split the schema into files while preserving the original schema’s structure has some subtleties that would take some research and experimentation to get right.

All of this to say that it’s pretty impressive that I was able to get a working solution fairly quickly while chatting to Claude 3.5, my pair-AI programmer!

You can find the final code in the rotemtam/splt repository on GitHub.

Our session would go something like:


Me: Hey Claude, here’s a test that I’m Running:

# The test case

The test is failing, here’s the output:

# the full test output

Claude: “I understand the issue. Let me explain the problem: The current implementation is not correctly preserving the structure and formatting of the original HCL file when writing the split files. Specifically, there are two main issues:

  1. Nested blocks are not being properly reconstructed. In the original file, there’s a nested block extra inside the new_block, but it’s not being written to the output file.
  2. The original formatting, including newlines and indentation, is not being preserved. This is why there’s an extra newline after the schema block in the original file that’s missing in the output.”

The root cause of these issues is in the writeFile function. Currently, it’s trying to manually reconstruct the HCL structure, but it’s not doing so completely or accurately. It’s only handling top-level attributes and one level of nested blocks, but not deeper nesting or formatting details.”

Me: “Thanks for the explanation. Can you suggest a fix?”

Claude: Here’s some code:

# The code snippet

Me: There’s an error. “Unresolved reference SrcTokens”

Claude: “I apologize for the confusion. You’re right to point that out. It seems I made an incorrect assumption about the availability of SrcTokens. Let’s correct this and use a different approach that should work with the available API.

Here’s a revised version:

# The revised code snippet

And so on, until we had a working version.

Caveats and takeaways

With a structured feedback loop - LLMs can generate working code

Anyone who has used LLMs in the past months to generate code knows that lots of it is garbage. However, as many are currently discovering in the industry, if we combine LLMs with “agentic” workflows that provide “real world” (e.g compilation error, test failure) feedback, and iterate a few times - we can get to a working solution.

Statically typed languages have a strong advantage

Whenever I can, I use Go for my projects, including for the entirety of Ariga’s codebase and of course for this project.

What I found is that the very thing that makes Go lovely to work with for humans - its strong typing - also makes it an ideal target language for LLMs. The compilation failures from Go’s type system provide a very clear signal to the LLM about API hallucinations and other issues.

Code-quality is still a human thing

Claude was able to help me generate code that passed the tests, with some intervention from me to guide the process. Some edge cases (like handling deterministic ordering of attributes in nested blocks) required me to pull up my sleeves and write a few blocks of code myself, but overall, I was impressed with the model’s ability to generate code that was close to what I needed.

Having said that, good code is about much more than just passing tests. It’s about readability, maintainability, and performance. While the code generated by Claude was functional, it was pretty ugly and not something I would want to put my name on. I ended up spending a good amount of time refactoring the code to make it more readable and maintainable.

It’s all about the feedback loop

My main takeaway from this project is that one of my key contributions was in defining the tests and constructing the feedback loop that allowed me to iterate on the code generation process.

I know that some very cool companies in our industry are building solutions to generate full test suites using LLMs. I am far from an expert on the subject, and I’m sure there are lots of nuance to the topic, but I am left with a feeling that perhaps defining the verification process should be left to us humans.

Conclusion

RoboCode. Generated by DALLE

I had a great time working on this project, and I was very thrilled to see that with a structured feedback loop, I was able to get a working solution fairly quickly. There are so many important problems in the world that can be solved with software, and if we figure out how to leverage LLMs to help us write it faster, we can make a real impact.

On the other hand, after the initial excitement of seeing tests pass and the code compile, I still spent an additional couple of days refactoring the code to make it something I would be willing to put my name on.

Personally, I don’t see serious software engineering being taken over by the robots any time soon, but as I told my co-founder, Ariel, after I showed him this project, it sure does make me feel a little bit like RoboCop.

(or should I say RoboCode?)