Yes, and better still the AI will fix its mistakes if it has access to verificat...

embedding-shape · 2026-01-26T10:19:20 1769422760

> decide if the code it wrote or the tests it wrote are wrong

Personally I think it's too early for this. Either you need to strictly control the code, or you need to strictly control the tests, if you let AI do both, it'll take shortcuts and misunderstandings will much easier propagate and solidify.

Personally I chose to tightly control the tests, as most tests LLMs tend to create are utter shit, and it's very obvious. You can prompt against this, but eventually they find a hole in your reasoning and figure out a way of making the tests pass while not actually exercising the code it should exercise with the tests.

seanmcdirmid · 2026-01-26T15:21:25 1769440885

I haven’t found that to be the case in practice. There is a limit on how big the code can be so it can do it like this, and it still can’t reliably subdivide problems on its own (yet?), but give it a module that is small enough it can write the code and the tests for it.

You should never let the LLM look at code when writing tests, so you need to have it figure out the interface ahead of time. Ideally, you wouldn’t let it look at tests when it was writing code, but it needs to tell which one was wrong. I haven’t been able to add an investigator into my workflow yet, so I’m just letting the code writer run and evaluate test correctness (but adding an investigator to do this instead would avoid confirmation bias, what you call it finding a loophole).

embedding-shape · 2026-01-26T18:57:26 1769453846

> I haven’t found that to be the case in practice.

Do you have any public test code you could share? Or create even, should be fast.

I'm asking because I hear this constantly from people, and since most people don't have as high standards for their testing code as the rest of the code, it tends to be a half-truth, and when you actually take a look at the tests, they're as messy and incorrect as you (I?) think.

I'd love to be proven wrong though, because writing good tests is hard, which currently I'm doing that part myself and not letting LLMs come up with the tests by itself.

seanmcdirmid · 2026-01-26T20:38:17 1769459897

I'm doing all my work at Google, so its not like I can share it so easily. Also, since GeminiCLI doesn't support sub-agents yet...I've had to get creative with how I implement my pipelines. The biggest challenge I've found ATM is controlling conversation context so you can control what the AI is looking at when you do things (e.g. not looking at code when writing tests!). I hope I can release what I'm doing eventually, although it isn't a key piece of AI tech (just a way to orchestrate the pipeline to make sure that AI gets different context for different parts of the pipeline steps, it might be obsolete after we get better support for orchestrating dev work in GeminiCLI or other dev-oriented AI front ends).

The tests can definitely be incorrect, and are often incorrect. You have to tell the AI that consider that the tests might be wrong, not the implementation, and it will generally take a closer look at things. They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.

embedding-shape · 2026-01-26T22:18:36 1769465916

> They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.

Yeah, those for me are all not "good tests", you don't want them in your codebase if you're aiming for a long-term project. Every single test has to make sense and be needed to confirm something, and should give clear signals when they fail, otherwise you end locking your entire codebase to things, because knowing what tests are actually needed or not becomes a mess.

Writing the tests and let the AI write the implementation ends you up with code you know what it does, and can confidently say what works vs not. When the IA ends up writing the tests, you often don't actually know what works or not, not even by scanning the test titles you often don't learn anything useful. How is one supposed to be able to guarantee any sort of quality like that?

seanmcdirmid · 2026-01-26T22:41:58 1769467318

If it clarifies anything, I have my workflow (each step is a separate prompt without preserved conversation context):

1 Create a test plan for N tests from the description. Note that this step doesn't provide specific data or logic for the test, it just plans out vaguely N tests that don't overlap too much.

2 Create an interface from the description

3 Create an implementation strategy from the description

4.N Create N tests, one at a time, from the test plan + interface (make sure the tests compile) (note each test is created in its own prompt without conversation context)

5 Create code using interface + implementation strategy + general knowledge, using N tests to validate it. Give feedback to 4.I if test I fails and AI decides it is the test's fault.

If anything changes in the description, the test plan is fixed, the tests are fixed, and that just propagates up to the code. You don't look at the tests unless you reach a situation where the AI can't fix the code or the tests (and you really need to help out).

This isn't really your quality pass, it is crap filter pass (the code should work in the sense that a programmer wrote something that they thinks works, but you can't really call it "tested" yet). Maybe you think I was claiming that this is all the testing that you'll need? No, you still need real tests as well as these small tests...

fragmede · 2026-01-26T19:18:44 1769455124

Testing is fun, but getting all the scaffolding in place to get to the fun part and do any testing suuuuucks. So let the LLM write the annoying parts (mocks. so many mocks.) while you do the fun part