A couple of years ago, I wrote about my preferred type of software testing: surface testing (https://federicopereiro.com/surface-testing/). A summary of the approach is that 1) you test a system only through its exposed “surfaces” (APIs, UIs, library functions); 2) you run the tests against the real codebase with zero mocks, in a meaningful linear order, and stop at the first error.
Now, LLM coding agents have come along and changed everything. You can produce code almost as fast as you can think of what it should do. At the same time, AI is still great at producing software of questionable quality. The massive increase of quantity and significant decrease of quality just makes testing all the more important.
In two separate projects (FuelFWD and vibey) of very different nature (mid-size SaaS worked on by a team, vs new solo open source project), I’ve recently tackled writing a surface test suite from scratch. These are my learnings:
- First, come up with a thorough documentation of the entire surface of the app. What are the main entities, what are they for? Describe their data at rest. Then describe each of the endpoints in detail, what they receive and what they respond back.
- Split the tests into modules that have internal coherence. A good splitting boundary is a DB entity. A module would then be all the endpoints that directly read or modify that entity.
- Having a documentation for the tests is a game changer. Make the documentation be a linear list of tests. You can be as thorough as you want, going all the way to making the linear list go into every possible branch of your code.
- Once the documentation is there, implement the server tests.
- You will probably find a lot of bugs, most of which are not important (think abstruse validations on nested structures), but a few of them are.
- Put the fast tests first. Make the slow tests of every suite run at the end. The stop-and-go nature of slow tests in the middle of the suite is a dopamine destroyer. This is an important point! Having fast tests at the beginning makes you catch fast errors earlier, and run the suite much more often. Leave the slow tests towards the end.
- Once the server is correct and tested, move on to client tests. Don’t re-attempt to test what has been tested already on the server. Instead, test that the client is making proper use of the server. This is also the place to make assertions about UX state, which should be considered surfaces too.
- Put this sequence in pride of place in the docs, for when new code comes along: 1) is the spec documentation updated in full detail? 2) Are the documentation of the tests fully updated and covering all new/modified/deleted cases? 3) Are the tests 1:1 aligned with their documentation, and passing?
You can see an example here, in vibey:
- Spec: https://github.com/altocodenl/vibey?tab=readme-ov-file#spec
- Documentation of the tests: https://github.com/altocodenl/vibey?tab=readme-ov-file#test-suites
- Server tests: https://github.com/altocodenl/vibey/blob/main/test-server.js
- Client tests: https://github.com/altocodenl/vibey/blob/main/test-client.js
Note: the Vibey suite is not modularized into files – at FuelFWD, because the project is larger, we did split the test modules into separate files, which also allows for parallel work within the same branch by multiple parallel agents.
I feel ambivalent about not having written these tests myself. There are still unnecessary and badly named variables everywhere. But that’s how I feel about coding with agents at high speed in general. I still need to explore how I can use agents to make the code elegant.
This approach is now enabling me to use AI to produce software that works decently well. Hope it’s useful to you.