Quality Engineering with AI

Artificial Intelligence is not going to go away, engineering teams will be using it to go faster in developing code and products. This means as quality professionals we need to be prepared to offer guardrails and ideas for how to test under this new landscape. Whilst this is still a new technology and way of working we have to get ahead of messaging to offer advice.

Get used to working with AI

It’s not enough to just be an armchair critic and warn against risks or just tell people to not use AI. Engineering teams are going to using it and we as quality professionals have to get on board with that. It’s not pragmatic or showing value to keep pushing back on the use of this technology.

It’s worth noting that different contexts will need wildly different testing alongside AI. Your context may need very different testing to the contexts I’ve worked in, but hopefully some of these thoughts will also help you.

How do I write tests when AI code keeps changing quickly?

Problem space: The outputs of AI and LLMs is non deterministic, meaning that the code generated will change all the time making it hard for traditional tests to work.

Whilst the code will be changing frequently (it’ll basically be ephemeral with the amount of AI refactoring) the product and features should remain more static. Testing should focus at this layer by inverting the traditional testing pyramid to have more E2E and acceptance tests. Doing this will allow us to test what should remain static (behaviour) even when code logic changes frequently.

Using outside-in BDD to test from the desired behaviours of the product is the way to go. This level of testing will need to be more comprehensive, including error and edge cases, to ensure that the code changes from AI do not regress behaviour. This will be more effective than unit tests, even if it is slightly slower and less efficient.

Make sure these tests are kept away from the AI / LMM to avoid them being refactored alongside the code changes being made. These need to be kept as static as possible to align with keeping product features consistent and not constantly changing.

What should I test when building features with AI?

Problem space: Things are changing fast in engineering, teams need help in navigating what good enough testing and quality looks like whilst using this new technology and optimising for speed.

Especially whilst initially adopting AI in a software delivery pipeline you need the flexibility to run experiments without the constraints of process. Teams need to look at the risks they really face and look to solve for those as a minimum. Generally in most contexts these risks will be very similar:

Delivering the wrong thing

Development at pace, non determinism and fire and forget development practices means a higher risk of delivering the wrong things, or something that’s just broken. Make sure requirements are well scoped and clear then use acceptance testing to test against them to make sure the right thing was built.

Recommendations: Three Amigos, BDD, Acceptance tests.

Changes to code breaking existing behaviour

AI has the strong likelihood of creeping out and changing large amounts of your codebase unexpectedly. Make sure that key product behaviour is covered by regression tests that can’t be rewritten as a part of AI development spreading out. As mentioned before this might be best as longer-lived acceptance and E2E tests rather than many unit tests.

Another way to handle this is to test in production and to put monitoring and alerting in place to identify workflow failures early. For this to be effective the team should have the ability to roll back or hot fix issues quickly to restore availability of features.

Recommendations: Risk based E2E tests, Observability.

Delivering the wrong Data leaks or security issues

Security should be a non-negotiable. Poor development can cause data leaks or routes to access data that users shouldn’t (back doors). Teams need to keep on top of their vulnerabilities and data, either by maintaining more human-in-the-loop for data management or using tools that help spot and remediate potential security issues.

Recommendations: Security scans, Penetration testing, Reviews.

AI code creating maintainability issues

Inconsistently applied engineering practices, non deterministically written code and massively verbose / inelegant responses by AI will build up into a maintainability nightmare. This will make it harder to debug, change and understand the product and codebase at speed and at scale. Make sure to keep on top of this to avoid getting stuck later down the line.

Recommendations: Code complexity / Cyclomatic complexity reviews, Human reviews (spot checks).

The key thing here is to build a test approach that is pragmatic and solves the major risks and quality lines you cannot cross in front of you now. From there you can build up on secondary concerns like usability or accessibility.

How do we Test AI generated pull requests without drowning?

Problem space: There’s going to be a lot more code changes and feature delivery than ever before with AI, traditional (and even Agile) testing methods just won’t work as teams will get swamped.

Not everything can be tested at the speed we’ll be going at, the limiting factor for development is going to be how quickly humans can engage with the AI software delivery lifecycle without burning out. Teams need to pull out to what really matters and focus on testing for these, this means using more pragmatism than ever before.

Pull out and test at the E2E level

Trying to keep up with everything at a low level just won’t work, we can’t start with units and work up, work where things are less ephemeral and more stable: the product layer. Using BDD to create tests from requirements, create meaningful acceptance tests and use these to spot smells for deeper dives where they are needed.

Spot checks rather than full reviews

Teams need to start building a picture of the things they just cannot get wrong. Use this to make risk based spot checks of the PRs and code that matters for a deeper dive review, rather than trying to do all of them.

AI can be used to make the majority of reviews, create an agent that reviews against standards and reports when a human should get involved. This could be a threshold against missing standards, spotting when it touches areas we can’t get wrong or areas of complexity.

Don’t sweat the small stuff

Don’t let perfect be the enemy of good. Not everything needs to be tested to a low detailed level, especially if we can spot and fix for those things quickly. Be pragmatic and know when and where to focus efforts by asking “would we rush to fix this if it broke?”

In organisations with dedicated quality coaches, use them to help build a picture of what testing is needed and useful. This can be helpful to know the lines we cannot cross vs. what might be finessing and gold plating of the system.

Shift right

If teams don’t have the time to test everything up front then learn from production. Put observability and user testing in place to test features and identify areas for fix and improvement. This requires the team to have the ability to roll back or hot fix changes quickly.

Support this with KPI around change failure rate and escaped defects to inform you if these approaches aren’t working. You may have to sacrifice some speed for higher quality.

The AI doesn’t know our business rules so How do we catch what it doesn’t know?

Problem space: The AI isn’t going to know the innate details of domain, product and customer needs so will find it hard to test for these things. Teams need to have a strategy to ensure their products meet the needs of their user base.

AI can be trained on the business documents you have to hand like brand guidelines, workflows, business rules and wider domain context to help inform tests. It also helps to have really well documented user stories that provide the context of what and why a feature is needed, this means creating slightly wordier user stories to support AI testing.

Strong user testing and in production testing helps to ensure teams catch everything that is missed from context. Human led exploratory testing can be used to shorten this feedback loop too, but this should be focused on high risk areas to avoid ruining team velocity and feature output.

A feedback loop we can use for refining context is to look at where failures happen after shipping AI code and features and using this to inform future tests.

TL;DR

Testing AI generated products will become the new norm, teams need strategies for building confidence at the faster speed of change.

Focus tests at the E2E level as this will be more static than the code.
Build your testing scope around the lines you will not cross (regressions, security, maintainability).
You can’t review everything, use E2E tests and AI reviews to find smells that you want to deep dive into.
Use shift right testing to cover business context and pick up what your tests might have missed, feed back learnings as context to improve your AI generated tests over time.

Hopefully this helps provide a starting point with thinking about testing AI generated code for you, if you have questions or thoughts let’s discuss.

Thanks for taking the time to read! If you found this helpful and would like learn more, be sure to check out my other posts on the blog. You can also connect with me on LinkedIn for additional content, updates and discussions; I’d love to hear your thoughts and continue the conversation.

More Blogs