Creating a Playwright framework with AI

TL;DR

Built a Playwright E2E test framework using Claude Code, covering ~100% of high-priority workflows.
Works against local, Docker and deployed QA environments with minimal team overhead.
Core loop: human sets scope → AI writes one spec at a time → AI self-reviews → human reviews → run for flakes → fix.
Best results when treated as a junior SDET with human in the loop, not an autonomous oracle.

I have been using ClaudeCode to create a Playwright E2E test framework for a product from scratch to very positive results (successfully running tests against multiple environments with almost 100% of high priority workflows covered). Whilst you don’t need to be code native to get started on this, I would recommend bringing in someone who can help to give some review and pointers on the code to avoid building the wrong thing.

Identifying the need

I am working with teams who want to accelerate their delivery using AI, this is a mature product so needs guardrails in place to avoid negatively impacting existing customer value; the product also has some tests through the stack but wants to build an AI skill to help increase coverage.

The need for greater E2E / feature coverage comes from needing to ensure that whilst code logic can change quickly via AI, product behaviour should remain consistent. The E2E tests will provide a safety net against unit tests being changed alongside code that cause logic and feature behaviour drift over time.

The requirements for a test framework are therefore:

To be AI native and allow for easy adding of tests through existing AI tooling (Claude Code).
To support the team in understanding whether they are impacting user behaviour as they develop further with AI.
To be able to retrofit golden master tests from the existing product into the new framework at speed.
To require minimal team input as they work on delivering client value.
To run tests locally and against an existing environment.
Tests are to be kept separate from product code to avoid them being changed alongside AI development / refactoring of features.

To this end I selected Playwright as a tool as it 1) has a Claude Code MCP allowing for AI development and 2) is established, well supported and documented allowing me to train AI on best practice from these documents.

Implementation

I started with an experiment to learn how the Playwright MCP works, how to set this up and what limitations (if any) this tooling has. Initial setup is pretty simple, you can even ask Claude to set up Playwright so that Claude Code can use it to explore an environment within your repo and it can do so very successfully.

Fig 1. The setup of my AI test framework.

Claude Code interacts with Playwright via the MCP to be able to use apps in a browser instance, from this it can determine behaviour and codify these into code as Playwright tests. This is important as we need context in order to write the tests and interacting with the test environment gives Claude Code actual context in order to be able to write meaningful tests.

The test environment can be a deployed instance (such as a staging or QA environment) or works well on a local / Docker build too.

What it does well

It is able to explore / spider a site to uncover behaviours without having to be explicitly told.
It can use the code repo to uncover behaviours, not just the front end.
It is good at following best practice patterns for creating tests.
it is good at setting up seeding of test data, rather than following full journeys.
It is good to review the created test spec code for consistency against examples.

What it needs help with

It needs guidance on what to test or what is useful to cover.
It needs to be forced to keep context windows tight.
It needs pushing to what to assert on as part of the test.
It needs stopping occasionally to stop it going down rabbit holes.
It’s not great at creating code that’s optimised for parallel test runs.

Based on my experiments and the observations I have created a playbook to guide my prompting and development of further Playwright frameworks from projects.

Playbook for AI generated tests

This is the general flow I’ve been using when prompting AI to write tests. note that there is human in the loop to keep the AI on track and make informed decisions on the outcomes of reviews throughout.

Fig 2. The structure of my AI test creation.

Set the testing scope yourself

Don’t ask Claude to discover everything to test all at once as it won’t cope with such a wide context and will start to discover the wrong thing and will write poor test code that is flakey. Instead use AI to discover a specific workflow and document it as test code, as it’s much better at this.

I’ve found it works better to manually review a product or site myself to develop a list of features to be tested and then use this to guide my AI test creation. This gave me the added benefit of giving me context of the feature set so that I had an understanding of behaviour and could help prompt the AI into doing the right thing.

Write: Go one test (or spec) at a time

I find that I get better results by being focused on one test (or areas of test) at a time when using AI. This keeps the context window small enough to get better results and creates more robust tests.

Where there is a CRUD function it can work to ask it to create tests for create, read, edit and delete together and you get okay responses. This depends on the patterns being very consistent across these so that it is easily discoverable.

Review: Ask it what it’s done

After a batch of tests have been written to a spec I frequently have to prompt the AI to ask review the tests in this spec and tell me what they do and what they assert against.

Frequently the tests would:

Miss steps.
Don’t have a meaningful assertion for pass / failure.
Assert on something trivial (like page URL).
Don’t assert at the right level (API when I want a UI based test).
Use suboptimal test input data.

Which I would then have to correct for. It’s worth noting that as I built up patterns of what I wanted from tests that the AI would get better at recognising this and make less mistakes.

Review: Run the tests for flakes

Standard best practice, when I’ve created tests for an area of workflow I prompt the AI to run the tests three times and report on any flakes, do not fix them but make recommendations for fixes.

Frequently the tests would fail and would need to be hardened to ensure they don’t fail or time out when run. I ask the prompt to report and recommend fixes so that I can go one by one to fix them, preserving context and getting better results.

Sometimes the first fix recommendation doesn’t work and the AI has to debug and make further recommendations. If it goes down a rabbit hole I stop the AI and ask it what it’s doing (or use a /btw ask) so I can see what it’s trying and keep on top of it.

Review: Get expert opinion on the code

This is especially important as you’re getting started with the tests. I made sure to review the code and get tests into a good state that meets code hygiene, best practice and engineering standards. If you’re not proficient with writing code then you can get a developer to review the code by raising a merge request and asking for a review.

These early specs will form examples to drive later development, reducing the need for developer review to create good code.

Review: Check specs for consistency

When I have specs that have been reviewed and make for good examples I prompt the AI to review the newly created spec against previous specs (names) for inconsistencies or not following patterns established in the previous specs. Recommend changes to bring the new spec in line with established patterns.

This usually finds some things to change, both cosmetic and structural and helps keep the test specs in line which makes them easier to review. It also keeps the new tests using the best practices established in previous test specs.

Review: Keep reviewing

As I’m going through creating more specs I’ve found it useful to review the specs and the whole suite against best practices by prompting the AI to review the test suite against trusted sources of best practice for writing good test code (such as Playwright’s website). Make recommendations for improving the test code to be cleaner and more efficient.

Not every recommendation is great, so you have to pick the most impactful ones. I tend to refactor one recommendation at a time, to preserve context and rerun tests afterwards to make sure the changes haven’t ruined the tests.

Review: Feedback Adds to memories

The review steps in the playbook are tactical, each one improves this test. But every review also produces a piece of guidance that should improve the next test and the one after that. I got Claude Code to remember fixes to failures or disagreements that I had with the implementation of specs and turn these into meaningful memories to shape how it builds tests in the future.

Selector reuse

Pattern reuse is good for macro structure (file shape, fixtures, assertion philosophy) but weak for micro selectors. Each individual locator was chosen fresh from the recon notes rather than asking— does this codebase already have a working pattern for this kind of element?” The rule that came out of it: before writing any new locator, grep existing helpers and tests for the same element shape (button-styled, sidebar item, table row action) and reuse the proven pattern instead of inventing one from a recon snapshot.

Saved as a feedback memory titled E2E selector reuse — check existing patterns first.

Spec structure

Nominate one canonical area as the structural baseline (for me, the initially created specs) and require new specs to mirror its skeleton — fixtures, hooks, naming, helper usage — before filling in new behaviour.

Saved as E2E spec structure — mirror existing specs.

Assertion coverage

CRUD tests must assert UI and persisted state via close-and-reopen; success toasts are noise and should not be asserted; negative assertions must be explicit toHaveCount(0) rather than absence-by-omission.

Saved as E2E assertion coverage — what to verify.

Each memory is one short file that includes the rule, a why line that anchors it to a real prior incident and how to apply. The “why” is what lets the AI judge for edge cases instead of blindly following the rule all the time.

Building on These tests

Once basic tests were written I’ve been able to extend them out to make them more useful, including such things as:

Randomising test data by prompting AI to create random and meaningful test data for this workflow by selecting from a set list of valid data. Use special, accented and Japanese characters throughout where the fields allow for this input.
Extending my tests to run in multiple environments by prompting AI to review this Playwright project and create a prompt to extend it to run on a QA environment (location here) as well as on local builds. Optimise for the same specs and tests running across both and highlight issues that would prevent me from being able to do this.
Creating test reports for coverage by prompting AI to review the tests in this test suite against (this pre-existing scope of tests) and create a coverage report, update this whenever a new test is added to the test suite. If a line in the scope doesn’t exist for a test, ask what to do.

What this framework gives us

With some reviews I can get pretty good Playwright tests that cover the right thing and broadly follow best practices for code. They are not massively flakey (although like any test need some support occasionally) and run well.

Support solving the tricky problems in test framework set up; I was able to prompt to discover ways around auth set up and test data creation easily using Claude Code. The framework sets up a number of helpers that seed data that can be used in a beforeAll to get the environment in a place the tests need them to be. Likewise it’s easy to prompt the AI to create clean up for test data cleanup too.

Because I now have a pattern of what works to get to a good result, I can create a skill for driving more tests for new features. This will allow for automation of extending the Playwright test coverage alongside new development, which will speed up team velocity and reduce the barrier to implement more tests. I haven’t yet started work on this, it will need further experimentation before I can share what works.

This framework also opens up the possibility of non-coders being able to contribute to the test suite, which reduces the barrier to implementing new tests even further as anyone can do it. It helps to have a good instinct on what matters for creating a test framework and tests in order to meaningfully prompt and steer the AI.

Speed to retrofitting regression / golden master tests. What would possibly have taken weeks in attempts to use another framework is now taking days to implement. Helping me to reduce the risk profile of releases for a team much faster.

Next steps are adding these tests to a CI pipeline with a skill that says what tests to run for a release and also adding test parallel running to optimise for speed. The framework has options for headless and headed running which will support running in the CI pipeline and being able to run locally for debugging purposes.

Why having a QE helps with this work

Simply put, a Quality Engineer can use their experience to tell you what good enough looks like in this space. You want to be able to scope and tailor the testing to maximise the guardrail and minimise token spend and over-testing. They can also explore the system under test to guide the AI on what needs to be tested, what edge cases matter and knowing if the tests have been meaningful and helpful.

A senior Quality Engineer will also have resources to support this work, heuristics and known ways of working that support the creation of E2E or acceptance tests. This will cut down on missteps and the time and token cost associated with these missteps.

Conclusion

This approach isn’t a replacement for engineering judgement, it’s a force multiplier for it. The teams getting value from AI generated tests are the ones treating the AI as a junior SDET that needs review rather than an oracle that produces finished work.

For further reading check out my posts on Quality Engineering with AI and Using AI to help teams to Shift Testing Left. Or look at these other pages on Playwright MCP, Playwright Best Practices and How to create a Claude Skill.

Thanks for taking the time to read! If you found this helpful and would like learn more, be sure to check out my other posts on the blog. You can also connect with me on LinkedIn for additional content, updates and discussions; I’d love to hear your thoughts and continue the conversation.

More Blogs