AI Testing Is “Priority #1”… Until You Ask Teams to Trust It: What Leapwork’s 2026 Survey Really Reveals

AI generated image for AI Testing Is “Priority #1”… Until You Ask Teams to Trust It: What Leapwork’s 2026 Survey Really Reveals

AI has officially reached the part of the hype cycle where every roadmap slide contains at least one of the following words: agentic, autonomous, or the ever-popular “end-to-end”. And yet, in software testing—where “end-to-end” has a habit of turning into “end-of-quarter panic”—teams are still hesitating.

A new study from test automation vendor Leapwork, summarized in a DevOps.com article by James Maguire, puts numbers on something many QA and DevOps teams have been muttering into their CI logs for months: organizations want AI in testing, but they don’t fully trust it yet. citeturn1view0turn2search0

This article expands on the RSS item “Survey: Adoption of AI Software Testing Slowed by Trust Issues” (DevOps.com). The original piece was written by James Maguire and published on February 20, 2026. citeturn1view0

We’ll dig into what Leapwork’s numbers actually say, why the trust gap exists (spoiler: it’s not just “hallucinations”), how this maps to broader enterprise AI adoption patterns, and what practical guardrails can help teams move from pilot projects to production-quality AI-assisted testing.

What the survey says: AI is a priority, but “core workflow” adoption is tiny

Leapwork surveyed 300+ software engineers, QA leaders, and IT decision-makers across midsize and large organizations globally. The topline results look like a victory lap for AI vendors:

  • 88% say AI is a priority for their future testing strategy (with 46% calling it critical or high priority).
  • 80% expect AI to have a positive impact on testing in the next two years.
  • 65% say they’re exploring or using AI in at least one testing activity.

But then comes the quiet part, said out loud:

  • Only 12.6% have embedded AI across key/core test workflows.

That delta—“AI is a priority” vs. “AI is in the core workflow”—is the story. It’s the difference between we tried it and we rely on it when revenue is on the line. citeturn2search0turn1view0

Trust issues, quantified: accuracy and quality are holding teams back

Leapwork’s survey doesn’t frame the problem as a lack of interest. It frames it as a lack of confidence:

  • 54% cite concerns about accuracy and quality as factors that hold back broader use of AI in testing.

That’s important because testing is a uniquely unforgiving domain for “mostly right.” If an AI assistant helps you write a slightly clumsy internal wiki page, you can shrug and edit it. If it generates a brittle test suite that randomly fails on Tuesdays, you’ll eventually do what software teams have always done to unstable test signals: ignore them. And ignored tests are worse than no tests because they create a false sense of safety.

DevOps.com’s summary highlights that teams worry about quality and dependability, along with brittle tests and the difficulty of automating complex cross-system workflows. citeturn1view0turn2search0

The maintenance tax: updating tests still takes days

One of the most actionable findings: test maintenance remains painfully slow.

  • 45% of respondents say it takes three days or longer to update tests after a change in a critical system.

This is not a minor inconvenience. In modern release pipelines, a multi-day lag in test updates can effectively freeze delivery—or worse, encourage teams to ship with a degraded test suite “temporarily,” which somehow lasts until the next re-org.

If AI tooling doesn’t reliably reduce that maintenance burden (or if it introduces new failure modes), trust will remain conditional. citeturn2search0turn1view0

Automation still isn’t the default: only 41% of testing is automated

Leapwork reports that, on average, only 41% of testing is currently automated. And the biggest blockers are not exotic technical constraints; they are the same classic pain points, now wearing an AI hoodie:

  • 71% say test creation slows them down the most.
  • 56% point to test maintenance.
  • 54% cite lack of time as a barrier.

So the paradox looks like this: AI is being pitched as a way to reduce manual effort, but the teams who would benefit most are already time-starved—and therefore risk-averse about introducing automation they don’t fully trust.

In other words: the same calendar pressure that makes AI attractive also makes it hard to adopt safely. citeturn2search0turn1view0

Why testing is different: you can’t “vibe check” your way to quality

AI’s early wins in software have often been in areas where outputs are easy to evaluate quickly: autocomplete, boilerplate generation, documentation, explaining code, summarizing tickets.

Testing is different. It has three properties that make trust harder:

  • Indirectness: Tests don’t deliver value directly; they deliver confidence. Confidence is hard to measure until it fails.
  • Adversarial reality: Production systems change constantly—UI selectors, APIs, data contracts, infra timing. Tests break even when software is “fine.”
  • High blast radius: Bad test guidance can block releases or—worse—let defects through with a green build.

So when Leapwork’s CEO Kenneth Ziegler says it’s no longer a question of whether testing teams will leverage agentic capabilities, but how confidently and predictably they can rely on them, that’s not marketing fluff. It’s the central constraint. citeturn2search0

Context: this trust gap shows up across AI adoption—not just in QA

The DevOps.com piece points to a broader pattern: many enterprises pilot AI initiatives, but fewer deploy them at scale, often due to guardrails, expertise, and confidence issues. citeturn1view0

You can see echoes of this in other surveys and reports:

  • A 2024 DevOps.com write-up of a Sogeti/Capgemini survey noted that generative AI adoption was being held back by concerns like data breaches, integration effort, and hallucinations, among others. citeturn2search5
  • Tricentis research (via a 2024 Business Wire release) found practitioners ranked testing as the most valuable area for AI investment across the SDLC—so the appetite is there, but value expectations are enormous. citeturn2search7
  • IDC has argued that scaling agentic AI safely will require new lifecycle discipline—what it calls an “Agent Development Life Cycle” with guardrails, feedback loops, and transparency. citeturn2search4

In short: QA is not uniquely skeptical. QA is just being honest earlier than everyone else.

“Trust” in AI testing isn’t one thing: it’s a stack of worries

When teams say they don’t trust AI testing, that can mean several different (and very practical) things. Let’s break down the most common layers.

1) Trust in the output: is the test correct and meaningful?

The most obvious concern is whether AI-generated tests reflect real user behavior and real risk. A test can be syntactically correct and still be strategically useless (testing trivial paths, asserting the wrong thing, or locking in a buggy behavior as “expected”).

This maps to Leapwork’s “accuracy and quality” barrier. citeturn2search0

2) Trust in stability: will the test suite stay green for the right reasons?

Flaky tests aren’t just annoying; they actively train teams to ignore signals. If AI-assisted tools create tests that are brittle—overfitting to transient UI details, timing assumptions, or unstable fixtures—then AI increases noise, and trust declines.

DevOps.com explicitly mentions unstable/brittle tests as a friction point. citeturn1view0

3) Trust in coverage: what didn’t it test?

This is the “unknown unknowns” problem. Humans are imperfect too, but experienced testers have mental models about where systems fail: edge cases, concurrency, boundary values, permissions, localization, and the weird behaviors that only show up at 2 a.m. on the last day of a billing cycle.

AI can generate large volumes of tests quickly, but volume is not coverage. Coverage requires intent, risk assessment, and domain context.

4) Trust in governance: where did the data go, and who owns the risk?

Even when AI models are “good,” organizations can be constrained by policy: data residency, regulated environments, IP concerns, auditability, and the need for reproducible decisions. Testing often touches production-like data, credentials, and sensitive flows. That elevates governance requirements.

5) Trust in the humans: will people verify AI output, or just ship it?

Here’s the uncomfortable part: even when teams say they don’t fully trust AI, they often use it anyway—and sometimes skip verification because it’s slow.

A survey discussed by ITPro reported that many developers don’t consistently verify AI-generated code before committing it, despite admitting they don’t fully trust its correctness. This idea has been framed as “verification debt,” a term attributed to AWS CTO Werner Vogels. citeturn0news15turn2search2

If we translate that to testing, the risk becomes “verification debt in quality engineering”: AI generates tests, teams don’t fully understand them, maintenance gets harder, and soon no one trusts the suite—or feels responsible for it.

AI in testing today: where it helps, where it struggles

Based on industry practice and what these surveys suggest, AI is generally more trustworthy when it’s used as an assistant inside a workflow that already has strong automation discipline, rather than as a replacement for that discipline.

High-confidence use cases (where AI tends to shine)

  • Test authoring acceleration: generating test scaffolding, page objects, API client stubs, fixtures, data builders.
  • Test refactoring: translating a brittle locator strategy into a more resilient one; improving naming; reorganizing suites.
  • Failure triage: summarizing logs, clustering similar failures, suggesting likely root causes.
  • Documentation and onboarding: explaining what a test is trying to validate and how to run it locally.

Lower-confidence use cases (where teams hit the trust wall)

  • Autonomous end-to-end UI testing generation without a stable automation architecture and good selectors.
  • Cross-system orchestration where the test spans legacy apps, third-party SaaS, and asynchronous workflows.
  • Security-relevant assertions (permissions, tenancy boundaries, injection paths) where “almost right” is unacceptable.

Leapwork’s findings implicitly support this: the industry is experimenting (65%), but very few are comfortable embedding AI across core workflows (12.6%). citeturn2search0

The “three-day update” problem: why maintenance is the real enemy

It’s tempting to assume AI adoption is blocked mainly by “model issues.” But Leapwork’s stats suggest a more mundane bottleneck: the economics of maintenance.

When 45% of teams need three or more days to update tests after critical system changes, the organization is effectively paying a recurring tax for every meaningful release. citeturn2search0

AI could help reduce that tax—if it can do two things consistently:

  • Diagnose why a test broke (was it a UI change, data change, environment issue, timing issue, upstream service issue?).
  • Propose a fix that is resilient (not just “update selector from #btn-123 to #btn-124”).

This is where “agentic” approaches get interesting. Research on agentic workflows for test-driven software engineering—like TDFlow—suggests that tightly constrained, test-driven agent loops can reduce failure modes like test hacking and improve pass rates on benchmarks, especially when human-written tests guide the system. citeturn2academia13

Now, benchmarks are not production. But the direction is relevant: reliability improves when AI is constrained by tests, tools, and verification steps, rather than asked to freestyle in a huge solution space.

How to deploy AI testing without lighting your release pipeline on fire

Organizations that want to move beyond experimentation need a strategy that treats AI as a probabilistic component that must be wrapped in deterministic controls. Here are practical patterns that map to the trust barriers highlighted by Leapwork. citeturn2search0

1) Start with “human-in-the-loop” policies you can explain to an auditor

If AI generates or modifies tests, define explicit review rules:

  • AI-generated tests must be code-reviewed by a human owner.
  • Critical-path suites require a second reviewer (two-person rule).
  • Any change to assertions requires justification in the PR description.

This isn’t bureaucracy for its own sake. It’s a direct response to verification debt and the tendency to ship AI output faster than we can reason about it. citeturn0news15

2) Treat flaky tests as a severity-1 reliability defect

Trust dies when builds go red for nonsense reasons. Set SLOs for test reliability:

  • Maximum acceptable flake rate per suite.
  • Quarantine policies with time limits (no “temporary quarantine” lasting 6 months).
  • Ownership: every flaky test has an accountable team.

AI can help diagnose flakes, but the organization must decide that flakiness is a production problem, not a QA inconvenience.

3) Use AI where the blast radius is limited, then expand

A sensible adoption ladder looks like:

  • AI for test documentation and failure summaries
  • AI for scaffolding and refactoring
  • AI for maintenance suggestions (human approves)
  • AI for creating net-new tests in low-risk modules
  • Only later: AI across core workflows (what only 12.6% report today) citeturn2search0

This mirrors the real-world pattern in the survey: experimentation is common; full embedding is rare. That’s not failure. It’s staged risk management.

4) Demand observability for AI actions in the SDLC

If AI is participating in test creation or maintenance, you need telemetry:

  • What tests were generated/modified?
  • What prompts or instructions were used?
  • What artifacts did the model reference (requirements, code, logs)?
  • What was the before/after impact on flake rate, duration, and defect escape rate?

Without this, “trust” becomes a feelings-based metric, and feelings are notoriously hard to debug.

5) Ground AI in your own domain knowledge (and keep secrets out of prompts)

Many AI test failures come from lack of context: business rules, data semantics, and environment constraints. Organizations can improve quality by grounding AI assistance in internal documentation, API contracts, and known-good patterns—while implementing strict data handling rules so sensitive information doesn’t leak into external services.

This is also where governance frameworks and lifecycle discipline—like IDC’s emphasis on structured approaches to building and scaling agents—matter. citeturn2search4

Comparisons: Leapwork’s numbers line up with broader QA automation research

It’s worth noting that “trust issues” aren’t new in automation. Even before generative AI, teams struggled with the cost and expertise required to adopt automated testing at scale.

A 2024 academic survey on factors preventing adoption of automated software testing highlights the resource-intensive nature of testing and identifies expertise and cost as primary challenges. While that paper isn’t about generative AI specifically, it reinforces that testing adoption is constrained as much by organizational capacity as by tooling. citeturn0search2

What AI changes is not the existence of constraints—it changes which constraints show up first. AI may reduce some effort in authoring and triage, but it can amplify the need for governance and verification.

What this means for vendors: “agentic” is not a feature, it’s a liability unless it’s measurable

Vendors in AI testing (Leapwork included) are increasingly using terms like “agentic capabilities.” That can mean many things: autonomous test generation, self-healing, workflow orchestration, natural language test creation, and more.

Leapwork’s CEO framed the challenge plainly: teams want AI to help them move faster and expand coverage, but accuracy is “table stakes,” and the opportunity lies in integrating AI alongside stable automation foundations. citeturn2search0

From a market perspective, this implies AI testing tools will win on:

  • Reliability metrics (flake reduction, maintenance time reduction)
  • Explainability (why did it generate this test? what risks does it cover?)
  • Integration (CI/CD, ticketing, observability, secrets management)
  • Control surfaces (policy, approvals, environment boundaries)

In other words, the future of AI testing may look less like a magic button and more like a well-instrumented power tool with a safety guard.

A pragmatic prediction for 2026–2028: AI testing adoption will grow, but “core workflow” will lag until reliability is proven

Leapwork’s data suggests most organizations are in a careful, staged adoption phase: high priority, high optimism, significant experimentation—but low deep integration. citeturn2search0turn1view0

Over the next two years, expect adoption to expand in three ways:

  • More AI-assisted maintenance: self-healing approaches will be judged by whether they reduce the “three-day update” reality.
  • More AI in test triage: summarization and clustering of failures will become standard because it’s low-risk and immediately helpful.
  • More agentic pipelines with guardrails: systems that can propose changes but must pass deterministic checks (linters, compilation, unit tests, policy gates) will feel trustworthy enough for broader use.

The “core workflow” number—12.6% today—will rise, but probably only in organizations that already have mature automation practices and can measure reliability improvements objectively. citeturn2search0

Conclusion: the real blocker isn’t AI ambition—it’s predictable quality

Leapwork’s survey results, and James Maguire’s DevOps.com summary, capture an uncomfortable truth: the software industry is extremely good at saying “AI-first,” but software delivery still runs on trust. And trust is earned through repeatability, observability, and outcomes you can bet a release on.

It’s telling that nearly nine in ten respondents say AI is a priority, yet only about one in eight have embedded it across key workflows. That’s not a contradiction. It’s the shape of responsible adoption in a discipline where failure is public, expensive, and occasionally tweeted at your CEO.

If AI testing is going to move from experimentation to operational standard, it won’t be because models got more impressive demos. It’ll be because teams found ways to make AI boringly reliable—the highest compliment any testing tool can receive.

Sources

Bas Dorland, Technology Journalist & Founder of dorland.org