Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale
If you might have ever stared at hundreds of traces of integration check logs questioning which of the sixteen log recordsdata really accommodates your bug, you aren’t alone — and Google now has information to show it.
A crew of Google researchers launched Auto-Diagnose, an LLM-powered device that robotically reads the failure logs from a damaged integration check, finds the foundation trigger, and posts a concise analysis instantly into the code overview the place the failure confirmed up. On a guide analysis of 71 real-world failures spanning 39 distinct groups, the device appropriately recognized the foundation trigger 90.14% of the time. It has run on 52,635 distinct failing checks throughout 224,782 executions on 91,130 code modifications authored by 22,962 distinct builders, with a ‘Not useful’ charge of simply 5.8% on the suggestions acquired.

The drawback: integration checks are a debugging tax
Integration checks confirm that a number of elements of a distributed system really talk to one another appropriately. The checks Auto-Diagnose targets are airtight useful integration checks: checks the place a whole system beneath check (SUT) — sometimes a graph of speaking servers — is introduced up inside an remoted atmosphere by a check driver, and exercised towards enterprise logic. A separate Google survey of 239 respondents discovered that 78% of integration checks at Google are useful, which is what motivated the scope.
Diagnosing integration check failures confirmed up as one of many high 5 complaints in EngSat, a Google-wide survey of 6,059 builders. A follow-up survey of 116 builders discovered that 38.4% of integration check failures take greater than an hour to diagnose, and eight.9% take greater than a day — versus 2.7% and 0% for unit checks.
The root trigger is structural. Test driver logs often floor solely a generic symptom (a timeout, an assertion). The precise error lives someplace inside one of many SUT part logs, usually buried beneath recoverable warnings and ERROR-level traces that aren’t really the trigger.

How Auto-Diagnose works
When an integration check fails, a pub/sub occasion triggers Auto-Diagnose. The system collects all check driver and SUT part logs at stage INFO and above — throughout information facilities, processes, and threads — then joins and kinds them by timestamp right into a single log stream. That stream is dropped right into a immediate template together with part metadata.
The mannequin is Gemini 2.5 Flash, known as with temperature = 0.1 (for near-deterministic, debuggable outputs) and highp = 0.8. Gemini was not fine-tuned on Google’s integration check information; that is pure immediate engineering on a general-purpose mannequin.
The immediate itself is essentially the most instructive a part of this analysis. It walks the mannequin via an express step-by-step protocol: scan log sections, learn part context, find the failure, summarize errors, and solely then try a conclusion. Critically, it consists of laborious unfavorable constraints — for instance: if the logs don’t include traces from the part that failed, don’t draw any conclusion.
The mannequin’s response is post-processed right into a markdown discovering with ==Conclusion==, ==Investigation Steps==, and ==Most Relevant Log Lines== sections, then posted as a remark in Critique, Google’s inside code overview system. Each cited log line is rendered as a clickable hyperlink.
Numbers from manufacturing
Auto-Diagnose averages 110,617 enter tokens and 5,962 output tokens per execution, and posts findings with a p50 latency of 56 seconds and p90 of 346 seconds — quick sufficient that builders see the analysis earlier than they’ve switched contexts.
Critique exposes three suggestions buttons on a discovering: Please repair (utilized by reviewers), Helpful, and Not useful (each utilized by authors). Across 517 complete suggestions reviews from 437 distinct builders, 436 (84.3%) have been “Please repair” from 370 reviewers — by far the dominant interplay, and an indication that reviewers are actively asking authors to act on the diagnoses. Among dev-side suggestions, the helpfulness ratio (H / (H + N)) is 62.96%, and the “Not useful” charge (N / (PF + H + N)) is 5.8% — properly beneath Google’s 10% threshold for conserving a device dwell. Across 370 instruments that submit findings to Critique, Auto-Diagnose ranks #14 in helpfulness, placing it within the high 3.78%.
The guide analysis additionally surfaced a helpful aspect impact. Of the seven circumstances the place Auto-Diagnose failed, 4 have been as a result of check driver logs weren’t correctly saved on crash, and three have been as a result of SUT part logs weren’t saved when the part crashed — each actual infrastructure bugs, reported again to the related groups. In manufacturing, round 20 ‘extra data is required‘ diagnoses have equally helped floor infrastructure points.
Key Takeaways
- Auto-Diagnose hit 90.14% root-cause accuracy on a guide analysis of 71 real-world integration check failures spanning 39 groups at Google, addressing an issue 6,059 builders ranked amongst their high 5 complaints within the EngSat survey.
- The system runs on Gemini 2.5 Flash with no fine-tuning — simply immediate engineering. A pub/sub set off collects logs throughout information facilities and processes, joins them by timestamp, and sends them to the mannequin at temperature 0.1 and highp 0.8.
- The immediate is engineered to refuse moderately than guess. Hard unfavorable constraints drive the mannequin to reply with “extra data is required” when proof is lacking — a deliberate trade-off that forestalls hallucinated root causes and even helped floor actual infrastructure bugs in Google’s logging pipeline.
- In manufacturing since May 2025, Auto-Diagnose has run on 52,635 distinct failing checks throughout 224,782 executions on 91,130 code modifications from 22,962 builders, posting findings in a p50 of 56 seconds — quick sufficient that engineers see the analysis earlier than switching contexts.
Check out the Pre-Print Paper here. Also, be happy to observe us on Twitter and don’t overlook to be part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale appeared first on MarkTechPost.
