Google Releases Gemini-SQL2: Gemini 3.1 Pro Text-to-SQL Scores 80.04% on BIRD Single-Model Leaderboard

Google Research group has introduced the launch of Gemini-SQL2 on X. They described this method as a breakthrough text-to-SQL functionality powered by Gemini 3.1 Pro. Gemini-SQL2 posted 80.04% execution accuracy on the BIRD Text-to-SQL Leaderboard (Single Model). Google’s chart locations it above its personal Gemini-SQL, the prior high entry. The metric measures whether or not generated SQL runs and returns right outcomes, not whether or not it appears legitimate.

https://x.com/GoogleResearch/standing/2065475343205740911

Gemini-SQL2

Gemini-SQL2 is a text-to-SQL functionality, not a standalone basis mannequin launch. It interprets pure language questions into what Google calls ‘execution-ready SQL queries.’ The functionality is constructed on Gemini 3.1 Pro.

Per the announcement on X, “knowledge subtlety & advanced enterprise contexts make producing correct SQL from pure language notoriously exhausting.” The X Post additionally said that “improved SQL understanding can elevate pure language abilities throughout Google’s knowledge providers.” That factors towards integration targets like BigQuery Studio, AlloyDB AI, and Cloud SQL Studio, which already ship Gemini-based SQL technology. Google has not but confirmed which merchandise will obtain Gemini-SQL2.

Benchmarks

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) is an business normal for this job. It accommodates 12,751 question-SQL pairs throughout 95 databases spanning 37 skilled domains, totaling 33.4GB. The databases embody soiled values and require exterior information grounding, not like older benchmarks akin to Spider.

BIRD measures execution accuracy (EX): the generated SQL should run and return outcomes matching the gold question. Google said this straight. “Per the BIRD benchmark, which measures execution-verified accuracy, GeminiSQL-2’s SQL doesn’t simply look proper, it additionally runs efficiently.”

The Single Trained Model Track restricts the preprocessing, retrieval, and agentic frameworks that ensembles use to spice up scores. It measures the mannequin’s core text-to-SQL skill. Google Cloud’s prior document on this observe, reported November 15, 2025, was 76.13. Google benchmarks human efficiency at 92.96, leaving a 12.92-point hole from 80.04.

How the Leaderboard Stacks Up

Google’s chart, on X put up, reveals Gemini-SQL2 forward of eight named rivals, together with a number of unlabeled factors. Only 80.04% is said as textual content. The values under are learn from the chart’s place and are approximate; dates mirror every level’s horizontal placement.

System	Organization	BIRD Execution Accuracy (Single Model)	Chart Date
Gemini-SQL2	Google	80.04% (said)	Jun 2026
Gemini-SQL	Google	~77.2%	Mar 2026
Q-SQL	AWS	~76.5%	Dec 2025
Databricks RLVR 32B	Databricks	~75.7%	Jul 2025
SiriusAI-Text2SQL-32B-v2	Tencent	~75.0%	Dec 2025
Arctic-Text2SQL-R1-32B	Snowflake	~73.9%	Jun 2025
GPT-5.5-xhigh	OpenAI	~72.5%	Apr 2026
SQLWeaver-32B	Alibaba	~71.7%	May 2026
Claude Opus 4.6	Anthropic	~70.1%	Feb 2026

Two patterns are seen. Google now holds the highest two named positions, Gemini-SQL2 and Gemini-SQL. Several specialised 32B SQL fashions additionally sit above some basic frontier fashions on this chart.

Use Cases with Examples

Self-service analytics: A income supervisor asks for month-to-month recurring income by area, for accounts that churned inside 90 days of improve. This wants joins, window logic, and date arithmetic. Execution-verified technology catches SQL that runs however returns flawed rows.
Data engineering drafts: Devs can draft BigQuery transformations from English, then evaluate somewhat than write from scratch. Google’s November 2025 work recognized schema understanding because the exhausting half. Higher BIRD scores mirror higher dealing with of ambiguous columns and messy values.
Embedded “ask your knowledge” options: SaaS groups including natural-language question interfaces nonetheless want human evaluate at 80% accuracy. One in 5 queries may be flawed. The rating units expectations, not a elimination of evaluate.

Gemini-SQL2 Launch: Community Reception Dashboard

Gemini-SQL2 Launch: Community Reception Dashboard

Verified public engagement on Google Research’s announcement posts • first ~3 hours • Jun 12, 2026

X views

X likes

X bookmarks

Reposts (X + LinkedIn)

BIRD Single-Model Leaderboard • Execution Accuracy

Platform Engagement Breakdown

X / Twitter (most important put up)

Views144.4K

Likes2,800

Reposts267

Bookmarks1,300

Replies64

Engagement charge3.1%

LinkedIn (most important put up)

Reactions349+

Comments12

Reposts27

Reception sign

Bookmark-plus-like to answer ratio on X. A excessive save charge with few replies usually indicators approval over controversy. Comment-level sentiment not but measurable; replies nonetheless loading at seize time.

Data verified Jun 12, 2026 from Google Research posts on X (9:44 AM PT, 144.4K views) and LinkedIn (348 reactions + creator, 3h after posting). Leaderboard values in addition to 80.04% are learn from Google’s revealed chart and marked approximate (~). Dashboard by Marktechpost.

“+
“

“+d.txt+”

“;
field.appendChild(row);
});
perform fmt(n,f){
if(f===”okay”&&n>=1000){return (n/1000).toFixed(n>=100000?1:1).substitute(/.0$/,””)+”Ok”;}
return n.toLocaleString();
}
perform animate(){
Array.prototype.forEach.name(doc.querySelectorAll(“#mtp-gsql2-dash .dd-bar-fill”),perform(b){
b.type.setProperty(“width”,b.getAttribute(“data-w”)+”%”,”necessary”);
});
Array.prototype.forEach.name(doc.querySelectorAll(“#mtp-gsql2-dash .dd-num”),perform(el){
var finish=parseInt(el.getAttribute(“data-count”),10),f=el.getAttribute(“data-fmt”),
begin=null,dur=1100;
perform step(ts){if(!begin)begin=ts;var p=Math.min((ts-start)/dur,1);
el.textContent=fmt(Math.spherical(finish*(p*(2-p))),f);
if(p