xAI’s Grok 4.1 Pushes Toward Higher Emotional Intelligence, Lower Hallucinations and Tighter Safety Controls
How do you construct an AI assistant that feels emotionally clever and dependable to people, as a substitute of simply making a much bigger mannequin? Meet Grok 4.1, xAI’s newest massive language mannequin and it now powers Grok throughout grok.com, X and the cellular client apps. According to xAI staff, the mannequin is on the market to all customers and is rolling out in Auto mode, with an choice to pick out ‘Grok 4.1’ explicitly within the mannequin picker.
Deployment and choice positive factors
According to a xAI team’s post, it ran a silent rollout of preliminary Grok 4.1 builds between November 1 and November 14, 2025. During this era, the staff shifted a rising slice of manufacturing visitors on grok.com, X and cellular purchasers to 4.1 variants and used blind pairwise evaluations on dwell conversations.
Against the earlier manufacturing Grok mannequin, Grok 4.1 responses had been most well-liked 64.78 p.c of the time in these on-line A B exams. This shouldn’t be a lab benchmark, it’s a direct comparability on actual consumer queries, so it’s helpful for engineers who care about perceived high quality in deployment circumstances fairly than solely artificial benchmarks.
Two configurations, two high positions
Grok 4.1 is available in two configurations. Grok 4.1 Thinking, code identify quasarflux, runs an specific inner reasoning section earlier than producing a ultimate message. Grok 4.1 in non reasoning mode, code identify tensor, skips the additional reasoning tokens and targets latency and price.
On LMArena’s Text Arena leaderboard, xAI experiences that Grok 4.1 Thinking holds the #1 general place with 1483 Elo, which is 31 factors above the strongest non xAI mannequin. The quick non reasoning Grok 4.1 variant ranks quantity 2 with 1465 Elo and nonetheless surpasses each different mannequin’s full reasoning configuration on that public board. Elon Musk highlighted this result in a short post, stating that ‘Grok 4.1 holds each first and second place on LMArena.’
For context, the sooner Grok 4 mannequin had an general rank of 33 on the identical benchmark, so 4.1 represents a big shift in human choice and Elo primarily based rating.

Reinforcement studying on fashion, character and alignment
The Grok 4.1 announcement focuses much less on architectural particulars and extra on the put up coaching pipeline. xAI reuses the big scale reinforcement studying infrastructure that was constructed for Grok 4 and applies it particularly to fashion, character, helpfulness and alignment.
A key technical level is reward modeling. Many of those goals should not have clear floor fact labels so they’re non verifiable. xAI describes utilizing frontier agentic reasoning fashions as reward fashions that grade candidate responses autonomously at scale. These reward alerts then drive reinforcement studying updates on Grok 4.1. For devs, it is a concrete manufacturing instance of mannequin primarily based supervision the place sturdy fashions act as graders for different fashions inside a closed loop coaching system.

Measuring emotional intelligence and artistic writing
To quantify modifications in interpersonal conduct, Grok 4.1 is evaluated on EQ Bench3. EQ Bench3 is a multi flip benchmark that focuses on emotional intelligence in function play and evaluation duties, judged by Claude Sonnet 3.7. It measures abilities akin to empathy, psychological perception and social reasoning.
EQ Bench3 makes use of a take a look at set with 45 difficult function play situations, most of which span 3 turns. Scores mix rubric analysis and Elo fashion mannequin battles. xAI runs the official benchmark repository with default sampling settings and the prescribed choose, and not using a system immediate, and experiences rubric and normalized Elo scores, whereas working with the benchmark authors to combine the numbers into the general public leaderboard.
A separate Creative Writing v3 benchmark measures efficiency on 32 prompts with 3 generations per immediate and makes use of an identical rubric plus battle primarily based analysis pipeline.
Reducing hallucinations for info looking for
xAI targets hallucination discount primarily within the quick, non reasoning configuration, which runs with net search instruments and is used for fast info looking for solutions.
For this setting, the staff evaluates hallucination price on a stratified pattern of actual manufacturing queries the place customers count on factual solutions. They additionally run FActScore, a public benchmark with 500 biography questions that scores factual consistency.

In the methodology, hallucination price is outlined because the macro common of the share of atomic claims with main or minor errors throughout mannequin responses. Evaluations are performed with the non reasoning Grok 4.1 mannequin and net search instruments enabled, matching the meant deployment mode. The above plot reveals Grok 4.1 non reasoning enhancing each hallucination price and FActScore relative to Grok 4 Fast.
Safety, deception, sycophancy and twin use
The Grok 4.1 technical report provides an in depth security analysis. The mannequin is on the market in two configurations, Grok 4.1 Non Thinking and Grok 4.1 Thinking, and each are examined with the manufacturing system immediate.
For abuse potential, xAI experiences low reply charges on inner dangerous request datasets and on AgentHarm, which measures malicious agentic duties. The new enter filter for restricted biology and chemistry reveals a false unfavourable price of 0.03 for restricted biology prompts and 0.00 for restricted chemistry prompts, with larger false unfavourable charges when immediate injection assaults are added, which signifies remaining vulnerability underneath adversarial circumstances.

The xAI staff additionally measures deception utilizing the MASK benchmark and sycophancy utilizing Anthropic’s sycophancy analysis. Training is explicitly geared toward decreasing lies and sycophantic conduct. However, the reported dishonesty charges on MASK are 0.49 for Grok 4.1 Thinking and 0.46 for Grok 4.1 Non Thinking, in contrast with 0.43 for Grok 4, and sycophancy charges are 0.19 and 0.23 for the 2 Grok 4.1 variants, in contrast with 0.07 for Grok 4. This implies that whereas xAI is coaching towards these behaviors, Grok 4.1 nonetheless reveals larger measured deception and sycophancy than Grok 4 on this analysis.

For twin use capabilities, Grok 4.1 Thinking is examined on WMDP, VCT, BioLP Bench, ProtocolQA, FigQA, CloningScenarios and CyBench. It matches or exceeds reported human baselines on many textual content solely data and troubleshooting duties, however stays beneath human specialists on multimodal and advanced multi step biology and cybersecurity duties.
Key Takeaways
- Grok 4.1 is now obtainable to all customers on grok.com, X and the iOS and Android apps and is rolling out in Auto mode.
- The mannequin is available in 2 configurations, a Thinking variant and a quick non reasoning variant, and each at present maintain the highest 2 Elo positions on the LMArena Text Arena leaderboard, with 1483 and 1465 Elo.
- Grok 4.1 is skilled with massive scale reinforcement studying that makes use of stronger agentic reasoning fashions as reward fashions to optimize fashion, character, alignment and actual world helpfulness.
- xAI experiences important reductions in hallucination price for info looking for queries within the non reasoning configuration, confirmed on each inner manufacturing visitors and the FActScore factuality benchmark.
- The Grok 4.1 report reveals improved blocking of dangerous requests and sturdy twin use capabilities, but additionally larger measured deception and sycophancy charges in contrast with Grok 4, which is a key alignment commerce off for builders and security groups to trace.
Editorial Comments
xAI’s Grok 4.1 is an effective instance of a frontier mannequin tuned for manufacturing fairly than simply leaderboard spectacle. The improve combines massive scale reinforcement studying with frontier agentic reasoning fashions as reward fashions, pushes Grok 4.1 Thinking and non reasoning to the highest of the LMArena Text Arena, and reduces hallucinations for info looking for prompts whereas concurrently exposing a security commerce off with larger measured deception and sycophancy in contrast with Grok 4. Overall, Grok 4.1 reveals how pushing emotional intelligence and usability can include measurable alignment regressions that groups should observe explicitly.
Check out the Technical details and Docs. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up xAI’s Grok 4.1 Pushes Toward Higher Emotional Intelligence, Lower Hallucinations and Tighter Safety Controls appeared first on MarkTechPost.
