Is this AGI? Google’s Gemini 3 Deep Think Cracks Final Personality Test And Scores 84.6% On ARC-AGI-2 Performance Today

Google announced a major update to the Gemini 3 Think Deeply today. This update is specifically designed to accelerate modern science, research, and engineering. This appears to be more than just another model release. It represents a pivot towards a ‘thinking mode’ that uses internal validation to solve problems that previously required expert intervention.
The updated model hits benchmarks that redefine the frontier of intelligence. By focusing on computer test time—the model’s ability to ‘think’ for a long time before making a response—Google goes beyond simple pattern matching.

It redefines AGI by 84.6% in ARC-AGI-2
I ARC-AGI The benchmark is a comprehensive test of intelligence. Unlike traditional benchmarks that test recall, ARC-AGI measures a model’s ability to learn new skills and perform new tasks that have never been seen before. The Google team reported that Gemini 3 Deep Think has been achieved 84.6% to ARC-AGI-2result confirmed by ARC Prize Foundation.
School of 84.6% it’s a huge leap for the industry. To put this in perspective, people average approx 60% in these visual reasoning puzzles, while previous AI models have often struggled to crack 20%. This means that the model no longer predicts the most likely next word. It develops a dynamic internal representation of logic. This skill is important R&D areas where engineers are dealing with dirty, incomplete, or novel data not in the training set.
Passing ‘The Final Examination of Humanity‘
Google has also set a new standard Final Human Examination (HLE)goals 48.4% (without tools). The HLE is a 1000-question test designed by subject matter experts to be easy for humans but almost impossible for current AI. These questions concern special academic topics where data is scarce and logic is dense.
To gain 48.4% without external search tools it is a sign of thinking models. This performance shows that Gemini 3 Deep Think can handle high-level mental processing. It can work through multi-step logical chains in fields such as advanced law, philosophy, and mathematics without drifting into ‘ideas.’ It proves that the model’s internal validation systems are effective in pruning wrong ways of thinking.
Contest Code: 3455 Elo Milestone
Physical review is in a competitive setting. Gemini 3 Deep Think now owns a 3455 Elo points to Codeforces. In the world of coding, a 3455 Elo placing the model in the ‘Legendary Grandmaster’ category, a level achieved by only a small fraction of programmers worldwide.
This feature means that the model exceeds algorithmic robustness. It can handle complex data structures, optimize time complexity, and solve problems that require intensive memory management. This model works as an elite pair organizer. It’s especially useful for ‘agent scripting’—where AI takes a high-level goal and performs a complex, multi-file solution automatically. In an internal test, the Google team noticed that the Gemini 3 Pro was shown 35% higher accuracy in solving software engineering challenges than previous versions.
Emerging Sciences: Physics, Chemistry, and Mathematics
Google’s review is specifically aimed at scientific discovery. Gemini 3 Critical Thinking is achieved gold medal level results in the written sections of 2025 International Physics Olympiad as well as 2025 International Chemistry Olympiad. It also reached gold medal status in International Math Olympiad 2025.
Beyond these student-level competitions, the model works at the professional research level. Score a goal 50.5% you have CMT-Benchmarkwhich tests proficiency in advanced theoretical physics. For researchers and data scientists in biotech or material science, this means that the model can help interpret experimental data or model physical systems.
Functional Engineering and 3D Modelling
Model thinking is not just abstract; has an active engineering function. A new ability highlighted by the Google team is the model’s ability to open a draw on a 3D printable object. Deep Think can analyze a 2D drawing, model a complex 3D shape with code, and generate the final file for a 3D printer.
This reflects the ‘agent’ nature of the model. It can bridge the gap between a visual idea and a real product by using code as a tool. For developers, this reduces the friction between design and prototyping. It is also effective in solving complex operational problems, such as designing recipes for growing thin films through special chemical processes.
Key Takeaways
- Breakthrough Abstract Reasoning: The derived model 84.6% to ARC-AGI-2 (confirmed by the ARC Prize Foundation), proving that it can learn novel tasks and make generalizations rather than relying on memorized training data.
- Code Performance Elite: with 3455 Elo points to CodeforcesGemini 3 Deep Think plays at the ‘Legendary Grandmaster’ level, surpassing a large number of competing programmers in algorithmic complexity and system architecture.
- A New Standard of Professional Reasoning: Score a goal 48.4% to Humanity’s Final Test (without tools), demonstrating the ability to solve high-level, multi-step logic chains previously considered ‘too human’ for AI to solve.
- Science Olympiad Success: The derived model gold medal level results in the written sections of 2025 International Physics and Chemistry Olympiadsdemonstrating its potential for expert-grade research and complex physical modeling.
- Estimated Time Calculator: Unlike traditional LLMs, this ‘Deep Thinking’ mode is active computer test time to confirm internally and correct its reasoning before responding, greatly reduces technical opinions.
Check it out Technical details here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.




