“It’s smarter than almost all graduate students in all disciplines simultaneously,” Elon Musk said during the livestream of the launch of his AI startup xAI’s most advanced version of the Grok chatbot.
Grok 4 is the latest iteration of xAI’s large language model (LLM) Grok, and it has not just come with minor updates but features some major enhancements over its predecessors. According to those who tried the new models, the Grok 4 series demonstrates a massive leap in LLMs, reportedly owing to its use of the technique known as reinforcement learning with verifiable rewards (RLVW). The RLVW is a method where an AI agent learns to make decisions by interacting with its environment and receiving rewards or penalties for its actions.
Grok was launched in 2023 as a model that focused entirely on next-token prediction, a fundamental concept in language modelling where the model predicts the next word or token in the sequence of text. Subsequent models in the line showcased a 10x increase in compute, specifically Grok 3, leading to better pre-training results. Grok 3.5 introduced reasoning capabilities to xAI’s LLMs using reinforcement learning; however, now Grok 4 has taken it much farther. With heavy emphasis on RLVW, Grok 4 seems to have outdone frontier models from OpenAI, Google, Anthropic, etc.
For the uninitiated, when an AI model solves problems with known answers, such as math equations or scientific facts, the reinforcement learning technique rewards it. The idea here is that repeatedly training the model with straightforward problems improves the model’s reasoning abilities. During the demonstration, Musk’s team of engineers even shared that they were running out of such problems, hinting at how real-world environments may soon be the best training grounds, offering unlimited verifiable feedback.
Why is Grok 4 the smartest LLM yet?
For any LLM, the ultimate test of its abilities is to secure scores on popular benchmarks that assess its ability to answer questions, solve logical problems, identify patterns, and even demonstrate proficiency with some coding tasks. In the last few years, big tech companies have been shipping their AI models in what can be called a ‘one-upmanship’; perhaps this is the reason that we keep hearing them introducing their AI models as the ‘best and most advanced AI yet’. While benchmark scores are key to judging an AI model’s capabilities, its real-world implications and practical applications may significantly vary.
Elon Musk’s Grok 4, xAI claims, has shown remarkable performance in benchmarks across categories. One of the notable benchmarks that the LLM crushed is the test named ‘Humanity’s Last Exam’, which is considered to be one of the most difficult AI benchmarks in the world. This test essentially evaluates a model’s knowledge and understanding in academic fields such as biology, physics, computer science, and engineering. It has been designed to compete even with the brightest human experts. In this test, without tools, Grok 4 secured 26.9 per cent, surpassing Google Gemini 2.5 Pro’s 21.6 per cent and OpenAI’s GPT-4 model scores close to 20 per cent. And with tools, the model scored 41 per cent when it came to web browsing, memory, and coding environments. On the other hand, with a scaled test-time compute, Grok 4 Heavy, which spawns multiple AI agents to solve problems, secured 50.7 per cent, which is a significant leap. When it comes to Grok 4 Heavy, this model employs agents who work as a team to solve problems, share insights, and refine responses collectively. This collaborative multi-agent architecture is the distinguishing feature of the Grok 4 Heavy model.
Another key benchmark is ARC-AGI, which is designed to evaluate a model’s abstract reasoning and problem-solving capabilities. This also involves pattern recognition and even general reasoning abilities which are easy for humans but much more difficult for AI models. On the ARC-AGI V2, Grok 4 obtained 15.9 per cent, which is double the previous score of 8 per cent (Opus 4).
Story continues below this ad
“ARC-AGI-2 is hard for current AI models. To score well, models have to learn a mini-skill from a series of training examples, then demonstrate that skill at test time. The previous top score was ~8% (by Opus 4). Below 10% is noisy; getting 15.9% breaks through that noise barrier. Grok 4 is showing non-zero levels of fluid intelligence,” Greg Kamradt, founder of ARC Prize, posted on X suggesting how this is a big leap in AI.
Visualisations, sports predictions, and more
Apart from benchmarks, during the demonstration, the engineers also showed how Grok 4 was capable of sports predictions, black hole visualisations, and game design. During the demo, Grok 4 created a scientifically plausible visual of two black holes colliding. Grok 4 has access to real-time data, which allows it to organise timelines of reactions, news developments, and more.
Meanwhile, other benchmarks show Grok 4’s range and versatility. On GPQA, or graduate-level question answering, the model scored 88.9 per cent, which is considered to be the best so far. In Math Arena, it surpassed all with a 96.7 per cent score. The model also dominated the USA Math Olympiad and scored 79.4 per cent. Live CodeBench is suggesting that it can also be a top-tier coder. When it came to the AI and Machine Learning 2025 Challenge, Grok 4 scored a perfect 100 per cent.
Along with traditional benchmarks, the Grok 4 model was also put to the test with some real-world intelligence. VendingBench is a benchmark that simulates the task of managing a vending machine, and it comes with limits such as budget and inventory. As part of the VendingBench test, AI agents are required to handle orders, manage inventory and pricing, and essentially make money. This test determines an AI model’s long-term coherence. Grok 4 scored a net worth of $4,700, outperforming top AI models and even human participants. In comparison, GPT-3.5 scored $1,800, and a human test taker could only net $844. Grok 4’s performance in the VendingBench test demonstrates its ability to reason, plan, and act under unpredictable situations where it is required to use critical skills.
Story continues below this ad
Many users have showcased some unique use cases where Grok 4 shined. An xAI team member used the model to build a first-person shooter game in just four hours. According to the engineer, the model automated tasks such as asset sourcing, logic, and visuals, cutting down development time and efforts dramatically. Not long ago, Elon Musk claimed that AI will generate full-fledged AAA titles. While this is no AAA, it shows how far AI has come in terms of video game development.
xAI, which is a relatively new player, has witnessed phenomenal growth in the last few years. Musk has claimed that the company is currently training its Foundation Model v7, which is expected to be complete soon. Besides, the AI company reportedly plans to unveil a coding-specialised model in August, a multimodal agent in September, and a video generation model in October.
Are we closer to AGI?
On paper, Grok 4 outshines its peers on numerous high-stakes benchmarks. However, when Musk said that Grok 4 is smarter than all graduate students, his statement needs a bit of context. It needs to be noted that Grok 4 is yet another LLM, meaning that it is prone to hallucinations – or coming up with incorrect information, just like any other AI model. In essence, this is not a new kind of AI. Musk later clarified that his comment about ‘graduate-level’ intelligence was based on the model’s test on academic tests. One of the X users noted that scores are impressive, but presentations can be misleading. For instance, the charts shared by xAI could also exaggerate the differences between models. Despite the astounding scores, several users noted that Grok 4 seems to struggle with visual tasks. Grok 4 has a modest improvement over Gemini 2.5 Pro on full multimodal benchmarks, a test involving texts and images.
Artificial General Intelligence, or AGI, is a theoretical concept of AI systems that possess human-level cognitive abilities. While big tech is racing towards achieving AGI and investing billions of dollars in the pursuit, there is no concrete timeline yet. Grok 4’s performance in benchmarks such as ARC-AGI and Humanity’s Last Exam shows how far we have come when it comes to AI advancements, but this is not AGI. Grok 4 is an LLM, which is prone to making up information confidently, and AGI is expected to be grounded in reality. Based on the benchmarks, Grok 4 excels in structured tasks such as math, code, etc. However, it fails at spatial reasoning and nuanced visual understanding. It is not an AGI since it lacks agency or goals, and it doesn’t really learn from its mistakes. To put it simply, Grok 4 mimics thinking but is not yet an autonomous thinker.
Story continues below this ad
On Thursday, July 10, xAI launched Grok 4, the multi-agent Grok4 Heavy, and SuperGrok Heavy. The models were launched with a demo led by Elon Musk and engineers from xAI. The new Grok 4 is based on xAI’s Foundation Model v6. Grok 4 can be accessed via xAI’s platform or through an API. It comes with a 256K context window, multimodal reasoning, real-time web access, and enterprise-grade security. The Grok 4 is priced at $30 a month, while the Grok 4 Heavy comes at $300 a month or $3,000 a year.
Average Rating