A New Open Source Model From China is Crushing the Benchmarks

Ziphu AI — also known as Z.ai — has unveiled the new GLM 4.5 family of AI models, which the company claims outperform Anthropic’s highly regarded Opus 4.0 on several benchmarks. The release follows Moonshot AI’s Kimi K2, which exhibited benchmark performance. 

Z.ai, based out of Beijing and backed by the e-commerce giant Alibaba, released the GLM-4.5 and the GLM-4.5-Air AI models. 

Looking at the recent releases of open source AI models from the East, it wouldn’t be an overstatement to say that the future of open source AI may not be led by the West. “It seems like Chinese labs are playing musical chairs at this point,” said Satvik Paramkusham, an engineer on X

Both – the GLM-4.5 and the GLM-4.5-Air –  are based on the Mixture of Experts (MoE) architecture and are packed with reasoning, coding and agentic capabilities. GLM-4.5 is built with 355 billion total parameters and 32 billion active parameters, and GLM-4.5-Air with 106 billion total parameters and 12 billion active parameters.

The startup claims that this is an effort to develop a genuinely general model. While acknowledging the capabilities of the models today, Z.ai said, “Models are still not really general: some of them are good at coding, some good at math, and some good at reasoning, but none of them could achieve the best performance across all the different tasks.”

“GLM-4.5 makes efforts toward the goal of unifying all the different capabilities,” added the company. 

And if benchmark scores are anything to go by, it would be fair to assert that Z.ai has fulfilled its claims. Across coding, reasoning, and other benchmarks, these models are comparable to some of the best-performing models today, and in some cases, even surpass models like Claude 4 Opus and OpenAI’s o3. 

Compared to DeepSeek-R1, a model that once disrupted both the ecosystem and NVIDIA’s market cap, Z.ai’s models consistently outperform it on multiple evaluations. 

In addition, a few users who have used the model have also generally reported a positive experience. One user on Reddit said, “GLM-4.5 is absolutely crushing it for coding–way better than Claude’s recent performance.” 

Another user on a Hacker News thread said, “I could get it to consistently use the tools and follow instructions in a way that never really worked well with Deepseek R1 or Qwen. Even compared to Kimi, I feel like this is probably the best open source coding model out right now.”

Furthermore, these models also excel across benchmarks that evaluate their agentic and tool-use capabilities. 

The company tested the GLM-4.5 on the BrowseComp benchmark for web browsing, which includes complex questions requiring short answers. With a web browsing tool enabled, it provides correct responses for 26.4% of all questions, outperforming Claude-4-Opus (18.8%) and nearing o4-mini-high (28.3%).

On other benchmarks, such as TAU-Bench – airline and retail, which assess a model’s ability to perform agentic tasks involving realistic customer-related activities reliably within the airline and retail domains, both the GLM-4.5 and the GLM-4.5-Air perform on par with the Claude 4 Sonnet but beat OpenAI’s o3.

chart visualization

These models were also put to the test on the ‘Pelican benchmark’ from Simon Wilson, the co-creator of the Django Web framework, a rather amusing test that demands AI models to generate an SVG of a pelican riding a bicycle. This helps evaluate a model’s practical coding and creative capabilities. 

While several models have historically struggled with this particular test, the GLM 4.5 provided an impressive result. ‘I like how the pelican has its wings on the handlebars,” said Wilson. 

(Left: SVG created by o3 Pro, Right: SVG created by GLM-4.5)

The GLM-4.5 model costs $0.6 per million input tokens and $2.2 per million output tokens, while the more affordable GLM-4.5-Air variant is priced at $0.2 per million input tokens and $1.1 per million output tokens. Besides, being available as open-source means that there’s more information released about the model’s training process, much to the appreciation of developers. 

Also Read: OpenAI is Flirting with Danger by Naming China’s Blacklisted Zhipu AI as a Threat

The RL Cherry on Top

Z.ai mentioned that during pre-training, the model was first trained on a corpus of 15 trillion tokens of general information, followed by 7 trillion tokens of code and a reasoning corpus, and then introduced additional stages to train it on more specific domains. 

In the blog post, the company also stated, “We employ loss-free balance routing and sigmoid gates for MoE layers. Unlike DeepSeek-V3 and Kimi K2, we reduce the width (hidden dimension and number of routed experts) of the model while increasing the height (number of layers), as we found that deeper models exhibit better reasoning capacity.” 

Alongside the models, the company has also open-sourced a reinforcement learning infrastructure called slime, which is said to be engineered for “exceptional flexibility, efficiency, and scalability.” 

Slime’s primary innovations are designed to overcome common RL bottlenecks, particularly in complex agentic tasks, said Z.ai. Some of these techniques involve using a flexible training architecture that enables maximum utilisation of GPUs, and an agent-oriented design that separates rollout engines from the training engines, which helps eliminate some bottlenecks associated with RL. 

Besides, slime is also said to employ the memory-efficient FP8 format for data generation, while retaining the stability of the more precise BF16 format. 

Casper Hansen, a natural language processing (NLP) scientist, shared his experience on X, that the GLM 4.5 Air model can “easily give up to 200 tokens/second” in FP8. 

The release of Kimi K2, and now from Z.ai, comes at a time when users are eagerly waiting for OpenAI’s GPT-5 and its open source model.

OpenAI now faces an unprecedented challenge. The company that initially released GPT-2 cautiously as open-source now re-enters a market filled with competitors, and this time, it won’t have the first-mover advantage. 

On the other hand, Meta faces a larger challenge, having been the creator of the leading open-source model but now halting development on its most powerful and largest model, the Llama 4 Behemoth. This has left many speculating about Llama 5, especially as the company is now fully dedicated to building a ‘superintelligence’ team. 

The post A New Open Source Model From China is Crushing the Benchmarks appeared first on Analytics India Magazine.

Scroll to Top