
Alibaba Qwen QwQ-32B and Its Reinforcement Learning Revolution
The Qwen group at Alibaba has introduced QwQ-32B, a 32 billion parameter AI model that showcases performance comparable to the significantly larger DeepSeek-R1. This advancement emphasizes the possibilities of enhancing Reinforcement Learning (RL) on substantial foundational models.
The Qwen group has effectively incorporated agent functionalities into the reasoning model, allowing it to think analytically, use tools, and modify its reasoning based on feedback from the environment.
“Expanding RL could elevate model performance beyond standard pretraining and post-training techniques,” the group remarked. “Recent research has shown that RL can considerably boost the reasoning skills of models.”
Alibaba QwQ-32B delivers performance on par with DeepSeek-R1, which features 671 billion parameters (with 37 billion activated), showcasing the efficacy of RL when utilized with solid foundational models pretrained on vast global knowledge. This extraordinary result highlights the potential of RL to bridge the divide between model size and performance.
The model has been assessed across various benchmarks, including AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, which are designed to evaluate its mathematical reasoning, coding skills, and overall problem-solving abilities.
The findings emphasize Alibaba QwQ-32B’s performance in relation to other top models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.
Benchmark findings:
AIME24: Alibaba QwQ-32B recorded 79.5, slightly trailing DeepSeek-R1-6718’s 79.8, yet still significantly outperforming OpenAl-o1-mini’s 63.6 and the distilled models.
LiveCodeBench: QwQ-32B scored 63.4, closely rivaled by DeepSeek-R1-6718’s 65.9, and surpassing both the distilled models and OpenAl-o1-mini’s 53.8.
LiveBench: QwQ-32B managed 73.1, while DeepSeek-R1-6718 scored 71.6, outperforming both the distilled models and OpenAl-o1-mini’s 57.5.
IFEval: QwQ-32B achieved 83.9, closely matching DeepSeek-R1-6718’s 83.3, leading ahead of the distilled models and OpenAl-o1-mini’s 59.1.
BFCL: Alibaba QwQ-32B recorded 66.4, with DeepSeek-R1-6718 scoring 62.8, showcasing a lead over the distilled models and OpenAl-o1-mini’s 49.3.
The Qwen group’s methodology employed a cold-start checkpoint and a multi-phase RL process driven by outcome-based rewards. The initial phase concentrated on scaling RL for mathematical and coding tasks, utilizing accuracy verifiers and code execution servers. The subsequent phase broadened to encompass general capabilities, incorporating rewards from general reward models and rule-based verifiers.
“We discover that this phase of RL training, even with a limited number of steps, can enhance the performance of other general capabilities, such as instruction adherence, alignment with human preferences, and agent efficiency, without substantial performance decline in math and coding,” the group elucidated.
Alibaba QwQ-32B is open-weight and can be found on Hugging Face and ModelScope under the Apache 2.0 license, and is also accessible via Qwen Chat. The Qwen group sees this as a foundational step in enhancing RL to improve reasoning capabilities and aspires to investigate further the integration of agents with RL for long-term reasoning.
“As we aim to develop the next iteration of Qwen, we believe that merging stronger foundational models with RL driven by scaled computational resources will bring us closer to realizing Artificial General Intelligence (AGI),” the group affirmed.
See also: Deepgram Nova-3 Medical: AI speech model reduces healthcare transcription inaccuracies
Interested in gaining more insights about AI and big data from industry frontrunners? Attend the AI & Big Data Expo occurring in Amsterdam, California, and London. The extensive event is co-located with other prominent events including the Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Discover additional forthcoming enterprise technology events and webinars powered by TechForge here.
Be the first to comment