As the race to develop powerful language models heats up, the Qwen2 series is emerging as a contender, bringing new advancements to the realm of artificial intelligence.
Following the global fervor for large language models (LLMs) initiated by ChatGPT and further bolstered by the Llama series, there has been a growing interest in open-source, GPT-level local LLMs. Innovations such as Claude-3 Opus and GPT-4o have recently showcased their prowess, maintaining top positions in the Chatbot Arena, a platform notable for its human evaluations.
Among these advancements, the Qwen series has made significant strides. Over recent months, the Qwen team has launched several models, including Qwen1.5, the vision-language model Qwen-VL, and the audio-language model Qwen-Audio. Their latest addition, Qwen2, marks a substantial leap forward.
Qwen2 is a family of LLMs and large multimodal models grounded in the Transformer architecture. These models are trained using next-token prediction and include both foundational (base language models) and instruction-tuned variants. The release comprises four dense models ranging from 0.5 billion to 72 billion parameters, alongside a Mixture-of-Experts (MoE) model with 57 billion parameters, of which 14 billion are activated per token. Smaller models are tailored for portable devices, while larger models target deployment across various GPU scales.
The models were pre-trained on a vast dataset of over 7 trillion tokens, extending across multiple languages and domains, including enriched code and mathematics content. This extensive training aims to enhance reasoning abilities.
Post-training processes such as supervised fine-tuning and direct preference optimization align the models with human preferences by learning from human feedback. This alignment enables the models to effectively follow instructions.
In evaluations, Qwen2 has outperformed several baseline models in both fundamental language capabilities and instruction-tuned functionalities. Notably, the instruction-tuned variant, Qwen2-72B-Instruct, scored 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. The base model, Qwen2-72B, achieved scores of 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH.
Qwen2 models employ a byte-level byte-pair encoding tokenizer, which exhibits high encoding efficiency, facilitating multilingual capabilities. The models feature multiple Transformer layers with self-attention mechanisms and feed-forward neural networks. Key innovations include Grouped Query Attention, Dual Chunk Attention, and YARN for better context handling and length extrapolation.
The MoE models incorporate multiple feed-forward networks (FFNs) as experts, offering more dynamic expert utilization and improved performance. They integrate both shared and specialized experts within MoE layers, enhancing adaptability and efficiency.
As the landscape of LLMs continues to evolve, the Qwen2 series stands out for its scalable architecture and robust performance across a spectrum of applications, elevating the potential of artificial intelligence research and development.