Ziyi Yang's homepage

Ziyi Yang (杨子逸)

My name is Ziyi Yang. I am a second-year MS student (expected to graduate in 2026) at Sun Yat-sen University, advised by Prof. Xiaojun Quan. Before this, I received my Bachelor's degree (2019-2023, computer science and technology) from Sun Yat-sen University. My main research interests focus on heterogeneous model fusion and preference optimization algorithm.

Email / CV / Google Scholar / GitHub / HF

Research

My main research interests focus on heterogeneous model fusion (e.g., combining the strengths of multiple large language models (LLMs) with diverse structures/scales), preference learning algorithm (e.g., DPO, SimPO), and large reasoning models (LRMs) (e.g., efficient reasoning, RL scaling, Self-Play Agent RL). Below is my representative papers.

Knowledge Fusion & Preference Learning
	Weighted-Reward Preference Optimization for Implicit Model Fusion Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan ICLR, 2025 [Paper] / [GitHub] / [HF] We propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO achieves a LC Win Rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a Win Rate of 46.2% against GPT-4-0314 on Arena-Hard.
	FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion Ziyi Yang, Fanqi Wan, Longguang Zhong, Canbin Huang, Guosheng Liang, Xiaojun Quan SCI-FM @ ICLR, 2025 [Paper] / [HF] / [HF Daily Papers] / [r/LocalLLaMA] / [GitHub] / [魔搭社区] We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. Using Llama-3.1-8B-Instruct as the target model, our fusion approach achieves an average improvement of 6.8 points across 14 benchmarks.
	Mutual-Taught for Co-adapting Policy and Reward Models Tianyuan Shi, Canbin Huang, Fanqi Wa, Longguang Zhong, Ziyi Yang, Weizhou Shen, Xiaojun Quan, Ming Yan ACL main, 2025 [Paper] We propose Mutual-Taught, a self-training method that iteratively improves both the policy model and reward model without requiring additional human annotation. Our approach mirrors the expectation-maximization (EM) algorithm. Experimental results demonstrate that this iterative approach leads to consistent improvements in both models.
	FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion Longguang Zhong, Fanqi Wan, Ziyi Yang, Guosheng Liang, Tianyuan Shi, Xiaojun Quan Preprint, 2025 We propose FuseRL, a novel two-stage framework comprising FuseSFT and FusePO to maximize the utilization of source LLMs. Using Llama-3.1-8B-Instruct as the target model, our approach achieves state-of-the-art performance among 8B LLMs on AlpacaEval-2 and Arena-Hard.
	FuseChat: Knowledge Fusion of Chat Models Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, Xiaojun Quan Tech Report, 2024 [Paper] / [GitHub] / [HF] / [机器之心] / [mergekit] We propose FuseChat, an extended framework of FuseLLM to integrate the collective knowledge and individual strengths of multiple structure- and scale-varied chat LLMs into a more powerful chat LLM. FuseChat-7B is comparable to the larger Mixtral-8x7B-Instruct and and approaches GPT-3.5-Turbo-1106 on MT-Bench.

Large Reasoning Models
	QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan Tech Report, 2025 [GitHub] / [HF] / [Paper] / [r/LocalLLaMA] / [HF Daily Papers] We propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking
	ThinkSwitcher: When to Think Hard, When to Think Fast Guosheng Liang, Longguang Zhong Ziyi Yang, Xiaojun Quan Tech Report, 2025 [Paper] we propose ThinkSwitcher, a framework that enables a single LRM to dynamically switch between short and long CoT modes based on task complexity. ThinkSwitcher reduces computational cost by 20–30% while maintaining high accuracy on complex tasks.
	FuseO1-Preview: System-II Reasoning Fusion of LLMs Fanqi Wan, Longguang Zhong, Ziyi Yang, Weizhou Shen, Xinting Huang Tech Report, 2025 [GitHub] / [HF] / [Blog] / [r/LocalLLaMA] / [Mergekit] FuseO1-Preview is our initial endeavor to enhance the System-II reasoning capabilities of large language models (LLMs) through innovative model fusion techniques. The resulted FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview achieves a Pass@1 accuracy of 74.0 on AIME24, demonstrating significant performance improvements compared to the OpenAI o1-preview (44.6) and OpenAI o1-mini (63.4), even approaching OpenAI o1 (79.2).

Education

MS Student in Computer Technology, Sun Yat-sen University (2023.09-2026.06).

Bachelor of Computer Science and Technology, Sun Yat-sen University (2019.09-2023.06).

Website's code is from Jon Barron.