Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer-Reference-Cited by-同舟云学术

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Published:2025-03-31 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Yang Yujiao Yang^ORCID,Lian Jing^ORCID,Li Linhui^ORCID

Abstract

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. In existing MoE paradigms, each expert works as an individual, leading to inadequate collaboration. Moreover, the MoE framework has not been effectively extended to attention blocks, which limits further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes the transformer model into an equivalent group of experts and applies selective routing to input data and experts. Our approach advances MoE design with four key innovations: (1) We conduct equivalent expert decomposition on both MLP blocks and attention blocks based on matrix partitioning in tensor parallelism. (2) We develop two routing paradigms: patch-wise data selection and expert selection, to apply routing at different levels. (3) We design the architecture of the UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop the parallel implementation of UoE’s routing and computation operations and optimize the efficiency based on hardware processing analysis. The experiments demonstrate that our UoE model surpasses Full Attention, state-of-the-art MoEs, and efficient transformers (including the recently proposed DeepSeek-V3 architecture) in several tasks across image and natural language domains. In language modeling tasks, UoE achieves an average reduction of 2.38 in perplexity compared to the best-performing MoE method with only 76% of its FLOPs. In the Long Range Arena benchmark, it demonstrates an average score at least 0.68% higher than all comparison models, including Full Attention, MoEs, and transformer variants, with only 50% of the FLOPs of the best MoE method. In image classification, our model yields an average accuracy improvement of 1.75% over the best model while maintaining comparable FLOPs. The source codes are available at https://github.com/YujiaoYang-work/UoE.

Publisher

Qeios Ltd

Link

https://www.qeios.com/read/9QH8RX/pdf

Reference40 articles.

1. N. Shazeer et al., "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer," arXiv preprint arXiv:1701.06538, 2017.

2. Z.-H. Zhou, Ensemble methods: foundations and algorithms. CRC press, 2012.

3. W. Li, Y. Peng, M. Zhang, L. Ding, H. Hu, and L. Shen, "Deep model fusion: A survey," arXiv preprint arXiv:2309.15698, 2023.

4. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," arXiv preprint arXiv:1909.08053, 2019.

5. A. Vaswani, "Attention is all you need," Advances in Neural Information Processing Systems, 2017.