跳至主導覽 跳至搜尋 跳過主要內容

Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model

  • Mu Chi Chen
  • , Po Hsuan Huang
  • , Xiangrui Ke
  • , Chia Heng Tu
  • , Jason Xue
  • , Shih Hao Hung

研究成果: Conference contribution

1   !!Link opens in a new tab 引文 斯高帕斯(Scopus)

摘要

Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's Chat-GPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.

原文English
主出版物標題2024 Research in Adaptive and Convergent Systems - Proceedings of the 2024 International Conference on Research in Adaptive and Convergent Systems, RACS 2024
發行者Association for Computing Machinery, Inc
頁面57-64
頁數8
ISBN(電子)9798400706066
DOIs
出版狀態Published - 2025 10月 8
事件2024 International Conference on Research in Adaptive and Convergent Systems, RACS 2024 - Pompei, Italy
持續時間: 2024 11月 52024 11月 8

出版系列

名字2024 Research in Adaptive and Convergent Systems - Proceedings of the 2024 International Conference on Research in Adaptive and Convergent Systems, RACS 2024

Conference

Conference2024 International Conference on Research in Adaptive and Convergent Systems, RACS 2024
國家/地區Italy
城市Pompei
期間24-11-0524-11-08

All Science Journal Classification (ASJC) codes

  • 一般電腦科學
  • 控制與系統工程

引用此