TY - GEN
T1 - Towards Building Private LLMs
T2 - 2024 International Conference on Research in Adaptive and Convergent Systems, RACS 2024
AU - Chen, Mu Chi
AU - Huang, Po Hsuan
AU - Ke, Xiangrui
AU - Tu, Chia Heng
AU - Xue, Jason
AU - Hung, Shih Hao
N1 - Publisher Copyright:
© 2024 Copyright is held by the owner/author(s).
PY - 2025/10/8
Y1 - 2025/10/8
N2 - Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's Chat-GPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.
AB - Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's Chat-GPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.
UR - https://www.scopus.com/pages/publications/105021315461
UR - https://www.scopus.com/pages/publications/105021315461#tab=citedBy
U2 - 10.1145/3649601.3698722
DO - 10.1145/3649601.3698722
M3 - Conference contribution
AN - SCOPUS:105021315461
T3 - 2024 Research in Adaptive and Convergent Systems - Proceedings of the 2024 International Conference on Research in Adaptive and Convergent Systems, RACS 2024
SP - 57
EP - 64
BT - 2024 Research in Adaptive and Convergent Systems - Proceedings of the 2024 International Conference on Research in Adaptive and Convergent Systems, RACS 2024
PB - Association for Computing Machinery, Inc
Y2 - 5 November 2024 through 8 November 2024
ER -