In distributed machine learning(DML)based on the parameter server(PS)architecture,unbalanced communication load distribution of PSs will lead to a significant slowdown of model synchronization in heterogeneous network...In distributed machine learning(DML)based on the parameter server(PS)architecture,unbalanced communication load distribution of PSs will lead to a significant slowdown of model synchronization in heterogeneous networks due to low utilization of bandwidth.To address this problem,a network-aware adaptive PS load distribution scheme is proposed,which accelerates model synchronization by proactively adjusting the communication load on PSs according to network states.We evaluate the proposed scheme on MXNet,known as a realworld distributed training platform,and results show that our scheme achieves up to 2.68 times speed-up of model training in the dynamic and heterogeneous network environment.展开更多
基金partially supported by the computing power networks and new communication primitives project under Grant No. HC-CN-2020120001the National Natural Science Foundation of China under Grant No. 62102066Open Research Projects of Zhejiang Lab under Grant No. 2022QA0AB02
文摘In distributed machine learning(DML)based on the parameter server(PS)architecture,unbalanced communication load distribution of PSs will lead to a significant slowdown of model synchronization in heterogeneous networks due to low utilization of bandwidth.To address this problem,a network-aware adaptive PS load distribution scheme is proposed,which accelerates model synchronization by proactively adjusting the communication load on PSs according to network states.We evaluate the proposed scheme on MXNet,known as a realworld distributed training platform,and results show that our scheme achieves up to 2.68 times speed-up of model training in the dynamic and heterogeneous network environment.