TY - GEN
T1 - A GPU inference system scheduling algorithm with asynchronous data transfer
AU - Zhang, Qin
AU - Zha, Li
AU - Wan, Xiaohua
AU - Cheng, Boqun
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - With the rapid expansion of application range, Deep-Learning has increasingly become an indispensable practical method to solve problems in various industries. In different application scenarios, especially in high concurrency areas such as search and recommendation, deep learning inference system is required to have high throughput and low latency, which can not be easily obtained at the same time. In this paper, we build a model to quantify the relationship between concurrency, throughput and job latency. Then we implement a GPU scheduling algorithm for inference jobs in deep learning inference system based on the model. The algorithm predicts the completion time of batch jobs being executed, and reasonably chooses the batch size of the next batch jobs according to the concurrency and upload data to GPU memory ahead of time. So that the system can hide the data transfer delay of GPU and achieve the minimum job latency under the premise of meetingthethroughputrequirements.Experimentsshowthatthe proposed GPU asynchronous data transfer scheduling algorithm improves throughput by 9% compared with the traditional synchronous algorithm, reduces the latency by 3%-76% under different concurrency, and can better suppress the job latency fluctuation caused by concurrency changing.
AB - With the rapid expansion of application range, Deep-Learning has increasingly become an indispensable practical method to solve problems in various industries. In different application scenarios, especially in high concurrency areas such as search and recommendation, deep learning inference system is required to have high throughput and low latency, which can not be easily obtained at the same time. In this paper, we build a model to quantify the relationship between concurrency, throughput and job latency. Then we implement a GPU scheduling algorithm for inference jobs in deep learning inference system based on the model. The algorithm predicts the completion time of batch jobs being executed, and reasonably chooses the batch size of the next batch jobs according to the concurrency and upload data to GPU memory ahead of time. So that the system can hide the data transfer delay of GPU and achieve the minimum job latency under the premise of meetingthethroughputrequirements.Experimentsshowthatthe proposed GPU asynchronous data transfer scheduling algorithm improves throughput by 9% compared with the traditional synchronous algorithm, reduces the latency by 3%-76% under different concurrency, and can better suppress the job latency fluctuation caused by concurrency changing.
KW - Deep Learning
KW - GPU
KW - Inference
KW - Latency
KW - Scheduling Algorithm
UR - http://www.scopus.com/inward/record.url?scp=85070419330&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2019.00083
DO - 10.1109/IPDPSW.2019.00083
M3 - Conference contribution
AN - SCOPUS:85070419330
T3 - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
SP - 438
EP - 445
BT - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
Y2 - 20 May 2019 through 24 May 2019
ER -