A GPU inference system scheduling algorithm with asynchronous data transfer

Qin Zhang; Li Zha; Xiaohua Wan; Boqun Cheng

doi:10.1109/IPDPSW.2019.00083

A GPU inference system scheduling algorithm with asynchronous data transfer

Qin Zhang, Li Zha^*, Xiaohua Wan, Boqun Cheng

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

With the rapid expansion of application range, Deep-Learning has increasingly become an indispensable practical method to solve problems in various industries. In different application scenarios, especially in high concurrency areas such as search and recommendation, deep learning inference system is required to have high throughput and low latency, which can not be easily obtained at the same time. In this paper, we build a model to quantify the relationship between concurrency, throughput and job latency. Then we implement a GPU scheduling algorithm for inference jobs in deep learning inference system based on the model. The algorithm predicts the completion time of batch jobs being executed, and reasonably chooses the batch size of the next batch jobs according to the concurrency and upload data to GPU memory ahead of time. So that the system can hide the data transfer delay of GPU and achieve the minimum job latency under the premise of meetingthethroughputrequirements.Experimentsshowthatthe proposed GPU asynchronous data transfer scheduling algorithm improves throughput by 9% compared with the traditional synchronous algorithm, reduces the latency by 3%-76% under different concurrency, and can better suppress the job latency fluctuation caused by concurrency changing.

Original language	English
Title of host publication	Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	438-445
Number of pages	8
ISBN (Electronic)	9781728135106
DOIs	https://doi.org/10.1109/IPDPSW.2019.00083
Publication status	Published - May 2019
Externally published	Yes
Event	33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019 - Rio de Janeiro, Brazil Duration: 20 May 2019 → 24 May 2019

Publication series

Name	Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019

Conference

Conference	33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
Country/Territory	Brazil
City	Rio de Janeiro
Period	20/05/19 → 24/05/19

Keywords

Deep Learning
GPU
Inference
Latency
Scheduling Algorithm

Access to Document

10.1109/IPDPSW.2019.00083

Cite this

Zhang, Q., Zha, L., Wan, X., & Cheng, B. (2019). A GPU inference system scheduling algorithm with asynchronous data transfer. In Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019 (pp. 438-445). Article 8778365 (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPSW.2019.00083

Zhang, Qin ; Zha, Li ; Wan, Xiaohua et al. / A GPU inference system scheduling algorithm with asynchronous data transfer. Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 438-445 (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019).

@inproceedings{b136be08b696415dbd7e88a40059c076,

title = "A GPU inference system scheduling algorithm with asynchronous data transfer",

abstract = "With the rapid expansion of application range, Deep-Learning has increasingly become an indispensable practical method to solve problems in various industries. In different application scenarios, especially in high concurrency areas such as search and recommendation, deep learning inference system is required to have high throughput and low latency, which can not be easily obtained at the same time. In this paper, we build a model to quantify the relationship between concurrency, throughput and job latency. Then we implement a GPU scheduling algorithm for inference jobs in deep learning inference system based on the model. The algorithm predicts the completion time of batch jobs being executed, and reasonably chooses the batch size of the next batch jobs according to the concurrency and upload data to GPU memory ahead of time. So that the system can hide the data transfer delay of GPU and achieve the minimum job latency under the premise of meetingthethroughputrequirements.Experimentsshowthatthe proposed GPU asynchronous data transfer scheduling algorithm improves throughput by 9% compared with the traditional synchronous algorithm, reduces the latency by 3%-76% under different concurrency, and can better suppress the job latency fluctuation caused by concurrency changing.",

keywords = "Deep Learning, GPU, Inference, Latency, Scheduling Algorithm",

author = "Qin Zhang and Li Zha and Xiaohua Wan and Boqun Cheng",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019 ; Conference date: 20-05-2019 Through 24-05-2019",

year = "2019",

month = may,

doi = "10.1109/IPDPSW.2019.00083",

language = "English",

series = "Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "438--445",

booktitle = "Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019",

address = "United States",

}

Zhang, Q, Zha, L, Wan, X & Cheng, B 2019, A GPU inference system scheduling algorithm with asynchronous data transfer. in Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019., 8778365, Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019, Institute of Electrical and Electronics Engineers Inc., pp. 438-445, 33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019, Rio de Janeiro, Brazil, 20/05/19. https://doi.org/10.1109/IPDPSW.2019.00083

A GPU inference system scheduling algorithm with asynchronous data transfer. / Zhang, Qin; Zha, Li; Wan, Xiaohua et al.
Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 438-445 8778365 (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - A GPU inference system scheduling algorithm with asynchronous data transfer

AU - Zhang, Qin

AU - Zha, Li

AU - Wan, Xiaohua

AU - Cheng, Boqun

PY - 2019/5

Y1 - 2019/5

N2 - With the rapid expansion of application range, Deep-Learning has increasingly become an indispensable practical method to solve problems in various industries. In different application scenarios, especially in high concurrency areas such as search and recommendation, deep learning inference system is required to have high throughput and low latency, which can not be easily obtained at the same time. In this paper, we build a model to quantify the relationship between concurrency, throughput and job latency. Then we implement a GPU scheduling algorithm for inference jobs in deep learning inference system based on the model. The algorithm predicts the completion time of batch jobs being executed, and reasonably chooses the batch size of the next batch jobs according to the concurrency and upload data to GPU memory ahead of time. So that the system can hide the data transfer delay of GPU and achieve the minimum job latency under the premise of meetingthethroughputrequirements.Experimentsshowthatthe proposed GPU asynchronous data transfer scheduling algorithm improves throughput by 9% compared with the traditional synchronous algorithm, reduces the latency by 3%-76% under different concurrency, and can better suppress the job latency fluctuation caused by concurrency changing.

AB - With the rapid expansion of application range, Deep-Learning has increasingly become an indispensable practical method to solve problems in various industries. In different application scenarios, especially in high concurrency areas such as search and recommendation, deep learning inference system is required to have high throughput and low latency, which can not be easily obtained at the same time. In this paper, we build a model to quantify the relationship between concurrency, throughput and job latency. Then we implement a GPU scheduling algorithm for inference jobs in deep learning inference system based on the model. The algorithm predicts the completion time of batch jobs being executed, and reasonably chooses the batch size of the next batch jobs according to the concurrency and upload data to GPU memory ahead of time. So that the system can hide the data transfer delay of GPU and achieve the minimum job latency under the premise of meetingthethroughputrequirements.Experimentsshowthatthe proposed GPU asynchronous data transfer scheduling algorithm improves throughput by 9% compared with the traditional synchronous algorithm, reduces the latency by 3%-76% under different concurrency, and can better suppress the job latency fluctuation caused by concurrency changing.

KW - Deep Learning

KW - GPU

KW - Inference

KW - Latency

KW - Scheduling Algorithm

UR - http://www.scopus.com/inward/record.url?scp=85070419330&partnerID=8YFLogxK

U2 - 10.1109/IPDPSW.2019.00083

DO - 10.1109/IPDPSW.2019.00083

M3 - Conference contribution

AN - SCOPUS:85070419330

T3 - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019

SP - 438

EP - 445

BT - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019

Y2 - 20 May 2019 through 24 May 2019

ER -

Zhang Q, Zha L, Wan X, Cheng B. A GPU inference system scheduling algorithm with asynchronous data transfer. In Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 438-445. 8778365. (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019). doi: 10.1109/IPDPSW.2019.00083

A GPU inference system scheduling algorithm with asynchronous data transfer

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this