DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling

Penghao Sun; Zehua Guo; Junchao Wang; Junfei Li; Julong Lan; Yuxiang Hu

DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling

Penghao Sun, Zehua Guo^*, Junchao Wang, Junfei Li, Julong Lan, Yuxiang Hu

^*Corresponding author for this work

School of Automation

National Digital Switching System Engineering and Technological R&D Center

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

37 Citations (Scopus)

Abstract

To improve the processing efficiency of jobs in distributed computing, the concept of coflow is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage computation task. A job consists of multiple coflows and can be usually formulated as a Directed-Acyclic Graph (DAG). A proper scheduling of coflows can significantly reduce the completion time of jobs in distributed computing. However, this scheduling problem is proved to be NP-hard. Different from existing schemes that use hand-crafted heuristic algorithms to solve this problem, in this paper, we propose a Deep Reinforcement Learning (DRL) framework named DeepWeave to generate coflow scheduling policies. To improve the inter-coflow scheduling ability in the job DAG, DeepWeave employs a Graph Neural Network (GNN) to process the DAG information. DeepWeave learns from the history workload trace to train the neural networks of the DRL agent and encodes the scheduling policy in the neural networks, which make coflow scheduling decisions without expert knowledge or a pre-assumed model. The proposed scheme is evaluated with a simulator using real-life traces. Simulation results show that DeepWeave completes jobs at least 1.7× faster than the state-of-the-art solutions.

Original language	English
Title of host publication	Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020
Editors	Christian Bessiere
Publisher	International Joint Conferences on Artificial Intelligence
Pages	3314-3320
Number of pages	7
ISBN (Electronic)	9780999241165
Publication status	Published - 2020
Event	29th International Joint Conference on Artificial Intelligence, IJCAI 2020 - Yokohama, Japan Duration: 1 Jan 2021 → …

Publication series

Name	IJCAI International Joint Conference on Artificial Intelligence
Volume	2021-January
ISSN (Print)	1045-0823

Conference

Conference	29th International Joint Conference on Artificial Intelligence, IJCAI 2020
Country/Territory	Japan
City	Yokohama
Period	1/01/21 → …

Cite this

Sun, P., Guo, Z., Wang, J., Li, J., Lan, J., & Hu, Y. (2020). DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling. In C. Bessiere (Ed.), Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020 (pp. 3314-3320). (IJCAI International Joint Conference on Artificial Intelligence; Vol. 2021-January). International Joint Conferences on Artificial Intelligence.

Sun, Penghao ; Guo, Zehua ; Wang, Junchao et al. / DeepWeave : Accelerating job completion time with deep reinforcement learning-based coflow scheduling. Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020. editor / Christian Bessiere. International Joint Conferences on Artificial Intelligence, 2020. pp. 3314-3320 (IJCAI International Joint Conference on Artificial Intelligence).

@inproceedings{73192d29a3884e379cff69e80b88a2eb,

title = "DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling",

abstract = "To improve the processing efficiency of jobs in distributed computing, the concept of coflow is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage computation task. A job consists of multiple coflows and can be usually formulated as a Directed-Acyclic Graph (DAG). A proper scheduling of coflows can significantly reduce the completion time of jobs in distributed computing. However, this scheduling problem is proved to be NP-hard. Different from existing schemes that use hand-crafted heuristic algorithms to solve this problem, in this paper, we propose a Deep Reinforcement Learning (DRL) framework named DeepWeave to generate coflow scheduling policies. To improve the inter-coflow scheduling ability in the job DAG, DeepWeave employs a Graph Neural Network (GNN) to process the DAG information. DeepWeave learns from the history workload trace to train the neural networks of the DRL agent and encodes the scheduling policy in the neural networks, which make coflow scheduling decisions without expert knowledge or a pre-assumed model. The proposed scheme is evaluated with a simulator using real-life traces. Simulation results show that DeepWeave completes jobs at least 1.7× faster than the state-of-the-art solutions.",

author = "Penghao Sun and Zehua Guo and Junchao Wang and Junfei Li and Julong Lan and Yuxiang Hu",

note = "Publisher Copyright: {\textcopyright} 2020 Inst. Sci. inf., Univ. Defence in Belgrade. All rights reserved.; 29th International Joint Conference on Artificial Intelligence, IJCAI 2020 ; Conference date: 01-01-2021",

year = "2020",

language = "English",

series = "IJCAI International Joint Conference on Artificial Intelligence",

publisher = "International Joint Conferences on Artificial Intelligence",

pages = "3314--3320",

editor = "Christian Bessiere",

booktitle = "Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020",

}

Sun, P, Guo, Z, Wang, J, Li, J, Lan, J & Hu, Y 2020, DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling. in C Bessiere (ed.), Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020. IJCAI International Joint Conference on Artificial Intelligence, vol. 2021-January, International Joint Conferences on Artificial Intelligence, pp. 3314-3320, 29th International Joint Conference on Artificial Intelligence, IJCAI 2020, Yokohama, Japan, 1/01/21.

DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling. / Sun, Penghao; Guo, Zehua; Wang, Junchao et al.
Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020. ed. / Christian Bessiere. International Joint Conferences on Artificial Intelligence, 2020. p. 3314-3320 (IJCAI International Joint Conference on Artificial Intelligence; Vol. 2021-January).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - DeepWeave

T2 - 29th International Joint Conference on Artificial Intelligence, IJCAI 2020

AU - Sun, Penghao

AU - Guo, Zehua

AU - Wang, Junchao

AU - Li, Junfei

AU - Lan, Julong

AU - Hu, Yuxiang

PY - 2020

Y1 - 2020

N2 - To improve the processing efficiency of jobs in distributed computing, the concept of coflow is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage computation task. A job consists of multiple coflows and can be usually formulated as a Directed-Acyclic Graph (DAG). A proper scheduling of coflows can significantly reduce the completion time of jobs in distributed computing. However, this scheduling problem is proved to be NP-hard. Different from existing schemes that use hand-crafted heuristic algorithms to solve this problem, in this paper, we propose a Deep Reinforcement Learning (DRL) framework named DeepWeave to generate coflow scheduling policies. To improve the inter-coflow scheduling ability in the job DAG, DeepWeave employs a Graph Neural Network (GNN) to process the DAG information. DeepWeave learns from the history workload trace to train the neural networks of the DRL agent and encodes the scheduling policy in the neural networks, which make coflow scheduling decisions without expert knowledge or a pre-assumed model. The proposed scheme is evaluated with a simulator using real-life traces. Simulation results show that DeepWeave completes jobs at least 1.7× faster than the state-of-the-art solutions.

AB - To improve the processing efficiency of jobs in distributed computing, the concept of coflow is proposed. A coflow is a collection of flows that are semantically correlated in a multi-stage computation task. A job consists of multiple coflows and can be usually formulated as a Directed-Acyclic Graph (DAG). A proper scheduling of coflows can significantly reduce the completion time of jobs in distributed computing. However, this scheduling problem is proved to be NP-hard. Different from existing schemes that use hand-crafted heuristic algorithms to solve this problem, in this paper, we propose a Deep Reinforcement Learning (DRL) framework named DeepWeave to generate coflow scheduling policies. To improve the inter-coflow scheduling ability in the job DAG, DeepWeave employs a Graph Neural Network (GNN) to process the DAG information. DeepWeave learns from the history workload trace to train the neural networks of the DRL agent and encodes the scheduling policy in the neural networks, which make coflow scheduling decisions without expert knowledge or a pre-assumed model. The proposed scheme is evaluated with a simulator using real-life traces. Simulation results show that DeepWeave completes jobs at least 1.7× faster than the state-of-the-art solutions.

UR - http://www.scopus.com/inward/record.url?scp=85086827905&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85086827905

T3 - IJCAI International Joint Conference on Artificial Intelligence

SP - 3314

EP - 3320

BT - Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020

A2 - Bessiere, Christian

PB - International Joint Conferences on Artificial Intelligence

Y2 - 1 January 2021

ER -

Sun P, Guo Z, Wang J, Li J, Lan J, Hu Y. DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling. In Bessiere C, editor, Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI 2020. International Joint Conferences on Artificial Intelligence. 2020. p. 3314-3320. (IJCAI International Joint Conference on Artificial Intelligence).

DeepWeave: Accelerating job completion time with deep reinforcement learning-based coflow scheduling

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this