Parallel Query Processing: To Separate Communication from Computation

Hao Zhang; Jeffrey Xu Yu; Yikai Zhang; Kangfei Zhao

doi:10.1145/3514221.3526164

Parallel Query Processing: To Separate Communication from Computation

Hao Zhang, Jeffrey Xu Yu, Yikai Zhang, Kangfei Zhao

Chinese University of Hong Kong

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii)computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator () to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with , we can also explicitly treat the local computation on a computing node as op for any RA (relational algebra) operator op. We give a merge operator (U), to collect all partial results from computing nodes as they are. In short, with , op, and U, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.

Original language	English
Title of host publication	SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
Publisher	Association for Computing Machinery
Pages	1447-1461
Number of pages	15
ISBN (Electronic)	9781450392495
DOIs	https://doi.org/10.1145/3514221.3526164
Publication status	Published - Jun 2022
Externally published	Yes
Event	2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022 - Hybrid, Philadelphia, United States Duration: 12 Jun 2022 → 17 Jun 2022

Publication series

Name	Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)	0730-8078

Conference

Conference	2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
Country/Territory	United States
City	Hybrid, Philadelphia
Period	12/06/22 → 17/06/22

Keywords

database
olap
parallel query processing
query optimization

Access to Document

10.1145/3514221.3526164

Cite this

Zhang, H., Yu, J. X., Zhang, Y., & Zhao, K. (2022). Parallel Query Processing: To Separate Communication from Computation. In SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data (pp. 1447-1461). (Proceedings of the ACM SIGMOD International Conference on Management of Data). Association for Computing Machinery. https://doi.org/10.1145/3514221.3526164

@inproceedings{7171368a9bf94d5cbe284beb80b3af1f,

title = "Parallel Query Processing: To Separate Communication from Computation",

abstract = "In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii)computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator () to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with , we can also explicitly treat the local computation on a computing node as op for any RA (relational algebra) operator op. We give a merge operator (U), to collect all partial results from computing nodes as they are. In short, with , op, and U, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.",

keywords = "database, olap, parallel query processing, query optimization",

author = "Hao Zhang and Yu, {Jeffrey Xu} and Yikai Zhang and Kangfei Zhao",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022 ; Conference date: 12-06-2022 Through 17-06-2022",

year = "2022",

month = jun,

doi = "10.1145/3514221.3526164",

language = "English",

series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",

publisher = "Association for Computing Machinery",

pages = "1447--1461",

booktitle = "SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data",

}

Zhang, H, Yu, JX, Zhang, Y & Zhao, K 2022, Parallel Query Processing: To Separate Communication from Computation. in SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data. Proceedings of the ACM SIGMOD International Conference on Management of Data, Association for Computing Machinery, pp. 1447-1461, 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022, Hybrid, Philadelphia, United States, 12/06/22. https://doi.org/10.1145/3514221.3526164

Parallel Query Processing: To Separate Communication from Computation. / Zhang, Hao; Yu, Jeffrey Xu; Zhang, Yikai et al.
SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data. Association for Computing Machinery, 2022. p. 1447-1461 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Parallel Query Processing

T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022

AU - Zhang, Hao

AU - Yu, Jeffrey Xu

AU - Zhang, Yikai

AU - Zhao, Kangfei

PY - 2022/6

Y1 - 2022/6

N2 - In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii)computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator () to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with , we can also explicitly treat the local computation on a computing node as op for any RA (relational algebra) operator op. We give a merge operator (U), to collect all partial results from computing nodes as they are. In short, with , op, and U, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.

AB - In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii)computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator () to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with , we can also explicitly treat the local computation on a computing node as op for any RA (relational algebra) operator op. We give a merge operator (U), to collect all partial results from computing nodes as they are. In short, with , op, and U, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.

KW - database

KW - olap

KW - parallel query processing

KW - query optimization

UR - http://www.scopus.com/inward/record.url?scp=85132772963&partnerID=8YFLogxK

U2 - 10.1145/3514221.3526164

DO - 10.1145/3514221.3526164

M3 - Conference contribution

AN - SCOPUS:85132772963

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 1447

EP - 1461

BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data

PB - Association for Computing Machinery

Y2 - 12 June 2022 through 17 June 2022

ER -

Parallel Query Processing: To Separate Communication from Computation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this