TY - GEN
T1 - Parallel Query Processing
T2 - 2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
AU - Zhang, Hao
AU - Yu, Jeffrey Xu
AU - Zhang, Yikai
AU - Zhao, Kangfei
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/6/10
Y1 - 2022/6/10
N2 - In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii)computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator () to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with , we can also explicitly treat the local computation on a computing node as op for any RA (relational algebra) operator op. We give a merge operator (U), to collect all partial results from computing nodes as they are. In short, with , op, and U, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.
AB - In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii)computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator () to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with , we can also explicitly treat the local computation on a computing node as op for any RA (relational algebra) operator op. We give a merge operator (U), to collect all partial results from computing nodes as they are. In short, with , op, and U, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.
KW - database
KW - olap
KW - parallel query processing
KW - query optimization
UR - http://www.scopus.com/inward/record.url?scp=85132772963&partnerID=8YFLogxK
U2 - 10.1145/3514221.3526164
DO - 10.1145/3514221.3526164
M3 - Conference contribution
AN - SCOPUS:85132772963
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1447
EP - 1461
BT - SIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 12 June 2022 through 17 June 2022
ER -