Parallel Query Processing: To Separate Communication from Computation

Hao Zhang, Jeffrey Xu Yu, Yikai Zhang, Kangfei Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii)computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator () to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with , we can also explicitly treat the local computation on a computing node as op for any RA (relational algebra) operator op. We give a merge operator (U), to collect all partial results from computing nodes as they are. In short, with , op, and U, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.

Original languageEnglish
Title of host publicationSIGMOD 2022 - Proceedings of the 2022 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1447-1461
Number of pages15
ISBN (Electronic)9781450392495
DOIs
Publication statusPublished - 10 Jun 2022
Externally publishedYes
Event2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022 - Virtual, Online, United States
Duration: 12 Jun 202217 Jun 2022

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2022 ACM SIGMOD International Conference on the Management of Data, SIGMOD 2022
Country/TerritoryUnited States
CityVirtual, Online
Period12/06/2217/06/22

Keywords

  • database
  • olap
  • parallel query processing
  • query optimization

Fingerprint

Dive into the research topics of 'Parallel Query Processing: To Separate Communication from Computation'. Together they form a unique fingerprint.

Cite this