TY - JOUR
T1 - SQL-G
T2 - Efficient Graph Analytics by SQL
AU - Zhao, Kangfei
AU - Su, Jiao
AU - Yu, Jeffrey Xu
AU - Zhang, Hao
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2021/5/1
Y1 - 2021/5/1
N2 - Querying graphs and conducting graph analytics become important in data processing since many real applications are dealing with massive graphs, such as online social networks, Semantic Web, knowledge graphs, etc. Over the years, many distributed graph processing systems have been developed to support graph analytics using various programming models, and many graph querying languages have been proposed. A natural question that arises is how to integrate graph data and traditional non-graph data in a distributed system for users to conduct analytics. There are two issues. One issue is related to expressiveness on how to specify graph analytics as well as data analytics by a querying language. The other issue is related to efficiency on how to process analytics in a distributed system. For the first issue, SQL is a best candidate, since SQL is a well-accepted language for data processing. We concentrate on SQL for graph analytics. Our early work shows that graph analytics can be supported by SQL in a way from 'semiring + while' to 'relational algebra + while' via the enhanced recursive SQL queries. In this article, we focus on the second issue on how to process such enhanced recursive SQL queries based on the GAS (Gather-Apply-Scatter) model under which efficient graph processing systems can be developed. To demonstrate the efficiency, we implemented a system by tightly coupling Spark SQL and GraphX on Spark which is one of the most popular in-memory data-flow processing platforms. First, we enhance Spark SQL by adding the capability of supporting the enhanced recursive SQL queries for graph analytics. In this regard, graph analytics can be processed using a distributed SQL engine alone. Second, we further propose new transformation rules to optimize/translate the operations for recursive SQL queries to the operations by GraphX. In this regard, graph analytics by SQL can be processed in a similar way as done by a distributed graph processing system using the APIs provided by the system. We conduct extensive performance studies to test graph analytics using large real graphs. We show that our approach can achieve similar or even higher efficiency, in comparison to the built-in graph algorithms in the existing graph processing systems.
AB - Querying graphs and conducting graph analytics become important in data processing since many real applications are dealing with massive graphs, such as online social networks, Semantic Web, knowledge graphs, etc. Over the years, many distributed graph processing systems have been developed to support graph analytics using various programming models, and many graph querying languages have been proposed. A natural question that arises is how to integrate graph data and traditional non-graph data in a distributed system for users to conduct analytics. There are two issues. One issue is related to expressiveness on how to specify graph analytics as well as data analytics by a querying language. The other issue is related to efficiency on how to process analytics in a distributed system. For the first issue, SQL is a best candidate, since SQL is a well-accepted language for data processing. We concentrate on SQL for graph analytics. Our early work shows that graph analytics can be supported by SQL in a way from 'semiring + while' to 'relational algebra + while' via the enhanced recursive SQL queries. In this article, we focus on the second issue on how to process such enhanced recursive SQL queries based on the GAS (Gather-Apply-Scatter) model under which efficient graph processing systems can be developed. To demonstrate the efficiency, we implemented a system by tightly coupling Spark SQL and GraphX on Spark which is one of the most popular in-memory data-flow processing platforms. First, we enhance Spark SQL by adding the capability of supporting the enhanced recursive SQL queries for graph analytics. In this regard, graph analytics can be processed using a distributed SQL engine alone. Second, we further propose new transformation rules to optimize/translate the operations for recursive SQL queries to the operations by GraphX. In this regard, graph analytics by SQL can be processed in a similar way as done by a distributed graph processing system using the APIs provided by the system. We conduct extensive performance studies to test graph analytics using large real graphs. We show that our approach can achieve similar or even higher efficiency, in comparison to the built-in graph algorithms in the existing graph processing systems.
KW - Graph analytics
KW - SQL recursive query
KW - distributed graph processing
KW - spark
UR - https://www.scopus.com/pages/publications/85104004116
U2 - 10.1109/TKDE.2019.2950620
DO - 10.1109/TKDE.2019.2950620
M3 - Article
AN - SCOPUS:85104004116
SN - 1041-4347
VL - 33
SP - 2237
EP - 2251
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 5
M1 - 8888185
ER -