Abstract
Frequent closed subgraphs mining is proved to be a NP-problem. Over the years, although many algorithms have been proposed to solve this problem, they are all faced with a common problem of computational efficiency when mining large scale graph data sets. In particular, when the average degree of vertexes increases in graph data sets, the efficiency of mining decreases sharply. At present, most of the existing distributed frequent subgraph mining algorithms for graph databases adopt a distributed computing framework based on horizontal partitioning, and all of them focus on the problem of mining all frequent subgraphs. The distributed computing framework based on horizontal partition is to partition original graph data sets horizontally and complete the distributed mining process. In terms of computational efficiency, the framework based on horizontal partition has some shortcomings. At the same time, because the frequent closed subgraph patterns mining need to detect the closeness of frequent subgraphs, if the existing distributed frequent subgraph mining algorithms are directly applied to the frequent closed subgraphs mining, it may lead to frequent communication between nodes or a large number of subgraph isomorphism detection. In view of the above problems, this paper proposes an efficient distributed mining algorithm Desu-FSM for large scale graph data sets. Unlike the existing distributed mining framework based on horizontal decomposition, this algorithm adopts the distributed mining framework based on vertical decomposition for the first time. Its basic idea can be summarized as "fast approaching, bidirectional search". First, we can get a summary graph set by merging the τ-domain kernel graphs, and quickly get close to the aggregation area of large subgraphs. On this basis, by reducing and extending the summary graphs, we can find all frequent closed graph patterns that summary graphs contain and the frequent closed graph patterns that containing summary graphs. Compared with the original graph data sets, the size and average degree of vertexes of the summary graphs are smaller. Moreover, the bidirectional search based on summary graphs can be completed independently in distributed environment without coupling. Different from the "data physical divide-and-conquer" approaches adopted in the framework based on horizontal partition, this framework adopts "task logical divide-and-conquer". The former reduces the amount of graph data sets which are processed by each node, while the latter decomposes the mining task from the original graph data sets into a series of subgraph bound tasks with smaller size and average vertexes degree. Therefore, the computational efficiency has been greatly improved. This paper also proposes a set of efficient optimization strategies to reduce the large number of repeated computations caused by common subgraphs between summary graphs. The mining efficiency of the framework based on vertical decomposition is tested on a large number of real and synthetic graph data sets. Compared with the horizontal decomposition framework, the results show that the vertical decomposition framework can be improved by one order of magnitude with less memory space occupancy when mining frequent closed subgraphs in large scale graph data sets.
| Translated title of the contribution | An Efficient Distributed Algorithm for Large-Scale Graph Data Mining Based on Decoupled Summary Subgraph |
|---|---|
| Original language | Chinese (Traditional) |
| Pages (from-to) | 1183-1198 |
| Number of pages | 16 |
| Journal | Jisuanji Xuebao/Chinese Journal of Computers |
| Volume | 43 |
| Issue number | 7 |
| DOIs | |
| Publication status | Published - 1 Jul 2020 |