TY - JOUR
T1 - GPPR
T2 - 跨域分布式个性化 PageRank 算法
AU - Chen, Zi Jun
AU - Ma, De Long
AU - Wang, Yi Shu
AU - Yuan, Ye
N1 - Publisher Copyright:
© 2024 Chinese Academy of Sciences. All rights reserved.
PY - 2024/3
Y1 - 2024/3
N2 - Personalized PageRank, as a basic algorithm in large graph analysis, has a wide range of applications in search engines, social recommendation, community detection, and other fields, and has been a hot problem of interest to researchers. The existing distributed personalized PageRank algorithms assume that all data are located in the same geographic location and the network environment is the same among the computing nodes where the data are located. However, in the real world, these data may be distributed in multiple data centers across continents, and these cross-geo-distributed data centers are connected to each other through WANs, which are characterized by heterogeneous network bandwidth, huge hardware differences, and high communication costs. The distributed personalized PageRank algorithm requires multiple iterations and random wandering on the global graph. Therefore, the existing distributed personalized PageRank algorithms are not applicable to the cross-geo-distributed environment. To address this problem, the GPPR (cross-geo-distributed personalized PageRank) algorithm is proposed in this study. The algorithm first preprocesses the big graph data in the cross-geo-distributed environment and maps the graph data by using a heuristic algorithm to reduce the impact of network bandwidth heterogeneity on the iteration speed of the algorithm. Secondly, GPPR improves the random wandering approach and proposes a probability-based push algorithm to further reduce the number of iterations required by the algorithm by reducing the bandwidth load of transmitting data between working nodes. The GPPR algorithm is implemented based on the Spark framework and a real cross-geo-distributed environment in AliCloud is built to conduct experiments on eight open-source big graph data compared with several existing representative distributed personalized PageRank algorithms. The results show that the communication data volume of GPPR is reduced by 30% on average in the cross-geo-distributed environment compared with other algorithms. In terms of algorithm running efficiency, GPPR improves by an average of 2.5 times compared to other algorithms.
AB - Personalized PageRank, as a basic algorithm in large graph analysis, has a wide range of applications in search engines, social recommendation, community detection, and other fields, and has been a hot problem of interest to researchers. The existing distributed personalized PageRank algorithms assume that all data are located in the same geographic location and the network environment is the same among the computing nodes where the data are located. However, in the real world, these data may be distributed in multiple data centers across continents, and these cross-geo-distributed data centers are connected to each other through WANs, which are characterized by heterogeneous network bandwidth, huge hardware differences, and high communication costs. The distributed personalized PageRank algorithm requires multiple iterations and random wandering on the global graph. Therefore, the existing distributed personalized PageRank algorithms are not applicable to the cross-geo-distributed environment. To address this problem, the GPPR (cross-geo-distributed personalized PageRank) algorithm is proposed in this study. The algorithm first preprocesses the big graph data in the cross-geo-distributed environment and maps the graph data by using a heuristic algorithm to reduce the impact of network bandwidth heterogeneity on the iteration speed of the algorithm. Secondly, GPPR improves the random wandering approach and proposes a probability-based push algorithm to further reduce the number of iterations required by the algorithm by reducing the bandwidth load of transmitting data between working nodes. The GPPR algorithm is implemented based on the Spark framework and a real cross-geo-distributed environment in AliCloud is built to conduct experiments on eight open-source big graph data compared with several existing representative distributed personalized PageRank algorithms. The results show that the communication data volume of GPPR is reduced by 30% on average in the cross-geo-distributed environment compared with other algorithms. In terms of algorithm running efficiency, GPPR improves by an average of 2.5 times compared to other algorithms.
KW - approximate calculation
KW - cross-geo-distributed
KW - personalized PageRank
UR - http://www.scopus.com/inward/record.url?scp=85190123086&partnerID=8YFLogxK
U2 - 10.13328/j.cnki.jos.007072
DO - 10.13328/j.cnki.jos.007072
M3 - 文章
AN - SCOPUS:85190123086
SN - 1000-9825
VL - 35
SP - 1090
EP - 1106
JO - Ruan Jian Xue Bao/Journal of Software
JF - Ruan Jian Xue Bao/Journal of Software
IS - 3
ER -