TY - JOUR
T1 - Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs
AU - Zhang, Lingling
AU - Wang, Fang
AU - Jiang, Hong
AU - Feng, Dan
AU - Xie, Yanwen
AU - Zhang, Zhiwei
AU - Wang, Guoren
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2022/7
Y1 - 2022/7
N2 - Random-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired node distribution because of the indeterministic sampling budget (i.e., the number of samples or sampling steps) required for doing so with large volumes of data in graphs. On the other hand, under a small sampling budget, these methods produce low-quality samples with many repeats and high correlations (i.e., many common attributes), which leads to a large deviation from the desired node distribution and large estimation errors. In this paper, we propose a new random-walk sampling scheme based on node cliques (a subset of cliques), called node-clique random walk, or NCRW, to strike a good balance between the estimation error and the sampling budget, by producing unique samples with low correlations. Meanwhile, both the deviation from the desired node distribution and the estimation errors under the constraint of the sampling budget are reduced both theoretically and experimentally. Thus, the sampling costs which are closely related to the sampling budget are reduced. Our extensive experimental evaluation driven by real-world datasets further confirms that NCRW significantly increases the quality of samples and accuracy of estimations with much lower costs than those of existing random-walk-based sampling schemes especially in estimating the higher-order node attributes.
AB - Random-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired node distribution because of the indeterministic sampling budget (i.e., the number of samples or sampling steps) required for doing so with large volumes of data in graphs. On the other hand, under a small sampling budget, these methods produce low-quality samples with many repeats and high correlations (i.e., many common attributes), which leads to a large deviation from the desired node distribution and large estimation errors. In this paper, we propose a new random-walk sampling scheme based on node cliques (a subset of cliques), called node-clique random walk, or NCRW, to strike a good balance between the estimation error and the sampling budget, by producing unique samples with low correlations. Meanwhile, both the deviation from the desired node distribution and the estimation errors under the constraint of the sampling budget are reduced both theoretically and experimentally. Thus, the sampling costs which are closely related to the sampling budget are reduced. Our extensive experimental evaluation driven by real-world datasets further confirms that NCRW significantly increases the quality of samples and accuracy of estimations with much lower costs than those of existing random-walk-based sampling schemes especially in estimating the higher-order node attributes.
KW - Estimating
KW - Graph mining
KW - Random walk
KW - Sampling
UR - http://www.scopus.com/inward/record.url?scp=85132545961&partnerID=8YFLogxK
U2 - 10.1007/s10115-022-01691-8
DO - 10.1007/s10115-022-01691-8
M3 - Article
AN - SCOPUS:85132545961
SN - 0219-1377
VL - 64
SP - 1909
EP - 1935
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 7
ER -