TY - JOUR
T1 - Nearly Optimal Rates of Privacy-Preserving Sparse Generalized Eigenvalue Problem
AU - Hu, Lijie
AU - Xiang, Zihang
AU - Liu, Jiabin
AU - Wang, Di
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR). We provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a ρ-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has anℓ2-norm estimation error of Õ(dn+dn2ρ) (under some mild assumptions), where d is the dimension and n is the sample size. Next, we discuss how to find such an initial parameter privately. In the high dimensional sparse case where nd≫n, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of (slogdn+slogdn2ρ) for various statistical models, where s is the sparsity of the underlying parameter. Moreover, we show that these errors in the stochastic setting are optimal up to a factor of Poly(logn) by providing the lower bounds of PCA and SIR under the statistical setting and in the CDP model. Finally, to give a separation between ϵ-DP and ρ-CDP for GEP, we also provide the lower bound Ω(dn+d2n2ϵ2) and Ω(slogdn+s2log2dn2ϵ2) of private minimax risk for PCA, under the statistical setting and ϵ-DP model, in low and high dimensional sparse case respectively. Finally, extensive experiments on both synthetic and real-world data support our previous theoretical analysis.
AB - In this article, we study the (sparse) generalized eigenvalue problem (GEP), which arises in a number of modern statistical learning models, such as principal component analysis (PCA), canonical correlation analysis (CCA), Fisher's discriminant analysis (FDA), and sliced inverse regression (SIR). We provide the first study on GEP in the differential privacy (DP) model under both deterministic and stochastic settings. In the low dimensional case, we provide a ρ-Concentrated DP (CDP) method namely DP-Rayleigh Flow and show if the initial vector is close enough to the optimal vector, its output has anℓ2-norm estimation error of Õ(dn+dn2ρ) (under some mild assumptions), where d is the dimension and n is the sample size. Next, we discuss how to find such an initial parameter privately. In the high dimensional sparse case where nd≫n, we propose the DP-Truncated Rayleigh Flow method whose output could achieve an error of (slogdn+slogdn2ρ) for various statistical models, where s is the sparsity of the underlying parameter. Moreover, we show that these errors in the stochastic setting are optimal up to a factor of Poly(logn) by providing the lower bounds of PCA and SIR under the statistical setting and in the CDP model. Finally, to give a separation between ϵ-DP and ρ-CDP for GEP, we also provide the lower bound Ω(dn+d2n2ϵ2) and Ω(slogdn+s2log2dn2ϵ2) of private minimax risk for PCA, under the statistical setting and ϵ-DP model, in low and high dimensional sparse case respectively. Finally, extensive experiments on both synthetic and real-world data support our previous theoretical analysis.
KW - Dimension reduction
KW - generalized eigenvalue problem
KW - privacy
KW - sliced inverse regression
UR - http://www.scopus.com/inward/record.url?scp=85177043333&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2023.3330775
DO - 10.1109/TKDE.2023.3330775
M3 - Article
AN - SCOPUS:85177043333
SN - 1041-4347
VL - 36
SP - 4101
EP - 4115
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 8
ER -