多 步 扩 散 映 射 得 到 样 本 紧 凑 分 布 的 降 维 方 法

Zhonghai He; Qiong Jia; Zhanbo Feng; Xiaofang Zhang

doi:10.3788/AOS240820

多步扩散映射得到样本紧凑分布的降维方法

Zhonghai He^*, Qiong Jia, Zhanbo Feng, Xiaofang Zhang

^*此作品的通讯作者

光电学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Objective Spectroscopy detection is widely used in industrial process measurement due to its speed, non-contact nature, and capability for multi-component measurement. However, spectral measurements need to be analyzed using a stoichiometric model to obtain concentration values. Environmental changes during model establishment and use can affect the accuracy of predictions for new data, which necessitates periodic model updates. Therefore, it is important to study the timing of spectral model updates. By reducing high-dimensional spectral data to two dimensions and creating scatter plots, one can visually observe the point cloud distribution and judge when to update the model. The current dimensionality reduction methods result in a scattered sample distribution, where the scattered point cloud can obscure new sample points, making it difficult to assess the novelty of new samples. We find that the multi-step diffusion process enables a more compact representation of sample points in the plane, which facilitates better judgment of when the model should be updated. Consequently, we propose a dimensionality reduction method based on multi-step diffusion mapping. Methods Our research method is based on the fundamental principle of diffusion mapping. Firstly, the Gaussian kernel function is used to calculate the similarity matrix K of the sample points. Subsequently, the obtained similarity matrix K is normalized to derive the Markov probability transition matrix. Next, multi-step diffusion is performed on the one-step probability transition matrix to obtain the multi-step diffusion probability matrix. This matrix is then transformed into diffusion distances, and the low-dimensional coordinates of the dataset are computed using classical multidimensional scaling (CMDS). To select the bandwidth value of the kernel function, we construct the similarity matrix W related to the kernel bandwidth based on the Euclidean distance between the sample points. Summing all elements in the similarity matrix yields a function related to the kernel bandwidth. Initially, we narrow down the range of the total similarity value to extract the intermediate line segment. Within this narrowed range, the most suitable kernel bandwidth value is chosen by minimizing the fitting line error. For selecting the number of diffusion steps t, the Shannon entropy of the sample diffusion matrix with respect to the normalized eigenvalues is calculated to obtain the Shannon entropy function H(t). The initial rapid decline of the H(t) curve is primarily due to the rapid decrease of small eigenvalues (which correspond to noise) with increasing power. The subsequent slow decline in the H(t) curve is mainly attributed to the continuous increase in power, which leads to a reduction in essential information. To minimize noise while preserving critical information, we select the “inflection point”of the H(t) curve, where the rate of decline begins to slow down, as the most suitable t value. Results and Discussions For the diffusion mapping method, the choice of the number of diffusion steps t is very important. Compared to other diffusion steps t, the diffusion step t calculated automatically by the algorithm in this paper achieves the best compact effect (Fig. 4). By using PCA and the multi-step diffusion mapping algorithm, we reduce the dimensionality of both old and new samples in the sample set and display them in a two-dimensional scatter plot. It is observed that the scatter map obtained using the multi-step diffusion mapping method is more compact, leaving a larger display space and reducing the overlap between the old and new sample sets. Therefore, it is easier to assess the novelty of samples by adding new samples, and the display effect is more ideal (Fig. 5). By comparing the scatter plots obtained using the multi-step diffusion mapping method and the PCA method, we can see the distance relationship between old and new samples in the scatter plots generated by the multi-step diffusion mapping method, whereas the scatter plots produced by PCA are less clear. Further comparison shows that the distance between old and new samples in the scatter plot obtained using multi-step diffusion mapping is proportional to its root-mean-square error value (Table 2). This highlights the effectiveness of multi-step diffusion mapping for dimensionality reduction. Conclusions The multi-step diffusion mapping method generates a compact two-dimensional scatter plot by increasing the number of diffusion steps. This improved scatter plot helps in determining the best timing for model updates. Unlike traditional dimensionality reduction methods, the multi-step diffusion technique effectively balances local and global data structures. Selecting optimal parameters based on data characteristics enhances the separation of point clouds after dimensionality reduction. As a result, using this scatter plot for deciding when to update the model becomes more accurate and efficient.

投稿的翻译标题	Dimensionality Reduction Method for Compact Sample Distribution Using Multi-Step Diffusion Mapping
源语言	繁体中文
文章编号	2030001
期刊	Guangxue Xuebao/Acta Optica Sinica
卷	44
期	20
DOI	https://doi.org/10.3788/AOS240820
出版状态	已出版 - 10月 2024

关键词

compact display
kernel width determination
model updating
multi-step diffusion
optimal diffusion steps
spectroscopy

访问文件

10.3788/AOS240820

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a9d3777106494435825dcaf6bfafbf84,

title = "多步扩散映射得到样本紧凑分布的降维方法",

abstract = "Objective Spectroscopy detection is widely used in industrial process measurement due to its speed, non-contact nature, and capability for multi-component measurement. However, spectral measurements need to be analyzed using a stoichiometric model to obtain concentration values. Environmental changes during model establishment and use can affect the accuracy of predictions for new data, which necessitates periodic model updates. Therefore, it is important to study the timing of spectral model updates. By reducing high-dimensional spectral data to two dimensions and creating scatter plots, one can visually observe the point cloud distribution and judge when to update the model. The current dimensionality reduction methods result in a scattered sample distribution, where the scattered point cloud can obscure new sample points, making it difficult to assess the novelty of new samples. We find that the multi-step diffusion process enables a more compact representation of sample points in the plane, which facilitates better judgment of when the model should be updated. Consequently, we propose a dimensionality reduction method based on multi-step diffusion mapping. Methods Our research method is based on the fundamental principle of diffusion mapping. Firstly, the Gaussian kernel function is used to calculate the similarity matrix K of the sample points. Subsequently, the obtained similarity matrix K is normalized to derive the Markov probability transition matrix. Next, multi-step diffusion is performed on the one-step probability transition matrix to obtain the multi-step diffusion probability matrix. This matrix is then transformed into diffusion distances, and the low-dimensional coordinates of the dataset are computed using classical multidimensional scaling (CMDS). To select the bandwidth value of the kernel function, we construct the similarity matrix W related to the kernel bandwidth based on the Euclidean distance between the sample points. Summing all elements in the similarity matrix yields a function related to the kernel bandwidth. Initially, we narrow down the range of the total similarity value to extract the intermediate line segment. Within this narrowed range, the most suitable kernel bandwidth value is chosen by minimizing the fitting line error. For selecting the number of diffusion steps t, the Shannon entropy of the sample diffusion matrix with respect to the normalized eigenvalues is calculated to obtain the Shannon entropy function H(t). The initial rapid decline of the H(t) curve is primarily due to the rapid decrease of small eigenvalues (which correspond to noise) with increasing power. The subsequent slow decline in the H(t) curve is mainly attributed to the continuous increase in power, which leads to a reduction in essential information. To minimize noise while preserving critical information, we select the “inflection point”of the H(t) curve, where the rate of decline begins to slow down, as the most suitable t value. Results and Discussions For the diffusion mapping method, the choice of the number of diffusion steps t is very important. Compared to other diffusion steps t, the diffusion step t calculated automatically by the algorithm in this paper achieves the best compact effect (Fig. 4). By using PCA and the multi-step diffusion mapping algorithm, we reduce the dimensionality of both old and new samples in the sample set and display them in a two-dimensional scatter plot. It is observed that the scatter map obtained using the multi-step diffusion mapping method is more compact, leaving a larger display space and reducing the overlap between the old and new sample sets. Therefore, it is easier to assess the novelty of samples by adding new samples, and the display effect is more ideal (Fig. 5). By comparing the scatter plots obtained using the multi-step diffusion mapping method and the PCA method, we can see the distance relationship between old and new samples in the scatter plots generated by the multi-step diffusion mapping method, whereas the scatter plots produced by PCA are less clear. Further comparison shows that the distance between old and new samples in the scatter plot obtained using multi-step diffusion mapping is proportional to its root-mean-square error value (Table 2). This highlights the effectiveness of multi-step diffusion mapping for dimensionality reduction. Conclusions The multi-step diffusion mapping method generates a compact two-dimensional scatter plot by increasing the number of diffusion steps. This improved scatter plot helps in determining the best timing for model updates. Unlike traditional dimensionality reduction methods, the multi-step diffusion technique effectively balances local and global data structures. Selecting optimal parameters based on data characteristics enhances the separation of point clouds after dimensionality reduction. As a result, using this scatter plot for deciding when to update the model becomes more accurate and efficient.",

keywords = "compact display, kernel width determination, model updating, multi-step diffusion, optimal diffusion steps, spectroscopy",

author = "Zhonghai He and Qiong Jia and Zhanbo Feng and Xiaofang Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 Chinese Optical Society.",

year = "2024",

month = oct,

doi = "10.3788/AOS240820",

language = "繁体中文",

volume = "44",

journal = "Guangxue Xuebao/Acta Optica Sinica",

issn = "0253-2239",

publisher = "Chinese Optical Society",

number = "20",

}

TY - JOUR

T1 - 多步扩散映射得到样本紧凑分布的降维方法

AU - He, Zhonghai

AU - Jia, Qiong

AU - Feng, Zhanbo

AU - Zhang, Xiaofang

PY - 2024/10

Y1 - 2024/10

N2 - Objective Spectroscopy detection is widely used in industrial process measurement due to its speed, non-contact nature, and capability for multi-component measurement. However, spectral measurements need to be analyzed using a stoichiometric model to obtain concentration values. Environmental changes during model establishment and use can affect the accuracy of predictions for new data, which necessitates periodic model updates. Therefore, it is important to study the timing of spectral model updates. By reducing high-dimensional spectral data to two dimensions and creating scatter plots, one can visually observe the point cloud distribution and judge when to update the model. The current dimensionality reduction methods result in a scattered sample distribution, where the scattered point cloud can obscure new sample points, making it difficult to assess the novelty of new samples. We find that the multi-step diffusion process enables a more compact representation of sample points in the plane, which facilitates better judgment of when the model should be updated. Consequently, we propose a dimensionality reduction method based on multi-step diffusion mapping. Methods Our research method is based on the fundamental principle of diffusion mapping. Firstly, the Gaussian kernel function is used to calculate the similarity matrix K of the sample points. Subsequently, the obtained similarity matrix K is normalized to derive the Markov probability transition matrix. Next, multi-step diffusion is performed on the one-step probability transition matrix to obtain the multi-step diffusion probability matrix. This matrix is then transformed into diffusion distances, and the low-dimensional coordinates of the dataset are computed using classical multidimensional scaling (CMDS). To select the bandwidth value of the kernel function, we construct the similarity matrix W related to the kernel bandwidth based on the Euclidean distance between the sample points. Summing all elements in the similarity matrix yields a function related to the kernel bandwidth. Initially, we narrow down the range of the total similarity value to extract the intermediate line segment. Within this narrowed range, the most suitable kernel bandwidth value is chosen by minimizing the fitting line error. For selecting the number of diffusion steps t, the Shannon entropy of the sample diffusion matrix with respect to the normalized eigenvalues is calculated to obtain the Shannon entropy function H(t). The initial rapid decline of the H(t) curve is primarily due to the rapid decrease of small eigenvalues (which correspond to noise) with increasing power. The subsequent slow decline in the H(t) curve is mainly attributed to the continuous increase in power, which leads to a reduction in essential information. To minimize noise while preserving critical information, we select the “inflection point”of the H(t) curve, where the rate of decline begins to slow down, as the most suitable t value. Results and Discussions For the diffusion mapping method, the choice of the number of diffusion steps t is very important. Compared to other diffusion steps t, the diffusion step t calculated automatically by the algorithm in this paper achieves the best compact effect (Fig. 4). By using PCA and the multi-step diffusion mapping algorithm, we reduce the dimensionality of both old and new samples in the sample set and display them in a two-dimensional scatter plot. It is observed that the scatter map obtained using the multi-step diffusion mapping method is more compact, leaving a larger display space and reducing the overlap between the old and new sample sets. Therefore, it is easier to assess the novelty of samples by adding new samples, and the display effect is more ideal (Fig. 5). By comparing the scatter plots obtained using the multi-step diffusion mapping method and the PCA method, we can see the distance relationship between old and new samples in the scatter plots generated by the multi-step diffusion mapping method, whereas the scatter plots produced by PCA are less clear. Further comparison shows that the distance between old and new samples in the scatter plot obtained using multi-step diffusion mapping is proportional to its root-mean-square error value (Table 2). This highlights the effectiveness of multi-step diffusion mapping for dimensionality reduction. Conclusions The multi-step diffusion mapping method generates a compact two-dimensional scatter plot by increasing the number of diffusion steps. This improved scatter plot helps in determining the best timing for model updates. Unlike traditional dimensionality reduction methods, the multi-step diffusion technique effectively balances local and global data structures. Selecting optimal parameters based on data characteristics enhances the separation of point clouds after dimensionality reduction. As a result, using this scatter plot for deciding when to update the model becomes more accurate and efficient.

AB - Objective Spectroscopy detection is widely used in industrial process measurement due to its speed, non-contact nature, and capability for multi-component measurement. However, spectral measurements need to be analyzed using a stoichiometric model to obtain concentration values. Environmental changes during model establishment and use can affect the accuracy of predictions for new data, which necessitates periodic model updates. Therefore, it is important to study the timing of spectral model updates. By reducing high-dimensional spectral data to two dimensions and creating scatter plots, one can visually observe the point cloud distribution and judge when to update the model. The current dimensionality reduction methods result in a scattered sample distribution, where the scattered point cloud can obscure new sample points, making it difficult to assess the novelty of new samples. We find that the multi-step diffusion process enables a more compact representation of sample points in the plane, which facilitates better judgment of when the model should be updated. Consequently, we propose a dimensionality reduction method based on multi-step diffusion mapping. Methods Our research method is based on the fundamental principle of diffusion mapping. Firstly, the Gaussian kernel function is used to calculate the similarity matrix K of the sample points. Subsequently, the obtained similarity matrix K is normalized to derive the Markov probability transition matrix. Next, multi-step diffusion is performed on the one-step probability transition matrix to obtain the multi-step diffusion probability matrix. This matrix is then transformed into diffusion distances, and the low-dimensional coordinates of the dataset are computed using classical multidimensional scaling (CMDS). To select the bandwidth value of the kernel function, we construct the similarity matrix W related to the kernel bandwidth based on the Euclidean distance between the sample points. Summing all elements in the similarity matrix yields a function related to the kernel bandwidth. Initially, we narrow down the range of the total similarity value to extract the intermediate line segment. Within this narrowed range, the most suitable kernel bandwidth value is chosen by minimizing the fitting line error. For selecting the number of diffusion steps t, the Shannon entropy of the sample diffusion matrix with respect to the normalized eigenvalues is calculated to obtain the Shannon entropy function H(t). The initial rapid decline of the H(t) curve is primarily due to the rapid decrease of small eigenvalues (which correspond to noise) with increasing power. The subsequent slow decline in the H(t) curve is mainly attributed to the continuous increase in power, which leads to a reduction in essential information. To minimize noise while preserving critical information, we select the “inflection point”of the H(t) curve, where the rate of decline begins to slow down, as the most suitable t value. Results and Discussions For the diffusion mapping method, the choice of the number of diffusion steps t is very important. Compared to other diffusion steps t, the diffusion step t calculated automatically by the algorithm in this paper achieves the best compact effect (Fig. 4). By using PCA and the multi-step diffusion mapping algorithm, we reduce the dimensionality of both old and new samples in the sample set and display them in a two-dimensional scatter plot. It is observed that the scatter map obtained using the multi-step diffusion mapping method is more compact, leaving a larger display space and reducing the overlap between the old and new sample sets. Therefore, it is easier to assess the novelty of samples by adding new samples, and the display effect is more ideal (Fig. 5). By comparing the scatter plots obtained using the multi-step diffusion mapping method and the PCA method, we can see the distance relationship between old and new samples in the scatter plots generated by the multi-step diffusion mapping method, whereas the scatter plots produced by PCA are less clear. Further comparison shows that the distance between old and new samples in the scatter plot obtained using multi-step diffusion mapping is proportional to its root-mean-square error value (Table 2). This highlights the effectiveness of multi-step diffusion mapping for dimensionality reduction. Conclusions The multi-step diffusion mapping method generates a compact two-dimensional scatter plot by increasing the number of diffusion steps. This improved scatter plot helps in determining the best timing for model updates. Unlike traditional dimensionality reduction methods, the multi-step diffusion technique effectively balances local and global data structures. Selecting optimal parameters based on data characteristics enhances the separation of point clouds after dimensionality reduction. As a result, using this scatter plot for deciding when to update the model becomes more accurate and efficient.

KW - compact display

KW - kernel width determination

KW - model updating

KW - multi-step diffusion

KW - optimal diffusion steps

KW - spectroscopy

UR - http://www.scopus.com/inward/record.url?scp=85207092343&partnerID=8YFLogxK

U2 - 10.3788/AOS240820

DO - 10.3788/AOS240820

M3 - 文章

AN - SCOPUS:85207092343

SN - 0253-2239

VL - 44

JO - Guangxue Xuebao/Acta Optica Sinica

JF - Guangxue Xuebao/Acta Optica Sinica

IS - 20

M1 - 2030001

ER -

多 步 扩 散 映 射 得 到 样 本 紧 凑 分 布 的 降 维 方 法

摘要

关键词

访问文件

其它文件与链接

指纹

引用此

多步扩散映射得到样本紧凑分布的降维方法