Abstract
Objective Spectroscopy detection is widely used in industrial process measurement due to its speed, non-contact nature, and capability for multi-component measurement. However, spectral measurements need to be analyzed using a stoichiometric model to obtain concentration values. Environmental changes during model establishment and use can affect the accuracy of predictions for new data, which necessitates periodic model updates. Therefore, it is important to study the timing of spectral model updates. By reducing high-dimensional spectral data to two dimensions and creating scatter plots, one can visually observe the point cloud distribution and judge when to update the model. The current dimensionality reduction methods result in a scattered sample distribution, where the scattered point cloud can obscure new sample points, making it difficult to assess the novelty of new samples. We find that the multi-step diffusion process enables a more compact representation of sample points in the plane, which facilitates better judgment of when the model should be updated. Consequently, we propose a dimensionality reduction method based on multi-step diffusion mapping. Methods Our research method is based on the fundamental principle of diffusion mapping. Firstly, the Gaussian kernel function is used to calculate the similarity matrix K of the sample points. Subsequently, the obtained similarity matrix K is normalized to derive the Markov probability transition matrix. Next, multi-step diffusion is performed on the one-step probability transition matrix to obtain the multi-step diffusion probability matrix. This matrix is then transformed into diffusion distances, and the low-dimensional coordinates of the dataset are computed using classical multidimensional scaling (CMDS). To select the bandwidth value of the kernel function, we construct the similarity matrix W related to the kernel bandwidth based on the Euclidean distance between the sample points. Summing all elements in the similarity matrix yields a function related to the kernel bandwidth. Initially, we narrow down the range of the total similarity value to extract the intermediate line segment. Within this narrowed range, the most suitable kernel bandwidth value is chosen by minimizing the fitting line error. For selecting the number of diffusion steps t, the Shannon entropy of the sample diffusion matrix with respect to the normalized eigenvalues is calculated to obtain the Shannon entropy function H(t). The initial rapid decline of the H(t) curve is primarily due to the rapid decrease of small eigenvalues (which correspond to noise) with increasing power. The subsequent slow decline in the H(t) curve is mainly attributed to the continuous increase in power, which leads to a reduction in essential information. To minimize noise while preserving critical information, we select the “inflection point”of the H(t) curve, where the rate of decline begins to slow down, as the most suitable t value. Results and Discussions For the diffusion mapping method, the choice of the number of diffusion steps t is very important. Compared to other diffusion steps t, the diffusion step t calculated automatically by the algorithm in this paper achieves the best compact effect (Fig. 4). By using PCA and the multi-step diffusion mapping algorithm, we reduce the dimensionality of both old and new samples in the sample set and display them in a two-dimensional scatter plot. It is observed that the scatter map obtained using the multi-step diffusion mapping method is more compact, leaving a larger display space and reducing the overlap between the old and new sample sets. Therefore, it is easier to assess the novelty of samples by adding new samples, and the display effect is more ideal (Fig. 5). By comparing the scatter plots obtained using the multi-step diffusion mapping method and the PCA method, we can see the distance relationship between old and new samples in the scatter plots generated by the multi-step diffusion mapping method, whereas the scatter plots produced by PCA are less clear. Further comparison shows that the distance between old and new samples in the scatter plot obtained using multi-step diffusion mapping is proportional to its root-mean-square error value (Table 2). This highlights the effectiveness of multi-step diffusion mapping for dimensionality reduction. Conclusions The multi-step diffusion mapping method generates a compact two-dimensional scatter plot by increasing the number of diffusion steps. This improved scatter plot helps in determining the best timing for model updates. Unlike traditional dimensionality reduction methods, the multi-step diffusion technique effectively balances local and global data structures. Selecting optimal parameters based on data characteristics enhances the separation of point clouds after dimensionality reduction. As a result, using this scatter plot for deciding when to update the model becomes more accurate and efficient.
Translated title of the contribution | Dimensionality Reduction Method for Compact Sample Distribution Using Multi-Step Diffusion Mapping |
---|---|
Original language | Chinese (Traditional) |
Article number | 2030001 |
Journal | Guangxue Xuebao/Acta Optica Sinica |
Volume | 44 |
Issue number | 20 |
DOIs | |
Publication status | Published - Oct 2024 |