摘要
将高维光谱数据降维至二维并绘制散点图,可以直观地观察点云分布情况,从而判断模型更新时机。目前的降维方法得到的样本分布过于分散,分散的点云分布会覆盖新样本的点,从而使样本的新颖性难以判断。进行多步扩散可以实现样本点在平面的紧凑显示,因此提出多步扩散映射的降维方法。基于数据集自身性质,给出自动确认最适合的带宽值以及扩散步数的方法。针对高斯核函数的核带宽,使用线性最优位置确定带宽值,而针对扩散步数,则通过寻找熵值曲线的拐点位置来确定最优步数。相比于传统的主成分分析(PCA)降维方法,多步扩散映射方法得到的二维散点图中样本点的分布更加紧凑,具有差异性的新样本在图中更加容易判断。
Objective Spectroscopy detection is widely used in industrial process measurement due to its speed,noncontact nature,and capability for multicomponent measurement.However,spectral measurements need to be analyzed using a stoichiometric model to obtain concentration values.Environmental changes during model establishment and use can affect the accuracy of predictions for new data,which necessitates periodic model updates.Therefore,it is important to study the timing of spectral model updates.By reducing highdimensional spectral data to two dimensions and creating scatter plots,one can visually observe the point cloud distribution and judge when to update the model.The current dimensionality reduction methods result in a scattered sample distribution,where the scattered point cloud can obscure new sample points,making it difficult to assess the novelty of new samples.We find that the multistep diffusion process enables a more compact representation of sample points in the plane,which facilitates better judgment of when the model should be updated.Consequently,we propose a dimensionality reduction method based on multistep diffusion mapping.Methods Our research method is based on the fundamental principle of diffusion mapping.Firstly,the Gaussian kernel function is used to calculate the similarity matrix K of the sample points.Subsequently,the obtained similarity matrix K is normalized to derive the Markov probability transition matrix.Next,multistep diffusion is performed on the onestep probability transition matrix to obtain the multistep diffusion probability matrix.This matrix is then transformed into diffusion distances,and the lowdimensional coordinates of the dataset are computed using classical multidimensional scaling(CMDS).To select the bandwidth value of the kernel function,we construct the similarity matrix W related to the kernel bandwidth based on the Euclidean distance between the sample points.Summing all elements in the similarity matrix yields a function related to the kernel bandwidth.Initially,we narrow down the range of the total similarity value to extract the intermediate line segment.Within this narrowed range,the most suitable kernel bandwidth value is chosen by minimizing the fitting line error.For selecting the number of diffusion steps t,the Shannon entropy of the sample diffusion matrix with respect to the normalized eigenvalues is calculated to obtain the Shannon entropy function H(t).The initial rapid decline of the H(t)curve is primarily due to the rapid decrease of small eigenvalues(which correspond to noise)with increasing power.The subsequent slow decline in the H(t)curve is mainly attributed to the continuous increase in power,which leads to a reduction in essential information.To minimize noise while preserving critical information,we select the“inflection point”of the H(t)curve,where the rate of decline begins to slow down,as the most suitable t value.Results and Discussions For the diffusion mapping method,the choice of the number of diffusion steps t is very important.Compared to other diffusion steps t,the diffusion step t calculated automatically by the algorithm in this paper achieves the best compact effect(Fig.4).By using PCA and the multistep diffusion mapping algorithm,we reduce the dimensionality of both old and new samples in the sample set and display them in a twodimensional scatter plot.It is observed that the scatter map obtained using the multistep diffusion mapping method is more compact,leaving a larger display space and reducing the overlap between the old and new sample sets.Therefore,it is easier to assess the novelty of samples by adding new samples,and the display effect is more ideal(Fig.5).By comparing the scatter plots obtained using the multistep diffusion mapping method and the PCA method,we can see the distance relationship between old and new samples in the scatter plots generated by the multistep diffusion mapping method,whereas the scatter plots produced by PCA are less clear.Further comparison shows that the distance between old and new samples in the scatter plot obtained using multistep diffusion mapping is proportional to its rootmeansquare error value(Table 2).This highlights the effectiveness of multistep diffusion mapping for dimensionality reduction.Conclusions The multistep diffusion mapping method generates a compact twodimensional scatter plot by increasing the number of diffusion steps.This improved scatter plot helps in determining the best timing for model updates.Unlike traditional dimensionality reduction methods,the multistep diffusion technique effectively balances local and global data structures.Selecting optimal parameters based on data characteristics enhances the separation of point clouds after dimensionality reduction.As a result,using this scatter plot for deciding when to update the model becomes more accurate and efficient.
作者
贺忠海
贾琼
冯占波
张晓芳
He Zhonghai;Jia Qiong;Feng Zhanbo;Zhang Xiaofang(School of Control Engineering,Northeastern University at Qinhuangdao,Qinhuangdao 066004,Hebei,China;Hebei Key Laboratory of MicroNano Sensing,Qinhuangdao 066004,Hebei,China;School of Optics and Photonics,Beijing Institute of Technology,Beijing 100081,China)
出处
《光学学报》
EI
CAS
CSCD
北大核心
2024年第20期285-293,共9页
Acta Optica Sinica
基金
河北省自然科学基金(F2020501040)。
关键词
光谱学
模型更新
多步扩散
紧凑显示
核宽度确定
最优扩散步数
spectroscopy
model updating
multistep diffusion
compact display
kernel width determination
optimal diffusion steps