期刊文献+

基于聚类分析的数据文件格式分析方法 被引量:2

Data File Format Analysis Method Based on Cluster Analysis
原文传递
导出
摘要 在实际工作中经常需要进行文件格式分析,以方便数据资源共享与交换,而目前数据文件格式分析方法存在解析效率低下等问题。为此提出基于聚类分析的数据文件格式分析方法,该方法应用聚类思想,通过同类型不同文件同一字节位置的聚类、文件组内不同字节位置的聚类,获得数据文件字节重复度分布并分析相应的聚类分布特征;将数据文件的文件头部分、文件数据体部分及分界处的重复度分布特征作为文件格式分析的依据,并提出以字节组聚类分析的方法提升文件格式分析效率。基于文件存储结构和聚类分布特征,制定了聚类分析的文件样本选取原则;同时开发了相应的格式解析辅助工具软件,该软件能自动进行选定文件合理性判断、自动分组等,简化了相应的格式解析处理过程。应用该方法和开发的工具进行了型号为Agilent GC 6890N MSD 5793N的质谱仪产生的MS数据文件格式解析试验,实验结果表明,该方法分析得到的文件格式准确、效率明显提高,通过对该方法复用,可以有效促进大型科研仪器数据资源共享,提高数据资源利用率。 In practical work,file format analysis is often needed to facilitate data resource sharing and exchange,and the current data file format analysis method has problems such as low parsing efficiency.In this regard,we propose a data file format analysis method based on clustering analysis,which applies the idea of clustering,through the clustering of the same byte position of different files of the same type and the clustering of different byte positions within a file group,to obtain the distribution of data file byte repetition and analyze the corresponding clustering distribution characteristics;the repetition distribution characteristics of the file header part,the file data body part and the demarcation of the data file are used as the basis of file format analysis.The repetition distribution characteristics of the header part,the data body part and the demarcation part of the data file are used as the basis of file format analysis,and the method of byte group cluster analysis is proposed to improve the efficiency of file format analysis.Based on the file storage structure and clustering distribution characteristics,the principles of file sample selection for clustering analysis are formulated;at the same time,the corresponding format analysis auxiliary tool software is developed,which can automatically judge the reasonableness of selected files and automatically group them,simplifying the corresponding format analysis processing.The method and the developed tool were applied to analyze the MS data file format generated by the mass spectrometer model Agilent GC6890N MSD 5793N,and the experimental results showed that the file format obtained by the method was accurate and the efficiency was significantly improved,and the reuse of the method could effectively promote the sharing of data resources of large research instruments and improve the utilization of data resources.
作者 刘杰 常兴山 孙锋 周建辉 LIU Jie;CHANG Xing-shan;SUN Feng;ZHOU Jian-hui(School of Naval Architecture,Ocean and Energy Power Engineering,Wuhan University of Technology,Wuhan 430063,China;Ship Development and Design Center,Wuhan 430064,China)
出处 《武汉理工大学学报》 CAS 2022年第1期93-99,共7页 Journal of Wuhan University of Technology
基金 工信部高技术船舶专项(CJ02N20)。
关键词 聚类分析 仪器资源共享 数据文件格式分析 cluster analysis instrument resource sharing data file format analysis
  • 相关文献

参考文献5

二级参考文献39

共引文献25

同被引文献22

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部