摘要
【目的】探究文档间的非对称关系并提出量化模型。【方法】基于主题词共现思想,挖掘主题词间的非对称关联信息,采用文档覆盖度指标量化文档间的非对称关系,通过文档聚类进行实证分析。【结果】在文档聚类应用中,与已有的两种文档间关系量化模型相比,所提出的基于主题词共现的文档非对称关系量化模型使聚类结果的平均熵值分别最大下降了22.6%和23.3%。【局限】量化模型只聚焦了文档的文本内容,未考虑图片和公式等非文本内容对文档间非对称关系的影响。【结论】利用文档间非对称关系能更好地区分文档间差异性,有助于提高文档聚类准确率。
[Objective]This paper proposes a quantitative model,aiming to explore the asymmetric relationship between documents.[Methods]Firstly,we examined the asymmetric association between topics with the help of co-occurrence.Secondly,we introduced the concept of the document coverage degree to quantify the asymmetric relationship between documents.Finally,we used document clustering to evaluate the proposed model’s performance.[Results]Compared with two existing measurement models,the average value of clustering was reduced by up to 22.6%and 23.3%with the proposed model.[Limitations]The proposed model only analyzed textual contents,which did not include pictures and formulas.[Conclusions]The proposed model could effectively improve the accuracy of document clustering.
作者
张国防
王鑫
徐建民
Zhang Guofang;Wang Xin;Xu Jianmin(School of Management,Hebei University,Baoding 071002,China;College of Mathematics and Information Science,Hebei University,Baoding 071002,China;School of Cyber Security and Computer,Hebei University,Baoding 071002,China)
出处
《数据分析与知识发现》
CSCD
北大核心
2023年第3期110-120,共11页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金后期资助项目(项目编号:17FTQ002)
河北省社科基金项目(项目编号:HB20TQ002)的研究成果之一。
关键词
非对称关系
主题词共现
覆盖度
Asymmetric Relationship
Topic Word Co-occurrence
Coverage