期刊文献+

基于多重文本关系图中clique子团聚类的主题识别方法研究 被引量:4

Study on Textual Topic Identification by Clustering Clique Structure in Multi-Relationship Text Graph
下载PDF
导出
摘要 在网络成为最主要科学交流和信息传播渠道的今天,越来越多的机构将其研究成果以电子化形式呈现,这些电子化的文本资源中蕴涵着丰富的语义信息。面对这些海量的资源,科研人员很难在短时间内快速捕获文本中的主旨内容。如何高效准确地呈现文本资源中的核心主题,辅助科研人员对文本集中的重要关联信息进行聚焦,提高科研效率,一直是文本挖掘研究中的一个重要问题。在对现有有益研究成果借鉴的基础上,结合文本中术语和术语关系的特点,论文提出将文本中的术语和术语间的共现、句法和语义关系利用图结构进行表示,识别文本关系图中的紧密关联子团,基于所得到的紧密关联子团聚类来揭示文本子主题的整体研究思路。开展了两个方面的研究:①将文本集中的术语和术语间各种关系属性进行叠加归并,构建多重文本关系叠加模型;②基于clique子团间相似性距离和语义标识,进行聚类识别文本集中所包含的重要子主题。论文采用"migraine disorders"主题中近五年的文献构建文本集,对提出的方法开展了2个有效性实验。实验1与文本中领域专家所给出的标引词按语义类型分组结果对比,结果表明论文提出的方法与领域专家给出的标引词语义类型分组结果具有一致性;实验2与目前广泛使用的LDA方法结果进行对比,在准确率和召回率上都较LDA方法有所提高。2个实验均证明了文中方法的有效性。 The Internet has become the most important channel for scientific communication and information dis- semination. An increasing number of institutes present their research findings in electronic form, and these electronic texts contain rich semantic information. However, it is difficult for researchers to capture core content on short notice when presented with various electronic texts. Assisting researchers in obtaining the core topics and important associ- ated information in these texts, quickly and accurately, is an urgent issue in text mining. Based on reference to state-of-art technologies, algorithms, and the characteristics of the terms and their relations, we propose a new method for topic identification, based on k-clique clustering, to identify text sub-themes. First, we merge the attribu-tions of terms and their relationships based on rules to construct a multi-relationship overlay model. Second, we cluster semantic k-cliques based on similarity distance and semantic content of each k-clique to identify the text sub-theme. With the above efforts, we used the migraine disorders topic dataset over nearly five years to determine the effectiveness of the proposed method. By comparing the proposed method with the Latent Dirichlet Allocation (LDA) method and using a grouping result based on semantic word types given by a professional in the Medline database, we found that the proposed method was closer to grouping results based on word semantic types, and had better precision and recall values than LDA.
出处 《情报学报》 CSSCI CSCD 北大核心 2017年第5期433-442,共10页 Journal of the China Society for Scientific and Technical Information
基金 中国科学院文献情报中心青年人才领域前沿项目"基于图模式的科技文献主题语义标注方法研究"(G160081001)
关键词 clique子团 多重文本关系 文本主题识别 cluster k-clique sub-graph text multi-relationship overlay model textual topic identification
  • 相关文献

同被引文献39

引证文献4

二级引证文献87

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部