摘要
相似词是自然语言中普遍存在的现象,词汇相似度计算是情报学、自然语言处理和信息处理等领域的一项中间步骤。首先,基于大规模语料库构建了汉语词汇共现网络,进而利用复杂网络结构中节点相似的思想来计算词汇的相似度。其次,基于分布假设、上下文语境理论和词汇网络结构的特点,本文提出了基于贡献度折扣的词汇相似度计算方法,该方法不仅考虑了网络边的权重信息,还将节点的全局度特征融合进来。通过节点相似度实验发现,本文提出的基于贡献度折扣的相似度算法要明显好于共同邻居法、Jaccard方法和Sahon方法。最后文章就实验结果及其结论做了详细分析。
Similar words are a common phenomenon in human languages. Word similarity calculation is a middle step in informatics, natural languages processing and information processing. First, a giant language network is constructed based on Chinese large-scale corpus. And then, the node similarity in complex network is used to similar word mining and word similarity calculation. Under distributional hypothesis, context theory and the characteristics of word network, the paper propose a new algorithm named Contribution Discount Similarity algorithm (CDSim) , which can capture not only the edge weight, but also the global characteristic. Compared with the three typical methods of node similarity calculation, such as common neighbors, Jaecard and Sahon, CDSim performs best. Finally, some related experiments and conclusion are discussed.
出处
《情报学报》
CSSCI
北大核心
2015年第8期885-896,共12页
Journal of the China Society for Scientific and Technical Information
基金
国家自科青年项目“基于CSSCI的句法级汉英平行语料库构建及知识挖掘研究”(项目编号:71303120)
南京邮电大学引进人才科研启动基金“基于语料库的词汇相似度计算研究”(项目编号:NYS213008)
南京邮电大学国自基金孵化项目‘‘大数据时代下汉语词义知识挖掘研究”(项目编号:NY214112)的资助
关键词
复杂网络
语料库
词汇相似度
语义相关度
complex network, corpus, word similarity, semantic relatedness