摘要
领域词典在信息检索、自然语言处理,以及问答系统等方面有着重要的应用.由于自然语言的复杂性,基于NLP的领域词典构建方法难以取得理想的结果.近年来Wiki百科得到了广泛的使用.Wiki不仅包含海量的文章,还拥有丰富的链接结构.基于超链接的锚描述性和主题局部性,提出一种基于有权无向链接结构图聚类的领域词典自动构建方法.该方法首先利用Wiki构建关于某特定领域的无向链接结构图,然后使用LSI算法和余弦相似度计算每条链接的权重,再利用CPMw算法对该有权无向链接结构图进行聚类,从而得到最终的领域词典.实验表明,本文提出的方法可以获得更好的领域词典构建结果.
The domain thesaurus plays an important role in information retrieval, natural language processing, question answering system etc. Due to the complexity of the natural language, the NLP based thesaurus constructing methods are difficult to achieve a desired result. In recent years, Wild has been widely used as a knowledge base. Wild contains not only a large hum of articles, but also has a dense link structure. Based on the characteristics anchor description and topic locality of hyperlinks, this paper proposes a weighted undirected hyperlink structure graph clustering based domain thesaurus construction method. The method first constructs a domain-specific hypedink structure graph using Wild, and then uses LSI algorithm to calculate the weight of each hyperlink. Then the method uses CPMw algorithm to cluster the weighted undirected hyperlink structure graph. After this step, the domain thesaurus can be achieved. The experiments show that method proposed in this paper can get better results.
出处
《小型微型计算机系统》
CSCD
北大核心
2014年第6期1286-1292,共7页
Journal of Chinese Computer Systems
基金
国家科技支撑计划课题项目(2011BAH11B01)资助