摘要
针对大规模领域主题词表提取的问题,提出根据给定语料中词共现特征构建词共现特征矩阵的方法。在此基础上进行词簇划分,进而计算出每个词簇的中心词,并以中心词为核心重新组织每个词簇,最终实现面向语料的主题词表的自动构建。实验结果表明,该算法具有较高的准确率和召回率。
To achieve a massive domain corpus oriented subject thesaurus,a method based on feature matrix which is set up by computing words co-occurrence was proposed.By operating on this feature matrix,words are divided into clusters,and central word for each words cluster is calculated.Lexical bundles are finally gained by re-organizing words clusters using central word as a core.The experiment indicates that the proposed method can achieve good precision rate and recall rate.
作者
安亚巍
操晓春
罗顺
AN Ya -wei1, CAO Xiao- chun2 ,LO Shun1(1Shanghai General Recognition Technology Institute,Shanghai 201112,China;2Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, Chin)
出处
《计算机科学》
CSCD
北大核心
2018年第B06期396-397,410,共3页
Computer Science
基金
国家自然科学基金项目(61422213
U1636214)资助
关键词
表
词共现特征
词簇划分
语料挖掘
Subject thesaurus
Words co occurrence feature
Words cluster dividing
Corpus mining