摘要
很多已经存在的词汇和词组可能会被运用于它们之前从未被运用过的领域文本中,这样的词汇或词组被称为领域新词。领域新词的发现可以为该领域的研究人员提供最新的领域发展动态,帮助其分析该领域的最新舆情,因此具有非常重要的意义。针对领域新词发现这一问题,文中提出了一种基于依存句法分析与词向量的领域新词发现方法。首先,提出了句法词典的概念,并基于依存句法分析,结合TF-IDF值的计算,提出了构建领域句法词典的方法;然后,使用领域句法词典,结合词向量技术,完成了领域新词发现方法的设计;最后,使用来自于护肤品论坛的真实文本数据集对所提方法进行了正确性验证。实验结果表明,构建的句法词典的质量较高,所提方法在进行领域新词发现时具有良好的性能。
Many existing words and phrases may be used in a domain in which they have never appeared before.These words and phrases are called newly-emerging domain words.The researchers can get insight into the latest development tendency and public opinions of a domain through these newly-emerging words.Therefore,it is significant to detect newly-emerging domain words.Based on dependency syntactic analysis and term vector,this paper proposed a newly-emerging domain words detection method.Firstly,the concept of syntactic dictionary was proposed,and its constructing method was proposed for some specific domains based on the dependency syntax of sentences and TF-IDF values of training corpus.Next,domain syntactic dictionary and term vectors were used to detect newly-emerging domain words.The comprehensive experiments were conducted to evaluate the proposed method with comment data from a skin-care products forum.The experimental results show that the syntactic dictionary is effective and the proposed method has good performance in newly-emerging domain word detection.
作者
赵志滨
石玉鑫
李斌阳
ZHAO Zhi-bin;SHI Yu-xin;LI Bin-yang(School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China;School of Information Science and Technology,University of International Relations,Beijing 100091,China)
出处
《计算机科学》
CSCD
北大核心
2019年第6期29-34,共6页
Computer Science
基金
国家重点研发计划项目(2018YFB1004700)
国家自然科学基金项目(61472070)
航天专业部新技术研究高校合作项目(SKX182010023)资助
关键词
句法分析
词向量
领域新词发现
句法词典
Syntactic analysis
Term vector
Newly-emerging domain words
Syntactic dictionary