期刊文献+

针对特定领域的新词发现方法研究 被引量:1

Research on New Word Discovery Methods Facing the Military Field
下载PDF
导出
摘要 如何准确识别文本中的领域新词是保证企事业内数据安全中的一项重要任务,针对特定领域语料的特性,提出一种针对特定领域的新词发现方法。首先预处理语料,其次采用Jieba结合本领域的成词策略分词,N-gram滑动取词得到候选词串,再次利用点互信息、邻接熵、词频与归一化得分筛选新词,从次新词向量化并降维,最后K-means分离领域或常用新词,从而得到领域新词集。解决了通用新词发现方法在特定领域的不适应性问题,在某领域约10万行的语料数据上,通过对比实验验证了上述方法的有效性。 How to accurately identify domain new words in the text is an important task in the security work in ensuring data security in enterprises and institutions. This article proposes a new word discovery method for specific domains based on the characteristics of a specific domain corpus. Firstly, the corpus was preprocessed. Secondly, Jieba was used to combine the word-formation strategy in a specific field to segment words. And the N-gram was used for sliding word retrieval to obtain the candidate word string. Thirdly, the pointwise mutual information, branch entropy, word frequency and normalized score were used to filter new words. Then, new words were vectorized and dimensionality reduced. Finally, K-means was used to separate domain new words or commonly used new words to obtain domain new word sets. This method solves the problem of the incompatibility of the general new word discovery method in a specific field. On the corpus data of about 100,000 lines in a certain field, the effectiveness of this method is verified by comparative experiments.
作者 申兆媛 巢翌 李晓龙 张伟 SHEN Zhao-yuan;CHAO Yi;LI Xiao-long;ZHANG Wei(Beijing Institute of Control and Electronic Technology,Beijing 100038,China)
出处 《计算机仿真》 北大核心 2022年第6期269-273,335,共6页 Computer Simulation
关键词 新词发现 点互信息 邻接熵 聚类 New word discovery Pointwise mutual information Branch entropy Clustering
  • 相关文献

参考文献9

二级参考文献78

共引文献128

同被引文献16

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部