期刊文献+

基于上下文词频词汇量指标的新词发现方法 被引量:9

A NOVEL APPROACH FOR CHINESE NEW WORD IDENTIFICATION BASED ON CONTEXTUAL WORD FREQUENCY-CONTEXTUAL WORD COUNT
下载PDF
导出
摘要 提出一种基于上下文词频词汇量的统计指标。该指标通过修改信息熵公式中参数的定义,即将邻接字符串在语料集中出现的次数改成邻接字符串集合的大小,克服了左右信息熵在识别新词时特征不够明显的缺点。同时提出一种递归的基于邻接关系的字符串连接方法,克服了N-gram方法采用固定滑动窗口大小的缺点。实证分析表明该新词发现方法有较高的准确率,通过选取不同的词频词汇量指标值作为阈值,能够在发现更多新词和提高发现新词的准确率方面进行灵活调整,为新词发现提供一种实用的方法。 This article presents a statistic index which is based on contextual word frequency-contextual word count ( W F -C W C ). W F -C W C , by modifying the definition of the parameters in information entropy formula, i. e ., changing the occurrence frequency of adjacent strings in corpus to the size of the adjacent strings collection, overcomes the defect of left and right information entropies being not significant in characteristics when identifying new words. Meanwhile, this paper presents a recursive and adjacent relation-based string concatenation method, which overcomes the disadvantage of the fixed sliding window size in N-gram model. Empirical analysis indicates that this new word identification method has higher accuracy. Through selecting different W F - C W C as the thresholds, it can make flexible adjustment in finding more new words or improve the accuracy of new words identification, and this provides a practical approach for new words identification.
作者 邢恩军 赵富强 Xing Enjun;Zhao Fuqiang(College of Management and Economics, Tianjin University, Tianjin 300072 , China;Department of Information Science and Technology, Tianjin University of Finance and Economics , Tianjin 300222 , China)
出处 《计算机应用与软件》 CSCD 2016年第6期64-67,共4页 Computer Applications and Software
基金 国家自然科学基金青年基金项目(61004056) 天津自然科学基金资助项目(15JCYBJC16000) 天津市哲学社会科学研究规划基金资助项目(TJTJ15-002)
关键词 新词发现 上下文信息熵 词频词汇量指标 New word identification Information entropy of context Context word frequency-context word count
  • 相关文献

参考文献8

二级参考文献79

共引文献154

同被引文献64

引证文献9

二级引证文献37

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部