期刊文献+

基于互信息的宋史语料库词表的提取 被引量:4

Word Extraction Based on Mutual Information for Ancient Chinese Language Database
下载PDF
导出
摘要 基于统计语言模型,对《续资治通鉴长编》进行了统计分析.根据互信息特征抽取候选字串,通过人机交互确定其是否构成词,然后动态修正相关字串的互信息值,逐步建立宋史语料库词表.实验中据互信息阈值抽取候选字串6 500个,根据汉语大词典确定是词的有3 694个,占56.8%.结果表明互信息法是建立古汉语语料库词表的有效辅助手段. In order to extract multi-character words from ancient Chinese database, statistical features are studied. The candidate words are extracted based on mutual information of character. It is up to the user to judge whether a candidate word is real word or not. After a word is extracted, the mutual information that is related to it will be modified accordingly. Word is extracted recursively. There are 3 694 words in 6 500 candidate words that extracted based on mutual information threshold. The experimental result shows that mutual information method is an effective auxiliary approach for multi-character word extraction from an ancient Chinese database.
出处 《河北大学学报(自然科学版)》 CAS 北大核心 2006年第5期557-560,共4页 Journal of Hebei University(Natural Science Edition)
基金 河北省自然科学基金资助项目(F2006001020) 河北省教育厅科研基金资助项目(2005347)
关键词 古籍数据库 互信息 抽词 统计特征 ancient Chinese Language database statistical feature word extraction mutual information
  • 相关文献

参考文献6

二级参考文献12

  • 1郑家恒 李文花.新词语自动识别方法研究.自然语言理解与机器翻译[M].北京:清华大学出版社,2001..
  • 2陆志苇.现代汉语构词法(修订本)[M].北京:中华书局,1975..
  • 3Giuliano, V.E. The interpretation of word associations. In Statistical Association Methods for Mechanized Documentation. National Bureau of Standards Miscellaneous Publication, 1965.25-32.
  • 4Fano, R. Transmission of Information. MIT Press, 1961.
  • 5Resnik, P. Selectional constrains: an information-theoretic model and its computational realization.Cognition, 1996, (61) : 127-159.
  • 6Dunning, T. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics,1993, 19(1).
  • 7Church, K.W. , Gale, W.A. Concordances for paralld text. In Proceedings of the 7th Annual Conference of the UW Center for ITE New OED & Text Research. Oxford, 1991, 40-62.
  • 8Smadja, F. Retrieving collocations from text: Xtract. Computational Linguistics, 1993, (19): 143-177.
  • 9Church, K.W. , Hanks, P. Word association norms, mutual information and lexicography. Computational Linguistics. 16(1), 1990,22-29.
  • 10Ferreira da Silva, J. , Pereira Lopes, G. A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In the 6th Meetings on Mathematics of Language. 1961,369-381.

共引文献130

同被引文献37

引证文献4

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部