摘要
基于统计语言模型,对《续资治通鉴长编》进行了统计分析.根据互信息特征抽取候选字串,通过人机交互确定其是否构成词,然后动态修正相关字串的互信息值,逐步建立宋史语料库词表.实验中据互信息阈值抽取候选字串6 500个,根据汉语大词典确定是词的有3 694个,占56.8%.结果表明互信息法是建立古汉语语料库词表的有效辅助手段.
In order to extract multi-character words from ancient Chinese database, statistical features are studied. The candidate words are extracted based on mutual information of character. It is up to the user to judge whether a candidate word is real word or not. After a word is extracted, the mutual information that is related to it will be modified accordingly. Word is extracted recursively. There are 3 694 words in 6 500 candidate words that extracted based on mutual information threshold. The experimental result shows that mutual information method is an effective auxiliary approach for multi-character word extraction from an ancient Chinese database.
出处
《河北大学学报(自然科学版)》
CAS
北大核心
2006年第5期557-560,共4页
Journal of Hebei University(Natural Science Edition)
基金
河北省自然科学基金资助项目(F2006001020)
河北省教育厅科研基金资助项目(2005347)
关键词
古籍数据库
互信息
抽词
统计特征
ancient Chinese Language database
statistical feature
word extraction
mutual information