摘要
在为检索信息而索引朝鲜文时,通常的做法是对语句和词素进行分析后把名词抽取成索引词.但由于分析的模糊性,若在对词素进行分析时出现参考的词典上不存在的未登录词的情况,就很难抽取正确的索引词.N-gram不需要对词进行语言的分析,因而索引速度快,而且对词素分析词典里不存在的未登录词的分析有利,所以对分析复合名词有很好的效果.但与别的分析方法相比,使用N-gram时会出现因索引词抽取得太多而导致存储空间使用率低和索引效率的下降.为了克服N-gram的缺点,本文提出了一种新的朝鲜语自动索引方法.该方法首先把体词与谓词抽取成索引词后,再利用语句类型规则对词素分析中失败的语句进行助词分离,最后在未登录词处理中使用N-gram的索引方法.对比分析和性能评价表明,所提出的方法是有效的.
When Korean documents are indexed in information retrieval,generally nouns are extracted as index words after statement and morphemic analysis.But during morphemic analysis,due to the fuzz of analysis it's very difficult to extract unregistered words as index words correctly which are not on reference dictionary.As for N-gram,linguistic analysis is not needed,so indexing speed is quick and it's very effective for the analysis of unregistered words which are not on morphemic analysis dictionary.And it's also effective for analysis of compound nouns.But if N-gram method is compared with other indexing methods,index words are extracted too much relatively and use the storage space ineffectively.And it also has a disadvantage of lowering the efficiency of the index.In this paper,in order to cope with these disadvantages of N-gram,a new Korean automatic indexing method has been suggested.In this method,first substantives and terms are extracted as index words and using rules of statement types,particles are separated from the statements,the statements which are failed during morphemic analysis.And finally,N-gram indexing method is used for processing unregistered words.Comparative analysis and performance evaluation have shown that the proposed indexing method is effective.
出处
《小型微型计算机系统》
CSCD
北大核心
2012年第5期950-954,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61070162
71071028
60802023
70931001)资助
高等学校博士学科点专项科研基金课题项目(20070145017)资助
中央高校基本科研业务费专项资金项目(N090504003
N090504006)资助