期刊文献+

基于N-GRAM的朝鲜文索引方法与性能评价

Korean Document Indexing and Evaluating Based on N-GRAM
下载PDF
导出
摘要 在为检索信息而索引朝鲜文时,通常的做法是对语句和词素进行分析后把名词抽取成索引词.但由于分析的模糊性,若在对词素进行分析时出现参考的词典上不存在的未登录词的情况,就很难抽取正确的索引词.N-gram不需要对词进行语言的分析,因而索引速度快,而且对词素分析词典里不存在的未登录词的分析有利,所以对分析复合名词有很好的效果.但与别的分析方法相比,使用N-gram时会出现因索引词抽取得太多而导致存储空间使用率低和索引效率的下降.为了克服N-gram的缺点,本文提出了一种新的朝鲜语自动索引方法.该方法首先把体词与谓词抽取成索引词后,再利用语句类型规则对词素分析中失败的语句进行助词分离,最后在未登录词处理中使用N-gram的索引方法.对比分析和性能评价表明,所提出的方法是有效的. When Korean documents are indexed in information retrieval,generally nouns are extracted as index words after statement and morphemic analysis.But during morphemic analysis,due to the fuzz of analysis it's very difficult to extract unregistered words as index words correctly which are not on reference dictionary.As for N-gram,linguistic analysis is not needed,so indexing speed is quick and it's very effective for the analysis of unregistered words which are not on morphemic analysis dictionary.And it's also effective for analysis of compound nouns.But if N-gram method is compared with other indexing methods,index words are extracted too much relatively and use the storage space ineffectively.And it also has a disadvantage of lowering the efficiency of the index.In this paper,in order to cope with these disadvantages of N-gram,a new Korean automatic indexing method has been suggested.In this method,first substantives and terms are extracted as index words and using rules of statement types,particles are separated from the statements,the statements which are failed during morphemic analysis.And finally,N-gram indexing method is used for processing unregistered words.Comparative analysis and performance evaluation have shown that the proposed indexing method is effective.
出处 《小型微型计算机系统》 CSCD 北大核心 2012年第5期950-954,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61070162 71071028 60802023 70931001)资助 高等学校博士学科点专项科研基金课题项目(20070145017)资助 中央高校基本科研业务费专项资金项目(N090504003 N090504006)资助
关键词 朝鲜语 N-gram法 未登录词 信息检索 复合名词 词素分析 Korean N-gram method unknown word information retrieval compound noun morphological analysis
  • 相关文献

参考文献3

二级参考文献18

  • 1方卫东,袁华,刘卫红.基于Web挖掘的领域本体自动学习[J].清华大学学报(自然科学版),2005,45(S1):1729-1733. 被引量:31
  • 2惠守博,王文杰.支持向量机分类算法中多元变量共线性问题的改进[J].计算机工程与设计,2006,27(8):1385-1388. 被引量:10
  • 3YANG Che-Yu.Word sense disambiguation using semantic relatedness measurement[J].Journal of Zhejiang University-Science A(Applied Physics & Engineering),2006,7(10):1609-1625. 被引量:7
  • 4Sergey Brin,Rajeev Motwani.What can you do with a Web in your Pocket[C].Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2002:42-54.
  • 5Shen D,Cong Y,Sun J T.Studies on Chinese web page classification[C].Proceedings of the 2003 International Conference on Machine Learning and Cybernetics,2003(1):23-27.
  • 6Baglioni M,Ferrara U,Romei A.Preprocessing and mining web log data for web personalization[C].AI·IA,2003:237-249.
  • 7Spiliopoulou M,Mobasher B,Berendt B.A framework for the evaluation of session reconstruction heuristics in web-usage analysis[J].Informs Journal on Computing,2003,15(2):171-190.
  • 8Chan P K.A non-invasive learning approach to building web user profiles[C].Workshop on Web usage analysis and user profiling,Fifth International Conference on Knowledge Discovery and Data Mining,San Diego,1999:342-351.
  • 9Weka.Machine learning software in Java[OL].[2005-6-12].http:∥www.cs.waikato.ac.nz/~ml/weka/.
  • 10Cavnar W B.Using an n-gram-based document representation with a vector processing retrieval model[C].TREC,1994:269-278.

共引文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部