基于N-GRAM的朝鲜文索引方法与性能评价

Korean Document Indexing and Evaluating Based on N-GRAM

下载PDF

导出

摘要在为检索信息而索引朝鲜文时,通常的做法是对语句和词素进行分析后把名词抽取成索引词.但由于分析的模糊性,若在对词素进行分析时出现参考的词典上不存在的未登录词的情况,就很难抽取正确的索引词.N-gram不需要对词进行语言的分析,因而索引速度快,而且对词素分析词典里不存在的未登录词的分析有利,所以对分析复合名词有很好的效果.但与别的分析方法相比,使用N-gram时会出现因索引词抽取得太多而导致存储空间使用率低和索引效率的下降.为了克服N-gram的缺点,本文提出了一种新的朝鲜语自动索引方法.该方法首先把体词与谓词抽取成索引词后,再利用语句类型规则对词素分析中失败的语句进行助词分离,最后在未登录词处理中使用N-gram的索引方法.对比分析和性能评价表明,所提出的方法是有效的. When Korean documents are indexed in information retrieval,generally nouns are extracted as index words after statement and morphemic analysis.But during morphemic analysis,due to the fuzz of analysis it＇s very difficult to extract unregistered words as index words correctly which are not on reference dictionary.As for N-gram,linguistic analysis is not needed,so indexing speed is quick and it＇s very effective for the analysis of unregistered words which are not on morphemic analysis dictionary.And it＇s also effective for analysis of compound nouns.But if N-gram method is compared with other indexing methods,index words are extracted too much relatively and use the storage space ineffectively.And it also has a disadvantage of lowering the efficiency of the index.In this paper,in order to cope with these disadvantages of N-gram,a new Korean automatic indexing method has been suggested.In this method,first substantives and terms are extracted as index words and using rules of statement types,particles are separated from the statements,the statements which are failed during morphemic analysis.And finally,N-gram indexing method is used for processing unregistered words.Comparative analysis and performance evaluation have shown that the proposed indexing method is effective.

作者金光赫王兴伟蒋定德

机构地区东北大学信息科学与工程学院金策工业综合大学应用程序学院

出处《小型微型计算机系统》 CSCD 北大核心 2012年第5期950-954,共5页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(61070162 71071028 60802023 70931001)资助高等学校博士学科点专项科研基金课题项目(20070145017)资助中央高校基本科研业务费专项资金项目(N090504003 N090504006)资助

关键词朝鲜语 N-gram法未登录词信息检索复合名词词素分析 Korean N-gram method unknown word information retrieval compound noun morphological analysis

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献3

1朱志国,邓贵仕,孔立平.基于N-gram的Web用户浏览模式分类算法研究[J].情报学报,2009,28(3):389-394. 被引量：2
2刘金红,陆余良.基于Ontology改进的N-Gram文本分类模型研究[J].计算机工程与设计,2007,28(13):3213-3215. 被引量：3
3刘鹏远,赵铁军.基于Web的无指导译文消歧词模型与N-gram模型及对比研究[J].电子与信息学报,2009,31(12):2969-2974. 被引量：3

二级参考文献18

1方卫东,袁华,刘卫红.基于Web挖掘的领域本体自动学习[J].清华大学学报（自然科学版）,2005,45(S1):1729-1733. 被引量：31
2惠守博,王文杰.支持向量机分类算法中多元变量共线性问题的改进[J].计算机工程与设计,2006,27(8):1385-1388. 被引量：10
3YANG Che-Yu.Word sense disambiguation using semantic relatedness measurement[J].Journal of Zhejiang University-Science A(Applied Physics & Engineering),2006,7(10):1609-1625. 被引量：7
4Sergey Brin,Rajeev Motwani.What can you do with a Web in your Pocket[C].Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2002:42-54.
5Shen D,Cong Y,Sun J T.Studies on Chinese web page classification[C].Proceedings of the 2003 International Conference on Machine Learning and Cybernetics,2003(1):23-27.
6Baglioni M,Ferrara U,Romei A.Preprocessing and mining web log data for web personalization[C].AI·IA,2003:237-249.
7Spiliopoulou M,Mobasher B,Berendt B.A framework for the evaluation of session reconstruction heuristics in web-usage analysis[J].Informs Journal on Computing,2003,15(2):171-190.
8Chan P K.A non-invasive learning approach to building web user profiles[C].Workshop on Web usage analysis and user profiling,Fifth International Conference on Knowledge Discovery and Data Mining,San Diego,1999:342-351.
9Weka.Machine learning software in Java[OL].[2005-6-12].http:∥www.cs.waikato.ac.nz/～ml/weka/.
10Cavnar W B.Using an n-gram-based document representation with a vector processing retrieval model[C].TREC,1994:269-278.

共引文献3

1金光赫,王兴伟,蒋定德.朝鲜语信息检索索引方法研究[J].计算机科学,2011,38(5):169-174.
2徐瑞朝,曾一昕.国内信息过载研究述评与思考[J].图书馆学研究,2017(18):21-25. 被引量：18
3薛丽娜.基于检索和推荐的英文辅助写作系统的设计与开发[J].英语广场（学术研究）,2018,0(4):107-108.

1金光赫,王兴伟,蒋定德.朝鲜语信息检索索引方法研究[J].计算机科学,2011,38(5):169-174.
2王桂平,林鹏.基于双侧语料评价模型的专业词汇识别算法[J].计算机与现代化,2005(9):13-15.
3高喜奎.论信息交换用朝鲜文国际标准字符集[J].中文信息,1991,8(4):0007-0010.
4金光赫,王兴伟,曲大鹏.提高检索性能的朝鲜语布尔查询词生成及扩展[J].小型微型计算机系统,2013,34(5):1097-1101.
5周国强,崔荣一.基于朴素贝叶斯分类器的朝鲜语文本分类的研究[J].中文信息学报,2011,25(4):16-19. 被引量：13
6胡军,左明.基于Snort的入侵检测规则匹配技术研究[J].计算机安全,2007(2):32-34. 被引量：1
7刘晓霞.基于Petri网模型的规则库表示[J].航空计算技术,1996,26(4):46-50.
8毕玉德.朝鲜语自然语言处理研究管窥[J].中文信息学报,2011,25(6):166-169. 被引量：7
9陈宏明.如何加快数据库索引速度[J].新浪潮,1992(1):57-58.
10于时.如何提高dBASE Ⅲ数据库的索引速度[J].计算机世界月刊,1989(6):30-32.

小型微型计算机系统

2012年第5期

浏览历史

内容加载中请稍等...

基于N-GRAM的朝鲜文索引方法与性能评价

参考文献3

二级参考文献18

共引文献3

相关作者

相关机构

相关主题

浏览历史