摘要
本文深入分析了全文检索中文搜索引擎的关键技术,提出了一种适用于全文检索搜索引擎的中文分词方案,既提高了分词的准确性,又能识别文中的未登录词。针对向量空间信息检索模型,本文设计了一个综合考虑中文词在Web文本中的位置、长度以及频率等重要因素的词条权重计算函数,并且用量化的方法表示出其重要性,能够较准确地反映出词条在Web文档中的重要程度。最后对分词算法进行了测试,测试表明该方法能够提高分词准确度满足实用的要求。
This paper analyses the key techniques of full - text retrieval Chinese search engine, and puts forward a Chinese word segmentation method suited to full - text retrieval search engine. It not only enhances the accuracy of word segmentation but also recognizes unknown words. For vector space information retrieval model, this paper gives a term weighting formula that takes into account the import information such as the position, length and frequency of Chinese word in the Web text. And it quantizes the importance of word and expresses the importance of term in the Web text. In the end, the given segmentation algorithm is tested, and the results show that the method can improve the accuracy of word segmentation and satisfy the applied requirement.
出处
《情报科学》
CSSCI
北大核心
2006年第6期895-899,909,共6页
Information Science
关键词
全文检索
搜索引擎
中文分词
信息检索
full - text retrieval
search engine
Chinese word segmentation
information retrieval