期刊文献+

一种提高中文搜索引擎检索质量的HTML解析方法 被引量:20

A HTML Parser to Improve Chinese Search Engines
下载PDF
导出
摘要 中文搜索引擎经常会返回大量的无关项或者不含具体信息的间接项 ,产生这类问题的一个原因是网页中存在着大量与主题无关的文字。对使用关键字检索方法的搜索引擎来说 ,想在检索或者后处理阶段解决这类问题不仅要付出一定代价 ,而且在大多数情况下是不可能的。在这篇论文中 ,我们提出了网页噪声的概念 ,并针对中文网页的特点 ,实现了一种对网页自动分块并去噪的HTML解析方法 ,从而达到在预处理阶段消除潜在无关项和间接项的目的。实验结果表明 ,该方法能够在不占用查询时间的前提下 10 0 %地消除中文搜索引擎隐藏的间接项 ,以及大约 11%的无法过滤或隐藏的无关项或间接项 ,从而大幅度提高检索结果的查准率。 While using search engine, people always find so many irrelevant or peripherally relevant items in the result list. Most of them are produced by the words irrelevant to the topic of a web page. It is costly or even impossible to remove such items using traditional keyword methods. In this paper, we define the concept of noise in web pages, and propose a novel approach to clean the noise information of web pages in the pre-processing stage. A novel model of Chinese web pages and 4 simple rules are build to discard noise from HTML files. Experimental results show that, all the indirect items that appear in the results of site grouping are removed correctly and about 11% irrelevant or indirect items that cannot be excluded by commercial Chinese search engines are removed by our approach.
出处 《中文信息学报》 CSCD 北大核心 2003年第4期19-26,共8页 Journal of Chinese Information Processing
基金 国家重点基础研究资助项目 (973) (G19980 30 5 0 9) 自然科学基金资助项目 (6 0 2 2 30 0 4 ) 86 3高科技资助项目 (2 0 0 1AA114 0 82 )
关键词 中文搜索引擎 检索质量 HTML解析方法 网页噪声 分块模型 网页去噪 中文信息处理 computer application Chinese information processing HTML parser noise filtering block model search engine
  • 相关文献

参考文献7

  • 1Kushmerick, N., Weld, D.S., and Doorenbos, R., Wrapper Induction for Information Extraction,Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, 729 - 735, 1997.
  • 2Carchiolo, V. ; Longheu, A. ; Malgeri, M., Structuring the Web, Database and Expert Systems Applications, 2000. Proceedings. 11th International Workshop on, 1123 - 1127, 2000.
  • 3Jinlin Chen, Baoyao Zhou, Jin Shi, HongJiang Zhang, Qiu Fengwu, Function-based object model towards website adaptation, WWW10, 587- 596, 2001.
  • 4Michal Cutler, Yungming Shih, Weiyi Meng, Using the Structure of HTML Documents to Improve Retrieval, Proceedings of the USENIX Symposium on Internet Technologies and Systems, 241- 251,1997.
  • 5S. Chakrabarti, B.Dom, D. Gibson, H. Kleinberg, P. Raghavan, S. Rajagopalan, Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, WWWT, 1998.
  • 6N. Craswell, D. Hawking, S. E. Robertson, Effective Site Finding Using Link Anchor Information,SIGIR 2001, 2001.
  • 7P. Buneman, Semistructured data, In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Databases Systems, 117- 121, 1997.

同被引文献194

引证文献20

二级引证文献76

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部