摘要
针对网页主题信息抽取不够精确的问题,提出一种新型的定义和量化主题信息的方法,即把主题信息分为三种信息形式并对不同形式的信息采用不同的方法进行量化计算。基于上述思想,结合DOM规范和分块思想,在DOM树的基础上提出IB-DOM树,并采用分治思想,先定位到包含主题信息的区域,后过滤噪音信息。实验证明本文提出的方法能够较好地解决主题信息自动提取存在的信息完整性和准确性的矛盾。
Aiming at the problem that the extration of topic information from Web page is not precise enough, this paper presents a new method of calculating the topic information of Web pages, which dividing the topic information of Web pages into three forms and using different quantization method for each. Based on the ideas above, the authors combine document object model with section thinking and present the IB - DOM model. Based on the idea of divide - and - conquer, first find the region which contains the topic information, then the irrelevant information is filtered out. The experimental re- sults show that this approach can solve the contradiction between integrity and accuracy existing in the field of automatic extraction of topical information from Web pages betterly.
出处
《现代图书情报技术》
CSSCI
北大核心
2008年第12期48-53,共6页
New Technology of Library and Information Service
基金
国家863计划重点项目“跨媒体搜索关键技术研究及服务产品开发”(项目编号:2006AA010105)
国家自然科学基金项目“基于语义的中文文本聚类研究”(项目编号:60772081)
北京市属市管高校人才强教计划项目“创新团队-智能搜索引擎和文本挖掘”(项目编号:PXM2007_014224_044677)的研究成果之一
关键词
网页主题信息信息抽取信息块语义信息IB—DOM树
Topic information of Web page
Information extraction
Information block
Semantic information
IB -DOM tree