摘要
结合HTML网页内部特征与外部的结构布局,提出采用映射表这种网页映射模式对网页视图进行变换,基于结构与启发式规则对网页进行区域分割与识别,并利用向量空间模型对网页内容分析,从而准确得到具有高语义内聚性的网页主题内容.实验结果表明,此方法对各种复杂结构的网页主题信息提取较为理想.
Combining the Web page's internal features and external structural layout, mapping table is suggested to tansform the view of Web page. The approach gets highly semantic cohesiveness of the topical contents of the Web page exactly, based on the structure and revelatory rules for Web page' s segmentation and identification and the use of the vector space model for Web content analysis. Experimental results show that this method is more ideal for the topical information extraction of complexstructure Web pages.
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2006年第3期41-44,共4页
Journal of Shandong University(Natural Science)
基金
山东省自然科学基金资助项目(Y2005G21)