期刊文献+

基于分块的网页主题信息自动提取算法 被引量:6

An automatic extraction algorithm of Web pages topical informati on based on blocks
下载PDF
导出
摘要 对互联网上大量存在的基于模板的网页,根据其半结构化的特点,提出了一种网页分块和主题信息自动提取算法.该算法利用网页标记对网页进行分块,改进了传统的文本特征选择方法,把网页块表示成特征向量,并根据有序标记集识别主题内容块.用该算法改进了网页分类的预处理过程,提高了分类的速度和准确性.实验表明,对网页进行主题信息提取后再进行分类,可以提高分类系统的查全率和查准率. According to the semi-structure of the template-based Web pages in the Internet,an algorithm which can identify the topic content blocks was proposed.In this algorithm,the Web-page is segmented according to the HTML tags,and the Web page block is represented as feature vector,which improved the traditional text feature selection method.After using the Algorithm in the pretreatment of Web page classification,the speed and correctness of the classification was improved a lot.Experiment shows that the algorithm can improve the precision and recall of a classification after the topic content extraction procedure.
作者 殷贤亮 李猛
出处 《华中科技大学学报(自然科学版)》 EI CAS CSCD 北大核心 2007年第10期39-41,共3页 Journal of Huazhong University of Science and Technology(Natural Science Edition)
关键词 网页分块 主题信息 自动提取 特征选择 网页分类 Web-page segmentation topic content information automate extraction feature selection Web page classification
  • 相关文献

参考文献5

  • 1Li Shianhua,Ho Janming.Discovering informative content blocks from Web documents[C]//Proceedings of ACM SIGKDD.Edmonton:ACM,2002:588-593.
  • 2Kovacevic M,Diligenti M,Gori M,et al.Recognition of common area in a Web page using visual information:a possible application in a page classification[C]//Proceedings of the 2002 IEEE International Conference on Data Mining.Hong Kong:IEEE Computer Society,2002:250-257.
  • 3Sandip Debnath,Prasenjit Mitra,Lee Giles.Identifying content blocks from Web documents[C]//Proceedings of the 15th ISMIS 2005 Conference.New York:Springer,2005:285-293.
  • 4Salton G,Buckley C.Term-weighting approaches in automatic text retrieval[J].Information Processing and Management,1988,26(31):513-523.
  • 5Yiming Y.Noise reduction in a statistical approach to text categorization[C]//Proceedings of SIGIR.Seattle:ACM Press,1995:256-263.

同被引文献49

引证文献6

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部