摘要
在数字图书馆Web学术信息资源的优化采集中,有效结合网页空间特征、内容特征和标签信息对网页进行分块,研究对分块结果进行识别和合并,然后输出网页的主题文本和相关链接块集合,最后通过实验分析该方法能够进一步去除页面中噪音、准确地分析页面的主题相关性和提高Web主题信息采集的质量。
Web academic resource crawling on digital library is an important research area. The effective integration of web space characteristics, content characteristics and label information on the web pages block are researched. The identification and the merger of results on Page Segmentation are studied. The subject of the final text page and related links block collection are output. It is fact that more accurate analysis of the topic pages and improve the quality of Web information collection subject.
出处
《中国科技资源导刊》
2012年第6期76-80,共5页
China Science & Technology Resources Review
关键词
数字图书馆
Web学术资源
自动采集
信息系统
digital library, web academic resource, automation crawling, information system