摘要
介绍了基于Web内容和结构挖掘的专题化智能Web爬行Crawler系统,并重点介绍其中CA(C&S)算法,该算法充分利用神经网络可以方便地模拟网络的拓扑结构和并行计算的特点,采用加强学习判断网页与主题的相关度,在进行相关度计算时,不考虑网页的全部内容,而通过提取网页的HTML描述中的重要标记,对Web网页进行内容和结构分析,从而判断爬行到的网页与主题的相关性,以提高信息搜集的效率和精确性。
This paper introduces the topic-specific intelligent Web Crawler system and its crawling algorithm based on Web content and structure mining. The algorithm takes full advantage of the characteristics of the neural network and can simulate the network topology conveniently and parallel calculation. The paper introduces the reinforcement learning to judge the relativity between the crawled page and the topic. When calculating the correlation, without regarding to the whole content of the Web page, but to abstract the important tags of HTML makeup of the Web page, to analyze the content and structure of the page, thereby judge the relativity between the crawled page and the topic, improve the efficiency and accuracy of collected information enormously.
出处
《计算机工程》
CAS
CSCD
北大核心
2006年第3期57-59,共3页
Computer Engineering
基金
国家自然科学基金重点资助项目(69835001)
国家科技成果重点推广计划基金资助项目(2003EC000001)
关键词
专题化爬行
WEB挖掘
神经网络
加强学习
Topic-specific crawler
Web mining
Neural network
Reinforcement learning