摘要
利用Web页面的采集序位和被检索页面的相关信息和主题,使得以主题为分块的网络爬虫算法,能够尽可能多地把整个Web按照主题为依据进行分块整合,可以采用对URL定位信息,提高了页面的高效检索能力。仿真实验中表明,提出的主题相关爬虫算法能够跨越BBS中URL网页中的断裂带,提高了URL网页的召回率,也不至于因为网页的断裂而中止检索。算法精度分析表明,误判点都在等分线附近徘徊,偏差不大,表明算法精度较高。
The collection sequences of Web pages and the relative information and focuses were taken in use, and made thenetwork crawler algorithm divide and integrate the Web pages based on the focuses, the URL location information was usedand the performance of efficient retrieval for the pages was improved. Simulation and experiments were taken based on thereal BBS, and result shows that the focused relative crawler algorithm which proposed here can overcome the fracture zoneof the URL pages in the BBS, and the recall rate of URL information is improved and the retrieval cannot be discontinuedfor the fracture. The precision analysis result of the algorithm shows that the erroneous judge points are distributed aroundthe accurate judge line, the result is good.
出处
《科技通报》
北大核心
2014年第4期206-208,共3页
Bulletin of Science and Technology
关键词
网络爬虫算法
URL定位信息
BBS信息检索
数据挖掘
network crawler algorithm
URL location information
BBS information retrieval
data mining