Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs...Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to label manually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall.展开更多
As a basic component of search engine and a series of other services on Web,Web crawler is playing an important role. Roughly,a Web crawler is a program which automatically traverses the Web by downloading documents a...As a basic component of search engine and a series of other services on Web,Web crawler is playing an important role. Roughly,a Web crawler is a program which automatically traverses the Web by downloading documents and following links from page to page. This article detailedly explains the principles and difficulties on the Web crawler,comprehensively argues several hot directions of Web crawler,and at last views the new direction of Web crawler.展开更多
文摘Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to label manually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall.
文摘As a basic component of search engine and a series of other services on Web,Web crawler is playing an important role. Roughly,a Web crawler is a program which automatically traverses the Web by downloading documents and following links from page to page. This article detailedly explains the principles and difficulties on the Web crawler,comprehensively argues several hot directions of Web crawler,and at last views the new direction of Web crawler.