期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
A new focused crawler using an improved tabu search algorithm incorporating ontology and host information
1
作者 Jingfa LIU Zhen WANG +1 位作者 Guo ZHONG Zhihe YANG 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2023年第6期859-875,共17页
To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods,in this paper,we propose a novel focused crawler using an improved tabu searc... To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods,in this paper,we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information(FCITS_OH),where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels.To avoid crawling visited hyperlinks and expand the search range,we present an improved tabu search(ITS)algorithm and the strategy of host information memory.In addition,a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks.Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics. 展开更多
关键词 focused crawler Tabu search algorithm ONTOLOGY Host information Priority evaluation
原文传递
A Multi-Threaded Semantic Focused Crawler 被引量:5
2
作者 Punam Bedi Anjali Thukral +2 位作者 Hema Banati Abhishek Behl Varun Mendiratta 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第6期1233-1242,共10页
The Web comprises of voluminous rich learning content. The volume of ever growing learning resources however leads to the problem of information overload. A large number of irrelevant search results generated from sea... The Web comprises of voluminous rich learning content. The volume of ever growing learning resources however leads to the problem of information overload. A large number of irrelevant search results generated from search engines based on keyword matching techniques further augment the problem. A learner in such a scenario needs semantically matched learning resources as the search results. Keeping in view the volume of content and significance of semantic knowledge, our paper proposes a multi-threaded semantic focused crawler (SFC) specially designed and implemented to crawl on the WWW for educational learning content. The proposed SFC utilizes domain ontology to expand a topic term and a set of seed URLs to initiate the crawl. The results obtained by multiple iterations of the crawl on various topics are shown and compared with the results obtained by executing an open source crawler on the similar dataset. The results are evaluated using Semantic Similarity, a vector space model based metric, and the harvest ratio. 展开更多
关键词 ELEARNING semantic focused crawler semantically expanded term ONTOLOGY
原文传递
A Survey about Algorithms Utilized by Focused Web Crawler
3
作者 Yong-Bin Yu Shi-Lei Huang +3 位作者 Nyima Tashi Huan Zhang Fei Lei Lin-Yang Wu 《Journal of Electronic Science and Technology》 CAS CSCD 2018年第2期129-138,共10页
Abstract—Focused crawlers (also known as subjectoriented crawlers), as the core part of vertical search engine, collect topic-specific web pages as many as they can to form a subject-oriented corpus for the latter ... Abstract—Focused crawlers (also known as subjectoriented crawlers), as the core part of vertical search engine, collect topic-specific web pages as many as they can to form a subject-oriented corpus for the latter data analyzing or user querying. This paper demonstrates that the popular algorithms utilized at the process of focused web crawling, basically refer to webpage analyzing algorithms and crawling strategies (prioritize the uniform resource locator (URLs) in the queue). Advantages and disadvantages of three crawling strategies are shown in the first experiment, which indicates that the best-first search with an appropriate heuristics is a smart choice for topic-oriented crawlingwhile the depth-first search is helpless in focused crawling. Besides, another experiment on comparison of improved ones (with a webpage analyzing algorithm added) is carried out to verify that crawling strategies alone are not quite efficient for focused crawling and in most cases their mutual efforts are taken into consideration. In light of the experiment results and recent researches, some points on the research tendency of focused crawler algorithms are suggested. 展开更多
关键词 Crawling strategies focused crawler harvest rate uniform resource locator(URL) prioritizing webpage analyzing
下载PDF
A New Framework for Focused Web Crawling 被引量:3
4
作者 PENG Tao HE Fengling ZUO Wanli 《Wuhan University Journal of Natural Sciences》 CAS 2006年第5期1394-1397,共4页
Focused crawlers are important tools to support applications such as specialized Web portals, online searching, and Web search engines. A topic driven crawler chooses the best URLs and relevant pages to pursue during ... Focused crawlers are important tools to support applications such as specialized Web portals, online searching, and Web search engines. A topic driven crawler chooses the best URLs and relevant pages to pursue during Web crawling. It is difficult to deal with irrelevant pages. This paper presents a novel focused crawler framework. In our focused crawler, we propose a method to overcome some of the limitations of dealing with the irrelevant pages. We also introduce the implementation of our focused crawler and present some important metrics and an evaluation function for ranking pages relevance. The experimental result shows that our crawler can obtain more "important" pages and has a high precision and recall value. 展开更多
关键词 focused crawlers irrelevant pages relevance metrics
下载PDF
On-line topical importance estimation:an effective focused crawling algorithm combining link and content analysis 被引量:6
5
作者 Can WANG Zi-yu GUAN +3 位作者 Chun CHEN Jia-jun BU Jun-feng WANG Huai-zhong LIN 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2009年第8期1114-1124,共11页
Focused crawling is an important technique for topical resource discovery on the Web.The key issue in focused crawling is to prioritize uncrawled uniform resource locators(URLs) in the frontier to focus the crawling o... Focused crawling is an important technique for topical resource discovery on the Web.The key issue in focused crawling is to prioritize uncrawled uniform resource locators(URLs) in the frontier to focus the crawling on relevant pages.Traditional focused crawlers mainly rely on content analysis.Link-based techniques are not effectively exploited despite their usefulness.In this paper,we propose a new frontier prioritizing algorithm,namely the on-line topical importance estimation(OTIE) algorithm.OTIE combines link-and content-based analysis to evaluate the priority of an uncrawled URL in the frontier.We performed real crawling experiments over 30 topics selected from the Open Directory Project(ODP) and compared harvest rate and target recall of the four crawling algorithms:breadth-first,link-context-prediction,on-line page importance computation(OPIC) and our OTIE.Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate.Moreover,OTIE is much faster than the traditional focused crawling algorithm. 展开更多
关键词 focused crawlers Topical crawlers PAGERANK Classifiers On-line topical importance estimation (OTIE) algorithm
原文传递
Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge
6
作者 Jingfa LIU Fan LI +1 位作者 Ruoyao DING Zi’ang LIU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2022年第8期1189-1204,共16页
At present,focused crawler is a crucial method for obtaining effective domain knowledge from massive heterogeneous networks.For most current focused crawling technologies,there are some difficulties in obtaining high-... At present,focused crawler is a crucial method for obtaining effective domain knowledge from massive heterogeneous networks.For most current focused crawling technologies,there are some difficulties in obtaining high-quality crawling results.The main difficulties are the establishment of topic benchmark models,the assessment of topic relevance of hyperlinks,and the design of crawling strategies.In this paper,we use domain ontology to build a topic benchmark model for a specific topic,and propose a novel multiple-filtering strategy based on local ontology and global ontology(MFSLG).A comprehensive priority evaluation method(CPEM)based on the web text and link structure is introduced to improve the computation precision of topic relevance for unvisited hyperlinks,and a simulated annealing(SA)method is used to avoid the focused crawler falling into local optima of the search.By incorporating SA into the focused crawler with MFSLG and CPEM for the first time,two novel focused crawler strategies based on ontology and SA(FCOSA),including FCOSA with only global ontology(FCOSA_G)and FCOSA with both local ontology and global ontology(FCOSA_LG),are proposed to obtain topic-relevant webpages about rainstorm disasters from the network.Experimental results show that the proposed crawlers outperform the other focused crawling strategies on different performance metric indices. 展开更多
关键词 focused crawler ONTOLOGY Priority evaluation Simulated annealing Rainstorm disaster
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部