摘要
普通搜索引擎的网页抓取程序只能理解常见HTML标签,无法对XML网站的内容做有效解析。该文建立一个包含动态自定义标签的纯XML网站,提出借助XSL样式信息帮助网页抓取程序理解XML网页标签含义的方案,实现了基于Nutch的XML网站全文搜索引擎。
General search engine spiders can understand only common HTML tags, and can't parser information from XML Web sites efficiently. This paper proposes a strategy of using XSL to help spiders to understand the structure of XML pages. Based on this strategy, a pure XML Website is set up, and a search engine based on Nutch which is able to parse XML Website content correctly is realized.
出处
《计算机工程》
CAS
CSCD
北大核心
2008年第15期95-96,107,共3页
Computer Engineering