摘要
为了解决用户能够快速、准确的搜索互联网上数字作品信息的问题,分析设计了一个对数字作品版权唯一标识符(Digital Copyright Identifier简称DCI)数字作品的垂直搜索引擎。首先基于Heritrix网络爬虫技术,对互联网上的数字作品进行数据采集和正文信息抽取,并将抽取的数据保存到本地;然后基于Lucene的全文检索工具包,对本地数据进行分词、倒排索引、索引检索和改进的相关度排序等处理,最终设计实现了一个通用可扩展的DCI垂直搜索引擎。实验结果表明,该搜索引擎在很大程度上提高了网页信息抽取的准确度和数据的检索效率。
In order to solve the users' problem for searching digital works information quickly and correctly, a vertical search engine about digital work's Copyright Identifier is analyzed and designed. In the first place, based on the Heritrix web crawler, the network digital work's data acquisition and text information extraction are presented and the extracted data is saved to the local; In the second place, on the basis of the Lucene's full-text retrieval toolkit, segmentation, inverted index, index retrieval and im- proved sorting algorithm technology are taken to handle the collected data. a general and extensible DCI vertical search engine is designed and achieved. The experimenal results show that this search engine does enhance web page information extraction accuracy and data indexing efficiency in great degree.
出处
《计算机工程与设计》
CSCD
北大核心
2013年第4期1481-1487,共7页
Computer Engineering and Design
基金
国家科技部支撑计划课题基金项目(2012BAH04f03)
关键词
数据采集
倒排索引
垂直搜索引擎
信息抽取
相关度排序
data acquisition
inverted index
vertical search engine
information acquisition
relevance sorting algorithm