摘要
互联网中存在大量涉恐信息,加强对这些信息的组织与利用,在防恐、反恐中起着重要的作用。针对网络涉恐信息零散情况,通过网络爬虫技术收集互联网上的涉恐信息,构建涉恐信息数据库;在此基础上,引入中文分词器进行合理粒度分词,使用Lucene构建全文搜索引擎以提升检索效率。同时,在建立索引时根据文档包含涉恐信息特征词汇的数量改变权重,查询时包含多特征词汇的涉恐信息排序更靠前。系统采用Python进行信息采集和数据结构化,使用MySQL构建涉恐信息数据库,通过Lucene构建全文检索引擎,测试表明,该引擎能够快速、准确地完成信息检索。
There is great amount of terrorism information on the Internet.It is of great significance in the fight against terrorism to strengthen the organization and utilization of terrorism information.Aiming at the problem of scattered distribution of terrorism information on the Internet,a terrorism database is constructed by collecting terrorism information from the Internet with web crawler technology and on this foundation,a full-text search engine is built to reach quicker query with Chinese word segmenter,which facilitates a more rational word segmentation.Especially,the weight of document is changed according to the count of terrorism-related vocabulary when building the index to make a more advanced ranking for those with more terrorism-related vocabulary.The system collects and structurizes data with Python,constructs the terrorism database by MySQL and full-text search engine by Lucene,which achieved quick and accurate search.
作者
彭世亮
周欣
卿粼波
熊淑华
何小海
Peng Shiliang;Zhou Xin;Qing Linbo;Xiong Shuhua;He Xiaohai(College of Electronic Information,Sichuan University,Chengdu 610065,China;China Information Technology Security Evaluation Center,Beijing 100085,China)
出处
《信息技术与网络安全》
2019年第11期23-28,共6页
Information Technology and Network Security
关键词
LUCENE
搜索引擎
分词
涉恐
Lucene
search engine
word segmenter
terrorism-related