摘要
在网络信息监控系统中,利用正则表达式和HTMLparser对网页HTML代码进行递规匹配,实现了对网站的整体解析。实际应用表明,新信息从发布到抓取的时间小于5分钟,没有出现信息漏抓、不抓和重复抓取的现象。系统利用Java语言实现,准确率和遗漏率分别达到99%和0。
In web information monitoring system, a comprehensive analysis of a web site is realized by using regular expression and HTMLparser in terms of recursive matching. The actual testing shows a time efficiency of less than five minutes between news is published and scratched. Scratching omission and repetitive scratching never happens in the analysis process. The system is built in Java language and reaches a precision of 98% and omission ratio of 0.
出处
《信息技术》
2008年第4期33-34,共2页
Information Technology
基金
上海市某公司项目资助
关键词
正则表达式
网络监控
信息抓取
regular expression
web monitoring
information scratching