摘要
本文介绍了一种新的基于c++模板技术实现的正则表达式引擎:Xpressive.并通过应用Xpressive技术中的静态正则表达式来实现网页清洗。该实现有效的避免了传统方法分析网页DOM树算法复杂不易实现的缺点,在保证算法简单和结果正确的同时,还以测试数据证明应用静态正则表达式运行速度上的优势。
This article first introduces Xpressive: a newly designed regular expression engine which exploits c++ template technology. Then the author implements a webpage cleaning system using the static regular expression introduced by Xpressive engine. This implementation avoids the difficulty of DOM tree analysis algorithm used by traditional webpage cleaning system. This paper also provides testing data to prove that the new implementation has higher running speed as well as its algorithm's simplicity and correctness.
出处
《微计算机信息》
北大核心
2007年第36期226-227,共2页
Control & Automation
基金
863计划(2003AA116060)基于国产软硬件的多通道网络教育关键技术与应用研究