摘要
研究淘宝网和百度有啊这两个国内有代表性的C2C电子商务平台上的销售记录及其用户信息的抽取.针对两个网站上的店铺销售数据,设计一个基于JerichoHtmlParser的、以Html数据标签为地标的Web数据抽取算法;针对两个网站上的用户信息,设计一个基于正则表达式的Web数据抽取算法.设计实现了一个Web抽取系统,可以按不同的抽取规则实现对不同站点上数据的抽取.最后通过对上述2个平台上实际数据的抽取,验证了设计方案的有效性,实验证实了所设计的原型系统具有较高查全率和准确率.
Taobao and Youa are representative C2C E-commerce platforms in China at present.This paper studies how to extract information from transaction record pages and user registration pages on these two platforms.According to the sales records and user registration information on the two sites,two Web data extraction algorithms are designed.One is JerichoHtmlParser-based and uses Html tag as landmark,the other is based on regular expression matching.A Web information extraction system which can extract data from different sites by different extraction rules is designed and implemented.To prove the validity of the algorithm,some experiments have been done.The results show that the prototype system has higher recall rate and accuracy rate.
出处
《泉州师范学院学报》
2010年第4期12-17,共6页
Journal of Quanzhou Normal University