摘要
研究了从包含多个数据块的页面中抽取数据的方法.通过对比各个数据块的XPath,发现这些数据块具有相似的XPath,提出一种基于XPath比较的数据块抽取规则生成算法XERG.得到各个数据块抽取规则之后,块内的信息可以使用相对XPath或者正则表达式的方法来进行抽取.实验结果表明,该方法能够准确地获得各个数据块,正确抽取块内信息.
The method of extracting data from a Web page that contains several data blocks is studied. After the comparison of each data block's XPath, it can be found that they are very similar. Based on this observation, an XPath-comparison-base Extraction Rules Generation Algorithm(XERG) is proposed. When the data block extraction rules are ready, the inner-block information can be extracted by relative XPath or regular expressions. Experimental results show that this method is able to obtain data blocks and extract data from them very accurately.
出处
《郑州大学学报(理学版)》
CAS
2007年第2期161-166,共6页
Journal of Zhengzhou University:Natural Science Edition
基金
国家自然科学资金资助项目
编号90412015
60603022