摘要
针对包装器在抽取Web网站的过程中抽取精度差、耗时长以及鲁棒性差等问题,提出了一种改进的基于内部特征、自底向上归纳总结的数据交叉定位方法,该方法建立了基于元素文本特征和基于元素属性特征的坐标系,将两种坐标系中的坐标值进行交叉验证获取待抽取的元数据信息。实验结果表明:该方法抽取数据相较于绝对路径方法、相对路径方法、绝对特征路径方法、相对特征路径方法以及交叉定位方法,在召回率略降2.2%的情况下,精确度提高了31.1%,并且相较于交叉定位法,抽取数据的时间提高了17.9秒。
In view of the wrapper in the process of extracting Web site extraction of low accuracy,long time consuming,and poor robustness problem,an improved based on internal characteristics,bottom- up summarized data cross locating method is proposed.The method establishes coordinate system based on elements' text characteristics and attributes' characteristics,and validates the values of the metadata information by cross- locating.The experiment results show that the recall rate of the method we proposed is reduced by2.2%than absolute path method,relative path method,absolute characteristic path method,relative characteristic path method and cross- locating method,and the precision of the method increases by 31.1%,and the time is reduced 17.9 seconds relative to the cross- locating method.
出处
《网络新媒体技术》
2015年第4期28-34,40,共8页
Network New Media Technology
基金
先导专项课题:智能电视平台与服务支撑环境研制(XDA06040501)
国家科技支撑计划课题:电视商务综合体新业态应用示范(2012BAH73F02)
关键词
WEB信息抽取
交叉定位
包装器
内部特征
DOM树
Web Information Extraction
Cross Locating
Wrapper
Internal Characteristic
DOM Tree