摘要
针对目前网络家纺资源采集方式在处理海量网络资源尤其是深网资源时效率低下的问题,提出了一种自动化的网络家纺资源抽取方法。该方法首先根据查询接口属性有限性和收敛性的特征,构建领域模型对深网查询接口进行识别,然后利用家纺领域关键词自动填写查询接口,抽取深网家纺资源;对于返回的查询页面,为过滤与抽取与主题无关的噪声信息,对页面进行视觉分块,利用标记的分块样本数据训练分块重要度模型,并利用该模型过滤与主题无关的噪声信息。实验结果表明,领域模型识别深网查询接口的阳性预测值和准确率比基于规则的方法分别提高了8%和6%,分块重要度模型过滤噪声的准确率和召回率的调和平均数值在3个等级上比基于规则方法的正确率平均提高了12.90%。
Aiming at the of poor efficiency while processing a huge quantity of Web resources,particularly data resources hidden in deep web by problem of current household textile resources from Web acquisition mode,an automatic approach to extract home textile resources from Web was proposed.In this approach,a domain model was firstly proposed to identify deep Web query interfaces,then the identified query interfaces were filled automatically with domain keywords from household textiles,and the household textile resources from deep Web were extracted.In addition,in order to filter noises from response Web pages,pages were divided into different view blocks,a block importance model was proposed and trained by labeled blocks,and the model was utilized to filter the noise information independent from the subject.Experimental results show that in comparison with rule-based approaches,the domain model achieves 8%and 6%improvements in terms of positive predictive value and accuracy for query interface identification.Also,the block importance model achieves average 12.9%improvements at three levels in terms of harmonic average value for filtering noise information.
作者
杨娟
吴志明
张远鹏
YANG Juan;WU Zhiming;ZHANG Yuanpeng(School of Textiles and Clothing,Nantong University,Nantong,Jiangsu 226019,China;College of Textile and Clothing Engineering,Soochow University,Suzhou,Jiangsu 215123,China;School of Textile and Clothing,Jiangnan University,Wuxi,Jiangsu 214122,China;Department of Medical Informatics,Nantong University,Nantong,Jiangsu 226001,China;School of Digital Media,Jiangnan University,Wuxi,Jiangsu 214122,China)
出处
《纺织学报》
EI
CAS
CSCD
北大核心
2018年第10期156-161,共6页
Journal of Textile Research
基金
国家自然科学基金项目(81701793)
江苏高校哲学社会科学基金项目(2016SJB760064)
关键词
家用纺织品
资源库
深网
信息抽取
household textile
resource database
deep Web
information extraction