期刊文献+

一种基于逆序匹配重复模式的主题信息提取方法

A THEME INFORMATION EXTRACTION METHOD BASED ON REPETITIVE PATTERN REVERSE MATCHING
下载PDF
导出
摘要 网页中的信息主要以重复的HTML结构进行组织并形成一致的展现形式,主要研究具备复杂重复模式的网页主题信息块识别,提出一种改进的基于逆序匹配重复模式的算法。该算法依据HTML标签结构和class属性改进DOM树,重构页面的向量空间模型,逆序匹配重复结构模式并完成对主题信息的提取。实验结果表明,该方法能准确识别复杂页面结构中主题重复模式,有效避免非主题重复模式的干扰,有较好的召回率和准确率。 The information in webpage is mainly arranged with repetitive HTML structure and presents in consistent display style.In the paper we put emphasis on studying the recognition of the webpage theme information with complicated repetitive pattern and propose an improved algorithm which is based on repetitive pattern reverse matching.The method improves document tree model in accordance with HTML tag structure and class property,reconstructs vector space model of the pages,reversely matches the repetitive structure pattern and then completes the extraction of the theme information.Experimental results suggest that this method can precisely recognise the theme repetitive pattern in complicated webpage structure,effectively avoid the disturbance from non-theme repetitive pattern blocks and performs well in precision and recall.
出处 《计算机应用与软件》 CSCD 北大核心 2013年第4期88-91,共4页 Computer Applications and Software
基金 国家自然科学基金项目(61003045)
关键词 信息提取 重复模式 主题识别 逆序匹配 Information extraction Repetitive pattern Theme recognition Reverse match
  • 相关文献

参考文献15

  • 1Bing Liu. Web Data Mining[ M].俞勇,薛贵荣,韩定一,译.北京:清华大学出版社,2011:231.
  • 2Bing Liu, Robert Grossman, Zhai Yanhong. Mining data records in Web pages [ C ]//Proceedings of the ninth ACM SIGKDD international con- ference on Knowledge discovery and data mining. ACM, 2003:601 - 606.
  • 3Xu Zhiwei, Wang Xinghua. Research for Information Extraction Based on Wrapper Model Algorithm [ C ]//Computer Research and Develop- ment,2010 Second International Conference on,2010:652- 655.
  • 4Nicholas Kushmerick. Wrapper induction:Efficiency and Expressive- ness[ J]. Artificial Intelligence ,2000,118 : 15 - 68.
  • 5Deng Cai, Yu Shipeng, Wen Jirong, et al. VIPS : A vision-based page segmentation algorithm. Microsoft Technical Report [ R ]. MSR-TR- 2003-79. 2003 : 10.
  • 6Yu Shipeng, Cai Deng, Wen Jirong, et al. Improving pseudo-re-levance feedback in Web information retrieval using Web page seg-mentation [C].2003.
  • 7高乐,张健,田贤忠.基于视觉的Web页面分块算法的改进与实现[J].计算机系统应用,2009,18(4):65-69. 被引量:11
  • 8黄文蓓,杨静,顾君忠.基于分块的网页正文信息提取算法研究[J].计算机应用,2007,27(B06):24-26. 被引量:32
  • 9王少康,董科军,阎保平.使用特征文本密度的网页正文提取[J].计算机工程与应用,2010,46(20):1-3. 被引量:13
  • 10周佳颖,朱珍民,高晓芳.基于统计与正文特征的中文网页正文抽取研究[J].中文信息学报,2009,23(5):80-85. 被引量:16

二级参考文献62

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3陈兰,左志宏,熊毅,孟令谦.一种新的基于Ontology的信息抽取方法[J].计算机应用研究,2004,21(8):155-157. 被引量:18
  • 4于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 5Chang Chia-Hui, Kayed M, Girgis M R. A Survey of Web Information Extraction Systems[J]. IEEE Transaction on Know-ledge and Data Engineering, 2006, 18( 10): 1411 - 1428.
  • 6Crescenzi V, Mecca G, Merialdo R Road-runner: Towards Automatic Data Extraction from Large Web Sites[C]//Proc. of the 26th Int'l Conf. on Very Large Database Systems. Roma, Italy: [s. n.], 2001: 109-118.
  • 7Chang Chia-Hui, Lui C. IEPAD: Information Extraction Based on Pattern Discovery[C]//Proceedings of the 10th International Conference on World Wide Web. Hong Kong, China: [s. n.], 2001: 681-688.
  • 8Liu Bing, Grossman R, Zhai Yanhong. Mining Data Records in Web Pages[C]//Proceedings of KDD'03. Washington D. C., USA: [s. n.], 2003: 601-606.
  • 9Phong L Vuong B Gao Xiaoying, et al. Data Extraction from Semi-structured Web Pages by Clustering[C]//Proceedings of WI'06. Hong Kong, China: [s. n.], 2006: 374-377.
  • 10Wu Yang. Identifying Syntactic Differences Between Two Programs[J]. Software-practice and Experience, 1991, 21(7): 739-755.

共引文献97

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部