针对Web论坛的一种结构化数据自动抽取方法被引量：1

Automatic structured data extraction from Web forums

导出

摘要由于网页布局设计的复杂性和用户发表帖子的灵活性,从论坛网页中抽取结构化的数据是一项未能很好解决并非常具有挑战性的任务。本文提出了一种从任意的论坛站点中自动抽取结构化数据的通用解决方案,通过分析网页结构发现列表页和帖子页中的数据记录,并利用一组产生式规则从发现的数据记录中抽取结构化的数据。实验结果表明该方法在抽取论坛数据记录方面明显优于已有的方法,对论坛帖子的标题、作者、发表时间和内容文本块等元数据的抽取达到了较高的准确率。 Because of both complex page layout designs and unrestricted user created posts,extracting structured data from Web forum pages is a very challenging task and not easily solved.A general solution to automatically extract structured data from any forum site was proposed.By analyzing page structure,a group of data records were found from both list page and post page,and then a set of production rules was used to extract structured data from these data records.Experimental results showed that the proposed approach significantly outperformed some existing methods in extracting data records and achieved high accuracy in extracting some metadata of Web forums such as title,author,time and content.

作者关冕马军

机构地区山东大学计算机科学与技术学院

出处《山东大学学报（理学版）》 CAS CSCD 北大核心 2010年第5期42-47,共6页 Journal of Shandong University(Natural Science)

基金国家自然科学基金资助项目(60970047) 山东省自然科学基金资助项目(Y2008G19) 山东省科技攻关资助项目(2008GG10001026 2007GG10001002)

关键词论坛结构化数据信息抽取 WEB挖掘 Web forums structured data information extraction Web mining

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献16

1CONG G, WANG L, LIN CY, et al. Finding question - answer pairs from online forums [ C ]//Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2008: 467-474.
2GLANCE N, HURST M, NIGAM K, et al. Deriving marketing intelligence from online discussion [ C ]//Proceedings of the 11th Annual International ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2005: 419-428.
3ZHANG J, ACKERMAN MS, ADAMIC L. Expertise networks in online communities: structure and algorithms [ C ]//Proceedings of the 16th Intemational Conference on World Wide Web. New York, USA: ACM Press, 2007: 221-230.
4KUSHMERICK N. Wrapper induction: efficiency and expressiveness [ J ]. Artificial Intelligence, 2000, 118 : 15- 68.
5LERMAN K, MINTON S, KNOBLOCK C. Wrapper maintenance: a machine learning approach[ J]. Journal of Artificial Intelligence Research, 2003, 18: 149-181.
6ZHAI Y, LIU B. Web data extraction based on partial tree alignment [ C ]//Proceedings of the 14th International Conference on World Wide Web. New York, USA: ACM Press, 2005: 76-85.
7ZHENG S, WU D, SONG R, WEN J R. Joint optimization of wrapper generation and template detection [ C ]// Proceedings of the 13th Annual International ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2007: 894-902.
8杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法(英文)[J].软件学报,2008,19(2):209-223. 被引量：45
9BUNESCU R, MOONEY R J. Collective information extraction with relational Markov networks [ C ]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. San Francisco, USA: Morgan Kaufmann Publishers, 2004: 439-446.
10PINTO D, MCCALLUM A, WEI X, et al. Table extraction using conditional random fields [ C ]//Proceedings of the 26th Annual Intemational ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2003: 235-242.

<12 >

二级参考文献18

1Chang Chia-Hui, Kayed M, Girgis M R. A Survey of Web Information Extraction Systems[J]. IEEE Transaction on Know-ledge and Data Engineering, 2006, 18( 10): 1411 - 1428.
2Crescenzi V, Mecca G, Merialdo R Road-runner: Towards Automatic Data Extraction from Large Web Sites[C]//Proc. of the 26th Int'l Conf. on Very Large Database Systems. Roma, Italy: [s. n.], 2001: 109-118.
3Chang Chia-Hui, Lui C. IEPAD: Information Extraction Based on Pattern Discovery[C]//Proceedings of the 10th International Conference on World Wide Web. Hong Kong, China: [s. n.], 2001: 681-688.
4Liu Bing, Grossman R, Zhai Yanhong. Mining Data Records in Web Pages[C]//Proceedings of KDD'03. Washington D. C., USA: [s. n.], 2003: 601-606.
5Phong L Vuong B Gao Xiaoying, et al. Data Extraction from Semi-structured Web Pages by Clustering[C]//Proceedings of WI'06. Hong Kong, China: [s. n.], 2006: 374-377.
6Wu Yang. Identifying Syntactic Differences Between Two Programs[J]. Software-practice and Experience, 1991, 21(7): 739-755.
7Chang CH, Kayed M, Girgis MR, Shaalan K. A survey of Web information extraction systems. IEEE Trans. on Knowledge and Data Engineering, 2006,18(10): 1411-1428.
8Gold ME. Language identification in the limit. Information and Control, 1967,10(5):447-474.
9Laender AHF, Ribeiro-Neto BA, da Silva AD, Teixeira JS. A brief survey of Web data extraction tools. SIGMOD Record, 2002,31 (2):84-93.
10Arasu A, Hector GM. Extracting structured data from Web pages. In: Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 337-348.

<12 >

共引文献51

1赵靖,王侨文,管马周,单传佳.自动提取布局结构相似网页的结构化信息[J].安徽科技学院学报,2010,24(6):37-42. 被引量：1
2李舒晨,刘云,李勇.网络舆情分析中网页信息预处理方案的实现[J].电脑与电信,2008(10):30-33. 被引量：2
3耿焕同,宋庆席,何宏强.一种基于视觉分块的Web信息抽取方法研究[J].情报理论与实践,2009,32(3):106-109. 被引量：4
4陈治昂,周知予,李大学.一种基于模板的快速网页文本自动抽取算法[J].计算机应用研究,2009,26(7):2646-2649. 被引量：11
5张彦超,刘云,李勇,沈波.基于自动生成模板的Web信息抽取技术[J].北京交通大学学报,2009,33(5):40-45. 被引量：13
6周佳颖,朱珍民,高晓芳.基于统计与正文特征的中文网页正文抽取研究[J].中文信息学报,2009,23(5):80-85. 被引量：16
7李广建,乔建忠.全自动生成网页信息抽取包装器的主要技术方法研究[J].情报理论与实践,2010,33(1):100-104. 被引量：4
8寇月,李冬,申德荣,于戈,聂铁铮.D-EEM:一种基于DOM树的Deep Web实体抽取机制[J].计算机研究与发展,2010,47(5):858-865. 被引量：17
9赵刚,郭东伟,李丹.基于序列比对的动态Web信息抽取算法[J].吉林大学学报（理学版）,2010,48(3):421-426.
10刘云峰.基于标签路径聚类的文本信息抽取算法[J].计算机工程,2010,36(12):83-84. 被引量：1

<12 3 4 5 6 >

同被引文献5

1王涛.Web页面中结构化数据抽取的实现与应用[D].天津:天津大学,2008.
2Liu Bing. Web Data Mining[M].俞勇,薛贵荣,韩定一,译.北京:清华大学出版社,2009:265-269.
3Liu Bing, Zhai Yanhong. NET-A system for extracting Web data from flat and nested data records[M]//Web Information Systems Engineering. New York: Springer Verlag, 2005: 487-495.
4师雪霖,程文涛.Web信息抽取与语义检索框架[J].郑州大学学报（理学版）,2010,42(1):29-32. 被引量：4
5李晶,陈恩红.Web信息抽取[J].计算机科学,2003,30(6):78-81. 被引量：17

引证文献1

1李贵,张琪,郑新录,韩子扬,李征宇.嵌套数据记录列表页的Web信息抽取[J].郑州大学学报（理学版）,2011,43(2):20-23.

1漆昊晟,欧阳群.DIV+CSS网页布局技术初探[J].科技广场,2009(7):249-250. 被引量：6
2王聪,杨韶华.基于DIV＋CSS技术网页布局应用与实践[J].电脑知识与技术,2014,0(12):8128-8128. 被引量：3
3曹阳,钱晓东.基于局部关键节点的大数据聚类算法[J].计算机工程与科学,2016,38(7):1338-1343. 被引量：5
4钱晓东,曹阳.基于社区极大类发现的大数据并行聚类算法[J].南京理工大学学报,2016,40(1):117-123. 被引量：6
5时国华,周斌,韩毅.一种微博事件源头发现的方法[J].信息网络安全,2012(8):146-149.
6刘勇.我听歌呢在线声音莫打扰[J].电脑迷,2007,0(12):75-75.
7dream.给你的网址减减肥[J].计算机应用文摘,2007(10S):86-86.
8火烧云.以一敌百超级剪贴板[J].网友世界,2009(1):41-41.
9朱莉.论坛帖子我要最新的[J].计算机应用文摘,2005(19):85-85.
10性价比的革命 APU攒机方案征集活动大受欢迎[J].电脑爱好者,2011(22):95-95.

<12 >

山东大学学报（理学版）

2010年第5期

针对Web论坛的一种结构化数据自动抽取方法被引量：1

参考文献16

二级参考文献18

共引文献51

同被引文献5

引证文献1

相关作者

相关机构

相关主题

针对Web论坛的一种结构化数据自动抽取方法 被引量：1

参考文献16

二级参考文献18

共引文献51

同被引文献5

引证文献1

相关作者

相关机构

相关主题

微信扫一扫：分享

针对Web论坛的一种结构化数据自动抽取方法被引量：1