Web信息抽取及知识表示系统的研究与实现被引量：2

Research and Realization of a Web Information Extraction and Knowledge Presentation System

下载PDF

导出

摘要研究了从数据密集型Web页面中自动提取结构化数据并形成知识表示系统的问题。基于知识数据库实现动态页面获取,进行预处理后转换为XML文档,采用基于PAT-array的模式发现算法自动发现重复模式,结合基于本体的关键词库自动识别页面数据显示结构模型,利用XML的对象-关系映射技术将数据存入知识数据库,由此实现Web数据自动抽取。同时,利用知识数据库已有知识从互联网抽取新知识,达到知识数据库的自扩展。以交通信息自动抽取及混合交通出行方案生成与表示系统进行的实验表明该系统具有高抽取准确率和良好的适应性。 The Web Information Extraction and Knowledge Presentation System is proposed to extract information from data intensive web pages.It downloads dynamic web pages, based on a knowledge database, changes them to XML documents after preprocessing, finds repeated patterns from them, by using a PAT-array based Pattern Discovery Algorithm, recognizes their data display structure models, automatically based on the repeated patterns and an ontology-based keyword library, and then extracts the data and stores them in the knowledge database with the object-relational mapping technology of XML.Through these steps, web data is extracted automatically, and the knowledge database is also expanded automatically.Experiments on the traffic information auto-extraction and mixed traffic travel schemes auto-creation system showed that the system has high precision and is adaptive to web pages in different domains with different structures.

作者谭守标徐超江元宁仁霞

机构地区安徽大学电子科学与技术学院黄山学院电子信息工程系

出处《计算机系统应用》 2010年第9期1-4,9,共5页 Computer Systems & Applications

基金安徽省教育厅自然科学基金(2005KJ004ZD)

关键词 WEB信息提取知识表示数据密集型Web页面基于本体的关键词库 web information extraction knowledge presentation data intensive web pages ontology-based keyword library

分类号 TP393.09 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1Ana-Maria P. Information extraction from unstructured Web text [Ph.D Dissertation]. University of Washington, 2007.
2李海健,王晓丰.Web信息抽取的现状及未来展望[J].廊坊师范学院学报（自然科学版）,2009,9(3):39-40. 被引量：4
3Wong TL, Wai L. An unsupervised method for joint information extraction and feature mining across different Web sites. Data and Knowledge Engineering, 2009,68(1): 107 - 125.
4韩存鸽,燕敏.Web信息抽取方法研究[J].计算机系统应用,2009,18(7):172-174. 被引量：6
5Chang CH, Kayed M, Girgis MR, et al. A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10): 1411 - 1428.
6Gatterbauer W, Bohunsky P, Herzog P, Krupl B, Pollak B.Towards domain-independent information extraction from web tables. Proc. of the 16th international conference on World Wide Web, May.2007.71 - 80.
7Crescenzi V, Mecca G Automatic information extraction from large websites. Journal of the ACM, 2004,51(5):731 - 779.
8邓尚民,孙玉伟.信息抽取系统的研究现状[J].现代图书情报技术,2006(3):55-58. 被引量：23
9林建敏,谢康林.基于PAT-array和模糊聚类的文本聚类方法[J].计算机工程,2004,30(12):126-127. 被引量：6
10Jtidy说明.[2008-11-21].http://jtidy.sourceforge.net/.

二级参考文献33

1张清军,朱才连.基于主动学习的Web页面信息抽取[J].情报学报,2004,23(6):667-671. 被引量：5
2W3 C:TidySpecification.http://www.w3 .org/People/Raggett/tidy/.
3HorstmannCS.Java2核心技术,第5版.北京:机械工业出版社,2001:40-50.
4Zhang Jian, Gao Jianfeng, Zhou Ming. Extraction of Chinese Compound Words-An Experimental Study on a Very Large Corpus.http://research.microsoft.com/china/papers/Extraction Chinese Compound Words.pdf
5Manber U,Myers E.Suffix Arrays:A New Method for On-line String Searches. In Proceedings of the First Annual ACM_SIAM Symposium on Discrete Algorithms, 1990:319-327
6Zamir O, Etzioni O.Web Document Clustering: A Feasibility Demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval,Melbourne, Australia, 1998
7Chang C H,Lui S C.IEPAD:lnformation Extraction Based on Pattem Discovery. In Proceedings of the Tenth International Conference on World Wide Web, Hong Kong, 2001-05
8Gaston T,New Indices for Text Pat Trees and Pat Arrays. In Information Retrieval Data Structures & Algorithms, Frakes and Baeza-Yates(eds.), Prentice hall, 1992:66-82
9Ralph Grishman. Information extraction:Techniques and Challenges.In Maria Teresa Pazienza, editor, Information Extraction. Springer -Vedag, Lecture Nots in Artificial Intelligence, Room, 1997
10Ralph Grishman and Beth Sundheim. Message Understanding Confererce -6: A Brief History. In Proceedings of 16th International Computational Linguistics. 1996

共引文献35

1江朝晖,张相华,林俊如,王立荣,冯焕清.基于互联网的交通信息资源自动获取技术研究[J].公路交通科技（应用技术版）,2008,4(8):168-170.
2邓尚民,孙玉伟.国内外信息抽取研究的文献计量分析[J].图书情报工作,2006,50(12):92-94. 被引量：2
3钱君,段隆振,熊必成,张和江.基于KPS的Web信息抽取MAS模型的研究[J].计算机与现代化,2007(9):79-82. 被引量：1
4程红莉,周宁,肖爽.文本驱动的商务智能研究[J].情报科学,2007,25(10):1525-1529. 被引量：1
5蒲筱哥.基于Web的信息抽取技术研究综述[J].现代情报,2007,27(10):215-219. 被引量：18
6孟岩,刘希玉,李镇.基于蚁群聚类算法的文本模糊聚类方法[J].山东科学,2007,20(5):48-52.
7史旗凯,郭菊娥.管理事件信息抽取中的基本问题研究[J].情报杂志,2007,26(12):90-92. 被引量：3
8刘路,李弼程,张先飞.基于正反例训练的SVM命名实体关系抽取[J].计算机应用,2008,28(6):1444-1446. 被引量：4
9刘路,李弼程,张先飞,孙显著.基于单实体语言模型的实体关系发现和描述[J].信息工程大学学报,2008,9(3):352-355. 被引量：1
10史旗凯,郭菊娥.基于管理问题信息抽取的主题识别研究[J].情报科学,2008,26(10):1558-1562.

同被引文献7

1张恒,屈景辉,张亮.网页文本信息提取及结果评价[J].微计算机应用,2007,28(9):921-924. 被引量：10
2李净,袁小华,沈晓晶.Web网页信息文本分类的研究[J].计算机工程与设计,2008,29(23):6026-6028. 被引量：5
3史瑞芳.贝叶斯文本分类器的研究与改进[J].计算机工程与应用,2009,45(12):147-148. 被引量：12
4张璇,左敏.一种改进的朴素贝叶斯分类器在文本分类中的应用研究[J].北京工商大学学报（自然科学版）,2009,27(4):52-55. 被引量：6
5赵敏,倪志伟,刘斌.K-means与朴素贝叶斯在商务智能中的应用[J].计算机技术与发展,2010,20(4):179-182. 被引量：6
6岳晓光,梁晓诚,麦范金,赵子强.基于.NET的中文分词系统设计与实现[J].微计算机信息,2010,26(12):215-216. 被引量：7
7王淑敬.基于Web的个性化信息检索技术研究[J].电脑编程技巧与维护,2010(12):58-60. 被引量：2

引证文献2

1行情[J].现代计算机（中旬刊）,2010(2):81-84.
2于丽.文本分类技术在陶瓷行业中的应用[J].现代计算机,2010,16(7X):60-63. 被引量：1

二级引证文献1

1许鑫,郭金龙,姚占雷.基于Web文本挖掘的行业态势分析——以2011上海车展为例[J].图书情报工作,2012,56(16):25-31. 被引量：4

1吴建军.基于WCF的软件在线注册方案[J].计算机系统应用,2012,21(4):125-129. 被引量：1
2蒋德山,陈志德,陈金梁.基于椭圆曲线的RFID协议的安全分析[J].计算机系统应用,2014,23(5):120-125.
3任重,邵军力.粗糙集理论在通侦信息融合中的应用[J].解放军理工大学学报（自然科学版）,2002,3(6):96-99. 被引量：9
4林建敏,谢康林.基于PAT-array和模糊聚类的文本聚类方法[J].计算机工程,2004,30(12):126-127. 被引量：6
5林凌,胡运发,施伯乐.结合面向对象技术的知识表示系统──OOplog[J].计算机工程,1996,22(4):41-46. 被引量：7
6王凤英,乔慧丽.面向对象的产品概念设计知识表示方法[J].重庆工学院学报,2007,21(3):22-25. 被引量：3
7李德泉,刘远航,周毅,任永功,廖士中.一个基于Rete算法的可视化产生式系统[J].辽宁师范大学学报（自然科学版）,2002,25(1):27-30. 被引量：4
8安利平,仝凌云.粗糙集理论中一种属性离散化算法[J].河北工业大学学报,2002,31(3):39-43. 被引量：14
9雷金娥,许文雨.J2EE平台下数据库的开发与优化[J].成都信息工程学院学报,2005,20(5):591-594. 被引量：1
10褚蕾蕾,徐宗本.基于多重数据库的模型表示与拓扑性质[J].工程数学学报,2003,20(6):116-120.

计算机系统应用

2010年第9期

浏览历史

内容加载中请稍等...

Web信息抽取及知识表示系统的研究与实现被引量：2

参考文献10

二级参考文献33

共引文献35

同被引文献7

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

Web信息抽取及知识表示系统的研究与实现 被引量：2

参考文献10

二级参考文献33

共引文献35

同被引文献7

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

Web信息抽取及知识表示系统的研究与实现被引量：2