面向领域的Web文本结构化分析被引量：2

Domain-oriented structured analysis of Web texts

下载PDF

导出

摘要为了充分利用领域特征进行Web文本的结构化分析,文章提出了一种面向领域的Web文本结构化分析方法。该方法以领域特征为基础,依据半结构化文本的结构特征和Html文本的层次特性构造Html树;利用本体论的相关思想和方法构建领域本体,从Html树中提取有价值的信息;并结合通用词库和领域词库进行结构化分析。实验结果表明,该方法能够很好地实现Web文本的结构化分析。 In order to take full use of the domain feature during the structured analysis of Web texts,a domain-oriented structured analysis method of Web texts is proposed.Based on the domain feature,this method first accords to the structural characteristic of the semi-structured text and the level characteristic of Html text to construct the Html tree.And then this method uses the related methods and thoughts of ontology to build the domain ontology,and extracts valuable information from the Html tree.Finally it combines with the general dictionary and the domain dictionary to accomplish the structured analysis.The experimental results show that this method is able to achieve the structured analysis of Web texts.

作者杨春磊刘念唐林雨邵堃

机构地区合肥工业大学计算机与信息学院

出处《合肥工业大学学报（自然科学版）》 CAS CSCD 北大核心 2013年第3期309-314,共6页 Journal of Hefei University of Technology：Natural Science

基金国家自然科学基金资助项目(60975033 60575035 60275022)

关键词领域特征 WEB文本结构化分析半结构化文本领域本体 domain feature Web text structured analysis semi-structured text domain ontology

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献12

1李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量：101
2周明建,高济,李飞.基于本体论的Web信息抽取[J].计算机辅助设计与图形学学报,2004,16(4):535-541. 被引量：34
3Alani H, Sanghee K, Millard D E, et al. Automatic ontology- based knowledge extraction from Web documents[J]. IEEE Intelligent Systems, 2003,18(1) : 14-21.
4Kayed M, Girgis R, Shaalan K F. A survey of Web i nforma tion extraction systems[J]. IEEE Transactions on Knowl edge and Data Engineering, 2006,18 (10) : 1411-1428.
5Gottlob G, Koch C. Monadic datalog and the expressive power of languages for web information extraction [J]. Journal of the ACM, 2004,51(1) : 74-113.
6李毅,王浩,杨静.基于语义相似度的Web文档聚类算法[J].合肥工业大学学报（自然科学版）,2009,32(12):1846-1850. 被引量：3
7林鸿飞,战学刚,姚天顺.基于概念的文本结构分析方法[J].计算机研究与发展,2000,37(3):324-328. 被引量：35
8王海涛,曹存根,高颖.基于领域本体的半结构化文本知识自动获取方法的设计和实现[J].计算机学报,2005,28(12):2010-2018. 被引量：31
9陈文亮,朱靖波,朱慕华,姚天顺.基于领域词典的文本特征表示[J].计算机研究与发展,2005,42(12):2155-2160. 被引量：23
10Chang C H, Hsu C N, Lui S C. Automatic information extraction from semi-structured Web pages by pattern discovery[J]. Decision Support Systems, 2003,35 (1):129-147.

二级参考文献71

1史忠植.智能主体及其应用[M].北京:科学出版社,2001.7-11.
2Han J, Kim T, Choi I. Web document clustering by using automatic keyphrase extraction[C]//IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology-Workshops, 2007 : 56-- 59.
3Farhat A, Isabelle J F, Douglas O' Shaughnessy. Clustering words for statistical language models based on contextual word similarity[J]. Proceedings of the Acoustics, Speeah, and Signal Processing, 1996 IEEE Internaional Conference, Vol 1. Atlanta, GA,USA, 1996:180 -183.
4Hammouda K M, Kamel M S. Efficient phrase-based document indexing for Web document clustering [J]. IEEE Transactions on Knowledge and Data Ebgineering, 2004,16 (10) : 1279--1296.
5Zhang D. Semantic, hierarchical, online clustering of Web search results[C]//Proceeding of the 6th Asia Pacific Web Conference. Hangzhou, China, 2004 : 69-- 78.
6Chen Zheng, Ma Weiying, Ma Jinwen. Learning to cluster web search results[C]//Proceedings of the 27th Annual In- ternational ACM SIGIR Conference. Sheffield, South Yorkshire,UK, 2004:210 -217.
7Zamir O, Etzioni O. Web document cluserting: a feasibility demonstration[C]//Proceeding of Austrilia ACM SIGIR on Research and Development in Information Retrieval. New York: ACM Press, 1998 : 46- 54.
8Pandya A, Bhattacharyya P. Text similarity measurment using concept representation of texts [C]//Proeeedings of First International Conference on Pattern Recognition and Machine Intelligence. Berlin: Springer, 2005 :678-689.
9Song Jiangchun, Shen Junyi. A Web document clustering algorithm based on concept of neighbor[C]//Proceedings of the Second International Conference on Machine Learning and Cybernetics, 2003 : 46--47.
10Liu Qun, Li Sujian. Word similarity computing based on How net[J]. Computational Linguistics and Chinese Language Processing,2002,17(2) :59-76.

共引文献227

1王丽,唐建雄.基于DOM和网页模板的Web信息抽取[J].电脑知识与技术（过刊）,2007(18):1617-1619. 被引量：1
2杨桢,赵燕平,朱东华.基于正则表达式的信息抽取系统在国防技术监测中的应用[J].北京理工大学学报,2006,26(z1):74-78. 被引量：9
3欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
4孙皓,董守斌.基于标签密度的自适应正文提取方法[J].郑州大学学报（理学版）,2009,41(1):44-47. 被引量：3
5宁卓,邹阳,傅光轩.基于内容的智能EMAIL安全拦截系统模型[J].计算机工程,2000,26(S1):227-231. 被引量：1
6胡俊华,杨波,李金屏.自然语言理解研究略述[J].济南大学学报（社会科学版）,2001,11(5):58-62. 被引量：8
7岳国伟,梁永全.基于Agent的Web页面结构化信息抽取[J].计算机研究与发展,2007,44(z2):344-349.
8黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(S2):326-328. 被引量：13
9王茹,宋瀚涛,陆玉昌.网页数据自动抽取系统[J].计算机工程与应用,2004,40(19):135-138. 被引量：8
10郑海,林鸿飞.基于段落匹配的文本分类机制[J].计算机工程与应用,2004,40(28):174-176. 被引量：3

同被引文献20

1简峥峰,谭建荣.面向虚拟企业的应用——基于可重用信息表达的CSV文件设计[J].浙江工业大学学报,2000,28(S1):88-92. 被引量：4
2黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(S2):326-328. 被引量：13
3张引,陈敏,廖小飞.大数据应用的现状与展望[J].计算机研究与发展,2013,50(S2):216-233. 被引量：379
4王强,关毅,王晓龙.基于标题类别语义识别的文本分类算法研究[J].电子与信息学报,2007,29(12):2885-2890. 被引量：6
5Xue Yewei, Hu Yunhua, Xin Guomao, et al. Web page ti- tle extraction and its application [ J ]. Information Process- ing and Management, 2007,43 (5) : 1332-1347.
6Fan Jian, Luo Ping, Joshi P. Title identification of Web article pages using HTML and visual features [ C ]/! Pro- ceedings of the International Society for Optical Engineer- ing, 2011. 2011,7879.
7Jericho HTML Parser. Jericho HTML Parser [ EB/OL]. ht- tp ://jericho. htmlparser, net/docs/index, html, 2015-03-10.
8陆余良,房珊瑶,刘金红,施凡.Deep Web站点分类研究进展[J].安徽大学学报（自然科学版）,2010,34(1):103-108. 被引量：1
9朱青,吕晓旭.基于机器学习的HTML标题抽取[J].微计算机信息,2010,26(9):15-16. 被引量：4
10李新芳,王芳.基于Windows Server的虚拟主机Web安全研究[J].聊城大学学报（自然科学版）,2010,23(1):92-95. 被引量：3

引证文献2

1张兵,汤进,罗斌.基于超链接和DOM结构树的网页标题实时抽取方法[J].计算机与现代化,2015(8):84-88. 被引量：2
2沈林.基于多线程的WordPress批量发布软件的设计与实现[J].廊坊师范学院学报（自然科学版）,2015,15(6):20-23.

二级引证文献2

1何春辉.一种基于文本相似度的网页新闻标题自动抽取算法[J].湖南城市学院学报（自然科学版）,2019,28(1):58-61. 被引量：2
2王宝亮,陈伟宁,潘文采.融合DOM树结构向量的行为类别标签预测模型[J].计算机仿真,2022,39(9):257-262. 被引量：2

1邵堃,杨春磊,钱立宾,方帅.基于模式匹配的结构化信息抽取[J].模式识别与人工智能,2014,27(8):758-768. 被引量：6
2潘小燕,孙承杰,刘远超,王晓龙.半结构化文本中的表格识别技术研究[J].微计算机信息,2008,24(18):198-199. 被引量：2
3孙师尧,妙全兴.基于改进HMM的半结构化文本信息抽取算法研究[J].电子科技,2014,27(10):111-114. 被引量：5
4曹进军.基于PATTree的半结构化文本信息抽取范式研究[J].情报杂志,2007,26(7):55-58. 被引量：2
5陈骁.Iframe攻击解读[J].网管员世界,2009(12):95-95.
6丁大勇.企业IT，Wiki-Wiki![J].信息系统工程,2007,20(9):25-25.
7戚前方,张俊卿,宋秋红.Internet信息处理技术[J].中国民航学院学报,2004,22(B06):163-167.
8刘守群,朱明,谭晓彬.一种基于树匹配的网页语义块挖掘算法[J].小型微型计算机系统,2009,30(8):1541-1545. 被引量：7
9刘椿年,宋霞.基于Boosting的半结构化信息抽取[J].北京工业大学学报,2005,31(2):199-203.
10解辉,王晓英,金鑫.基于模板知识的带噪音半结构文本数据自动分词方法[J].微型机与应用,2015,34(17):89-91. 被引量：1

合肥工业大学学报（自然科学版）

2013年第3期

浏览历史

内容加载中请稍等...

面向领域的Web文本结构化分析被引量：2

参考文献12

二级参考文献71

共引文献227

同被引文献20

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

面向领域的Web文本结构化分析 被引量：2

参考文献12

二级参考文献71

共引文献227

同被引文献20

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

面向领域的Web文本结构化分析被引量：2