期刊文献+

基于XML的Web内容挖掘方法 被引量:1

Method of Web Content Mining based on XML
下载PDF
导出
摘要 在分析Web内容挖掘特征的基础上,提出一种基于XML技术的Web内容挖掘模型.利用HITS算法确定权威Web页面,利用HTML Tidy工具将非XML文件经过数据清洗后转换成结构良好的XML文档,结合互联网上传统科技论文的自动抽取系统实例,采用文本聚类分类技术进行面向XML文档数据的数据挖掘.实验结果表明,该模型工作良好,可以自动、有效地提取网页内容. The characteristics of Web content mining were analyzed and a model of Web content mining was proposed base on XML. The HITS algorithm was used to determine the authority of Web pages, the HTML Tidy tool was used for non-XML documents through the data cleansing and transform XML documents into well-formed, and text clustering techniques were used for XML document classification data in data mining. Combining with the examples of traditional scientific papers of automated extraction system from Internet, the model is proved to work well, and it can automatically and effectively extract web page content.
作者 郑霞 陈建国
出处 《沈阳大学学报(自然科学版)》 CAS 2012年第3期52-55,共4页 Journal of Shenyang University:Natural Science
关键词 WEB挖掘 数据挖掘 文本聚类 非XML文档 Web Mining data mining text clustering non-XML documents
  • 相关文献

参考文献3

二级参考文献15

共引文献13

同被引文献9

  • 1赵志滨,贾岩峰,姚兰,鲍玉斌.含有丰富结构化数据的Web页面分类技术的研究[J].计算机研究与发展,2013,50(S1):53-60. 被引量:5
  • 2Wu J. A Framework for Learning Comprehensible Theo- ries in XML Document Classification[J]. IEEE Transac- tions on Knowledge & Data Engineering, 2011, 24(1):1 -14.
  • 3M. Elgin Akpmar, Yeliz Yes ilada. Vision Based Page Segmentation Algorithm: Extended and Perceived Suc- cess[M]// Current Trends in Web Engineering. Springer International Publishing, 2013:238-252.
  • 4Xiang P, Yang X, Shi Y. Web Page Segmentation Based on Gestalt Theory.[C]//Multimedia and Expo, 2007 IEEE In- ternational Conference on IEEE, 2010:2253-2256.
  • 5Madaan A, Chu W, Bhalla S. VisHue: Web Page Segmen- tation for an Improved Query Interface for MedlinePlus Medical Encyclopedia.[M]// Databases in Networked In- formation Systems. Springer Berlin Heidelberg, 2011:89 -108.
  • 6Liu X, Lin H, Tian Y. Segmenting Webpage with Go- mory-Hu Tree Based Clustering[J]. Journal of Software, 2011, 6(12): 2421-2425.
  • 7Otsubo M, Quang Hung B, Hijikata Y, et al. Web Page Classification using Anchor-related Text Extracted by a DOM-based Method[J]. Transactions of the Japanese So- ciety for Artificial Intelligence, 2010, 25(25): 37-49.
  • 8常红要,朱征宇,陈烨,张鹏,曾丽芳.基于HTML标记用途分析的网页正文提取技术[J].计算机工程与设计,2010,31(24):5187-5191. 被引量:15
  • 9张晨,汪永益,王雄,施凡.基于网页DOM树比对的SQL注入漏洞检测[J].计算机工程,2012,38(18):111-115. 被引量:5

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部