摘要
Internet的快速发展和大量非结构化数据的出现,给传统的数据挖掘工具带来了极大的挑战。XML(可扩展标记语言)技术一方面继承了HTML的灵活性和简单性,另一方面又具有强制结构的完整性和标签的自定义性,已经成为web挖掘的一个重要方向。本文在介绍XML语言基本特点的基础上,针对web上大量异构数据的特点,探讨了基于多叉树的HTML到XML的转换方法,实现web上文本的规范化,并将其用于提出的基于XML的web文本挖掘模型,提高web文本挖掘的有效性。
With the rapid development of Internet and the appearance of non-structure data,the traditional data-mining tools are greatly challenged. XML technology not only inherit the agility and simpleness of HTML,but also is integral on structure and customed on tag,and it has become a very important way of web mining. Introducing the course of development,basic character of XML language, this article discusses how to implement the standardization of web text, according to the characteristic of data on the web. Advancing a kind of text-minlng model based on XML, combining a multi-tree based HTML to XML transformation approach, implementing effectively mining to the web heterogeneous data.
出处
《微计算机信息》
北大核心
2006年第11X期196-197,177,共3页
Control & Automation
基金
总装预研基金(编号不公开)
关键词
XML
WEB文本挖掘
多叉树
XML(eXtensible Markup Language),web text mining,multi-tree