摘要
针对Web挖掘中Web数据的抽取问题,设计了一种基于XML的Web数据抽取方法。由于Web数据的最大特点是半结构化,所以采用XML(半结构化的数据模型)来解决传统的关系数据库不适合Web数据存储的问题,从而将XML的文档描述与关系数据库中的属性一一对应起来,实施精确地查询与模型抽取。由于Web数据的大量信息都与抽取无关,所以利用XSL过滤掉XML的无关数据,并进行实时抽取,最后将合并结果保存到XML文档中。实验结果表明,此方法可以很好地解决Web数据的抽取和存储问题。
Focused on the Web data extraction problem in web mining, a method of web data extraction based on XML is designed. Because the supreme characteristic of Web data is half-structured, Using XML, a kind of half-structured data model, to solve the hard problem of saving web data in traditional relation database, corresponds the document descriptions of XML with fields of database and realizes the query accurately and model extracting. Because most information of Web data is independent of extraction, using XSL to filter irrespective data and extract in realtime. At last, the uniting extraction data is saved in XML document. The test indicates that the method can solve the extraction and storage of web data elegantly.
出处
《黑龙江工程学院学报》
CAS
2004年第1期28-30,共3页
Journal of Heilongjiang Institute of Technology
关键词
KDD
KDW
半结构化
XML
XSL
DOM
数据抽取
WEB挖掘
knowledge discovery in databases
knowledge discovery in Web
half-structured
extensible markup language
extensible style sheet language
document object model