期刊文献+

从网站中自动挖掘数据记录的算法(英文)

Algorithms of mining data records from website automatically
下载PDF
导出
摘要 为了提高从web中挖掘数据记录的精确性和完整性,提出了同构页与目录页的概念及3个算法.如果一组网页结构相同,只是主信息不同,该网页称为同构页.一个包含有多个指向同构页连接的网页称为目录页.算法1用于发现目录页,它首先将连接排序,并对同一目录的链接记数,如果记数大于某一给定阀值,则对其链接子页进行相似比较并得到结果.同时给出了一个网页相似度判断的函数.算法2采用了噪声信息过滤方法从同构页中挖掘主信息并得到数据记录,该算法是基于在2个同构页中噪声信息相同而只有主信息不同.算法3通过采用Spider技术可以实现从整个网站中自动挖掘数据记录.实验表明所提算法比已有算法可挖掘更完整的数据记录.从同构页中挖掘数据记录是一种有效的方法. In order to improve the accuracy and integrality of mining data records from the web, the concepts of isomorphic page and directory page and three algorithms are proposed. An isomorphic web page is a set of web pages that have uniform structure, only differing in main information. A web page which contains many links that link to isomorphic web pages is called a directory page. Algorithm 1 can find directory web pages in a web using adjacent links similar analysis method. It first sorts the link, and then counts the links in each directory. If the count is greater than a given valve then finds the similar sub-page links in the directory and gives the results. A function for an isomorphic web page judgment is also proposed. Algorithm 2 can mine data records from an isomorphic page using a noise information filter. It is based on the fact that the noise information is the same in two isomorphic pages, only the main information is different. Algorithm 3 can mine data records from an entire website using the technology of spider. The experiment shows that the proposed algorithms can mine data records more intactly than the existing algorithms. Mining data records from isomorphic pages is an efficient method.
作者 邱勇 兰永杰
出处 《Journal of Southeast University(English Edition)》 EI CAS 2006年第3期423-425,共3页 东南大学学报(英文版)
关键词 数据挖掘 数据记录 网站 同构网页 data mining data record website isomorphic page
  • 相关文献

参考文献1

二级参考文献3

  • 1Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. New York:Proc 7th Int'l World Wide Web Conf(WWW7)ACM Press,2003:107 -117.
  • 2Srivastava J, et al. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data.[S.1.]:SIGGKDD Explorations,2003, 1 (2) :12-23.
  • 3Chakrabarti S, et al. Mining the Web's Link Structure.[S.l.]: Computer, 2005-08: 60-67.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部