从网站中自动挖掘数据记录的算法(英文)

Algorithms of mining data records from website automatically

下载PDF

导出

摘要为了提高从web中挖掘数据记录的精确性和完整性,提出了同构页与目录页的概念及3个算法.如果一组网页结构相同,只是主信息不同,该网页称为同构页.一个包含有多个指向同构页连接的网页称为目录页.算法1用于发现目录页,它首先将连接排序,并对同一目录的链接记数,如果记数大于某一给定阀值,则对其链接子页进行相似比较并得到结果.同时给出了一个网页相似度判断的函数.算法2采用了噪声信息过滤方法从同构页中挖掘主信息并得到数据记录,该算法是基于在2个同构页中噪声信息相同而只有主信息不同.算法3通过采用Spider技术可以实现从整个网站中自动挖掘数据记录.实验表明所提算法比已有算法可挖掘更完整的数据记录.从同构页中挖掘数据记录是一种有效的方法. In order to improve the accuracy and integrality of mining data records from the web, the concepts of isomorphic page and directory page and three algorithms are proposed. An isomorphic web page is a set of web pages that have uniform structure, only differing in main information. A web page which contains many links that link to isomorphic web pages is called a directory page. Algorithm 1 can find directory web pages in a web using adjacent links similar analysis method. It first sorts the link, and then counts the links in each directory. If the count is greater than a given valve then finds the similar sub-page links in the directory and gives the results. A function for an isomorphic web page judgment is also proposed. Algorithm 2 can mine data records from an isomorphic page using a noise information filter. It is based on the fact that the noise information is the same in two isomorphic pages, only the main information is different. Algorithm 3 can mine data records from an entire website using the technology of spider. The experiment shows that the proposed algorithms can mine data records more intactly than the existing algorithms. Mining data records from isomorphic pages is an efficient method.

作者邱勇兰永杰

机构地区山东工商学院信息与电子工程学院

出处《Journal of Southeast University(English Edition)》 EI CAS 2006年第3期423-425,共3页 东南大学学报（英文版）

关键词数据挖掘数据记录网站同构网页 data mining data record website isomorphic page

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献1

1方铖.用于网页智能搜索的数据挖掘[J].电脑学习,2008(2):33-34. 被引量：1

二级参考文献3

1Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. New York:Proc 7th Int'l World Wide Web Conf(WWW7)ACM Press,2003:107 -117.
2Srivastava J, et al. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data.[S.1.]:SIGGKDD Explorations,2003, 1 (2) :12-23.
3Chakrabarti S, et al. Mining the Web's Link Structure.[S.l.]: Computer, 2005-08: 60-67.

1李国敬.搜索引擎与个人隐私[J].计算机教育,2005(7):57-59. 被引量：2
2评刊表[J].天津档案,2009(2):50-50.
3评刊表[J].天津档案,2008(8):30-30.
4评刊表[J].天津档案,2009(6):62-62.
5小贺.让目录页不显示页码[J].网友世界,2005(4):36-36.
6本刊投稿指南[J].早期教育（幼教·教育教学）,2010(3):1-1.
7评刊表[J].天津档案,2009(5):37-37.
8目录改版令人欣喜[J].数字商业时代,2010(8):21-21.
9评刊表[J].天津档案,2009(4):55-55.
10评刊表[J].天津档案,2008(10):52-52.

Journal of Southeast University(English Edition)

2006年第3期

浏览历史

内容加载中请稍等...

从网站中自动挖掘数据记录的算法(英文)

参考文献1

二级参考文献3

相关作者

相关机构

相关主题

浏览历史