频繁项集在Deep Web数据源聚类中的应用

Deep Web data source clustering using frequent itemsets

下载PDF

导出

摘要在Deep Web页面的背后隐藏着海量的可以通过结构化的查询接口进行访问的数据源。将这些数据源按所属领域进行组织划分,是DeepWeb数据集成中的一个关键步骤。已有的划分方法主要是基于查询接口模式和提交查询返回结果,存在查询接口特征难以完全抽取和提交数据库查询效率不高等问题。提出了一种结合网页文本信息,基于频繁项集的聚类方法,根据数据源查询接口所在页面的标题、关键词和提示文本,将数据源按照领域进行聚类,有效解决了传统方法中依赖查询接口特征以及文本模型的高维性问题。实验结果表明该方法是可行的,具有较高的效率。 There are thousands of data sources hiding behind the Deep Web pages which can be accessed through structured query interfaces.Organizing these data sources by their domains has become an important step in Deep Web data integration process.Existing methods mainly focus on query interface schema and query results which have the disadvantages of difficulty in extracting interface schemas and deficiency of submitting queries to the database.A method based on frequent itemsets is proposed to cluster the data sources by their domains.This method considers the Web page text information such as title,key words and label text and solves the problems of overdependency on the query interface and high dimensionality of text processing in traditional solutions.Experimental results show effectiveness and efficiency of this method.

作者张蓬飞朱群雄

机构地区北京化工大学信息科学与技术学院

出处《计算机工程与应用》 CSCD 2012年第14期152-157,共6页 Computer Engineering and Applications

关键词深层网络数据源聚类文本聚类频繁项集数据集成 Deep Web data source clustering text clustering frequent itemsets data integration

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献10

1Shestakov D, Salakoski T.Host-IP clustering technique for deep web characterization[C]//Proceedings of the 2010 ACM Symposium on Applied Computing,2010: 874-875.
2Li Yingiun, Nie Tiezheng.Domain-oriented Deep Web data sources' discovery and identification[C]//APWEB, 2010: 464-467.
3He B, Tao T, Chang K C C.Organizing structured Web sources by query schemas:a clustering approach[C]//Gra-vano L.Proc of ACM the 13th Conference on Informa- tion and Knowledge Management,2004.
4Peng Qian,Meng Weiyi,He Hai, et al.WISE-cluster: clus- tering e-commerce search engines automatically[C]//6th ACM International Workshop on Web Information and Data Management, 2004.
5Gravano L, Ipeirotis P, Sahami M.QProber: a system for automatic classification of hidden-web databases[J].ACM Transactions on Information Systems, 2003,21 (l) : 1-41.
6马军,宋玲,韩晓晖,闫泼.基于网页上下文的Deep Web数据库分类[J].软件学报,2008,19(2):267-274. 被引量：31
7Chang J H,Lee W S.Finding frequent itemsets over on- line data streams[J].Information and Software Technolo- gy, 2006,48 : 606-618.
8Barbosa L,Freire J.Combining classifiers to identify on- line databases[C]//Proc of the 16th International Confer- ence on World Wide Web,2007.
9Fung B C M, Wang Ke, Ester M.Hierarchical document clustering using frequent itemsets[C]//Proceedings of SDM, 2003.
10Salton G, Buckley C.Term weighting approach in auto- matic text retrieval[J].Information Processing and Man- agement, 1988,25 (5) : 513-523.

二级参考文献18

1Gravano L, Garcia-Molina H, Tomasic A. Gloss: Textsource discovery over the Intemet. ACM Trans. on Database Systems, 1999, 24(2):229-246.
2Yi L, Liu B. Web page cleaning for Web mining through feature weighting. In: Cohn AG, ed. Proc. of the 18th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2003). Acapulco: Kluwier Academic Publisher, 2003.64-75.
3Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources. In: Spaccapietra S, ed. Proc. of the 4th Int'l Conf. on Web Information Systems Engineering. Rome: IEEE Computer Society, 2003. 125-133.
4Barbosa L, Freire J, Silva A. Organizing hidden-Web databases by clustering visible Web documents. In: Doqac A, ed. Proc. of IEEE the 23rd Int'l Conf. on Data Engineering. Istanbul: IEEE Computer Society, 2007. 326-335.
5Gravano L, Ipeirotis PG, Sahami M. QProber: A system for automatic classification of hidden-Web databases. ACM TOIS, 2003, 21(1):1-41.
6He B, Tao T, Chang KCC. Organizing structured Web sources by query schemas: A clustering approach. In: Oravano L, ed. Proc. of ACM the 13th Conf. on Information and Knowlege Management. Washington: ACM Press, 2004.22-31.
7Baeza-Yates R, Ribeiro-Neto B. Modem Information Retrieval. Boston: Addison Wesley, 1999. 27-30.
8The UIUC Web integration repository. 2007. http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html
9Thomopolos S, Buche P, Haemmerle O. Fuzzy sets defined on a hierarchical domain. IEEE Trans. on Knowledge and Data Engineering, 2006,16(10): 1395-1409.
10Wang J, Loehovsky F. Data-Rich section extraction from HTML pages. In: Cham TS, ed. Proc. of the 3rd Int'l Conf. on Web Information Systems Engineering. Singapore: IEEE Computer Society Press, 2002. 1-10.

共引文献30

1周爱武,李玉梅,周闪闪,王宝铜.基于返回结果的Deep Web查询接口识别[J].计算机技术与发展,2009,19(7):117-119. 被引量：1
2刘芳.查询自动生成器在Web数据库发现中的应用[J].信息技术,2009,33(6):85-87. 被引量：2
3崔晓军,彭智勇,杨先娣,张莹.Deep Web信息按需集成研究综述[J].武汉大学学报（理学版）,2009,55(4):465-472. 被引量：2
4鲜学丰,赵朋朋,辛洁,方巍,崔志明.基于领域样本查询的Deep Web数据库分类[J].微电子学与计算机,2010,27(3):20-23. 被引量：1
5陆余良,房珊瑶,刘金红,施凡.Deep Web站点分类研究进展[J].安徽大学学报（自然科学版）,2010,34(1):103-108. 被引量：1
6沈炜,蒙祖强.基于Web日志粒度化的深网数据库分类[J].微计算机信息,2010,26(15):161-162.
7华慧,伏玉琛,周小科.基于查询接口文本的Deep Web数据源分类[J].计算机工程,2010,36(12):66-68. 被引量：1
8李秀娟,田川,冯欣.数据挖掘分类技术研究与分析[J].现代电子技术,2010,33(20):86-88. 被引量：11
9陈文,晏立,周亮.一种具有增量学习能力的PU主动学习算法[J].计算机工程,2011,37(4):214-215. 被引量：1
10张亮,陆余良,房珊瑶.基于量子自组织神经网络的Deep Web分类方法研究[J].计算机科学,2011,38(6):205-210. 被引量：3

1王兵,王轲.Deep Web数据源聚类与分类[J].计算机与现代化,2007(8):36-40. 被引量：3
2黄进,何中市,李英豪.基于Dirichlet过程的Deep Web数据源聚类方法[J].微型机与应用,2015,34(7):75-78.
3郭迎春,刘一伟,陈召旭.Deep Web数据抽取的分析与研究[J].南开大学学报（自然科学版）,2012,45(3):9-14. 被引量：2
4黄国华,齐春生,吴智,程占民.基于嵌入式处理器的维护管理器设计与实现[J].高性能计算技术,2013,0(5):39-42.
5高洁,吉根林.文本分类技术研究[J].计算机应用研究,2004,21(7):28-30. 被引量：36
6吴凌云.一种基于自组织映射神经网络的Deep Web聚类方法[J].科教导刊,2012(21):120-121.
7彭媛媛,许建潮.基于xml的Deep Web信息自动抽取技术的研究[J].科技信息,2009(33):85-85.
8孟小峰,于戈.DeepWeb数据集成专刊前言[J].软件学报,2008,19(2):177-178. 被引量：1
9张俊英,胡侠,卜佳俊.网页文本信息自动提取技术综述[J].计算机应用研究,2009,26(8):2827-2831. 被引量：9
10任玉,樊勇,郑家恒.基于分块的网页主题文本抽取[J].广西师范大学学报（自然科学版）,2009,27(1):141-144. 被引量：5

计算机工程与应用

2012年第14期

浏览历史

内容加载中请稍等...

频繁项集在Deep Web数据源聚类中的应用

参考文献10

二级参考文献18

共引文献30

相关作者

相关机构

相关主题

浏览历史