期刊文献+

频繁项集在Deep Web数据源聚类中的应用

Deep Web data source clustering using frequent itemsets
下载PDF
导出
摘要 在Deep Web页面的背后隐藏着海量的可以通过结构化的查询接口进行访问的数据源。将这些数据源按所属领域进行组织划分,是DeepWeb数据集成中的一个关键步骤。已有的划分方法主要是基于查询接口模式和提交查询返回结果,存在查询接口特征难以完全抽取和提交数据库查询效率不高等问题。提出了一种结合网页文本信息,基于频繁项集的聚类方法,根据数据源查询接口所在页面的标题、关键词和提示文本,将数据源按照领域进行聚类,有效解决了传统方法中依赖查询接口特征以及文本模型的高维性问题。实验结果表明该方法是可行的,具有较高的效率。 There are thousands of data sources hiding behind the Deep Web pages which can be accessed through structured query interfaces.Organizing these data sources by their domains has become an important step in Deep Web data integration process.Existing methods mainly focus on query interface schema and query results which have the disadvantages of difficulty in extracting interface schemas and deficiency of submitting queries to the database.A method based on frequent itemsets is proposed to cluster the data sources by their domains.This method considers the Web page text information such as title,key words and label text and solves the problems of overdependency on the query interface and high dimensionality of text processing in traditional solutions.Experimental results show effectiveness and efficiency of this method.
出处 《计算机工程与应用》 CSCD 2012年第14期152-157,共6页 Computer Engineering and Applications
关键词 深层网络 数据源聚类 文本聚类 频繁项集 数据集成 Deep Web data source clustering text clustering frequent itemsets data integration
  • 相关文献

参考文献10

  • 1Shestakov D, Salakoski T.Host-IP clustering technique for deep web characterization[C]//Proceedings of the 2010 ACM Symposium on Applied Computing,2010: 874-875.
  • 2Li Yingiun, Nie Tiezheng.Domain-oriented Deep Web data sources' discovery and identification[C]//APWEB, 2010: 464-467.
  • 3He B, Tao T, Chang K C C.Organizing structured Web sources by query schemas:a clustering approach[C]//Gra-vano L.Proc of ACM the 13th Conference on Informa- tion and Knowledge Management,2004.
  • 4Peng Qian,Meng Weiyi,He Hai, et al.WISE-cluster: clus- tering e-commerce search engines automatically[C]//6th ACM International Workshop on Web Information and Data Management, 2004.
  • 5Gravano L, Ipeirotis P, Sahami M.QProber: a system for automatic classification of hidden-web databases[J].ACM Transactions on Information Systems, 2003,21 (l) : 1-41.
  • 6马军,宋玲,韩晓晖,闫泼.基于网页上下文的Deep Web数据库分类[J].软件学报,2008,19(2):267-274. 被引量:31
  • 7Chang J H,Lee W S.Finding frequent itemsets over on- line data streams[J].Information and Software Technolo- gy, 2006,48 : 606-618.
  • 8Barbosa L,Freire J.Combining classifiers to identify on- line databases[C]//Proc of the 16th International Confer- ence on World Wide Web,2007.
  • 9Fung B C M, Wang Ke, Ester M.Hierarchical document clustering using frequent itemsets[C]//Proceedings of SDM, 2003.
  • 10Salton G, Buckley C.Term weighting approach in auto- matic text retrieval[J].Information Processing and Man- agement, 1988,25 (5) : 513-523.

二级参考文献18

  • 1Gravano L, Garcia-Molina H, Tomasic A. Gloss: Textsource discovery over the Intemet. ACM Trans. on Database Systems, 1999, 24(2):229-246.
  • 2Yi L, Liu B. Web page cleaning for Web mining through feature weighting. In: Cohn AG, ed. Proc. of the 18th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2003). Acapulco: Kluwier Academic Publisher, 2003.64-75.
  • 3Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources. In: Spaccapietra S, ed. Proc. of the 4th Int'l Conf. on Web Information Systems Engineering. Rome: IEEE Computer Society, 2003. 125-133.
  • 4Barbosa L, Freire J, Silva A. Organizing hidden-Web databases by clustering visible Web documents. In: Doqac A, ed. Proc. of IEEE the 23rd Int'l Conf. on Data Engineering. Istanbul: IEEE Computer Society, 2007. 326-335.
  • 5Gravano L, Ipeirotis PG, Sahami M. QProber: A system for automatic classification of hidden-Web databases. ACM TOIS, 2003, 21(1):1-41.
  • 6He B, Tao T, Chang KCC. Organizing structured Web sources by query schemas: A clustering approach. In: Oravano L, ed. Proc. of ACM the 13th Conf. on Information and Knowlege Management. Washington: ACM Press, 2004.22-31.
  • 7Baeza-Yates R, Ribeiro-Neto B. Modem Information Retrieval. Boston: Addison Wesley, 1999. 27-30.
  • 8The UIUC Web integration repository. 2007. http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html
  • 9Thomopolos S, Buche P, Haemmerle O. Fuzzy sets defined on a hierarchical domain. IEEE Trans. on Knowledge and Data Engineering, 2006,16(10): 1395-1409.
  • 10Wang J, Loehovsky F. Data-Rich section extraction from HTML pages. In: Cham TS, ed. Proc. of the 3rd Int'l Conf. on Web Information Systems Engineering. Singapore: IEEE Computer Society Press, 2002. 1-10.

共引文献30

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部