基于独立分量分析的隐蔽Web领域聚类被引量：1

Hidden Web Domain Clustering Based on Independent Component Analysis

下载PDF

导出

摘要针对隐蔽Web主题领域自动识别问题,提出一种基于独立分量分析(ICA)的聚类算法。对查询页面进行页面文本抽取和预处理,利用TF-IDF公式计算权重并选择前N个权重最大的特征词构造文档矩阵,在使用潜在语义索引(LSI)进行特征重构的基础上通过ICA分解获得类别信息。利用LSI的词共现分析和文本降噪能力提高聚类准确率。实验表明聚类平均准确率达到90%以上。 Aiming at organizing hidden Web databases according to their topic domains, this paper proposes an Independent Component Analysis（ICA） based algorithm for hidden Web domain clustering. Text is extracted from search interface pages as common Web pages, and TF-IDF formula is applied to weight terms. After selecting the top N-highest weight terms to construct VSM, the algorithm performs a singular value decomposition to implement features reconstruction. It applies ICA decomposition to obtain the cluster information. The main idea is utilizing the co-occurrence analysis and noise eliminating ability of Latent Semantic Index（LSI） to improve cluster performance. Experiment shows that the average precision is higher than 90 percent.

作者王晓斌温春石昭祥

机构地区电子工程学院网络工程系

出处《计算机工程》 CAS CSCD 北大核心 2009年第7期175-176,179,共3页 Computer Engineering

关键词隐蔽Web 潜在语义独立分量分析文本聚类 hidden Web latent semantic Independent Component Analysis（ICA） text clustering

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献5

1Barbosa L, Freire J, Silva A. Organizing Hidden-Web Databases by Clustering Visible Web Documents[C]//Proc. of the 23rd Int'l Conf. on Data Engineering. [S. l.]: IEEE Press, 2007: 326-335.
2Manning C D, Schtitze H. Foundations of Statistical Natural Language Processing[M]. Cambridge: MIT Press, 1999: 335-368.
3Hyvarinen A. Fast and Robust Fixed-point Algorithms for Independent Component Analysis[J]. IEEE Transactions on Neural Networks, 1999, 10(3): 626-634.
4Kolenda T, Hansen L K, Sigurdsson S. Independent Components in Text[J]. Advances in Neural Information Processing Systems, 2000, 13(5): 235-256.
5Chang Chenchuan, He Bin, Li Chengkai, et al. The UIUC Web Integration Repository[DB/OL]. (2003-05-05). http://metaquerier.cs. uiuc.edu/repository/datasets/tel-8findex.html.

同被引文献25

1刘志为,何丕廉,孙越恒,郑小慎.N层向量空间模型在Web信息检索中的应用[J].微型机与应用,2004,23(12):60-62. 被引量：5
2刘海峰,王元元,王倩.基于分类的VSM模式下文本检索研究[J].情报科学,2006,24(11):1700-1703. 被引量：11
3张秋余,刘洋.使用基于SVM的局部潜在语义索引进行文本分类[J].计算机应用,2007,27(6):1382-1384. 被引量：4
4张爱文,樊红莲.半离散矩阵分解改进算法在网页信息检索中的应用研究[J].黑龙江工程学院学报,2007,21(2):55-57. 被引量：3
5Salton G, Yang C S. On the specification of tel'In values in automatic indexing[J]. Journal of Documentation,1973,29(4) :351 - 372.
6Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18 ( 11 ) : 613 - 620.
7Tai Xiaoying, Sasaki M, Tanaka Y, et al. Improvement of vector space information retrieval model based on supervised lemaaing [ C ]//Proceedings of the 5th International Workshop Information Retrieval with Asian Languages. New York : ACM,2000:69 - 74.
8Isbell C L, Viola P. Restructuring sparse high dimensional data for effective retrieval[ C ]//Advances in Neural Information Processing Systems 11. San Mateo : Kaufmann, 1999:480 - 486.
9Frakes W B, Baeza-Yates R. Information retrieval:Data structures and algorithms [ M ]. Englewood : Prentice-Hall, 1992 : 420 - 441.
10Sun Yueheng, lie Pilian, Chen Zhigang. An improved team weighting scheme for vector space model [ C ]//Proceedings of the Third International Conference on Machine Learning and Cybernetics. Piscataway : IEEE ,2004 : 1692 - 1695.

引证文献1

1梁士金.VSM信息检索中的数据稀疏问题分析与规避策略[J].图书情报工作,2013,57(1):142-146. 被引量：3

二级引证文献3

1李扬.基于向量空间模型的信息检索技术的探讨[J].商情,2013(18):168-168.
2迟玉琢.2013年我国情报学研究进展[J].山东图书馆学刊,2014(6):8-13. 被引量：3
3苏赢彬,杜学绘,夏春涛,李海华.基于文档平滑和查询扩展的文档敏感信息检测方法[J].计算机应用,2014,34(9):2639-2644. 被引量：8

1黄章益,刘怀亮.一种基于语义的中文文本特征降维技术研究[J].情报杂志,2011,30(S2):123-125. 被引量：2
2颜端武,罗胜阳,成晓.协同推荐中基于用户-文档矩阵的用户聚类研究[J].现代图书情报技术,2007(3):25-28. 被引量：2
3潘瑜,孙权森,夏德深.基于PCA分解的图像融合框架[J].计算机工程,2011,37(13):210-212. 被引量：8
4梅蓉蓉,吴小俊,冯振华.基于状态估计的张量分解人脸识别方法[J].计算机工程与应用,2011,47(24):143-145. 被引量：1
5郭志强,杨杰,柳步荫.基于WPT/PCA的特征级融合人脸识别方法[J].武汉理工大学学报,2009,31(17):131-134. 被引量：2
6郑金芳.基于文本分类领域中文本分类和主题分析[J].中国科技财富,2008(11X):26-26.
7吴金学.基于概率潜在语义分析的文本聚类研究[J].青岛理工大学学报,2008,29(2):95-99. 被引量：3
8王晓斌,温春,石昭祥.基于贝叶斯信息准则的文本主题数估计[J].计算机工程,2009,35(7):183-185. 被引量：5
9张秋余,刘洋.使用基于SVM的局部潜在语义索引进行文本分类[J].计算机应用,2007,27(6):1382-1384. 被引量：4
10陶跃华,王锡钢,王云爱.信息检索向量空间模型中特征提取的研究[J].云南师范大学学报（自然科学版）,2000,20(6):18-20. 被引量：13

计算机工程

2009年第7期

浏览历史

内容加载中请稍等...

基于独立分量分析的隐蔽Web领域聚类被引量：1

参考文献5

同被引文献25

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于独立分量分析的隐蔽Web领域聚类 被引量：1

参考文献5

同被引文献25

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于独立分量分析的隐蔽Web领域聚类被引量：1