基于多粒度树模型的Web站点描述及挖掘算法被引量：5

A Web Site Representation and Mining Algorithm Using the Multiscale Tree Model

下载PDF

导出

摘要随着Web 所拥有的信息量和信息种类的急剧增长,Web 站点挖掘对于自动实现特定主题的 Web 资源发现和分类具有重要的意义.然而现有的 Web 站点分类或挖掘算法在利用上下文语义信息、去除噪声信息以进一步提高分类准确率等方面还缺乏深入研究.从站点的采样尺寸、分析粒度和描述结构 3 个方面分析了设计高效的 Web 站点挖掘算法所需要解决的问题.在此基础上,提出了一种新的 Web 站点多粒度树描述模型,并描述了包括基于隐 Markov 树的两阶段分类算法、粒度间上下文融合算法、两阶段去噪程序以及基于熵的动态剪枝策略在内的多粒度 Web 站点挖掘算法.站点的多粒度描述方法及挖掘算法为多站点查询优化、Web 效用挖掘等的深入研究奠定了基础.实验表明,该算法相对于基线系统平均可以提高 16%的分类准确率,并减少了 34.5%的处理时间. With the exponential growth of both the amount and the diversity of the web information, web site mining is highly desirable for automatically discovering and classifying topic-specific web resources from the World Wide Web. Nevertheless, existing web site mining methods have not yet handled adequately how to make use of all the correlative contextual semantic clues and how to denoise the content of web sites effectually so as to obtain a better classification accuracy. This paper circumstantiates three issues to be solved for designing an effective and efficient web site mining algorithm, i.e., the sampling size, the analysis granularity, and the representation structure of web sites. On the basis, this paper proposes a novel multiscale tree representation model of web sites, and presents a multiscale web site mining approach that contains an HMT-based two-phase classification algorithm, a context-based interscale fusion algorithm, a two-stage text-based denoising procedure, and an entropy-base pruning strategy. The proposed model and algorithms may be used as a starting-point for further investigating some related issues of web sites, such as query optimization of multiple sites and web usage mining. Experiments also show that the approach achieves in average 16% improvement in classification accuracy and 34.5% reduction in processing time over the baseline system.

作者田永鸿黄铁军高文

机构地区中国科学院计算技术研究所中国科学院研究生院哈尔滨工业大学计算机科学与工程系

出处《软件学报》 EI CSCD 北大核心 2004年第9期1393-1404,共12页 Journal of Software

基金中国科学院知识创新工程~~

关键词算法 Web站点挖掘多粒度站点树上下文模型隐MARKOV树多粒度分类基于熵的剪枝 algorithm Web site mining multiscale site tree context model hidden Markov tree (HMT) multiscale classification entropy-based pruning

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献20

1[1]Ester M, Kriegel HP, Schubert M. Web site mining: A new way to spot competitors, customers and suppliers in the world wide web.In: Hand D, ed. Proc. of the SIGKDD 2002. Edmonton: ACM Press, 2002. 249～258.
2[2]Chakrabarti S, Joshi M, Tawde V. Enhanced topic distillation using text, markup tags, and hyperlinks. In: Kraft DH, ed. Proc. of the 24th ACM-SIGIR Conf. on Research and Development in Information Retrieval. New Orleans: ACM Press, 2001. 208～216.
3[3]Chakrabarti S. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In:Shen VY, ed. Proc. of the WWW 2001. Hong Kong: ACM Press, 2001.211～220.
4[4]Pierre JM. On the automated classification of web sites. Computer and Information Science, 2001,6(001).
5[5]Terveen L, Hill W, Amento B. Constructing, organizing, and visualizing collections of topically related web resources. ACM Trans.on Computer-Human Interaction, 1999,6(1):67～94.
6[6]Crouse MS, Nowak RD, Baraniuk RG. Wavelet-Based statistical signal processing using hidden Markov models. IEEE Trans. on Signal Processing, 1998,46(4):886～902.
7[7]Li J, Gray RM. Context-Based multiscale classification of document images using wavelet coefficient distributions. IEEE Trans. on Image Processing, 2000,9(9):1604～1616.
8[8]Chakrabarti S, Berg M, van den Dom B. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 1999,31 (11-16): 1623～ 1640.
9[9]Minh ND. Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov model. IEEE Signal Processing Letters, 2003,10(4): 115～ 118.
10[10]Diligenti M, Gori M, Maggini M, Scarselli F. Classification of HTML documents by hidden tree-Markov models. In: Tombre K, et al, eds. Proc. of the Int'l Conf. on Document Analysis and Recognition (ICDAR 2001). Los Vaqueros: IEEE Computer Society Press, 2001. 849～853.

同被引文献36

1刘仁金,黄贤武.图像分割的商空间粒度原理[J].计算机学报,2005,28(10):1680-1685. 被引量：23
2张向荣,谭山,焦李成.基于商空间粒度计算的SAR图像分类[J].计算机学报,2007,30(3):483-490. 被引量：21
3罗立群,张慰,陈金鑫.基础教育黄页网站自动生成系统的设计与实现[J].现代图书情报技术,2007(8):80-83. 被引量：2
4李鸿.粒集理论:粒计算的新模型[J].重庆邮电大学学报（自然科学版）,2007,19(4):397-404. 被引量：13
5张文修,徐伟华.基于粒计算的认知模型[J].工程数学学报,2007,24(6):957-971. 被引量：32
6Zdzis?aw Pawlak. Rough sets[J] 1982,International Journal of Computer & Information Sciences(5):341～356
7王国胤,张清华,胡军.粒计算研究综述[J].智能系统学报,2007,2(6):8-26. 被引量：109
8Mitchell T M;曾华军;张银奎.Machine Learning[M]北京:机械工业出版社,201138-44.
9Choi M J,Torralba A,Willsky A S. A tree-based context model for object recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012.240-252.
10Chow C K,Liu C N. Approximating discrete probability distributions with dependence trees[J].IEEE Transactions on Information theory,1968,(03):462-467.

引证文献5

1董宝力,祁国宁,顾新建.基于混合向量空间模型的主题网站识别[J].清华大学学报（自然科学版）,2005,45(S1):1795-1801. 被引量：4
2张清华,周玉兰,滕海涛.基于粒计算的认知模型[J].重庆邮电大学学报（自然科学版）,2009,21(4):494-501. 被引量：22
3许晴,李凡长.上下文决策树学习算法及其在机械波图像中的应用[J].合肥工业大学学报（自然科学版）,2013,36(2):160-164. 被引量：1
4郑羽.基于Lucene构建的校园网黄页系统研究与实现[J].电脑开发与应用,2014,27(8):14-17.
5王晨曦,林培榕,林蔚,欧阳中.基于多粒度一致覆盖约简的混合数据规则学习[J].闽南师范大学学报（自然科学版）,2016,29(2):24-30.

二级引证文献27

1蔡明,倪贤贵.基于超链接和内容相关度的综合爬行策略[J].微计算机信息,2008,24(27):204-205.
2张清华,幸禹可,王国胤.概念知识粒与概念信息粒的相互转化[J].山东大学学报（理学版）,2010,45(9):1-6. 被引量：6
3薛志远,张清华.复合粒计算模型研究进展[J].重庆邮电大学学报（自然科学版）,2010,22(5):631-640. 被引量：3
4陈杰,邓敏,肖鹏峰,杨敏华,梅小明,刘慧敏.结合支持向量机与粒度计算的高分辨率遥感影像面向对象分类[J].测绘学报,2011,40(2):135-141. 被引量：25
5邓飞,潘华.基于粒计算的入侵检测系统研究[J].现代电子技术,2011,34(10):115-117. 被引量：2
6史航.利用TF-IDF算法优化地方性新闻搜索[J].软件导刊,2011,10(11):59-60.
7冯志全,杨波,李毅,许婷,尚爱丽,刘炳超,蒋彦.基于交互行为分析的手势跟踪方法[J].计算机集成制造系统,2012,18(1):31-39. 被引量：3
8张清华,幸禹可.一种基于Hash的快速值约简方法[J].广西师范大学学报（自然科学版）,2011,29(4):39-44. 被引量：2
9潘旭伟,李娜,沈铁伟,吴益民,傅丽君.基于Web资源的组织知识服务研究[J].图书情报工作,2012,56(10):113-118. 被引量：3
10张清华,李鸿,沈文.基于点割集的并行最短路径算法[J].郑州大学学报（工学版）,2012,33(5):125-129. 被引量：2

1赵海峰.Web日志挖掘系统研究及设计[J].信息与电脑（理论版）,2014,0(8):146-147. 被引量：1
2李志国,钟将,冯永,叶春晓.基于知识本体的文本分类技术及其应用研究[J].计算机科学,2007,34(8):184-186. 被引量：7
3桂林,武小悦.基于DIC的HMT模型选择在故障诊断中的应用[J].微计算机信息,2008,24(19):194-195.
4庞文俊,李会方.一种基于小波域HMT模型的图像去噪方法研究[J].信息安全与通信保密,2005,27(9):108-109.
5武小悦.基于隐Markov树的设备状态综合诊断模型[J].系统工程与电子技术,2006,28(7):1034-1038. 被引量：2
6李会方,孙颖力,庞文俊.基于HMT模型的图像去噪方法研究[J].计算机工程与设计,2006,27(2):309-311.
7马立勇,马家辰,沈毅.基于像素融合的Curvelet医学超声图像降噪方法[J].中国医学影像技术,2007,23(6):934-936. 被引量：2
8桂林,武小悦.基于隐Markov树故障诊断的确定退火设计[J].系统工程与电子技术,2008,30(7):1359-1365.
9王相海,陈明莹,宋传鸣,徐孟春,方玲玲.带方向特征的Contourlet HMT模型[J].中国科学：信息科学,2013,43(5):626-643. 被引量：3
10陆余良,房珊瑶,刘金红,施凡.Deep Web站点分类研究进展[J].安徽大学学报（自然科学版）,2010,34(1):103-108. 被引量：1

软件学报

2004年第9期

浏览历史

内容加载中请稍等...

基于多粒度树模型的Web站点描述及挖掘算法被引量：5

参考文献20

同被引文献36

引证文献5

二级引证文献27

相关作者

相关机构

相关主题

浏览历史

基于多粒度树模型的Web站点描述及挖掘算法 被引量：5

参考文献20

同被引文献36

引证文献5

二级引证文献27

相关作者

相关机构

相关主题

浏览历史

基于多粒度树模型的Web站点描述及挖掘算法被引量：5