期刊文献+

一种新的加权后缀树Web文档聚类方法 被引量:2

Novel Weighted Suffix Tree Clustering for Web Documents
下载PDF
导出
摘要 针对Web文档的结构及其特征,提出了一种新的加权后缀树聚类方法WSTC。首先,根据Web文档的HTML标签,把文档划分为具备不同重要性等级的段,段划分成句子,句子分割为词。其次,用句子替代文档构造后缀树,把其重要性等级作为结构权融入后缀树的节点,形成文档集的加权后缀树模型。最后,在选择和合并基类过程中,综合利用节点包含的文档数、句子数、短语长度和结构权。仿真实验表明,WSTC算法比传统STC算法取得了更好的聚类效果。 For Web documents clustering,a novel Weighted Suffix Tree Clustering(WSTC) method was proposed.First,according to the structure and HTML tags of Web documents,different parts of documents were assigned different levels of significance as structure weights;each part was partitioned into some sentences which were partitioned into some words.Second,the weighted suffix tree of documents set was built with sentences and structure weights stored in the nodes.Finally,the documents count,sentences count,phrase length and structure weights of each internal node were employed in the process of identifying and merging base clusters.The evaluation experimental results indicate that WSTC is much more effective on clustering Web documents than original STC.
出处 《系统仿真学报》 CAS CSCD 北大核心 2011年第3期474-479,共6页 Journal of System Simulation
基金 国家科技支撑计划(2007BAH08B04) 重庆市科技支撑计划(2008AC20084)
关键词 后缀树 后缀树聚类 WEB文档聚类 Web文档结构 权重计算 suffix tree suffix tree clustering web document clustering web document structure weight computing
  • 相关文献

参考文献1

二级参考文献13

  • 1Anick PG. Adapting a full-text information retrieval system to computer the troubleshooting domain. In: Croft WB, van Rijsbergen CJ, eds. Proc. of the 17th Annual Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'94).ACM Press, 1994. 349-358.
  • 2Croft WB, Cook R, Wilder D. Providing government information on the Internet: Experience with THOMAS. In: Proc. of the 2nd Int'l Conf. in Theory and Practice of Digital Libraries (DL'95). Texas, 1995. 19-24. http://csdl.tamu.edu/DL95/papers/croft/croft.html
  • 3Stefan K, Armin H, Markus J, Andreas D. Improving document retrieval by automatic query expansion using collaborative learning of term-based concepts. Lecture Notes in Computer Science 2423, 2002. 376-387.
  • 4Moffat A, Davis R, Wilkinson R, Zobel J. Retrieval of partial documents. In: Harman D, ed. Proc. of the 2nd Text Retrieval Conf.(TREC 2). Gaithersburg: National Institute of Standards and Technology Special Publication, 1994. 181-191.
  • 5Srinivasa S, Bhatt PCP. Introduction to Web information retrieval: A user perspective. Journal of Science Education, 2002,7(6):27-38.
  • 6Meng M, Yu C, Liu KL. Building efficient and effective metasearch engines. ACM Computing Surveys, 2002,34(1):48-89.
  • 7Glover E, Tsioutsiouliklis K, Lawrence S, Pennock D, Flake G. Using Web structure for classifying and describing Web pages. In:Proc. of the Int'l World Wide Web Conf. (www 2002). Hawaii: ACM Press, 2002. 562-569. http://www2002.org/CDROM/refereed/504/index.html
  • 8Cutler M, Shih Y, Meng W. Using the structure of HTML documents to improve retrieval. In: Proc. of the USENIX Symp. on Internet Technologies and Systems (NISTS'97). 1997. 241-251. http://www.usenix.org/publications/library/proceedings/usits97/full_papers/cutler/cutler.pdf
  • 9Newby GB. Information space based on HTML structure. In: Vorhees E, ed. Proc. of the 9th Text Retrieval Conf. (TREC 9).Gaithersburg: National Institute of Standards and Technology Special Publication, 2000. 601-610.
  • 10Ricardo BY, Berthier RN. Modern Information Retrieval. New York: Addison-Wesley, ACM Press, 1999. 19-34.

共引文献12

同被引文献15

  • 1YANGJian-wu.A Chinese Web Page Clustering Algorithm Based on the Suffix Tree[J].Wuhan University Journal of Natural Sciences,2004,9(5):817-822. 被引量:4
  • 2ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clus-tering of Web documents[C]//Proceedings of the 3rd InternationalConference on Knowledge Discovery and Data Mining.New York:AAAI Press,1997:287-290.
  • 3HONG YI,SAM K.Learning assignment order of instances for theconstrained K-means clustering algorithm[J].IEEE Transactions onSystems Man and Cybernetics Part B-Cybernetics,2009,39(2):568-574.
  • 4HALL L O,GOLDGOF D B.On convergence properties of the sin-glepass and online fuzzy c-means algorithm[C]//2010 IEEE Inter-national Conference on Fuzzy Systems,Washington,DC:IEEE,2010:1-3.
  • 5AIOLLI F,SAN-MARTINO G,HAGENBUCHNER M,et al.Learning nonsparse kernels by self organizing maps for structured da-ta[J].IEEE Transactions on Neural Networks,2009,20(12):1938-1949.
  • 6ZAMIR O,ETZIONI O.Web document clustering:A feasibilitydemonstration[C]//SIGIR'98:Proceedings of the 21st Interna-tional ACM SIGIR Conference on Research and Development in In-formation Retrieval.New York:ACM Press,1998:46-54.
  • 7CHEN CHUNXI,BERTIL S.Parallel construction of large suffixtrees on a PC cluster[C]//Euro-Par 2005 Parallel Processing:11th International Euro-Par Conference.Berlin:Springer,2005:1227-1236.
  • 8WANG JUNZE,MO YIJUN,HUANG BENXIONG,et al.Websearch results clustering based on a novel suffix tree structure[C]//Autonomic and Trusted Computing:5th International Conference.Berlin:Springer,2008:540-554.
  • 9KOPIDAKI S,PAPADAKOS P,TZITZIKAS Y.STC+and NM-STC:two novel online results clustering methods for Web searching[C]//WISE 2009:10th International Conference.Berlin:Spring-er,2009:523-537.
  • 10HAN WEN,GUO-SHUN HUANG,ZHAO LI.Clustering Websearch results using semantic information[C]//Proceedings of theEighth International Conference on Machine Learning and Cybernet-ics.Liverpool:World Academic Press,2009:1504-1509.

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部