一种新的加权后缀树Web文档聚类方法被引量：2

Novel Weighted Suffix Tree Clustering for Web Documents

下载PDF

导出

摘要针对Web文档的结构及其特征,提出了一种新的加权后缀树聚类方法WSTC。首先,根据Web文档的HTML标签,把文档划分为具备不同重要性等级的段,段划分成句子,句子分割为词。其次,用句子替代文档构造后缀树,把其重要性等级作为结构权融入后缀树的节点,形成文档集的加权后缀树模型。最后,在选择和合并基类过程中,综合利用节点包含的文档数、句子数、短语长度和结构权。仿真实验表明,WSTC算法比传统STC算法取得了更好的聚类效果。 For Web documents clustering,a novel Weighted Suffix Tree Clustering（WSTC） method was proposed.First,according to the structure and HTML tags of Web documents,different parts of documents were assigned different levels of significance as structure weights;each part was partitioned into some sentences which were partitioned into some words.Second,the weighted suffix tree of documents set was built with sentences and structure weights stored in the nodes.Finally,the documents count,sentences count,phrase length and structure weights of each internal node were employed in the process of identifying and merging base clusters.The evaluation experimental results indicate that WSTC is much more effective on clustering Web documents than original STC.

作者杨瑞龙朱庆生谢洪涛屈洪春

机构地区重庆大学计算机学院

出处《系统仿真学报》 CAS CSCD 北大核心 2011年第3期474-479,共6页 Journal of System Simulation

基金国家科技支撑计划(2007BAH08B04) 重庆市科技支撑计划(2008AC20084)

关键词后缀树后缀树聚类 WEB文档聚类 Web文档结构权重计算 suffix tree suffix tree clustering web document clustering web document structure weight computing

分类号 TP397.2 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1张敏,马少平,宋睿华.DF还是IDF?主特征模型在Web信息检索中的使用[J].软件学报,2005,16(5):1012-1020. 被引量：13

二级参考文献13

1Anick PG. Adapting a full-text information retrieval system to computer the troubleshooting domain. In: Croft WB, van Rijsbergen CJ, eds. Proc. of the 17th Annual Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'94).ACM Press, 1994. 349-358.
2Croft WB, Cook R, Wilder D. Providing government information on the Internet: Experience with THOMAS. In: Proc. of the 2nd Int'l Conf. in Theory and Practice of Digital Libraries (DL'95). Texas, 1995. 19-24. http://csdl.tamu.edu/DL95/papers/croft/croft.html
3Stefan K, Armin H, Markus J, Andreas D. Improving document retrieval by automatic query expansion using collaborative learning of term-based concepts. Lecture Notes in Computer Science 2423, 2002. 376-387.
4Moffat A, Davis R, Wilkinson R, Zobel J. Retrieval of partial documents. In: Harman D, ed. Proc. of the 2nd Text Retrieval Conf.(TREC 2). Gaithersburg: National Institute of Standards and Technology Special Publication, 1994. 181-191.
5Srinivasa S, Bhatt PCP. Introduction to Web information retrieval: A user perspective. Journal of Science Education, 2002,7(6):27-38.
6Meng M, Yu C, Liu KL. Building efficient and effective metasearch engines. ACM Computing Surveys, 2002,34(1):48-89.
7Glover E, Tsioutsiouliklis K, Lawrence S, Pennock D, Flake G. Using Web structure for classifying and describing Web pages. In:Proc. of the Int'l World Wide Web Conf. (www 2002). Hawaii: ACM Press, 2002. 562-569. http://www2002.org/CDROM/refereed/504/index.html
8Cutler M, Shih Y, Meng W. Using the structure of HTML documents to improve retrieval. In: Proc. of the USENIX Symp. on Internet Technologies and Systems (NISTS'97). 1997. 241-251. http://www.usenix.org/publications/library/proceedings/usits97/full_papers/cutler/cutler.pdf
9Newby GB. Information space based on HTML structure. In: Vorhees E, ed. Proc. of the 9th Text Retrieval Conf. (TREC 9).Gaithersburg: National Institute of Standards and Technology Special Publication, 2000. 601-610.
10Ricardo BY, Berthier RN. Modern Information Retrieval. New York: Addison-Wesley, ACM Press, 1999. 19-34.

共引文献12

1魏振达,阳小华,刘军.成员搜索引擎的查询参数表达能力的建模设计[J].南华大学学报（自然科学版）,2005,19(4):83-85.
2刘慧,马军,雷景生,连莉.基于特征域词频的邮件过滤方法的研究[J].山东大学学报（理学版）,2006,41(3):134-138. 被引量：1
3刘慧,马军,雷景生,宋玲.基于词频的权值计算在邮件过滤算法中的应用[J].计算机工程,2006,32(17):60-62.
4徐德智,王庆涛,王斌.基于本体的Web信息采集[J].现代图书情报技术,2007(2):53-55. 被引量：2
5赵正文,康耀红.Web信息检索结构化排序函数与标引词加权技术[J].计算机工程与应用,2007,43(11):181-184. 被引量：1
6张孝飞,黄河燕,陈肇雄,代六玲.跨语言信息检索中查询语句翻译转换算法[J].计算机工程,2007,33(11):166-167. 被引量：1
7吴一帆,童锡鹏,沈锡臣,王臻.基于语义的信息获取服务平台的研究与实现[J].计算机工程与设计,2007,28(14):3476-3479.
8孙双,贺樑,杨静,顾君忠.An improved algorithm for weighting keywords in web documents[J].Journal of Shanghai University(English Edition),2008,12(3):235-239. 被引量：1
9侯越先,张鹏,于瑞国.基于内容相关性挖掘的反馈式搜索引擎框架[J].天津大学学报,2008,41(8):941-945. 被引量：3
10徐德智,郭渭莉.基于本体的主题相关度算法研究[J].云南大学学报（自然科学版）,2007,29(S1):51-54. 被引量：3

同被引文献15

1YANGJian-wu.A Chinese Web Page Clustering Algorithm Based on the Suffix Tree[J].Wuhan University Journal of Natural Sciences,2004,9(5):817-822. 被引量：4
2ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clus-tering of Web documents[C]//Proceedings of the 3rd InternationalConference on Knowledge Discovery and Data Mining.New York:AAAI Press,1997:287-290.
3HONG YI,SAM K.Learning assignment order of instances for theconstrained K-means clustering algorithm[J].IEEE Transactions onSystems Man and Cybernetics Part B-Cybernetics,2009,39(2):568-574.
4HALL L O,GOLDGOF D B.On convergence properties of the sin-glepass and online fuzzy c-means algorithm[C]//2010 IEEE Inter-national Conference on Fuzzy Systems,Washington,DC:IEEE,2010:1-3.
5AIOLLI F,SAN-MARTINO G,HAGENBUCHNER M,et al.Learning nonsparse kernels by self organizing maps for structured da-ta[J].IEEE Transactions on Neural Networks,2009,20(12):1938-1949.
6ZAMIR O,ETZIONI O.Web document clustering:A feasibilitydemonstration[C]//SIGIR'98:Proceedings of the 21st Interna-tional ACM SIGIR Conference on Research and Development in In-formation Retrieval.New York:ACM Press,1998:46-54.
7CHEN CHUNXI,BERTIL S.Parallel construction of large suffixtrees on a PC cluster[C]//Euro-Par 2005 Parallel Processing:11th International Euro-Par Conference.Berlin:Springer,2005:1227-1236.
8WANG JUNZE,MO YIJUN,HUANG BENXIONG,et al.Websearch results clustering based on a novel suffix tree structure[C]//Autonomic and Trusted Computing:5th International Conference.Berlin:Springer,2008:540-554.
9KOPIDAKI S,PAPADAKOS P,TZITZIKAS Y.STC+and NM-STC:two novel online results clustering methods for Web searching[C]//WISE 2009:10th International Conference.Berlin:Spring-er,2009:523-537.
10HAN WEN,GUO-SHUN HUANG,ZHAO LI.Clustering Websearch results using semantic information[C]//Proceedings of theEighth International Conference on Machine Learning and Cybernet-ics.Liverpool:World Academic Press,2009:1504-1509.

引证文献2

1翟献民,田生伟,禹龙,冯冠军.面向维吾尔语文本的改进后缀树聚类[J].计算机应用,2012,32(4):1078-1081. 被引量：2
2蒋程,张建武.利用广义后缀树的最大相似度优先聚类方法[J].中国科技信息,2013(3):89-91.

二级引证文献2

1木妮娜.玉素甫,古丽娜.玉素甫.重复模式识别算法及在Web信息抽取和聚类分析中的应用[J].计算机科学,2017,44(B11):39-45. 被引量：1
2田亮,吐尔根.依布拉音,艾山.吾买尔,卡哈尔江.阿比的热西提.基于LDA的英汉维文本聚类系统的设计与实现[J].现代电子技术,2019,42(3):122-126. 被引量：2

1冯长远,普杰信.Web文本特征选择算法的研究[J].计算机应用研究,2005,22(7):36-38. 被引量：8
2翟献民,田生伟,禹龙,冯冠军.面向维吾尔语文本的改进后缀树聚类[J].计算机应用,2012,32(4):1078-1081. 被引量：2
3冯冰洁,杨天奇.后缀树聚类算法在元搜索引擎中的应用[J].微计算机信息,2010,26(3):204-206. 被引量：5
4邓峰,陈家琪.STC算法的网络服务分类技术研究[J].信息技术,2013,37(9):13-17.
5杜红斌,夏克文,刘南平,吴涛.一种改进的基于广义后缀树的文本聚类算法[J].信息与控制,2009,38(3):331-336. 被引量：7
6刘亚明,马力,舒惠.基于后缀树的文本聚类算法[J].西安邮电学院学报,2012,17(1):62-66. 被引量：4
7骆绍烨.一种基于用户兴趣的STC改进算法[J].江南大学学报（自然科学版）,2015,14(1):85-89.
8杜光芹,张化祥.基于超链接结构和向量空间模型的网页排序算法研究[J].信息技术与信息化,2006(4):106-108.
9潘敏,王明文,王晓庆,揭安全.基于簇特征的文本增量聚类研究[J].江西师范大学学报（自然科学版）,2014,38(1):95-101. 被引量：2
10李贵林,杨禹琪,高星,廖明宏.企业搜索引擎个性化表示与结果排序算法研究[J].计算机研究与发展,2014,51(1):206-214. 被引量：7

系统仿真学报

2011年第3期

浏览历史

内容加载中请稍等...

一种新的加权后缀树Web文档聚类方法被引量：2

参考文献1

二级参考文献13

共引文献12

同被引文献15

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种新的加权后缀树Web文档聚类方法 被引量：2

参考文献1

二级参考文献13

共引文献12

同被引文献15

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种新的加权后缀树Web文档聚类方法被引量：2