期刊文献+

基于频繁词义序列的检索结果聚类算法研究 被引量:3

Search result clustering algorithm based on frequent itemsets meaning sequence
下载PDF
导出
摘要 目前大多搜索引擎结果聚类算法针对用户查询生成的网页摘要进行聚类,由于网页摘要较短且质量良莠不齐,聚类效果难以保证。提出了一种基于频繁词义序列的检索结果聚类算法,利用Word Net结合句法和语义特征对搜索结果构建聚类及标签。不像传统的基于向量空间模型的聚类算法,考虑了词语在文档中的序列模式。算法首先对文本进行预处理,生成压缩文档以降低文本数据维度,构建广义后缀树,挖掘出最大频繁项集,然后获取频繁词义序列。从文档中获取的有序频繁项集可以更好地反映文档的主题,把相同主题的搜索结果聚类在一起,与用户查询相关度高的优先排序。实验表明,该算法可以获得与查询相关的高质量聚类及基于语义的聚类标签,具有更高的聚类准确度和更高的运行效率,并且可扩展性良好。 Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often cause bad clustering performance. This paper presents a clustering algorithm based on frequent itemsets meaning sequence, which combines the use of WordNet syntactic and semantic features to build the search results clustering and labeling. Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. A word(meaning)sequence is frequent if it occurs in more than certain percentage of the documents in the text database. Firstly, the text is pre-processed to generate compact document to reduce the dimension of the document, build generalized suffix tree, and dig out the maximum frequent itemsets, then the frequent word meaning sequences is generated. Document theme can be better reflected by frequent itemsets meaning sequence, the search results having same themes clustered together with the user's query prioritization highly relevant. Experimental results show that the clustering algorithm can obtain a high quality cluster that related to the query semantic tags, which has higher accuracy, efficiency and good scal- ability.
出处 《计算机工程与应用》 CSCD 北大核心 2015年第1期13-20,共8页 Computer Engineering and Applications
基金 中国科学院战略先导专项(No.XDA06030400) 新疆维吾尔自治区"十二五"重大专项(No.201230118) 中科院西部之光项目(No.YB201304)
关键词 聚类算法 频繁项 信息检索 WORDNET clustering algorithm frequent itemset information retrieval WordNet
  • 相关文献

参考文献35

  • 1Huang L.A Survey on Web information retrieval technologies,ECSL Technical Report[R].State University of New York,2000.
  • 2Zamir O.Clustering Web documents:a phrase-based method for grouping search engine results[D].Washington DC:University of Washington,1999.
  • 3Steinbach M,Karypis G,Kumar V.A comparison of document clustering techniques[C]//Proceedings of KDD-2000Workshop on Text Mining,2000.
  • 4Fung B C M,Wang K,Ester M.Hierarchical document clustering using frequent itemsets[C]//Proceedings of SIAM International Conference on Data Mining,2003.
  • 5Hotho A,Staab S,Stumme G.Ontologies improve text document clustering[C]//Proceedings of the 3rd IEEE International Conference on Data Mining,2003:541-544.
  • 6Kowalski G.Information retrieval systems:theory and implementation[M].[S.l.]:Kluwer Academic Publishers,1997.
  • 7Zamir O,Etzioni O,Madani O,et al.Fast and intuitive clustering of Web documents[C]//Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining,1997:287-290.
  • 8Zeng H,He Q,Chen Z,et al.Learning to cluster Web search results[C]//Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,2004:210-217.
  • 9Cutting D,Karger D,Pedersen J,et al.Scatter/gather:a cluster-based approach to browsing large document collections[C]//Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1992:318-329.
  • 10Koller D,Sahami M.Hierarchically classifying documents using very few words[C]//Proceedings of Machine Learning International Workshop,1997:170-178.

二级参考文献14

共引文献8

同被引文献21

引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部