基于N元语法的英文学术文献聚类标签抽取算法被引量：3

N-gram Based on Cluster Label Extracting Algorithm for English Paper

导出

摘要提出一种基于N元语法的英文学术文献聚类标签抽取算法,该算法利用N元语法在大规模语料库上进行先期学习生成领域短语词表,再通过K-means算法进行聚类,从聚簇中抽取N元语法项计算TFIDF值,对出现在词表中的特征项赋以更高的权值,以得分最高的特征项作为聚类标签。实验结果表明,该算法能获得更好的实验效果。同时,在抽取聚类标签时提出一种改进的TFIDF权重计算,在评价标签质量时提出一种新的标签评价方法R@N方法。 In this paper, a novel cluster label extracting algorithm for English paper based on N - gram is proposed. Before the clustering, this algorithm first uses N - gram to generate the field phrases list by prior learning in the large - scale corpus, then clusters the English paper using K - means algorithm. Finally, the highest score N - gram terms from the cluster is extracted as the label. In the score calculation, if the term exists in the field phrases list, it is set double weight. Experimental resuhs show that the quality of cluster label is improved. Furthermore, an improved TFIDF calculation method is developed, and a new R@ N method to evaluate the cluster label is proposed.

作者吴夙慧成颖郑彦宁潘云涛

机构地区南京大学信息管理系中国科学技术信息研究所

出处《现代图书情报技术》 CSSCI 北大核心 2011年第7期68-75,共8页 New Technology of Library and Information Service

基金国家社会科学基金项目“中文学术信息检索系统相关性集成研究”(项目编号:10CTQ027) 教育部人文社会科学研究规划基金项目“面向用户的相关性标准及其应用研究”(项目编号:07JA870006) 中国科学技术信息研究所合作研究项目的研究成果之一

关键词聚类标签 N元语法学术文献聚类 Cluster label N - gram Paper clustering

分类号 G353 [文化科学—情报学]

引文网络
相关文献

参考文献17

1Berger H, Merkl D. A Comparison of Text - Categorization Methods Applied to N - Gram Frequency Statistics [ C ]. In : Proceedings of the 17 th Australian Joint Conference on Artificial Intelligence (AI'2004), Cairns, Australia. Lecture Notes in Computer Science,2005,3339:998 - 1003.
2Mansur M, UzZaman N, Khan M. Analysis of N - Gram Based Text Categorization for Bangla in a Newspaper Corpus [ C ]. In : Proceedings of Center for Research on Bangla Language Processing, BRAC University. 2006.
3Rahmoun A, Elberrichi Z. Experimenting N - Grams in Text Categorization[ J]. The International Arab Journal of lnformation Technology,2007,4(4) :377 -385.
4Guran A, Akyokus S, Bayazit N G, et al. Turkish Text Categorization Using N - Gram Words [ C ]. In : Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications. Turkey : Trabzon,2009 : 369 - 373.
5何浩,杨海棠.一种基于N-Gram技术的中文文献自动分类方法[J].情报学报,2002,21(4):421-427. 被引量：18
6于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法[J].图书情报工作,2004,48(8):48-50. 被引量：17
7许云,樊孝忠,张锋.一种不需分词的中文文本分类方法[J].北京理工大学学报,2005,25(9):778-781. 被引量：5
8孙桂煌.基于N-Grams短语的中文Web文本聚类及其预处理的研究[D].赣州:江西理工大学,2009.
9Zamir O, Etzioni O. Web Document Clustering: A Feasibility Demonstration [ C ]. In : Proceedings of the 21st International A CM SIGIR Conference on Research and Development in Information Retrieval. 1998 : 46 -54.
10Wang J, Mo Y, Huang B, et al. Web Search Results Clustering Based on a Novel Suffix Tree Structure [ C ]. In : Proceedings of the 5th International Conference on Autonomic and Trusted Computing. Lecture Notes in Computer Science. Berlin Heidelberg: Springer, 2008:540 - 554.

二级参考文献31

1刘涌泉.中国计算机和自然语言处理的新进展[J].情报科学,1987,8(1):64-70. 被引量：4
2郭伟,唐晓君,刘万军.一种基于划分的聚类算法分析与改进[J].辽宁工程技术大学学报（自然科学版）,2004,23(6):826-828. 被引量：4
3周强.规则和统计相结合的汉语词类标注方法[J].中文信息学报,1995,9(3):1-10. 被引量：43
4Baidu search engine[CP].http://www, baidu, com.
5Carrot clustering engine[CP].http://demo, carrot2. org/demo-stable/main.
6Dragon toolkit[CP].http://www, dragontoolkit, org
7H. Chim and X. Deng. A new suffix tree similarity measure for document clustering[C]//WWW.121- 129, 2007.
8Google search engine[CP].http://www, google, com
9Vivisimo clustering engine[CP].http://vivisimo.com
10X. Wang and C. Zhai. Learn from web search logs to organize search results[C]//SIGIR, 87-94, 2007.