基于相似度的词聚类算法和可变长语言模型被引量：7

Word Clustering Based on Similarity and Vari-gram Language Model

下载PDF

导出

摘要基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法.传统的统计聚类方法基于含婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准.这种传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优.本文利用互信息定义一种词相似度,基于相似度,提出一种自下而上的分层聚类算法.实验证明,该算法在计算复杂度和聚类效果上比传统的基于贪婪原则的统计聚类算法都有明显的改进.在提高预测能力方面,提出一种新的基于类的可变长语言模型(Vari-gram)的生成方法. Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed,and initial choices can influence final result. The author tries to solve above problems in this paper, and presents a definition of word similarity by utilizing mutual information. Based on word similarity, a bottom-up hierarchical clustering algorithm is proposed. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, a new method to create the vari-gram language model is presented.

作者袁里驰

机构地区江西财经大学信息学院数据与知识工程江西省重点实验室中南大学信息科学与工程学院

出处《小型微型计算机系统》 CSCD 北大核心 2009年第5期912-915,共4页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(60763001 60663007)资助中南大学博士后科学基金项目(2007)资助

关键词互信息词相似度聚类算法可变长语言模型 mutual information word similarity clustering algorithm vari-gram language model

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献9

1Takuya Matsuzaki,Yusuke Miyao.An efficient clustering algorithm for class-based language models[C].Proceedings of the 7th Conf on Natural Language Learning at HLT-NAACL,2003,119-126.
2Ido Dagan.Context word similarity and estimation from sparse data[J].Computer Speech and Language,1995,9(2):123-152.
3Firth,John Rupert.A synopsis of linguistic theory 1930-1955.In Philological Society,editor[A].Studies in linguistic analysis.Blackwell,Oxford,1957:1-32.Reprinted in Selected Papers of J.R.Firth,edited by Palmer F[M].Longman,1968.
4Christopher D Manning,Hinrich Schutze.Foundations of statistical natural language processing[M].London:The MIT Press,1999.
5Cutting D R,Karger D R,Perdersen J R,et al.Scatter/garther:a cluster-based approach to browsing large document collections[C].Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1992:318-329.
6Gao J,Wang H F,Lee K F.A unified approach to statistical language modeling for Chinese[C].Proceedings of the IEEE International Conference on Acoustics,Speech,and Signal Processing,Istanbul,Turkey,June,2000:1703-1706.
7Lee Lillian.Similarity-based approaches to natural language processing[D].Harvard University,Cambridge,MA,1997.
8Karov Yael,Shimon Edelman.Learning similarity-based word sense disambiguation from sparse data[C].Proceedings of the Fourth Workshop on Very Large Corpora,Copenhagen,Denmark,1996:42-55.
9Niesler T R,Woodland P C.A variable-length category-based n-gram language model[C].Proceedings of the International Conference of Acoustics Speech and Signal Processing.Atlanta,1996:164-169.

同被引文献76

1孙静,朱杰,徐向华.一种新的中文词自动聚类算法[J].上海交通大学学报,2003,37(z1):139-142. 被引量：2
2沈家煊.句式和配价[J].中国语文,2000(4):291-297. 被引量：220
3袁里驰.Improved head-driven statistical models for natural language parsing[J].Journal of Central South University,2013,20(10):2747-2752. 被引量：1
4梁以敏,黄德根.基于完全二阶隐马尔可夫模型的汉语词性标注[J].计算机工程,2005,31(10):177-179. 被引量：25
5赵岩,王晓龙,刘秉权,关毅.融合聚类触发对特征的最大熵词性标注模型[J].计算机研究与发展,2006,43(2):268-274. 被引量：20
6姜维,关毅,王晓龙.基于条件随机域的词性标注模型[J].计算机工程与应用,2006,42(21):13-16. 被引量：12
7司马义.阿不都热依木.现代维吾尔语造词法研究[D].乌鲁木齐:新疆大学,2010.
8Huang z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge, Discovery II 1998, 3(2) : 283-304.
9Manning C D, Schutze H. Foundations of statistical natural language processing[M]. London: MIT Press, 1999: 184-197.
10Seo K J, Nam K C, Choi K S. A probalistic model of the dependency parse of the variable-word-order languages by using ascending dependency[J]. Computer Processing of Oriental Languages, 2000, 12(3): 309-322.

引证文献7

1谭勋,吐尔根·依布拉音,艾山·吾买尔,张韦煜.基于相似度计算的维吾尔语词聚类[J].新疆大学学报（自然科学版）,2012,29(1):104-107. 被引量：2
2袁里驰.基于配价结构的词汇化句法分析模型[J].中南大学学报（自然科学版）,2012,43(5):1808-1813. 被引量：2
3袁里驰.融合语言知识的统计句法分析[J].中南大学学报（自然科学版）,2012,43(3):986-991. 被引量：5
4袁里驰.中心词驱动句法分析中的平滑技术[J].电子学报,2013,41(7):1337-1342. 被引量：1
5YUAN Lichi.A Part-of-speech Tagging Model Employing Word Clustering and Syntactic Parsing[J].Chinese Journal of Electronics,2014,23(1):109-114. 被引量：1
6袁里驰.几种基于统计的词聚类方法比较[J].中南大学学报（自然科学版）,2016,47(9):3079-3084. 被引量：1
7袁里驰.融合语义类信息的句法分析统计模型[J].数据采集与处理,2017,32(1):175-181. 被引量：1

二级引证文献13

1袁里驰.基于配价结构和语义依存关系的句法分析统计模型[J].电子学报,2013,41(10):2029-2034. 被引量：6
2马建军,宗敏.功能小句自动句法分析结果的错误分析[J].鸡西大学学报（综合版）,2014,14(9):124-127.
3袁里驰.几种基于统计的词聚类方法比较[J].中南大学学报（自然科学版）,2016,47(9):3079-3084. 被引量：1
4袁里驰.融合语义类信息的句法分析统计模型[J].数据采集与处理,2017,32(1):175-181. 被引量：1
5张亚军,刘宗田,周文.基于深度信念网络的事件识别[J].电子学报,2017,45(6):1415-1423. 被引量：12
6LI Dongchen,ZHANG Xiantao,WU Xihong.Integrated Chinese Segmentation, Parsing and Named Entity Recognition[J].Chinese Journal of Electronics,2018,27(4):756-760.
7田亮,吐尔根.依布拉音,艾山.吾买尔,卡哈尔江.阿比的热西提.基于LDA的英汉维文本聚类系统的设计与实现[J].现代电子技术,2019,42(3):122-126. 被引量：2
8李朝阳.基于用户行为的定向网络广告投放分析[J].电子测试,2019,30(4):59-61. 被引量：1
9袁里驰.利用语义信息的句法分析统计模型[J].小型微型计算机系统,2019,40(10):2125-2129. 被引量：3
10王娟娟.基于List-Based转移的英语语义分析翻译系统研究[J].电子设计工程,2020,28(16):35-38. 被引量：1

1袁里驰.一种基于互信息的词聚类算法[J].系统工程,2008,26(5):120-122. 被引量：4
2袁里驰,钟义信.基于相似度的词聚类算法[J].微电子学与计算机,2005,22(8):93-95. 被引量：4
3乔亚男,刘跃虎,齐勇.查询词相似度加权的邻近性检索方法[J].模式识别与人工智能,2013,26(2):189-194. 被引量：2
4赵明磊.惠普放弃为Windows Home Server提供平台[J].电子乐园,2010(12):4-4.
5袁里驰.几种基于统计的词聚类方法比较[J].中南大学学报（自然科学版）,2016,47(9):3079-3084. 被引量：1
6谌颃.社会化标签语义相似度的协同过滤算法[J].华侨大学学报（自然科学版）,2016,37(1):84-87.
7王静.基于网络日志的用户查询推荐[J].河南科技,2016,35(7):50-51. 被引量：1
8陈永强,刘惠颖.一种基于密度的数据流聚类分析算法[J].科技创新导报,2009,6(22):20-20.
9苏进,张佑生.一种分层聚类模型及其在电信行业的应用[J].计算机工程,2005,31(22):110-112.
10杨锦锋,关毅.基于免疫原理词表示的词相似度计算[J].智能计算机与应用,2015,5(3):61-64.

小型微型计算机系统

2009年第5期

浏览历史

内容加载中请稍等...

基于相似度的词聚类算法和可变长语言模型被引量：7

参考文献9

同被引文献76

引证文献7

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

基于相似度的词聚类算法和可变长语言模型 被引量：7

参考文献9

同被引文献76

引证文献7

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

基于相似度的词聚类算法和可变长语言模型被引量：7