期刊文献+

基于改进K-均值聚类的汉语语块识别 被引量:4

Chinese text chunking based on improved K-means clustering
下载PDF
导出
摘要 为了既避免数据稀疏又充分考虑相邻词性的关系和每种短语的内部组成规律,提出了改进K-均值聚类方法.此方法把每个短语看成是以中心词为核心的聚簇,充分考虑每种短语的内部组成规律;依据语料库中的数据来确定每个类的初始中心,使有指导的统计方法和无指导的聚类方法有机结合,既提高了聚类的准确率,又避免了因汉语语块库规模较小而导致的数据稀疏现象.应用改进K-均值聚类方法对7种汉语语块进行识别,F值达到了92.94%,因此,该方法对汉语语块识别是有效的. An improved k-means clustering method is proposed avoiding data sparseness and taking think of the relationship of to identify Chinese phrases with the purpose of neighbor part of speech and the cohesion of all part of speeches within one phrase. The proposed method regards each phrase as a cluster whose kernel is headword, which richly used the constituent disciplinarian of one phrase. It also integrates supervised statistical method and unsupervised clustering method by setting the original center of each class according the data from small Chinese corpus, which not only improves the accuracy of clustering but also avoids data sparseness. Through testing on Chinese Penn Treebank, the F score of seven types of Chinese phrase achieves to 92. 94%. So, it is effective for Chinese text chunking.
出处 《哈尔滨工业大学学报》 EI CAS CSCD 北大核心 2007年第7期1106-1109,共4页 Journal of Harbin Institute of Technology
基金 国家自然科学基金资助项目(60302021) 科技部政府间国际合作项目(CI-2003-03) 哈尔滨市青年科学基金资助项目(2005AFQXJ020)
关键词 K-均值聚类 汉语语块识别 数据稀疏 K-means clustering Chinese text chunking sparseness
  • 相关文献

参考文献8

二级参考文献25

  • 1周强.汉语语料库的短语自动划分和标注研究.北京大学博士研究生学位论文[M].-,1996..
  • 2赵军.汉语基本名词短语识别及结构分析研究.清华大学工学博士学位论文[M].-,1998..
  • 3孙宏林.现代汉语非受限文本的实语块分析.北京大学博士研究生学位论文[M].-,2001..
  • 4[1]Abney S.Parsing by chunk.In Berwick,A.and Tenny,editors,Principle-Based Parsing.Kluwer,1991
  • 5[2]Erik F.Tjong Kim Sang and Sabine Buchholz Introduction to the CoNLL-2000 Shared Task: Chunking.CoNLL-2000 and LLL-2000.Lisbon,Portugal,pp.127~132
  • 6[3]Erik F,Sang T K.Text chunking by system combination.In:Proc.of CoNLL-2000 and LLL-2000.Lisbon,Portugal,2000
  • 7[4]Brants T.TnT -a statistical part-of-speech tagger.In:Proc.of the Sixth Applied Natural Language Processing (ANLP-2000),Seattle,WA,2000
  • 8[5]Ramshaw L,Marcus M.Text Chunking Using Transformation-Based Learning.In:Proc.of third Workshop on Very Large Corpora,June 1995.82~94
  • 9[6]Ratnaparkhi A.Maximum Entropy Models for Natural Language Ambiguity Resolution:[Phd.Thesis].University of Pennsylvania,1998
  • 10[7]Merialdo B.Tagging English Text with a Probabilistic Mod-el.Computational Linguistics,1994,20(2):155~171

共引文献87

同被引文献57

引证文献4

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部