期刊文献+

汉字关联性量化方法及其在文本相似性分析中的应用 被引量:1

Chinese character association measurement method and its application on Chinese text similarity analysis
下载PDF
导出
摘要 文本相似性分析、聚类和分类多基于特征词,由于汉语词之间无分隔符,汉语分词及高维特征空间的处理等基础工作必然引起高计算费用问题。探索了一种在不使用特征词的条件下,使用汉字间的关系进行文本相似性分析的研究思路。首先定义了文本中汉字与汉字之间关系的量化方法,提出汉字关联度的概念,然后构造汉字关联度矩阵来表示汉语文本,并设计了一种基于汉字关联度矩阵的汉语文本相似性度量算法。实验结果表明,汉字关联度优于二字词词频、互信息、T检验等统计量。由于无需汉语分词,本算法适用于海量中文信息处理。 The research of text similarity analysis and text clustering is mostly based on feature words. Because Chinese text does not have a natural delimiter between words, it must solve two problems such as Chinese word segmentation and higher-level dimensions feature vector spaces. In order to reduce the higher complexity, a novel investigation method of text similarity analysis using the association of Chinese characters was probed without using feature words. The notation of Chinese Character Association Measurement was defined, and the Chinese Character Association Measurement matrix to represent the Chinese text documents was constructed. Then a Chinese text similarity algorithm based on Chinese Character Association Measurement Matrix is proposed. The experiment result shows the Chinese Character Association Measurement is better than the mutual information and the T test and the bi-gram frequency. Without Chinese word segmentation, so this algorithm is useful in massive Chinese data corpus.
出处 《计算机应用》 CSCD 北大核心 2006年第6期1396-1397,1400,共3页 journal of Computer Applications
基金 国家自然科学基金资助项目(60273075)
关键词 汉字关联度 信息矩阵 文本相似度算法 Chinese Character Association Measurement( CCAM) information matrix text similarity measurement algorithm
  • 相关文献

参考文献4

二级参考文献13

  • 1孙茂松,邹嘉彦.汉语自动分词研究中的苦干理论问题[J].语言文字应用,1995(4):40-46. 被引量:45
  • 2黄萱菁,吴立德,王文欣,叶丹瑾.基于机器学习的无需人工编制词典的切词系统[J].模式识别与人工智能,1996,9(4):297-303. 被引量:24
  • 3孙茂松,黄昌宁,邹嘉彦,陆方,沈达阳.利用汉字二元语法关系解决汉语自动分词中的交集型歧义[J].计算机研究与发展,1997,34(5):332-339. 被引量:66
  • 4马晏.基于评价的汉语自动分词系统的研究与实现[A]..语言信息处理专论[C].北京:清华大学出版社,1996..
  • 5Choi A, Cheng C H, Ko Y L. Word extraction from Chinese documents by occurrence counts [ A].1988 International Conference on Computer Processing of Chinese and Oriental Languages, Toronto,Canada: 488 - 491.
  • 6Fan C K, Tsai W H. Automatic word identification in Chinese sentences by the relaxation technique[J]. Computer Processing of Chinese and Oriental Languages, 1988, 4(1):33-56.
  • 7梁南元.书面汉语自动分词系统—CDWS[J].中文信息学报,1987,(2):44-52.
  • 8Sproat R., Shih C.L.. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 1993, 4(4): 336~249
  • 9Sun Mao-Song, Shen Da-Yang, Tsou B K. Chinese word segmentation without using lexicon and hand-crafted training data. In: Proceedings of the 36th Annual Meeting of Association of Computational Linguistics and the 17th International Conference on Computational Linguistics, Montreal, Canada, 1998, 1265~1271
  • 10Nie J.Y., Jin W.Y.. A hybrid approach to unknown word detection and segmentation of Chinese. In: Proceedings of International Conference on Chinese Computing, Singapore, 1994, 405~412

共引文献141

同被引文献11

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部