摘要
针对维吾尔语领域术语获取难度大,人工扩充领域术语工作量大、效率低等特点,利用词汇共现原理,以维吾尔语连接词和互信息(MI)为工具,快速扩充原始维吾尔语领域术语;建立了以维吾尔语领域术语为特征模板,利用条件随机场(CRF)模型实现Web文本中维吾尔语领域术语的自动发现方法,并在此基础上实现长维吾尔语领域术语的自动发现。实验表明,对短维吾尔语领域术语的自动发现准确率为97.59%,召回率为93.38%,对长维吾尔语领域术语的自动发现正确率达到55.72%。
Since the Uyghur domain term is difficult to achieve, the workload of artificial expansion of the domain term is tremendous, and the efficiency is low, this paper used the Conditional Random Field (CRF) to identify the Uyghur domain term from the Web texts, which expanded the domain term with the conjunction word and the Mutual Information (MI) between the words based on the co-occurrence of terms. The experiments on the collected Web texts show that, for the short Uyghur domain terms, the algorithm achieves the precision as high as 97.59% and the recall 93.38%, and for the long Uyghur domain terms achieves the precision 55.72%.
出处
《计算机应用》
CSCD
北大核心
2012年第2期407-410,共4页
journal of Computer Applications
基金
国家自然科学基金资助项目(60963017)
国家社科基金资助项目(10BTQ045
11XTQ007)
关键词
维吾尔语
互信息
条件随机场
TF/IDF
Uyghur
Mutual Information (MI)
Conditional Random Field (CRF)
Term Frequency/Inverse DocumentFrequency (TF/IDF)