期刊文献+

藏文文本相似度计算方法研究 被引量:2

Study on the Similarity Algorithm of Tibetan Text
下载PDF
导出
摘要 随着藏文文献数量的增加,原创性的藏文文献保护需求也越来越迫切,一个准确有效的藏文文本相似度计算方法就显得十分重要。文章针对藏文文字特殊结构导致藏文相似度计算不能照搬中英文文本相似度计算方法的问题,尝试按照《藏文编码字符集》的标准对藏文字符以一定顺序编码后与词库中的近义词关联;然后将待测文本和对照文本向量化,再对向量化的文本进行关键词提取,并用各自获取的关键词修正向量;最后使用余弦相似度原理计算待测文本向量和对照文本向量的余弦值,以此表示两句话的相似度。针对关键词提取的有效性,文章分别研究了TF-IDF和TF-IWF两种方法在不同主题词语比例的语料库下召回关键词的情况,结果表明TF-IWF提取关键词时能降低语料库中不同主题词语比例对计算结果的影响。针对相似度计算结果的准确性,文章引入皮尔森相关系数进行结果准确度评价,基于TF-IWF的相似度计算方法的皮尔森相关系数为0.7108,表明该方法是一种适用于藏文文本相似度计算的有效方法。 With the increasing of the number of Tibetan literature,protection of original Tibetan literature is be⁃coming more and more urgent.Hence,an accurate and effective method to calculate the similarity of Tibetan text becomes very important.Due to the fact that the similarity calculation method of Tibetan text cannot be directly copied from that of Chinese and English text because of the special structure of Tibetan characters,in this paper we proposed a new similarity algorithm specially tailor to Tibetan text.According to the standard of Tibetan Cod⁃ed Character Set,Tibetan characters are firstly encoded in a certain sequence and associated them with syn⁃onyms in the thesaurus.Secondly,the tested text and the reference text are vectorized.Then,the vectorized text is extracted with keywords,and the obtained keywords are used to correct the vector.Finally,cosine similarity principle is used to calculate the cosine value of vectors of the tested and the reference text,which represents the similarity of the two sentences.To verify the effectiveness of keyword extraction,the retrieval of keywords using TF-IDF and TF-IWF methods from the corpus with different proportion of subject words are studied in this pa⁃per.It shows that TF-IWF can reduce the influence of different proportion of subject words in the corpus on the results of calculation.Pearson correlation coefficient,which is introduced in this paper to evaluate the accuracy of results,of similarity calculation method based on TF-IWF is found to be 0.7108,indicating this method is an effective method for similarity calculation of Tibetan texts.
作者 严李强 田博 梁炜恒 杨欢欢 YAN Liqiang;TIAN bo;LIANG Weiheng;YANG Huanhuan(School of information science and technology,Tibet university,Lhasa 850000,China)
出处 《高原科学研究》 CSCD 2021年第3期70-77,114,共9页 Plateau Science Research
基金 国家自然科学基金项目(61561045) 西藏自治区大学生创新训练项目(S202110694080).
关键词 藏文 文本相似度 TF-IDF TF-IWF 余弦相似度原理 Tibetan text similarity TF-IDF TF-IWF Cosine similarity principle
  • 相关文献

参考文献9

二级参考文献63

  • 1罗霄,任勇,山秀明.基于Python的混合语言编程及其实现[J].计算机应用与软件,2004,21(12):17-18. 被引量:62
  • 2贾婧,葛万成,陈康力.基于轮廓结构和统计特征的字符识别研究[J].沈阳师范大学学报(自然科学版),2006,24(1):43-46. 被引量:11
  • 3刘克彬,李芳,刘磊,韩颖.基于核函数中文关系自动抽取系统的实现[J].计算机研究与发展,2007,44(8):1406-1411. 被引量:58
  • 4Wu Lei, Hoi S C H, and Yu Neng-hai. Semantics-preserving bag-of-words models mid applications [J]. IEEE Transactions on Image Processing, 2010, 19(7): 1908-1920.
  • 5Uijlings J R R, Smeulders A W M, and Scha R J H. Real-time visual concept classification [J]. IEEE Transactions on Multimedia, 2010, 12(7): 665-681.
  • 6Chao Zhu, Charles-Edmond B, and Chen Li-ming. Visual object recognition using DAISY descriptor [C]. IEEE International Conference on Multimedia and Expo, Barcelona, Spain, July 11-15, 2011: 1-6.
  • 7Wang Meng-yue, Zhang Chang-lin, and Song Yan. Extraction of image semantic features with spatial mean shift clustering algorithm [C]. IEEE 10th International Conference on Signal Processing, Beijing, China, Oct. 24-26, 2010: 906-909.
  • 8Chum O and Zisserman A. An exemplar model for learning object classes [C]. IEEE International Conference on Computer Vision and Pattern Recognition, Minneapolis, MN USA, June 17-22, 2007: 1-8.
  • 9Mathur A and Foody G M. Multiclass and binary SVM classification: implications for training and classificationusers[J]. IEEE Geoscience and Remote Sensing Letters, 2008, 5(2): 241-245.
  • 10Kalyani S and Swarup K S. Classification and assessment of power system security using multiclass SVM [J]. IEEE Transactions on Systems, Man, and Cybermetics, 2011, 41(5): 753-758.

共引文献57

同被引文献30

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部