摘要
语料库是自然语言处理中不可或缺的数据资源,其预处理结果直接影响后期研究的应用性能。文章分析了藏文语料库预处理方法,提出了一种规则和统计相结合的藏文合并音节纠正算法:首先,以藏文音节分隔符“·”为切分点对原语料进行音节切分;其次,通过前向和反向的合并音节纠正算法还原合并音节;最后,采用歧义消解算法消除双向纠正算法中存在的歧义合并音节。实验结果表明,该算法能有效纠正合并音节的非真字错误,该算法的宏平均准确率达到了79.27%。
A corpus is an indispensable data resource in natural language processing,and the result of preprocessing of the corpus has a significant impact on subsequent research.A Tibetan corpus preprocessing method is studied and a Tibetan combined syllable error correction algorithm based on rules and statistics is proposed in this paper.In the algorithm,firstly,the Tibetan syllable separator"་"is used as the segmentation point to split the original corpus.Then,the combined syllables are restored using forward and backward combined syllable correction algorithms.Finally,the ambiguity resolution algorithm is applied to eliminate the ambiguous combined syllables existing in the bidirectional correction algorithm.Our experimental results demonstrate that the algorithm can effectively correct the nonexisting word errors in combined syllables,and the macro-average accuracy of the algorithm achieved 79.27%.
作者
道吉扎西
尼玛扎西
才智杰
色差甲
仁青东主
Dorje-Tashi;Nima-Tashi;Caizhi-Jie;Secha-Jia;Rinchen-Dongrub(School of Information Science and Technology,Tibet University,Lhasa 850000,China;College of Computer Science and Technology,Qinghai Normal University,Xining 810016,China)
出处
《高原科学研究》
CSCD
2023年第3期112-118,共7页
Plateau Science Research
基金
西藏大学校级科研培育计划项目(ZDQMJH22-01)
科技创新2030-“新一代人工智能(2030)”重大项目(SQ2022AAA01028802)。
关键词
自然语言处理
语料库
藏文
合并音节
natural language processing
corpus
Tibetan
combined syllables