摘要
基于深度学习的藏文自动分句研究中构建分句数据集,事关藏文分句模型性能和质量。鉴于现有的藏文自动分句数据稀缺问题,文章通过梳理藏文句法结构,提出了位于句末的谓语动词和谓语形容词,以及终结虚词和离合虚词可充当句尾标识符号的观点,并通过相关语料库构建了谓语形容词词典、谓语动词词典和句尾虚词词典,最终使用句尾词匹对方法成功从语料中切分出了40万条句子,解决了藏文分句数据集建构问题,为基于深度学习的藏文分句研究提供了可靠和较大规模的数据基础。
The construction of a clause dataset in the study of Tibetan automatic sentence segmentation based on deep learning is related to the performance and quality of the Tibetan sentence segmentation model.In view of the scarcity of existing Tibetan automatic clause segmentation data,by combing various syntactic structures of Tibetan,in this paper we propose that predicate verbs and predicate adjectives at the end of sentences,as well as terminal function words and clutch function words,can be used as the end-of-sentence identifiers,and then constructs predicate adjective dictionary,predicate verb dictionary and end-of-sentence function word dictionary through the relevant corpus,and finally successfully cuts 400000 sentences from the corpus by using the endof-sentence matching method,which solves the problem of constructing Tibetan clause datasets.It provides a reliable and large-scale data basis for Tibetan sentence segmentation study based on deep learning.
作者
才让叁智
多拉
Tsering-Samdrup;Dorla(Department of Chinese language and literature,Northwest Minzu University,Lanzhou 730030,China;School of Information Science and Technology,Tibet University,Lhasa 850000,China;State Key Laboratory of Tibetan Intelligent Information Processing and Application,Qinghai Normal University,Xining 810016,China)
出处
《高原科学研究》
CSCD
2022年第4期85-94,共10页
Plateau Science Research
基金
国家自然科学基金项目(62266037,61866034)
2019年度西藏大学校级培育基金项目(ZDCZJH19-19)
西藏大学在职攻读博士学位资助项目(藏财预指[2022]1号)。
关键词
藏文
句子
藏文垂符
分句数据集
Tibetan
sentences
Tibetan brush stroke(shad)
clause dataset