期刊文献+

贫语言资源条件下藏文分句数据集构建研究

Study on the Construction of Tibetan Sentence Segmentation Dataset under Scarcity Language Resources
下载PDF
导出
摘要 基于深度学习的藏文自动分句研究中构建分句数据集,事关藏文分句模型性能和质量。鉴于现有的藏文自动分句数据稀缺问题,文章通过梳理藏文句法结构,提出了位于句末的谓语动词和谓语形容词,以及终结虚词和离合虚词可充当句尾标识符号的观点,并通过相关语料库构建了谓语形容词词典、谓语动词词典和句尾虚词词典,最终使用句尾词匹对方法成功从语料中切分出了40万条句子,解决了藏文分句数据集建构问题,为基于深度学习的藏文分句研究提供了可靠和较大规模的数据基础。 The construction of a clause dataset in the study of Tibetan automatic sentence segmentation based on deep learning is related to the performance and quality of the Tibetan sentence segmentation model.In view of the scarcity of existing Tibetan automatic clause segmentation data,by combing various syntactic structures of Tibetan,in this paper we propose that predicate verbs and predicate adjectives at the end of sentences,as well as terminal function words and clutch function words,can be used as the end-of-sentence identifiers,and then constructs predicate adjective dictionary,predicate verb dictionary and end-of-sentence function word dictionary through the relevant corpus,and finally successfully cuts 400000 sentences from the corpus by using the endof-sentence matching method,which solves the problem of constructing Tibetan clause datasets.It provides a reliable and large-scale data basis for Tibetan sentence segmentation study based on deep learning.
作者 才让叁智 多拉 Tsering-Samdrup;Dorla(Department of Chinese language and literature,Northwest Minzu University,Lanzhou 730030,China;School of Information Science and Technology,Tibet University,Lhasa 850000,China;State Key Laboratory of Tibetan Intelligent Information Processing and Application,Qinghai Normal University,Xining 810016,China)
出处 《高原科学研究》 CSCD 2022年第4期85-94,共10页 Plateau Science Research
基金 国家自然科学基金项目(62266037,61866034) 2019年度西藏大学校级培育基金项目(ZDCZJH19-19) 西藏大学在职攻读博士学位资助项目(藏财预指[2022]1号)。
关键词 藏文 句子 藏文垂符 分句数据集 Tibetan sentences Tibetan brush stroke(shad) clause dataset
  • 相关文献

参考文献6

二级参考文献55

  • 1共确降措.论藏文[J].西藏研究,1997(3):94-108. 被引量:7
  • 2格桑居冕.藏语复句的句式[J].中国藏学,1996(1):132-141. 被引量:10
  • 3于中华,张容,唐常杰,左劼,张天庆.基于前后文词形特征的生物医学文献句子边界识别[J].小型微型计算机系统,2006,27(1):180-184. 被引量:1
  • 4祁坤钰.信息处理用藏文自动分词研究[J].西北民族大学学报(哲学社会科学版),2006(4):92-97. 被引量:34
  • 5王诗文.汉、藏语句子结构对比研究[J].西南民族大学学报(人文社会科学版),2007,28(4):50-55. 被引量:4
  • 6赵维纳,刘汇丹,于新,等.基于法律文本的藏语句子边界识别[C]//第五届全国青年计算语言学研讨会论文集,2010:480-486.
  • 7胡书津.简明藏文文法[M].昆明:云南民族出版社,1988.
  • 8Riley, M. D. Some applications of tree-based modeling to speech and language indexing [C]//Proceedings of the DARPA Speech and Natural Language Work- shop, 1989:339-352.
  • 9Palmer, D. D. , Hearst M. A. Adaptive Multilingual Sentence Boundary Disambiguation [J]. Computational Linguistics, 1997, 23(2); 241-269.
  • 10I.iu, Y. , Stoleke, A. , Shriberg, E. and Harper, M. Using Conditional Random Fields for Sentence Bound- ary Detection in Speech[C]//Proc. ACL, 2005 :451- 458.

共引文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部