期刊文献+

语言学领域多语种本体知识库构建与新术语发现

Building Multilingual Linguistic Knowledge Ontology and Discovering New Terms
原文传递
导出
摘要 针对语言学术语使用缺少规范、领域知识有待梳理的问题,本文首先通过整合语言学领域英俄汉术语资源,构建了多语种术语表;其次,基于维基百科采集与术语相关的多语种可比语料并建库,使用知识本体方法构建了包含14个大类、75个子类、25,385个实例以及16个属性关系的语言学领域知识库;最后,统计术语构成模式,抽取新多词术语以实现对知识库的迭代更新。本文不仅对于术语翻译、词典编撰、语言学知识挖掘、多语种语料库研究有重要意义,知识库也可作为重要基础数据资源应用于其他语言学研究中。 Currently,the use of multilingual linguistic terminology needs to be regulated.Linguistic knowledge needs to be systematically combed.This paper constructs a multilingual glossary of Chinese,English,and Russian terms by integrating linguistic terminology resources.Using this multilingual glossary,comparable multilingual corpora from Wikipedia are collected.A linguistic ontological knowledge base with 14 classes,75 subclasses,25,385 instances,and 16 attributes and relationships are constructed.Finally,new multi-word terms are extracted for iterative updating knowledge base through a statistical term composition model.Firstly,the terminology base and corpora are combined to overcome the lack of systematic research in terminology.The multilingual terminology has formed a preliminary scale.Comparable corpora of the study reach the 10-million-word level.The following issues are further tacked in the research:(1)When establishing the correspondence of multilingual terms,there are cases of one-to-many and many-to-many.In this project,a glossary is established with Chinese as the source language,English and Russian as the target language.The first term in the dictionary is used as the correspondence term.(2)In terms of comparable corpora construction,Wikipedia has the advantages of large scale,high topicality and close multilingualism.However,the authority and representativeness are challenged by the open-ended editing features adopted by Wikipedia.This has resulted in a large scale of late candidate terminology and high manual revision costs.In the follow-up study,the authors will try to integrate resources such as professional field encyclopedias and research works to enhance the field representation and authority.Secondly,corpus-based mining of new terms is a complex but meaningful undertaking.The project statistically generate patterns of multi-word terms,using a constitutive rule-based approach with single-word seed terms as clues for pattern matching.The following issues require further study:(1)Multi-word terms with complex composition patterns are not tested.The reason is that some of the key issues of automatic analysis of English and Russian texts have not been well addressed in the academic community.(2)In this study,57,395 English and 78,952 Russian multi-word candidate terms were extracted by automated means.Processing efficiency was significantly increased,resulting in significant saving of labor costs.Nevertheless,it is not yet possible to add it directly to the knowledge base as a term,and manual revision and regulation are imperative.(3)A follow-up study is needed for inter-translationally aligned terminology extraction.In this study,an inter-translational relationship has been established between English and Russian seed terms when extracting monolingual terms.In future work,the above clue can be used to build a multi-word terminology list with cross-linguistic similarity calculation,so as to maximize the value and application space of this work.
作者 原伟 李勤 YUAN Wei;LI Qin(Information Engineering University,Luoyang,Henan 471003,China;School of Russian and Eurasian Studies,Shanghai International Studies University,Shanghai 200083,China)
出处 《外语电化教学》 CSSCI 北大核心 2020年第3期73-80,12,共9页 Technology Enhanced Foreign Language Education
基金 国家社会科学基金项目“基于本体的俄汉可比语料库构建与评估”(项目编号:14CYY051) 国家社会科学基金项目“基于可比语料库和本体的俄汉网络新闻话题监测与情感识别研究”(项目编号:18BYY235)的阶段性研究成果。
关键词 语言学 术语 本体 知识库 可比语料库 Linguistics Terminology Ontology Knowledge Base Comparable Corpora
  • 相关文献

参考文献17

二级参考文献194

共引文献170

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部