摘要
设计并实现了一种基于语料库的传统蒙古文同形不同码识别系统。通过网络爬虫获取原始语料,然后对语料进行预处理并生成词表和倒排索引。基于词表利用GDI和传统蒙古文字库对每个传统蒙古文词生成字形图,并根据字形图之间的相似度识别出相同的字形。系统根据倒排索引、同形词列表统计出语料中传统蒙古文同形不同码的情况。实验结果显示,同形不同码问题在传统蒙古文中普遍存在,反映出了制定相关标准的迫切性。
In this paper, a corpus-based system is designed and implemented to recognize same shape and different code of Mongolian. The raw corpus is crawled by web spider, a dictionary and inverted index are generated from the corpus. The isomorphic words are recognized depending on the similarity of word glyphs which are generated though GDI and Mongolian font.The statistical information of same shape and different code based on two types of the corpus is calculated according to inverted index and list of same shape and different code. The experimental results show that thesame shape and different code of Mongolian is pervasive in traditional Mongolian, which also reflects the necessity and urgency of developing relevant standard.
出处
《信息技术与标准化》
2015年第1期62-66,共5页
Information Technology & Standardization
基金
国家自然科学基金
项目编号:61303165
61202219
61202220
新闻出版重大科技工程
项目编号:0610-1041BJNF 2328/23
关键词
传统蒙古文
同形不同码
爬虫
倒排索引
语料库
raditional mongolian same shape and different code
web spider
inverted index