摘要
文章以一个2.3亿字的历时语料库为平台,结合统计与词嵌入算法,定量考察近代汉语中13个动量词与动词的组合状况。以宏观视野,展现并解释近代汉语动量词的概貌与特征,服务于汉语史研究与量词教学。首先,综合统计与规则的方法,完成动量词自动识别、自动分词、动量词搭配的动词自动识别等预处理工作。其次,分时段测查各动量格式、各动量词的频率,发现动量词在文言、白话语体中的词频差异悬殊。最后,依照《同义词词林》的语义类体系,考察动量词所修饰的动词的优势和劣势语义类别,发现动词语义类与动词是否受动量词修饰之间,存在着一种非强制的、概率性的联系。
Based on a diachronic corpus with 230 million Chinese characters and combined with the statistical method and word embedding algorithm,this paper makes a quantitative study of 13 verbal classifiers in pre-modern Chinese language.From a macro perspective,this study shows and explains the general situation and characteristics of verbal classifiers in pre-modern Chinese,and tries to serve for the study of Chinese history and the teaching of quantifiers.Firstly,combined with statistical and regular methods,it finishes pre-processing work of the automatic recognition of verbal classifiers,word segmentation,and verbal classifiers collocation in pre-modern Chinese language.Secondly,it measures the frequency of various verbal classifiers,verbal classifiers’syntactic forms,and finds the differences in the word frequency of verbal classifiers in classical Chinese and vernacular Chinese.Finally,according to the lexical semantic system of Synonym Forest,it analyzes the advantage and disadvantage of semantic categories of verbs modified by verbal classifiers,and finds that there is a non-compulsory and probabilistic relationship between the semantic categories of verbs and whether the verbs are modified by passive quantifiers.
作者
蒋彦廷
潘雨婷
杨乐
JIANG Yan-ting;PAN Yu-ting;YANG Le(Institute of Chinese Information Processing,Beijing Normal University,Beijing,100875,China;School of Chinese Language&Culture,Beijing Normal University,Beijing,100875,China;School of Statistics and Mathematics,Central University of Finance and Economics,Beijing,102206,China)
出处
《西华大学学报(哲学社会科学版)》
2020年第2期23-32,共10页
Journal of Xihua University(Philosophy & Social Sciences)
基金
国家语委“十三五”科研规划2018年度重点项目“面向国际编码的《说文》小篆线条定名定量与定序研究”(ZDI135-57)。
关键词
动量词
自动识别
分词
统计
正则表达式
词嵌入
《同义词词林》
verbal classifiers
automatic recognition
word segmentation
statistics
regular expression
word embedding
Synonym Forest