摘要
本文区分词语和词汇,词语是个别词,词汇是词语的集合体。过去讨论不同词汇的差异都只能列举词语的异同,无法呈现宏观的词汇特色。以断代词典所收录的字词来比较,也很难看出不同时代词汇的差异。本文考查'中研院'所收集的上古汉语数字资源、近代汉语数字文本、现代汉语平衡语料库、《唐诗三百首》、《宋词三百首》、北京大学标记的《人民日报》1998年新闻稿以及台湾通讯社1991—2002年所发布的新闻文字,论述词语成千上万,须要提炼出有意义的词汇特色来区别词汇异同。区别的关键在于词语的使用而不在于词语的有无,词语使用表现在语流或文本中。因此,本文所提出的词汇属性称为词汇动态属性。在语流中词语出现次数高低可以排序,从排序中可以从最高词频往下累积,得出词频在全部词语数目中的百分比。我们以词频统计中最高的15个词语的词频累积百分比作为高频词集中度,以高频词集中度当作词汇动态特性。从文本计算出来的词汇动态特性能清楚划分出所考查的古代、近代、现代、诗词和新闻稿的词汇属性。希望这个计量性质的词汇属性对今后的词汇研究有些助益。
This study makes a distinction between word and lexicon. A word is an individual lexical item. The lexicon is an aggregate of words. Past discussions of differences among lexicons could only list individual words for comparison. There was no way to show an overall view of lexical characteristics. Even when one compared the words collected in the dictionaries of different historical periods, it was difficult to see the lexical differences in various stages. In this study, the Old Chinese Digital Resources, Pre-Modern Chinese Digital Archive, Balanced Modern Chinese Digital Resources of "Academia Sinica", the 300 Tang Poems, the 300 Song Lyrics, the 1998 People’s Daily News Releases as word-segmented and tagged at Peking University, and the 1991-2002 digital news reports of the News Agency of Taiwan were examined. As the words used in these texts were vast in number, it was mandatory to extract a small number of significant lexical characteristics to capture the distinct nature of the lexicons. The crucial point of distinction is how words are used and not whether particular words exit. Word usage appears in word streams or texts. Therefore, the lexical attribute discussed here is called dynamic attribute. The occurrences of words in the text streams can be tabulated for their frequency and percentage of the occurrence with respect to the entire texts. As the word of the highest frequency is listed first, the cumulative percentage of the occurrences of the 15 highest frequency words is also tabulated. The cumulative percentage can be considered as the concentration level of high frequency words in use. This concentration level clearly differentiates the types of texts used in Old Chinese, Pre-Modern Chinese, Modern Chinese, poetic writings and press releases. Thus this lexical attribute is of quantitative nature and may be of some use in future research.
出处
《语言学论丛》
CSSCI
2017年第2期1-19,共19页
Essays on Linguistics
关键词
词语与词汇
词汇动态特性
词频累积百分比
高频词集中度
word and lexicon
lexical dynamic attribute
cumulative frequency percentage
concentration level of high frequency words