摘要
为解决二字短语扩充词表带来的歧义切分大幅增加问题 ,我们对扩收的二字短语进行了凝固度的分级。我们首先考察验证了已曾提出过的各种标准和方法。考察证明 ,结构类型、“成分字替换率”“前 /后接歧义度”与凝固度密切相关 ,也与接续类型 (A/BC~AB/C)密切相关。其中 ,定中、状中、述宾三类结构以前字为基准的后字替换率有特别价值 ,该频率高的字组多为A/BC型接续 ,其他字组多为AB/C型接续。在此基础上 ,我们提出了二字短语扩充词表的分级方案和具体的分级排歧策略。
This paper attempts to solve the problem of multi ambiguities caused by the enlarged vocabulary of two character phrases (TCP), by means of grading the TCP according to their agglomeration degree. By testing various standards and methods, we find that these three factors——the structure of the phrases, the replacing rate (RR) of component character (CC) and the ambiguous rate by front and back connecting——are not only closely related to the agglomeration degree of two character phrases, but also related to the type of ambiguity (A/BC~AB/C). We also find that the RR of back CC (RR1) to the structures of adnominal N, adverbial V/A and VO are especially useful: these three types of phrases with the high RR1 are mostly of A/BC connecting, whereas the other phrases are of AB/C. Based on this result, we present a grading scheme for the enlarged vocabulary of TCP and give some disambiguation rules relating to the agglomeration degree.
出处
《语言文字应用》
CSSCI
北大核心
2000年第2期21-33,共13页
Applied Linguistics