依存距离分布有规律吗? 被引量：10

Does Dependency Distance Distribute Regularly?

导出

摘要探索语言的普遍特征一直是语言学研究的重要内容,当前依存距离最小化已经被证实是人类语言的一种普遍规律。为了发现这一规律背后的动因,对30种语言的依存距离分布情况进行研究,通过多种模型拟合对比,发现广延指数分布和指数截断的幂律分布分别适合拟合"短句"与"长句"的依存距离分布。研究结果还显示,人类语言的依存距离分布介于指数分布和幂律分布之间,可用指数和幂律混合的模型来描述。在此基础上,利用不同模型拟合对比来探讨依存距离分布的方法和路径,结果揭示出人类语言的依存距离可能遵循一种普遍性的分布模式,反映了省力原则和人类认知机制在语言结构运用与演化过程中发挥着重要的支配作用。 Universal properties of languages have always been important in traditional linguistics study.In recent years,studies have increasingly presented a trend which integrates multiple disciplines and methods,e.g.cognitive science,network science,big data analysis and quantitative techniques.So far,results of the survey on large-scale cross-language materials have indicated that human languages have a tendency toward dependency distance minimization.This tendency suggests that,although human languages differ in pronunciation,vocabulary and grammar,etc.,their syntax may be bound by universal mechanisms,and their evolution may also have a universal model.Dependency distance,which is defined as the linear distance between two words which are syntactically related,can reflect the comprehension difficulty of syntactic structure.Therefore,the dependency distance minimization is considered as resulting from cognitive mechanism and theeffect of″the principle of least effort″on syntactic structure.It also proves that humans prefer to avoid the use of long-distance dependencies to reduce cognitive cost.As a result,dependency distance distribution may present a certain pattern.Revealing this pattern will help us understand how human cognitive mechanism works on syntactic structure.But the question is which of the probability distributions can fit the pattern of dependency distance distribution more properly—the power law distribution or the exponential distribution？To find out the answer,this paper uses the following methods and materials to analyze dependency distance distribution：1）Complementary Cumulative Distribution Function（CCDF）is used to smooth data,to avoid statistical fluctuation,and to lower fitting error;2）Maximum likelihood estimation and likelihood ratio test are used to fit and compare five kinds of″heavy tail″distribution,including exponential and power law;3）HamleDT 2.0dependence treebank is adopted,especially for language materials which are annotated with Prague Dependencies Scheme,because this annotation scheme is most similar to the traditional dependency grammar,and more helpful to find the rules of language structure.With these methods and materials,this research analyzes dependency treebanks of 30 languages,and summarises the following findings：1）Complementary Cumulative Distribution indicates that distribution of dependency distance in human languages proved has certain regularity;2）for the majority of 30 languages,the distribution of dependency distance conforms to certain models,namely,Stretched Exponential Distribution（SED）for″short sentences″and Truncated Power Law Distribution（TPLD）for″long sentences;″3）although dependency distance distribution patterns differ among languages,they all fit,in essence,to a mixed exponential and power law distribution;4）the debate over exponential distribution and power law distribution might mainly be caused by different fitting methods,different languages,different sentences length and text size,etc.The findings of this research will help us better understand the nature of the dependency distance minimization.It may also reveal that the dependency distance in human languages may abide by a certain universal distribution pattern.At the same time,the findings may contribute significantly to constructing the syntactical synergetic subsystem in the framework of dependency grammar.

作者陆前刘海涛

机构地区浙江大学外国语言文化与国际交流学院

出处《浙江大学学报（人文社会科学版）》 CSSCI 北大核心 2016年第4期63-76,共14页 Journal of Zhejiang University:Humanities and Social Sciences

基金国家社会科学基金重大项目(11&ZD188) 中国博士后科学基金资助项目(2015M571852)

关键词依存距离依存距离分布幂律分布依存树库认知机制语言结构 dependency distance dependency distance distribution power law dependency treebank cognitive mechanism language structure

分类号 H0-05 [语言文字—语言学]

引文网络
相关文献

参考文献31

1N. A. Smith J. Eisner,"Annealing Structural Bias in Multilingual Weighted Grammar Induction," in Proceedings of the 21st International Conference on Computational Linguistic and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney: Association for Computational Linguistics, 2006, pp. 569 - 576.
2D. Klein C. D. Manning,"Corpus-based Induction of Syntactic Structure: Models of Dependency and Constituency," in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona: Association for Computational Linguistics, 2004, pp. 478 - 485.
3J. Hajic, J. Panevov - E. Hajicov/t et al. ,"Prague Dependency Treebank 2.0, CD-ROM, Linguistic Data Consortium, LDC Catalog No. : LDC2006T01, ISBN 1-58563-370-4, Philadelphia, 2006.
4N. Chomsky, Lectures on Government and Binding: The Pisa Lectures Berlin/New York: Walter de Gruyter, 1993.
5R. Hudson, Word Grammar, Oxford.- Blaekwell, 1984.
6I. A. Mel'Euk, Dependency Syntax: Theory and Practice, Albany: State University Press of New York, 1988.
7刘海涛.语言网络:隐喻,还是利器?[J].浙江大学学报（人文社会科学版）,2011,41(2):169-180. 被引量：23
8刘海涛,黄伟.计量语言学的现状、理论与方法[J].浙江大学学报（人文社会科学版）,2012,42(2):178-192. 被引量：52
9H. Liu,"Dependency Distance as a Metric of Language Comprehension Difficulty," Journal of Cognitive Science, Vol. 9, No. 2(2008), pp. 159 - 191.
10R. Ferrer-i-Cancho,"Euclidean Distance between Syntactically Linked Words," Physical Review E, Vol. 70, No. 5(2004), 056135.