
语料库间词汇差异的统计方法研究 被引量:4

An analysis of statistical techniques applied to lexical differences between corpora
摘要 语料库之间词汇或其他特征出现频率的差异研究是语料库语言学的一项基本研究内容,使用的统计方法主要是卡方检验。任何统计方法的应用都有一定的前提假设,由于词汇差异研究不能够完全符合卡方检验的要求,使得研究结果存在较大误差。可应用于词汇差异分析的其他统计方法包括对数似然率和秩和检验。实验证明,对数似然率与卡方检验类似,在词汇差异检验中都会受到样本量以及样本代表性的影响而产生统计偏误,秩和检验能在一定程度上解决这些问题,得到比较客观的统计结果。 Lexical differences or the differences of other linguistic features between corpora are one of the fundamental research topics in corpus linguistics.Due to mismatches between its basic assumptions and the conditions of lexical differences,the chi-square test,which is the main statistical technique for lexical differences between corpora,is likely to produce statistical errors when applied to this kind of tasks.Therefore,in this study other statistical techniques including the log-likelihood ratio test and the rank sum test are also applied to lexical differences between corpora.As the analysis indicates,the log-likelihood ratio test is experimentally similar to the chi-square test in examining lexical differences between corpora;they both tend to cause statistical errors due to such factors as sample size and sample representativeness.The rank sum test,however,can solve some of the relevant problems and obtain relatively objective statistical results.
作者 葛诗利
出处 《现代外语》 CSSCI 北大核心 2010年第3期249-257,共9页 Modern Foreign Languages
  • 相关文献


  • 1Church, K. & W. Gale. 1995. Poisson mixtures[J]. Journal of Natural Language Engineering 1, 2 : 163-190.
  • 2Cohen, B. H. 2008. Explaining Psychological Statistics [M]. NJ: John Wiley & Sons.
  • 3Conover,W.J.2006.实用非参数统计(崔恒建译)[M].北京:人民邮电出版社.
  • 4Davison, A. C. 2008. Statistical Models [M]. Cambridge: Cambridge University Press.
  • 5De Cock, S. 2000. Repetitive phrasal chunkiness and advanced EFL speech and writing [A]. In C. Mair & M. Hundt (eds.). Corpus Linguistics and Linguistic Theory: Papers from the Twentieth International Conference on English Language Research on Computerized Corpra (ICAME 20) [C]. Amsterdam: Rodopi, 51-68.
  • 6Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence [J]. Computational Linguistics 19: 61-74.
  • 7Everitt, B. S. 1992. The Analysis of Contingency Tables [M]. London: Chapman and Hall.
  • 8Hofland, K. & S. Johansson. 1982. Word Frequencies in British and American English [M]. Bergen: The Norwegian Computing Centre for the Humanities.
  • 9Kilgarriff, A. 1996. Which words are particularly characteristic of a text? A survey of statistical approaches ~A]. In L. J. Evett & T. G. Rose (eds.). Language Engineering for Document Analysis and Recognition (LEDAR), AISB96 Workshop Proceedings ~C]. Brighton: Nottingham Trent University, 33-40.
  • 10Kilgarriff, A. 2001. Comparing corpora [J]. International Journal of Corpus Linguistics 6, 1 : 97-133.


  • 1王立非,祝卫华.中国学生英语口语中话语标记语的使用研究[J].外语研究,2005,22(3):40-44. 被引量:145
  • 2王立非,张大凤.国外二语预制语块习得研究的方法进展与启示[J].外语与外语教学,2006(5):17-21. 被引量:389
  • 3王立非,张岩.基于语料库的大学生英语议论文中的语块使用模式研究[J].外语电化教学,2006(4):36-41. 被引量:175
  • 4Aijmer, K. 2002. English Discourse Particles [M]. Amsterdam: John Benjamins.
  • 5Alenberg, B. 1998. On the Phraseology of Spoken English: The Evidence of Recurrent Word-Combinations[A]. In A. P. Cowie (ed). Phraseology : Theory, Analysis and Applications[C]. Oxford: Oxford University Press, 101-22.
  • 6Biber. D., S. Johansson, G. Leech, S. Conrad & E. Finegan. 1999. Longman Grammar of Spoken and Written English[M]. London: Pearson Education limited.
  • 7Biber. D. 2004. If you look at lexical bundles in university teaching and textbooks[J]. Applied Linguistics 25/3 : 371-405.
  • 8Channell, J. 1994. Vague Language[M]. Oxford: Oxford University Press.
  • 9De Cock, S. 1998. A Recurrent Word Combination Approach to the study of formulae in the speech of native and non-native speakers of English [J]. International Journal of Corpus Linguistics 3,1 : 59-80.
  • 10De Cock, S. 2000. Repetitive phrasal chunkiness and advanced EFL speech and writing[A]. In C.Mair & M. Hund (eds). Corpus Linguistics and Linguistic Theory: Papers from the Twentieth International Conference on English language Research on Computerized Corpora (ICAME 20)[C]. Amsterdam: Rodopi: 51-68.











使用帮助 返回顶部