摘要
国内当前以语料库为基础的研究,在研究工具方面,多以Ant Conc、Power GREP为主,使用Python语言NLTK包进行数据处理分析的研究较少,限于软件自身设计,不能灵活地对研究方法提供支持。在研究中使用Python语言的NLTK处理包,使数据有了统一标准,避免了各类文字处理转换的麻烦,同时也弥补了Range等工具在句法分析、图形绘制、正则表达式检索等方面的缺憾。针对语料库研究的中文本分词、词形归并、文本检索统计等主要环节,简要介绍Python语言的NLTK自然语言处理包在语料库研究中的运用,并以古腾堡语料库中的简·奥斯丁小说《艾玛》为例,说明如何运用该自然语言处理包对语料进行加工处理。
According to the current domestic corpus based study,AntConc and PowerGREP are the main research tool.Few studies were done using the Python language NLTK packet for data processing and a-nalysis.It can not provide support to the research methods due to the design defect of the software.The Python language NLTK handling package was used in the study so that the data have uniform standards, avoiding the conversion of various types of word processing workshop trouble.It also makes up for the weakness of the range tool such as syntactic analysis,graphic,regular expression search etc.In this pa-per,it was briefly introduced that the application of NLTK processing package based on Python in corpus research.Then it takes the novel Emma written by Austen in Gutenberg corpus as an example to explain how to use the natural language processing to process the data.
出处
《昆明冶金高等专科学校学报》
CAS
2015年第5期65-69,93,共6页
Journal of Kunming Metallurgy College