摘要
在已有的词分布特征测量方法基础上,本文提出了实体分布的主要测量指标与方法,并借助通用语料库和自建的美国陆军英语新闻语料库,对实体进行统计测量和对比分析,结果显示实体分布在英语军事新闻和通用英语中存在显著差异,显示出实体在区分不同语言变体上的独特能力,为观察和测量语言分布特征提供了新视角,并可为文本分类、语义挖掘等应用提供新的特征支持。
Entity is a special language unit,which has explicit external reference,stable structure and single meaning. We filter and process the Wikipedia entries,and obtain a large set of entities,which covers a wide range of fields and includes rich entity types. Based on the entity set,the present study proposes entity recognition algorithm and realizes the automatic recognition of entities. We investigate the distribution of the military entities and the findings are as follows. Compared to the distribution in general English,entities in military English corpus are more intensive and concentrated in a relatively closed set. Although the number of generic domain entities in general English is large,the distribution is widely dispersed. The entity collocations show close semantic relationship among the related entities,which provides valuable perspective for further text mining and information processing.
出处
《解放军外国语学院学报》
CSSCI
北大核心
2020年第3期74-81,F0003,共9页
Journal of PLA University of Foreign Languages
基金
国家社会科学基金项目“汉英小句级对齐语料库的研制与应用研究”(19BYY081)
国家社会科学基金重点项目“语料库语言学意义发现理论建模研究”(16AYY008)
山东省高等学校青创科技计划“多语言大数据创新团队项目”(2019RWC014)。