the widely discussed study of word class categorization has a long history of more than 2000 years,which is known as the study of the“God Particles”in language.As a typical analytic language,Modern Chinese,due to it...the widely discussed study of word class categorization has a long history of more than 2000 years,which is known as the study of the“God Particles”in language.As a typical analytic language,Modern Chinese,due to its lack of morphological changes,is challenged by a thorny problem of word classes especially when it comes to the criteria for word class identification and the treat⁃ment of multiple class membership.As such,all the controversies eventually give rise to some contradiction and confusion in word class labeling in Modern Chinese and Chinese-English dictionaries.As an important grammatical means in Chinese and the focus of lexicology and rhetorics,total reduplication lexemes serve as an essential part of Chinese-English dictionaries with complex and diverse word classes.Guided by the Two-level Word Class Categorization Theory,this thesis focuses on the word class labeling of total reduplication lexemes in New Century Chinese-English Dictionary(2nd edition)backed by large-scale balanced Modern Chi⁃nese corpora.With an innovative theoretical perspective,this study not only contributes to the word class labeling of total reduplica⁃tion lexemes and even sheds light on the compilation of Chinese-English dictionaries,but also drives the study of Modern Chinese word classes in the long term.展开更多
为完善基于先验知识的标记增强方法对于情绪信息的捕捉,提出一种基于语义规则增强的蒙古语情感分布学习方法(semantic rule enhancement based Mongolian emotion distribution learning, SRE-MEDL)。在情感轮和情感词典的基础上,引入...为完善基于先验知识的标记增强方法对于情绪信息的捕捉,提出一种基于语义规则增强的蒙古语情感分布学习方法(semantic rule enhancement based Mongolian emotion distribution learning, SRE-MEDL)。在情感轮和情感词典的基础上,引入程度词典和否定词典,得到各种情感词组合,以此制定相应的语义规则计算情感词权重,将其融入到标记增强中。在情感分布学习中融入从情感分布空间到实例特征空间的反向重构映射来弥补正向映射引起的原始信息丢失问题。对比实验结果显示,在蒙古语和中英文常用数据集上,SRE-MEDL方法在标记增强任务和情感分布学习中的表现均优于现有方法。展开更多
在对科技领域视频文本进行分类时,容易忽略分类贡献度较高的专业名词。为此,改进传统Labeled潜在Dirichlet分布(LDA)模型,建立用于科技领域视频文本的M ul CHI-Labeled LDA模型,避免偏向高频词的现象。通过构建领域术语库以突出专业名词...在对科技领域视频文本进行分类时,容易忽略分类贡献度较高的专业名词。为此,改进传统Labeled潜在Dirichlet分布(LDA)模型,建立用于科技领域视频文本的M ul CHI-Labeled LDA模型,避免偏向高频词的现象。通过构建领域术语库以突出专业名词,同时使用卡方加权和文本位置加权算法提升主题词质量。实验结果表明,与Labeled LDA模型相比,该模型可以解决专业名词被忽略的问题,并能有效提高主题词质量和分类准确率。展开更多
针对玉米育种文本数据中存在重叠三元组、实体表达方式多样等问题,提出一种嵌入词汇信息的BERT-CRF(Bidirectional encoder representations from transformers-conditional random field)玉米育种实体关系联合抽取方法。首先,分析了玉...针对玉米育种文本数据中存在重叠三元组、实体表达方式多样等问题,提出一种嵌入词汇信息的BERT-CRF(Bidirectional encoder representations from transformers-conditional random field)玉米育种实体关系联合抽取方法。首先,分析了玉米育种语料表达特征,采用对实体边界、关系类别和实体位置信息同步标注的策略;其次,构建了嵌入词汇信息的BERT-CRF模型进行训练和预测,自建玉米育种知识词典,通过在BERT中嵌入词汇信息,融合字符特征和词汇特征,增强模型的语义能力,利用CRF模型输出全局最优标签序列,设计了实体关系三元组匹配算法(Entity and relation triple matching algorithm,ERTM),将标签进行匹配和映射来获取三元组;最后,为验证该方法的有效性,在玉米育种数据集上进行实验,结果表明,本文模型精确率、召回率和F1值分别为91.84%、95.84%、93.80%,与现有模型相比性能均有提升。说明该方法能够有效抽取玉米育种领域知识,为构建玉米育种知识图谱及其它下游任务提供数据基础。展开更多
文摘the widely discussed study of word class categorization has a long history of more than 2000 years,which is known as the study of the“God Particles”in language.As a typical analytic language,Modern Chinese,due to its lack of morphological changes,is challenged by a thorny problem of word classes especially when it comes to the criteria for word class identification and the treat⁃ment of multiple class membership.As such,all the controversies eventually give rise to some contradiction and confusion in word class labeling in Modern Chinese and Chinese-English dictionaries.As an important grammatical means in Chinese and the focus of lexicology and rhetorics,total reduplication lexemes serve as an essential part of Chinese-English dictionaries with complex and diverse word classes.Guided by the Two-level Word Class Categorization Theory,this thesis focuses on the word class labeling of total reduplication lexemes in New Century Chinese-English Dictionary(2nd edition)backed by large-scale balanced Modern Chi⁃nese corpora.With an innovative theoretical perspective,this study not only contributes to the word class labeling of total reduplica⁃tion lexemes and even sheds light on the compilation of Chinese-English dictionaries,but also drives the study of Modern Chinese word classes in the long term.
文摘为完善基于先验知识的标记增强方法对于情绪信息的捕捉,提出一种基于语义规则增强的蒙古语情感分布学习方法(semantic rule enhancement based Mongolian emotion distribution learning, SRE-MEDL)。在情感轮和情感词典的基础上,引入程度词典和否定词典,得到各种情感词组合,以此制定相应的语义规则计算情感词权重,将其融入到标记增强中。在情感分布学习中融入从情感分布空间到实例特征空间的反向重构映射来弥补正向映射引起的原始信息丢失问题。对比实验结果显示,在蒙古语和中英文常用数据集上,SRE-MEDL方法在标记增强任务和情感分布学习中的表现均优于现有方法。
文摘在对科技领域视频文本进行分类时,容易忽略分类贡献度较高的专业名词。为此,改进传统Labeled潜在Dirichlet分布(LDA)模型,建立用于科技领域视频文本的M ul CHI-Labeled LDA模型,避免偏向高频词的现象。通过构建领域术语库以突出专业名词,同时使用卡方加权和文本位置加权算法提升主题词质量。实验结果表明,与Labeled LDA模型相比,该模型可以解决专业名词被忽略的问题,并能有效提高主题词质量和分类准确率。
文摘针对玉米育种文本数据中存在重叠三元组、实体表达方式多样等问题,提出一种嵌入词汇信息的BERT-CRF(Bidirectional encoder representations from transformers-conditional random field)玉米育种实体关系联合抽取方法。首先,分析了玉米育种语料表达特征,采用对实体边界、关系类别和实体位置信息同步标注的策略;其次,构建了嵌入词汇信息的BERT-CRF模型进行训练和预测,自建玉米育种知识词典,通过在BERT中嵌入词汇信息,融合字符特征和词汇特征,增强模型的语义能力,利用CRF模型输出全局最优标签序列,设计了实体关系三元组匹配算法(Entity and relation triple matching algorithm,ERTM),将标签进行匹配和映射来获取三元组;最后,为验证该方法的有效性,在玉米育种数据集上进行实验,结果表明,本文模型精确率、召回率和F1值分别为91.84%、95.84%、93.80%,与现有模型相比性能均有提升。说明该方法能够有效抽取玉米育种领域知识,为构建玉米育种知识图谱及其它下游任务提供数据基础。