Popular word on line is a kind of language form which is welcomed by netizens.The popular word on line is the most active part of network language.They are made of expressions what the internet users are established b...Popular word on line is a kind of language form which is welcomed by netizens.The popular word on line is the most active part of network language.They are made of expressions what the internet users are established by usage.The popular words on line have a very strong nature of medium.They are the comprehensive products which come from the facts of social politics,economics,culture,environment and people's psychological activities etc.of a certain time.The total amount of new emerging popular words on line is not so many every year,but the influence of them can not be belittled.This study focuses on the popular words on line with the cases of the popular words occured in recent years.It works on the reasons for their rapid spreading.The author summarizes their characteristics and put forward some ways for the guidance of the popular words on line.展开更多
This study investigates the feasibility of applying complex networks to fine-grained language classification and of employing word co-occurrence networks based on parallel texts as a substitute for syntactic dependenc...This study investigates the feasibility of applying complex networks to fine-grained language classification and of employing word co-occurrence networks based on parallel texts as a substitute for syntactic dependency networks in complex-network-based language classification.14 word co-occurrence networks were constructed based on parallel texts of 12 Slavic languages and 2 non-Slavic languages,respectively.With appropriate combinations of major parameters of these networks,cluster analysis was able to distinguish the Slavic languages from the non-Slavic and correctly group the Slavic languages into their respective sub-branches.Moreover,the clustering could also capture the genetic relationships of some of these Slavic languages within their sub-branches.The results have shown that word co-occurrence networks based on parallel texts are applicable to fine-grained language classification and they constitute a more convenient substitute for syntactic dependency networks in complex-network-based language classification.展开更多
针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectiona...针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。展开更多
针对循环神经网络语言模型对长距离历史信息学习能力不足的问题,本文提出了结合全局词向量特征的循环神经网络语言模型。首先利用Glo Ve(Global Word Vectors)算法训练出全局词向量,然后将其作为特征向量输入到引入特征层的循环神经网...针对循环神经网络语言模型对长距离历史信息学习能力不足的问题,本文提出了结合全局词向量特征的循环神经网络语言模型。首先利用Glo Ve(Global Word Vectors)算法训练出全局词向量,然后将其作为特征向量输入到引入特征层的循环神经网络中进行训练。相对于局部词向量方法,全局词向量能够利用全局统计信息来训练出含有更加丰富的语义和句法信息词向量。为了验证新方法的性能,本文在Penn Treebank和Wall Street Journal语料库上分别进行困惑度和连续语音识别实验。实验结果表明结合全局词向量的循环神经网络语言模型的困惑度相比传统的循环神经网络语言模型降低了20.2%,同时语音识别系统的词错误率降低了18.3%。展开更多
文摘Popular word on line is a kind of language form which is welcomed by netizens.The popular word on line is the most active part of network language.They are made of expressions what the internet users are established by usage.The popular words on line have a very strong nature of medium.They are the comprehensive products which come from the facts of social politics,economics,culture,environment and people's psychological activities etc.of a certain time.The total amount of new emerging popular words on line is not so many every year,but the influence of them can not be belittled.This study focuses on the popular words on line with the cases of the popular words occured in recent years.It works on the reasons for their rapid spreading.The author summarizes their characteristics and put forward some ways for the guidance of the popular words on line.
基金supported by the National Social Science Foundation of China (09BYY024 and 11&ZD188)
文摘This study investigates the feasibility of applying complex networks to fine-grained language classification and of employing word co-occurrence networks based on parallel texts as a substitute for syntactic dependency networks in complex-network-based language classification.14 word co-occurrence networks were constructed based on parallel texts of 12 Slavic languages and 2 non-Slavic languages,respectively.With appropriate combinations of major parameters of these networks,cluster analysis was able to distinguish the Slavic languages from the non-Slavic and correctly group the Slavic languages into their respective sub-branches.Moreover,the clustering could also capture the genetic relationships of some of these Slavic languages within their sub-branches.The results have shown that word co-occurrence networks based on parallel texts are applicable to fine-grained language classification and they constitute a more convenient substitute for syntactic dependency networks in complex-network-based language classification.
文摘针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。
文摘针对循环神经网络语言模型对长距离历史信息学习能力不足的问题,本文提出了结合全局词向量特征的循环神经网络语言模型。首先利用Glo Ve(Global Word Vectors)算法训练出全局词向量,然后将其作为特征向量输入到引入特征层的循环神经网络中进行训练。相对于局部词向量方法,全局词向量能够利用全局统计信息来训练出含有更加丰富的语义和句法信息词向量。为了验证新方法的性能,本文在Penn Treebank和Wall Street Journal语料库上分别进行困惑度和连续语音识别实验。实验结果表明结合全局词向量的循环神经网络语言模型的困惑度相比传统的循环神经网络语言模型降低了20.2%,同时语音识别系统的词错误率降低了18.3%。