规模自动化工业生产中的集群数控机床因各种故障导致停机而造成生产线效率的下降,若能及时准确地预测故障对数控机床进行预检预修有利于提高全线生产效率。在工业智能制造背景下,以数据驱动为支撑,数控机床积累的大量历史故障报警数据...规模自动化工业生产中的集群数控机床因各种故障导致停机而造成生产线效率的下降,若能及时准确地预测故障对数控机床进行预检预修有利于提高全线生产效率。在工业智能制造背景下,以数据驱动为支撑,数控机床积累的大量历史故障报警数据为依托,设计了一种基于Word2vec和LSTM-SVM的故障报警预测方法对机床未来可能发生的故障进行预测。首先通过词嵌入技术将报警文本向量化,然后将报警向量作为输入构建长短期记忆神经网络(long short term memory network,LSTM)预测模型,并使用支持向量机(support vector machine,SVM)代替传统的softmax作为模型的末端分类器,实验结果表明该方法具有更高的预测准确率。展开更多
针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectiona...针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。展开更多
We study the short-term memory capacity of ancient readers of the original New Testament written in Greek, of its translations to Latin and to modern languages. To model it, we consider the number of words between any...We study the short-term memory capacity of ancient readers of the original New Testament written in Greek, of its translations to Latin and to modern languages. To model it, we consider the number of words between any two contiguous interpunctions I<sub>p</sub>, because this parameter can model how the human mind memorizes “chunks” of information. Since I<sub>P</sub> can be calculated for any alphabetical text, we can perform experiments—otherwise impossible— with ancient readers by studying the literary works they used to read. The “experiments” compare the I<sub>P</sub> of texts of a language/translation to those of another language/translation by measuring the minimum average probability of finding joint readers (those who can read both texts because of similar short-term memory capacity) and by defining an “overlap index”. We also define the population of universal readers, people who can read any New Testament text in any language. Future work is vast, with many research tracks, because alphabetical literatures are very large and allow many experiments, such as comparing authors, translations or even texts written by artificial intelligence tools.展开更多
为了对饮食文本信息高效分类,建立一种基于word2vec和长短期记忆网络(Long-short term memory,LSTM)的分类模型。针对食物百科和饮食健康文本特点,首先利用word2vec实现包含语义信息的词向量表示,并解决了传统方法导致数据表示稀疏及维...为了对饮食文本信息高效分类,建立一种基于word2vec和长短期记忆网络(Long-short term memory,LSTM)的分类模型。针对食物百科和饮食健康文本特点,首先利用word2vec实现包含语义信息的词向量表示,并解决了传统方法导致数据表示稀疏及维度灾难问题,基于K-means++根据语义关系聚类以提高训练数据质量。由word2vec构建文本向量作为LSTM的初始输入,训练LSTM分类模型,自动提取特征,进行饮食宜、忌的文本分类。实验采用48 000个文档进行测试,结果显示,分类准确率为98.08%,高于利用tf-idf、bag-of-words等文本数值化表示方法以及基于支持向量机(Support vector machine,SVM)和卷积神经网络(Convolutional neural network,CNN)分类算法结果。实验结果表明,利用该方法能够高质量地对饮食文本自动分类,帮助人们有效地利用健康饮食信息。展开更多
针对现有恶意域名检测算法对于家族恶意域名检测精度不高和实时性不强的问题,提出一种基于BiLSTM-DAE的恶意域名检测算法。通过利用双向长短时记忆神经网络(Bi-directional Long Short Term Memory,BiLSTM)提取域名字符组合的上下文序...针对现有恶意域名检测算法对于家族恶意域名检测精度不高和实时性不强的问题,提出一种基于BiLSTM-DAE的恶意域名检测算法。通过利用双向长短时记忆神经网络(Bi-directional Long Short Term Memory,BiLSTM)提取域名字符组合的上下文序列特征,并结合深度自编码网络(Deep Auto-Encoder,DAE)逐层压缩感知提取类内有共性和类间有区分性的强字符构词特征并进行分类。实验结果表明,与当前主流恶意域名检测算法相比,该算法在保持检测开销较小的基础上,具有更高的检测精度。展开更多
Statistics of languages are usually calculated by counting characters, words, sentences, word rankings. Some of these random variables are also the main “ingredients” of classical readability formulae. Revisiting th...Statistics of languages are usually calculated by counting characters, words, sentences, word rankings. Some of these random variables are also the main “ingredients” of classical readability formulae. Revisiting the readability formula of Italian, known as GULPEASE, shows that of the two terms that determine the readability index G—the semantic index , proportional to the number of characters per word, and the syntactic index GF, proportional to the reciprocal of the number of words per sentence—GF is dominant because GC is, in practice, constant for any author throughout seven centuries of Italian Literature. Each author can modulate the length of sentences more freely than he can do with the length of words, and in different ways from author to author. For any author, any couple of text variables can be modelled by a linear relationship y = mx, but with different slope m from author to author, except for the relationship between characters and words, which is unique for all. The most important relationship found in the paper is that between the short-term memory capacity, described by Miller’s “7 ? 2 law” (i.e., the number of “chunks” that an average person can hold in the short-term memory ranges from 5 to 9), and the word interval, a new random variable defined as the average number of words between two successive punctuation marks. The word interval can be converted into a time interval through the average reading speed. The word interval spreads in the same range as Miller’s law, and the time interval is spread in the same range of short-term memory response times. The connection between the word interval (and time interval) and short-term memory appears, at least empirically, justified and natural, however, to be further investigated. Technical and scientific writings (papers, essays, etc.) ask more to their readers because words are on the average longer, the readability index G is lower, word and time intervals are longer. Future work done on ancient languages, such as the classical Greek and Latin Literatures (or modern languages Literatures), could bring us an insight into the short-term memory required to their well-educated ancient readers.展开更多
TCM terms has a deep root in Chinese culture,which brings about cultural obstacles for TCM terms translation.This paper attempts to analyze the origin of some TCM terms and classify the cultural obstacles in TCM terms...TCM terms has a deep root in Chinese culture,which brings about cultural obstacles for TCM terms translation.This paper attempts to analyze the origin of some TCM terms and classify the cultural obstacles in TCM terms translation.In order to effectively transfer culture-loaded information to foreign readers,translators should be fully aware of cultural differences to improve their intercultural communication competence.展开更多
[目的/意义]本文提出基于长短时记忆(Long short-term memory,LSTM)神经网络和条件随机场(Conditional Random Field,CRF)的藏文分词模型。[方法/过程]引入注意力机制,获取更多特征信息,提升模型关注上下文信息与当前音节之间联系;提出...[目的/意义]本文提出基于长短时记忆(Long short-term memory,LSTM)神经网络和条件随机场(Conditional Random Field,CRF)的藏文分词模型。[方法/过程]引入注意力机制,获取更多特征信息,提升模型关注上下文信息与当前音节之间联系;提出一种音节扩展方法,获取更多的输入特征信息与语料信息,增强模型单音节特征信息以获取更多语义信息的能力。[局限]本文在西藏大学数据集12261条的基础上,扩充至74384条,形成Tibetan-News数据集。[结果/结论]实验结果表明,在模型中加入注意力机制并使用音节扩展方法后,模型在Tibetan-News数据集上的精确率、召回率和F1分别提升2.9%、3.5%和3.2%。基于本文模型的分词系统已在工程上应用推广。展开更多
文摘规模自动化工业生产中的集群数控机床因各种故障导致停机而造成生产线效率的下降,若能及时准确地预测故障对数控机床进行预检预修有利于提高全线生产效率。在工业智能制造背景下,以数据驱动为支撑,数控机床积累的大量历史故障报警数据为依托,设计了一种基于Word2vec和LSTM-SVM的故障报警预测方法对机床未来可能发生的故障进行预测。首先通过词嵌入技术将报警文本向量化,然后将报警向量作为输入构建长短期记忆神经网络(long short term memory network,LSTM)预测模型,并使用支持向量机(support vector machine,SVM)代替传统的softmax作为模型的末端分类器,实验结果表明该方法具有更高的预测准确率。
文摘针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。
文摘We study the short-term memory capacity of ancient readers of the original New Testament written in Greek, of its translations to Latin and to modern languages. To model it, we consider the number of words between any two contiguous interpunctions I<sub>p</sub>, because this parameter can model how the human mind memorizes “chunks” of information. Since I<sub>P</sub> can be calculated for any alphabetical text, we can perform experiments—otherwise impossible— with ancient readers by studying the literary works they used to read. The “experiments” compare the I<sub>P</sub> of texts of a language/translation to those of another language/translation by measuring the minimum average probability of finding joint readers (those who can read both texts because of similar short-term memory capacity) and by defining an “overlap index”. We also define the population of universal readers, people who can read any New Testament text in any language. Future work is vast, with many research tracks, because alphabetical literatures are very large and allow many experiments, such as comparing authors, translations or even texts written by artificial intelligence tools.
文摘针对现有恶意域名检测算法对于家族恶意域名检测精度不高和实时性不强的问题,提出一种基于BiLSTM-DAE的恶意域名检测算法。通过利用双向长短时记忆神经网络(Bi-directional Long Short Term Memory,BiLSTM)提取域名字符组合的上下文序列特征,并结合深度自编码网络(Deep Auto-Encoder,DAE)逐层压缩感知提取类内有共性和类间有区分性的强字符构词特征并进行分类。实验结果表明,与当前主流恶意域名检测算法相比,该算法在保持检测开销较小的基础上,具有更高的检测精度。
文摘Statistics of languages are usually calculated by counting characters, words, sentences, word rankings. Some of these random variables are also the main “ingredients” of classical readability formulae. Revisiting the readability formula of Italian, known as GULPEASE, shows that of the two terms that determine the readability index G—the semantic index , proportional to the number of characters per word, and the syntactic index GF, proportional to the reciprocal of the number of words per sentence—GF is dominant because GC is, in practice, constant for any author throughout seven centuries of Italian Literature. Each author can modulate the length of sentences more freely than he can do with the length of words, and in different ways from author to author. For any author, any couple of text variables can be modelled by a linear relationship y = mx, but with different slope m from author to author, except for the relationship between characters and words, which is unique for all. The most important relationship found in the paper is that between the short-term memory capacity, described by Miller’s “7 ? 2 law” (i.e., the number of “chunks” that an average person can hold in the short-term memory ranges from 5 to 9), and the word interval, a new random variable defined as the average number of words between two successive punctuation marks. The word interval can be converted into a time interval through the average reading speed. The word interval spreads in the same range as Miller’s law, and the time interval is spread in the same range of short-term memory response times. The connection between the word interval (and time interval) and short-term memory appears, at least empirically, justified and natural, however, to be further investigated. Technical and scientific writings (papers, essays, etc.) ask more to their readers because words are on the average longer, the readability index G is lower, word and time intervals are longer. Future work done on ancient languages, such as the classical Greek and Latin Literatures (or modern languages Literatures), could bring us an insight into the short-term memory required to their well-educated ancient readers.
文摘TCM terms has a deep root in Chinese culture,which brings about cultural obstacles for TCM terms translation.This paper attempts to analyze the origin of some TCM terms and classify the cultural obstacles in TCM terms translation.In order to effectively transfer culture-loaded information to foreign readers,translators should be fully aware of cultural differences to improve their intercultural communication competence.
文摘[目的/意义]本文提出基于长短时记忆(Long short-term memory,LSTM)神经网络和条件随机场(Conditional Random Field,CRF)的藏文分词模型。[方法/过程]引入注意力机制,获取更多特征信息,提升模型关注上下文信息与当前音节之间联系;提出一种音节扩展方法,获取更多的输入特征信息与语料信息,增强模型单音节特征信息以获取更多语义信息的能力。[局限]本文在西藏大学数据集12261条的基础上,扩充至74384条,形成Tibetan-News数据集。[结果/结论]实验结果表明,在模型中加入注意力机制并使用音节扩展方法后,模型在Tibetan-News数据集上的精确率、召回率和F1分别提升2.9%、3.5%和3.2%。基于本文模型的分词系统已在工程上应用推广。