Two lines of research on eye movements in reading are summarized. One line of research examines how adult readers identify compound words during reading. The other line of research deals with how a specific reading go...Two lines of research on eye movements in reading are summarized. One line of research examines how adult readers identify compound words during reading. The other line of research deals with how a specific reading goal influences the way long expository texts are read. Both lines of research are conducted using Finnish as the source language. With respect to the first research question, it is demonstrated that compound words are recognized either holistically or via their components, depending on the length of the compound word. Readers begin to process whatever information is readily available in the foveal vision(i.e., either the whole-word form or the initial component). The second line of research demonstrates that(1)a specific reading goal is capable of exerting an early effect on readers’ eye fixation patterns,(2)time course analyses based on eye movement patterns can reveal interesting individual differences, and(3)working memory capacity is linked to the efficiency to strategically allocate attention as well as to encode information to and retrieve it from the long-term memory. It is concluded that the eye-tracking technique is an excellent research tool to tap into the workings of the human mind during the comprehension of written texts.展开更多
Text Rank is a popular tool for obtaining words or phrases that are important for many Natural Language Processing (NLP) tasks. This paper presents a practical approach for Text Rank domain specific using Field Associ...Text Rank is a popular tool for obtaining words or phrases that are important for many Natural Language Processing (NLP) tasks. This paper presents a practical approach for Text Rank domain specific using Field Association (FA) words. We present the keyphrase separation technique not for a single document, although for a particular domain. The former builds a specific domain field. The second collects a list of ideal FA terms and compounds FA terms from the specific domain that are considered to be contender keyword phrases. Therefore, we combine two-word node weights and field tree relationships into a new approach to generate keyphrases from a particular domain. Studies using the changed approach to extract key phrases demonstrate that the latest techniques including FA terms are stronger than the others that use normal words and its precise words reach 90%.展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectiona...针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。展开更多
We study the short-term memory capacity of ancient readers of the original New Testament written in Greek, of its translations to Latin and to modern languages. To model it, we consider the number of words between any...We study the short-term memory capacity of ancient readers of the original New Testament written in Greek, of its translations to Latin and to modern languages. To model it, we consider the number of words between any two contiguous interpunctions I<sub>p</sub>, because this parameter can model how the human mind memorizes “chunks” of information. Since I<sub>P</sub> can be calculated for any alphabetical text, we can perform experiments—otherwise impossible— with ancient readers by studying the literary works they used to read. The “experiments” compare the I<sub>P</sub> of texts of a language/translation to those of another language/translation by measuring the minimum average probability of finding joint readers (those who can read both texts because of similar short-term memory capacity) and by defining an “overlap index”. We also define the population of universal readers, people who can read any New Testament text in any language. Future work is vast, with many research tracks, because alphabetical literatures are very large and allow many experiments, such as comparing authors, translations or even texts written by artificial intelligence tools.展开更多
对雷达装备故障文本进行智能化分类,有助于提高雷达装备保障效率。针对雷达故障文本专业性强,样本量小且不平衡的问题,通过非核心词EDA进行类内数据增强,以实现在增加文本量的同时保持关键信息不变。针对非核心词EDA方法产生的新样本多...对雷达装备故障文本进行智能化分类,有助于提高雷达装备保障效率。针对雷达故障文本专业性强,样本量小且不平衡的问题,通过非核心词EDA进行类内数据增强,以实现在增加文本量的同时保持关键信息不变。针对非核心词EDA方法产生的新样本多样性不够的问题,增加SSMix(saliency-based span mixup for text classification),进行类间数据增强,通过对输入文本非线性的交叉融合来提升文本的多样性。实验证明,与现有的经典基线分类方法和典型数据增强分类方法相比,该方法在准确率上有较大幅度的提升。展开更多
文摘Two lines of research on eye movements in reading are summarized. One line of research examines how adult readers identify compound words during reading. The other line of research deals with how a specific reading goal influences the way long expository texts are read. Both lines of research are conducted using Finnish as the source language. With respect to the first research question, it is demonstrated that compound words are recognized either holistically or via their components, depending on the length of the compound word. Readers begin to process whatever information is readily available in the foveal vision(i.e., either the whole-word form or the initial component). The second line of research demonstrates that(1)a specific reading goal is capable of exerting an early effect on readers’ eye fixation patterns,(2)time course analyses based on eye movement patterns can reveal interesting individual differences, and(3)working memory capacity is linked to the efficiency to strategically allocate attention as well as to encode information to and retrieve it from the long-term memory. It is concluded that the eye-tracking technique is an excellent research tool to tap into the workings of the human mind during the comprehension of written texts.
文摘Text Rank is a popular tool for obtaining words or phrases that are important for many Natural Language Processing (NLP) tasks. This paper presents a practical approach for Text Rank domain specific using Field Association (FA) words. We present the keyphrase separation technique not for a single document, although for a particular domain. The former builds a specific domain field. The second collects a list of ideal FA terms and compounds FA terms from the specific domain that are considered to be contender keyword phrases. Therefore, we combine two-word node weights and field tree relationships into a new approach to generate keyphrases from a particular domain. Studies using the changed approach to extract key phrases demonstrate that the latest techniques including FA terms are stronger than the others that use normal words and its precise words reach 90%.
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
文摘针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。
文摘We study the short-term memory capacity of ancient readers of the original New Testament written in Greek, of its translations to Latin and to modern languages. To model it, we consider the number of words between any two contiguous interpunctions I<sub>p</sub>, because this parameter can model how the human mind memorizes “chunks” of information. Since I<sub>P</sub> can be calculated for any alphabetical text, we can perform experiments—otherwise impossible— with ancient readers by studying the literary works they used to read. The “experiments” compare the I<sub>P</sub> of texts of a language/translation to those of another language/translation by measuring the minimum average probability of finding joint readers (those who can read both texts because of similar short-term memory capacity) and by defining an “overlap index”. We also define the population of universal readers, people who can read any New Testament text in any language. Future work is vast, with many research tracks, because alphabetical literatures are very large and allow many experiments, such as comparing authors, translations or even texts written by artificial intelligence tools.
文摘对雷达装备故障文本进行智能化分类,有助于提高雷达装备保障效率。针对雷达故障文本专业性强,样本量小且不平衡的问题,通过非核心词EDA进行类内数据增强,以实现在增加文本量的同时保持关键信息不变。针对非核心词EDA方法产生的新样本多样性不够的问题,增加SSMix(saliency-based span mixup for text classification),进行类间数据增强,通过对输入文本非线性的交叉融合来提升文本的多样性。实验证明,与现有的经典基线分类方法和典型数据增强分类方法相比,该方法在准确率上有较大幅度的提升。