A new method of automatic Chinese term extraction is proposed based on Patricia (PAT) tree. Mutual information is calculated based on prefix searching in PAT tree of domain corpus to estimate the internal associativ...A new method of automatic Chinese term extraction is proposed based on Patricia (PAT) tree. Mutual information is calculated based on prefix searching in PAT tree of domain corpus to estimate the internal associative strength between Chinese characters in a string. It can improve the speed of term candidate extraction largely compared with methods based on domain corpus directly. Common collocation suffix, prefix bank are constructed and term part of speech (POS) composing rules are summarized to improve the precision of term extraction. Experiment results show that the F-measure is 74.97%.展开更多
OOV term translation plays an important role in natural language processing. Although many researchers in the past have endeavored to solve the OOV term translation problems, but none existing methods offer definition...OOV term translation plays an important role in natural language processing. Although many researchers in the past have endeavored to solve the OOV term translation problems, but none existing methods offer definition or context information of OOV terms. Furthermore, non-existing methods focus on cross-language definition retrieval for OOV terms. Never the less, it has always been so difficult to evaluate the correctness of an OOV term translation without domain specific knowledge and correct references. Our English definition ranking method differentiate the types of OOV terms, and applies different methods for translation extraction. Our English definition ranking method also extracts multilingual context information and monolingual definitions of OOV terms. In addition, we propose a novel cross-language definition retrieval system for OOV terms. Never the less, we propose an auto re-evaluation method to evaluate the correctness of OOV translations and definitions. Our methods achieve high performances against existing methods.展开更多
Objectives Medical knowledge extraction (MKE) plays a key role in natural language processing (NLP) research in electronic medical records (EMR),which are the important digital carriers for recording medical activitie...Objectives Medical knowledge extraction (MKE) plays a key role in natural language processing (NLP) research in electronic medical records (EMR),which are the important digital carriers for recording medical activities of patients.Named entity recognition (NER) and medical relation extraction (MRE) are two basic tasks of MKE.This study aims to improve the recognition accuracy of these two tasks by exploring deep learning methods.Methods This study discussed and built two application scenes of bidirectional long short-term memory combined conditional random field (BiLSTM-CRF) model for NER and MRE tasks.In the data preprocessing of both tasks,a GloVe word embedding model was used to vectorize words.In the NER task,a sequence labeling strategy was used to classify each word tag by the joint probability distribution through the CRF layer.In the MRE task,the medical entity relation category was predicted by transforming the classification problem of a single entity into a sequence classification problem and linking the feature combinations between entities also through the CRF layer.Results Through the validation on the I2B2 2010 public dataset,the BiLSTM-CRF models built in this study got much better results than the baseline methods in the two tasks,where the F1-measure was up to 0.88 in NER task and 0.78 in MRE task.Moreover,the model converged faster and avoided problems such as overfitting.Conclusion This study proved the good performance of deep learning on medical knowledge extraction.It also verified the feasibility of the BiLSTM-CRF model in different application scenarios,laying the foundation for the subsequent work in the EMR field.展开更多
This paper presents anew way to extract concept that can beused to improve text classification per-formance (precision and recall). Thecomputational measure will be dividedinto two layers. The bottom layercalled docum...This paper presents anew way to extract concept that can beused to improve text classification per-formance (precision and recall). Thecomputational measure will be dividedinto two layers. The bottom layercalled document layer is concernedwith extracting the concepts of parti-cular document and the upper layercalled category layer is with findingthe description and subject concepts ofparticular category. The relevant im-plementation algorithm that dramatic-ally decreases the search space is dis-cussed in detail. The experiment basedon real-world data collected from Info-Bank shows that the approach is supe-rior to the traditional ones.展开更多
Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ...Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.展开更多
This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a l...This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a local thesaurus [2] in the selection of the most appropriate extractive text summarization for a particular document. Sentences were tagged and normalized then subjected to the Longest Common Subsequence (LCS) algorithm [3] [4] for the selection of the most similar subset of sentences. Calculated similarity was based on LCS of pairs of sentences that make up the document. A normalized score was calculated and used to rank sentences. A selected top subset of the most similar sentences was then tokenized to produce a set of important keywords or terms. The produced terms were further expanded into two subsets using 1) WorldNet;and 2) a local electronic dictionary/thesaurus. The three sets obtained (the original and the expanded two) were then re-cycled to further refine and expand the list of selected sentences from the original document. The process was repeated a number of times in order to find the best representative set of sentences. A final set of the top (best) sentences was selected as candidate sentences for summarization. In order to verify the utility of the procedure, a number of experiments were conducted using an email corpus. The results were compared to those produced by human annotators as well as to results produced using some basic sentences similarity calculation method. Produced results were very encouraging and compared well to those of human annotators and Jacquard sentences similarity.展开更多
Two foundational factors (escape cone and transmissivity) about light extraction of light emitting diodes (LEDs) are discussed. According to these factors, a new process to simulate the light extraction of LEDs ba...Two foundational factors (escape cone and transmissivity) about light extraction of light emitting diodes (LEDs) are discussed. According to these factors, a new process to simulate the light extraction of LEDs based on the Monte Carlo method has been provided. The improved method is to deal with the reflection and refraction of light (beam of light) at the interface between two mediums approximately. In addition, light extraction of traditional LEDs is simulated by different processes with the same structure and parameters. The results show that the reflection and refraction of light processed approximately are accurate enough for analyzing LEDs structure. This method saves much time and improves efficiency in the simulation of light extraction of LEDs.展开更多
This study was carried out to understand the long-term effect of organic waste treatment on the fate of heavy metals originated from the organic wastes, together with examination of changes in soil properties. For thi...This study was carried out to understand the long-term effect of organic waste treatment on the fate of heavy metals originated from the organic wastes, together with examination of changes in soil properties. For this, the soils received three different organic wastes (municipal sewage sludge, alcohol fermentation processing sludge, pig manure compost) in three different rates (12.5, 25, 50 ton/ha/yr) for 7 years (1994 - 2000) were used. To see the long-term effect, plant growth study and soil examination were conducted twice in 2000 and 2010, respectively. There was no additional treatment of organic wastes for post ten years after ceasing organic waste treatment for seven years. Soil examination conducted in 2010 showed decreases in soil pH, EC, total nitrogen, organic matter, available phosphorus, exchangeable cations and heavy metal contents in all soils received organic wastes compared to the results obtained in 2000. Speciation of heavy metals in soil through sequential extraction showed that organically bound Cu was the dominant species in all treatment and exchangeable Cu was increased in the plots treated with municipal sewage sludge and alcohol fermentation processing sludge. organically bound Ni increased from 25% - 30% to 32% - 45% in 2010 inall treatment while Pb showed increase in carbonate form in all treatments. Zn existed mainly as sulfide and residual forms, showing increases in organically bound form in all treatment during post ten years.展开更多
Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and eff...Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and effective classification of web pages. The web documents are represented as set of features. The proposed method selects and extracts the most prominent features reducing the high dimensionality problem of classifier. The proper selection of features among the large set improves the performance of the classifier. The proposed algorithm is implemented and tested on a benchmarked dataset. The results show the better performance than most of the existing term weighting techniques.展开更多
情报学术语承载了情报学科基础知识与核心概念。从概念维度梳理与分析情报学术语对推动学科发展、助力下游知识挖掘任务具有重要意义。面对数量快速增长的科技文献,自动术语抽取替代了人工筛选,但现有方法严重依赖大规模标注数据集,难...情报学术语承载了情报学科基础知识与核心概念。从概念维度梳理与分析情报学术语对推动学科发展、助力下游知识挖掘任务具有重要意义。面对数量快速增长的科技文献,自动术语抽取替代了人工筛选,但现有方法严重依赖大规模标注数据集,难以迁移至低资源场景。本文设计了一种生成式情报学术语抽取方法(generative term extraction for information science,GTX-IS),将传统基于序列标注的抽取式任务转化为序列到序列的生成式任务。结合小样本学习策略与有监督微调,提升面向特定任务的文本生成能力,能够在低资源有标签数据集场景下较为精准地抽取情报学术语。对于抽取结果,本文进一步开展了情报学领域术语发现及多维知识挖掘。综合运用全文科学计量与信息计量方法,从术语自身、术语间关联、时间信息等维度,对术语的出现频次、生命周期、共现信息等进行统计分析与知识挖掘。采用社会网络分析方法,结合时间维度特征,从术语角度出发,完善期刊的动态简介,探究情报学研究热点、演变历程和未来发展趋势。本文方法在术语抽取实验中的表现超越了全部13种主流生成式和抽取式模型,展现出较强的小样本学习能力,为领域信息抽取提供了新的思路。展开更多
文摘A new method of automatic Chinese term extraction is proposed based on Patricia (PAT) tree. Mutual information is calculated based on prefix searching in PAT tree of domain corpus to estimate the internal associative strength between Chinese characters in a string. It can improve the speed of term candidate extraction largely compared with methods based on domain corpus directly. Common collocation suffix, prefix bank are constructed and term part of speech (POS) composing rules are summarized to improve the precision of term extraction. Experiment results show that the F-measure is 74.97%.
文摘OOV term translation plays an important role in natural language processing. Although many researchers in the past have endeavored to solve the OOV term translation problems, but none existing methods offer definition or context information of OOV terms. Furthermore, non-existing methods focus on cross-language definition retrieval for OOV terms. Never the less, it has always been so difficult to evaluate the correctness of an OOV term translation without domain specific knowledge and correct references. Our English definition ranking method differentiate the types of OOV terms, and applies different methods for translation extraction. Our English definition ranking method also extracts multilingual context information and monolingual definitions of OOV terms. In addition, we propose a novel cross-language definition retrieval system for OOV terms. Never the less, we propose an auto re-evaluation method to evaluate the correctness of OOV translations and definitions. Our methods achieve high performances against existing methods.
基金Supported by the Zhejiang Provincial Natural Science Foundation(No.LQ16H180004)~~
文摘Objectives Medical knowledge extraction (MKE) plays a key role in natural language processing (NLP) research in electronic medical records (EMR),which are the important digital carriers for recording medical activities of patients.Named entity recognition (NER) and medical relation extraction (MRE) are two basic tasks of MKE.This study aims to improve the recognition accuracy of these two tasks by exploring deep learning methods.Methods This study discussed and built two application scenes of bidirectional long short-term memory combined conditional random field (BiLSTM-CRF) model for NER and MRE tasks.In the data preprocessing of both tasks,a GloVe word embedding model was used to vectorize words.In the NER task,a sequence labeling strategy was used to classify each word tag by the joint probability distribution through the CRF layer.In the MRE task,the medical entity relation category was predicted by transforming the classification problem of a single entity into a sequence classification problem and linking the feature combinations between entities also through the CRF layer.Results Through the validation on the I2B2 2010 public dataset,the BiLSTM-CRF models built in this study got much better results than the baseline methods in the two tasks,where the F1-measure was up to 0.88 in NER task and 0.78 in MRE task.Moreover,the model converged faster and avoided problems such as overfitting.Conclusion This study proved the good performance of deep learning on medical knowledge extraction.It also verified the feasibility of the BiLSTM-CRF model in different application scenarios,laying the foundation for the subsequent work in the EMR field.
基金Project supported by the National Natural Science Foundation of China (No. 60082003) and the National High Technology Research and Development Program of China (N0.863-306-ZD03-04-1).
文摘This paper presents anew way to extract concept that can beused to improve text classification per-formance (precision and recall). Thecomputational measure will be dividedinto two layers. The bottom layercalled document layer is concernedwith extracting the concepts of parti-cular document and the upper layercalled category layer is with findingthe description and subject concepts ofparticular category. The relevant im-plementation algorithm that dramatic-ally decreases the search space is dis-cussed in detail. The experiment basedon real-world data collected from Info-Bank shows that the approach is supe-rior to the traditional ones.
基金supported by the National Natural Science Foundation of China under Grants No.61100205,No.60873001the HiTech Research and Development Program of China under Grant No.2011AA010705the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212
文摘Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.
文摘This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a local thesaurus [2] in the selection of the most appropriate extractive text summarization for a particular document. Sentences were tagged and normalized then subjected to the Longest Common Subsequence (LCS) algorithm [3] [4] for the selection of the most similar subset of sentences. Calculated similarity was based on LCS of pairs of sentences that make up the document. A normalized score was calculated and used to rank sentences. A selected top subset of the most similar sentences was then tokenized to produce a set of important keywords or terms. The produced terms were further expanded into two subsets using 1) WorldNet;and 2) a local electronic dictionary/thesaurus. The three sets obtained (the original and the expanded two) were then re-cycled to further refine and expand the list of selected sentences from the original document. The process was repeated a number of times in order to find the best representative set of sentences. A final set of the top (best) sentences was selected as candidate sentences for summarization. In order to verify the utility of the procedure, a number of experiments were conducted using an email corpus. The results were compared to those produced by human annotators as well as to results produced using some basic sentences similarity calculation method. Produced results were very encouraging and compared well to those of human annotators and Jacquard sentences similarity.
文摘Two foundational factors (escape cone and transmissivity) about light extraction of light emitting diodes (LEDs) are discussed. According to these factors, a new process to simulate the light extraction of LEDs based on the Monte Carlo method has been provided. The improved method is to deal with the reflection and refraction of light (beam of light) at the interface between two mediums approximately. In addition, light extraction of traditional LEDs is simulated by different processes with the same structure and parameters. The results show that the reflection and refraction of light processed approximately are accurate enough for analyzing LEDs structure. This method saves much time and improves efficiency in the simulation of light extraction of LEDs.
文摘This study was carried out to understand the long-term effect of organic waste treatment on the fate of heavy metals originated from the organic wastes, together with examination of changes in soil properties. For this, the soils received three different organic wastes (municipal sewage sludge, alcohol fermentation processing sludge, pig manure compost) in three different rates (12.5, 25, 50 ton/ha/yr) for 7 years (1994 - 2000) were used. To see the long-term effect, plant growth study and soil examination were conducted twice in 2000 and 2010, respectively. There was no additional treatment of organic wastes for post ten years after ceasing organic waste treatment for seven years. Soil examination conducted in 2010 showed decreases in soil pH, EC, total nitrogen, organic matter, available phosphorus, exchangeable cations and heavy metal contents in all soils received organic wastes compared to the results obtained in 2000. Speciation of heavy metals in soil through sequential extraction showed that organically bound Cu was the dominant species in all treatment and exchangeable Cu was increased in the plots treated with municipal sewage sludge and alcohol fermentation processing sludge. organically bound Ni increased from 25% - 30% to 32% - 45% in 2010 inall treatment while Pb showed increase in carbonate form in all treatments. Zn existed mainly as sulfide and residual forms, showing increases in organically bound form in all treatment during post ten years.
文摘Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and effective classification of web pages. The web documents are represented as set of features. The proposed method selects and extracts the most prominent features reducing the high dimensionality problem of classifier. The proper selection of features among the large set improves the performance of the classifier. The proposed algorithm is implemented and tested on a benchmarked dataset. The results show the better performance than most of the existing term weighting techniques.
文摘情报学术语承载了情报学科基础知识与核心概念。从概念维度梳理与分析情报学术语对推动学科发展、助力下游知识挖掘任务具有重要意义。面对数量快速增长的科技文献,自动术语抽取替代了人工筛选,但现有方法严重依赖大规模标注数据集,难以迁移至低资源场景。本文设计了一种生成式情报学术语抽取方法(generative term extraction for information science,GTX-IS),将传统基于序列标注的抽取式任务转化为序列到序列的生成式任务。结合小样本学习策略与有监督微调,提升面向特定任务的文本生成能力,能够在低资源有标签数据集场景下较为精准地抽取情报学术语。对于抽取结果,本文进一步开展了情报学领域术语发现及多维知识挖掘。综合运用全文科学计量与信息计量方法,从术语自身、术语间关联、时间信息等维度,对术语的出现频次、生命周期、共现信息等进行统计分析与知识挖掘。采用社会网络分析方法,结合时间维度特征,从术语角度出发,完善期刊的动态简介,探究情报学研究热点、演变历程和未来发展趋势。本文方法在术语抽取实验中的表现超越了全部13种主流生成式和抽取式模型,展现出较强的小样本学习能力,为领域信息抽取提供了新的思路。