Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learn...Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learning.The probability model is loss of semantic information in essential,and it influences the processing accuracy.Machine learning approaches include supervised,unsupervised,and semi-supervised approaches,labeled corpora is necessary for semantics model and supervised learning.The method for achieving a reliably labeled corpus is done manually,it is costly and time-consuming because people have to read each document and annotate the label of each document.Recently,the continuous CBOW model is efficient for learning high-quality distributed vector representations,and it can capture a large number of precise syntactic and semantic word relationships,this model can be easily extended to learn paragraph vector,but it is not precise.Towards these problems,this paper is devoted to developing a new model for learning paragraph vector,we combine the CBOW model and CNNs to establish a new deep learning model.Experimental results show that paragraph vector generated by the new model is better than the paragraph vector generated by CBOW model in semantic relativeness and accuracy.展开更多
Neural Machine Translation(NMT)based system is an important technology for translation applications.However,there is plenty of rooms for the improvement of NMT.In the process of NMT,traditional word vector cannot dist...Neural Machine Translation(NMT)based system is an important technology for translation applications.However,there is plenty of rooms for the improvement of NMT.In the process of NMT,traditional word vector cannot distinguish the same words under different parts of speech(POS).Aiming to alleviate this problem,this paper proposed a new word vector training method based on POS feature.It can efficiently improve the quality of translation by adding POS feature to the training process of word vectors.In the experiments,we conducted extensive experiments to evaluate our methods.The experimental result shows that the proposed method is beneficial to improve the quality of translation from English into Chinese.展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification...With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering,especially for Chinese texts.This paper selected the manually calibrated Douban movie website comment data for research.First,a text filtering model based on the BP neural network has been built;Second,based on the Term Frequency-Inverse Document Frequency(TF-IDF)vector space model and the doc2vec method,the text word frequency vector and the text semantic vector were obtained respectively,and the text word frequency vector was linearly reduced by the Principal Component Analysis(PCA)method.Third,the text word frequency vector after dimensionality reduction and the text semantic vector were combined,add the text value degree,and the text synthesis vector was constructed.Experiments show that the model combined with text word frequency vector degree after dimensionality reduction,text semantic vector,and text value has reached the highest accuracy of 84.67%.展开更多
Word Sense Disambiguation has been a trending topic of research in Natural Language Processing and Machine Learning.Mining core features and performing the text classification still exist as a challenging task.Here the...Word Sense Disambiguation has been a trending topic of research in Natural Language Processing and Machine Learning.Mining core features and performing the text classification still exist as a challenging task.Here the features of the context such as neighboring words like adjective provide the evidence for classification using machine learning approach.This paper presented the text document classification that has wide applications in information retrieval,which uses movie review datasets.Here the document indexing based on controlled vocabulary,adjective,word sense disambiguation,generating hierarchical cate-gorization of web pages,spam detection,topic labeling,web search,document summarization,etc.Here the kernel support vector machine learning algorithm helps to classify the text and feature extract is performed by cuckoo search opti-mization.Positive review and negative review of movie dataset is presented to get the better classification accuracy.Experimental results focused with context mining,feature analysis and classification.By comparing with the previous work,proposed work designed to achieve the efficient results.Overall design is per-formed with MATLAB 2020a tool.展开更多
基金The authors would like to thank all anonymous reviewers for their suggestions and feedback.This work Supported by the National Natural Science,Foundation of China(No.61379052,61379103)the National Key Research and Development Program(2016YFB1000101)+1 种基金The Natural Science Foundation for Distinguished Young Scholars of Hunan Province(Grant No.14JJ1026)Specialized Research Fund for the Doctoral Program of Higher Education(Grant No.20124307110015).
文摘Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learning.The probability model is loss of semantic information in essential,and it influences the processing accuracy.Machine learning approaches include supervised,unsupervised,and semi-supervised approaches,labeled corpora is necessary for semantics model and supervised learning.The method for achieving a reliably labeled corpus is done manually,it is costly and time-consuming because people have to read each document and annotate the label of each document.Recently,the continuous CBOW model is efficient for learning high-quality distributed vector representations,and it can capture a large number of precise syntactic and semantic word relationships,this model can be easily extended to learn paragraph vector,but it is not precise.Towards these problems,this paper is devoted to developing a new model for learning paragraph vector,we combine the CBOW model and CNNs to establish a new deep learning model.Experimental results show that paragraph vector generated by the new model is better than the paragraph vector generated by CBOW model in semantic relativeness and accuracy.
基金This work is supported by the National Natural Science Foundation of China(61872231,61701297).
文摘Neural Machine Translation(NMT)based system is an important technology for translation applications.However,there is plenty of rooms for the improvement of NMT.In the process of NMT,traditional word vector cannot distinguish the same words under different parts of speech(POS).Aiming to alleviate this problem,this paper proposed a new word vector training method based on POS feature.It can efficiently improve the quality of translation by adding POS feature to the training process of word vectors.In the experiments,we conducted extensive experiments to evaluate our methods.The experimental result shows that the proposed method is beneficial to improve the quality of translation from English into Chinese.
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
基金Supported by the Sichuan Science and Technology Program (2021YFQ0003).
文摘With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering,especially for Chinese texts.This paper selected the manually calibrated Douban movie website comment data for research.First,a text filtering model based on the BP neural network has been built;Second,based on the Term Frequency-Inverse Document Frequency(TF-IDF)vector space model and the doc2vec method,the text word frequency vector and the text semantic vector were obtained respectively,and the text word frequency vector was linearly reduced by the Principal Component Analysis(PCA)method.Third,the text word frequency vector after dimensionality reduction and the text semantic vector were combined,add the text value degree,and the text synthesis vector was constructed.Experiments show that the model combined with text word frequency vector degree after dimensionality reduction,text semantic vector,and text value has reached the highest accuracy of 84.67%.
文摘Word Sense Disambiguation has been a trending topic of research in Natural Language Processing and Machine Learning.Mining core features and performing the text classification still exist as a challenging task.Here the features of the context such as neighboring words like adjective provide the evidence for classification using machine learning approach.This paper presented the text document classification that has wide applications in information retrieval,which uses movie review datasets.Here the document indexing based on controlled vocabulary,adjective,word sense disambiguation,generating hierarchical cate-gorization of web pages,spam detection,topic labeling,web search,document summarization,etc.Here the kernel support vector machine learning algorithm helps to classify the text and feature extract is performed by cuckoo search opti-mization.Positive review and negative review of movie dataset is presented to get the better classification accuracy.Experimental results focused with context mining,feature analysis and classification.By comparing with the previous work,proposed work designed to achieve the efficient results.Overall design is per-formed with MATLAB 2020a tool.