期刊文献+
共找到42篇文章
< 1 2 3 >
每页显示 20 50 100
Word Embeddings and Semantic Spaces in Natural Language Processing 被引量:1
1
作者 Peter J. Worth 《International Journal of Intelligence Science》 2023年第1期1-21,共21页
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ... One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP. 展开更多
关键词 Natural Language Processing Vector Space Models Semantic Spaces Word embeddings Representation Learning Text Vectorization Machine Learning Deep Learning
下载PDF
Aspect-Based Sentiment Classification Using Deep Learning and Hybrid of Word Embedding and Contextual Position
2
作者 Waqas Ahmad Hikmat Ullah Khan +3 位作者 Fawaz Khaled Alarfaj Saqib Iqbal Abdullah Mohammad Alomair Naif Almusallam 《Intelligent Automation & Soft Computing》 SCIE 2023年第9期3101-3124,共24页
Aspect-based sentiment analysis aims to detect and classify the sentiment polarities as negative,positive,or neutral while associating them with their identified aspects from the corresponding context.In this regard,p... Aspect-based sentiment analysis aims to detect and classify the sentiment polarities as negative,positive,or neutral while associating them with their identified aspects from the corresponding context.In this regard,prior methodologies widely utilize either word embedding or tree-based rep-resentations.Meanwhile,the separate use of those deep features such as word embedding and tree-based dependencies has become a significant cause of information loss.Generally,word embedding preserves the syntactic and semantic relations between a couple of terms lying in a sentence.Besides,the tree-based structure conserves the grammatical and logical dependencies of context.In addition,the sentence-oriented word position describes a critical factor that influences the contextual information of a targeted sentence.Therefore,knowledge of the position-oriented information of words in a sentence has been considered significant.In this study,we propose to use word embedding,tree-based representation,and contextual position information in combination to evaluate whether their combination will improve the result’s effectiveness or not.In the meantime,their joint utilization enhances the accurate identification and extraction of targeted aspect terms,which also influences their classification process.In this research paper,we propose a method named Attention Based Multi-Channel Convolutional Neural Net-work(Att-MC-CNN)that jointly utilizes these three deep features such as word embedding with tree-based structure and contextual position informa-tion.These three parameters deliver to Multi-Channel Convolutional Neural Network(MC-CNN)that identifies and extracts the potential terms and classifies their polarities.In addition,these terms have been further filtered with the attention mechanism,which determines the most significant words.The empirical analysis proves the proposed approach’s effectiveness compared to existing techniques when evaluated on standard datasets.The experimental results represent our approach outperforms in the F1 measure with an overall achievement of 94%in identifying aspects and 92%in the task of sentiment classification. 展开更多
关键词 Sentiment analysis word embedding aspect extraction consistency tree multichannel convolutional neural network contextual position information
下载PDF
Enhanced Image Captioning Using Features Concatenation and Efficient Pre-Trained Word Embedding
3
作者 Samar Elbedwehy T.Medhat +1 位作者 Taher Hamza Mohammed F.Alrahmawy 《Computer Systems Science & Engineering》 SCIE EI 2023年第9期3637-3652,共16页
One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical archite... One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem.This paper aims to find optimized models for these two subsystems.For the image feature extraction subsystem,the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image.For the caption generation lingual subsystem,this paper tested three different pre-trained language embedding models:Glove(Global Vectors for Word Representation),BERT(Bidirectional Encoder Representations from Transformers),and TaCL(Token-aware Contrastive Learning),to select from them the most accurate pre-trained language embedding model.Our experiments showed that building an image captioning system that uses a concatenation of the two Transformer based models SWIN(Shiftedwindow)and PVT(PyramidVision Transformer)as an image feature extractor,combined with the TaCL language embedding model is the best result among the other combinations. 展开更多
关键词 Image captioning word embedding CONCATENATION TRANSFORMER
下载PDF
Hybrid Scalable Researcher Recommendation System Using Azure Data Lake Analytics
4
作者 Dinesh Kalla Nathan Smith +1 位作者 Fnu Samaah Kiran Polimetla 《Journal of Data Analysis and Information Processing》 2024年第1期76-88,共13页
This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of co... This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of computer science in different fields of study. The technique used in this paper is handling the inadequate Information for citation;it removes the problem of cold start, which is encountered by very many other recommender systems. In this paper, abstracts, the titles, and the Microsoft academic graphs have been used in coming up with the recommendation list for every document, which is used to combine the content-based approaches and the co-citations. Prioritization and the blending of every technique have been allowed by the tuning system parameters, allowing for the authority in results of recommendation versus the paper novelty. In the end, we do observe that there is a direct correlation between the similarity rankings that have been produced by the system and the scores of the participant. The results coming from the associated scrips of analysis and the user survey have been made available through the recommendation system. Managers must gain the required expertise to fully utilize the benefits that come with business intelligence systems [1]. Data mining has become an important tool for managers that provides insights about their daily operations and leverage the information provided by decision support systems to improve customer relationships [2]. Additionally, managers require business intelligence systems that can rank the output in the order of priority. Ranking algorithm can replace the traditional data mining algorithms that will be discussed in-depth in the literature review [3]. 展开更多
关键词 Azure Data Lake U-SQL Author Recommendation System Power BI Microsoft Academic Big Data Word embedding
下载PDF
Novel Representations of Word Embedding Based on the Zolu Function
5
作者 Jihua Lu Youcheng Zhang 《Journal of Beijing Institute of Technology》 EI CAS 2020年第4期526-530,共5页
Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can ... Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can process extremely large data sets as well as word2vec without increasing the complexity.Also,the models outperform several word embedding methods both in word similarity and syntactic accuracy.The method of ZL-CBOW outperforms CBOW in accuracy by 8.43%on the training set of capital-world,and by 1.24%on the training set of plural-verbs.Moreover,experimental simulations on word similarity and syntactic accuracy show that ZL-CBOW and ZL-SG are superior to LL-CBOW and LL-SG,respectively. 展开更多
关键词 Zolu function word embedding continuous bags of words word similarity accuracy
下载PDF
Neural Machine Translation Models with Attention-Based Dropout Layer
6
作者 Huma Israr Safdar Abbas Khan +3 位作者 Muhammad Ali Tahir Muhammad Khuram Shahzad Muneer Ahmad Jasni Mohamad Zain 《Computers, Materials & Continua》 SCIE EI 2023年第5期2981-3009,共29页
In bilingual translation,attention-based Neural Machine Translation(NMT)models are used to achieve synchrony between input and output sequences and the notion of alignment.NMT model has obtained state-of-the-art perfo... In bilingual translation,attention-based Neural Machine Translation(NMT)models are used to achieve synchrony between input and output sequences and the notion of alignment.NMT model has obtained state-of-the-art performance for several language pairs.However,there has been little work exploring useful architectures for Urdu-to-English machine translation.We conducted extensive Urdu-to-English translation experiments using Long short-term memory(LSTM)/Bidirectional recurrent neural networks(Bi-RNN)/Statistical recurrent unit(SRU)/Gated recurrent unit(GRU)/Convolutional neural network(CNN)and Transformer.Experimental results show that Bi-RNN and LSTM with attention mechanism trained iteratively,with a scalable data set,make precise predictions on unseen data.The trained models yielded competitive results by achieving 62.6%and 61%accuracy and 49.67 and 47.14 BLEU scores,respectively.From a qualitative perspective,the translation of the test sets was examined manually,and it was observed that trained models tend to produce repetitive output more frequently.The attention score produced by Bi-RNN and LSTM produced clear alignment,while GRU showed incorrect translation for words,poor alignment and lack of a clear structure.Therefore,we considered refining the attention-based models by defining an additional attention-based dropout layer.Attention dropout fixes alignment errors and minimizes translation errors at the word level.After empirical demonstration and comparison with their counterparts,we found improvement in the quality of the resulting translation system and a decrease in the perplexity and over-translation score.The ability of the proposed model was evaluated using Arabic-English and Persian-English datasets as well.We empirically concluded that adding an attention-based dropout layer helps improve GRU,SRU,and Transformer translation and is considerably more efficient in translation quality and speed. 展开更多
关键词 Natural language processing neural machine translation word embedding ATTENTION PERPLEXITY selective dropout regularization URDU PERSIAN Arabic BLEU
下载PDF
Quantum Particle Swarm Optimization with Deep Learning-Based Arabic Tweets Sentiment Analysis
7
作者 Badriyya BAl-onazi Abdulkhaleq Q.A.Hassan +5 位作者 Mohamed K.Nour Mesfer Al Duhayyim Abdullah Mohamed Amgad Atta Abdelmageed Ishfaq Yaseen Gouse Pasha Mohammed 《Computers, Materials & Continua》 SCIE EI 2023年第5期2575-2591,共17页
Sentiment Analysis(SA),a Machine Learning(ML)technique,is often applied in the literature.The SA technique is specifically applied to the data collected from social media sites.The research studies conducted earlier u... Sentiment Analysis(SA),a Machine Learning(ML)technique,is often applied in the literature.The SA technique is specifically applied to the data collected from social media sites.The research studies conducted earlier upon the SA of the tweets were mostly aimed at automating the feature extraction process.In this background,the current study introduces a novel method called Quantum Particle Swarm Optimization with Deep Learning-Based Sentiment Analysis on Arabic Tweets(QPSODL-SAAT).The presented QPSODL-SAAT model determines and classifies the sentiments of the tweets written in Arabic.Initially,the data pre-processing is performed to convert the raw tweets into a useful format.Then,the word2vec model is applied to generate the feature vectors.The Bidirectional Gated Recurrent Unit(BiGRU)classifier is utilized to identify and classify the sentiments.Finally,the QPSO algorithm is exploited for the optimal finetuning of the hyperparameters involved in the BiGRU model.The proposed QPSODL-SAAT model was experimentally validated using the standard datasets.An extensive comparative analysis was conducted,and the proposed model achieved a maximum accuracy of 98.35%.The outcomes confirmed the supremacy of the proposed QPSODL-SAAT model over the rest of the approaches,such as the Surface Features(SF),Generic Embeddings(GE),Arabic Sentiment Embeddings constructed using the Hybrid(ASEH)model and the Bidirectional Encoder Representations from Transformers(BERT)model. 展开更多
关键词 Sentiment analysis Arabic tweets quantum particle swarm optimization deep learning word embedding
下载PDF
Personality Assessment Based on Natural Stream of Thoughts Empowered with Machine Learning
8
作者 Mohammed Salahat Liaqat Ali +1 位作者 Taher M.Ghazal Haitham M.Alzoubi 《Computers, Materials & Continua》 SCIE EI 2023年第7期1-17,共17页
Knowing each other is obligatory in a multi-agent collaborative environment.Collaborators may develop the desired know-how of each other in various aspects such as habits,job roles,status,and behaviors.Among different... Knowing each other is obligatory in a multi-agent collaborative environment.Collaborators may develop the desired know-how of each other in various aspects such as habits,job roles,status,and behaviors.Among different distinguishing characteristics related to a person,personality traits are an effective predictive tool for an individual’s behavioral pattern.It has been observed that when people are asked to share their details through questionnaires,they intentionally or unintentionally become biased.They knowingly or unknowingly provide enough information in much-unbiased comportment in open writing about themselves.Such writings can effectively assess an individual’s personality traits that may yield enormous possibilities for applications such as forensic departments,job interviews,mental health diagnoses,etc.Stream of consciousness,collected by James Pennbaker and Laura King,is one such way of writing,referring to a narrative technique where the emotions and thoughts of the writer are presented in a way that brings the reader to the fluid through the mental states of the narrator.More-over,computationally,various attempts have been made in an individual’s personality traits assessment through deep learning algorithms;however,the effectiveness and reliability of results vary with varying word embedding techniques.This article proposes an empirical approach to assessing personality by applying convolutional networks to text documents.Bidirectional Encoder Representations from Transformers(BERT)word embedding technique is used for word vector generation to enhance the contextual meanings. 展开更多
关键词 Personality traits convolutional neural network deep learning word embedding
下载PDF
Improved Metaheuristics with Deep Learning Enabled Movie Review Sentiment Analysis
9
作者 Abdelwahed Motwakel Najm Alotaibi +5 位作者 Eatedal Alabdulkreem Hussain Alshahrani MohamedAhmed Elfaki Mohamed K Nour Radwa Marzouk Mahmoud Othman 《Computer Systems Science & Engineering》 SCIE EI 2023年第10期1249-1266,共18页
Sentiment Analysis(SA)of natural language text is not only a challenging process but also gains significance in various Natural Language Processing(NLP)applications.The SA is utilized in various applications,namely,ed... Sentiment Analysis(SA)of natural language text is not only a challenging process but also gains significance in various Natural Language Processing(NLP)applications.The SA is utilized in various applications,namely,education,to improve the learning and teaching processes,marketing strategies,customer trend predictions,and the stock market.Various researchers have applied lexicon-related approaches,Machine Learning(ML)techniques and so on to conduct the SA for multiple languages,for instance,English and Chinese.Due to the increased popularity of the Deep Learning models,the current study used diverse configuration settings of the Convolution Neural Network(CNN)model and conducted SA for Hindi movie reviews.The current study introduces an Effective Improved Metaheuristics with Deep Learning(DL)-Enabled Sentiment Analysis for Movie Reviews(IMDLSA-MR)model.The presented IMDLSA-MR technique initially applies different levels of pre-processing to convert the input data into a compatible format.Besides,the Term Frequency-Inverse Document Frequency(TF-IDF)model is exploited to generate the word vectors from the pre-processed data.The Deep Belief Network(DBN)model is utilized to analyse and classify the sentiments.Finally,the improved Jellyfish Search Optimization(IJSO)algorithm is utilized for optimal fine-tuning of the hyperparameters related to the DBN model,which shows the novelty of the work.Different experimental analyses were conducted to validate the better performance of the proposed IMDLSA-MR model.The comparative study outcomes highlighted the enhanced performance of the proposed IMDLSA-MR model over recent DL models with a maximum accuracy of 98.92%. 展开更多
关键词 Corpus linguistics sentiment analysis natural language processing deep learning word embedding
下载PDF
An Intelligent Deep Neural Sentiment Classification Network
10
作者 Umamaheswari Ramalingam Senthil Kumar Murugesan +1 位作者 Karthikeyan Lakshmanan Chidhambararajan Balasubramaniyan 《Intelligent Automation & Soft Computing》 SCIE 2023年第5期1733-1744,共12页
A Deep Neural Sentiment Classification Network(DNSCN)is devel-oped in this work to classify the Twitter data unambiguously.It attempts to extract the negative and positive sentiments in the Twitter database.The main go... A Deep Neural Sentiment Classification Network(DNSCN)is devel-oped in this work to classify the Twitter data unambiguously.It attempts to extract the negative and positive sentiments in the Twitter database.The main goal of the system is tofind the sentiment behavior of tweets with minimum ambiguity.A well-defined neural network extracts deep features from the tweets automatically.Before extracting features deeper and deeper,the text in each tweet is represented by Bag-of-Words(BoW)and Word Embeddings(WE)models.The effectiveness of DNSCN architecture is analyzed using Twitter-Sanders-Apple2(TSA2),Twit-ter-Sanders-Apple3(TSA3),and Twitter-DataSet(TDS).TSA2 and TDS consist of positive and negative tweets,whereas TSA3 has neutral tweets also.Thus,the proposed DNSCN acts as a binary classifier for TSA2 and TDS databases and a multiclass classifier for TSA3.The performances of DNSCN architecture are evaluated by F1 score,precision,and recall rates using 5-fold and 10-fold cross-validation.Results show that the DNSCN-WE model provides more accuracy than the DNSCN-BoW model for representing the tweets in the feature encoding.The F1 score of the DNSCN-BW based system on the TSA2 database is 0.98(binary classification)and 0.97(three-class classification)for the TSA3 database.This system provides better a F1 score of 0.99 for the TDS database. 展开更多
关键词 Deep neural network word embeddings BAG-OF-words sentiment analysis text classification
下载PDF
Translation of English Language into Urdu Language Using LSTM Model
11
作者 Sajadul Hassan Kumhar Syed Immamul Ansarullah +3 位作者 Akber Abid Gardezi Shafiq Ahmad Abdelaty Edrees Sayed Muhammad Shafiq 《Computers, Materials & Continua》 SCIE EI 2023年第2期3899-3912,共14页
English to Urdu machine translation is still in its beginning and lacks simple translation methods to provide motivating and adequate English to Urdu translation.In order tomake knowledge available to the masses,there... English to Urdu machine translation is still in its beginning and lacks simple translation methods to provide motivating and adequate English to Urdu translation.In order tomake knowledge available to the masses,there should be mechanisms and tools in place to make things understandable by translating from source language to target language in an automated fashion.Machine translation has achieved this goal with encouraging results.When decoding the source text into the target language,the translator checks all the characteristics of the text.To achieve machine translation,rule-based,computational,hybrid and neural machine translation approaches have been proposed to automate the work.In this research work,a neural machine translation approach is employed to translate English text into Urdu.Long Short Term Short Model(LSTM)Encoder Decoder is used to translate English to Urdu.The various steps required to perform translation tasks include preprocessing,tokenization,grammar and sentence structure analysis,word embeddings,training data preparation,encoder-decoder models,and output text generation.The results show that the model used in the research work shows better performance in translation.The results were evaluated using bilingual research metrics and showed that the test and training data yielded the highest score sequences with an effective length of ten(10). 展开更多
关键词 Machine translation Urdu language word embedding
下载PDF
A Data Mining Approach to Detecting Bias and Favoritism in Public Procurement
12
作者 Yeferson Torres-Berru Vivian F.Lopez-Batista Lorena Conde Zhingre 《Intelligent Automation & Soft Computing》 SCIE 2023年第6期3501-3516,共16页
In a public procurement process,corruption can occur at each stage,favoring a participant with a previous agreement,which can result in over-pricing and purchases of substandard products,as well as gender discriminati... In a public procurement process,corruption can occur at each stage,favoring a participant with a previous agreement,which can result in over-pricing and purchases of substandard products,as well as gender discrimination.This paper’s aim is to detect biased purchases using a Spanish Language corpus,ana-lyzing text from the questions and answers registry platform by applicants in a public procurement process in Ecuador.Additionally,gender bias is detected,pro-moting both men and women to participate under the same conditions.In order to detect gender bias and favoritism towards certain providers by contracting enti-ties,the study proposes a unique hybrid model that combines Artificial Intelli-gence algorithms and Natural Language Processing(NLP).In the experimental work,303,076 public procurement processes have been analyzed over 10 years(since 2010)with 1,009,739 questions and answers to suppliers and public insti-tutions in each process.Gender bias and favoritism were analyzed using a Word2-vec model with word embedding,as well as sentiment analysis of the questions and answers using the VADER algorithm.In 32%of cases(96,984 answers),there was favoritism or gender bias as evidenced by responses from contracting entities.The proposed model provides accuracy rates of 88% for detecting favor-itism,and 90%for detecting gender bias.Consequently one-third of the procure-ment processes carried out by the state have indications of corruption and bias.In Latin America,government corruption is one of the most significant challenges,making the resulting classifier useful for detecting bias and favoritism in public procurement processes. 展开更多
关键词 FAVORITISM BIAS natural language processing Word2vec sentiment analysis word embeddings
下载PDF
Suggestion Mining from Opinionated Text of Big Social Media Data 被引量:6
13
作者 Youseef Alotaibi Muhammad Noman Malik +4 位作者 Huma Hayat Khan Anab Batool Saif ul Islam Abdulmajeed Alsufyani Saleh Alghamdi 《Computers, Materials & Continua》 SCIE EI 2021年第9期3323-3338,共16页
:Social media data are rapidly increasing and constitute a source of user opinions and tips on a wide range of products and services.The increasing availability of such big data on biased reviews and blogs creates cha... :Social media data are rapidly increasing and constitute a source of user opinions and tips on a wide range of products and services.The increasing availability of such big data on biased reviews and blogs creates challenges for customers and businesses in reviewing all content in their decision-making process.To overcome this challenge,extracting suggestions from opinionated text is a possible solution.In this study,the characteristics of suggestions are analyzed and a suggestion mining extraction process is presented for classifying suggestive sentences from online customers’reviews.A classification using a word-embedding approach is used via the XGBoost classifier.The two datasets used in this experiment relate to online hotel reviews and Microsoft Windows App Studio discussion reviews.F1,precision,recall,and accuracy scores are calculated.The results demonstrated that the XGBoost classifier outperforms—with an accuracy of more than 80%.Moreover,the results revealed that suggestion keywords and phrases are the predominant features for suggestion extraction.Thus,this study contributes to knowledge and practice by comparing feature extraction classifiers and identifying XGBoost as a better suggestion mining process for identifying online reviews. 展开更多
关键词 Suggestion mining word embedding Naïve Bayes random forest XGBoost DATASET
下载PDF
Multi-Level Knowledge Engineering Approach for Mapping Implicit Aspects to Explicit Aspects 被引量:3
14
作者 Jibran Mir Azhar Mahmood Shaheen Khatoon 《Computers, Materials & Continua》 SCIE EI 2022年第2期3491-3509,共19页
Aspect’s extraction is a critical task in aspect-based sentiment analysis,including explicit and implicit aspects identification.While extensive research has identified explicit aspects,little effort has been put for... Aspect’s extraction is a critical task in aspect-based sentiment analysis,including explicit and implicit aspects identification.While extensive research has identified explicit aspects,little effort has been put forward on implicit aspects extraction due to the complexity of the problem.Moreover,existing research on implicit aspect identification is widely carried out on product reviews targeting specific aspects while neglecting sentences’dependency problems.Therefore,in this paper,a multi-level knowledge engineering approach for identifying implicit movie aspects is proposed.The proposed method first identifies explicit aspects using a variant of BiLSTM and CRF(Bidirectional Long Short Memory-Conditional Random Field),which serve as a memory to process dependent sentences to infer implicit aspects.It can identify implicit aspects from four types of sentences,including independent and three types of dependent sentences.The study is evaluated on a largemovie reviews dataset with 50k examples.The experimental results showed that the explicit aspect identification method achieved 89%F1-score and implicit aspect extraction methods achieved 76%F1-score.In addition,the proposed approach also performs better than the state-of-the-art techniques(NMFIAD andML-KB+)on the product review dataset,where it achieved 93%precision,92%recall,and 93%F1-score. 展开更多
关键词 Movie NEs(named entities) ASPECTS opinion words annotation process memory implicit aspects implicit aspects mapping word embedding and BiLSTM
下载PDF
Automatic Classification of Swedish Metadata Using Dewey Decimal Classification:A Comparison of Approaches 被引量:1
15
作者 Koraljka Golub Johan Hagelback Anders Ardo 《Journal of Data and Information Science》 CSCD 2020年第1期18-38,共21页
Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization syst... Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few records(often less than A Comparison of Approaches100 per class)on which to train are available—and these hold only for top 3 hierarchical levels(803 instead of 14,413 classes).Research limitations:Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes,skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.Practical implications:In conclusion,for operative information retrieval systems applying purely automatic DDC does not work,either using machine learning(because of the lack of training data for the large number of DDC classes)or using string-matching algorithm(because DDC characteristics perform well for automatic classification only in a small number of classes).Over time,more training examples may become available,and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC.In order for quality information services to reach the objective of highest possible precision and recall,automatic classification should never be implemented on its own;instead,machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.Originality/value:The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems.Due to lack of sufficient training data across the entire set of classes,an approach complementing machine learning,that of string matching,was applied.This combination should be explored further since it provides the potential for real-life applications with large target classification systems. 展开更多
关键词 LIBRIS Dewey Decimal Classification Automatic classification Machine learning Support Vector Machine Multinomial Naive Bayes Simple linear network Standard neural network 1D convolutional neural network Recurrent neural network Word embeddings String matching
下载PDF
Identification of Sarcasm in Textual Data: A Comparative Study 被引量:1
16
作者 Pulkit Mehndiratta Devpriya Soni 《Journal of Data and Information Science》 CSCD 2019年第4期56-83,共28页
Purpose:Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet.Textual data contributes a major share towards data generated on the worl... Purpose:Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet.Textual data contributes a major share towards data generated on the world wide web.Understanding people’s sentiment is an important aspect of natural language processing,but this opinion can be biased and incorrect,if people use sarcasm while commenting,posting status updates or reviewing any product or a movie.Thus,it is of utmost importance to detect sarcasm correctly and make a correct prediction about the people’s intentions.Design/methodology/approach:This study tries to evaluate various machine learning models along with standard and hybrid deep learning models across various standardized datasets.We have performed vectorization of text using word embedding techniques.This has been done to convert the textual data into vectors for analytical purposes.We have used three standardized datasets available in public domain and used three word embeddings i.e Word2Vec,GloVe and fastText to validate the hypothesis.Findings:The results were analyzed and conclusions are drawn.The key finding is:the hybrid models that include Bidirectional LongTerm Short Memory(Bi-LSTM)and Convolutional Neural Network(CNN)outperform others conventional machine learning as well as deep learning models across all the datasets considered in this study,making our hypothesis valid.Research limitations:Using the data from different sources and customizing the models according to each dataset,slightly decreases the usability of the technique.But,overall this methodology provides effective measures to identify the presence of sarcasm with a minimum average accuracy of 80%or above for one dataset and better than the current baseline results for the other datasets.Practical implications:The results provide solid insights for the system developers to integrate this model into real-time analysis of any review or comment posted in the public domain.This study has various other practical implications for businesses that depend on user ratings and public opinions.This study also provides a launching platform for various researchers to work on the problem of sarcasm identification in textual data.Originality/value:This is a first of its kind study,to provide us the difference between conventional and the hybrid methods of prediction of sarcasm in textual data.The study also provides possible indicators that hybrid models are better when applied to textual data for analysis of sarcasm. 展开更多
关键词 Machine learning Artificial neural networks Word embedding Text vectorization ACCURACY
下载PDF
Emvirus:An embedding-based neural framework for human-virus proteinprotein interactions prediction 被引量:1
17
作者 Pengfei Xie Jujuan Zhuang +1 位作者 Geng Tian Jialiang Yang 《Biosafety and Health》 CAS CSCD 2023年第3期152-158,共7页
Human-virus protein-protein interactions(PPIs)play critical roles in viral infection.For example,the spike protein of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)binds primarily to human angiotensinconv... Human-virus protein-protein interactions(PPIs)play critical roles in viral infection.For example,the spike protein of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)binds primarily to human angiotensinconverting enzyme 2(ACE2)protein to infect human cells.Thus,identifying and blocking these PPIs contribute to controlling and preventing viruses.However,wet-lab experiment-based identification of human-virus PPIs is usually expensive,labor-intensive,and time-consuming,which presents the need for computational methods.Many machine-learning methods have been proposed recently and achieved good results in predicting humanvirus PPIs.However,most methods are based on protein sequence features and apply manually extracted features,such as statistical characteristics,phylogenetic profiles,and physicochemical properties.In this work,we present an embedding-based neural framework with convolutional neural network(CNN)and bi-directional long short-term memory unit(Bi-LSTM)architecture,named Emvirus,to predict human-virus PPIs(including human-SARS-CoV-2 PPIs).In addition,we conduct cross-viral experiments to explore the generalization ability of Emvirus.Compared to other feature extraction methods,Emvirus achieves better prediction accuracy. 展开更多
关键词 SARS-CoV-2 human-virus PPI Word embedding Doc2vec Neural networks
原文传递
Machine Learning-Based Advertisement Banner Identification Technique for Effective Piracy Website Detection Process
18
作者 Lelisa Adeba Jilcha Jin Kwak 《Computers, Materials & Continua》 SCIE EI 2022年第5期2883-2899,共17页
In the contemporary world, digital content that is subject to copyright is facing significant challenges against the act of copyright infringement.Billions of dollars are lost annually because of this illegal act. The... In the contemporary world, digital content that is subject to copyright is facing significant challenges against the act of copyright infringement.Billions of dollars are lost annually because of this illegal act. The currentmost effective trend to tackle this problem is believed to be blocking thosewebsites, particularly through affiliated government bodies. To do so, aneffective detection mechanism is a necessary first step. Some researchers haveused various approaches to analyze the possible common features of suspectedpiracy websites. For instance, most of these websites serve online advertisement, which is considered as their main source of revenue. In addition, theseadvertisements have some common attributes that make them unique ascompared to advertisements posted on normal or legitimate websites. Theyusually encompass keywords such as click-words (words that redirect to installmalicious software) and frequently used words in illegal gambling, illegal sexual acts, and so on. This makes them ideal to be used as one of the key featuresin the process of successfully detecting websites involved in the act of copyrightinfringement. Research has been conducted to identify advertisements servedon suspected piracy websites. However, these studies use a static approachthat relies mainly on manual scanning for the aforementioned keywords. Thisbrings with it some limitations, particularly in coping with the dynamic andever-changing behavior of advertisements posted on these websites. Therefore,we propose a technique that can continuously fine-tune itself and is intelligentenough to effectively identify advertisement (Ad) banners extracted fromsuspected piracy websites. We have done this by leveraging the power ofmachine learning algorithms, particularly the support vector machine with theword2vec word-embedding model. After applying the proposed technique to1015 Ad banners collected from 98 suspected piracy websites and 90 normal orlegitimate websites, we were able to successfully identify Ad banners extractedfrom suspected piracy websites with an accuracy of 97%. We present thistechnique with the hope that it will be a useful tool for various effective piracywebsite detection approaches. To our knowledge, this is the first approachthat uses machine learning to identify Ad banners served on suspected piracywebsites. 展开更多
关键词 Copyright infringement piracy website detection online advertisement advertisement banners machine learning support vector machine word embedding word2vec
下载PDF
Deep Neural Network and Pseudo Relevance Feedback Based Query Expansion
19
作者 Abhishek Kumar Shukla Sujoy Das 《Computers, Materials & Continua》 SCIE EI 2022年第5期3557-3570,共14页
The neural network has attracted researchers immensely in the last couple of years due to its wide applications in various areas such as Data mining,Natural language processing,Image processing,and Information retriev... The neural network has attracted researchers immensely in the last couple of years due to its wide applications in various areas such as Data mining,Natural language processing,Image processing,and Information retrieval etc.Word embedding has been applied by many researchers for Information retrieval tasks.In this paper word embedding-based skip-gram model has been developed for the query expansion task.Vocabulary terms are obtained from the top“k”initially retrieved documents using the Pseudo relevance feedback model and then they are trained using the skip-gram model to find the expansion terms for the user query.The performance of the model based on mean average precision is 0.3176.The proposed model compares with other existing models.An improvement of 6.61%,6.93%,and 9.07%on MAP value is observed compare to the Original query,BM25 model,and query expansion with the Chi-Square model respectively.The proposed model also retrieves 84,25,and 81 additional relevant documents compare to the original query,query expansion with Chi-Square model,and BM25 model respectively and thus improves the recall value also.The per query analysis reveals that the proposed model performs well in 30,36,and 30 queries compare to the original query,query expansion with Chi-square model,and BM25 model respectively. 展开更多
关键词 Information retrieval query expansion word embedding neural network deep neural network
下载PDF
Payload Encoding Representation from Transformer for Encrypted Traffic Classification
20
作者 HE Hongye YANG Zhiguo CHEN Xiangning 《ZTE Communications》 2021年第4期90-97,共8页
Traffic identification becomes more important,yet more challenging as related encryption techniques are rapidly developing nowadays.Unlike recent deep learning methods that apply image processing to solve such encrypt... Traffic identification becomes more important,yet more challenging as related encryption techniques are rapidly developing nowadays.Unlike recent deep learning methods that apply image processing to solve such encrypted traffic problems,in this pa⁃per,we propose a method named Payload Encoding Representation from Transformer(PERT)to perform automatic traffic feature extraction using a state-of-the-art dynamic word embedding technique.By implementing traffic classification experiments on a pub⁃lic encrypted traffic data set and our captured Android HTTPS traffic,we prove the pro⁃posed method can achieve an obvious better effectiveness than other compared baselines.To the best of our knowledge,this is the first time the encrypted traffic classification with the dynamic word embedding has been addressed. 展开更多
关键词 traffic identification encrypted traffic classification natural language process⁃ing deep learning dynamic word embedding
下载PDF
上一页 1 2 3 下一页 到第
使用帮助 返回顶部