This paper proposed an improved Naïve Bayes Classifier for sentimental analysis from a large-scale dataset such as in YouTube.YouTube contains large unstructured and unorganized comments and reactions,which carry...This paper proposed an improved Naïve Bayes Classifier for sentimental analysis from a large-scale dataset such as in YouTube.YouTube contains large unstructured and unorganized comments and reactions,which carry important information.Organizing large amounts of data and extracting useful information is a challenging task.The extracted information can be considered as new knowledge and can be used for deci sion-making.We extract comments from YouTube on videos and categorized them in domain-specific,and then apply the Naïve Bayes classifier with improved techniques.Our method provided a decent 80%accuracy in classifying those comments.This experiment shows that the proposed method provides excellent adaptability for large-scale text classification.展开更多
The naïve Bayes classifier is one of the commonly used data mining methods for classification.Despite its simplicity,naïve Bayes is effective and computationally efficient.Although the strong attribute indep...The naïve Bayes classifier is one of the commonly used data mining methods for classification.Despite its simplicity,naïve Bayes is effective and computationally efficient.Although the strong attribute independence assumption in the naïve Bayes classifier makes it a tractable method for learning,this assumption may not hold in real-world applications.Many enhancements to the basic algorithm have been proposed in order to alleviate the violation of attribute independence assumption.While these methods improve the classification performance,they do not necessarily retain the mathematical structure of the naïve Bayes model and some at the expense of computational time.One approach to reduce the naïvetéof the classifier is to incorporate attribute weights in the conditional probability.In this paper,we proposed a method to incorporate attribute weights to naïve Bayes.To evaluate the performance of our method,we used the public benchmark datasets.We compared our method with the standard naïve Bayes and baseline attribute weighting methods.Experimental results show that our method to incorporate attribute weights improves the classification performance compared to both standard naïve Bayes and baseline attribute weighting methods in terms of classification accuracy and F1,especially when the independence assumption is strongly violated,which was validated using the Chi-square test of independence.展开更多
Intrusion detection is the investigation process of information about the system activities or its data to detect any malicious behavior or unauthorized activity.Most of the IDS implement K-means clustering technique ...Intrusion detection is the investigation process of information about the system activities or its data to detect any malicious behavior or unauthorized activity.Most of the IDS implement K-means clustering technique due to its linear complexity and fast computing ability.Nonetheless,it is Naïve use of the mean data value for the cluster core that presents a major drawback.The chances of two circular clusters having different radius and centering at the same mean will occur.This condition cannot be addressed by the K-means algorithm because the mean value of the various clusters is very similar together.However,if the clusters are not spherical,it fails.To overcome this issue,a new integrated hybrid model by integrating expectation maximizing(EM)clustering using a Gaussian mixture model(GMM)and naïve Bays classifier have been proposed.In this model,GMM give more flexibility than K-Means in terms of cluster covariance.Also,they use probabilities function and soft clustering,that’s why they can have multiple cluster for a single data.In GMM,we can define the cluster form in GMM by two parameters:the mean and the standard deviation.This means that by using these two parameters,the cluster can take any kind of elliptical shape.EM-GMM will be used to cluster data based on data activity into the corresponding category.展开更多
Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent.Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for ...Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent.Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing.The communication using the Roman characters,which are used in the script of Urdu language on social media,is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply.English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past.This is due to the numerous complexities involved in the processing of Roman Urdu data.The complexities associated with Roman Urdu include the non-availability of the tagged corpus,lack of a set of rules,and lack of standardized spellings.A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook,Twitter but meaningful information can only be extracted if data is in a structured format.We have developed a Roman Urdu news headline classifier,which will help to classify news into relevant categories on which further analysis and modeling can be done.The author of this research aims to develop the Roman Urdu news classifier,which will classify the news into five categories(health,business,technology,sports,international).First,we will develop the news dataset using scraping tools and then after preprocessing,we will compare the results of different machine learning algorithms like Logistic Regression(LR),Multinomial Naïve Bayes(MNB),Long short term memory(LSTM),and Convolutional Neural Network(CNN).After this,we will use a phonetic algorithm to control lexical variation and test news from different websites.The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news.After applying above mentioned different machine learning algorithms,results have shown that Multinomial Naïve Bayes classifier is giving the best accuracy of 90.17%which is due to the noise lexical variation.展开更多
Crimes are expected to rise with an increase in population and the rising gap between society’s income levels.Crimes contribute to a significant portion of the socioeconomic loss to any society,not only through its i...Crimes are expected to rise with an increase in population and the rising gap between society’s income levels.Crimes contribute to a significant portion of the socioeconomic loss to any society,not only through its indirect damage to the social fabric and peace but also the more direct negative impacts on the economy,social parameters,and reputation of a nation.Policing and other preventive resources are limited and have to be utilized.The conventional methods are being superseded by more modern approaches of machine learning algorithms capable of making predictions where the relationships between the features and the outcomes are complex.Making it possible for such algorithms to provide indicators of specific areas that may become criminal hot-spots.These predictions can be used by policymakers and police personals alike to make effective and informed strategies that can curtail criminal activities and contribute to the nation’s development.This paper aims to predict factors that most affected crimes in Saudi Arabia by developing a machine learning model to predict an acceptable output value.Our results show that FAMD as features selection methods showed more accuracy on machine learning classifiers than the PCA method.The naïve Bayes classifier performs better than other classifiers on both features selections methods with an accuracy of 97.53%for FAMD,and PCA equals to 97.10%.展开更多
文摘This paper proposed an improved Naïve Bayes Classifier for sentimental analysis from a large-scale dataset such as in YouTube.YouTube contains large unstructured and unorganized comments and reactions,which carry important information.Organizing large amounts of data and extracting useful information is a challenging task.The extracted information can be considered as new knowledge and can be used for deci sion-making.We extract comments from YouTube on videos and categorized them in domain-specific,and then apply the Naïve Bayes classifier with improved techniques.Our method provided a decent 80%accuracy in classifying those comments.This experiment shows that the proposed method provides excellent adaptability for large-scale text classification.
文摘The naïve Bayes classifier is one of the commonly used data mining methods for classification.Despite its simplicity,naïve Bayes is effective and computationally efficient.Although the strong attribute independence assumption in the naïve Bayes classifier makes it a tractable method for learning,this assumption may not hold in real-world applications.Many enhancements to the basic algorithm have been proposed in order to alleviate the violation of attribute independence assumption.While these methods improve the classification performance,they do not necessarily retain the mathematical structure of the naïve Bayes model and some at the expense of computational time.One approach to reduce the naïvetéof the classifier is to incorporate attribute weights in the conditional probability.In this paper,we proposed a method to incorporate attribute weights to naïve Bayes.To evaluate the performance of our method,we used the public benchmark datasets.We compared our method with the standard naïve Bayes and baseline attribute weighting methods.Experimental results show that our method to incorporate attribute weights improves the classification performance compared to both standard naïve Bayes and baseline attribute weighting methods in terms of classification accuracy and F1,especially when the independence assumption is strongly violated,which was validated using the Chi-square test of independence.
文摘Intrusion detection is the investigation process of information about the system activities or its data to detect any malicious behavior or unauthorized activity.Most of the IDS implement K-means clustering technique due to its linear complexity and fast computing ability.Nonetheless,it is Naïve use of the mean data value for the cluster core that presents a major drawback.The chances of two circular clusters having different radius and centering at the same mean will occur.This condition cannot be addressed by the K-means algorithm because the mean value of the various clusters is very similar together.However,if the clusters are not spherical,it fails.To overcome this issue,a new integrated hybrid model by integrating expectation maximizing(EM)clustering using a Gaussian mixture model(GMM)and naïve Bays classifier have been proposed.In this model,GMM give more flexibility than K-Means in terms of cluster covariance.Also,they use probabilities function and soft clustering,that’s why they can have multiple cluster for a single data.In GMM,we can define the cluster form in GMM by two parameters:the mean and the standard deviation.This means that by using these two parameters,the cluster can take any kind of elliptical shape.EM-GMM will be used to cluster data based on data activity into the corresponding category.
基金This work is supported by the KIAS(Research Number:CG076601)and in part by Sejong University Faculty Research Fund.
文摘Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent.Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing.The communication using the Roman characters,which are used in the script of Urdu language on social media,is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply.English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past.This is due to the numerous complexities involved in the processing of Roman Urdu data.The complexities associated with Roman Urdu include the non-availability of the tagged corpus,lack of a set of rules,and lack of standardized spellings.A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook,Twitter but meaningful information can only be extracted if data is in a structured format.We have developed a Roman Urdu news headline classifier,which will help to classify news into relevant categories on which further analysis and modeling can be done.The author of this research aims to develop the Roman Urdu news classifier,which will classify the news into five categories(health,business,technology,sports,international).First,we will develop the news dataset using scraping tools and then after preprocessing,we will compare the results of different machine learning algorithms like Logistic Regression(LR),Multinomial Naïve Bayes(MNB),Long short term memory(LSTM),and Convolutional Neural Network(CNN).After this,we will use a phonetic algorithm to control lexical variation and test news from different websites.The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news.After applying above mentioned different machine learning algorithms,results have shown that Multinomial Naïve Bayes classifier is giving the best accuracy of 90.17%which is due to the noise lexical variation.
文摘Crimes are expected to rise with an increase in population and the rising gap between society’s income levels.Crimes contribute to a significant portion of the socioeconomic loss to any society,not only through its indirect damage to the social fabric and peace but also the more direct negative impacts on the economy,social parameters,and reputation of a nation.Policing and other preventive resources are limited and have to be utilized.The conventional methods are being superseded by more modern approaches of machine learning algorithms capable of making predictions where the relationships between the features and the outcomes are complex.Making it possible for such algorithms to provide indicators of specific areas that may become criminal hot-spots.These predictions can be used by policymakers and police personals alike to make effective and informed strategies that can curtail criminal activities and contribute to the nation’s development.This paper aims to predict factors that most affected crimes in Saudi Arabia by developing a machine learning model to predict an acceptable output value.Our results show that FAMD as features selection methods showed more accuracy on machine learning classifiers than the PCA method.The naïve Bayes classifier performs better than other classifiers on both features selections methods with an accuracy of 97.53%for FAMD,and PCA equals to 97.10%.