When someone threatens or humiliates another person online by sending those unpleasant messages or comments, this is known as Cyberbullying. Recently, Bangla text has been used much more often on social media. People ...When someone threatens or humiliates another person online by sending those unpleasant messages or comments, this is known as Cyberbullying. Recently, Bangla text has been used much more often on social media. People communicate with others on social media through messages and comments. So bullies use social media as a rich environment to bully others, especially on political issues. Fights over Cyberbullying on political and social media posts are common today. Most of the time, it does a lot of damage. However, few works have been done for monitoring Bangla text on social media & no work has been done yet for detecting the bullying Bangla text on political issues due to the lack of annotated corpora and morphologic analyzers. In this work, we used several machine learning classifiers & a model. That will help to detect the Bangla bullying texts on social media. For this work, 11,000 Bangla texts have been collected from the comments section of political Facebook posts to make a new dataset and labelled the data as either bullied or not. This dataset has been used to train the machine learning classifier. The results indicate that Random Forest achieves superior accuracy of 91.08%.展开更多
近年来深度学习在短文本聚类方面发挥巨大作用,最近提出的短文本聚类(Short Text Clustering, STC)算法在此方面取得不错的成效。为进一步提高聚类准确率并优化算法性能,基于指数函数提出改进的随机近邻嵌入算法。该算法用指数函数度量...近年来深度学习在短文本聚类方面发挥巨大作用,最近提出的短文本聚类(Short Text Clustering, STC)算法在此方面取得不错的成效。为进一步提高聚类准确率并优化算法性能,基于指数函数提出改进的随机近邻嵌入算法。该算法用指数函数度量样本点与聚类中心差距,放大不同特征差别,并在后期使用k-means++算法预先确定聚类中心与聚类数目。在Stackoverflow数据集上的实验证明,随机指数嵌入聚类模型(e-STC)在准确率与标准互信息上均优于原STC模型,准确率相对提高3.2%,互信息相对提高2.9%。展开更多
Mining rich semantic information hidden in heterogeneous information network is one of the important tasks of data mining. Generally, a nuclear medicine text consists of the description of disease (<i>i.e.</i...Mining rich semantic information hidden in heterogeneous information network is one of the important tasks of data mining. Generally, a nuclear medicine text consists of the description of disease (<i>i.e.</i>, lesions) and diagnostic results. However, how to construct a computer-aided diagnostic model with a large number of medical texts is a challenging task. To automatically diagnose diseases with SPECT imaging, in this work, we create a knowledge-based diagnostic model by exploring the association between a disease and its properties. Firstly, an overview of nuclear medicine and data mining is presented. Second, the method of preprocessing textual nuclear medicine diagnostic reports is proposed. Last, the created diagnostic modes based on random forest and SVM are proposed. Experimental evaluation conducted real-world data of diagnostic reports of SPECT imaging demonstrates that our diagnostic models are workable and effective to automatically identify diseases with textual diagnostic reports.展开更多
With the increasing interest in e-commerce shopping, customer reviews have become one of the most important elements that determine customer satisfaction regarding products. This demonstrates the importance of working...With the increasing interest in e-commerce shopping, customer reviews have become one of the most important elements that determine customer satisfaction regarding products. This demonstrates the importance of working with Text Mining. This study is based on The Women’s Clothing E-Commerce Reviews database, which consists of reviews written by real customers. The aim of this paper is to conduct a Text Mining approach on a set of customer reviews. Each review was classified as either a positive or negative review by employing a classification method. Four tree-based methods were applied to solve the classification problem, namely Classification Tree, Random Forest, Gradient Boosting and XGBoost. The dataset was categorized into training and test sets. The results indicate that the Random Forest method displays an overfitting, XGBoost displays an overfitting if the number of trees is too high, Classification Tree is good at detecting negative reviews and bad at detecting positive reviews and the Gradient Boosting shows stable values and quality measures above 77% for the test dataset. A consensus between the applied methods is noted for important classification terms.展开更多
随着大量科研论文全文本的出现,如何从中挖掘相应的知识不仅有利于学术文献的深度知识组织而且有益于学术文献的精准检索。而识别学术文本的结构是进行上述探究的基础,因为结构的识别有助于从更深层次或者偏重语义的角度理解学术文本,...随着大量科研论文全文本的出现,如何从中挖掘相应的知识不仅有利于学术文献的深度知识组织而且有益于学术文献的精准检索。而识别学术文本的结构是进行上述探究的基础,因为结构的识别有助于从更深层次或者偏重语义的角度理解学术文本,从而促进学术文本挖掘研究的发展。本文以学术文本的不同结构功能为研究对象,以Journal of the Association for Information Science and Technology(JASIST)上发表的1579篇论文为数据集,进行双向长短时记忆神经网络、支持向量机和条件随机场三种模型上的预实验,并对比实验结果的性能,最终确定利用条件随机场模型做进一步探究。利用条件随机场模型,本文将学术文本结构功能识别问题转化为对句子单元的序列标注问题,寻找最优识别模型并探究不同特征对结构功能识别的影响,最终获得开放测试的调和平均值为92.88%的结构整体识别效果。实验结果表明,章节标题中词汇信息和章节内容的特征词汇信息对学术文本的功能结构识别起到巨大作用,可以达到令人满意的效果,而结构的长度特征则干扰条件随机场方法的性能。在最后,本文对学术文本结构功能识别出错原因进行总结,指出进一步探讨的问题和方向。展开更多
文摘When someone threatens or humiliates another person online by sending those unpleasant messages or comments, this is known as Cyberbullying. Recently, Bangla text has been used much more often on social media. People communicate with others on social media through messages and comments. So bullies use social media as a rich environment to bully others, especially on political issues. Fights over Cyberbullying on political and social media posts are common today. Most of the time, it does a lot of damage. However, few works have been done for monitoring Bangla text on social media & no work has been done yet for detecting the bullying Bangla text on political issues due to the lack of annotated corpora and morphologic analyzers. In this work, we used several machine learning classifiers & a model. That will help to detect the Bangla bullying texts on social media. For this work, 11,000 Bangla texts have been collected from the comments section of political Facebook posts to make a new dataset and labelled the data as either bullied or not. This dataset has been used to train the machine learning classifier. The results indicate that Random Forest achieves superior accuracy of 91.08%.
文摘近年来深度学习在短文本聚类方面发挥巨大作用,最近提出的短文本聚类(Short Text Clustering, STC)算法在此方面取得不错的成效。为进一步提高聚类准确率并优化算法性能,基于指数函数提出改进的随机近邻嵌入算法。该算法用指数函数度量样本点与聚类中心差距,放大不同特征差别,并在后期使用k-means++算法预先确定聚类中心与聚类数目。在Stackoverflow数据集上的实验证明,随机指数嵌入聚类模型(e-STC)在准确率与标准互信息上均优于原STC模型,准确率相对提高3.2%,互信息相对提高2.9%。
文摘Mining rich semantic information hidden in heterogeneous information network is one of the important tasks of data mining. Generally, a nuclear medicine text consists of the description of disease (<i>i.e.</i>, lesions) and diagnostic results. However, how to construct a computer-aided diagnostic model with a large number of medical texts is a challenging task. To automatically diagnose diseases with SPECT imaging, in this work, we create a knowledge-based diagnostic model by exploring the association between a disease and its properties. Firstly, an overview of nuclear medicine and data mining is presented. Second, the method of preprocessing textual nuclear medicine diagnostic reports is proposed. Last, the created diagnostic modes based on random forest and SVM are proposed. Experimental evaluation conducted real-world data of diagnostic reports of SPECT imaging demonstrates that our diagnostic models are workable and effective to automatically identify diseases with textual diagnostic reports.
文摘With the increasing interest in e-commerce shopping, customer reviews have become one of the most important elements that determine customer satisfaction regarding products. This demonstrates the importance of working with Text Mining. This study is based on The Women’s Clothing E-Commerce Reviews database, which consists of reviews written by real customers. The aim of this paper is to conduct a Text Mining approach on a set of customer reviews. Each review was classified as either a positive or negative review by employing a classification method. Four tree-based methods were applied to solve the classification problem, namely Classification Tree, Random Forest, Gradient Boosting and XGBoost. The dataset was categorized into training and test sets. The results indicate that the Random Forest method displays an overfitting, XGBoost displays an overfitting if the number of trees is too high, Classification Tree is good at detecting negative reviews and bad at detecting positive reviews and the Gradient Boosting shows stable values and quality measures above 77% for the test dataset. A consensus between the applied methods is noted for important classification terms.
文摘随着大量科研论文全文本的出现,如何从中挖掘相应的知识不仅有利于学术文献的深度知识组织而且有益于学术文献的精准检索。而识别学术文本的结构是进行上述探究的基础,因为结构的识别有助于从更深层次或者偏重语义的角度理解学术文本,从而促进学术文本挖掘研究的发展。本文以学术文本的不同结构功能为研究对象,以Journal of the Association for Information Science and Technology(JASIST)上发表的1579篇论文为数据集,进行双向长短时记忆神经网络、支持向量机和条件随机场三种模型上的预实验,并对比实验结果的性能,最终确定利用条件随机场模型做进一步探究。利用条件随机场模型,本文将学术文本结构功能识别问题转化为对句子单元的序列标注问题,寻找最优识别模型并探究不同特征对结构功能识别的影响,最终获得开放测试的调和平均值为92.88%的结构整体识别效果。实验结果表明,章节标题中词汇信息和章节内容的特征词汇信息对学术文本的功能结构识别起到巨大作用,可以达到令人满意的效果,而结构的长度特征则干扰条件随机场方法的性能。在最后,本文对学术文本结构功能识别出错原因进行总结,指出进一步探讨的问题和方向。