Text mining is a text data analysis,found that the relationship between concepts and underlying concepts from unstructured text,it is extracted from large text database has not yet been realized patterns or associatio...Text mining is a text data analysis,found that the relationship between concepts and underlying concepts from unstructured text,it is extracted from large text database has not yet been realized patterns or associations,some information retrieval and text processing system can find the relationship between words and paragraphs.This article first describes the data sources and a brief introduction to the related platforms and functional components.Secondly,it explains the Chinese word segmentation and the Korean word segmentation system.At last,it takes the news,documents and materials of the Korean Peninsula as well as the various public opinion data on the network as the basic data for the research.The examples of word frequency graph and word cloud graph is carried out to show the results of text mining through Chinese word segmentation system and Korean word segmentation system.展开更多
Opinion (sentiment) analysis on big data streams from the constantly generated text streams on social media networks to hundreds of millions of online consumer reviews provides many organizations in every field with o...Opinion (sentiment) analysis on big data streams from the constantly generated text streams on social media networks to hundreds of millions of online consumer reviews provides many organizations in every field with opportunities to discover valuable intelligence from the massive user generated text streams. However, the traditional content analysis frameworks are inefficient to handle the unprecedentedly big volume of unstructured text streams and the complexity of text analysis tasks for the real time opinion analysis on the big data streams. In this paper, we propose a parallel real time sentiment analysis system: Social Media Data Stream Sentiment Analysis Service (SMDSSAS) that performs multiple phases of sentiment analysis of social media text streams effectively in real time with two fully analytic opinion mining models to combat the scale of text data streams and the complexity of sentiment analysis processing on unstructured text streams. We propose two aspect based opinion mining models: Deterministic and Probabilistic sentiment models for a real time sentiment analysis on the user given topic related data streams. Experiments on the social media Twitter stream traffic captured during the pre-election weeks of the 2016 Presidential election for real-time analysis of public opinions toward two presidential candidates showed that the proposed system was able to predict correctly Donald Trump as the winner of the 2016 Presidential election. The cross validation results showed that the proposed sentiment models with the real-time streaming components in our proposed framework delivered effectively the analysis of the opinions on two presidential candidates with average 81% accuracy for the Deterministic model and 80% for the Probabilistic model, which are 1% - 22% improvements from the results of the existing literature.展开更多
在大数据背景下,如何快速准确的从庞大数据集中筛选过滤出有用信息一直是自然语言处理领域的一个研究目标,对用户所提问题进行意图识别能够帮助用户在向问答系统进行沟通的时候,根据用户提出的直接或者间接的信息来快速判断用户的真实意...在大数据背景下,如何快速准确的从庞大数据集中筛选过滤出有用信息一直是自然语言处理领域的一个研究目标,对用户所提问题进行意图识别能够帮助用户在向问答系统进行沟通的时候,根据用户提出的直接或者间接的信息来快速判断用户的真实意图,过滤无用冗余信息后返回一个概率最大答案给用户。FastText是Facebook AI Research推出的文本分类和词训练工具,它的最大特点是模型简单并且在文本分类的准确率上,和现有的深度学习的方法效果相近,即在保证了准确率的情况下大大缩短了分类时间。展开更多
In order to study deeply the prominent problems faced by China’s clean government work,and put forward effective coping strategies,this article analyzes the network information of anti-corruption related news events,...In order to study deeply the prominent problems faced by China’s clean government work,and put forward effective coping strategies,this article analyzes the network information of anti-corruption related news events,which is based on big data technology.In this study,we take the news report from the website of the Communist Party of China(CPC)Central Commission for Discipline Inspection(CCDI)as the source of data.Firstly,the obtained text data is converted to word segmentation and stop words under preprocessing,and then the pre-processed data is improved by vectorization and text clustering,finally,after text clustering,the key words of clean government work is derived from visualization analysis.According to the results of this study,it shows that China’s clean government work should focus on‘the four forms of decadence’issue,and related departments must strictly crack down five categories of phenomena,such as“illegal payment of subsidies or benefits,illegal delivery of gifts and cash gift,illegal use of official vehicles,banquets using public funds,extravagant wedding ceremonies and funeral”.The results of this study are consistent with the official data released by the CCDI’s website,which also suggests that the method is feasible and effective.展开更多
文摘Text mining is a text data analysis,found that the relationship between concepts and underlying concepts from unstructured text,it is extracted from large text database has not yet been realized patterns or associations,some information retrieval and text processing system can find the relationship between words and paragraphs.This article first describes the data sources and a brief introduction to the related platforms and functional components.Secondly,it explains the Chinese word segmentation and the Korean word segmentation system.At last,it takes the news,documents and materials of the Korean Peninsula as well as the various public opinion data on the network as the basic data for the research.The examples of word frequency graph and word cloud graph is carried out to show the results of text mining through Chinese word segmentation system and Korean word segmentation system.
文摘Opinion (sentiment) analysis on big data streams from the constantly generated text streams on social media networks to hundreds of millions of online consumer reviews provides many organizations in every field with opportunities to discover valuable intelligence from the massive user generated text streams. However, the traditional content analysis frameworks are inefficient to handle the unprecedentedly big volume of unstructured text streams and the complexity of text analysis tasks for the real time opinion analysis on the big data streams. In this paper, we propose a parallel real time sentiment analysis system: Social Media Data Stream Sentiment Analysis Service (SMDSSAS) that performs multiple phases of sentiment analysis of social media text streams effectively in real time with two fully analytic opinion mining models to combat the scale of text data streams and the complexity of sentiment analysis processing on unstructured text streams. We propose two aspect based opinion mining models: Deterministic and Probabilistic sentiment models for a real time sentiment analysis on the user given topic related data streams. Experiments on the social media Twitter stream traffic captured during the pre-election weeks of the 2016 Presidential election for real-time analysis of public opinions toward two presidential candidates showed that the proposed system was able to predict correctly Donald Trump as the winner of the 2016 Presidential election. The cross validation results showed that the proposed sentiment models with the real-time streaming components in our proposed framework delivered effectively the analysis of the opinions on two presidential candidates with average 81% accuracy for the Deterministic model and 80% for the Probabilistic model, which are 1% - 22% improvements from the results of the existing literature.
文摘在大数据背景下,如何快速准确的从庞大数据集中筛选过滤出有用信息一直是自然语言处理领域的一个研究目标,对用户所提问题进行意图识别能够帮助用户在向问答系统进行沟通的时候,根据用户提出的直接或者间接的信息来快速判断用户的真实意图,过滤无用冗余信息后返回一个概率最大答案给用户。FastText是Facebook AI Research推出的文本分类和词训练工具,它的最大特点是模型简单并且在文本分类的准确率上,和现有的深度学习的方法效果相近,即在保证了准确率的情况下大大缩短了分类时间。
基金funded by the Open Foundation for the University Innovation Platform in the Hunan Province,grant number 16K013Hunan Provincial Natural Science Foundation of China,grant number 2017JJ2016+2 种基金2016 Science Research Project of Hunan Provincial Department of Education,grant number 16C0269Accurate crawler design and implementation with a data cleaning function,National Students innovation and entrepreneurship of training program,grant number 201811532010This research work is implemented at the 2011 Collaborative Innovation Center for Development and Utilization of Finance and Economics Big Data Property,Universities of Hunan Province.Open project,grant number 20181901CRP03,20181901CRP04,20181901CRP05.
文摘In order to study deeply the prominent problems faced by China’s clean government work,and put forward effective coping strategies,this article analyzes the network information of anti-corruption related news events,which is based on big data technology.In this study,we take the news report from the website of the Communist Party of China(CPC)Central Commission for Discipline Inspection(CCDI)as the source of data.Firstly,the obtained text data is converted to word segmentation and stop words under preprocessing,and then the pre-processed data is improved by vectorization and text clustering,finally,after text clustering,the key words of clean government work is derived from visualization analysis.According to the results of this study,it shows that China’s clean government work should focus on‘the four forms of decadence’issue,and related departments must strictly crack down five categories of phenomena,such as“illegal payment of subsidies or benefits,illegal delivery of gifts and cash gift,illegal use of official vehicles,banquets using public funds,extravagant wedding ceremonies and funeral”.The results of this study are consistent with the official data released by the CCDI’s website,which also suggests that the method is feasible and effective.