Societal risk classification is the fundamental issue for online societal risk monitoring. To show the challenge and feasibility of societal risk classification toward BBS posts, an empirical analysis is implemented i...Societal risk classification is the fundamental issue for online societal risk monitoring. To show the challenge and feasibility of societal risk classification toward BBS posts, an empirical analysis is implemented in this paper. Through effectiveness analysis, Support Vector Machine based on Bag-Of-Words (BOW-SVM) is adopted for challenge validation, and the distributed document embeddings of BBS posts generated by Paragraph Vector are applied to feasibility study. Based on BOW-SVM, cross-validations of BBS posts labeled by different groups and annotators are conducted. The big fluctuation of cross-validation results indicates the differences of individual risk perceptions, which brings more challenges to societal risk classification. Furthermore, based on the distributed document embeddings of BBS posts, the pairwise similarities of more than 300 thousands BBS posts from different societal risk categories are compared. The higher similarities of BBS posts in the same societal risk category reveal that BBS posts in the same societal risk category share more features than BBS posts in different categories, which manifests the feasibility of societal risk classification of BBS posts, and also reflects the possibility to improve the performance of societal risk monitoring.展开更多
Societal risk classification is a fundamental and complex issue for societal risk perception. To conduct societal risk classification, Tianya Forum posts are selected as the data source, and four kinds of representati...Societal risk classification is a fundamental and complex issue for societal risk perception. To conduct societal risk classification, Tianya Forum posts are selected as the data source, and four kinds of representations: string representation, term-frequency representation, TF-IDF representation and the distributed representation of BBS posts are applied. Using edit distance or cosine similarity as distance metric, four k-Nearest Neighbor (kNN) classifiers based on different representations are developed and compared. Owing to the priority of word order and semantic extraction of the neural network model Paragraph Vector, kNN based on the distributed representation generated by Paragraph Vector (kNN-PV) shows effectiveness for societal risk classification. Furthermore, to improve the performance of societal risk classification, through different weights, kNN-PV is combined with other three kNN classifiers as an ensemble model. Through brute force grid search method, the optimal weights are assigned to different kNN classifiers. Compared with kNN-PV, the experimental results reveal that Macro-F of the ensemble method is significantly improved for societal risk classification.展开更多
The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicat...The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicators from socio psychology, and conduct document-level multiple societal risk classification of BBS posts. To effectively capture the semantics and word order of documents, a shallow neural network as Paragraph Vector is applied to realize the distributed vector representations of the posts in the vector space. Based on the document vectors, the authors apply one classification method KNN to identify the societal risk category of the posts. The experimental results reveal that paragraph vector in document-level societal risk classification achieves much faster training speed and at least 10% improvements of F-measures than Bag-of-Words. Furthermore, the performance of paragraph vector is also superior to edit distance and Lucene-based search method. The present work is the first attempt of combining document embedding method with socio psychology research results to public opinions area.展开更多
文摘Societal risk classification is the fundamental issue for online societal risk monitoring. To show the challenge and feasibility of societal risk classification toward BBS posts, an empirical analysis is implemented in this paper. Through effectiveness analysis, Support Vector Machine based on Bag-Of-Words (BOW-SVM) is adopted for challenge validation, and the distributed document embeddings of BBS posts generated by Paragraph Vector are applied to feasibility study. Based on BOW-SVM, cross-validations of BBS posts labeled by different groups and annotators are conducted. The big fluctuation of cross-validation results indicates the differences of individual risk perceptions, which brings more challenges to societal risk classification. Furthermore, based on the distributed document embeddings of BBS posts, the pairwise similarities of more than 300 thousands BBS posts from different societal risk categories are compared. The higher similarities of BBS posts in the same societal risk category reveal that BBS posts in the same societal risk category share more features than BBS posts in different categories, which manifests the feasibility of societal risk classification of BBS posts, and also reflects the possibility to improve the performance of societal risk monitoring.
基金This study is supported by the National Key Research and Development Program of China under grant No. 2016YFB1000902 and National Natural Science Foundation of China under grant Nos. 61473284, 71601023 and 71371107.
文摘Societal risk classification is a fundamental and complex issue for societal risk perception. To conduct societal risk classification, Tianya Forum posts are selected as the data source, and four kinds of representations: string representation, term-frequency representation, TF-IDF representation and the distributed representation of BBS posts are applied. Using edit distance or cosine similarity as distance metric, four k-Nearest Neighbor (kNN) classifiers based on different representations are developed and compared. Owing to the priority of word order and semantic extraction of the neural network model Paragraph Vector, kNN based on the distributed representation generated by Paragraph Vector (kNN-PV) shows effectiveness for societal risk classification. Furthermore, to improve the performance of societal risk classification, through different weights, kNN-PV is combined with other three kNN classifiers as an ensemble model. Through brute force grid search method, the optimal weights are assigned to different kNN classifiers. Compared with kNN-PV, the experimental results reveal that Macro-F of the ensemble method is significantly improved for societal risk classification.
基金supported by the National Natural Science Foundation of China under Grant Nos.71171187,71371107,and 61473284
文摘The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicators from socio psychology, and conduct document-level multiple societal risk classification of BBS posts. To effectively capture the semantics and word order of documents, a shallow neural network as Paragraph Vector is applied to realize the distributed vector representations of the posts in the vector space. Based on the document vectors, the authors apply one classification method KNN to identify the societal risk category of the posts. The experimental results reveal that paragraph vector in document-level societal risk classification achieves much faster training speed and at least 10% improvements of F-measures than Bag-of-Words. Furthermore, the performance of paragraph vector is also superior to edit distance and Lucene-based search method. The present work is the first attempt of combining document embedding method with socio psychology research results to public opinions area.