This paper focuses on semantic knowl- edge acquisition from blogs with the proposed tag- topic model. The model extends the Latent Dirichlet Allocation (LDA) model by adding a tag layer be- tween the document and th...This paper focuses on semantic knowl- edge acquisition from blogs with the proposed tag- topic model. The model extends the Latent Dirichlet Allocation (LDA) model by adding a tag layer be- tween the document and the topic. Each document is represented by a mixture of tags; each tag is as- sociated with a multinomial distribution over topics and each topic is associated with a multinomial dis- trNution over words. After parameter estimation, the tags are used to descrNe the underlying topics. Thus the latent semantic knowledge within the top- ics could be represented explicitly. The tags are treated as concepts, and the top-N words from the top topics are selected as related words of the con- cepts. Then PMI-IR is employed to compute the re- latedness between each tag-word pair and noisy words with low correlation removed to improve the quality of the semantic knowledge. Experiment re- sults show that the proposed method can effectively capture semantic knowledge, especially the polyse- me and synonym.展开更多
Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, th...Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very difficult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memoryefficient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online ex- pectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.展开更多
基金supported by the National Natural Science Foundation of China under Grants No.90920005,No.61003192the Key Project of Philosophy and Social Sciences Research,Ministry of Education under Grant No.08JZD0032+3 种基金the Program of Introducing Talents of Discipline to Universities under Grant No.B07042the Natural Science Foundation of Hubei Province under Grants No.2011CDA034,No.2009CDB145Chenguang Program of Wuhan Municipality under Grant No.201050231067the selfdetermined research funds of CCNU from the colleges' basic research and operation of MOE under Grants No.CCNU10A02009,No.CCNU10C01005
文摘This paper focuses on semantic knowl- edge acquisition from blogs with the proposed tag- topic model. The model extends the Latent Dirichlet Allocation (LDA) model by adding a tag layer be- tween the document and the topic. Each document is represented by a mixture of tags; each tag is as- sociated with a multinomial distribution over topics and each topic is associated with a multinomial dis- trNution over words. After parameter estimation, the tags are used to descrNe the underlying topics. Thus the latent semantic knowledge within the top- ics could be represented explicitly. The tags are treated as concepts, and the top-N words from the top topics are selected as related words of the con- cepts. Then PMI-IR is employed to compute the re- latedness between each tag-word pair and noisy words with low correlation removed to improve the quality of the semantic knowledge. Experiment re- sults show that the proposed method can effectively capture semantic knowledge, especially the polyse- me and synonym.
文摘Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very difficult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memoryefficient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online ex- pectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.