This paper focuses on semantic knowl- edge acquisition from blogs with the proposed tag- topic model. The model extends the Latent Dirichlet Allocation (LDA) model by adding a tag layer be- tween the document and th...This paper focuses on semantic knowl- edge acquisition from blogs with the proposed tag- topic model. The model extends the Latent Dirichlet Allocation (LDA) model by adding a tag layer be- tween the document and the topic. Each document is represented by a mixture of tags; each tag is as- sociated with a multinomial distribution over topics and each topic is associated with a multinomial dis- trNution over words. After parameter estimation, the tags are used to descrNe the underlying topics. Thus the latent semantic knowledge within the top- ics could be represented explicitly. The tags are treated as concepts, and the top-N words from the top topics are selected as related words of the con- cepts. Then PMI-IR is employed to compute the re- latedness between each tag-word pair and noisy words with low correlation removed to improve the quality of the semantic knowledge. Experiment re- sults show that the proposed method can effectively capture semantic knowledge, especially the polyse- me and synonym.展开更多
基金supported by the National Natural Science Foundation of China under Grants No.90920005,No.61003192the Key Project of Philosophy and Social Sciences Research,Ministry of Education under Grant No.08JZD0032+3 种基金the Program of Introducing Talents of Discipline to Universities under Grant No.B07042the Natural Science Foundation of Hubei Province under Grants No.2011CDA034,No.2009CDB145Chenguang Program of Wuhan Municipality under Grant No.201050231067the selfdetermined research funds of CCNU from the colleges' basic research and operation of MOE under Grants No.CCNU10A02009,No.CCNU10C01005
文摘This paper focuses on semantic knowl- edge acquisition from blogs with the proposed tag- topic model. The model extends the Latent Dirichlet Allocation (LDA) model by adding a tag layer be- tween the document and the topic. Each document is represented by a mixture of tags; each tag is as- sociated with a multinomial distribution over topics and each topic is associated with a multinomial dis- trNution over words. After parameter estimation, the tags are used to descrNe the underlying topics. Thus the latent semantic knowledge within the top- ics could be represented explicitly. The tags are treated as concepts, and the top-N words from the top topics are selected as related words of the con- cepts. Then PMI-IR is employed to compute the re- latedness between each tag-word pair and noisy words with low correlation removed to improve the quality of the semantic knowledge. Experiment re- sults show that the proposed method can effectively capture semantic knowledge, especially the polyse- me and synonym.