Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate...Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate the prob- lems of synonymy and polysemy, and bring about new chal- lenges to data mining applications such as text clustering and classification. To address these issues, we dissect some poten- tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed ap- proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investi- gate potential factors that contribute to the improved perfor- mance.展开更多
An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF)...An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.展开更多
文摘Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate the prob- lems of synonymy and polysemy, and bring about new chal- lenges to data mining applications such as text clustering and classification. To address these issues, we dissect some poten- tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed ap- proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investi- gate potential factors that contribute to the improved perfor- mance.
基金supported by the Hi-Tech Research and Development Program of China (2009AA01Z430)the National Natural Science Foundation of China (60972077,60821001)+2 种基金the National S&T Major Program (2010ZX03003-003-01)the Fundamental Research Funds for the Central Universities (BUPT2011RC0210)the Science and Technology on Electronic Control Laboratory
文摘An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.