摘要
Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate the prob- lems of synonymy and polysemy, and bring about new chal- lenges to data mining applications such as text clustering and classification. To address these issues, we dissect some poten- tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed ap- proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investi- gate potential factors that contribute to the improved perfor- mance.
Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbrevi- ations, and coined acronyms and words exacerbate the prob- lems of synonymy and polysemy, and bring about new chal- lenges to data mining applications such as text clustering and classification. To address these issues, we dissect some poten- tial causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed ap- proach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investi- gate potential factors that contribute to the improved perfor- mance.