摘要
Web文本表示方法作为所有Web文本分析的基础工作,对文本分析的结果有深远的影响。提出了一种多维度的Web文本表示方法。传统的文本表示方法一般都是从文本内容中提取特征,而文档的深层次特征和外部特征也可以用来表示文本。本文主要研究文本的表层特征、隐含特征和社交特征,其中表层特征和隐含特征可以由文本内容中提取和学习得到,而文本的社交特征可以通过分析文档与用户的交互行为得到。所提出的多维度文本表示方法具有易用性,可以应用于各种文本分析模型中。在实验中,改进了两种常用的文本聚类算法——K-means和层次聚类算法,并命名为多维度K-means MDKM和多维度层次聚类算法MDHAC。通过大量的实验表明了本方法的高效性。此外,我们在各种特征的结合实验结果中还有一些深层次的发现。
For all web document analysis approaches, finding good representation of web documentsplays a fundamental role and greatly affects the performance of web document analysis. Wepropose a multidimension representation scheme for web documents. In addition toextracting features directly from document contents, which is normally employed bytradition document representation approaches, we also represent web documents with deeperfeatures that can be learned internally from documents and externally from web documentcontexts. We exploit the three representation dimensions, including superficial dimension,latent dimension and social dimension, extract and discover the features of superficial andlatent dimensions internally from document contents, and capture the social dimensionfeatures externally from the interaction behavior between users and web documents. Theproposed multidimension representation scheme can be applied to document analysis models.We conduct extensive experiments to evaluate its effectiveness in terms of documentclustering performance. Two common document clustering algorithms, multidimension kmeans and multidimension hierarchical agglomerative clustering, are investigated.Experiments verify that the proposed multidimension document representation scheme iseffective. Moreover, we report interesting observations in crossdimension featuresdiscovered from experimental results.
出处
《计算机工程与科学》
CSCD
北大核心
2016年第11期2348-2355,共8页
Computer Engineering & Science
基金
国家自然科学基金(61462011
61202089)
高等学校博士学科专项科研基金(20125201120006)
贵州大学引进人才科研项目(2011015)
贵州大学研究生创新基金(研理工2016052)
关键词
文本表示
文本聚类
社交特征
document representation
document clustering
social feature