期刊文献+

基于社交特征的多维度文本表示方法 被引量:3

A multi-dimension document representation approach based on social features
下载PDF
导出
摘要 Web文本表示方法作为所有Web文本分析的基础工作,对文本分析的结果有深远的影响。提出了一种多维度的Web文本表示方法。传统的文本表示方法一般都是从文本内容中提取特征,而文档的深层次特征和外部特征也可以用来表示文本。本文主要研究文本的表层特征、隐含特征和社交特征,其中表层特征和隐含特征可以由文本内容中提取和学习得到,而文本的社交特征可以通过分析文档与用户的交互行为得到。所提出的多维度文本表示方法具有易用性,可以应用于各种文本分析模型中。在实验中,改进了两种常用的文本聚类算法——K-means和层次聚类算法,并命名为多维度K-means MDKM和多维度层次聚类算法MDHAC。通过大量的实验表明了本方法的高效性。此外,我们在各种特征的结合实验结果中还有一些深层次的发现。 For all web document analysis approaches, finding good representation of web documentsplays a fundamental role and greatly affects the performance of web document analysis. Wepropose a multidimension representation scheme for web documents. In addition toextracting features directly from document contents, which is normally employed bytradition document representation approaches, we also represent web documents with deeperfeatures that can be learned internally from documents and externally from web documentcontexts. We exploit the three representation dimensions, including superficial dimension,latent dimension and social dimension, extract and discover the features of superficial andlatent dimensions internally from document contents, and capture the social dimensionfeatures externally from the interaction behavior between users and web documents. Theproposed multidimension representation scheme can be applied to document analysis models.We conduct extensive experiments to evaluate its effectiveness in terms of documentclustering performance. Two common document clustering algorithms, multidimension kmeans and multidimension hierarchical agglomerative clustering, are investigated.Experiments verify that the proposed multidimension document representation scheme iseffective. Moreover, we report interesting observations in crossdimension featuresdiscovered from experimental results.
出处 《计算机工程与科学》 CSCD 北大核心 2016年第11期2348-2355,共8页 Computer Engineering & Science
基金 国家自然科学基金(61462011 61202089) 高等学校博士学科专项科研基金(20125201120006) 贵州大学引进人才科研项目(2011015) 贵州大学研究生创新基金(研理工2016052)
关键词 文本表示 文本聚类 社交特征 document representation document clustering social feature
  • 相关文献

参考文献23

  • 1Salton G,Wong A,Yang C S. A vector space model for auto- matic indexing[J]. Communications of the ACM, 1975 18 (11) ..613-620.
  • 2Salton G, Buckley C. Term-weighting approaches in auto- matic text retrieval [J]. Information Processing and Manage- ment,1988,24(5) :513-523.
  • 3Deerwester S,Dumais S T,Furnas G W,et al. Indexing by la- tent semantic analysis[J]. Journal of the American Society for Information Science,1990,41(6) :391.
  • 4Hofmann T. Probahilistic latent semantic indexing[C]//Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999: 50-57.
  • 5Blei D M, Ng A Y,Jordan M I. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003 (3).. 993- 1022.
  • 6Ponte J M,Croft W B. A language modeling approach to in- formation retrieval[C]//Proc of the 21st Annual Internation- al ACM SIGIR Conference on Research and Development in Information Retrieval, 1998 : 275-281.
  • 7Cavnar W. Using an n-gram-based document representation with a vector processing retrieval modeI[C]//Proc of the 3rd Text Retrieval Conference, 1995:269-277.
  • 8Robertson S E,Walker S,Jones S,et al. Okapi at TREC-3[C] //Proc of TREC-3,1995 : 109-126.
  • 9Robertson S E, Walker S, Jones S, et al. Okapi at TREC-4 [C]//Proc of the 4th Text Retrieval Conference, 1996: 73- 96.
  • 10Schenker A, Last M, Bunke H, et al. Graph representations for web documents clustering [C]//Proc of LNCS on Pat- tern Recognition and Image Analysis, 2003:985-942.

同被引文献21

引证文献3

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部