基于社交特征的多维度文本表示方法被引量：3

A multi-dimension document representation approach based on social features

下载PDF

导出

摘要 Web文本表示方法作为所有Web文本分析的基础工作,对文本分析的结果有深远的影响。提出了一种多维度的Web文本表示方法。传统的文本表示方法一般都是从文本内容中提取特征,而文档的深层次特征和外部特征也可以用来表示文本。本文主要研究文本的表层特征、隐含特征和社交特征,其中表层特征和隐含特征可以由文本内容中提取和学习得到,而文本的社交特征可以通过分析文档与用户的交互行为得到。所提出的多维度文本表示方法具有易用性,可以应用于各种文本分析模型中。在实验中,改进了两种常用的文本聚类算法——K-means和层次聚类算法,并命名为多维度K-means MDKM和多维度层次聚类算法MDHAC。通过大量的实验表明了本方法的高效性。此外,我们在各种特征的结合实验结果中还有一些深层次的发现。 For all web document analysis approaches, finding good representation of web documentsplays a fundamental role and greatly affects the performance of web document analysis. Wepropose a multidimension representation scheme for web documents. In addition toextracting features directly from document contents, which is normally employed bytradition document representation approaches, we also represent web documents with deeperfeatures that can be learned internally from documents and externally from web documentcontexts. We exploit the three representation dimensions, including superficial dimension,latent dimension and social dimension, extract and discover the features of superficial andlatent dimensions internally from document contents, and capture the social dimensionfeatures externally from the interaction behavior between users and web documents. Theproposed multidimension representation scheme can be applied to document analysis models.We conduct extensive experiments to evaluate its effectiveness in terms of documentclustering performance. Two common document clustering algorithms, multidimension kmeans and multidimension hierarchical agglomerative clustering, are investigated.Experiments verify that the proposed multidimension document representation scheme iseffective. Moreover, we report interesting observations in crossdimension featuresdiscovered from experimental results.

作者陈功黄瑞章钟文良

机构地区贵州大学计算机科学与技术学院贵州省公共大数据重点实验室

出处《计算机工程与科学》 CSCD 北大核心 2016年第11期2348-2355,共8页 Computer Engineering & Science

基金国家自然科学基金(61462011 61202089) 高等学校博士学科专项科研基金(20125201120006) 贵州大学引进人才科研项目(2011015) 贵州大学研究生创新基金(研理工2016052)

关键词文本表示文本聚类社交特征 document representation document clustering social feature

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献23

1Salton G,Wong A,Yang C S. A vector space model for auto- matic indexing[J]. Communications of the ACM, 1975 18 (11) ..613-620.
2Salton G, Buckley C. Term-weighting approaches in auto- matic text retrieval [J]. Information Processing and Manage- ment,1988,24(5) :513-523.
3Deerwester S,Dumais S T,Furnas G W,et al. Indexing by la- tent semantic analysis[J]. Journal of the American Society for Information Science,1990,41(6) :391.
4Hofmann T. Probahilistic latent semantic indexing[C]//Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999: 50-57.
5Blei D M, Ng A Y,Jordan M I. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003 (3).. 993- 1022.
6Ponte J M,Croft W B. A language modeling approach to in- formation retrieval[C]//Proc of the 21st Annual Internation- al ACM SIGIR Conference on Research and Development in Information Retrieval, 1998 : 275-281.
7Cavnar W. Using an n-gram-based document representation with a vector processing retrieval modeI[C]//Proc of the 3rd Text Retrieval Conference, 1995:269-277.
8Robertson S E,Walker S,Jones S,et al. Okapi at TREC-3[C] //Proc of TREC-3,1995 : 109-126.
9Robertson S E, Walker S, Jones S, et al. Okapi at TREC-4 [C]//Proc of the 4th Text Retrieval Conference, 1996: 73- 96.
10Schenker A, Last M, Bunke H, et al. Graph representations for web documents clustering [C]//Proc of LNCS on Pat- tern Recognition and Image Analysis, 2003:985-942.

同被引文献21

1陶志勇,李小兵,刘影,刘晓芳.基于双向长短时记忆网络的改进注意力短文本分类方法[J].数据分析与知识发现,2019,3(12):21-29. 被引量：23
2郭肇强,周慧聪,刘释然,李言辉,陈林,周毓明,徐宝文.基于信息检索的缺陷定位:问题、进展与挑战[J].软件学报,2020(9):2826-2854. 被引量：14
3郑诚,吴文岫,代宁.融合BTM主题特征的短文本分类方法[J].计算机工程与应用,2016,52(13):95-100. 被引量：11
4党红恩.藏文字形轮廓特征优化识别提取仿真研究[J].计算机仿真,2016,33(11):341-344. 被引量：6
5张婷婷,王伟军,黄英辉,刘凯,胡祥恩.基于屏幕视觉热区的中文短文本关键词实时提取方法[J].情报学报,2016,35(12):1313-1322. 被引量：2
6张国和,黄凯,张斌,符欢欢,赵季中.最大稳定极值区域与笔画宽度变换的自然场景文本提取方法[J].西安交通大学学报,2017,51(1):135-140. 被引量：18
7王万良,潘蒙.基于多特征的视频关联文本关键词提取方法[J].浙江工业大学学报,2017,45(1):14-18. 被引量：5
8周博通,孙承杰,林磊,刘秉权.InsunKBQA:一个基于知识库的问答系统[J].智能计算机与应用,2017,7(5):150-154. 被引量：4
9李惠富,陆光,景维鹏.文本分类中基于K-Sprinkling的特征提取方法[J].计算机工程,2017,43(12):141-146. 被引量：2
10孙晶涛,张秋余.不均衡大数据集下的文本特征基因提取方法[J].电子科技大学学报,2018,47(1):125-131. 被引量：7

引证文献3

1杨肖楠,花季伟.互联网中非法文本特征自适应提取仿真研究[J].计算机仿真,2019,36(6):434-437. 被引量：1
2陆芸婷,李振军.基于体商的多维度聚类在教育中的应用研究[J].科教文汇,2019,0(28):96-97.
3叶仕超,雷景生,杨胜英.基于知识图谱的问答系统中属性映射方法研究[J].浙江科技学院学报,2022,34(5):435-443. 被引量：1

二级引证文献2

1吴锐,黄明,杨玉涛.知识图谱辅助历史建筑信息模型自动化建构[J].测绘科学,2024,49(1):163-180.
2赵海燕,刘琨,王廷梅,杜丽娟.网络文本蕴含关系识别的异常信息获取仿真[J].计算机仿真,2020,37(8):256-260. 被引量：3

1吕勇,李友荣,朱瑞荪,王志刚.连续小波变换在齿轮故障诊断中的应用[J].煤矿机械,2006,27(10):188-189.
2付琼莹,余旭初,谭熊,魏祥坡,赵吉龙.高光谱影像的DAE分类[J].测绘科学技术学报,2016,33(5):485-489. 被引量：2
3陈玉博,何世柱,刘康,赵军,吕学强.融合多种特征的实体链接技术研究[J].中文信息学报,2016,30(4):176-183.
4尉宇,孙德宝,吴江洪,梁瑞麟.基于GP的多传感器数据融合诊断与恢复[J].传感技术学报,2002,15(3):208-210. 被引量：2
5吕勇,李友荣,王志刚,朱瑞荪.基于奇异值分解及包络分析的齿轮局部故障特征提取[J].机床与液压,2007,35(5):217-219. 被引量：7
6白家峰.视觉改造—P2P[J].艺术当代,2015,0(10):80-81.
7赵海燕,刘倩玉,陈庆奎,曹健.融合时间和邻域信息的矩阵分解算法[J].计算机应用研究,2016,33(7):1961-1965.
8张孝飞,陈肇雄,黄河燕,代六玲.多策略机器翻译系统IHSMTS中候选实例模式检索算法[J].小型微型计算机系统,2005,26(3):330-334. 被引量：2
9师建军.工程英语语篇的词汇特征[J].中国科技纵横,2010(9):220-220.
10蔡蕾,朱永生.基于稀疏性非负矩阵分解和支持向量机的时频图像识别[J].自动化学报,2009,35(10):1272-1277. 被引量：16

计算机工程与科学

2016年第11期

浏览历史

内容加载中请稍等...

基于社交特征的多维度文本表示方法被引量：3

参考文献23

同被引文献21

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于社交特征的多维度文本表示方法 被引量：3

参考文献23

同被引文献21

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于社交特征的多维度文本表示方法被引量：3