摘要
基于滑动窗口的方法,结合机器学习分类技术,可以判定文本的作者归属。但是此类方法需要精心挑选对应的文本特征,不同的文本特征选取可能会影响判定结果。针对以上问题,提出了一种基于快速文本分类(fastText)的文本作者归属判定模型。该模型融合滑动窗口的思想,引入词(字)向量、数据增强技术,从而充分利用文本信息、自动提取文本特征,并且以可视化的方式将结果呈现出来。使用该模型来检测《红楼梦》、《Roman de la Rose》的作者归属,实验结果表明《红楼梦》的前八十回与后四十回为不同作者所著、《Roman de la Rose》开篇4 058行(约50 000字)与后面17 724行(约218 000字)为不同作者所著。证明了Rolling-fastText模型判定文本作者归属的有效性。
Some methods are based on sliding window and machine learning,which can determine the authorship attribution of text.However,these methods require careful selection of text features,and different text features may affect the outcome of the authorship attribution.In response to the above problems,this paper proposes a model based on fastText classification to determine authorship attribution.The model incorporates the idea of the sliding window,introduces word(character) vectors and data enhancement technology,so as to make full use of text information and extract text features automatically,and presents the results in a manner of visualization.Finally,this paper uses the model to detect the authorship attribution of 《A Dream of Red Mansions》 and《Roman de la Rose》.The experimental results show that the first 80 chapters and the last 40 chapters of《A Dream of Red Mansions》 are written by different authors,the opening 4 058 lines(approximately 50 000 words) and the following 17 724 lines(approximately 218 000 words) of《Roman de la Rose》 are written by different authors.It is proved that this model is effective to determine the authorship attribution.
作者
李逍
顾长贵
杨雷鑫
陆祺灵
LI Xiao;GU Changgui;YANG Leixin;LU Qiling(Business School,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处
《智能计算机与应用》
2021年第1期14-19,共6页
Intelligent Computer and Applications
基金
国家自然科学基金(11875042)
上海理工大学大学生创新创业计划资助项目(SH2020072)。
关键词
滑动窗口
作者归属
快速文本分类器
数据增强技术
可视化
sliding window
authorship attribution
fast text classifier
data enhancement technology
visualization