期刊文献+

基于句法树节点嵌入的作者识别方法

Authorship identification method based on the embedding of the syntax tree node
原文传递
导出
摘要 作者识别是通过分析未知文本的写作风格推断作者归属的交叉学科。现有的研究多基于字符和词汇特征,而句法关联信息在研究中鲜有涉及。该文提出了基于句法树节点嵌入的作者识别方法,将句法树的节点表示成其所有依存弧对应的嵌入之和,把依存关系信息引入深度学习模型中。然后构建句法注意力网络,并通过该网络得到句法感知向量。该向量同时融合了依存关系、词性以及单词等信息。接着通过句子注意力网络得到句子的表示,最后通过分类器进行分类。在3个英文数据集的实验中,该文方法的性能位列第2或3位。更重要的是,依存句法组合的引入为模型的解释提供了更多的方向。 [Objective]Authorship identification is a study for inferring authorship of an unknown text by analyzing its stylometry or writing style.The traditional research on authorship identification is generally based on the empirical knowledge of literature or linguistics,whereas modern research mostly relies on mathematical methods to quantify the author’s writing style.Currently,researchers have proposed various feature combinations and neural network models.Some feature combinations can achieve better results with traditional machine learning classifiers,while some neural network models can autonomously learn the relationship between the input text and corresponding author to extract text features implicitly.However,the current research mostly focuses on character and lexicon features.Furthermore,the exploration of syntactic features is limited.How to use the dependency relationship between different words in a sentence and combine syntactic features with neural networks still remains unclear.This paper proposes an authorship identification method based on the syntax tree node embedding,which introduces syntactic features into a deep learning model.[Methods]We believe that an author’s writing style is mainly reflected in the way he chooses words and constructs sentences.Therefore,this paper mainly develops the authorship identification model from the perspectives of words and sentences.The attention mechanism is used to construct sentence-level features.First,an embedding representation of the syntax tree node is proposed,and the syntax tree node is expressed as a sum of embeddings corresponding to all its dependency arcs.Thus,the information on sentence structure and the association between words are introduced into the neural network model.Then,a syntactic attention network using different embedding methods to vectorize text features,such as dependencies,part-of-speech tags,and words,is constructed,and a syntax-aware vector is obtained through this network.Furthermore,the sentence attention network is used to extract the features from the syntax-aware vector to distinguish between different authors,thereby generating the sentence representation.Finally,the result is obtained by the classifier and the correct rate is used to evaluate the result.[Results]Experiments on CCAT10,CCAT50,IMDb62,and the Chinese novel data sets show that an increase in the number of authors causes a downward trend in the accuracy rate of the model proposed in the paper.In some data points,an increase in the number of authors resulted in an increase instead of a decrease in the correct rate.This shows that the ability of the model proposed in this study to capture the writing style of different authors is considerably different.Furthermore,when we change the number of authors on the IMDb dataset,the correct rate of the model in the paper is found to be slightly lower than the BertAA model in the case of 5authors;however,the rate is higher than the BertAA model in the case of 10,25,and 50authors.Additionally,when the experimental results of the model are compared to other models on the CCAT10,CCAT50,and IMDb62data sets,the performance of this model is observed to be ranked as second or third.[Conclusions]The attention mechanism demonstrated its efficiency in text feature mining,which can fully capture an author’s style that is reflected in different parts of the document.The integration of lexical and syntactic features based on the attention mechanism enhances the overall performance of the model.Our model performs well on different Chinese and English datasets.Notably,the introduction of dependency syntactic combination provides more space for the interpretation of the model,which can explain the text styles of different authors at the word selection and sentence construction levels.
作者 张洋 江铭虎 ZHANG Yang;JIANG Minghu(Computational Linguistics Laboratory,Department of Chinese,School of Humanities,Tsinghua University,Beijing 100084,China)
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2023年第9期1390-1398,共9页 Journal of Tsinghua University(Science and Technology)
基金 国家自然科学基金重点项目(62036001)。
关键词 作者识别 句法树节点 依存关系 注意力机制 authorship identification node of the syntax tree dependency attention mechanism
  • 相关文献

参考文献1

二级参考文献1

共引文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部