摘要
即时通信等社交软件产生的聊天文本内容证据数据量大且聊天内容含有“黑话”等复杂语义,数字取证时无法快速识别和提取与犯罪事件有关的聊天文本证据。为此,基于DSR(dynamic semantic representation)模型和BGRU(bidirectional gated recurrent unit)模型提出一个聊天文本证据分类模型(DSR-BGRU)。通过预处理手段处理聊天文本数据,使其保存犯罪领域特征。设计并实现了基于DSR模型的聊天文本证据语义特征表示方法,从语义层面对聊天文本进行特征表示,通过聚类算法筛选出语义词,并通过单词属性与语义词的加权组合对非语义词词向量进行特征表示,且将语义词用于对新单词进行稀疏表示。利用Keras框架构建了包含DSR模型输入层、BGRU模型隐藏层和softmax分类层的多层聊天文本特征提取与分类模型,该模型使用DSR模型进行词的向量表示组成的文本矩阵作为输入向量,从语义层面对聊天文本进行特征表示,基于BGRU模型的多层隐藏层对使用这些词向量组成的文本提取上下文特征,从而能够更好地准确理解聊天文本的语义信息,并利用softmax分类层实现聊天文本证据识别与提取目标。实验结果表明,基于DSR-BGRU的聊天文本证据分类模型能够更加准确地完成聊天记录证据的识别和提取任务,该模型能够有效地提取出聊天信息中的犯罪文本信息,取得有效的证据,并取得了92.06%的准确率,F1值为91.00%。高于其他用于文本分类的模型与方法。
It is always unlikely to efficiently identify and extract chat text evidence related to criminal events,due to the complex semantics such as “slang” in the chat content and the huge amount of chat text data generated by social software such as instant messaging.Based on this motivation,a chat text evidence classification model(DSR-BGRU) based on the DSR(dynamic semantic representation) model and the BGRU(bidirectional gated recurrent unit) model was proposed.The chat text data was pre-processed to preserve the characteristics of the criminal field.Then a multi-layer chat text feature extraction and classification model using the Keras framework was proposed.With the text matrix composed of vector representation of words in the DSR model as the input vector,the input layer of the DSR model featured the chat text from the semantic level.Then the hidden layer of the BGRU model extracted the context characteristics of the text composed of the word vectors.The softmax classification layer recognized and extracted the chat text evidence.The experimental results show that the proposed DSR-BGRU can more accurately identify and extract chat records compared with other models and methods for text classification,and it can also effectively extract the criminal text information from the chat information with the accuracy rate 92.06% and the F1 score 91.00%.
作者
张宇
李炳龙
李学娟
张和禹
ZHANG Yu;LI Binglong;LI Xuejuan;ZHANG Heyu(Information Engineering University,Zhengzhou 450001,China;Henan Polytechnic University,Jiaozuo 454003,China)
出处
《网络与信息安全学报》
2022年第2期150-159,共10页
Chinese Journal of Network and Information Security
基金
国家自然科学基金(60903220)。
关键词
文本语义表示
一词多义
文本分类
数字取证
text semantic representation
polysemy
text classification
digital forensics