摘要
在海量暗网网页中筛选敏感主题内容对执法部门具有重要意义。通过对Freenet等暗网网页文本特点和类别进行深入分析,提出基于TextCNN的暗网网页主题分类模型。模型根据暗网网页非标准化的语言特点进行数据预处理;使用预训练的词向量获得网页内容的表示,通过不同大小的卷积核进行卷积操作获得特征图像,使用最大池化函数获得最终的特征向量;对卷积网络进行正则化处理,使用softmax函数预测类别概率。实验结果表明,采用该方法精确率为86.01%,召回率为78.97%,Macro-F1值为82.33%,高于机器学习模型,能够有效解决暗网网页分类问题。
It is critical for law enforcement departments to extract contents of specific topic from enormous amount of darknet webpages.After in-depth analysis on webpage texts of Freenet and other darknets,a darknet webpage topics classification model based on TextCNN is proposed.The model preprocessed the data according to the non-standardized language characteristics of darknet webpages,and then represented webpage tokens with pretrained word embeddings.The feature image was obtained by convolution operation with convolution kernels of different sizes,and the final feature vector was obtained by using the maximum pooling function.The convolution network was regularized,and the category probability was predicted by using Softmax function.The experimental results show that the model achieves precision at 86.01%,recall score at 78.97%and Macro-F1 score at 82.33%,higher than machine learning models,which can effectively solve the classification problem of darknet webpages.
作者
洪良怡
朱松林
王轶骏
薛质
Hong Liangyi;Zhu Songlin;Wang Yijun;Xue Zhi(School of Electric Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China;Nantong Public Security Bureau,Nantong 226001,Jiangsu,China)
出处
《计算机应用与软件》
北大核心
2023年第2期320-325,330,共7页
Computer Applications and Software
基金
国家重点研发计划项目“网络空间安全”重点专项(2016QY01W0202)。