摘要
[研究目的]将Sentence-BERT模型应用于专利技术主题聚类,解决专利文献为突出新颖性,常使用独特技术术语造成词汇向量语义特征稀疏的问题。[研究方法]以人工智能领域2015年-2019年的22370篇专利为实验数据。首先,采用Sentence-BERT算法对专利文献摘要文本进行向量化表示;其次,对向量化矩阵进行数据降维,利用HDBSCAN方式寻找原始数据中的高密度簇;最后,识别类簇文本集合中的主题特征,并完成主题呈现。[研究结论]对比LDA主题模型、K-means、doc2vec等方法,本文的实验结果提高了主题划分的细粒度和精确度,获得了较好的主题一致性。如何采用fine-tune策略进一步提升模型的效果,是未来该方法进一步深入探索的方向。
[Research purpose]The Sentence-Bert model is applied to patent technology topic clustering to solve the problem of sparse semantic features of lexical vectors caused by the frequent use of unique technical terms in patent documents in order to highlight novelty.[Research method]The study takes 22370 patents in the field of artificial intelligence from 2015 to 2019 as experimental data.Firstly,the Sentence-Bert algorithm is used to vectorize the patent document abstract text;Secondly,the data dimension of the vectorization matrix is reduced,and the HDBSCAN method is used to find the high-density clusters in the original data;Finally,the topic features in the class cluster text collection are identified and the topic presentation was completed.[Research conclusion]Compared with LDA topic model,K-means,doc2vec and other methods,the experimental results of this study improves the granularity and accuracy of topic division,and obtains better topic consistency.How to use the fine tune strategy to further improve the effect of the model is the direction of further exploration of this method in the future.
作者
阮光册
周萌葳
Ruan Guangce;Zhou Mengwei(Faculty of Economics and Management,East China Normal University,Shanghai 200241)
出处
《情报杂志》
北大核心
2024年第2期110-117,共8页
Journal of Intelligence