摘要
【目的】克服论文与专利之间语言特征差异的障碍,将论文和专利数据按照研究主题集成融合。【方法】以维基百科为基本分类体系,通过半自动方式构建少量标注集,设计半监督深度文本聚类模型,将相似主题的论文与专利聚类融合,设计指标评估数据融合结果的质量。【结果】所提模型在两个数据集上的聚类准确率比其他基线模型提升了2.4~11.9个百分点,数据融合结果的质量评估得分超过0.9,优于基线模型,可以在已知主题的基础上补充研究主题。【局限】未利用融合数据开展实证分析,聚类数目需要人工确定。【结论】所提模型可以从论文和专利差异化的文本中提取与主题相关的特征,有效地实现数据融合。
[Objective]This study integrates papers and patents based on research topics to bridge their language gaps.[Method]Using Wikipedia as the primary classification system,we constructed a small number of annotation sets semi-automatically.Then,we designed a semi-supervised deep text clustering model to fuse papers and patents with similar topics.Finally,we created indicators to evaluate the data fusion quality.[Results]Our model’s clustering accuracy was 2.4~11.9%higher than that of other baseline models.Its quality evaluation score of data fusion reached 0.9,which can supplement research topics based on the known topics.[Limitations]We did not conduct empirical analysis using the fused data and need to determine the cluster numbers manually.[Conclusion]The proposed model can extract topic-related features from differentiated texts of papers and patents to effectively realize data fusion.
作者
谢士尧
王小梅
Xie Shiyao;Wang Xiaomei(Institutes of Science and Development,Chinese Academy of Sciences,Beijing 100190,China;School of Public Policy and Management,University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《数据分析与知识发现》
EI
CSCD
北大核心
2024年第4期112-124,共13页
Data Analysis and Knowledge Discovery
基金
中国科学院战略研究专项“重要学科领域发展态势研究与决策支持”(项目编号:GHJ-ZLZX-2022-09)研究成果之一
关键词
深度文本聚类
数据融合
论文
专利
研究主题识别
Deep Text Clustering
Data Fusion
Papers
Patents
Research Topic Identification