摘要
突发公共卫生事件通常会造成巨大的破坏,研究时效性与可理解性在解决这类事件中尤为重要,亟需快速分析研究现状、抽取特定研究信息的方法。科学文献是知识传播的主要载体与重要途径之一,针对文献中专业术语特殊性与歧义性导致的传播受阻问题,该文通过自然语言处理与知识图谱技术,以新冠疫情研究相关文献为例,结合实体识别与信息融合构建知识图谱。该方法首先通过对文献的题目与摘要标注实体以构建数据集用于训练BERT-BiLSTM-CRF模型,该模型可以对文本中的医学实体自动识别并提取。然后根据作者信息的多源交叉验证与领域、机构相似度消除作者姓名歧义并构建一个作者集合。最后根据实体-实体、作者-作者和实体-作者关系,在融合多源信息后增量构建新冠肺炎疫情知识图谱。命名实体识别模型在6类不同医学实体上的平均F1分数达到92.86%,知识图谱包含了34 802个医学实体与397 163名作者。这项研究表明以上流程可以有效地构建知识图谱,并据此快速找到前沿研究热点和相关领域核心学者,有效促进知识的获取和概念的传播。
Public health emergencies usually cause great damage. Timeliness and comprehensibility of research are particularly important in solving such incidents. It is urgent to analyze the current situation of research quickly and extract specific research information. Scientific literature is one of the main carriers and important ways of knowledge dissemination. In view of the problem of transmission obstruction caused by the special terminology and ambiguity in the literature, we use natural language processing and knowledge graph technology, and take COVID-19 as an example to build knowledge graph with recognized entities and fused information. Firstly, the method labels the entities of the title and abstract of the literature to construct a data set for training the BERT-BiLSTM-CRF model, which can automatically recognize and extract the medical entities in the papers. Then, according to the multi-source cross validation of author information and the similarity of domain and organization, the author name ambiguity is eliminated and an author information set is constructed. Finally, a knowledge graph about COVID-19 is constructed after the integration of multiple sources information based on entity-entity, author-author and entity-author relationships. The average F1 score of the entity recognition model on 6 different medical entities reached 92.86%. The knowledge graph contains 34 802 medical entities and 397 163 authors. This study shows that this process can effectively construct the knowledge graph, quickly find cutting-edge research hotspots and core scholars in related fields, which effectively promote the acquisition of knowledge and the dissemination of concepts.
作者
刘华玲
孙毅
LIU Hua-ling;SUN Yi(Department of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201620,China)
出处
《计算机技术与发展》
2022年第9期107-113,共7页
Computer Technology and Development
基金
上海哲学社会科学规划课题(2018BJB023)
国家社会科学重大课题(16ZDA055)。