摘要
目的针对大数据环境下跨数据源查询面临的中药实体呈现多视图,且中药实体在各数据源中表现出属性不完整、多模态、差异性等问题,提出面向多源数据的中药实体统一视图的构建策略。方法基于实体属性间的相互关系,构建实体多视图融合整体架构,并对实体和属性等关键元素进行抽象化表示;以用户需求为约束提出基于词向量的相关度计算方法,采用Skip-gram模型训练出表征实体属性的词向量;提出基于欧氏距离和Jaccard系数的相关度算法,并以此为依据进行实体融合。结果共训练完成属性词向量6116个,其中有效词向量230个,以400对不同源中药实体作为测试集,分别采用AFCDS、FF和WVCC方法进行实体融合实验,其融合准确率依次为92.20%、88.47%和94.24%。结论基于词向量的实体融合策略有效可行,能充分利用属性间的有效信息,自适应性强,实体融合准确率较高,可为解决多源实体融合问题提供新的研究思路。
Objective To propose a construction strategy of unified view of TCM entities for multi-source data targeting the fact that TCM entities are faced with multi-data query with multiple views in the big data environment, and that TCM entities exhibit incomplete attributes, multi-modality, and differences in each data source. Methods Based on the interrelationship between entity attributes, an entity multi-view fusion overall architecture was constructed, and abstract representations of key elements such as entities and attributes were carried out. A word vector-based correlation calculation method was proposed based on user requirements. The Skip-gram model was used to train word vectors that characterize entity attributes. A correlation algorithm based on Euclidean distance and Jaccard coefficient was proposed, and the entity fusion was based on this. Results The experiment trained a total of 6116 attribute word vectors, including 230 effective word vectors. 400 pairs of heterologous TCM entities were used as test sets, and the entity fusion experiments were carried out by AFCDS, FF and WVCC respectively. The fusion accuracy was 92.20%, 88.47% and 94.24%. Conclusion The entity fusion strategy based on word vector is effective and feasible, and can make full use of the effective information between attributes. It has strong adaptability and high accuracy of entity fusion, and can provide new ideas for solving the problem of multi-source entity fusion.
作者
梁杨
丁长松
蔡雄
LIANG Yang;DING Changsong;CAI Xiong(School of Information Science and Engineering,Hunan University of Chinese Medicine,Changsha 410208,China;TCM Big Data Analysis Laboratory of Hunan,Changsha 410208,China;School of Computer Science and Engineering,Central South University,Changsha 410000,China;Institute of Innovation and Applied Research in Chinese Medicine,Hunan University of Chinese Medicine,Changsha 410208,China)
出处
《中国中医药信息杂志》
CAS
CSCD
2020年第9期108-114,共7页
Chinese Journal of Information on Traditional Chinese Medicine
基金
国家重点研发计划(2017YFC1703306)
湖南省教育厅科学研究项目(19C1391)
湖南省重点研发计划(2017SK2111)
湖南省教育厅重点项目(18A227)
湖南省自然科学基金(2018JJ2301)
湖南省中医药科研计划重点课题(2020002)
湖南中医药大学电子科学与技术学科开放基金(2018DK04)。
关键词
大数据
多源数据
实体融合
词向量
相关度
big data
multi-source data
entity fusion
word vector
correlation