期刊文献+

多源环境下中药实体统一视图构建策略 被引量:2

Construction Strategy for Unified View of TCM Entities in Multi-source Environment
下载PDF
导出
摘要 目的针对大数据环境下跨数据源查询面临的中药实体呈现多视图,且中药实体在各数据源中表现出属性不完整、多模态、差异性等问题,提出面向多源数据的中药实体统一视图的构建策略。方法基于实体属性间的相互关系,构建实体多视图融合整体架构,并对实体和属性等关键元素进行抽象化表示;以用户需求为约束提出基于词向量的相关度计算方法,采用Skip-gram模型训练出表征实体属性的词向量;提出基于欧氏距离和Jaccard系数的相关度算法,并以此为依据进行实体融合。结果共训练完成属性词向量6116个,其中有效词向量230个,以400对不同源中药实体作为测试集,分别采用AFCDS、FF和WVCC方法进行实体融合实验,其融合准确率依次为92.20%、88.47%和94.24%。结论基于词向量的实体融合策略有效可行,能充分利用属性间的有效信息,自适应性强,实体融合准确率较高,可为解决多源实体融合问题提供新的研究思路。 Objective To propose a construction strategy of unified view of TCM entities for multi-source data targeting the fact that TCM entities are faced with multi-data query with multiple views in the big data environment, and that TCM entities exhibit incomplete attributes, multi-modality, and differences in each data source. Methods Based on the interrelationship between entity attributes, an entity multi-view fusion overall architecture was constructed, and abstract representations of key elements such as entities and attributes were carried out. A word vector-based correlation calculation method was proposed based on user requirements. The Skip-gram model was used to train word vectors that characterize entity attributes. A correlation algorithm based on Euclidean distance and Jaccard coefficient was proposed, and the entity fusion was based on this. Results The experiment trained a total of 6116 attribute word vectors, including 230 effective word vectors. 400 pairs of heterologous TCM entities were used as test sets, and the entity fusion experiments were carried out by AFCDS, FF and WVCC respectively. The fusion accuracy was 92.20%, 88.47% and 94.24%. Conclusion The entity fusion strategy based on word vector is effective and feasible, and can make full use of the effective information between attributes. It has strong adaptability and high accuracy of entity fusion, and can provide new ideas for solving the problem of multi-source entity fusion.
作者 梁杨 丁长松 蔡雄 LIANG Yang;DING Changsong;CAI Xiong(School of Information Science and Engineering,Hunan University of Chinese Medicine,Changsha 410208,China;TCM Big Data Analysis Laboratory of Hunan,Changsha 410208,China;School of Computer Science and Engineering,Central South University,Changsha 410000,China;Institute of Innovation and Applied Research in Chinese Medicine,Hunan University of Chinese Medicine,Changsha 410208,China)
出处 《中国中医药信息杂志》 CAS CSCD 2020年第9期108-114,共7页 Chinese Journal of Information on Traditional Chinese Medicine
基金 国家重点研发计划(2017YFC1703306) 湖南省教育厅科学研究项目(19C1391) 湖南省重点研发计划(2017SK2111) 湖南省教育厅重点项目(18A227) 湖南省自然科学基金(2018JJ2301) 湖南省中医药科研计划重点课题(2020002) 湖南中医药大学电子科学与技术学科开放基金(2018DK04)。
关键词 大数据 多源数据 实体融合 词向量 相关度 big data multi-source data entity fusion word vector correlation
  • 相关文献

参考文献6

二级参考文献107

  • 1韩立岩,周芳.基于D-S证据理论的知识融合及其应用[J].北京航空航天大学学报,2006,32(1):65-68. 被引量:41
  • 2Indyk P,Motwani R.Approximate nearest neighbors:Towards removing the curse of dimensionality.In:Jeffrey V,ed.Proc.of the 30th Annual ACM Symp.on Theory of Computing.New York:ACM Press,1998.604-613.
  • 3Kleinberg J.Two algorithms for nearest-neighbor search in high dimensions.In:Leighton FT,Borodin A,eds.Proc.of the 27th Annual ACM Symp.on Theory of Computing.New York:ACM Press,1997.599-608.
  • 4Kushilevitz E,Ostrovsky R,Rabani Y.Efficient search for approximate nearest neighbor in high dimensional spaces.SIAM Journal on Computing,2000,30(2):451-474.
  • 5Aggarwal C.Hierarchical subspace sampling:A unified framework for high dimensional data reduction,selectivity estimation,and nearest neighbor search.In:Michael J,ed.Proc.of the ACM SIGMOD Conf.New York:ACM Press,2002.452-463.
  • 6Berchtold S,Keim D,Kriegel HP.The X-tree:An index structure for high dimensional data.In:Vijayaraman TM,Buchmann AP,Mohan C,Sarda NL,eds.Proc.of the 22nd Int'l Conf.on Very Large Databases.San Francisco:ACM Press,1996,28-39.
  • 7Beyer K,Goldstein J,Ramakrishnan R,Shaft U.When is nearest neighbors meaningful? In:Beeri C,Buneman P,eds.Proc.of the 7th Int'l Conf.on Database Theory.Jerusalem:Springer-Verlag,1999.217-235.
  • 8Gionis A,Indyk P,Motwani R.Similarity search in high dimensions via hashing.In:Atkinson MP,Orlowska ME,Valduriez P,Zdonik SB,Brodie ML,eds.Proc.of the 25th Int'l Conf.on Very Large Databases.San Francisco:ACM Press,1999.518-529.
  • 9Goldstein J,Ramakrishnan R.Contrast plots and P-sphere trees:Space vs.time in nearest neighbour searches.In:Abbadi AE,Brodie ML,Chakravarthy S,Dayal U,Kamel N,Schlageter G,Whang KY,eds.Proc.of the 26th Int'l Conf.on Very Large Databases.San Francisco:ACM Press,2000.429-440.
  • 10White D,Jain R.Similarity indexing with the SS-tree.In:Su SYW,ed.Proc.of the 12th Int'l Conf.on Data Engineering.New Orleans:IEEE Computer Society,1996.516-523.

共引文献208

同被引文献7

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部