摘要
电子病历中包含大量有用的医疗知识,抽取这些知识对于构建临床决策支持系统和个性化医疗健康信息服务具有重要意义。自动分词是分析和挖掘中文电子病历的关键基础。为了克服获取标注语料的困难,提出了一种基于无监督学习的中文电子病历分词方法。首先,使用通用领域的词典对电子病历进行初步的切分,为了更好地解决歧义问题,引入概率模型,并通过EM算法从生语料中估计词的出现概率。然后,利用字串的左右分支信息熵构建良度,将未登录词识别转化为最优化问题,并使用动态规划算法进行求解。最后,在3 000来自神经内科的中文电子病历上进行实验,证明了该方法的有效性。
Electronic medical records( EMR) contain a lot of useful medical knowledge. Extracting these knowledge are important for building clinical decision support system and personalized healthcare information service. Automatic word segmentation is a key precursor for analysis and mining of Chinese EMRs. In order to overcome the difficulties of obtaining labeled corpus,the paper proposes an unsupervised approach to word segmentation in Chinese EMRs. First,the paper uses a lexicon of general domain to generate an initial segmentation. To deal with the ambiguity problem,the paper also builds a probabilistic model. The probabilities of words are estimated by an EM procedure. Then the paper uses the left and right branching entropy to build goodness measure and regards the recognition of unknown words as an optimization problem which can be solved by dynamic programming. Finally,to prove the effectiveness of our approach,experiments are conducted on 3,000 copies of Chinese EMRs from the Department of Neurology.
出处
《智能计算机与应用》
2014年第2期68-71,共4页
Intelligent Computer and Applications
关键词
中文电子病历
无监督分词
EM算法
分支信息熵
动态规划
Chinese EMRs
Unsupervised Segmentation
EM Algorithm
Branching Entropy
Dynamic Programming