摘要
去隐私化是2014 i2b2/UTHealth中的一个任务,目的在于识别并移除电子病历中的隐私信息。本文提出了一种基于支持向量机(SVMs)和条件随机场(CRFs)双层分类模型的去隐私化方法,经过预处理将病历文本进行词切分(tokenize)处理,并在此基础上抽取4类特征,训练SVM模型对隐私信息实体边界进行划分并将结果作为特征添加到特征集中,通过CRF训练多分类器,并通过该分类器对各个类别的隐私信息进行识别。实验表明双层分类模型对于隐私信息识别是有效的,结果 F值达到0.9110。
De-identification is a shared task of the 2014 i2b2/UTHealth challenge which aimed at removing protected personal information( PHI) from electronic medical records. This paper proposes a two tier classifier based on support vector machines( SVMs)and conditional random fields( CRFs). Electronic medical records are tokenized through a preprocessing module,and four types of features are generated to train a SVM classifier to identify the boundary of PHI entities,results of the SVM classifier is used as new features to train a CRF classifier. The experiments show that the two tier classifier is effective in de-identification of electronic medical records and achieving a F-measure of 0.9110.
出处
《智能计算机与应用》
2016年第6期17-19,24,共4页
Intelligent Computer and Applications
关键词
电子病历
去隐私化
SVM
CRF
electronic medical records
de-identification
SVM
CRF