摘要
目的针对中医方剂数据挖掘需要提出一套以数据清洗为主的数据预处理方法,使数据规范、准确和有序,利于后续处理。方法通过检索技术,在方剂数据库中获取文本数据源,将非规范化的数据通过辅助词群行处理、正则表达式替换、异名处理等步骤进行清洗,改进数据质量。结果在中国方剂数据库共检索到1758条记录,在方剂现代应用数据库共检索到91条记录。源文本数据经预处理后共得到有效记录6913味药,可成功导入相关信息挖掘系统进行方剂名称和中药名词的信息抽取。结论本方法适用于基于中医方剂数据库的文本挖掘和知识发现,可成功对源文本数据实施清洗,得到标准统一、无噪声的数据,实现所需方药信息的有效抽取,可为中医方剂文本型数据信息分析与挖掘研究提供有益的借鉴。
Objective To propose a set of data preprocessing method based on data cleaning for TCM prescription database;To make data more standard, accurate and orderly, and convenient for follow-up processing. Methods The text data source was retrieved from prescription databases by bibliographic searching techniques. Non-normalized data were processed through steps followed by auxiliary word group line processing, regular expression substitution, and synonyms processing, with a purpose to improve data quality. Results Totally 1758 effective records were retrieved from TCM prescription database, and 91 records were retrieved from prescription modern application database. 6913 effective Chinese herbal medicines were retrieved after preprocessing, which can be successfully imported into relevant information mining system, and information about prescription and herb names can be extracted. Conclusion This method is applicable for text mining and knowledge discovery in TCM prescription database. It can successfully implement data cleaning for source text data, get data with unified standard and without noise, and finally realize the effective extraction of prescription information, which can provide references for researches on analysis and mining of TCM prescription text data.
出处
《中国中医药图书情报杂志》
2015年第3期8-11,共4页
Chinese Journal of Library and Information Science for Traditional Chinese Medicine
基金
辽宁省教育厅科研课题(L2012345)
关键词
中医方剂
方剂数据库
文本挖掘
数据预处理
数据清洗
TCM prescriptions
prescription database
text mining
data preprocessing
data cleaning