摘要
为实现高效准确的矿床描述文本多标签分类,降低从大量文本中获取细粒度知识的难度,需要构建有针对性的标注数据集和机器学习模型。使用地理位置、成矿区带、矿体地质等17种内容标签,为《中国矿产地质志·典型矿床总述卷》中的13411个句子实施人工分类标注,构建了一个矿床描述文本多标签分类标注数据集。将多标签分类流程拆解为划分特征单元、文本向量化、分类计算三个步骤,在每个步骤分别采用不同方法,形成30种机器学习分类模型,在标注数据集上测试并比较了这些模型的分类性能。试验结果显示:微调BERT模型搭配FNN分类器时加权F1值可达到0.91,优于其他模型;TextCNN模型搭配K近邻分类器时加权F1值可达到0.80;TF-IDF词袋模型搭配FNN分类器时加权F1值可达到0.76;在其他步骤方法相同的情况下,按字符划分特征单元的模型加权F1值相对较高。基于微调BERT的机器学习模型可用于替代或辅助矿床描述文本多标签人工分类。使用TF-IDF词袋的机器学习模型可解释性较强,可用于优化人工分类方法。
To achieve efficient and accurate multi-label classification for mineral deposit description texts and reduce the difficulty of obtaining fine-grained knowledge from a large amount of text,it is necessary to construct labeled datasets and machine learning models purposefully.Using 17 kinds of content labels,such as geographic location,metallogenic zone,orebody geology,etc.,13411 sentences from the Geology of Mineral Resources in China:Overview of Typical Mineral Deposits are manually classified,constructing a labeled dataset of multi-label classification for mineral deposit description texts.The multi-label classification process is divided into three steps:tokenization,text vectorization,and classified calculation.Different methods are adopted at each step,forming 30 kinds of machine learning classification models.The classification performances of these models are evaluated and compared on the labeled dataset.The result of experiments shows:fine-tuned BERT combined with FNN can achieve a weighted F1 score of 0.91,outperforming other models;TextCNN combined with Knearest neighbors classifier can achieve a weighted F1 score of 0.80;TF-IDF bag of words combined with FNN can achieve a weighted F1 score of 0.76;when the methods in other steps are the same,models that use characters as tokens have relatively higher weighted F1 scores.Machine learning models based on fine-tuned BERT can be used to replace or assist the manual multi-label classification for mineral deposit description texts.The machine learning model using TF-IDF bag of words has strong interpretability and can be used for optimize manual classification method.
作者
赵锴
叶丹
ZHAO Kai;YE Dan(Geoscience Documentation Center,China Geological Survey,Beijing 100083,China;Technology Innovation Center of Geoscience Knowledge and Intelligent Service,China Geological Survey,Beijing 100083,China;Geological Publishing House,Beijing 100083,China)
出处
《中国矿业》
北大核心
2024年第10期153-161,共9页
China Mining Magazine
基金
地球科学文献知识服务与决策支撑项目资助(编号:DD20230139)
大数据智能找矿预测(地学文献中心)项目资助(编号:DD20243286)。
关键词
机器学习
自然语言处理
多标签分类
矿床学
知识工程
machine learning
natural language processing
multi-label classification
mineralogy
knowledge engineering