基于《中国图书馆分类法》(下简称《中图法》)的文献深层分类蕴含着两个经典的自然语言处理问题:极限多标签文本分类(Extreme Multi-label Text Classification,XMC)和层次文本分类(Hierarchical Text Classification,HTC)。然而目前基...基于《中国图书馆分类法》(下简称《中图法》)的文献深层分类蕴含着两个经典的自然语言处理问题:极限多标签文本分类(Extreme Multi-label Text Classification,XMC)和层次文本分类(Hierarchical Text Classification,HTC)。然而目前基于《中图法》的文献分类研究普遍将其视为普通的文本分类问题,由于没有充分挖掘问题的核心特点,这些研究在深层分类上的效果普遍不理想甚至不可行。相较于同类研究,本文基于对《中图法》文献分类特点和难点的深入分析,从XMC和HTC两个角度对基于《中图法》的文献深层分类和相关的解决方案进行了考察和研究,并针对该场景下的特点进行应用和创新,不仅提高了分类的准确度,还扩展了分类的深度和广度。本文模型首先通过适用于XMC问题的轻量深度学习模型提取了文本的语义特征作为分类的基础依据,而后针对《中图法》分类中的HTC问题,利用LTR(Learning to Rank)框架融入包括层级结构信息等多元特征作为分类的辅助依据,极大化地挖掘了蕴含在文本语义及分类体系中的信息和知识。本模型兼具深度学习模型强大的语义理解能力与机器学习模型的可解释性,同时具备良好的可扩展性,后期可较为便捷地融入专家定制的新特征进行提高,并且模型较为轻量,可在有限计算资源下轻松应对数万级别的分类标签,为基于《中图法》的全深度分类奠定良好的基础。展开更多
Purpose: This study aims to discuss the strategies for mapping from Dewey Decimal Classification(DDC) numbers to Chinese Library Classification(CLC) numbers based on co-occurrence mapping while minimizing manual inter...Purpose: This study aims to discuss the strategies for mapping from Dewey Decimal Classification(DDC) numbers to Chinese Library Classification(CLC) numbers based on co-occurrence mapping while minimizing manual intervention.Design/methodology/approach: Several statistical tables were created based on frequency counts of the mapping relations with samples of USMARC records,which contain both DDC and CLC numbers. A manual table was created through direct mapping. In order to find reasonable mapping strategies,the mapping results were compared from three aspects including the sample size,the choice between one-to-one and one-to-multiple mapping relations,and the role of a manual mapping table.Findings: Larger sample size provides more DDC numbers in the mapping table. The statistical table including one-to-multiple DDC-CLC relations provides a higher ratio of correct matches than that including only one-to-one relations. The manual mapping table cannot produce a better result than the statistical tables. Therefore,we should make full use of statistical mapping tables and avoid the time-consuming manual mapping as much as possible.Research limitations: All the sample sizes were small. We did not consider DDC editions in our study. One-to-multiple DDC-CLC relations in the records were collected in the mapping table,but how to select one appropriate CLC number in the matching process needs to be further studied.Practical implications: The ratio of correct matches based on the statistical mapping table came up to about 90% by CLC top-level classes and 76% by the second-level classes in our study. The statistical mapping table will be improved to realize the automatic classification of e-resources and shorten the cataloging cycle significantly.Originality/value: The mapping results were investigated from different aspects in order to find suitable mapping strategies from DDC to CLC while minimizing manual intervention.The findings have facilitated the establishment of DDC-CLC mapping system for practical applications.展开更多
文摘基于《中国图书馆分类法》(下简称《中图法》)的文献深层分类蕴含着两个经典的自然语言处理问题:极限多标签文本分类(Extreme Multi-label Text Classification,XMC)和层次文本分类(Hierarchical Text Classification,HTC)。然而目前基于《中图法》的文献分类研究普遍将其视为普通的文本分类问题,由于没有充分挖掘问题的核心特点,这些研究在深层分类上的效果普遍不理想甚至不可行。相较于同类研究,本文基于对《中图法》文献分类特点和难点的深入分析,从XMC和HTC两个角度对基于《中图法》的文献深层分类和相关的解决方案进行了考察和研究,并针对该场景下的特点进行应用和创新,不仅提高了分类的准确度,还扩展了分类的深度和广度。本文模型首先通过适用于XMC问题的轻量深度学习模型提取了文本的语义特征作为分类的基础依据,而后针对《中图法》分类中的HTC问题,利用LTR(Learning to Rank)框架融入包括层级结构信息等多元特征作为分类的辅助依据,极大化地挖掘了蕴含在文本语义及分类体系中的信息和知识。本模型兼具深度学习模型强大的语义理解能力与机器学习模型的可解释性,同时具备良好的可扩展性,后期可较为便捷地融入专家定制的新特征进行提高,并且模型较为轻量,可在有限计算资源下轻松应对数万级别的分类标签,为基于《中图法》的全深度分类奠定良好的基础。
基金jointly supported by the Foundation for Humanities and Social Sciences of the Chinese Ministryof Education(Grant No.:11BTQ007)Shanghai Society for Library Science(Grant No.:10BSTX02)
文摘Purpose: This study aims to discuss the strategies for mapping from Dewey Decimal Classification(DDC) numbers to Chinese Library Classification(CLC) numbers based on co-occurrence mapping while minimizing manual intervention.Design/methodology/approach: Several statistical tables were created based on frequency counts of the mapping relations with samples of USMARC records,which contain both DDC and CLC numbers. A manual table was created through direct mapping. In order to find reasonable mapping strategies,the mapping results were compared from three aspects including the sample size,the choice between one-to-one and one-to-multiple mapping relations,and the role of a manual mapping table.Findings: Larger sample size provides more DDC numbers in the mapping table. The statistical table including one-to-multiple DDC-CLC relations provides a higher ratio of correct matches than that including only one-to-one relations. The manual mapping table cannot produce a better result than the statistical tables. Therefore,we should make full use of statistical mapping tables and avoid the time-consuming manual mapping as much as possible.Research limitations: All the sample sizes were small. We did not consider DDC editions in our study. One-to-multiple DDC-CLC relations in the records were collected in the mapping table,but how to select one appropriate CLC number in the matching process needs to be further studied.Practical implications: The ratio of correct matches based on the statistical mapping table came up to about 90% by CLC top-level classes and 76% by the second-level classes in our study. The statistical mapping table will be improved to realize the automatic classification of e-resources and shorten the cataloging cycle significantly.Originality/value: The mapping results were investigated from different aspects in order to find suitable mapping strategies from DDC to CLC while minimizing manual intervention.The findings have facilitated the establishment of DDC-CLC mapping system for practical applications.