Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization syst...Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few records(often less than A Comparison of Approaches100 per class)on which to train are available—and these hold only for top 3 hierarchical levels(803 instead of 14,413 classes).Research limitations:Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes,skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.Practical implications:In conclusion,for operative information retrieval systems applying purely automatic DDC does not work,either using machine learning(because of the lack of training data for the large number of DDC classes)or using string-matching algorithm(because DDC characteristics perform well for automatic classification only in a small number of classes).Over time,more training examples may become available,and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC.In order for quality information services to reach the objective of highest possible precision and recall,automatic classification should never be implemented on its own;instead,machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.Originality/value:The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems.Due to lack of sufficient training data across the entire set of classes,an approach complementing machine learning,that of string matching,was applied.This combination should be explored further since it provides the potential for real-life applications with large target classification systems.展开更多
Purpose: This study aims to discuss the strategies for mapping from Dewey Decimal Classification(DDC) numbers to Chinese Library Classification(CLC) numbers based on co-occurrence mapping while minimizing manual inter...Purpose: This study aims to discuss the strategies for mapping from Dewey Decimal Classification(DDC) numbers to Chinese Library Classification(CLC) numbers based on co-occurrence mapping while minimizing manual intervention.Design/methodology/approach: Several statistical tables were created based on frequency counts of the mapping relations with samples of USMARC records,which contain both DDC and CLC numbers. A manual table was created through direct mapping. In order to find reasonable mapping strategies,the mapping results were compared from three aspects including the sample size,the choice between one-to-one and one-to-multiple mapping relations,and the role of a manual mapping table.Findings: Larger sample size provides more DDC numbers in the mapping table. The statistical table including one-to-multiple DDC-CLC relations provides a higher ratio of correct matches than that including only one-to-one relations. The manual mapping table cannot produce a better result than the statistical tables. Therefore,we should make full use of statistical mapping tables and avoid the time-consuming manual mapping as much as possible.Research limitations: All the sample sizes were small. We did not consider DDC editions in our study. One-to-multiple DDC-CLC relations in the records were collected in the mapping table,but how to select one appropriate CLC number in the matching process needs to be further studied.Practical implications: The ratio of correct matches based on the statistical mapping table came up to about 90% by CLC top-level classes and 76% by the second-level classes in our study. The statistical mapping table will be improved to realize the automatic classification of e-resources and shorten the cataloging cycle significantly.Originality/value: The mapping results were investigated from different aspects in order to find suitable mapping strategies from DDC to CLC while minimizing manual intervention.The findings have facilitated the establishment of DDC-CLC mapping system for practical applications.展开更多
文摘Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few records(often less than A Comparison of Approaches100 per class)on which to train are available—and these hold only for top 3 hierarchical levels(803 instead of 14,413 classes).Research limitations:Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes,skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.Practical implications:In conclusion,for operative information retrieval systems applying purely automatic DDC does not work,either using machine learning(because of the lack of training data for the large number of DDC classes)or using string-matching algorithm(because DDC characteristics perform well for automatic classification only in a small number of classes).Over time,more training examples may become available,and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC.In order for quality information services to reach the objective of highest possible precision and recall,automatic classification should never be implemented on its own;instead,machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.Originality/value:The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems.Due to lack of sufficient training data across the entire set of classes,an approach complementing machine learning,that of string matching,was applied.This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
基金jointly supported by the Foundation for Humanities and Social Sciences of the Chinese Ministryof Education(Grant No.:11BTQ007)Shanghai Society for Library Science(Grant No.:10BSTX02)
文摘Purpose: This study aims to discuss the strategies for mapping from Dewey Decimal Classification(DDC) numbers to Chinese Library Classification(CLC) numbers based on co-occurrence mapping while minimizing manual intervention.Design/methodology/approach: Several statistical tables were created based on frequency counts of the mapping relations with samples of USMARC records,which contain both DDC and CLC numbers. A manual table was created through direct mapping. In order to find reasonable mapping strategies,the mapping results were compared from three aspects including the sample size,the choice between one-to-one and one-to-multiple mapping relations,and the role of a manual mapping table.Findings: Larger sample size provides more DDC numbers in the mapping table. The statistical table including one-to-multiple DDC-CLC relations provides a higher ratio of correct matches than that including only one-to-one relations. The manual mapping table cannot produce a better result than the statistical tables. Therefore,we should make full use of statistical mapping tables and avoid the time-consuming manual mapping as much as possible.Research limitations: All the sample sizes were small. We did not consider DDC editions in our study. One-to-multiple DDC-CLC relations in the records were collected in the mapping table,but how to select one appropriate CLC number in the matching process needs to be further studied.Practical implications: The ratio of correct matches based on the statistical mapping table came up to about 90% by CLC top-level classes and 76% by the second-level classes in our study. The statistical mapping table will be improved to realize the automatic classification of e-resources and shorten the cataloging cycle significantly.Originality/value: The mapping results were investigated from different aspects in order to find suitable mapping strategies from DDC to CLC while minimizing manual intervention.The findings have facilitated the establishment of DDC-CLC mapping system for practical applications.