摘要
《诗经》位居古文经学派"五经"之首,蕴含丰富。随着人文计算的广泛应用,本文结合《汉学引得丛刊》中《毛诗引得》的领域知识,采用机器学习的方法研究《诗经》的自动分词。基于《诗经》手工分词的语料,采用《广韵》字表和统计分析相结合的方法,得到23组融合不同特征知识的特征模板,训练产生机器学习分词模型。对每个分词模型进行性能测试,分析发现词性特征对《诗经》分词效果的影响最大,且分词模型的调和平均值F值最高可达到97.42%。最后,采用《毛诗引得》领域词表对测试性能最佳的分词模型进行长词校正的模型后处理,得到了融合《毛诗引得》专家词汇知识的《诗经》分词语料。本文融入多维领域知识实现《诗经》自动分词的研究模式不仅对先秦诗歌体的相关研究起借鉴意义,而且对先秦典籍的自动分词研究具有启发性,《诗经》分词语料作为先秦典籍语料库的一部分,对进一步实现先秦典籍的知识挖掘有较强的辅助作用。
The Book of Songs is the earliest anthology of poetry in China: it is one of the thirteen classic books of Confucian tradition. The Book of Songs is ranked the first of the ancient canonical Five Classics. The Five Classics include Yijing ("Classic of Changes"), the Shujing ("Classic of History"), The Book of Songs, the Collection of Rituals, and the Chunqiu ("Spring and Autumn Annals"). The connotations of The Book of Songs are abundant, reflecting all aspects of social life in the Zhou Dynasty, such as labor and love, war and corvee oppression and rebellion, customs and marriage, ancestor worship and banquets, and even astronomy, geomorphology, animals, and plants. It is a mirror of Zhou Dynasty society, known as The Life Encyclopedia of Ancient Society. Moreover, The Book of Songs is the textbook of ancient Chinese political ethics, aesthetic education, and naturalism. With the extensive application of humanities computing, this paper combines the Sinological Index Series with the domain knowledge of the Mao Shi Index, and studies the automatic word segmentation of The Book of Songs using the machine learning method. Based on the corpus of the manual word segmentation of The Book of Songs, the method of combining the Guang Yun and statistical analysis was used to get 23 sets of feature templates that fuse different characteristics knowledge and then producing machine learning segmentation model by training. The performance of each word segmentation model is analyzed, and it is found that lexical features have the greatest influence on the word segmentation effect of The Book of Songs, and the harmonic mean F value of the word segmentation model can be up to 97.42%. Finally, the paper uses the domain glossary of the Mao Shi Index to carry out the post-processing of the long word correction with the test performance optimum segmentation model, and obtains the word corpus of The Book of Songs that fuses the ex- pert vocabulary knowledge of the Mao Shi Index. This article integrates knowledge into the multi-dimensional domain to realize the automatic segmentation of The Book of Songs, which provides reference for the related research of the Pre-Qin poetry. Moreover, it inspires the study of the automatic word segmentation of Pre-Qin Classics. The word corpus of The Book of Songs, as part of the Pre-Qin Classics word corpus, has a supporting role to further realize the knowledge mining of the Pre-Qin Classics.
作者
王姗姗
王东波
黄水清
何琳
Wang Shanshan;Wang Dongbo;Huang Shuiqing;He Lin(Nanjing Agricultural University, Nanjing 210095)
出处
《情报学报》
CSSCI
CSCD
北大核心
2018年第2期183-193,共11页
Journal of the China Society for Scientific and Technical Information
基金
国家社会科学基金重大项目"基于<汉学引得丛刊>的典籍知识库构建及人文计算研究"(15ZDB127)
南京农业大学中央高校基本科研业务费人文社科基金"基于<汉学引得丛刊>的古文本体研究"(SKCX2017004)