针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectiona...针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。展开更多
用日语讲述中国故事、传播中国声音是面向日本构建中国国际形象的重要一环,也是国内日语教育的主要目标之一。《人民网日文版》正是国家对日宣传的重要媒介,可用作培养讲述中国故事日语人才的教学资源。为分析该媒体的对日新闻文本中国...用日语讲述中国故事、传播中国声音是面向日本构建中国国际形象的重要一环,也是国内日语教育的主要目标之一。《人民网日文版》正是国家对日宣传的重要媒介,可用作培养讲述中国故事日语人才的教学资源。为分析该媒体的对日新闻文本中国形象宣传现状,充分挖掘其教育教学功能,收集《人民网日文本》的新闻文本并构建语料库开展量化分析,是较为有效的研究路径。准确的分词结果是量化分析日语语料的前提,但研究发现,目前的日语分词工具难以处理中国故事日语文本的精确分词,将严重影响分析结论的可靠性。因此,本研究抽取《人民网日文版》新闻文本中与中国社会、经济、文化和科技等相关的日语表述,构建适用于中国故事日语文本的专用分词词表,并评测该词表的实用效果。Telling China’s story and spreading China’s voice in Japanese are crucial for shaping China’s international image in Japan and are also one of the primary goals of domestic Japanese language education. The People’s Daily Japanese Edition is an important medium for China’s stories towards Japan and can serve as a valuable resource for training Japanese language talents to tell China’s story. To analyze the current state of China’s image publicity in this media’s news texts and to fully exploit its educational functions, collecting and constructing a corpus of these news texts for quantitative analysis is an effective research approach. Accurate word segmentation is a prerequisite for the quantitative analysis of Japanese corpora. However, research has found that current Japanese word segmentation tools struggle to precisely segment texts related to China’s story, which significantly affects the reliability of the analysis results. Therefore, this study extracts Japanese expressions related to Chinese society, economy, culture, and technology from the People’s Daily Japanese Edition news texts, constructs a custom segmentation word dictionary for these texts, and evaluates the accuracy and practicality of this dictionary.展开更多
文摘针对畜禽疫病文本语料匮乏、文本内包含大量疫病名称及短语等未登录词问题,提出了一种结合词典匹配的BERT-BiLSTM-CRF畜禽疫病文本分词模型。以羊疫病为研究对象,构建了常见疫病文本数据集,将其与通用语料PKU结合,利用BERT(Bidirectional encoder representation from transformers)预训练语言模型进行文本向量化表示;通过双向长短时记忆网络(Bidirectional long short-term memory network,BiLSTM)获取上下文语义特征;由条件随机场(Conditional random field,CRF)输出全局最优标签序列。基于此,在CRF层后加入畜禽疫病领域词典进行分词匹配修正,减少在分词过程中出现的疫病名称及短语等造成的歧义切分,进一步提高了分词准确率。实验结果表明,结合词典匹配的BERT-BiLSTM-CRF模型在羊常见疫病文本数据集上的F1值为96.38%,与jieba分词器、BiLSTM-Softmax模型、BiLSTM-CRF模型、未结合词典匹配的本文模型相比,分别提升11.01、10.62、8.3、0.72个百分点,验证了方法的有效性。与单一语料相比,通用语料PKU和羊常见疫病文本数据集结合的混合语料,能够同时对畜禽疫病专业术语及疫病文本中常用词进行准确切分,在通用语料及疫病文本数据集上F1值都达到95%以上,具有较好的模型泛化能力。该方法可用于畜禽疫病文本分词。
文摘用日语讲述中国故事、传播中国声音是面向日本构建中国国际形象的重要一环,也是国内日语教育的主要目标之一。《人民网日文版》正是国家对日宣传的重要媒介,可用作培养讲述中国故事日语人才的教学资源。为分析该媒体的对日新闻文本中国形象宣传现状,充分挖掘其教育教学功能,收集《人民网日文本》的新闻文本并构建语料库开展量化分析,是较为有效的研究路径。准确的分词结果是量化分析日语语料的前提,但研究发现,目前的日语分词工具难以处理中国故事日语文本的精确分词,将严重影响分析结论的可靠性。因此,本研究抽取《人民网日文版》新闻文本中与中国社会、经济、文化和科技等相关的日语表述,构建适用于中国故事日语文本的专用分词词表,并评测该词表的实用效果。Telling China’s story and spreading China’s voice in Japanese are crucial for shaping China’s international image in Japan and are also one of the primary goals of domestic Japanese language education. The People’s Daily Japanese Edition is an important medium for China’s stories towards Japan and can serve as a valuable resource for training Japanese language talents to tell China’s story. To analyze the current state of China’s image publicity in this media’s news texts and to fully exploit its educational functions, collecting and constructing a corpus of these news texts for quantitative analysis is an effective research approach. Accurate word segmentation is a prerequisite for the quantitative analysis of Japanese corpora. However, research has found that current Japanese word segmentation tools struggle to precisely segment texts related to China’s story, which significantly affects the reliability of the analysis results. Therefore, this study extracts Japanese expressions related to Chinese society, economy, culture, and technology from the People’s Daily Japanese Edition news texts, constructs a custom segmentation word dictionary for these texts, and evaluates the accuracy and practicality of this dictionary.
基金supported in part by the Graduate Student-Faculty Mentoring Research Program to the first, second, and fourth authors in the College of Education, Criminal Justice, and Human Services at the University of Cincinnati, USA。