The paper presents the text-linguistic concepts on which the analysis of textual structure is based including text and discourse,coherence and cohesive.In addition we try to discover different manifestations of text b...The paper presents the text-linguistic concepts on which the analysis of textual structure is based including text and discourse,coherence and cohesive.In addition we try to discover different manifestations of text between ET and CT,including different coherent structures.展开更多
Text extraction is the key step in the character recognition;its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solv...Text extraction is the key step in the character recognition;its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solve some regional problems such as incomplete, false position or orientation deviation occurred in the low-contrast image text extraction. Firstly, we make some pre-processing for the original image, including color space transform, contrast-limited adaptive histogram equalization, Sobel edge detector, morphological method and eight neighborhood processing method (ENPM) etc., to provide some results to compare the different methods. Secondly, we use the connected component analysis (CCA) method to get several connected parts and non-connected parts, then use the morphology method and CCA again for the non-connected part to erode some noises, obtain another connected and non-connected parts. Thirdly, we compute the edge feature for all connected areas, combine Support Vector Machine (SVM) to classify the real text region, obtain the text location coordinates. Finally, we use the text region coordinate to extract the block including the text, then binarize, cluster and recognize all text information. At last, we calculate the precision rate and recall rate to evaluate the method for more than 200 images. The experiments show that the method we proposed is robust for low-contrast text images with the variations in font size and font color, different language, gloomy environment, etc.展开更多
中文故事结尾生成(SEG)是自然语言处理中的下游任务之一。基于全错误结尾的CLSEG(Contrastive Learning of Story Ending Generation)在故事的一致性方面表现较好。然而,由于错误结尾中也包含与原结尾文本相同的内容,仅使用错误结尾的...中文故事结尾生成(SEG)是自然语言处理中的下游任务之一。基于全错误结尾的CLSEG(Contrastive Learning of Story Ending Generation)在故事的一致性方面表现较好。然而,由于错误结尾中也包含与原结尾文本相同的内容,仅使用错误结尾的对比训练会导致生成文本中原结尾正确的主要部分被剥离。因此,在CLSEG基础上增加正向结尾增强训练,以保留对比训练中损失的正确部分;同时,通过正向结尾的引入,使生成的结尾具有更强的多样性和关联性。基于双向对比训练的中文故事结尾生成模型包含两个主要部分:1)多结尾采样,通过不同的模型方法获取正向增强的结尾和反向对比的错误结尾;2)对比训练,在训练过程中修改损失函数,使生成的结尾接近正向结尾,远离错误结尾。在公开的故事数据集OutGen上的实验结果表明,相较于GPT2. ft和深层逐层隐变量融合(Della)等模型,所提模型的BERTScore、METEOR等指标均取得了较优的结果,生成的结尾具有更强的多样性和关联性。展开更多
随着自然语言处理技术的发展,文本摘要技术已经被广泛应用在生活的方方面面,在司法领域,文本摘要技术能够帮助司法文本实现“降维”,对迅速了解案件详情,获取案件要素有很大的帮助,促使司法向信息化、智能化发展。但是现有的摘要生成模...随着自然语言处理技术的发展,文本摘要技术已经被广泛应用在生活的方方面面,在司法领域,文本摘要技术能够帮助司法文本实现“降维”,对迅速了解案件详情,获取案件要素有很大的帮助,促使司法向信息化、智能化发展。但是现有的摘要生成模型应用在司法文本上,生成的摘要质量不尽如人意,还存在着生成重复、冗余,与现实情况不相符等问题,特别是当行为人存在多项罪名和多项判罚时,使用常见摘要生成模型生成的摘要会出现罪罚不匹配的情况。为了解决这些问题,提出基于知识增强预训练模型的司法文本摘要生成模型LCSG-ERNIE(legal case summary generation based on enhanced language representation with informative entities),该模型在预训练语言模型中融入司法知识,并结合对比学习的思想生成摘要,提高生成摘要的质量,减少出现的罪罚不匹配情况,最终通过实验证明提出的模型取得较好效果。展开更多
文摘The paper presents the text-linguistic concepts on which the analysis of textual structure is based including text and discourse,coherence and cohesive.In addition we try to discover different manifestations of text between ET and CT,including different coherent structures.
文摘Text extraction is the key step in the character recognition;its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solve some regional problems such as incomplete, false position or orientation deviation occurred in the low-contrast image text extraction. Firstly, we make some pre-processing for the original image, including color space transform, contrast-limited adaptive histogram equalization, Sobel edge detector, morphological method and eight neighborhood processing method (ENPM) etc., to provide some results to compare the different methods. Secondly, we use the connected component analysis (CCA) method to get several connected parts and non-connected parts, then use the morphology method and CCA again for the non-connected part to erode some noises, obtain another connected and non-connected parts. Thirdly, we compute the edge feature for all connected areas, combine Support Vector Machine (SVM) to classify the real text region, obtain the text location coordinates. Finally, we use the text region coordinate to extract the block including the text, then binarize, cluster and recognize all text information. At last, we calculate the precision rate and recall rate to evaluate the method for more than 200 images. The experiments show that the method we proposed is robust for low-contrast text images with the variations in font size and font color, different language, gloomy environment, etc.
文摘中文故事结尾生成(SEG)是自然语言处理中的下游任务之一。基于全错误结尾的CLSEG(Contrastive Learning of Story Ending Generation)在故事的一致性方面表现较好。然而,由于错误结尾中也包含与原结尾文本相同的内容,仅使用错误结尾的对比训练会导致生成文本中原结尾正确的主要部分被剥离。因此,在CLSEG基础上增加正向结尾增强训练,以保留对比训练中损失的正确部分;同时,通过正向结尾的引入,使生成的结尾具有更强的多样性和关联性。基于双向对比训练的中文故事结尾生成模型包含两个主要部分:1)多结尾采样,通过不同的模型方法获取正向增强的结尾和反向对比的错误结尾;2)对比训练,在训练过程中修改损失函数,使生成的结尾接近正向结尾,远离错误结尾。在公开的故事数据集OutGen上的实验结果表明,相较于GPT2. ft和深层逐层隐变量融合(Della)等模型,所提模型的BERTScore、METEOR等指标均取得了较优的结果,生成的结尾具有更强的多样性和关联性。
文摘随着自然语言处理技术的发展,文本摘要技术已经被广泛应用在生活的方方面面,在司法领域,文本摘要技术能够帮助司法文本实现“降维”,对迅速了解案件详情,获取案件要素有很大的帮助,促使司法向信息化、智能化发展。但是现有的摘要生成模型应用在司法文本上,生成的摘要质量不尽如人意,还存在着生成重复、冗余,与现实情况不相符等问题,特别是当行为人存在多项罪名和多项判罚时,使用常见摘要生成模型生成的摘要会出现罪罚不匹配的情况。为了解决这些问题,提出基于知识增强预训练模型的司法文本摘要生成模型LCSG-ERNIE(legal case summary generation based on enhanced language representation with informative entities),该模型在预训练语言模型中融入司法知识,并结合对比学习的思想生成摘要,提高生成摘要的质量,减少出现的罪罚不匹配情况,最终通过实验证明提出的模型取得较好效果。