Purpose-Topic segmentation is one of the active research fields in natural language processing.Also,many topic segmenters have been proposed.However,the current challenge of researchers is the improvement of these seg...Purpose-Topic segmentation is one of the active research fields in natural language processing.Also,many topic segmenters have been proposed.However,the current challenge of researchers is the improvement of these segmenters by using external resources.Therefore,the purpose of this paper is to integrate study and evaluate a new external semantic resource in topic segmentation.Design/methodology/approach-New topic segmenters(TSS-Onto and TSB-Onto)are proposed based on the two well-known segmenters C99 and TextTiling.The proposed segmenters integrate semantic knowledge to the segmentation process by using a domain ontology as an external resource.Subsequently,an evaluation is made to study the effect of this resource on the quality of topic segmentation along with a comparative study with related works.Findings-Based on this study,the authors showed that adding semantic knowledge,which is extracted from a domain ontology,improves the quality of topic segmentation.Moreover,TSS-Ont outperforms TSB-Ont in terms of quality of topic segmentation.Research limitations/implications-The main limitation of this study is the used test corpus for the evaluation which is not a benchmark.However,we used a collection of scientific papers from well-known digital libraries(ArXiv and ACM).Practical implications-The proposed topic segmenters can be useful in different NLP applications such as information retrieval and text summarizing.Originality/value-The primary original contribution of this paper is the improvement of topic segmentation based on semantic knowledge.This knowledge is extracted from an ontological external resource.展开更多
介绍了基于半条件随机域(semi-Markov conditional random fields,简称semi-CRFs)模型的百科全书文本段落划分方法.为了克服单纯的HMM模型和CRF模型的段落类型重复问题,以经过整理的HMM模型状态的后验分布为基本依据,使用了基于词汇语...介绍了基于半条件随机域(semi-Markov conditional random fields,简称semi-CRFs)模型的百科全书文本段落划分方法.为了克服单纯的HMM模型和CRF模型的段落类型重复问题,以经过整理的HMM模型状态的后验分布为基本依据,使用了基于词汇语义本体知识库的段落开始特征以及针对特定段落类型的提示性特征来进一步适应目标文本的特点.实验结果表明,该划分方法可以综合利用各种不同类型的信息,比较适合百科全书文本的段落结构,可以取得比单纯的HMM模型和CRF模型更好的性能.展开更多
文摘Purpose-Topic segmentation is one of the active research fields in natural language processing.Also,many topic segmenters have been proposed.However,the current challenge of researchers is the improvement of these segmenters by using external resources.Therefore,the purpose of this paper is to integrate study and evaluate a new external semantic resource in topic segmentation.Design/methodology/approach-New topic segmenters(TSS-Onto and TSB-Onto)are proposed based on the two well-known segmenters C99 and TextTiling.The proposed segmenters integrate semantic knowledge to the segmentation process by using a domain ontology as an external resource.Subsequently,an evaluation is made to study the effect of this resource on the quality of topic segmentation along with a comparative study with related works.Findings-Based on this study,the authors showed that adding semantic knowledge,which is extracted from a domain ontology,improves the quality of topic segmentation.Moreover,TSS-Ont outperforms TSB-Ont in terms of quality of topic segmentation.Research limitations/implications-The main limitation of this study is the used test corpus for the evaluation which is not a benchmark.However,we used a collection of scientific papers from well-known digital libraries(ArXiv and ACM).Practical implications-The proposed topic segmenters can be useful in different NLP applications such as information retrieval and text summarizing.Originality/value-The primary original contribution of this paper is the improvement of topic segmentation based on semantic knowledge.This knowledge is extracted from an ontological external resource.
文摘介绍了基于半条件随机域(semi-Markov conditional random fields,简称semi-CRFs)模型的百科全书文本段落划分方法.为了克服单纯的HMM模型和CRF模型的段落类型重复问题,以经过整理的HMM模型状态的后验分布为基本依据,使用了基于词汇语义本体知识库的段落开始特征以及针对特定段落类型的提示性特征来进一步适应目标文本的特点.实验结果表明,该划分方法可以综合利用各种不同类型的信息,比较适合百科全书文本的段落结构,可以取得比单纯的HMM模型和CRF模型更好的性能.