摘要
小句识别是篇章信息处理的基础问题。在语言学上,判断一个语段是否为小句,不仅依赖其内部结构,也依赖其在对外全局中的功能。问题是,识别小句一般依赖多大范围语段全局为好。本文基于汉语小句识别,对此探索。汉语小句一般以标点标记首尾,但并非所有标点都标记小句。本文将小句识别当成标点分类问题,将小句识别所依赖的全局范围归结为标点前后的语段个数,探测该范围大小与识别效果间关系。本文基于预训练语言模型Bert提取标点两侧语段的文本特征进行小句识别。实验表明,语段个数增多,识别效果增强,标点前后语段各达到4个效果最好;对识别效果的贡献,标点前侧语段大于后侧语段,双侧语段大于单侧语段;通过全局长度与前后语段特征权重的优化,最优模型小句识别效果F1值为95.19%。
Clause recognition is a basic issue in discourse information processing.In linguistics,whether a paragraph is a clause depends not only on its internal structure,but also on its function in the overall external situation.The question is the range of the paragraph that the clauses generally depend on.This paper explores this question based on Chinese clause recognition.Chinese clauses usually mark the beginning and end with punctuation,but not all punctuation marks clauses.In this paper,clause recognition is regarded as a punctuation classification problem.The global range relied on by clause recognition is reduced to the number of paragraphs before and after punctuation.The relationship between the size of this range and the recognition effect is detected.Based on the pre-training language model Bert,this paper extracts the text features of the segments on both sides of punctuation for clause recognition.The experiment shows that with the increase of the number of paragraphs,the recognition effect is enhanced,and the effect is the best when the number of paragraphs before and after punctuation reaches four respectively.The contribution to the recognition effect is that the front segment of punctuation is greater than the back segment,and the bilateral segment is greater than the unilateral segment.By optimizing the global length and the feature weight of the front and back paragraphs,the F1 value of the optimal model clause recognition effect is 95.19%.
作者
冯文贺
高子雄
张文娟
FENG Wenhe;GAO Zixiong;ZHANG Wenjuan
出处
《语言文字应用》
CSSCI
北大核心
2022年第2期111-121,共11页
Applied Linguistics
基金
国家社科基金项目“汉语篇章结构的特征—依存描写机制及资源建设研究”(17BYY036)的资助。
关键词
小句识别
篇章分析
语段全局范围
中文信息处理
clause recognition
discourse analysis
global range of segments
Chinese information processing