摘要
大语言模型在自然语言处理领域蓬勃发展,但在教育数字化领域应用过程中仍面临一系列重要挑战。针对教育数字化领域垂域数据稀缺、摘要长度不稳定导致信息缺失或冗余的问题,提出一种用于教育领域文本摘要的轻量化幂等模型框架IGLM。该模型首先采用多源训练进行自适应扩增以提升数据多样性,然后对下游的文本摘要任务进行多种微调。同时,为降低文本长度的影响,设计幂等摘要生成策略拉近初次摘要与幂等摘要来约束模型,减少语料分布不均导致的偏见,结合量化技术在低资源条件下生成更为精确和流畅的摘要文本。实验以ROUGE分数为评估指标,在公开中文文本摘要数据集LCSTS、EDUCATION、NLPCC上进行验证。实验结果表明,该框架在生成摘要的准确率和流畅性上有明显提升,其中ROUGE-1/2/L相较基线模型在LCSTS数据集上分别提升7.9、7.4、8.7个百分点,在EDUCATION数据集上分别提升12.9、15.4、15.7个百分点,在NLPCC数据集上分别提升12.2、11.7、12.7个百分点,验证了模型有效性。
Large Language Models(LLMs)are currently undergoing vigorous development in the field of Natural Language Processing(NLP).However,significant challenges remain in their applications to educational digitization.To address the problem posed by the scarcity of domain-specific data and the instability of summarization leading to information loss or redundancy,this study introduces a lightweight idempotent model framework,Idempotent Generative Language Model(IGLM),for educational text summarization.The model first employs multisource training for adaptive augmentation to enhance data diversity.Subsequently,various finetuning procedures are applied to the downstream text summarization task.Concurrently,an idempotent summarization generation strategy is designed to mitigate the impact of text length.This strategy brings the summaries closer to idempotent form,constrains the model,mitigates biases resulting from uneven language corpora,and combines quantization techniques to generate more precise and fluent summaries under low-resource conditions.The experiments used Recall-Oriented Understudy for Gisting Evaluation(ROUGE)scores as the evaluation metric and validated the model on publicly available Chinese text summarization datasets Large-scale Chinese Short Text Summarization(LCSTS),EDUCATION,and Natural Language Processing and Chinese Computing(NLPCC).The results revealed significant enhancements in precision and coherence within this framework.Specifically,compared to the baseline model,the ROUGE-1/2/L scores were improved by 7.9,7.4,and 8.7 percentage points on the LCSTS dataset.Moreover,on the EDUCATION dataset,the scores exhibited enhancements of 12.9,15.4,and 15.7 percentage points for ROUGE-1/2/L.Similarly,on the NLPCC dataset,improvements of 12.2,11.7,and 12.7 percentage points were observed for ROUGE-1/2/L.This validation confirms the model's effectiveness.
作者
杨兴睿
马斌
李森垚
钟忺
YANG Xingrui;MA Bin;LI Senyao;ZHONG Xian(School of Computer Science and Artificial Intelligence,Wuhan University of Technology,Wuhan 430070,Hubei,China;Informatization Office,Wuhan University of Technology,Wuhan 430070,Hubei,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2024年第7期32-41,共10页
Computer Engineering
基金
国家自然科学基金(62271361)。
关键词
教育数字化
文本摘要
大语言模型
低资源场景
幂等
扩增
educational digitalization
text summarization
Large Language Model(LLM)
low-resource scenarios
idempotent
augmentation