摘要
在自然语言处理领域各项任务中,模型广泛存在性别偏见。然而,当前尚无中文性别偏见评估和消偏的相关数据集,因此无法对中文自然语言处理模型中的性别偏见进行评估。首先,该文根据16对性别称谓词,从一个平面媒体语料库中筛选出性别无偏的句子,构建了一个含有20000条语句的中文句子级性别无偏数据集SlguSet(Sentence-Level Gender Unbiased Dataset)。随后,该文提出了一个可衡量预训练语言模型性别偏见程度的指标,并对5种流行的预训练语言模型中的性别偏见进行评估。结果表明,中文预训练语言模型中存在不同程度的性别偏见,该文所构建数据集能够很好地对中文预训练语言模型中的性别偏见进行评估。
In various tasks in the field of natural language processing,models are widely gender biased.However,there is no relevant dataset for Chinese gender bias assessment and debiasing.According to 16 pairs of gender appellations,this paper screened out gender-unbiased sentences from a print media corpus,and constructed a Chinese sentence-level gender-unbiased data set SlguSet(sentence-level gender unbiased dataset)containing 20,000 sentences.This paper further proposes an index that can measure the degree of gender bias in pre-trained language models,and evaluates the gender bias in five popular pre-trained language models.The results show that there are different degrees of gender bias in the Chinese pre-training language model,and the dataset constructed in this article can effectively evaluate the gender bias in the Chinese pre-training language model.
作者
赵继舜
杜冰洁
刘鹏远
朱述承
ZHAO Jishun;DU Bingjie;LIU Pengyuan;ZHU Shucheng(College of Information Science,Beijing Language and Culture University,Beijing 100083,China;Language Resources Monitoring and Reserch Center Print Media Language Branch,Beijing Language and CultureUniversity,Beijing 100083,China;Shool of Humanities,Tsinghua University,Beijing 100084,China)
出处
《中文信息学报》
CSCD
北大核心
2023年第9期15-22,共8页
Journal of Chinese Information Processing
基金
北京市自然科学基金(4192057)。