摘要
文章针对中文语音文本,通过脚本标注筛除关键词的文本数据,利用BERT预训练模型生成其余的文本的词向量序列,结合逻辑回归模型进行训练,最终获得具有“有/无意义”标签的语音文本数据。经过二分类的语音数据,可用于优化语音云平台的用户说法词库,提高用户的交互体验。
In this paper,for Chinese voice text,the text data of keywords are screened out through the script annotation,and the word vector sequence of other text is generated by pre trained BERT model.In the end,combined with logistic regression model,the voice text data with the label of"meaningful/meaningless"are obtained.The data,what is being through binary logistic way,can be used to optimize the user's speech thesaurus of voice cloud platform and improve the user's interactive experience.
作者
宋冠谕
程登
张森
刘威
丁晓雯
SONG Guanyu;CHENG Deng;ZHANG Sen;LIU Wei;DING Xiaowen(SAIC GM Wuling Automoblic Co.,Ltd.,Guangxi Laboratory of New Energy Automobile,Guangxi Kcy Laboratory of Automobilc Four New Fcaturcs,Liuzhou,Guangxi 545007,China)
出处
《计算机应用文摘》
2022年第18期96-98,共3页
Chinese Journal of Computer Application
关键词
二分类
脚本标注
BERT预训练模型
逻辑回归
binary classification
seript annotation
pre-trained BERT model
logistic regression