面向工业生产的中文Text-to-SQL模型被引量：1

Chinese Text-to-SQL model for industrial production

下载PDF

导出

摘要英文自然语言查询转SQL语句(Text-to-SQL)任务的模型迁移到中文工业Text-to-SQL任务时,由于工业数据集的可解释差且比较分散,会出现数据库的表名列名等信息与问句中关键信息的表示形式不一致以及问句中的列名隐含在语义中等问题导致模型精确匹配率变低。针对迁移过程中出现的问题,提出了对应的解决方法并构建修改后的模型。首先,在数据使用过程中融入工厂元数据信息以解决表示形式不一致以及列名隐含在语义中的问题;然后,根据中文语言表达方式的特性,使用基于相对位置的自注意力模型直接通过问句以及数据库模式信息识别出where子句的value值;最后,根据工业问句查询内容的特性,使用微调后的基于变换器的双向编码器表示技术(BERT)对问句进行分类以提高模型对SQL语句结构预测的准确率。构建了一个基于铝冶炼行业的工业数据集,并在该数据集上进行实验验证。结果表明所提模型在工业测试集上的精确匹配率为74.2%,对比英文数据集Spider上各阶段主流模型的效果后可以看出,所提模型能有效处理中文工业Text-to-SQL任务。 When the model of translating English natural language questions into Structured Query Language(SQL) statements(Text-to-SQL) is migrated to Chinese industrial Text-to-SQL task, due to the poor interpretability and strong dispersion of industrial datasets, the representation format of the information of table names and column names in database are often inconsistent with the key information in questions, and the column names in questions are often hidden in the semantics, which leads to a lower exact match accuracy. Aiming at the problems appeared in migration, the corresponding solution was proposed and a modified model was constructed. Firstly, in data use process, factory metadata information was used to solve problem of inconsistency in representation format and the problem that the column names were hidden in the semantics. Then, according to the characteristics of Chinese language expression, a self-attention model based on relative position was used to directly identify the value of where clause by questions and database mode information. Finally, according to the characteristics of the query of industrial questions, the fine-tuned Bidirectional Encoder Representation from Transformers(BERT) was used to classify questions in order to improve the accuracy of SQL statement structure prediction.An industrial dataset based on the aluminum smelting industry was constructed and experimental verification was performed on this dataset. The results show that the exact match accuracy of the proposed model on the industrial test set is 74. 2%.Compared with the effect of the mainstream models on English dataset Spider, it can be seen that the proposed model can effectively deal with the Chinese industrial Text-to-SQL task.

作者吕剑清王先兵陈刚张华王明刚 LYU Jianqing;WANG Xianbing;CHEN Gang;ZHANG Hua;WANG Minggang(Key Laboratory of Aerospace Information Security and Trusted Computing,Ministry of Education(Wuhan University),Wuhan Hubei 430072,China;School of Computer Science,Wuhan University,Wuhan Hubei 430072,China;Zunyi Aluminum Industry Company Limited,Zunyi Guizhou 563100,China)

机构地区空天信息安全与可信计算教育部重点实验室(武汉大学) 武汉大学计算机学院遵义铝业股份有限公司

出处《计算机应用》 CSCD 北大核心 2022年第10期2996-3002,共7页 journal of Computer Applications

基金国家自然科学基金资助项目(51977155)。

关键词中文Text-to-SQL任务工业数据集元数据自注意力模型基于变换器的双向编码器表示技术 Chinese Text-to-SQL task industrial dataset metadata self-attention model Bidirectional Encoder Representation from Transformers(BERT)

分类号 TP391.2 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1张顺利,王应军,姬东鸿.基于BLSTM网络的医学时间短语识别[J].计算机应用研究,2020,37(4):1059-1062. 被引量：2

二级参考文献1

1李丽双,郭元凯.基于CNN-BLSTM-CRF模型的生物医学命名实体识别[J].中文信息学报,2018,32(1):116-122. 被引量：121

共引文献1

1张琪,王东波,黄水清,李斌,孟凯,邓三鸿.时间维度下的史籍全文自动重组研究--数字人文视角下的探索[J].图书情报知识,2022,39(1):51-60. 被引量：2

同被引文献6

1李智,王震,杨赋庚,奚雪峰.基于表格的自动问答研究与展望[J].计算机工程与应用,2021,57(13):67-76. 被引量：2
2马浩,戴国琳,刘新遥,万艳丽.医学知识图谱自动构建研究[J].医学信息,2022,35(4):10-12. 被引量：3
3曹合心,赵亮,李雪峰.图神经网络在Text-to-SQL解析中的技术研究[J].计算机科学,2022,49(4):110-115. 被引量：2
4郑泳智,朱定局,吴惠粦,彭小荣.知识图谱问答领域综述[J].计算机系统应用,2022,31(4):1-13. 被引量：13
5高良才,李一博,都林,张新鹏,朱子仪,卢宁,金连文,黄永帅,汤帜.表格识别技术研究进展[J].中国图象图形学报,2022,27(6):1898-1917. 被引量：12
6梁清源,朱琪豪,孙泽宇,张路,张文杰,熊英飞,梁广泰,郁莲.基于深度学习的SQL生成研究综述[J].中国科学：信息科学,2022,52(8):1363-1392. 被引量：1

引证文献1

1张洪廙,李韧,杨建喜,杨小霞,肖桥,蒋仕新,王笛.表格问答研究综述[J].中文信息学报,2024,38(4):1-16.

1张稣荣,卜佑军,陈博,孙重鑫,王涵,胡先君.基于多层双向SRU与注意力模型的加密流量分类方法[J].计算机工程,2022,48(11):127-136. 被引量：6
2羊建信.就业压力下大学生就业心理分析与自我调控[J].科学咨询,2022(17):158-160. 被引量：1
3赵明钧,程英蕾,秦先祥,王鹏,文沛,张碧秀.基于模型微调与AM-Softmax的极化SAR图像分类[J].空军工程大学学报,2022,23(5):36-43. 被引量：1
4高飞,余晓玫.一种Enlighten-GAN网络的指纹超分辨率重建方法[J].激光与红外,2022,52(10):1577-1584. 被引量：5
5张笑博,吴迪,朱岱寅.基于深度学习的ViSAR多运动目标检测[J].雷达科学与技术,2022,20(5):513-519. 被引量：2
6朱纳,李明.多层次可选择核卷积用于视网膜图像分类[J].重庆邮电大学学报（自然科学版）,2022,34(5):886-893. 被引量：3
7王谢宁,李玉蒻,朱志国,刘琦卿.融合情感评论倾向与均衡长尾物品的推荐方法[J].运筹与管理,2022,31(10):176-182. 被引量：1
8丁季时雨,孙科武,董博,杨皙睿,范长超,马喆.基于元课程强化学习的多智能体协同博弈技术[J].现代防御技术,2022,50(5):36-42. 被引量：3

计算机应用

2022年第10期

浏览历史

内容加载中请稍等...

面向工业生产的中文Text-to-SQL模型被引量：1

参考文献1

二级参考文献1

共引文献1

同被引文献6

引证文献1

相关作者

相关机构

相关主题

浏览历史

面向工业生产的中文Text-to-SQL模型 被引量：1

参考文献1

二级参考文献1

共引文献1

同被引文献6

引证文献1

相关作者

相关机构

相关主题

浏览历史

面向工业生产的中文Text-to-SQL模型被引量：1