摘要
英文自然语言查询转SQL语句(Text-to-SQL)任务的模型迁移到中文工业Text-to-SQL任务时,由于工业数据集的可解释差且比较分散,会出现数据库的表名列名等信息与问句中关键信息的表示形式不一致以及问句中的列名隐含在语义中等问题导致模型精确匹配率变低。针对迁移过程中出现的问题,提出了对应的解决方法并构建修改后的模型。首先,在数据使用过程中融入工厂元数据信息以解决表示形式不一致以及列名隐含在语义中的问题;然后,根据中文语言表达方式的特性,使用基于相对位置的自注意力模型直接通过问句以及数据库模式信息识别出where子句的value值;最后,根据工业问句查询内容的特性,使用微调后的基于变换器的双向编码器表示技术(BERT)对问句进行分类以提高模型对SQL语句结构预测的准确率。构建了一个基于铝冶炼行业的工业数据集,并在该数据集上进行实验验证。结果表明所提模型在工业测试集上的精确匹配率为74.2%,对比英文数据集Spider上各阶段主流模型的效果后可以看出,所提模型能有效处理中文工业Text-to-SQL任务。
When the model of translating English natural language questions into Structured Query Language(SQL) statements(Text-to-SQL) is migrated to Chinese industrial Text-to-SQL task, due to the poor interpretability and strong dispersion of industrial datasets, the representation format of the information of table names and column names in database are often inconsistent with the key information in questions, and the column names in questions are often hidden in the semantics, which leads to a lower exact match accuracy. Aiming at the problems appeared in migration, the corresponding solution was proposed and a modified model was constructed. Firstly, in data use process, factory metadata information was used to solve problem of inconsistency in representation format and the problem that the column names were hidden in the semantics. Then, according to the characteristics of Chinese language expression, a self-attention model based on relative position was used to directly identify the value of where clause by questions and database mode information. Finally, according to the characteristics of the query of industrial questions, the fine-tuned Bidirectional Encoder Representation from Transformers(BERT) was used to classify questions in order to improve the accuracy of SQL statement structure prediction.An industrial dataset based on the aluminum smelting industry was constructed and experimental verification was performed on this dataset. The results show that the exact match accuracy of the proposed model on the industrial test set is 74. 2%.Compared with the effect of the mainstream models on English dataset Spider, it can be seen that the proposed model can effectively deal with the Chinese industrial Text-to-SQL task.
作者
吕剑清
王先兵
陈刚
张华
王明刚
LYU Jianqing;WANG Xianbing;CHEN Gang;ZHANG Hua;WANG Minggang(Key Laboratory of Aerospace Information Security and Trusted Computing,Ministry of Education(Wuhan University),Wuhan Hubei 430072,China;School of Computer Science,Wuhan University,Wuhan Hubei 430072,China;Zunyi Aluminum Industry Company Limited,Zunyi Guizhou 563100,China)
出处
《计算机应用》
CSCD
北大核心
2022年第10期2996-3002,共7页
journal of Computer Applications
基金
国家自然科学基金资助项目(51977155)。