摘要
传统发票识别通常拿纸质发票扫描再采用OCR识别,识别准确率为80%至90%。而由于本案使用Word或者Excel格式转化成的pdf格式发票,文件保留了完整的字符信息和一些相对固定的格式信息。以编译原理的思维,把发票转化成的文本看作为一种编程语言,再用有限状态机去识别。实验结果表明,准确率可达99%以上,获得了满意的效果。
Traditional invoice recognitionis usually completed by scanning paper invoices and then using OCR.The recognition accuracy is about 80%-90%.For our case,the invoice files contain complete character information and some relatively fixed format information.If the text from the invoice is regarded as a programming language,it can be recognized by a finite automaton.Experimental results show that the accuracy of this method can reach more than 99%,which is a satisfactory result.
作者
施海昕
周雪峰
陈凯
刘云锋
SHI Haixin;ZHOU Xuefeng;CHEN Kai;LIU Yunfeng(ICkey(Shanghai)Internet and Technology Co.Ltd.,Shanghai 201612,China;School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 201240,China)
出处
《微型电脑应用》
2020年第11期86-89,共4页
Microcomputer Applications
关键词
有限状态机
发票识别
编译原理
finite automaton
invoice recognition
principles of compliers