In recent years,with the development of the natural language processing(NLP)technologies,security analyst began to use NLP directly on assembly codes which were disassembled from binary executables in order to examine...In recent years,with the development of the natural language processing(NLP)technologies,security analyst began to use NLP directly on assembly codes which were disassembled from binary executables in order to examine binary similarity,achieved great progress.However,we found that the existing frameworks often ignored the complex internal structure of instructions and didn’t fully consider the long-term dependencies of instructions.In this paper,we propose firmVulSeeker—a vulnerability search tool for embedded firmware images,based on BERT and Siamese network.It first builds a BERT MLM task to observe and learn the semantics of different instructions in their context in a very large unlabeled binary corpus.Then,a finetune mode based on Siamese network is constructed to guide training and matching semantically similar functions using the knowledge learned from the first stage.Finally,it will use a function embedding generated from the fine-tuned model to search in the targeted corpus and find the most similar function which will be confirmed whether it’s a real vulnerability manually.We evaluate the accuracy,robustness,scalability and vulnerability search capability of firmVulSeeker.Results show that it can greatly improve the accuracy of matching semantically similar functions,and can successfully find more real vulnerabilities in real-world firmware than other tools.展开更多
文摘In recent years,with the development of the natural language processing(NLP)technologies,security analyst began to use NLP directly on assembly codes which were disassembled from binary executables in order to examine binary similarity,achieved great progress.However,we found that the existing frameworks often ignored the complex internal structure of instructions and didn’t fully consider the long-term dependencies of instructions.In this paper,we propose firmVulSeeker—a vulnerability search tool for embedded firmware images,based on BERT and Siamese network.It first builds a BERT MLM task to observe and learn the semantics of different instructions in their context in a very large unlabeled binary corpus.Then,a finetune mode based on Siamese network is constructed to guide training and matching semantically similar functions using the knowledge learned from the first stage.Finally,it will use a function embedding generated from the fine-tuned model to search in the targeted corpus and find the most similar function which will be confirmed whether it’s a real vulnerability manually.We evaluate the accuracy,robustness,scalability and vulnerability search capability of firmVulSeeker.Results show that it can greatly improve the accuracy of matching semantically similar functions,and can successfully find more real vulnerabilities in real-world firmware than other tools.