A vulnerability detection method and system based on smart contract bytecode
By using a vulnerability detection method based on smart contract bytecode, and leveraging pre-trained models and graph neural networks to learn the semantic and structural information of bytecode, the generality and efficiency issues of smart contract vulnerability detection models are resolved, enabling efficient detection of complex contracts.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF ELECTRONICS SCI & TECH OF CHINA
- Filing Date
- 2023-05-06
- Publication Date
- 2026-06-30
AI Technical Summary
Existing smart contract vulnerability detection models have poor versatility and low detection efficiency, especially for complex smart contracts where the analysis time is too long.
This paper proposes a vulnerability detection method based on smart contract bytecode. By collecting bytecode, an opcode data flow graph and a control flow graph are constructed. A pre-trained model is used to learn the semantic and structural information of the bytecode. Vulnerability detection is then performed by combining graph neural networks. The method includes the construction and training of the pre-trained model and the smart contract vulnerability detection model.
It improves the versatility and detection efficiency of the vulnerability detection model, making it applicable to smart contracts with different vulnerability types, reducing training time, avoiding the OOV problem during source code segmentation, and possessing good scalability.
Smart Images

Figure CN116561761B_ABST
Abstract
Description
Technical Field
[0001] A vulnerability detection method and system based on smart contract bytecode is disclosed, which is used for vulnerability detection of smart contracts and belongs to the field of smart contract security. Background Technology
[0002] Smart contracts represent a valuable and flexible application area of blockchain. Essentially, a smart contract is a piece of code implemented in a specific scripting language, inevitably carrying security vulnerabilities. Therefore, accurately and promptly identifying these vulnerabilities has become a key focus and hot topic in blockchain security research.
[0003] The paper "Peculiar: Smart contract vulnerability detection based on crucial data flow graph and pretraining techniques" proposes a smart contract vulnerability detection framework called Peculiar, based on a crucial data flow graph and a pretrained model. This framework uses a modified tree-sitter tool to parse the smart contract source code into an abstract syntax tree (AST), identifies variable sequences within the AST, constructs a data flow graph (DFG) based on the data flow relationships between variables, extracts a subgraph (CDFG) from the DFG consisting of data nodes related to key variables, and finally uses the CDFG as input to pretrain the GraphCodeBert model, combining it with a downstream classification model for vulnerability detection.
[0004] The paper "DeeSCVHunter: A Deep Learning-Based Framework for Smart Contract Vulnerability Detection" developed the first systematic and modular deep learning-based smart contract vulnerability detection framework, DeeSCVHunter. This framework incorporates eight mainstream deep learning models and three embedding schemes. The framework proposes the concept of Vulnerability Candidate Slices (VCS). First, it matches vulnerability-related candidate statements (CS) in the smart contract source code, such as `call.value()` and `block.timestamp`. Then, based on control flow and data flow, it matches statements related to these candidate statements to form vulnerability candidate slices. These slices are then converted into vectors using the embedding scheme and used as input to train the deep learning model.
[0005] Since most smart contracts are not open source, and only the EVM bytecode of the smart contract is stored on the blockchain, researchers cannot obtain the source code of the smart contract. This limits the application scenarios of vulnerability detection methods based on smart contract source code. Therefore, implementing vulnerability detection of smart contracts at the bytecode level is a current research hotspot. However, most deep learning-based smart contract vulnerability detection methods also focus more on the source code level of smart contracts. Research on the bytecode level of smart contracts is still in a relatively early stage. Most studies simply treat bytecode as a text sequence, ignoring the extraction and representation of semantic information of smart contracts, and lacking interpretability.
[0006] CN113904844A - A smart contract vulnerability detection method based on cross-modal teacher-student networks. Although semantic information is used, the training of the teacher network requires the source code of the smart contract, and different source code semantic graphs need to be designed for different vulnerabilities. Therefore, the teacher network needs to be retrained for different vulnerabilities.
[0007] CN112990941A - A vulnerability detection method and system for Ponzi schemes in smart contracts. Although it applies symbolic execution technology to control flow graphs and data flow graphs and only targets Ponzi contracts, it still requires designing different detection rules for different vulnerabilities, relies on expert experience and knowledge, and the symbolic execution technology has the problem of excessively long analysis time for complex smart contracts, resulting in low efficiency.
[0008] Therefore, the existing technology has the following technical problems:
[0009] 1. The model building process is related to the vulnerability type, resulting in poor model versatility;
[0010] 2. To address the issues of excessively long analysis times and low detection efficiency for complex smart contracts. Summary of the Invention
[0011] To address the problems mentioned above, the present invention aims to provide a vulnerability detection method and system based on smart contract bytecode, thereby solving the problem that the model construction process in the prior art is related to the vulnerability type, resulting in poor model universality.
[0012] To achieve the above objectives, the present invention adopts the following technical solution:
[0013] A vulnerability detection method based on smart contract bytecode includes the following steps:
[0014] Step 1: Collect bytecode from any smart contract to construct a smart contract opcode data flow graph, and then construct a pre-training dataset based on the opcode data flow graph;
[0015] Step 2: Build and train a pre-trained model based on the pre-trained dataset to determine whether two basic blocks in a data flow graph are adjacent;
[0016] Step 3: Collect unlabeled smart contract bytecode to construct an opcode control flow graph, and then construct a smart contract vulnerability dataset based on the opcode control flow graph;
[0017] Step 4: Build and train a smart contract vulnerability detection model based on the parameters of the pre-trained model and the smart contract vulnerability dataset;
[0018] Step 5: Use the trained smart contract vulnerability detection model to perform vulnerability detection on the control flow graph of the smart contract under test.
[0019] Furthermore, the specific steps of step 1 are as follows:
[0020] Step 2.1: Disassemble the smart contract bytecode to obtain the smart contract opcode sequence;
[0021] Step 2.2: Divide the smart contract opcode sequence into multiple basic blocks, where each basic block is an opcode sequence;
[0022] Step 2.3: Simulate the execution of smart contract opcode sequences based on basic blocks to construct an opcode data flow graph using basic blocks as nodes;
[0023] Step 2.4: Construct a pre-trained dataset based on the opcode data flow graph.
[0024] Furthermore, the pre-trained model in step 2 sequentially includes an input layer, an embedding layer, a generator layer, and a discriminator layer;
[0025] The input layer is used to add a token[SEQ] to the end of the opcode sequence of the basic block in the data flow graph as a sequence segment end marker, then concatenate the opcode sequences of any two basic blocks, and add a token[CLS] to the front of the concatenated opcode sequence as a classification marker. Each concatenated opcode sequence is regarded as a token sequence, that is, multiple token sequences are obtained.
[0026] The Embedding layer is used to map the token sequence of the input layer to a high-dimensional space to obtain a token vector;
[0027] The Generator layer consists of multiple TransformerEncoder layers and an MLP stacked sequentially. It is used to take the token vector obtained from the Embedding layer as input, predict the corresponding position of the token sequence, and obtain the predicted opcode sequence, i.e. reconstruct the token sequence. The Generator layer contains 3-5 Transformer layers.
[0028] The Discriminator layer consists of an Embedding layer, multiple TransformerEncoder layers, and an MLP stacked sequentially. It takes the predicted opcode sequence from the Generator layer as input and predicts whether the predicted opcode sequence at the corresponding position is the original opcode token sequence. At the [CLS] position, it predicts whether two basic blocks are adjacent in the data flow graph. The Discriminator layer contains 3-5 Transformer layers.
[0029] Furthermore, the training pre-trained model specifically includes three pre-training tasks: masking token recovery task, adjacent basic block prediction task, and replacement token detection task. Among them, masking token recovery and adjacent basic block prediction are trained simultaneously, and then the replacement token detection task is trained.
[0030] The masking token recovery task randomly selects any basic block from the data flow graph during each training session, replaces the token at the corresponding position in the token sequence obtained from the input layer with the [MASK] token, and then trains the Generator layer to predict the prediction opcode sequence of the masked position [MASK] token, i.e. the reconstructed token sequence. The replaced token does not include [CLS] or [SEQ].
[0031] In the adjacent basic block prediction task, when constructing the token sequence at the input layer, any two selected basic blocks are adjacent in the data flow graph if they have a data dependency relationship, otherwise they are not adjacent. The Discriminator layer needs to output the prediction result of whether the two basic blocks in the token sequence are adjacent at the position corresponding to [CLS]. During training, the Discriminator layer learns the data dependency relationship between different basic blocks.
[0032] The replacement token detection task and the masked token recovery task, after training, the Generator layer outputs a reconstructed token sequence. In this reconstructed token sequence, the tokens at the masked positions are reconstructed by the Generator layer. The reconstructed tokens include those that are successfully reconstructed, i.e., recovered into an opcode token sequence, and those that are reconstructed opcode tokens, i.e., those that have been replaced. The Discriminator layer, obtained after training the adjacent basic block prediction task, takes the reconstructed token sequence as input and needs to output the prediction result of whether the token at the corresponding position has been replaced. Combined with the masked token recovery task, it learns the contextual relationship of opcodes within the basic block. Training ends when the number of iterations is satisfied or the loss decreases to a specified threshold, i.e., the parameters of the Embedding layer and Transformer layer in the Discriminator layer are obtained. Otherwise, one basic block and two arbitrarily selected basic blocks are selected for the next round of training.
[0033] Furthermore, the specific steps of step 3 are as follows:
[0034] Vulnerability labeling tools were used to label the collected unlabeled smart contracts, including Mythril.
[0035] The collected and annotated source code of smart contracts is compiled into smart contract bytecode;
[0036] Disassemble the smart contract bytecode to obtain the opcode sequence;
[0037] The smart contract opcode sequence is divided into multiple basic blocks to construct an opcode control flow graph. The opcode control flow graph and the corresponding vulnerability labels form a smart contract vulnerability dataset.
[0038] Furthermore, the smart contract vulnerability detection model specifically includes an Encoder layer, a graph neural network layer, and an output layer arranged sequentially.
[0039] The Encoder layer consists of an Embedding layer and multiple TransformerEncoder layers stacked sequentially. It is used to take the opcode token sequence of a single basic block as input and output the feature vector of the basic block at the [CLS] position. The structure of the Encoder layer is the same as that of the Embedding layer and multiple Transformer layers in the Discriminator layer, and the parameters of the Embedding layer and multiple Transformer layers in the trained Discriminator layer are used as the initial parameters of the Encoder layer.
[0040] The graph neural network layer consists of multiple GGNNs and a pooling layer stacked sequentially. It takes the opcode control flow graph as input and outputs the feature vector of the corresponding smart contract, which contains the semantic and structural information of the smart contract. There are 1 to 3 GGNNs.
[0041] The output layer is an MLP, which takes the feature vector of the smart contract as input and outputs the predicted probability of the smart contract having vulnerabilities.
[0042] Furthermore, the nodes of the opcode control flow graph are represented by the feature vectors of the corresponding basic blocks output by the Encoder layer.
[0043] Furthermore, the pooling layer is a parametric pooling layer or a non-parametric pooling layer.
[0044] Furthermore, the specific steps of step 5 are as follows:
[0045] The smart contract under test is compiled into bytecode and the opcode sequence is obtained by disassembly.
[0046] The smart contract opcode sequence is divided into multiple basic blocks, each of which is a segment of the opcode sequence.
[0047] Simulate the execution of smart contract opcode sequences, and construct an opcode control flow graph using basic blocks as nodes;
[0048] The opcode control flow graph is input into the trained smart contract vulnerability detection model to perform smart contract vulnerability detection.
[0049] A vulnerability detection system based on smart contract bytecode includes:
[0050] Smart contract collection module: Collects bytecode from arbitrary smart contracts and unannotated smart contract bytecode respectively;
[0051] Data preprocessing module: Constructs a smart contract opcode data flow graph based on arbitrary smart contract bytecode, and then constructs a pre-training dataset based on the opcode data flow graph, or constructs an opcode control flow graph based on unlabeled smart contract bytecode, and then constructs a smart contract vulnerability dataset based on the opcode control flow graph;
[0052] Pre-training module: A pre-trained model built and trained based on a pre-training dataset to determine whether two basic blocks in a data flow graph are adjacent;
[0053] The smart contract vulnerability detection module constructs and trains a smart contract vulnerability detection model based on the parameters of a pre-trained model and a smart contract vulnerability dataset, and uses the trained smart contract vulnerability detection model to detect vulnerabilities in the control flow graph of the smart contract under test.
[0054] Compared with the prior art, the beneficial effects of this invention are as follows:
[0055] I. This invention performs semantic extraction and vulnerability analysis based on smart contract bytecode, and the construction process is independent of vulnerability type. By incorporating data flow information into the pre-trained model, the semantics learned by the model can be more comprehensive, greatly improving the versatility of the vulnerability detection model. That is, the pre-trained model learns general code semantics and structural information from smart contract bytecode through task-independent training objectives, without training for a specific vulnerability. Subsequently, the parameters of the pre-trained model are used as the initial parameters of the vulnerability detection model, and then the model is trained and updated for specific vulnerabilities. This allows it to be applied to different vulnerability types, while reducing training time and improving the model's generalization ability.
[0056] Second, the vulnerability detection model framework based on deep learning in this invention automatically mines semantic information related to the causes of vulnerabilities within the code, and after the model is trained, it has high detection efficiency for complex smart contracts.
[0057] Third, this invention implements smart contract vulnerability detection based on smart contract bytecode. Since the opcodes obtained by disassembly are known, it can avoid the Out-Of-Vocabulary (OOV) problem encountered by vulnerability detection methods based on source code when performing word segmentation on the source code.
[0058] Fourth, this invention analyzes the entire smart contract. Compared with the method of program slicing based on vulnerability characteristics, this invention has good scalability and can be directly applied to the detection of vulnerabilities in new smart contracts. Attached Figure Description
[0059] Figure 1 This is a flowchart of the vulnerability detection method based on smart contract bytecode according to Embodiment 1 of the present invention;
[0060] Figure 2 This is a block diagram of the vulnerability detection system based on smart contract bytecode according to Embodiment 2 of the present invention; Detailed Implementation
[0061] The present invention will now be further described in conjunction with the accompanying drawings and specific embodiments.
[0062] Example 1
[0063] like Figure 1 As shown, the specific embodiment of the vulnerability detection method based on smart contract bytecode provided by the present invention is as follows:
[0064] Step 1: Obtain any smart contract bytecode on Ethereum from platforms such as EtherScan to construct a smart contract opcode data flow graph, and then construct a pre-training dataset based on the opcode data flow graph;
[0065] The specific steps are as follows:
[0066] Step 2.1: Disassemble the smart contract bytecode to obtain the smart contract opcode sequence;
[0067] Step 2.2: Divide the smart contract opcode sequence into multiple basic blocks, where each basic block is an opcode sequence;
[0068] Step 2.3: Simulate the execution of smart contract opcode sequences based on basic blocks to construct an opcode data flow graph using basic blocks as nodes;
[0069] Step 2.4: Construct a pre-trained dataset based on the opcode data flow graph.
[0070] Step 2: Build and train a pre-trained model based on the pre-trained dataset to determine whether two basic blocks in a data flow graph are adjacent; the pre-trained model includes an input layer, an embedding layer, a generator layer, and a discriminator layer in sequence.
[0071] The input layer is used to add a token[SEQ] to the end of the opcode sequence of the basic block in the data flow graph as a sequence segment end marker, then concatenate the opcode sequences of any two basic blocks, and add a token[CLS] to the front of the concatenated opcode sequence as a classification marker. Each concatenated opcode sequence is regarded as a token sequence, that is, multiple token sequences are obtained.
[0072] The Embedding layer is used to map the token sequence of the input layer to a high-dimensional space to obtain a token vector;
[0073] The Generator layer consists of multiple TransformerEncoder layers and an MLP stacked sequentially. It is used to take the token vector obtained from the Embedding layer as input, predict the corresponding position of the token sequence, and obtain the predicted opcode sequence, i.e. reconstruct the token sequence. The Generator layer contains 3-5 Transformer layers.
[0074] The Discriminator layer consists of an Embedding layer, multiple TransformerEncoder layers, and an MLP stacked sequentially. It takes the predicted opcode sequence from the Generator layer as input and predicts whether the predicted opcode sequence at the corresponding position is the original opcode token sequence. At the [CLS] position, it predicts whether two basic blocks are adjacent in the data flow graph. The Discriminator layer contains 3-5 Transformer layers.
[0075] The training of the pre-trained model includes three pre-training tasks: masking token recovery, adjacent basic block prediction, and replacement token detection. Among them, masking token recovery and adjacent basic block prediction are trained simultaneously, and then the replacement token detection task is trained.
[0076] The masking token recovery task randomly selects any basic block from the data flow graph during each training session, replaces the token at the corresponding position in the token sequence obtained from the input layer with the [MASK] token, and then trains the Generator layer to predict the prediction opcode sequence of the masked position [MASK] token, i.e. the reconstructed token sequence. The replaced token does not include [CLS] or [SEQ].
[0077] In the adjacent basic block prediction task, when constructing the token sequence at the input layer, any two selected basic blocks are adjacent in the data flow graph if they have a data dependency relationship, otherwise they are not adjacent. The Discriminator layer needs to output the prediction result of whether the two basic blocks in the token sequence are adjacent at the position corresponding to [CLS]. During training, the Discriminator layer learns the data dependency relationship between different basic blocks.
[0078] The replacement token detection task and the masked token recovery task, after training, the Generator layer outputs a reconstructed token sequence. In this reconstructed token sequence, the tokens at the masked positions are reconstructed by the Generator layer. The reconstructed tokens include those that are successfully reconstructed, i.e., recovered into an opcode token sequence, and those that are reconstructed opcode tokens, i.e., those that have been replaced. The Discriminator layer, obtained after training the adjacent basic block prediction task, takes the reconstructed token sequence as input and needs to output the prediction result of whether the token at the corresponding position has been replaced. Combined with the masked token recovery task, it learns the contextual relationship of opcodes within the basic block. Training ends when the number of iterations is satisfied or the loss decreases to a specified threshold, i.e., the parameters of the Embedding layer and Transformer layer in the Discriminator layer are obtained. Otherwise, one basic block and two arbitrarily selected basic blocks are selected for the next round of training.
[0079] Step 3: Obtain a large amount of unlabeled smart contract bytecode from platforms such as EtherScan (the smart contract bytecode here is related to specific vulnerabilities), construct an opcode control flow graph, and then construct a smart contract vulnerability dataset based on the opcode control flow graph; the specific steps are as follows:
[0080] Vulnerability labeling tools are used to label unlabeled smart contracts that have been collected. These tools include Mythril, etc.
[0081] The collected and annotated source code of smart contracts is compiled into smart contract bytecode;
[0082] Disassemble the smart contract bytecode to obtain the opcode sequence;
[0083] The smart contract opcode sequence is divided into multiple basic blocks to construct an opcode control flow graph. The opcode control flow graph and the corresponding vulnerability labels form a smart contract vulnerability dataset.
[0084] Step 4: Build and train a smart contract vulnerability detection model based on the parameters of the pre-trained model and the smart contract vulnerability dataset;
[0085] The smart contract vulnerability detection model specifically includes an Encoder layer, a graph neural network layer, and an output layer, set up sequentially.
[0086] The Encoder layer consists of an Embedding layer and multiple TransformerEncoder layers stacked sequentially. It is used to take the opcode token sequence of a single basic block as input and output the feature vector of the basic block at the [CLS] position. The structure of the Encoder layer is the same as that of the Embedding layer and multiple Transformer layers in the Discriminator layer, and the parameters of the Embedding layer and multiple Transformer layers in the trained Discriminator layer are used as the initial parameters of the Encoder layer.
[0087] The graph neural network layer consists of multiple GGNNs and a pooling layer stacked sequentially. It takes the opcode control flow graph as input and outputs the feature vector of the corresponding smart contract, which contains the semantic and structural information of the smart contract. There are 1 to 3 GGNNs.
[0088] The output layer is an MLP (Multi-Level Processing) that takes the feature vector of the smart contract as input and outputs the predicted probability of the smart contract having vulnerabilities. The nodes of the opcode control flow graph are represented by the feature vectors of the corresponding basic blocks output by the Encoder layer. The pooling layer is either a parametric pooling layer or a non-parametric pooling layer.
[0089] Step 5: Use the trained smart contract vulnerability detection model to perform vulnerability detection on the control flow graph of the smart contract under test. The specific steps are as follows:
[0090] The smart contract under test is compiled into bytecode and the opcode sequence is obtained by disassembly.
[0091] The smart contract opcode sequence is divided into multiple basic blocks, each of which is a segment of the opcode sequence.
[0092] Simulate the execution of smart contract opcode sequences, and construct an opcode control flow graph using basic blocks as nodes;
[0093] The opcode control flow graph is input into the trained smart contract vulnerability detection model to perform smart contract vulnerability detection.
[0094] The above are merely representative embodiments among the many specific applications of this invention, and do not constitute any limitation on the scope of protection of this invention. All technical solutions formed by transformation or equivalent substitution fall within the scope of protection of this invention.
Claims
1. A vulnerability detection method based on smart contract bytecode, characterized in that, Includes the following steps: Step 1: Collect bytecode from any smart contract to construct a smart contract opcode data flow graph, and then construct a pre-training dataset based on the opcode data flow graph; Step 2: Build and train a pre-trained model based on the pre-trained dataset to determine whether two basic blocks in a data flow graph are adjacent; Step 3: Collect unlabeled smart contract bytecode to construct an opcode control flow graph, and then construct a smart contract vulnerability dataset based on the opcode control flow graph; Step 4: Build and train a smart contract vulnerability detection model based on the parameters of the pre-trained model and the smart contract vulnerability dataset; Step 5: Use the trained smart contract vulnerability detection model to perform vulnerability detection on the control flow graph of the smart contract under test; The pre-trained model in step 2 includes, in sequence, an input layer, an embedding layer, a generator layer, and a discriminator layer; The training of the pre-trained model includes three pre-training tasks: masking token recovery, adjacent basic block prediction, and replacement token detection. Among them, masking token recovery and adjacent basic block prediction are trained simultaneously, and then the replacement token detection task is trained. The masked token recovery task randomly selects any basic block from the data flow graph during each training session, replaces the token at the corresponding position in the token sequence obtained from the input layer with the [MASK] token, and then trains the Generator layer to predict the prediction opcode sequence of the [MASK] token at the masked position, which is the reconstructed token sequence. The replaced token does not include [CLS] or [SEQ]. In the adjacent basic block prediction task, when constructing the token sequence at the input layer, any two selected basic blocks are adjacent in the data flow graph if they have a data dependency relationship, otherwise they are not adjacent. The Discriminator layer needs to output the prediction result of whether the two basic blocks in the token sequence are adjacent at the position corresponding to [CLS]. During training, the Discriminator layer learns the data dependency relationship between different basic blocks. The replacement token detection task and the masked token recovery task, after training, the Generator layer outputs a reconstructed token sequence. In this reconstructed token sequence, the tokens at the masked positions are reconstructed by the Generator layer. The reconstructed tokens include those that are successfully reconstructed, i.e., recovered into an opcode token sequence, and those that are reconstructed opcode tokens, i.e., those that have been replaced. The Discriminator layer, obtained after training the adjacent basic block prediction task, takes the reconstructed token sequence as input and needs to output the prediction result of whether the token at the corresponding position has been replaced. Combined with the masked token recovery task, it learns the contextual relationship of opcodes within the basic block. Training ends when the number of iterations is satisfied or the loss decreases to a specified threshold, i.e., the parameters of the Embedding layer and Transformer layer in the Discriminator layer are obtained. Otherwise, one basic block and two arbitrarily selected basic blocks are selected for the next round of training.
2. The vulnerability detection method based on smart contract bytecode according to claim 1, characterized in that, The specific steps of step 1 are as follows: Step 2.1: Disassemble the smart contract bytecode to obtain the smart contract opcode sequence; Step 2.2: Divide the smart contract opcode sequence into multiple basic blocks, where each basic block is an opcode sequence; Step 2.3: Simulate the execution of smart contract opcode sequences based on basic blocks to construct an opcode data flow graph using basic blocks as nodes; Step 2.4: Construct a pre-trained dataset based on the opcode data flow graph.
3. The vulnerability detection method based on smart contract bytecode according to claim 2, characterized in that, The input layer is used to add a token[SEQ] to the end of the opcode sequence of the basic block in the data flow graph as a sequence segment end marker, then concatenate the opcode sequences of any two basic blocks, and add a token[CLS] to the front of the concatenated opcode sequence as a classification marker. Each concatenated opcode sequence is regarded as a token sequence, that is, multiple token sequences are obtained. The Embedding layer is used to map the token sequence of the input layer to a high-dimensional space to obtain a token vector; The Generator layer consists of multiple TransformerEncoder layers and an MLP stacked sequentially. It is used to take the token vector obtained from the Embedding layer as input, predict the corresponding position of the token sequence, and obtain the predicted opcode sequence, i.e. reconstruct the token sequence. The Generator layer contains 3-5 Transformer layers. The Discriminator layer consists of an Embedding layer, multiple TransformerEncoder layers, and an MLP stacked sequentially. It takes the predicted opcode sequence from the Generator layer as input and predicts whether the predicted opcode sequence at the corresponding position is the original opcode token sequence. At the [CLS] position, it predicts whether two basic blocks are adjacent in the data flow graph. The Discriminator layer contains 3-5 Transformer layers.
4. The vulnerability detection method based on smart contract bytecode according to claim 3, characterized in that, The specific steps of step 3 are as follows: Vulnerability labeling tools were used to label the collected unlabeled smart contracts, including Mythril. The collected and annotated source code of smart contracts is compiled into smart contract bytecode; Disassemble the smart contract bytecode to obtain the opcode sequence; The smart contract opcode sequence is divided into multiple basic blocks to construct an opcode control flow graph. The opcode control flow graph and the corresponding vulnerability labels form a smart contract vulnerability dataset.
5. The vulnerability detection method based on smart contract bytecode according to claim 4, characterized in that, The smart contract vulnerability detection model specifically includes an Encoder layer, a graph neural network layer, and an output layer set sequentially. The Encoder layer consists of an Embedding layer and multiple TransformerEncoder layers stacked sequentially. It is used to take the opcode token sequence of a single basic block as input and output the feature vector of the basic block at the [CLS] position. The structure of the Encoder layer is the same as that of the Embedding layer and multiple Transformer layers in the Discriminator layer, and the parameters of the Embedding layer and multiple Transformer layers in the trained Discriminator layer are used as the initial parameters of the Encoder layer. The graph neural network layer consists of multiple GGNNs and a pooling layer stacked sequentially. It takes the opcode control flow graph as input and outputs the feature vector of the corresponding smart contract, which contains the semantic and structural information of the smart contract. There are 1 to 3 GGNNs. The output layer is an MLP, which takes the feature vector of the smart contract as input and outputs the predicted probability of the smart contract having vulnerabilities.
6. The vulnerability detection method based on smart contract bytecode according to claim 5, characterized in that, The nodes of the opcode control flow graph are represented by the feature vectors of the corresponding basic blocks output by the Encoder layer.
7. A vulnerability detection method based on smart contract bytecode according to claim 6, characterized in that, The pooling layer is either a parametric pooling layer or a non-parametric pooling layer.
8. The vulnerability detection method based on smart contract bytecode according to claim 7, characterized in that, The specific steps of step 5 are as follows: The smart contract under test is compiled into bytecode and the opcode sequence is obtained by disassembly. The smart contract opcode sequence is divided into multiple basic blocks, each of which is an opcode sequence. Simulate the execution of smart contract opcode sequences, and construct an opcode control flow graph using basic blocks as nodes; The opcode control flow graph is input into the trained smart contract vulnerability detection model to perform smart contract vulnerability detection.
9. A vulnerability detection system based on smart contract bytecode, characterized in that, include: Smart contract collection module: Collects bytecode from arbitrary smart contracts and unannotated smart contract bytecode respectively; Data preprocessing module: Constructs a smart contract opcode data flow graph based on arbitrary smart contract bytecode, and then constructs a pre-training dataset based on the opcode data flow graph, or constructs an opcode control flow graph based on unlabeled smart contract bytecode, and then constructs a smart contract vulnerability dataset based on the opcode control flow graph; Pre-training module: A pre-trained model built and trained based on a pre-training dataset to determine whether two basic blocks in a data flow graph are adjacent; Smart contract vulnerability detection module: Based on the parameters of the pre-trained model and the smart contract vulnerability dataset, a smart contract vulnerability detection model is built and trained, and the trained smart contract vulnerability detection model is used to detect vulnerabilities in the control flow graph of the smart contract under test; The pre-trained model in the data preprocessing module consists of an input layer, an embedding layer, a generator layer, and a discriminator layer, in sequence. The training of the pre-trained model includes three pre-training tasks: masking token recovery, adjacent basic block prediction, and replacement token detection. Among them, masking token recovery and adjacent basic block prediction are trained simultaneously, and then the replacement token detection task is trained. The masked token recovery task randomly selects any basic block from the data flow graph during each training session, replaces the token at the corresponding position in the token sequence obtained from the input layer with the [MASK] token, and then trains the Generator layer to predict the prediction opcode sequence of the [MASK] token at the masked position, which is the reconstructed token sequence. The replaced token does not include [CLS] or [SEQ]. In the adjacent basic block prediction task, when constructing the token sequence at the input layer, any two selected basic blocks are adjacent in the data flow graph if they have a data dependency relationship, otherwise they are not adjacent. The Discriminator layer needs to output the prediction result of whether the two basic blocks in the token sequence are adjacent at the position corresponding to [CLS]. During training, the Discriminator layer learns the data dependency relationship between different basic blocks. The replacement token detection task and the masked token recovery task, after training, the Generator layer outputs a reconstructed token sequence. In this reconstructed token sequence, the tokens at the masked positions are reconstructed by the Generator layer. The reconstructed tokens include those that are successfully reconstructed, i.e., recovered into an opcode token sequence, and those that are reconstructed opcode tokens, i.e., those that have been replaced. The Discriminator layer, obtained after training the adjacent basic block prediction task, takes the reconstructed token sequence as input and needs to output the prediction result of whether the token at the corresponding position has been replaced. Combined with the masked token recovery task, it learns the contextual relationship of opcodes within the basic block. Training ends when the number of iterations is satisfied or the loss decreases to a specified threshold, i.e., the parameters of the Embedding layer and Transformer layer in the Discriminator layer are obtained. Otherwise, one basic block and two arbitrarily selected basic blocks are selected for the next round of training.