A binary code analysis method and system for cross-architecture knowledge transfer

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By employing a cross-architecture knowledge transfer binary code analysis method, and utilizing word embedding models and linear transformation alignment matrices, a unified instruction semantic space is established. This resolves the analysis differences between different CPU architectures, improves the analysis accuracy of low-frequency CPU architectures, and enhances cross-architecture application capabilities.

CN122195451APending Publication Date: 2026-06-12SHANGHAI PALMIN TECH +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHANGHAI PALMIN TECH
Filing Date: 2026-03-27
Publication Date: 2026-06-12

Application Information

Patent Timeline

27 Mar 2026

Application

12 Jun 2026

Publication

CN122195451A

IPC: G06F8/53; G06F8/76; G06F9/30; G06N3/096

AI Tagging

Application Domain

Decompilation/disassemblyBiological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Instruction-driven java runtime business logic protection method and system
CN120850259BHide call relationshipavoid exposureDecompilation/disassemblyDigital data protection Software engineering Logisim
Cryptographic function identification method based on loop analysis and binary similar code analysis
CN116743363BKey distribution for secure communicationDecompilation/disassemblyLoop analysisAlgorithm
A system and method for memory leak analysis and testing based on GCC wrap and Python.
CN122240448ADecompilation/disassemblyResource allocation
Code processing method and apparatus, and device
EP4209896B1Decompilation/disassemblyProgram code adaption
A vulnerability detection method, device and equipment for ARM architecture UEFI firmware and a medium
CN122241716ADecompilation/disassemblyPlatform integrity maintainance

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing binary code analysis methods are difficult to reuse across CPU architectures, especially for low-frequency CPU architectures where there is a lack of sufficient code samples and labeled data, resulting in low accuracy and coverage of analysis tools. Furthermore, existing cross-architecture conversion methods lose architecture-specific semantic information, affecting analysis accuracy.

Method used

By generating binary code corpora for specific CPU architectures, training the instruction vector space using a word embedding model, and aligning the instruction semantic spaces of different CPU architectures using a linear transformation alignment matrix, a unified general word embedding vector space is established, enabling cross-architecture knowledge transfer.

Benefits of technology

It enables the reuse of binary code analysis knowledge across different CPU architectures, improves the analysis accuracy of CPU architectures used in low frequency, provides a unified vector space to support tasks such as instruction prediction and vulnerability detection across architectures, and has good scalability.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure FT_1
Figure FT_2

Patent Text Reader

Abstract

The application provides a binary code analysis method and system for cross-architecture knowledge migration, and relates to binary code static analysis technology in the field of software code analysis. In view of the technical problems that binary code analysis methods between different CPU architectures are difficult to reuse, and the low-frequency use of CPU architecture marking data and analysis knowledge is scarce, the application extracts basic blocks and word embedding training from binary codes of multiple CPU architectures to obtain corresponding word embedding vector spaces of each architecture. Then, the sparse matrix is used to mark the semantic similar instruction pairs across architectures, and the linear transformation alignment matrix is calculated through iterative optimization. Finally, the vector space corresponding to the high-frequency use of the CPU architecture is taken as the benchmark, and the vector spaces of other architectures are aligned and merged into a unified general word embedding vector space through linear transformation. Compared with the prior art, the application realizes semantic alignment and knowledge migration across instruction sets, so that the analysis knowledge accumulated on the high-frequency CPU architecture can be reused for binary code analysis of the low-frequency CPU architecture, and effectively solves the problem of low analysis accuracy caused by insufficient marking data.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of software code analysis, and more particularly to static analysis techniques for binary code. Specifically, it addresses the differences in binary code analysis across different CPU architectures (e.g., x86, ARM, RISC-V, etc.) by utilizing knowledge transfer methods to apply mature binary code analysis methods for a particular CPU architecture to the analysis of binary code for a wider range of CPU architectures. Background Technology

[0002] Binary code analysis is an important research direction in the field of software security. Its core objective is to perform static or dynamic analysis on compiled binary executable files to achieve various applications such as vulnerability detection, malware identification, and code similarity analysis. However, binary code analysis needs to consider the differences between different CPU architectures. Different CPU architectures use different instruction set architectures. For example, the x86 architecture uses Complex Instruction Set Computing (CISC), while the ARM and RISC-V architectures use Reduced Instruction Set Computing (RISC), and the MIPS architecture also belongs to the RISC category. The instructions of various CPU architectures differ significantly in syntax, addressing modes, register naming, etc., making it difficult to directly apply binary code analysis methods developed for a specific CPU architecture to other CPU architectures.

[0003] In practical applications, the use of certain CPU architectures is relatively limited. For example, the MIPS architecture is mainly used in embedded systems and network devices, and its corresponding binary code samples are significantly fewer compared to x86 and ARM architectures. Due to the lack of sufficient code samples and labeled data, binary code analysis research on these low-frequency CPU architectures is relatively weak, and the accuracy and coverage of related analysis tools are limited.

[0004] Furthermore, most existing binary code analysis methods are designed and trained independently for specific CPU architectures, and analysis methods cannot be directly reused across different architectures. Although some research attempts to achieve cross-architecture code analysis through intermediate representation languages or abstract semantic representations, these methods often lose architecture-specific semantic information during the conversion process, affecting analysis accuracy. Therefore, how to effectively transfer analysis knowledge from mature architectures to resource-constrained architectures and achieve cross-architecture binary code analysis knowledge reuse has become an urgent technical problem to be solved. Summary of the Invention

[0005] This invention proposes a binary code analysis method and system for cross-architecture knowledge transfer. By transferring binary code analysis knowledge corresponding to a specific CPU architecture to binary code analysis corresponding to other CPU architectures, existing methods are reused, thereby solving the difficulties in binary code analysis corresponding to certain CPU architectures.

[0006] The method of the present invention specifically includes the following steps:

[0007] Step S1: Generate and label binary code corpus for a specific CPU architecture. For a given CPU architecture, collect open-source code and compile it to obtain the corresponding binary code. Extract all basic blocks from the compiled binary code, treating each basic block as a set of instructions. Using a word embedding model training method, use the instructions in each basic block as tokens to train the code, obtaining the word embedding vector space T_i corresponding to the instructions of this CPU architecture.

[0008] Step S2: Linear Transformation Alignment Matrix Calculation. For the two word embedding vector spaces T_i and T_j obtained in Step S1 for different CPU architectures, construct a sparse matrix D_ij. If the x-th element of T_i and the y-th element of T_j are semantically similar, the position (x, y) is set to 1; otherwise, it is set to 0. Randomly generate a transformation matrix X, calculate the vector space T_i multiplied by X and then subtract the Euclidean distance L of T_j. Multiply each element of the sparse matrix D_ij by L to obtain the new transformation matrix X, and then repeat the calculation until the value of the transformation matrix X is stable.

[0009] Step S3: Cross-CPU architecture word embedding vector space alignment. Select the word embedding vector space T_i corresponding to the most frequently used CPU architecture as the baseline. Following step S2, calculate the corresponding transformation matrix with the word embedding vector spaces corresponding to other CPU architectures. For each other CPU architecture's word embedding vector space T_j (excluding the one corresponding to the most frequently used CPU architecture), multiply by the calculated transformation matrix X_j to obtain a new word embedding vector space V_j. Combine all newly generated vector spaces with the word embedding vector space corresponding to the most frequently used CPU architecture to form a new vector space, which serves as the new universal word embedding vector space to guide instruction prediction in subsequent binary code analysis.

[0010] Based on the above steps, this invention proposes a binary code analysis system for cross-architecture knowledge transfer. The system consists of three parts: the first part comprises multiple different compilers, which take the same batch of high-level language source code as input and output the compiled binary code that supports different CPU architectures; the second part includes a mathematical operation unit, which takes binary code for different CPU architectures as input and outputs the corresponding specific word embedding vector spaces and the linear transformation alignment matrices associated with these word embedding vector spaces; the third part includes a data merging unit, which takes a batch of word embedding vector spaces corresponding to different CPU architectures as input and the associated linear transformation alignment matrices as output and outputs a unified vector space. Technical effect

[0011] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0012] First, it enables knowledge transfer of binary code analysis across CPU architectures, establishing a shared semantic space for binary code of different CPU architectures, so that the analysis knowledge for one CPU architecture can be reused for other CPU architectures.

[0013] Second, by using a linear transformation alignment method for word embedding vector space, semantically similar instructions from different CPU instruction sets are mapped to the same vector space, effectively solving the problem of large differences in instruction syntax across different architectures and the inability to reuse analysis methods.

[0014] Third, by using high-frequency CPU architecture as a benchmark, its mature analytical knowledge is transferred to low-frequency CPU architecture, which solves the problems of insufficient labeled data and low analysis accuracy of low-frequency CPU architecture.

[0015] Fourth, it provides a unified general word embedding vector space, supporting various downstream tasks such as instruction prediction, code similarity analysis, and vulnerability detection across architectures.

[0016] Fifth, the method has good scalability. When it is necessary to support a new CPU architecture, it is only necessary to perform an alignment transformation on the word embedding vector space of the new architecture to incorporate it into the existing general vector space. Attached Figure Description

[0017] Figure 1 This is a schematic diagram of a binary code analysis method for cross-architecture knowledge transfer disclosed in an embodiment of the present invention.

[0018] Figure 2 This is a schematic diagram of a binary code analysis system architecture for cross-architecture knowledge transfer disclosed in an embodiment of the present invention. Detailed Implementation

[0019] like Figure 1As shown, the binary code analysis method for cross-architecture knowledge transfer proposed in this invention specifically includes the following steps:

[0020] Step S1: Generation and labeling of binary code corpus for a specific CPU architecture. In this embodiment, a target CPU architecture, such as x86 architecture, is first specified as the object of corpus generation. A large number of representative open-source project source codes are collected from open-source code repositories, covering various types of software projects such as operating system kernels, network protocol stacks, and encryption algorithm libraries, to ensure the diversity and representativeness of the binary code corpus. The collected open-source code is compiled using a cross-compilation toolchain corresponding to the target CPU architecture to generate a binary executable file corresponding to the target CPU architecture.

[0021] Furthermore, basic block extraction is performed on the compiled binary code. Specifically, a disassembler is used to perform static disassembly on the binary executable file to identify function boundaries and control flow structures, thereby extracting all basic blocks. Each basic block is considered as a set of consecutive instructions, where each instruction includes an opcode and operands. In this embodiment, the instruction sequence in each basic block is used as the training input for the word embedding model.

[0022] Specifically, the extracted instruction sequences are trained using a word embedding model. Each instruction is treated as a token, and each basic block is considered a sentence. A Skip-gram or CBOW model, similar to Word2Vec, is used for training to learn the contextual relationships between instructions. After training, an instruction word embedding vector space T_i corresponding to the target CPU architecture is obtained, where each instruction is mapped to a dense vector of fixed dimensions. This word embedding vector space can capture the semantic relationships between instructions; instructions with similar semantics are closer together in the vector space.

[0023] By repeating step S1 above for different CPU architectures, we can obtain word embedding vector spaces corresponding to each CPU architecture, such as word embedding vector space T_1 for x86 architecture, word embedding vector space T_2 for ARM architecture, word embedding vector space T_3 for RISC-V architecture, and word embedding vector space T_4 for MIPS architecture, etc.

[0024] Step S2: Linear Transformation Alignment Matrix Calculation. In this embodiment, a sparse matrix D_ij is constructed for the word embedding vector spaces T_i and T_j obtained in step S1 for two different CPU architectures. The sparse matrix D_ij is constructed as follows: for the x-th element in T_i and the y-th element in T_j, if they semantically represent similar operation functions, the sparse matrix D_ij takes a value of 1 at position (x, y); otherwise, it takes a value of 0. The semantic similarity determination can be based on the functional category of the instruction, for example, marking instructions that perform the same function such as addition, data movement, and conditional jump under different architectures as semantically similar.

[0025] Further, an initial transformation matrix X is randomly generated, with dimensions consistent with the word embedding vector space. The difference between the word embedding vector space T_i after linear transformation by the transformation matrix X and the word embedding vector space T_j is calculated. Specifically, the Euclidean distance between T_i and X and then T_j is calculated and denoted as L. The Euclidean distance L measures the degree of alignment between T_i and T_j after linear transformation; a smaller L value indicates a better alignment effect.

[0026] Specifically, the Euclidean distance L is weighted and adjusted using the sparse matrix D_ij. Each element in the sparse matrix D_ij is multiplied by L, and the result is used as the updated transformation matrix X. Then, the Euclidean distance calculation process in step S2.2 is repeated using the updated transformation matrix X, iteratively updating the transformation matrix X until the value of the transformation matrix X tends to stabilize, that is, the change in the transformation matrix X between two consecutive iterations is less than a preset convergence threshold. Through the above iterative optimization process, the optimal linear transformation matrix X that can align the word embedding vector space T_i to the word embedding vector space T_j is finally obtained.

[0027] Step S3: Alignment of word embedding vector spaces across CPU architectures. In this embodiment, the word embedding vector space corresponding to the CPU architecture with the highest usage frequency is selected as the reference vector space. For example... Figure 2 As shown, in practical applications, the x86 architecture is one of the most widely used CPU architectures, and its corresponding binary code analysis research is the most mature, with the richest related tagging data. Therefore, the word embedding vector space T_1 corresponding to the x86 architecture is preferably used as the reference vector space. Following the linear transformation alignment matrix calculation method described in step S2, the transformation matrices between the word embedding vector spaces corresponding to other CPU architectures and the reference vector space T_1 are calculated, such as the transformation matrix X_2 for the ARM architecture, the transformation matrix X_3 for the RISC-V architecture, and the transformation matrix X_4 for the MIPS architecture.

[0028] Furthermore, for each CPU architecture other than the baseline CPU architecture, the word embedding vector space T_j is multiplied by the corresponding transformation matrix X_j calculated in step S3.1 to obtain the transformed new word embedding vector space V_j. The transformed word embedding vector space V_j is in the same vector space as the baseline vector space T_1, that is, instructions with similar semantics in different architectures have similar vector representations in the transformed vector space.

[0029] Specifically, all the newly generated word embedding vector spaces V_2, V_3, V_4, etc., after transformation are combined with the word embedding vector space T_1 corresponding to the baseline CPU architecture to form a unified general word embedding vector space. This general word embedding vector space encompasses the instruction semantic information of all CPU architectures involved in the alignment, enabling cross-architecture instruction semantic comparison and analysis. In subsequent binary code analysis tasks, this general word embedding vector space can be directly used for tasks such as instruction prediction, code similarity analysis, and vulnerability detection, thus realizing the transfer of binary code analysis knowledge accumulated on high-frequency CPU architectures to low-frequency CPU architectures.

[0030] The above-described specific implementations can be partially adjusted by those skilled in the art in different ways without departing from the principles and purpose of the present invention. The scope of protection of the present invention is defined by the claims and is not limited to the above-described specific implementations. All implementation schemes within the scope of the claims are bound by the present invention.

Claims

1. A binary code analysis method and system for cross-architecture knowledge transfer, characterized in that, By employing three specific steps—generating and labeling binary code corpora for specific CPU architectures, calculating linear transformation alignment matrices, and aligning word embedding vector spaces across CPU architectures—this method enables the transfer of binary code analysis knowledge corresponding to a specific CPU architecture to the analysis of binary code corresponding to other CPU architectures, reusing existing methods and thus solving the difficulties in binary code analysis for certain CPU architectures.

2. The binary code analysis method and system for cross-architecture knowledge transfer according to claim 1, further characterized in that, In the binary code corpus generation and labeling steps for a specific CPU architecture, binary code is extracted and function boundaries and control flow structures are identified. All basic blocks are extracted, and each basic block is regarded as a set of consecutive instructions.

3. The binary code analysis method and system for cross-architecture knowledge transfer according to any one of claims 1 to 2, characterized in that, For binary code corpora generated by different CPU architectures, each basic block is treated as a sentence, and each instruction is treated as a token. Common word embedding models are used for training to learn the contextual relationships between instructions, resulting in a corresponding word embedding vector space for each type of CPU architecture.

4. The binary code analysis method and system for cross-architecture knowledge transfer according to any one of claims 1 to 3, characterized in that, In the linear transformation alignment matrix calculation step, a sparse matrix D_ij is constructed: for any x-th element in the word embedding vector space T_i and y-th element in the word embedding vector space T_j, if they semantically represent similar operation functions, the sparse matrix takes the value 1 at position (x, y), otherwise it takes the value 0.

5. A binary code analysis method and system for cross-architecture knowledge transfer according to any one of claims 1 to 4, characterized in that, For each type of CPU architecture, an initial transformation matrix X is randomly generated. The corresponding word embedding vector space T_i is calculated by multiplying the transformation matrix X by the word embedding vector space T_j and then subtracting the Euclidean distance L of the word embedding vector space T_j. The Euclidean distance L is then weighted and adjusted using the sparse matrix D_ij to update the transformation matrix X. This process is repeated until the value of the transformation matrix X tends to stabilize.

6. A binary code analysis method and system for cross-architecture knowledge transfer according to any one of claims 1 to 5, characterized in that, In the cross-CPU architecture word embedding vector space alignment step, a CPU architecture with the highest frequency of use is selected as the baseline CPU architecture. Then, the word embedding vector space T_j corresponding to each other CPU architecture is multiplied by the corresponding transformation matrix X_j to obtain the transformed word embedding vector space V_j. The semantically similar instructions in the transformed word embedding vector space V_j have similar vector representations to the semantically similar instructions in the baseline vector space.

7. A binary code analysis method and system for cross-architecture knowledge transfer according to any one of claims 1 to 6, characterized in that, It includes three types of components: a multi-architecture compilation unit, a mathematical operation unit, and a data merging unit. The multi-architecture compilation unit consists of multiple compilers corresponding to different CPU architectures, which are used to compile the same batch of high-level language source code into binary code that supports different CPU architectures. The mathematical operation unit is used to extract basic blocks and train word embeddings on the binary code of the different CPU architectures to obtain the word embedding vector space corresponding to each CPU architecture, and to calculate the linear transformation alignment matrix between different word embedding vector spaces. The data merging unit is used to receive the word embedding vector space and linear transformation alignment matrix output by the mathematical operation unit, and transform and merge the word embedding vector spaces corresponding to different CPU architectures into a unified general vector space.

8. A binary code analysis method and system for cross-architecture knowledge transfer according to any one of claims 1 to 7, characterized in that, The output of the multi-architecture compilation unit is connected to the input of the mathematical operation unit, and the output of the mathematical operation unit is connected to the input of the data merging unit.

9. A binary code analysis method and system for cross-architecture knowledge transfer according to any one of claims 1 to 8, characterized in that, The mathematical operation unit selects the word embedding vector space corresponding to the CPU architecture with the highest usage frequency as the reference vector space, and calculates the transformation matrix between the word embedding vector space corresponding to other CPU architectures and the reference vector space respectively; the data merging unit uses the transformation matrix to perform a linear transformation on the word embedding vector space of non-reference CPU architectures, and then combines it with the reference vector space to form the unified general vector space.