A malicious file detection method and device, electronic equipment and storage medium

By marking and matching the functions in the macro code of Office files using sequence matrices, this method solves the technical problems in malicious file detection that cannot be effectively addressed by existing technologies, thus improving the accuracy and efficiency of malicious file detection.

CN115577356BActive Publication Date: 2026-06-30BEIJING TOPSEC NETWORK SECURITY TECH +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING TOPSEC NETWORK SECURITY TECH
Filing Date
2022-09-09
Publication Date
2026-06-30

Smart Images

  • Figure CN115577356B_ABST
    Figure CN115577356B_ABST
Patent Text Reader

Abstract

This application provides a method, apparatus, electronic device, and storage medium for detecting malicious files. The method includes: acquiring the macro code of the file to be detected; marking the functions in the macro code to obtain a label corresponding to each function; generating a first call sequence matrix based on the labels corresponding to each function and the call relationships between functions, wherein each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label; and matching the first call sequence matrix with a second call sequence matrix corresponding to a malicious sample to obtain a result indicating whether the file to be detected is malicious. This application, by labeling functions, can extract valid labels even if obfuscation techniques are used to render function names meaningless. Furthermore, by using labels to replace the complex and variable original function call sequences, the call sequences are normalized, improving the accuracy of malicious sample detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer security technology, and more specifically, to a method, apparatus, electronic device, and storage medium for detecting malicious files. Background Technology

[0002] Macros are a special feature designed by Microsoft specifically for the Office software package, but they have become a widely used attack method by hackers to embed malicious macro code in Office documents to achieve their attack goals.

[0003] Currently, malicious Office file detection methods mainly include dynamic detection methods and static detection methods. Dynamic detection methods, because they simulate real-world environments, incur significant performance overhead. Static detection methods can reduce performance overhead, but malicious samples are highly variable, leading to low detection accuracy. Summary of the Invention

[0004] The purpose of this application is to provide a method, apparatus, electronic device, and storage medium for detecting malicious files, so as to improve the accuracy of malicious file detection.

[0005] In a first aspect, embodiments of this application provide a method for detecting malicious files, including:

[0006] Macro code for obtaining the file to be tested;

[0007] Mark the functions in the macro code to obtain the label corresponding to each function;

[0008] A first call sequence matrix is ​​generated based on the label corresponding to each function and the call relationship between the functions. Each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label.

[0009] The first call sequence matrix is ​​matched with the second call sequence matrix corresponding to the malicious sample to obtain the result of whether the file to be detected is a malicious file.

[0010] This application embodiment uses tagging of functions, so even if obfuscation techniques are used to make function names meaningless, effective tags can still be extracted. Furthermore, by using tags to replace the complex and varied original function call sequence, the call sequence is normalized, improving the accuracy of malicious sample detection.

[0011] In any embodiment, the first call sequence matrix is ​​matched with the second call sequence matrix corresponding to the malicious sample to obtain a result indicating whether the file to be detected is a malicious file, including:

[0012] Match each first element in the first call sequence matrix with the corresponding second element in the second call sequence matrix;

[0013] If the second element of all functions corresponding to row labels and column labels in the second call sequence matrix that have a call relationship matches the first element in the first call sequence matrix, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

[0014] This application embodiment can accurately determine whether a file to be detected is malicious by matching the first call sequence matrix corresponding to the file to be detected with the second call sequence matrix corresponding to each malicious sample in the repository.

[0015] In any embodiment, elements in the second call sequence matrix that have a call relationship between the function corresponding to the row label and the function corresponding to the column label are represented by a first identifier, and elements that do not have a call relationship between the function corresponding to the row label and the function corresponding to the column label are represented by a second identifier;

[0016] Match the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample, including:

[0017] Obtain the positions of all elements corresponding to the first identifier in the second call sequence matrix;

[0018] If the element corresponding to the position of the element in the first call sequence matrix is ​​also the first identifier, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

[0019] This application embodiment uses a first identifier to indicate that there is a calling relationship between the functions corresponding to the row labels and the functions corresponding to the column labels of the elements. This allows it to determine whether the first identifier in the first call sequence matrix completely covers the first identifier in the second call sequence matrix, thereby accurately and efficiently determining whether the file to be detected is a malicious file.

[0020] In any embodiment, a first call sequence matrix is ​​generated based on the label corresponding to each function and the call relationship between functions, including:

[0021] A function call sequence diagram is generated based on the call relationships between functions corresponding to macro code. The function call sequence diagram includes function names and the call relationships between function names.

[0022] Replace each function name in the function call sequence graph with its corresponding label to obtain a labeled call sequence graph;

[0023] Generate the first call sequence matrix based on the tag call sequence diagram.

[0024] This application embodiment generates a tag call sequence diagram using the tags corresponding to functions in the file to be detected, and then generates a first call sequence matrix based on the tag call sequence diagram, thereby preventing the risk of being unable to detect whether a sample is malicious due to function name obfuscation.

[0025] In any embodiment, the macro code for obtaining the file to be detected includes:

[0026] Obtain the file header of the file to be tested;

[0027] Determine the file type of the file to be detected based on the file header;

[0028] Retrieve macro code from the corresponding file path based on file type.

[0029] In practical applications, different types of files store macro code at different addresses. Therefore, the file type of the file to be detected can be determined by the file header, and the macro code can be accurately obtained based on the file type.

[0030] In any embodiment, obtaining macro code from the corresponding file path according to the file type includes:

[0031] If the file path contains VBA code, then the VBA code will be treated as macro code;

[0032] If the file path does not contain VBA code, then obtain the p-code, disassemble the p-code, and use the disassembled p-code as macro code.

[0033] In practical applications, malicious samples can evade detection by deleting VBA code or p-code from macro code. Therefore, in order to improve the accuracy of malicious sample detection, VBA code is treated as macro code when it is included in the file path, and decompiled p-code is treated as macro code when it is not included in the file path, so as to ensure that malicious samples can be detected based on macro code.

[0034] In any embodiment, the functions in the macro code are marked to obtain a label corresponding to each function, including:

[0035] Extract the first keyword from the function;

[0036] The first keyword is matched with the second keyword in the tag library, and the tag corresponding to the second keyword that successfully matches the first keyword is used as the tag of the first keyword; wherein, the tag library is pre-built and includes multiple tags, each tag including at least one second keyword.

[0037] This application provides a unified detection method for disordered call sequences by assigning corresponding labels to each function and replacing the complex and ever-changing original function call sequences with labels.

[0038] Secondly, embodiments of this application provide a malicious file detection device, comprising:

[0039] The acquisition module is used to acquire the macro code of the file to be detected;

[0040] The tagging module is used to tag functions in macro code and obtain the tag corresponding to each function;

[0041] The matrix generation module is used to generate a first call sequence matrix based on the label corresponding to each function and the call relationship between functions. Each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label.

[0042] The matching module is used to match the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample to obtain the result of whether the file to be detected is a malicious file.

[0043] Thirdly, embodiments of this application provide an electronic device, including: a processor, a memory, and a bus, wherein,

[0044] The processor and memory communicate with each other via a bus;

[0045] The memory stores program instructions that can be executed by the processor, and the processor can execute the method of the first aspect by calling the program instructions.

[0046] Fourthly, embodiments of this application provide a non-transitory computer-readable storage medium, comprising:

[0047] A non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method of the first aspect.

[0048] Other features and advantages of this application will be set forth in the following description and will be apparent in part from the description or may be learned by practicing embodiments of this application. The objectives and other advantages of this application may be realized and obtained by means of the structures particularly pointed out in the written description, claims, and drawings. Attached Figure Description

[0049] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0050] Figure 1 This is a schematic flowchart of a malicious file detection method provided in an embodiment of this application;

[0051] Figure 2 A function call sequence diagram provided in an embodiment of this application;

[0052] Figure 3 A tag invocation sequence diagram provided in an embodiment of this application;

[0053] Figure 4 This application provides a schematic diagram of the file header of a file to be detected.

[0054] Figure 5 This is a schematic diagram of a malicious file detection device provided in an embodiment of this application;

[0055] Figure 6 This is a schematic diagram of the physical structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0056] The technical solutions in the embodiments of this application will now be described with reference to the accompanying drawings.

[0057] Malicious documents have long been a favorite weapon of attackers, used to widely spread malware or launch direct attacks. Users often lack awareness of document security and are unprepared for documents spread through email, social media, and the internet, increasing the chances of successful malicious document attacks. Even with spam detection and malware filtering mechanisms in place, many malicious documents still slip through the net.

[0058] Office documents are a crucial component of online documents. Sophos, in its malware situation report, notes that ransomware is frequently spread via spam emails, PDF files, and Microsoft Word documents containing embedded malicious macros. Statistics show that among malicious file types spread via the web, the three most common file types are .exe, .pdf, and .doc, accounting for 52%, 20%, and 5% respectively; among malicious file types spread via email, the three most common file types are .exe, .xlsx, and .pdf, accounting for 34%, 16%, and 9% respectively.

[0059] Malicious Office file detection is crucial for network information security. Existing methods for detecting malicious Office files mainly include dynamic and static detection methods. For static detection methods, malicious samples often employ obfuscation techniques, rendering function names meaningless. Consequently, the call sequence determined by function names cannot accurately identify whether a file is malicious.

[0060] To address the aforementioned technical problems, the inventors of this application propose a method, apparatus, electronic device, and storage medium for detecting malicious files. This method, after obtaining macro code, tags the functions within the macro code, and then generates a first call sequence matrix based on the call relationships between functions and their tags. Using this first call sequence matrix, it is possible to accurately identify whether a file to be detected is malicious.

[0061] It is understood that the embodiments of this application are mainly aimed at detecting whether an Office file is a malicious file. An Office file refers to a file with a preset format, which includes .doc, .xls, .ppt, .docx, .xlsx, .pptx, .vsd, .vsdx, etc. Therefore, files with the above-mentioned extensions are all called Office files.

[0062] It is understood that the malicious file detection method provided in this application embodiment can be applied to electronic devices; electronic devices include terminals and servers, wherein the terminal can specifically be a smartphone, tablet computer, computer, personal digital assistant (PDA), etc.; the server can specifically be an application server or a web server.

[0063] Figure 1 This is a schematic diagram of a malicious file detection method provided in an embodiment of this application, such as... Figure 1 As shown, the method includes:

[0064] Step 101: Obtain the macro code of the file to be tested.

[0065] Macro code is a term used to describe batch processing. Generally speaking, a macro is a rule or pattern, or syntax substitution, used to describe how a specific input (usually a string) is transformed into a corresponding output (usually also a string) according to predefined rules. Microsoft Word defines a macro as: "A macro is a set of Word commands that can be organized together as a single, independent command to make everyday tasks easier." Word uses the macro language Visual Basic to write macros as a series of instructions.

[0066] In computer science, a macro is an abstraction that replaces a certain text pattern according to a set of predefined rules. Excel, an office software, automatically integrates the "VBA" high-level programming language; programs written in this language can be considered a form of macro coding.

[0067] Macro code is a special feature designed by Microsoft specifically for the Office suite, and it is stored in a specific path within the file to be tested. Understandably, the path where macro code is stored in a file is usually fixed; therefore, the macro code can be obtained through this path.

[0068] Step 102: Mark the functions in the macro code to obtain the label corresponding to each function.

[0069] Macro code contains multiple functions that have call relationships to achieve corresponding functionalities. After acquiring the macro code, an electronic device can obtain the corresponding functions by scanning it. It's understandable that functions in macro code typically begin with "Sub" and end with "End Sub," therefore, the individual functions within the macro code can be retrieved based on "Sub" and "End Sub."

[0070] After obtaining the functions from the macro code, labels are assigned to each function based on the keywords it contains. It's understandable that a function can have multiple labels, each label representing the functionality of that function.

[0071] Step 103: Generate a first call sequence matrix based on the labels corresponding to each function and the call relationships between functions. Each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label.

[0072] The first call sequence matrix is ​​a two-dimensional matrix, where both rows and columns are labels. These labels represent all the labels that functions in the macro code within a pre-defined Office file may contain. Each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to that row label and the function corresponding to that column label. For ease of understanding, let's take an example where the macro code contains function 1 and function 2, and function 1 includes labels A and B, while function 2 includes labels A and C. Assuming the Office file contains labels A, B, C, and D, and function 1 calls function 2, then the generated first call sequence matrix would be:

[0073]

[0074] In this matrix, "A", "B", "C", and "D" are labels. The label corresponding to a row is called the row label, and the label corresponding to a column is called the column label. The row label represents the label corresponding to function 2, and the column label represents the label corresponding to function 1. A "1" indicates a call relationship, and a "0" indicates no call relationship. Since function 1 calls function 2, we know that label A of function 1 calls labels C and A of function 2, and label B of function 1 calls labels C and A of function 2. Therefore, label A of function 1 calls labels A and C of function 2, and the corresponding elements in the first row and first column, and the first row and third column, have a value of 1; label B of function 1 calls labels A and C of function 2, therefore, the corresponding elements in the second row and first column, and the second row and third column, have a value of 1, and the other elements in the first call sequence matrix have a value of 0.

[0075] It should be noted that the functions corresponding to the row labels and column labels in the first call sequence matrix can be set according to the actual situation. Furthermore, the number of functions corresponding to the row labels and column labels can also be determined according to the actual situation. For example, if the macro code includes functions 1, 2, and 3, where function 1 calls function 2, and function 2 calls function 3, then the row labels can represent the labels corresponding to function 2, and the column labels can represent the labels corresponding to functions 1 and 3. Similarly, if the macro code includes functions 1, 2, and 3, where function 1 calls functions 2 and 3, then the row labels in the first call sequence matrix can represent the labels corresponding to functions 2 and 3, and the column labels can represent the label corresponding to function 1.

[0076] Step 104: Match the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample to obtain the result of whether the file to be detected is a malicious file.

[0077] In the specific implementation process, various malicious samples are collected in advance, and a second call sequence matrix corresponding to the malicious samples is generated. The first call sequence matrix is ​​matched with the second call sequence matrix to determine whether the function call sequence in the macro code of the file to be detected contains the function call sequence of the malicious sample, thereby obtaining the result of whether the file to be detected is a malicious file.

[0078] Understandably, if the function call sequence in the macro code of the file to be detected does not contain the function call sequence in the malicious sample, it means that the malicious behavior in the file to be detected covers the malicious behavior in the malicious sample, and the file to be detected can be judged as a malicious file; if the second call sequence matrix corresponding to all malicious samples is matched with the first call sequence matrix, and no case is found where the function call sequence in the macro code of the file to be detected contains the function call sequence in the malicious sample, then the sample to be detected is determined to be a normal sample.

[0079] This application embodiment uses tagging of functions, so even if obfuscation techniques are used to make function names meaningless, effective tags can still be extracted. Furthermore, by using tags to replace the complex and varied original function call sequence, the call sequence is normalized, improving the accuracy of malicious sample detection.

[0080] Based on the above embodiments, the first call sequence matrix is ​​matched with the second call sequence matrix corresponding to the malicious sample to obtain the result of whether the file to be detected is a malicious file, including:

[0081] Match each first element in the first call sequence matrix with the corresponding second element in the second call sequence matrix;

[0082] If the second element of all functions corresponding to row labels and column labels in the second call sequence matrix that have a call relationship matches the first element in the first call sequence matrix, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

[0083] In the specific implementation process, the first call sequence matrix and the second call sequence matrix have the same dimensions, and the row labels and column labels are also ordered in the same way. A successful match means that the value of the second element in the second call sequence matrix is ​​the same as the value of the first element at the corresponding position in the first call sequence matrix.

[0084] When determining whether a file to be detected is malicious, each second element in the second call sequence matrix that represents the call relationship between the function corresponding to the row label and the function corresponding to the column label is matched with the first element at the corresponding position in the first call sequence matrix. If the match is successful, it means that the file to be detected is malicious.

[0085] This application embodiment can accurately determine whether a file to be detected is malicious by matching the first call sequence matrix corresponding to the file to be detected with the second call sequence matrix corresponding to each malicious sample in the repository.

[0086] Based on the above embodiments, in the second call sequence matrix, elements in which the functions corresponding to the row labels and the functions corresponding to the column labels have a call relationship are represented by a first identifier, and elements in which the functions corresponding to the row labels and the functions corresponding to the column labels do not have a call relationship are represented by a second identifier;

[0087] Match the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample, including:

[0088] Obtain the positions of all elements corresponding to the first identifier in the second call sequence matrix;

[0089] If the element at the position of the first call sequence matrix is ​​also the first identifier, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

[0090] In the specific implementation process, the first identifier can be "1" and the second identifier can be "0". Of course, in practical applications, the first identifier and the second identifier can also be represented by other characters, such as the first identifier being "T" and the second identifier being "F". This application embodiment does not specifically limit this. For ease of description, this application embodiment uses 1 and 0 to distinguish the first identifier and the second identifier.

[0091] Therefore, the first call sequence matrix can be represented by the first identifier and the second identifier as follows:

[0092]

[0093] It should be noted that the specific values ​​of each element in the above matrix are merely an example. The labels corresponding to the functions in the macro code can be divided into 8 categories; therefore, the above matrix is ​​an 8x8 matrix. Similarly, the second call sequence matrix corresponding to the malicious sample is also an 8x8 matrix.

[0094] Taking the above matrix as an example, when determining whether a file to be detected is malicious based on the first and second call sequence matrices, it can be determined whether the positions of all elements with a value of 1 in the second call sequence matrix fall within the following positions: (1,2), (1,4), (1,5), (1,8), (2,2), (2,4), (3,5), (2,8), (8,2), (8,4), (8,5), (8,8). If so, the file to be detected is malicious. If any one or more elements with a value of 1 are not within the above positions, the file to be detected is not malicious.

[0095] This application embodiment uses a first identifier to indicate that there is a calling relationship between the functions corresponding to the row labels and the functions corresponding to the column labels of the elements. This allows it to determine whether the first identifier in the first call sequence matrix completely covers the first identifier in the second call sequence matrix, thereby accurately and efficiently determining whether the file to be detected is a malicious file.

[0096] Based on the above embodiments, the step of generating a first call sequence matrix according to the label corresponding to each function and the call relationship between the functions includes:

[0097] A function call sequence diagram is generated based on the call relationships between the functions corresponding to the macro code. The function call sequence diagram includes function names and the call relationships between the function names.

[0098] Replace each function name in the function call sequence graph with its corresponding label to obtain a labeled call sequence graph;

[0099] The first call sequence matrix is ​​generated based on the tag call sequence diagram.

[0100] In the specific implementation process, the electronic device generates a function call sequence diagram based on the code information of the macro code obtained, which represents the call relationships of all functions in the macro code corresponding to the file to be detected. For example, if the macro code contains two functions, AutoOpen() and zFcKWSPrk(), and AutoOpen() calls zFcKWSPrk(), then the generated function call sequence diagram would look like this: Figure 2 As shown. Figure 2 The graph contains the function names AutoOpen() and zFcKWSPrk(), with arrows indicating call relationships. It's important to note that in VBA code, the syntax for function calls is to directly reference the function name; in DisAsm_p-code, the syntax is ArgsCall function_name. The original call sequence diagram of the function is obtained based on this syntax.

[0101] Since each function corresponds to at least one label, for example, the function AutoOpen() includes labels: A_func, B_func, H_func, and the function zFcKWSPrk() includes labels: B_func, D_func, E_func, H_func. Therefore, the function names in the function call sequence diagram can be replaced with the labels corresponding to those function names to obtain a label call sequence diagram, such as... Figure 3 As shown.

[0102] After obtaining the label call sequence diagram, a first call sequence matrix can be generated based on the call relationships of each label in the diagram. For example, the label A_func in the function AutoOpen() has call relationships with the labels B_func, D_func, E_func, and H_func in the function zFcKWSPrk(). Therefore, the values ​​of the first row elements (the first row corresponding to label A_func) of the first call sequence matrix are: 0, 1, 0, 1, 1, 0, 0, 1. Similarly, the label B_func in the function AutoOpen() has call relationships with the labels B_func, D_func, E_func, and H_func in the function zFcKWSPrk(). Therefore, the values ​​of the second row elements (the second row corresponding to label B_func) of the first call sequence matrix are: 0, 1, 0, 1, 1, 0, 0, 1. The label H_func in the function AutoOpen() has call relationships with the labels B_func, D_func, E_func, and H_func in the function zFcKWSPrk(). Therefore, the values ​​of the eighth row of the first call sequence matrix (the eighth row corresponding to the label H_func) are 0, 1, 0, 1, 1, 0, 0, 1. The other elements in the first call sequence matrix are all 0, resulting in the following first call matrix:

[0103]

[0104] This application embodiment generates a tag call sequence diagram using the tags corresponding to functions in the file to be detected, and then generates a first call sequence matrix based on the tag call sequence diagram, thereby preventing the risk of being unable to detect whether a sample is malicious due to function name obfuscation.

[0105] Based on the above embodiments, obtaining the macro code of the file to be detected includes:

[0106] Obtain the file header of the file to be tested;

[0107] Determine the file type of the file to be detected based on the file header;

[0108] Retrieve macro code from the corresponding file path based on file type.

[0109] In practice, Office documents come in two formats: MS Office 97-2003 and MS Office 2007+. MS Office 97-2003 documents are in binary form, using the Compound File Binary Format (CFBF) and the OLESS (OLES Structured Storage) format, which conforms to the OLE 1.0 specification, or simply OLE2 files. MS Office 2007+ documents conform to the OOXML (Office Open XML, OOXML) standard. In 2008, when it passed ISO standard certification, it was changed to the OXML (Open XML, OXML) standard. This standard uses a ZIP compression format, and its core is to use an XML structure and a ZIP container to store the object properties of Office documents in XML files.

[0110] After inputting a file, the file header is read first. Different file types have different header characters. For example, if the first 8 bytes are "D0 CF 11 E0 A1 B1 1A E1", it indicates an MS Office 97-2003 document; if the first two bytes are "PK", it indicates an MS Office 2007+ document. It should be noted that using 8 bytes and 2 bytes is just an example. In practical applications, more or fewer characters can be used, as long as the file type can be distinguished. This application does not impose specific limitations on this.

[0111] It should be noted that in MS Office 97-2003 documents, both VBA code and p-code are stored in the document's Module Stream, while in MS Office 2007+ documents, both VBA code and p-code are stored in the vbaProject.bin file.

[0112] The Module Stream or vbaProject.bin file storing VBA and p-code is parsed. Furthermore, experiments revealed that malicious MS Office 2007+ documents can evade parsing by modifying the vbaProject.bin filename. Therefore, when locating the files storing VBA and p-code in MS Office 2007+ documents, it's necessary to look for the tag "Override PartName" corresponding to ContentType="application / vnd.ms-office.vbaProject" in the [Content_Types].xml file in the zip root directory. The value of this tag indicates the location of the corresponding VBA and p-code. It's understood that both VBA and p-code are considered macro code.

[0113] In practical applications, different types of files store macro code at different addresses. Therefore, the file type of the file to be detected can be determined by the file header, and the macro code can be accurately obtained based on the file type.

[0114] Based on the above embodiments, macro code is obtained from the corresponding file path according to the file type, including:

[0115] If the file path contains VBA code, then the VBA code will be treated as macro code;

[0116] If the file path does not contain VBA code, then obtain the p-code, disassemble the p-code, and use the disassembled p-code as macro code.

[0117] In the specific implementation process, since malicious samples can evade detection by deleting VBA code and retaining p-code code or deleting p-code code and retaining VBA code, the detection accuracy will be low if only VBA code or p-code code is used. Therefore, in this embodiment, if VBA code is obtained from the file path, it can be used as macro code. If only p-code exists and no VBA code exists, the p-code code is first disassembled to obtain the assembled code, which is marked as DisAsm_p-code, and DisAsm_p-code is used for subsequent processing.

[0118] It is understandable that there are two scenarios where the file path includes VBA code: the first is that it only contains VBA code and does not contain p-code; the second is that it contains both VBA code and p-code.

[0119] In practical applications, malicious samples can evade detection by deleting VBA code or p-code from macro code. Therefore, in order to improve the accuracy of malicious sample detection, VBA code is treated as macro code when it is included in the file path, and decompiled p-code is treated as macro code when it is not included in the file path, so as to ensure that malicious samples can be detected based on macro code.

[0120] Based on the above embodiments, functions in the macro code are marked to obtain a tag corresponding to each function, including:

[0121] Extract the first keyword from the function;

[0122] The first keyword is matched with the second keyword in the tag library, and the tag corresponding to the second keyword that successfully matches the first keyword is used as the tag of the first keyword; wherein, the tag library is pre-built and includes multiple tags, each tag including at least one second keyword.

[0123] In the specific implementation process, the special function keywords in the macro code corresponding to the Office file are first analyzed and classified, and then a tag library is established. For example, the keywords 'Mid', 'Left', 'Right', 'StrReverse', 'Xor', 'ChrB', 'ChrW', 'Chr', 'Replace', and 'Hex' are all related to obfuscation functions, so these keywords are classified into the obfuscation category and marked as obfuscation_tag; similarly, the keywords 'webclient', 'net', 'Socket', 'Connections', and 'WorkbookConnection' are all related to network connections, so these keywords are classified into the network connection category and marked as webconnect_tag. There are also similar categories such as auto_tag for automatic execution, fileAction_tag for file operations, sysEnv_tag for system environment variables, osdllCall_tag for operating system library calls, hideWindow_tag for window hiding, and shell_tag for shell operations, for a total of eight categories. It is understood that the division of this application into eight categories is only an example, and it can be divided according to the actual situation. This application does not make specific limitations on this.

[0124] After the electronic device obtains the function, it extracts the first keyword in the function. The specific extraction method can be keyword matching, that is, matching the above keywords with the words in the function to determine whether the function contains the above keywords.

[0125] After extracting the first keyword, the system matches the second keywords corresponding to each tag in the tag library. Tags corresponding to the second keywords that successfully match the first keyword are used as tags for the first keyword. For example, if a function contains the keyword 'StrReverse', it is marked as `obfuscation_tag_func`. Furthermore, if a function only contains 'StrReverse' from the obfuscation category, it is marked as `obfuscation_tag_func`. It should be noted that a function can have multiple tags, and a tag can be shared by multiple functions.

[0126] It should be noted that the names of the tags in the above tag library can also be replaced with other names, such as A, B, C, D, E, F, G, and H.

[0127] This application provides a unified detection method for disordered call sequences by assigning corresponding labels to each function and replacing the complex and ever-changing original function call sequences with labels.

[0128] For ease of understanding, this application uses a file with the hash value 4e28d5cf0bb90add582515196e44f94cc1d24c9a as an example to describe the solution of this application.

[0129] (1) Determine whether the input Office document is an OLE file or an OOXML file.

[0130] Read the file header; the first 8 bytes are parsed to be "D0 CF 11 E0 A1 B1 1A E1". Figure 4 As shown, this confirms that the file is an MS Office 97-2003 document.

[0131] (2) Extract the VBA code or p-code from the document.

[0132] Locate the Module Stream of the document and obtain the VBA code offset value ModuleOffset from the Module Stream. The p-code is located between offset 0 and Module Offset, and the VBA code is located from Module Offset to the end.

[0133] In this document, both VBA code and p-code exist. Therefore, the VBA code will be used for analysis in subsequent analyses.

[0134] (3) Scan the above VBA code to obtain the functions AutoOpen() and zFcKWSPrk() in the code.

[0135] (4) Tag the functions obtained in the above steps according to the self-built tag library. The function AutoOpen() contains the keyword AutoOpen in the auto_tag category of automatic execution, the keywords Right, Left, and Mid in the obfuscation_tag category of obfuscation, and the keywords shell and exe in the shell_tag category of shell operation. Therefore, the function AutoOpen() is tagged as auto_tag, obfuscation_tag, and shell_tag. The function zFcKWSPrk() contains the keywords Right, Left, and Chr in the obfuscation_tag category of obfuscation, the keyword http in the webconnect_tag category of network connection, the keyword DownloadFile in the fileAction_tag category of file operation, and the keyword shell in the shell_tag category of shell operation. Therefore, the function zFcKWSPrk() is tagged as obfuscation_tag, webconnect_tag, fileAction_tag, and shell_tag.

[0136] For ease of representation, auto_tag, obfuscation_tag, hideWindow_tag, webconnect_tag, fileAction_tag, sysEnv_tag, osdllCall_tag, and shell_tag are labeled A, B, C, D, E, F, G, and H, respectively. Correspondingly, the AutoOpen() function is labeled A_func, B_func, and H_func; and the zFcKWSPrk() function is labeled B_func, D_func, E_func, and H_func.

[0137] Based on the code information, first obtain the function call sequence diagram of the original function, such as... Figure 2 As shown. Then, labels are used to replace the function names in the call sequence diagram to obtain a labeled call sequence diagram, such as... Figure 3 As shown.

[0138] The call matrix generated based on the tag call sequence diagram is shown below:

[0139]

[0140] By comparing the call sequence matrix of the tagging function with the call sequence matrix of the black samples already tagged in the repository, it was found that the matrix covers the following matrix in the repository:

[0141]

[0142] This indicates that the malicious behavior of this sample covers the malicious behavior of samples in the database, and the sample is therefore determined to be malicious.

[0143] This application's embodiments reduce system overhead by extracting and detecting call sequences based on static methods; at the same time, based on VBA code and p-code, the reliability of the detection results is improved; by using obfuscation techniques to extract valid call sequences, the call sequences are normalized, providing a unified detection method for disordered call sequences and improving the detection rate.

[0144] Figure 5 This is a schematic diagram of a malicious file detection device provided in an embodiment of this application. The device can be a module, program segment, or code on an electronic device. It should be understood that this device is similar to the one described above. Figure 1 The method implementation corresponds to this and can be executed. Figure 1 The specific functions of the device involved in the method embodiments can be found in the description above; to avoid repetition, detailed descriptions are omitted here. The device includes: an acquisition module 501, a marking module 502, a matrix generation module 503, and a matching module 504, wherein:

[0145] Module 501 is used to acquire the macro code of the file to be detected;

[0146] The tagging module 502 is used to tag the functions in the macro code and obtain a tag for each function;

[0147] The matrix generation module 503 is used to generate a first call sequence matrix based on the labels corresponding to each function and the call relationships between the functions, wherein each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label of the element.

[0148] The matching module 504 is used to match the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample to obtain the result of whether the file to be detected is a malicious file.

[0149] Based on the above embodiments, the matching module 504 is specifically used for:

[0150] Match each first element in the first call sequence matrix with the corresponding second element in the second call sequence matrix;

[0151] If the second element in the second call sequence matrix that represents the function corresponding to the row label and the function corresponding to the column label having a call relationship is successfully matched with the first element in the first call sequence matrix, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

[0152] Based on the above embodiments, elements in the second call sequence matrix where there is a call relationship between the function corresponding to the row label and the function corresponding to the column label are represented by a first identifier, and elements where there is no call relationship between the function corresponding to the row label and the function corresponding to the column label are represented by a second identifier; the matching module 504 is specifically used for:

[0153] Obtain the positions of all elements corresponding to the first identifier in the second call sequence matrix;

[0154] If the element corresponding to the position of the element in the first call sequence matrix is ​​also the first identifier, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

[0155] Based on the above embodiments, the matrix generation module 503 is specifically used for:

[0156] A function call sequence diagram is generated based on the call relationships between the functions corresponding to the macro code. The function call sequence diagram includes function names and the call relationships between the function names.

[0157] Replace each function name in the function call sequence graph with its corresponding label to obtain a labeled call sequence graph;

[0158] The first call sequence matrix is ​​generated based on the tag call sequence diagram.

[0159] Based on the above embodiments, the acquisition module 501 is specifically used for:

[0160] Obtain the file header of the file to be detected;

[0161] The file type of the file to be detected is determined based on the file header;

[0162] The macro code is obtained from the corresponding file path based on the file type.

[0163] Based on the above embodiments, the acquisition module 501 is specifically used for:

[0164] If the file path includes VBA code, then the VBA code will be used as the macro code;

[0165] If the file path does not contain the VBA code, then the p-code is obtained, and the p-code is disassembled. The disassembled p-code is then used as the macro code.

[0166] Based on the above embodiments, the marking module 502 is specifically used for:

[0167] Extract the first keyword from the function;

[0168] The first keyword is matched with the second keyword in the tag library, and the tag corresponding to the second keyword that successfully matches the first keyword is used as the tag of the first keyword; wherein, the tag library is pre-built and includes multiple tags, and each tag includes at least one second keyword.

[0169] Figure 6 This is a schematic diagram of the physical structure of the electronic device provided in the embodiments of this application, such as... Figure 6 As shown, the electronic device includes: a processor 601, a memory 602, and a bus 603; wherein,

[0170] The processor 601 and the memory 602 communicate with each other through the bus 603;

[0171] The processor 601 is used to call program instructions in the memory 602 to execute the methods provided in the above-described method embodiments, including, for example,: obtaining macro code of the file to be detected; marking functions in the macro code to obtain a label corresponding to each function; generating a first call sequence matrix based on the label corresponding to each function and the call relationship between functions, wherein each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label; and matching the first call sequence matrix with a second call sequence matrix corresponding to a malicious sample to obtain a result on whether the file to be detected is a malicious file.

[0172] Processor 601 can be an integrated circuit chip with signal processing capabilities. The processor 601 can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the various methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor.

[0173] The memory 602 may include, but is not limited to, random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), etc.

[0174] This embodiment discloses a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium. The computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the methods provided in the above-described method embodiments, such as: obtaining macro code of a file to be detected; marking functions in the macro code to obtain a label corresponding to each function; generating a first call sequence matrix based on the label corresponding to each function and the call relationship between functions, wherein each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label; and matching the first call sequence matrix with a second call sequence matrix corresponding to a malicious sample to obtain a result on whether the file to be detected is a malicious file.

[0175] This embodiment provides a non-transitory computer-readable storage medium storing computer instructions. These instructions instruct the computer to execute the methods provided in the above embodiments, including, for example: acquiring macro code of a file to be detected; marking functions in the macro code to obtain a label corresponding to each function; generating a first call sequence matrix based on the labels corresponding to each function and the call relationships between functions, wherein each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label; and matching the first call sequence matrix with a second call sequence matrix corresponding to a malicious sample to obtain a result indicating whether the file to be detected is a malicious file.

[0176] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.

[0177] Furthermore, the units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0178] Furthermore, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

[0179] In this document, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, without necessarily requiring or implying any such actual relationship or order between these entities or operations.

[0180] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A malicious file detection method characterized by comprising: include: Macro code for obtaining the file to be tested; The functions in the macro code are marked to obtain the label corresponding to each function; A first call sequence matrix is ​​generated based on the labels corresponding to each function and the call relationships between the functions. Each element in the first call sequence matrix indicates whether a call relationship exists between the function corresponding to the row label and the function corresponding to the column label. The first call sequence matrix is ​​a two-dimensional matrix, where both rows and columns are labels. The labels are determined based on the keywords contained in the functions, and the labels refer to all labels contained in the functions of the macro code in the Office file, as determined in advance. The union of the row labels and the column labels covers all the labels corresponding to the functions of the macro code in the Office file. The first call sequence matrix is ​​matched with the second call sequence matrix corresponding to the malicious sample to obtain the result of whether the file to be detected is a malicious file.

2. The method of claim 1, wherein, The step of matching the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample to obtain a result on whether the file to be detected is a malicious file includes: Match each first element in the first call sequence matrix with the corresponding second element in the second call sequence matrix; If the second element in the second call sequence matrix that represents the function corresponding to the row label and the function corresponding to the column label having a call relationship is successfully matched with the first element in the first call sequence matrix, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

3. The method according to claim 1, characterized in that, In the second call sequence matrix, elements in which the functions corresponding to the row labels and the functions corresponding to the column labels have a call relationship are represented by the first identifier, and elements in which the functions corresponding to the row labels and the functions corresponding to the column labels do not have a call relationship are represented by the second identifier. The step of matching the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample includes: Obtain the positions of all elements corresponding to the first identifier in the second call sequence matrix; If the element corresponding to the position of the element in the first call sequence matrix is ​​also the first identifier, then the file to be detected is determined to be a malicious file; otherwise, the file to be detected is determined not to be a malicious file.

4. The method according to claim 1, characterized in that, The step of generating the first call sequence matrix based on the label corresponding to each function and the call relationship between the functions includes: A function call sequence diagram is generated based on the call relationships between the functions corresponding to the macro code. The function call sequence diagram includes function names and the call relationships between the function names. Replace each function name in the function call sequence graph with its corresponding label to obtain a labeled call sequence graph; The first call sequence matrix is ​​generated based on the tag call sequence diagram.

5. The method according to claim 1, characterized in that, The macro code for obtaining the file to be detected includes: Obtain the file header of the file to be detected; The file type of the file to be detected is determined based on the file header; The macro code is obtained from the corresponding file path based on the file type.

6. The method according to claim 5, characterized in that, The step of retrieving the macro code from the corresponding file path based on the file type includes: If the file path includes VBA code, then the VBA code will be used as the macro code; If the file path does not contain the VBA code, then obtain the p-code, disassemble the p-code, and use the disassembled p-code as the macro code.

7. The method according to any one of claims 1-6, characterized in that, The step of marking the functions in the macro code to obtain a tag for each function includes: Extract the first keyword from the function; The first keyword is matched with the second keyword in the tag library, and the tag corresponding to the second keyword that successfully matches the first keyword is used as the tag of the first keyword; wherein, the tag library is pre-built and includes multiple tags, and each tag includes at least one second keyword.

8. A malicious file detection device, characterized in that, include: The acquisition module is used to acquire the macro code of the file to be detected; The tagging module is used to tag the functions in the macro code and obtain a tag for each function; A matrix generation module is used to generate a first call sequence matrix based on the labels corresponding to each function and the call relationships between the functions. Each element in the first call sequence matrix indicates whether there is a call relationship between the function corresponding to the row label and the function corresponding to the column label. The first call sequence matrix is ​​a two-dimensional matrix, where both rows and columns are labels. The labels are determined based on the keywords contained in the functions, and the labels refer to all the labels contained in the functions of the macro code in the Office file, as determined in advance. The union of the row labels and the column labels covers all the labels corresponding to the functions of the macro code in the Office file. The matching module is used to match the first call sequence matrix with the second call sequence matrix corresponding to the malicious sample to obtain the result of whether the file to be detected is a malicious file.

9. An electronic device, characterized in that, include: Processor, memory, and bus, among which, The processor and the memory communicate with each other via the bus; The memory stores program instructions that can be executed by the processor, and the processor can execute the method as described in any one of claims 1-7 by calling the program instructions.

10. A non-transitory computer-readable storage medium, characterized in that, The non-transitory computer-readable storage medium stores computer instructions that, when executed by a computer, cause the computer to perform the method as described in any one of claims 1-7.