Malicious process detection methods, devices, and computer-readable storage media
By converting process access logs into a two-dimensional matrix and training a neural network model, the problem of detecting malicious processes due to the large volume of process access logs is solved, and fast and accurate detection of malicious processes is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI GUAN AN INFORMATION TECH
- Filing Date
- 2021-11-10
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies cannot effectively and accurately detect malicious processes based on process access logs, as process access logs are large in volume and difficult to analyze manually or using established rules.
By storing the source process, target process, call content, and malicious identifier fields of the target process in the process access log in the form of a two-dimensional matrix, extracting the absolute path and generating a path list, and using a neural network to train a malicious process detection model, it is possible to determine whether the process to be detected is a malicious process.
It improves the accuracy of malicious process detection by using the correlation between the source process and the absolute path to achieve fast and accurate malicious process identification.
Smart Images

Figure CN114238964B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence, and in particular to a method, apparatus and computer-readable storage medium for detecting malicious processes. Background Technology
[0002] During operation, processes access necessary resources, and these accesses are part of the process's behavioral characteristics. The access behaviors of malicious processes and normal processes differ to some extent. By analyzing process access behaviors, it may be possible to determine whether a process is malicious or not.
[0003] Windows systems have a logging function, and after installing a Microsoft-official software called Sysmon, it can detect process access behavior, which provides us with a source of data.
[0004] The amount of logs in process access logs is enormous. A single process can generate hundreds of logs in a short period of time, making them difficult to analyze manually and also difficult to match using established rules.
[0005] There is currently no effective solution to the problem that existing technologies cannot accurately detect based on process access logs. Summary of the Invention
[0006] To address the aforementioned problems, this invention provides a malicious process detection method, apparatus, and computer-readable storage medium. The method stores the absolute path in the call content as a path list, merges the source process into the path list, numbers the list items according to their contents to obtain a sample two-dimensional matrix, trains a malicious process detection model using the sample two-dimensional matrix, and then uses this model to perform malicious detection on process access logs to obtain the detection result of the target process, thereby solving the problem of inaccurate detection of malicious processes.
[0007] To achieve the above objectives, the present invention provides a malicious process detection method, comprising: acquiring multiple process access logs; extracting the source process, target process, call content, and malicious identifier field of the target process from the process access logs; storing the source process, target process, call content, and malicious identifier field of the target process in the form of a two-dimensional matrix to obtain a first two-dimensional matrix; extracting the absolute path from the call content; storing the absolute path as a path list; deleting the call content; and merging the corresponding source process into the path list to obtain a second two-dimensional matrix; merging the process access logs of the same target process; numbering the list items in the path list according to the list item content to obtain a sample two-dimensional matrix; using the target process in the sample two-dimensional matrix as the row index, the numbered path list, and the malicious identifier field of the target process as the column index, inputting them into a neural network for training to obtain a malicious process detection model; and inputting the process to be detected into the malicious process detection model to determine whether the process to be detected is a malicious process.
[0008] Optionally, merging the process access logs of the same target process and numbering the list items in the path list according to their contents to obtain a sample two-dimensional matrix includes: merging the path lists corresponding to all target processes to obtain a two-dimensional list; using the list items in the two-dimensional list as words, generating a word vector dictionary and a sequence number for each word vector using a word vector generation algorithm; and converting the sequence number of each list item according to the word vector dictionary and the sequence number of each word vector to obtain the sample two-dimensional matrix.
[0009] Further optionally, after converting the sequence number of each list item according to the word vector dictionary and the sequence number corresponding to each vector, the method further includes: deleting redundant list items when the length of the target path list is greater than the preset length; and padding the insufficient list items with 0 when the length of the target path list is less than the preset length.
[0010] Further optionally, the step of extracting the absolute path from the call content and storing the absolute path as a path list includes: splitting the call content into multiple short strings using short string concatenation characters as delimiters; splitting the absolute path in the short strings and the address written to the target process using absolute path concatenation characters as delimiters; deleting the address written to the target process; and converting all absolute paths into a path list using each absolute path as a list item.
[0011] Further optionally, merging the process access logs of the same target process includes: concatenating the first and last paths of the path list corresponding to the same target process to obtain the path list corresponding to the target process; taking the target process malicious identifier field of the first process access log of the same target process as the target process malicious identifier field corresponding to the target process.
[0012] On the other hand, the present invention also provides a malicious process detection device, comprising: a data acquisition module, used to acquire multiple process access logs, extract the source process, target process, call content, and target process malicious identifier field from the process access logs, and store the source process, target process, call content, and target process malicious identifier field in the form of a two-dimensional matrix to obtain a first two-dimensional matrix; a second two-dimensional matrix generation module, used to extract the absolute path from the call content, store the absolute path as a path list, delete the call content, and merge the corresponding source process into the path list to obtain a second two-dimensional matrix; a data merging module, used to merge process access logs of the same target process; a sample two-dimensional matrix generation module, used to number the list items in the path list according to the list item content to obtain a sample two-dimensional matrix; a malicious process detection model generation module, used to input the target process in the sample two-dimensional matrix as the row index, the numbered path list, and the target process malicious identifier field as the column index into a neural network for training to obtain a malicious process detection model; and a malicious process judgment module, used to input the process to be detected into the malicious process detection model to determine whether the process to be detected is a malicious process.
[0013] Further optionally, the sample two-dimensional matrix generation module includes: a two-dimensional list generation submodule, used to merge the path lists corresponding to all target processes to obtain a two-dimensional list; a word vector generation submodule, used to use the list items in the two-dimensional list as words, and use a word vector generation algorithm to generate a word vector dictionary and a sequence number corresponding to each word vector; and a numbering submodule, used to convert the sequence number of each list item according to the word vector dictionary and the sequence number corresponding to each word vector to obtain the sample two-dimensional matrix.
[0014] Further optionally, the device also includes: a deletion module for deleting redundant list items when the length of the target path list is greater than a preset length; and a supplementation module for padding the missing list items with 0 when the length of the target path list is less than the preset length.
[0015] Further optionally, the second two-dimensional matrix generation module includes: a call content string segmentation submodule, used to segment the call content into multiple short strings using short string concatenation characters as delimiters; a short string segmentation submodule, used to segment the absolute paths in the short strings and the addresses written to the target process using absolute path concatenation characters as delimiters; and a list conversion submodule, used to delete the addresses written to the target process and convert all absolute paths into a path list, using each absolute path as a list item.
[0016] Further optionally, the data merging module includes: a data connection submodule, used to connect the first and last paths of the same target process to obtain the path list corresponding to the target process; and a target process malicious identifier field merging submodule, used to take the target process malicious identifier field of the first process access log in the same target process as the target process malicious identifier field corresponding to the target process.
[0017] On the other hand, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the above-described malicious process detection method.
[0018] The above technical solution has the following beneficial effects: This invention establishes a sample set characterized by the source process and absolute path, and uses this sample set as input to train a malicious process detection model, thus linking the detection of malicious processes with the source process and absolute path, which facilitates the detection of malicious processes; since the source process and absolute path in a malicious process are different from those in a normal process, by associating the source process and absolute path with their text content, the source process and absolute path are given the meaning of judging whether a process is malicious, thereby improving the accuracy of malicious process detection. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a flowchart of the malicious process detection method provided in the embodiments of the present invention;
[0021] Figure 2 This is a flowchart of the sample two-dimensional matrix generation method provided in the embodiments of the present invention;
[0022] Figure 3 This is a flowchart of the method for unifying the length of the path list provided in an embodiment of the present invention;
[0023] Figure 4 This is a flowchart of the path list generation method provided in the embodiments of the present invention;
[0024] Figure 5 This is a flowchart of the process access log merging method provided in an embodiment of the present invention;
[0025] Figure 6 This is a schematic diagram of the malicious process detection device provided in an embodiment of the present invention;
[0026] Figure 7 This is a schematic diagram of the sample two-dimensional matrix generation module provided in an embodiment of the present invention;
[0027] Figure 8 This is a schematic diagram of the deletion module and the supplement module provided in an embodiment of the present invention;
[0028] Figure 9 This is a schematic diagram of the structure of the second two-dimensional matrix generation module provided in an embodiment of the present invention;
[0029] Figure 10 This is a schematic diagram of the data merging module provided in an embodiment of the present invention.
[0030] Figure labeling: 100 - Data acquisition module; 200 - Second 2D matrix generation module; 2001 - Call content string segmentation submodule; 2002 - Short string segmentation submodule; 2003 - List conversion submodule; 300 - Data merging module; 3001 - Data concatenation submodule; 3002 - Target process malicious identifier field merging submodule; 400 - Sample 2D matrix generation module; 4001 - 2D list generation submodule; 4002 - Word vector generation submodule; 4003 - Numbering submodule; 500 - Malicious process detection model generation module; 600 - Malicious process judgment module; 700 - Deletion module; 800 - Supplement module. Detailed Implementation
[0031] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0032] To address the aforementioned problem of inaccurate detection of malicious processes based on process access logs, this invention provides a method for detecting malicious processes. Figure 1 This is a flowchart of the malicious process detection method provided in the embodiments of the present invention, such as... Figure 1 As shown, the method includes:
[0033] S101. Obtain multiple process access logs, extract the source process, target process, call content, and target process malicious identifier fields from the process access logs, and store the source process, target process, call content, and target process malicious identifier fields in the form of a two-dimensional matrix to obtain the first two-dimensional matrix;
[0034] Process access logs are generated when a process accesses a target resource during runtime. Sysmon can be used to generate these logs. The logs record the process of the source process starting the target process, and the log fields include "Source Process," "Target Process," "Call Content," and "Generation Time." A large number of these logs are manually generated, and the target process is marked as malicious to create a malicious identification field. nxlog is used to collect the logs, and the "Source Process," "Target Process," "Call Content," and "Malicious Identifier Field" are filtered out. The data is then converted to JSON format and stored in a two-dimensional matrix for easy subsequent data processing.
[0035] S102. Extract the absolute path from the call content, store the absolute path as a path list, delete the call content, and merge the corresponding source process into the path list to obtain the second two-dimensional matrix.
[0036] The call content contains multiple absolute paths. These absolute paths are extracted and stored as a path list. Simultaneously, the call content field is removed, leaving only the path list related to the call content. For each process access log, the string corresponding to the "source process" field is added as a new list item to the head of the path list to form a new list. The "source process" field is then removed, resulting in the latest two-dimensional matrix.
[0037] As an optional implementation, all data is preprocessed before extracting the absolute path from the called content to facilitate subsequent data processing. The preprocessing methods include:
[0038] (1) Manually convert the case inconsistencies that may exist in the data, such as converting “C: / / ” and “c: / / ” to the latter; delete data items with empty values.
[0039] (2) Use the name of the JSON data as the first row index and the value as the value, convert it to CSV format and store it. The index contains "source process", "target process", "call content" and "target process malicious identifier field".
[0040] (3) Convert the value of the “Target Process Malicious Identifier Field” to a numerical value. This field has two possible values: “Malicious” and “-”. The former is converted to 1 and the latter to 0.
[0041] S103. Merge the process access logs of processes with the same target process;
[0042] Since the final detection subject is the target process, access logs of the same target process need to be merged.
[0043] S104. Number the list items in the path list according to their contents to obtain a sample two-dimensional matrix;
[0044] After merging access logs for the same target process, each item in its path list is a string with no connection between them, making it difficult to use the path list as a training sample set for model training. Therefore, this embodiment assigns a number to each list item based on its content to facilitate subsequent model training. At this point, the sample two-dimensional matrix contains three parts: the target process, the numbered path list, and a malicious target process identifier field.
[0045] S105. Input the target process in the sample two-dimensional matrix as the row index, the numbered path list and the malicious identification field of the target process as the column index into the neural network for training to obtain the malicious process detection model.
[0046] In the sample two-dimensional matrix, the target process is the detection subject, the path list after the number is the feature item, and the malicious identification field of the target process is the label. The sample two-dimensional matrix is input into the neural network for training to obtain the malicious process detection model. During model training, the sample two-dimensional matrix is randomly split into training and test sets to ensure that malicious and non-malicious samples appear evenly in the training and test sets, with a split ratio of 4:1.
[0047] As a preferred implementation, a BiLSTM neural network is used to train the data in the sample two-dimensional matrix.
[0048] The neural network structure is primarily built using the open-source Keras library. The construction process is as follows:
[0049] (1) Create a serialization network structure and add an embedding layer.
[0050] (2) Add a BiLSTM layer with 64 units. LSTM stands for Long Short-Term Memory Neural Network. Its characteristic is that it takes a sequence as input, and the output of each unit after receiving an input is combined with the next input to form a new input, which is short-term memory; at the same time, the input is processed through some operations to become the current state output, and is used to update the state with each input, which is long-term memory. BiLSTM, on the other hand, takes the input sequence and inputs it in both the forward and reverse directions to obtain two final outputs, and then concatenates them to obtain the final output of the layer.
[0051] (3) Add a fully connected hidden layer and use ReLU as the activation function. The purpose is to initially compress the output dimension, which is beneficial for the linear partitioning of the final output layer.
[0052] (4) Add a fully connected output layer and use Sigmoid as the activation function to obtain the output value. The output value is a floating-point number, ranging from 0 to 1, where a value greater than 0.5 indicates a prediction result of 1, and a value less than 0.5 indicates a prediction result of 0. Here, they correspond to whether the prediction result of the target process is malicious or not.
[0053] (5) Compile the network, with the loss function being binary cross-entropy, the optimization function being adam, and the evaluation function being accuracy.
[0054] S106. Input the process to be detected into the malicious process detection model to determine whether the process to be detected is a malicious process.
[0055] A malicious process detection model is used to detect whether a process under test is malicious. During the detection process, the malicious process detection model determines whether the process under test is malicious by identifying the source process and the absolute path in the process under test, so as to achieve a fast and accurate judgment of malicious processes.
[0056] As an optional implementation method, Figure 2 This is a flowchart of the sample two-dimensional matrix generation method provided in the embodiments of the present invention, such as... Figure 2 As shown, the list items in the path list are numbered according to their contents to obtain a sample two-dimensional matrix, including:
[0057] S1041. Merge the path lists corresponding to all target processes to obtain a two-dimensional list;
[0058] At this point, each target process corresponds to only one path list and a target process malicious identifier field. The value of the path list field is a list of strings of variable length, where the order of the strings contains important information. This is highly similar to natural language, so natural language processing can effectively extract features for classification.
[0059] To number the items in the path list, the path lists corresponding to all target processes are first merged into a two-dimensional list, which is similar to a text composed of multiple sentences.
[0060] S1042. Take the list items in the two-dimensional list as words, and use a word vector generation algorithm to generate a word vector dictionary and the corresponding sequence number of each word vector;
[0061] The above two-dimensional list is used as input, where each list item is considered a word. The Word2Vec word vector generation algorithm from gensim is used to generate a word vector dictionary. After processing, every word in the input that appears more than a set threshold is converted into a word vector of a set length. The architecture uses the CBOW model, which predicts a word by looking at words within a certain window around it. The parameters of each word vector are adjusted based on the prediction results, and the final word vector parameters accurately reflect the similarity between words. This process is also known as pre-training. In addition to the word vector dictionary, the algorithm also generates a corresponding index for each word vector. Based on this index, the appropriate method can be used to obtain both the word and its corresponding word vector.
[0062] S1043. Based on the word vector dictionary and the corresponding index of each word vector, convert the index of each list item to obtain a sample two-dimensional matrix.
[0063] At this point, each word corresponds to a word vector and a sequence number. In this embodiment, each list item is numbered according to the correspondence.
[0064] As an optional implementation method, Figure 3 This is a flowchart of the method for unifying the length of the path list provided in an embodiment of the present invention, such as... Figure 3 As shown, after converting the index of each list item according to the word vector dictionary and the index corresponding to each vector, it also includes:
[0065] S107. When the length of the target path list is greater than the preset length, delete the redundant list items;
[0066] S108. When the length of the target path list is less than the preset length, the missing list items are padded with 0.
[0067] Set a preset length for the path list and unify the preset length of all data to facilitate the construction of neural networks.
[0068] When building the neural network, an embedding layer parameter matrix is generated based on the word vector dictionary. The number of rows in the matrix corresponds to the number of dictionary entries, and the number of columns corresponds to the set length of the word vectors. Each word vector from the dictionary is used as a row in the embedding layer parameter matrix. This embedding layer parameter matrix can be used as parameters when creating the serialized network structure for building the neural network. Specifically, the generated embedding layer parameter matrix is used as the weights parameter, allowing the pre-training results to be applied to the final model's training process. The input to the embedding layer is a two-dimensional tensor equal to batch_size multiplied by the length of the input sequence, and the output is a three-dimensional tensor equal to batch_size multiplied by the length of the input sequence multiplied by the length of the word vectors. Here, batch_size is a parameter that sets the number of data items input at one time during training, and the length of the input sequence is the preset length of the path list. The preset length can be determined according to specific circumstances. In this embodiment, it is preferred to arrange all path lists in descending order of length and select the length of the path list at the 1 / 20 position as the preset length to include as many list items as possible. Preferably, the value of batch_size is 32.
[0069] As an optional implementation method, Figure 4 This is a flowchart of the path list generation method provided in the embodiments of the present invention, such as... Figure 4 As shown, the absolute path is extracted from the called content and stored as a path list, including:
[0070] S1021. Using short string concatenation characters as delimiters, split the called content into multiple short strings;
[0071] The content field is formatted as a string with a fixed structure: multiple short strings are concatenated using the "|" character, which is the short string concatenation operator. Here, "|" is used as the delimiter to separate the strings, resulting in a list of many short strings as elements.
[0072] As an optional implementation, to reduce the amount of data processing before performing short string splitting, the called content is read into pandas.dataframe format and deduplicated to obtain new dataframe format data without duplicates.
[0073] S1022. Using the absolute path concatenation character as the delimiter, split the absolute path in the short string and the address to be written to the target process.
[0074] The short strings used for concatenation have a fixed structure: the absolute path of the called file and the address to be written to the target process are concatenated using the "+" character, where "+" is the absolute path separator. Each item in the list above is further split using "+" as the separator, and only the first item, the absolute path of the called file, is retained.
[0075] As an optional implementation, deduplication is performed after this step to remove duplicate combinations of absolute paths in the data.
[0076] S1023. Delete the address written to the target process, and convert all absolute paths into a path list, using each absolute path as a list item.
[0077] Ultimately, the value of the "Called Content" field for each piece of data becomes a list consisting of the absolute paths of the called files. The addresses written to the target process are discarded here because these addresses are hexadecimal values allocated at a certain moment and have no analytical significance.
[0078] As an optional implementation method, Figure 5 This is a flowchart of the process access log merging method provided in an embodiment of the present invention, as shown below. Figure 5 As shown, process access logs for the same target process are merged, including:
[0079] S1031. Connect the first and last paths of the same target process to obtain the path list corresponding to the target process.
[0080] S1032. Take the malicious target process identifier field of the first process access log in the same target process and use it as the malicious target process identifier field corresponding to that target process.
[0081] During the data merging phase, the path lists of the same target process are concatenated end-to-end, transforming multiple path lists into a new path list. Furthermore, since the target process malicious identifier field is determined based on the target process, the target process malicious identifier fields corresponding to the same target process are identical. Therefore, it is sufficient to randomly select one target process malicious identifier field as the target process malicious identifier field for that target process. In this embodiment, the first target process malicious identifier field is selected as the target process malicious identifier field for that target process.
[0082] This invention also provides a malicious process detection device. Figure 6 This is a schematic diagram of the malicious process detection device provided in an embodiment of the present invention, as shown below. Figure 6 As shown, the device includes:
[0083] The data acquisition module 100 is used to acquire multiple process access logs, extract the source process, target process, call content, and target process malicious identification field from the process access logs, and store the source process, target process, call content, and target process malicious identification field in the form of a two-dimensional matrix to obtain the first two-dimensional matrix;
[0084] Process access logs are generated when a process accesses a target resource during runtime. Sysmon can be used to generate these logs. The logs record the process of the source process starting the target process, and the log fields include "Source Process," "Target Process," "Call Content," and "Generation Time." A large number of these logs are manually generated, and the target process is marked as malicious to create a malicious identification field. nxlog is used to collect the logs, and the "Source Process," "Target Process," "Call Content," and "Malicious Identifier Field" are filtered out. The data is then converted to JSON format and stored in a two-dimensional matrix for easy subsequent data processing.
[0085] The second two-dimensional matrix generation module 200 is used to extract the absolute path in the call content, store the absolute path as a path list, delete the call content, and merge the corresponding source process into the path list to obtain the second two-dimensional matrix.
[0086] The call content contains multiple absolute paths. These absolute paths are extracted and stored as a path list. Simultaneously, the call content field is removed, leaving only the path list related to the call content. For each process access log, the string corresponding to the "source process" field is added as a new list item to the head of the path list to form a new list. The "source process" field is then removed, resulting in the latest two-dimensional matrix.
[0087] As an optional implementation, all data is preprocessed before extracting the absolute path from the called content to facilitate subsequent data processing. The preprocessing methods include:
[0088] (1) Manually convert the case inconsistencies that may exist in the data, such as converting “C: / / ” and “c: / / ” to the latter; delete data items with empty values.
[0089] (2) Use the name of the JSON data as the first row index and the value as the value, convert it to CSV format and store it. The index contains "source process", "target process", "call content" and "target process malicious identifier field".
[0090] (3) Convert the value of the “Target Process Malicious Identifier Field” to a numerical value. This field has two possible values: “Malicious” and “-”. The former is converted to 1 and the latter to 0.
[0091] The data merging module 300 is used to merge the process access logs of the same target process;
[0092] Since the final detection subject is the target process, access logs of the same target process need to be merged.
[0093] The sample two-dimensional matrix generation module 400 is used to number the list items in the path list according to the content of the list items to obtain the sample two-dimensional matrix;
[0094] After merging access logs for the same target process, each item in its path list is a string with no connection between them, making it difficult to use the path list as a training sample set for model training. Therefore, this embodiment assigns a number to each list item based on its content to facilitate subsequent model training. At this point, the sample two-dimensional matrix contains three parts: the target process, the numbered path list, and a malicious target process identifier field.
[0095] The malicious process detection model generation module 500 is used to input the target process in the sample two-dimensional matrix as the row index, the numbered path list and the malicious identifier field of the target process as the column index into the neural network for training, so as to obtain the malicious process detection model.
[0096] In the sample two-dimensional matrix, the target process is the detection subject, the path list after the number is the feature item, and the malicious identification field of the target process is the label. The sample two-dimensional matrix is input into the neural network for training to obtain the malicious process detection model. During model training, the sample two-dimensional matrix is randomly split into training and test sets to ensure that malicious and non-malicious samples appear evenly in the training and test sets, with a split ratio of 4:1.
[0097] As a preferred implementation, a BiLSTM neural network is used to train the data in the sample two-dimensional matrix.
[0098] The neural network structure is primarily built using the open-source Keras library. The construction process is as follows:
[0099] (1) Create a serialization network structure and add an embedding layer.
[0100] (2) Add a BiLSTM layer with 64 units. LSTM stands for Long Short-Term Memory Neural Network. Its characteristic is that it takes a sequence as input, and the output of each unit after receiving an input is combined with the next input to form a new input, which is short-term memory; at the same time, the input is processed through some operations to become the current state output, and is used to update the state with each input, which is long-term memory. BiLSTM, on the other hand, takes the input sequence and inputs it in both the forward and reverse directions to obtain two final outputs, and then concatenates them to obtain the final output of the layer.
[0101] (3) Add a fully connected hidden layer and use ReLU as the activation function. The purpose is to initially compress the output dimension, which is beneficial for the linear partitioning of the final output layer.
[0102] (4) Add a fully connected output layer and use Sigmoid as the activation function to obtain the output value. The output value is a floating-point number, ranging from 0 to 1, where a value greater than 0.5 indicates a prediction result of 1, and a value less than 0.5 indicates a prediction result of 0. Here, they correspond to whether the prediction result of the target process is malicious or not.
[0103] (5) Compile the network, with the loss function being binary cross-entropy, the optimization function being adam, and the evaluation function being accuracy.
[0104] The malicious process judgment module 600 is used to input the process to be detected into the malicious process detection model and determine whether the process to be detected is a malicious process.
[0105] A malicious process detection model is used to detect whether a process under test is malicious. During the detection process, the malicious process detection model determines whether the process under test is malicious by identifying the source process and the absolute path in the process under test, so as to achieve a fast and accurate judgment of malicious processes.
[0106] As an optional implementation method, Figure 7 This is a schematic diagram of the sample two-dimensional matrix generation module provided in an embodiment of the present invention, as shown below. Figure 7 As shown, the sample two-dimensional matrix generation module 400 includes:
[0107] The two-dimensional list generation submodule 4001 is used to merge the path lists corresponding to all target processes to obtain a two-dimensional list.
[0108] At this point, each target process corresponds to only one path list and a target process malicious identifier field. The value of the path list field is a list of strings of variable length, where the order of the strings contains important information. This is highly similar to natural language, so natural language processing can effectively extract features for classification.
[0109] To number the items in the path list, the path lists corresponding to all target processes are first merged into a two-dimensional list, which is similar to a text composed of multiple sentences.
[0110] The word vector generation submodule 4002 is used to take the list items in the two-dimensional list as words and use the word vector generation algorithm to generate a word vector dictionary and the corresponding index of each word vector.
[0111] The above two-dimensional list is used as input, where each list item is considered a word. The Word2Vec word vector generation algorithm from gensim is used to generate a word vector dictionary. After processing, every word in the input that appears more than a set threshold is converted into a word vector of a set length. The architecture uses the CBOW model, which predicts a word by looking at words within a certain window around it. The parameters of each word vector are adjusted based on the prediction results, and the final word vector parameters accurately reflect the similarity between words. This process is also known as pre-training. In addition to the word vector dictionary, the algorithm also generates a corresponding index for each word vector. Based on this index, the appropriate method can be used to obtain both the word and its corresponding word vector.
[0112] The numbering submodule 4003 is used to convert the list item into an index based on the word vector dictionary and the index corresponding to each word vector, so as to obtain a sample two-dimensional matrix.
[0113] At this point, each word corresponds to a word vector and a sequence number. In this embodiment, each list item is numbered according to the correspondence.
[0114] As an optional implementation method, Figure 8 This is a schematic diagram of the structure of the deletion module and the supplementation module provided in the embodiments of the present invention, as shown below. Figure 8 As shown, the device also includes:
[0115] The deletion module 700 is used to delete redundant list items when the length of the target path list is greater than the preset length.
[0116] The supplementary module 800 is used to pad the list with 0 when the length of the target path list is less than the preset length.
[0117] Set a preset length for the path list and unify the preset length of all data to facilitate the construction of neural networks.
[0118] When building the neural network, an embedding layer parameter matrix is generated based on the word vector dictionary. The number of rows in the matrix corresponds to the number of dictionary entries, and the number of columns corresponds to the set length of the word vectors. Each word vector from the dictionary is used as a row in the embedding layer parameter matrix. This embedding layer parameter matrix can be used as parameters when creating the serialized network structure for building the neural network. Specifically, the generated embedding layer parameter matrix is used as the weights parameter, allowing the pre-training results to be applied to the final model's training process. The input to the embedding layer is a two-dimensional tensor equal to batch_size multiplied by the length of the input sequence, and the output is a three-dimensional tensor equal to batch_size multiplied by the length of the input sequence multiplied by the length of the word vectors. Here, batch_size is a parameter that sets the number of data items input at one time during training, and the length of the input sequence is the preset length of the path list. The preset length can be determined according to specific circumstances. In this embodiment, it is preferred to arrange all path lists in descending order of length and select the length of the path list at the 1 / 20 position as the preset length to include as many list items as possible. Preferably, the value of batch_size is 32.
[0119] As an optional implementation method, Figure 9 This is a schematic diagram of the structure of the second two-dimensional matrix generation module provided in an embodiment of the present invention, as shown below. Figure 9 As shown, the second two-dimensional matrix generation module 200 includes:
[0120] Call the content string splitting submodule 2001, which is used to split the call content into multiple short strings using short string concatenation characters as delimiters;
[0121] The content field is formatted as a string with a fixed structure: multiple short strings are concatenated using the "|" character, which is the short string concatenation operator. Here, "|" is used as the delimiter to separate the strings, resulting in a list of many short strings as elements.
[0122] As an optional implementation, to reduce the amount of data processing before performing short string splitting, the called content is read into pandas.dataframe format and deduplicated to obtain new dataframe format data without duplicates.
[0123] The short string splitting submodule 2002 is used to split the absolute path and the address written to the target process in a short string using the absolute path concatenation character as the delimiter.
[0124] The short strings used for concatenation have a fixed structure: the absolute path of the called file and the address to be written to the target process are concatenated using the "+" character, where "+" is the absolute path separator. Each item in the list above is further split using "+" as the separator, and only the first item, the absolute path of the called file, is retained.
[0125] As an optional implementation, deduplication is performed after this step to remove duplicate combinations of absolute paths in the data.
[0126] The list conversion submodule 2003 is used to remove addresses written to the target process and convert all absolute paths into a list of paths, with each absolute path as a list item.
[0127] Ultimately, the value of the "Called Content" field for each piece of data becomes a list consisting of the absolute paths of the called files. The addresses written to the target process are discarded here because these addresses are hexadecimal values allocated at a certain moment and have no analytical significance.
[0128] As an optional implementation method, Figure 10 This is a schematic diagram of the data merging module provided in an embodiment of the present invention, as shown below. Figure 10 As shown, the data merging module 300 includes:
[0129] The data connection submodule 3001 is used to concatenate the first and last paths of the same target process to obtain the path list corresponding to the target process.
[0130] The target process malicious identifier field merging submodule 3002 is used to extract the target process malicious identifier field from the first process access log of the same target process and use it as the target process malicious identifier field corresponding to that target process.
[0131] During the data merging phase, the path lists of the same target process are concatenated end-to-end, transforming multiple path lists into a new path list. Furthermore, since the target process malicious identifier field is determined based on the target process, the target process malicious identifier fields corresponding to the same target process are identical. Therefore, it is sufficient to randomly select one target process malicious identifier field as the target process malicious identifier field for that target process. In this embodiment, the first target process malicious identifier field is selected as the target process malicious identifier field for that target process.
[0132] This invention also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the aforementioned malicious process detection method.
[0133] The aforementioned storage medium stores the aforementioned software, and the storage medium includes, but is not limited to, optical discs, floppy disks, hard disks, and rewritable memory.
[0134] The above technical solution has the following beneficial effects: This invention establishes a sample set characterized by the source process and absolute path, and uses this sample set as input to train a malicious process detection model, thereby linking the detection of malicious processes with the source process and absolute path, which facilitates the detection of malicious processes; since the source process and absolute path in a malicious process are different from those in a normal process, by associating the source process and absolute path with their text content, the source process and absolute path are given the meaning of judging whether a process is malicious, thereby improving the accuracy of malicious process detection.
[0135] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for detecting malicious processes, characterized in that, include: Obtain multiple process access logs, extract the source process, target process, call content, and target process malicious identifier field from the process access logs, and store the source process, target process, call content, and target process malicious identifier field in the form of a two-dimensional matrix to obtain the first two-dimensional matrix; Extract the absolute path from the call content, store the absolute path as a path list, delete the call content, and merge the corresponding source process into the path list to obtain a second two-dimensional matrix; Merge the process access logs of processes with the same target process; Number the list items in the path list according to their contents to obtain a sample two-dimensional matrix; The target process in the sample two-dimensional matrix is used as the row index, the numbered path list and the malicious identification field of the target process are used as the column index, and the neural network is input for training to obtain the malicious process detection model. The process to be detected is input into the malicious process detection model to determine whether the process to be detected is a malicious process.
2. The malicious process detection method according to claim 1, characterized in that, The step of numbering the list items in the path list according to their contents to obtain a sample two-dimensional matrix includes: Merge the path lists corresponding to all target processes to obtain a two-dimensional list; The list items in the two-dimensional list are used as words, and a word vector dictionary and the corresponding sequence number of each word vector are generated using a word vector generation algorithm. Based on the word vector dictionary and the corresponding index of each word vector, each list item is converted into an index to obtain the sample two-dimensional matrix.
3. The malicious process detection method according to claim 2, characterized in that, After converting the index of each list item according to the word vector dictionary and the index corresponding to each vector, the process also includes: If the length of the target path list exceeds the preset length, delete the extra list items; When the length of the target path list is less than the preset length, the missing list items are padded with 0.
4. The malicious process detection method according to claim 1, characterized in that, The step of extracting the absolute path from the called content and storing the absolute path as a path list includes: The call content is divided into multiple short strings using short string concatenation characters as delimiters; Using the absolute path concatenation character as a delimiter, the absolute path and the address written to the target process in the short string are separated; Delete the address written to the target process, and convert all absolute paths into a path list, using each absolute path as a list item.
5. The malicious process detection method according to claim 1, characterized in that, The merging of process access logs for the same target process includes: By concatenating the first and last paths of the same target process, we obtain the path list corresponding to that target process. Take the malicious identifier field of the first process access log in the same target process and use it as the malicious identifier field of the target process corresponding to that target process.
6. A malicious process detection device, characterized in that, include: The data acquisition module is used to acquire multiple process access logs, extract the source process, target process, call content, and target process malicious identifier field from the process access logs, and store the source process, target process, call content, and target process malicious identifier field in the form of a two-dimensional matrix to obtain the first two-dimensional matrix; The second two-dimensional matrix generation module is used to extract the absolute path in the call content, store the absolute path as a path list, delete the call content, and merge the corresponding source process into the path list to obtain the second two-dimensional matrix. The data merging module is used to merge process access logs of the same target process; The sample two-dimensional matrix generation module is used to number the list items in the path list according to the content of the list items to obtain the sample two-dimensional matrix; The malicious process detection model generation module is used to input the target process in the sample two-dimensional matrix as the row index, the numbered path list and the malicious identifier field of the target process as the column index into the neural network for training to obtain the malicious process detection model. The malicious process detection module is used to input the process to be detected into the malicious process detection model and determine whether the process to be detected is a malicious process.
7. The malicious process detection device according to claim 6, characterized in that, The sample two-dimensional matrix generation module includes: The two-dimensional list generation submodule is used to merge the path lists corresponding to all target processes to obtain a two-dimensional list. The word vector generation submodule is used to take the list items in the two-dimensional list as words and use a word vector generation algorithm to generate a word vector dictionary and the corresponding sequence number of each word vector. The numbering submodule is used to convert the number of each list item according to the word vector dictionary and the corresponding number of each word vector to obtain the sample two-dimensional matrix.
8. The malicious process detection device according to claim 6, characterized in that, The device also includes: The deletion module is used to delete redundant list items when the length of the target path list exceeds the preset length; The supplement module is used to pad the list with 0s when the length of the target path list is less than the preset length.
9. The malicious process detection device according to claim 6, characterized in that, The second two-dimensional matrix generation module includes: The content string splitting submodule is used to split the call content into multiple short strings using short string concatenation characters as delimiters; The short string splitting submodule is used to split the absolute path and the address written to the target process in the short string using the absolute path concatenation character as the delimiter; The list conversion submodule is used to delete the address written to the target process and convert all absolute paths into a path list, with each absolute path as a list item.
10. The malicious process detection device according to claim 6, characterized in that, The data merging module includes: The data connection submodule is used to concatenate the first and last paths of the same target process to obtain the path list corresponding to that target process. The target process malicious identifier field merging submodule is used to extract the target process malicious identifier field from the first process access log of the same target process and use it as the target process malicious identifier field corresponding to that target process.
11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the malicious process detection method as described in any one of claims 1-5.