A high-precision identification method and system for a powershell obfuscated script
By performing abstract syntax tree parsing and iterative feature comparison on PowerShell scripts, combined with linear regression analysis, the problem of low detection accuracy in existing technologies is solved, achieving high-precision identification of obfuscated scripts.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI JIAOTONG UNIV
- Filing Date
- 2021-09-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing PowerShell script obfuscation detection technologies rely on character-level features, resulting in low detection accuracy and susceptibility to being bypassed by meaningless script statements, making them ineffective in identifying highly obfuscated malicious scripts.
By performing abstract syntax tree parsing on PowerShell scripts and extracting structured information, combined with script iteration features and differential comparisons, a linear regression analysis model is trained to improve detection accuracy and resist obfuscation methods involving irrelevant script statements.
It improves the detection accuracy and robustness of PowerShell obfuscated scripts, effectively identifies obfuscated scripts inserted with irrelevant statements, reduces the burden of manual review, and improves the accuracy of the detection model.
Smart Images

Figure CN115840568B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of network security technology, and in particular to a high-precision identification method and system for PowerShell obfuscated scripts. Background Technology
[0002] PowerShell obfuscation detection tools are crucial for detecting suspicious PowerShell scripts. They provide early warnings of PowerShell script-based network attacks, particularly in advanced persistent penetration testing scenarios, where highly obfuscated malicious PowerShell scripts are identified. With the widespread use of automation tools, obfuscating portions of PowerShell scripts is easy, while manual detection of obfuscated samples is extremely labor-intensive and inefficient. High-precision obfuscation detection models can significantly reduce the burden of manual review, thereby better revealing PowerShell scripts hidden by obfuscation techniques.
[0003] Current PowerShell script obfuscation detection techniques, aside from extensive manual collection and analysis, rely entirely on character-level PowerShell script features—statistics on all visible strings in the script, such as character frequency, number of occurrences, and character entropy. This coarse-grained feature extraction technique is prone to lower accuracy, and can be bypassed simply by adding meaningless script statements.
[0004] Abstract Syntax Trees (ABSTs) are tree-like parsing techniques used in programming languages, often employed for syntax analysis and compilation. Parsing a script's ABST reveals hierarchical information beyond just characters, distinguishing between strings, arrays, variables, and commands. Combining ABSTs with obfuscated script detection systems improves feature acquisition accuracy and enriches feature dimensions, leading to more accurate detection models. Currently, the most advanced PowerShell script obfuscation detection engines use ABST parsing and character feature statistics for different ABST nodes. This training method improves accuracy compared to purely analyzing the overall script's character features, but it still cannot resist obfuscation methods that insert irrelevant script statements to prevent character-level analysis. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of the existing technology and provide a high-precision identification method and system for PowerShell obfuscated scripts.
[0006] The objective of this invention can be achieved through the following technical solutions:
[0007] The first aspect of this invention provides a high-precision identification method for PowerShell obfuscated scripts, the method comprising the following steps:
[0008] Original script feature acquisition steps:
[0009] The PowerShell script to be detected is obtained, and the original obfuscated PowerShell script is parsed using an abstract syntax tree (AST). Structured script information is extracted, and feature information of the original script is obtained and stored by collecting features from different AST nodes. The structured script information includes the types of AST nodes, the attributes of AST nodes, and the values of AST nodes. The feature information of the original script includes any one or more of the following: the overall character distribution of the script; the distribution of various AST nodes after AST parsing; the string length and character distribution of AST nodes corresponding to constant strings; and the array length and character distribution of AST nodes corresponding to arrays.
[0010] Iterative feature acquisition steps in script:
[0011] The analysis engine performs semantic iterative parsing on the structured script information and records the script iteration methods as dynamic features of the script. The dynamic features of the script include any one or more of the following: the number of iterations for each syntax tree node, the number of iterations for each type of iteration method, and the proportion of each type of iteration method.
[0012] Script-based feature acquisition steps:
[0013] The script information after iterative analysis is compared with the structured information of the original script to obtain the differential features. At the same time, the character features of the script after iterative analysis are collected and stored.
[0014] Model training steps:
[0015] All features recorded in the above steps are merged and stored, and used as training parameters for the detection model to obtain a high-precision recognition model for obfuscated PowerShell scripts after training.
[0016] Steps for identifying obfuscated results:
[0017] The script to be detected is subjected to script feature extraction in sequence through the original script feature acquisition step, the script iterative feature acquisition step, and the script comparison feature acquisition step. The extracted features are used as input parameters for the high-precision recognition model of PowerShell obfuscated scripts to obtain the obfuscated output results.
[0018] Furthermore, the specific content of the abstract syntax tree parsing of the original PowerShell obfuscated script is as follows:
[0019] By leveraging the characteristics of scripting languages, an abstract syntax tree is used to parse the original PowerShell obfuscated script to obtain its structured script information.
[0020] Furthermore, in each step, the processed feature information is stored in the form of a database or a text file.
[0021] Furthermore, the analysis engine employs one or more of the following methods for iterative semantic parsing of the script: string concatenation, string replacement, re-parsing of string format scripts, cleaning of unused variables, conversion of ASCII codes to characters, updating of syntax tree node attributes, restoration of command abbreviations, and script analysis hierarchy records.
[0022] Furthermore, the data to be compared in the differential comparison includes any one or more of the following: changes in character frequency, changes in the proportion of various syntax tree nodes, and changes in syntax tree structure.
[0023] Furthermore, the training basis model for the high-precision recognition model of the trained PowerShell obfuscated scripts is a linear regression analysis model.
[0024] In another aspect, the present invention provides a high-precision identification system for PowerShell obfuscated scripts, comprising:
[0025] The original script feature acquisition module performs abstract syntax tree parsing on the original PowerShell obfuscated script, extracts structured script information, and obtains and stores the feature information of the original script by feature acquisition of different abstract syntax tree nodes;
[0026] Script Iteration and Dynamic Feature Acquisition Module: Processes script information through the analysis engine, records script iteration methods, and stores them as dynamic features of the script;
[0027] Script comparison feature acquisition module: performs differential comparison between the script information after iterative analysis and the structured script information, and simultaneously collects the character features of the script after iterative analysis, and stores the differential features and the script features after iterative processing;
[0028] Model training module: The collected script features are used as input to train the model. The labeled scripts are used to train the model, generate model detection parameters and store them.
[0029] Script detection module: Extracts features from the input script using a feature extraction method, uses the features as input to the model, and obtains the detection results.
[0030] The high-precision identification method and system for PowerShell obfuscated scripts provided by this invention have at least the following advantages compared to existing technologies:
[0031] 1) This invention combines the structural information of text and abstract syntax tree, extracts dynamic features in PowerShell scripts through a self-built script structure information iteration engine, and extracts differential features by comparing script information before and after iteration as a supplement to character-level feature extraction, which can improve the accuracy and precision of obfuscation detection.
[0032] 2) This invention supports the detection of PowerShell obfuscated scripts that have been processed by inserting irrelevant statements. It can resist a certain degree of obfuscation and has good robustness and practicality, and has broad application prospects. Attached Figure Description
[0033] Figure 1 This is a flowchart of the PowerShell script hierarchical feature extraction method of the present invention in the embodiment;
[0034] Figure 2 This is a flowchart illustrating the training process of a high-precision detection model for PowerShell obfuscated scripts in this embodiment.
[0035] Figure 3 This is a flowchart illustrating the judgment process of a high-precision detection model for PowerShell obfuscated scripts in this embodiment. Detailed Implementation
[0036] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort should fall within the scope of protection of the present invention.
[0037] Example
[0038] This invention relates to a high-precision identification method for obfuscated PowerShell scripts. The method includes: a raw script feature acquisition step: parsing the raw script using an abstract syntax tree (AST) to extract structured script information; acquiring and storing feature information of the raw script by collecting features from different AST nodes; a script iteration feature acquisition step: processing the script information through an analysis engine; recording the script iteration method and storing it as dynamic script features; and a script comparison feature acquisition step: comparing the iteratively analyzed script information with the structured information of the raw script, simultaneously acquiring character features of the iteratively analyzed script, and storing the differential features and the iteratively processed script features.
[0039] Specifically, the process of parsing the original script into an abstract syntax tree mainly includes obtaining the data structure of the script abstract syntax tree by utilizing the characteristics of the scripting language.
[0040] Specifically, the structured script information mainly includes the abstract syntax tree node category, abstract syntax tree node attributes, and abstract syntax tree node values.
[0041] Specifically, the feature information of the original script mainly includes any one or more of the following: the overall character distribution of the script; the distribution of various syntax tree nodes after parsing; the string length and character distribution of the syntax tree node corresponding to the constant string; and the array length and character distribution of the syntax tree node corresponding to the array.
[0042] Specifically, the feature information is stored in the form of a database or a text file.
[0043] Specifically, in the process of processing script information through the analysis engine, the analysis engine uses one or more of the following methods to perform iterative semantic parsing of the script: string concatenation, string replacement, re-parsing of string format scripts, cleaning of unused variables, conversion of ASCII codes to characters, updating of syntax tree node attributes, restoration of command abbreviations, and recording of script analysis hierarchy.
[0044] Specifically, the script dynamic features include one or more of the following features: the number of iterations for each syntax tree node, the number of iterations for each type of iteration method, and the proportion of each type of iteration method.
[0045] Specifically, the data to be compared in the differential comparison includes one or more of the following: changes in character frequency, changes in the proportion of various syntax tree nodes, and changes in syntax tree structure.
[0046] This invention provides a PowerShell obfuscation script prediction model based on linear regression analysis, wherein the model is trained using features extracted using the method described above.
[0047] The preferred embodiments of the method of the present invention are further described below. (In conjunction with...) Figures 1-3 As shown, the high-precision identification method for PowerShell obfuscated scripts of the present invention includes the following specific steps:
[0048] Step 1: Obtain the PowerShell script to be tested and process the original PowerShell script:
[0049] Abstract syntax tree parsing of the original PowerShell script yields the script's tree-like syntax information, i.e., structured script information. This information includes the types of abstract syntax tree nodes, the attributes of the abstract syntax tree nodes, and the values of the abstract syntax tree nodes. This information not only includes all the information in the original script but also provides more syntax-level information in conjunction with the syntax analysis engine.
[0050] Step 2: Store the extracted structured script information in XML format:
[0051] The syntax tree nodes in the script are converted into XML nodes. The XML nodes store information about the type, attributes, and values of the current syntax tree node. They can also support the insertion of more attributes, including information such as the number of iterations during the iterative processing.
[0052] Step 3: Extract feature information from the original script:
[0053] The extracted feature information of the original script includes one or more of the following variable positions: the overall character distribution of the original script, the distribution of various syntax tree nodes after syntax tree parsing, the string length and character distribution of the syntax tree node corresponding to constant strings, and the array length and character distribution of the syntax tree node corresponding to arrays. These features are recorded in a database or text file.
[0054] Step 4: Process script information through the analysis engine:
[0055] The corresponding processing method is selected based on the category of the relevant syntax tree node in the script information. This includes operations such as removing unused variables, splitting strings and generating new constant string nodes, replacing strings and generating new constant string nodes, reversing strings and generating new constant string nodes, converting from ASCII codes to characters, and completing common command abbreviations.
[0056] Step 5: Record the script iteration method:
[0057] In step four, each time a matching processing method is encountered, the processing method is recorded and statistically analyzed as a feature, ultimately yielding the frequency and proportion of each processing method. Simultaneously, during processing, new nodes are assigned an iteration depth attribute, the value of which is one plus the iteration depth of the old node. Finally, the proportion of nodes at different depths is recorded.
[0058] Step Six: Compare the resulting script after the analysis engine has processed the original script with the original script itself.
[0059] The specific comparison includes analyzing the changes in the number of strings before and after script processing (the specific result is the ratio of the number of strings before and after iteration), the changes in the number and proportion of various commands before and after script processing (the specific result is the ratio of the number of various commands before and after processing and the ratio of the frequency before and after processing), and other analysis of the differences in the script before and after engine processing. These values are recorded in a database or text file as input features for subsequent model training.
[0060] Step 7: Collect the character features of the script after iterative analysis. For the script information after iterative processing, the information extraction method in Step 3 will be used to extract and save the script information.
[0061] Step 8: Establishing the detection model:
[0062] The features recorded from steps one through seven are merged and stored as training parameters for the detection model. A large number of scripts are labeled as obfuscated, and the scripts and their labeling results are used as training input. Specifically, the feature acquisition operations described in steps one through seven are performed on each sample. Whether a sample is obfuscated is used as the expected output of the model. The product coefficients corresponding to each feature are continuously adjusted using linear regression analysis in mathematical statistics, so that the accuracy of the detection model reaches a stable peak in the training sample set after processing the product coefficients. At the peak accuracy state, the product coefficients corresponding to each feature are recorded, thus obtaining the mathematical mapping relationship between each script feature and whether the script is obfuscated. That is, each feature, after training, obtains its corresponding product coefficient. The product coefficients corresponding to all features together constitute a high-precision identification model for PowerShell obfuscated scripts, which can be used to subsequently determine whether a script has been obfuscated.
[0063] Step 9: After training and obtaining the PowerShell obfuscated script recognition model, the script to be detected can be subjected to script feature extraction using the methods described in Steps 1 to 7. The extracted features are used as input parameters for the recognition model obtained in Step 8. The output result is the probability that the script has been obfuscated. The higher the value, the greater the probability that the script has been obfuscated.
[0064] This invention also provides a high-precision identification system for PowerShell obfuscated scripts, which may include:
[0065] Original script feature acquisition module: performs abstract syntax tree parsing on the original script, extracts structured script information, and obtains and stores the feature information of the original script by acquiring features from different abstract syntax tree nodes.
[0066] Script Iteration and Dynamic Feature Acquisition Module: Processes script information through the analysis engine, records script iteration methods, and stores them as dynamic features of the script.
[0067] Script comparison feature acquisition module: It compares the script information after iterative analysis with the structured information of the original script, and collects the character features of the script after iterative analysis. It stores the differential features and the script features processed by the iteration.
[0068] Model training module: The collected script features are used as input to train the model. The model is trained using a large number of labeled scripts, and the model detection parameters are generated and stored.
[0069] Script detection module: Extracts features from the input script using a feature extraction method, uses the features as input to the model, and obtains the detection results.
[0070] Those skilled in the art should understand that, besides implementing the system and its modules provided by this invention in purely computer-readable program code, the same program can be implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system, device, and its modules provided by this invention can be considered as a hardware component, and the modules included therein for implementing various programs can also be considered as structures within the hardware component; alternatively, the modules for implementing various functions can be considered as both software programs implementing the method and structures within the hardware component.
[0071] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and such modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A high-precision identification method for PowerShell obfuscated scripts, characterized in that, Includes the following steps: Original script feature acquisition steps: The PowerShell script to be detected is obtained, the original obfuscated PowerShell script is parsed using an abstract syntax tree, structured script information is extracted, and feature information of the original script is obtained and stored by collecting features from different abstract syntax tree nodes. Iterative feature acquisition steps in script: The analysis engine performs semantic iterative parsing on structured script information and records the script iteration methods, storing them as dynamic features of the script. Script-based feature acquisition steps: The script information after iterative analysis is compared with the structured information of the original script to obtain the differential features. At the same time, the character features of the script after iterative analysis are collected and stored. Model training steps: All features recorded in the above steps are merged and stored, and used as training parameters for the detection model to obtain a high-precision recognition model for obfuscated PowerShell scripts after training. Steps for identifying obfuscated results: The script to be detected is sequentially processed through the original script feature acquisition step, the script iterative feature acquisition step, and the script comparison feature acquisition step to extract script features. The extracted features are used as input parameters for the high-precision recognition model of PowerShell obfuscated scripts to obtain obfuscated output results. The specific content of the abstract syntax tree parsing of the original PowerShell obfuscated script is as follows: By leveraging the characteristics of scripting languages, an abstract syntax tree is used to parse the original PowerShell obfuscated script to obtain its structured script information. The feature information of the original script includes any one or more of the following: the overall character distribution of the script, the distribution of various syntax tree nodes after parsing, the string length and character distribution of the syntax tree node corresponding to the constant string, and the array length and character distribution of the syntax tree node corresponding to the array. The analysis engine employs one or more of the following methods for iterative semantic parsing of scripts: string concatenation, string replacement, re-parsing of string format scripts, cleaning of unused variables, conversion of ASCII codes to characters, updating of syntax tree node attributes, restoration of command abbreviations, and script analysis hierarchy records. The script dynamic features include any one or more of the following: the number of iterations for each syntax tree node, the number of iterations for each type of iteration method, and the proportion of each type of iteration method.
2. The high-precision identification method for PowerShell obfuscated scripts according to claim 1, characterized in that, The structured script information includes the types of abstract syntax tree nodes, the attributes of abstract syntax tree nodes, and the values of abstract syntax tree nodes.
3. The method of claim 1, wherein the method further comprises: In each step, the processed feature information is stored in the form of a database or a text file.
4. The high-precision identification method for PowerShell obfuscated scripts according to claim 1, characterized in that, The differential comparison requires comparing data including any one or more of the following: changes in character frequency, changes in the proportion of various syntax tree nodes, and changes in syntax tree structure.
5. The method of claim 1, wherein the method further comprises: The training basis of the high-precision recognition model for obfuscated PowerShell scripts is a linear regression analysis model.
6. A high-precision identification system for PowerShell obfuscated scripts, characterized in that, include: The original script feature acquisition module performs abstract syntax tree parsing on the original PowerShell obfuscated script, extracts structured script information, and obtains and stores the feature information of the original script by feature acquisition of different abstract syntax tree nodes; Script Iteration and Dynamic Feature Acquisition Module: Processes script information through the analysis engine, records script iteration methods, and stores them as dynamic features of the script; Script comparison feature acquisition module: performs differential comparison between the script information after iterative analysis and the structured script information, and simultaneously collects the character features of the script after iterative analysis, and stores the differential features and the script features after iterative processing; Model training module: The collected script features are used as input to train the model. The labeled scripts are used to train the model, generate model detection parameters and store them. Script detection module: Extracts features from the input script using a feature extraction method, uses the features as input to the model, and obtains the detection results; The specific content of the abstract syntax tree parsing of the original PowerShell obfuscated script is as follows: By leveraging the characteristics of scripting languages, an abstract syntax tree is used to parse the original PowerShell obfuscated script to obtain its structured script information. The feature information of the original script includes any one or more of the following: the overall character distribution of the script, the distribution of various syntax tree nodes after parsing, the string length and character distribution of the syntax tree node corresponding to the constant string, and the array length and character distribution of the syntax tree node corresponding to the array. The analysis engine employs one or more of the following methods for iterative semantic parsing of scripts: string concatenation, string replacement, re-parsing of string format scripts, cleaning of unused variables, conversion of ASCII codes to characters, updating of syntax tree node attributes, restoration of command abbreviations, and script analysis hierarchy records. The script dynamic features include any one or more of the following: the number of iterations for each syntax tree node, the number of iterations for each type of iteration method, and the proportion of each type of iteration method.