A vulnerability identification-based anomaly evaluation method, medium and device

By using adaptive feature fusion and model analysis, the problems of scenario adaptability and high false positive rate in source code vulnerability identification and anomaly assessment are solved, achieving accurate vulnerability identification and risk quantification, and improving the efficiency and accuracy of source code security analysis.

CN121902167BActive Publication Date: 2026-06-26QINGDAO WANDAO (BEIJING) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
QINGDAO WANDAO (BEIJING) INFORMATION TECH CO LTD
Filing Date
2026-03-16
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing source code vulnerability identification and anomaly assessment methods suffer from poor scenario adaptability, one-sided feature extraction, and high false alarm rates, failing to accurately match the core assessment needs of source code vulnerabilities.

Method used

By determining the target weight set and target model set based on the type of the target object and the user intent, structural semantics, environment configuration, and data flow taint analysis features are extracted, feature encoding and weighted fusion are performed, and analysis is conducted using vulnerability identification models, false positive filtering models, and anomaly assessment models to generate accurate vulnerability identification and anomaly assessment results.

Benefits of technology

It improves the scenario adaptability and feature completeness of source code vulnerability identification and anomaly assessment, accurately filters false positive vulnerabilities, realizes automated vulnerability screening and risk quantification, and generates objective and comprehensive anomaly assessment results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121902167B_ABST
    Figure CN121902167B_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of anomaly evaluation, and particularly relates to an anomaly evaluation method based on vulnerability identification, a medium and equipment, a first preset weight set and a target model set are determined through a target object type and a user intention, so that feature fusion and analysis processes accurately adapt to source code vulnerability identification and anomaly evaluation scenarios, and the pertinence and practicality of the scheme are improved, through extraction of structural semantics, environment configuration, and exclusive morphological features of data flow stain analysis features, full coverage of multi-dimensional original features is realized, first fusion feature vectors are generated through first preset weight set weighted fusion of feature coding, the contribution degree of core risk features is amplified, vulnerability automatic preliminary screening is realized through a vulnerability identification model, the identification efficiency is improved, false positives are accurately removed through a false positive filtering model, the artificial audit cost is reduced, and multi-dimensional risks are quantified through a first anomaly evaluation model and divided into grades, so that the anomaly evaluation result has objectivity and comprehensiveness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of anomaly assessment technology, and in particular to an anomaly assessment method, medium, and device based on vulnerability identification. Background Technology

[0002] In software source code security testing, vulnerability identification and anomaly risk assessment are core components of ensuring code security. Existing source code vulnerability identification and anomaly assessment methods generally suffer from the following shortcomings: Poor scenario adaptability: Most methods use generalized feature weights and analysis models, failing to tailor solutions to specific source code files / projects and vulnerability identification / anomaly assessment scenarios. This results in a lack of specificity in feature fusion and model analysis, making it impossible to accurately match the core judgment requirements of source code vulnerabilities; One-sided feature extraction: Existing methods either focus only on code structure semantic features to identify vulnerability logic or only on a single environment configuration dimension, failing to integrate multi-dimensional features such as structural semantics, environment configuration, and data flow taint analysis. This makes it difficult to fully cover the judgment dimensions of vulnerability existence, triggerability, and exploitability; High false positive rate: The vulnerability identification stage relies solely on code logic pattern matching output results, without combining contextual information such as data flow cleanup and environment protection to filter false positives. A large number of pseudo-vulnerabilities with logical matching but no actual exploitation risk interfere with subsequent analysis, increasing the auditing costs for security personnel.

[0003] Therefore, improving the scenario adaptability, feature completeness, and result accuracy of source code vulnerability identification and anomaly assessment has become an urgent problem to be solved. Summary of the Invention

[0004] To address the aforementioned technical problems, the present invention provides an anomaly assessment method based on vulnerability identification, which includes the following steps:

[0005] S10, determine the target weight set and the target model set according to the type of the target object and the user intent. If the type of the target object and the user intent meet the first preset condition, the target weight set is the first preset weight set. The target model set includes a vulnerability identification model, a false alarm filtering model and a first anomaly assessment model.

[0006] S20, if the type of the target object and the user intent meet the first preset conditions, then extract the original features of the target object, wherein the original features include at least structural semantic features, environmental configuration features and specific morphological features, and the specific morphological features include at least data flow taint analysis features.

[0007] S30, based on the first preset weight set, the original features are encoded and weighted and fused to obtain the first fused feature vector.

[0008] S40. Analyze the structural semantic features based on the vulnerability identification model to obtain vulnerability identification results. The vulnerability identification results include at least the potential vulnerability types and the predicted probability corresponding to each potential vulnerability type.

[0009] S50: Analyze the first fused feature vector and vulnerability identification results based on the false positive filtering model to obtain the false positive filtering results.

[0010] S60, Analyze the first fusion feature vector and false alarm filtering results according to the first anomaly assessment model to obtain the first anomaly assessment result.

[0011] The present invention also provides a non-transitory computer-readable storage medium storing at least one instruction or at least one program, wherein the at least one instruction or at least one program is loaded and executed by a processor to implement the above-described anomaly assessment method based on vulnerability identification.

[0012] The present invention also provides an electronic device, including a processor and the aforementioned non-transitory computer-readable storage medium.

[0013] This invention has at least the following beneficial effects: By adaptively determining the first preset weight set and target model set according to the target object type and user intent, the feature fusion and analysis process is highly adapted to the source code vulnerability identification and anomaly assessment scenarios, improving the scenario relevance and practicality of the technical solution; by extracting structural semantic features, environmental configuration features, and exclusive morphological features including data flow taint analysis features, comprehensive coverage of multi-dimensional original features of the source code is achieved, capturing both the logical carrier of the vulnerability and analyzing its triggering conditions and exploitability, providing complete data support for subsequent accurate identification and assessment; by encoding and weighting the multi-dimensional original features through the first preset weight set, the first model is generated. A fusion feature vector amplifies the contribution of features strongly correlated with vulnerabilities, enabling the fusion features to accurately characterize the vulnerability attributes of the source code. By focusing on structural semantic feature analysis and outputting potential vulnerability types and predicted probabilities through a vulnerability identification model, automated initial screening of source code vulnerabilities is achieved, improving the efficiency of initial vulnerability screening. A false positive filtering model integrates the first fusion feature vector and vulnerability identification results for binary classification analysis, accurately filtering false positive vulnerabilities that are logically matched but have no actual risk. A first anomaly assessment model integrates the first fusion feature vector and false positive filtering results, quantifying dimensions such as basic vulnerability risk, exploitability, environmental constraints, and business impact, and generating risk levels, making the first anomaly assessment results both objective and comprehensive. Attached Figure Description

[0014] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0015] Figure 1 A flowchart of an anomaly assessment method based on vulnerability identification provided in Embodiment 1 of the present invention;

[0016] Figure 2 A flowchart illustrating an anomaly assessment method based on component identification provided in Embodiment 2 of the present invention;

[0017] Figure 3 This is a flowchart of an adaptive anomaly assessment method provided in Embodiment 3 of the present invention. Detailed Implementation

[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0019] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is understood that, where appropriate, the terms used to distinguish similar objects can be interchanged so that the invention can also be implemented in other embodiments besides the illustrated or described embodiments. Furthermore, the terms "including," "having," and any variations are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or server that includes a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or devices.

[0020] Example 1

[0021] This first embodiment provides an anomaly assessment method based on vulnerability identification, such as... Figure 1 As shown, this anomaly assessment method based on vulnerability identification includes the following steps:

[0022] S10, determine the target weight set and the target model set according to the type of the target object and the user intent. If the type of the target object and the user intent meet the first preset condition, the target weight set is the first preset weight set. The target model set includes a vulnerability identification model, a false alarm filtering model and a first anomaly assessment model.

[0023] In one specific implementation, the first preset condition is: the type of the target object is a source code file or source code project, and the user's intention is vulnerability identification and anomaly assessment.

[0024] The target object type refers to the specific form of the file or project to be analyzed, while the user intent is the user's core objective. Different types of target objects (such as source code or APK) and different user intents (such as vulnerability identification or component analysis) have fundamentally different requirements for feature importance and the functional requirements of the analysis model. By first determining the target object type and user intent, and then allocating weight sets and model sets accordingly, we ensure that the subsequent feature fusion and analysis process better aligns with the core needs of APK component identification and anomaly assessment, avoiding problems such as low identification accuracy and inaccurate assessment caused by a one-size-fits-all approach.

[0025] The system performs dual verification using file extensions (e.g., .java, .py, .cpp) and project directory structure (e.g., including pom.xml, requirements.txt, and src / main directories) to confirm that the target object is a source code file or source code project. The determination logic is: if the target object contains code files that can be directly compiled / interpreted and executed, and has a well-organized project directory structure, it is determined to be a source code type. The user's intent for vulnerability identification and anomaly assessment can be determined through user interface function selections (e.g., "Source Code Vulnerability Scan" and "Anomaly Risk Assessment" options) or task command parsing.

[0026] The first preset condition is a predefined rule for triggering the source code vulnerability identification scheme. This serves as a threshold for determining scenario switching, enabling automated matching of objects, intentions, and schemes, and improving the flexibility and adaptability of technical solutions.

[0027] The first preset weight set is a set of feature fusion weight values ​​customized for source code vulnerability identification scenarios. It focuses on increasing the weight ratio of features that are strongly related to vulnerabilities, such as structural semantic features and data flow taint features, to ensure that the value of core vulnerability features is fully amplified in the subsequent feature fusion process, laying a data foundation for accurate vulnerability identification.

[0028] The target model set is a combination of models adapted to specific scenario requirements. In this embodiment, it includes a vulnerability identification model, a false positive filtering model, and a first anomaly assessment model. The vulnerability identification model is an intelligent model built on algorithms such as graph neural networks, specifically designed to analyze the semantic structure and data flow characteristics of source code to identify potential vulnerabilities. It can detect vulnerability patterns in source code and output basic identification results such as vulnerability type and trigger point. The false positive filtering model is a binary classification model trained on labeled samples. Its input includes fused features and vulnerability identification results, used to eliminate false positive vulnerabilities caused by code logic similarity, improving the purity of the identification results. The first anomaly assessment model is a risk quantification model that comprehensively analyzes features such as vulnerability type, propagation path, and business importance. It is used to classify the risk level of filtered true positive vulnerabilities and output anomaly assessment results.

[0029] As described above, by determining the target object type and user intent and matching the first preset condition, the subsequent feature fusion and model analysis process is highly adapted to the needs of source code vulnerability identification, avoiding the problem of insufficient targeting of general solutions and improving the scenario adaptability of the technical solution. By automatically allocating the first preset weight set, features strongly related to vulnerabilities are ensured to receive higher weights, enabling the subsequent feature fusion vector to accurately represent the vulnerability attributes of the source code and providing high-quality input for the vulnerability identification model. By loading a target model set that includes vulnerability identification, false positive filtering, and anomaly assessment functions, the entire process from vulnerability detection to risk quantification is covered, avoiding analysis gaps caused by missing model functions and improving the integrity of the technical solution. Through the automated solution switching mechanism triggered by preset conditions, no manual intervention is required for weight and model configuration, simplifying the operation process and improving the efficiency of source code vulnerability analysis.

[0030] S20, if the type of the target object and the user intent meet the first preset conditions, then extract the original features of the target object, wherein the original features include at least structural semantic features, environmental configuration features and specific morphological features, and the specific morphological features include at least data flow taint analysis features.

[0031] The security vulnerabilities in source code are essentially hidden in the code logic, runtime configuration, and data flow processes. By extracting structural semantic features reflecting the core code logic, environmental configuration features reflecting runtime dependencies and security constraints, and data flow taint analysis features reflecting critical paths for vulnerability exploitability in separate modules, and then weighted and fused to generate a unified fused feature vector, the vulnerability identification model focuses on structural semantic features to determine the vulnerability type and probability. This achieves the logic from feature extraction to initial vulnerability screening, adapting to the core needs of source code vulnerability identification.

[0032] In one specific embodiment, S20 includes the following steps:

[0033] S210, extract structural semantic features from the code file of the target object.

[0034] S220 extracts environment configuration features from the target object's project configuration file and framework configuration file.

[0035] S230, perform taint tracking based on the initial control flow graph and initial data flow graph in the structural semantic features, integrate taint source, propagation path and cleanup function information to obtain data flow taint analysis features.

[0036] In one specific embodiment, S210 includes the following steps:

[0037] S211, parse the target object's code file to generate an initial abstract syntax tree and an initial control flow graph.

[0038] S212, extract class names, method names, string constants, and code structure information from the initial abstract syntax tree and / or the initial control flow graph.

[0039] S213. Based on the initial abstract syntax tree, initial control flow graph, class name, method name, string constants, and code structure information, structural semantic features are integrated to obtain the structural semantic features.

[0040] The algorithm employs code parsing tools such as Clang / JavaParser to parse source code files such as .java / .cpp, and generates an initial Abstract Syntax Tree (AST) according to grammatical rules. This AST describes the grammatical hierarchy of the code in a tree structure, including nodes such as classes, methods, variables, and expressions. Additionally, it constructs an initial Control Flow Graph (CFG) to describe the execution path of the code in a directed graph, with basic blocks as nodes and jump relationships as edges.

[0041] The algorithm iterates through all nodes of the initial AST, extracting class names (including package name prefixes, such as "com.openssl.crypto"), method names (such as "heartbeat", "encrypt"), and string constants (such as keys, server URLs, and SQL statement fragments) by matching syntax rules. It also iterates through the basic blocks and jump edges of the initial CFG, extracting code structure information (such as the number of nested loops, conditional logic, method call chains, and dangerous function call locations). Invalid strings (such as empty strings and comment text) and redundant structural information (such as duplicate empty jump edges) are filtered out, retaining core information relevant to component identification and risk assessment.

[0042] The initial AST and CFG graph structure data are associated and bound with extracted class names, method names, string constants, and code structure information. Textual class information (such as class names and method names) is initially encoded (e.g., converted to string IDs), and the graph structure data is labeled with node and edge features (e.g., AST nodes are labeled with types such as "class" and "method," and CFG edges are labeled with types such as "conditional jump" and "unconditional jump"). Finally, this is integrated into a structured data set containing textual features and graph structure features, i.e., structural semantic features.

[0043] In one specific embodiment, S220 includes the following steps:

[0044] S221, parse the target object's project configuration file to extract permission declarations, component declarations, and project build constraint information.

[0045] S222, parse the target object's framework configuration file to extract third-party framework dependency information and framework security mechanism configuration information.

[0046] S223 integrates the environment configuration features based on permission declarations, component declarations, project build constraint information, third-party framework dependency information, framework security mechanism configuration information, file paths, and package module ownership information.

[0047] Among them, the runtime environment configuration of the source code directly determines the vulnerability triggering conditions, security protection capabilities, and dependency risks. By parsing the project configuration file and framework configuration file in layers, core configuration information such as permissions, components, dependencies, and security mechanisms is extracted. Then, context information such as file paths and package module ownership is integrated to transform scattered configuration items into standardized environment configuration features. This not only fully covers the environment constraints of the source code runtime, but also provides a unified format of environment context data for subsequent feature fusion and vulnerability assessment.

[0048] Specifically, a dedicated configuration file parsing library (such as an XML parser or a YAML parser) is used to read the contents of the project configuration file, extracting core information by tags / key fields, including permission declarations: such as in the pom.xml file of a Java project. <permission>Tags declare code execution permissions (file read / write, network access), system permission dependencies declared in the setup.py file of a Python project; component declarations: such as in the applicationContext.xml file of a Spring project. <bean>The tags declare the core business components, component scope (singleton / prototype), and initialization method; as well as project build constraints, such as compilation version (JDK 1.8 / 17), dependency package version locking rules, packaging method (JAR / WAR), and build environment (development / production) constraints. The extracted information is then deduplicated and standardized (e.g., permission names are standardized to industry norms) to form structured data.

[0049] Further analysis of third-party framework dependency information is performed, extracting framework name, version number, and dependency type (core / optional dependencies), and marking high-risk versions. Framework security mechanism configuration information is also analyzed, extracting built-in security protection switches and parameters, such as Spring's CSRF protection activation status, request rate limiting configuration, Django's SQL injection protection (ORM parameterized queries), XSS filtering configuration, Struts2's OGNL expression execution restrictions, and file upload whitelists. The extracted framework configuration information is then structured and stored in the format of "framework name-configuration item-value".

[0050] The process involves collecting file paths and package / module ownership information for the target object. Permission declarations, component declarations, project build constraints, third-party framework dependencies, and framework security mechanism configurations are then linked according to a "configuration type - configuration value - effective scope" framework. Combined with file paths and package / module ownership information, the code modules corresponding to each configuration item are labeled. Discrete configuration items are encoded, and all encoded configuration features are concatenated to generate a structured environment configuration feature with a unified dimension, such as a 512-dimensional numerical vector. Specifically, Boolean configurations (e.g., enabling CSRF protection) are encoded as 0 / 1 values; version numbers are encoded as normalized values ​​(e.g., Log4j 2.14.1 is converted to 2.141); and permissions / component names are encoded as one-hot encoded vectors.

[0051] In one specific embodiment, S230 includes the following steps:

[0052] S231, based on the preset user input source rule base, mark all user-controllable inputs in the target object as taint sources.

[0053] S232, based on the initial control flow graph and initial data flow graph in the structural semantic features, the propagation path of tainted data in the target object is tracked through inter-process data flow analysis technology.

[0054] S233, based on a pre-defined cleanup function rule base, determines whether each data processing function on the propagation path is a valid cleanup function capable of blocking the current vulnerability exploitation chain through function signature matching, parameter verification, and logical semantic analysis.

[0055] S234, based on the taint source, propagation path and effective cleanup function, the taint analysis features of the data stream are constructed.

[0056] Among these vulnerabilities, those in the source code (such as SQL injection and XSS) are essentially untrusted user input (taint sources) that have not been effectively cleaned up, directly reaching sensitive operations. By first marking user-controllable taint sources, then tracing the propagation path of tainted data across functions / modules, and finally determining whether the cleanup functions on the path can block the vulnerability exploitation chain, the three core types of information—"taint source - propagation path - cleanup function"—are integrated into standardized data flow taint analysis features. This accurately reflects the core logic of vulnerability exploitability and provides a crucial basis for subsequent vulnerability assessment: "whether it can be actually triggered."

[0057] Specifically, the preset user input source rule base covers all typical user-controllable input scenarios in the source code, such as Web scenarios: HTTP request parameters (GET / POST parameters), Cookie values, HTTP header information, and form submission data; general scenarios: file reading streams, command line input parameters, database query return values ​​(untrusted sources), and external interface call return data; and local scenarios: console parameters entered by the user and modifiable items in the configuration file.

[0058] Traverse the structural semantic features of the target object, namely the initial AST and initial CFG, match the input source features in the rule base, explicitly mark the matched user-controllable input nodes, record the location of the taint source (file path + line number), input type (such as "HTTPGET parameters"), and associated variable name, and form a taint source list.

[0059] Starting with the taint source, the algorithm analyzes the initial control flow graph reflecting the code execution path and the initial data flow graph reflecting data dependencies in the structural semantic features. Following the link of "variable assignment → parameter passing → function call return → cross-module reference", the algorithm tracks the flow of tainted data node by node, records the node position (function name + line number) and data processing method (such as string concatenation, type conversion) at each step of the propagation, and marks the branch nodes in the propagation path (such as path forks caused by condition judgments). The algorithm fully reconstructs all possible propagation branches and finally outputs structured propagation path information, including the full link trajectory of "starting point (taint source) - intermediate node - ending point (sensitive operation / no sensitive operation)".

[0060] The pre-defined cleanup function rule base covers effective cleanup functions for various vulnerability scenarios, such as XSS vulnerabilities: HtmlUtils.escape(), StringEscapeUtils.escapeHtml4() (features: function signature matching + logic is character escaping); SQL injection: PreparedStatement.setParameter() (features: parameterized query + input validation); command injection: Runtime.exec() parameter whitelist validation function (features: logical semantics is "only allow specified characters / values").

[0061] The system iterates through all data processing functions along the propagation path, performing three layers of verification: function signature matching (verifying if the function name and the number / type of parameters match the cleansing functions in the rule base); parameter verification (determining if tainted data is passed as a core parameter of the cleansing function; if only non-tainted parameters are passed, it is invalid); and logical semantic analysis (analyzing the function's internal logic through AST to confirm whether "filtering / escaping / verification" is truly implemented (avoiding functions with the same name but no actual cleansing logic). Functions that satisfy the three layers of verification are marked as "valid cleansing functions," and their location, vulnerability type, and cleansing capabilities (e.g., "full escape of XSS characters") are recorded. Furthermore, if a valid cleansing function exists in the propagation path, the path is marked as "exploitation chain blocked"; otherwise, it is marked as "exploitation chain reachable."

[0062] The extracted features are standardized and encoded. For example, discrete information (such as taint source type and cleanup function type) is encoded using one-hot encoding; numerical information (such as path length and number of cleanup functions) is normalized using Min-Max; and path trajectories are serialized and encoded (e.g., converting node sequences into fixed-length vectors). The encoded information is then concatenated to generate a unified-dimensional data stream taint analysis feature (e.g., a 256-dimensional numerical vector), containing the core feature of "whether there exists a path to the sensitive operation without cleanup taints."

[0063] As described above, by extracting three types of features—structural semantics, environment configuration, and data flow taint analysis—by dividing the analysis into modules, a comprehensive coverage of source code risk dimensions is achieved. This ensures that subsequent vulnerability analysis focuses not only on "whether there is a vulnerability" but also on "whether it can be triggered / exploited," thus improving the completeness of the analysis. The structural semantic features fully preserve the logical form of the code, enabling the vulnerability identification model to accurately match the logical patterns of vulnerabilities, improving the accuracy of initial vulnerability screening. The environment configuration features capture dependency risks and security protection capabilities, allowing subsequent assessments to distinguish between scenarios with vulnerabilities that are blocked by protection mechanisms, reducing the risk of misjudgment. The data flow taint analysis features accurately determine vulnerability exploitability, addressing the deficiency of merely identifying the existence of vulnerabilities while ignoring exploitation conditions, thus improving the accuracy of anomaly assessment. Furthermore, the standardized integration of these three types of features provides a unified input format for subsequent weighted fusion and model analysis, enhancing the automation level of source code security analysis.

[0064] S30, based on the first preset weight set, the original features are encoded and weighted and fused to obtain the first fused feature vector.

[0065] Among them, the three types of original features—source code structural semantics, environment configuration, and data flow taint analysis—contribute differently to vulnerability identification and anomaly assessment. For example, data flow taint analysis features directly reflect vulnerability exploitability and should have a higher weight. Differentiated weights are assigned to different features using a first preset weight set. If the original feature formats are not uniform, the heterogeneous original features are first standardized and encoded to transform them into vectors of a uniform format. Then, they are fused according to the first preset weights to generate a first fused feature vector that accurately represents the vulnerability attributes of the source code. This amplifies the value of core risk features and eliminates the interference of heterogeneous feature format differences, adapting to the input requirements of subsequent model analysis.

[0066] Specifically, for structural semantic features, the textual information (class name, method name, string constant) is transformed into a 256-dimensional dense vector using Word2Vec / BERT embedding technology, preserving semantic relevance; the graph structural information (AST / CFG) is extracted using a graph convolutional network and transformed into a 256-dimensional vector, preserving the code logic structure; the text vector and the graph structural vector are concatenated to generate a 512-dimensional structural semantic feature encoding vector.

[0067] For the environment configuration features, the discrete category information (permission declaration, framework dependency name) is converted into a binary vector using one-hot encoding; the numerical / Boolean information (framework version, security mechanism enabled status) is mapped to the [0, 1] interval using Min-Max normalization; and all encoding results are concatenated to generate a 256-dimensional environment configuration feature encoding vector.

[0068] For the data stream taint analysis features, the discrete information (taint source type, cleanup function type) is encoded using one-hot encoding; the numerical information (propagation path length, number of cleanup functions) is normalized to the [0, 1] interval using Min-Max normalization; the path trajectory information is converted into a 256-dimensional vector using sequence encoding (such as LSTM); and all encoding results are concatenated to generate a 512-dimensional data stream taint analysis feature encoding vector.

[0069] The three types of encoded feature vectors are L2 normalized to avoid the fusion effect being affected by differences in numerical range. The first preset weight set is a set of feature fusion weight values ​​customized for source code vulnerability identification scenarios, focusing on data flow taint analysis and structural semantic features. It is used to amplify the contribution of features strongly correlated with vulnerabilities, weaken the interference of secondary features, and improve the vulnerability representation capability of the fused features. The normalized vectors are weighted and summed according to the first preset weight set. The weighted summed vectors are then mapped to a unified high-dimensional space (e.g., 1024 dimensions) through a fully connected layer to generate the final first fused feature vector, ensuring that it meets the input dimension requirements of subsequent false positive filtering and anomaly assessment models.

[0070] The specific values ​​of the first preset weight set can be set by the implementer according to the actual situation. For example, based on the feature correlation analysis of massive source code vulnerability samples and combined with the core requirements of source code vulnerability identification, the weight allocation principle can be determined, that is, to prioritize strengthening features directly related to the vulnerability. Correspondingly, the weight of data flow taint analysis features reflecting vulnerability exploitability is 40%, the weight of structural semantic features reflecting the vulnerability logical carrier is 35%, and the weight of environmental configuration features reflecting vulnerability triggering conditions and protection capabilities is 25%.

[0071] As described above, by assigning differentiated weights to different features through the first preset weight set, the data flow taint analysis features that are directly related to vulnerability exploitability receive the highest weight. The fused vector can accurately focus on the core risk dimension and integrate the complementary value of the three types of features, providing comprehensive feature support for subsequent anomaly assessment and improving the accuracy of subsequent anomaly assessment.

[0072] S40. Analyze the structural semantic features based on the vulnerability identification model to obtain vulnerability identification results. The vulnerability identification results include at least the potential vulnerability types and the predicted probability corresponding to each potential vulnerability type.

[0073] In one specific embodiment, S40 includes the following steps:

[0074] S410: Input the structural semantic features into the vulnerability identification model to obtain the predicted probabilities of several preset vulnerability types.

[0075] S420 identifies preset vulnerability types whose predicted probability is greater than a preset probability threshold as potential vulnerability types.

[0076] S430 integrates all potential vulnerability types and the predicted probability corresponding to each potential vulnerability type to obtain vulnerability identification results.

[0077] The vulnerabilities in the source code are essentially specific code logic patterns (such as SQL injection corresponding to the logic of "string concatenation of SQL statements + user input parameters"). The vulnerability identification model adopts a combined architecture of graph convolutional networks and fully connected classifiers. The graph convolutional network is responsible for extracting graph structure vulnerability patterns from the structural semantic features AST / CFG, and the fully connected classifier is responsible for outputting the predicted probability of various vulnerabilities, with a value range of (0, 1), where a higher value indicates a higher matching degree. Those skilled in the art will know that the vulnerability identification models and their training methods in the prior art fall within the protection scope of this invention, and will not be described in detail here.

[0078] By setting a preset probability threshold to filter potential vulnerability types with high confidence, the accuracy of vulnerability identification is ensured while avoiding interference from low-probability false positives. The final output is a structured vulnerability identification result, providing core foundational data for subsequent false positive filtering and anomaly assessment. The specific value of the preset probability threshold can be set by the implementer according to the actual situation. For example, based on the balance between precision and recall during model training, a general preset probability threshold of 0.7 can be set, which can be adjusted according to business needs. For example, if a low false positive rate is desired, it can be increased to 0.8; if full coverage is desired, it can be decreased to 0.6.

[0079] The screened potential vulnerability types are sorted in descending order of predicted probability. The vulnerability type name and corresponding predicted probability are integrated. It can also integrate the core code location of vulnerability matching (file path + line number extracted from structural semantic features), the core logic fragments matched, etc.

[0080] As described above, by analyzing structural semantic features through vulnerability identification models and accurately matching vulnerability logic patterns in source code, the vulnerability identification process is automated and standardized, significantly improving the efficiency of initial source code vulnerability screening. By setting preset probability thresholds to filter potential vulnerability types and filtering low-probability false positives, the accuracy of vulnerability identification results is greatly improved, avoiding interference from low-confidence vulnerabilities in subsequent analysis. By integrating information such as vulnerability type, predicted probability, and code location to generate structured results, the vulnerability identification results are traceable, facilitating security personnel to locate and verify vulnerabilities.

[0081] S50: Analyze the first fused feature vector and vulnerability identification results based on the false positive filtering model to obtain the false positive filtering results.

[0082] The vulnerability identification model may output false positives that are logically matched but not actually exploitable (such as SQL injection pattern matching but with effectively cleaned data). The false positive filtering model, as a binary classification model, integrates the first fusion feature vector and the predicted probability of the vulnerability identification results. This overcomes the limitations of relying solely on a single logical match, accurately distinguishing between true positive vulnerabilities and false positives, filtering out false positives with no actual risk, and improving the purity of the vulnerability analysis results.

[0083] In one specific implementation, the false alarm filtering model is a binary classification model, and S50 includes the following steps:

[0084] S510 concatenates the first fused feature vector and the predicted probability from the vulnerability identification result into a joint feature vector.

[0085] S520 inputs the joint feature vector into the false positive filtering model to obtain the binary classification result corresponding to each potential vulnerability type.

[0086] S530 filters out potential vulnerability types classified as false alarms from the vulnerability identification results, thus obtaining the false alarm filtering results.

[0087] The false positive filtering model employs a lightweight binary classification model (such as XGBoost or MLP multilayer perceptron) to meet the classification requirements of the joint feature vector. The training data is labeled with samples of actually exploitable true positive vulnerabilities and false positive vulnerabilities that only logically match but are not exploitable, covering mainstream vulnerability scenarios such as SQL injection and XSS. After inputting the joint feature vector, it outputs a binary classification result for each potential vulnerability type, with positive classes representing true positive vulnerabilities and negative classes representing false positives. Those skilled in the art will recognize that existing false positive filtering models and their training methods fall within the protection scope of this invention, and will not be elaborated upon further here.

[0088] The first fused feature vector (1024 dimensions) and the predicted probability feature (1 dimension) are concatenated dimensionally to generate a 1025-dimensional joint feature vector. If multiple potential vulnerability types exist, a separate joint feature vector is generated for each type. The joint feature vector for each potential vulnerability type is input into the false positive filtering model. Based on the trained classification rules, the false positive filtering model analyzes contextual information such as "whether the data stream is cleaned" and "whether the environment configuration blocks the vulnerability" to determine the nature of the vulnerability and output a binary classification result. The binary classification results are traversed, retaining true positive vulnerabilities marked as "positive" and removing false positive vulnerabilities marked as "negative". The filtered true positive vulnerability information is integrated to generate a standardized false positive filtering result, which includes core information such as vulnerability type, predicted probability, and code location.

[0089] As described above, by concatenating the first fused feature vector with the predicted probability into a joint feature vector, multi-dimensional information such as vulnerability logic, context, and matching confidence is integrated, so that false alarm judgment has comprehensive feature support and the judgment accuracy is improved.

[0090] By analyzing the joint feature vector using a binary classification false positive filtering model, true positive vulnerabilities and false positives can be accurately distinguished, solving the problem of high false positive rate in vulnerability identification, significantly improving the reliability of vulnerability analysis results, and enhancing vulnerability handling efficiency, thus providing high-quality analysis objects for subsequent anomaly assessment.

[0091] S60, Analyze the first fusion feature vector and false alarm filtering results according to the first anomaly assessment model to obtain the first anomaly assessment result.

[0092] The actual risk level of true positive vulnerabilities depends not only on the vulnerability type itself, but also on the context of data flow, operating environment, and business impact. The first anomaly assessment model integrates the first fusion feature vector and the true positive vulnerability information after false positive filtering. By quantifying the basic risk, exploitability, environmental constraints, and business impact of the vulnerability, it generates risk values ​​and classifies risk levels. Finally, it outputs a first anomaly assessment result that includes at least a list of true positive vulnerabilities, risk levels, and remediation suggestions, thus achieving a closed loop from vulnerability identification to risk quantification assessment.

[0093] Those skilled in the art will recognize that any abnormal evaluation model in the prior art falls within the protection scope of this invention, such as a pre-trained large language model, and its pre-training method falls within the protection scope of this invention, which will not be elaborated here.

[0094] As described above, by integrating the first fusion feature vector and false positive filtering results through the first anomaly assessment model, a full-dimensional risk assessment of vulnerability attributes, context, and business impact is achieved. This makes the anomaly assessment results more relevant to actual application scenarios, overcomes the one-sidedness of merely determining the existence of vulnerabilities, and improves the accuracy of anomaly assessment.

[0095] As described above, by adaptively determining the first preset weight set and target model set based on the target object type and user intent, the feature fusion and analysis process is highly adapted to the source code vulnerability identification and anomaly assessment scenarios, improving the scenario relevance and practicality of the technical solution. By extracting structural semantic features, environmental configuration features, and exclusive morphological features including data flow taint analysis features, comprehensive coverage of multi-dimensional original features of the source code is achieved. This captures both the logical carrier of the vulnerability and analyzes its triggering conditions and exploitability, providing complete data support for subsequent accurate identification and assessment. The first preset weight set is used to encode and weightedly fuse the multi-dimensional original features to generate the first fused feature. The vector amplifies the contribution of features strongly correlated with vulnerabilities, enabling the fused features to accurately characterize the vulnerability attributes of the source code. By focusing on structural semantic feature analysis and outputting potential vulnerability types and predicted probabilities through the vulnerability identification model, automated initial screening of source code vulnerabilities is achieved, improving the efficiency of initial vulnerability screening. The false positive filtering model integrates the first fused feature vector and vulnerability identification results for binary classification analysis, accurately filtering false positive vulnerabilities that are logically matched but have no actual risk. The first anomaly assessment model integrates the first fused feature vector and false positive filtering results, quantifying the basic risk, exploitability, environmental constraints, business impact, and other dimensions of vulnerabilities and generating risk levels, making the anomaly assessment results both objective and comprehensive.

[0096] Example 2

[0097] This second embodiment provides an anomaly assessment method based on component identification, such as... Figure 2 As shown, the anomaly assessment method based on component identification includes the following steps:

[0098] S1. Based on the type of the target object and the user intent, determine the target weight set and the target model set. If the type of the target object and the user intent meet the second preset condition, then the target weight set is the second preset weight set. The target model set includes the component identification model and the second anomaly evaluation model.

[0099] In one specific implementation, the second preset condition is: the target object is an APK file and the user's intent is APK component identification and anomaly assessment.

[0100] The target object type refers to the specific form of the file or project to be analyzed, while the user intent is the user's core objective. Different types of target objects (such as source code or APK) and different user intents (such as vulnerability identification or component analysis) have fundamentally different requirements for feature importance and the functional requirements of the analysis model. By first determining the target object type and user intent, and then allocating weight sets and model sets accordingly, we ensure that the subsequent feature fusion and analysis process better aligns with the core needs of APK component identification and anomaly assessment, avoiding problems such as low identification accuracy and inaccurate assessment caused by a one-size-fits-all approach.

[0101] The second preset condition is a predefined rule that triggers a specific set of weights and models. This serves as the criterion for scene switching, enabling automatic adaptation of objects, intentions, and solutions, and improving the flexibility and applicability of the method.

[0102] The target weight set is a set of fusion weight values ​​assigned to each type of original feature. In this embodiment, it is the second preset weight set. The binary code feature in the exclusive morphological feature has a high weight ratio to highlight the importance of key features for APK component identification and anomaly assessment, ensuring that the fused feature vector can accurately represent the core attributes of APK components and improve the accuracy of subsequent analysis.

[0103] The target model set is a combination of models adapted to specific scenario requirements. In this embodiment, it includes a component identification model and a second anomaly assessment model. The component identification model is an intelligent identification model based on multimodal feature fusion and deep learning, integrating anti-obfuscation logic such as fuzzy hash matching and CFG semantic similarity comparison. It is specifically used to identify the name and version information of third-party libraries in APKs, solving the problems of easy obfuscation and obfuscation interference in traditional APK component identification. It outputs accurate component identification results and matching confidence scores, providing basic data for subsequent anomaly assessment. The second anomaly assessment model is an intelligent assessment model based on component interaction context and vulnerability information. The input includes component identification results, confidence scores, call relationship graphs, data flow graphs, etc., and the output is a quantified anomaly risk result, realizing a closed loop from component identification to anomaly assessment. This overcomes the defect of separating component identification and risk assessment in traditional solutions, accurately judging the actual exploitability and business impact of vulnerabilities. Those skilled in the art will know that the component identification model and anomaly assessment model in the prior art fall within the protection scope of this invention, and will not be described in detail here.

[0104] As described above, by first determining the target object type and user intent, and then assigning the corresponding weight set and model set, the feature fusion and analysis process is highly compatible with the core requirements of APK component identification and anomaly assessment, avoiding the problem of insufficient targeting of general solutions and improving the scenario adaptability of the method. The automatic triggering of object and intent schemes is achieved through the second preset condition, without the need for manual intervention in weight configuration and model selection, which simplifies the operation process and improves the efficiency of APK component analysis and anomaly assessment.

[0105] S2, if the type of the target object and the user intent meet the second preset conditions, then extract the original features of the target object, wherein the original features include at least structural semantic features, environmental configuration features and exclusive morphological features, and the exclusive morphological features include at least binary code features and resource file features.

[0106] The components and security risks of an APK are jointly determined by its code logic, configuration information, proprietary resources, and binary components. By systematically extracting structural semantic features reflecting the core logic of the code, environmental configuration features reflecting runtime dependencies and security mechanisms, and proprietary morphological features reflecting the unique component attributes of the APK, a comprehensive capture of the APK's multi-dimensional original features is achieved. This provides complete and accurate data support for subsequent feature fusion, component identification, and anomaly assessment, overcoming the identification and assessment bias problems caused by the one-sided feature extraction of traditional solutions.

[0107] In one specific embodiment, S2 includes the following steps:

[0108] S201, extract structural semantic features from the code file of the target object.

[0109] S202, extracts environment configuration features from the target object's manifest file and framework configuration file.

[0110] S203 extracts the specific morphological features from the target object's native library file, resource index file, and manifest file.

[0111] In one specific embodiment, S201 includes the following steps:

[0112] S2011, parses the target object's code file, and generates an initial abstract syntax tree and an initial control flow graph.

[0113] S2012 extracts class names, method names, string constants, and code structure information from the initial abstract syntax tree and / or the initial control flow graph.

[0114] S2013 integrates structural semantic features based on the initial abstract syntax tree, initial control flow graph, class name, method name, string constants, and code structure information.

[0115] Specifically, Apktool is used to decompile the code files (Dex files) in the APK to generate smali intermediate code; the Soot static analysis framework is called to parse the syntax rules and execution logic of the smali code, automatically construct an initial abstract syntax tree, which describes the syntax hierarchy of the code in a tree structure, including nodes such as classes, methods, variables, and expressions, and constructs an initial control flow graph, which describes the execution path of the code in a directed graph, with basic blocks as nodes and jump relationships as edges.

[0116] The algorithm iterates through all nodes of the initial AST, extracting class names (including package name prefixes, such as "com.openssl.crypto"), method names (such as "heartbeat", "encrypt"), and string constants (such as keys, server URLs, and SQL statement fragments) by matching syntax rules. It also iterates through the basic blocks and jump edges of the initial CFG, extracting code structure information (such as the number of nested loops, conditional logic, method call chains, and dangerous function call locations). Invalid strings (such as empty strings and comment text) and redundant structural information (such as duplicate empty jump edges) are filtered out, retaining core information relevant to component identification and risk assessment.

[0117] The initial AST and CFG graph structure data are associated and bound with extracted class names, method names, string constants, and code structure information. Textual class information (such as class names and method names) is initially encoded (e.g., converted to string IDs), and the graph structure data is labeled with node and edge features (e.g., AST nodes are labeled with types such as "class" and "method," and CFG edges are labeled with types such as "conditional jump" and "unconditional jump"). Finally, this is integrated into a structured data set containing textual features and graph structure features, i.e., structural semantic features.

[0118] In one specific embodiment, S202 includes the following steps:

[0119] S2021, parse the manifest file of the target object to extract the permission declaration and component declaration.

[0120] S2022, parse the framework configuration file of the target object to extract third-party framework dependency information and framework security mechanism configuration information.

[0121] S2023 integrates environment configuration features based on permission declarations, component declarations, third-party framework dependency information, framework security mechanism configuration information, file paths, and package module ownership information.

[0122] The process involves using the AXMLPrinter tool to parse the APK's manifest file (AndroidManifest.xml). This manifest file is in binary format and needs to be converted into a readable format using a dedicated tool; extracting... <uses-permission>The permission declarations in the tag, such as dangerous permissions like android.permission.READ_PHONE_STATE and android.permission.INTERNET, and normal permissions, as well as extraction... <activity> 、 <service>,<broadcast Receiver> The component declaration in the tag includes the component name, the export attribute android:exported, the launch mode, etc.

[0123] Locate framework configuration files (such as build.gradle and pom.xml) in the assets folder, res folder, or associated source code within the unzipped APK directory. Parse these configuration files using an XML / JSON parsing library to extract third-party framework dependency information, such as framework name and version number (e.g., "OpenSSL:1.0.1c" or "Glide:4.12.0"). Also parse framework-specific configurations, such as Spring's application.yml and Django's settings.py, and extract framework security mechanism configuration information, such as whether CSRF protection, SQL parameterized queries, and SSL certificate verification are enabled.

[0124] Collect the file paths of the target objects (such as the / lib / arm64-v8a / directory and / src / main / directory after APK decompression) and package module ownership information (such as the package name prefix "com.company.admin" corresponding to the core business module); convert the permission declarations, component declarations, third-party framework dependency information, framework security mechanism configuration information, file paths, and package module ownership information into standardized features (such as converting permission declarations and component declarations into binary vectors, and converting framework security mechanism configuration information into boolean features); and integrate these features into a unified format environment configuration feature.

[0125] In one specific embodiment, S203 includes the following steps:

[0126] S2031, extract symbol table information and section information of at least one native library file in the target object.

[0127] S2032, disassemble at least one native library file to obtain an opcode sequence.

[0128] S2033, based on symbol table information, section information and opcode sequence, integrates to obtain binary code characteristics.

[0129] S2034, parse the resource index file in the target object to extract the resource type and resource identifier.

[0130] S2035, extract the hardware requirements declared in the manifest file of the target object.

[0131] S2036 integrates resource file characteristics based on resource type, resource identifier, and hardware requirements.

[0132] The process involves locating the native library files (.so files, such as libssl.so and libcrypto.so) in the / lib directory of the APK; using Ghidra or IDAPro tools to parse the .so files (ELF format) and extract symbol table information (including exported function names and imported function names; if the symbol table is stripped, it is marked as anonymous symbols); parsing the ELF file header and section table to extract section information, such as the starting address, size, and permission attributes of the .text and .data sections. The .text section stores the executable code and is the core analysis object.

[0133] The .text section of the .so file is disassembled, and machine instructions are converted into assembly instructions (such as assembly code for ARM and x86 architectures) using tools; invalid instructions such as NOP (no-operation) and RET (return) are filtered out, and the opcodes corresponding to valid assembly instructions (such as MOV, PUSH, CALL, etc.) are extracted in the order of execution; the opcodes are fragmented into fixed lengths (such as 32 bytes) to form an ordered sequence of opcodes.

[0134] The symbol table information, section information, and opcode sequence are associated, and the function belonging to the opcode sequence is marked (based on symbol table address mapping); the opcode sequence is initially encoded (e.g., each opcode is mapped to a unique integer ID), and integrated into binary code features that include file structure features and instruction sequence features.

[0135] The APK's resource index file (resources.arsc) is parsed, which stores the index and attributes of all resources. The resource parsing tool is used to extract the mapping relationship between resource types (such as string resources, drawable image resources, and layout resources), resource identifiers (unique resource IDs, such as 0x7f0d0001), and resource names. Frequently used or sensitive resources (such as string resources containing "key" or "token") are filtered out, and core index information is retained.

[0136] Parse the AndroidManifest.xml file again and extract... <uses-feature>The tag contains a declaration of hardware requirements (such as whether it supports hardware functions such as camera, positioning, and NFC); ​​it also records the type and necessity of the hardware requirements (e.g., android:required="true" indicates that the hardware is required).

[0137] The mapping relationship between resource type and resource identifier is associated with hardware requirement declaration; resource types are classified and encoded (e.g., image resources are encoded as 1, string resources are encoded as 2), and hardware requirements are converted into Boolean features (1 for support, 0 for non-support); these are integrated into resource file features containing resource configuration and hardware adaptation information; finally, binary code features are combined with resource file features to form a unique morphological feature.

[0138] As described above, by extracting structural semantic features, the code logic and syntactic attributes of the APK are accurately captured, enabling component identification to match third-party libraries based on core business logic, thus improving the accuracy of identification. By extracting environment configuration features, the dependency framework, security mechanisms, and permission configuration of the APK are clarified, allowing anomaly assessment to judge risks in conjunction with the runtime environment context, avoiding one-sided assessments detached from configuration information. By extracting specific morphological features, the feature capture of the APK's unique binary and resource components is strengthened, giving component identification anti-obfuscation capabilities and effectively addressing the traditional identification failure problem caused by code obfuscation and packing. By extracting three types of core features, full-dimensional feature coverage of APK code, configuration, and specific components is achieved, enabling subsequent feature fusion to comprehensively characterize the APK's component attributes, overcoming the information loss problem caused by single feature extraction in traditional solutions, improving the completeness and accuracy of the original features, and providing an accurate data foundation for subsequent feature fusion and component identification.

[0139] S3, based on the second preset weight set, perform feature encoding and weighted fusion on the original features to obtain the second fused feature vector.

[0140] The original features of the APK belong to different types, such as text, graph structure, sequence, and discrete / continuous numerical values, and cannot be directly used for model analysis. By targeted encoding, the heterogeneous features are transformed into vectors of a unified format, and then weighted and fused with a second preset weight set. This process not only preserves the core information of each original feature, but also highlights the features that are key to component identification (such as binary code features and resource file features). Finally, a second fused feature vector that can comprehensively and accurately represent the component attributes of the APK is generated, providing strong support for the anti-aliasing identification of the subsequent component identification model.

[0141] Specifically, text-based feature encoding is suitable for features such as class names and method names in structural semantic features, and frame names in environment configuration features: using Word2Vec or BERT embedding techniques, text features are transformed into dense vectors of fixed dimensions (e.g., 256 dimensions) while preserving the semantic relevance of the text. Discrete text labels (such as permission declarations and component types) are converted into binary vectors using one-hot encoding.

[0142] Graph structure feature encoding is applicable to AST and CFG in structural semantic features: it adopts graph convolutional network or graph attention network in graph neural network, takes the node features (such as node type and attributes) and edge features (such as jump relationship and call relationship) of AST / CFG as input, and aggregates neighborhood information through 2-3 layers of graph neural network to generate graph structure vector with fixed dimension (such as 512 dimensions) to fully capture the syntactic and logical structure features of the code.

[0143] Sequence-based feature encoding is suitable for features such as binary opcode sequences in morphological features: a combined 1D-convolutional neural network and long short-term memory network model is used to extract features from opcode sequences. The 1D-convolutional neural network captures the local dependencies of opcodes (such as combinations of consecutively executed instructions), while the long short-term memory network captures the temporal features of the sequence, ultimately outputting a fixed-dimensional (e.g., 256-dimensional) sequence feature vector.

[0144] Numerical / Boolean feature encoding is applicable to features such as version number and security mechanism activation status in environment configuration features, and segment size in specific morphological features: For continuous numerical features (such as segment size and version number converted values), Min-Max normalization is used to map them to the [0, 1] interval to eliminate dimensional differences. For Boolean features (such as whether the security mechanism is enabled), they are directly converted into 0 / 1 values.

[0145] All encoded single-type feature vectors are subjected to L2 normalization to ensure consistent magnitude and avoid the impact of numerical range differences on the fusion effect. Based on a second preset weight set, the normalized feature vectors of each type are weighted and summed. The weighted summed feature vectors are then mapped to a unified high-dimensional space (e.g., 1024-dimensional) through a fully connected layer to obtain the final second fused feature vector, ensuring compatibility with the input requirements of subsequent component recognition models.

[0146] The specific values ​​of the second preset weight set can be set by the implementer according to the actual situation. For example, based on the Pearson correlation coefficient analysis, the correlation between each original feature and the APK component identification label (third-party library name + version) can be analyzed, and the fusion weight of each feature category can be determined by combining industry experience and sample training optimization. For example, the weight corresponding to structural semantic features is 25%-35%, the weight corresponding to environmental configuration features is 10%-15%, and the weight corresponding to specific morphological features is 50%-65%, and the sum of the weights corresponding to structural semantic features, environmental configuration features, and specific morphological features is 1. Among them, the weight of binary sub-features in specific morphological features is not less than 40% of the total weight of specific morphological features, and the weight of resource sub-features is not less than 25% of the total weight of specific morphological features.

[0147] As described above, through targeted heterogeneous feature encoding, a unified vector representation of different types of original features such as text, graph structures, and sequences is achieved, solving the problems of incompatible feature formats and inability to deeply fuse features. This ensures that the second fused feature vector can directly adapt to the input requirements of the component recognition model, improving the efficiency and accuracy of model analysis. By using a second preset weight set based on correlation analysis for weighted fusion, the anti-obfuscation binary code and resource file features gain a higher contribution, significantly improving the recognition accuracy of fused features for APK components. This allows the fused second fused feature vector to fully cover the code logic, configuration information, and exclusive component attributes of the APK, providing complete feature support for the component recognition model.

[0148] S4. Analyze the original features and the second fused feature vector according to the component recognition model to obtain the component recognition result and the matching confidence score. The component recognition result includes the name and version information of the third-party library.

[0149] In one specific embodiment, the component recognition model includes a feature matching module and a graph similarity module, and S4 includes the following steps:

[0150] S401, through the feature matching module, calculates the hash similarity between the second fused feature vector and the corresponding feature of each third-party library.

[0151] S402, based on the graph similarity module, calculate the structural semantic similarity between the initial control flow graph of the target object and the reference control flow graph of each third-party library.

[0152] S403, based on hash similarity and structural semantic similarity, calculate the matching confidence score between the target object and each third-party library.

[0153] S404 uses the name and version information of the third-party library corresponding to the highest matching confidence score as the component identification result.

[0154] The standardized third-party library feature library includes: reference feature vectors for each version of each third-party library, which can be generated by extracting features, encoding and fusing them from the clean version of the third-party library using the methods in steps S2-S3; reference control flow graphs for each version of each third-party library, which can be generated by decompiling the code files of the clean version of the library to generate the AST and extracting the CFG corresponding to the core business logic as a structural semantic comparison template; and metadata of the third-party library, such as name, version number, function description, historical vulnerability information, etc.

[0155] APK component identification needs to overcome both the problems of "surface feature obfuscation and interference" and "deep logic consistency verification". This embodiment uses a feature matching module to calculate hash similarity based on the global representation of the second fused feature vector, achieving rapid initial screening against obfuscation; a graph similarity module uses the structural semantic features of the control flow graph for deep comparison to ensure the accuracy of component identification; finally, a confidence score is generated by combining the two types of similarity to select the optimal matching result. This not only solves the problem of single feature matching being prone to failure, but also improves the reliability of anomaly identification through dual-path verification.

[0156] Specifically, the feature matching module loads a preset fuzzy hash algorithm (such as TLSH, SSDEEP). The fuzzy hash algorithm is robust to minor changes in data and adapts to the feature matching scenario after obfuscation. The hash value of the second fused feature vector is calculated using the fuzzy hash algorithm (e.g., the hash value of TLSH is a fixed length of 32 bytes). The feature library of the third-party library is traversed, and the reference feature vector of the corresponding version of each third-party library is extracted and its hash value is calculated. The hash distance calculation method (such as Hamming distance, edit distance) is used to compare the difference between the hash value of the target object and the hash values ​​of each reference feature vector, and convert it into hash similarity, with a value range of [0, 1]. The closer the value is to 1, the higher the similarity.

[0157] The graph similarity module calls a graph neural network (such as Siamese-GAT) to compare the semantic similarity of graph structures and capture the deep logical connections of the control flow graph. The input consists of an initial control flow graph and reference control flow graphs from various third-party libraries in the feature library. The graph neural network preprocesses the two CFGs, unifying the feature descriptions of nodes and edges and aligning the graph structure dimensions. Then, through the shared feature extraction layer of the Siamese-GAT network, it aggregates the node neighborhood features of the two CFGs respectively, generating their respective graph embedding vectors. The cosine similarity of the two graph embedding vectors is calculated to obtain the structural semantic similarity, with a value range of [0, 1]. The closer the value is to 1, the more consistent the core logic.

[0158] A weighted fusion weight is set, and the hash similarity and structural semantic similarity are weighted and summed based on the importance allocation ratio of the two types of similarity to calculate the one-to-one correspondence confidence score between the target object and each third-party library. In this embodiment, the weighted fusion weight corresponding to hash similarity is 0.4, and the weighted fusion weight corresponding to structural semantic similarity is 0.6, to highlight the priority of deep logical matching.

[0159] As described above, through the dual-path collaboration of the feature matching module and the graph similarity module, both the second fused feature vector is used to achieve rapid initial screening against confusion, and deep verification is achieved by controlling the semantic comparison of the flow graph structure. This makes component identification both efficient and accurate, overcoming the shortcomings of single feature matching being susceptible to confusion interference. By calculating hash similarity through the fuzzy hash algorithm, the global representation advantage of the second fused feature vector is fully utilized. By calculating structural semantic similarity through the Siamese-GAT network, interference methods such as code obfuscation and packing are penetrated, accurately capturing the core logical features of third-party libraries and improving the accuracy of component identification results.

[0160] S5. Based on the component identification results, construct a call relationship diagram and data flow diagram between the target object and the third-party library.

[0161] In one specific embodiment, S5 includes the following steps:

[0162] S501 constructs a call relationship graph using methods in the application code of the target object and APIs of third-party libraries as nodes and method call relationships as edges, and labels the call frequency, call position and call level of the edges.

[0163] S502 uses taint analysis technology to track the propagation path of user input data between application code and third-party libraries, constructs a data flow graph with data dependencies as edges, and marks the function processing operations on the propagation path.

[0164] The security risks of an APK depend not only on the vulnerabilities of the third-party library itself, but also on the call relationships and data flow logic between the application and the third-party library. By constructing a call relationship graph, the call chain and key attributes of methods and APIs are presented intuitively; by constructing a data flow graph, the propagation path and processing of tainted data are accurately tracked, addressing the shortcomings of merely identifying the existence of components without in-depth correlation analysis. This provides the component interaction context for the subsequent second anomaly assessment model, supporting the accurate assessment of vulnerability exploitability and business impact.

[0165] Specifically, when constructing the call relationship graph, two types of core nodes are first identified: target object side nodes (all executable methods in the application code, including class name prefixes, such as "com.app.login.UserAuth.verify"), and third-party library side nodes (the public APIs of third-party libraries in the component identification results, including library name prefixes, such as "org.openssl.crypto.encrypt").

[0166] Using the initial AST and initial CFG, all nodes are extracted and deduplicated, and each node is assigned a unique identifier (e.g., ID + name + module). Using method call relationships as edges, directed edges are established between nodes when application methods actively call third-party library APIs, or when nested calls exist within third-party library APIs. By traversing the call chain of the initial CFG, the relationships between edges are confirmed to avoid missing cross-module or cross-level calls.

[0167] The system counts the number of times the call relationship is executed during the runtime of the target APK (estimated based on the number of times the call statement appears in the static code, or the results of dynamic instrumentation statistics), and labels the call frequency as "high frequency (≥100 times)," "medium frequency (10-99 times)," or "low frequency (<10 times)." It records the file path and line number of the call statement (e.g., " / src / main / java / com / app / login / UserAuth.java:45") to indicate the call location. Using the application entry method as the top level (level 1), the system labels the call levels according to the depth of the call chain (e.g., application method → ​​third-party library API is level 2, application method → ​​intermediate method → ​​third-party library API is level 3). Finally, a graph database (e.g., Neo4j) or a visualization tool (e.g., Graphviz) is used to integrate nodes, edges, and edge attributes into a structured call relationship graph, supporting node retrieval and link tracing functions.

[0168] Taint analysis technology is a static code analysis technique that marks untrusted inputs (taint sources), tracks their propagation path in the code, and determines whether they flow into sensitive operations (such as vulnerable APIs or database writes). It provides core technical support for the construction of data flow graphs and enables accurate tracing of tainted data propagation paths.

[0169] Specifically, when constructing the data flow graph, the starting node of the data flow (the taint source node, labeled with the "untrusted input" attribute) is clearly defined based on user-controllable input taint sources (such as network request parameters, file read data, and interactive input box data). Based on taint analysis technology, combined with the initial CFG and data flow propagation path information, the flow of tainted data is tracked across application code and third-party libraries: including data assignment, transmission, parameter input, and return value reception processes, recording the processing node (application method or third-party library API) at each step.

[0170] Using data dependencies as edges, directed edges are established between adjacent processing nodes in the flow of tainted data, visually representing the complete path of data from the taint source to the final processing node. Each edge is labeled with the function processing operations along the propagation path, such as "parameter validation," "string concatenation," "encryption," and "database write," clearly indicating the specific operations performed during data flow. The taint source node, data processing node, directed edges, and operation labels are integrated into a data flow graph, highlighting key risk paths (such as the path where tainted data is directly passed to a vulnerable API in a third-party library without purification).

[0171] As described above, by constructing a call relationship graph and marking the call frequency, location, and level, the interaction logic between the application and third-party libraries is visualized, providing a call context for subsequent risk assessment and solving the problem of lacking component interaction analysis. By using taint analysis technology to track data propagation paths and construct a data flow graph, the flow process and processing operations of tainted data are accurately presented, making vulnerability exploitability analysis based on evidence. The call relationship graph and data flow graph complement each other, presenting both functional call relationships and data flow relationships, comprehensively covering the core dimensions of component interaction, providing complete contextual input for the second anomaly assessment model, supporting the realization of quantitative risk assessment, and improving the accuracy of risk assessment.

[0172] S6. Based on the second anomaly assessment model, the component identification results, matching confidence scores, call relationship graphs, and data flow graphs are analyzed to obtain the second anomaly assessment results.

[0173] The anomaly risk of an APK depends not only on the existence of vulnerabilities in the third-party library itself, but also on the interaction context of its components (call relationships, data flow) and the reliability of its identification (confidence score). Therefore, by integrating component identification results, matching confidence scores, call relationship graphs, and data flow graphs through a second anomaly assessment model, we can break through the one-sided assessment mode of simply matching the CVE database. This model simulates a comprehensive judgment logic of "vulnerability existence + exploitability + business impact," realizing the logic from component identification to risk quantification, and accurately outputting anomaly assessment results.

[0174] Those skilled in the art will recognize that any abnormal evaluation model in the prior art falls within the protection scope of this invention, such as a pre-trained large language model, and its pre-training method falls within the protection scope of this invention, which will not be elaborated here.

[0175] In one specific implementation, the input to the second anomaly assessment model also includes known vulnerability information of third-party libraries, the reachability of the API corresponding to the known vulnerability in the call relationship graph, and the degree of matching between the permissions required to exploit the known vulnerability and the actual permissions possessed by the target object. S6 includes the following steps:

[0176] The component identification results, confidence scores, call relationship graphs, data flow graphs, known vulnerability information, reachability, and matching degree are input into the second anomaly assessment model to obtain the second anomaly assessment results. The second anomaly assessment results include at least a list of third-party library components, a list of vulnerabilities sorted by risk value, and remediation suggestions.

[0177] The known vulnerability information is obtained by querying vulnerability databases such as NVD and CVEDetails, including the CVE vulnerability ID, CVSS base score, vulnerability type (e.g., remote code execution, SQL injection), vulnerability triggering API, and exploit prerequisites corresponding to the third-party library version. Vulnerability API reachability is determined by traversing the call relationship graph to see if the target application code can reach the vulnerability API through direct / indirect call chains, outputting a binary identifier of "reachable (1.0)" or "unreachable (0.0)". Permission matching degree is determined by comparing the permissions required for vulnerability exploitation (e.g., network permissions, file read / write permissions) with the actual permissions declared in the target object's manifest file, outputting a matching coefficient (complete match = 1.0, partial match = 0.6, no match = 0.2).

[0178] Specifically, the standardized component identification results, confidence scores, call relationship graphs, data flow graphs, known vulnerability information, reachability, and matching degree are simultaneously input into the second anomaly assessment model, and the risk level of each vulnerability is determined by combining the quantified risk value. Further, a structured list of third-party library components is formed by integrating the third-party library name, version, matching confidence score, and associated CVE vulnerability ID; these are then sorted in descending order of quantified risk value. Each vulnerability includes vulnerability type, CVSS base score, risk level, vulnerability API, reachability status, permission matching status, call chain (from the call relationship graph), and data propagation path (from the data flow graph), resulting in a vulnerability list sorted by risk value. Furthermore, precise solutions are matched for different vulnerability scenarios, providing corresponding remediation suggestions, such as third-party library version upgrades (explicitly recommending secure versions), vulnerability API replacement solutions, permission reduction suggestions, and adding data cleanup functions.

[0179] As described above, by supplementing the model input with known vulnerability information from third-party libraries, vulnerability API accessibility, and permission matching degree, the limitations of assessment relying solely on component identification results are overcome. The model simulates a three-layer judgment logic of vulnerability existence, exploitability, and exploitation condition satisfaction, achieving an upgrade from component identification to precise risk quantification, and ensuring that the anomaly assessment results are both comprehensive and accurate.

[0180] As described above, by adaptively determining the target weight set and target model set based on the target object type and user intent, the feature fusion and analysis process becomes more targeted, effectively adapting to the core requirements of APK component identification and anomaly assessment, thus improving the adaptability and practicality of the technical solution. By extracting structural semantic features, environmental configuration features, and exclusive morphological features including binary code features and resource file features, comprehensive coverage of the original features of the APK in multiple dimensions is achieved, providing rich data support for subsequent accurate identification and risk assessment. By encoding and weighting the original features based on the second preset weight set, a second fusion feature vector that comprehensively represents the component attributes of the APK is obtained, which is then used for component identification. By combining the original features and the second fused feature vector, the analysis yields identification results including the third-party library name, version, and matching confidence score, making the component identification results more reliable and providing accurate basic data for subsequent anomaly assessment. By constructing a call relationship graph and data flow graph between the target object and the third-party library, the interaction logic and data propagation path between components are clearly presented, providing key contextual information for anomaly assessment. Through comprehensive analysis of the component identification results, matching confidence scores, and two types of graph structure data using the second anomaly assessment model, accurate anomaly assessment results are obtained, realizing an intelligent closed loop from component identification to anomaly assessment, significantly improving the accuracy and efficiency of anomaly risk assessment.

[0181] Example 3

[0182] This third embodiment provides an adaptive anomaly assessment method, such as... Figure 3 As shown, the adaptive anomaly assessment method includes the following steps:

[0183] S100, based on the type of the target object and the user intent, determine the target weight set and the target model set. If the type of the target object and the user intent meet the first preset condition, then the target weight set is the first preset weight set, and the target model set includes a vulnerability identification model, a false alarm filtering model, and a first anomaly assessment model. If the type of the target object and the user intent meet the second preset condition, then the target weight set is the second preset weight set, and the target model set includes a component identification model and a second anomaly assessment model.

[0184] S200, based on the type of the target object and the user intent, extract the original features of the target object. The original features include at least structural semantic features, environmental configuration features, and specific morphological features. If the type of the target object and the user intent meet the first preset condition, the specific morphological features include at least data flow taint analysis features. If the type of the target object and the user intent meet the second preset condition, the specific morphological features include at least binary code features and resource file features.

[0185] S300: Based on the target weight set, the original features are encoded and weighted to obtain the target fused feature vector.

[0186] S400, based on the target model set, original features and target fusion feature vector, obtains the target anomaly assessment result corresponding to the target object.

[0187] This embodiment constructs an evaluation system with dual preset condition branches for two typical target objects and corresponding evaluation requirements. By matching the target object type with the user intent, it customizes the allocation of weight sets, model sets, and exclusive morphological features, breaking through the homogeneity limitations of a single evaluation scheme, achieving precise adaptation of objects, intents, and schemes, and covering two core scenarios: source code vulnerability identification and APK component anomaly analysis.

[0188] If the type of the target object and the user intent meet the first preset conditions, the exclusive morphological feature is the data flow taint analysis feature. By marking the taint source, tracking the propagation path, and determining the purification function, the exploitability of the source code vulnerability is reflected. The target fusion feature vector is the first fusion feature vector in Example 2. Based on the target model set, the original features, and the target fusion feature vector, the target anomaly evaluation result corresponding to the target object can be obtained by referring to steps S40-S60 in Example 2.

[0189] If the type of the target object and the user intent meet the second preset conditions, the exclusive morphological features are binary code features and resource file features. By parsing the DEX file and resource list of the APK, the component attributes and version risks of the third-party library are reflected. The target fusion feature vector is the second fusion feature vector in Example 1. Based on the target model set, the original features and the target fusion feature vector, the target anomaly evaluation result corresponding to the target object can be obtained by referring to steps S4-S6 in Example 1.

[0190] The above-mentioned dual-scenario adaptive matching design achieves full coverage of anomaly assessment for both source code and APK objects. Through scenario-differentiated weight allocation, customized feature extraction, and precise model invocation, it not only solves the problem of poor adaptability of general solutions, but also improves the accuracy and efficiency of anomaly identification, providing a standardized and intelligent solution for the security detection of multiple types of software objects.

[0191] Example 4

[0192] Embodiment 4 of the present invention provides a non-transitory computer-readable storage medium, which can be disposed in an electronic device to store at least one instruction or at least one program related to implementing a method in the method embodiment. The at least one instruction or at least one program is loaded and executed by the processor to implement the vulnerability identification-based anomaly assessment method provided in the above embodiment.

[0193] Example 5

[0194] Embodiment 5 of the present invention provides an electronic device, which includes a processor and the non-transitory computer-readable storage medium of Embodiment 4 of the present invention.

[0195] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention. < / service> < / activity> < / bean> < / permission>

Claims

1. An anomaly assessment method based on vulnerability identification, characterized in that, The method includes the following steps: S10, determine a target weight set and a target model set according to the type of the target object and the user intent, wherein if the type of the target object and the user intent meet the first preset condition, the target weight set is the first preset weight set, and the target model set includes a vulnerability identification model, a false alarm filtering model and a first anomaly assessment model; S20, if the type of the target object and the user intent meet the first preset conditions, then extract the original features of the target object, wherein the original features include at least structural semantic features, environmental configuration features and exclusive morphological features, and the exclusive morphological features include at least data flow taint analysis features; S30, the original features are encoded and weighted according to the first preset weight set to obtain a first fused feature vector; S40, Analyze the structural semantic features according to the vulnerability identification model to obtain vulnerability identification results, wherein the vulnerability identification results include at least the potential vulnerability types and the predicted probability corresponding to each potential vulnerability type. S40 includes the following steps: S410, Input the structural semantic features into the vulnerability identification model to obtain the predicted probabilities of several preset vulnerability types; S420: Define the preset vulnerability types whose predicted probability is greater than the preset probability threshold as potential vulnerability types. S430, integrate all potential vulnerability types and the predicted probability corresponding to each potential vulnerability type to obtain the vulnerability identification result; S50, Analyze the first fused feature vector and the vulnerability identification result according to the false positive filtering model to obtain the false positive filtering result, wherein the false positive filtering model is a binary classification model, and S50 includes the following steps: S510, the first fused feature vector and the predicted probability in the vulnerability identification result are concatenated into a joint feature vector; S520, the joint feature vector is input into the false alarm filtering model to obtain a binary classification result corresponding to each potential vulnerability type, wherein the positive class in the binary classification result represents a true positive vulnerability, and the negative class in the binary classification result represents a false alarm; S530, Filter out the potential vulnerability types that are classified as false alarms from the vulnerability identification results to obtain the false alarm filtering results; S60, the first fused feature vector and the false alarm filtering result are analyzed according to the first anomaly assessment model to obtain the first anomaly assessment result.

2. The anomaly assessment method based on vulnerability identification according to claim 1, characterized in that, The first preset condition is that the type of the target object is a source code file or source code project, and the user intent is vulnerability identification and anomaly assessment.

3. The anomaly assessment method based on vulnerability identification according to claim 1, characterized in that, S20 includes the following steps: S210, extract the structural semantic features from the code file of the target object; S220, extract the environment configuration features from the project configuration file and framework configuration file of the target object; S230, perform taint tracking based on the initial control flow graph and initial data flow graph in the structural semantic features, integrate taint source, propagation path and cleanup function information to obtain the data flow taint analysis features.

4. The anomaly assessment method based on vulnerability identification according to claim 3, characterized in that, S210 includes the following steps: S211, parse the code file of the target object to generate an initial abstract syntax tree and an initial control flow graph; S212, extract class names, method names, string constants, and code structure information from the initial abstract syntax tree and / or the initial control flow graph; S213, the structural semantic features are obtained by integrating the initial abstract syntax tree, the initial control flow graph, the class name, the method name, the string constant, and the code structure information.

5. The anomaly assessment method based on vulnerability identification according to claim 4, characterized in that, S230 includes the following steps: S231, Based on the preset user input source rule base, mark all user-controllable inputs in the target object as taint sources; S232, based on the initial control flow graph and initial data flow graph in the structural semantic features, the propagation path of tainted data in the target object is tracked through inter-process data flow analysis technology; S233, based on a preset cleanup function rule base, through function signature matching, parameter verification and logical semantic analysis, determine whether each data processing function on the propagation path is a valid cleanup function that can block the current vulnerability exploitation chain; S234, The data stream taint analysis features are constructed based on the taint source, propagation path and effective cleanup function.

6. The anomaly assessment method based on vulnerability identification according to claim 3, characterized in that, S220 includes the following steps: S221, parse the project configuration file of the target object to extract permission declarations, component declarations and project construction constraint information; S222, parse the framework configuration file of the target object to extract third-party framework dependency information and framework security mechanism configuration information; S223, the environment configuration features are obtained by integrating the permission declaration, the component declaration, the project construction constraint information, the third-party framework dependency information, the framework security mechanism configuration information, the file path and package module ownership information.

7. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores at least one instruction or at least one program segment, characterized in that, The at least one instruction or the at least one program segment is loaded and executed by the processor to implement the anomaly assessment method based on vulnerability identification as described in any one of claims 1-6.

8. An electronic device, characterized in that, Includes a processor and the non-transitory computer-readable storage medium as described in claim 7.