A product information risk assessment method and system based on a large model

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a large model-based approach, the problem of inaccurate risk assessment in existing technologies is solved. It achieves unified cross-language mapping and accessibility analysis, reduces false alarm rates, improves the accuracy of risk assessment and remediation efficiency, and forms a closed-loop risk assessment and governance process.

CN121615142BActive Publication Date: 2026-06-26ARTICLE NUMBERING CENT OF CHINA

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ARTICLE NUMBERING CENT OF CHINA
Filing Date: 2025-12-01
Publication Date: 2026-06-26

Application Information

Patent Timeline

01 Dec 2025

Application

26 Jun 2026

Publication

CN121615142B

IPC: G06F21/57; G06N5/04

CPC: G06F21/577; G06N5/041; G06F2221/033; Y02P90/30

AI Tagging

Technology Topics

Digital data Code generation

Technical Efficacy Phrases

Reduce invalid troubleshootingReduce false positives

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Hydropower generator lower guide shoe temperature measurement and monitoring device
CN224286170UReally reflect the temperature field conditionsReduce false positives Thermometer details Hydro energy generation Telecommunications link Interference (communication)
Industrial internet high-risk software identification system and method
CN121525036Bachieve recognizabilityImplement identity authenticationThe Internet Industrial Internet
An emergency landing blocking method and device for a fixed-wing unmanned aerial vehicle
CN122186452AReduce false positives eliminate distractionsArresting gear
Artificial intelligence-based electrical automation device fault diagnosis method
CN122262818AAccurate judgmentSuppression of strong background noiseElectrical testing Biological modelsBispectral analysisFeature vector
A method for detecting the welding of a multilayer heat dissipation structure for an integrated circuit
CN122237791AReduce false positivesshort temperatureThermometer details Thermometers using electric/magnetic elements

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing product information risk assessment solutions are prone to overlooking the transitive dependencies and hidden call paths between nodes when dealing with software that uses multiple languages, has multiple repositories, and multiple dependencies. This leads to inaccurate risk assessments, high false alarm rates, mismatched remediation resources, and increased risks of compliance and security incidents.

Method used

By using a large model-based approach, the source code and artifact dependency lists and artifact component lists are scanned, and after unifying the main names, they are aligned to generate a cross-language path logic graph group. The actual package version is corrected by combining the artifact component list, and reachability and taint analysis are performed to obtain a set of risk candidate paths. The path paths are then corrected by setting risk assessment weights, and strategy code is generated for risk management.

Benefits of technology

It achieves unified cross-language mapping and joint analysis of reachability and taints, reducing false alarm rates, improving the accuracy of risk assessment and the authenticity of remediation sequence, reducing duplicate data collection and calculation, forming a closed loop from discovery to remediation, and has the advantages of more accurate discovery, faster handling and shorter process.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121615142B_ABST

Patent Text Reader

Abstract

The application relates to the technical field of electric digital data processing, and particularly discloses a product information risk assessment method and system based on a large model, which comprises the following steps: simultaneously scanning source code and products by a scanner to generate a source dependency list and a product component list, performing main name normalization and alignment through a large model, locking a warehouse and a version according to the source dependency list, analyzing original code by a parser to generate a cross-language path logic group, combining the product component list to correct a real package version, converging into a unified intermediate representation, performing reachability and joint analysis on the unified intermediate representation to obtain a risk candidate path set and complete path scoring, finally, according to a preset weight and a running period evidence, modifying the score, screening out a risk path, matching a disposal strategy, and generating executable strategy code by the large model for risk control of a gateway or a service side.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of electronic digital data processing technology, specifically to a product information risk assessment method and system based on a large model. Background Technology

[0002] Current product information risk assessments are typically implemented through an end-to-end process of checklist creation, modeling, verification, access control, and monitoring. First, the assessment scope is defined, an asset inventory and data map (code and artifact inventory, third-party dependencies, interfaces and data flows, sensitive information classification, and compliance guidelines) are completed, and threat modeling and control baseline mapping are conducted. Then, through static / component / dependency scanning, code review and key leakage detection, configuration baseline verification, dynamic / interface security testing and privacy impact assessment, runtime monitoring, and audit log sampling, the probability of occurrence multiplied by the impact is quantified to form a risk matrix and remediation list. After deployment, continuous monitoring is performed using KPIs / KRIs and alarm thresholds, combined with penetration testing, drills, and incident debriefing for model calibration and baseline updates, thereby achieving traceable assessment results, closed-loop remediation, and continuous improvement.

[0003] For example, Chinese invention patent application CN119577790B discloses a software risk assessment method, apparatus, storage medium, program product, and device. The method includes: parsing the code file of the software to be assessed to obtain code information, which is used to indicate the components and library functions corresponding to the code file; extracting features based on the code information to obtain word vector features, and determining a word vector risk score based on the word vector features; performing risk assessment on the components based on the code file to obtain a first static feature risk score, and performing risk assessment on the library functions to obtain a second static feature risk score; determining a static feature risk score based on the first static feature risk score and the second static feature risk score; and determining an overall risk score based on the word vector risk score and the static feature risk score, thereby identifying the risks present in the software.

[0004] For example, Chinese invention patent application CN114528195A discloses a hierarchical classification and quantitative risk assessment method for open-source software in railway systems. This method mainly includes: scanning the open-source software source code library in the information system; obtaining the risk assessment value out1 for software that depends on open-source software; obtaining the risk assessment value out2 for software that does not depend on open-source software; performing a weighted average of out1 and out2 based on the protocol weight λ and code weight μ; the weighted average is val = λ*out1 + μ*out2; and obtaining the risk assessment value val for the open-source software. The larger this value, the greater the risk of using the software in the current system.

[0005] Based on the above technical solutions, it was found that most existing product information risk assessment solutions are based on static evidence scoring and linear weighting, such as looking at the list, version, and a few static indicators to score, which is like simply adding different pieces of evidence together; while when software uses multiple source codes, it is often a mixture of multiple languages, multiple repositories, and multiple dependencies, and there are often dynamic loading, automatic code generation, private packages, container images, as well as rapid CI / CD deployment and in-service monitoring.

[0006] Therefore, existing technologies easily overlook the transitive dependencies and hidden call paths between node information, and only support a few languages or static node information compilation chains. This results in insufficient cross-language and dynamic coverage of risk assessment solutions, making it easy to treat vulnerabilities or defects marked as high-risk at the manifest or version level as high-priority vulnerabilities, when in fact these high-risk vulnerabilities are not reachable during product execution. This leads to numerous but inaccurate alerts in risk assessment solutions, distorted remediation priorities, and misallocation of remediation resources, which not only slows down problem remediation and increases costs, but also amplifies the risks of compliance and security incidents. Summary of the Invention

[0007] To address the shortcomings of existing technologies, this invention provides a product information risk assessment method and system based on a large model, which can effectively solve the problems mentioned in the background technology.

[0008] To achieve the above objectives, the present invention provides the following technical solution: The first aspect of the present invention provides a product information risk assessment method based on a large model, comprising: S1. Scanning the original code repository and code artifacts of the software to be assessed using a scanner to obtain a source dependency list and an artifact composition list of the software to be assessed, aligning the two lists after unifying the main name based on the large model, and identifying the source-product differences between the two lists; S2. Based on the source dependency list, a parser calls and parses the original code of the software to be assessed to generate a path logic graph group, and based on the artifact composition list and the path logic graph group, producing a unified intermediate representation; S3. Performing reachability and taint analysis on the unified intermediate representation to obtain a risk candidate path set, and scoring the risk candidate path set; S4. Correcting the risk score by setting risk assessment weights, filtering out risk paths and matching disposal strategies, and the large model writes the disposal strategies into strategy code for risk management.

[0009] The second aspect of this invention provides a product information risk assessment system based on a large model, comprising: a source-product difference identification module, used to scan the original code repository and code products of the software to be assessed using a scanner to obtain a source dependency list and a product component list of the software to be assessed, and align the two lists after unifying the main name based on the large model to identify the source-product differences between the two lists; an intermediate representation output module, used to generate a path logic graph group by calling and parsing the original code of the software to be assessed based on the source dependency list, and to produce a unified intermediate representation based on the product component list and the path logic graph group; a path risk scoring module, used to perform reachability and taint analysis on the unified intermediate representation to obtain a set of risk candidate paths, and to score the risk of the risk candidate path set; and a risk management module, used to correct the risk score by setting risk assessment weights, screen out risk paths and match disposal strategies, and the large model writes the disposal strategies into strategy code for risk management.

[0010] Compared with the prior art, the embodiments of the present invention have at least the following advantages or beneficial effects:

[0011] (1) This invention provides a product information risk assessment method and system based on a large model. First, a scanner simultaneously scans the source code and the finished product to generate a source dependency list and a finished product component list. The large model then unifies and aligns the main names, thereby quickly identifying source-product differences and focusing attention on hotspots of differences to reduce invalid investigations. Subsequently, based on the source dependency list, the repository and version are locked, and the parser parses the original code to generate a cross-language path logic graph group. Then, combined with the finished product component list, the actual packaged version is corrected and aggregated into a unified intermediate representation, thereby ensuring that the analysis object is completely consistent with the current construction to reduce false alarms in scoring incorrect versions. Next, accessibility and taint analysis are performed on the unified intermediate representation to obtain a set of risk candidate paths and complete path scoring. This measures both whether it is reachable and whether external controllable data can enter sensitive points, making the priority closer to the actual usability. Finally, the scores are corrected and screened based on preset weights and runtime evidence to match the risk paths and treatment strategies. The large model generates executable strategy code for risk management on the gateway or service side, forming a closed loop from discovery to governance, and allowing for gray-scale and rollback to reduce the risk of going live.

[0012] (2) This invention assesses risk scores through multivariate evaluation. Compared with the traditional approach that mainly relies on a list of known vulnerabilities or only performs reachability statistics, this solution introduces multidimensional information such as reachability evidence, taint evidence, runtime hits, authentication strength, exposure surface and version truth value. It also performs weighting and weighting processing on shadow and loss differences, thereby significantly reducing false alarms of high-risk but unreachable vulnerabilities on paper. The remediation order has a higher degree of overlap with the actual attack surface, and the number of release suspensions and reworks is less.

[0013] (3) In terms of multivariate reuse, compared with the data silo approach where each link operates independently, this solution unifies the source dependency list and product component list into a unified map, and unifies the operation coverage authentication evidence and compliance metadata into the same namespace and graph model. The same fact can be repeatedly used in the scanning, mapping, analysis, scoring, strategy and review stages. The operation evidence can be written back to improve confidence, and the strategy takes effect and feeds back to the evaluation, thereby reducing repeated collection and calculation and allowing the score to automatically correct itself as evidence accumulates.

[0014] (4) The overall effect of the solution provided by the present invention compared with the prior art is that it integrates list alignment, cross-language unified graph construction, reachability and taint joint analysis, runtime correction and strategy generation and distribution into a closed loop, and solves traditional difficulties such as master name unification, dynamic missing edge completion and strategy code generation with a large model. Therefore, it has the advantages of more accurate discovery, faster processing, shorter process and more friendly to microservices and multi-language scenarios. Attached Figure Description

[0015] The present invention will be further described with reference to the accompanying drawings, but the embodiments in the drawings do not constitute any limitation on the present invention. For those skilled in the art, other drawings can be obtained based on the following drawings without creative effort.

[0016] Figure 1 This is a schematic diagram of the method steps of the present invention.

[0017] Figure 2 This is a schematic diagram of the system module connections of the present invention.

[0018] Figure 3 This is a flowchart of the closed-loop process for software information risk assessment.

[0019] Figure 4 This is a flowchart illustrating the execution process of a canary release strategy. Detailed Implementation

[0020] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

[0021] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.

[0022] The entire process, from version evaluation before release to the formation of closed-loop governance, is detailed as follows: Figure 3 As shown, Figure 3The software information risk assessment closed-loop flowchart begins by scanning source code and artifacts, generating a source dependency list and an artifact component list. After unifying the main names using a large model, source-product differences are aligned and identified. Next, the repository and version are located based on the source dependency list. The original code is parsed to generate cross-language call and data flow diagrams, and the actual packaged version is corrected using the artifact component list, producing a unified intermediate representation. Accessibility and taint analysis are performed on this unified intermediate representation to obtain risk candidate paths and complete initial scoring. Then, scores are adjusted based on preset weights and runtime evidence, risk paths are screened, and handling strategies are matched. The large model generates executable strategy code. After the strategy is validated and tested on gateways or service meshes, it takes effect. Operational observation continuously collects coverage hits, call frequency, timing anomalies, and high entropy signals, writing the evidence back into the analysis model for dynamic correction scoring. Based on this, decisions are made to allow, implement, or block access, forming a closed loop from discovery to governance and reassessment.

[0023] Reference Figure 1 As shown, the first aspect of the present invention provides a flowchart of a product information risk assessment method based on a large model. The method includes: S1. Scanning the original code repository and code products of the software to be assessed using a scanner to obtain the source dependency list and product component list of the software to be assessed, respectively. After unifying the main name based on the large model, the two lists are aligned to identify the source product differences between the two lists.

[0024] The aforementioned source dependency list refers to the list of declaration / intent layer components extracted from the source code of the software to be evaluated. It reflects the dependencies and version constraints (including direct and transitive dependencies) claimed by the project in the code repository and build configuration, source repositories / private repositories, license declarations, scopes, feature switches, target platforms, etc. It answers what is planned / declared to be used and is often derived from parsing pom.xml, package.json, requirements.txt, go.mod, lock files, and build scripts.

[0025] The aforementioned artifact component list refers to the fact / result layer component list scanned from the built deliverables / runtimes (such as JAR / WHL, binary, container images). It records the exact versions, files / paths, checksums and signatures, detected licenses, suppliers and download sources, image base layers and system packages, etc., of the components actually packaged into the artifact. It answers what was actually packaged / deployed, and is used to verify source credibility, discover undeclared components (shadows), version drift, license mismatches, etc., and compare it with the source dependency list to identify source artifact differences.

[0026] In the risk testing phase before software release (within CI / CD, before entering pre-release / production), first scan the repository to obtain the claimed dependencies and versions, and locate the repository and commit (source dependency list, reflecting intent); then scan the artifact / image to see the actual components and precise versions, signatures and licenses of the artifact (artifact composition list, reflecting the facts); cross-validate the two: if there are many artifacts but few in the repository, it indicates the existence of undeclared embedded / base image components (shadow), which is a high risk in the supply chain / compliance; if the repository has them but the artifact does not, they are mostly unpackaged, feature disabled or dead code (missing), which can usually be marked as unreachable candidates and downgraded; based on this, quickly identify source-product differences before the release gate, reduce false positives, and improve the accuracy of remediation priority.

[0027] The above-mentioned alignment of the two lists based on unified master names in a large model involves first extracting component identifiers (name, purl, group / artifact, file traces, etc.) from the source dependency list and artifact component list. This is then combined with a rule base / alias dictionary for the first round of normalization (case sensitivity, separators, namespaces, common aliases such as lodash↔lodash-es, reload4j↔log4j). Next, retrieval and semantic matching from the large model are used to normalize the master names of problematic items (outputting the mapping and confidence of candidate master names → real master names). Simultaneously, version semantics (range / lock / back-to-patch) and source / license metadata are aligned. Subsequently, all entries in both lists are replaced with master names, generating an alignment table, which is then deduplicated and merged. After aggregating identical names, the intersection and difference are calculated: resulting in missing (source exists but artifact does not), shadow (artifact exists but source does not), and version drift. Low-confidence mappings trigger manual review / runtime verification. Finally, master name alignment results and a difference report with evidence (string samples, file paths, signatures / hashes) are output for subsequent accessibility and scoring purposes.

[0028] Specifically, the process of identifying the source differences between the two lists is as follows:

[0029] Perform a set operation on the source dependency list and the product composition list. If there is a missing difference or a shadow difference between the source dependency list and the product composition list, it is determined that there is a source-product difference between the two lists.

[0030] An inventory represents a collection of components of a software product. A missing difference indicates that a component exists in the source dependency inventory but not in the artifact component inventory. A shaded difference indicates that a component exists in the artifact component inventory but not in the source dependency inventory.

[0031] In one specific embodiment:

[0032] Source dependency list: {spring-web, snayeaml, lodash};

[0033] Product ingredient list: {spring-web, snakeaml, curl};

[0034] missing={lodash} (declared to exist but not packaged);

[0035] shadow={curl} (packaged without being declared).

[0036] When the source product difference between the two lists is a missing difference, the code path corresponding to the component is marked as an unreachable candidate, and the basic risk assessment weight corresponding to the path is set to the minimum risk assessment weight.

[0037] When the source differences between the two lists are reflected as shadow differences, the basic risk assessment weight corresponding to that path is set as the risk assessment adaptation weight.

[0038] In this embodiment, the focus of pre-release risk assessment is placed on the areas most likely to cause problems: missing often means that the feature is not packaged / feature is disabled / dead code, which has low actual triggerability and should be downgraded, placed under observation, or planned for repair by default; while shadow often comes from the base image itself, repackaged, or unmanaged embedded components, which not only bypass declarations and audits, but may also carry unknown licenses / sources and known vulnerabilities, with a larger actual exposure surface, and should be given higher weight to prioritize accessibility / taint verification and compliance checks, and may trigger stricter release strategies (canary release or blocking).

[0039] S2. Based on the source dependency list, the parser calls and parses the original code of the software to be evaluated, generates a path logic graph group, and produces a unified intermediate representation based on the product composition list and the path logic graph group.

[0040] Furthermore, a unified intermediate representation of the output is presented, and the specific analysis process is as follows:

[0041] The source dependency list determines the repository address and version constraints for parsing the original code, and the parser then calls and parses the original code.

[0042] Specifically, the process first reads the repo_url, module_path, commit / tag, or version constraints and lock files (such as package-lock.json, go.sum) from the source dependency manifest. If necessary, it combines build tracing to resolve constraints into unique commits. Then, the parser checks out code snapshots according to the repository address and version, applies the profile / featureflag used in this build, and automatically incorporates monorepo submodules, generated code, and vendor / embedded source code. The checked snapshots are used to generate abstract syntax trees, control flow graphs, and data flow graphs according to the language frontend, and are summarized into a unified call graph and a unified data flow graph (including file paths, row and column ranges, version fingerprints, and other metadata) as the base map for subsequent reachability / taint analysis. At the same time, the parsing artifacts, snapshot hashes, and dependency parsing logs are cached on disk to ensure reproducibility, auditability, and incrementality.

[0043] The path logic graph group includes the abstract syntax tree, control flow graph, and data flow graph.

[0044] Abstract Syntax Tree (AST): A hierarchical structure that breaks down source code into statements, functions, calls, and variables. Control Flow Graph: The path the program takes (if / else, loops, exceptions). Data Flow Graph: Where data comes from and where it goes (variable assignment, parameter passing).

[0045] Abstract syntax trees contain functions and symbol tables, control flow graphs contain function call relationships, and data flow graphs contain variable flow relationships.

[0046] Functions in the abstract syntax tree are unified into nodes, function calls in the control flow graph are unified into directed edges, and variable flows in the data flow graph are unified into data edges, producing a unified intermediate representation graph. The component nodes on the unified intermediate representation graph are corrected by the product composition list.

[0047] During the alignment phase, the unified intermediate representation is corrected based on the product: first, the version number of each third-party library node is replaced with the exact version of the actual package; if a shadow component is found that exists in the product but is not declared in the manifest, the corresponding component node is added to the unified intermediate representation and the edge related to the call / data flow is added; conversely, if it is a missing item that is declared in the manifest but not packaged in the product, the confidence of its related nodes / edges is reduced and it is marked as an unreachable candidate so that subsequent scoring will be downgraded by default and only observations and re-evaluations will be retained.

[0048] The call graph and data flow graph are extracted based on the projection of the unified intermediate representation graph.

[0049] On the unified graph, each node / edge is associated with its origin and supporting metadata: including repository location (repo_url, commit / tag, module_path), build fingerprint (source code snapshot hash / lock file hash), artifact location (artifact_id / image name and tag, image layer, file path and checksum), whether it was actually packaged (packaged=true / false, source: base image / system package / application packaged), version truth value (actual version vs. declared version), and evidence confidence / source (static matching / runtime hit / LLM (large model) edge supplementation). In this way, every call or data flow can be traced back to which repository, which version, and whether the entity exists in the image, facilitating auditing, reproduction, and risk assessment.

[0050] Specifically, the unified intermediate representation of output also includes:

[0051] In the call graph, the original code of the software to be evaluated is searched for a preset fixed pattern to obtain a set of fixed patterns in the original code. Each fixed pattern is then split into three elements to obtain the field combination of the fixed pattern.

[0052] The aforementioned fixed patterns refer to the code or configuration that reliably and systematically exposes information about the target being called, indicating which target will be invoked at runtime. Typical source:

[0053] Reflection / Dynamic Invocation:

[0054] Class.forName("com.acme.Auth").getMethod("doLogin") (Java).

[0055] Dynamic import:

[0056] importlib.import_module("auth.handlers").__getattr__("login") (Python).

[0057] Plugin / Policy Name: exporter=CsvExporter (Configuration / Convention).

[0058] Routing / conventional loading: require(". / controllers / "+name) (Node.js).

[0059] Macro / code generation: REGISTER_HANDLER(Foo, handle) (C / C++ / Rust, etc.).

[0060] These patterns typically contain key information such as namespaces, classes, and functions, making them easy to extract automatically.

[0061] The ternary decomposition involves normalizing the target extracted from each fixed pattern into a unified symbolic key:

[0062] <Namespace ns, owner (class / type / module, can be empty), member (method / function / field, can be empty)>.

[0063] A unified cross-language representation is achieved through ternary decomposition, which facilitates precise or fuzzy matching in the symbol table of the Abstract Syntax Tree (AST); missing items can be filled with null, and subsequent completion and disambiguation can be achieved by combining import / directory / framework conventions / LLM; after successful matching, edges can be added to the call graph / dataflow graph.

[0064] Example:

[0065] 1) Java Reflection

[0066] Source code: Class.forName("com.acme.Auth").getMethod("doLogin");

[0067] Three elements: <com.acme, Auth, doLogin>;

[0068] Matched AST symbol: com.acme.Auth#doLogin(...) → Add a calls edge to the call graph.

[0069] 2) Dynamic import in Python

[0070] Source code: m = importlib.import_module("auth.handlers");

[0071] fn=getattr(m,"login");

[0072] Three elements: <auth.handlers, null, login>;

[0073] Matches the module-level function auth.handlers.login.

[0074] 3) Node.js Conventional Loading

[0075] Source code: app.use(" / api / "+v, require(". / api / "+v));

[0076] When v="users", the ternary operator is: <app.api, users, index> (owner=module name, member=default export / entry point);

[0077] Match: app.api.users.index(...) (depending on project conventions).

[0078] The fixed pattern of field combinations is matched with the existing nodes in the abstract syntax tree symbol table. If all the fixed pattern of field combinations matches an existing node in the abstract syntax tree, then the existing node is recorded as the hit node. The code function node containing the fixed pattern is extracted and connected to the hit node, and the edges of the call graph are filled.

[0079] If a fixed pattern of field combinations hits several existing nodes in the abstract syntax tree, then these existing nodes are recorded as a set of candidate nodes. The set of candidate nodes is then disambiguated and normalized to obtain a unique candidate node. The code function node containing the fixed pattern is then connected to the unique candidate node, and the call graph is padded with edges.

[0080] The disambiguation and normalization process described above is as follows: First, the namespaces, class or module names, and member names of candidate nodes are standardized and common aliases are merged. Then, targets that cannot be referenced are filtered based on the import relationships and visibility of the candidate nodes. Next, precise matching is performed using the number of formal parameters, type signatures, and directory and framework conventions. Simultaneously, targets that have not been packaged or have been disabled are eliminated using the actual package information and configuration switches from this build. If multiple candidates still exist, runtime evidence such as coverage hits and call frequency are introduced as weights and combined with the large model to rank the contextual semantics. After passing the minimum validation, the one with the highest score and sufficient evidence is selected as the unique node. If all candidate scores are close or the evidence is insufficient, the node is not unified as a unique node and its confidence is lowered for further runtime verification or manual review.

[0081] The aforementioned impossible-to-reference targets refer to candidate symbols that, in the current file and the actual context of this build, simply do not meet the conditions for being called or accessed by this code. Common examples are as follows, all of which can be directly eliminated during disambiguation.

[0082] The first type of scope and visibility are not satisfied:

[0083] For example, in Java, a package-private class that is only in another package and not imported, or a private method in the same class, is not visible in the current location. In Python, a module function that is not imported into the current namespace, and symbols that are not exported in TypeScript, are not visible in external files.

[0084] Type II dependency or missing product:

[0085] The library containing the candidate function was not actually packaged into the artifact in this build, it does not exist in the artifact list, or the module was turned off by the build switch and therefore is not on the classpath or module path at all during runtime.

[0086] The third type of platform is incompatible with conditional compilation:

[0087] The candidate implementation is only compiled and generated under Windows or a specific architecture. However, the current build is on Linux or a different architecture C / C++ that loads a different set of macro paths, so the implementation will not appear in the current executable.

[0088] If disambiguation and normalization fail and a unique candidate node cannot be obtained, the intent is inferred based on the large model and context. The unique node in the candidate node set is marked as the unique marked node. The code function node where the fixed pattern is located is connected to the unique marked node, and the call graph is padded with edges. At the same time, the large model outputs the edge padded confidence.

[0089] The so-called labeling process involves assigning the judgment of the large model, after fully understanding the context, to a specific code point in the candidate set and recording this decision as traceable metadata. First, the context of the code segment containing the fixed pattern is fully collected, including import relationships, directories and namespaces, parameters and types of the call point, framework annotations, build configurations, and whether it has been packaged. This context, along with the list of candidate nodes, is then provided to the large model one by one, asking it to rank the most likely targets and provide reasons based on semantics and scenario. Subsequently, a minimum validation filter is used to filter unreasonable candidates, such as signature mismatches, not being packaged, or not being visible, retaining only those that pass the validation. The model gives an initial score to each candidate, selecting the one with the highest score as the unique labeled node, with the initial score output as the confidence value. Finally, a label and evidence are added to this node on the unified graph, including the selection reason, the context segment involved in the decision, the confidence value, and time and version fingerprints, facilitating subsequent auditing, rollback, and automatic adjustment of the confidence level when runtime evidence arrives.

[0090] It needs to be explained that some code does not have a fixed pattern (it neither writes reflection strings nor follows conventional paths, or the target is mapped from complex branches / templates / ORMs). In this case, a large model inference is used: the code snippet where the clue is located, along with the import list, directory / namespace, type and signature, comments, framework annotations, files / classes with the same name, historical commit information, and other context, are fed into the model. The model is asked to generate the most likely function / class / module candidates and sort them and assign them confidence scores. Then, minimal checks (whether the symbol exists, whether the signature matches, whether the configuration / feature is enabled, whether the artifact component list has been packaged) are used to filter out obvious errors. The candidates that pass the checks are added to the unified graph with low / medium confidence edges, and automatically upgraded / downgraded during the testing / grayscale phase based on coverage hits and log evidence, eventually converging to the high-confidence target.

[0091] S3. Perform reachability and taint analysis on the unified intermediate representation to obtain a set of risk candidate paths, and score the risk candidate paths.

[0092] Furthermore, reachability and taint analysis are performed on the unified intermediate representation to obtain a set of risk candidate paths. The specific analysis process is as follows:

[0093] Extract the entry point of the external request as the starting point, and traverse the nodes along the call chain path from the entry point on the call graph. Determine whether the call chain path can reach the known dangerous function node. If it can reach the known dangerous function node, mark the path as reachable and record the path reachability score as 1. If it cannot reach the known dangerous function node, mark the path as unreachable and record the path reachability score as 0. This completes the reachability analysis of the call graph.

[0094] Known dangerous function nodes refer to execution entry points exposed to the outside world by external entities (users, third-party services, client applications, scripts, partner systems), serving as the starting point for a request within the software execution system. Typical examples include: HTTP / HTTPS API routes, WebSocket events, RPC / gRPC interfaces, message queue consumers, file upload / callback receivers, CLI / batch processing trigger points, and webhooks exposed by scheduled tasks. In analysis, these are entry points, determining which code paths might be driven by external traffic, and are often related to security controls such as authentication, rate limiting, and CSRF.

[0095] The aforementioned known dangerous function nodes refer to function nodes marked in the risk point directory that, if reached (or if their key parameters receive externally controllable data), could cause security consequences. Examples include command execution (exec / Runtime.exec), dynamic evaluation (eval), deserialization / template rendering, SQL / NoSQL execution, XML parsing (XXE), file read / write / external transmission, reflection loading, weak encryption / random numbers, etc. They exist because languages and frameworks / third-party libraries, in order to provide powerful capabilities and versatility, inevitably expose these high-authority operations; furthermore, numerous historical CVEs and abuse patterns have proven that they become attack surfaces when configurations are default, inputs are not validated, authentication is lacking, or scenarios are misused. Therefore, they are modeled as known dangerous function nodes in the unified graph for focused checking and weighting during reachability analysis, scoring, and release decisions.

[0096] Extracting external controllable data involves designating the starting point of the external controllable data as the source. In the data flow graph, each source's value node is marked as a taint and sent to the taint analysis work queue. Based on the first-in-first-out strategy, nodes are retrieved from the work queue and propagated forward along the edges of the data flow graph. If a preset propagation feature is encountered, the taint label is propagated to the next hop. If a preset cleaner is encountered, the taint is removed.

[0097] Externally controllable data specifically refers to data payloads that enter the system through the aforementioned entry points and whose values are determined externally; these are the sources of taint analysis. Common examples include: HTTP request paths / queries / forms / JSON fields, request headers / cookies, uploaded files and their metadata, WebSocket / message queue message bodies, third-party callback parameters, URL template parameters, configurations / instructions from the client, and environmental inputs (such as cross-process stdin). These are untrusted and are marked as taints by default, propagating along the data flow. Only after passing through whitelist verification, strong type parsing, encoding / de-identification, and other cleansing processes can the risk level be reduced.

[0098] The propagation characteristics described above are a set of rules / patterns in taint analysis that describe how data propagates along a program, used to pass taints from one node to the next in the data flow graph. Typical examples include: assignment and aliasing (b=a), operations / concatenation (c=a+b retains the taint), container writes and reads (arr[i]=x and y=arr[i]), cross-procedure parameter passing (actual parameter → formal parameter), return value postback (ret=f(...)), and cross-service / cross-language bridging (RPC / HTTP / messages). Propagation characteristics define which edges should continue propagating and how the tainted attributes change, serving as the basis for the work queue algorithm's progression.

[0099] The aforementioned cleaners refer to code points / functions and their rules that can reduce or remove the risk of taints. When a taint passes through the cleaner, it is detained according to preset semantics, such as whitelist verification, strong type parsing (Integer.parseInt), output encoding / escaping (HTML / SQL), encryption / desensitization, path normalization, etc. In practice, it is necessary to maintain a cleaner rule base (including function signatures and effects) categorized by language / framework. For unknown functions, low / medium confidence can be identified using name / comment / context + LLM, and the confidence can be increased upon runtime hit to avoid false positives and over-propagation.

[0100] When a taint reaches a specific parameter of a preset data sensitivity point, a taint hit is recorded and the taint flow score is set to 1; otherwise, it is set to 0. At the same time, it checks whether there is an authentication decision on the taint path to determine whether the taint path is reachable.

[0101] The aforementioned data sensitivity points refer to dangerous entry points where sending controllable external data into their specific parameters could lead to security consequences. These include command execution (the `cmd` parameter in `exec(cmd)`), SQL / NoSQL queries, template / deserialization entry points, file system / network external interfaces, XML parsing, and dangerous reflection loading. Sensitive points are typically maintained in a risk point directory as functions combined with sensitive parameter bits. During analysis, it is determined whether there is an uncleaned data stream (Taint) from an external source to that parameter bit, and path classification and priority ranking are based on reachability, authentication, and runtime coverage.

[0102] Perform iterative analysis on each node of the taint analysis work queue until any iteration stopping condition is met:

[0103] 1) The taint analysis queue is empty.

[0104] 2) The number of iterations is greater than the preset iteration limit.

[0105] 3) The cleaner removes the stains.

[0106] The paths with a reachability score of 1 and the paths with a taint flow score of 1 are counted as a risk candidate path set, and probe analysis is performed on the risk candidate path set.

[0107] After summarizing the paths screened out by static analysis into a risk candidate path set, probe analysis is performed on this set: lightweight probes are deployed as needed at key nodes of the candidate paths (entry points, suspicious calls, sensitive points / parameter bits, outbound / persistent points) to collect operational evidence such as whether the path is hit, hit frequency, request samples and parameter forms, authentication results, error rate / latency, outbound destination, and high entropy of the payload; based on this, dynamic edge patching is confirmed / disproved, Reach / Taint / confidence is updated, paths that are continuously hit are given higher weight and trigger handling strategies (blocking / grayscale / rate limiting / desensitization / strong authentication), and paths that have not been hit for a long time and whose features are closed or not packaged are marked as unreachable candidates and given lower weight, ultimately producing more accurate risk priorities and release decision inputs.

[0108] When runtime probes detect new calls triggered by reflection / dynamic loading, or when gateway / access logs expose new external entry points, this evidence is written back to the unified call graph: edges are added to known nodes, nodes are added to new targets, and their source and confidence level are labeled. Conversely, if a candidate path is not hit for a long time, and the configuration indicates that the feature is disabled or the artifact ingredient list shows that the code has not been packaged, the confidence level of the path is reduced on the graph or it is directly marked as an unreachable candidate, and it will be demoted in subsequent scoring and release decisions by default.

[0109] Specifically, the process involves checking whether authentication checks exist on the tainted path to determine its reachability. The detailed analysis process is as follows:

[0110] If there is an authentication determination on the tainted path, the known maximum permission score of the tainted path is extracted and compared with the limited permission score on the tainted path. If the known maximum permission score of the tainted path is less than the limited permission score, the tainted path is determined to be unreachable, and the basic risk assessment weight corresponding to the tainted path is set to the minimum risk assessment weight.

[0111] It should be explained that the aforementioned known maximum privilege score can be extracted from the software system's runtime logs. The limited privilege score specifically refers to the pre-defined privilege labels in the risk assessment database. The risk assessment base weight specifically refers to the pre-defined path weight benchmark values in the risk assessment database.

[0112] If the known maximum permission score of a tainted path is greater than or equal to the limited permission score, the tainted path is determined to be reachable, and the risk assessment base weight corresponding to the tainted path is set as the risk assessment adaptation weight.

[0113] If there is no authentication determination on the tainted path, the tainted path is determined to be reachable, and the basic risk assessment weight corresponding to the tainted path is set to the maximum risk assessment weight.

[0114] In the authentication process, the ease with which an external user can be compromised is taken into account when setting the weights: if a path to a vulnerability requires the ADMIN role and there is no known way to elevate a regular user to ADMIN within the system (no privilege escalation chain, no privilege overreach interface, no weak password / default account evidence), then the path is marked as blocked and its risk is reduced. This does not mean there is no risk, but rather that resources are allocated to items that are easier to access first. Conversely, if the path has no authentication (or is a public interface) and externally controllable data (taints) can directly flow into the sensitive parameters of the vulnerability, then it is judged as high-risk and its weight is immediately increased, making it the first target for blocking / gray rollout or rapid patching.

[0115] S4. By setting risk assessment weights, the risk score is corrected, risk paths are selected and matched with disposal strategies. The large model writes the disposal strategies into strategy codes for risk management.

[0116] Furthermore, the risk score is revised by setting risk assessment weights. The specific analysis process is as follows:

[0117] Risk assessment weights include basic risk assessment weights, minimum risk assessment weights, maximum risk assessment weights, adaptive risk assessment weights, and additional risk assessment weights.

[0118] In this embodiment, the arrangement of risk assessment weights from smallest to largest is expressed as: minimum risk assessment weight, basic risk assessment weight, adaptive risk assessment weight, and maximum risk assessment weight.

[0119] The specific process for defining the additional weighting elements in risk assessment is as follows:

[0120] When a path has a confidence level for edge completion, the real-time confidence level for edge completion is brought into the association mapping set based on the confidence level for edge completion and the additional weight element to obtain the additional weight element, which is denoted as the risk assessment additional weight element. When a path does not have a confidence level for edge completion, the risk assessment additional weight element is 0.

[0121] It should be explained that the additional weight element for risk assessment is expressed as follows: if there is a confidence level for edge supplementation, the additional weight element for risk assessment is added to the risk assessment weight to jointly correct the risk score.

[0122] Extract the risk score for each candidate risk path, and then associate and merge the risk assessment weight corresponding to the path with the risk score. Specifically, multiply the risk assessment weight corresponding to the path with the risk score. If there is an additional risk assessment weight element, multiply the sum of the additional risk assessment weight element and the risk assessment weight with the risk score to obtain the risk score representation of each candidate risk path.

[0123] Specifically, the risk scoring and analysis process for each candidate risk path is as follows:

[0124] Obtain risk characteristic parameters for each risk candidate path in the risk candidate path set, including the call frequency, temporal anomaly degree, and high entropy signal of each risk candidate path during the risk assessment period. These risk characteristic parameters can be extracted from the software system's execution logs.

[0125] The call frequency of each risk candidate path is obtained by comparing the number of times each risk candidate path is called within the risk assessment period with the total risk assessment period.

[0126] The call frequency, temporal anomaly degree, and high entropy signal of each risk candidate path are preprocessed separately. The data preprocessing includes normalization and de-normalization. Influencing elements are introduced and correlated with the data preprocessing results to obtain the risk score of each risk candidate path.

[0127] The specific analysis process is as follows:

[0128]

[0129] In the formula, S j Let j be the risk score for the j-th risk candidate path, where j is the number of each risk candidate path. J represents the total number of risky candidate paths, Cr j Ta is the call frequency of the j-th risk candidate path.j Let Ss be the temporal anomaly degree of the j-th risk candidate path. j Let m1 be the high-entropy signal of the j-th risk candidate path, m2 be the influence element corresponding to the predefined call frequency in the risk assessment database, m3 be the influence element corresponding to the predefined time-series anomaly degree in the risk assessment database, and m4 be the influence element corresponding to the predefined high-entropy signal in the risk assessment database.

[0130] The aforementioned high-entropy signal is specifically an indicator reflecting the uncertainty or randomness of the code information corresponding to the risk candidate path. It is obtained by taking how many times each byte value appears within a sliding window, dividing it by the window length to get the probability of each byte value appearing, and then processing the standard deviation of the probability of each byte value appearing to represent the high-entropy signal.

[0131] In this embodiment, multivariate analysis is performed using call frequency, temporal anomaly degree, and high entropy signal. Specifically, the correlation between these parameters is considered. A higher call frequency indicates that the path is repeatedly reached by real traffic, and the risk score should generally be increased. A higher temporal anomaly degree indicates that recent behavior has significantly deviated from the historical baseline, which is common in sudden attacks or misuse of functions, and the risk score should also be increased. A stronger high entropy signal indicates that the payload or outgoing content is closer to ciphertext, keys, compressed packages, or data packaging formats, and the suspicion of leakage and exploitation is greater, which also increases the risk score. When the three factors work together, there is a relationship of mutual reinforcement and mutual inhibition: high frequency accompanied by high entropy strongly points to data outgoing or automated attacks; high anomaly but low frequency may be early attack detection and deserves to be weighted higher; high frequency but low entropy and long-term stability are more likely to be normal business flow, and the frequency weighting should be capped and offset by low anomaly degree; if high entropy only appears in low-frequency and anomaly-free maintenance windows, it can be reduced by low anomaly degree and low frequency.

[0132] Furthermore, risk paths are identified and corresponding response strategies are matched. The specific analysis process is as follows:

[0133] The risk score representation of each candidate risk path is matched with a predefined risk score representation interval to determine the specific interval of the risk score representation for each candidate risk path, and the corresponding treatment strategy is matched for each risk score representation interval.

[0134] Each risk score representation interval includes the first risk score representation interval, the second risk score representation interval, and the third risk score representation interval.

[0135] If the risk score representation of a risk candidate path belongs to the first or second risk score representation interval, the risk candidate path is marked as a risk path. If the risk score representation of a risk candidate path belongs to the third risk score representation interval, the risk candidate path is marked as a pending path and is continuously evaluated.

[0136] The first risk score interval corresponds to the blocking release priority repair strategy, the second risk score interval corresponds to the gray release strategy, and the third risk score interval corresponds to the allow release strategy with a prompt for continuous evaluation.

[0137] The release blocking priority repair strategy specifically terminates subsequent steps, prevents artifacts from entering the release process, generates traceable work orders and reports, automatically notifies the responsible person, marks the repair path and regression verification requirements, and freezes the release and change window for this version. If necessary, it triggers rollback or locks dependent versions until the repair submission passes through the same set of scanning, mapping, reachability and taint analysis, regression and acceptance rerun process before the blocking is lifted and the release is restored.

[0138] The canary release strategy involves the following steps: When a version passes basic verification but still requires runtime verification, the assessment system first reads the risk candidate paths, reachability and taint conclusions, coverage hits and timing anomalies generated in this assessment. It then selects an appropriate control combination according to the strategy template (proportional traffic import, strong authentication or administrator-only access, interface frequency limiting, sensitive field anonymization, Web application firewall input validation, and instrumentation observation of key functions and outbound points). Subsequently, the system feeds the environmental information and risk context into the large model, which generates executable strategy code and configuration snippets. After generation, syntax and dependency self-checks, shadow releases, or dry runs are performed for verification. Once successful, the code is deployed to the gateway, service mesh, or business-side software development kit to take effect, and the rollout pace, health thresholds, and automatic rollback conditions are set. During runtime, evidence of hits and high-entropy outbound releases is continuously collected. If the indicators are robust, the rollout is gradually increased while simultaneously tightening alerts. If risk escalation or health degradation occurs, the system is automatically frozen and rolled back, while runtime evidence is written back to the model to correct subsequent assessments and strategies.

[0139] Example 1: Service mesh scales up access proportionally and only allows high-risk entry points to administrators.

[0140] The example input is a risk entry point via API parse that requires an administrator role. Externally controllable data can directly reach the deserialization point. The generated policy might be an Istio routing and authorization policy, configuring 1 / 50th of traffic to enter the new version and forcibly requiring an administrator group token.

[0141] apiVersion:networking.istio.io / v1beta1;

[0142] kind:VirtualService;

[0143] spec:

[0144] hosts:[app.example.com];

[0145] http:

[0146] -match:[{uri:{prefix:" / api / parse"}}];

[0147] route:

[0148] -destination:{host:app, subset:canary, port:{number:80}

[0149] , weight:5};

[0150] -destination:{host:app, subset:stable, port:{number:80}

[0151] , weight:95};

[0152] ---

[0153] apiVersion:security.istio.io / v1;

[0154] kind:AuthorizationPolicy;

[0155] spec:

[0156] selector:{matchLabels:{app:app}};

[0157] rules:

[0158] -to:[{operation:{paths:[" / api / parse"]}}];

[0159] when:

[0160] -key:request.auth.claims[roles];

[0161] values:["ADMIN"];

[0162] Example 2: Gateway-side input validation and rate limiting, with sensitive fields de-identified in logs.

[0163] The example input is a filename parameter that is suspected of being sent out of high entropy and is not verified. The generated strategy may be an Nginx gateway and a Lua filtering script that limits the rate of the same path, verifies the filename, and desensitizes it on the log side.

[0164] #TrafficLimiting

[0165] limit_req_zone $binary_remote_addr zone=api_limit:10m rate

[0166] =10r / s;

[0167] location / api / parse{

[0168] limit_req zone=api_limit burst=20 nodelay;

[0169] #Basic verification

[0170] if($arg_filename !~ "^[A-Za-z0-9._-]{1,64}$"){return 400;};

[0171] #Log desensitization

[0172] set $safe_filename "$arg_filename";

[0173] if($safe_filename ~ "([A-Za-z0-9._-]{1, 3}).+([A-Za-z0-9._-]

[0174] {1,3})"){set $safe_filename "$1***$2";};

[0175] proxy_set_header X-Log-Filename $safe_filename;

[0176] proxy_pass http: / / app_upstream;

[0177] };

[0178] The execution process of the above-mentioned canary release strategy is specifically reflected as follows: Figure 4 As shown, Figure 4This is the execution flowchart for the canary release strategy. When the evaluation results allow for verification with small-scale deployment, the system first reads the candidate risk paths, their reachability and taint conclusions, operational coverage and difference information, and selects an appropriate combination of canary release strategies, such as proportionally importing traffic, allowing only administrators to access high-risk entry points, interface frequency limiting and input validation, anonymizing sensitive fields, and adding probes at key points. Then, the environment and risk context are passed to the large model to generate specific strategy code and configuration snippets, which are then subjected to syntax checks and dry runs such as shadow deployment. After passing these checks, the strategy is deployed and executed on the gateway, service mesh, or application side. During the deployment process, continuous monitoring is strengthened to track health and risk indicators. If the performance is stable, traffic is gradually increased; if the risk level escalates or health deteriorates, the traffic increase is immediately frozen and rolled back, or mitigation measures such as rate limiting and strong authentication are used. Simultaneously, operational evidence is written back to the evaluation model and unified graph for adjusting strategies and scores in subsequent stages until the canary release is completed or the strategy converges.

[0179] The policy allows for deployment with continuous evaluation prompts. Specifically, if all risk items fall within the acceptable range, deployment will not be interrupted, and artifacts will continue to be pushed to the target environment. At the same time, alerts and work orders will be automatically generated for the corresponding services, marking risk items, reachability and taint conclusions, version and source information, and suggested remediation measures. After deployment, enhanced monitoring will be enabled within the agreed window to continuously collect operational evidence such as coverage hits, call frequency, timing anomalies, and high entropy outbound transmissions. Once a preset threshold is triggered, the policy will be automatically upgraded to a canary deployment. If the operation is stable, the risk will be included in the planned remediation list, reviewed in subsequent builds, and alerts will be turned off. This achieves traceable and reversible control over low and medium risks without affecting business delivery.

[0180] Reference Figure 2 As shown in the diagram, the second aspect of this invention provides a schematic diagram of the module connections for a product information risk assessment system based on a large model. The system includes: a source-product difference identification module, an intermediate representation output module, a path risk scoring module, a risk management module, and a risk assessment database. The risk assessment database is used to store preset values for various parameters.

[0181] The source-origin difference identification module is connected to the intermediate representation output module, the intermediate representation output module is connected to the path risk scoring module, the path risk scoring module is connected to the risk management module, and the source-origin difference identification module, the intermediate representation output module, the path risk scoring module, and the risk management module are all connected to the risk assessment database.

[0182] The source-product difference identification module is used to scan the original code repository and code artifacts of the software to be evaluated using a scanner, and obtain the source dependency list and artifact composition list of the software to be evaluated respectively. After unifying the main name based on the large model, the two lists are aligned to identify the source-product differences between the two lists.

[0183] The intermediate representation output module is used to generate a unified intermediate representation based on the source dependency list, the parser calls and parses the original code of the software to be evaluated, generates a path logic graph group, and outputs the unified intermediate representation based on the product composition list and the path logic graph group.

[0184] The path risk scoring module is used to perform reachability and taint analysis on the unified intermediate representation, obtain a set of risk candidate paths, and score the risk of the risk candidate path set.

[0185] The risk management module is used to correct risk scores by setting risk assessment weights, filter out risk paths and match disposal strategies. The large model writes the disposal strategies into strategy codes for risk management.

[0186] The above description is merely an example and illustration of the structure of the present invention. Those skilled in the art can make various modifications or additions to the specific embodiments described, or use similar methods to replace them, as long as they do not deviate from the structure of the invention or exceed the scope defined by the present invention, they should all fall within the protection scope of the present invention.

Claims

1. A product information risk assessment method based on a large model, characterized in that, include: S1. Based on the scanner, the original code repository and code artifacts of the software to be evaluated are scanned to obtain the source dependency list and artifact composition list of the software to be evaluated. After unifying the main name based on the large model, the two lists are aligned to identify the source artifact differences between the two lists. S2. Based on the source dependency list, the parser calls and parses the original code of the software to be evaluated, generates a path logic graph group, and produces a unified intermediate representation based on the product composition list and the path logic graph group. The unified intermediate representation of output includes: In the call graph, the original code of the software to be evaluated is searched for a preset fixed pattern to obtain a set of fixed patterns in the original code. Each fixed pattern is then split into three elements. The three-element splitting is to normalize the target extracted from each fixed pattern into a unified symbol key to obtain the field combination of the fixed pattern. The fixed pattern field combination is matched with the existing nodes in the abstract syntax tree symbol table. If all the fixed pattern field combinations match an existing node in the abstract syntax tree, the existing node is recorded as the hit node. The code function node where the fixed pattern is located is extracted and connected to the hit node, and the edges of the call graph are filled. If a fixed pattern of field combinations hits several existing nodes in the abstract syntax tree, then these existing nodes are recorded as a set of candidate nodes. The set of candidate nodes is then disambiguated and normalized to obtain a unique candidate node. The code function node containing the fixed pattern is then connected to the unique candidate node, and the call graph is padded with edges. The above disambiguation and normalization process is as follows: First, the namespace, class or module name, and member name of the candidate nodes are standardized and common aliases are merged. Then, the impossible targets are filtered based on the import relationship and visibility of the candidate node's location. Next, the number of formal parameters, type signature, directory and framework conventions are used for precise matching. At the same time, the actual package information and configuration switches of this construction are used to eliminate targets that have not been packaged or have been turned off. If there are still multiple candidates, runtime evidence such as coverage hit and call frequency are introduced as weights and combined with the large model to sort the context semantics. After passing the minimum verification, the one with the highest score and sufficient evidence is selected as the unique node. If all candidate scores are close or the evidence is insufficient, they are not unified into a unique node and the confidence is reduced, waiting for further runtime verification or manual review. If disambiguation and normalization fail and a unique candidate node cannot be obtained, the intent is inferred based on the large model and context. The unique node in the candidate node set is marked as the unique marked node, which falls on a specific code point in the candidate set. This decision is recorded as traceable metadata. The code function node where the fixed pattern is located is connected to the unique marked node, and the call graph is padded with edges. At the same time, the large model outputs the edge padded confidence. S3. Perform reachability and taint analysis on the unified intermediate representation to obtain a set of risk candidate paths, and score the risk candidate path set; S4. By setting risk assessment weights, the risk score is corrected, risk paths are selected and matched with disposal strategies. The large model writes the disposal strategies into strategy codes for risk management.

2. The product information risk assessment method based on a large model according to claim 1, characterized in that: The identification of source origin differences between the two lists includes: Perform a set operation on the source dependency list and the product composition list. If there is a missing difference or a shadow difference between the source dependency list and the product composition list, it is determined that there is a source product difference between the two lists. The list represents the set of components of a software product; the missing difference indicates that a component exists in the source dependency list but not in the artifact component list; the shaded difference indicates that a component exists in the artifact component list but not in the source dependency list. When the source difference between the two lists is a missing difference, the code path corresponding to the component is marked as an unreachable candidate, and the basic risk assessment weight corresponding to the path is set to the minimum risk assessment weight. When the source differences between the two lists are reflected as shadow differences, the basic risk assessment weight corresponding to that path is set as the risk assessment adaptation weight.

3. The product information risk assessment method based on a large model according to claim 1, characterized in that: The unified intermediate representation of output includes: The source dependency list determines the repository address and version constraints of the original code to be parsed, and the parser calls and parses the original code. The path logic graph group includes an abstract syntax tree, a control flow graph, and a data flow graph; The abstract syntax tree contains functions and a symbol table, the control flow graph contains function call relationships, and the data flow graph contains variable flow relationships; Unify functions in the abstract syntax tree into nodes, unify function calls in the control flow graph into directed edges, unify variable flows in the data flow graph into data edges, and produce a unified intermediate representation graph. Then, correct the component nodes on the unified intermediate representation graph using the product composition list. The call graph and data flow graph are extracted based on the projection of the unified intermediate representation graph.

4. The product information risk assessment method based on a large model according to claim 1, characterized in that: The reachability and taint analysis performed on the unified intermediate representation yields a set of risky candidate paths, including: Extract the entry point of the external request as the starting point, and traverse the nodes along the call chain path from the entry point on the call graph. Determine whether the call chain path can reach the known dangerous function node. If it can reach the known dangerous function node, mark the path as reachable and record the path reachability score as 1. If it cannot reach the known dangerous function node, mark the path as unreachable and record the path reachability score as 0. This completes the reachability analysis of the call graph. Extracting external controllable data: The starting point of the external controllable data is recorded as the source. In the data flow graph, the value node of each source is marked as a taint and sent to the taint analysis work queue. Based on the first-in-first-out strategy, nodes are taken from the work queue and propagated forward along the edge of the data flow graph. If a preset propagation feature is encountered, the taint label is propagated to the next hop. If a preset cleaner is encountered, the taint is removed. When a taint reaches a specific parameter of a preset data sensitive point, a taint hit is recorded and the taint flow score is recorded as 1; otherwise, it is recorded as 0. At the same time, it is checked whether there is an authentication decision on the taint path to determine whether the taint path is reachable. Perform iterative analysis on each node of the taint analysis work queue until any iteration stopping condition is met: 1) The taint analysis queue is empty; 2) The number of iterations exceeds the preset iteration limit; 3) The cleaner removes the stains; The paths with a reachability score of 1 and the paths with a taint flow score of 1 are counted as a risk candidate path set, and the risk candidate path set test is performed.

5. The product information risk assessment method based on a large model according to claim 4, characterized in that: The process of checking whether an authentication determination exists on the tainted path and determining whether the tainted path is reachable includes: If there is an authentication decision on the tainted path, the known maximum permission score of the tainted path is extracted and compared with the limited permission score on the tainted path. If the known maximum permission score of the tainted path is less than the limited permission score, the tainted path is determined to be unreachable, and the basic risk assessment weight corresponding to the tainted path is set to the minimum risk assessment weight. If the known maximum permission score of a tainted path is greater than or equal to the limited permission score, then the tainted path is determined to be reachable, and the risk assessment base weight corresponding to the tainted path is set as the risk assessment adaptation weight. If there is no authentication determination on the tainted path, the tainted path is determined to be reachable, and the basic risk assessment weight corresponding to the tainted path is set to the maximum risk assessment weight.

6. The product information risk assessment method based on a large model according to claim 1, characterized in that: The method of correcting the risk score by setting risk assessment weights includes: The risk assessment weights include basic risk assessment weights, minimum risk assessment weights, maximum risk assessment weights, adaptive risk assessment weights, and additional risk assessment weights. The specific process for defining the additional weighting elements in the risk assessment is as follows: When a path has a confidence level for edge completion, the real-time confidence level for edge completion is brought into the association mapping set based on the confidence level for edge completion and the additional weight element to obtain the additional weight element, which is denoted as the risk assessment additional weight element. When a path does not have a confidence level for edge completion, the risk assessment additional weight element is 0. Extract the risk score for each candidate risk path, and then associate and merge the risk assessment weight corresponding to the path with the risk score to obtain the risk score representation of each candidate risk path.

7. The product information risk assessment method based on a large model according to claim 6, characterized in that: The risk scores for each candidate risk path include: Obtain risk characteristic performance parameters of each risk candidate path in the risk candidate path set, including the call frequency of each risk candidate path during the risk assessment period, the time series anomaly degree of each risk candidate path, and the high entropy signal of each risk candidate path. The call frequency of each risk candidate path will be obtained by comparing the number of times each risk candidate path is called within the risk assessment period with the risk assessment period itself. The call frequency, temporal anomaly degree, and high entropy signal of each risk candidate path are preprocessed separately. The data preprocessing includes normalization and de-normalization. Influencing elements are introduced and correlated with the data preprocessing results to obtain the risk score of each risk candidate path.

8. The product information risk assessment method based on a large model according to claim 1, characterized in that: The process of identifying risk paths and matching them with appropriate strategies includes: The risk score representation of each risk candidate path is matched with the predefined risk score representation interval to determine the specific interval of the risk score representation of each risk candidate path, and the corresponding treatment strategy is matched for each risk score representation interval. Each risk score representation interval includes a first risk score representation interval, a second risk score representation interval, and a third risk score representation interval. If the risk score representation of a risk candidate path belongs to the first or second risk score representation interval, the risk candidate path is marked as a risk path. If the risk score representation of a risk candidate path belongs to the third risk score representation interval, the risk candidate path is marked as a path to be determined and continuously evaluated. The first risk score interval is matched with the blocking release priority repair strategy, the second risk score interval is matched with the gray release strategy, and the third risk score interval is matched with the allow release strategy and prompts for continuous evaluation.

9. A product information risk assessment system based on a large model, applied to the product information risk assessment method based on a large model as described in any one of claims 1-8, characterized in that: include: The source-origin difference identification module is used to scan the original code repository and code artifacts of the software to be evaluated using a scanner, and obtain the source dependency list and artifact composition list of the software to be evaluated respectively. After unifying the main name based on the large model, the two lists are aligned to identify the source-origin differences between the two lists. The intermediate representation output module is used to generate a unified intermediate representation based on the source dependency list, the parser calls and parses the original code of the software to be evaluated, generates a path logic diagram group, and outputs the unified intermediate representation based on the product composition list and the path logic diagram group. The path risk scoring module is used to perform reachability and taint analysis on the unified intermediate representation, obtain a set of risk candidate paths, and score the risk of the risk candidate path set. The risk management module is used to correct risk scores by setting risk assessment weights, filter out risk paths and match disposal strategies. The large model writes the disposal strategies into strategy codes for risk management.

Citation Information

Patent Citations

CN114528195A
CN119577790B
CN120688064A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

CN114528195A

CN119577790B

CN120688064A