LLM-based network attack chain complete identification method and device and electronic equipment

By constructing a network attack chain semantic model based on LLM, the problems of low APT attack identification rate and insufficient identification of unknown attacks are solved, and the accurate positioning and automated protection of attack chain stages are realized. It is applicable to enterprise intranets, cloud platforms and industrial control networks.

CN122247779APending Publication Date: 2026-06-19SHENZHEN Y& D ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN Y& D ELECTRONICS CO LTD
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies suffer from low recognition rates, stage location errors, link breaks, poor generalization, and insufficient identification of unknown attacks when identifying and tracking APT attacks. In particular, they lack flexibility and automated processing capabilities for zero-day attacks and unknown attacks.

Method used

We construct a semantic model of network attack chains based on LLM, which automatically determines the attack type by leveraging the semantic understanding and context learning capabilities of large language models. Furthermore, we construct a semantic model of unknown attack chains through flexible fuzzy clustering and frequent itemset mining, thereby achieving accurate positioning and automated protection of attack chain stages.

Benefits of technology

It achieves accurate localization of known attacks and flexible identification of unknown attacks, significantly reducing the false positive and false negative rates at each stage, and forming a closed-loop capability of discovery-learning-modeling-detection, which is applicable to various scenarios such as enterprise intranets, cloud platforms and industrial control networks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122247779A_ABST
    Figure CN122247779A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of network security assurance and relates to a method, device, and electronic device for identifying complete network attack chains based on LLM (Limited Language Management). It constructs an attack chain semantic model, collects and preprocesses attack data from multi-source heterogeneous environments, and uses a large language classification model to determine known or unknown attack types. For known attacks, entity extraction and semantic matching are used to locate the attack stage; for unknown attacks, flexible fuzzy clustering and frequent itemset mining are used to extract behavioral features, and LLM is used to automatically construct the attack chain model, which is then integrated into the detection system to achieve model iteration and real-time protection. This method utilizes the semantic understanding capabilities of LLM to achieve accurate attack chain location, significantly reducing false positives and false negatives. It also possesses automatic learning and modeling capabilities for unknown attacks, forming a closed loop of discovery-learning-modeling-detection, with full automation, and is applicable to various scenarios such as enterprise intranets, cloud platforms, and industrial control networks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of network security protection technology, and in particular to a method, apparatus and electronic device for identifying complete network attack chains based on LLM. Background Technology

[0002] In recent years, cyberattacks have become increasingly frequent and persistent, with attack methods becoming more diverse and complex. APT attacks, due to their stealthy and organized nature, have become a major challenge in the field of cybersecurity. APT attacks often involve a multi-stage intrusion process, including initial infection, lateral movement, and data leakage, posing significant challenges to identification and tracking.

[0003] Current attack chain identification technologies mainly rely on rule matching (such as MITRE ATT&CK and Sigma rules) and traditional machine learning, which have the following core shortcomings: rules depend on manual maintenance, resulting in extremely low identification rates for unknown attacks such as zero-day attacks; traditional methods cannot understand the contextual semantics of attack behavior, leading to errors in attack chain stage positioning and chain breaks; machine learning models rely on a large number of labeled samples, resulting in poor generalization; clustering of unknown attacks lacks flexibility and cannot adjust clustering results according to actual user needs; and there is a lack of closed-loop mechanisms for rule generation and model iteration for unknown attacks. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention provides a method for identifying the complete network attack chain based on LLM, employing the following technical solution, including the following steps: Construct a semantic model of the attack chain, and define entity types and feature requirements; Based on the entity type and the feature requirements, raw attack data is collected from a multi-source heterogeneous environment, and the raw attack data is preprocessed to output standardized attack data. Based on the standardized attack data, the semantic understanding and context learning capabilities of the large language classification model are used to automatically determine whether the attack type belongs to a known attack type or an unknown attack type, and output classification labels as diversion control signals. For the known attack types, based on the attack chain semantic model, the current attack chain stage is located through LLM-driven entity extraction and semantic similarity matching. For the aforementioned unknown attack types, flexible fuzzy clustering is used to group attack events with similar behaviors into the same cluster according to the actual needs of users. The core behavioral features of each cluster are extracted through frequent itemset mining, and an unknown attack chain semantic model is automatically constructed using LLM semantic generation. The unknown attack chain semantic model is integrated into the LLM intrusion detection system, the attack type knowledge base is updated, and the updated unknown attack chain semantic model is used to identify attack chain stages and perform automated protection and handling of real-time traffic, forming a closed-loop optimization.

[0005] Preferably, the steps of constructing the attack chain semantic model and defining entity types and feature requirements specifically include: Based on the cyberspace kill chain framework and the MITRE ATT and CK knowledge bases, the core behavioral characteristics of the scanning phase, detection phase, exploitation phase, control phase, and propagation phase are defined respectively. Define multiple core entities for each attack phase, and store the entities in triplet format; Establish causal and temporal dependencies between attack phases and provide automatically executable protection strategies for each phase.

[0006] Preferably, the step of collecting raw attack data from a multi-source heterogeneous environment based on the entity type and the feature requirements, preprocessing the raw attack data, and outputting standardized attack data specifically includes: The network traffic capture tool, host log collection agent, and external threat intelligence platform interface are used to collect network traffic feature data, host operation logs, and threat intelligence text in parallel. Filter invalid data, fill in missing fields, and use the maximum-minimum normalization formula to linearly map the value of each feature to the interval [0,1]. All heterogeneous data is converted into a unified JSON structured format, and core feature vectors that can distinguish attack chains are extracted.

[0007] Preferably, the step of automatically determining whether an attack type belongs to a known attack type or an unknown attack type based on the standardized attack data, utilizing the semantic understanding and context learning capabilities of a large language classification model, and outputting a classification label as a diversion control signal specifically includes: Integrate publicly available attack libraries and historical attack samples from within enterprises to form a semantic set of known attack types; Using a contextual learning approach, a few-shot example is provided in the prompt words to guide LLM to perform classification reasoning based on standardized attack data; Call the LLM inference API, parse the returned JSON result, and obtain the known attack or unknown attack label. If it is a known attack, output the attack type identifier synchronously.

[0008] Preferably, the step of locating the current attack chain stage based on the attack chain semantic model and LLM-driven entity extraction and semantic similarity matching for the known attack type specifically includes: Design entity extraction prompt templates and leverage the named entity recognition capabilities of LLM to extract four key entities—attackers, tools, techniques, and vulnerabilities—from unstructured threat intelligence texts. The extracted free-format entities are aligned with the entity type by rules and semantics, invalid entities are filtered out, and a standardized structured entity set is generated. Construct a matching library for known attack chain stages, calculate the cosine similarity between the current attack entity set and the standard entity set of each stage, and select the stage with the highest similarity as the output.

[0009] Preferably, the steps of automatically constructing a semantic model of the unknown attack chain for the unknown attack type, using flexible fuzzy clustering to group attack events with similar behaviors into the same cluster according to the user's actual needs, extracting the core behavioral features of each cluster through frequent itemset mining, and using LLM semantic generation, specifically include: A fuzzy similarity matrix is ​​constructed using the correlation coefficient method, and multi-cut set clustering is performed based on a fast direct clustering algorithm to divide the unknown attack data into multiple clusters according to behavioral similarity; The Apriori algorithm is used to perform frequent itemset mining and association rule generation for each cluster, extracting high-frequency behavior sets, strong association rules, and typical behavior time series. Construct specialized prompt word templates, input core features into the LLM, and automatically generate the corresponding attack chain semantic model; The LLM is invoked to verify the integrity of the generated attack chain semantic model. If a missing stage is found, it is filled in by reverse reasoning based on association rules.

[0010] Preferably, the steps of integrating the unknown attack chain semantic model into the LLM intrusion detection system, updating the attack type knowledge base, and using the updated unknown attack chain semantic model to identify attack chain stages and perform automated protection and handling of real-time traffic to form a closed-loop optimization specifically include: The attack chain semantic model generated by LLM is stored in the attack chain semantic model library, and the core feature set of the cluster is added to the attack type knowledge base and marked as a known attack. After real-time captured network traffic is preprocessed and classified by LLM, the updated dedicated recognition model is called to extract entities and match stages. Based on the identified attack phase, the corresponding protection policy is retrieved from the handling suggestion library. The handling instructions are automatically executed through API linkage with EDR, firewall and WAF security devices, and alarm notifications and audit logs are generated.

[0011] To address the aforementioned technical problems, this invention also provides an LLM-based network attack chain integrity identification device, employing the following technical solution, including: The building block is used to construct the semantic model of the attack chain and define entity types and feature requirements. The acquisition module is used to acquire raw attack data from a multi-source heterogeneous environment based on the entity type and the feature requirements, and to preprocess the raw attack data to output standardized attack data. The judgment module is used to automatically determine whether the attack type belongs to a known attack type or an unknown attack type based on the standardized attack data and by utilizing the semantic understanding and context learning capabilities of the large language model, and outputs a classification label as a diversion control signal. The localization module is used to locate the current stage of the attack chain based on the attack chain semantic model, using LLM-driven entity extraction and semantic similarity matching, for the known attack type. The extraction module is used to group attack events with similar behaviors into the same cluster through flexible fuzzy clustering for the unknown attack type, extract the core behavioral features of each cluster through frequent itemset mining, and automatically construct a semantic model of the unknown attack chain using LLM semantic generation. The optimization module is used to integrate the unknown attack chain semantic model into the LLM intrusion detection system, update the attack type knowledge base, and use the updated unknown attack chain semantic model to identify attack chain stages and automatically protect against real-time traffic, forming a closed-loop optimization.

[0012] To address the aforementioned technical problems, the present invention also provides an electronic device that employs the technical solution described below, comprising a memory and a processor. The memory stores computer-readable instructions, and the processor, when executing the computer-readable instructions, implements the steps of the above-described method for complete identification of network attack chains based on LLM.

[0013] To address the aforementioned technical problems, the present invention also provides a computer-readable storage medium, which employs the technical solution described below. The computer-readable storage medium stores computer-readable instructions, which, when executed by a processor, implement the steps of the above-described method for complete identification of network attack chains based on LLM.

[0014] Compared with existing technologies, this invention has the following main advantages: By constructing a five-stage attack chain semantic model and relying on the powerful natural language understanding capabilities of LLM, it can deeply analyze the semantics of attack behaviors in heterogeneous data, achieve accurate positioning of all stages of the attack chain, and significantly reduce the false positive and false negative rates of stages; for known attacks, it can automatically complete entity extraction and stage matching; for unknown attacks, it can generate recognition rules and complete model iteration through flexible fuzzy clustering and LLM linkage, and flexibly adjust the clustering results according to user needs, forming a closed-loop capability of discovery-learning-modeling-detection; it can achieve full-process automation and can be widely applied to various scenarios such as enterprise intranets, cloud platforms, and industrial control networks. Attached Figure Description

[0015] To more clearly illustrate the solutions in this invention, the accompanying drawings used in the description of the embodiments of this invention will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0016] Figure 1 This is a flowchart of an embodiment of the LLM-based network attack chain complete identification method of the present invention; Figure 2 This is a schematic diagram of an embodiment of the LLM-based network attack chain complete identification device of the present invention; Figure 3 This is a schematic diagram of another embodiment of the LLM-based network attack chain integrity identification device of the present invention; Figure 4 This is a schematic diagram of the structure of an embodiment of the electronic device of the present invention. Detailed Implementation

[0017] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains; the terminology used herein in the specification is for the purpose of describing particular embodiments only and is not intended to limit the invention; the terms "comprising" and "having," and any variations thereof, in the specification, claims, and foregoing drawings are intended to cover non-exclusive inclusion. The terms "first," "second," etc., in the specification, claims, or foregoing drawings are used to distinguish different objects and not to describe a particular order.

[0018] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0019] To enable those skilled in the art to better understand the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.

[0020] It should be noted that the LLM-based network attack chain integrity identification method provided in this embodiment of the invention is generally executed by a server / terminal device, and correspondingly, the LLM-based network attack chain integrity identification device is generally installed in the server / terminal device.

[0021] It should be understood that the number of terminal devices, networks, and servers is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be used.

[0022] Example 1 Please refer to Figure 1 The flowchart illustrates an embodiment of the LLM-based network attack chain integrity identification method of the present invention. The LLM-based network attack chain integrity identification method includes the following steps: Step S1: Construct an attack chain semantic model and define entity types and feature requirements.

[0023] In this embodiment, the electronic device (e.g., a server / terminal device) on which the LLM-based network attack chain integrity identification method runs can receive the LLM-based network attack chain integrity identification request via a wired or wireless connection. It should be noted that the aforementioned wireless connection methods may include, but are not limited to, 3G / 4G / 5G connections, WiFi connections, Bluetooth connections, WiMAX connections, Zigbee connections, UWB (ultra-wideband) connections, and other currently known or future-developed wireless connection methods.

[0024] In this embodiment, step S1 may specifically include the following steps: S11, drawing on the cyberspace kill chain framework and the MITRE ATT and CK knowledge bases, proposes a five-stage attack chain semantic model, defining the core behavioral characteristics of the scanning, probing, exploitation, control, and propagation stages respectively.

[0025] Drawing inspiration from the Cyber ​​Kill Chain framework and the MITRE ATT&CK knowledge base (v14 and above), a five-stage attack chain semantic model is proposed, defining the behavioral characteristics of the following five stages: Scanning Phase: Behavioral characteristics include: port scanning (such as using nmap to scan common ports like 80, 443, and 22), website directory scanning (such as using dirb to scan paths like / admin and / phpmyadmin), subdomain enumeration (such as using Sublist3r), target asset WHOIS information lookup, network segment traversal (such as ping sweep), and other passive or active reconnaissance operations. The core objective is to obtain the target's basic network information, IP address distribution, open ports, and service types, in order to narrow down the target range for subsequent attacks.

[0026] The detection phase includes behavioral characteristics such as: system / service fingerprinting (e.g., using Nmap's -O parameter to identify the operating system type), known vulnerability detection and matching (e.g., using Nessus or OpenVAS to scan for CVE vulnerabilities), asset type determination (web server, database server, file server, etc.), software and system version identification (e.g., the Server field in the HTTP response header), and weak password probing (e.g., using Hydra to brute-force SSH or RDP passwords). The core objective is to locate exploitable vulnerabilities, security configuration flaws, and other attack entry points on the target.

[0027] Exploitation Phase: Behavioral characteristics include: exploit code execution (e.g., using Metasploit's exploit module), remote code execution (RCE, such as JNDI injection via the Log4j2 vulnerability CVE-2021-44228), malicious payload delivery (e.g., via phishing email attachments or malicious macro documents), script injection (e.g., SQL injection, XSS attacks), and initial privilege acquisition (e.g., reverse shell). The core objective is to breach the target's security mechanisms and gain initial access to the target host or network.

[0028] Control Phase: Behavioral characteristics include: issuing remote control commands (such as issuing commands like whoami and netuser via a C2 server), deploying Trojans and backdoor programs (such as Gh0st RAT and Cobalt Strike Beacon), persistent system configuration (such as writing to the registry's Run key and creating cron jobs), privilege escalation operations (such as JuicyPotato privilege escalation on Windows and SUID privilege escalation on Linux), establishing encrypted communication with the C2 server (such as HTTPS or DNS tunneling), and lateral movement within the internal network (such as using PsExec or WMI for lateral movement). The core objective is to achieve long-term, stable control over the target host or internal network, preventing loss of privileges or detection by security devices.

[0029] The outward transmission phase is characterized by the following behaviors: stealing sensitive data (such as reading / etc / passwd, database configuration files, and customer information tables), encrypting and packaging data (such as using 7z or tar + gpg encryption), transmitting stolen data to a designated external server (such as uploading to a server controlled by the attacker via HTTP POST or FTP), erasing attack traces (such as deleting bash_history and clearing Windows event logs), and destroying system logs (such as using the wevtutil cl command). The core objective is to achieve the core attack intent of data theft and destruction, while concealing attack traces to avoid detection and tracking.

[0030] The above behavioral characteristics are stored in the five-stage attack chain semantic model library in the form of structured natural language descriptions, for direct use in subsequent LLM prompt word design, entity extraction and stage matching.

[0031] The purpose of step S11 is to clarify the typical behavioral patterns of each attack stage, so that LLM and clustering algorithms can map specific attack events to the correct stage.

[0032] S12 defines multiple core entities for each attack phase, and the entities are stored in triplet format.

[0033] Define 4-6 core entities for each attack phase. Entities are stored in triplet format (entity type, entity name, attribute list). The specific definitions are as follows:

[0034] This entity type table, as part of the unified supporting knowledge base, is directly referenced by LLM and alignment algorithms during the entity extraction and stage matching phases.

[0035] The purpose of step S12 is to provide clear entity type labels for entity extraction, ensuring that the entities extracted by LLM from unstructured threat intelligence text or log fragments are correctly aligned with the attack phase.

[0036] S13 establishes causal and temporal dependencies between attack phases and provides automatically executable protection strategies for each phase.

[0037] Association rules are based on logical implication. (That is, "If P is true, then Q must or must occur") is a formal definition, with an accompanying strength weight (0~1), for example: Weight 1.0 (Scanning is a necessary prerequisite for detection; detection without scanning is semantically invalid). The weight is 0.95 (in most cases, attackers will try to exploit the vulnerability after it is detected, but there will also be information gathering scenarios). Weight 0.98 (Attackers will almost always establish persistent controls to maintain access). Weight 0.90 (depending on the attacker's ultimate goal; some APT attacks may remain dormant for a long time and not be immediately disseminated). Handling suggestions are stored in the form of condition-action rules for the detection and handling module to call. Example: If scanning is detected then, execute the following: {Block the scanning source IP, hide asset information, and trigger a low-priority alarm}. If the exploitation phase is detected (THEN), execute the following steps: {Isolate the infected host, remove the malicious payload, trigger a high-risk alert, and push it to the SOC}. If outbound transmission is detected then, execute the following: {Block outbound IP / domain, restore tampered logs, trigger emergency alarm, and initiate data breach emergency response}. These rules are stored in JSON format in the five-stage attack chain semantic model library and support dynamic addition, deletion, modification and query.

[0038] The purpose of step S13 is to establish causal and temporal dependencies between attack stages, support the temporal reconstruction of the attack chain and intelligent completion of broken links, and provide automatically executable protection strategies for each stage.

[0039] The purpose of step S1 is to construct a standardized, computable, and scalable five-stage attack chain semantic model. This model provides a unified semantic framework, stage division criteria, entity type definitions, and handling strategy benchmarks for the identification of all subsequent attack behaviors. This model serves as the knowledge foundation and semantic anchor of the entire method, and all subsequent processing (classification, matching, clustering, and generation) is based on this model.

[0040] Step S2: Based on entity type and feature requirements, collect raw attack data from multi-source heterogeneous environments, preprocess the raw attack data, and output standardized attack data.

[0041] In this embodiment, step S2 may specifically include the following steps: S21 uses network traffic capture tools, host log collection agents, and external threat intelligence platform interfaces to collect network traffic characteristic data, host operation logs, and threat intelligence text in parallel.

[0042] Three types of data collection agents / tools are deployed in parallel: Network traffic capture: Deploy Suricata or Wireshark on the mirrored port of the core switch. Capture raw network packets (pcap format), parse them in real time, and extract the following fields: The 5-tuple consists of: source IP, source port, destination IP, destination port, and transport layer protocol (TCP / UDP / ICMP). HTTP layer information: Request method (GET / POST), URL path, User-Agent, request parameters TLS information: SNI (Server Name Indication), certificate information Statistical characteristics: stream duration, packet size distribution, uplink and downlink byte count Host log collection: Deploy Filebeat (a lightweight log collector) or OSSEC (a host intrusion detection system) on each protected host. Log types collected include: System logs (Linux's / var / log / syslog, Windows' Event Log) Application logs (Web server access.log, error.log, database logs) Process creation records (auditd in Linux, Sysmon Event ID 1 in Windows) File integrity monitoring (changes to critical configuration files or sensitive files) Threat intelligence gathering: The latest lists of malicious IPs / domains, CVE vulnerability details, attack pattern descriptions, and YARA rules are retrieved periodically (e.g., every 5 minutes) via API from platforms such as MITRE ATT&CK, CAPEC, Alibaba Cloud Threat Intelligence Center, and VirusTotal. Data format is typically JSON or STIX 2.1 standard.

[0043] The collected data includes both structured JSON / CSV formats (such as a list of malicious IPs returned by the threat intelligence API) and unstructured text (such as vulnerability descriptions and TTP descriptions in attack reports).

[0044] The purpose of step S21 is to capture traces of attack behavior from different levels (network layer, host layer, intelligence layer), cover the digital footprints that may be left at each stage of the attack chain, and avoid blind spots caused by a single data source.

[0045] S22 filters invalid data, completes missing fields, and uses the maximum-minimum normalization formula to linearly map the value of each feature to the interval [0,1].

[0046] Filter normal access traffic: For example, requests with an HTTP response status code of 200 and a request URL that is a static resource (.js, .css, .png) and has no abnormal parameters can be marked as normal and discarded (not sent to the LLM classification) to reduce the amount of noisy data.

[0047] Remove duplicate logs: Remove duplicates based on the (timestamp, src_ip, dst_ip, port) key combination, keeping only the earliest or the aggregated log.

[0048] Complete missing fields: For critical but missing fields such as "abnormal behavior markers", use rule-based filling: if the request parameters contain patterns such as cmd=, exec=, $(...), then fill in "suspected command injection"; if it is completely impossible to determine, then fill in "unknown".

[0049] Max-min normalization is used to linearly map the value of each feature to an interval. The normalization formula is as follows: .

[0050] in: : Sample index, indicating the index of the sample in the dataset. A sample (e.g., a specific attack log entry).

[0051] : No. The raw numerical value of a sample for a specific feature. For example: if the feature is "connection duration (seconds)", then... Indicates the first The attack lasted for 12 seconds.

[0052] This represents the minimum value for this feature across all samples in the entire dataset. For example, the minimum connection duration in all attack logs is 1 second.

[0053] This represents the maximum value for this feature across all samples in the entire dataset. For example, the maximum connection duration in all attack logs is 60 seconds.

[0054] The normalized eigenvalues ​​range from 0 to 1. For example, a connection duration of 12 seconds would have normalized eigenvalues. .

[0055] Eliminating the influence of units: Connection duration (seconds) and access frequency (times / minute) originally have different numerical ranges (seconds vs. minutes), and directly comparing or calculating distance will cause connection duration to dominate. After normalization, both are in the [0,1] interval, participating fairly in the cluster distance calculation.

[0056] Maintaining relative size relationships: Normalization is a linear transformation that does not change the distribution shape of the data. Larger values ​​in the original data will remain larger after normalization.

[0057] Adaptation algorithm requirements: Many machine learning algorithms (including distance calculation in fuzzy clustering) are sensitive to the scale of input features, and normalization can significantly improve clustering stability.

[0058] The purpose of step S22 is to delete invalid or interfering data, fill in missing values, and eliminate numerical differences between different features caused by different units, so that subsequent clustering algorithms (such as the flexible fuzzy clustering in step S5) can treat each feature fairly and avoid the problem of large numbers swallowing small numbers.

[0059] S23 converts all heterogeneous data into a unified JSON structured format and extracts core feature vectors that can distinguish attack chains.

[0060] Unified Format Conversion: Write a preprocessing pipeline in Python or Go to unify the output of network traffic, host logs, and threat intelligence data sources into the following JSON Schema: { "timestamp": "2026-04-10T09:35:22Z", "src_ip": "XXX.XX.XX.XX", "dst_ip": "XX.XXXX", "src_port": 54321, "dst_port": 8080, "protocol": "TCP", "request_path": " / api / v1 / exec", "request_params": "cmd=cat%20 / etc / passwd", "conn_duration_sec": 12, "conn_duration_norm": 0.186, "request_freq_per_min": 20, "request_freq_norm": 0.192, "malicious_mark": "Suspected command injection", "threat_intel_match": "CVE-2022-22965" } Core feature extraction: Extract the following 15-20 dimensional feature vectors from the original logs as input for clustering and classification.

[0061] Discrete characteristics: Protocol type (1=TCP, 2=UDP, 3=ICMP), Request method (1=GET, 2=POST, 3=other).

[0062] Continuous characteristics: normalized connection duration, number of packets, and uplink / downlink byte ratio.

[0063] Semantic features: the number of sensitive words in the request path (such as the number of times / etc / passwd, cmd=, and exec appear).

[0064] Behavioral flags: whether it is a known malicious IP (0 / 1) or an uncommon port (0 / 1).

[0065] The standardized data flows into step S3 (LLM classification) on one hand, and is backed up to the data lake for subsequent cluster analysis on the other.

[0066] The purpose of step S23 is to convert all heterogeneous data (from different sources and in different original formats) into a unified JSON structured format and extract the core feature vectors that are most effective in identifying attack chains, so as to facilitate the subsequent LLM Prompt input and matrix construction of clustering algorithms.

[0067] Step S2 aims to comprehensively collect raw attack-related data from various heterogeneous data sources (network traffic, host logs, threat intelligence platforms), and transform it into high-quality input data in a unified format through cleaning, normalization, and standardization. This step is the cornerstone of the data-driven approach, and data quality directly determines the accuracy of subsequent LLM classification and clustering analysis.

[0068] Step S3: Based on standardized attack data, the semantic understanding and context learning capabilities of the large language classification model are used to automatically determine whether the attack type belongs to a known attack type or an unknown attack type, and output classification labels as diversion control signals.

[0069] In this embodiment, step S3 may specifically include the following steps: S31 integrates publicly available attack libraries and historical attack samples from within the enterprise to form a semantic set of known attack types.

[0070] It integrates multiple publicly available attack libraries: MITRE ATT&CK v14 (provides TTP tactical and technical procedures), CAPEC 3.9 (Common Attack Pattern Enumeration and Classification, providing attack patterns), and CVE vulnerability database (provides vulnerability numbers and descriptions).

[0071] Include at least 1,000 historical attack samples from within the enterprise. Each sample should contain an attack type label (such as "SQL injection", "phishing attack", "Log4j2 RCE"), attack data fragments, and corresponding handling conclusions.

[0072] These text descriptions are converted into 768-dimensional vectors using an embedding model (such as OpenAI text-embedding-ada-002 or the open-source BGE model) while preserving the original text, forming a semantic set of known attack types.

[0073] The purpose of step S31 is to provide the LLM with a reference semantic set of known attack types, enabling the LLM to perform a comparison judgment on the existence of attacks during inference.

[0074] S32 employs a contextual learning approach, providing Few-shot examples in the prompt words to guide LLM in performing classification reasoning based on standardized attack data.

[0075] The ICL (In-Context Learning) method is used, providing a set of few-shot examples (2-3 input-output pairs of known attack types) in the prompt words, followed by the input attack data to be classified. The template is as follows: You are a cybersecurity attack classification expert. Below are known attack types and their typical characteristics: 1. Remote Code Execution (RCE): Attackers execute arbitrary code through system command injection, deserialization vulnerabilities, etc. Characteristics: Request parameters include cmd=, exec(), Runtime.exec(), etc.

[0076] 2. SQL Injection: Attackers insert malicious fragments into SQL statements. Characteristics: Request parameters contain 'OR '1'='1', UNION SELECT, etc.

[0077] 3. Phishing Attacks: Stealing credentials by forging login pages or malicious attachments. Characteristics: URLs similar to the official ones but differing by one letter, or Office files containing malicious macros.

[0078] For example, "Now please determine whether the following standardized attack data belongs to a known attack (and output the type) or an unknown attack": Attack data: { "src_ip": "XXX.XX.XX.XX", "dst_port": 8080, "request_path": " / api / v1 / exec", "request_params": "cmd=cat%20 / etc / passwd", "malicious_mark": Suspected command injection }

[0079] Output format: {"type": "known", "attack_id": "CVE-2022-22965"} or {"type":"unknown"}.

[0080] The feature descriptions in the prompt words are consistent with the entity types defined in step S1, forming a semantic closed loop.

[0081] The purpose of step S32 is to guide the LLM to perform contextual learning and reasoning based on standardized attack data through a carefully constructed prompt word template, and output accurate classification results.

[0082] S33 calls the LLM inference API, parses the returned JSON result, and obtains the known attack or unknown attack label. If it is a known attack, the attack type identifier is output synchronously.

[0083] Call the LLM inference API (such as DeepSeek-V3, Qwen-Max, or GPT-4 Turbo), passing in the above prompt words and the standardized data to be classified.

[0084] Parse the JSON returned by the LLM and obtain the type field. If it is "known", then also obtain the attack_id (such as CVE number or attack family name).

[0085] Example of output: {"type": "known", "attack_id": "Remote code execution attack (CVE-2022-22965)"}; {"type": "unknown"}.

[0086] The classification result serves as a control flow signal, determining whether to subsequently call the known attack identification module (step S4) or the unknown attack identification module (step S5).

[0087] The purpose of step S33 is to generate clear known / unknown labels and specific attack type identifiers for routing to step S4 or step S5.

[0088] Step S3 serves to leverage the powerful semantic understanding and context learning capabilities of the large language classification model to automatically determine whether the input attack data belongs to a known attack type or an unknown attack type. This step is the core of the intelligent triage process, ensuring that known attacks are quickly matched while unknown attacks enter the deep clustering modeling process.

[0089] Step S4: For known attack types, based on the attack chain semantic model, the current attack chain stage is located through LLM-driven entity extraction and semantic similarity matching.

[0090] In this embodiment, step S4 may specifically include the following steps: S41, design entity extraction prompt templates, and use the named entity recognition capability of LLM to extract four key entities: attackers, tools, techniques and vulnerabilities from unstructured threat intelligence text.

[0091] Design prompt word templates specifically for entity extraction, leveraging the implicit sequence labeling capabilities of LLM.

[0092] Input text example (obtained from the Cybersecurity Repository): "CVE-2022-22965 is a remote code execution vulnerability in Spring Cloud Gateway. Attackers can execute system commands by constructing malicious HTTP request parameters to gain server privileges. Burp Suite is a commonly used tool, and the attacked IPs are mostly malicious nodes." Entity extraction prompt template, such as: Please extract four core entities from the following threat intelligence text: attackers, tools, techniques, and vulnerabilities.

[0093] The output format is one triple per line: (Entity type, Entity name, Attribute). Text: {Threat Intelligence Text}; LLM output: (Attacker, IP: XXX.XX.XX.XX, malicious node); (Tools, Burp Suite, HTTP request construction); (Technology, remote code execution, system command injection); (Vulnerability, CVE-2022-22965, Spring Cloud Gateway); This approach leverages the Named Entity Recognition (NER) capability of LLM, eliminating the need to train a dedicated NER model. It achieves zero-shot extraction based on prompt words, demonstrating strong generalization ability.

[0094] The purpose of step S41 is to extract four key entities—attackers, tools, techniques, and vulnerabilities—from unstructured threat intelligence text (such as CVE descriptions, attack report paragraphs, and security blogs) and convert natural language into structured data.

[0095] S42 performs rule-based and semantic alignment between the extracted free-format entities and entity types, filters out invalid entities, and generates a standardized structured entity set.

[0096] Rule alignment: Establish a mapping table to map free text entity types (such as "vulnerability" and "tool") output by the LLM to standard entity types (such as "Vuln_ID" and "Exploit_Tool"). For example: "Vulnerability" → "Vuln_ID"; The tool is either “Exploit_Tool” or “Scanner_Tool”, depending on the context.

[0097] Semantic alignment: For cases where the rules cannot cover the full scope, semantic similarity calculation using LLM is employed. The embedded vector similarity of the output entity name is compared with that of the entity name in the entity table of step S1.2, and the entity type with the highest similarity that is greater than a threshold (e.g., greater than 0.85) is selected.

[0098] Generate a standardized set: { "attack_type": "CVE-2022-22965", "entities": [ {"stage": "exploitation", "entity_type": "Exploit_Tool", "name": "BurpSuite", "attrs": {"purpose": "HTTP request construction"}}, {"stage": "Exploit", "entity_type": "Vuln_ID", "name": "CVE-2022-22965","attrs": {"product": "Spring Cloud Gateway"}}, {"stage": "exploitation", "entity_type": "Technique", "name": "remote code execution","attrs": {"method": "system command injection"}} ] } The purpose of step S42 is to align the free-format entities extracted in the previous step with the standard entity types defined in step S12, filter out invalid or redundant entities, and generate a standardized set of structured entities.

[0099] S43, construct a known attack chain stage matching library, calculate the cosine similarity between the current attack entity set and the standard entity set of each stage, and select the stage with the highest similarity as the output.

[0100] Build a known attack chain stage matching library: Pre-build a standard entity set and stage mapping for each known attack type (such as SQL injection, RCE, phishing attack), and store it as an attack type-stage-entity table.

[0101] Semantic similarity calculation: for the structured entity set currently being attacked. Matching the standard entity set for each stage in the matching library (stage ∈ {scanning, probing, exploiting, controlling, spreading}) Calculate cosine similarity: .

[0102] in and These are dense vectors (e.g., 768-dimensional) obtained by mapping entity sets through an LLM embedding model. End-to-end similarity scoring can also be performed directly using LLM.

[0103] Phase Positioning: Selection For example, the similarity score during the return to the exploitation phase is 0.95, which is much higher than the 0.4 score during the detection phase.

[0104] Output: Stage identifier (utilization stage), along with entity details and disposal recommendations for this stage.

[0105] The purpose of step S43 is to calculate the semantic similarity between the current attack entity set and the five-stage standard entity library in step S1, and select the stage with the highest similarity as the output.

[0106] The purpose of step S4 is to automatically extract key entities from the threat intelligence text for attack types identified by LLM as known, and perform semantic similarity matching with the five-stage semantic model constructed in step S1 to accurately output the current attack chain stage (scanning / probing / exploitation / control / transmission) and corresponding entity details.

[0107] Step S5: For unknown attack types, attack events with similar behaviors are grouped into the same cluster according to the user's actual needs through flexible fuzzy clustering. The core behavioral features of each cluster are extracted through frequent itemset mining. Using LLM semantic generation, an unknown attack chain semantic model is automatically constructed.

[0108] In this embodiment, step S5 may specifically include the following steps: S51 uses the correlation coefficient method to construct a fuzzy similarity matrix, and performs multi-cut set clustering based on a fast direct clustering algorithm to divide the unknown attack data into multiple clusters according to behavioral similarity.

[0109] In specific implementation, step S51 further includes the following steps: S511, construct a standardized data matrix of attack events.

[0110] It has An unknown attack event Each event has Construct a matrix using standardized features (numerical features from the preprocessed features in step S2, such as normalized connection duration, access frequency, port number, etc.): .

[0111] Each Indicates the first The attack event was in the 1st The values ​​of each feature (already normalized to) (Interval).

[0112] S512, Constructing a fuzzy similarity matrix .

[0113] The correlation coefficient method is used to calculate any two attack events. and similarity between The formula is: .

[0114] in: Attack incident With attack incident Similarity coefficient between them, range of values A value closer to 1 indicates that the two events have similar behavioral characteristics (strong positive correlation), while a value closer to -1 indicates a negative correlation.

[0115] : Index of attack events , .

[0116] : Index of features .

[0117] : No. The event in the 1st Normalized values ​​on each feature.

[0118] : No. The event in the 1st Normalized values ​​on each feature.

[0119] : No. The event in all The average value across the features is calculated using the following formula: .

[0120] : No. The event in all The average value over each feature.

[0121] all constitute a Symmetric matrix , is called the fuzzy similarity matrix, where .

[0122] The correlation coefficient method can capture linear correlation patterns between features, not just absolute distance. For attack behaviors, even if two attack events have different numerical values ​​(e.g., one scans 100 ports and the other scans 1000 ports), as long as their feature change trends are similar (e.g., both follow an increasing port number scanning pattern), the correlation coefficient can still be high. This is more suitable for attack behavior clustering than Euclidean distance.

[0123] S513 performs fast direct clustering (without transitive closure).

[0124] Avoiding the high time complexity of calculating transitive closure in traditional fuzzy clustering ( Clustering is performed directly on directed graphs based on similarity matrices, which greatly improves efficiency.

[0125] fuzzy similarity matrix View it as a weighted undirected complete graph: each attack event is a node, and the nodes... With nodes The edge weights between them are .

[0126] Given a cut set level (range of values) ), delete all weights Edges, retain weight The edge.

[0127] In the preserved graph, connected components are computed. All attack events within each connected component are clustered into one class.

[0128] The results of cut-off clustering are equivalent to: .

[0129] Based on fuzzy clustering theory, similarity matrix transitive closure Clustering results and direct comparison The results of clustering the cut sets are equivalent. Therefore, there is no need to calculate the transitive closure; clustering can be performed directly, reducing the time complexity to O(n). (Construction Graph) + (Connected component search).

[0130] S514 performs flexible multi-cut set result partitioning.

[0131] Traditional single cut set Clustering results are fixed and lack flexibility; this method provides clustering results at multiple cut-off levels, allowing users to choose the most suitable granularity according to their actual business needs, thus achieving flexibility.

[0132] according to Values ​​are selected from highest to lowest: (The step size is usually 0.05 or 0.1).

[0133] For each Calculate the clustering results (A cluster).

[0134] Merge clustering results from all cutoffs: .

[0135] Eliminate redundancy: If a cluster appears in multiple [locations / regions] simultaneously Below, only the highest level is retained. The corresponding version (because it is high) This implies stricter similarity requirements and stronger intra-cluster cohesion.

[0136] The final output is a set of disjoint clusters that cover all attack events. ,satisfy: .

[0137] Users can select a suitable clustering parameter setting module through the interactive display layer. (Or directly choose the system recommendation) This yields a set of clusters at the corresponding granularity, with each cluster representing a new type of unknown attack.

[0138] Step S51 aims to divide the unclassified unknown attack data into multiple clusters based on behavioral similarity. Unlike traditional hard clustering methods such as K-Means, this method employs a fuzzy similarity matrix and multi-level cut sets, allowing users to select different clustering granularities according to business needs, thus offering high flexibility.

[0139] S52 uses the Apriori algorithm to perform frequent itemset mining and association rule generation for each cluster, extracting high-frequency behavior sets, strong association rules, and typical behavior time series.

[0140] In specific implementation, step S52 further includes the following steps: S521 enables transactional processing of behaviors.

[0141] Each attack log entry within the cluster is converted into a transaction. Each item in a transaction represents a specific behavioral characteristic. Example: T1:{"Port 443 scan", / admin path detection, JSP file upload, privilege escalation, external domain mal-xxx.com}; T2:{"Port 443 Scan", / upload path detection,"reverse shell",C2 heartbeat,data packaged and encrypted}.

[0142] S522, Mining Frequent Itemsets.

[0143] Set minimum support threshold (Can be adjusted based on actual data). Itemset Support definition: .

[0144] The Apriori algorithm iteratively generates all satisfying... The itemsets start with 1-itemsets and gradually expand.

[0145] Example of mining results: Frequent 1-itemsets: {Port 443 scan: 0.85}, { / admin probe: 0.72}, {JSP upload: 0.68}; Frequent 2-itemsets: {Port 443 scan, / admin probe}: 0.68, {JSP upload, privilege escalation}: 0.63; Frequent multisets: {port 443 scan, / admin probe, JSP upload, privilege escalation}: 0.60.

[0146] S523, generate strong association rules.

[0147] Set minimum confidence threshold Association rules ( The confidence level of a ) is defined as follows: .

[0148] Generate all The rules are provided, along with support and confidence levels.

[0149] Example rules are: Port 443 scan / admin path detection (support = 0.68, confidence = 0.80); JSP upload Privilege escalation (support = 0.63, confidence = 0.93); Privilege Escalation External link C2 (support = 0.60, confidence = 0.88).

[0150] S524, extract the core feature set.

[0151] Frequent itemsets are merged with association rules, deduplicated, and normalized to form structured core features, in the following format: Attack type identifier: Unknown_Cluster_3; High-frequency behavior set: {Port 443 scanning, / admin path detection, JSP file upload, privilege escalation, external C2 connection}.

[0152] Strong association rules: Port 443 scan → / admin path detection (s=0.68, c=0.80); JSP Upload → Privilege Elevation (s=0.63, c=0.93); Privilege escalation → External C2 (s=0.60, c=0.88).

[0153] Typical sequence of events: Port scanning → Path detection → File upload → Privilege escalation → C2 communication.

[0154] The purpose of step S52 is to mine high-frequency co-occurring behavioral items and causal association rules from the original behavior logs for each unknown attack cluster, forming the structured core features of the cluster, and providing quantitative basis for the subsequent generation of LLM semantic models.

[0155] S53: Construct a special prompt word template, input the core features into the LLM, and automatically generate the corresponding attack chain semantic model.

[0156] Construct a special prompt word template, input the core features of the cluster, and require the LLM to strictly follow the five-stage output.

[0157] Prompt template: You are a cybersecurity attack chain modeling expert. A new type of unknown cyberattack has been discovered through clustering, and its core behavioral characteristics are as follows: [Unknown Attack Cluster ID]: Cluster_3; [High-Frequency Behavior Set]: {Port 443 scanning, / admin path detection, JSP file upload, privilege escalation, external C2 connection, sensitive file reading}; [Strong Association Rule]: 1. Port 443 scan → / admin path detection (support 0.68, confidence 0.80); 2. JSP Upload → Privilege Escalation (Support 0.63, Confidence 0.93); 3. Privilege escalation → External C2 connection (Support 0.60, Confidence 0.88).

[0158] For example: Please strictly follow the five stages of the attack chain standard: scanning → detection → exploitation → control → propagation, to generate a dedicated attack chain semantic model for this unknown attack. Requirements: 1. For each stage, clearly describe the typical behaviors, objectives, and key entities involved in this unknown attack; 2. It reflects the temporal sequence and logical progression of the actions; 3. The language is formal and can be directly imported into the database for use; 4. No unfounded behavior may be fabricated; all actions must be deduced based on given characteristics and rules.

[0159] The purpose of step S53 is to input the numerical frequent itemsets and association rules into the LLM, and use the semantic understanding and logical reasoning capabilities of the LLM to automatically generate a human-readable and machine-executable semantic model corresponding to the five stages (scanning → probing → exploitation → control → dissemination).

[0160] S54, call LLM to verify the integrity of the generated attack chain semantic model. If a missing stage is found, it is filled in by reverse reasoning based on association rules.

[0161] The LLM algorithm is invoked to perform a secondary verification of the generated semantic model: The prompt message is "Please check whether the above attack chain model completely includes the five stages of scanning, probing, exploitation, control, and propagation. If any are missing, please supplement the specific content of the missing stages based on association rules and typical behaviors." If the LLM detects a missing element (e.g., only probe-exploitation-control, without scanning and outgoing transmission), then: Based on the association rules, we can infer from the reverse reasoning: since there is a port 443 scan → / admin path probe, we know that the scanning phase should include port scanning behavior.

[0162] Using LLM to generate supplementary content: Scanning phase: Attackers use port scanning tools to perform batch scans on the target port 443 in order to discover web servers that offer HTTPS services. Key entities include the source IP, target IP range, port 443, and scanner type.

[0163] The output shows the complete attack chain sequence after completion, with all five stages present and none missing.

[0164] The purpose of step S54 is to check whether the five-stage semantic model generated by LLM completely covers the entire process from scanning to transmission. If a certain stage is found to be missing (for example, the scanning stage is not in the model), the AI ​​algorithm is used to automatically complete it to ensure that the attack chain is not interrupted.

[0165] Step S5 serves the following purpose: For attack types identified as unknown by LLM, flexible fuzzy clustering is used to group attack events with similar behaviors into the same cluster (representing a new type of unknown attack). Then, frequent itemset mining is used to extract the core behavioral features of each cluster. Finally, LLM semantic generation is used to automatically construct a dedicated five-stage attack chain semantic model for this unknown attack. This step achieves a complete closed loop of discovering the unknown → automatic modeling → database detection.

[0166] Step S6: Integrate the unknown attack chain semantic model into the LLM intrusion detection system, update the attack type knowledge base, and use the updated unknown attack chain semantic model to identify attack chain stages and perform automated protection and handling of real-time traffic, forming a closed-loop optimization.

[0167] In this embodiment, step S6 may specifically include the following steps: S61, store the attack chain semantic model generated by LLM into the attack chain semantic model library, and at the same time add the core feature set of the cluster to the attack type knowledge base and mark it as a known attack.

[0168] Call the API of the model management module to store the five-stage semantic model (JSON format) generated by LLM into the five-stage attack chain semantic model library and assign a unique identifier (such as ATTACK_CHAIN_Unknown_Cluster_3).

[0169] Simultaneously, the cluster's core feature set, association rules, and attack type identifier (such as Unknown_Cluster_3) are added to the attack type knowledge base and marked as a known attack. Afterward, any traffic matching this feature will directly enter the known attack identification process in step S4.

[0170] If needed, drafts of Sigma or YARA rules can also be generated for security engineers to review and deploy to IDS / IPS.

[0171] The purpose of step S61 is to persist the newly modeled unknown attack chain model and convert it into a known attack type for subsequent matching.

[0172] S62, after collecting, preprocessing and classifying the network traffic captured in real time, calls the updated dedicated recognition model to extract entities and perform stage matching.

[0173] After real-time traffic is processed through step S2 (collection and preprocessing), it goes through step S3 (LLM classification).

[0174] If the attack is classified as a known attack and matches Unknown_Cluster_3 (which has now been converted to known), proceed to step S4, where the ATTACK_CHAIN_Unknown_Cluster_3 model is invoked for entity extraction and phase matching.

[0175] Example: The system captures the behavior of source IP XXX.XXX.X.XX uploading a shell.jsp file to the server, and the model determines that it is in the "exploitation phase".

[0176] Output: Stage identifier "Exploitation stage", confidence level 0.93, associated entity {malicious JSP file, upload path, initial privilege level www-data}.

[0177] The purpose of step S62 is to process the newly captured network traffic in real time and call the updated model to determine the attack chain stage.

[0178] S63, based on the identified attack stage, retrieves the corresponding protection policy from the handling suggestion library, and automatically executes handling instructions through API linkage with EDR, firewall and WAF security devices, and issues alarm notifications and audit logs.

[0179] Policy Mapping: The recommended actions for the exploitation phase are to isolate the infected host, block the attack source IP, and delete malicious files.

[0180] Device linkage: Send the isolate_host command to the target server via the EDR (Endpoint Detection and Response) API to isolate the host network communication.

[0181] Deploy ACL rules via firewall API: deny src XXX.XXX.X.XX any.

[0182] Update rules via WAF (Web Application Firewall) API: Block requests that include uploads of shell.jsp.

[0183] Alarm notification: Send emergency alarms via alarm module (such as email), including attack stage, source IP, target IP, and handling measures.

[0184] Record audits: Write all actions taken into the audit log for later review and tracing.

[0185] The function of step S63 is to: based on the identified attack stage, retrieve the corresponding protection strategy from the handling suggestion library in step S13, and automatically issue an execution command to the security device.

[0186] The purpose of step S6 is to integrate the newly generated unknown attack chain semantic model in step 5 into the LLM intrusion detection system, update the attack type knowledge base, and enable the system to have the ability to discover and permanently detect similar new attacks in one go; at the same time, the updated model is used to identify the attack chain stages of real-time traffic and automatically execute the corresponding protection strategies.

[0187] This invention can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This invention can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0188] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware through computer-readable instructions. These computer-readable instructions can be stored in a computer-readable storage medium. When the program is executed, it can include the processes of the embodiments of the above methods. The aforementioned storage medium can be a non-volatile storage medium such as a magnetic disk, optical disk, or read-only memory (ROM), or random access memory (RAM).

[0189] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0190] Example 2 Further reference Figure 2 As a response to the above Figure 1 The present invention provides an embodiment of a network attack chain integrity identification device based on LLM, which implements the method shown. Figure 1 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0191] like Figure 2 As shown, the LLM-based network attack chain complete identification device 70 described in this embodiment includes: a construction module 71, a collection module 72, a judgment module 73, a location module 74, an extraction module 75, and an optimization module 76. Wherein: Module 71 is used to build the attack chain semantic model and define entity types and feature requirements; The acquisition module 72 is used to acquire raw attack data from a multi-source heterogeneous environment based on the entity type and the feature requirements, and to preprocess the raw attack data to output standardized attack data. The judgment module 73 is used to automatically determine whether the attack type belongs to a known attack type or an unknown attack type based on the standardized attack data and by utilizing the semantic understanding and context learning capabilities of the large language classification model, and outputs a classification label as a diversion control signal. The localization module 74 is used to locate the current stage of the attack chain based on the attack chain semantic model, using LLM-driven entity extraction and semantic similarity matching, for the known attack type. Extraction module 75 is used to, for the unknown attack type, group attack events with similar behaviors into the same cluster according to the user's actual needs through flexible fuzzy clustering, extract the core behavioral features of each cluster through frequent itemset mining, and automatically construct an unknown attack chain semantic model using LLM semantic generation. The optimization module 76 is used to integrate the unknown attack chain semantic model into the LLM intrusion detection system, update the attack type knowledge base, and use the updated unknown attack chain semantic model to identify attack chain stages and automatically protect against real-time traffic, forming a closed-loop optimization.

[0192] Example 3 Further reference Figure 3 As a response to the above Figure 1 The present invention provides an embodiment of a network attack chain integrity identification device based on LLM, which implements the method shown. Figure 1 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0193] like Figure 3 As shown, the LLM-based network attack chain complete identification device described in this embodiment includes: The architecture adopts a layered, modular, and knowledge-based support design, which is divided into five layers from top to bottom: interactive display layer, core business layer, data preprocessing layer, data acquisition layer, and unified support knowledge base. Each layer of modules works collaboratively in a closed loop of data flow: data input → processing → analysis → output → iteration.

[0194] The unified support knowledge base layer is the underlying resource support layer for the stable operation of the entire system. It integrates core resources such as a five-stage attack chain semantic model library, an attack type library, a prompt word model library, and an LLM model library. It provides each upper-layer module with attack chain stage definitions, known attack semantic features, prompt word rules, LLM reasoning basis, and the ability to store and call recognition models. Simultaneously, it supports model iteration updates and dynamic expansion of the knowledge base, ensuring the accuracy and continuous evolution of the system's attack recognition capabilities. Specific functional modules are as follows: (1) Five-stage attack chain semantic model library: The storage system constructs various attack chain semantic model data corresponding to scanning → detection → exploitation → control → transmission, as well as the updating and maintenance of the model data.

[0195] (2) Attack type library: The library of known attack features identified by the storage system, as well as the updating and maintenance of the feature library.

[0196] (3) Prompt word model library: Stores various prompt word templates used in the system and manages and maintains them.

[0197] (4) LLM Model Library: This module mainly stores two types of large language models: one is a large language recognition model, such as a recognition model based on Transformer, RWKV model, etc., which is used to classify known and unknown attacks, and to accurately match the attack stage of known attack types; the other is a large language generation model, which is used to extract entities from threat intelligence text of attack types, generate exclusive attack chains for unknown attack types, and complete the stages of existing attack chains.

[0198] The data acquisition and preprocessing layer is the foundational support layer for the system to acquire and standardize attack data across the entire domain. It consists of two main modules: full-domain data acquisition and data preprocessing. (1) Full-domain data collection module: Through multiple channels such as network traffic capture tools, host log collection agents, and external threat intelligence platform interfaces, it collects network traffic feature data, host operation logs, threat intelligence texts and other structured and unstructured attack data, comprehensively covering the behavioral traces of each stage of the attack chain, and providing original data sources for attack identification; (2) Data preprocessing module: The collected raw data is cleaned, normalized to maximum and minimum, and format standardized in sequence. Redundant and invalid information is filtered, missing fields are filled, differences in the units of different features are eliminated, the data structure is unified, and core attack feature vectors such as source IP, port, and behavioral instructions are extracted. Data interference factors are eliminated, and high-quality and standardized unified data input is provided for key business processing such as LLM semantic reasoning, flexible multi-cut set fuzzy clustering, and frequent itemset mining in the upper core business layer.

[0199] The core business layer is the central execution platform for the system to achieve complete attack chain identification and closed-loop protection. It integrates core business logic such as LLM-driven attack classification, precise location of known attack chain stages, autonomous generation of semantic models for unknown attack chains, attack chain time-series reconstruction and integrity verification, model iterative updates, and real-time detection and handling. It completes differentiated processing of attack data, automatic modeling of unknown threats, attack chain stage determination, and intelligent distribution of protection strategies, serving as a crucial hub connecting data processing and security protection. Specific functional modules are described below: (1) Model Management Module: This module enables the updating, storage, and version management of large language models stored in the LLM model library, including recognition, extraction, and generation, supporting continuous system iteration.

[0200] (2) LLM classification module: Based on large language recognition models (such as Transformer, RWKV model, etc.), it realizes the automatic identification of attack types, and divides the standardized data into "known attacks" and "unknown attacks", providing a basis for subsequent differentiated processing.

[0201] (3) Known attack identification module: For known attack types, accurately locate the attack chain stage and output stage identifier and entity details.

[0202] (4) Unknown attack identification module: For unknown attack types, a dedicated attack chain semantic model is generated through flexible fuzzy clustering, feature extraction, and LLM modeling.

[0203] (5) Attack chain completion module: LLM (such as DeepSeek-R1, Qianwen, etc.) is used to perform contextual deep reasoning on the generated attack chain, detect the five-stage coverage of the generated model, complete the missing stages, and output the complete attack chain sequence.

[0204] (6) Detection and handling module: Utilize the updated model to detect actual network attacks in real time, identify attack chain stages and implement corresponding protection measures.

[0205] The interactive presentation layer serves as the human-computer interaction interface between the system and security operations personnel. It is responsible for the visual presentation of the identification results and support for operation and maintenance, specifically including: (1) Visualization Module: The complete attack chain, behavior at each stage, key entities and handling results are visualized through a web front-end interface in the form of topology diagrams, timelines, etc. It supports the automatic generation of attack chain identification statistical reports (stage identification accuracy, number of unknown attacks discovered, handling success rate, etc.) and supports exporting reports in PDF / Excel format.

[0206] (2) Clustering parameter setting module: Users can dynamically adjust the cut set value of fuzzy clustering according to business needs and intervene in the clustering results of unknown attacks.

[0207] (3) Alarm module: Supports SMS and email alarms, pushes alarms according to the severity of the attack (real-time push for high-risk attacks, and periodic summary for medium and low-risk attacks), and supports alarm classification, filtering and query.

[0208] (4) Operation and maintenance module: Provides entry points for functions such as knowledge base update, model parameter configuration, log query, and manual handling, and supports user manual intervention and adjustment.

[0209] Example 4 To address the aforementioned technical problems, embodiments of the present invention also provide an electronic device. Please refer to [link / reference needed]. Figure 4 , Figure 4 This is a basic structural block diagram of the electronic device in this embodiment.

[0210] The aforementioned electronic device 8 includes a memory 81, a processor 82, and a network interface 83 that are interconnected via a system bus. It should be noted that only the electronic device 8 with components 81, 82, and 83 is shown in the figure; however, it should be understood that it is not required to implement all the shown components, and more or fewer components can be implemented alternatively. Those skilled in the art will understand that the electronic device described here is a device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions, and its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.

[0211] The aforementioned electronic devices can be computing devices such as desktop computers, laptops, handheld computers, and cloud servers. These electronic devices can interact with users via keyboards, mice, remote controls, touchpads, or voice-activated devices.

[0212] The aforementioned memory 81 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the aforementioned memory 81 may be an internal storage unit of the aforementioned electronic device 8, such as the hard disk or memory of the electronic device 8. In other embodiments, the aforementioned memory 81 may also be an external storage device of the aforementioned electronic device 8, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the electronic device 8. Of course, the aforementioned memory 81 may also include both internal storage units and external storage devices of the aforementioned electronic device 8. In this embodiment, the aforementioned memory 81 is typically used to store the operating system and various application software installed on the aforementioned electronic device 8, such as computer-readable instructions for a network attack chain integrity identification method based on LLM. In addition, the aforementioned memory 81 can also be used to temporarily store various types of data that have been output or will be output.

[0213] In some embodiments, the processor 82 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip. The processor 82 is typically used to control the overall operation of the electronic device 8. In this embodiment, the processor 82 is used to execute computer-readable instructions stored in the memory 81 or to process data, for example, to execute the computer-readable instructions of the LLM-based network attack chain integrity identification method.

[0214] The aforementioned network interface 83 may include a wireless network interface or a wired network interface, which is typically used to establish communication connections between the aforementioned electronic device 8 and other electronic devices.

[0215] Example 5 The present invention also provides another embodiment, namely, providing a computer-readable storage medium storing computer-readable instructions that can be executed by at least one processor to cause the at least one processor to perform the steps of the LLM-based network attack chain complete identification method described above.

[0216] The beneficial effects of implementing the above embodiments are as follows: By constructing a five-stage attack chain semantic model and relying on the powerful natural language understanding capabilities of LLM, it is possible to deeply analyze the semantics of attack behaviors in heterogeneous data, achieve accurate positioning of the entire attack chain stage, and significantly reduce the stage misjudgment and false negative rates; for known attacks, entity extraction and stage matching can be automatically completed; for unknown attacks, recognition rules are generated and model iteration is completed through flexible fuzzy clustering and LLM linkage, forming a closed-loop capability of discovery-learning-modeling-detection; it can achieve full-process automation and can be widely applied to various scenarios such as enterprise intranets, cloud platforms, and industrial control networks.

[0217] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods of the various embodiments of the present invention.

[0218] Obviously, the embodiments described above are merely some embodiments of the present invention, not all embodiments. The accompanying drawings show preferred embodiments of the present invention, but do not limit the patent scope of the present invention. The present invention can be implemented in many different forms; rather, these embodiments are provided to provide a more thorough and complete understanding of the disclosure of the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or make equivalent substitutions for some of the technical features. Any equivalent structures made using the content of this specification and drawings, directly or indirectly applied to other related technical fields, are similarly within the patent protection scope of this invention.

Claims

1. A method for identifying the complete network attack chain based on LLM, characterized in that, Includes the following steps: Construct a semantic model of the attack chain, and define entity types and feature requirements; Based on the entity type and the feature requirements, raw attack data is collected from a multi-source heterogeneous environment, and the raw attack data is preprocessed to output standardized attack data. Based on the standardized attack data, the semantic understanding and context learning capabilities of the large language classification model are used to automatically determine whether the attack type belongs to a known attack type or an unknown attack type, and output classification labels as diversion control signals. For the known attack types, based on the attack chain semantic model, the current attack chain stage is located through LLM-driven entity extraction and semantic similarity matching. For the aforementioned unknown attack types, flexible fuzzy clustering is used to group attack events with similar behaviors into the same cluster according to the actual needs of users. The core behavioral features of each cluster are extracted through frequent itemset mining, and an unknown attack chain semantic model is automatically constructed using LLM semantic generation. The unknown attack chain semantic model is integrated into the LLM intrusion detection system, the attack type knowledge base is updated, and the updated unknown attack chain semantic model is used to identify attack chain stages and perform automated protection and handling of real-time traffic, forming a closed-loop optimization.

2. The method for identifying the complete network attack chain based on LLM according to claim 1, characterized in that, The steps of constructing the attack chain semantic model and defining entity types and feature requirements specifically include: Drawing on the cyberspace kill chain framework and the MITRE ATT and CK knowledge bases, a five-stage attack chain semantic model is proposed, defining the core behavioral characteristics of the scanning, probing, exploitation, control, and propagation stages respectively. Define multiple core entities for each attack phase, and store the entities in triplet format; Establish causal and temporal dependencies between attack phases and provide automatically executable protection strategies for each phase.

3. The method for identifying the complete network attack chain based on LLM according to claim 1, characterized in that, The steps of collecting raw attack data from a multi-source heterogeneous environment based on the entity type and the feature requirements, preprocessing the raw attack data, and outputting standardized attack data specifically include: The network traffic capture tool, host log collection agent, and external threat intelligence platform interface are used to collect network traffic feature data, host operation logs, and threat intelligence text in parallel. Filter invalid data, fill in missing fields, and use the maximum-minimum normalization formula to linearly map the value of each feature to the interval [0,1]. All heterogeneous data is converted into a unified JSON structured format, and core feature vectors that can distinguish attack chains are extracted.

4. The method for identifying the complete network attack chain based on LLM according to claim 1, characterized in that, The step of automatically determining whether an attack type belongs to a known or unknown attack type based on the standardized attack data, utilizing the semantic understanding and context learning capabilities of a large language classification model, and outputting a classification label as a diversion control signal specifically includes: Integrate publicly available attack libraries and historical attack samples from within enterprises to form a semantic set of known attack types; Using a contextual learning approach, a few-shot example is provided in the prompt words to guide LLM to perform classification reasoning based on standardized attack data; Call the LLM inference API, parse the returned JSON result, and obtain the known attack or unknown attack label. If it is a known attack, output the attack type identifier synchronously.

5. The method for identifying the complete network attack chain based on LLM according to claim 1, characterized in that, The steps for locating the current attack chain stage based on the attack chain semantic model and LLM-driven entity extraction and semantic similarity matching for the known attack type specifically include: Design entity extraction prompt templates and leverage the named entity recognition capabilities of LLM to extract four key entities—attackers, tools, techniques, and vulnerabilities—from unstructured threat intelligence texts. The extracted free-format entities are aligned with the entity type by rules and semantics, invalid entities are filtered out, and a standardized structured entity set is generated. Construct a matching library for known attack chain stages, calculate the cosine similarity between the current attack entity set and the standard entity set of each stage, and select the stage with the highest similarity as the output.

6. The method for identifying the complete network attack chain based on LLM according to claim 1, characterized in that, The steps for automatically constructing a semantic model of the unknown attack chain for the unknown attack type include: flexible fuzzy clustering to group attack events with similar behaviors into the same cluster according to actual user needs; extracting core behavioral features of each cluster through frequent itemset mining; and utilizing LLM semantic generation to automatically construct the semantic model of the unknown attack chain. A fuzzy similarity matrix is ​​constructed using the correlation coefficient method, and multi-cut set clustering is performed based on a fast direct clustering algorithm to divide the unknown attack data into multiple clusters according to behavioral similarity; The Apriori algorithm is used to perform frequent itemset mining and association rule generation for each cluster, extracting high-frequency behavior sets, strong association rules, and typical behavior time series. Construct specialized prompt word templates, input core features into the LLM, and automatically generate the corresponding attack chain semantic model; The LLM is invoked to verify the integrity of the generated attack chain semantic model. If a missing stage is found, it is filled in by reverse reasoning based on association rules.

7. The method for identifying the complete network attack chain based on LLM according to any one of claims 1 to 6, characterized in that, The steps of integrating the unknown attack chain semantic model into the LLM intrusion detection system, updating the attack type knowledge base, and using the updated unknown attack chain semantic model to identify attack chain stages and perform automated protection and handling of real-time traffic to form a closed-loop optimization specifically include: The attack chain semantic model generated by LLM is stored in the attack chain semantic model library, and the core feature set of the cluster is added to the attack type knowledge base and marked as a known attack. After real-time captured network traffic is preprocessed and classified by LLM, the updated dedicated recognition model is called to extract entities and match stages. Based on the identified attack phase, the corresponding protection policy is retrieved from the handling suggestion library. The handling instructions are automatically executed through API linkage with EDR, firewall and WAF security devices, and alarm notifications and audit logs are generated.

8. A network attack chain integrity identification device based on LLM, characterized in that, include: The building block is used to construct the semantic model of the attack chain and define entity types and feature requirements. The acquisition module is used to acquire raw attack data from a multi-source heterogeneous environment based on the entity type and the feature requirements, and to preprocess the raw attack data to output standardized attack data. The judgment module is used to automatically determine whether the attack type belongs to a known attack type or an unknown attack type based on the standardized attack data and by utilizing the semantic understanding and context learning capabilities of the large language classification model, and outputs a classification label as a diversion control signal. The localization module is used to locate the current stage of the attack chain based on the attack chain semantic model, using LLM-driven entity extraction and semantic similarity matching, for the known attack type. The extraction module is used to group attack events with similar behaviors into the same cluster according to the user's actual needs through flexible fuzzy clustering for the unknown attack type, extract the core behavioral features of each cluster through frequent itemset mining, and automatically construct the semantic model of the unknown attack chain using LLM semantic generation. The optimization module is used to integrate the unknown attack chain semantic model into the LLM intrusion detection system, update the attack type knowledge base, and use the updated unknown attack chain semantic model to identify attack chain stages and automatically protect against real-time traffic, forming a closed-loop optimization.

9. An electronic device, characterized in that, The method includes a memory and a processor, wherein the memory stores computer-readable instructions, and the processor executes the computer-readable instructions to implement the steps of the LLM-based network attack chain complete identification method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-readable instructions, which, when executed by a processor, implement the steps of the LLM-based network attack chain complete identification method as described in any one of claims 1 to 7.