Network space knowledge extraction method and device based on rule-enhanced prompt learning
By employing a rule-based augmented prompt learning method, the text data to be extracted in the cyberspace security domain is split into multiple sub-prompts. Logical rules are used to connect conditional functions related to ontology rules to construct task-specific prompts. This solves the problem of low efficiency in cyberspace knowledge extraction, achieves efficient and fine-grained entity and relation extraction, and improves the efficiency of threat intelligence data acquisition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-10-16
- Publication Date
- 2026-06-26
AI Technical Summary
Current knowledge extraction technologies in cyberspace suffer from low extraction efficiency, especially in the field of cybersecurity. Traditional deep learning methods struggle to identify entities and relationships in mixed Chinese and English threat intelligence, lack large-scale fine-grained training samples, and the fine-tuning process for pre-trained language models is cumbersome and complex, with high hardware requirements.
We adopt a rule-based augmentation prompt learning method, which splits the prompt input into subject, relation and object prompts, and uses the logical rules of the conjunction paradigm to connect the conditional functions related to the ontology rules to construct task-specific prompts. We use a pre-trained language model to extract cyberspace knowledge and fill it with cybersecurity knowledge to achieve efficient and fine-grained extraction.
It improves the efficiency of threat intelligence data discovery and acquisition in the field of cyberspace security, realizes efficient extraction of entities and their relationships, reduces the cost of manual annotation, adapts to multiple classification tasks, and improves extraction efficiency and accuracy.
Smart Images

Figure CN117391083B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer data processing technology, and relates to a method and apparatus for extracting cyberspace knowledge based on rule-enhanced prompting learning. Background Technology
[0002] Cyberspace ontology clearly defines the entities and relationships to be extracted, and fine-grained ontology places higher demands on knowledge extraction. Faced with complex multi-class classification tasks, current cyberspace knowledge extraction technologies have some threat intelligence issues and areas for improvement. In terms of data, relevant Chinese or mixed Chinese-English datasets are scarce, and publicly available threat intelligence datasets are also limited and predominantly English. There is a lack of large-scale, fine-grained training samples in vertical cyberspace domains. In terms of models, traditional deep learning methods struggle to identify new or mixed Chinese-English threat intelligence entities and relationships in the cybersecurity field, and the extracted features are insufficient.
[0003] In recent years, research on pre-trained language models (PLMs) has been abundant, and Natural Language Processing (NLP) has also benefited from this trend. The research approach based on PLMs is typically "pre-train, fine-tune," which involves applying the PLM to downstream tasks. During the pre-training and fine-tuning phases, training objects are designed according to the downstream tasks, and the PLM ontology is adjusted. However, as the size of PLMs continues to increase, the hardware requirements, data demands, and practical costs of fine-tuning them also rise. Furthermore, the diverse range of downstream tasks makes the design of the pre-training and fine-tuning phases cumbersome and complex. Therefore, under current technological conditions, knowledge extraction in the field of cyberspace security still faces the technical problem of low extraction efficiency. Summary of the Invention
[0004] To address the problems existing in the aforementioned traditional technologies, this invention proposes a network space knowledge extraction method based on rule-enhanced prompting learning, a network space knowledge extraction device based on rule-enhanced prompting learning, a computer device, and a computer-readable storage medium, which can effectively improve the knowledge extraction efficiency in the field of network space security.
[0005] To achieve the above objectives, the embodiments of the present invention adopt the following technical solutions:
[0006] On the one hand, a method for extracting network space knowledge based on rule-enhanced reinforcement learning is provided, including the following steps:
[0007] To obtain text data to be extracted and input prompts in the field of cyberspace security;
[0008] The input prompts are broken down into subject prompts, relation prompts, and object prompts; subject prompts and object prompts are represented by unary functions and the tag set is constructed using two characters, while relation prompts are represented by binary functions and the tag set is constructed using three characters;
[0009] Sub-hints related to the conditional functions of the ontology rules are connected using logical rules with conjunction paradigm to obtain task-specific hints for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the field of cyberspace security and represented in first-order logical form, and the conditional functions include conditional functions for determining the subject type in the subject hint, conditional functions for determining the object type in the object hint, and conditional functions for determining the semantic relationship between the subject and the object in the relation hint;
[0010] Using task-specific prompts, a pre-trained language model is used to extract cyberspace knowledge from the text data to be extracted, outputting entities and relationships in the current task within the cyberspace security domain; these entities and relationships are used to determine threat intelligence data in the current cyberspace security domain.
[0011] On the other hand, a network space knowledge extraction device based on rule-enhanced prompting learning is also provided, comprising:
[0012] The data input module is used to acquire text data to be extracted and input prompts in the field of cyberspace security.
[0013] The prompt splitting module is used to split the prompt input into subject prompts, relation prompts, and object prompts; subject prompts and object prompts are represented by unary functions and the tag word set is constructed using two characters, while relation prompts are represented by binary functions and the tag word set is constructed using three characters;
[0014] The prompt optimization module is used to connect sub-prompts with conditional functions related to ontology rules using logical rules with conjunctive paradigm, to obtain task-specific prompts for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the field of cyberspace security and represented in first-order logical form, and the conditional functions include conditional functions for determining the subject type in the subject prompt, conditional functions for determining the object type in the object prompt, and conditional functions for determining the semantic relationship between the subject and the object in the relation prompt;
[0015] The extraction output module is used to extract cyberspace knowledge from the text data to be extracted using task-specific prompts and a pre-trained language model. It outputs the entities and relationships of the current task in the cyberspace security domain. The entities and relationships are used to determine the threat intelligence data in the current cyberspace security domain.
[0016] On the other hand, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described method for extracting network space knowledge based on rule-enhanced prompting learning.
[0017] Furthermore, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the above-described method for extracting cyberspace knowledge based on rule-enhanced prompting learning.
[0018] One of the above technical solutions has the following advantages and beneficial effects:
[0019] The aforementioned method and apparatus for extracting cyberspace knowledge based on rule-enhanced prompting learning, for the cyberspace security threat intelligence data to be acquired, first obtains the text data to be extracted and the prompt input in the cyberspace security domain, then splits the prompt input of the text data to be extracted into multiple sub-prompts, and then uses logical rules with a conjunction paradigm to connect all the sub-prompts of conditional functions related to ontology rules to combine them into task-specific prompts filled with cybersecurity knowledge. This achieves efficient and flexible construction through a small number of sub-prompts in the form of template permutation and combination. At the same time, since the prompts are filled with cybersecurity knowledge based on ontology rules, they can effectively represent the auxiliary information contained in the relationships between entities. This allows the pre-trained language model to perform fine-grained knowledge extraction from the text data to be extracted and efficiently extract the relationships between entities after obtaining the task-specific prompts. This realizes the efficient extraction of entities and their relationships in the cyberspace security domain using prompt optimization technology, and improves the efficiency of threat intelligence data discovery and acquisition in the cyberspace security domain. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of this application or the conventional technology, the drawings used in the description of the embodiments or the conventional technology will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a flowchart illustrating a network space knowledge extraction method based on rule-enhanced prompting learning in one embodiment.
[0022] Figure 2 This is a flowchart illustrating the process of adding learning tags in one embodiment;
[0023] Figure 3 This is a schematic diagram of a hint optimization framework with ontology rules in one embodiment;
[0024] Figure 4 This is a schematic diagram of the module structure of a network space knowledge extraction device based on rule-enhanced prompting learning in one embodiment. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0026] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
[0027] It should be noted that, in this document, the reference to "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The presentation of this phrase in various locations throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments.
[0028] Those skilled in the art will understand that the embodiments described herein can be combined with other embodiments. The term "and / or" as used in this specification and the appended claims refers to any combination of one or more of the associated listed items, and all possible combinations thereof.
[0029] The data extracted from cybersecurity entities includes structured, semi-structured, and unstructured data. Extracting the first two types is relatively simple, so this paper mainly focuses on extracting unstructured cybersecurity text data. Generally, the methods can be divided into three categories: rule matching, machine learning, and deep learning methods.
[0030] (1) Rule matching methods closely resemble human thinking and can achieve high accuracy. They first require the manual creation of numerous rule templates, which can be regular expressions or dictionaries; this process typically relies on domain experts. Then, methods such as string matching, forward maximum matching, and shortest path matching are used for extraction. These methods heavily depend on the extracted rules. While they can achieve high accuracy and extraction efficiency with small datasets, their drawbacks include the need for extensive manual annotation and expert involvement, resulting in high labor costs, long construction cycles, and a lack of universality in rule matching, potentially leading to conflicts.
[0031] (2) Machine learning-based methods involve manually annotating corpora and then using machine learning algorithms for training and learning. Based on the degree of human involvement in the training process, machine learning algorithms can be further categorized into supervised, semi-supervised, and unsupervised methods. Currently, the most commonly used named entity recognition method is based on supervised learning, which learns the influence of different features on sample categories from a large, pre-annotated corpus. Common supervised learning-based machine learning algorithms mainly include Hidden Markov Models (HMMs), Maximum Entropy Models (MEMMs), and Conditional Random Fields (CRFs).
[0032] (3) In named entity recognition, both rule-based and machine learning-based methods require significant manpower and lack versatility. Therefore, many different deep learning frameworks have been developed to improve upon traditional named entity recognition methods. For example, deep neural networks, recurrent neural networks, and convolutional neural networks not only address the inherent shortcomings of traditional methods but also enable automatic feature extraction, eliminating the need for manual feature creation. In recent years, with the development of hardware such as big data GPUs, deep learning has gradually been applied to various tasks in the field of natural language processing. In the field of deep learning, a neural network is a computational or mathematical model that mimics the structure and function of the human brain, used to estimate or approximate functions.
[0033] In the field of cybersecurity entity extraction, some researchers have developed IoC (Indicator of Compromise) extraction frameworks to generate Structured Threat Information Representations (STIX) format IoCs from Cuckoo Sandbox (an open-source project for automated malware analysis). Others have proposed an LSTM-CRF-based model for extracting unstructured security information using named entity recognition methods in natural language processing, combining LSTM and Conditional Random Fields (CRF) to identify relevant entities such as products, versions, and attack names in cybersecurity documents. Still others have used natural language processing principles to train feedforward neural networks and Document Topic Generation (LDA) models to extract entities representing attack behaviors from social media data, thereby enabling the detection of Distributed Denial-of-Service (DDoS) attacks.
[0034] In addition, some people have used end-to-end neural networks combined with attention mechanisms to build models for threat intelligence corpora and trained IoC extractors, which show high accuracy in actual IoC extraction. Since the use of threat intelligence is not limited to IoCs, threat intelligence reports provide more detailed information about cyberattacks, especially semantic information about attackers, attack techniques, and attack tools. Some people, based on the analysis of threat intelligence corpora, use convolutional neural networks (CNNs) to obtain corpus character embedding features and propose a network security entity recognition method of CNN-BiLSTM-CRF that combines feature templates, achieving good results in the recognition of personal names, place names, organization names, software names, network-related terms, and vulnerability numbers involved in network security text data. Some people have proposed an identification model based on BLSTM-CRF for named entities in the field of security vulnerabilities, and combined with domain dictionaries to correct the recognition results, achieving effective recognition of 7 types of vulnerability-related named entities, including vulnerability numbers, vulnerability names, vulnerability types, vulnerability exploitation conditions (software vendors, operating systems, application software), and attack methods. Some people have applied deep active learning algorithms to the named entity recognition task and compared the performance of 3 types of sampling strategies: the minimum confidence algorithm, Bayesian inconsistent active learning, and maximum normalized log probability.
[0035] In addition, relation extraction mainly focuses on discovering the semantic relationships between entities from text content. Its task definition can be described as: based on a piece of text C, determine the category relationship r between target entity pairs. For example, the sentence "Zhang San discovered that Chrome has an XSS vulnerability" contains an entity pair <Chrome, XSS vulnerability>, and the relationship between these two entities is "owns". Using relation extraction technology to discover the deep relationship structure between entities has important research value. At the same time, it is also a preparatory work for optimizing search engines, constructing knowledge graphs, and developing intelligent question-answering systems.
[0036] Relation extraction is divided into supervised extraction, unsupervised extraction, and distant supervision extraction. Supervised extraction transforms the input training samples into a feature vector space, which will lose a large amount of context information to a certain extent. How to effectively utilize this part of the information has always been the focus of research. At the same time, supervised extraction requires a large amount of manual annotation. To address these problems, many researchers have proposed unsupervised extraction methods based on unsupervised learning - entity relation extraction methods based on unsupervised learning are also known as open relation extraction. Entity relation extraction is usually converted into a binary classification problem or a multi-class classification problem and classified using traditional classifiers such as SVM. Using unsupervised extraction does not require pre-defining relation types, and this method has strong domain adaptability and advantages in processing large-scale text data; the difficulty of this method lies in the need to pre-determine the clustering threshold, and at the same time, it lacks an objective evaluation criterion.
[0037] Therefore, with the development of big data technology, researchers have begun to consider how to extract relations from large datasets. For large datasets, relation extraction methods based on remote supervised learning (remote supervised extraction) have become a research hotspot in the past two years. Remote supervised extraction methods do not require manually labeled seeds and rules, but they do require external knowledge bases, such as Freebase. In 2009, the remote supervised method was first proposed for relation extraction at the ACL conference and subsequently used as a comparative method in relation extraction. Remote supervised extraction methods can label unlabeled text data, which is usually in the form of sentences. By aligning with a triplet knowledge base, it can automatically label large amounts of text data. This method differs from traditional methods that predefine relation category sets, possessing the advantages of both supervised and unsupervised methods. It significantly reduces the cost of manual labeling, increases the applicability of relation extraction models in general domains, and is particularly suitable for cybersecurity fields lacking labeled data.
[0038] In the field of cybersecurity relation extraction, some researchers have developed the iACE system to automatically extract Indicators of Threat Compromise (IoC) and their contextual relationships from threat intelligence texts. Others have developed a deep learning-based semantic relation extraction system for threat intelligence, which obtains semantic triples from open-source threat intelligence and integrates them with security operations centers to further enhance cybersecurity defense capabilities.
[0039] To enable natural language interaction between humans and computers, Natural Language Processing (NLP) was born, with fully supervised learning playing a crucial role in early NLP tasks. In recent years, the development of deep learning has dramatically transformed NLP tasks. Modern NLP techniques have been summarized into four paradigms: fully supervised learning in the non-neural network era, fully supervised learning based on neural networks, the pre-training and fine-tuning paradigm, and the pre-training, prompt, and prediction paradigm. To bridge the gap between pre-trained models and downstream tasks, a new paradigm, "pre-training, prompt, and prediction," has been proposed, and current technology is currently in the fourth paradigm.
[0040] In this paradigm, instead of adapting the pre-trained language model (LM) to the downstream task through goal engineering, the downstream task is reformulated to resemble the task solved during the original LM training with the help of a text prompt. For example, in sentiment classification, the goal is for the pre-trained model to understand the emotion of the input sentence and provide adjectives to classify it. After inputting "I like this movie," a prompt is given beforehand for optimization: "This movie is [mask]," so that the pre-trained model understands upon seeing the template to output praise adjectives such as "good / nice." It can be seen that in the standard "pre-train + fine-tune" paradigm, the gap between the pre-training stage and the downstream task can be significant, usually requiring the introduction of new parameters for the downstream task. The prompt essentially unifies NLP tasks into MLM (Mask LM) tasks, because the training method of the pre-trained language model is primarily based on MLM tasks. The prompt allows the downstream task to adopt the same format as the pre-training objective and does not require new parameters.
[0041] Prompt Tuning is a self-supervised learning technique based on mask filling. It generates masks based on given prompt words, thereby forcing the model to focus on information related to the prompt words. This application specifically introduces a prompt learning technique that has achieved good results in the few-sample domain. Addressing the lack of interpretability of continuous prompts and the difficulty of manually designing prompts for multi-class classification tasks, this application uses ontology rules to enhance prompt learning: that is, combining logical rules to compose task-specific prompts from several simple sub-prompts for relation extraction in the cybersecurity domain. Specifically, the prompt for the entire input sentence can be split into multiple sub-prompts, and efficient and flexible prompt construction can be achieved through the permutation and combination of a small number of sub-prompt templates. After constructing the sub-prompts, cybersecurity knowledge is filled into the prompts using the constructed ontology rules, thereby ensuring efficient relation extraction. Simultaneously, the subject and object can be reversed to form inverted rule prompts, exploring complex relationships between knowledge and verifying that the design effect meets the expected requirements. Based on the above understanding, the embodiments of this invention will be described in detail below with reference to the accompanying drawings.
[0042] Please see Figure 1 In one embodiment, this application provides a method for extracting network space knowledge based on rule-enhanced prompting learning, including the following processing steps S12 to S18:
[0043] S12, Obtain the text data to be extracted and the input prompts in the field of cyberspace security.
[0044] It is understandable that the text data to be extracted is the network text data available for the current threat intelligence acquisition task in the field of cyberspace security. The prompt input is the original prompt given by the user to the text data to be extracted, which is used to inform the pre-trained language model what entities and their relationships the user wants to obtain from the text data to be extracted, so that the model can extract and output the relevant entities and their relationships, thereby obtaining accurate threat intelligence.
[0045] In terms of design, prompt templates can be categorized into manually designed templates and automatically learned templates. Automatically learned templates can be further divided into two main categories: discrete prompts and continuous prompts. Manually designed templates are generally based on human natural language knowledge, striving to obtain semantically fluent and efficient templates. The advantage of manually designed templates is their intuitiveness, but the disadvantage is that they require more experimentation, experience, and linguistic expertise, making them costly. To address the difficulty of template construction, many studies have begun to explore how to automatically learn suitable templates. In solving single-level knowledge extraction tasks for cyberspace security, manually constructing prompts is cumbersome and prone to errors. Verifying the effectiveness of automatically generated prompts is also time-consuming. Therefore, this application proposes a method for optimizing prompts based on ontology rules. Logical rules are applied to construct templates, and a large-scale pre-trained language model, bert-base-chinese, is used for training, optimization, and knowledge extraction applications. This application uses ontology rules added during ontology construction to constrain the subjects and objects corresponding to the relationships. At the same time, a sub-prompt is constructed for each entity and relationship. Through the permutation and combination of sub-prompts, efficient and rich template construction is achieved, thereby realizing fine-grained knowledge extraction.
[0046] In knowledge extraction, some terms typically include:
[0047] An ontology is a conceptual model of a specific domain, describing the things, concepts, attributes, and relationships that exist in that domain, as well as the hierarchical structure and relationships between them. An entity refers to a specific thing in reality, such as a person, a place, a book, or a machine. In knowledge extraction, an entity is usually a specific object identified from text, such as a person's name, a place name, or a cyberattack event.
[0048] The subject is the entity that initiates a relation. For example, in the relation "User A is the father of User a1", User A is the subject. The object is the entity that is acted upon in a relation. For example, in the relation "User A is the father of User a1", User a1 is the object. A relation is a connection or interaction between two entities. For example, in the relation "User A is the father of User a1", "is the father" is a relation.
[0049] In knowledge extraction, knowledge and information in a particular domain can be further extracted by identifying entities in a text and constructing ontology models based on the relationships between them.
[0050] S14, the input prompts are split into subject prompts, relation prompts, and object prompts; subject prompts and object prompts are represented by unary functions and the tag word set is constructed using two characters, while relation prompts are represented by binary functions and the tag word set is constructed using three characters.
[0051] It's understandable that prompt optimization addresses cloze test tasks based on a pre-trained language model. Formally, a prompt consists of a template T(·) and a set of labeled words V. The [MASK] tag is a commonly used tag in natural language processing, used to represent the word or phrase to be predicted during language model training and prediction. During training, the language model randomly replaces a certain proportion of words with [MASK] tags and then predicts these replaced words. This process is called Masked Language Modeling. During prediction, if a masked word needs to be predicted, the model infers the probability of that word based on the context and other known words. This process is called Language Model Inference. The [MASK] tag is very useful in many natural language processing tasks, such as text classification, question answering, and machine translation.
[0052] For each entity x, a template is first used to map entity x to the prompt input. The template defines where each tag of entity x is placed and whether any other tags are added.
[0053] x prompt =T(x)
[0054] At least one [MASK] is included in the prompt input. prompt For example, for a binary sentiment classification task, a template T(·) can be set and entity x can be mapped to the prompt input x. prompt middle:
[0055] T(·) = "·It was [MASK]"
[0056] x prompt =“x It was [MASK]”
[0057] Next, the optimized template can be fed into the pre-trained language model to calculate the latent vector h of [MASK]. [MASK The probability of the tag word v at the mask is calculated sequentially. The tag word with the highest probability is selected from the candidate tag word set as the target tag word at the mask. The probability p([MASK]) of the tag word v at the mask is:
[0058]
[0059] Here, V represents the set of tag words. There is also an injective mapping function between the tag word set and the actual category set. In some papers, injective mapping functions It is called a "verbalizer", which is a tag word mapping.
[0060]
[0061] Using injective mapping functions The probability distribution over the true category can be formalized by representing the probability of the set of tags at the mask position, i.e., p(y):
[0062]
[0063] Take a binary sentiment classification task as an example. Positive sentiment (entity) is mapped to the input prompt label "great", and negative sentiment (entity) is mapped to the input prompt label "terrawful". Based on the marker [MASK], the label "great" or "terrawful" is filled into the mask position of the entity x prompt, and then an injective mapping function is used. By mapping to the actual emotion category, we can know whether entity x has a positive or negative emotion.
[0064] Therefore, for prompt optimization, given a template T(·), a set of tag words V, and an injective mapping function, the optimization can be achieved by... The learning objective is to maximize the following expression:
[0065]
[0066] To address the challenges of designing task prompts for multiple categories, this application employs a prompt optimization method combined with logical rules to extract relationships in the cybersecurity domain by composing task-specific prompts from several simple sub-prompts.
[0067] Furthermore, in previous template construction methods, some methods used only one template for all corpora. While simple and direct, this approach lacked flexibility and struggled to adapt to complex corpora. Other methods designed multiple templates for specific tasks, achieving good results with complex corpora, but different corpora required different templates, consuming significant effort. Therefore, considering that this application focuses more on relation extraction tasks, the input prompts are split into three sub-prompts: subject prompts, object prompts, and relation prompts. For example, given a sentence x={…e… s …e o …}, where e s and e o These are the subject entity (the main body) and the object entity (the recipient), respectively, and a univariate function can be defined. To determine the type of entity, construct a set of tags for the subject hint and the object hint using two characters. Both sets of tags for the sub-hints can be {entity, product, version, component, vulnerability, cause, attacker, vendor, method, impact}.
[0068] Since the relationship involves both subject and object, it is represented using a bivariate function: Let the bivariate function... To determine the relationships between entities, a set of tags for relationship hints is constructed using three characters. This set of tags can be {no relationship, version, vulnerability, component, cause, produced, used, led to, attacked}. Therefore, without the constraint of introducing ontology rules, the candidate set of hint templates can have a total of 10*10*9 = 900 seed hint templates.
[0069] S16, using logical rules with conjunction paradigm to connect the sub-hints of the conditional functions related to the ontology rules, to obtain task-specific hints for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the field of cyberspace security and represented in first-order logical form, and the conditional functions include conditional functions for determining the subject type in the subject hint, conditional functions for determining the object type in the object hint, and conditional functions for determining the semantic relationship between the subject and the object in the relation hint.
[0070] It is understandable that after constructing sub-prompts, 900 sub-prompt templates can be formed if there are no rules to restrict them. However, in the real cyberspace, the subjects and objects corresponding to the relationship are relatively fixed and not unlimited. Therefore, ontology rules can be constructed to fill the prompts with network security knowledge. For example, in order to determine whether there is a set "has_val" relationship between two marked entities, two conditions can be constructed based on the prior knowledge "there is a vulnerability in the product": (1) the subject is the product category and the object is the vulnerability category; (2) there is a logical relationship between the subject and the object, which belongs to the "vulnerability is" category. After satisfying the above two conditions, it can be determined that the relationship between the marked entities is "has_val". Based on the characteristics of the cyberspace security field itself, the construction table of ontology rules is, for example, but not limited to, the following Table 1. In practice, it can be expanded and constructed according to the application needs:
[0071] Table 1
[0072]
[0073] In relation extraction tasks within the cybersecurity field, the aforementioned rules can be represented using first-order logic. This application designs a set of condition functions F, where each condition function f∈F is used to determine whether the input satisfies certain conditions. The condition function f(x, product) represents whether the entity x belongs to the product category, and the condition function f(x, vulnerability, y) represents whether the vulnerability of entity x is object y. All condition functions constitute the set of condition functions F (f∈F). These condition functions are essentially predicates of first-order logic. Therefore, the relation can be formally represented using three sub-hints as follows:
[0074]
[0075] in, It is a conditional function that determines the type of the subject entity (body). It is a conditional function that determines the type of the object entity (object). It is a conditional function that determines the semantic relationship between the subject and the object. Based on the above logical rules and conditional functions, sub-hints related to the rule-related conditional functions can be formed to process the current knowledge extraction task T.
[0076] Furthermore, since the classification task can be transformed into the computation of a series of condition functions through the above design, it is necessary to combine the sub-hints of each condition function into a complete task-specific hint. In this application, a simple strategy is adopted: using logical rules with conjunctive normal form, and then directly connecting the sub-hints of all functions related to the ontology rules. Thus, the template for the task-specific hint is T(x):
[0077]
[0078] in, The body hint template representing the conditional function of the corresponding body type of entity x. A relation hint template representing the semantic relationship between the corresponding subject and object of entity x. The object hint template represents the conditional function of the corresponding object type of entity x. [MASK]1, [MASK]2 and [MASK]3 represent the subject tag, relation tag and object tag, respectively.
[0079] The tag sets for subject, relation, and object are as follows:
[0080]
[0081] S18 utilizes task-specific prompts to extract cyberspace knowledge from the text data to be extracted using a pre-trained language model, outputting the entities and relationships of the current task in the cyberspace security domain; the entities and relationships are used to determine the threat intelligence data in the current cyberspace security domain.
[0082] It is understandable that the pre-trained language model can be obtained by crawling Chinese cybersecurity datasets and training and fine-tuning it based on the aforementioned prompting and optimization techniques with ontology rules. After constructing task-specific prompts, the pre-trained language model can be used to extract cyberspace knowledge from the text data to be extracted, obtaining the entities and relationships for the current task. This allows for the acquisition of threat intelligence data in the current cyberspace security domain, enabling network administrators to take timely countermeasures and maintain cyberspace security.
[0083] The aforementioned rule-enhanced prompting learning-based cyberspace knowledge extraction method, for the cyberspace security threat intelligence data to be acquired, first obtains the text data to be extracted and the prompt input in the cyberspace security domain. Then, it splits the prompt input of the text data to be extracted into multiple sub-prompts. Then, using logical rules with a conjunctive paradigm, it connects all the sub-prompts with conditional functions related to the ontology rules to combine them into task-specific prompts filled with cybersecurity knowledge. This achieves efficient and flexible construction through a small number of sub-prompts in the form of template permutations and combinations. At the same time, since the prompts are filled with cybersecurity knowledge based on ontology rules, they can effectively represent the auxiliary information implied by the relationships between entities. This allows the pre-trained language model to perform fine-grained knowledge extraction from the text data to be extracted and efficiently extract the relationships between entities after obtaining the task-specific prompts. This achieves efficient extraction of entities and their relationships in the cyberspace security domain using prompt optimization technology, improving the efficiency of threat intelligence data discovery and acquisition in the cyberspace security domain.
[0084] In one embodiment, such as Figure 2 As shown, the process of connecting the sub-hints of the conditional functions related to the ontology rules using logical rules with conjunctive normal form in step S16 above can specifically include the following processing steps:
[0085] S162, determine the sub-hints for the conditional functions related to the ontology rules; the sub-hints include subject hints, relation hints, and object hints;
[0086] S164, using logical rules with conjunction normal form to connect the templates for subject hints, relation hints, and object hints;
[0087] S166, add learnable tags to the tag sets of the subject prompt, the relation prompt, and the object prompt, respectively; the parameters of the learnable tags are randomly initialized.
[0088] It is understandable that, based on the aforementioned logical rules and conditional functions, the computing device responsible for knowledge extraction can directly determine the sub-hints of the conditional functions related to the ontology rules. Then, it uses logical rules with conjunctive paradigm to connect the templates of the subject hints, relation hints, and object hints. Simultaneously, it can add some learnable tags with randomly initialized parameters to make the pre-trained language templates more effective, thereby further improving knowledge extraction efficiency. A schematic diagram of the hint optimization framework with ontology rules can be seen as follows: Figure 3 As shown, [CLS] represents the classification task symbol, and [SEP] represents the separator.
[0089] In one embodiment, the pre-trained language model and classification categories are adjusted as follows:
[0090]
[0091] It's understandable that since the aggregated template can contain multiple [MASK] values, all masked locations must be considered for prediction. The above formula can be used to adjust the pre-trained language model PLM and the classification category. Here, p(y|x) represents the probability distribution of entity x in the true category, and [MASK]... j Let represent the j-th masking position, and n represent the number of masking positions in the template. This indicates that [MASK] is for the j-th masking position. j The network security category y is mapped to the set of tag words, and T(x) represents the template for task-specific prompts.
[0092] In one embodiment, further, regarding the above step S18, during the process of extracting network spatial knowledge from the text data to be extracted using task-specific prompts through a pre-trained language model, the final learning objective is to maximize the following expression:
[0093]
[0094] Here, X represents the set of entities. By maximizing the above formula, efficient knowledge extraction is ultimately achieved.
[0095] In one embodiment, the formal representations of the subject prompt template and the tag set can be as follows:
[0096]
[0097]
[0098] in, This is a template for the main prompt, where x represents the entity and e represents the content. s [MASK] represents the subject entity, and [MASK] represents the entity's tag. A set of tags that represent the main prompt.
[0099] In one embodiment, the formal representations of the relation hint template and the tag set are further as follows:
[0100]
[0101]
[0102] in, A template for indicating relation hints, where x represents an entity, e s Indicates the subject entity, e o [MASK] indicates the object entity, and [MASK] indicates the tag word of the entity. A set of tags that indicate object prompts.
[0103] In one embodiment, inverted ontology rules can also be used for tuning and testing of the methods described above in this application. In relation extraction, there is causal logic between the subject and object, and some implicit knowledge exists. If the subject and object are reversed, will the effect change? Therefore, ontology rules can be inverted. For example, the original rule "Has_element: Product component is component" can be inverted to "element_of component constitutes product," as shown in Table 2.
[0104] Table 2
[0105] Reverse relation name Reversal rule construction (subject + relation + object) no_relation Entities are not related to entities version_of_ Version defines the product element_of Components make up the product vul_of The vulnerability exists in the product cause The cause led to the vulnerability has_product The manufacturer produced the product meansvof The method was used by the attacker consequence_of The impact stems from the vulnerability. is_exploited Vulnerability attacking party
[0106] The datasets used in the experiments were Chinese cybersecurity datasets crawled from the internet. After data cleaning and processing, the experiments were conducted. Two datasets were set up: Dataset 1 labeled entity locations and relationship categories, while Dataset 2 added entity category information to Dataset 1 to compare and verify whether adding knowledge improved the experimental results. Since the datasets were in Chinese, the pre-trained model bert-base-chinese was used with a learning rate of 3e-5, and the experiments were conducted in a GPU environment. The rule-based prompting optimization model constructed above was used for the experiments: Experiment 1 compared and verified the prompting optimization model with ontology rules on the two datasets. Experiment 2 reversed the ontology rules, as forward and reversed rules often contain different information; therefore, the positions of rules, subjects, and objects were reversed, and the experiments were conducted again on the two datasets. The experimental results show that the knowledge extraction of the prompting optimization model with ontology rules met the design requirements, achieving fine-grained and efficient knowledge extraction; the knowledge extraction of the prompting optimization model with rule reversal also met the design requirements, similarly achieving fine-grained and efficient knowledge extraction. The method in this application is advanced in that it uses a knowledge injection approach to learn virtual templates and simulated answers to replace manually defined rules, and can be generalized to a variety of multi-class classification tasks. Furthermore, it uses knowledge constraints to collaboratively optimize templates and answer words, making the embeddings interconnected.
[0107] It should be understood that, although Figure 1 and Figure 2 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order requirement for the execution of these steps; they can be executed in other orders. Figure 1 and Figure 2 At least some of the steps may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0108] In one embodiment, such as Figure 4As shown, a network space knowledge extraction device 100 based on rule-enhanced prompting learning is also provided, including a data input module 11, a prompt splitting module 13, a prompt optimization module 15, and an extraction output module 17. The data input module 11 is used to acquire the text data to be extracted in the network space security domain and the prompt input. The prompt splitting module 13 is used to split the prompt input into subject prompts, relation prompts, and object prompts; subject prompts and object prompts are represented by unary functions and the tag word set is constructed using two characters, while relation prompts are represented by binary functions and the tag word set is constructed using three characters. The prompt optimization module 15 is used to connect the sub-prompts with conditional functions related to the ontology rules using logical rules with conjunctive paradigm, to obtain task-specific prompts for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the network space security domain and represented in first-order logical form, and the conditional functions include conditional functions for determining the subject type in the subject prompt, conditional functions for determining the object type in the object prompt, and conditional functions for determining the semantic relationship between the subject and object in the relation prompt. The extraction output module 17 is used to extract cyberspace knowledge from the text data to be extracted using task-specific prompts and a pre-trained language model, and outputs the entities and relationships of the current task in the cyberspace security domain; the entities and relationships are used to determine the threat intelligence data in the current cyberspace security domain.
[0109] The aforementioned rule-enhanced prompting learning-based cyberspace knowledge extraction device 100, for the cyberspace security threat intelligence data to be acquired, first obtains the text data to be extracted and the prompt input in the cyberspace security field. Then, it splits the prompt input of the text data to be extracted into multiple sub-prompts. Then, using logical rules with a conjunction paradigm, it connects all the sub-prompts of the conditional functions related to the ontology rules to combine them into task-specific prompts filled with cybersecurity knowledge. This achieves efficient and flexible construction through a small number of sub-prompts in the form of template permutation and combination. At the same time, since the prompts are filled with cybersecurity knowledge based on ontology rules, they can effectively represent the auxiliary information contained in the relationships between entities. This allows the pre-trained language model to perform fine-grained knowledge extraction from the text data to be extracted and efficiently extract the relationships between entities after obtaining the task-specific prompts. This realizes the efficient extraction of entities and their relationships in the cyberspace security field using prompt optimization technology, and improves the efficiency of threat intelligence data discovery and acquisition in the cyberspace security field.
[0110] In one embodiment, the prompt optimization module 15, in the process of connecting sub-prompts related to the conditional functions of the ontology rules using logical rules with conjunctive normal form, can specifically be used to determine the sub-prompts related to the conditional functions of the ontology rules; the sub-prompts include subject prompts, relation prompts, and object prompts; connecting the templates of subject prompts, relation prompts, and object prompts using logical rules with conjunctive normal form; adding model-learnable tags to the tag sets of subject prompts, relation prompts, and object prompts respectively; the parameters of the model-learnable tags are randomly initialized.
[0111] In one embodiment, the template for the task-specific prompt is T(x):
[0112]
[0113] in, The body hint template representing the conditional function of the corresponding body type of entity x. A relation hint template representing the semantic relationship between the corresponding subject and object of entity x. The object hint template represents the conditional function of the corresponding object type of entity x. [MASK]1, [MASK]2 and [MASK]3 represent the subject tag, relation tag and object tag, respectively.
[0114] In one embodiment, the pre-trained language model and classification categories are adjusted as follows:
[0115]
[0116] Where p(y|x) represents the probability distribution of entity x in the true class, [MASK] j Let represent the j-th masking position, and n represent the number of masking positions in the template. This indicates that [MASK] is for the j-th masking position. j The network security category y is mapped to the set of tag words, and T(x) represents the template of the task-specific prompt.
[0117] In one embodiment, during the process of extracting cyberspace knowledge from the text data to be extracted using a pre-trained language model with task-specific prompts, the ultimate learning objective is to maximize the following expression:
[0118]
[0119] Where X represents a set of entities.
[0120] In one embodiment, the formal representations of the subject prompt template and the tag set are as follows:
[0121]
[0122]
[0123] in, This is a template for the main prompt, where x represents the entity and e represents the content. s [MASK] represents the subject entity, and [MASK] represents the entity's tag. A set of tags that represent the main prompt.
[0124] In one embodiment, the formal representations of the relation hint template and the tag set are as follows:
[0125]
[0126]
[0127] in, A template for indicating relation hints, where x represents an entity, e s Indicates the subject entity, e o [MASK] indicates the object entity, and [MASK] indicates the tag word of the entity. A set of tags that indicate object prompts.
[0128] For specific limitations regarding the rule-enhanced prompting learning-based cyberspace knowledge extraction device 100, please refer to the corresponding limitations of the rule-enhanced prompting learning-based cyberspace knowledge extraction method above, which will not be repeated here. Each module in the rule-enhanced prompting learning-based cyberspace knowledge extraction device 100 can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in hardware or independently of a device with data processing capabilities, or stored in software in the memory of the aforementioned device, so that the processor can call and execute the operations corresponding to each module. The aforementioned device can be, but is not limited to, various types of cyberspace security monitoring devices already existing in the field.
[0129] In one embodiment, a computer device is also provided, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the following processing steps: acquiring text data to be extracted in the cyberspace security domain and prompt input; splitting the prompt input into subject prompts, relation prompts, and object prompts; the subject prompts and object prompts are represented by unary functions and the tag word set is constructed using two characters, while the relation prompts are represented by binary functions and the tag word set is constructed using three characters; using logical rules with a conjunction paradigm to connect the sub-prompts of conditional functions related to the ontology rules to obtain task-specific prompts for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the cyberspace security domain and represented in first-order logical form, and the conditional functions include conditional functions for determining the subject type in the subject prompts, conditional functions for determining the object type in the object prompts, and conditional functions for determining the semantic relationship between the subject and the object in the relation prompts; using the task-specific prompts, performing cyberspace knowledge extraction on the text data to be extracted using a pre-trained language model, and outputting entities and relationships of the current task in the cyberspace security domain; the entities and relationships are used to determine threat intelligence data in the current cyberspace security domain.
[0130] It is understood that, in addition to the memory and processor mentioned above, the computer equipment described above also includes other hardware and software components not listed in this specification. The specific components can be determined according to the model of the computer equipment in different application scenarios, and will not be listed and described in detail in this specification.
[0131] In one embodiment, when the processor executes the computer program, it can also implement the steps or sub-steps added to the various embodiments of the network space knowledge extraction method based on rule-enhanced prompting learning.
[0132] In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored. When executed by a processor, the computer program performs the following processing steps: acquiring text data to be extracted in the cyberspace security domain and prompt input; splitting the prompt input into subject prompts, relation prompts, and object prompts; subject prompts and object prompts are represented by unary functions and the tag word set is constructed using two characters, while relation prompts are represented by binary functions and the tag word set is constructed using three characters; using logical rules with conjunctive paradigm to connect the sub-prompts of conditional functions related to the ontology rules to obtain task-specific prompts for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the cyberspace security domain and represented in first-order logical form, and the conditional functions include conditional functions for determining the subject type in the subject prompt, conditional functions for determining the object type in the object prompt, and conditional functions for determining the semantic relationship between the subject and object in the relation prompt; using the task-specific prompts, performing cyberspace knowledge extraction on the text data to be extracted using a pre-trained language model, and outputting entities and relationships of the current task in the cyberspace security domain; the entities and relationships are used to determine threat intelligence data in the current cyberspace security domain.
[0133] In one embodiment, when the computer program is executed by a processor, it can also implement the steps or sub-steps added to the various embodiments of the network space knowledge extraction method based on rule-enhanced prompting learning.
[0134] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), memory bus DRAM (RDRAM), and interface DRAM (DRDRAM), etc.
[0135] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0136] The above embodiments merely illustrate several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, all of which fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.
Claims
1. A method for extracting cyberspace knowledge based on rule-enhanced reinforcement learning, characterized in that, Including the following steps: To obtain text data to be extracted and input prompts in the field of cyberspace security; The input prompts are broken down into subject prompts, relation prompts, and object prompts; The subject hint and the object hint are represented by unary functions and the tag word set is constructed using two characters; the relation hint is represented by a binary function and the tag word set is constructed using three characters. Sub-hints related to the ontology rules are connected using logical rules with conjunctive paradigm to obtain task-specific hints for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the field of cyberspace security and represented in first-order logical form, and the condition functions include condition functions for determining the subject type in the subject hints, condition functions for determining the object type in the object hints, and condition functions for determining the semantic relationship between the subject and the object in the relation hints; Using the task-specific prompts, a pre-trained language model is used to extract cyberspace knowledge from the text data to be extracted, and the entities and relationships of the current task in the cyberspace security domain are output; the entities and relationships are used to determine the threat intelligence data in the current cyberspace security domain.
2. The network space knowledge extraction method based on rule-enhanced prompting learning according to claim 1, characterized in that, The process of connecting sub-hints of conditional functions related to ontology rules using logical rules with conjunctive normal form includes: Determine sub-hints for the conditional functions related to the ontology rules; the sub-hints include the subject hint, the relation hint, and the object hint; Connect the templates for the subject hints, the relation hints, and the object hints using logical rules with conjunctive normal form; Learnable tags are added to the tag sets of the subject prompt, the relation prompt, and the object prompt, respectively; the parameters of the learnable tags are randomly initialized.
3. The network space knowledge extraction method based on rule-enhanced prompting learning according to claim 2, characterized in that, The template for the task-specific prompt is T(x): in, The body hint template representing the conditional function of the corresponding body type of entity x. A relation hint template representing the semantic relationship between the corresponding subject and object of entity x. The object hint template represents the conditional function of the corresponding object type of entity x. [MASK]1, [MASK]2 and [MASK]3 represent the subject tag, relation tag and object tag, respectively.
4. The network space knowledge extraction method based on rule-enhanced prompting learning according to any one of claims 1 to 3, characterized in that, The adjustment method for the trained pre-trained language model and classification categories is as follows: Where p(y|x) represents the probability distribution of entity x in the true class, [MASK] j Let represent the j-th masking position, and n represent the number of masking positions in the template. This indicates that [MASK] is for the j-th masking position. j The network security category y is mapped to the set of tag words, and T(x) represents the template of the task-specific prompt.
5. The network space knowledge extraction method based on rule-enhanced prompting learning according to claim 4, characterized in that, During the process of extracting cyberspace knowledge from the text data to be extracted using the task-specific prompts and a pre-trained language model, the ultimate learning objective is to maximize the following expression: Where X represents a set of entities.
6. The network space knowledge extraction method based on rule-enhanced prompting learning according to claim 4, characterized in that, The formal representations of the template and tag set of the main prompt are as follows: in, This is a template for the main prompt, where x represents the entity and e represents the content. s [MASK] represents the subject entity, and [MASK] represents the entity's tag. A set of tags that represent the main prompt.
7. The network space knowledge extraction method based on rule-enhanced prompting learning according to claim 6, characterized in that, The formal representations of the relation hint template and the tag set are as follows: in, A template for indicating relation hints, where x represents an entity, e s Indicates the subject entity, e o [MASK] indicates the object entity, and [MASK] indicates the tag word of the entity. A set of tags that indicate object prompts.
8. A network space knowledge extraction device based on rule-enhanced reinforcement learning, characterized in that, include: The data input module is used to acquire text data to be extracted and input prompts in the field of cyberspace security. The prompt splitting module is used to split the prompt input into subject prompts, relation prompts, and object prompts; The subject hint and the object hint are represented by unary functions and the tag word set is constructed using two characters; the relation hint is represented by a binary function and the tag word set is constructed using three characters. The prompt optimization module is used to connect sub-prompts with conditional functions related to ontology rules using logical rules with conjunctive paradigm to obtain task-specific prompts for the text to be extracted; wherein, the ontology rules are rules pre-constructed for relation extraction tasks in the field of cyberspace security and represented in first-order logical form, and the conditional functions include conditional functions for determining the subject type in the subject prompt, conditional functions for determining the object type in the object prompt, and conditional functions for determining the semantic relationship between the subject and the object in the relation prompt; The extraction output module is used to extract cyberspace knowledge from the text data to be extracted using the task-specific prompts and a pre-trained language model, and output the entities and relationships of the current task in the cyberspace security domain; the entities and relationships are used to determine the threat intelligence data in the current cyberspace security domain.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the network space knowledge extraction method based on rule-enhanced prompting learning as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the network space knowledge extraction method based on rule-enhanced prompting learning as described in any one of claims 1 to 7.