A named entity recognition method, device, electronic equipment and storage medium
By actively learning through evolutionary algorithms to select the optimal data from unlabeled data and training with the Roberta-BiLSTM-CRF framework, the problem of data uncertainty and diversity not being considered in named entity recognition in the field of cybersecurity is solved, achieving low-cost and efficient data annotation and model performance improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING TOPSEC NETWORK SECURITY TECH
- Filing Date
- 2023-07-20
- Publication Date
- 2026-06-23
AI Technical Summary
Existing named entity recognition methods in the field of cybersecurity fail to simultaneously consider the uncertainty and diversity of data, leading to sampling bias and increased data annotation costs.
An active learning strategy based on evolutionary algorithms is adopted, in which the best unlabeled data is selected from the unlabeled data for manual annotation, and the model is trained using the Roberta-BiLSTM-CRF framework to balance uncertainty and diversity, thereby achieving high model performance with a low number of labeled samples.
By selecting and labeling unlabeled data that is rich in information and comprehensive, the cost of data labeling is reduced, while the performance of the named entity recognition model is improved.
Smart Images

Figure CN116861247B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of network security technology, and more specifically, to a named entity recognition method, apparatus, electronic device, and storage medium. Background Technology
[0002] Named Entity Recognition (NER), a key technology in natural language processing tasks, plays a crucial role in many fields, such as knowledge graph construction, machine translation, information retrieval, and question-answering systems. With the increasing number of cyberattacks, the internet is generating ever-growing amounts of cybersecurity data, such as blogs, forums, and databases. This data contains a wealth of valuable information. NER technology can automatically extract cybersecurity entities of interest to security researchers from this data, enriching cybersecurity knowledge, discovering new threats, viruses, vulnerabilities, etc., and taking timely and effective measures.
[0003] Existing named entity recognition methods in the field of cybersecurity do not simultaneously consider the uncertainty and diversity of unlabeled data when they are obtained, which leads to sampling bias and increases the cost of data labeling. Summary of the Invention
[0004] The purpose of this application is to provide a named entity recognition method, apparatus, electronic device, and storage medium. By utilizing an active learning strategy, it simultaneously considers the uncertainty and diversity of data and selects data with rich and comprehensive information from unlabeled data, thereby achieving high model performance with a lower number of labeled samples. This solves the problem that existing methods do not simultaneously consider the uncertainty and diversity of data, leading to sampling bias and increased data labeling costs.
[0005] This application provides a method for obtaining unstructured text in the field of network security;
[0006] The unstructured text is input into a trained named entity recognition model to obtain network security entities in the unstructured text; wherein, the named entity recognition model is obtained by actively learning based on evolutionary algorithms, selecting the best unlabeled data from the unlabeled data pool for manual labeling, and training on the labeled data obtained.
[0007] In the above implementation process, an active learning strategy based on evolutionary algorithms is used to select the optimal unlabeled data from the unlabeled data. The optimal unlabeled data takes into account both the uncertainty and diversity of the data. It selects data that is rich in information (high uncertainty) and comprehensive (high diversity) from the unlabeled data, thereby achieving high model performance with a low number of labeled samples. This solves the problem that existing methods do not take into account the uncertainty and diversity of the data at the same time, which leads to sampling bias and increases the cost of data labeling.
[0008] Furthermore, using evolutionary algorithm-based active learning, the optimal unlabeled data is selected from the unlabeled data pool for manual annotation to obtain labeled data, which is then used to train the named entity recognition model, including:
[0009] Multiple randomly selected data points are manually labeled, and the labeled data is stored in the labeled data pool;
[0010] The named entity recognition model is trained using the labeled data in the labeled data pool;
[0011] The optimal unlabeled data is selected from the unlabeled data pool using active learning based on evolutionary algorithms, and then manually labeled and stored in the labeled data pool.
[0012] Repeat the steps of training the named entity recognition model and selecting the best unlabeled data using active learning based on evolutionary algorithms and then manually labeling it until the number of labeled data in the labeled data pool reaches a preset threshold.
[0013] The named entity recognition model is trained using the labeled data in the labeled data pool to obtain the final named entity recognition model.
[0014] In the above implementation process, active learning based on evolutionary algorithms is used to select the optimal unlabeled data, thereby reducing the cost of data labeling and achieving better model performance with a smaller amount of labeled data.
[0015] Furthermore, training the named entity recognition model using the labeled data in the labeled data pool includes:
[0016] The Roberta-BiLSTM-CRF framework is trained using the labeled data to obtain a named entity recognition model, where Roberta is used to convert the unstructured text into semantic vectors; BiLSTM is used to bidirectionally model the contextual information of the text; and CRF is used to learn the transition probabilities between labels.
[0017] In the above implementation process, the Roberta-BiLSTM-CRF model is used for training to obtain the globally optimal label sequence, where the label is the entity category corresponding to each word in the text.
[0018] Furthermore, the step of selecting the optimal unlabeled data from the unlabeled data pool using active learning based on evolutionary algorithms includes:
[0019] Multiple individuals are randomly selected from the unlabeled data pool to form an initial population. The initial population contains a POP of individuals, and each individual contains n sentences: L = {l1, l2, ..., l...} n};
[0020] Encode each individual in the initial population with a real number;
[0021] The crossover operator is used to perform a crossover operation between the individuals to generate new individuals and put them into the initial population;
[0022] The individuals are mutated using a polynomial mutation operator to generate new individuals, which are then added to the initial population.
[0023] Fitness is calculated for each individual in the initial population, and a binary tournament selection method is used to select individuals from the initial population based on the calculation results, selecting POP individuals to form a new population;
[0024] Repeat the above crossover, mutation, and individual selection operations to update the new population until the preset maximum number of iterations is reached;
[0025] The Pareto optimal solution for the final population is obtained using an evolutionary algorithm. If there are multiple Pareto optimal solutions, the optimal solution with the highest sum of uncertainty score and diversity score is selected.
[0026] The optimal solution is decoded and mapped to the corresponding optimal unlabeled data in the unlabeled data pool.
[0027] In the above implementation process, a balance can be struck between the uncertainty principle and the diversity principle, so that uncertainty and diversity can be optimized as much as possible, and unlabeled data with large and comprehensive information is selected from unlabeled data.
[0028] Further, the fitness calculation for each individual in the population includes:
[0029] Each individual is input into the named entity recognition model for classification, and each word w in each sentence l is labeled as a category. The probability of;
[0030] Calculate the information entropy of each sentence based on the aforementioned probability:
[0031]
[0032] Calculate the individual's uncertainty score based on the information entropy:
[0033]
[0034] The individuals are represented as L = {l1, l2, ..., l n};
[0035] The individuals are input into the Sentence-BERT model to obtain the vector representation V of each sentence l. l ;
[0036] The diversity score of the individual is calculated based on the vector representation:
[0037]
[0038] Where μ represents the average vector of the individual; cos(V l ,μ) represents V l Cosine similarity between μ and μ.
[0039] In the above implementation process, information entropy is used to calculate the uncertainty score, and cosine similarity is used to calculate the diversity score.
[0040] This application embodiment also provides a named entity recognition device, the device comprising:
[0041] The data acquisition module is used to obtain unstructured text in the field of cybersecurity;
[0042] The recognition module is used to input the unstructured text into a trained named entity recognition model to obtain network security entities in the unstructured text; wherein, the named entity recognition model is trained by selecting the best unlabeled data from the unlabeled data pool through active learning based on evolutionary algorithms and manually labeling it.
[0043] In the above implementation process, an active learning strategy based on evolutionary algorithms is used to select the optimal unlabeled data from the unlabeled data. The optimal unlabeled data takes into account both the uncertainty and diversity of the data. It selects data that is rich in information (high uncertainty) and comprehensive (high diversity) from the unlabeled data, thereby achieving high model performance with a low number of labeled samples. This solves the problem that existing methods do not take into account the uncertainty and diversity of the data at the same time, which leads to sampling bias and increases the cost of data labeling.
[0044] Furthermore, the device also includes a model building module, which comprises:
[0045] The data annotation module is used to manually annotate multiple randomly selected data points and store the annotated data in the annotation data pool;
[0046] The training module is used to train the named entity recognition model using the labeled data in the labeled data pool;
[0047] The evolutionary algorithm-based active learning module is used to select the best unlabeled data from the unlabeled data pool using evolutionary algorithm-based active learning, and then manually label and store it in the labeled data pool.
[0048] The labeled data acquisition module is used to repeatedly train the named entity recognition model and select the best unlabeled data and manually label it using active learning based on evolutionary algorithms until the number of labeled data in the labeled data pool reaches a preset threshold.
[0049] The named entity recognition model building module is used to train the named entity recognition model using the labeled data in the labeled data pool to obtain the final named entity recognition model.
[0050] In the above implementation process, active learning based on evolutionary algorithms is used to select the optimal unlabeled data, thereby reducing the cost of data labeling and achieving better model performance with a smaller amount of labeled data.
[0051] Furthermore, the training module includes:
[0052] The named entity recognition model training module is used to train the Roberta-BiLSTM-CRF framework using the labeled data to obtain a named entity recognition model. Roberta is used to convert the unstructured text into semantic vectors; BiLSTM is used to bidirectionally model the contextual information of the text; and CRF is used to learn the transition probabilities between labels.
[0053] In the above implementation process, the Roberta-BiLSTM-CRF model is used for training to obtain the globally optimal label sequence, where the label is the entity category corresponding to each word in the text.
[0054] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor runs the computer program to enable the electronic device to perform the named entity recognition method described in any of the above-described embodiments.
[0055] This application also provides a readable storage medium storing computer program instructions, which, when read and executed by a processor, perform the named entity recognition method described in any of the above-described embodiments. Attached Figure Description
[0056] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0057] Figure 1 A flowchart illustrating a named entity recognition method provided in this application embodiment;
[0058] Figure 2 A detailed flowchart of model training provided for embodiments of this application;
[0059] Figure 3 This is a flowchart illustrating the training process of the named entity recognition model provided in an embodiment of this application.
[0060] Figure 4 The process for obtaining optimal unlabeled data provided in the embodiments of this application;
[0061] Figure 5 A structural block diagram of a named entity recognition device provided in an embodiment of this application;
[0062] Figure 6 This is a structural block diagram of another named entity recognition device provided in an embodiment of this application.
[0063] icon:
[0064] 100 - Data Acquisition Module; 200 - Recognition Module; 300 - Model Building Module; 310 - Data Labeling Module; 320 - Training Module; 321 - Named Entity Recognition Model Training Module; 330 - Active Learning Module Based on Evolutionary Algorithm; 340 - Labeled Data Acquisition Module; 350 - Named Entity Recognition Model Building Module. Detailed Implementation
[0065] The technical solutions in the embodiments of this application will now be described with reference to the accompanying drawings.
[0066] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this application, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0067] Example 1
[0068] Please refer to Figure 1 , Figure 1This is a flowchart illustrating a named entity recognition method provided in an embodiment of this application. The method transforms the active learning strategy into a multi-objective optimization problem, employing a multi-objective evolutionary optimization algorithm to search for the optimal solution. This algorithm can select data with the greatest uncertainty and diversity from unlabeled data, reducing the cost of data labeling.
[0069] The method specifically includes the following steps:
[0070] Step S100: Obtain unstructured text in the field of cybersecurity;
[0071] Step S200: Input the unstructured text into the trained named entity recognition model to obtain network security entities in the unstructured text; wherein, the named entity recognition model is trained by selecting the best unlabeled data from the unlabeled data pool and manually labeling it using active learning based on evolutionary algorithms.
[0072] By inputting unstructured text from the cybersecurity field into a named entity recognition model, the model can label the words in the text and extract cybersecurity entities.
[0073] Among them, such as Figure 2 The diagram shows the specific flowchart for model training. Active learning based on evolutionary algorithms selects the best unlabeled data from the unlabeled data pool for manual annotation, resulting in labeled data used for training the named entity recognition model. Figure 3 The diagram shown is a flowchart of the named entity recognition model training process, which may include the following steps:
[0074] Step S310: Manually label multiple randomly selected data points and store the labeled data in the labeled data pool;
[0075] For example, randomly select n data points and have human experts annotate them, then put the annotated data into an annotated data pool.
[0076] Step S320: Train the named entity recognition model using the labeled data in the labeled data pool;
[0077] Specifically, the Roberta-BiLSTM-CRF framework is trained using the labeled data to obtain a named entity recognition model, wherein Roberta is used to convert the unstructured text into semantic vectors; BiLSTM is used to bidirectionally model the contextual information of the text; and CRF is used to learn the transition probabilities between labels.
[0078] The Roberta-BiLSTM-CRF model can be used to obtain the globally optimal label sequence, where the label is the entity category corresponding to each word in the text.
[0079] Step S330: Use active learning based on evolutionary algorithms to select the best unlabeled data from the unlabeled data pool, and then manually label and store it in the labeled data pool again;
[0080] Step S340: Repeat the steps of training the named entity recognition model and selecting the best unlabeled data and manually labeling it using active learning based on evolutionary algorithms until the number of labeled data in the labeled data pool reaches a preset threshold.
[0081] Repeat steps S320-S330 until the amount of data in the labeled data pool reaches the set threshold V.
[0082] Step S350: Train the named entity recognition model using the labeled data in the labeled data pool to obtain the final named entity recognition model.
[0083] In step S330, the two query strategies for active learning are the uncertainty principle and the diversity principle. The uncertainty principle states that the greater the uncertainty of data, the richer the information it contains; therefore, it is necessary to find data with high uncertainty. The diversity principle states that the greater the diversity among data points, the less repetitive or redundant the information contained within them, and the more comprehensive the information contained in the data; therefore, it is necessary to find data with high diversity.
[0084] Active learning can be transformed into a multi-objective optimization problem, with the final optimization objective being:
[0085] Objective 1: Select the data with the greatest uncertainty from the unlabeled data.
[0086] Maxf1 = U(L);
[0087] Objective 2: Select the data with the greatest diversity from the unlabeled data.
[0088] Maxf2 = D(L);
[0089] Where U(L) represents the uncertainty score and D(L) represents the diversity score.
[0090] The calculation of uncertainty score and diversity score has been explained below.
[0091] like Figure 4 The diagram illustrates the process of obtaining the optimal unlabeled data. Step S330 describes the optimal solution search process based on an evolutionary algorithm. The NSGA-2 evolutionary algorithm can be used for optimal solution search, and it specifically includes the following steps:
[0092] Step S331: Randomly select multiple individuals from the unlabeled data pool to form an initial population. The number of individuals in the initial population is POP, and each individual L contains n sentences: L = {l1, l2, ..., l...} n};
[0093] Step S332: Encode each individual in the initialized population with a real number;
[0094] Each individual is coded with chromosomes, specifically using real-number encoding. The real-number encoding values are [0.0, Num], where Num is the number of sentences with unlabeled data in the unlabeled data pool.
[0095] For example, if we need to select 3 data points from the unlabeled data as a single entity, then n is 3, and the number of sentences in the unlabeled data is 100, then L = {l1, l2, ..., l...} n The form of the real number encoding is {0.5, 49.6, 98.1}.
[0096] Step S333: Use the crossover operator to perform a crossover operation between the individuals to generate new individuals and put them into the initialized population;
[0097] Specifically, the crossover probability is set to p. cro Generate a random number between 0 and 1. If the random number is less than p... cro Then, the SBX (Simulated Binary Crossover) crossover operator is used to perform crossover operations between individuals in the initial population, and the resulting new individuals are added to the initial population.
[0098] Step S334: Use a polynomial mutation operator to mutate the individual to generate a new individual and put it into the initialized population;
[0099] Specifically, the mutation probability is set to p. mut Generate a random number between 0 and 1. If the random number is less than p... mut Then, the polynomial mutation operator is used to mutate the individuals in the initial population, and the resulting new individuals are put into the initial population.
[0100] Step S335: Calculate the fitness of each individual in the initial population, and select individuals from the initial population based on the calculation results using the binary tournament selection method, selecting POP individuals to form a new population;
[0101] The fitness of each individual L is calculated, including the calculation of uncertainty score and diversity score, specifically:
[0102] First, perform a fast non-dominated sort on each individual in the population. Individuals are then divided into different non-dominated levels based on their dominance relationships. After this operation, each individual in the population has the attribute: non-dominated order n. rank .
[0103] Next, calculate the crowding degree n for each individual L in the population. d f1 and f2 are the two objective functions to be optimized in this application. Based on the objective function values of f1 and f2, all individuals in the population are sorted in ascending order. The crowding degree of the first and last individuals is set to ∞, and the crowding degree of the remaining individuals is the objective function value. The crowding degree n of each individual L in the population is... d The calculation formula is as follows:
[0104]
[0105] Among them, f m (L-1) and f m (L+1) represent the individuals ranked before and after individual L in the objective function f. m The function value on, and The objective functions f are respectively m The maximum and minimum function values.
[0106] The calculation process for the objective function values of f1 and f2 is as follows:
[0107] Each individual is input into the named entity recognition model for classification, and each word w in each sentence l is labeled as a category. The probability of;
[0108] Calculate the information entropy of each sentence based on the aforementioned probability:
[0109]
[0110] Calculate the individual's uncertainty score based on the information entropy:
[0111]
[0112] The individuals are represented as L = {l1, l2, ..., l n};
[0113] The individuals are input into the Sentence-BERT model to obtain the vector representation V of each sentence l. l ;
[0114] The diversity score of the individual is calculated based on the vector representation:
[0115]
[0116] Where μ represents the average vector of the individual; cos(V l ,μ) represents V l Cosine similarity between μ and μ.
[0117] Finally, based on the binary tournament selection strategy, the top 100 individuals (POPs) are selected from the population to form a new population. The details are as follows:
[0118] Randomly select two individuals from the population, and use a crowding comparison operator to choose the best one to add to the new population. Return the remaining individuals to the original population. Repeat this process until the new population contains a total of POP individuals.
[0119] The crowding comparison operator compares individual L with another individual K, and individual L wins if the following condition is met:
[0120] First, the non-dominated order of individual L is less than the non-dominated order of individual K;
[0121] Second, the non-dominated order of individual L is the same as that of individual K, and the crowding degree of individual L is greater than that of individual K.
[0122] Step S336: Repeat the above crossover operation, mutation operation and individual selection operation to update the new population until the preset maximum number of iterations is reached;
[0123] After each crossover and mutation selection, the old population is updated to obtain a new population. After N iterations, it is still a population, but this population has been updated N times.
[0124] Step S337: Use an evolutionary algorithm to obtain the Pareto optimal solution for the final population. If there are multiple Pareto optimal solutions, select the optimal solution with the highest sum of uncertainty score and diversity score.
[0125] Step S338: Decode the optimal solution and map it to the corresponding optimal unlabeled data in the unlabeled data pool.
[0126] Specifically, the currently selected optimal solution exists in the form of a real number encoding, therefore it needs to be decoded and mapped to the specific data in the unlabeled data. Using the example described in step S332, if the selected excellent individual is encoded as {1.2, 67.4, 88.8}, then the encoding is mapped to integers, i.e., {1, 67, 88}. Finally, the data in the unlabeled data with indices 1, 67, and 88 are the selected optimal solutions.
[0127] For example, this method can be applied to threat intelligence analysis to extract key entities from unstructured text in the cybersecurity field. Specifically, it includes the following steps:
[0128] Step S11: Define the entity categories to be extracted from the text, such as attack organizations, attack tools, attack methods, etc.
[0129] Step S12: Set the final amount of data to be labeled to 500 (preset threshold V);
[0130] Step S13: Randomly select 100 sentences from the unlabeled data, manually label them, and put them into the labeled data pool;
[0131] Step S14: Train the named entity recognition model using data from the labeled data pool;
[0132] Step S15: Using an active learning strategy based on evolutionary algorithms, select 100 data points from the unlabeled data;
[0133] Step S16: Manually label the data selected in step S15 and put it into the labeled data pool;
[0134] Step S17: Repeat steps S15 to S16 until the amount of data in the labeled data pool is 500;
[0135] Step S18: Train the named entity recognition model again using the data in the labeled data pool. The resulting model is the final named entity recognition model.
[0136] Step S19: Input unstructured text from the cybersecurity field into the named entity recognition model obtained in step S18 to detect the security entities contained in the text.
[0137] This method is based on an active learning strategy of evolutionary algorithms, which can balance the principles of uncertainty and diversity, so that uncertainty and diversity are optimized as much as possible. It selects data with large and comprehensive information from unlabeled data, thereby reducing the cost of data labeling and achieving better model performance with less labeled data.
[0138] Example 2
[0139] This application provides a named entity recognition device, such as... Figure 5 The diagram shown is a structural block diagram of a named entity recognition device, which includes, but is not limited to:
[0140] Data acquisition module 100 is used to acquire unstructured text in the field of network security;
[0141] The recognition module 200 is used to input the unstructured text into a trained named entity recognition model to obtain network security entities in the unstructured text; wherein, the named entity recognition model is obtained by selecting the best unlabeled data from the unlabeled data pool through active learning based on evolutionary algorithms, manually labeling it, and then training it with the labeled data.
[0142] like Figure 6 The diagram shown is a structural block diagram of another named entity recognition device. The device further includes a model building module 300, which comprises:
[0143] The data annotation module 310 is used to manually annotate multiple randomly selected data points and store the annotated data in the annotation data pool;
[0144] Training module 320 is used to train the named entity recognition model using the labeled data in the labeled data pool;
[0145] The evolutionary algorithm-based active learning module 330 is used to select the best unlabeled data from the unlabeled data pool using evolutionary algorithm-based active learning, and then manually label and store it in the labeled data pool.
[0146] Specifically:
[0147] Multiple individuals are randomly selected from the unlabeled data pool to form an initial population. The initial population contains a POP of individuals, and each individual contains n sentences: L = {l1, l2, ..., l...} n};
[0148] Encode each individual in the initial population with a real number;
[0149] The crossover operator is used to perform a crossover operation between the individuals to generate new individuals and put them into the initial population;
[0150] The individuals are mutated using a polynomial mutation operator to generate new individuals, which are then added to the initial population.
[0151] Fitness is calculated for each individual in the initial population, and a binary tournament selection method is used to select individuals from the initial population based on the calculation results, selecting POP individuals to form a new population;
[0152] Repeat the above crossover, mutation, and individual selection operations to update the new population until the preset maximum number of iterations is reached;
[0153] The Pareto optimal solution for the final population is obtained using an evolutionary algorithm. If there are multiple Pareto optimal solutions, the optimal solution with the highest sum of uncertainty score and diversity score is selected.
[0154] The optimal solution is decoded and mapped to the corresponding optimal unlabeled data in the unlabeled data pool.
[0155] The labeled data acquisition module 340 is used to repeatedly train the named entity recognition model and select the best unlabeled data and manually label it using active learning based on evolutionary algorithms until the number of labeled data in the labeled data pool reaches a preset threshold.
[0156] The named entity recognition model building module 350 is used to train the named entity recognition model using the labeled data in the labeled data pool to obtain the final named entity recognition model.
[0157] The calculation of uncertain scores and diversity scores has been specifically explained in Example 1 and will not be repeated here.
[0158] The training module 320 includes:
[0159] Named entity recognition model training module 321 is used to train the Roberta-BiLSTM-CRF framework using the labeled data to obtain a named entity recognition model, wherein Roberta is used to convert the unstructured text into semantic vectors; BiLSTM is used to bidirectionally model the contextual information of the text; and CRF is used to learn the transition probabilities between labels.
[0160] This device utilizes an active learning strategy based on evolutionary algorithms to select the optimal unlabeled data from the unlabeled data. The optimal unlabeled data takes into account both the uncertainty and diversity of the data, selecting data that is both information-rich (high uncertainty) and comprehensive (high diversity) from the unlabeled data. This achieves high model performance with a low number of labeled samples, solving the problem that existing methods do not simultaneously consider the uncertainty and diversity of the data, leading to sampling bias and increased data labeling costs.
[0161] This application provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor runs the computer program to enable the electronic device to perform the named entity recognition method described in Embodiment 1.
[0162] This application also provides a readable storage medium storing computer program instructions, which are read and executed by a processor to perform the named entity recognition method described in embodiment 1.
[0163] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can also be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0164] In addition, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.
[0165] If the aforementioned functions are implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0166] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application. It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.
[0167] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0168] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
Claims
1. A named entity recognition method, characterized in that, The method includes: Obtain unstructured text in the field of cybersecurity; The unstructured text is input into a trained named entity recognition model to obtain cybersecurity entities within the unstructured text. The named entity recognition model is trained using active learning based on an evolutionary algorithm, selecting the optimal unlabeled data from an unlabeled data pool for manual annotation. Active learning can be transformed into a multi-objective optimization problem, with the final optimization objectives being to select the data with the greatest uncertainty from the unlabeled data and to select the data with the greatest diversity from the unlabeled data. The selection of the optimal unlabeled data from the unlabeled data pool using active learning based on an evolutionary algorithm includes: randomly selecting multiple individuals from the unlabeled data pool to form an initial population, wherein the number of individuals in the initial population is... POP Each individual L Include n One sentence: Each individual in the initial population is encoded with a real number; a crossover operation is performed between the individuals using a crossover operator to generate new individuals and place them into the initial population; a polynomial mutation operation is performed on the individuals using a polynomial mutation operator to generate new individuals and place them into the initial population; the fitness of each individual in the initial population is calculated, and a binary tournament selection method is used to select individuals from the initial population based on the calculation results. POP Individuals form a new population; the crossover, mutation, and individual selection operations described above are repeated to update the new population until a preset maximum number of iterations is reached; the Pareto optimal solution for the final population is obtained using an evolutionary algorithm; if there are multiple Pareto optimal solutions, the optimal solution with the highest sum of uncertainty score and diversity score is selected; the optimal solution is decoded and mapped to the corresponding optimal unlabeled data in the unlabeled data pool; the fitness calculation includes: inputting each individual into the named entity recognition model for classification to obtain each sentence l Each word w Labeled as category The probability of the given information; based on the probability, the information entropy of each sentence is calculated: ; Calculate the uncertainty score of the individual based on the information entropy: ; where the individual is represented as The individuals are then input into the Sentence-BERT model to obtain each sentence. l vector representation V l ; Calculate the diversity score of the individual based on the vector representation: ;in, The average vector representing the individual; express and Cosine similarity between them.
2. The named entity recognition method according to claim 1, characterized in that, Active learning based on evolutionary algorithms is used to select the best unlabeled data from the unlabeled data pool for manual annotation, resulting in labeled data used for training the named entity recognition model, including: Multiple randomly selected data points are manually labeled, and the labeled data is stored in the labeled data pool; The named entity recognition model is trained using the labeled data in the labeled data pool; The optimal unlabeled data is selected from the unlabeled data pool using active learning based on evolutionary algorithms, and then manually labeled and stored in the labeled data pool. Repeat the steps of training the named entity recognition model and selecting the best unlabeled data using active learning based on evolutionary algorithms and then manually labeling it until the number of labeled data in the labeled data pool reaches a preset threshold. The named entity recognition model is trained using the labeled data in the labeled data pool to obtain the final named entity recognition model.
3. The named entity recognition method according to claim 2, characterized in that, The step of training the named entity recognition model using the labeled data in the labeled data pool includes: The Roberta-BiLSTM-CRF framework is trained using the labeled data to obtain a named entity recognition model, where Roberta is used to convert the unstructured text into semantic vectors; BiLSTM is used to bidirectionally model the contextual information of the text; and CRF is used to learn the transition probabilities between labels.
4. A named entity recognition device, characterized in that, The device includes: The data acquisition module is used to obtain unstructured text in the field of cybersecurity; The recognition module is used to input the unstructured text into a trained named entity recognition model to obtain network security entities in the unstructured text. The named entity recognition model is trained using active learning based on an evolutionary algorithm, selecting the optimal unlabeled data from an unlabeled data pool for manual annotation. Active learning can be transformed into a multi-objective optimization problem, with the final optimization objectives being to select the data with the greatest uncertainty from the unlabeled data and to select the data with the greatest diversity from the unlabeled data. The selection of the optimal unlabeled data from the unlabeled data pool using active learning based on an evolutionary algorithm includes: randomly selecting multiple individuals from the unlabeled data pool to form an initial population, wherein the number of individuals in the initial population is... POP Each individual L Include n One sentence: Each individual in the initial population is encoded with a real number; a crossover operation is performed between the individuals using a crossover operator to generate new individuals and place them into the initial population; a polynomial mutation operation is performed on the individuals using a polynomial mutation operator to generate new individuals and place them into the initial population; the fitness of each individual in the initial population is calculated, and a binary tournament selection method is used to select individuals from the initial population based on the calculation results. POP Individuals form a new population; the crossover, mutation, and individual selection operations described above are repeated to update the new population until a preset maximum number of iterations is reached; the Pareto optimal solution for the final population is obtained using an evolutionary algorithm; if there are multiple Pareto optimal solutions, the optimal solution with the highest sum of uncertainty score and diversity score is selected; the optimal solution is decoded and mapped to the corresponding optimal unlabeled data in the unlabeled data pool; the fitness calculation includes: inputting each individual into the named entity recognition model for classification to obtain each sentence l Each word w Labeled as category The probability of the given information; based on the probability, the information entropy of each sentence is calculated: ; Calculate the uncertainty score of the individual based on the information entropy: ; where the individual is represented as The individuals are then input into the Sentence-BERT model to obtain each sentence. l vector representation V l ; Calculate the diversity score of the individual based on the vector representation: ;in, The average vector representing the individual; express and Cosine similarity between them.
5. The named entity recognition device according to claim 4, characterized in that, The device further includes a model building module, which comprises: The data annotation module is used to manually annotate multiple randomly selected data points and store the annotated data in the annotation data pool; The training module is used to train the named entity recognition model using the labeled data in the labeled data pool; The evolutionary algorithm-based active learning module is used to select the best unlabeled data from the unlabeled data pool using evolutionary algorithm-based active learning, and then manually label and store it in the labeled data pool. The labeled data acquisition module is used to repeatedly train the named entity recognition model and select the best unlabeled data and manually label it using active learning based on evolutionary algorithms until the number of labeled data in the labeled data pool reaches a preset threshold. The named entity recognition model building module is used to train the named entity recognition model using the labeled data in the labeled data pool to obtain the final named entity recognition model.
6. The named entity recognition device according to claim 5, characterized in that, The training module includes: The named entity recognition model training module is used to train the Roberta-BiLSTM-CRF framework using the labeled data to obtain a named entity recognition model. Roberta is used to convert the unstructured text into semantic vectors; BiLSTM is used to bidirectionally model the contextual information of the text; and CRF is used to learn the transition probabilities between labels.
7. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory being used to store a computer program, and the processor running the computer program to cause the electronic device to perform the named entity recognition method according to any one of claims 1 to 3.
8. A readable storage medium, characterized in that, The readable storage medium stores computer program instructions, which, when read and executed by a processor, perform the named entity recognition method according to any one of claims 1 to 3.