Corpus label acquisition method and device and computer device

By acquiring and integrating the vector similarity of candidate corpus data, candidate corpus labels are determined as training samples, which solves the problem of insufficient training samples and improves the accuracy and efficiency of model training.

CN114328915BActive Publication Date: 2026-06-23TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2021-12-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

During model training, insufficient training samples can lead to overfitting, resulting in low generalization and accuracy. Therefore, obtaining a sufficient number of training samples is an urgent problem to be solved.

Method used

By acquiring the sample labels and initial corpus data of the model to be trained, encoding them into label corpus vectors, calculating the vector similarity of candidate corpus data, and integrating them to obtain candidate prediction result vectors, the candidate corpus labels of the candidate corpus data are determined and used as training samples to expand the training samples of the model.

Benefits of technology

This approach expands the training samples for the model, improving the accuracy and efficiency of model training and ensuring the accuracy and efficiency of corpus expansion.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114328915B_ABST
    Figure CN114328915B_ABST
Patent Text Reader

Abstract

Embodiments of the present application disclose a corpus label acquisition method and device and a computer device, relating to machine learning technology in the field of artificial intelligence. The method comprises: acquiring k sample labels of a to-be-trained model and initial corpus data corresponding to each sample label, encoding each initial corpus data to obtain label corpus vectors corresponding to the k sample labels respectively; acquiring candidate corpus data, encoding the candidate corpus data into a candidate corpus vector; acquiring vector similarities between the k label corpus vectors and the candidate corpus vector respectively, integrating the k label corpus vectors based on the vector similarities to obtain a candidate prediction result vector corresponding to the candidate corpus data; determining a candidate corpus label of the candidate corpus data based on the candidate prediction result vector, and determining the candidate corpus label and the candidate corpus data as training samples of the to-be-trained model. The present application can improve the accuracy and efficiency of corpus expansion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a method, apparatus and computer equipment for obtaining corpus tags. Background Technology

[0002] With the development of artificial intelligence, data analysis often involves training a model and then using that model to analyze the data. Training a model typically requires a sufficient number of training samples. If the number of training samples is too small, the trained model may overfit, resulting in low generalization performance and accuracy. However, it's possible to obtain a limited number of training samples during model training. Therefore, how to acquire a sufficient number of training samples for model training has become a pressing issue. Summary of the Invention

[0003] This application provides a method, apparatus, and computer device for obtaining corpus tags, which can improve the accuracy and efficiency of corpus expansion.

[0004] One embodiment of this application provides a method for obtaining corpus tags, the method comprising:

[0005] Obtain k sample labels for the model to be trained, and the initial corpus data corresponding to each sample label. Encode the initial corpus data corresponding to each sample label to obtain the label corpus vectors corresponding to the k sample labels respectively; k is a positive integer.

[0006] Obtain candidate corpus data and encode the candidate corpus data into candidate corpus vectors;

[0007] Obtain the vector similarity between k labeled corpus vectors and candidate corpus vectors respectively, and integrate the k labeled corpus vectors based on the vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data;

[0008] Based on the candidate prediction result vector, the candidate corpus labels of the candidate corpus data are determined, and the candidate corpus labels and candidate corpus data are determined as training samples for the model to be trained.

[0009] One embodiment of this application provides a corpus tag acquisition device, which includes:

[0010] The initial corpus acquisition module is used to acquire the k sample labels of the model to be trained, and the initial corpus data corresponding to each sample label;

[0011] The initial corpus encoding module is used to encode the initial corpus data corresponding to each sample label, resulting in label corpus vectors corresponding to k sample labels; k is a positive integer.

[0012] The candidate corpus encoding module is used to acquire candidate corpus data and encode the candidate corpus data into candidate corpus vectors.

[0013] The similarity determination module is used to obtain the vector similarity between the k labeled corpus vectors and the candidate corpus vectors respectively;

[0014] The initial prediction module is used to integrate the vectors of k tag corpus based on vector similarity to obtain the candidate prediction result vectors corresponding to the candidate corpus data;

[0015] The sample determination module is used to determine the candidate corpus labels of the candidate corpus data based on the candidate prediction result vector, and to determine the candidate corpus labels and candidate corpus data as training samples for the model to be trained.

[0016] Each sample label corresponds to at least two initial corpus data.

[0017] The initial corpus encoding module includes:

[0018] The block coding unit is used to encode at least two initial corpus data corresponding to the i-th sample label to obtain the initial corpus vectors corresponding to the at least two initial corpus data corresponding to the i-th sample label; i is a positive integer less than or equal to k;

[0019] The vector fusion unit is used to perform vector fusion on at least two initial corpus vectors corresponding to the i-th sample label to obtain the label corpus vector corresponding to the i-th sample label.

[0020] The similarity determination module includes:

[0021] The label vector acquisition unit is used to acquire the label vectors corresponding to the labels of k samples.

[0022] The vector optimization unit is used to fuse the tag corpus vectors and tag vectors that are associated with the same sample label to obtain the optimized corpus vector corresponding to each sample label.

[0023] The similarity acquisition unit is used to obtain the vector similarity between the k optimized corpus vectors and the candidate corpus vectors.

[0024] The candidate prediction result vector includes k candidate prediction categories and the prediction probability corresponding to each candidate prediction category;

[0025] The sample determination module includes:

[0026] The probability selection unit is used to select the candidate prediction category with the highest prediction probability from k candidate prediction categories, and to determine the candidate prediction category with the highest prediction probability as the candidate corpus label of the candidate corpus data.

[0027] The sample selection unit is used to obtain the corpus selection threshold. If the prediction probability corresponding to the candidate corpus label is greater than or equal to the corpus selection threshold, the candidate corpus label and candidate corpus data are determined as training samples for the model to be trained.

[0028] The number of candidate corpus data is N; N is a positive integer;

[0029] The sample determination module includes:

[0030] Multiple prediction units are used to determine the candidate corpus labels corresponding to each of the N candidate corpus data based on the candidate prediction result vectors corresponding to the N candidate corpus data respectively;

[0031] The quantity determination unit is used to determine the number of samples to be expanded based on the initial corpus data corresponding to each sample label in the model to be trained.

[0032] The sample determination unit is used to determine the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data as training samples for the model to be trained if N is less than or equal to the number of sample expansions.

[0033] The sample determination unit is also used to determine the corpus confidence of the N candidate corpus data based on the candidate prediction result vectors corresponding to the N candidate corpus data if N is greater than the number of sample expansions, and to obtain the training samples of the model to be trained from the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data based on the corpus confidence.

[0034] The training samples for the model to be trained include training sample data and training sample labels; the training sample data includes candidate corpus data and initial corpus data corresponding to each sample label; the device also includes:

[0035] The sample prediction module is used to input training sample data into the model to be trained for prediction and obtain the sample prediction results corresponding to the training sample data.

[0036] The first training module is used to generate a first loss function based on the sample prediction results and the training sample labels, and to adjust the parameters of the model to be trained based on the first loss function to obtain the target model.

[0037] The device also includes:

[0038] The model parsing module is used to receive data parsing requests for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed.

[0039] The data matching module is used to obtain historical parsing data and the corresponding historical parsing results, match the data to be parsed with the historical parsing data, and obtain the data matching degree between the data to be parsed and the historical parsing data.

[0040] The data parsing module is used to determine the second parsing result of the data to be parsed based on the data matching degree and historical parsing results;

[0041] The parsing and integration module is used to integrate the first and second parsing results to obtain the target parsing result of the data to be parsed.

[0042] The device also includes:

[0043] The model parsing module is also used to receive data parsing requests for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed.

[0044] The template parsing module is used to obtain corpus templates, extract target corpus templates that match the data to be parsed from the corpus templates, and determine the template parsing result corresponding to the target corpus template as the third parsing result of the data to be parsed.

[0045] This parsing and integration module is also used to integrate the first and third parsing results to obtain the target parsing result of the data to be parsed.

[0046] The device also includes:

[0047] The model parsing module is also used to receive data parsing requests for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed.

[0048] The key parsing module is used to extract the key information to be parsed from the key information extraction model, perform semantic analysis on the key information to be parsed, and determine the fourth parsing result of the data to be parsed.

[0049] This parsing and integration module is also used to integrate the first parsing result and the fourth parsing result to obtain the target parsing result of the data to be parsed.

[0050] The device also includes:

[0051] The key prediction module is used to input the training sample data into the initial key information extraction model for prediction, and obtain the predicted key information corresponding to the training sample data.

[0052] The key determination module is used to determine the key prediction results corresponding to the training sample data based on the key prediction information.

[0053] The second training module is used to generate a second loss function based on the key prediction results and training sample labels. Based on the second loss function, the parameters of the initial key information extraction model are adjusted to obtain the key information extraction model.

[0054] Specifically, the candidate corpus encoding module is used for:

[0055] Acquire candidate corpus data, and encode the candidate corpus data into candidate corpus vectors through the encoding layer in the corpus matching model;

[0056] This similarity determination module is specifically used for:

[0057] By using the feature fusion layer in the corpus matching model, the vector similarity between the k labeled corpus vectors and the candidate corpus vectors is obtained;

[0058] This initial prediction module is specifically used for:

[0059] Based on vector similarity, k tag corpus vectors are integrated to obtain candidate prediction result vectors corresponding to candidate corpus data.

[0060] The device also includes:

[0061] The sample acquisition module is used to acquire d first corpus samples and the first corpus label of each first corpus sample, and to acquire second corpus samples and the second corpus label of each second corpus sample; d is a positive integer;

[0062] The first corpus encoding module is used to encode the second corpus sample based on the encoding layer in the initial corpus matching model to obtain the second corpus vector corresponding to the second corpus sample;

[0063] The second corpus encoding module is used to encode d first corpus samples based on the encoding layer in the initial corpus matching model, to obtain the first corpus vectors corresponding to the d first corpus samples respectively, and to integrate the first corpus vectors corresponding to the first corpus samples with the same first corpus label to obtain the label corpus matrix.

[0064] The label vector fusion module is used to fuse the label corpus matrix and the second corpus vector through the feature fusion layer in the initial corpus matching model to obtain the initial prediction result of the second corpus sample.

[0065] The model generation module is used to generate a third loss function based on the second corpus labels and the initial prediction results, and to adjust the parameters of the initial corpus matching model based on the third loss function to generate a corpus matching model.

[0066] The initial corpus acquisition module includes:

[0067] The page display unit is used to respond to requests to add corpus to the model to be trained and to display the corpus management page associated with the model to be trained.

[0068] The data acquisition unit is used to acquire the k sample tags submitted on the corpus management page, output the k sample tags on the corpus management page, and acquire the initial corpus data corresponding to each sample tag submitted on the corpus management page based on the k sample tags.

[0069] One embodiment of this application provides a computer device, including a processor, a memory, and an input / output interface;

[0070] The processor is connected to a memory and an input / output interface, respectively. The input / output interface is used to receive and output data, the memory is used to store computer programs, and the processor is used to call the computer programs so that the computer device containing the processor executes the corpus tag acquisition method in one aspect of the embodiments of this application.

[0071] One aspect of this application provides a computer-readable storage medium storing a computer program adapted to be loaded and executed by a processor, such that a computer device having the processor performs the corpus tag acquisition method of one aspect of this application.

[0072] One aspect of this application provides a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods provided in various optional embodiments of this application. In other words, when the computer instructions are executed by the processor, they implement the methods provided in various optional embodiments of this application.

[0073] Implementing the embodiments of this application will have the following beneficial effects:

[0074] In this embodiment, k sample labels of the model to be trained and the initial corpus data corresponding to each sample label are obtained. The initial corpus data corresponding to each sample label is encoded to obtain the label corpus vectors corresponding to the k sample labels respectively; k is a positive integer. Candidate corpus data is obtained and encoded into candidate corpus vectors. The vector similarity between the k label corpus vectors and the candidate corpus vectors is obtained. The k label corpus vectors are integrated based on the vector similarity to obtain the candidate prediction result vectors corresponding to the candidate corpus data. The candidate corpus labels of the candidate corpus data are determined based on the candidate prediction result vectors. The candidate corpus labels and candidate corpus data are determined as training samples of the model to be trained. Through the above process, the training samples are expanded using the acquired seed corpus (i.e., the initial corpus data corresponding to each sample label), resulting in candidate corpus data. The labels of the seed corpus are then used to predict the labels of the candidate corpus data, thus determining the training samples for model training from the candidate corpus data. This expands the model's training samples to obtain the maximum number of available training samples, improving training accuracy. Simultaneously, the correspondence between sample labels and initial corpus data is used to establish a relationship with the candidate corpus data, obtaining candidate corpus labels. This maps the candidate corpus data to sample labels, further improving the accuracy and efficiency of corpus expansion. Attached Figure Description

[0075] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0076] Figure 1 This is a network interaction architecture diagram for corpus acquisition provided in an embodiment of this application;

[0077] Figure 2 This is a schematic diagram of a corpus tag acquisition scenario provided in an embodiment of this application;

[0078] Figure 3 This is a flowchart of a method for obtaining corpus tags provided in an embodiment of this application;

[0079] Figure 4 This is a schematic diagram of a corpus acquisition scenario provided in an embodiment of this application;

[0080] Figure 5 This application provides a schematic diagram of a model prediction scenario.

[0081] Figure 6 This is a flowchart of a data parsing method provided in an embodiment of this application;

[0082] Figure 7 This is a schematic diagram of a data parsing scenario provided in an embodiment of this application;

[0083] Figure 8 This is a flowchart of a model training method provided in an embodiment of this application;

[0084] Figure 9 This is a schematic diagram of a model training scenario provided in an embodiment of this application;

[0085] Figure 10 This is a schematic diagram of a corpus tag acquisition device provided in an embodiment of this application;

[0086] Figure 11 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Detailed Implementation

[0087] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0088] Before and during user data collection, a prompt interface or pop-up window is displayed to inform the user that XXXX data is currently being collected. Data acquisition steps only begin after the user confirms the prompt interface or pop-up window; otherwise, the process ends. For example, if the candidate corpus data includes some non-public user data, a prompt interface or pop-up window can be displayed to obtain the user's authorization for the collection of that candidate corpus data. Optionally, a data usage message can be provided to the user, indicating the purpose of the requested data collection, so that the user understands the intended use of the collected data.

[0089] In the embodiments of this application, please refer to Figure 1 , Figure 1This is a network interaction architecture diagram for corpus acquisition provided in an embodiment of this application. The computer device 101 can acquire existing training samples of the model to be trained, such as initial corpus data corresponding to k sample labels. This initial corpus data corresponding to the k sample labels can be stored in the computer device 101 or in a terminal device, such as terminal device 102a, terminal device 102b, or terminal device 102c. Optionally, the initial corpus data corresponding to the k sample labels can also be stored in a blockchain network or cloud storage space, etc., without limitation. Here, k is a positive integer. Specifically, the computer device 101 can determine the data storage location and acquire the initial corpus data corresponding to the k sample labels of the model to be trained from that data storage location. Further, the computer device 101 can acquire candidate corpus data. This candidate corpus data is not data proposed for the model to be trained; that is, the candidate corpus data is acquired data unrelated to the model to be trained. Computer device 101 learns from the existing training samples of the model to be trained, namely the initial corpus data corresponding to the k sample labels, to obtain the candidate corpus labels corresponding to the candidate corpus data in the model to be trained. In this way, candidate corpus data that is not related to the model to be trained is learned to be related to the model to be trained, so that the candidate corpus data can establish a connection with the model to be trained and can be used as training samples for the model to be trained. Furthermore, based on the existing training samples, the training samples are expanded, increasing the number of training samples used to train the model to be trained, thereby improving the accuracy and generalization effect of model training.

[0090] This application may involve machine learning technology in the field of artificial intelligence, such as expanding the training samples of a model and training the model.

[0091] Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that utilize digital computers or computers-controlled machines to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce new intelligent machines that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess perception, reasoning, and decision-making capabilities.

[0092] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, as well as machine learning / deep learning, autonomous driving, and intelligent transportation.

[0093] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instructional learning.

[0094] With the research and advancement of artificial intelligence (AI) technology, AI is being studied and applied in various fields, such as smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, autonomous driving, drones, robots, smart healthcare, smart customer service, vehicle networking, and intelligent transportation. It is believed that with the development of technology, AI will be applied in more fields and play an increasingly important role.

[0095] For details, please see Figure 2 , Figure 2 This is a schematic diagram illustrating a corpus tag acquisition scenario provided in an embodiment of this application. For example... Figure 2As shown, the computer device can acquire k sample labels of the model 201 to be trained, such as sample label 1, sample label 2, ..., and sample label k, where k is a positive integer. It acquires the initial corpus data corresponding to each sample label. The number of initial corpus data corresponding to each sample label can be zero, one, or at least two; there is no restriction here. That is, it can be assumed that the number of initial corpus data corresponding to each sample label is h, where h is a natural number (i.e., 0, 1, or at least two). The number of initial corpus data corresponding to each sample label can be the same or different. Optionally, the computer device may directly acquire f initial corpus data. In this case, the computer device can acquire the sample labels corresponding to each initial corpus data. Based on the correspondence between sample labels and initial corpus data, it determines the initial corpus data corresponding to each of the k sample labels, where f is a positive integer. Further, the computer device can encode the initial corpus data corresponding to each sample label to obtain the label corpus vectors corresponding to the k sample labels. Candidate corpus data is acquired and encoded into candidate corpus vectors. Using the label corpus vectors corresponding to k sample labels, the candidate corpus vectors are predicted to map to the k sample labels of the model 201 to be trained, thus obtaining candidate corpus labels for the candidate corpus data. Each candidate corpus label can belong to one of the k sample labels. Further, the computer device can determine the candidate corpus labels and candidate corpus data as training samples for the model 201. Through this process, the candidate corpus data can be associated with the model 201 to be trained, thereby expanding the training samples for the model 201 and improving the accuracy and efficiency of sample expansion.

[0096] It is understood that the computer equipment mentioned in the embodiments of this application includes, but is not limited to, terminal devices or servers. In other words, the computer equipment can be a server or a terminal device, or a system composed of a server and a terminal device. The terminal device mentioned above can be an electronic device, including but not limited to mobile phones, tablets, desktop computers, laptops, handheld computers, in-vehicle devices, augmented reality / virtual reality (AR / VR) devices, head-mounted displays, smart TVs, wearable devices, smart speakers, digital cameras, webcams, and other mobile internet devices (MIDs) with network access capabilities, or terminal devices in scenarios such as trains, ships, and flights. The server mentioned above can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, vehicle-to-everything (V2X) communication, content delivery networks (CDNs), and big data and artificial intelligence platforms.

[0097] Optionally, the data involved in the embodiments of this application may be stored in a computer device, or may be stored based on cloud storage technology or blockchain technology, without limitation.

[0098] Further, please see Figure 3 , Figure 3 This is a flowchart illustrating a method for obtaining corpus tags according to an embodiment of this application. Figure 3 As shown, the tag acquisition process for this corpus includes the following steps:

[0099] Step S301: Obtain k sample labels of the model to be trained, and the initial corpus data corresponding to each sample label. Encode the initial corpus data corresponding to each sample label to obtain the label corpus vectors corresponding to the k sample labels respectively.

[0100] In this embodiment, the computer device can acquire k sample labels of the model to be trained, and obtain the initial corpus data corresponding to each sample label based on the k sample labels. Alternatively, the computer device can acquire f initial corpus data, obtain the sample labels corresponding to each initial corpus data, and cluster the f initial corpus data based on the sample labels to obtain the initial corpus data corresponding to each of the k sample labels. Here, the k sample labels refer to the labels associated with the model to be trained; that is, after training the model, the model is used to predict the probabilities of the data to be predicted corresponding to each of the k sample labels. For example, inputting data 1 into the trained model can yield the probabilities of data 1 corresponding to each of the k sample labels.

[0101] Optionally, the k sample labels for the model to be trained and the initial corpus data corresponding to each sample label can be provided by a target object. This target object can be an object with the authority to provide samples to the model to be trained, such as a user. For example, a computer device can provide a corpus management page from which the k sample labels for the model to be trained and the initial corpus data corresponding to each sample label can be obtained. Specifically, the computer device can respond to a corpus addition request for the model to be trained and display the corpus management page associated with the model. This can involve obtaining the k sample labels submitted on the corpus management page, outputting the k sample labels on the corpus management page, and obtaining the initial corpus data corresponding to each sample label submitted on the corpus management page based on the k sample labels. Alternatively, the system can obtain f initial corpus data submitted from the corpus management page, along with sample labels for each initial corpus data. Based on these sample labels, the f initial corpus data can be clustered to obtain k sample labels and the initial corpus data corresponding to each sample label. The number of initial corpus data corresponding to each sample label can be h, where h is a natural number (i.e., 0, 1, or at least two). The number of initial corpus data corresponding to different sample labels can be the same or different. Optionally, when obtaining the k sample labels and the initial corpus data corresponding to each sample label, the computer device can match the sample labels with the initial corpus data. If there is a mismatch between the initial corpus data and the corresponding sample label among the k sample labels and the initial corpus data corresponding to each sample label, a sample exception message is output. This sample exception message indicates that the initial corpus data is abnormal due to the mismatch with the corresponding sample label.

[0102] For example, please see Figure 4 , Figure 4 This is a schematic diagram of a corpus acquisition scenario provided in an embodiment of this application. For example... Figure 4As shown, the computer device can respond to a corpus addition request for the model to be trained, displaying a corpus management page 401 associated with the model. The target object can input initial corpus data and sample labels for each initial corpus data point in this corpus management page 401, such as... Figure 4 The initial corpus data includes "Which day this month is suitable for moving to a new house" and its corresponding sample label 1, and the initial corpus data includes "Can I travel next week?" and its corresponding sample label 2, etc. Furthermore, the computer device can respond to data submission requests to the corpus management page 401, obtaining f initial corpus data submitted based on the corpus management page 401 and sample labels for each initial corpus data. The f initial corpus data can be clustered based on the sample labels to obtain k initial corpus data corresponding to each sample label. Optionally, the computer device can respond to a corpus addition request for the model to be trained, display the corpus management page 401 associated with the model to be trained, obtain the k sample labels submitted in the corpus management page 401, or obtain the k sample labels associated with the model to be trained, output the k sample labels in the corpus management page 401, and obtain the initial corpus data corresponding to each sample label based on the k sample labels. In this mode, the initial corpus data can be grouped and displayed in the corpus management page 401 based on the k sample labels, and the sample labels corresponding to the initial corpus data in each group are the same.

[0103] Optionally, the number of initial corpus data corresponding to each sample label is at least two. The at least two initial corpus data corresponding to the i-th sample label can be encoded separately to obtain initial corpus vectors corresponding to the at least two initial corpus data corresponding to the i-th sample label; i is a positive integer less than or equal to k. Vector fusion is performed on the at least two initial corpus vectors corresponding to the i-th sample label to obtain the tag corpus vector corresponding to the i-th sample label. This process continues until the tag corpus vectors corresponding to k sample labels are obtained. Specifically, if the number of initial corpus data corresponding to the i-th sample label is 0, the default corpus vector can be determined as the tag corpus vector corresponding to the i-th sample label. This default corpus vector can be an empty vector or an invalid corpus vector, etc., without restriction. If the number of initial corpus data corresponding to the i-th sample label is 1, the initial corpus data corresponding to the i-th sample label is encoded to obtain the initial corpus vector of the initial corpus data corresponding to the i-th sample label, and this initial corpus vector is determined as the tag corpus vector corresponding to the i-th sample label. If the number of initial corpus data corresponding to the i-th sample label is at least two, then the at least two initial corpus data corresponding to the i-th sample label can be encoded respectively to obtain the initial corpus vectors corresponding to the at least two initial corpus data corresponding to the i-th sample label respectively; i is a positive integer less than or equal to k; vector fusion is performed on the at least two initial corpus vectors corresponding to the i-th sample label to obtain the label corpus vector corresponding to the i-th sample label.

[0104] The computer equipment can perform weighted summation on at least two initial corpus vectors corresponding to the i-th sample label to achieve vector fusion and obtain the tag corpus vector corresponding to the i-th sample label; or, it can obtain the average vector of at least two initial corpus vectors corresponding to the i-th sample label to achieve vector fusion and obtain the tag corpus vector corresponding to the i-th sample label; or, it can obtain the center vector of at least two initial corpus vectors corresponding to the i-th sample label to achieve vector fusion and obtain the tag corpus vector corresponding to the i-th sample label; or, it can delete abnormal corpus vectors from at least two initial corpus vectors corresponding to the i-th sample label and perform vector fusion on the normal corpus vectors to obtain the tag corpus vector corresponding to the i-th sample label, wherein, among the at least two initial corpus vectors, the similarity between them and the abnormal corpus vectors is greater than or equal to the similarity between the two initial corpus vectors. The number of initial corpus vectors with a similarity threshold is less than or equal to the similarity weight threshold among at least two initial corpus vectors. For example, if the number of at least two initial corpus vectors is 10 and the similarity weight threshold is 20%, then the number of initial corpus vectors with a similarity greater than or equal to the corpus similarity threshold among at least two initial corpus vectors is less than or equal to 2. Since there is a certain correlation between the corpus vectors corresponding to the corpus data and the sample labels of the corpus data, they can represent the characteristics of the sample labels of the corpus data to a certain extent. Therefore, among at least two initial corpus data corresponding to a sample label, there will be a certain similarity between the initial corpus data and the converted initial corpus vectors. Based on this similarity characteristic, anomaly screening can be performed on at least two initial corpus vectors corresponding to the sample label, thereby improving the accuracy of the data.

[0105] Optionally, the computer device can encode the initial corpus data corresponding to each sample label using the encoding layer in the corpus matching model, obtaining label corpus vectors corresponding to k sample labels respectively. Here, the encoding layer in the corpus matching model refers to the trained encoding layer.

[0106] Step S302: Obtain candidate corpus data and encode the candidate corpus data into candidate corpus vectors.

[0107] In this embodiment, the computer device can acquire candidate corpus data and encode it into candidate corpus vectors through the encoding layer in the corpus matching model. Optionally, the computer device can acquire candidate corpus data from a corpus or similar source. This corpus can be a publicly available database or a database associated with the application hosting the model to be trained; no limitation is imposed here. Alternatively, the computer device can also acquire candidate corpus data from the Internet.

[0108] Step S303: Obtain the vector similarity between the k tag corpus vectors and the candidate corpus vectors respectively, and integrate the k tag corpus vectors based on the vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data.

[0109] In this embodiment, a computer device can predict candidate corpus vectors based on k labeled corpus vectors to obtain candidate prediction result vectors corresponding to the candidate corpus data. Specifically, the computer device can obtain the vector similarity between the k labeled corpus vectors and the candidate corpus vectors respectively, and integrate the k labeled corpus vectors based on the vector similarity to obtain the candidate prediction result vectors corresponding to the candidate corpus data. Optionally, the feature fusion layer in the corpus matching model can be used to obtain the vector similarity between the k labeled corpus vectors and the candidate corpus vectors respectively, and integrate the k labeled corpus vectors based on the vector similarity to obtain the candidate prediction result vectors corresponding to the candidate corpus data.

[0110] Here, it is assumed that the dimension of each tag corpus vector is 1*c, the dimension of the candidate corpus vector is 1*c, and c is a positive integer.

[0111] In one method for determining the candidate prediction result vector ①, the computer device can obtain the vector similarity between k labeled corpus vectors and candidate corpus vectors respectively. Based on the vector similarity corresponding to the k labeled corpus vectors, the k labeled corpus vectors are weighted and summed to obtain the candidate prediction result vector corresponding to the candidate corpus data. Optionally, the vector obtained by weighting and summing the k labeled corpus vectors can be denoted as the weighted vector. Based on the fully connected layer in the corpus matching model, the weighted vector can be converted into the candidate prediction result vector corresponding to the candidate corpus data.

[0112] Alternatively, in method ② of determining the candidate prediction result vector, the computer device can concatenate the k labeled corpus vectors to obtain a label encoding matrix. This label encoding matrix can be considered a k*c matrix. The computer device can then multiply the label encoding matrix by the transpose of the candidate corpus vectors to obtain the candidate prediction result vector corresponding to the candidate corpus data. In this case, the dimension of the candidate prediction result vector is k*1. Alternatively, the candidate corpus vectors can be multiplied by the transpose of the label encoding matrix to obtain the candidate prediction result vector corresponding to the candidate corpus data, in which case the dimension of the candidate prediction result vector is 1*k. Optionally, the vector resulting from the fusion of the candidate corpus vectors and the label encoding matrix can be denoted as the initial category vector. This initial category vector can then be normalized to obtain the candidate prediction result vector corresponding to the candidate corpus data. When a computer device multiplies the tag encoding matrix with the transpose of the candidate corpus vector, it includes two parts: vector multiplication and parameter addition. Vector multiplication can be considered as the process of obtaining the vector similarity between the k tag corpus vectors and the candidate corpus vectors respectively, and parameter addition can be considered as the process of integrating the k tag corpus vectors based on the vector similarity. Similarly, when a computer device multiplies the candidate corpus vector with the transpose of the tag encoding matrix, it can be considered as including two parts: vector multiplication and parameter addition.

[0113] Alternatively, in one method for determining the candidate prediction result vector (③), the computer device can obtain the vector similarity between the k labeled corpus vectors and the candidate corpus vectors, and then use the vector similarity values ​​corresponding to the k labeled corpus vectors to form the candidate prediction result vector corresponding to the candidate corpus data. Optionally, the vector similarity values ​​corresponding to the k labeled corpus vectors can be normalized, and the normalized vector similarity values ​​can be used to form the candidate prediction result vector corresponding to the candidate corpus data.

[0114] Optionally, the computer device can acquire the label vectors corresponding to each of the k sample labels; fuse the label corpus vectors and label vectors associated with the same sample labels to obtain the optimized corpus vector corresponding to each sample label; and acquire the vector similarity between each of the k optimized corpus vectors and the candidate corpus vectors. Further, the computer device can integrate the k label corpus vectors based on vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data; or, it can integrate the k optimized corpus vectors based on vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data, etc., without limitation. The process of acquiring the vector similarity between each of the k optimized corpus vectors and the candidate corpus vectors, and integrating the k optimized corpus vectors based on vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data, can be referred to the above process of predicting the candidate corpus data based on the k label corpus vectors to obtain the candidate prediction result vector corresponding to the candidate corpus data. For example, in one method for determining the candidate prediction result vector ①, the candidate prediction result vector can be obtained by weighted summing of the k optimized corpus vectors based on their vector similarity. In another method ②, the k optimized corpus vectors can be concatenated to obtain a label encoding matrix. This label encoding matrix can then be fused with the candidate corpus vectors to obtain the candidate prediction result vector. This fusion method could involve multiplying the label encoding matrix by the transpose of the candidate corpus vectors, or vice versa. In yet another method ③, the candidate prediction result vector can be constructed from the vector similarity of the k optimized corpus vectors. Further details will not be elaborated upon here.

[0115] Since the corpus matching model is based on model parameters trained with k sample labels, the model parameters in the encoding layer and feature fusion layer of the corpus matching model can all be associated with the k sample labels. That is, the vector obtained by the corpus matching model can be considered to represent the corresponding sample label. For example, the similarity between the vectors obtained by the encoding layer of corpus data with the same sample label is relatively high. Therefore, based on the label corpus vectors corresponding to the k sample labels or the optimized corpus vectors, the candidate corpus data can be predicted to map the candidate corpus vectors to the k sample labels, thereby improving the efficiency of corpus expansion and the accuracy of corpus label acquisition.

[0116] Step S304: Determine the candidate corpus labels of the candidate corpus data based on the candidate prediction result vector, and use the candidate corpus labels and candidate corpus data as training samples for the model to be trained.

[0117] In this embodiment, the candidate prediction result vector may include k candidate prediction categories and the prediction probability corresponding to each candidate prediction category. The candidate prediction result vector obtained through step S303 can be considered a 1*k or k*1 vector, where k corresponds to k candidate prediction categories, and the numerical value in the candidate prediction result vector represents the prediction probability of the candidate prediction category at the given position in the vector. The k candidate prediction categories and k sample labels are in one-to-one correspondence; that is, the k candidate prediction categories can be considered as k sample labels. Further, the computer device can obtain the candidate prediction category with the highest prediction probability from the k candidate prediction categories and determine it as the candidate corpus label for the candidate corpus data. The candidate corpus data and candidate corpus labels are then used as training samples for the model to be trained. Optionally, the computer device can also obtain the corpus selection threshold. If the prediction probability corresponding to the candidate corpus label is greater than or equal to the corpus selection threshold, the candidate corpus label and candidate corpus data are determined as training samples of the model to be trained. This removes candidate corpus data with unclear semantics or ambiguous prediction results, so that the obtained training samples can better represent the k sample labels corresponding to the model to be trained, thereby improving the training accuracy of the model to be trained and thus improving the accuracy and efficiency of corpus expansion.

[0118] Optionally, when expanding the corpus of the model to be trained, generally not only one candidate corpus data is obtained; that is, the number of candidate corpus data can be considered to be N, where N is a positive integer. Through the above process, candidate prediction result vectors corresponding to the N candidate corpus data can be obtained. Further, the computer device can determine the candidate corpus labels corresponding to the N candidate corpus data based on the candidate prediction result vectors corresponding to the N candidate corpus data. Based on the initial corpus data corresponding to each sample label in the model to be trained, the number of samples to be expanded is determined; if N is less than or equal to the number of samples to be expanded, then the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data are determined as the training samples of the model to be trained.

[0119] If N is greater than the number of samples to be expanded, then the confidence level of each of the N candidate corpus data is determined based on the candidate prediction result vectors corresponding to the N candidate corpus data. The computer device can obtain the numerical distribution state in the candidate prediction result vectors and determine the corpus confidence level of the candidate corpus data corresponding to each candidate prediction result vector based on the numerical distribution state. The numerical distribution state can be considered as the distribution state of the prediction probabilities corresponding to the k candidate prediction categories. This numerical distribution state can include the differences between the values ​​in the candidate prediction result vectors, and the mean or center value of these differences can be determined as the corpus confidence level. Alternatively, the maximum value in the candidate prediction result vectors can be determined as the corpus confidence level of the corresponding candidate corpus data. Further, based on the corpus confidence level, training samples for the model to be trained are obtained from the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data.

[0120] This method ensures a sufficient number of training samples while improving the accuracy of the training samples used to train the model. Specifically, when determining candidate corpus labels based on the candidate prediction result vectors corresponding to N candidate corpus data, taking the j-th candidate corpus data as an example, the candidate prediction result vector corresponding to the j-th candidate corpus data includes the prediction probabilities of k candidate prediction categories corresponding to the j-th candidate corpus data. The candidate prediction category with the highest prediction probability corresponding to the j-th candidate corpus data is determined as the candidate corpus label for the j-th candidate corpus data.

[0121] Optionally, the computer device can determine the candidate corpus labels corresponding to each of the N candidate corpus data based on the candidate prediction result vectors corresponding to each candidate corpus data. The N candidate corpus data, and the candidate corpus label corresponding to each candidate corpus data, can be used as training samples for the model to be trained.

[0122] Among them, see Figure 5 , Figure 5 This is a schematic diagram of a model prediction scenario provided in an embodiment of this application. For example... Figure 5As shown, the computer device can acquire an initial corpus set and a candidate corpus set. The initial corpus set includes k sample labels and initial corpus data corresponding to each sample label. The candidate corpus set includes candidate corpus data to be expanded. The computer device predicts each candidate corpus data in the candidate corpus set using the initial corpus set. Optionally, assuming the candidate corpus set includes N candidate corpus data, N prediction data pairs can be formed based on the initial corpus set and the candidate corpus set. Each prediction data pair includes one candidate corpus data from the candidate corpus set and each initial corpus data from the initial corpus set. Taking a prediction data pair as an example, the computer device can encode the initial corpus data corresponding to each sample label using the encoding layer 501 in the corpus matching model to obtain label corpus vectors corresponding to the k sample labels; and encode the candidate corpus data using the encoding layer 501 in the corpus matching model to obtain candidate corpus vectors corresponding to the candidate corpus data. In the feature fusion layer 502 of the corpus matching model, candidate corpus vectors are predicted based on the tag corpus vectors to obtain candidate corpus labels for the candidate corpus data. This process can be found in [reference needed]. Figure 3 The specific description of step S304 in the above steps will not be repeated here.

[0123] Furthermore, after expanding the training samples, the computer equipment can train the model to be trained based on the training samples to obtain the target model. The training samples of the model to be trained can be considered to include training sample data and training sample labels; the training sample data includes candidate corpus data and initial corpus data corresponding to each sample label; the training sample labels include candidate corpus labels corresponding to the candidate corpus data and sample labels corresponding to the initial corpus data. The computer equipment can input the training sample data into the model to be trained for prediction, obtaining the sample prediction results corresponding to the training sample data; a first loss function is generated based on the sample prediction results and training sample labels; and the parameters of the model to be trained are adjusted based on the first loss function to obtain the target model.

[0124] In this embodiment, k sample labels of the model to be trained and the initial corpus data corresponding to each sample label are obtained. The initial corpus data corresponding to each sample label is encoded to obtain the label corpus vectors corresponding to the k sample labels respectively; k is a positive integer. Candidate corpus data is obtained and encoded into candidate corpus vectors. The vector similarity between the k label corpus vectors and the candidate corpus vectors is obtained. The k label corpus vectors are integrated based on the vector similarity to obtain the candidate prediction result vectors corresponding to the candidate corpus data. The candidate corpus labels of the candidate corpus data are determined based on the candidate prediction result vectors. The candidate corpus labels and candidate corpus data are determined as training samples of the model to be trained. Through the above process, the training samples are expanded using the acquired seed corpus (i.e., the initial corpus data corresponding to each sample label), resulting in candidate corpus data. The labels of the seed corpus are then used to predict the labels of the candidate corpus data, thus determining the training samples for model training from the candidate corpus data. This expands the model's training samples to obtain the maximum number of available training samples, improving training accuracy. Simultaneously, the correspondence between sample labels and initial corpus data is used to establish a relationship with the candidate corpus data, obtaining candidate corpus labels. This maps the candidate corpus data to sample labels, further improving the accuracy and efficiency of corpus expansion.

[0125] The target model trained through the above process can be used for data parsing. For example, a computer device can receive a data parsing request for the data to be parsed, input the data into the target model for prediction, obtain a first parsing result, and determine this first parsing result as the target parsing result for the data to be parsed. Optionally, other parsing methods can also be combined to parse the data. For details, please refer to... Figure 6 , Figure 6 This is a flowchart of a data parsing method provided in an embodiment of this application. Figure 6 As shown, the method includes the following steps:

[0126] Step S601: Receive a data parsing request for the data to be parsed, parse the data to be parsed, and obtain the first parsing result.

[0127] In this embodiment of the application, the computer device can receive a data parsing request for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed.

[0128] Step S602: The candidate parsing method is used to parse the data to be parsed, and the candidate parsing results of the data to be parsed are obtained.

[0129] In this embodiment of the application, the computer device can use a candidate parsing method to parse the data to be parsed, and obtain candidate parsing results for the data to be parsed. The candidate parsing method may include, but is not limited to, any one or any combination of data matching methods, template matching methods, and key parsing methods.

[0130] In one scenario, the candidate parsing method includes a data matching method. The computer device can acquire historical parsing data and corresponding historical parsing results, match the data to be parsed with the historical parsing data, and obtain the data matching degree between the data to be parsed and the historical parsing data. Based on the data matching degree and the historical parsing results, a second parsing result for the data to be parsed is determined. Optionally, historical parsing results with a data matching degree greater than or equal to the parsing determination threshold can be determined as the second parsing result for the data to be parsed; if no historical parsing results with a data matching degree greater than or equal to the parsing determination threshold exist, a default parsing result can be determined as the second parsing result for the data to be parsed. This default parsing result can be a pre-set parsing result or an invalid result that has no practical meaning, i.e., does not affect the final parsing result of the data to be parsed, etc., without restriction. In this case, the candidate parsing result includes the second parsing result.

[0131] In one scenario, the candidate parsing method includes template matching. The computer device can acquire corpus templates, retrieve a target corpus template that matches the data to be parsed, and determine the template parsing result corresponding to the target corpus template as the third parsing result for the data to be parsed. In this case, the candidate parsing result includes the third parsing result.

[0132] In one scenario, the candidate parsing method includes a key parsing method. The computer device can extract the key information to be parsed from the data to be parsed using a key information extraction model, perform semantic analysis on the key information to be parsed, and determine the fourth parsing result for the data to be parsed.

[0133] Specifically, the computer equipment can input training sample data into the initial key information extraction model for prediction, obtaining the predicted key information corresponding to the training sample data; determine the key prediction result corresponding to the training sample data based on the predicted key information; generate a second loss function based on the key prediction result and the training sample labels; and adjust the parameters of the initial key information extraction model based on the second loss function to obtain the key information extraction model. Optionally, such as... Figure 4As shown, the computer device can obtain key semantics based on the corpus management page 401. This key semantics refers to the semantic type of key information that needs to be extracted from the training sample data, including but not limited to time and action. When training the key information extraction model based on the training sample data, the computer device can introduce this key semantics into the initial key information extraction model for training, thus obtaining the key information extraction model.

[0134] In one scenario, the candidate parsing method may include both data matching and template matching. In this case, the candidate parsing result includes a second parsing result and a third parsing result.

[0135] In one scenario, the candidate parsing method may include a data matching method and a key parsing method. In this case, the candidate parsing result includes a second parsing result and a fourth parsing result.

[0136] In one scenario, the candidate parsing method may include template matching and key parsing. In this case, the candidate parsing result includes a third parsing result and a fourth parsing result.

[0137] In one scenario, the candidate parsing method may include data matching, template matching, and key parsing. In this case, the candidate parsing results may include second, third, and fourth parsing results, etc.

[0138] Among them, the second parsing result is the parsing result corresponding to the data matching method, the third parsing result is the parsing result corresponding to the template matching method, the fourth parsing result is the parsing result corresponding to the key parsing method, etc. For details, please refer to the determination process of each parsing result mentioned above, which will not be repeated here.

[0139] Step S603: Integrate the first parsing result and the candidate parsing results to obtain the target parsing result of the data to be parsed.

[0140] In this embodiment, if the candidate parsing method includes a data matching method, the first parsing result and the second parsing result are integrated to obtain the target parsing result of the data to be parsed. If the candidate parsing method includes a template matching method, the first parsing result and the third parsing result are integrated to obtain the target parsing result of the data to be parsed. If the candidate parsing method includes a key parsing method, the first parsing result and the fourth parsing result are integrated to obtain the target parsing result of the data to be parsed. If the candidate parsing method includes both a data matching method and a template matching method, the first parsing result, the second parsing result, and the third parsing result are integrated to obtain the target parsing result of the data to be parsed. If the candidate parsing method includes both a data matching method and a key parsing method, the first parsing result, the second parsing result, and the fourth parsing result are integrated to obtain the target parsing result of the data to be parsed. If the candidate parsing method includes both a template matching method and a key parsing method, the first parsing result, the third parsing result, and the fourth parsing result are integrated to obtain the target parsing result of the data to be parsed. If the candidate parsing method can include data matching method, template matching method and key parsing method, then the first parsing result, the second parsing result, the third parsing result and the fourth parsing result are integrated to obtain the target parsing result of the data to be parsed.

[0141] For example, see Figure 7 , Figure 7 This is a schematic diagram of a data parsing scenario provided in an embodiment of this application. For example... Figure 7 As shown, the candidate parsing method may include data matching, template matching, and key parsing, comprising a model training section 701 and a data parsing section 702. In the model training section 701, the computer device can use the aforementioned... Figure 3 The steps shown involve expanding the corpus (i.e., candidate corpus data) based on the seed. The seed refers to the initial corpus data obtained. The model to be trained is trained based on the seed and the expanded corpus to obtain the target model corresponding to the model to be trained. An initial key information extraction model is also trained based on the seed and the expanded corpus to obtain the key information extraction model. In the data parsing section 702, the computer device can input the data to be parsed into the target model for prediction, obtaining the first parsing result; input the data to be parsed into the key information extraction model for prediction, obtaining the fourth parsing result; use data matching to parse the data to be parsed, obtaining the second parsing result; and use template matching to parse the data to be parsed, obtaining the third parsing result. The first, second, third, and fourth parsing results are then integrated to obtain the target parsing result for the data to be parsed.

[0142] In this embodiment, the expanded training samples can be used to train the model through the above process, resulting in the target model corresponding to the model to be trained, and data parsing is performed based on the target model. When only data matching and template matching methods are used for data parsing, the precision (P) is high, but the recall (R) is low, leading to a low final F-measure. This solution improves the generalization ability of the target model, increases the recall, and thus improves the overall F-measure. Higher precision and higher recall both indicate a better representation method; however, these two metrics may contradict each other, with one increasing while the other decreases. Therefore, the F-measure combines both metrics, ensuring that a higher F-measure yields better results for both precision and recall. Specifically, see Table 1, which illustrates the effect of the parsing method, as shown below:

[0143] Table 1

[0144]

[0145] The benchmark refers to the data matching method and the template matching method. As shown in Table 1, the F value of this scheme has been greatly improved compared with the benchmark scheme, which shows a certain progress.

[0146] Further, see Figure 8 , Figure 8 This is a flowchart of a model training method provided in an embodiment of this application. Figure 8 As shown, the process includes the following steps:

[0147] Step S801: Obtain d first corpus samples and the first corpus label of each first corpus sample; obtain second corpus samples and the second corpus label of each second corpus sample.

[0148] In the embodiments of this application, see [reference needed]. Figure 3 The specific description of step S301 in the above steps will not be repeated here.

[0149] Step S802: Based on the encoding layer in the initial corpus matching model, the second corpus sample is encoded to obtain the second corpus vector corresponding to the second corpus sample.

[0150] In the embodiments of this application, see [reference needed]. Figure 3 The specific description of the candidate corpus vectors in step S302 will not be repeated here.

[0151] Step S803: Based on the encoding layer in the initial corpus matching model, the d first corpus samples are encoded respectively to obtain the first corpus vectors corresponding to the d first corpus samples respectively. The first corpus vectors corresponding to the first corpus samples with the same first corpus label are integrated to obtain the label corpus matrix.

[0152] In this embodiment, the first corpus vectors corresponding to first corpus samples with the same first corpus label can be integrated to obtain a first label sample vector for the first corpus label, wherein one first corpus label corresponds to one first label sample vector. The first label sample vectors of each first corpus label are concatenated to obtain a label corpus matrix. See [link to specific details] for details. Figure 3 The specific description of the tag corpus vector in step S301 will not be repeated here.

[0153] Step S804: The tag corpus matrix and the second corpus vector are fused through the feature fusion layer in the initial corpus matching model to obtain the initial prediction result of the second corpus sample.

[0154] In the embodiments of this application, see [reference needed]. Figure 3 The process of determining the candidate prediction result vector in step S303 will not be described in detail here.

[0155] Step S805: Generate a third loss function based on the second corpus labels and the initial prediction results, and adjust the parameters of the initial corpus matching model based on the third loss function to generate a corpus matching model.

[0156] In this embodiment, the initial prediction result is similar to a candidate prediction result vector, essentially a vector containing the probabilities corresponding to each first corpus sample. The second corpus label can belong to the first corpus label. A third loss function is generated based on the second corpus label and the initial prediction result. The parameters of the initial corpus matching model are adjusted based on the third loss function to generate the corpus matching model.

[0157] For details, please refer to Figure 9 , Figure 9 This is a schematic diagram of a model training scenario provided in an embodiment of this application. For example... Figure 9As shown, the computer device can acquire a first corpus set and a second corpus set. The first corpus set includes at least two first corpus samples and a first corpus label for each first corpus sample, wherein there can be t first corpus labels, where t is a positive integer and can be the same as the value of k. The second corpus set includes second corpus samples, each corresponding to a second corpus label. The computer device predicts the second corpus samples in the second corpus set using the first corpus set. Optionally, assuming the second corpus set includes s second corpus samples, s training data pairs can be formed based on the first and second corpus sets, where s is a positive integer. Each training data pair includes one second corpus sample from the second corpus set and a subset of first corpus samples from the first corpus set. For example, for a training data pair, the computer device can randomly select d first corpus samples from the first corpus set and one second corpus sample from the second corpus set to form the training data pair. To ensure the accuracy of model training, when acquiring d first corpus samples, it is possible to acquire as many first corpus samples as possible with first corpus labels. That is, the first corpus labels corresponding to the d first corpus samples should ideally encompass all first corpus labels. Then, when predicting second corpus samples based on the first corpus samples, the similarity calculation between the first and second corpus samples can include the relationship between the second corpus sample and all first corpus labels, thereby improving the accuracy of model training. Specifically, if the second corpus sample has a similar center to the first corpus sample corresponding to first corpus label 1, then the first corpus sample can be considered to belong to that first corpus label 1. This is because corpora with similar expressions generally also have similar intentions (i.e., corpus labels).

[0158] Furthermore, the computer device can encode d first corpus samples separately through the encoding layer 901 in the initial corpus matching model, obtaining first corpus vectors corresponding to each of the d first corpus samples. The first corpus vectors corresponding to first corpus samples with the same first corpus label are integrated to obtain a label corpus matrix. The second corpus samples are then encoded through the encoding layer 901 in the initial corpus matching model to obtain second corpus vectors corresponding to the second corpus samples. In the feature fusion layer 902 in the initial corpus matching model, the second corpus vectors are predicted based on the label corpus matrix to obtain initial prediction results for the second corpus samples. A loss function is generated based on the initial prediction results and the second corpus labels, and the parameters of the initial corpus matching model are adjusted to obtain the corpus matching model. This process can be found in [reference needed]. Figure 8 The specific descriptions of steps S804 to S805 are not repeated here.

[0159] Through the above model training process, the model parameters in the trained corpus matching model can be associated with the first corpus label, which can be a sample label. This makes the vector obtained by the corpus matching model equal to the first corpus label, and to a certain extent, it can represent the label of the corresponding data. Therefore, it is possible to predict the label of unknown corpus data based on existing corpus data, thereby achieving the purpose of corpus expansion and improving the accuracy and efficiency of corpus label acquisition.

[0160] Further, please see Figure 10 , Figure 10 This is a schematic diagram of a corpus tag acquisition device provided in an embodiment of this application. The corpus tag acquisition device can be a computer program (including program code, etc.) running on a computer device; for example, the corpus tag acquisition device can be an application software. This device can be used to execute the corresponding steps in the method provided in the embodiments of this application. Figure 10 As shown, the corpus tag acquisition device 1000 can be used for Figure 3 Specifically, the computer device in the corresponding embodiment may include: an initial corpus acquisition module 11, an initial corpus encoding module 12, a candidate corpus encoding module 13, a similarity determination module 14, an initial prediction module 15, and a sample determination module 16.

[0161] The initial corpus acquisition module 11 is used to acquire k sample labels of the model to be trained, and the initial corpus data corresponding to each sample label;

[0162] The initial corpus encoding module 12 is used to encode the initial corpus data corresponding to each sample label to obtain the label corpus vectors corresponding to k sample labels respectively; k is a positive integer;

[0163] The candidate corpus encoding module 13 is used to acquire candidate corpus data and encode the candidate corpus data into candidate corpus vectors.

[0164] Similarity determination module 14 is used to obtain the vector similarity between the k tag corpus vectors and the candidate corpus vectors respectively;

[0165] The initial prediction module 15 is used to integrate the vectors of k tag corpus based on vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data;

[0166] The sample determination module 16 is used to determine the candidate corpus labels of the candidate corpus data based on the candidate prediction result vector, and to determine the candidate corpus labels and candidate corpus data as training samples for the model to be trained.

[0167] Each sample label corresponds to at least two initial corpus data.

[0168] The initial corpus encoding module 12 includes:

[0169] The group coding unit 121 is used to encode at least two initial corpus data corresponding to the i-th sample label respectively, so as to obtain the initial corpus vectors corresponding to the at least two initial corpus data corresponding to the i-th sample label respectively; i is a positive integer less than or equal to k;

[0170] Vector fusion unit 122 is used to perform vector fusion on at least two initial corpus vectors corresponding to the i-th sample label to obtain the label corpus vector corresponding to the i-th sample label.

[0171] The similarity determination module 14 includes:

[0172] The label vector acquisition unit 141 is used to acquire the label vectors corresponding to the k sample labels respectively;

[0173] Vector optimization unit 142 is used to fuse the tag corpus vector and tag vector that are associated with the same sample label to obtain the optimized corpus vector corresponding to each sample label.

[0174] The similarity acquisition unit 143 is used to acquire the vector similarity between the k optimized corpus vectors and the candidate corpus vectors.

[0175] The candidate prediction result vector includes k candidate prediction categories and the prediction probability corresponding to each candidate prediction category;

[0176] The sample determination module 16 includes:

[0177] The probability selection unit 161 is used to select the candidate prediction category with the highest prediction probability from k candidate prediction categories, and to determine the candidate prediction category with the highest prediction probability as the candidate corpus label of the candidate corpus data.

[0178] The sample selection unit 162 is used to obtain the corpus selection threshold. If the prediction probability corresponding to the candidate corpus label is greater than or equal to the corpus selection threshold, the candidate corpus label and candidate corpus data are determined as training samples for the model to be trained.

[0179] The number of candidate corpus data is N; N is a positive integer;

[0180] The sample determination module 16 includes:

[0181] The multi-prediction unit 163 is used to determine the candidate corpus labels corresponding to the N candidate corpus data based on the candidate prediction result vectors corresponding to the N candidate corpus data respectively.

[0182] The quantity determination unit 164 is used to determine the number of samples to be expanded based on the initial corpus data corresponding to each sample label in the model to be trained.

[0183] The sample determination unit 165 is used to determine the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data as training samples for the model to be trained if N is less than or equal to the number of sample expansions.

[0184] The sample determination unit 165 is also used to determine the corpus confidence of the N candidate corpus data based on the candidate prediction result vectors corresponding to the N candidate corpus data if N is greater than the number of sample expansions, and to obtain the training samples of the model to be trained from the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data based on the corpus confidence.

[0185] The training samples for the model to be trained include training sample data and training sample labels; the training sample data includes candidate corpus data and initial corpus data corresponding to each sample label; the device 1000 also includes:

[0186] The sample prediction module 17 is used to input training sample data into the model to be trained for prediction and obtain the sample prediction results corresponding to the training sample data.

[0187] The first training module 18 is used to generate a first loss function based on the sample prediction results and the training sample labels, and to adjust the parameters of the model to be trained based on the first loss function to obtain the target model.

[0188] The device 1000 also includes:

[0189] Model parsing module 19 is used to receive data parsing requests for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed;

[0190] The data matching module 20 is used to obtain historical parsing data and the corresponding historical parsing results, match the data to be parsed with the historical parsing data, and obtain the data matching degree between the data to be parsed and the historical parsing data.

[0191] Data parsing module 21 is used to determine the second parsing result of the data to be parsed based on the data matching degree and historical parsing results;

[0192] The parsing and integration module 22 is used to integrate the first parsing result and the second parsing result to obtain the target parsing result of the data to be parsed.

[0193] The device 1000 also includes:

[0194] The model parsing module 19 is also used to receive data parsing requests for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed.

[0195] Template parsing module 23 is used to obtain corpus templates, obtain target corpus templates that match the data to be parsed from the corpus templates, and determine the template parsing result corresponding to the target corpus template as the third parsing result of the data to be parsed;

[0196] The parsing integration module 22 is also used to integrate the first parsing result and the third parsing result to obtain the target parsing result of the data to be parsed.

[0197] The device 1000 also includes:

[0198] The model parsing module 19 is also used to receive data parsing requests for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed.

[0199] Key parsing module 24 is used to extract the key information to be parsed from the key information extraction model, perform semantic analysis on the key information to be parsed, and determine the fourth parsing result of the data to be parsed.

[0200] The parsing integration module 22 is also used to integrate the first parsing result and the fourth parsing result to obtain the target parsing result of the data to be parsed.

[0201] The device 1000 also includes:

[0202] The key prediction module 25 is used to input the training sample data into the initial key information extraction model for prediction, and obtain the predicted key information corresponding to the training sample data.

[0203] Key determination module 26 is used to determine the key prediction results corresponding to the training sample data based on the key prediction information;

[0204] The second training module 27 is used to generate a second loss function based on the key prediction results and training sample labels, and to adjust the parameters of the initial key information extraction model based on the second loss function to obtain the key information extraction model.

[0205] Specifically, the candidate corpus encoding module 13 is used for:

[0206] Acquire candidate corpus data, and encode the candidate corpus data into candidate corpus vectors through the encoding layer in the corpus matching model;

[0207] The similarity determination module 14 is specifically used for:

[0208] By using the feature fusion layer in the corpus matching model, the vector similarity between the k labeled corpus vectors and the candidate corpus vectors is obtained;

[0209] The initial prediction module 15 is specifically used for:

[0210] Based on vector similarity, k tag corpus vectors are integrated to obtain candidate prediction result vectors corresponding to candidate corpus data.

[0211] The device 1000 also includes:

[0212] The sample acquisition module 28 is used to acquire d first corpus samples and the first corpus label of each first corpus sample, and to acquire second corpus samples and the second corpus label of each second corpus sample; d is a positive integer;

[0213] The first corpus encoding module 29 is used to encode the second corpus sample based on the encoding layer in the initial corpus matching model to obtain the second corpus vector corresponding to the second corpus sample;

[0214] The second corpus encoding module 30 is used to encode d first corpus samples based on the encoding layer in the initial corpus matching model to obtain the first corpus vectors corresponding to the d first corpus samples respectively, and to integrate the first corpus vectors corresponding to the first corpus samples with the same first corpus label to obtain the label corpus matrix.

[0215] The tag vector fusion module 31 is used to fuse the tag corpus matrix and the second corpus vector through the feature fusion layer in the initial corpus matching model to obtain the initial prediction result of the second corpus sample.

[0216] The model generation module 32 is used to generate a third loss function based on the second corpus labels and the initial prediction results, and to adjust the parameters of the initial corpus matching model based on the third loss function to generate a corpus matching model.

[0217] The initial corpus acquisition module 11 includes:

[0218] Page display unit 111 is used to respond to the request to add corpus for the model to be trained and display the corpus management page associated with the model to be trained;

[0219] Data acquisition unit 112 is used to acquire k sample tags submitted on the corpus management page, output k sample tags on the corpus management page, and acquire initial corpus data corresponding to each sample tag submitted on the corpus management page based on the k sample tags.

[0220] This application provides a corpus label acquisition device. The device can acquire k sample labels for a model to be trained, and initial corpus data corresponding to each sample label. It encodes the initial corpus data corresponding to each sample label to obtain label corpus vectors corresponding to the k sample labels, where k is a positive integer. It acquires candidate corpus data and encodes the candidate corpus data into candidate corpus vectors. It acquires the vector similarity between the k label corpus vectors and the candidate corpus vectors, integrates the k label corpus vectors based on the vector similarity, and obtains candidate prediction result vectors corresponding to the candidate corpus data. Based on the candidate prediction result vectors, it determines the candidate corpus labels of the candidate corpus data, and uses the candidate corpus labels and candidate corpus data as training samples for the model to be trained. Through the above process, the training samples are expanded using the acquired seed corpus (i.e., the initial corpus data corresponding to each sample label), resulting in candidate corpus data. The labels of the seed corpus are then used to predict the labels of the candidate corpus data, thus determining the training samples for model training from the candidate corpus data. This expands the model's training samples to obtain the maximum number of available training samples, improving training accuracy. Simultaneously, the correspondence between sample labels and initial corpus data is used to establish a relationship with the candidate corpus data, obtaining candidate corpus labels. This maps the candidate corpus data to sample labels, further improving the accuracy and efficiency of corpus expansion.

[0221] See Figure 11 , Figure 11 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Figure 11 As shown, the computer device in this embodiment may include one or more processors 1101, a memory 1102, and an input / output interface 1103. The processor 1101, memory 1102, and input / output interface 1103 are connected via a bus 1104. The memory 1102 stores a computer program, which includes program instructions. The input / output interface 1103 receives and outputs data, such as for data interaction between the computer device and a terminal device, or for data interaction between different layers in a model. The processor 1101 executes the program instructions stored in the memory 1102.

[0222] The processor 1101 can perform the following operations:

[0223] Obtain k sample labels for the model to be trained, and the initial corpus data corresponding to each sample label. Encode the initial corpus data corresponding to each sample label to obtain the label corpus vectors corresponding to the k sample labels respectively; k is a positive integer.

[0224] Obtain candidate corpus data and encode the candidate corpus data into candidate corpus vectors;

[0225] Obtain the vector similarity between k labeled corpus vectors and candidate corpus vectors respectively, and integrate the k labeled corpus vectors based on the vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data;

[0226] Based on the candidate prediction result vector, the candidate corpus labels of the candidate corpus data are determined, and the candidate corpus labels and candidate corpus data are determined as training samples for the model to be trained.

[0227] In some feasible implementations, the processor 1101 may be a central processing unit (CPU), but it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

[0228] The memory 1102 may include read-only memory and random access memory, and provides instructions and data to the processor 1101 and the input / output interface 1103. A portion of the memory 1102 may also include non-volatile random access memory. For example, the memory 1102 may also store device type information.

[0229] In practice, the computer device can perform actions such as these through its built-in functional modules. Figure 3 or Figure 8 For details on the implementation methods provided for each step, please refer to [the relevant documentation / document / etc.]. Figure 3 or Figure 8 The implementation methods provided for each step are not elaborated here.

[0230] This application provides a computer device including a processor, an input / output interface, and a memory. The processor retrieves a computer program from the memory and executes it. Figure 3Each step of the method shown involves acquiring corpus labels. This embodiment expands the training samples using the acquired seed corpus (i.e., the initial corpus data corresponding to each sample label), obtaining candidate corpus data. The labels of the seed corpus are used to predict the candidate corpus data, thus obtaining labels for the candidate corpus data. This allows for the determination of training samples for model training from the candidate corpus data, achieving the goal of expanding the model's training samples to obtain the maximum number of training samples for model training and improving the accuracy of model training. Simultaneously, the correspondence between sample labels and initial corpus data is used to obtain the association between them and the candidate corpus data, obtaining candidate corpus labels for the candidate corpus data. This maps the candidate corpus data to sample labels, thereby improving the accuracy and efficiency of corpus expansion.

[0231] This application also provides a computer-readable storage medium storing a computer program adapted to be loaded and executed by a processor. Figure 3 or Figure 8 For details on the methods for obtaining corpus tags provided in each step, please refer to the [link / reference]. Figure 3 or Figure 8 The implementation methods provided for each step are not repeated here. Furthermore, the beneficial effects of using the same method are also not repeated. For technical details not disclosed in the computer-readable storage medium embodiments involved in this application, please refer to the description of the method embodiments of this application. As an example, a computer program may be deployed to execute on a single computer device, or on multiple computer devices located in one location, or on multiple computer devices distributed across multiple locations and interconnected via a communication network.

[0232] The computer-readable storage medium can be the corpus tag acquisition device provided in any of the foregoing embodiments or the internal storage unit of the computer device, such as the hard disk or memory of the computer device. The computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., provided on the computer device. Furthermore, the computer-readable storage medium can include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.

[0233] This application also provides a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform... Figure 3 or Figure 8 The method provided among the various optional approaches implements the expansion of training samples by acquiring seed corpus (i.e., the initial corpus data corresponding to each sample label), obtaining candidate corpus data, and predicting the labels of the candidate corpus data using the labels of the seed corpus to obtain the labels of the candidate corpus data. This allows for the determination of training samples for model training from the candidate corpus data, thereby expanding the model's training samples to obtain the maximum number of training samples for model training and improving the accuracy of model training. Simultaneously, by employing the correspondence between sample labels and initial corpus data, the association between the sample labels and candidate corpus data is obtained, yielding candidate corpus labels for the candidate corpus data. This maps the candidate corpus data to sample labels, further improving the accuracy and efficiency of corpus expansion.

[0234] The terms "first," "second," etc., in the specification, claims, and drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the term "comprising," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, apparatus, product, or device that includes a series of steps or units is not limited to the listed steps or modules, but may optionally include steps or modules not listed, or may optionally include other step units inherent to these processes, methods, apparatuses, products, or devices.

[0235] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described in terms of functionality. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this application.

[0236] The methods and related apparatus provided in this application are described with reference to the method flowcharts and / or structural diagrams provided in this application. Specifically, each block of the method flowchart and / or structural diagram, as well as combinations of blocks in the flowchart and / or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable corpus tag acquisition device to create a machine, such that the instructions executed by the processor of the computer or other programmable corpus tag acquisition device generate instructions for implementing the process... Figure 1 A schematic diagram of one or more processes and / or structures. Figure 1 The means specifying the functions in one or more boxes. These computer program instructions may also be stored in a computer-readable storage medium capable of directing a computer or other programmable corpus tagging device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in the process. Figure 1 A schematic diagram of one or more processes and / or structures. Figure 1 The functions specified in one or more boxes. These computer program instructions may also be loaded onto a computer or other programmable corpus tagging device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable device for implementing the process. Figure 1 A process or multiple processes and / or structures illustrate the steps of the functions specified in one or more boxes.

[0237] The steps in the method of this application embodiment can be adjusted, combined, or deleted according to actual needs.

[0238] The modules in the device of this application embodiment can be merged, divided, and deleted according to actual needs.

[0239] The above-disclosed embodiments are merely preferred embodiments of this application and should not be construed as limiting the scope of this application. Therefore, any equivalent variations made in accordance with the claims of this application shall still fall within the scope of this application.

Claims

1. A method for obtaining corpus tags, characterized in that, The method includes: Obtain k sample labels for the model to be trained, and at least two initial corpus data corresponding to each sample label. Obtain the initial corpus vectors of the at least two initial corpus data corresponding to each sample label. Perform vector fusion on the at least two initial corpus vectors corresponding to each sample label to obtain the label corpus vector corresponding to that sample label; k is a positive integer; the k sample labels refer to the labels associated with the model to be trained, and the model obtained after training the model to be trained is used to predict the probability of the data to be predicted corresponding to the k sample labels respectively; Obtain candidate corpus data and encode the candidate corpus data into candidate corpus vectors; Obtain the label vectors corresponding to the k sample labels respectively; The tag corpus vectors and tag vectors that are associated with the same sample label are fused to obtain the optimized corpus vectors corresponding to each sample label. Obtain the vector similarity between the k optimized corpus vectors and the candidate corpus vectors; Based on the vector similarity, the k optimized corpus vectors are weighted and summed to obtain the candidate prediction result vector corresponding to the candidate corpus data; Based on the candidate prediction result vector, the candidate corpus labels of the candidate corpus data are determined, and the candidate corpus labels and the candidate corpus data are determined as training samples of the model to be trained. The model to be trained is trained based on the training samples to obtain a target model; the target model is used for data parsing, and is used to predict the data to be parsed to obtain the target parsing result of the data to be parsed, or to integrate the first parsing result obtained by predicting the data to be parsed with the candidate parsing result obtained by parsing the data to be parsed using the candidate parsing method to obtain the target parsing result of the data to be parsed.

2. The method as described in claim 1, characterized in that, The candidate prediction result vector includes k candidate prediction categories and the prediction probability corresponding to each candidate prediction category; The step of determining candidate corpus labels based on the candidate prediction result vector, and determining the candidate corpus labels and the candidate corpus data as training samples for the model to be trained, includes: From the k candidate predicted categories, obtain the candidate predicted category with the highest prediction probability, and determine the candidate predicted category with the highest prediction probability as the candidate corpus label of the candidate corpus data; Obtain a corpus selection threshold. If the predicted probability corresponding to the candidate corpus label is greater than or equal to the corpus selection threshold, then the candidate corpus label and the candidate corpus data are determined as training samples for the model to be trained.

3. The method as described in claim 1, characterized in that, The number of candidate corpus data is N; N is a positive integer; The step of determining candidate corpus labels based on the candidate prediction result vector, and determining the candidate corpus labels and the candidate corpus data as training samples for the model to be trained, includes: Based on the candidate prediction result vectors corresponding to N candidate corpus data, the candidate corpus labels corresponding to the N candidate corpus data are determined. Based on the initial corpus data corresponding to each sample label in the model to be trained, determine the number of samples to be expanded. If N is less than or equal to the number of samples expanded, then the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data are determined as the training samples of the model to be trained. If N is greater than the number of samples to be expanded, then based on the candidate prediction result vectors corresponding to the N candidate corpus data respectively, the corpus confidence corresponding to the N candidate corpus data is determined, and based on the corpus confidence, the training samples of the model to be trained are obtained from the N candidate corpus data and the candidate corpus labels corresponding to the N candidate corpus data respectively.

4. The method as described in claim 1, characterized in that, The training samples of the model to be trained include training sample data and training sample labels; The training sample data includes the candidate corpus data and the initial corpus data corresponding to each sample label; The method further includes: The training sample data is input into the model to be trained for prediction, and the sample prediction result corresponding to the training sample data is obtained. A first loss function is generated based on the sample prediction results and the training sample labels. The parameters of the model to be trained are adjusted based on the first loss function to obtain the target model.

5. The method as described in claim 4, characterized in that, When the target model is used to integrate the first parsing result obtained by predicting the data to be parsed with the candidate parsing result obtained by parsing the data to be parsed using a candidate parsing method, to obtain the target parsing result of the data to be parsed, and the candidate parsing method is a data matching method, the method further includes: Receive a data parsing request for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed; Obtain historical parsing data and the corresponding historical parsing results, match the data to be parsed with the historical parsing data, and obtain the data matching degree between the data to be parsed and the historical parsing data; Based on the data matching degree and the historical parsing results, a second parsing result for the data to be parsed is determined; The first and second parsing results are integrated to obtain the target parsing result of the data to be parsed.

6. The method as described in claim 4, characterized in that, The method further includes: Receive a data parsing request for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed; Obtain a corpus template, obtain a target corpus template that matches the data to be parsed from the corpus template, and determine the template parsing result corresponding to the target corpus template as the third parsing result of the data to be parsed; The first parsing result and the third parsing result are integrated to obtain the target parsing result of the data to be parsed.

7. The method as described in claim 4, characterized in that, The method further includes: Receive a data parsing request for the data to be parsed, input the data to be parsed into the target model for prediction, and obtain the first parsing result of the data to be parsed; Extract the key information to be parsed from the data to be parsed in the key information extraction model, perform semantic analysis on the key information to be parsed, and determine the fourth parsing result of the data to be parsed. The first parsing result and the fourth parsing result are integrated to obtain the target parsing result of the data to be parsed.

8. The method as described in claim 7, characterized in that, The method further includes: The training sample data is input into the initial key information extraction model for prediction, and the predicted key information corresponding to the training sample data is obtained. Based on the aforementioned key prediction information, determine the key prediction results corresponding to the training sample data; A second loss function is generated based on the key prediction results and the training sample labels. The parameters of the initial key information extraction model are adjusted based on the second loss function to obtain the key information extraction model.

9. The method as described in claim 1, characterized in that, The step of obtaining candidate corpus data and encoding the candidate corpus data into candidate corpus vectors includes: Acquire candidate corpus data, and encode the candidate corpus data into candidate corpus vectors through the encoding layer in the corpus matching model; The step of obtaining the vector similarity between k optimized corpus vectors and the candidate corpus vectors, and then performing a weighted summation of the k optimized corpus vectors based on the vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data includes: The feature fusion layer in the corpus matching model obtains the vector similarity between k optimized corpus vectors and the candidate corpus vectors, respectively. Based on the vector similarity, the k optimized corpus vectors are weighted and summed to obtain the candidate prediction result vector corresponding to the candidate corpus data.

10. The method as described in claim 9, characterized in that, The method further includes: Obtain d first corpus samples and the first corpus label of each first corpus sample; obtain second corpus samples and the second corpus label of each second corpus sample; d is a positive integer; Based on the encoding layer in the initial corpus matching model, the second corpus sample is encoded to obtain the second corpus vector corresponding to the second corpus sample; Based on the encoding layer in the initial corpus matching model, the d first corpus samples are encoded respectively to obtain the first corpus vectors corresponding to the d first corpus samples respectively. The first corpus vectors corresponding to the first corpus samples with the same first corpus label are integrated to obtain the label corpus matrix. The tag corpus matrix and the second corpus vector are fused through the feature fusion layer in the initial corpus matching model to obtain the initial prediction result of the second corpus sample. A third loss function is generated based on the second corpus labels and the initial prediction results. The parameters of the initial corpus matching model are adjusted based on the third loss function to generate the corpus matching model.

11. The method as described in claim 1, characterized in that, The process of obtaining k sample labels for the model to be trained, and at least two initial corpus data corresponding to each sample label, includes: In response to a request to add corpus to the model to be trained, the corpus management page associated with the model to be trained is displayed; Obtain k sample tags submitted in the corpus management page, output the k sample tags in the corpus management page, and obtain at least two initial corpus data corresponding to each sample tag submitted in the corpus management page based on the k sample tags.

12. A corpus tag acquisition device, characterized in that, The device includes: The initial corpus acquisition module is used to acquire k sample labels of the model to be trained, and at least two initial corpus data corresponding to each sample label; the k sample labels refer to the labels associated with the model to be trained, and the model obtained after training the model to be trained is used to predict the probability of the data to be predicted corresponding to the k sample labels respectively. The initial corpus encoding module is used to obtain the initial corpus vectors of at least two initial corpus data corresponding to each sample label, and to perform vector fusion on the at least two initial corpus vectors corresponding to each sample label to obtain the label corpus vector corresponding to that sample label; k is a positive integer; The candidate corpus encoding module is used to acquire candidate corpus data and encode the candidate corpus data into candidate corpus vectors. The similarity determination module is used to obtain the label vectors corresponding to the k sample labels respectively; The similarity determination module is also used to fuse the tag corpus vectors and tag vectors that are associated with the same sample tags to obtain the optimized corpus vectors corresponding to each sample tag. The similarity determination module is also used to obtain the vector similarity between the k optimized corpus vectors and the candidate corpus vectors respectively; The initial prediction module is used to perform a weighted summation of the k optimized corpus vectors based on the vector similarity to obtain the candidate prediction result vector corresponding to the candidate corpus data; The sample determination module is used to determine the candidate corpus labels of the candidate corpus data based on the candidate prediction result vector, and to determine the candidate corpus labels and the candidate corpus data as training samples of the model to be trained. The first training module is used to train the model to be trained based on the training samples to obtain a target model; the target model is used to perform data parsing, and the target model is used to predict the data to be parsed to obtain the target parsing result of the data to be parsed, or to integrate the first parsing result obtained by predicting the data to be parsed with the candidate parsing result obtained by parsing the data to be parsed using the candidate parsing method to obtain the target parsing result of the data to be parsed.

13. A computer device, characterized in that, Includes processor, memory, and input / output interfaces; The processor is connected to the memory and the input / output interface respectively, wherein the input / output interface is used to receive data and output data, the memory is used to store computer programs, and the processor is used to call the computer programs so that the computer device executes the method according to any one of claims 1-11.