A method, apparatus, storage medium and electronic device for generating a sample
By generating key templates and automatically labeling semi-structured data based on probability, the problem of low labeling efficiency of semi-structured data is solved, and efficient and accurate labeling and entity type recognition are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
- Filing Date
- 2022-09-07
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the annotation of semi-structured data is labor-intensive and costly, making it difficult to efficiently identify entity types.
By pre-generating multiple key templates, the number of data matching the key templates and the annotation probability in the labeled semi-structured data are counted. The semi-structured data to be labeled is automatically matched and labeled, and labeled samples are generated to train the natural language processing model.
It improves the efficiency and accuracy of semi-structured data annotation, reduces the workload of manual annotation, and lowers costs.
Smart Images

Figure CN116151207B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a method, apparatus, storage medium and electronic device for generating samples. Background Technology
[0002] Currently, with the rapid development of the internet, users are paying increasing attention to their privacy data, leading to a corresponding increase in the application of natural language processing (NLP) technology. Semi-structured data, which is stored in a semi-structured manner, is characterized by its large volume and rapid update speed in daily life, creating a demand for analysis and retrieval. Typically, NLP technology can be used to identify the entity types corresponding to the keys in semi-structured data, enabling analysis and retrieval based on the identified entity categories.
[0003] Natural language processing technology typically requires a large amount of labeled data as training samples. For semi-structured data, manual labeling is usually used, which involves a large amount of work and is costly. Summary of the Invention
[0004] This specification provides a method, apparatus, storage medium, and electronic device for generating samples, which at least partially solves the problems existing in the prior art.
[0005] The following technical solution is adopted in this specification:
[0006] This specification provides a method for generating samples, including:
[0007] Identify multiple pre-generated key templates;
[0008] For each key template, determine the number of semi-structured data that match the key template in the labeled semi-structured data, and determine the labels corresponding to each key in the key template in each matched semi-structured data, as the candidate labels corresponding to each key;
[0009] For each key in the key template, the probability that the label corresponding to the key is one of the candidate labels is determined based on the number of matched semi-structured data and the candidate labels corresponding to the key.
[0010] When annotating the semi-structured data to be annotated, each key template is matched with the semi-structured data to be annotated to determine the key template that matches the semi-structured data to be annotated.
[0011] Based on the probability that the label corresponding to each key in the matched key template is a candidate label, the label corresponding to each key in the semi-structured data to be labeled is determined, and a label sample is obtained. The natural language processing model is trained based on the label sample. The natural language processing model is used to identify entity types in the semi-structured data.
[0012] This specification provides an apparatus for generating samples, comprising:
[0013] The template determination module is used to determine multiple pre-generated key templates;
[0014] The first matching module is used to determine, for each key template, the number of semi-structured data that match the key template in the labeled semi-structured data, and to determine the corresponding labels of each key in the key template in each matched semi-structured data, as candidate labels for each key.
[0015] The probability determination module is used to determine the probability that the label corresponding to the key is one of the candidate labels for each key in the key template, based on the number of matched semi-structured data and the candidate labels corresponding to the key.
[0016] The second matching module is used to match each key template with the semi-structured data to be labeled when labeling the semi-structured data to be labeled, and to determine the key template that matches the semi-structured data to be labeled.
[0017] The annotation module is used to determine the annotation corresponding to each key in the semi-structured data to be annotated based on the probability that the annotation corresponding to each key in the matched key template is each candidate annotation, thereby obtaining annotation samples, and training a natural language processing model based on the annotation samples. The natural language processing model is used to identify entity types in the semi-structured data.
[0018] This specification provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method for generating samples.
[0019] This specification provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the above-described method for generating samples.
[0020] The above-mentioned technical solutions adopted in this specification can achieve the following beneficial effects:
[0021] The method for generating samples provided in this specification first determines multiple pre-generated key templates. For each key template, the number of semi-structured data matching the key template in the labeled semi-structured data is determined, as well as the candidate labels corresponding to each key in the key template are determined. For each key in the key template, the probability that the label corresponding to that key is a candidate label is determined. When labeling the semi-structured data to be labeled, the key template matching the semi-structured data to be labeled is determined. Based on the probability that the label corresponding to each key in the key template is a candidate label, the labels for each key in the semi-structured data to be labeled are determined, resulting in labeled samples. By statistically analyzing the labeled semi-structured data to obtain the probability that the label of each key in each key template is a candidate label, and then automatically labeling the semi-structured data to be labeled based on the statistical results, the labeling efficiency is improved. Attached Figure Description
[0022] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:
[0023] Figure 1 This is a schematic diagram of a sample generation process provided in this specification;
[0024] Figure 2 This is a schematic diagram of a display page provided in this specification;
[0025] Figure 3 This is a schematic diagram of a device for generating samples provided in this specification;
[0026] Figure 4 This is a schematic diagram of an electronic device for implementing a method for generating samples, as provided in this specification. Detailed Implementation
[0027] To make the objectives, technical solutions, and advantages of this specification clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments in this specification without creative effort are within the scope of protection of this application.
[0028] The technical solutions provided by the various embodiments of this application are described in detail below with reference to the accompanying drawings.
[0029] Figure 1 This is a schematic diagram of a sample generation process in this specification, which specifically includes the following steps:
[0030] S100: Determine multiple pre-generated key templates.
[0031] Generally, semi-structured data can be viewed as data stored in key-value pairs. The specific keys contained in each semi-structured data set are not fixed, but the keys used to represent the data in each semi-structured data set can be predetermined. That is, multiple keys can be predetermined, and at least some combinations of these keys can be used to construct semi-structured data. For example, if 1000 keys are predetermined, then each semi-structured data set can select several keys as needed, determine the corresponding entity values for each key, generate several key-value pairs, and form a semi-structured data record.
[0032] Based on this, in one or more embodiments of this specification, the server of the business platform can pre-generate multiple key templates. When applying this solution, the pre-generated multiple key templates can be determined.
[0033] The specific method used to pre-generate multiple key templates can be determined as needed. For example, the server can first determine which keys to use for recording semi-structured data, and then generate key templates based on the permutations and combinations of at least some of the determined keys.
[0034] For example, the server can first determine that semi-structured data is recorded using four keys: "username", "TEL", "time", and "url". Assuming each piece of semi-structured data records information using at least two keys, then permuting and combining at least some of these four keys can generate the following key template:
[0035] (username, TEL), (username, time), (username, url), (TEL, time), (TEL, url), (time, url), (username, TEL, time), (username, TEL, url), (username, time, url), (TEL, time, url), (username, TEL, time, url).
[0036] This specification can be executed by a terminal that implements this solution, such as a mobile phone, tablet computer, or personal computer that implements this method. Of course, the executing entity of this specification can also be a server set up on a business platform, or a device such as a desktop computer or laptop computer capable of executing this specification. For ease of explanation, the following description will only use a server as the executing entity.
[0037] S102: For each key template, determine the number of semi-structured data that match the key template in the labeled semi-structured data, and determine the corresponding labels of each key in the key template in each matched semi-structured data, as the candidate labels corresponding to each key.
[0038] S104: For each key in the key template, based on the number of matched semi-structured data and the candidate labels corresponding to the key, determine the probability that the label corresponding to the key is a candidate label.
[0039] After determining the pre-generated key templates as described above, the server can first obtain a number of labeled semi-structured data, and then, based on each labeled semi-structured data, statistically analyze the labeling of each key in each key template, and subsequently label the semi-structured data to be labeled.
[0040] Specifically, in one or more embodiments of this specification, the server may, for each key template, first determine the number of semi-structured data matching the key template in the labeled semi-structured data, and then determine the labels corresponding to each key in the key template in each matched semi-structured data, as candidate labels for each key. Then, for each key in the key template, based on the number of matched semi-structured data and the candidate labels corresponding to that key, determine the probability that the label corresponding to that key in the key template is a candidate label.
[0041] The semi-structured data that matches the key template is the semi-structured data whose key combinations, obtained by extracting the keys from each key-value pair, match the key template. Furthermore, in semi-structured data, even the same key may correspond to different types of values in different scenarios. For example, when the key name is "username," its corresponding value might be the user's nickname or real name. Of course, if the scenario describes an organization, it could also be the organization's username.
[0042] Therefore, the server can first determine, based on the key template and each piece of semi-structured data, the number of semi-structured data whose key combinations match the key template. Then, for each key in the key template, the server can determine the corresponding key-value pairs in each piece of matched semi-structured data, and based on the annotations of the keys in each key-value pair, determine the candidate annotations corresponding to that key. Next, for each candidate annotation corresponding to that key, the number of annotations for that candidate annotation is determined based on the annotations corresponding to that key in the semi-structured data matching the key template. Finally, the probability that the corresponding annotation is a candidate annotation is determined based on the ratio of the number of candidate annotations to the number of matched semi-structured data.
[0043] For example, suppose the server retrieves 1000 semi-structured data entries. Taking the key template ["username", "TEL", "time"] as an example, and assuming that 100 of these 100 entries have the same key combination as this key template (i.e., they match this key template), then these 100 entries are considered semi-structured data entries that match this key template. For each key in this key template, taking the "username" key as an example, assuming that among these 100 matching semi-structured data entries, each matching "username" key has two possible labels: "personal username" and "organizational username," then the candidate labels for the "username" key are "personal username" and "organizational username." Specifically, among these 100 matching semi-structured data entries, 70 entries correspond to the candidate label "personal username," and 30 entries correspond to the candidate label "organizational username." Therefore, the probability of the "username" key in the key template corresponding to the candidate label "personal username" is 0.7, and the probability of corresponding to the candidate label "organizational username" is 0.3. The same logic applies to other keys; by following the same process, the probability of each label corresponding to each key in the key template being a candidate label can be obtained.
[0044] S106: When annotating the semi-structured data to be annotated, each key template is matched with the semi-structured data to be annotated to determine the key template that matches the semi-structured data to be annotated.
[0045] S108: Based on the probability that the annotation corresponding to each key in the matched key template is each candidate annotation, determine the annotation corresponding to each key in the semi-structured data to be annotated, and obtain annotation samples, so as to train a natural language processing model based on the annotation samples. The natural language processing model is used to identify entity types in the semi-structured data.
[0046] After determining the probability that each key in each key template corresponds to a candidate label, the semi-structured data to be labeled can be labeled.
[0047] Specifically, in one or more embodiments of this specification, when annotating semi-structured data to be annotated, for each piece of data to be annotated, the server can first extract the keys from each key-value pair contained in the semi-structured data to determine the key combinations contained in the semi-structured data to be annotated. Then, based on the key combinations and each key template, a key template matching the key combinations contained in the semi-structured data to be annotated is determined. Afterwards, for each key-value pair contained in the semi-structured data to be annotated, each key in the matching key template is matched with the keys in the key-value pair to determine the keys in the matching key template that match the key-value pair. Based on the probability of each candidate annotation corresponding to the annotation of the matching keys, the target annotation is determined. Finally, the keys in the key-value pair are annotated based on the target annotation.
[0048] In this context, the key template that matches the key combination is the key template that matches the key combination. During matching, it is sufficient that the key template matches all the keys in the key combination contained in the semi-structured data to be labeled, regardless of the order of the keys. Based on the probabilities of the labels of the keys matching the key-value pair in the key template, the candidate label with the highest probability corresponding to the matching key can be used as the target label, and the key label in the key-value pair in the semi-structured data to be labeled can be used as the target label.
[0049] For example, continuing with the example in step S104, assuming the semi-structured data to be labeled is "username: TomTEL: 13300000000time: 2022.2.2", the server can first determine that the key template matching the semi-structured data to be labeled is ["username", "TEL", "time"]. For each key-value pair in the semi-structured data to be labeled, taking "username: Tom" as an example, step S104 determines that the "username" key in the above-mentioned matching key template corresponds to two candidate labels. The probability of corresponding to the candidate label "personal username" is 0.7, and the probability of corresponding to the candidate label "organizational username" is 0.3. The server can use the candidate label "personal username" as the target label and label the "username" key in this key-value pair as "personal username". The same logic applies to other key-value pairs, which will not be elaborated here.
[0050] After determining the labels corresponding to each key in the semi-structured data to be labeled, labeled samples can be obtained. Natural language processing models can then be trained based on these labeled samples. These models are used to identify entity types in the semi-structured data.
[0051] Figure 1This diagram illustrates a sample generation process provided in this specification. First, multiple pre-generated key templates are determined. For each key template, the number of semi-structured data points matching that template in the labeled semi-structured data is determined, as well as the candidate labels corresponding to each key in that template. For each key in the template, the probability that the label corresponding to that key is a candidate label is determined. When labeling the semi-structured data to be labeled, the key templates matching the semi-structured data to be labeled are determined. Based on the probability that the label corresponding to each key in that key template is a candidate label, the labels for each key in the semi-structured data to be labeled are determined, resulting in a labeled sample. By statistically analyzing the labeled semi-structured data to obtain the probability that the label of each key in each key template is a candidate label, and then automatically labeling the semi-structured data to be labeled based on the statistical results, labeling efficiency is improved.
[0052] Generally speaking, semi-structured data, as a type of structured data, has a certain structure, but the structure varies greatly.
[0053] For example, taking the logging of business applications as an example, the following are two sample logs:
[0054] Username: zhangsan
[0055] TEL: 15100000000
[0056] Source: 192.168.35.35
[0057] request_method: GET
[0058] status: 200.
[0059] "username:lisi"
[0060] Time: February 2, 2022
[0061] Source: 192.168.25.25
[0062] URL: index.html.
[0063] As can be seen, logs, as semi-structured data, are stored in the form of key-value pairs, but the specific keys included in each log entry can be determined as needed.
[0064] Taking an XML file as an example again, the following is a sample XML record:
[0065] “ <person>
[0066] <name> A< / name>
[0067] <age> 13< / age>
[0068] <gender> female< / gender>
[0069] < / person> "
[0070] As can be seen, in XML records, tags and their content can also be regarded as key-value pairs. The specific tags used to record XML records can be determined as needed and are not fixed.
[0071] Therefore, it is impossible to organize semi-structured data into a single file and process it as unstructured data, nor is it possible to create a corresponding structured table for processing. However, the method provided in this manual can efficiently and accurately perform automatic annotation of semi-structured data.
[0072] In addition, in one or more embodiments of this specification, in step S100, when generating multiple key templates, the server can also generate multiple key templates based on the labeled semi-structured data. Specifically, the server can first obtain each labeled semi-structured data, then extract the keys from each key-value pair contained in each labeled semi-structured data, determine the key combinations contained in the semi-structured data, and finally, based on the determined key combinations, remove duplicates from each key combination to obtain each key template.
[0073] This specification does not restrict the specific annotation method used for the labeled semi-structured data. Similarly, it does not restrict the specific format in which the semi-structured data is stored, as long as the data can be converted into key-value pairs, the sample generation method provided in this specification can be applied.
[0074] For example, consider semi-structured data in log format, such as the semi-structured data "username: zhangsan TEL: 15100000000", "username: zhang TEL: 15200000000", and "username: san TEL: 15300000000time: 2022.2.2". The first semi-structured data contains two key-value pairs, "username: zhangsan" and "TEL: 15100000000". The server can extract the keys to obtain the key combination ["username", "TEL"]. The same operation is performed on each semi-structured data to obtain the key combinations ["username", "TEL"], ["username", "TEL"], and ["username", "TEL", "time"]. The server can then deduplicate the key combinations of each semi-structured data and use the resulting ["username", "TEL"] and ["username", "TEL", "time"] as the key template.
[0075] For example, for semi-structured data in XML file format, using semi-structured data as " <logs> <user> Zhang< / user> <tel> 15200000000< / tel> <gender> female< / gender> < / logs>"", <logs> <user> san< / user> <tel> 15300000000< / tel> <gender> male< / gender> < / logs> "", <logs> <user> zhangsan< / user> <tel> 15300000000< / tel> <time>2022.2.2< / time> < / logs> Taking this as an example, the first semi-structured data contains three key-value pairs: "user: zhang", "tel: 15200000000", and "gender: female". The server can extract the keys to obtain the key combination ["user", "tel", "gender"]. The same operation is performed on each semi-structured data to obtain the key combinations ["user", "tel", "gender"], ["user", "tel", "gender"], and ["user", "tel", "time"]. The server can then deduplicate the key combinations of each semi-structured data and use the resulting ["user", "tel", "gender"] and ["user", "tel", "time"] as the key template.
[0076] This section uses semi-structured data in log format and XML file format as examples for illustration. The same principle applies to other forms of semi-structured data, and they will not be illustrated here.
[0077] Furthermore, steps S104 and S106 determine the probability that each key in each key template belongs to each label. Then, steps S108 and S110, for the semi-structured data to be labeled, based on the probability that each key in the key template matching the semi-structured data belongs to each label, the value in each key-value pair of the data to be labeled is labeled with the label of the highest probability. However, typically, for each piece of data to be labeled corresponding to the same key template, not all values corresponding to the same key in these pieces of data will correspond to the same label.
[0078] For example, taking the annotation process in step S106 as an example, through the annotation process in step S106, in all the semi-structured data to be annotated corresponding to the key template ["username", "TEL", "time"], the value corresponding to the "username" key in each semi-structured data to be annotated will be annotated as a personal username. However, it is possible that not all the values corresponding to the "username" key in each semi-structured data to be annotated are personal usernames, that is, not all of them may be annotated as "personal username". According to the statistical results in step S104, about 70% of the semi-structured data to be annotated may have the value corresponding to the "username" key as a personal username, that is, the annotation may be "personal username".
[0079] Therefore, after annotating the semi-structured data to be labeled through the above steps, most of the annotations are usually correct, but a small portion may be inappropriate. Thus, in one or more embodiments of this specification, the server can also use the annotation samples obtained by annotating the semi-structured data to be labeled according to probability as initial annotation samples, and display these initial annotation samples to the user, allowing the user to correct the initial annotation samples based on the key-value pairs contained in the semi-structured data to be labeled and the initial annotation samples. In other words, the user can determine whether each annotation in the initial annotation sample is appropriate based on the specific content of each key-value pair contained in the semi-structured data to be labeled and the initial annotation samples. If the annotation is appropriate, the user can confirm it without modification. If the annotation is inappropriate, the user can correct the inappropriate annotations and upload the correction results to the server. As can be seen from the above, usually only a small portion of the annotations need to be corrected by the user; that is, by implementing this solution, the workload of manual annotation can be greatly reduced, and annotation efficiency improved. Then, in response to the user's correction of the initial labeled sample, the server determines the corrected labeled sample as the final labeled sample corresponding to the semi-structured data to be labeled. By correcting the initial labeled sample corresponding to the semi-structured data to be labeled, the accuracy of the labeling is improved.
[0080] Figure 2 This is a schematic diagram of a display page for showing the initial labeled sample to the user, as provided in this specification. Figure 2 The page displays the initial labeled sample corresponding to semi-structured data 1. The left side of the page shows the key-value pairs contained in semi-structured data 1: "username: lisi time: 2022.2.2 source: 192.168.25.25 url: index.html". The right side of the page displays the label corresponding to the key in each key-value pair on the same line. For example, the label corresponding to the entity "username" in "username: lisi" is "personal username", and the label corresponding to the entity "time" in "time: 2022.2.2" is "date". The other two key-value pairs are not labeled. Users can confirm whether the labels are appropriate. If the user confirms that the labels are appropriate, they can click the "Confirm" rectangle button to upload the confirmation result to the server without further action. If the user confirms that the labels are inappropriate, they can click the "Modify" rectangle button to modify them, and then confirm again to upload the corrected result to the server. This allows the server to respond to the user's correction operation on the displayed initial labeled sample and determine the corrected labeled sample as the final labeled sample corresponding to the semi-structured data to be labeled.
[0081] Furthermore, in one or more embodiments of this specification, in step S108, to further reduce user operations, when annotating the semi-structured data to be annotated and determining the target annotation, the server can also use all candidate annotations corresponding to the matching keys as target annotations, that is, annotate all candidate annotations on the corresponding keys in the semi-structured data to be annotated. Subsequent implementations can... Figure 2 The provided display page shows all candidate labels for each key in the semi-structured data to be labeled, along with their corresponding probabilities. This allows users to label data based on the probabilities of each candidate label and the type of the entity corresponding to the key. The candidate labels can be displayed as button controls on the page, allowing users to click and apply the label. The server responds to user actions and determines the final labeled samples for each piece of semi-structured data to be labeled.
[0082] Furthermore, in one or more embodiments of this specification, to further reduce the number of annotation choices for users, a preset probability threshold can be established. When annotating semi-structured data to be annotated and determining target annotations, the server can also select all candidate annotations whose probability is greater than the preset probability threshold from all candidate annotations corresponding to the matching keys as target annotations, thereby reducing the need for subsequent annotation selection. Figure 2 The provided display page shows the annotations, making it easier for users to select from the available annotations.
[0083] Furthermore, in one or more embodiments of this specification, in step S108, when annotating the currently unannotated semi-structured data and determining the target annotation, the server may also use the candidate annotations with the highest and second-highest probabilities corresponding to the matching keys as target annotations. Then, from the unannotated semi-structured data, the unannotated semi-structured data that matches the matching key template corresponding to the currently unannotated semi-structured data is determined as unannotated semi-structured data of the same type. For unannotated semi-structured data of the same type, the keys in the unannotated semi-structured data of the same type can be annotated according to the ratio of the probabilities corresponding to the two target annotations.
[0084] For example, assuming the probability of the username key in the matched key template being selected as a "personal username" candidate label is 0.7, and the probability of it being selected as a "company username" candidate label is 0.25, then both "personal username" and "company username" candidate labels can be used as target labels. Continuing to assume there are 100 pieces of semi-structured data to be labeled matching this key template, then 100 × (0.7 / (0.7+0.25)) of the semi-structured data to be labeled can be labeled as "personal username," and the username keys of the remaining semi-structured data can be labeled as "company username." The specific selection of which part to label as "personal username" can be determined as needed. For example, a random selection method can be used. This specification does not impose any restrictions on this.
[0085] Furthermore, in one or more embodiments of this specification, after obtaining relatively accurate final annotation samples of each semi-structured data to be annotated through the above-described method, the server can update the probability of each key in each key template belonging to each annotation based on each final annotation sample.
[0086] Specifically, the server can first determine the key combinations contained in the semi-structured data to be labeled based on the key-value pairs contained in the semi-structured data to be labeled, thereby determining the key template that matches the semi-structured data to be labeled. Then, based on the final annotation samples corresponding to the semi-structured data to be labeled, the server updates the number of semi-structured data that match the key template. Next, for each key in the matching key template, based on the final annotation samples corresponding to the semi-structured data to be labeled, the server updates the number of annotations corresponding to that key that are candidate annotations in the matching key template. Finally, based on the updated number of semi-structured data that match the matching key template and the number of annotations corresponding to that key that are candidate annotations, the server updates the probability that the annotation corresponding to that key is a candidate annotation.
[0087] The server can be updated periodically. It acquires the final labeled samples obtained after labeling and correcting the semi-structured data to be labeled within a given period, which serve as the labeled semi-structured data. Then, based on the final labeled samples corresponding to the labeled semi-structured data acquired within that period, the probability of each key in each key template belonging to each label is updated. This further improves the accuracy of subsequent labeling of other semi-structured data to be labeled based on the updated probabilities.
[0088] Of course, when determining target labels, if the candidate label with the highest probability corresponding to the matching key is used as the target label, the target label corresponding to each key in the key template can be determined simply by determining the number of labels corresponding to each candidate label for each key in the key template. Therefore, the server can, based on the final label samples, assign target labels to each key in each key template. That is, the server can determine the key template matching the semi-structured data to be labeled based on the final label samples corresponding to the semi-structured data to be labeled. For each key in the matching key template, based on the final label samples corresponding to the semi-structured data to be labeled, the server updates the number of labels corresponding to that key as candidate labels in the matching key template, and uses the candidate label with the highest number of labels as the target label for that key.
[0089] Furthermore, as described in step S104 in one or more embodiments of this specification, even the same key may correspond to different types of values in different business scenarios in semi-structured data. Therefore, when the server obtains the labeled semi-structured data, it can also obtain the server identifier that generated each semi-structured data, thereby determining the business scenario corresponding to each semi-structured data based on the business executed by the server that generated each semi-structured data. Then, it determines each key template under each business scenario and determines the label corresponding to each key in each key template under each business scenario. Subsequently, for the semi-structured data to be labeled, it can first determine which business scenario the semi-structured data to be labeled matches, and then label the semi-structured data to be labeled according to the labeling situation of the key template matched by the semi-structured data to be labeled under the matched business scenario.
[0090] Specifically, it may include the following steps:
[0091] S200: Obtain each labeled semi-structured data, wherein the semi-structured data contains several key-value pairs and the server identifier that generated the semi-structured data.
[0092] The server can record a corresponding server identifier within the semi-structured data when generating it. For example, this can be recorded as a key-value pair, such as "Server:001". The key for the key-value pair recording the server identifier can be a unique identifier, meaning that only this key represents the server that generated the semi-structured data. This is merely an example; the specific method used to record the server identifier for generating the semi-structured data can be determined as needed, and this specification does not impose any restrictions on it.
[0093] S202: Based on the server identifier corresponding to each labeled semi-structured data, determine the business scenario corresponding to each labeled semi-structured data.
[0094] The server can pre-store the correspondence between the server identifiers of each server on the business platform and the business they execute. When the server obtains each labeled semi-structured data, it can determine the business executed by the server that generated each semi-structured data based on the server identifiers of the corresponding records of each labeled semi-structured data and the stored correspondence, thereby determining the business scenario corresponding to each labeled semi-structured data.
[0095] For example, the correspondence can be shown in Table 1 below:
[0096] Server Identifier Business Scenarios Server: 001 User Inquiry Server: 003 Unit Inquiry …… ……
[0097] Table 1
[0098] By querying the correspondence based on the server identifier, the business scenario corresponding to each labeled semi-structured data can be determined. For example, the business scenario corresponding to the server with the identifier 001 is a user query, and the semi-structured data generated by this server is usually data associated with the user. The business scenario corresponding to the server with the identifier 003 is a unit query, and the semi-structured data generated by this server is usually data associated with the unit.
[0099] S204: Based on the business scenarios corresponding to each labeled semi-structured data and the key combinations contained therein, determine the key templates for each business scenario.
[0100] After determining the business scenario corresponding to each labeled semi-structured data in step S202, the key templates for each business scenario can be further determined based on the key combinations contained in the labeled semi-structured data within that scenario. At this point, the same key templates may exist in different business scenarios. For details on how to determine each key template, please refer to the aforementioned explanation.
[0101] S206: For each key template, determine the annotations corresponding to each key in the semi-structured data that matches the key template in the annotated semi-structured data.
[0102] After obtaining the key templates for different business scenarios, the server can determine the business scenario corresponding to each key template. Then, from the labeled semi-structured data, it can determine the semi-structured data that matches the key template in the same business scenario. Based on the labeling of the values in each key-value pair contained in the matched semi-structured data, it can determine the labeling of each key in the key template in that business scenario.
[0103] S208: Based on the server identifier corresponding to the semi-structured data to be labeled, determine the business scenario that matches the semi-structured data to be labeled.
[0104] S210: Match each key template in the matched business scenario with the semi-structured data to be labeled to determine the matching key template.
[0105] S202: Based on the annotations corresponding to each key in the matched key template, determine the annotations corresponding to each key in the semi-structured data to be annotated, and obtain annotation samples, so as to train the natural language processing model based on the annotation samples.
[0106] For semi-structured data to be labeled, the server can determine the business scenario that matches the semi-structured data to be labeled based on the server identifier corresponding to the semi-structured data to be labeled. Then, from the key templates under the matching business scenario, the server can determine the key template that matches the key combination contained in the semi-structured data to be labeled. Based on the label corresponding to each key in the key template, the server can determine the label corresponding to each key in the semi-structured data to be labeled, thus obtaining the labeled sample. The natural language processing model can then be trained based on the labeled sample.
[0107] For example, taking Table 1 above as an example, assuming the server identifier corresponding to the semi-structured data to be labeled is 001, then the business scenario corresponding to this semi-structured data is the "user query" scenario. Therefore, we can further determine the key template that matches the key combination of the semi-structured data to be labeled from the key templates under the "user query" scenario. Based on the labeling of the matching key templates, the semi-structured data to be labeled can then be labeled. For details, please refer to the corresponding explanations above.
[0108] The above describes one or more embodiments of the method for generating samples provided in this specification. Based on the same idea, this specification also provides corresponding apparatus for generating samples, such as... Figure 3 As shown.
[0109] Figure 3 A schematic diagram of an apparatus for generating a sample, as provided in this specification, includes:
[0110] Template determination module 200 is used to determine multiple pre-generated key templates;
[0111] The first matching module 202 is used to determine, for each key template, the number of semi-structured data that match the key template in the labeled semi-structured data, and to determine the labels corresponding to each key in the key template in each matched semi-structured data, as candidate labels corresponding to each key.
[0112] The probability determination module 204 is used to determine the probability that the label corresponding to the key is one of the candidate labels for each key in the key template, based on the number of matched semi-structured data and the candidate labels corresponding to the key.
[0113] The second matching module 206 is used to match each key template with the semi-structured data to be labeled when labeling the semi-structured data to be labeled, and to determine the key template that matches the semi-structured data to be labeled.
[0114] The annotation module 208 is used to determine the annotation corresponding to each key in the semi-structured data to be annotated based on the probability that the annotation corresponding to each key in the matched key template is each candidate annotation, and to obtain annotation samples for training a natural language processing model based on the annotation samples. The natural language processing model is used to identify entity types in the semi-structured data.
[0115] Optionally, the template determination module 200 acquires each labeled semi-structured data, extracts the keys from each key-value pair contained in each labeled semi-structured data, determines the key combinations contained in the semi-structured data, and removes duplicates from each key combination to obtain each key template.
[0116] Optionally, the first matching module 202 determines, based on the key template, each of the semi-structured data in the labeled semi-structured data that matches the key template, and determines the number of matched semi-structured data. For each key in the key template, it determines each key-value pair corresponding to the key in each of the semi-structured data matched by the key template, and determines each candidate label corresponding to the key based on the label of the key in each determined key-value pair.
[0117] Optionally, the probability determination module 204, for each candidate label corresponding to the key, determines the number of labels for the candidate label based on the labels corresponding to the key in the semi-structured data that match the key template, and determines the probability that the label corresponding to the key is the candidate label based on the ratio of the number of labels for the candidate label to the number of matched semi-structured data.
[0118] Optionally, the annotation module 208, for each key-value pair contained in the semi-structured data to be annotated, matches each key in the matching key template with the key in the key-value pair, determines the key in the matching key template that matches the key-value pair, determines the target annotation based on the probability of each candidate annotation being labeled with the matching key, and annotates the key in the key-value pair based on the target annotation.
[0119] Optionally, the device further includes: a correction module 212, configured to display an initial labeled sample after labeling the semi-structured data to be labeled, and in response to the user's correction operation on the displayed initial labeled sample, determine the corrected labeled sample as the final labeled sample corresponding to the semi-structured data to be labeled.
[0120] Optionally, the device further includes: an update module 214, configured to update the number of semi-structured data matching the matching key template according to the final annotation sample corresponding to the semi-structured data to be annotated; for each key in the matching key template, update the number of annotations in the matching key template whose corresponding annotations are candidate annotations according to the final annotation sample corresponding to the semi-structured data to be annotated; and update the probability of the annotation corresponding to the key being a candidate annotation according to the updated number of semi-structured data matching the matching key template and the number of annotations whose corresponding annotations are candidate annotations.
[0121] This specification also provides a computer-readable storage medium storing a computer program that can be used to execute the above-described... Figure 1 The provided method for generating samples.
[0122] This instruction manual also provides Figure 4 The diagram shows the structure of the electronic device. Figure 4 At the hardware level, the electronic device includes a processor, internal bus, network interface, memory, and non-volatile memory, and may also include other hardware required for the business operations. The processor reads the corresponding computer program from the non-volatile memory into memory and then runs it to achieve the above-mentioned functions. Figure 1 The provided method for generating samples.
[0123] Of course, in addition to the software implementation method, this specification does not exclude other implementation methods, such as the combination of hardware and software XOR logic devices, etc. In other words, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.
[0124] In the 1990s, improvements to a technology could be clearly distinguished as either hardware improvements (e.g., improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (improvements to the methodology). However, with technological advancements, many methodological improvements today can be considered direct improvements to the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved methodology into the hardware circuit. Therefore, it cannot be said that a methodological improvement cannot be implemented using hardware physical modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user programming the device. Designers can program and "integrate" a digital system onto a PLD themselves, without needing chip manufacturers to design and manufacture dedicated integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing integrated circuit chips, this programming is mostly implemented using "logic compiler" software. Similar to the software compiler used in program development, the original code before compilation must be written in a specific programming language, called a Hardware Description Language (HDL). There are many HDLs, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, and RHDL (Ruby Hardware Description Language). Currently, the most commonly used are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should understand that by simply performing some logic programming on the method flow using one of these hardware description languages and programming it into an integrated circuit, the hardware circuit implementing the logical method flow can be easily obtained.
[0125] The controller can be implemented in any suitable manner. For example, it can take the form of a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code form, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, ASICs, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included therein for implementing various functions can also be considered as structures within the hardware component. Alternatively, the means for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.
[0126] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.
[0127] For ease of description, the above devices are described in terms of function, divided into various units. Of course, in implementing this specification, the functions of each unit can be implemented in one or more software and / or hardware components.
[0128] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0129] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0130] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0131] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0132] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0133] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0134] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0135] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0136] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0137] This specification can be described in the general context of computer-executable instructions that are executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This specification can also be practiced in distributed computing environments, where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0138] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.
[0139] The above description is merely an embodiment of this specification and is not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this specification.
Claims
1. A method for generating samples, the method comprising: Identify multiple pre-generated key templates; For each key template, determine the number of semi-structured data that match the key template in the labeled semi-structured data, and determine the labels corresponding to each key in the key template in each matched semi-structured data, as the candidate labels corresponding to each key; For each key in the key template, the probability that the label corresponding to the key is one of the candidate labels is determined based on the number of matched semi-structured data and the candidate labels corresponding to the key. When annotating the semi-structured data to be annotated, each key template is matched with the semi-structured data to be annotated to determine the key template that matches the semi-structured data to be annotated. Based on the probability that the label corresponding to each key in the matched key template is a candidate label, the label corresponding to each key in the semi-structured data to be labeled is determined, and a label sample is obtained. The natural language processing model is trained based on the label sample. The natural language processing model is used to identify entity types in the semi-structured data.
2. The method as described in claim 1, wherein multiple key templates are pre-generated, specifically including: Obtain the labeled semi-structured data; For each labeled semi-structured data, extract the keys from each key-value pair contained in the semi-structured data to determine the key combinations contained in the semi-structured data. Based on the determined key combinations, duplicate key combinations are removed to obtain key templates.
3. The method as described in claim 1, wherein determining the number of semi-structured data matching the key template in the labeled semi-structured data, and determining the labels corresponding to each key in the key template in each matched semi-structured data as candidate labels corresponding to each key, specifically includes: Based on the key template, determine each semi-structured data in the labeled semi-structured data that matches the key template, and determine the number of matching semi-structured data. For each key in the key template, determine the key-value pairs corresponding to that key in each half of the structured data matched by the key template; Based on the annotations of the keys in each determined key-value pair, determine the candidate annotations corresponding to that key.
4. The method as described in claim 3, wherein determining the probability that the label corresponding to the key is one of the candidate labels based on the number of matched semi-structured data and the candidate labels corresponding to the key, specifically includes: For each candidate label corresponding to the key, the number of labels for the candidate label is determined based on the labels corresponding to the key in the semi-structured data that match the key template. The probability that the label corresponding to the key is the candidate label is determined based on the ratio of the number of labels of the candidate label to the number of matched semi-structured data.
5. The method as described in claim 1, wherein the label corresponding to each key in the semi-structured data to be labeled is determined based on the probability that the label corresponding to each key in the matched key template is a candidate label, specifically including: For each key-value pair contained in the semi-structured data to be labeled, each key in the matching key template is matched with the key in the key-value pair to determine the key in the matching key template that matches the key-value pair; The target label is determined based on the probability of each candidate label according to the labels of the matched keys; Based on the target annotation, the keys in the key-value pair are labeled.
6. The method of claim 1, further comprising: The initial labeled sample is shown after the semi-structured data to be labeled is labeled; In response to the user's correction operation on the initial labeled sample displayed, the corrected labeled sample is determined as the final labeled sample corresponding to the semi-structured data to be labeled.
7. The method of claim 6, further comprising: Based on the final labeled sample corresponding to the semi-structured data to be labeled, update the number of semi-structured data that match the matched key template; For each key in the matched key template, based on the final labeled sample corresponding to the semi-structured data to be labeled, update the label corresponding to that key in the matched key template to the number of labels for each candidate label; Based on the updated number of semi-structured data matching the key template and the number of labels corresponding to the key that are each candidate label, update the probability that the label corresponding to the key is each candidate label.
8. An apparatus for generating a sample, the apparatus comprising: The template determination module is used to determine multiple pre-generated key templates; The first matching module is used to determine, for each key template, the number of semi-structured data that match the key template in the labeled semi-structured data, and to determine the corresponding labels of each key in the key template in each matched semi-structured data, as candidate labels for each key. The probability determination module is used to determine the probability that the label corresponding to the key is one of the candidate labels for each key in the key template, based on the number of matched semi-structured data and the candidate labels corresponding to the key. The second matching module is used to match each key template with the semi-structured data to be labeled when labeling the semi-structured data to be labeled, and to determine the key template that matches the semi-structured data to be labeled. The annotation module is used to determine the annotation corresponding to each key in the semi-structured data to be annotated based on the probability that the annotation corresponding to each key in the matched key template is each candidate annotation, thereby obtaining annotation samples, and training a natural language processing model based on the annotation samples. The natural language processing model is used to identify entity types in the semi-structured data.
9. A computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in any one of claims 1 to 7.
10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described in any one of claims 1 to 7.