Automatic labeling of text data

The system addresses inefficiencies in search technologies by enabling user-defined classes and generative models to enhance classification accuracy and reduce resource intensity, improving document retrieval efficiency.

JP7883526B2Active Publication Date: 2026-07-01MICROSOFT TECHNOLOGY LICENSING LLC

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
MICROSOFT TECHNOLOGY LICENSING LLC
Filing Date
2022-05-23
Publication Date
2026-07-01

AI Technical Summary

Technical Problem

Current search technologies rely on user-provided keywords and predefined taxonomies, leading to inefficiencies and limited relevance in document retrieval due to vocabulary mismatches and the need for manual labeling, which is resource-intensive and costly.

Method used

A system that classifies text without prior training data, allowing users to define classes as natural language inputs, utilizing generative models to generate semantically rich examples and keywords, reducing the need for manual input and computational resources.

Benefits of technology

Improves classification accuracy and reduces processing requirements by generating semantically rich examples and keywords, enhancing search relevance and efficiency in document retrieval.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007883526000001
    Figure 0007883526000001
  • Figure 0007883526000002
    Figure 0007883526000002
  • Figure 0007883526000003
    Figure 0007883526000003
Patent Text Reader

Abstract

The techniques described herein determine whether a candidate text is in a requested class by using a generative model that may not be trained on the requested class. The techniques may use models trained primarily in an unsupervised mode, without requiring a large number of manual user-entered examples of label classes. The techniques may generate semantically rich positive examples of label text from the candidate text and the labels. Similarly, the techniques may generate semantically rich negative examples of label text from the candidate text and the labels. To generate the generated results, the labeling service utilizes a generative model that estimates the likelihood that a label will be correctly applied to the candidate text. In another aspect, the techniques are directed to a method of obtaining semantically rich examples that resemble the candidate text.
Need to check novelty before this filing date? Find Prior Art

Description

Background Art

[0001] Background Current search technologies have made great progress for simplicity and ease of use. However, these changes are generally limited in two important ways as much as they have been beneficial. First, these methods rely on the user providing the correct specific nouns or keywords in order to receive a set of results related to the topic by using the keywords and discovering highly relevant documents. If the vocabulary and experience of the person writing the query are limited, this can require hours of effort and can fail before finding the nouns that the user should have been using from the start. If the user does not know the keywords that appear in the index of the topic to be searched, the likelihood that the user will obtain results related to the topic without a process involving the pain of trial and error is low.

[0002] Second, generally available search technologies usually only scratch the surface of relevant documents. There may be many more relevant documents. However, these documents use different terms, different vocabularies and expressions. Therefore, these documents do not obtain a high score related to the user's query.

[0003] These two limitations are partly the result of the failure of past attempts to label linguistic data. The methods that existed used a taxonomy of nouns that did not conform to user definitions and were not necessarily known or meaningful to the user performing the search. Manual labeling systems have been developed at a significant processing power and cost to the user and are therefore usually not available to the search process or the index of the search process.

Summary of the Invention

[0004] Summary This summary is provided to introduce, in a simplified form, selected concepts from those further described in the following "Modes for Carrying Out the Invention." This "Summary of the Invention" is not intended to identify key or essential features of the claimed subject matter, nor to define the scope of the claimed subject matter. decision It is not intended to be used as an aid in that process.

[0005] The techniques described herein determine whether candidate text exists within the requested class. decision This technology can perform this classification without any prior training data or models trained on the requested classes. In fact, the user can define the classes as natural language input rather than selecting from existing classes. The requested classes do not need to follow a hierarchy or be predefined. This technology is effective even when the requested classes are not nouns but rather concepts such as diversity. The requested classes may be described herein as labels. Other text labeling systems have required a certain minimum number of manual user input examples and, in addition, required a lot of computer processing to supervise the training of the label classifier. This technology improves upon modern technologies by providing good performance while utilizing models trained primarily in unsupervised mode without requiring, for example, a large number of manual user input examples of label classes. Since the input and computer training requirements for the labeling service are far less resource-intensive than usual, the computerized system provides a technological improvement that requires less computer processing to plot the results.

[0006] The techniques described herein provide this improved efficiency by receiving candidate text and labels, so that semantically rich positive examples of label text can be generated from the candidate text and labels. Similarly, the labeling service can generate semantically rich negative examples of label text from the candidate text and labels. The labeling service utilizes a generative model that estimates the probability that a label correctly fits a candidate text in order to generate the generated results. The success rate of the classification can be improved while maintaining this improved efficiency by obtaining a second generated result from the generative model and estimating the label probability by using the second generated result.

[0007] In another aspect, the technology is directed towards a method for obtaining semantically rich examples similar to candidate text. Other solutions to this problem have either provided semantically poor representations of the input data or, otherwise, relied on enormous amounts of manual data to provide training. The technology provides this improvement by, for example, obtaining a set of keywords that reflect the richness of the candidate text within the context of the label. The set of keywords is presented to a search service, and text snippets from search results with good relevance rankings are obtained to provide examples if the label class confidence of the extracted snippet is high.

[0008] In another aspect, the technology is directed towards a method for providing a semantically rich set of keywords from candidate text within the context of a label. Other solutions were semantically poor in their representation, resulting in a large number of returns from search engines that had to be received to obtain a certain number of relevant results. The technology improves upon state-of-the-art techniques by, for example, generating a semantically rich set of keywords while providing good performance, and thus reducing the amount of data required for training. A set of candidate text preferred keywords is obtained from the candidate text. A set of label preferred keywords is obtained from the label. Embedding vectors are assigned to the preferred keywords by using a transformer-based model. Context-aware keywords are obtained by the similarity of the preferred keywords based on the embedding vectors in order to obtain a set of context-aware keywords. decision This context-aware keyword set allows search engines to retrieve semantically relevant information for candidate text within the label's context, thus reducing the amount of search processing required to return a certain number of relevant results.

[0009] The technologies described herein are shown as examples, and not as limitations, in the accompanying drawings where similar reference numerals indicate similar elements. [Brief explanation of the drawing]

[0010] Brief explanation of the drawing [Figure 1] This is a block diagram of an exemplary labeling system operating environment suitable for the implementation of this disclosure. [Figure 2] This is an exemplary representation of a labeling application suitable for implementing aspects of this disclosure. [Figure 3] A flowchart shows a method for providing results based on an estimation of the probability that "the label will be correctly assigned to the candidate text" according to one aspect of the technique described herein. [Figure 4]This is a flowchart of a method for providing candidate input results based on candidate text according to one aspect of the technology described herein. [Figure 5] This is a flowchart of an additional embodiment of a method for providing candidate input results based on candidate text according to one aspect of the technology described herein. [Figure 6] This is a flowchart of a method that provides results based on an extension of a set of class examples according to one aspect of the technique described herein. [Figure 7] This is a flowchart of a method for generating a set of context-aware keywords based on a preferred set of keywords in the context of a label, according to one aspect of the technology described herein. [Figure 8] This is a block diagram of an exemplary computer environment suitable for use in implementing the aspects of the technology described herein. [Figure 9] This is a flowchart of a method for preparing a set of preferred keywords according to one aspect of the technology described herein. [Figure 10] This is a flowchart of a method for calculating similarity according to one aspect of the technology described herein. [Figure 11] This is a typical representation of a preferred text keyword structure related to a preferred label keyword structure according to one aspect of the technology described herein. [Figure 12] This is a flowchart of an additional embodiment of a method for providing candidate input results based on candidate text according to one aspect of the technology described herein. [Figure 13] This flowchart shows a method for determining the correspondence between class labels and text according to one aspect of the technology described herein. [Figure 14] This flowchart shows a method for determining the correspondence between class labels and text according to one aspect of the technology described herein. [Figure 15] This flowchart shows a method for extending training data for a classifier according to one aspect of the technique described herein. [Figure 16] This is a flowchart of a method for providing candidate input results based on candidate text according to one aspect of the technology described herein. [Modes for carrying out the invention]

[0011] Detailed explanation The various technologies described herein are described with sufficient specificity to satisfy legal requirements. However, this specification itself is not intended to limit the scope of this patent. Rather, the inventors intended that "the claimed subject matter may also be embodied in other ways to include various processes or combinations of processes similar to those described herein that relate to other current or future technologies." Furthermore, the terms “process” and / or “block” may be used herein to imply various elements of the method employed, but these terms should not be construed as implying any particular order among or between the various processes disclosed herein unless the order of the individual processes is explicitly stated or if the order of the individual processes is explicitly stated.

[0012] The techniques described herein determine whether candidate text exists within the requested class. decision This technology can perform this classification without any prior training data or models trained on the requested class. In fact, the user can define the class as natural language input rather than selecting it from existing classes. The requested class does not need to follow a hierarchy or be predefined. This technology is effective even when the requested class is not a noun but rather a concept such as diversity. The requested class may be described herein as a label.

[0013] The label classification system can provide feedback to the user indicating whether a candidate text is likely or unlikely to match a user-defined label. For example, a business document creation assistance application may receive user-defined classes such as "customer-friendly business-like communication". The candidate text can be a word processing document. In this example, each paragraph of the document can be evaluated as belonging or not belonging to a user-defined class. As output, the word processing application can highlight the paragraph if it is not "customer-friendly business-like communication".

[0014] Other text labeling systems have required some predetermined minimum number of manual user input examples and, in addition, have required a lot of computer processing to perform supervised training of the label classifier. The present technology improves on the state of the art by providing good performance while utilizing a model that is primarily trained in an unsupervised mode without requiring a large number of manual user input examples for label classes. Since the input and computer for training the requirements of the labeling service are typically much less resource-intensive, the computerized system provides a technical improvement that requires less computer processing to render the results.

[0015] The techniques described herein provide this improved efficiency by receiving candidate text and labels, and can generate semantically rich positive examples of label text from the candidate text and labels. Similarly, the labeling service can generate semantically rich negative examples of label text from the candidate text and labels. The labeling service utilizes a generative model that estimates the likelihood that a label is correctly applied to a candidate text to produce the generation result. The classification success rate can be improved while maintaining this improved efficiency by obtaining a second generation result from the generative model and using the second generation result to estimate the label probability.

[0016] In another aspect, the technology is directed to a method of obtaining semantically rich examples that are similar to candidate text. Other solutions to this problem have provided semantically poor representations of the input data or, otherwise, have relied on vast amounts of manual data to provide training. Any of these other solutions have required substantial computer processing to train a model to classify. The technology improves upon the state of the art by providing good performance while generating semantically rich examples without requiring a large number of manual user input examples for label classes. Since the inputs and computers that train the requirements of the labeling services described herein are far less resource intensive, the computerized system provides a technical improvement that requires less computer processing to render the results. The labeling service provides this improvement by obtaining, for example, a set of keywords that reflect the richness of the candidate text within the context of the label. The set of keywords is presented to the search service 164 and text snippets from search results having good relevance rankings are obtained to provide examples if the label class confidence of the extracted snippets is high.

[0017] In another aspect, the technology is directed to a method of providing a semantically rich set of keywords from candidate text within the context of a label. Since other solutions have been semantically poor in their representations, the number of returns from search engines that had to be received to obtain a certain number of relevant results was large. This large number of required returns meant high computer processing requirements. The technology improves upon the state of the art by providing good performance while, for example, generating a semantically rich set of keywords and thus reducing the amount of data required for training. A set of candidate text priority keywords is obtained from the candidate text. A set of label priority keywords is obtained from the label. Embedding vectors are assigned to the priority keywords by using a transformer-based model. Context-aware keywords are determined by the similarity of the priority keywords based on the embedding vectors to obtain a set of context-aware keywords decision This context-aware keyword set allows search engines to retrieve semantically relevant information for candidate text within the label's context, thus reducing the amount of search processing required to return a certain number of relevant results.

[0018] definition Labels are typically categories described by a single word / term or by a description of the content requirements to which the model is trained. Labels are also typically categories to which other electronic entities, such as natural language input strings, may be classified.

[0019] Antilabels are typically categories that include electronic entities that do not belong to the class described by the label. In the context of polynomial classes, antilabels include all enumerated classes that do not belong to the label class.

[0020] Custom labels are typically user-defined natural language descriptions entered by the user as indications for the desired label category.

[0021] Labeling services are typically applications that assign labels or label probabilities to electronic items such as natural language strings.

[0022] Label scoring services are typically applications that score candidate natural language input strings to measure the distance of candidates from a label in the context of other possible alternative labels. Generally, a label score can be a measure such as probability and can be used to classify candidates into one or more categories related to the label (such as the label, anti-label, a subcategory of the label, or a subcategory of the anti-label).

[0023] A conversion service typically involves taking a term or set of terms and converting them according to operations such as synonyms, antonyms, and word forms.

[0024] A priority keyword extraction service (e.g., Figure 9) is typically a service that takes a text string, extracts keywords, and orders them (e.g., within a label structure such as a list of keywords ordered in descending order of importance).

[0025] A context-aware keyword extraction service (e.g., Figure 7) is typically a keyword extraction service that represents candidate text within the context of a label.

[0026] A term similarity service (e.g., Figure 10) typically operates on the structure of keywords (such as a graph) and represents term similarity (e.g., through weighted graph linkages between terms in the graph).

[0027] Search service 164, also known as search / retrieval service, is a search service that typically operates on queries across an entire corpus of documents and returns a relevance-ranked list of documents from the corpus along with text snippets that provide portions of documents particularly relevant to the query.

[0028] Natural Language Processing (NLP) applications are typically computerized applications that act on natural language input, such as speech or text input, to perform computerized operations on a string of natural language inputs.

[0029] Natural Language Generating (NLG) models are typically applications that generate natural language text based on a generation input. The generation input can be, for example, tokens, a sequence of tokens, or several other input mechanisms such as sequences / vectors of numbers. Therefore, these systems may not typically be able to perform the function of an unsupervised label classifier. Examples of NLG models include GPT-2, GPT-3, and DeBerta.

[0030] Generative pre-trained transformer models are typically autoregressive language models that use neural networks based on deep learning.

[0031] Transformer models typically classify the broad context of the input. decision This is a deep learning model that utilizes an attention mechanism to integrate it into the context of other potentially related inputs.

[0032] A transfer-learning model is a neural network model that learns, at least partially, from a large amount of unsupervised and unlabeled data. Such models can be further refined by data (preferably data from similar domains to the model's application).

[0033] Zero-shot generation mode is typically a mode of a generative NLP model that can generate text without fine-tuning for a specific type of data. A generative NLP model typically receives an input text string and generates a generated result (which is generated in the prompting of the input text string) that is text.

[0034] This document describes an unsupervised label classifier that does not typically require user-provided examples of labeled classes, but may utilize user-provided examples to enhance performance.

[0035] Semantic search models are typically learning models, such as deep learning models, that measure the distance in the linguistic semantic space from a query document to another document within a set of documents, and return a measure such as cosine similarity (which represents the proximity of the query document to the document in question within the set). (Examples of semantic search models include DSSM.)

[0036] Having outlined some aspects of the technology described herein, exemplary operating environments in which some aspects of the technology described herein may be implemented are described below to provide a general context for the various aspects.

[0037] Referring next to Figure 1, a block diagram is provided showing an exemplary operating environment 100 in which several aspects of the present disclosure may be employed. It should be understood that this and other arrangements described herein are merely examples. Other arrangements and elements (e.g., machines, interfaces, functions, sequences and groupings of functions) may be used in addition to or instead of those shown, and some elements may be omitted collectively for clarity. Furthermore, many of the elements described herein are functional entities that may be implemented as individual or distributed components, or in relation to other components, in any preferred combination and location. Various functions described herein as being performed by one or more entities may be performed by hardware, firmware, and / or software. For example, some functions may be performed by a processor that executes instructions stored in memory.

[0038] Among other components not shown, the exemplary operating environment 100 includes many computer devices such as user device 105, server 125, cloud service 199, application service 175, fabric controller 179, server cluster 176, server 177, storage service 180, network 186, and network 103. Each of the components shown in Figure 1 may be implemented via any type of computing device (e.g., computing device 800 as described in relation to Figure 8). These components may communicate with each other via network 103 or network 186 (which may include, but not limited to, one or more local area networks (LANs) and / or wide area networks (WANs)). In the exemplary implementation, network 103 and network 186 each include the Internet and / or cellular networks, among any of the wide variety of possible public and / or private networks.

[0039] In one embodiment, the technology is directed to a computerized system (e.g., shown in operating environment 100) that performs a method of classifying text as either belonging to or not belonging to a user-defined text label. A labeling application 110 within operating environment 100 may present prompts to the user on a display 120. The display 120 may be a visual display or a speaker. A user input device 115, such as a microphone, mouse, or keyboard within device 105, receives input from the user. In some embodiments, this input may be a natural language string that acts as a user-defined text label. In one embodiment, an operating system 107 converts the audio signal input to a text string, and the labeling application 110 receives this text string as input. In one embodiment, the operating system 107 receives keystrokes from the keyboard 115 and provides the text string to the labeling application 110. The labeling application 110 also receives candidate text to be classified from the user in a similar manner. The candidate text may be received by the labeling application 110 from user input or from documents in a corpus of system documents 154. At the end of this process, the labeling application 110 provides classification results, such as an indication on the display 120 that "the candidate text is likely to belong to a user-defined label."

[0040] Computer device 105 and server 125 may be client-side client devices in the operating environment 100, while server 125, server 177, cloud service 199, application service 175, fabric controller 179, server cluster 176, and storage service 180 may be on the server side of the operating environment 100. Computer device 105 typically includes an operating system 107, a user input device 115 such as a touchscreen sensor or mouse, and a display 120. Importantly, computer device 105 also includes a labeling application 110, which may be, for example, a browser, plugins, downloadable applications, search applications, information management systems, dedicated applications, labeling applications, label-assisted search applications, label-assisted classification programs, document creation support, automated compliance applications, customer relationship management applications, etc. The labeling application 110 may also be a user interface component that performs one or more of these application functions related to the applications shown on server 177. In one embodiment, the applications on remote server 177 and the applications on device 105 reside on server 125.

[0041] In one embodiment, the labeling application 110 communicates with components on a remote server 177 to collaborate with the labeling application 110 to perform functions provided for the user. For example, components collaborating with the labeling application 110 may include a labeling service 142, a label scoring service 168, a term conversion service 144, a search service 164, a preferred keyword extraction service 146, a natural language generation (NLG) model repository 162, a context embedding generation model 158, a context-aware keyword extraction service 148, a vectorization function 156, a term similarity service 152, a corpus 130, a corpus 195, and a corpus 154. These components may be embodied as a set of compiled computer instructions or functions, program modules, computer software services, or arrangements of processes performed on one or more computer systems, such as a computing device 800 as described in relation to Figure 8. Server 177 may include server-side software designed to work in conjunction with client-side software on user device 105 to implement any combination of the features and functions discussed in this disclosure. For example, Server 177 may run an information management system for device 105 (managing access to and use of information in a knowledge graph). Server 177 may receive digital assets such as files for storage, documents, spreadsheets, emails, social media posts, user profiles, etc., from numerous user devices belonging to many users. This division of the operating environment 100 is provided to illustrate an example of a preferred environment, and therefore there is no requirement that "any combination of Server 177 and user device 105 remains as a separate entity" for each implementation.

[0042] Computing devices such as user device 105 and server 125 may include any type of computing device available for use by a user. For example, in one embodiment, user device 105 and server 125 may be the type of computing devices described herein in relation to Figure 8. As an example and without limitation, computing devices may be embodied as personal computers (PCs), laptop computers, mobile devices, smartphones, tablet computers, smartwatches, wearable computers, compatibility trackers, virtual reality headsets, augmented reality glasses, personal digital assistants (PDAs), MP3 players, global positioning systems (GPS) or devices, video players, handheld communication devices, gaming devices or systems, entertainment systems, vehicle computer systems, embedded system controllers, remote controls, appliances, consumer electronic devices, workstations, file servers, web servers, application servers, host computers, enterprise servers, server clusters, data centers, search appliances, virtual servers, daemons, mainframes or any combination of these enumerated devices, or any other suitable device.

[0043] This disclosure describes a system and method for training a text classification model without the need for representative labeled data or the assistance of a human grader, in order to generate representative data that could otherwise be used directly or indirectly to train natural language processing (NLP), or to generate a text classification model that could map / classify candidate input text across one or more classes (class labels) of interest.

[0044] Generally, an unbiased text classification model trained on unrepresentative training data (when representative labeled data is unavailable) may claim at best 50% accuracy for binary classification. This is comparable to human recognition where all data is labeled as either a "positive label class" or a "negative label class" in binary label classification mode. This is used as a scientific basis for comparing any candidate model (see: baseline of ROC curves).

[0045] The technology of this system is not only more accurate than what is possible with the human heuristic or unbiased model classification (trained on non-representative data) described above. Results from several experiments have demonstrated better accuracy and an established “recall and / or defined false positive rate” (FPR) that provides decision-makers with more objective information about the model’s usefulness in real-world scenarios.

[0046] Referring to Figure 2, the exemplary graphic display 200 shows a user display of a browser application acting as a labeling application 110 for an exemplary labeling service 142 that performs the functions of a Customer Relationship Management system. The CRM system corpus 154 contains sales, marketing, and service communications over text, web, and email. The graphic area 202 provides control elements. Initially, the user provides text to define candidate text. The labeling application 110 receives the candidate text. When the user enters text to define a label into the graphic control 206, the labeling application 110 receives the label text string. The text string defining the label may be a word, term, or description of any concept or idea.

[0047] The labeling application 110 sends two strings (candidate text and label) to the labeling service 142. The labeling service 142 performs the labeling process and provides one or more results to the labeling application 110. The labeling application 110 then updates the graphic display 200 to include the results, such as those displayed in graphic display areas 209, 231, 235, 261, 262, 257, 251, 253, 212, 214, 216, 292, and 204.

[0048] Display area 204 is labeled by labeling service 142 decision Display area 212 shows a set of context-aware keywords representing candidate texts that would be selected. Display area 292 shows an ordered list of keywords representing antilabels derived from candidate label texts. Display area 214 shows a set of antilabel keywords derived from candidate label texts. Display area 216 shows a set of antilabel keywords derived from candidate antilabel texts. Display area 209 shows an estimate of the probability that "the candidate text belongs to a specified label class". In addition, display area 209 shows the label class membership threshold applied to the probability estimate. decision It may provide a display of results (such as true or false) based on this.

[0049] The labeling service 142 returns candidate label class predictions (1 for true, , etc. for false) to provide a binary classification output. Generally, all depicted result data is received from the labeling service 142 by the labeling application 110, which provides the depicted result data to be displayed on the display 120. Generally, the depicted labeling results are any label-related information items provided to components of the labeling system for display or use (for example, if the labeling service 142 is "Candidate text is...") Allowable standard satisfiesThe meaning of "to add" decision This is shown in the operating environment 100 if it is permitted. Yongji The hypothesis allows for the estimation of the probability of a label class. Tolerance threshold This might mean that the values ​​are above the actual values. Since this system does not necessarily require the user to provide any examples of text that correctly classify to the given labels, the depicted results can be provided in an unsupervised manner.

[0050] In one embodiment, the graphic display 200 clears the candidate text from the graphic control 202 to prompt the user to enter additional candidate text. Yongji Candidate texts that satisfy the criteria are updated to be placed in the positive example display area 231. By repeating this candidate text input, the user can also build a library of positive and negative example labels with computer assistance, performing semantic language processing to generate positive examples such as those shown in graphic display areas 231 and 235, and negative examples such as those shown in display areas 261, 262, 257, 251, and 253. This method provides automatic classification of candidates and expands a set of input data to include positive and negative examples and keyword structures. By repeating candidate text input, additional anti-label definitions can be made to reside not only in graphic display areas 224, 222, and 218, but also in additional negative example display areas 267, 277, and 287. The uppercase letters A, B, C, D, E, and F in the graphic display 200 are "Anti-label display areas 212, 214, 216, 224, 222, and 218 are for the labeling service 142 to correspond to anti-label examples shown in display areas 260, 250, 256, 265, 270, and 280 decision This indicates that it is a subcategory. In this way, a rich set of anti-label subcategories are provided by the labeling system shown within the operating environment 100. decisionThe results are then displayed to the user in an intuitive and useful user graphic display 200. This display pairs a set of antilabel keywords with corresponding examples and allows the user to provide feedback on the usefulness of the antilabel or the usefulness of the corresponding example related to the antilabel. Generally, any display area that provides results may have associated controls such as 232. Exemplary display areas, such as display area 231, are shown with a corresponding graphic control 232 that allows the user to override or provide confirmation that "the adjacent example fits the label assigned by the system." The graphic control 232 may include a prompt with "yes" and "no" checkboxes (e.g., "Is this a good example?"). Alternatively, the display may be radio buttons marked "green" or "good" and, when selected, toggle between "red" or "yellow" to indicate bad and mediocre examples. Graphic controls 236, 259, 252, and 254 are similar to graphic control 232. Graphic control 239 allows all shown positive examples to be confirmed or rejected with respect to display areas 231 and 235. Similarly, graphic control 255 allows display areas 251 and 253 to be confirmed or rejected by a single control.

[0051] The graphic display 200 provides an example of the technology disclosed herein, but the system class labels candidate text (e.g., text shown in graphic control 202) when it comes from different users or from document corpus 154 (e.g., text from a salesman's email). decisionIt is possible to operate in this manner. The user of the graphic display 200 may be a CRM manager who provides only a minimal definition of the label input (such as "Pleasant, and Business-like") into the graphic control 206. The labeling service 142 may then begin building a library by searching throughout the documents in the corpus 154 to define the label class, testing written texts, and building a library of example labels. Thus, the display area 200 may initially provide a much clearer display and only provide the graphic control 206 that is visible to the user. After the user enters the label into the graphic control 206, several iterations may occur, and the graphic display 200 may display an estimate of the feasibility of the label presented over the corpus, or otherwise provide a set of links to the document or part of the document that is closest to the description presented by the user. In addition, label-based document search capabilities may be provided by logically combining the label definitions developed by the user. Once each label classifier has achieved sufficient performance, they may be placed in the user's library and combined to discover documents that provide high scores in the context of user-defined label combinations.

[0052] The graphic display 200 presents numerous display areas that allow the user to provide low-level feedback to the system to improve the classifier. The display typically includes an anti-label display area 210, a class definition display area 201, a positive example area 203, and a negative example area 205. Generally, the graphic display 200 presents any user input results in the data signaled from the labeling application 110 to the labeling service 142.

[0053] In one embodiment, the input text can be of any length. The label may simply be a short sentence or document. The required label is given in terms of a positive class label; a negative class is treated as the absence of a positive class. This may be a word, term, or a description of any concept or idea.

[0054] The technology described herein is broadly applicable. This technology has the potential to empower many systems. For example, one use of this technology is for automated compliance, where tenant administration may require that the entire corporate data corpus (including emails, chats, document repositories, contracts, etc.) be labeled with respect to any concept that may be deemed necessary at the time, and that a timely response is crucial not only for legal but also for business reasons. There are several other applications where such insights are needed (at a scale, speed, purpose, or equity level that could not be achieved by either a single person or a group of human teams for one or more of the following reasons):

[0055] The technology described herein scales efficiently. The technology described herein is intended for enterprise-scale data (including emails, chats, document repositories, contracts, etc.) that is not feasible for any number of people to process manually and objectively for the desired purpose. The technology described has low latency and is therefore capable of efficiently processing large volumes of text input. This technology is intended for applications that require processing large amounts of data that are not feasible for any number of people to process manually and fairly for the desired purpose, and that need to deliver output within a reasonable timeframe considered effective and useful for business and legal purposes.

[0056] The technologies described herein maintain user privacy and confidentiality. Human processing of critical data is vulnerable because several risks are involved in involving human analysts in bulk label classification efforts. In addition to legal and compliance requirements, even with respect to non-corporate data, it may be unwise or even impractical to expose such data to a single user or a team of users.

[0057] The technologies described herein possess excellent impartiality. Human perception is not unique and is often biased or limited by knowledge of given concepts, understanding of given contexts, and knowledge of specific languages. Therefore, such tasks, when performed by various individuals, carry a critical risk of bias that may not be reliable or controlled in a wide range of applications as desired. By exposing users to the keyword context of the technologies described herein, users can modify poorly defined labels or labels that use words that do not accurately mean what the user thought the word meant.

[0058] This technology possesses excellent objectivity. The purpose of this system is not only to predict any candidate label level for any candidate text, but also to provide relevant confidence levels required or otherwise useful in many downstream applications and in the relevant software features authorized by this disclosure. Human perception is generally biased due to the limited understanding of individuals, and therefore cannot generate any objectively defined and auditable confidence levels for perception or specific candidate label levels.

[0059] The technologies disclosed herein possess superior multilingual capabilities. Every human being is limited by their own knowledge and ability to use various languages, and even by their ability to use various concepts within their known languages. Therefore, the cognition of a single person may be insufficient, and the collective cognition of a person may not be consistent across the various combinations of languages, concepts, and expertise.

[0060] The technologies disclosed herein possess excellent auditability and reproducibility. In multiple domains and applications requiring compliance, demonstrating reproducibility and consistency, as well as incorporating objectivity into processes, can be critical. Human recognition-based systems cannot be adopted within these domains and applications.

[0061] The techniques described herein are economically efficient and reliable. Currently, many labeling requirements are applied to a few predetermined labels, and therefore many labeling requirements are achieved at enormous cost, with poor reliability, and on a very limited scale. Typically, competing methods are carried out through paid assignments (analysts, sellers, or human contractors) or through crowdsourcing. Those involved in paid assignments are expensive. Those involved in crowdsourcing are unreliable.

[0062] Figure 3 shows the processing flow of a labeling service 142 that performs a computerized method to draw results (sent to the labeling application 110 when the labeling service completes without error messages to provide a reasonable estimate) as shown in display area 209. Generally, it may be advantageous to initialize the labeling service 142 on server 177. As part of the labeling service initialization, an NLG model is loaded into the memory of server 177. In one embodiment, the NLG model is hosted within a cloud service 199 by using multiple real or virtual servers to provide a large-scale service. Generative NLP models available in repository 162 are loaded (or remain pre-loaded). For better results, larger and more expressive models may be used. The model may preferably be pre-trained (the concept of transfer learning (where the model partially learns from large unsupervised and unlabeled data)) and further fine-tuned with data (preferably from a domain similar to the application requirements). Some examples of similar models may be (but are not limited to) GPT-3 and Microsoft DeBerta (preferably models with good zero-shot generation capability mode ("a mode in which the model may be able to generate text without fine-tuning with a particular type of data"). The current state of technology (SOTA) in NLP generative models is large-scale (over 10 billion trainable parameters) transformer-based models. This disclosure does not limit the use of these models, so any available model that may be made compatible with one or more scoring mechanisms disclosed herein may be used.

[0063] NLG models taken from repository 162 to perform the steps in label scoring service 168 and employed by labeling service 142 are typically trained across an entire unlabeled natural language corpus. Similarly, NLP models, whether stored within a group of models 158 and used to generate contextual embeddings, or employed to perform transformation service 144 or vectorization 156, are also generally trained across an entire unlabeled natural language corpus. Such models are typically trained by applying token masking techniques. In one embodiment, the NLP or NLG models employed in the service are trained across an entire web corpus, enterprise data corpus, or another corpus. The techniques disclosed herein can be operated with neural network models, non-neural network models, partially pre-trained models, fully trained models, and tuned models, among other models.

[0064] A method for rendering labeling results (e.g., method 300) begins in step 303 when the labeling service 142 services the display page to the labeling application 110. In step 305, the labeling service 142 receives a text string that defines candidate text from a document in the corpus 154 or from the labeling application 110. In step 310, the labeling service 142 receives a text string that defines a label (e.g., from the labeling application 110). In step 307, if the received label has multiple words, the keyword structure of the label is processed by the labeling service 142. decisionThis is done (as illustrated with respect to the example in Figure 9, for example). If a label has multiple words, method 300 uses an additional sub-step to make the label relevant to the rest of the process. In one embodiment, the keyword algorithm is an available extractive text summarization and keyword extraction algorithm. In one embodiment, for illustrative purposes, if the input is "fun and businesslike", the output is an ordered candidate label ("service" score = 0.6) and ("harmony" score = 0.4) as shown in the label graph 1130 shown in Figure 11, which also shows the tag display 1160.

[0065] Referring briefly to Figure 9, a computerized method for extracting preferred keywords in a method performing step 710 is presented (starting in step 903). The method then moves to step 905, where the text to be summarized is received by the keyword extraction service 146. In this example, the label text "fun and businesslike" is received. In addition, the method performing step 710 receives constraints that limit the size of the generated structure. For example, the size constraint could be the maximum number of top keywords for the service to be retained and may be received from the storage service 180 by the method performing step 710. In another embodiment, the size constraint could be a keyword intensity threshold received from the storage service 180 by the method performing step 710. The size constraint is then applied later in step 940 to filter out non-essential terms. In step 910, the text is purified and pre-processed so that extraneous characters are removed and the text is ready for further processing. In one embodiment, the text is converted to all capital letters to simplify additional processing. The method then moves to step 915, where the purified text is tokenized into terms. In one embodiment, the original expression of the text is converted into a more compact vocabulary through synonyms. In step 920, the terms in the text are vectorized, and the conversion is applied. The vectorization function is typically the function of converting a set of terms into a meaningful numerical representation. Examples of vectorization functions include Term Frequency Inverse Document Frequen (TF-IDF), Global IDF, and Entropy Weighting. In step 925, a threshold for the vectorization metric is used to filter out non-essential terms. The remaining terms are used in step 930 to form the vertices of the graph. In step 935, each vertex (term) in the graph is quantified in terms of similarity to other terms in the graph by plotting edges relative to each other's vertices in the graph using edge weighting, which represents the similarity between terms.

[0066] In one embodiment, step 935 utilizes a method 1000 for calculating co-occurrence-based term similarity. The similarity calculation begins in step 1003 and proceeds to step 1005, where a graph of terms is received. In this context, the graph is a graph of preferred keywords connected within the graph. In step 1010, a collocated search term count or "TermDistance" is obtained, or a default value is used. For example, if no input is given, the default "TermDistance" is taken as the square root of the number of terms in the text. In one embodiment, the collocated search term count is an integer between 2 and 10, indicating how many terms should be considered in the search for collocated terms. If a count between 2 and 10 is assigned, the collocated search will be performed on the terms between neighboring terms and the ninth neighboring term. The method proceeds to step 1015, where the number of times each term is collocated in pairs within the term distance is discovered. Each vertex (term) in the graph is considered in relation to other terms in the graph. The number of times two terms coexist within the "term distance" is counted. In step 1020, the co-occurrence frequencies are normalized and scaled by adding 1. In step 1025, each normalized and scaled frequency is assigned to the graph linkage weights between the two vertices. In step 1030, term importance is calculated as needed. For each vertex (term), the term importance is calculated according to the normalized score of all output edges from the vertex. decision This is done. In step 1035, the graph edge weighting is restored. This method is completed in step 1097.

[0067] In step 940, a size constraint (e.g., a threshold) is applied to filter out non-essential terms. This filter removes weak keywords. In step 945, the keyword structure is output. In one embodiment, the output is the resulting graph structure. The graph may be a subgraph with preferred vertices, their respective edge weights, and vertex scores. In one embodiment, the output is an ordered set of keywords. The method is completed in step 997.

[0068] Returning to Figure 3, in step 307 decision The generated keyword structure is stored by the labeling service 142, in particular to assist in the generation of an example of a candidate label in step 330 by method 300. The method proceeds to step 372, where an antilabel structure is generated and stored by the labeling service 142. The antilabel structure is used, in particular, as a means to generate an example of a candidate antilabel in step 345. Many different methods can be employed for antilabel generation. For example, the antilabel “bossy disharmony” shown in display area 212 was generated by inverting the individual keywords shown in display area 292. In addition, the entire set of label keywords can be inverted via a context-aware inversion service, which may be employed by the term conversion service 144, or by state-of-the-art vectorization techniques such as an NLP vectorization embedding algorithm that provides antonyms for words used in context. In addition, the set of labels and antilabels stored by the labeling service 142 may be stored in a library along with relevant explicit or implicit user approvals to form separate labeling contexts. This method has the potential to index the abstract use of label terms as distinct areas of communication that can be mined to more carefully track and utilize the labeling efforts of users or groups of users with similar or shared linguistic contexts. As an additional inversion technique, the language inversion method used focused on the term "businesslike" and found "informal" shown in display area 214. Furthermore, the exemplary embedding vector antonym localization function returned the possible antonym "self-focused" shown in display area 216. Since anti-label categories are often complex, method 300 can employ one or all of the anti-labels found to generate several examples as shown in graphic display 200. Furthermore, a similar semantic method can be applied to label classes to increase the number of candidate synonyms for similar methods by using label classes, in order to obtain label classes that are as semantically rich as the described anti-label classes.

[0069] Method 300 proceeds to step 315, which performs a method to expand the input data to include one example (or, in other words, to obtain several examples from candidate labels). The example shown in Figure 3 provides a balanced initial set of one positive example and one negative example when the user provides two or more examples of either labels or antilabels. In step 320, if there are positive examples of available candidate labels, the method proceeds to step 335, where the labeling service 142 receives the positive examples as input to be received by one or more label scoring methods. Similarly, in step 325, if there are negative or antilabel examples provided by the user, Method 300 proceeds to step 340, where the labeling service 142 receives the negative examples as input to be received by one or more scoring methods. If the user has not provided any positive examples, the method proceeds from step 320 to step 330, where an example of a candidate label is generated. Similarly, if the user has not provided any negative examples, the method proceeds from step 325 to step 345, where an example of a candidate antilabel is generated. The methods used to generate label examples in step 330 from label information or to discover antilabel examples in step 345 from antilabel information may follow similar processes but with different inputs.

[0070] An exemplary method for obtaining positive examples in step 330 involves performing a search across the entire corpus 154 by using ordered keywords derived from the labels and by using at least a portion of the extended method 600 shown in Figure 6. Specifically, the search across the entire corpus 154 is performed in step 620 by using preferred keywords of the labels as queries. In step 625, text snippets are obtained, and the method proceeds to step 630, which quantifies the confidence that "the text snippet belongs to the label class." An exemplary method for quantifying class confidence is to construct the keyword structure of the text snippet (for example, by using the method that performs step 710 in Figure 9). An exemplary method for evaluating the overall semantic similarity between the keyword structure of the text snippet and the label keyword structure may be the use of cosine similarity based on vectorization transformation of graph terms or other methods provided by the vectorization function 156. Other methods disclosed herein provide an estimate of the probability that a similarity score or label is correctly applied to the text snippet. judgement If the class confidence is too low in step 635, the method returns to step 625 to obtain another text snippet, which is quantified in step 630 and tested in step 635. judgement If sufficient at step 635, the method proceeds to step 645, where the input is expanded to include a sufficient snippet as an example. A similar method is applied in step 345 to generate examples that match the antilabels generated in step 372 to generate negative examples.

[0071] Another exemplary method of step 340 for generating candidate antilabel examples involves performing a search across the entire corpus of documents by using the search service 164, by using an ordered list of preferred keywords from labels, and by using text snippets of low-ranked entries. With respect to keyword-based indexes, this procedure is likely to return results that are powerful within the corpus but are included only because "the entry in question matches a shared word in the corpus that is completely unrelated to the context of other words in the label." Text snippets are taken from the lowest-ranked returns (the k-th return, where the k-th return is likely to discover a powerful corpus password used in a different context), where K=100, for example. Similarly, queries for antilabel preferred keywords across the entire corpus will return powerful keywords that do not quantify antilabel classes. Such distant return distances can also provide important information about the separability of labels from antilabel classes.

[0072] Another exemplary method for generating an example involves calculating additional (or balancing) examples when an example is provided by the user. For example, assume that "the example shown in display area 231 in Figure 2 was typed by the user into the graphic control represented in display area 231." In this case, the method in step 315 would proceed to step 335 to receive a positive example. In step 325, the method would proceed to step 345 because there are no available candidate antilabel examples. In this case, the labeling service 142 performs at least part of method 600 (starting in step 603, where the extended method 600 begins) in step 345. The method proceeds to step 605, where the candidate text is received by method 600. The method proceeds to step 610, where it receives the antilabel keyword structure as a representation of the candidate antilabel. In step 615, a set of preferred keywords is prepared for the candidate text. In this example, this would occur by first obtaining the preferred keywords for the positive example by performing the method shown in step 710 in Figure 9 to summarize the positive sample text with a preferred graph. Next, the graph is inverted (for example, the label graph was inverted in step 372). The method then proceeds to perform the method shown in step 615 of Figure 7, starting in step 720 to generate a set of context-aware keywords for the inverted graph in the context of the antilabel. In step 720, embedding vectors of preferred terms for negative text keywords are obtained (for example, from context embedding generation model 158), and only high-priority terms are retained. In step 725, embedding vectors of preferred terms for antilabel keywords are obtained. For example, each term in the inverted text is provided with an embedding vector, and this list is filtered to retain only the preferred terms. In step 730, the similarity between preferred antilabel terms and preferred inverted keywords is obtained. This may be obtained by calculating the similarity (for example, the cosine similarity between the embedding vector of each preferred term in the antilabel and each preferred term in the inverted text). In step 735, the contextual importance of the preferred text terms is calculated.In one embodiment, the contextual importance of each summary keyword term is calculated as a normalized weighted average of the similarity between each term in the antilabel, where the weights are the importance scores of the antilabel terms. In step 740, the method obtains contextual awareness priority from the contextual importance and keyword priority. decision For example, the contextual priority of each summary keyword can be calculated as the normalized product of contextual importance and keyword priority. The basic method for performing step 615 is the same for various inputs such as positive text and positive labels, except that it usually begins in step 703. The method for performing step 615 is also, judgement Step 705 may include a test to determine if the input label has multiple terms, and if true, the method proceeds to step 710 and performs the operations shown elsewhere before returning to step 715, where a candidate text structure providing preferred text keywords is determined. The method performing step 615 ends at step 797, and the method returns to step 620 of method 600 in this example, Figure 6. Next, the method proceeds to obtain a set of ranked search and retrieve results in step 620, as shown elsewhere, then obtains text snippets from the entries in step 625, and then quantifies the label class confidence of the text snippets in step 630. However, in this case, the method may utilize one of the label scoring methods that performs step 380 based on the text of anti-labels and positive examples. Method 600 continues to step 635, and if the class confidence is sufficient, the method proceeds to step 645, where method 600 is completed in this case, and returns to step 380 of method 300, as shown in Figure 3.

[0073] The method for performing step 380 employs one or more scoring methods presented in Method 400 in Figure 4, Method 500 in Figure 5, or Method 1200 in Figure 12. The scoring method typically receives several positive or negative examples, labels, and anti-labels, and scores candidate texts with a probability of the label being present using an NLG model. The method may also utilize GPT-3 for classification with the provided input, based on the generated examples, and then estimate the accuracy of the GPT-3 probability based on the similarity of the labels to previous experience of GPT-3 accuracy.

[0074] This disclosure suggests at least four different methods for performing the label scoring process 380. The first method, shown as method 400 in Figure 4, is known as the Numeric Class (NC) method. The second method 1300 in Figure 13 is known as the String Label (SL) method. The third method 500 in Figure 5 is known as the Search-Score (SS) method. The fourth method 1200 in Figure 12 is known as the Log Probability (LP) method. In addition, the label scoring methods can be parameterized based on a risk parameter that controls how dangerous the text generation is using an NLG model. A single label scoring method performing process 380 can be operated, for example, by controlling the risk parameter to produce high-risk, medium-risk, or low-risk generation. Thus, the specified methods can be augmented and operated in parallel. The four parameterizable methods can be extended up to 12. For this reason, the process 380 in Figure 3 defines the application of the label scoring method. Multiple label scoring methods may be operated on the same input, and a result vector (providing two or more results of the label scoring methods) may be obtained. Thus, the label scoring service 168 is typically a result vector of multiple label scoring methods as described herein. Each of the NC, SL, SS, and LP label scoring methods is the output probability of the label, decisionProvide an indication of the class that was selected, an indication that the result is inconclusive, and an explanation of why the result is inconclusive (e.g., generation method failure, extension failure, extension too weak, label scoring method failure, lack of class separation, or inappropriate threshold).

[0075] The setup parameters determine how many of the available models stored within the label scoring service 168 will be used in process 380 for label scoring by selecting the desired method from the label scoring service 168. decision In one embodiment, the setup parameters are based on the characteristics of the label and / or antilabel. decision In the model selection process, various label prediction modes are selected. In one embodiment, the single mode is the default or standard mode used by the classification system that handles data such as NC methods or modes (training / validation sets of labeled data). However, the same applies to other scoring systems, each of which has advantages under various conditions.

[0076] In step 380 of Figure 3, the results for the selected mode are generated. For example, when all loaded modes of the NC, SL, SS, and LP methods are performed, the composite output may include the following vector output: [NC: (Service Harmony: 0, Confidence: 0.55), SL: (Service Harmony: 1, Confidence: 0.8), SS: (Service Harmony: 1, Confidence: 0.9), LP: (Service Harmony: 0, Confidence: 0.6)]

[0077] In step 385, the label scoring service 142 stores performance, record estimates, similarity weights, and class labels in a library of known performance. The label scoring service 168 uses a repository of vectors and similarity algorithms and determines whether the label of the current score is similar to any labeling method available in the library. decision The label scoring service also utilizes a repository of embedded algorithms available within NLP vectorization and vectorization functions.

[0078] In step 390, weighting is applied if available. This method may have two or more scoring methods available for a given model / algorithm. In such cases, predictions from various mechanisms may fluctuate, or at least the associated probabilities may fluctuate. In such cases, this method needs to reconcile the predictions and associated probabilities. By default, if no subsystem exists, no weighting is available in step 390. This method estimates the result and label probabilities. value of decision To do so, default weighting or throw vote base The method uses the weighting. If additional information is available, this method incorporates the weighting into the output evaluation. An example of applying the weighting results includes the case where the pre-labels are found in the classifier library, which shows that "SS search and SL search are twice as likely to produce correct results as other available classifiers," and therefore the weighted response would be (2*SS+2*SL+LP+NC) / 6 = Service Harmony weighted likelihood (70%).

[0079] In step 395, if the performance conditions indicate that available results have been obtained by the labeling service, one or more results are plotted based on estimation. Next, the method proceeds to step 395, where the results are plotted based on probability estimation. At this point, the labeling service 142 returns all results that are available and displayed in the graphic display 200 in Figure 2, and the method proceeds to step 397, where a new input is awaited. If a new input is available, for example, additional input from the user is displayed.

[0080] The NC method for label scoring is shown in Figure 4, 400. This method proceeds from step 315 to step 410, where several examples of generative model inputs are formatted. This label prediction and scoring system behaves like any arbitrary standard binary / polynomial classification system in terms of its output. The output of this system is a Boolean / polynomial class indicator (with respect to the Boolean class indicator, positive classes are indicated as 1 and negative classes as 0) and association probabilities / likelihoods. With respect to this system, the system explores the generative model in zero-shot mode with several arbitrary positive and negative sentences that have Boolean / polynomial indexed classes, and then input sentences from which the model is expected to generate similar Boolean / polynomial class labels along with their association "token probabilities". The association token probabilities are normalized / scaled by a historical model specific range parameter used as the predicted probability / likelihood. Additional checks are performed to ensure that "the generated text contains the required class labels" before aligning the token probabilities and labels and generating the output. If these checks fail, a “NONE” output is sent, indicating that this scoring mechanism has exited the final prediction weighting mechanism in step 390 of method 300. The NC method for label scoring in step 380 uses a number to represent the label class and another number to represent the anti-label class. Thus, for the binary case, label=1, anti-label=0. In an exemplary embodiment, the NLG model is used in zero-shot mode. For example, the model prompt may be prepared by combining several examples and their respective labels using a sentence class separator. A separate example uses a sentence mask break. The prompt is then continued by another sentence mask, followed by a “sentence class separator” and then a “start prediction” prompt. For the case with one positive generated example and one negative generated example, the prompt may be: [“positive example” “sentence class separator” 1 “sentence break” “negative example” “sentence class separator” 0 “sentence break” “candidate text”].

[0081] In step 420, a prompt is applied to a generative model such as GPT-3. The "log probability" of the generated text and tokens is received for each token in the generated text. In step 430, the generated output is searched for the digits "1" and "O". If neither of these digit labels exists, the method fails and an error response is returned to the labeling service 142. If the digits exist, the token probability is obtained from the generated output in step 430. decision For example, in the binary case, the generated output is searched for numerical labels 1 or 0. The token probabilities of the found symbols 1 or 0 are used to estimate the label probabilities. decision It is used for this purpose. Token probabilities are combined if necessary and normalized to be used as predicted probabilities. In addition, the scoring method may apply its own threshold and whether the candidate text belongs to the label. decision The results of the NC label scoring service are stored, for example, when returned to the labeling service 142.

[0082] The SL method for label scoring is shown in Figure 16, step 1600. This method proceeds from step 315 to step 1610, where several examples of the generative model input are formatted. The SL method typically performs the operations of the NC method, but differs in how prompts are generated. The fundamental difference is that text labels are used rather than numerical labels. Arbitrary concepts represented by multiple words are difficult for NLP systems to understand. For this reason, this method generates a sequence of key keyword-based concepts for label generation. Any arbitrary concept may be used directly, if these labels are used to prompt the model. The output similarly includes the prompted concept, which is later mapped to the original “arbitrary concept” or term for presentation to the user. Prediction probabilities are calculated by discovering keywords or synonyms of keywords in the generated output and from the token probabilities of keywords or synonyms in the output. decisionThis is generated / calculated in step 1630. With respect to SL mode, instead of using numerical labels, prompts are used by combining the respective labels. For example, if the label class has a preferred keyword list "service harmony" and the anti-label class has a preferred keyword list "disharmony", and this method processes one positive example and one negative example, the prompts might be: ["Let me know if there is anything I can do for you. I'd be happy to help." "Sentence class separator." "service harmony." "Sentence separator." "This is your problem, not mine." "Sentence class separator." "Disservice disharmony." "Sentence separator." "Candidate text"].

[0083] In step 1620, the prompt is applied to a generative model such as GPT-3. The generated text and "log probability" of the token are received for each token in the generated text. In step 1630, the generated output is searched for label keywords and anti-label keywords (e.g., "service," "harmony," "disservice," "disharmony," or synonyms thereof). If none of these keywords or their synonyms exist, the method fails and an error response is returned to the labeling service 142. If one of the keywords or synonyms exists, the token probability is retrieved from the generated output in step 1630. decision For example, in this case, the generated output is searched for the labels "Service" and "Harmony". Next, the token probabilities of the found "Service" and "Harmony" are used to estimate the label probabilities. decision It is used for this purpose. Token probabilities are combined if necessary and normalized for use as predictive probabilities. In addition, the scoring method may apply its own threshold and determine whether the candidate text belongs to the label. decisionThe results of the string NC label scoring service are stored, for example, when returned to the labeling service 142. In step 1630, the SL method of the label scoring service 168 searches for the output generated text of terms from candidate labels or antilabels, or the output generated text of terms that have very similar meanings / embeddings within the generated context. Otherwise, the SL method performs an operation similar to that performed by the NC method.

[0084] The SS label scoring method is shown in Method 500 in Figure 5. The SS label scoring method proceeds from step 315 to the embodiment of the SS label scoring method in step 380 shown in Figure 5. This method may use one or more samples of similar and dissimilar text / preferably antitext selected as similar to the concept in the input label. The system sends the samples along with the input text to a dedicated search ranking subsystem / model (which provides search rankings for various sentences / texts). Based on the retrieved search samples and search rankings, the method calculates the labels and label probabilities for the input text. decision The search score requires additional processing to be converted into a predicted probability. For most search subsystems, a scaled or normalized search rank / score range may be used as a substitute for likelihood. One special consideration is that the additional subsystems for similar / dissimilar text generation are mechanisms that deal with concepts and search queries and are therefore not analogous to traditional classification systems, and therefore the data generated / recovered from these systems in their inherently unprocessed / unfiltered form cannot be used directly to train classifiers.

[0085] In step 510, each example in the set of label examples and each example in the set of anti-label examples are used to generate text by using the NLG model and tagging the output results with the relevant labels. Thus, consider the case where there are two examples of labels (represented as EX-LI and EX-L2). Furthermore, there are two examples of anti-labels (represented as EX-AL1 and EX-AL2). Next, the results from exemplary inputs are represented by prefixing the exemplary name with the indicator "GR-". Thus, in step 510, GR-EX-LI is generated by applying the generative model to EX-LI. GR-EX-L2 is generated by applying the generative model to EX-L2. GR-EX-AL1 is generated by applying the generative model to EX-AL1. GR-EX-AL2 is generated by applying the generative model to EX-AL2.

[0086] In step 520, the generative model is applied to the candidate text (represented as CT) to obtain the corresponding generative output (represented as GR-CT). The method proceeds to step 530 to calculate the search score of the candidate generated text (GR-CT) from the document set generated by the generative examples generated in step 510. Generally, the idea of ​​the SS method is to use the candidate generative output (GR-CT) as a query in a search engine and to determine whether the results generated from the label examples (GR-EX-L1 and GR-EX-L2) are closer to the query (GR-CT) than the results generated from the anti-label examples (GR-EX-AL1 and GR-EX-AL2). decision The goal is to measure search result rank as a metric for this purpose. Generally, a trainable structural / semantic similarity model search engine (e.g., Microsoft® DSSM) that measures the semantic distance between a query and its results is preferred. Alternatively, GPT-3 search rank may be used.

[0087] One embodiment involves label probability estimation in step 540. value of decisionTo do this, we use arbitration rules regarding search rank or scores for documents labeled in different ways across the entire document set. The first arbitration rule is to use the label and search rank / score of the document with the best search rank (highest search score). The second arbitration rule is to use a heuristic group (such as the average) of heuristic search scores for the group of all documents generated from the label example. decision The first step is to compare this with the heuristic search score of the other group of documents generated from the anti-label example. The third arbitration rule is to shortlist candidate documents based on their search score or rank, and then apply the second rule to this shortlist. For example, assume a rank from highest to lowest (GR-EX-L1, GR-EX-AL2, GR-EX-AL1, GR-EX-L2). The label class will be selected under Rule 1. Next, assume that the search score of the search engine is the cosine similarity between documents in the semantic space that produces the relevance score (GR-EX-L1=0.5, GR-EX-AL2=0.3, GR-EX-AL1=0.21, GR-EX-L2=0.5). Under the second rule, the label will again be selected. However, by the same search score, arbitration rule 3 will select the anti-label if a search score threshold of 0.08 is used. Next, estimate the probability of the label. value This is formed by normalizing the search score. Next, method 500 returns to the output of step 380 in Figure 31.

[0088] This label prediction and scoring method requires a model with NLP search and ranking capabilities. These could be a pure NLP generative model or another state-of-the-art search ranking model. In addition, the system requires a text generation or retrieval subsystem that can generate / retrieve text based on specific requirements without prior (user-provided or use-case specific) training data. In one embodiment, this could be a rule-based web search retrieval system. base With regard to quasi / concepts (which are often repeated), these requirements may be heightened and replaced with human-generated candidate search ranking text.

[0089] The LP method for label scoring is shown in Method 1200 in Figure 12. The LP label scoring method proceeds from step 315 to the embodiment of the LP label scoring method in step 380 shown in Figure 12. The LP method is also known as the dual-pass generative log probability based label scoring method that performs step 380. In this system, either the NC class index or SL label scoring mechanism in one embodiment is used to support a sub-step. In the LP method, instead of asking the system to generate labels, the method duplicates the input text for each possible class index or (string) class label and then asks the system to generate the next text. The generated text may not be used directly, but the token log probabilities of the submitted class indexes / labels for various indices / labels are used, and the method selects the one with the highest log probability after applying a soft-max function that rescales these log probabilities to 1.

[0090] After entering the label scoring method which performs step 380 as shown in Method 1200, three paths operate in parallel. In step 1210, Method 1200 takes a positive example as input having candidate text (for example, by using a sentence connection technique that combines text examples and label types). In step 1215, the log probability of the label is taken from this input. decision In step 1220, method 1200 takes a negative example as input having candidate text (for example, by using a sentence connection technique that combines the text example and the antilabel type). The method proceeds to step 1225, where the log probability of the antilabel is taken from this input. decisionSimilarly, in step 1230, the next text is predicted by all combinations of example and candidate text (for example, by using a sentence connection technique that combines example text and label type). The method proceeds to step 1235, where the log probabilities of key terms / tokens are derived and used as threshold indications. The method proceeds to step 1240, where a test is performed to see whether the obtained threshold ensures that the separation between the log probabilities of candidate texts associated with a label is sufficiently separated from the log probabilities of candidate texts associated with an antilabel. If the threshold is not valid, the method proceeds to step 1245, where an error signal is generated. Otherwise, the method proceeds to step 1250, where the positive and negative probabilities are scaled to generate prediction probabilities, and predictions are generated that are favorable to the class with the higher score.

[0091] This disclosure describes a system and method for data augmentation of state-of-the-art NLP models. Here, “state-of-the-art” NLP models typically refer to a class of NLP models that have learned to focus more on sentence context and are complex enough to learn many rich representations from a large amount of data. Some examples of such models are transfer learning-based models built on transformer architectures (e.g., BERT, TURING, GPT3, etc.).

[0092] Traditional data augmentation techniques (based on generating perturbation from existing training data using one or more methods) were insufficient. Examples of failed attempts include back translation, which involves translating text from one language to another (perhaps several) and then back to the original language. Such transformations will produce different expressions of the same text, perhaps through slightly different word choices but conveying the same meaning. Another attempt that is insufficient on its own is Easy Data Augmentation (EDA). These are a set of simple techniques applied in combination to modify specific words / terms in text using methods such as synonym substitution, random insertion / deletion / swap. These also preserve the same idea of ​​the sentence and only modify a few words. Again, simply performing NLP sequencing / tone alteration is insufficient. This method alters the order / sequence of words in a sentence. This can be random or involve some simple logic (from first person to third person) but does not change the idea in question. Simply using embedded base words / term substitutions was also insufficient: these techniques utilize word embeddings from NLP vectorization models such as GloVe and Word2Vec, and then select vectorically close / similar representations (or inverse vectors for antonyms) of several words to modify some words in a sentence.

[0093] Traditional data augmentation techniques are not helpful for modern state-of-the-art NLP models. These traditional techniques are not suitable for augmenting the training data of such state-of-the-art models for the following reasons: Firstly, in-context models (e.g., BERT, Turing, etc.) are based on transformer learning and, from their pre-training phase, already know various formulations of the same word, so if some words are changed by their synonyms or similar embedding terms, these models generate minimal new ideas to learn something new. Secondly, these large / state-of-the-art NLP models are usually immune to random insertion and deletion-based disturbances, as most of them learn during pre-training to predict masked terms. Thirdly, most of these models are multilingual and therefore work on vector representations of multiple languages ​​in a homogeneous vector space; thus immune to translation-based ideas. Fourthly, these large models are context-aware, and therefore non-context-aware transformations (e.g., changing "lay" in context within "lay-egg" by the synonym "lye") can even degrade the performance of these models. Fifth, these models have billions of trainable parameters and therefore require a "rich" corpus of training data. Here, "rich" is modified by both the quantity and significant diversity of ideas within the same label class ("context"). The insufficient techniques mentioned above alone cannot generate large amounts of training data and may also miserably fail when generating data with a variety of ideas (within the same label class). Sixth, there is the problem of bias towards similar ideas. These large models learn from representations of ideas in text, and if the same idea is repeated multiple times (by using traditional extension techniques), the model is likely to overfit to that idea and perform well on texts with different ideas within the same context. Seventh, large amounts of training data are required for effective learning.While traditional NLP models require thousands of training samples to saturate their learning needs, state-of-the-art NLP models require millions of labeled data across a given context to learn various styles of representations of the diverse ideas underlying that context. Therefore, manual retrieval and grading of this data can be extremely expensive.

[0094] Other alternative data scarce approaches to mitigate the data augmentation requirements of state-of-the-art NLP models are equally inadequate. Obtaining large-scale "augmented" data to train models that require richer "context" and "ideas" within text to effectively train rich and state-of-the-art large-scale NLP models is challenging. Therefore, the current techniques used to mitigate the data augmentation challenges of these models are as follows:

[0095] Firstly, non-scalable and costly methods were insufficient. These include, firstly, Manual Data Source Scavenging and Grading, which is the most powerful method for obtaining (not exactly extending) state-of-the-art models for training. This first, insufficient method, based on contextual requirements (label class specifications), involves obtaining data from several diverse sources, and then each sample from these sources is graded manually or via crowdsourcing. Secondly, scalable but less effective methods are also insufficient, such as Few-Shot Classification. In this method, an NLP model (usually a transformer-based model) is pre-trained on a large corpus of unlabeled "web" or "corporate" data. This provides learning fit on real human-generated data with "richer" context and "ideas" (synthetic traditional extension techniques as described above). However, such data is unlabeled. However, even with just a few samples of labeled data, it has been found that such models perform far better than traditional models trained on the same training data combined with extensions generated from the same training data. Another scalable but inadequate technique is simply zero-shot learning. In this method, a very large (billions of parameters, e.g., GPT-3) NLP model is trained to generate text (as opposed to classifying text) on even larger unlabeled training data. The assumption is that if a few available training samples are used as prompts to generate text, the model is likely to function as a pseudo-NLP-classification model, thus reducing the need to train on large labeled training data.

[0096] Returning to Figure 2, the graphic display 200 also includes graphic controls 293, 294, and 295. These controls may be used, for example, to assist the user in performing a set of operations across data items used or generated by the labeling service 142. Such controls may be used for electronic items such as labeling criteria, document corpora, criterion change logs, labeling performance logs, labeling indexes, and labeling indexers. Electronic items are typically stored, retrieved, modified, and displayed by the labeling service 142 using the memory of storage 180 or server 177. As used herein, “labeling criteria” typically refers to a set of data items. The set of data items together are used by the labeling service 142 based on a model. decision (Determine whether the label correctly belongs to the new candidate) decision This makes it possible to provide a "document corpus." A "document corpus" typically allows new candidates to influence the labeling criteria. decision This is a set of documents pulled out to make a decision. The "Criteria Change Log" is typically a record of the addition and deletion of data items to the labeling criteria. The "Labeling Performance Log" is typically a record of events related to the labeling criteria, which may indicate dissatisfaction such as the frequency of rejections, the average confidence of manually added examples, the average confidence of recently added candidates, the average confidence of rejected candidates, the standard deviation of one of these statistics, or the success rate of the labeling criteria for a set of managed documents to which the label was managed and verified. When several examples are manually added, the labeling service 142 may run the labeling criteria on entries before adding them to obtain an estimate of the accuracy of the labeling criteria, and may incorporate these estimates into the average confidence of recently added candidates. The "Labeling Index" is typically a record that shows the portion of the document corpus to which labels are correctly applied. The "Labeling Indexer" typically refers to an application function that builds a labeling index of the document corpus and tracks which documents in the corpus have been scanned for labeling.

[0097] Graphic control 293, when selected, provides a dropdown menu that allows the user to perform content management-related operations (e.g., save labeling criteria, load labeling criteria, save labeling criteria, define a corpus associated with labeling criteria, define a logical combination of labeling criteria, close labeling criteria, open a new labeling criterion, load recently used labeling criteria, etc.). The "Define a logical combination of labeling criteria" function allows two or more predefined labeling criteria to be logically combined to form a third labeling criterion. For example, three labeling criteria defining poor customer service could be logically combined via the OR function to identify a portion of communications that have at least one of these labels. As another example, someone searching for four specific plot elements in a movie database could generate labeling rules for each plot element and then generate logical rules to find plots containing at least two of the plot elements via the logical combination function for each pair of plot elements (a function that generates combined rules that define labeling criteria related to the combination of six logical combinations of pairs).

[0098] When selected, graphic control 294 typically provides a dropdown menu that allows the user to perform operations related to development, operation, and analysis, and to use the following history of loaded labeling criteria: view change log, view performance log, index corpus by labeling criteria, manually expand labeling criteria, import new examples, set index granularity, set label threshold, expand labeling criteria examples, expand anti-labels for labeling criteria, expand labels for labeling criteria, expand all parts, etc. "Manual mode of label expansion" may be provided by graphic display 200 by clearing content to present empty graphic controls such as 235 within display area 203. After the user completes text input, the new text is added to positive examples set by the confirmed state. Alternatively, selecting manual mode of label expansion may provide a traditional keyword index search engine that operates on a document corpus but provides adjacent control to each text snippet in the ranked return results. When a user selects a control to show positive or negative examples, text snippets are added to the labeling criteria along with appropriate specifications. The "Import Samples" function can take a predefined dataset containing examples marked as positive and negative and incorporate the dataset into the labeling criteria. For example, a user who has performed a manual search or entry may send an email with an attachment (perhaps an attachment containing those examples stored within the labeling criteria structure, even if it does not have any label definitions). Once the labeling criteria file is saved locally, it can be selected by any file browser to import several examples into another labeling criterion. The "Set Index Granularity" function defines the amount of candidate text that makes up a sentence, paragraph, several words, or document.The “Set Index Granularity” feature also allows users to define “how precisely the location of positive label indications will be recorded.” For example, document-level precision would record that a document is positive in a test for a label, but only one indication per document would be recorded. The “Extend Labeling Criteria Examples” feature typically provides a computer-implemented extension of available examples that reflects the richness of current examples in the context of the label. The “Extend Anti-Labels in Labeling Criteria” feature works like an exemplary extension, but instead of simply adding examples, alternative anti-label keyword structures are added to the anti-label area 210 in addition to or instead of adding additional examples. The “Extend Labels in Labeling Criteria” feature works like an exemplary extension, but instead of simply adding examples, alternative label keyword structures are added to the label definition display area 201 in addition to or instead of adding additional examples. In one embodiment, a set of label keyword structures is presented to the user in a display area such as the anti-label display area 210 to provide an alternative label set for the keywords found.

[0099] Graphic control 295 is typically a function activation control that enables one of the labeling service operations to be performed for the user. By selecting graphic control 295, this function is performed immediately. In one embodiment further described herein, graphic control 295 is assigned to the function of “expanding the examples of the labeling criteria.” The user might select such a control if they have received a new set of 10 positive and 10 negative examples manually entered by a colleague and have imported the new examples into the labeling criteria shown in graphic display 200. Another reason might be that “the user has changed the corpus definition for which rules are applied,” so that accumulated examples can be used to expand the classified examples in the context of the new corpus. For example, the user initially defined a document corpus as “sales emails” that are likely to have a high level of customer service. If the document corpus is changed to a “technical support” corpus, the user may be able to utilize a more balanced set of negative examples, which are likely to discover different and richer examples. Because the number of samples in the input labeling criteria is extremely small (compared to the requirements of state-of-the-art large-scale transformer-based NLP models), other models are likely unable to be effectively trained with a small number of data samples. These large-scale state-of-the-art models require highly rich and diverse training data in terms of the bias of ideas needed to holistically represent the contextual requirements (in terms of the diversity of ideas needed to holistically represent the contextual requirements). It is usually not possible to represent such richness with only small datasets. Often, smaller labeling criterion datasets do not have sufficient data diversity and richness for a model to effectively learn the holistically required contextual representation from this dataset.

[0100] As a result of the augmentation, the disclosed method 600 augments this dataset with sufficient data (rich data) across both classes and thanks to a well-selected corpus (human-generated). Thus, the resulting augmented data represents real-world scenarios and is noise-tolerant. Therefore, the result of the augmentation is an improved stability and relevance model performed from the labeling criteria. The method disclosed herein can generate a suitable labeling criterion dataset for training state-of-the-art large-scale NLP models.

[0101] This system augments extremely small datasets with remarkably rich diversity. The output dataset is enriched not only with richer representations of individual words, as a thesaurus would provide, but also with new ideas about contextual requirements. The output dataset discovers people / business-generated data in a contextually conscious manner. The augmentation methods disclosed herein do not simply randomly replace words / terms / translations / generations, but holistically discover new ideas about specific contextual requirements provided by label descriptions. The presented augmentation methods operate in a noise-tolerant manner. The generated augmented datasets can be used directly to train large and state-of-the-art NLP models. In addition, in contrast to zero-shot / few-shot classification techniques that require existing (unlabeled) datasets to be classified, the disclosed methods satisfy both augmentation and pre-classification requirements. The disclosed augmentation methods automatically and intelligently acquire and extract data samples within the correct dataset subset that is ready for any classification model.

[0102] After the user selects graphic control 295 to invoke the extension, the labeling service 142 receives a control signal from application 110 and, accordingly, performs an extension operation involving extension method 600. Typically, several positive and several negative examples representing a particular contextual requirement are received by method 600. Examples and labels in the labeling criterion are received by method 600 to perform an extension operation that extends a set of examples based on the received examples and labels. The output of the extension invoked by the selection of graphic control 295 is typically an improved labeling criterion with a larger contextual requirement-aware dataset containing more diverse positive and negative class-specific data samples. That is, a set of examples has diverse ideas about the required context, even if these ideas are not present in a very small set of input samples. In addition, the generated samples are non-synthetic: i.e., the generated samples are not generated by mere spot disturbances of strings by using a generative model. This dataset is ideally suited for training state-of-the-art large-scale NLP models that require large amounts of rich data, which are currently required for manual acquisition and grading.

[0103] The extension method typically receives a set of examples, such as a set of currently defined examples within a labeling criterion (for example, by receiving the labeling criterion from storage service 180). The extension method then loops through the set of examples, taking one example and associated label at a time. In one embodiment, the selected label is an antilabel associated with a negative example or a label associated with a positive example. If multiple available labels exist (for example, if there are several available antilabels), multiple combinations of labels and examples may be used. In another embodiment, the label is randomly selected from a set of available labels of the same class.

[0104] Once an example and a label are selected, method 600 begins the extension method in step 603. In step 605, method 600 receives candidate text from the current example. In the exemplary case, the previously classified sample “I would be happy to help you with your sprocket order,” shown in graphic control 202, is classified as a positive example and is therefore received by method 600. In step 610, method 600 receives an ordered list shown in graphic control 292, consisting of input labels such as a graph corresponding to the input label shown in graphic control 206, or the list “Service Harmony.”

[0105] In step 615, a set of preferred keywords is prepared. In this step, summary keywords are extracted, and their strength is calculated in a context-aware manner. That is, the strength of each preferred keyword is calculated. This calculation is context-aware of the contextual requirements within the label description. This context-aware set of keywords is obtained for both negative and positive examples. Generally, the descriptive label text input may be a raw text string containing multiple terms, and the candidate text is a raw text string containing multiple terms. The labeling criteria may store the preferred keywords for candidate label pairs. In this case, preferred candidate label keywords are received from storage service 180 by method 600 to prepare a set of preferred keywords. Alternatively, keyword summary structures of candidate text and / or labels may be available within the labeling criteria. These structures are received from storage service 180 if available. The method for performing step 615 begins in step 703 and proceeds to step 705. If the label structure is not available from storage 180, it determines whether the label contains multiple words. decisionTests are conducted to achieve this. Many contextual requirements cannot be explained in a single term. More complex labeling ideas require a collection of ideas. Modern NLP, using large-scale state-of-the-art transformer-based models, excels at generating rich models that can intelligently classify such data. However, these models also require rich training data to holistically learn fundamental concepts under diverse representations of various ideas that make up conceptual requirements. Since not all fundamental ideas can be holistically represented even within very small input data samples or in a single term for labeling requirements, contextual requirements are expressed as label descriptions instead of single-term labeling requirements.

[0106] If the label contains multiple words, the method proceeds to step 710. The summary keyword structure is derived from the input label description, as explained by the method for performing step 710 in Figure 9. decision The method then returns to step 715, where a candidate text structure providing preferred text keywords is obtained. The method in step 715 proceeds as in the method in step 710, with different input texts (i.e., candidate texts) to summarize. For example, the candidate text "I would be happy to help you with your sprocket order" yields a list of meaningful keywords such as [helping, community-focus, happy, customer, sprocket]. decisionIt is possible. An ordered list of priority keywords with priority is [(helping,0.35),(community-focus,0.35),(happy,0.2),(customer,0.1)]. A result graph showing an illustration of candidate graph 1110 with helping vertex 1112, community-focus vertex 1114, happy vertex 1116 and customer vertex 1118 is shown in the structure display 1100 of Figure 11. The tag display 1160 shows that social value tags are assigned to helping, community-focus and service. The people tag is assigned to customer. The feeling / sentiment tag is assigned to happy and harmony. The shown graph structure provides richer terminology and also provides a richer ordering description that includes not only order but also strength and similarity. Tags, linkages and directions for richer query construction for the following process are available. When the keyword summarization method that performs step 710 is completed, the priority keywords are [helping, community-focus, happy, customer]. In one embodiment, different sizes Z base This method is used for cases where the candidate text is summarized. The method proceeds to step 720, where embeddings for preferred terms of text keywords are obtained. In step 725, embedding vectors for preferred terms of label keywords are obtained. For example, each term in the candidate text is provided with an embedding vector, and the list is filtered to retain only the preferred terms. In step 730, the similarity between the preferred label terms and preferred candidate keywords is obtained. This may be obtained by calculating the similarity (e.g., the cosine similarity between the embedding vector of each preferred term in the anti-label and each preferred term in the inverted text). In step 735, the contextual importance of the preferred text terms is calculated. In one embodiment, the contextual importance of each summary keyword term is calculated as a normalized weighted average of the similarity between each term in the label, where the weights are the importance scores of the label terms. In step 740, the method obtains the contextual awareness priority from the contextual importance and keyword priority. decisionFor example, the contextual priority of each summary keyword can be calculated as the normalized product of contextual importance and keyword priority. In this example, the contextual priority keywords are "helping, happy, customer". The calculation of the contextual priority keywords is completed in step 797, and the method returns to step 620 in Figure 6.

[0107] In step 620, a set of ranked search retrieval results is obtained. A search service 164 across the entire labeled document corpus, such as corpus 154, is performed using context-aware keywords as queries. The number of top-ranked search returns is obtained from search service 164. For example, if the term “context-aware” is used to search for relevant documents about a given search engine, the top n (e.g., n=10) search results are retrieved. Most search engines also generate text snippets indicating why they believe the retrieved search results are relevant to the query. In step 625, the method collects these snippets to extend the database. One embodiment uses an API version of the search engine. Another embodiment uses a client version of the search retrieval of the top N search results and their respective snippet extractions (in step 625). An exemplary ordered context-aware keyword term for input to this step is “helping, happy, customer”. This input can be further enriched based on class requirement prompts (i.e., prompts to ensure that positive sentences and negative sentences are generated). For example, graphic control 236 may prompt the user to review the positive examples found. A prompt in graphic control 252 may prompt the user to review the negative examples found.

[0108] In step 625, a text snippet is obtained, and the method proceeds to step 630, which quantifies the confidence that "the text snippet belongs to a label class." An exemplary method for quantifying class confidence is to construct the keyword structure of the text snippet (for example, by using the method that performs step 710 in Figure 9). Exemplary methods for evaluating the overall semantic similarity between the keyword structure of the text snippet and the label keyword structure may be the use of cosine similarity based on vectorization transformation of graph terms, or other methods provided by the vectorization function 156. Other methods disclosed herein provide an estimate of the probability that a similarity score or label is correctly applied to the text snippet. judgement If the class confidence is too low in step 635, the method indicates the failure in step 640 by recording the failed snippet in the storage service 180, and returns to step 625 to obtain another text snippet, which is quantified in step 630 and tested in step 635. judgement If sufficient at step 635, the method proceeds to step 645, where the input is expanded to include a sufficient snippet, for example. In one embodiment, step 630 uses the text snippet as the candidate input in step 305 and the label as the candidate label in step 310, thereby increasing confidence that the label is correctly applied to the candidate text. decision Method 300 is used. Next, the output of the estimated label probabilities of Method 300 is used as the class confidence score. In step 307, since the labels are already known, the method proceeds to step 372. In step 372, in one embodiment, the antilabel is generated by the labeling service 142 from the labeling criteria which have the antilabel stored in memory, and the method proceeds to step 315.

[0109] In step 320, the method is available as an example of a candidate label. decisionTherefore, the method proceeds to step 335, where an example of a candidate label is received. In one embodiment, K examples of positive labels are received if available, where K is a non-negative integer. In one embodiment of step 335, the example is randomly selected from a set of positive examples. In one embodiment of step 335, the highest confidence example of a set of labels is used to randomly select K examples from the top L examples in the set of positives. In one embodiment of step 335, the set of examples used to obtain the positive example is restricted to a set of positive examples belonging to the same cluster of similar examples.

[0110] In step 325, it is stated that an example of a candidate antilabel is available for this method. decision Therefore, the method proceeds to step 340, where an example of a candidate antilabel is received. In one embodiment, K examples of negative labels are received if available, where K is a non-negative integer. In one embodiment of step 340, the highest confidence example of a set of antilabels is used to randomly select K examples from the top L examples in the negative set. In one embodiment of step 340, the set of examples used to obtain the negative examples is restricted to a set of negative examples belonging to the same cluster of similar examples.

[0111] In one embodiment, K and / or L are parameters set by the user to control the extension method 600. In one embodiment, a balanced set of K negative examples and K positive examples is obtained if available.

[0112] The method proceeds from step 315 to step 380, where one or more label scoring methods are applied. In step 385, performance records are accumulated, and available weightings for labels similar to the current label are searched for. In step 390, if weightings are found, they are applied, and the weighted label score is decision Otherwise, the label score will be in process 380. decision From a set of label scores decisionThen, in step 395, the results are plotted based on estimations. In the example of method 300, the plotted results are decision The resulting label scores are provided to Method 600 as label class confidence scores and tested in step 635. Method 300 then proceeds to step 397, where new inputs are awaited from the user or extension. Noise during the extension of disturbance systems, especially when not performed in a predetermined location, is a challenge.

[0113] Noise during augmentation is another challenge for AI-based alternative systems intended to augment or generate data for training complex models. Even if a sample has a certain probability of belonging to a class when only a small amount of data is available, there will inevitably be some samples that belong to a class but are not good representative of it. Noise from these samples needs to be reduced. This system provides a noise reduction method that works for small sample sizes.

[0114] judgement In step 635, if the class confidence is sufficient, the method proceeds to step 645, where a set of positive examples is increased by storing a text snippet as a positive example within the labeling criteria. Next, the method judgement In step 650, the process proceeds to review additional user input or additional input from labeling service 142, and if there is no additional user input, this method judgement Moving on to 655, the test here asks whether each new case discovered should be "balanced" or supplemented by a negative case (which, when discovered, supplements the new positive case). decision This is done to increase the richness and scale of the augmented data. The augmentation generates both negative and positive class augmentations from each example, regardless of its original class. Thus, for example, positive class samples are also synthetically transformed into negative class samples to ensure that the resulting balancing subgraph exists. Several labels regarding whether the data has a balance of representations of a particular idea. decisionThere are advantages in this. Embodiments for generating balanced examples may include, for example, thesaurus-based methods, antonym substitution methods, and negative vector-based embedding methods.

[0115] judgement To perform 655 The basis The criteria may be user settings, labeling criteria settings, or extension settings for the labeling service 142. In judgment 655, Newly discovered positive examples should be balanced. Once decided The method then proceeds to step 660, where the data required to obtain antilabeled cases related to recently discovered positive cases is obtained. decision In one embodiment, in step 660 decision The obtained anti-label data includes a set of preferred keywords for the text snippet, an inversion of the set of preferred keywords for the text snippet, a set of preferred keywords for the anti-label, and a set of context-aware keywords for the inversion of the preferred keywords for the text snippet in the context of the anti-label. Once a set of context-aware keywords for the inversion of the text snippet from the perspective of the anti-label context is obtained, the method proceeds to step 620, where a ranked search result for a set of context-aware keywords is obtained. The method then proceeds to discover negative examples via steps 625, 630, 635, and 640, using different inputs, but employing the methods described herein for discovering appropriate and supplementary positive examples for discovering negative examples, as also described herein. The input received by method 600 includes negative text context-aware keywords (for representing candidate text) and anti-labels (for representing candidate labels). If negative examples are sought by the extended method 600, supplementary data is used to obtain extended negative examples in step 645 with sufficient class confidence.

[0116] judgementIn step 650, a test is performed to see if there is any user input or if there are any remaining examples that have not yet been expanded. If additional input is received, method 600 proceeds to step 665, where the additional input is processed. If there are additional examples to be expanded, the method proceeds to step 605, where candidate text is received, and the method repeats with the new input data. If the user provides additional input in step 650, the method uses the additional input in step 665 to provide an improved expansion. For example, if a newly acquired example is displayed to the user in display area 235, and the user selects "Confirm" or "Green" by using control 236, then step 665 records this example as a strong example and proceeds to step 620 by adding this example to a set of samples to generate additional examples. Alternatively, if the user determines that a newly discovered example is poor, the user enters "Reject" or "Red" into control 236, and the method will proceed to step 620 by using a new example from a set of examples that define preferred keywords to expand it. Alternatively, if the keyword has not yet been defined, the method will proceed to step 605. In addition, if the user modifies a label and provides the modified label definition input into graphic control 206, the method will reset and start the extension method in step 606 with the new label, seeing all the examples that should be duplicated in light of the new label.

[0117] In judgment 650, No new user input Ku And there are no additional examples to consider for expansion. It was decided that In this case, this method displays an extension completion notification and continues until there is additional input. judgement At 650, it effectively waits by periodically sampling the input state. In the data augmentation process, which is invoked by selecting graphic control 295, the augmented text is such that "the predicted class of the selected snippet matches the intended class." decisionThis is verified by using noise filtering. In one embodiment, a threshold sets a confidence tolerance level for accepting a sample. In one embodiment, a number of query returns from context-aware keywords are consumed without finding a suitable candidate. In this case, the example is effectively skipped and an error message is stored. Once the extension method is complete, the extension statistics are summarized for the user and presented to the user in a display area such as a graphic display 200 so that the user receives an indication of the degree of success of the extension. In one embodiment, many successful positive class examples added are displayed in area 203, many negative class examples added are displayed in area 205, and many skipped samples are displayed in display area 201.

[0118] When documenting labeling criteria, descriptions of positive classes such as "service harmony" are searched for. The system shown in operating environment 100 uses such expressions from descriptive input such as "Pleasant and business-like". decision Next, the positive class is typically a sentence that exhibits positive features that benefit the customer and positive features that promote customer happiness and loyalty. To obtain a rich definition of the class being sought, it is also useful to have examples of sentences that reflect either “disservice” or “disharmony.” A system such as the one shown within operating environment 100 provides the generation and expansion of a set of examples that are semantically rich, have diverse ideas, are balanced, and are filtered for strength of expression. If a sentence does not exhibit either a positive or negative tendency in the service harmony label, the sentence is typically labeled “inert” or “yellow.” Some contexts allow for the setup of two thresholds for sentences, one given by distance from the “inert” case rather than distance from the opposite case. It has been found that such samples reflecting the “inert” case are drawn from either positive or negative examples and are not particularly close to the parent example.

[0119] The idea of ​​text is usually the meaning of a sentence that does not have specific attributes for the words / sequences used within it. For example, these two sentences below have the same idea:

[0120] The dog was too tired to cross the street.

[0121] The hunting dogs were too exhausted to cross the road.

[0122] The richness of translation lies in its generally diverse expression while maintaining the same idea. Below are two examples of text extensions from one form to another that offer much richness but still express the same idea.

[0123] The dog was too tired to cross the street.

[0124] My pet didn't seem to want to make the extra effort to go all the way to the other side of St. Peter's Basilica.

[0125] The richness of ideas usually lies in the expression of the same idea from various perspectives. Below are examples of two sentences that, while in the same context (e.g., "a sentence describing service harmony"), contain very different ideas.

[0126] I enjoy working in customer service because turning problematic customers into fans through service and kindness makes me feel good.

[0127] I understand what you're getting at; you're saying you're frustrated because I don't know the answer to your question, so let's look at some documents together and see if I can get the information you need.

[0128] Generally, state-of-the-art NLP models require a richness of ideas across the entire set of training samples belonging to the same label in order to effectively train the label classifier.

[0129] The disclosed solution is superior to other methods. Real, rich, and human-generated data of specific training context requirements is the “gold” standard for training any NLP model. However, other methods do not provide an effective way to augment “rich,” “human-generated,” context-aware training data for state-of-the-art and large-scale NLP models. Since “manual” modes of data acquisition and labeling are neither scalable nor cost-effective with respect to the scale of data required for these state-of-the-art models (which can be 100x to 100,000x for any traditional NLP model), this disclosure should be evaluated in comparison to other scalable methods.

[0130] Other methods of data augmentation do not provide richness or contextual awareness. Comparison of final model results will be performed by other methods that do not use the methods disclosed herein, or by the same methods if augmentation is utilized by using one or more methods and sub-methods in this disclosure. Best baselines are followed using a few-shot method (e.g., Microsoft Turing) with state-of-the-art and large-scale pre-trained transformer-based NLP models.

[0131] The proposed method, in addition to having a primary intelligent, scalable, and context-aware data augmentation method, also has an additional method for making the augmented data noise-tolerant. The data is demonstrated by the following performance of both the isolated primary data augmentation module and the primary data augmentation module with noise reduction add-ons compared to baseline performance supplied by modern state-of-the-art and large-scale transformer-based NLP models (e.g., Microsoft Turing) (without these modules for the same data samples). The method takes subsets of standardized datasets of varying sizes, having samples "only" within the range of 20 (10 for each positive tendency and each negative tendency) to 100 records for a particular contextual requirement / label description. For such large models, this number of training samples is considered extremely small by others, and it is considered impossible to train a reasonably performing model for any real-world application. This fact is verified by the suboptimal performance of the trained model on such small representative datasets without this technique. The "recall" of the NLP system is in the range of 4% to 8% for sample sizes in the range of 20 to 100 samples. While all validation / testing of the final trained model was performed on a validation dataset created for the same contextual requirements (label descriptions), there are many more varied, richer, and more diverse "ideas" about the contextual requirements that can actually be presented within the handful of training samples available to train the model. This scenario also anomalously represents the case of a much larger (1000x) training dataset obtained from a limited data source (e.g., from a portal on a given topic, or from a portal frequently visited by a specific class / subset of target audiences).

[0132] Next, using the same data samples (not just similarly sized data), a system in operating environment 100 performs the disclosed method once using only the scalable, intelligent, and context-aware method flow without the noise reduction add-on module. This method provided 8% to 17% recall for sample sizes of 20 to 100. With noise reduction, performance is similar, but there is an early advantage of 14% recall at sample size 40. Under both conditions, the disclosed method provided significantly better results than the baseline. The disclosed noise reduction add-on module also provided even better results with smaller data sample sizes.

[0133] The disclosed method extends the training data to large-scale, state-of-the-art NLP data, which may provide better recall / FPR / accuracy for the basic model. Due to the richness and diversity of ideas in the data that can be extended, the model may learn context better and more holistically, meaning that the model may reasonably do better with respect to new data / domains. The extended samples are search-based and therefore real samples generated by people / businesses, which ensures that models trained on these systems under real-world applications are more reliable and stable.

[0134] The disclosed methods have the potential to extend even state-of-the-art transformer-based NLP models, which require rich contextual learning for a wide variety of ideas, to vast amounts of real-world and human / corporate-generated training data.

[0135] The disclosed method is context-aware (as opposed to simply changing any word by its synonym / antonym or by adding / substituting random words). This is a huge benefit as it not only greatly reduces noise in any downstream models but also ensures more relevant training data for downstream models, thus improving the performance, accuracy, relevance, reliability, and stability of the models.

[0136] Manually acquiring and grading data from large-scale models is insufficient for the following reasons:

[0137] Firstly, state-of-the-art and large-scale transformer-based NLP models require at least several thousand samples of extremely rich representational data. Such data is difficult, time-consuming, and very expensive to obtain from a single source. While such methods have worked well in the past for traditional models (in conjunction with other non-AI-based extensions), they do not scale to the modern NLP ecosystem.

[0138] Secondly, even when obtaining such data from multiple data sources, this data needs to be graded in terms of time, cost, and, most importantly, all nuances of bias related to the grading of such data.

[0139] Referring now to Figure 13-15, each block of Methods 1300, 1400, and 1500 described herein includes a computing process that can be performed using any combination of hardware, firmware, and / or software. For example, various functions can be performed by a processor executing instructions stored in memory. The method can also be embodied as computer-usable instructions stored on a computer storage medium. The method can be provided, to name a few examples, as a standalone application, a service, or a hosted service (standalone or in combination with another hosted service), or as a plug-in to another product. In addition, Methods 1300, 1400, and 1500 are described as examples of the systems and methods of Figure 1-12. However, these methods can be performed additionally or alternatively by any one system, or any combination of systems, including, but not limited to, those described herein.

[0140] Figure 13 shows the correspondence between class labels and text in some embodiments of the present disclosure. decision A flowchart of Method 1300. Method 1300 includes receiving candidate text in block 1302. As previously described with reference to Figure 2, candidate text may be received via a user interface. Alternatively, candidate text may be a set of documents, emails, or other sources of text. In some embodiments, candidate text may be part of a larger document (such as a sentence, a phrase, or a paragraph). Method 1300 includes receiving a label description in block 1304. As previously described with reference to Figure 2, label descriptions may be received via a user interface. The user determines whether one or more documents, emails, texts, social media posts, or other text content correspond to a label description. decision A label description may be submitted for the purpose of identifying a document that embodies customer service. Method 1300 determines whether the label description corresponds to candidate text. decision It is possible. Labels correspond to candidate texts where the concepts in the text and label descriptions have similar meanings.

[0141] Method 1300 includes using a label description to generate a query in block 1306. For example, preferred keywords derived from labels are used as queries as described by steps 615 and 620 in Figure 6. Alternatively, preferred keywords derived from labels are used in conjunction with preferred keywords derived from an example of forming a set of context-aware keywords by step 615, as shown in Figure 7.

[0142] Method 1300 includes, in block 1308, transmitting the query to the search engine. The labeling service 142 sends the query to the search service 164. In one embodiment, the search service 164 is an API version of the search engine. In one embodiment, a client version of the search is used. The search service 164 receives the query and performs a search across the entire document corpus 154. The search engine provides blocks of ranked retrieved results (including the rank of each result and the search score of each result) and text snippets that sample documents at locations relevant to the query. decision The search service 164 obtains a set of ranked search results, as discussed in relation to step 620 in Figure 6.

[0143] Method 1300 includes receiving a text string corresponding to the query from the search engine in block 1310. The search service 164 sends a results page containing a set of ranked search results, each containing a text snippet of the ranked search result, to the labeling service 142. In some cases, entries with a high rank or high search score relevance are selected by the labeling service 142 by selecting a text snippet as the text string. In some cases, as described in relation to Figure 6, the list of ranked search results is selected by obtaining a text snippet as described in step 625, quantifying the label class confidence of the text snippet in step 630, and if the text snippet has an appropriate class with sufficient confidence. judgement In 635 decision A loop is formed where the snippet is evaluated by doing so. Otherwise, the method records the failure of the snippet in step 640 and returns to step 625. If a text snippet of sufficient confidence is found in step 635, the text snippet that is found to be of sufficient confidence is selected as the text string corresponding to the query.

[0144] Method 1300 includes inputting a text string and candidate text into a generative model in block 1312. The text string is essentially a positive or negative example and is used in conjunction with the exemplary processing disclosed herein. In some cases, a parameter indicating the amount of risk in the generative model is retrieved from storage 180. There are four basic methods described herein for inputting exemplary text strings into the candidate model: the NC method (Figure 4, steps 410 and 420), the SL method (Figure 4, steps 410 and 420), the SS method (Figure 5, steps 510 and 520), and the LP method (Figure 12, steps 1210, 1220 and 1230). In some embodiments, the generative model is in zero-shot mode.

[0145] Method 1300 includes receiving a generated text from a generative model in block 1314, which includes multiple tokens and association probabilities. The generated text broadly includes not only the actual stream of text tokens generated by the model but also vectors of association token probabilities and log probabilities reported for each token, where each log probability describes a certain number of likelihoods corresponding to a token that the model may have selected. As described above, there are four basic methods disclosed herein (disclosed in the NC, LP, SS, and LP methods) for receiving the generated text from the generative model. In the NC method shown in Figure 4, the text is received and scanned with respect to class labels as described in relation to the NC embodiment of step 420. In the SL method, also shown in Figure 4, the text is received and scanned with respect to label and antilabel keywords as described in relation to the SL embodiment of step 420. In the SS method shown in Figure 5, the generated text is used in a search query as described in step 530. In the LP method shown in Figure 12, log probabilities are used in relation to steps 1215, 1225, and 1235.

[0146] Method 1300 performs label probability estimation based on the generated text in block 1316. value of decision This includes doing so. Again, we use the label probabilities disclosed in the NC, SL, SS, and LP methods. decision There are four basic methods. In the NC method, in step 430, the token probabilities of label numbers or antilabel numbers are used as input to an approximation using experimentally estimated scaling factors in some embodiments. In the SL method, in step 430, the token probabilities of keywords for labels or antilabels or their synonyms are used to form an approximation of the strength of label indications as opposed to antilabel indications. In the SS method, in step 540, arbitration rules are used to maintain a balance of ranks for positive example documents as opposed to negative example documents. In the LP method, in step 1250, results exceeding a predictability threshold provide scaling for positive and negative probabilities to approximate label probabilities.

[0147] Method 1300 is used in block 1318 to estimate the label probability of candidate texts. value This includes outputting an indication of whether it corresponds to a label description based on the above. As previously described with reference to Figure 2, the indication may be output via a user interface. In one embodiment, this indication may be a binary "yes / no" or similar indication. In another embodiment, this indication may represent the degree or strength of correlation.

[0148] Figure 14 shows the correspondence between class labels and text in some embodiments of the present disclosure. decisionA flowchart of Method 1400 is shown. Method 1400 includes receiving candidate text in block 1402. As previously described with reference to Figure 2, candidate text may be received via a user interface. Alternatively, candidate text may be a set of documents, emails, or other sources of text. In some embodiments, candidate text may be part of a larger document (such as a sentence, a phrase, or a paragraph). Method 1400 includes receiving a label description in block 1404. As previously described with reference to Figure 2, label descriptions may be received via a user interface. The user determines whether one or more documents, emails, texts, social media posts, or other text content correspond to a label description. decision Label descriptions may be submitted for the purpose of identifying customer service. For example, a user might want to identify documents that embody customer service.

[0149] Method 1400 includes, in block 1406, generating candidate results from a generative model that have candidate text as input to the generative model. Method 1400 determines whether the label description corresponds to the candidate text. decision It is possible. Labels correspond to candidate text where the concepts in the text and label descriptions have similar meanings. In one embodiment, labels are abstract concepts or categories that correctly describe several examples that embody the label or specific examples that fit the label. The process of generating candidate results from a generative model, with candidate text as input to the generative model, is described in step 520 of Figure 5. An example of candidate text input from the graphic display 200 is "We are happy to help you order your sprocket," as shown in the graphic control 202.

[0150] Method 1400 includes, in block 1408, generating a positive example result from the generative model, which has positive example text that embodies a label description as input to the generative model. Steps 1408 and 1410 are generally described in step 530 of Figure 5. In the example shown in graphic display 200, the positive example text may be "Let me know if there is anything else I can do for you. I'd be happy to help," as shown in display area 231.

[0151] Method 1400 includes, in block 1410, generating a negative example result from the generative model, which has negative example text that embodies the opposite concept of the label description as input to the generative model. An example of negative example text, as shown in graphic display 200, might be "This is your problem, not mine," as shown in graphic display area 261.

[0152] Method 1400, in block 1412, submits candidate results to the search engine as a second query across the entire corpus, including positive and negative example results, and based on the response, determines the first rank score of the positive example results. decision This includes the following: The rank score can be a numerical rank of 1, 2, or 3, where a lower number actually reflects a higher rank (first list). The rank score can be the cosine similarity between the candidate result and the positive example result.

[0153] Method 1400, in block 1414, obtains a second rank score for the negative example results based on the response from submitting candidate results to the search engine as a second query across the entire corpus including the negative example results. decision This includes the following: The rank score could be, for example, the cosine similarity between a candidate result and a negative example result. The similarity measure can be measured in a deep vector space by using a semantic search engine.

[0154] Method 1400 estimates the label probability by comparing the first rank score of the positive example results with the second rank score of the negative example results in block 1416. value of decision This includes the following. The arbitration rules disclosed herein may be used to estimate probabilities. In one embodiment, the label probability is a scaled comparison between the mean positive example cosine similarity and the mean negative example cosine similarity. In one embodiment, the scaling coefficient is determined by discovering the cosine similarity of randomly selected texts as a decay coefficient. decision In one embodiment, the scaling factor is determined by measuring the ratio of user confirmations as a coefficient. decision It will be done.

[0155] Method 1400 is used in block 1418 to estimate the label probability of candidate texts. value This includes outputting an indication of whether it corresponds to a label description based on the given information. In one embodiment, this indication may be a binary "yes / no" or similar indication. In another embodiment, this indication may represent the degree or strength of correlation.

[0156] Figure 15 is a flowchart illustrating a method 1500 for extending classifier training data according to some embodiments of the present disclosure.

[0157] Method 1500 includes receiving a training data instance containing exemplary text related to class labels for the classifier in block 1502. The training data instance may be provided by the user via an interface. In another embodiment, the training data is drawn from a set of training data.

[0158] Method 1500, in block 1504, selects a set of preferred keywords from the exemplary text. decision This includes doing the following. The preferred keywords are as explained in relation to Figure 9, for example. decision It will be done.

[0159] Method 1500, in block 1506, selects a set of preferred keywords for the class labels. decision This includes doing the following for class labels. decision A set of priority keywords is, for example, as explained in step 307 of Figure 3 and in Figure 9. decision It will be done.

[0160] Method 1500, in block 1508, selects a set of priority keywords and a set of contextually conscious keywords from the set of priority keywords. decision This includes doing a set of contextual keywords. decision The method is explained in Figure 7. An example of a context-aware keyword might be "helping, happy, customer," as shown in the display area 204 of the graphic display 200.

[0161] Method 1500 includes transmitting a query containing a set of context-aware keywords to a search engine in block 1510. Labeling service 142 sends the query containing the context-aware keywords to search service 164. In one embodiment, search service 164 is an API version of the search engine. In one embodiment, a client version of the search is used. Search service 164 receives the query and performs a search across the entire document corpus 154. The search engine provides blocks of ranked search results (including the rank of each result and the search score of each result) and text snippets that sample documents at locations relevant to the query. decision The search service 164 obtains a set of ranked search results, as discussed in relation to step 620 in Figure 6.

[0162] Method 1500 includes receiving text snippets from the search engine in response to a query in block 1512. The search service 164 sends a results page containing a set of ranked search results, each containing a text snippet for a ranked search result, to the labeling service 142. In some cases, entries with a high rank or high search score relevance are selected by the labeling service 142 (thus selecting text snippets). In some cases, as described in relation to Figure 6, the list of ranked search results is used to obtain potential text snippets as described in step 625, and the label class confidence of the potential text snippets is quantified in step 630, and if the potential text snippets have an appropriate class with sufficient confidence, judgement In 635 decision A loop is formed where the snippet is evaluated by doing so. Otherwise, the method records the failure of the snippet in step 640 and returns to step 625. If a potential text snippet of sufficient confidence is found in step 635, the potential text snippet that has been found to be of sufficient confidence is selected as the text snippet to be returned, depending on the query.

[0163] Method 1500 includes generating an extended training data instance in block 1514 that includes text snippets and class labels. In one embodiment, the labeling criterion is augmented by including additional examples that include text snippets and are associated with class labels. A method for storing, modifying, and enhancing a labeling criterion to include additional examples of the labeling criterion disclosed herein is an example of generating an extended instance (or label standard) that includes new examples of text snippets or class labels.

[0164] Method 1500 includes classifying candidate texts into classes in block 1516 by using a trained classifier having an expanded training data instance.

[0165] Method 1500 includes outputting an indication in block 1518 that the candidate text corresponds to a label corresponding to a class. In one embodiment, this indication may be a binary "yes / no" or similar indication. In another embodiment, this indication may represent the degree or strength of correlation.

[0166] Exemplary operating environment Referring generally to the attached drawings, and initially to Figure 8 in particular, an exemplary operating environment for implementing aspects of the technology described herein is generally shown and designated as computing device 800. Computing device 800 is merely an example of a suitable computing environment and is therefore not intended to imply any limitation on the scope of use of the technology described herein. Computing device 800 should not be construed as having any dependencies or requirements relating to any of the components shown or any combination thereof.

[0167] The technologies described herein may be described in the general context of computer code or machine-usable instructions (including computer executable instructions such as program components) executed by a computer or other machine (such as a personal data assistant or other handheld device). Generally, routines, programs, and program components, including objects, parts, data structures, etc., refer to code that performs a specific task or implements a specific abstract data type. The technologies described herein may be executed in a wide variety of system configurations, including handheld devices, consumer electronics, general-purpose computers, and specific computing devices. Embodiments of the technologies described herein may also be executed in a distributed computing environment in which tasks are performed by remote processing devices linked via a communication network.

[0168] Continuing to refer to Figure 8, the computing device 800 includes a bus 810 that is directly or indirectly coupled to the following devices: memory 812, one or more processors 814, one or more presentation components 816, input / output (I / O) ports 818, I / O components 820, and an exemplary power supply 822. Bus 810 represents what may be one or more buses (such as an address bus, a data bus, or a combination thereof). The various blocks in Figure 8 are shown with lines for clarity, but in reality, the depiction of the various components is not so obvious, and metaphorically, the lines would be more accurately gray and ambiguous. For example, presentation components such as a display device can be considered I / O components. Also, a processor has memory. The inventors acknowledge that such things are essential to the art and reiterate that the diagram in Figure 8 is merely an example of an exemplary computing device that may be used in connection with one or more aspects of the art described herein. Categories such as "workstation," "server," "laptop," and "handheld device" are all considered within the scope of Figure 8 and are not distinguished as they all refer to "computer" or "computing device."

[0169] The computing device 800 typically includes a wide variety of computer-readable media. Computer-readable media may be any available media that can be accessed by the computing device 800 and include both volatile and non-volatile media, and removable and non-removable media. For example, and without limitation, computer-readable media may include computer storage media and communication media. Computer storage media include volatile and non-volatile, removable and non-removable media implemented in any method or technique for storing information such as computer-readable instructions, data structures, program modules, or other data.

[0170] Computer storage media include RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disks (DVDs) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices. Computer storage media do not include propagating data signals.

[0171] Communication media typically embody computer-readable instructions, data structures, program modules, or other data within modulated data signals (such as carrier waves or other transport mechanisms), and include any information transmission medium. The term “modulated data signal” means a signal that has been set or modified in such a way that one or more of its characteristics encode the information within the signal. For example, and without limitation, communication media include wired media such as wired networks or direct wired connections, and wireless media such as acoustic, RF, infrared, and other wireless media. Any combination of the above should also be included within the scope of computer-readable media.

[0172] Memory 812 includes computer storage media in the form of volatile and / or non-volatile memory. Memory 812 may be non-removable, removable, or a combination thereof. Exemplary memory includes solid memory, hard drives, optical disc drives, etc. Computing device 800 includes one or more processors 814 that read data from various entities such as bus 810, memory 812, or I / O components 820. Presentation components 816 present data indications to the user or other devices. Exemplary presentation components 816 include display devices, speakers, printed components, vibrating components, etc. I / O ports 818 enable computing device 800 to be logically coupled to other devices (including I / O components 820), some of which may be incorporated.

[0173] I / O components include microphones, joysticks, gamepads, satellite antennas, scanners, printers, display devices, wireless devices, controllers (such as styluses, keyboards, and mice), natural user interfaces (NUIs), and so on. In some embodiments, a pen digitizer (not shown) and an accompanying input device (also not shown, but which may include, for example, only a pen or stylus) are provided to digitally capture handwritten user input. The connection between the pen digitizer and the processor 814 may be direct, or it may be via a serial port, a parallel port, and / or a coupling utilizing other interfaces and / or system buses known in the art. Furthermore, the digitizer input component may be a component separated from output components such as a display device, or in some embodiments, the usable input area of ​​the digitizer may coexist with and be integrated with the display device, or exist as a separate device covering it, or otherwise be attached to the display device. Any such variations and any combination thereof are contemplated within the scope of the embodiments of the art described herein.

[0174] The NUI processes user-generated air gestures, voice, or other physiological inputs. Appropriate NUI inputs may be interpreted as ink strokes for a presentation associated with the computing device 800. These requests may be sent to appropriate network elements for further processing. The NUI implements any combination of voice recognition, touch and stylus recognition, face recognition, biometric recognition, on-screen gesture recognition and gesture recognition adjacent to the screen, air gestures, head and eye tracking, and touch recognition associated with the display on the computing device 800. The computing device 800 may be equipped with depth cameras such as stereo camera systems, infrared camera systems, RGB camera systems, and combinations thereof for gesture detection and recognition. In addition, the computing device 800 may be equipped with an accelerometer or gyroscope to enable motion detection. The output of the accelerometer or gyroscope may be provided to the display of the computing device 800 to render immersive augmented reality or virtual reality.

[0175] Embodiment The technology described herein is described in all embodiments in relation to the specific forms intended to be illustrative rather than restrictive. While the technology described herein is susceptible to various modifications and alternative structures, some of which are shown in the accompanying drawings and described in detail above, it should be understood that there is no intention to limit the technology described herein to the specific forms disclosed; on the contrary, the intention is to cover all modifications, alternative structures, and equivalents that fall within the spirit and scope of the technology described herein.

[0176] For example, a labeling service 122 that labels documents across an entire corpus 154 sometimes described an enterprise corpus of CRM data, but a labeling service can label parts of documents across any entire corpus of documents. Corpus 154 could be parts of a personal hard drive, cloud storage, a set of web pages, a movie database, etc.

[0177] In addition, the labeling application 110 was generally described as an application that provides labeling results. The labeling application 110 can be combined with the search service 164 through a favorable combination. For example, a larger set of results from the search service 164 can be filtered through the labeling service to remove returns that do not fit the label. As another example, the search service 164 may be configured to return 100 of the most relevant returns, and those returns related to the label may be moved to the top of a rank list. In one embodiment, the user types a label description into the graphic control 206, and the search service 164 returns to the user a set of entries presenting possible positive examples and a set of entries presenting possible negative examples. The user selects positive and negative examples, and the method proceeds to perform method 300 with the positive examples taken from the text snippets of the user-selected positive entries and the negative examples taken from the text snippets of the user-selected negative entries. The search service 164 then passes the processing to the labeling service 142. Labeling service 142 then proceeds to perform input filtering of search service 164 by calling method 300, in which each text snippet from each entry returned by search service 164 is evaluated as candidate text in light of user-entered labels, so that entries are ranked based on label probability rather than raw keyword similarity and presented to the user as a semantically relevant list of web results.

[0178] Furthermore, the labeling application 110 may be used to generate a search index of a corpus of documents that provides a label strength index and to return documents based on a combination of label strengths rather than keyword relevance. decision To achieve this, a hybrid search can be generated that weights the keyword index and the label strength index as a weighted combination.

[0179] The classification levels disclosed herein were sometimes binary levels, serving as labels and antilabels. The techniques described herein can process polynomial levels to provide a polynomial label classifier.

[0180] In addition, wherever search or web search is described herein, semantic search based on semantic proximity may be performed instead of traditional keyword search.

[0181] Embodiment 1. Correspondence between class labels and text decision A method comprising receiving candidate text and receiving a label description. The method also comprises using the label description to generate a query. The method also comprises communicating the query to a search engine. The method also comprises receiving a text string from the search engine in response to the query. The method also comprises inputting the text string and candidate text into a generative model. The method also comprises receiving generated text from the generative model, which includes multiple tokens and association probabilities. The method also comprises estimating label probabilities based on the generated text. value of decision This method also includes the estimation of the label probability of candidate text. value This includes outputting an indication of whether it corresponds to a label description based on [the specified format].

[0182] Embodiment 2. Label Probability Estimation value This is derived from the token probabilities of the generated text corresponding to the label. decisionThe method according to Embodiment 1.

[0183] Embodiment 3. The method according to Embodiment 2, wherein the label is a positive label or an anti-label.

[0184] Embodiment 4. Label Probability Estimation value This is derived from the token probabilities of the generated text corresponding to the keywords in the label description or the keywords in the anti-label. decision The method according to any one of Embodiments 1 to 3.

[0185] Embodiment 5. The search engine technology of the search engine is selected from the group consisting of rule-based search, semantic search based on semantic proximity, or contextual search using a transformer model, according to any one of Embodiments 1 to 4.

[0186] Embodiment 6. Label probability estimation based on generated text value of decision The method according to any one of Embodiments 1 to 5, which includes using a first weighting applied to a first label score based on the generated text and a second weighting applied to a second label score based on a second generated text received from the second generative model when candidate text is input to the second generative model.

[0187] Embodiment 7. The first and second weightings are determined by discovering the stored weights of a set of different label descriptions that are similar to the label description. decision The method according to embodiment 6.

[0188] Embodiment 8. When executed by a computing device, the computing device provides a correspondence between class labels and text. decisionA computer-readable medium comprising instructions for causing a method to perform a method which includes receiving candidate text and receiving label descriptions. The method also comprises generating candidate results from a generative model, having candidate text as input to the generative model. The method also comprises generating positive example results from a generative model, having positive example text that embodies a label description, as input to the generative model. The method also comprises generating negative example results from a generative model, having negative example text that embodies the opposite concept of a label description, as input to the generative model. The method also comprises a first rank score of the positive example results based on the response from submitting the candidate results to a search engine as a second query across the entire corpus, including the positive and negative example results. decision This includes the following: This method also involves submitting candidate results to a search engine as queries across the entire corpus containing the negative example results and the negative example results, and deriving a second rank score for the negative example results based on the response. decision This method also includes estimating label probabilities by comparing the first rank score of positive example results with the second rank score of negative example results. value of decision This method also includes the estimation of the label probability of candidate text. value This includes outputting an indication of whether it corresponds to a label description based on [the specified format].

[0189] Embodiment 9. The medium described in Embodiment 8 is a semantic search engine.

[0190] Embodiment 10. The medium according to any one of Embodiments 8 to 9, wherein the generation model is GPT3 executed in zero-shot mode.

[0191] Embodiment 11. Indication is label probability estimation. value And a second label probability estimate calculated using a different method. value A medium according to any one of embodiments 8 to 10 based on a weighted combination thereof.

[0192] Embodiment 12. The candidate text is a corpus of documents, as described in Embodiment 11.

[0193] Embodiment 13.1 or a plurality of processors; and a system comprising one or more computer storage media that, when used by one or more processors, cause one or more processors to perform the Method. The Method comprises receiving a training data instance for a classifier, which includes exemplary text associated with class labels. The Method also comprises a set of preferred keywords in the exemplary text. decision This includes the following. This method also includes a set of preferred keywords for class labels. decision This includes the following. This method also involves selecting a set of priority keywords and a set of contextual keywords from the set of priority keywords. decision This method includes: communicating a query containing a set of context-aware keywords to a search engine; receiving text snippets from the search engine in response to the query; generating an augmented training data instance containing text snippets and class labels; classifying candidate text into classes using a classifier trained on the augmented training data instance; and outputting an indication that candidate text corresponds to a label corresponding to a class.

[0194] Embodiment 14. The exemplary text is a positive example of the class label for the system of Embodiment 13.

[0195] Embodiment 15. The system of Embodiment 13, where the exemplary text is a negative example of a class label.

[0196] Embodiment 16. The system according to either Embodiment 14 or 15, further comprising storing a set of preferred keywords in exemplary text and a set of preferred keywords in class labels within a graph structure.

[0197] Embodiment 17. The method further comprises obtaining a first embedding of preferred keyword terms in an exemplary text, as described in any one of Embodiments 14, 15, or 16. The method also comprises obtaining a second embedding of preferred keyword terms in a class label. The method also comprises obtaining context-aware keywords decision This involves using operations on the first and second embeddings.

[0198] Embodiment 18. The system of Embodiment 17, which uses the operation to calculate the cosine similarity between a set of preferred keyword terms in exemplary text and a set of preferred keyword terms in class labels.

[0199] Embodiment 19. A set of context-aware keywords decision One of the systems in Embodiments 14, 15, 16, 17, or 18 involves filtering the keywords in the exemplary text according to the relevance of each term of a set of preferred keywords in the exemplary text to the context of the keywords in the class label.

[0200] Embodiment 20. Any system of Embodiments 14, 15, 16, 17, 18, or 19, further comprising confirming that the text snippet is likely to represent a class label by using a label scoring method that receives a text snippet and a class label and returns an indication that the probability that the text snippet embodies the class label exceeds a threshold.

[0201] Embodiment 21. Correspondence between class labels and text decisionA method comprising receiving candidate text. The method further comprises receiving a label description; receiving a positive example text that embodies the label description. The method further comprises receiving a negative example text that embodies a concept opposite to the label description. The method further comprises applying a generative model to the positive example text and candidate text to obtain a positive example result. The method further comprises applying a generative model to the negative example text and candidate text to obtain a negative example result; applying a generative model to the positive example text, negative example text and candidate text to obtain a baseline result; and estimating the label probability by comparing the associated log probability of the positive example result with the associated log probability of the negative example result in the context of the baseline result. value of decision This method includes the following: the candidate text is labeled with a probability estimate. value This includes outputting an indication of whether it corresponds to a label description based on [the specified format].

[0202] Embodiment 22. The method of Embodiment 2, wherein the token probabilities of the generated text include a number of token probabilities corresponding to the labels.

[0203] Embodiment 23. The method of Embodiment 2, wherein the token probabilities of the generated text include token probabilities corresponding to antilabels.

[0204] Embodiment 24. Label Probability Estimation value This is derived from the token probability of the generated text corresponding to the antilabel. decision The method of Embodiment 2.

[0205] Embodiment 25. Label Probability Estimation value This is derived from the token probabilities of terms from the generated text that are synonyms of the keywords in the string labels. decision The method of Embodiment 2.

[0206] Embodiment 26. Label Probability Estimation valueThis is derived from the token probabilities of terms from the generated text, which are the keywords of the string labels. decision The method of Embodiment 2 is performed.

[0207] Embodiment 27. Two token probabilities are used for overall probability estimation. value A method of embodiment 24 or 25 that can be combined to form a [unclear].

[0208] Embodiment 28. Token Label Probability Estimation value A method of embodiment 25 or 26 that takes the probability of two terms from generated text which are keywords of the string label or synonyms of keywords of the string label.

Claims

1. A method by which a computer determines the correspondence between class labels and text, Receiving candidate text, Receiving label descriptions, Using the aforementioned label description to generate queries, The aforementioned query is transmitted to the search engine, The search engine receives a text string corresponding to the aforementioned query, Inputting the aforementioned text string and the aforementioned candidate text into the generation model, Receiving generated text from the generation model, wherein the generated text includes a plurality of tokens and a token-specific association probability indicating the probability that the candidate text belongs to a token, and each token is a keyword related to the label description or a synonym of said keyword, Based on the association probability for each token, the label probability indicating the probability that the candidate text belongs to the label description is determined, Based on the label probability, an indication is output indicating whether the candidate text corresponds to the label description. A method that includes this.

2. The method according to claim 1, wherein the keyword is a positive label keyword or an anti-label keyword.

3. The method according to claim 1, wherein the label probability is determined from the token probability of the generated text corresponding to the keyword of the label description or the keyword of the anti-label.

4. The method according to claim 1, wherein the search engine technology of the search engine is selected from the group consisting of rule-based search, semantic search based on semantic proximity, or contextual search using a transformer model.

5. The method according to claim 1, wherein determining the label probability includes using a first weight applied to a first label score based on the generated text and a second weight applied to a second label score based on a second generated text received from the second generative model when the candidate text is input to the second generative model.

6. The method according to claim 5, wherein the first and second weights are determined by discovering stored weights of a set of different label descriptions that are similar to the label description.

7. The method according to claim 1, wherein the text string is a positive example text that embodies the label description or a negative example text that embodies the opposite concept of the label description.

8. The method according to claim 1, wherein using the label description to generate a query includes deriving one or more keywords from the label description.