A method and system for improving the performance of vertical domain language models
By collecting high-quality text, generalizing, and performing topic modeling, and combining the Transformer model to generate diverse attribute combination prompt words, the problem of insufficient training data for vertical domain language models is solved, achieving efficient and low-cost text acquisition and model performance improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PACHIRA TIMES (ZHUHAI HENGQIN) INFORMATION TECH CO LTD
- Filing Date
- 2023-09-01
- Publication Date
- 2026-06-19
AI Technical Summary
Vertical domain language models face challenges in collecting and acquiring training data. Traditional methods are inefficient and costly, and existing research uses simple prompts, which limits the diversity of generated data and inherits the systematic biases of large language models.
By collecting high-quality text, we model using large model generalization and latent Dirichlet assignment topic models, combined with Transformer models to learn attribute dependencies, generate diverse attribute combination prompt words, use pre-trained language models to generate relevant text, and perform data cleaning and fine-tuning to improve model performance.
Quickly acquire high-quality text corpora from vertical domains to improve the modeling capabilities of language models, reduce the bias of large language models, increase data diversity and robustness, and expand application scenarios.
Smart Images

Figure CN117312545B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of natural language processing technology, specifically relating to a method and system for improving the performance of language models in vertical domains. Background Technology
[0002] As the core of Natural Language Processing (NLP), language models play a crucial role in many fields, such as speech recognition, machine translation, handwriting recognition, input methods, search query understanding, and dialogue systems. Common language modeling methods currently include statistical language models based on N-grams, neural network language models, and transfer learning language models that are pre-trained on large-scale general corpora and then transferred to downstream tasks.
[0003] Since current language models are largely consistent in structure and training methods, the selection and processing of training data becomes extremely important during the training and transfer learning process. This is because the scale and quality of training data directly affect the final quality and application effectiveness of the language model. Especially in vertical domains, the lack of relevant online corpora makes the collection of corpora for those domains a highly challenging task. Good training data is crucial for ensuring the effectiveness of the language model; therefore, the collection of relevant corpora in vertical domains plays a decisive role in the quality of the language model.
[0004] Traditional methods for collecting corpora mainly include web crawling and manual processing. However, the quality of web-crawled corpora varies greatly, often requiring significant manpower and resources for cleaning and processing. Manual processing, on the other hand, is expensive and inefficient, especially when the corpus is very large, where the efficiency problem becomes even more pronounced.
[0005] In recent years, the emergent capabilities of large-scale language models have attracted widespread attention when the training parameters reach a certain scale. They possess powerful language generation capabilities, able to generate various types of text corpora based on given prompts; they also exhibit good generalization capabilities, capable of generalizing from given sentences to obtain more refined and richer corpora.
[0006] Large Language Models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. Therefore, researchers have explored using LLMs as task-specific training data generators to alleviate the need for task-specific data and annotations. Current research typically uses simple category-based prompts (with a common template format: "Please generate text about {category}") to query the LLM to generate training data, rarely exploring the data generation process itself. In particular, using simple prompts may limit the diversity of generated data and inherit systematic biases inherent in LLMs. Summary of the Invention
[0007] The purpose of this invention is to provide a method and system for improving the performance of vertical domain language models, which can solve problems such as insufficient training data and difficulty in data acquisition for vertical domain language models.
[0008] The technical solution of the present invention is as follows: A method for improving the performance of a vertical domain language model, comprising the following steps:
[0009] Step 1: Collect relevant high-quality text for the vertical field;
[0010] Step 2: Generalize the collected text using a large model;
[0011] Step 3: Topic Modeling;
[0012] Step 4: Train a Transformer model to learn the dependencies between the attributes in each dimension;
[0013] Step 5: Select different attribute combinations, and then randomly select the attribute values corresponding to the attributes to form prompt words;
[0014] Step 6: The large model generates relevant text D`` based on the prompt words generated in Step 5;
[0015] Step 7: Use the text from Steps 1, 2, and 6 to train a language model for the vertical domain.
[0016] The high-quality text sources mentioned in step 1 include manually generated or web-crawled texts that have been manually selected. The manually generated and manually selected crawled texts are denoted as dataset D.
[0017] In step 2, a pre-trained language model is used to generalize the dataset D from step 1 to obtain a generalized text dataset D'. Specifically, the pre-trained language model is loaded, and D is used as the input of the pre-trained language model. Through the forward computation of the pre-trained language model, the output of the pre-trained language model is D'.
[0018] In step 3, the Latent Dirichlet Assignment Topic Model is used to model the dataset D in step 2, automatically learning the topic distribution of the vertical domain dataset; topics are regarded as attribute dimensions, and high-frequency keywords under each topic are used as the attribute values of that topic.
[0019] The specific process of step 3 is as follows:
[0020] Step 31: Assume that each document consists of multiple topics, and each topic is represented by a series of keywords;
[0021] Step 32: Model the document-topic distribution and topic-word distribution using Dirichlet prior distribution, so that each document can be represented by multiple topics in different proportions, and each topic contains each word with different probabilities;
[0022] Step 33: Through Bayesian inference and iterative solution, converge the document-topic distribution and topic-word distribution in reverse. This mainly includes: randomly sampling topics based on the document-topic distribution, then randomly sampling words based on the topic-word distribution, and synthesizing documents; optimizing the parameters of the document-topic distribution and topic-word distribution based on the synthesized documents; repeating the above steps until a stable document-topic distribution and topic-word distribution are converged.
[0023] Step 34: Finally, each document is represented as a distribution of multiple topics, and each topic is represented as a distribution of words, automatically discovering the topic information of the document set; where the topic is the required attribute, and the words corresponding to each topic are the attribute values corresponding to the attribute.
[0024] In step 4, the attributes are used as the input sequence of the Transformer model. Each dimension of the attributes is transformed into a fixed-length vector through embedding, capturing the dependencies between different attributes and giving the attention weight of each attribute to the output prediction. The attribute combination configuration is automatically generated based on the learning results of the Transformer model. The rules of the combination configuration are automatically learned by the model based on the input data.
[0025] In step 5, based on the topic-word distribution learned by the topic model in step 3, several high-frequency keywords are randomly selected for each topic as the attribute values of the topic; based on the optimal attribute combination generated in step 4, attribute values are randomly selected for each attribute, contradictory combinations are filtered out, and then the attributes and attribute values are inserted into the prompt template to generate complete attribute prompts.
[0026] In step 7, the text data D, D`, and D`` obtained in steps 1, 2, and 6 are processed through data cleaning, word segmentation, and serialization into a format suitable for language model training, called D```. Based on the basic model, the processed data D``` is used to fine-tune the model. Then, the model is validated and evaluated using the validation set and test set respectively to verify the model's performance and adjust and optimize the model as needed.
[0027] A system for improving the performance of vertical domain language models includes a text generation module, a generalization module, a topic modeling module, a Transformer module, a prompt word module, and a training module;
[0028] The text generation module collects relevant high-quality text for vertical fields. The sources of the high-quality text include manually generated text or text selected by human web crawling. The manually generated and manually selected crawled texts are denoted as dataset D.
[0029] The generalization module generalizes the dataset D to obtain the generalized text dataset D'. Specifically, it loads a pre-trained language model, uses D as the input of the pre-trained language model, and outputs D' through the forward computation of the pre-trained language model.
[0030] The topic modeling module uses a Latent Dirichlet Assignment Topic Model to model dataset D, automatically learning the topic distribution of the vertical domain dataset; topics are treated as attribute dimensions, and high-frequency keywords under each topic are used as attribute values for that topic, as detailed below:
[0031] Assume that each document consists of multiple topics, and each topic is represented by a series of keywords;
[0032] The document-topic distribution and topic-word distribution are modeled using the Dirichlet prior distribution, so that each document can be represented by multiple topics in different proportions, and each topic contains each word with different probabilities;
[0033] By employing Bayesian inference and iterative solution, the document-topic distribution and topic-word distribution are converged in reverse. This process mainly includes: randomly sampling topics based on the document-topic distribution, then randomly sampling words based on the topic-word distribution to synthesize documents; optimizing the parameters of the document-topic distribution and topic-word distribution based on the synthesized documents; and repeating the above steps until a stable document-topic distribution and topic-word distribution are obtained.
[0034] Ultimately, each document is represented as a distribution of multiple topics, and each topic is represented as a distribution of words, automatically discovering the topic information of the document collection; where the topic is the required attribute, and the words corresponding to each topic are the attribute values corresponding to the attribute.
[0035] The Transformer module receives attributes as an input sequence, transforms each dimension of the attributes into a fixed-length vector through embedding, captures the dependencies between different attributes, and provides the attention weight of each attribute for the output prediction. The module automatically generates attribute combination configurations based on the learning results of the Transformer module, and the rules for the combination configurations are automatically learned by the model based on the input data.
[0036] The prompt word module performs topic-word distribution learning for topic modeling, randomly selects several high-frequency keywords for each topic as the topic's attribute values; based on the optimal attribute combination generated by the Transformer module, it randomly selects attribute values for each attribute, filters out contradictory combinations, and then inserts the attributes and attribute values into the prompt template to generate complete attribute prompts. The large model generates relevant text D`` based on the generated prompt words.
[0037] The training module cleans the obtained D, D`, and D`` text data into a format suitable for language model training, through data cleaning, word segmentation, and serialization. Based on the basic model, the model is fine-tuned using the cleaned data D```. Then, the model is validated and evaluated using the validation set and test set respectively to verify the model's performance and adjust and optimize the model as needed.
[0038] The beneficial effects of this invention are: (1) it can quickly obtain high-quality text corpora in vertical fields at a low cost; (2) it improves the modeling ability of language models, thereby improving the overall performance of the system; (3) attribute prompts can generate data with richer information and greater diversity, which can reduce the bias of large-scale language models themselves; (4) it can quickly obtain high-quality text corpora in vertical fields at a low cost based on attribute prompts; (5) it can generalize existing texts using large models, increase the diversity of data, and make language models more robust; (6) it can use the generated texts in language model training, improve the intelligence level of language models, and can be extended to more scenarios and more fields, expanding the possibilities of interaction. Attached Figure Description
[0039] Figure 1 A diagram illustrating the construction of a text database for a language model;
[0040] Figure 2 A diagram illustrating the construction of complex prompt words with multiple attributes. Detailed Implementation
[0041] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0042] To address the challenge of insufficient training corpora for vertical domain language models, this invention proposes a method and system to improve their performance. The method involves two aspects: first, generalizing existing text; and second, directly generating the required text using a large-scale model based on domain-specific prompts. This approach efficiently acquires a sufficient and high-quality training corpus, contributing to improved performance of vertical domain language models.
[0043] This invention uses complex prompts with diverse attributes, such as the multi-attribute complex prompt word template: "Please help me generate a text of {attribute 1: attribute value} {attribute 2: attribute value}...".
[0044] A method to improve the performance of a vertical domain language model includes the following steps:
[0045] Step 1: Collect relevant high-quality text for the vertical field;
[0046] The high-quality text sources include manually generated text and manually selected text crawled from the web. The manually generated and manually selected crawled texts are denoted as dataset D.
[0047] Among them, manual generation involves writing content using precise language according to requirements.
[0048] Question: How to call Xiao Wang?
[0049] Manually written answer: I will contact Xiao Wang for you, please wait a moment.
[0050] Step 2: Generalize the collected text using a large model;
[0051] The dataset D from step 1 is generalized using a pre-trained language model (such as chatgpt4, BERT, etc.) to obtain the generalized text dataset D'. Specifically, the pre-trained language model is loaded, with D as its input. Through forward computation of the pre-trained language model, the output of the pre-trained language model is D'.
[0052] The forward computation process is as follows:
[0053] y = F(x; θ)
[0054] In the formula, y is the output of the model, x is the input of the model, F is the pre-trained language model, and θ is the parameter of the pre-trained language model.
[0055] Step 3: Topic Modeling;
[0056] The Latent Dirichlet Assignment Topic Model is used to model the dataset D from Step 2 and the generalized text dataset D', automatically learning the topic distribution of the vertical domain dataset. Topics can be viewed as attribute dimensions, and high-frequency keywords under each topic can be used as attribute values for that topic.
[0057] The Latent Dirichlet Allocation (LDA) is an unsupervised learning probabilistic topic model that can automatically discover topic information in a text set. LDA is used to model datasets D and D' to learn the topic distribution to which each text belongs. The specific process is as follows:
[0058] Step 31: Assume that each document consists of multiple topics, and each topic is represented by a series of keywords.
[0059] Step 32: Model the document-topic distribution and topic-word distribution using Dirichlet prior distributions, so that each document can be represented by multiple topics in different proportions, and each topic contains each word with different probabilities.
[0060] The Dirichlet prior distribution is a multivariate probability distribution, commonly used as a parameter prior distribution in Bayesian statistics. It is a natural extension of the Beta distribution in the multidimensional case.
[0061] The probability density function of the Dirichlet distribution is as follows:
[0062]
[0063] It is a probability vector of K categories, which here represent the K topics of the document.
[0064] It is a parameter vector of the distribution, the parameters Prior information about category k controls the probability of each category occurring. It is a multivariate beta function used to normalize distributions.
[0065] Step 33: Through Bayesian inference and iterative solution, converge the document-topic distribution and topic-word distribution in reverse.
[0066] Specifically, this includes: randomly sampling topics based on the document-topic distribution, then randomly sampling words based on the topic-word distribution, and synthesizing documents; optimizing the parameters of the document-topic distribution and topic-word distribution based on the synthesized documents; repeating the above steps until a stable document-topic distribution and topic-word distribution are obtained through convergence.
[0067] Step 34: Finally, each document is represented as a distribution of multiple topics, and each topic is represented as a distribution of words, thus automatically discovering the topic information of the document collection.
[0068] The topic is the required attribute, and the word corresponding to each topic is the attribute value.
[0069] Step 4: Train a Transformer model to learn the dependencies between the attributes of each dimension.
[0070] The Transformer model described is a deep learning model widely used in natural language processing tasks. Its key feature is the use of self-attention to capture global dependencies in the input. Attributes are considered as input sequences to the Transformer model, and each dimension of the attribute is transformed into a fixed-length vector through embedding. This embedding method captures and utilizes the semantic relevance between attributes. Embedding is a representation method that typically maps discrete feature variables to a real-number vector space, where attributes that are close to each other have similar representations, while attributes that are far apart have different representations. Based on this embedding method, the Transformer model can effectively capture the dependencies between different attributes and assign the importance of each attribute to the output prediction, i.e., the attention weight. The learning results from the Transformer model are then used to automatically generate more reasonable attribute combinations. The rules for these combinations are automatically learned by the model based on the input data. For example, it can learn that two attributes frequently appear together, or that a certain attribute has a particularly important influence under certain circumstances; these will all be reflected in the model's generation process.
[0071] Construct a Transformer-based neural network model. The input is a vector representing the embedding of an attribute (obtained in step 3), and the output is the probability distribution of different attribute combinations. Train this model to obtain the dependencies between different attributes, and then generate more reasonable attribute combination configurations based on these dependencies. This avoids redundancy (e.g., location, position, attribute duplication) and ineffectiveness (e.g., weather, material) that can occur with random attribute combinations.
[0072] Step 5: Select different attribute combinations, and then randomly select the attribute values corresponding to the attributes to form prompt words;
[0073] Based on the topic-word distribution learned by the topic model in step 3, several high-frequency keywords are randomly selected for each topic as attribute values. Based on the optimal attribute combination generated in step 4, attribute values are randomly selected for each attribute, and rules are manually written to filter out contradictory combinations (e.g., location: Antarctica, animal: polar bear). Then, the attributes and attribute values are inserted into a prompt template ("Please generate a text of {attribute 1: attribute value} {attribute 2: attribute value}...") to generate a complete attribute prompt.
[0074] Step 6: The large model generates relevant text based on the prompts generated in Step 5;
[0075] Input the prompt words generated in step 5 into the pre-trained language model to generate the relevant text D``.
[0076] Step 7: Use the text from Steps 1, 2, and 6 to train a language model for the vertical domain.
[0077] Use D, D`, and D`` texts to fine-tune the language model for the vertical domain.
[0078] The text data D, D`, and D`` obtained in steps 1, 2, and 6 are processed through data cleaning, word segmentation, and serialization to form data D``` suitable for language model training. Based on the base model, the processed data D``` is used to fine-tune the model. The base model includes large open-source models, self-built models, or statistical language models based on N-grams. Fine-tuning is a common training strategy in deep learning, allowing the model to better adapt to specific tasks in a vertical domain and improve its performance on those tasks. Then, the model is validated and evaluated using validation and test sets to verify its performance and adjust and optimize it as needed. Finally, after model training and validation, the model can be applied to practical tasks such as text generation, sentiment analysis, and text classification.
[0079] Text generated by large models can effectively expand the training text of language models, enrich the data, ensure the quality and relevance of the generated text, and improve the performance of language models in vertical domains, thereby effectively solving the problem of insufficient training data for language models.
Claims
1. A method for improving the performance of a vertical domain language model, characterized in that, Includes the following steps: Step 1: Collect relevant high-quality text for the vertical field; The high-quality text sources mentioned in step 1 include manually generated or web-crawled texts that have been manually selected. The manually generated and manually selected crawled texts are denoted as dataset D. Step 2: Generalize the collected text using a large model; In step 2, a pre-trained language model is used to generalize the dataset D in step 1 to obtain the generalized text dataset D'. Specifically, the pre-trained language model is loaded, and the dataset D in step 1 is used as the input of the pre-trained language model. Through the forward computation of the pre-trained language model, the output of the pre-trained language model is the generalized text dataset D'. Step 3: Topic Modeling; In step 3, the Latent Dirichlet Assignment Topic Model is used to model the generalized text dataset D` from step 2, automatically learning the topic distribution of the vertical domain dataset; topics are regarded as attribute dimensions, and high-frequency keywords under each topic are used as the attribute values of that topic; The specific process of step 3 is as follows: Step 31: Assume that each document consists of multiple topics, and each topic is represented by a series of keywords; Step 32: Model the document-topic distribution and topic-word distribution using Dirichlet prior distribution, so that each document can be represented by multiple topics in different proportions, and each topic contains each word with different probabilities; Step 33: Through Bayesian inference and iterative solution, converge the document-topic distribution and topic-word distribution in reverse, including: randomly sampling topics according to the document-topic distribution, and then randomly sampling words according to the topic-word distribution to synthesize documents; optimize the parameters of the document-topic distribution and topic-word distribution based on the synthesized documents; repeat the above steps until a stable document-topic distribution and topic-word distribution are obtained. Step 34: Finally, each document is represented as a distribution of multiple topics, and each topic is represented as a distribution of words, automatically discovering the topic information of the document set; where the topic is the required attribute, and the words corresponding to each topic are the attribute values corresponding to the attribute. Step 4: Train a Transformer model to learn the dependencies between the attributes in each dimension; In step 4, the attributes are used as the input sequence of the Transformer model. Each dimension of the attributes is transformed into a fixed-length vector through embedding, capturing the dependencies between different attributes and giving the attention weight of each attribute to the output prediction. The attribute combination configuration is automatically generated based on the learning results of the Transformer model. The rules of the combination configuration are automatically learned by the model based on the input data. Step 5: Select different attribute combinations, and then randomly select the attribute values corresponding to the attributes to form prompt words; In step 5, based on the topic-word distribution learned by the topic model in step 3, several high-frequency keywords are randomly selected for each topic as the attribute values of the topic; based on the optimal attribute combination generated in step 4, attribute values are randomly selected for each attribute, contradictory combinations are filtered out, and then the attributes and attribute values are inserted into the prompt template to generate complete attribute prompts. Step 6: The large model generates relevant text D`` based on the complete attribute hints generated in Step 5; Step 7: Use the high-quality text dataset D collected in Step 1, the generalized text dataset D` in Step 2, and the related text D`` generated in Step 6 to train the language model for the vertical domain. In step 7, the high-quality text dataset D collected in step 1, the generalized text dataset D` in step 2, and the related text D`` generated in step 6 are processed into a format suitable for language model training, D```, through data cleaning, word segmentation, and serialization. Based on the basic model, the processed data D``` is used to fine-tune the model. Then, the model is validated and evaluated using the validation set and test set respectively to verify the model's performance and adjust and optimize the model as needed.
2. A system for improving the performance of vertical domain language models based on the method described in claim 1, characterized in that: It includes a text generation module, a generalization module, a topic modeling module, a Transformer module, a prompt word module, and a training module; The text generation module collects relevant high-quality text for vertical fields. The sources of the high-quality text include manually generated text or text selected by human web crawling. The manually generated and manually selected crawled texts are denoted as dataset D. The generalization module generalizes the dataset D to obtain the generalized text dataset D'. Specifically, it loads a pre-trained language model, uses the dataset D as the input of the pre-trained language model, and outputs the generalized text dataset D' through the forward computation of the pre-trained language model. The topic modeling module uses a Latent Dirichlet Assignment Topic Model to model the generalized text dataset D', automatically learning the topic distribution of the vertical domain dataset; topics are treated as attribute dimensions, and high-frequency keywords under each topic are used as attribute values for that topic, as detailed below: Assume that each document consists of multiple topics, and each topic is represented by a series of keywords; The document-topic distribution and topic-word distribution are modeled using the Dirichlet prior distribution, so that each document can be represented by multiple topics in different proportions, and each topic contains each word with different probabilities; By using Bayesian inference and iterative solution, the document-topic distribution and topic-word distribution are converged in reverse, including: randomly sampling topics according to the document-topic distribution, then randomly sampling words according to the topic-word distribution, and synthesizing documents; optimizing the parameters of the document-topic distribution and topic-word distribution based on the synthesized documents; repeating the above steps until a stable document-topic distribution and topic-word distribution are obtained. Ultimately, each document is represented as a distribution of multiple topics, and each topic is represented as a distribution of words, automatically discovering the topic information of the document collection; where the topic is the required attribute, and the words corresponding to each topic are the attribute values corresponding to the attribute. The Transformer module receives attributes as an input sequence, transforms each dimension of the attributes into a fixed-length vector through embedding, captures the dependencies between different attributes, and provides the attention weight of each attribute for the output prediction. The module automatically generates attribute combination configurations based on the learning results of the Transformer module, and the rules for the combination configurations are automatically learned by the model based on the input data. The prompt word module performs topic-word distribution learning for topic modeling, randomly selects several high-frequency keywords for each topic as the topic's attribute values; based on the optimal attribute combination generated by the Transformer module, it randomly selects attribute values for each attribute, filters out contradictory combinations, and then inserts the attributes and attribute values into the prompt template to generate complete attribute prompts. The large model generates relevant text D`` based on the generated prompt words. The training module cleans, segments, and serializes the obtained high-quality text dataset D, generalized text dataset D`, and related text D`` into a format suitable for language model training, and then uses the cleaned data, D```, to fine-tune the model based on the basic model. Finally, the model is validated and evaluated using the validation set and test set to verify its performance and adjust and optimize it as needed.