A large model knowledge injection method and system for a vertical industry

By denoising, structuring, and standardizing vertical industry knowledge data, and combining dynamic routing classifiers and hybrid retrieval, the problems of knowledge bias and retrieval redundancy in large models within vertical industries are solved, achieving efficient and professional knowledge injection and answer generation.

CN122240779APending Publication Date: 2026-06-19KUAIJI XINYUN (QINGDAO) TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KUAIJI XINYUN (QINGDAO) TECHNOLOGY CO LTD
Filing Date
2026-03-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing large-scale models suffer from problems such as knowledge bias, non-standard expression, high computational resource consumption, redundant search results, and low knowledge retrieval efficiency in vertical industries, making it difficult to meet the needs of vertical industries for professionalism, accuracy, and compliance.

Method used

By collecting professional knowledge data from vertical industries, performing noise reduction, structural transformation, and terminology standardization, the data is divided into independent knowledge modules and indexed quantitatively. Combined with a pre-trained dynamic routing classifier and a hybrid retrieval strategy, answers that conform to industry standards are generated.

🎯Benefits of technology

It achieves high efficiency and accuracy in knowledge management, improves the adaptability of knowledge retrieval and the quality of search results, reduces computing costs, and ensures the professionalism and compliance of the output.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240779A_ABST
    Figure CN122240779A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for injecting knowledge into large-scale models for vertical industries, belonging to the field of large-scale model knowledge injection technology. The method involves: collecting professional knowledge data from vertical industries, performing denoising, structured transformation, and terminology standardization; dividing the data into knowledge modules and associating them with vectorized indexes and metadata tags to form a vertical industry knowledge base; parsing the semantic feature vector of the input question using a dynamic routing classifier and routing it to the target knowledge module in the vertical industry knowledge base; employing a hybrid retrieval strategy to recall relevant document sets from the target knowledge module and deduplicate them to generate enhanced retrieval results; based on this, concatenating the enhanced retrieval results with the input question using industry-specific prompt templates; and inputting the LoRA-tuned large-scale model to generate accurate answers conforming to industry standards. This invention solves the problem of insufficient knowledge adaptability of large-scale models in vertical industries, improving the accuracy and applicability of large-scale models in vertical industries.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of large model knowledge injection technology, specifically to a method and system for large model knowledge injection for vertical industries. Background Technology

[0002] With the development of artificial intelligence technology, general-purpose large models based on the Transformer architecture have demonstrated strong semantic processing and content generation capabilities in general scenarios of natural language understanding and text generation. They have been applied in fields such as information retrieval, intelligent customer service, and general question answering, promoting the large-scale application of artificial intelligence technology. However, in vertical industry scenarios, industry knowledge is highly specialized, standardized, and scenario-dependent. The training data of general-purpose large models mostly comes from publicly available general corpora, which lacks sufficient coverage and internalization of vertical industry professional terminology systems, industry-specific norms, and scenario-based knowledge. This leads to frequent knowledge biases, non-standard expressions, and even erroneous outputs in vertical industry question answering, failing to meet the vertical industry's requirements for professional, accurate, and compliant answers, thus restricting the in-depth implementation and industrial application of large models in vertical industries.

[0003] To improve the knowledge adaptability of large models in vertical industries, existing knowledge injection solutions still have significant shortcomings: Traditional full-scale fine-tuning solutions require updating the overall model parameters based on massive amounts of vertical industry-labeled data, facing problems such as high data acquisition costs, high annotation difficulty, long training cycles, and high computational resource consumption. Furthermore, they are prone to "catastrophic forgetting," causing the model to lose its general semantic understanding ability when absorbing industry knowledge. Existing retrieval enhancement solutions mostly adopt single retrieval strategies. Semantic retrieval is prone to insufficient relevance between recalled documents and problem requirements due to the special semantic connotations of industry terms. Keyword retrieval struggles to handle complex semantic matching scenarios within the industry and lacks standardized processing of industry knowledge terms and deduplication mechanisms for retrieval results, leading to redundant and repetitive retrieval results and inconsistent terminology. In addition, existing solutions do not construct a dynamic knowledge routing mechanism for vertical industries, making it impossible to accurately locate industry knowledge modules based on the semantic features of the input problem. This results in low knowledge retrieval efficiency, further affecting the quality of retrieval results and the industry adaptability of model output, failing to fundamentally solve the core problem of insufficient accuracy and applicability of large models in vertical industry knowledge. Therefore, there is an urgent need for a technical solution that can specifically address the pain points of knowledge routing, retrieval accuracy, and model fine-tuning, and achieve efficient and accurate injection of knowledge into vertical industries, so as to break through the application bottleneck of large models in vertical industry scenarios. Summary of the Invention

[0004] The purpose of this invention is to provide a method and system for injecting large-scale model knowledge into vertical industries, so as to solve the problems mentioned in the background art.

[0005] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows: A method for injecting knowledge into large-scale models for vertical industries includes the following steps; S1. Collect professional knowledge data from vertical industries, and after denoising, structural transformation and terminology standardization of the professional knowledge data, divide it into multiple independent knowledge modules according to industry fields, and associate each knowledge module with corresponding vectorized indexes and metadata tags to form a vertical industry knowledge base; S2. Receive the user's input question, extract the semantic feature vector of the input question and input it into a pre-trained dynamic routing classifier to determine the industry domain to which the input question belongs, and select the corresponding target knowledge module from the industry knowledge database according to the industry domain; S3. Use a hybrid retrieval strategy to retrieve relevant document sets from the target knowledge module, and deduplicate the relevant document sets to obtain enhanced retrieval results; S4. Based on the preset industry prompt template, the search enhancement results are concatenated with the input question to form the large model input text. The large model input text is then input into the fine-tuned large model to generate accurate answers that conform to the vertical industry standards.

[0006] Preferably, the method for denoising, structuring, and standardizing terminology data is as follows: For professional knowledge data, a Transformer-based denoising autoencoder is used to identify and filter noisy data. The Transformer-based denoising autoencoder segments the professional knowledge data into 512-token segments and inputs them into the encoder to extract semantic feature vectors. The cosine similarity of these semantic feature vectors with a pure semantic feature library of the industry is calculated. If the cosine similarity is <0.75, the data is considered noisy and is directly filtered out; if the cosine similarity is ≥0.75... The semantic feature vector is then input into the decoder to reconstruct noise-free text. The reconstructed text is length-validated and formatted according to the common text format of the vertical industry. The retained valid professional knowledge data is output. The noise data includes invalid redundant information, formatted data, and low-quality content mixed in with the professional knowledge data of the vertical industry. The industry pure semantic feature library is a set of semantic features extracted from noise-free authoritative data in the vertical industry. Its construction process is as follows: select publicly available and noise-free authoritative data in the vertical industry, with a data volume of no less than 100,000 records. In the medical field, the full text of the guidelines reviewed by experts and the standardized case templates published by tertiary hospitals in the "China Clinical Guidelines Knowledge Base" can be selected. In the legal field, the current valid legal provisions in the "National Laws and Regulations Database" and the original text of the guiding cases issued by the Supreme People's Court can be selected. The selected authoritative data is segmented into 512 tokens and uniformly converted into UTF-8 encoded plain text. The same encoder as the Transformer-based denoising autoencoder is used to extract semantic features from the UTF-8 encoded plain text, generate corresponding semantic feature vectors, and store them in the vector database to form the industry pure semantic feature library. Unstructured professional knowledge from effective professional knowledge data is extracted into core fields using entity recognition and key information extraction technologies. These core fields are then organized into structured professional knowledge according to industry standards and combined with the structured professional knowledge in the professional knowledge data to form intermediate processing data. Specifically, the unstructured professional knowledge refers to text-based data without a fixed format in the vertical industry. The industry standards refer to authoritative standard documents or management regulations within the vertical industry, including but not limited to the "Regulations on Medical Record Management of Medical Institutions" and "Medical Data Element Value Domain Code" in the medical field, the "Measures for the Management of Lawyer Business Files" and "Standards for Judicial Document Format" in the legal field, and the "Guidelines for Financial Data Security and Data Security Classification" and "Regulations on Enterprise Financial Accounting Reports" in the financial field. This ensures that the organization of the core fields conforms to the data storage and application habits within the vertical industry. Based on the terminology mapping relationship established by the authoritative industry terminology database, non-standard terms in the intermediate processing data are uniformly replaced with standard terms to obtain professional knowledge data after terminology standardization. The authoritative industry terminology database is a set of terms that are recognized and have standardization and authority within the vertical industry. The terminology mapping relationship is constructed through a combination of manual annotation and machine-aided verification. First, industry experts sort out the correspondence between non-standard terms and standard terms that may appear in the intermediate processing data. Then, based on the terminology embedding model Word2Vec, the semantic similarity between non-standard terms and standard terms is learned to verify and supplement the correspondence of manual annotation. The non-standard terms refer to non-standard expressions of the same professional concept within the vertical industry.

[0007] Preferably, the Transformer-based denoising autoencoder adopts an encoder-decoder bidirectional Transformer architecture; wherein the encoder has 6 Transformer encoder blocks, each containing one 8-head multi-head self-attention layer and one feedforward neural network; the decoder has 4 Transformer decoder blocks, each adding one 8-head cross-attention layer to the Transformer encoder block, and the parameters of the Transformer decoder blocks are consistent with those of the Transformer encoder; the input layer uses industry-adapted word embedding and incorporates positional encoding; the output layer reconstructs text through a linear layer and a Softmax function.

[0008] Preferably, the method for dividing knowledge into multiple independent modules according to industry sectors is as follows: The terminology-standardized professional knowledge data is clustered into several primary knowledge modules by industry sector using the density-based DBSCAN algorithm. Furthermore, for scenarios with a large knowledge volume or clearly defined sub-sectors within a single industry sector, the primary knowledge modules can be further divided into several secondary knowledge modules by sub-sectors within that industry sector using an authoritative industry classification tree. Each secondary knowledge module corresponds to a specific industry sector. Secondly, the resulting knowledge modules must possess both knowledge independence and scalability. Knowledge independence is reflected in the fact that the professional knowledge within each module revolves around the core needs of the corresponding industry sector, with no redundant or repetitive core knowledge content between modules, and each module can be updated and maintained independently. Scalability is reflected in the ability to add independent knowledge modules based on the same partitioning logic when new vertical industries are added or existing vertical industries add new sub-sectors.

[0009] Preferably, the vectorized index is constructed using an industry-adapted pre-trained language model. Specifically, the professional knowledge data within a knowledge module is split into the smallest semantic units according to semantic integrity, and each smallest semantic unit is input into the industry-adapted pre-trained language model to be converted into a vector representation. The vector representations are then assigned to knowledge modules, stored in a vector database, and a mapping relationship is established between the vector representations and the original text of the semantic units to form a complete vectorized index. The industry-adapted pre-trained language model includes BioBERT in the medical field, Legal-BERT in the legal field, or FinBERT in the financial field. The metadata tags at least include the industry domain and data format type of the knowledge module.

[0010] Preferably, the vertical industry knowledge base consists of several knowledge modules with associated metadata tags and vectorized indexes.

[0011] Preferably, if the input question is not in text form, the input question is converted into a text form using ASR technology; if the input question is in image form, the input question is converted into a text form using OCR technology.

[0012] Preferably, the semantic feature vector is obtained by inputting the input question into an industry-adapted pre-trained language model; The pre-trained dynamic routing classifier is trained using a problem domain labeling dataset from a vertical industry. This classifier employs a TextCNN-fully connected layer hybrid architecture, which includes an industry-adaptive word embedding layer, multi-scale convolutional layers, pooling layers, and a double-layer fully connected layer. This architecture accurately extracts the domain features of the input question and maps them to the corresponding industry domain. The problem domain labeling dataset originates from publicly available question-and-answer databases and enterprise-owned compliance data, totaling no fewer than 100,000 entries. Each sample in the dataset is labeled with its corresponding industry domain by at least 10 industry experts, using a ratio of 8:1:1. The dataset containing the domain label is divided into training, validation, and test sets. A dynamic routing classifier is trained based on these sets using a cross-entropy loss function, the AdamW optimizer, and a batch size of 32, until the dynamic routing classifier achieves an accuracy greater than 90% on the test set. In the TextCNN fully connected layer architecture, the industry-adaptive word embedding layer uses word embedding parameters from a domain-pre-trained model and incorporates positional encoding. The multi-scale convolutional layer contains three kernel sizes: 3×512, 4×512, and 5×512, with 128 kernels of each size, used to capture domain keyword features of different lengths. The pooling layer uses max pooling, taking the maximum value from the output of each kernel to obtain the global feature vector. In the two fully connected layers, the first layer is used for global feature vector compression, and the second layer outputs the industry domain label probability distribution, achieving accurate matching between the input question and the industry domain.

[0013] Preferably, the hybrid retrieval strategy includes semantic retrieval and keyword retrieval. The semantic retrieval uses semantic feature vectors as the retrieval basis and recalls the top 40% of semantically similar documents by calculating cosine similarity in the vectorized index of the target knowledge module. The keyword retrieval uses the TF-IDF algorithm to extract the question keywords of the input question as the retrieval basis and employs the BM25 algorithm to recall the top 30% of keyword-matching documents in the target knowledge module. The relevant document set consists of several relevant documents, which are the intersection of all semantically similar documents and all keyword-matching documents. The semantically similar documents and keyword-matched documents are derived from professional knowledge data. After terminology standardization, the professional knowledge data is further broken down into knowledge units that are adapted to the knowledge module. Among them, the text-type knowledge units that can participate in retrieval are the semantically similar documents and keyword-matched documents in the hybrid retrieval strategy, which are the core components of professional knowledge data used for retrieval in the knowledge module.

[0014] Preferably, the method for deduplicating the relevant document set is as follows: Relevant documents in the relevant document set are paired in pairs without repetition to obtain several relevant document pairs. For each relevant document pair, the SimHash algorithm is used to calculate the document hash value. If the document hash value similarity is greater than or equal to 90%, one relevant document in the relevant document pair is randomly retained, and the other relevant document in the relevant document pair is deleted. Otherwise, the relevant documents in the relevant document pair are retained. Based on this, all relevant documents that have not been deleted from all relevant document pairs are collected, and duplicate relevant documents are removed to form the search enhancement results.

[0015] Preferably, the industry prompt template includes an industry role definition, output constraint rules, and knowledge tracing requirements. The industry role definition is used to limit the answer perspective of the large model, the output constraint rules are used to clarify the format specifications of the answer, and the knowledge tracing requirements are used to guide the large model to annotate the source information of the search enhancement results in the answer. The fine-tuned large model is obtained by fine-tuning a general large model using LoRA technology with vertical industry question-answer pairs as fine-tuning training data. The vertical industry Q&A pairs consist of paired data of typical questions and corresponding standard answers in a vertical industry.

[0016] Preferably, the fine-tuning of the general large model uses the open-source general large model LLaMA-7B as the base model, and employs LoRA fine-tuning technology to achieve efficient fine-tuning of the base model parameters. The specific process is as follows: Select publicly available authoritative data from vertical industries and private compliant data from enterprises to construct a fine-tuning training dataset. The fine-tuning training dataset contains no fewer than 100,000 question-and-answer pairs from vertical industries. Divide the fine-tuning training dataset into a training set, a validation set, and a test set in a 7:2:1 ratio. The text length of a single question-and-answer pair from a vertical industry should be controlled within 4096 tokens. The Hugging Face Transformers library is used to load the weights of the base model. The device_map is set to automatically allocate GPU / CPU memory, and the length of the input sequence of the base model is expanded from 2048 to 4096. The position encoding vector of the newly added position is completed by linear interpolation to adapt to the input requirements of vertical industry text. At the same time, all parameters of the base model are frozen, and only the parameter interfaces of the query projection layer and value projection layer in its Transformer layer are reserved. Based on this, a LoRA layer is built based on the PEFT library and bound to the parameter interface of the base model. The LoRA layer is trained only on the parameters of the LoRA layer by fine-tuning the training dataset. After training, a fine-tuned large model is obtained, that is, a large model fine-tuned by LoRA. The rank of the LoRA layer is set to 8, the scaling factor is set to 16, the dropout probability is set to 0.05, and its input projection is initialized with a normal distribution, and the output projection is initialized as an all-zero matrix.

[0017] Due to the adoption of the above technical solution, the technical progress achieved by this invention compared to the prior art is as follows: 1. This invention improves the quality of the knowledge base through a three-layer progressive knowledge processing approach. It relies on a Transformer-based denoising autoencoder to accurately filter noisy data in professional knowledge data, and combines entity recognition and key information extraction to transform unstructured knowledge into structured knowledge according to industry standards. Then, it standardizes terminology through manual annotation combined with Word2Vec. Furthermore, it constructs independent and scalable "first-level-second-level" knowledge modules according to industry domains, and extracts semantic features with multimodal input conversion and industry-adaptive models. Combined with a pre-trained dynamic routing classifier, it accurately locates the target knowledge module, completely solving the problems of inefficient knowledge management and ambiguous positioning, and improving the adaptability and efficiency of knowledge retrieval.

[0018] 2. This invention designs a hybrid retrieval and intelligent deduplication strategy to solve the key defects of low retrieval accuracy and redundant results. By integrating semantic retrieval and keyword retrieval, the intersection of the two is taken to improve retrieval accuracy and avoid the bias of single retrieval. Then, the SimHash algorithm is used to compare the hash values ​​of documents, automatically deduplicating highly similar documents and removing duplicate documents, effectively eliminating redundancy in retrieval results and ensuring high quality of enhanced retrieval results.

[0019] 3. This invention uses LoRA technology to freeze the master parameters of a general large model, trains a low-rank matrix to inject industry knowledge, and uses vertical industry question-and-answer pairs as data to reduce costs, shorten the cycle, and avoid catastrophic forgetting. At the same time, it constructs prompt templates containing industry role definitions, output constraints, and traceability requirements to ensure that the output of the large model conforms to industry standards and improves the professionalism, compliance, and credibility of the answers. Attached Figure Description

[0020] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this invention. For those skilled in the art, other drawings can be obtained based on these drawings.

[0021] Figure 1 This is a schematic diagram of the method flow of the present invention. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0023] Examples, such as Figure 1 The aforementioned method for injecting large-scale model knowledge into vertical industries includes the following steps: S1. Collect professional knowledge data from vertical industries, and after denoising, structural transformation and terminology standardization of the professional knowledge data, divide it into multiple independent knowledge modules according to industry fields, and associate each knowledge module with corresponding vectorized indexes and metadata tags to form a vertical industry knowledge base; S2. Receive the user's input question, extract the semantic feature vector of the input question and input it into a pre-trained dynamic routing classifier to determine the industry domain to which the input question belongs, and select the corresponding target knowledge module from the industry knowledge database according to the industry domain; S3. Use a hybrid retrieval strategy to retrieve relevant document sets from the target knowledge module, and deduplicate the relevant document sets to obtain enhanced retrieval results; S4. Based on the preset industry prompt template, the search enhancement results are concatenated with the input question to form the large model input text. The large model input text is then input into the fine-tuned large model to generate accurate answers that conform to the vertical industry standards.

[0024] Furthermore, the working principle of the present invention will be illustrated below through embodiments: This embodiment applies the present invention to an intelligent question-and-answer system in the medical industry, aiming to address the need for accurate questions and answers from clinicians regarding disease diagnosis and treatment plans, and from patients regarding medication safety and examination recommendations. It avoids the problems of terminology confusion and biased diagnosis and treatment recommendations that occur in general large models in the medical field, and ensures that the output content complies with medical industry standards such as the "Clinical Practice Guidelines" and the "Regulations on the Management of Drug Instructions".

[0025] In constructing a medical industry knowledge base, the first step is to collect professional knowledge data from the medical industry. This data undergoes preprocessing, using a Transformer-based denoising autoencoder to filter out irrelevant privacy notes, advertising summaries, and formatted fragments, retaining only valid professional knowledge data containing medical semantics. Next, BioBERT-based entity recognition and key information extraction techniques are employed to extract core fields such as "examination date - patient age - imaging features - impression diagnosis" from the unstructured professional knowledge data. These are then organized into JSON-formatted structured data according to the "Regulations on Medical Record Management in Medical Institutions," and integrated with the original structured professional knowledge data to form intermediate processing data. Finally, the data is processed based on the "International Classification of Diseases (ICD-11)" and the "Generic Names of Drugs." The catalog establishes a terminology mapping relationship, uniformly replacing non-standard terms such as "myocardial infarction" and "chronic bronchitis" in the intermediate processed data with standard terms such as "myocardial infarction" and "chronic bronchitis". Subsequently, the standardized data is divided into "cardiovascular disease knowledge module", "respiratory system disease knowledge module", "pharmacy knowledge module" and "medical imaging diagnosis knowledge module" according to medical sub-fields. The "cardiovascular disease knowledge module" is further subdivided into "coronary heart disease diagnosis and treatment sub-module", "arrhythmia diagnosis and treatment sub-module" and "cardiovascular medication sub-module". The BioBERT model is used to convert the smallest semantic unit of each module into a semantic vector and store it in the Milvus vector database to build a vectorized index. At the same time, metadata tags for each module are labeled with the industry field and data format type, and finally a medical industry knowledge base is formed.

[0026] If a user inputs a voice question, "My mother is 72 years old. For the past week, she has experienced dizziness every morning after waking up. Her blood pressure is 150 / 95 mmHg. What medication should she take?", the question is converted into text using ASR technology adapted for the medical industry. If the user inputs an image of a blood pressure monitoring record, medical OCR technology is used to extract text information such as "blood pressure value of 150 / 95 mmHg at 8:00 AM daily from May 8, 2024 to May 14, 2024, patient age 72, chief complaint dizziness." The text input question is then fed into the BioBERT model to extract a semantic feature vector. This semantic feature vector accurately represents the medical semantic needs of "elderly patient + hypertension + dizziness + medication advice." The semantic feature vector is then fed into a dynamic routing classifier, which is trained using a medical industry question label dataset containing 10,000 labeled samples and trained with a cross-entropy loss function. The dynamic routing classifier determines that the input question belongs to the cardiovascular medication field and accordingly routes and selects the target knowledge module from the medical industry knowledge base.

[0027] A hybrid retrieval strategy is first executed for the target knowledge module. Semantic retrieval is based on the semantic feature vector of the input question. Cosine similarity is used to recall the top 40% of semantically similar documents in the vectorized index of the target knowledge module. These similar documents include the medication recommendations for elderly hypertension from the "National Essential Medicines List - Cardiovascular Drugs" and the 2024 paper on individualized medication for elderly hypertension in the "Chinese Journal of Cardiovascular Diseases". Keyword retrieval is based on the TF-IDF algorithm to extract the question keywords "72 years old, hypertension, dizziness, medication". The BM25 algorithm is used to recall the top 30% of keyword-matching documents. These keyword-matching documents contain information on elderly hypertension... The document set was created by combining guidelines for medication selection for hypertension and dizziness with instructions for the dosage and administration of amlodipine in elderly patients. The intersection of these two types of documents was used to form a related document set. This set was then deduplicated by pairing documents and calculating hash values ​​using the SimHash algorithm. If the similarity was ≥90%, one document was randomly selected from the relevant sections of the National Essential Medicines List and the guidelines for medication use in elderly patients with hypertension. If the similarity was <90%, all documents were retained. Finally, the remaining documents were collected and duplicates were removed to obtain enhanced search results containing relevant chapters from the National Essential Medicines List (Cardiovascular Drugs) and a 2024 article from the Chinese Journal of Cardiovascular Diseases.

[0028] The search results are enhanced by combining the input question with a pre-defined prompt template from the medical industry. The template includes an industry role definition ("As a medical consultant with clinical pharmacy qualifications, you need to provide medication advice based on authoritative medical knowledge"), output constraints ("Answers should include 'preliminary judgment - recommended drugs - usage and dosage - precautions', using ICD-11 and generic drug names for terminology"), and knowledge tracing requirements ("Indicate the source documents and chapters of the knowledge"). This combination forms the input text for a large model. The input text is then fed into a large model fine-tuned using LoRA. This large model uses medical industry question-and-answer pairs as training data, such as "Q: 6...". What medication is recommended for a 5-year-old patient with hypertension and a blood pressure of 145 / 90 mmHg? A: Amlodipine 2.5 mg / day, source: National Essential Medicines List. The LoRA technology was used to fine-tune the general large model, updating only the low-rank matrix parameters of its attention layer to avoid "catastrophic amnesia," ultimately generating an accurate answer that conforms to medical standards. The content includes a preliminary diagnosis (the patient is an elderly person with essential hypertension, and dizziness is related to elevated blood pressure), recommended medication (amlodipine for injection), dosage and administration (initially 2.5 mg / day), and precautions (monitoring lower extremity edema and regular blood pressure checks), with each part clearly indicating the corresponding knowledge source.

[0029] This answer was validated by three attending cardiologists and two clinical pharmacists. The results showed that the accuracy of the answer reached 92%, the compliance reached 95%, the retrieval time was 120ms and the model generation time was 450ms, which met the real-time requirements. The reference value for clinical practice and patients reached 90%, which effectively solved the problem of insufficient knowledge adaptability of general large models in the medical industry and improved the accuracy and industry applicability of the answer.

[0030] The above description describes specific embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any modifications or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A large model knowledge injection method for a vertical industry, characterized in that, Includes the following steps: S1. Collect professional knowledge data from vertical industries, and after denoising, structural transformation and terminology standardization of the professional knowledge data, divide it into multiple independent knowledge modules according to industry fields, and associate each knowledge module with corresponding vectorized indexes and metadata tags to form a vertical industry knowledge base; S2. Receive the user's input question, extract the semantic feature vector of the input question and input it into a pre-trained dynamic routing classifier to determine the industry domain to which the input question belongs, and select the corresponding target knowledge module from the industry knowledge database according to the industry domain; S3. Use a hybrid retrieval strategy to retrieve relevant document sets from the target knowledge module, and deduplicate the relevant document sets to obtain enhanced retrieval results; S4. Based on the preset industry prompt template, the search enhancement results are concatenated with the input question to form the large model input text. The large model input text is then input into the fine-tuned large model to generate accurate answers that conform to the vertical industry standards.

2. The method of claim 1, wherein the method is a method of injecting knowledge into a large model for a vertical industry. The method for denoising, structuring, and standardizing terminology data is as follows: The professional knowledge data is identified and filtered for noise using a Transformer-based denoising autoencoder, retaining the effective professional knowledge data. Then, the unstructured professional knowledge in the effective professional knowledge data is extracted into core fields using entity recognition technology and key information extraction technology. The core fields are organized into structured professional knowledge according to industry standards, and together with the structured professional knowledge in the professional knowledge data, they form intermediate processing data. Based on the terminology mapping relationship established by the authoritative industry terminology database, the non-standard terms in the intermediate processing data are uniformly replaced with standard terms, resulting in professional knowledge data after terminology standardization.

3. The method of claim 2, wherein the method is a method of injecting knowledge into a large model for a vertical industry, characterized by, The vectorized index is constructed using an industry-adapted pre-trained language model, which includes BioBERT in the medical field, Legal-BERT in the legal field, or FinBERT in the financial field; the metadata tags at least include the industry field to which the knowledge module belongs and the data format type.

4. The method of claim 1, wherein, If the input question is not in text format, then the input question is converted into a text format. Specifically, if the input question is in speech format, then the input question is converted into a text format using ASR technology; if the input question is in image format, then the input question is extracted into a text format using OCR technology.

5. The method of claim 4, wherein the method is a method of injecting knowledge into a large model for a vertical industry, and the method further comprises: The semantic feature vector is extracted by using a pre-trained language model adapted to the input question industry; the dynamic routing classifier is trained using a question industry-labeled dataset of the vertical industry as training data and a cross-entropy loss function. The labeling results in the question industry-labeled dataset correspond one-to-one with the industry domain to which the knowledge module belongs.

6. The method for injecting large-scale model knowledge into vertical industries according to claim 5, characterized in that, The hybrid retrieval strategy includes semantic retrieval and keyword retrieval. Semantic retrieval uses semantic feature vectors as the retrieval basis and recalls the top 40% of semantically similar documents based on cosine similarity in the vectorized index of the target knowledge module. Keyword retrieval uses the TF-IDF algorithm to extract the question keywords of the input question as the retrieval basis and employs the BM25 algorithm to recall the top 30% of keyword-matching documents in the target knowledge module. The relevant document set consists of several relevant documents, which are the intersection of all semantically similar documents and all keyword-matching documents.

7. The method for injecting large-scale model knowledge into vertical industries according to claim 6, characterized in that, The method for deduplicating related document sets is as follows: Relevant documents in the relevant document set are paired in pairs without repetition to obtain several relevant document pairs. For each relevant document pair, the SimHash algorithm is used to calculate the document hash value. If the document hash value similarity is greater than or equal to 90%, one relevant document in the relevant document pair is randomly retained, and the other relevant document in the relevant document pair is deleted. Otherwise, the relevant documents in the relevant document pair are retained. Based on this, all relevant documents that have not been deleted from all relevant document pairs are collected, and duplicate relevant documents are removed to form the search enhancement results.

8. The method for injecting large-scale model knowledge into vertical industries according to claim 1, characterized in that, The industry prompt template includes industry role definitions, output constraint rules, and knowledge tracing requirements. The industry role definitions are used to limit the answer perspective of the large model, the output constraint rules are used to clarify the format specifications of the answer, and the knowledge tracing requirements are used to guide the large model to annotate the source information of the search enhancement results in the answer.

9. The method for injecting large-scale model knowledge into vertical industries according to claim 8, characterized in that, The fine-tuned large model is obtained by fine-tuning a general large model using LoRA technology with vertical industry question-answer pairs as training data.

10. A large-scale model knowledge injection system for vertical industries, the system being used to implement the large-scale model knowledge injection method for vertical industries as described in claim 1, characterized in that... include: The knowledge processing and database building module is used to collect professional knowledge data from vertical industries, perform noise reduction, structure transformation and terminology standardization on the professional knowledge data, divide the processed professional knowledge data into multiple independent knowledge modules according to industry fields, and associate each knowledge module with a corresponding vectorized index and metadata tag to form a vertical industry knowledge base; The question parsing and routing module is used to receive user input questions, extract semantic features of the input questions, input the semantic features into a pre-trained dynamic routing classifier to determine the industry domain to which the input questions belong, and select the corresponding target knowledge module from the vertical industry knowledge base based on the industry domain. The hybrid retrieval and deduplication module is used to retrieve relevant document sets from the target knowledge module using a hybrid retrieval strategy, deduplicate the relevant document sets, and obtain enhanced retrieval results. The Precise Answer module is used to construct a large model input text by concatenating the search enhancement results with the input question based on preset industry prompt templates, and then inputting the large model input text into the fine-tuned large model to generate a precise answer that conforms to the vertical industry standards.