Text processing and knowledge question answering system based on vertical domain large model
By constructing a text processing and knowledge question-answering system based on a large model in a vertical domain, the problems of noise interference and table structure loss were solved, achieving highly accurate knowledge question answering and traceability capabilities, and improving the credibility of vertical domain applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies in vertical domain intelligent knowledge question answering systems suffer from layout noise interference and table structure loss, resulting in noise and semantic logic breaks after text segmentation, which affects the understanding and reasoning of large models.
A text processing and knowledge question answering system based on a large model of a vertical domain is constructed. The model deployment module performs parameter fine-tuning and quantization, the data structuring module performs document parsing and noise removal and cleaning, the domain knowledge indexing module constructs semantic vector indexes and keyword inverted indexes, the query intent parsing module performs intent recognition and logical decomposition, and the intelligent retrieval and adaptation module performs hybrid recall and reordering, finally generating response data containing traceability information.
It effectively corrects noise interference and table structure loss issues, improves retrieval hit rate and answer accuracy, achieves precise traceability of the entire process from raw data to answer generation, and enhances the system's credibility in vertical fields.
Smart Images

Figure CN122197819A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of text processing technology, specifically to a text processing and knowledge question answering system based on a large vertical domain model. Background Technology
[0002] In recent years, with the rapid development of deep learning technology, Large Language Models (LLMs) based on the Transformer architecture have demonstrated outstanding capabilities in tasks such as general text generation, machine translation, and dialogue interaction. In vertical industries such as power, healthcare, finance, and industrial manufacturing, building intelligent knowledge question-answering systems using large models has become a key path for industry digital transformation.
[0003] Current mainstream solutions typically employ a Retrieval Augmentation Generation (RAG) architecture. This involves first segmenting domain documents into fragments and storing them in a vector database. When a user asks a question, semantic retrieval is used to recall relevant fragments, which are then used as context input to a large model to generate an answer. While this approach alleviates the illusion problem of large models to some extent and supplements timely knowledge, in practical vertical domain applications, existing technologies still have the following significant technical shortcomings: First, layout noise interference. The inability to effectively remove headers, footers, and sidebars results in the segmented text containing a large amount of non-textual noise, disrupting the semantic logic of sentences spanning multiple pages. Second, table structure loss. Simple text extraction stretches two-dimensional tables into one-dimensional strings, causing disordered row and column correspondences within the table, severely affecting the subsequent understanding and reasoning of large models. Summary of the Invention
[0004] The purpose of this invention is to provide a text processing and knowledge question answering system based on a large vertical domain model to solve the problems mentioned above.
[0005] The objective of this invention can be achieved through the following technical solutions:
[0006] The text processing and knowledge question answering system based on a vertical domain large model includes a model deployment module, which is configured to fine-tune the parameters of a pre-trained large model using vertical domain corpus, obtain a full-precision fine-tuned model, quantize the full-precision fine-tuned model, generate quantized weights for the vertical domain large model, establish a model inference service with a computing resource scheduling mechanism, and load the quantized weights for the vertical domain large model to start the model inference service.
[0007] The data structuring module is configured to receive multi-source heterogeneous raw files from vertical domains, and perform document parsing, noise removal and cleaning, and semantic segmentation on the multi-source heterogeneous raw files to generate a sequence of structured data objects containing metadata.
[0008] The domain knowledge indexing module connects to the data structuring processing module, is configured to receive sequences of structured data objects, constructs a vector index library containing semantic features through vectorization encoding, and simultaneously constructs an inverted index library based on keywords;
[0009] The query intent parsing module is configured to receive user input commands submitted from the front end, and to perform intent recognition, logical decomposition and semantic expansion on the user input commands to generate a search feature vector and a set of search keywords, and encapsulate the two into a combined search condition object;
[0010] The intelligent retrieval and adaptation module is connected to the domain knowledge indexing module and the query intent parsing module, respectively. It is configured to receive combined retrieval condition objects, perform approximate nearest neighbor search in the vector index library based on the retrieval feature vector, perform matching search in the inverted index library based on the retrieval keyword set, and perform fusion sorting and truncation adaptation on the recall results to filter out target context fragments.
[0011] The generative service and interaction module connects to the intelligent retrieval and adaptation module, is configured to call the model inference service started by the model deployment module, receives target context fragments and user input instructions, assembles the two into a structured prompt word input model inference service, and outputs the final response data containing traceability information through a streaming protocol.
[0012] Furthermore, the model deployment module is specifically configured to perform the following steps:
[0013] Collect regulatory documents and business records from vertical industries to build a vertical industry instruction dataset;
[0014] A general large language model is selected as the base, all its pre-trained parameters are frozen, a trainable low-rank decomposition matrix is injected into the attention mechanism layer, and the low-rank decomposition matrix is iteratively updated and trained using a vertical domain instruction dataset to derive a full-precision fine-tuning model.
[0015] Convert the full-precision fine-tuning model into a binary format that supports memory mapping;
[0016] The current hardware topology is detected; when a graphics processor is detected, a high-precision quantization mode is selected and the computation layer is offloaded to video memory using the underlying computing library; when only a central processing unit is detected, a high-compression quantization mode is selected and the system falls back to instruction set acceleration mode to generate a quantized version of the vertical domain large model weights.
[0017] Establish a model inference service that includes a computing resource scheduling mechanism, and load a quantized version of the weights of large vertical domain models to start the model inference service.
[0018] Furthermore, the data structuring module is specifically configured to perform the following steps:
[0019] Obtain the original files that constitute the multi-source heterogeneous original files. The original file formats include PDF documents, Word documents, Excel documents, and image files.
[0020] Read the binary header information of the original file, identify the actual document type of the original file based on the magic number feature in the binary header information, and mark the original file as a data stream to be parsed;
[0021] Based on the identified document type, the data stream to be parsed is input into the corresponding parsing process: when identified as a PDF document or image file, the layout analysis algorithm is executed to remove non-text information in the header, footer, and sidebar areas and extract the original text stream; when identified as an Excel document or a document containing tables, the table structure restoration algorithm is called to identify the row and column lines of the table and convert the table content into an original text stream that maintains two-dimensional logic.
[0022] The original text stream is denoised using a pre-defined set of regular expression rules to output clean text data.
[0023] The sliding window segmentation algorithm is used to segment the clean text data into several text blocks, and a preset length of overlapping area is retained between two adjacent text blocks. The obtained file attribute information is backtracked to construct a structured data object. The name, page number and chapter title of the original file are injected into the structured data object to generate a sequence of structured data objects.
[0024] Furthermore, the domain knowledge indexing module is specifically configured to execute a parallel, dual-path index building process:
[0025] In the semantic vector index construction branch: a pre-built domain embedding model is loaded, and the text content of each structured data object in the sequence of structured data objects is encoded using the domain embedding model to generate a dense vector; the dense vector is bound to the metadata information carried by the structured data object, and a vector index library is constructed using an approximate nearest neighbor search algorithm;
[0026] In the keyword inverted index construction branch: a word segmenter with an industry-customized dictionary is used to segment the text content in each structured data object and filter stop words to generate a keyword set; the BM25 algorithm is used to calculate the weight score of each keyword in the keyword set in the structured data object, and an inverted index library containing the mapping relationship between keywords, unique identifiers of structured data objects, and weight scores is constructed.
[0027] Furthermore, the query intent parsing module is specifically configured to perform the following steps:
[0028] The query rewriting model is used to perform semantic completion and synonym expansion on user input commands to generate semantically enhanced query text.
[0029] The logical structure of the semantically enhanced query text is analyzed using the thinking chain reasoning technology. When logical nesting or cross-document reasoning requirements are identified, the text is decomposed into several atomic sub-queries using a logical decomposition algorithm.
[0030] A domain embedding model consistent with the domain knowledge indexing module is used to vectorize and encode atomic subqueries to generate retrieval feature vectors, and a professional word segmenter is used simultaneously to extract a set of retrieval keywords from atomic subqueries;
[0031] The retrieval feature vector, the set of retrieval keywords, and the semantically enhanced query text are collectively encapsulated into a combined retrieval condition object.
[0032] Furthermore, the intelligent search and adaptation module is specifically configured to perform the following steps:
[0033] Parse the combined search criteria object to obtain the search feature vector, the set of search keywords, and the semantically enhanced query text;
[0034] Dual-path hybrid recall: Using the retrieval feature vector, an approximate nearest neighbor search is performed in the vector index to generate a semantic recall set, while simultaneously using the retrieval keyword set, an exact match search is performed in the inverted index to generate a literal recall set;
[0035] Fusion Ranking: The inverse ranking fusion algorithm is used to weight and merge the semantic recall set and the literal recall set to generate a preliminary candidate document set;
[0036] Semantic reordering: Load the pre-built cross-encoder reordering model, concatenate the semantically enhanced query text with each document fragment in the initial candidate document set, input the concatenation into the model, calculate the relevance score, and reorder the documents.
[0037] Truncation adaptation: Based on the maximum context window limit of the large model in the vertical domain, the reordered results are truncated, and several document fragments that meet the token number limit are retained as target context fragments.
[0038] Furthermore, the generative service and interaction module is specifically configured to execute the following steps:
[0039] The user input command is concatenated with the target context fragment, constraint commands are embedded, and the metadata information carried in the target context fragment is parsed into reference tags to generate structured prompt words;
[0040] The structured prompts are input into the model inference service started by the model deployment module. The generated answer is pushed to the front end word by word using the server-side event sending protocol to form a natural language answer.
[0041] Based on the citation tags, generate source links to the original document pages and highlighted paragraphs in the natural language answer, and output the final response data.
[0042] The beneficial effects of this invention are:
[0043] 1. This invention constructs a dual-path hybrid retrieval mechanism that combines semantic vectors and keyword inversion in parallel, and introduces a cross-encoder reordering model. This effectively combines the advantages of precise keyword matching with the semantic understanding advantages of vectors. This mechanism not only corrects the entity confusion bias of single vector retrieval, but also uses the reordering step to remove semantically related but logically contradictory noise segments, which significantly improves the retrieval hit rate and answer accuracy for complex professional questions.
[0044] 2. This invention employs layout analysis and table structure restoration algorithms to perform deep analysis of complex heterogeneous data such as PDFs and Excel files, fully preserving the hierarchical logic and two-dimensional table structure of the original document. This avoids semantic loss caused by data fragmentation. Furthermore, it establishes a full-link metadata transmission and reference tag parsing mechanism. During the inference generation stage, it forces the large model to answer only based on target context fragments and automatically generates source links to the associated original pages and highlighted paragraphs. Ultimately, it achieves accurate source tracing from raw data acquisition to answer generation, effectively solving the comprehension bias caused by dirty data and fundamentally suppressing the illusion problem of generative models, significantly improving the system's credibility in vertical domain applications. Attached Figure Description
[0045] The invention will now be further described with reference to the accompanying drawings.
[0046] Figure 1 This is a flowchart of the module of the present invention. Detailed Implementation
[0047] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0048] Please see Figure 1 As shown, this invention is a text processing and knowledge question answering system based on a large vertical domain model, including the following steps:
[0049] The model deployment module is configured to fine-tune the parameters of a pre-trained large model using vertical domain corpus, obtain a fully-precision fine-tuned model, quantize the fully-precision fine-tuned model, generate quantized weights for the vertical domain large model, establish a model inference service that includes a computing resource scheduling mechanism, and load the quantized weights for the vertical domain large model to start the model inference service.
[0050] The data structuring module is configured to receive multi-source heterogeneous raw files from vertical domains, and perform document parsing, noise removal and cleaning, and semantic segmentation on the multi-source heterogeneous raw files to generate a sequence of structured data objects containing metadata.
[0051] The domain knowledge indexing module connects to the data structuring processing module, is configured to receive sequences of structured data objects, constructs a vector index library containing semantic features through vectorization encoding, and simultaneously constructs an inverted index library based on keywords;
[0052] The query intent parsing module is configured to receive user input commands submitted from the front end, and to perform intent recognition, logical decomposition and semantic expansion on the user input commands to generate a search feature vector and a set of search keywords, and encapsulate the two into a combined search condition object;
[0053] The intelligent retrieval and adaptation module is connected to the domain knowledge indexing module and the query intent parsing module, respectively. It is configured to receive combined retrieval condition objects, perform approximate nearest neighbor search in the vector index library based on the retrieval feature vector, perform matching search in the inverted index library based on the retrieval keyword set, and perform fusion sorting and truncation adaptation on the recall results to filter out target context fragments.
[0054] The generative service and interaction module connects to the intelligent retrieval and adaptation module, is configured to call the model inference service started by the model deployment module, receives target context fragments and user input instructions, assembles the two into a structured prompt word input model inference service, and outputs the final response data containing traceability information through a streaming protocol.
[0055] The data structuring module is configured to receive multi-source heterogeneous raw data from vertical domains, and to perform document parsing, noise removal and cleaning, and semantic segmentation on the raw data to generate a standardized sequence of structured data objects.
[0056] The domain knowledge indexing module connects to the data structuring processing module, is configured to receive sequences of structured data objects, constructs a vector index library containing high-dimensional semantic features through vectorization encoding, and simultaneously constructs an inverted index library based on keywords.
[0057] The query intent parsing module is configured to receive user input commands submitted from the front end, and to perform intent recognition, logical decomposition, and semantic expansion on the user input commands to generate corresponding search feature vectors and combined search conditions.
[0058] The intelligent retrieval and adaptation module is connected to the domain knowledge indexing module and the query intent parsing module, respectively. It is configured to perform multi-path recall and relevance reordering operations based on the retrieval feature vector and combined retrieval conditions to filter out target context fragments.
[0059] The generative service and interaction module connects to the intelligent retrieval and adaptation module. It is configured to encapsulate the finely tuned vertical domain large model and computing resource management mechanism, assembles the target context fragments and user input instructions into prompt words input to the vertical domain large model, and outputs the final response data containing traceability information through a streaming transmission protocol.
[0060] Model Deployment Module: Configures an intelligent computing platform adapted to vertical domains, efficiently fine-tunes the parameters of general-purpose large models using industry corpora, performs adaptive quantization on the fine-tuned models, and launches standardized inference services supporting dynamic resource scheduling. Specifically, it includes:
[0061] Collect regulatory documents, technical manuals, and business system database records in the vertical industry;
[0062] Use a rule engine to remove garbled characters and non-text noise;
[0063] Knowledge fragments are extracted using sliding window or semantic segmentation algorithms and reconstructed into instruction pairs containing instruction, input, and output fields;
[0064] By mixing instruction pair data with general dialogue data, a vertical domain instruction dataset is formed;
[0065] We selected a general large language model as the base, froze all its pre-trained parameters, and injected a trainable low-rank decomposition matrix into the attention mechanism layer and feedforward neural network layer of the Transformer architecture.
[0066] The parameters of the low-rank decomposition matrix are iteratively updated and trained using a vertical domain instruction dataset;
[0067] The trained low-rank adapter weights are merged with the original weights of the pedestal model to derive a full-precision fine-tuned model.
[0068] Convert the full-precision fine-tuning model into the memory-mapped binary format GGUF; perform adaptive quantization based on the hardware resource attributes of the deployment target: select 8-bit or half-precision mode for servers with high video memory, select 4-bit mode for devices with limited video memory, and select 2-bit mode for environments without a dedicated graphics card, generating quantized version of the vertical domain large model weights.
[0069] Establish a computing resource management mechanism and load quantized weights for large vertical domain models; externally encapsulate RESTful API interfaces and internally scan hardware topology; when a graphics processor is detected, offload the computing layer to video memory for execution, and fall back to instruction set acceleration mode when only a central processing unit is detected; manage computing resources by monitoring the number of concurrent requests through a traffic gateway.
[0070] Data structuring module: As the system's data entry point, it first obtains heterogeneous raw files from multiple sources, including the Internet and the company's internal databases. The specific formats of the raw files include PDF documents, Word documents, Excel documents, and image files.
[0071] After obtaining the original files, calculate the hash fingerprint of each original file and compare the hash fingerprint with the pre-set historical fingerprint database;
[0072] If the comparison result shows that the hash fingerprint already exists, the corresponding original file is determined to be redundant data and discarded.
[0073] If the comparison result shows that the hash fingerprint does not exist, the corresponding original file is retained;
[0074] Subsequently, the binary header information of the preserved original file is read, the magic number feature in the header is used to identify the actual document type of the original file, and the original file is marked as a data stream to be parsed;
[0075] Based on the identified document type, the original file marked as the data stream to be parsed is input into the corresponding parsing process:
[0076] When the original file is recognized as a PDF document or image file, the optical character recognition engine is called to process the original file. However, before the optical character recognition engine extracts the text, the page layout analysis algorithm is first executed on the original file page to build a coordinate system to lock the header, footer and sidebar areas, and remove non-text information located in these areas to extract and generate the original text stream.
[0077] When the original file is identified as an Excel document or a document containing tables, a table structure restoration algorithm is invoked to process the table area. This algorithm identifies the row and column lines of the table and converts the table content into structured data in Markdown or JSON format, thereby preserving the two-dimensional logical relationship of the data and generating an original text stream that maintains the topological structure.
[0078] After receiving the raw text stream extracted from the original file, the system uses a pre-defined set of regular expression rules to traverse and process the raw text stream. It uses regular expressions to remove meaningless garbled characters, invisible control characters, and consecutive redundant newline characters from the raw text stream, and performs a character standardization program to convert full-width characters to half-width characters and standardize punctuation mark format.
[0079] In addition, the named entity recognition model is invoked to scan the text, detect sensitive entities, and replace the detected sensitive entities with mask symbols to output clean text data.
[0080] To adapt to the context window of a large model, a sliding window segmentation algorithm is used to process clean text data. This algorithm prioritizes locating natural paragraph line breaks or document heading levels as segmentation points. When the length of a logical paragraph exceeds a preset threshold, the algorithm performs a secondary segmentation. During the secondary segmentation, a preset length of overlap is forced between two adjacent text blocks, so that the header of the latter text block contains the semantic information of the tail of the former text block. This overlap mechanism ensures that sentences or referential relationships that cross segmentation points maintain semantic integrity.
[0081] Finally, based on the segmented text blocks, the obtained file attribute information is traced back to construct a structured data object. The name of the original file, the specific page number of the text block in the original file, and the chapter title to which the text block belongs are injected into the structured data object. Finally, a sequence of structured data objects containing complete metadata is output and passed to the domain knowledge indexing module.
[0082] Domain Knowledge Index Module: Receives a sequence of structured data objects output by the data structuring module, and traverses the sequence to parse each structured data object, separating the text content and metadata information within it;
[0083] To balance the breadth of semantic understanding with the accuracy of terminology matching, a parallel dual-path index building process was then initiated, which specifically includes a semantic vector index building branch and a keyword inverted index building branch.
[0084] In the semantic vector index construction branch, load the pre-built domain embedding model;
[0085] The embedding model in this domain is not a general pre-trained model, but a model that has been fine-tuned by contrastive learning using specialized corpora in the vertical domain.
[0086] Specifically, the model construction process includes: automatically extracting question-paragraph pairs from unstructured documents in the vertical domain, using the potential queries generated by the large model as anchors, using the original fragments as positive samples, and using other fragments in the same document as hard negative samples;
[0087] Then, the InfoNCE loss function is used for training to narrow the distance between the query vector and the positive sample vector in the feature space.
[0088] The finely tuned domain embedding model is used to encode the text content in each structured data object, transforming the text content into a high-dimensional dense vector that can be recognized by a computer.
[0089] To ensure the traceability of search results, the raw vector is not stored directly. Instead, the high-dimensional dense vector is strongly bound to the metadata information carried by the structured data object. The metadata information specifically includes the file name, page number, and chapter title.
[0090] Subsequently, the vector data bound with metadata information is written into a high-performance vector database, and an approximate nearest neighbor search algorithm is configured to build an index structure for these vectors, thereby forming a vector index library.
[0091] Simultaneously, in the keyword inverted index construction branch, in order to solve the problem of entity confusion caused by the close proximity of specific proper nouns in the vector space, a professional word segmenter is used to process the text content in each structured data object;
[0092] This word segmenter is equipped with a custom dictionary specific to the industry, which can accurately identify combined professional terms and avoid incorrect segmentation; after segmentation and removal of stop words, it generates a set of keywords for the structured data object.
[0093] Next, the BM25 algorithm is used to calculate the relevance score of each keyword in the structured data object, and an inverted index library is constructed that records the unique identifier of the keyword, the structured data object, and the weight score mapping relationship.
[0094] Finally, the vector index library and the inverted index library are logically encapsulated to form a hybrid knowledge base, which awaits invocation by downstream intelligent retrieval and adaptation modules.
[0095] Query intent parsing module: Receives user input commands submitted by the front end, which are usually non-standard natural language questions;
[0096] To eliminate the problem of unclear references or ambiguous expressions in user input commands, the query rewriting model is invoked to perform semantic completion and synonym expansion on the user input commands.
[0097] This query rewriting model utilizes a thesaurus and contextual information in a vertical domain to convert colloquial expressions in user input commands into technical terms, generating semantically enhanced query text. For example, it can restore abbreviations entered by users to their full names or replace vague pronouns with specific entity names.
[0098] For question-and-answer scenarios with complex logic, we further introduce the thinking chain reasoning technology to perform logical structure analysis on the semantically enhanced query text;
[0099] When the semantically enhanced query text is identified to contain multiple levels of logical nesting or to require cross-document reasoning, the logical decomposition algorithm is used to break the text down into several independent atomic subqueries.
[0100] Each atomic subquery represents a single, directly searchable specific question, thus transforming the originally complex compound problem into a simple single-point query sequence, solving the technical defect of low hit rate of traditional retrieval methods when facing multi-hop reasoning problems;
[0101] Subsequently, the semantically enhanced query text or the decomposed atomic subqueries are input into the feature extraction process;
[0102] To ensure mathematical consistency of the retrieval space, a domain embedding model that is completely consistent with the aforementioned domain knowledge indexing module is loaded. This domain embedding model is used to vectorize and encode each atomic subquery, generating a retrieval feature vector that is consistent with the distribution of the knowledge base vector space.
[0103] Simultaneously, a professional word segmenter is invoked to extract keywords from the sub-queries, and a set of search keywords is generated after filtering out stop words;
[0104] Finally, the retrieval feature vector and the set of retrieval keywords are encapsulated into a combined retrieval condition object, and this combined retrieval condition object is passed to the downstream intelligent retrieval and adaptation module to trigger the subsequent multi-path recall process.
[0105] Intelligent retrieval and adaptation module: Receives the combined retrieval condition object passed by the query intent parsing module, and parses out the retrieval feature vector, retrieval keyword set, and semantically enhanced query text contained therein; based on the parsed conditions, it immediately triggers a dual-path hybrid recall process;
[0106] In the first path of the dual-path hybrid recall program, the retrieval feature vector is used to perform an approximate nearest neighbor search in the vector index library built by the domain knowledge index module. The cosine similarity between each vector in the knowledge library and the retrieval feature vector is calculated, thereby recalling several document fragments that are closest in the semantic space and generating a semantic recall set.
[0107] Simultaneously, in the second path, an exact match search is performed in the inverted index using the set of search keywords. Based on the scoring rules of the BM25 algorithm, several document fragments containing specific industry terms are recalled to generate a literal recall set.
[0108] Subsequently, in order to address the potential biases that may exist in a single retrieval path, the inverse ranking fusion algorithm was invoked to perform a weighted merging of the semantic recall set and the literal recall set;
[0109] The algorithm does not rely directly on the original absolute value of the score, but calculates the fusion score based on the reciprocal of the ranking position of the document fragment in its respective queue, thereby smoothing the score difference between different retrieval methods and generating a preliminary candidate document set after deduplication.
[0110] Given that the initial candidate document set may still contain noisy data that is semantically similar but logically contradictory, the semantic re-ranking process is further initiated;
[0111] Load the pre-built cross-encoder reordering model, which has a deep attention mechanism and can process the interaction information of two text sequences simultaneously.
[0112] Specifically, the model uses binary classification or regression tasks for modeling during the training phase, and deliberately adds difficult negative samples that are literally similar but semantically contradictory to the training data to improve the ability to judge logical relationships.
[0113] The semantically enhanced query text is concatenated with each document fragment in the initial candidate document set using a special delimiter, and the concatenated text is then used to calculate the cross-encoder reordering model one by one.
[0114] The model outputs a precise relevance score that represents the logical implication between the query and the document; finally, the preliminary candidate document set is reordered based on the precise relevance score, and noisy segments with scores below a preset threshold are removed.
[0115] Meanwhile, in order to adapt to the maximum input length limit of large models in vertical domains, the sorted results are truncated based on the cumulative number of tokens. Several document fragments with the highest scores and total lengths that meet the requirements of the large model context window are retained as target context fragments, and these target context fragments are passed to the downstream generative service and interaction modules.
[0116] Generative service and interaction module: simultaneously receives target context fragments filtered and output by the intelligent retrieval and adaptation module, as well as user input instructions transmitted by the query intent parsing module;
[0117] Before executing the inference task, the computing resource management mechanism is activated to adapt to the current hardware operating environment, and the server's hardware configuration is detected to identify whether a graphics processor is present or only a central processing unit is present. Based on the hardware detection results, the model loading strategy is adaptively selected.
[0118] If hardware resource constraints are detected, the model quantization loading program will be automatically invoked to load the quantized version of the large vertical domain model weights generated by the model deployment module.
[0119] Subsequently, the prompt word assembly engine is invoked to construct structured prompt words for the input large model;
[0120] This engine not only concatenates user input commands with target context fragments, but also embeds strict constraint commands within them;
[0121] This constraint explicitly requires the large model to answer only based on the information contained in the target context fragment. If the target context fragment does not contain a valid answer, it will directly output a statement that it cannot answer, thereby suppressing the large model from generating illusory content that is detached from reality.
[0122] In addition, the engine parses the metadata information carried in the target context fragment, including the source file name, page number, and chapter, into reference tags and temporarily stores these reference tags for later use.
[0123] The constructed structured prompts are input into a large vertical domain model to perform inference and generation;
[0124] During the reasoning process, this large model in the vertical domain can accurately understand the industry terms and logical relationships in the prompts and generate natural language answers that conform to professional standards.
[0125] Finally, a persistent connection with the front end is established using a server-side event sending protocol. The generated content is then pushed word by word to the user interface through a streaming channel, achieving a typewriter-like real-time response effect and reducing user anxiety while waiting.
[0126] While pushing text content, the system simultaneously retrieves temporarily stored citation tags, associates key information cited in the natural language answer with the corresponding citation tags, and automatically generates clickable source links at the end of the final output answer or next to key sentences.
[0127] When a user clicks the source link, the system can directly locate and display the original document page and highlighted paragraphs that support the answer, thus forming a complete chain of evidence from the acquisition of raw data to the final delivery of knowledge.
[0128] Finally, the generative service and interaction module adaptively loads the large vertical domain model that has been fine-tuned and quantized by LoRA according to the current hardware environment, assembles the target context and user instructions into strictly constrained prompt words, drives the model to perform inference, and streams professional answers with precise source links through the SSE protocol.
[0129] The foregoing has provided a detailed description of one embodiment of the present invention, but this description is merely a preferred embodiment and should not be construed as limiting the scope of the invention. All equivalent variations and modifications made within the scope of the claims of this invention should still fall within the patent coverage of this invention.
Claims
1. A text processing and knowledge question answering system based on a large vertical domain model, characterized in that, include: The model deployment module is configured to fine-tune the parameters of a pre-trained large model using vertical domain corpus, obtain a fully-precision fine-tuned model, quantize the fully-precision fine-tuned model, generate quantized weights for the vertical domain large model, establish a model inference service that includes a computing resource scheduling mechanism, and load the quantized weights for the vertical domain large model to start the model inference service. The data structuring module is configured to receive multi-source heterogeneous raw files from vertical domains, and perform document parsing, noise removal and cleaning, and semantic segmentation on the multi-source heterogeneous raw files to generate a sequence of structured data objects containing metadata. The domain knowledge indexing module connects to the data structuring processing module, is configured to receive sequences of structured data objects, constructs a vector index library containing semantic features through vectorization encoding, and simultaneously constructs an inverted index library based on keywords; The query intent parsing module is configured to receive user input commands submitted from the front end, and to perform intent recognition, logical decomposition and semantic expansion on the user input commands to generate a search feature vector and a set of search keywords, and encapsulate the two into a combined search condition object; The intelligent retrieval and adaptation module is connected to the domain knowledge indexing module and the query intent parsing module, respectively. It is configured to receive combined retrieval condition objects, perform approximate nearest neighbor search in the vector index library based on the retrieval feature vector, perform matching search in the inverted index library based on the retrieval keyword set, and perform fusion sorting and truncation adaptation on the recall results to filter out target context fragments. The generative service and interaction module connects to the intelligent retrieval and adaptation module, is configured to call the model inference service started by the model deployment module, receives target context fragments and user input instructions, assembles the two into a structured prompt word input model inference service, and outputs the final response data containing traceability information through a streaming protocol.
2. The text processing and knowledge question answering system based on a large vertical domain model according to claim 1, characterized in that, The model deployment module is specifically configured to perform the following steps: Collect regulatory documents and business records from vertical industries to build a vertical industry instruction dataset; A general large language model is selected as the base, all its pre-trained parameters are frozen, a trainable low-rank decomposition matrix is injected into the attention mechanism layer, and the low-rank decomposition matrix is iteratively updated and trained using a vertical domain instruction dataset to derive a full-precision fine-tuning model. Convert the full-precision fine-tuning model into a binary format that supports memory mapping; The current hardware topology is detected; when a graphics processor is detected, a high-precision quantization mode is selected and the computation layer is offloaded to video memory using the underlying computing library; when only a central processing unit is detected, a high-compression quantization mode is selected and the system falls back to instruction set acceleration mode to generate a quantized version of the vertical domain large model weights. Establish a model inference service that includes a computing resource scheduling mechanism, and load a quantized version of the weights of large vertical domain models to start the model inference service.
3. The text processing and knowledge question answering system based on a large vertical domain model according to claim 1, characterized in that, The data structuring module is specifically configured to perform the following steps: Obtain the original files that constitute the multi-source heterogeneous original files. The original file formats include PDF documents, Word documents, Excel documents, and image files. Read the binary header information of the original file, identify the actual document type of the original file based on the magic number feature in the binary header information, and mark the original file as a data stream to be parsed; Based on the identified document type, the data stream to be parsed is input into the corresponding parsing process: when identified as a PDF document or image file, the layout analysis algorithm is executed to remove non-text information in the header, footer, and sidebar areas and extract the original text stream; when identified as an Excel document or a document containing tables, the table structure restoration algorithm is called to identify the row and column lines of the table and convert the table content into an original text stream that maintains two-dimensional logic. The original text stream is denoised using a pre-defined set of regular expression rules to output clean text data. The sliding window segmentation algorithm is used to segment the clean text data into several text blocks, and a preset length of overlapping area is retained between two adjacent text blocks. The obtained file attribute information is backtracked to construct a structured data object. The name, page number and chapter title of the original file are injected into the structured data object to generate a sequence of structured data objects.
4. The text processing and knowledge question answering system based on a large vertical domain model according to claim 1, characterized in that, The domain knowledge indexing module is specifically configured to execute a parallel, dual-path index building process: In the semantic vector index construction branch: a pre-built domain embedding model is loaded, and the text content of each structured data object in the sequence of structured data objects is encoded using the domain embedding model to generate a dense vector; the dense vector is bound to the metadata information carried by the structured data object, and a vector index library is constructed using an approximate nearest neighbor search algorithm; In the keyword inverted index construction branch: a word segmenter with an industry-customized dictionary is used to segment the text content in each structured data object and filter stop words to generate a keyword set; the BM25 algorithm is used to calculate the weight score of each keyword in the keyword set in the structured data object, and an inverted index library containing the mapping relationship between keywords, unique identifiers of structured data objects, and weight scores is constructed.
5. The text processing and knowledge question answering system based on a large vertical domain model according to claim 1, characterized in that, The query intent parsing module is specifically configured to perform the following steps: The query rewriting model is used to perform semantic completion and synonym expansion on user input commands to generate semantically enhanced query text. The logical structure of the semantically enhanced query text is analyzed using the thinking chain reasoning technology. When logical nesting or cross-document reasoning requirements are identified, the text is decomposed into several atomic sub-queries using a logical decomposition algorithm. A domain embedding model consistent with the domain knowledge indexing module is used to vectorize and encode atomic subqueries to generate retrieval feature vectors, and a professional word segmenter is used simultaneously to extract a set of retrieval keywords from atomic subqueries; The retrieval feature vector, the set of retrieval keywords, and the semantically enhanced query text are collectively encapsulated into a combined retrieval condition object.
6. The text processing and knowledge question answering system based on a large vertical domain model according to claim 1, characterized in that, The intelligent search and adaptation module is specifically configured to perform the following steps: Parse the combined search criteria object to obtain the search feature vector, the set of search keywords, and the semantically enhanced query text; Dual-path hybrid recall: Using the retrieval feature vector, an approximate nearest neighbor search is performed in the vector index to generate a semantic recall set, while simultaneously using the retrieval keyword set, an exact match search is performed in the inverted index to generate a literal recall set; Fusion Ranking: The inverse ranking fusion algorithm is used to weight and merge the semantic recall set and the literal recall set to generate a preliminary candidate document set; Semantic reordering: Load the pre-built cross-encoder reordering model, concatenate the semantically enhanced query text with each document fragment in the initial candidate document set, input the concatenation into the model, calculate the relevance score, and reorder the documents. Truncation adaptation: Based on the maximum context window limit of the large model in the vertical domain, the reordered results are truncated, and several document fragments that meet the token number limit are retained as target context fragments.
7. The text processing and knowledge question answering system based on a large vertical domain model according to claim 1, characterized in that, The specific configuration of the generative service and interaction module involves executing the following steps: The user input command is concatenated with the target context fragment, constraint commands are embedded, and the metadata information carried in the target context fragment is parsed into reference tags to generate structured prompt words; The structured prompts are input into the model inference service started by the model deployment module. The generated answer is pushed to the front end word by word using the server-side event sending protocol to form a natural language answer. Based on the citation tags, generate source links to the original document pages and highlighted paragraphs in the natural language answer, and output the final response data.