Intelligent customer service question and answer accurate generation method and system based on multi-source knowledge base retrieval

By loading multimodal chemical engineering documents for layout analysis and region segmentation, multimodal structured data is obtained and encoded, solving the problems of precise numerical values ​​and insufficient cross-modal retrieval accuracy in question-and-answer communication in the chemical engineering field, and achieving high-quality question-and-answer generation.

CN122196145APending Publication Date: 2026-06-12DONGGUAN JUZHENGYUAN NEW ENERGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DONGGUAN JUZHENGYUAN NEW ENERGY CO LTD
Filing Date
2026-05-15
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing multimodal knowledge question answering retrieval technologies cannot accurately answer precise numerical questions in the chemical industry, lack professional analysis of flowchart symbols and molecular structural formulas, and have insufficient cross-modal retrieval accuracy.

Method used

By loading multimodal chemical engineering documents, performing layout analysis and region segmentation, multimodal structured data is obtained, and multimodal encoding is performed on it. The data is then stored in a multimodal vector database, and natural language questions are received for cross-modal retrieval, fusion, and sorting to generate answer text.

🎯Benefits of technology

It improves the accuracy and adaptability of question answering in the chemical industry, ensures the credibility and usability of answer texts, and enhances the accuracy of cross-modal retrieval.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196145A_ABST
    Figure CN122196145A_ABST
Patent Text Reader

Abstract

The application discloses an intelligent customer service question and answer accurate generation method and system based on multi-source knowledge base retrieval, relates to the field of intelligent customer service, and comprises the following steps: loading a multi-modal chemical document; performing layout analysis and region segmentation on the multi-modal chemical document to obtain multi-modal structured data; performing multi-modal coding on the multi-modal structured data respectively to obtain corresponding embedding vectors and store the embedding vectors in a multi-modal vector database; receiving a user natural language question and coding the question into a query embedding vector; performing cross-modal retrieval in the multi-modal vector database to obtain a plurality of multi-modal candidate results with the highest similarity to the query embedding vector; performing fusion sorting on the plurality of multi-modal candidate results to obtain reordered multi-modal contexts; and generating an answer text based on the user natural language question and the reordered multi-modal contexts. The problems that accurate numerical problems are difficult to answer, professional analysis is lacking, and cross-modal retrieval precision is insufficient in the prior art are solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent customer service, specifically to a method and system for accurately generating intelligent customer service questions and answers based on multi-source knowledge base retrieval. Background Technology

[0002] Existing multimodal knowledge question answering retrieval technologies do not cover the numerical restoration and trend quantification of chemical curves, nor do they design data point restoration mechanisms for chemical-specific charts, thus failing to answer precise numerical questions. They also lack professional analysis of flowchart symbols and molecular structural formulas, making it impossible to understand equipment connection relationships and molecular structural semantics. Furthermore, they fail to align chart shapes with semantic descriptions through comparative learning, resulting in insufficient cross-modal retrieval accuracy. Summary of the Invention

[0003] This application provides a method and system for generating accurate intelligent customer service questions and answers based on multi-source knowledge base retrieval, which addresses the problems of difficulty in answering precise numerical questions, lack of professional analysis, and insufficient cross-modal retrieval accuracy in existing technologies.

[0004] In view of the above problems, this application provides a method and system for accurate generation of intelligent customer service questions and answers based on multi-source knowledge base retrieval.

[0005] Firstly, this application provides a method for accurately generating intelligent customer service questions and answers based on multi-source knowledge base retrieval, the method comprising: Load multimodal chemical engineering documents; The multimodal chemical engineering document is subjected to layout analysis and region segmentation to obtain multimodal structured data, wherein the multimodal structured data includes text fragments, structured table data, curve graph numerical sequences, flowchart topology, and molecular structure strings; Multimodal encoding is performed on the multimodal structured data to obtain the corresponding embedding vectors, and the embedding vectors are stored in the multimodal vector database. Receive a user's natural language question and encode the user's natural language question into a query embedding vector; Cross-modal retrieval is performed in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, wherein the multimodal candidate results include text candidates, table candidates, graph candidates, flowchart candidates and molecular structure candidates; Perform fusion sorting on the aforementioned multimodal candidate results to obtain the reordered multimodal context; The answer text is generated based on the user's natural language question and the reordered multimodal context.

[0006] Secondly, this invention provides an intelligent customer service question-and-answer generation system based on multi-source knowledge base retrieval, the system comprising: The document acquisition module is used to load multimodal chemical engineering documents; The multimodal data acquisition module is used to perform layout analysis and region segmentation on the multimodal chemical engineering document to obtain multimodal structured data, wherein the multimodal structured data includes text fragments, structured table data, curve graph numerical sequences, flowchart topology, and molecular structure strings; An embedding vector storage module is used to perform multimodal encoding on the multimodal structured data respectively to obtain the corresponding embedding vectors, and store the embedding vectors into a multimodal vector database; An embedded vector query module is used to receive user natural language questions and encode the user natural language questions into query embedded vectors; The candidate result acquisition module is used to perform cross-modal retrieval in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, wherein the multimodal candidate results include text candidates, table candidates, graph candidates, flowchart candidates and molecular structure candidates; The context acquisition module is used to perform fusion sorting on the plurality of multimodal candidate results to obtain the reordered multimodal context; The answer text generation module is used to generate answer text based on the user's natural language question and the reordered multimodal context.

[0007] One or more technical solutions provided in this application have at least the following technical effects or advantages: This application first loads a multimodal chemical engineering document; second, it performs layout analysis and region segmentation on the multimodal chemical engineering document to obtain multimodal structured data, providing high-quality raw data for subsequent question-answering generation, performs multimodal encoding, obtains embedding vectors, and stores them in a multimodal vector database; third, it receives the user's natural language question and encodes it into a query embedding vector; further, it performs cross-modal retrieval in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, ensuring the credibility of subsequent rearrangement and improving the accuracy of question answering; further, it performs fusion sorting on several multimodal candidate results, fitting multi-dimensional sorting through fusion sorting to ensure the credibility and usability of the sorting, obtaining the reordered multimodal context, ensuring the accuracy of the input data before generating the answer text; finally, it generates the answer text based on the user's natural language question and the reordered multimodal context, obtaining the final intelligent question answering result, improving the adaptability and accuracy of the answer text. Attached Figure Description

[0008] Figure 1 This is a flowchart illustrating the intelligent customer service question-and-answer generation method based on multi-source knowledge base retrieval provided in this application. Figure 2 This is a schematic diagram of the intelligent customer service question and answer accuracy generation system based on multi-source knowledge base retrieval provided in this application.

[0009] In the attached diagram, the components represented by each number are as follows: Document acquisition module 11, multimodal data acquisition module 12, embedded vector storage module 13, embedded vector query module 14, candidate result acquisition module 15, context acquisition module 16, and answer text generation module 17. Detailed Implementation

[0010] This application provides an intelligent customer service question-and-answer generation method based on multi-source knowledge base retrieval, which specifically addresses the problems of difficulty in answering precise numerical questions, lack of professional analysis, and insufficient cross-modal retrieval accuracy.

[0011] The present invention will now be described in detail with reference to the accompanying drawings.

[0012] Example 1, as Figure 1 As shown, this application provides a method for accurately generating intelligent customer service questions and answers based on multi-source knowledge base retrieval, the method comprising: S10: Load multimodal chemical engineering documents; In this embodiment of the application, the multimodal chemical document is an unstructured or semi-structured electronic file that includes various information representation forms such as text, tables, graphs, flowcharts, and molecular structural formulas, in the context of online sales and technical support of chemical products.

[0013] Specifically, the process begins by loading chemical engineering documents from the corresponding data storage module in the original database of the multi-source knowledge base. The non-textual modalities within these documents carry crucial quantitative and structural information, including technical data sheets, safety data sheets, product manuals, and process flow diagrams. Among these, graphs present numerical relationships such as viscosity-temperature and particle size distribution; flow diagrams consist of standard chemical symbols for pumps, reactors, valves, and connecting lines forming the process topology; and molecular structure diagrams describe the atomic bonding relationships of compounds.

[0014] In this embodiment of the application, a unified loading mechanism is used to overcome the limitation of existing technologies that can only process single text or simple images, laying a data foundation for building a comprehensive chemical knowledge base and ensuring that subsequent analysis can be based on rich and authentic multi-source documents in the chemical field.

[0015] S20: Perform layout analysis and region segmentation on the multimodal chemical engineering document to obtain multimodal structured data, wherein the multimodal structured data includes text fragments, structured table data, curve graph numerical sequences, flowchart topology, and molecular structure strings; In this embodiment, layout analysis refers to using computer vision and image processing algorithms to parse document images and automatically identify the types and bounding boxes of different functional areas in the image, such as distinguishing the title area; region segmentation is based on the region bounding boxes obtained from layout analysis, cropping or extracting each independent region from the original document image to form separate image blocks or data blocks; multimodal structured data is processed to transform the originally mixed unstructured document content into computer-understandable data units with clear type labels and internal structures; text fragments are continuous string paragraphs extracted from the document.

[0016] Structured tabular data is a two-dimensional matrix data that restores a tabular image to rows, columns and their corresponding relationships; a curve graph numerical sequence is a discrete data point pair restored from a curve graph image, forming an ordered numerical array; a flowchart topology is a directed graph of a process flow represented in the form of nodes and edges; a molecular structure string is a fixed-format string used to accurately characterize the structure of chemical molecules.

[0017] Specifically, for the loaded chemical engineering documents, a pre-trained layout analysis model is first called to perform layout analysis. Then, the region segmentation module crops the analyzed regions into independent images. After data processing through layout analysis and region segmentation, the originally mixed unstructured document content is transformed into computer-understandable data units with clear type labels and internal structures, resulting in multimodal structured data containing text fragments, structured table data, curve graph numerical sequences, flowchart topology structures, and molecular structure strings.

[0018] Step S20 in the method provided in this embodiment of the invention includes: Detect the coordinate axes, dimensions, and scale values ​​of the curve region; Extract discrete data points on the curve in the curve graph region; The discrete data points are restored to a functional form to generate a numerical sequence, which includes at least one of temperature-viscosity correspondence and stress-strain correspondence. Extract the statistical features of the numerical sequence, wherein the statistical features include maximum value, minimum value, inflection point and trend type.

[0019] In this embodiment of the application, the curve graph area is the area where the original curve graph of the layout analysis is located, which includes the complete coordinate system, coordinate axis titles, units of measurement, scale values ​​and data curve image sub-regions.

[0020] Dimensions are the units of measurement for physical quantities marked on a coordinate axis; for example, the dimension of temperature is degrees Celsius. Correct identification of dimensions is crucial for interpreting the physical meaning of numerical sequences. Scale values ​​are marked points along the axis of a coordinate system, each bearing a specific numerical value.

[0021] Specifically, deep learning-based detection methods, such as coordinate axis detection, are used to analyze the curve graph region and identify the position and direction of the coordinate axes, the dimensional text labeled on the axes, and the numerical values ​​corresponding to each scale mark in the curve graph region.

[0022] Secondly, discrete data points are a finite number of representative pixel locations and their corresponding physical coordinates that constitute a continuous curve. Due to storage and computational needs, it is usually not necessary to extract every pixel on the curve; instead, a sampling algorithm is used to obtain key points sufficient to reconstruct the shape of the curve.

[0023] Specifically, pixel extraction is performed. For example, if the curve is red and the background is white or the grid lines are gray, all pixel coordinates corresponding to the red curve are extracted. Then, the pixels are traversed from left to right along the curve, and a sampling point is recorded every 5 pixels. For the pixel coordinates of each sampling point, the corresponding physical temperature and viscosity values ​​are calculated, thus obtaining a list of original discrete data points, for example: [(160,2.5),(180,4.2),(200,6.8),(220,10.1),(240,14.5)], melt index g / 10min. If there are multiple curves in the image, the curve extraction process is repeated until corresponding discrete data points are generated for each curve.

[0024] Furthermore, the temperature-viscosity relationship is a data sequence that represents the viscosity of fluid materials such as polymer solutions and lubricating oils as a function of temperature. It is the most common data type in the chemical industry and is used to evaluate the processing fluidity and operating temperature range of materials. The stress-strain relationship is a data sequence that represents the deformation behavior of solid materials such as plastics and rubber under stress. It can be used to calculate key mechanical properties such as elastic modulus, yield strength, and elongation at break.

[0025] Specifically, a cubic spline interpolation method is used to generate a uniform numerical sequence of independent variables from the minimum to the maximum value of the curve with a step size of 1 unit. Cubic spline interpolation is an interpolation method that uses a cubic polynomial to piecewise fit discrete points, ensuring a smooth curve. For each independent variable in the interpolated sequence, the corresponding dependent variable value is calculated, ultimately resulting in a temperature-viscosity numerical sequence containing several points. If multiple curves exist in the original graph, multiple numerical sequences are generated and associated with their corresponding caption text. Simultaneously, the sequence type is labeled as either a temperature-viscosity correspondence or a stress-strain correspondence for subsequent modal identification.

[0026] Finally, the statistical features of the numerical sequence after data point reconstruction are extracted, including the statistical features of maximum value, minimum value, inflection point and trend type. Among them, the maximum value and minimum value are the maximum and minimum values ​​of the dependent variable in the numerical sequence and the corresponding independent variable positions; the inflection point is the point on the curve where the curvature changes sign, that is, the position where the curve changes from convex to concave or from concave to convex.

[0027] Trend type is a qualitative classification of the overall change pattern of a numerical sequence, such as monotonically increasing, monotonically decreasing, increasing then decreasing, exponential decay, S-shaped curve, etc.

[0028] Specifically, after generating a uniform numerical sequence, numerical analysis is performed. First, the entire sequence is traversed to find the minimum and maximum viscosity values. Then, the second-order differences of the sequence are calculated, and points where the sign of the differences changes are identified as inflection points. For example, if the second-order difference changes from negative to positive near 60°C, this point is marked as an inflection point. Finally, the first-order differences are calculated. If all differences are negative, the trend type is determined to be monotonically decreasing. The rate of change of the absolute value of the differences is further calculated; if the decay rate gradually slows down, it can be refined to an exponential decay type. The statistical features are then stored as metadata along with the original numerical sequence in a multimodal vector database for subsequent data retrieval.

[0029] Step S20 in the method provided in this embodiment of the invention further includes: Load a pre-built chemical standard symbol library, wherein the chemical standard symbol library includes pump symbol templates, reactor symbol templates, valve symbol templates and heat exchanger symbol templates; Perform symbol recognition on the flowchart area to obtain symbol instances and their corresponding types; Identify the direction of the connecting lines in the flowchart area and establish directed connection relationships between the symbol instances; The types of the symbol instances and the directed connection relationships are organized into a directed process flow graph, which serves as the topology of the flow graph.

[0030] In this embodiment of the application, firstly, the pre-constructed chemical standard symbol library is a set of symbol templates established in advance according to the drawing specifications of the chemical industry. Each symbol template is a small binary image or contour feature vector, representing a standard graphical representation of common equipment in chemical process diagrams.

[0031] Specifically, the system first loads a pre-built chemical standard symbol library from the storage module, including pump symbol templates, reactor symbol templates, valve symbol templates, and heat exchanger symbol templates. The pump symbol template is a graphic template used to represent fluid transport equipment, typically a circle with an inscribed triangle or a circle with an arrow, representing centrifugal pumps, gear pumps, etc. The reactor symbol template is a graphic template used to represent chemical reaction vessels, typically a cylinder with a schematic diagram of a stirrer or a rectangle with a jacket.

[0032] Valve symbol templates are graphic templates used to represent fluid on / off or regulating equipment, including various variations such as gate valves, ball valves, and globe valves; heat exchanger symbol templates are graphic templates used to represent heat exchange equipment, usually in the form of circles or rectangles, with wavy lines or cross lines drawn inside, representing shell-and-tube heat exchangers, plate heat exchangers, etc.

[0033] Secondly, symbol recognition is performed on the flowchart area to locate and classify each chemical symbol instance, where a symbol instance is a specific symbol that actually appears in a particular flowchart image.

[0034] Each instance has a unique identifier, bounding box coordinates, and detection confidence score, resulting in several identified symbol instances and their corresponding types.

[0035] Furthermore, the direction of a connecting line is the spatial path of the connecting line, that is, starting from the exit of a certain symbol, passing through a series of straight line segments and inflection points, and finally reaching the entrance of another symbol; a directed connection is a directional edge established between two symbol instances.

[0036] Identify the direction of all connecting lines in the flowchart area and obtain the starting point, ending point, path, and other directions of the connecting lines.

[0037] Specifically, after obtaining symbol instances, the flowchart area is preprocessed to separate the connecting line pixels. Path searching is performed, starting from the preset port position of each symbol instance and tracing along the connecting line pixels until the port of another symbol instance is encountered. Simultaneously, the presence of arrow markers on the connecting lines is checked to determine the flow direction; if no arrow is found, the direction is inferred based on default rules. Finally, a directed edge record is generated. For cases where multiple symbols share the same connecting line, multiple directed edges are established. Subsequently, all identified directed connections are collected into an edge list.

[0038] Finally, the types of symbol instances and the directed connections are organized into a directed graph of the process flow, which serves as the topology of the flowchart.

[0039] Specifically, after obtaining the list of symbol instances and the list of directed connections, a graph data structure is instantiated, using the list of symbol instances as nodes and the list of directed connections as edges. First, all symbol instances are traversed, creating a node object for each instance to store its type, instance, and bounding box position. Then, all directed connections are traversed, creating a directed edge object for each connection to store the start node, end node, and pipeline attributes. Node and edge validity is validated to ensure that the start and end nodes of each edge exist in the node set, avoiding dangling edges. Finally, the organized graph is stored as a flowchart topology and associated with the metadata of the original document for subsequent graph neural network encoding and cross-modal retrieval.

[0040] In this embodiment, coordinate axis detection, discrete point extraction, and function restoration are used to accurately recover the continuous numerical sequence corresponding to the curve, filling the gap in curve data in the chemical industry. Systematic errors are eliminated by detecting dimensions and scale values ​​and establishing precise pixel-physical coordinate mapping. High-value statistical features are extracted to enhance the information density and practicality of the answer. The curve numerical sequence, linked to its caption text, provides a precise data foundation for comparison. Subsequently, a pre-constructed standard chemical symbol library of typical symbols, combined with a chemical symbol dataset, is used to accurately identify symbol instances, reducing the false detection rate. Directed connections between symbol instances are established through line orientation recognition, constructing a standard graph data structure and providing a structured reasoning basis for answer generation.

[0041] S30: Perform multimodal encoding on the multimodal structured data respectively to obtain the corresponding embedding vectors, and store the embedding vectors in the multimodal vector database; In this embodiment, multimodal coding maps structured data of different modalities to the same high-dimensional vector space using their respective neural network encoders, generating dense vectors of fixed dimensions. The encoders for different modalities need to be jointly trained so that semantically related cross-modal data are close to each other in this space. Embedded vectors are arrays of floating-point numbers that can abstractly represent the semantic, shape, or topological features of the original data. Cosine similarity or Euclidean distance between vectors can quantify the semantic correlation between the original data. A multimodal vector database is a database system specifically designed for storing and retrieving high-dimensional vectors, supporting efficient indexing of embedded vectors and enabling fast near-nearest neighbor retrieval.

[0042] Specifically, each mode of the processed multimodal structured data is encoded, and the multimodal structure data is converted into corresponding embedding vectors. All generated embedding vectors, along with their corresponding original data and metadata, are stored in a multimodal vector database and an index is created.

[0043] Step S30 in the method provided in this embodiment of the invention includes: Extract the caption text associated with the numerical sequence of the curve graph, and perform language model encoding with fine-tuning in the chemical industry on the caption text to obtain the text vector of the trend; The numerical sequence of the curve is input into a time series encoder to obtain a curve shape vector. The curve shape vector is then compared with the text vector of the trend to perform comparative learning, so that the curve shape vector and the text vector of the trend are aligned in the embedding space to obtain the curve embedding vector. Graph neural network encoding is performed on the flowchart topology to obtain the flowchart embedding vector; Graph neural network encoding is performed on the molecular structure string to obtain the molecular structure embedding vector; The text fragment is encoded using a language model fine-tuned for the chemical engineering field to obtain a text embedding vector.

[0044] In this embodiment, firstly, the figure caption text is a type of text fragment, which typically includes the figure title, axis labels, legend, and accompanying trend descriptive annotations; the language model is a specialized model obtained by further pre-training or fine-tuning using a large amount of chemical engineering corpus on the basis of a general pre-trained language model; and the trend text vector is a text embedding vector used to represent the semantic information of the curve graph.

[0045] Specifically, the natural language description corresponding to the graph is obtained, i.e., the graph caption text, such as viscosity decreasing as temperature increases. This caption text is encoded using a language model finely tuned for the chemical industry to obtain a text vector representing the trend. This vector is then aligned in the embedding space. After the user inputs a natural language question, the query vector will be similar to both the trend text vector and the curve shape vector, thus directly retrieving the corresponding graph and bridging the semantic gap between the graph and text.

[0046] For example, for a styrene-butadiene rubber stress-strain curve, the caption text is the stress-strain curve of styrene-butadiene rubber under three vulcanization systems.

[0047] Secondly, a time series encoder is a neural network model specifically designed for processing ordered sequence data, such as a Transformer-based encoder or a recurrent neural network. It can extract local patterns, long-term dependencies, and overall shape features from the sequence. Since the independent variables of numerical sequences in graphs typically exhibit a monotonically increasing order, they can essentially be treated as time series data.

[0048] Contrastive learning is a self-supervised or weakly supervised representation learning method. Its basic idea is to bring positive sample pairs closer together in the embedding space while pushing negative sample pairs further apart.

[0049] Specifically, the numerical sequence of the curve is input into a Transformer-based time-series encoder. The encoder extracts the dependencies between different positions in the sequence using a multi-head self-attention mechanism, outputting a curve shape vector. Then, the curve shape vector and the trend text vector are compared and contrasted. For the same curve, the shape vector and text vector of the same curve are constructed as positive sample pairs. A contrastive loss is calculated using a loss function such as InfoNCE to minimize the cosine distance between the two in the embedding space. Simultaneously, the curve shape vector of the curve is used as a negative sample pair with the trend text vectors of other curves to maximize the cosine distance. After sufficient training, for a new curve, the time-series encoder directly generates a shape vector. Through comparative learning, the curve shape vector and the trend text vector are aligned in the embedding space to obtain the curve embedding vector.

[0050] Specifically, the construction steps of a time series encoder are as follows: The model structure, from input to output, consists of: an input embedding layer, a learnable positional encoding layer, a 4-layer Transformer encoder, a global pooling layer, and an output projection layer. The input embedding layer uses a fully connected linear layer with an input dimension of 1 and an output dimension of 256, containing 256 neurons and no activation function. Each Transformer encoder layer includes a multi-head self-attention sub-layer and a feedforward neural network sub-layer, both employing residual connections and layer normalization. The multi-head self-attention layer has 8 heads, each with a key / value dimension of 32. The attention outputs are concatenated and mapped through a linear layer. The feedforward neural network sub-layer consists of two linear layers with the GELU activation function, maintaining an output tensor size of 128×256. The global pooling layer uses average pooling and max pooling in parallel, yielding 256-dimensional vectors respectively, which are then concatenated to form a 512-dimensional vector. The output projection layer is a linear layer followed by a Tanh activation function, compressing the pooled features to 256 dimensions and limiting the output range to [-1, 1] for easier subsequent comparative learning.

[0051] Model training employs a contrastive learning framework, jointly optimized with a language model fine-tuned in the chemical engineering field. The training dataset contains paired samples: each curve's numerical sequence and its corresponding caption text constitute a positive sample pair. The optimizer is AdamW, with an initial learning rate of 1e. -4 The weight decay is 0.01; cosine annealing is used for learning rate scheduling, with a minimum learning rate of 1e. -6The batch size is 64; the maximum number of iterations is 200. The loss function is InfoNCE. Through forward propagation, the input curve numerical sequence is input to the embedding layer to obtain the embedded representation, which is then superimposed with positional encoding. This representation is passed through four encoder layers to obtain a 128×256 vector. Global average pooling and global max pooling then yield two 256-dimensional vectors, which are concatenated to obtain a 512-dimensional vector. Finally, the output projection layer produces a 256-dimensional curve shape vector. Simultaneously, the text encoder outputs a text vector, and the similarity matrix of all positive and negative sample pairs within the batch is calculated to determine the InfoNCE loss. The derivative of the loss with respect to the output projection layer is then backpropagated to the input embedding layer, where AdamW updates the parameters. The iteration stops when the validation set shows no improvement for 10 consecutive iterations, or after 200 iterations.

[0052] Model validation is performed on an independent validation set, where the optimal model weights are loaded and set to evaluation mode. For each curve, the numerical sequence is input into the encoder, and its caption text is simultaneously input into the text encoder to obtain a text vector. The cosine similarity between the shape vector and all text vectors in the validation set is calculated, sorted in descending order, and the ranking of correct matches is recorded. The average of all curves is calculated, and the difference between the average cosine similarity of positive and negative samples is calculated, requiring a value greater than 0.3. The trained time-series encoder can be directly used for new curve plots: it takes a numerical sequence as input and outputs a curve shape vector aligned with the text semantics, which is stored as the curve plot embedding vector in a multimodal vector database.

[0053] Furthermore, graph neural network encoding is performed on the flowchart topology to obtain the flowchart embedding vector.

[0054] Specifically, the construction steps of graph neural network encoding are as follows: The graph neural network encoder uses a graph isomorphic network as its basic architecture, with the following structure: First, the input layer receives a directed graph of a process flow and obtains the number of nodes and edges. The initial feature of each node is a one-hot encoding of the node type, and the feature of each edge is the direction indicator of the directed edge and the optional pipeline type, which is simplified to direction features here. The main body of the graph neural network encoder consists of three stacked graph attention network layers, each with four attention heads and an output dimension of 32. After three layers of GAT, each node obtains a 128-dimensional embedding vector. A global pooling layer aggregates the node-level representation into a graph-level representation: global average pooling and global max pooling are used to obtain 128-dimensional vectors respectively, which are concatenated to form a 256-dimensional vector. Then, a two-layer multilayer perceptron (MLP) is used as the readout function, finally outputting a 256-dimensional process flow embedding vector.

[0055] The model training employs a contrastive learning framework, using flowcharts and corresponding text descriptions as positive sample pairs. The training dataset contains 10,000 pairs, divided into training, validation, and test sets in an 8:1:1 ratio. The optimizer is Adam, with an initial learning rate of 2e^(-1 / 2). -4 The weight decay is 0.01; the learning rate uses cosine annealing scheduling with a period of 100 rounds and a minimum learning rate of 1e. -6 The batch size is 32; the maximum number of iterations is 200. The loss function is also InfoNCE. Positive sample pairs are the embedding vectors of the same flowchart and their corresponding text description embedding vectors, while negative sample pairs are the text vectors of different flowcharts within the same batch.

[0056] The process involves forward propagation, inputting a graph structure, initial node feature matrices, and adjacency matrices. The first layer, Gaussian Interchange Array (GAT), performs a linear transformation on each node, calculates multi-head attention coefficients, weights and aggregates neighbor features, and concatenates the results. The second and third layers are repeated, followed by global average pooling and max pooling. Finally, a 256-dimensional flowchart embedding vector is obtained through a Multi-Level Processing Array (MLP). Simultaneously, the text encoder outputs corresponding text vectors, and the intra-batch similarity matrix and InfoNCE loss are calculated. The derivative of the loss with respect to the MLP output layer is used for backpropagation to each layer of the GAT, resulting in gradient clipping. Training stops if there is no improvement for 12 consecutive rounds on the validation set, or if the improvement reaches 200 rounds.

[0057] Once verified, the graph neural network encoder can be deployed for flowchart embedding vector generation: for newly extracted flowchart topology, the flowchart embedding vector is directly output through forward propagation and stored in a multimodal vector database for subsequent cross-modal retrieval.

[0058] Specifically, the flowchart topology is input into a graph attention network (GAT). The GAT first randomly initializes a feature vector for each node. Then, each layer of the GAT aggregates information based on the attention weights of neighboring nodes and updates the node representation. After each GAT layer, each node obtains a latent vector that incorporates information from its upstream and downstream devices. Finally, global average pooling is performed on the latent vectors of all nodes to obtain the flowchart embedding vector.

[0059] For example, there are two different polymerization process flow diagrams for polyethylene production and polypropylene production, with different specific equipment tag numbers and parameters, but both having the same core topology. The embedding vectors of the two flow diagrams after GNN encoding have high similarity, so both can be effectively searched when a user searches for gas-phase polymerization process flow diagrams.

[0060] Furthermore, graph neural network encoding is performed on the molecular structure string to obtain the molecular structure embedding vector.

[0061] Specifically, molecular structure strings are obtained from multimodal structured data, and then cheminformatics tools are used to parse the molecular structure strings into molecular graphs. Nodes include several atoms, and edges may include several carbon-carbon single bonds, several double bonds, and several aromatic bonds in benzene rings.

[0062] The parsed molecular graphs are input into a graph attention network (GAT). The GAT first randomly initializes a feature vector for each molecular graph. Then, each layer of the GAT aggregates information based on the attention weights of neighboring nodes and updates the node representation. Through multiple iterations, the information of each atom and its bonded neighbors is aggregated. Finally, a molecular embedding vector is output through global pooling. The obtained molecular embedding vector captures the backbone structure, functional groups, and electronic properties, which are used for subsequent molecular structure description.

[0063] Finally, the text fragments are encoded using a language model fine-tuned for the chemical engineering field to obtain text embedding vectors. The text fragments are continuous string paragraphs extracted from multimodal chemical engineering documents after layout analysis and region segmentation. The text fragments vary in length and may be sentences, paragraphs, or chapters.

[0064] Specifically, each text fragment is sequentially extracted from multimodal structured data. The text fragments are then segmented, and [CLS] and [SEP] tags are added before being input into a BERT model fine-tuned for the chemical engineering field. After forward propagation, the output vector corresponding to the [CLS] tag is taken, or average pooling is performed on all output tags to obtain a fixed-dimensional text embedding vector. Finally, the embedding vectors of all text fragments, along with their original text and source metadata, are stored in a multimodal vector database for subsequent use when retrieving text fragments using text queries.

[0065] In this embodiment, the numerical sequence of the curve is input into a time series encoder to obtain a curve shape vector. Through comparative learning, a cross-modal mapping between the curve shape and the natural language trend description is established, achieving deep semantic alignment between the curve shape and the text description. The flowchart topology, molecular structure string, and text fragment are respectively encoded using a language model fine-tuned for the chemical engineering field to obtain flowchart embedding vectors, molecular structure embedding vectors, and text embedding vectors. This enables subsequent cross-modal retrieval to obtain a more accurate initial candidate set, reduces the pressure on the re-ranking model, and improves the accuracy of answer generation.

[0066] S40: Receive a user's natural language question and encode the user's natural language question into a query embedding vector; In this embodiment, the user's natural language question is a consultation request made by a customer or salesperson in the form of everyday spoken or written language in the online sales scenario of chemical products; the encoding uses a language model finely tuned in the chemical field, the same as the encoded text fragment, to convert the input question text into an embedding vector of the same dimension; the query embedding vector is the embedding vector corresponding to the question text, which is used to calculate the similarity with the multimodal embedding vector stored in the database in subsequent steps.

[0067] Specifically, firstly, when a user enters a question through the intelligent customer service dialog box, the system receives the string and obtains the user's natural language question. Then, using the same encoding method as the embedding vector, the collected natural language question is converted into a query embedding vector of the same dimension.

[0068] In this embodiment, the user's natural language query is converted into a representation in the same semantic space as the vectors embedded in the multimodal vector database, making cross-modal retrieval possible.

[0069] S50: Perform cross-modal retrieval in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, wherein the multimodal candidate results include text candidates, table candidates, graph candidates, flowchart candidates and molecular structure candidates; In this embodiment of the application, cross-modal retrieval is to retrieve data of different modalities from the database using a query of a certain modality; the highest similarity is sorted from high to low according to the calculated similarity score, and the result with the highest ranking is taken; the multimodal candidate result is the result set returned by the retrieval, in which each element is the original data unit and its metadata from different modalities.

[0070] Specifically, the encoded query embedding vector is sent to a multimodal vector database. An approximate nearest neighbor search is performed in the multimodal vector database to calculate the similarity between the query embedding vector and all text, table, curve, flowchart, and molecular structure embedding vectors in the database. Subsequently, the top-ranked similarity candidate results are retrieved and integrated as the multimodal candidate results for return.

[0071] Step S50 of the method provided in this embodiment of the invention involves performing cross-modal retrieval in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector. This step includes the following prior steps: Modal intent prediction is performed on the user's natural language question to obtain the expected answer modal type, wherein the expected answer modal type includes text modality, table modality, graph modality, flowchart modality, and molecular structure modality; Based on the expected answer modality type, different initial similarity weights are assigned to the search results of different modalities.

[0072] In this embodiment, firstly, modal intent prediction is based on the text content of the user's natural language question. It utilizes keywords, sentence structure, interrogative words, and domain prior knowledge in the question to automatically determine the information modality type of the user's expected answer through classification models or rule-based reasoning.

[0073] The expected answer modality type is the modality category to which the predicted answer most likely expected by the user belongs; this claim defines five modality types: text modality, which presents the answer in the form of natural language paragraphs, such as product descriptions and safety instructions; tabular modality, which presents the answer in the form of structured row-column tables, such as performance parameter comparison tables; graph modality, which presents the answer in the form of numerical sequences or curve images, such as viscosity-temperature curves; flowchart modality, which presents the answer in the form of directed graphs of process flows, such as equipment connection relationships; and molecular structure modality, which presents the answer in the form of molecular diagrams or SMILES strings, such as chemical structural formulas. Specifically, Specifically, after encoding the user's question into a query embedding vector, the user's natural language question is first passed to the modal intent prediction module before sending a retrieval request to the multimodal vector database. Internally, the modal intent prediction module uses a dialogue-tuned expected answer modality classification model. The model outputs probability scores for five categories: flowchart modality probability, text modality probability, table modality probability, graph modality probability, and molecular structure modality probability. The expected answer modality type is determined based on the highest probability.

[0074] For example, if there is a flowchart of the reaction between phenol and acetone to produce bisphenol A, input it into the modal intent prediction module, which internally uses the expected answer modal classification model to output the probability scores of five categories: flowchart modal probability 0.92, text modal probability 0.05, table modal probability 0.02, curve graph modal probability 0.01, and molecular structure modal probability 0.

[0075] Secondly, in the similarity calculation stage of cross-modal retrieval, the candidate results from different modalities are assigned corresponding modal weights, where the weights can be determined according to the degree of influence of each modality on the retrieval results to adapt to different adaptation scenarios, and their sum is 1.

[0076] Specifically, based on the modality intent prediction results, a weight factor is assigned to each modality. Then, the original similarity score is multiplied by the weight of the corresponding modality to obtain a weighted similarity score. The weighted score is then used to reorder the preliminary search results.

[0077] For example, after obtaining the expected answer modality type, modality weights are configured. For instance, when the expected modality is a flowchart, the weight of the flowchart modality is 0.1, the weight of the text modality is 0.3, the weight of the table modality is 0.3, the weight of the line graph modality is 0.1, and the weight of the molecular structure modality is 0.2.

[0078] In this embodiment, modal intent prediction is performed on the user's natural language question, and initial similarity weights are assigned to different modalities. This achieves modality-aware retrieval enhancement, enabling candidate results of the desired modality to obtain a higher ranking position in the initial retrieval list, and achieving accurate identification of the user's modal intent.

[0079] S60: Perform fusion sorting on the plurality of multimodal candidate results to obtain the reordered multimodal context; In this embodiment, the fusion ranking is to re-score and rank the list of multimodal candidate results obtained from the initial retrieval using a more refined model or strategy, so as to correct the possible biases in the initial vector retrieval and place the most relevant results at the top; the re-ranked multimodal context is a set of the most relevant and diverse multimodal information fragments that are selected after fine ranking and used as input to the answer generation model.

[0080] Specifically, the multimodal candidate results obtained after the initial sorting are first re-scored and re-ranked using a fusion ranking method. The user question is then concatenated with each candidate result and input into the model, outputting a relevance score. Simultaneously, modality type weights or numerical accuracy checks may be combined to correct biases. After fusion ranking, the results are sorted according to the new scores, and the top few re-ranked results are used as the multimodal context.

[0081] Step S60 in the method provided in this embodiment of the invention includes: The candidate results from different modalities among the plurality of multimodal candidate results are merged according to a weighted fusion strategy to obtain a fused candidate list, wherein the weighted fusion strategy uses the initial similarity weight; The user's natural language question is concatenated with each candidate result in the fusion candidate list, and then input into the cross-modal re-ranking model to obtain the relevance score of each candidate result; The top K candidate results are selected as the reordered multimodal contexts based on their relevance scores, sorted from highest to lowest.

[0082] In this embodiment, firstly, candidate results from different modalities among several multimodal candidate results, along with predefined weight coefficients, are merged according to a weighted fusion strategy to obtain a fused candidate list. The weighted fusion strategy uses initial similarity weights. Therefore, the fused candidate result = flowchart modality weight × flowchart modality + text modality weight × text modality + table modality weight × table modality + graph modality weight × graph modality + molecular structure modality weight × molecular structure modality. Fusion calculations are performed on all candidate results, and finally, a candidate result list is constructed.

[0083] Secondly, the cross-modal re-ranking model is used to refine the scoring and ranking of the initially retrieved multimodal candidate results. Its input is the user's natural language question and a single candidate result, and its output is a relevance score.

[0084] Specifically, cross-encoder models based on the Transformer architecture, such as BERT or RoBERTa models fine-tuned for chemical engineering data, input the query and document concatenation into the model. Through a multi-head self-attention mechanism, each word in the question interacts directly with each word in the candidate content, thus more accurately determining relevance. This model typically outputs a relevance score between 0 and 1, representing the degree of relevance between the candidate result and the question.

[0085] For example, a Transformer-based cross-encoder architecture has the following model structure: the input layer receives the concatenated sequence; the embedding layer uses BERT-based token embedding for word embedding, position embedding, and paragraph embedding, with an embedding dimension of 768 and a vocabulary size of 30522. This is followed by a 12-layer Transformer encoder, each layer containing multi-head self-attention and a feedforward neural network. Each layer is followed by residual connections and layer normalization; in the output of the last encoder layer, a 768-dimensional vector is extracted from the [CLS] position, reduced in dimensionality by a linear layer, and output as a scalar. Finally, it is mapped to the (0,1) interval using a Sigmoid activation function, which is the relevance score.

[0086] During model training, the training dataset was constructed from chemical industry data logs, containing 500,000 query-candidate pairs, each labeled with a relevance score. Positive samples were derived from user clicks or manually labeled highly relevant candidates, while negative samples were obtained through random sampling and hard negative example mining. The training set was divided into training, validation, and test sets in an 8:1:1 ratio. The optimizer was Adam, with an initial learning rate of 2e. -5 The weight decays by 0.01; the learning rate decays linearly after a linear warm-up; the batch size is 32; and the maximum number of iterations is 5.

[0087] The loss function is binary cross-entropy, which calculates the loss between the predicted result and the true label. Through forward propagation, the user question and candidate content are concatenated and fed into the embedding layer to obtain the [CLS] and the sequence embedding representation. After passing through a 12-layer Transformer encoder, the [CLS] vector is extracted, and the score is output after two linear layers and a sigmoid function. Through backpropagation, the loss is calculated, the gradient of the last linear layer is calculated, and backpropagation is performed layer by layer to the embedding layer, updating the parameters based on the gradient. If the validation set does not improve for three consecutive rounds or reaches five rounds, training stops and the optimal weights are restored. After successful validation, the cross-modal re-ranking model can be deployed in the fusion ranking process, calculating relevance scores for each candidate in the weighted fusion list for final re-ranking.

[0088] Finally, the results are reordered from highest to lowest relevance score, and then the top K candidate results with higher relevance scores are selected as the reordered multimodal contexts.

[0089] In step S60 of the method provided in this embodiment of the invention, the cross-modal reordering model performs the following judgment when calculating the relevance score: Calculate the matching degree between the expected answer modality type and the actual modality type of the candidate result, and set it as the modality alignment score; When the user's natural language question contains numerical conditions, check whether the candidate results contain precise numerical values ​​that meet the numerical conditions, and set them as numerical accuracy scores. The modality alignment score and the numerical accuracy score are weighted and fused into the correlation score.

[0090] In this embodiment of the application, firstly, the actual modal type of the candidate result is obtained, and then the modal alignment score is set according to the matching degree between the expected answer modal type and the actual modal type of the candidate result.

[0091] Specifically, the matching score is set based on the degree of similarity between the expected answer modality and the actual modality of the candidate result. For example, the matching score is set to 0 when the expected answer modality and the actual modality of the candidate result are completely different; for a close match, it can be set to 0.4-0.6; and for a perfect match, the matching score is set to 1. For instance, if the candidate result is a text fragment and the actual modality is text, the matching score between the expected and actual modalities can be set to 0.6, resulting in a modality alignment score of 0.6.

[0092] Furthermore, when the user's natural language question contains numerical conditions, the candidate results are checked to see if they contain precise numerical values ​​that meet the numerical conditions, and this is set as a numerical accuracy score. Here, the numerical conditions are the constraining descriptions of numerical values ​​that appear in the user's natural language question, including but not limited to specific numerical values, comparison relationships, numerical ranges, and implicit numerical requirements.

[0093] Finally, the modal alignment score and numerical accuracy score are weighted and merged into the relevance score.

[0094] Specifically, for the current candidate results, the cross-modal re-ranking model first calculates a semantic relevance score based on the concatenated input of the query and candidate content using a Transformer network. The modal alignment score is calculated as 1, and the numerical precision score as 0.6. Based on the influence of the modal alignment score, numerical precision score, and initial relevance score on the relevance item score, corresponding weights are assigned. Assuming w1=0.4, w2=0.3, w3=0.3, and weights w1, w2, and w3 represent the modal alignment score, numerical precision score, and initial relevance score, respectively, and assuming the initial relevance score is 0.6, the final relevance score is calculated as 0.4×1.0 + 0.3×0.6 + 0.3×0.6 = 0.76. This final score serves as the candidate relevance score for subsequent ranking.

[0095] In this embodiment, candidate results of different modalities in the multimodal candidate results are merged according to a weighted fusion strategy to obtain a fused candidate list, providing a data foundation for subsequent re-ranking; the matching degree between the expected answer modality type and the actual modality type of the candidate results is calculated and set as the modality alignment score to ensure the alignment of each modality; when the user's natural language question contains numerical conditions, it is checked whether it contains precise numerical values ​​that meet the numerical conditions, and this is set as the numerical accuracy score, which is used as the ranking standard to improve the accuracy of re-ranking; relevance scores are obtained through weighted fusion and sorted from high to low, reducing the dependence on large-scale high-quality labeled data; the top K candidate results are selected as the multimodal context after re-ranking, providing more reliable evidence for generating answers.

[0096] S70: Generate answer text based on the user's natural language question and the reordered multimodal context.

[0097] In this embodiment of the application, the answer text is generated by utilizing the generation capabilities of a large language model. The user's question and the retrieved multimodal context are organized according to a specific template and then input into the model. The model then generates a fluent, accurate text answer that contains the necessary data or description.

[0098] Specifically, the non-text data in the reordered multimodal context is first converted into text. Then, the converted multimodal context, along with the user's natural language question, is used as input data to a large language model for inference calculation, and finally, the answer text is output.

[0099] Step S70 in the method provided in this embodiment of the invention includes: The reordered multimodal context curve candidates are converted into a list of discrete data points and trend description text; The flowchart candidates in the reordered multimodal context are converted into natural language flowchart descriptions; The molecular structure candidates in the reordered multimodal context are converted into a list of molecular attributes; The transformed, reordered multimodal context is assembled with the user's natural language question into a prompt template, which is then input into the large language model. Extract the answer text from the output of the large language model, and append the source metadata corresponding to each citation to the end of the answer text; When the user's natural language question requires a comparison of the performance of multiple products, the system retrieves the curve numerical sequences and structured table data corresponding to the multiple products from the multimodal vector database, converts the curve numerical sequences and structured table data corresponding to the multiple products into a comparison table format, inputs it into the large language model, and generates answer text containing the comparison table.

[0100] In this embodiment of the application, firstly, by calling the curve conversion, the curve candidates in the reordered multimodal context are converted into a list of discrete data points, that is, the numerical sequence of the curve candidates is converted into a list of discrete data points, and trend description text is generated according to the trend type and inflection point.

[0101] For example, the numerical sequence is [(20,12000),(40,2500),(60,800),(80,300),(100,150)], with statistical characteristics including: a maximum value of 12000°C, a minimum value of 150°C, an inflection point of 60°C, and a trend type of "monotonically decreasing with a gradually decreasing decay rate". A curve conversion function is used to format the numerical sequence into a list of discrete data points: viscosity is 12000 mPa·s at 20°C, 2500 mPa·s at 40°C, 800 mPa·s at 60°C, 300 mPa·s at 80°C, and 150 mPa·s at 100°C. Then, a trend description text is generated based on the trend type and inflection point: the viscosity-temperature curve shows a monotonically decreasing trend, the decay rate gradually slows down with increasing temperature, and an inflection point appears near 60°C, indicating that the viscosity decreases less after exceeding the temperature. The converted text will be used as part of the subsequent prompt template.

[0102] Furthermore, the reordered multimodal context contains flowchart candidates. Through flowchart transformation, directed graph nodes are converted into natural language flowchart descriptions. Nodes without incoming edges are identified as starting points. Then, a depth-first traversal is performed starting from the starting point. For each edge visited, a description is generated, and finally, the descriptions are concatenated into a natural language flowchart description paragraph.

[0103] Furthermore, the molecular structure candidates in the reordered multimodal context are converted into a list of molecular attributes. The reordered multimodal context contains a molecular structure candidate. The RDKit toolkit is called to parse the molecular formula, calculate the molecular weight, detect functional groups, and then format the attributes into a list while retaining the original SMILES string as an alternative.

[0104] Furthermore, the transformed multimodal context is obtained, including a list of discrete data points and trend descriptions for candidate curve graphs, a process description for candidate flowcharts, a list of attributes for candidate molecular structures, and the original text fragments. Using a predefined prompt template, the transformed multimodal context and the user's natural language question are populated into the template to generate a complete prompt string. The prompt template is a predefined text structure used to organize the question and context into the input format expected by the large language model. Subsequently, the locally deployed large language model is invoked, using the prompt string as input, and finally, the large language model generates the answer.

[0105] Furthermore, the model output is post-processed, or the answer field is parsed from the structured output to extract the answer text. The source metadata corresponding to each reference is then appended to the end of the answer text. The source metadata is the original document information associated with each candidate result, such as the document name and page number.

[0106] Finally, when a user's natural language question asks for a comparison of the performance of multiple products, the system retrieves the curve data sequences and structured table data corresponding to multiple products from the multimodal vector database. These data are then integrated into a two-dimensional comparison table, where rows represent products and columns represent performance metrics. This table is then input into a large language model to generate the answer text containing the comparison table.

[0107] In step S70 of the method provided in this embodiment of the invention, the multimodal chemical documents include technical data sheets, safety data sheets, product manuals, and process flow diagrams for chemical products; the numerical sequence of curves includes a viscosity-temperature curve numerical sequence, a particle size distribution curve numerical sequence, and a stress-strain curve numerical sequence; and the molecular structure string includes the SMILES string.

[0108] In this embodiment of the application, when loading multimodal chemical documents, the following can be imported in batches: technical data sheets, safety data sheets, product manuals, and process flow diagrams of chemical products. After loading the documents, the subsequent layout analysis and region segmentation module will perform specialized processing on the tables and curves in the technical data sheets, the text and molecular formulas in the safety data sheets, the flowcharts in the product manuals, and the symbols and connecting lines in the process flow diagrams.

[0109] In acquiring numerical sequences of curves, coordinate axis detection, discrete point extraction, function restoration, and statistical feature extraction are performed for different types of curves. When processing viscosity-temperature curves, the horizontal axis is identified as temperature and the vertical axis as viscosity; when processing particle size distribution curves, the horizontal axis is identified as particle size and the vertical axis as cumulative throughput or frequency; when processing stress-strain curves, the system identifies the horizontal axis as strain and the vertical axis as stress, then restores the numerical sequence and extracts statistical features.

[0110] In this embodiment, the curve graph candidates, flowchart candidates, and molecular structure candidates in the reordered multimodal context are transformed respectively, realizing the understandable transformation of multimodal data. Then, the transformed reordered multimodal context is assembled with the user's natural language question into a prompt template, input into a large language model, the answer text is extracted, and the source metadata of the citation information is attached, providing a traceable answer citation mechanism. When the user's natural language question requires a comparison of the performance of multiple products, the corresponding curve graph numerical sequence and structured table data are retrieved, converted into a comparison table format, input into the large language model, and the answer text containing the comparison table is generated, enhancing the usability of the answer and the user experience.

[0111] The embodiments of this application, through the above specific implementation methods, achieve the following technical effects: In this embodiment, a unified loading mechanism is first used to overcome the limitation of existing technologies that can only process single text or simple images, laying a data foundation for building a comprehensive chemical knowledge base and ensuring that subsequent analysis can be based on rich and authentic multi-source documents in the chemical field.

[0112] Furthermore, through coordinate axis detection, discrete point extraction, and function restoration, the continuous numerical sequence corresponding to the curve is recovered with high precision, filling the gap in curve graph data in the chemical industry. Systematic errors are eliminated by clearly defining the detection units and scale values ​​and establishing a precise pixel-physical coordinate mapping. High-value statistical features are extracted to enhance the information density and practicality of the answer. The curve numerical sequence, coupled with its caption text, provides a precise data foundation for comparison. Subsequently, a pre-constructed standard chemical symbol library of typical symbols, combined with a chemical symbol dataset, is used to accurately identify symbol instances, reducing the false detection rate. Directed connections between symbol instances are established through line orientation recognition, constructing a standard graph data structure and providing a structured reasoning basis for answer generation.

[0113] Furthermore, the numerical sequence of the curve graph is input into a time series encoder to obtain the curve shape vector. Through comparative learning, a cross-modal mapping between the curve shape and the natural language trend description is established, achieving deep semantic alignment between the curve shape and the text description. The flowchart topology, molecular structure string, and text fragment are respectively encoded using a language model fine-tuned for the chemical engineering field to obtain flowchart embedding vectors, molecular structure embedding vectors, and text embedding vectors. This enables subsequent cross-modal retrieval to obtain a more accurate initial candidate set, reduces the pressure on the re-ranking model, and improves the accuracy of answer generation.

[0114] Furthermore, converting users' natural language queries into a representation in the same semantic space as the embedded vectors in the multimodal vector database makes cross-modal retrieval possible.

[0115] Furthermore, modal intent prediction is performed on the user's natural language questions, and initial similarity weights are assigned to different modalities. This achieves modality-aware retrieval enhancement, enabling candidate results of the desired modality to obtain a higher ranking position in the initial retrieval list, and achieving accurate identification of the user's modal intent.

[0116] Furthermore, candidate results from different modalities in the multimodal candidate results are merged according to a weighted fusion strategy to obtain a fused candidate list, providing a data foundation for subsequent re-ranking; the matching degree between the expected answer modality type and the actual modality type of the candidate results is calculated and set as the modality alignment score to ensure alignment of each modality; when the user's natural language question contains numerical conditions, it is checked whether it contains precise numerical values ​​that meet the numerical conditions, and this is set as the numerical accuracy score, which is used as the ranking criterion to improve the accuracy of re-ranking; relevance scores are obtained through weighted fusion and sorted from high to low, reducing the dependence on large-scale high-quality labeled data; the top K candidate results are selected as the multimodal context after re-ranking, providing more reliable evidence for generating answers.

[0117] Finally, the candidate graphs, flowcharts, and molecular structures in the reordered multimodal context were transformed to achieve understandable transformation of multimodal data. Subsequently, the transformed reordered multimodal context was assembled with the user's natural language question into a prompt template, input into a large language model, and the answer text was extracted, along with source metadata of the citation information, providing a traceable answer citation mechanism. When the user's natural language question requires a comparison of the performance of multiple products, the corresponding graph numerical sequences and structured table data were retrieved, converted into a comparison table format, input into the large language model, and the answer text containing the comparison table was generated, enhancing the usability of the answer and the user experience.

[0118] Example 2, as Figure 2 As shown, based on the same inventive concept as the intelligent customer service question and answer accuracy generation method based on multi-source knowledge base retrieval provided in Embodiment 1, this embodiment of the invention also provides an intelligent customer service question and answer accuracy generation system based on multi-source knowledge base retrieval, the system comprising: Document acquisition module 11 is used to load multimodal chemical engineering documents; The multimodal data acquisition module 12 is used to perform layout analysis and region segmentation on the multimodal chemical engineering document to obtain multimodal structured data, wherein the multimodal structured data includes text fragments, structured table data, curve graph numerical sequences, flowchart topology and molecular structure strings; The embedding vector storage module 13 is used to perform multimodal encoding on the multimodal structured data respectively to obtain the corresponding embedding vectors, and store the embedding vectors into the multimodal vector database; Embedded vector query module 14 is used to receive user natural language questions and encode the user natural language questions into query embedded vectors. The candidate result acquisition module 15 is used to perform cross-modal retrieval in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, wherein the multimodal candidate results include text candidates, table candidates, graph candidates, flowchart candidates and molecular structure candidates; The context acquisition module 16 is used to perform fusion sorting on the plurality of multimodal candidate results to obtain the reordered multimodal context; The answer text generation module 17 is used to generate answer text based on the user's natural language question and the reordered multimodal context.

[0119] In one embodiment, the multimodal data acquisition module 12 is used for: Detect the coordinate axes, dimensions, and scale values ​​of the curve region; Extract discrete data points on the curve in the curve graph region; The discrete data points are restored to a functional form to generate a numerical sequence, which includes at least one of temperature-viscosity correspondence and stress-strain correspondence. Extract the statistical features of the numerical sequence, wherein the statistical features include maximum value, minimum value, inflection point and trend type.

[0120] Specifically, the multimodal chemical engineering document undergoes layout analysis and region segmentation to obtain the flowchart topology, including: Load a pre-built chemical standard symbol library, wherein the chemical standard symbol library includes pump symbol templates, reactor symbol templates, valve symbol templates and heat exchanger symbol templates; Perform symbol recognition on the flowchart area to obtain symbol instances and their corresponding types; Identify the direction of the connecting lines in the flowchart area and establish directed connection relationships between the symbol instances; The types of the symbol instances and the directed connection relationships are organized into a directed process flow graph, which serves as the topology of the flow graph.

[0121] In one embodiment, the embedded vector storage module 13 is used for: Multimodal encoding is performed on the multimodal structured data to obtain the corresponding embedding vectors, including: Extract the caption text associated with the numerical sequence of the curve graph, and perform language model encoding with fine-tuning in the chemical industry on the caption text to obtain the text vector of the trend; The numerical sequence of the curve is input into a time series encoder to obtain a curve shape vector. The curve shape vector is then compared with the text vector of the trend to perform comparative learning, so that the curve shape vector and the text vector of the trend are aligned in the embedding space to obtain the curve embedding vector. Graph neural network encoding is performed on the flowchart topology to obtain the flowchart embedding vector; Graph neural network encoding is performed on the molecular structure string to obtain the molecular structure embedding vector; The text fragment is encoded using a language model fine-tuned for the chemical engineering field to obtain a text embedding vector.

[0122] In one embodiment, the candidate result acquisition module 15 is used for: Modal intent prediction is performed on the user's natural language question to obtain the expected answer modal type, wherein the expected answer modal type includes text modality, table modality, graph modality, flowchart modality, and molecular structure modality; Based on the expected answer modality type, different initial similarity weights are assigned to the search results of different modalities.

[0123] In one embodiment, the context acquisition module 16 is used for: The candidate results from different modalities among the plurality of multimodal candidate results are merged according to a weighted fusion strategy to obtain a fused candidate list, wherein the weighted fusion strategy uses the initial similarity weight; The user's natural language question is concatenated with each candidate result in the fusion candidate list, and then input into the cross-modal re-ranking model to obtain the relevance score of each candidate result; The top K candidate results are selected as the reordered multimodal contexts based on their relevance scores, sorted from highest to lowest.

[0124] Specifically, the cross-modal reordering model performs the following judgment when calculating the relevance score: Calculate the matching degree between the expected answer modality type and the actual modality type of the candidate result, and set it as the modality alignment score; When the user's natural language question contains numerical conditions, check whether the candidate results contain precise numerical values ​​that meet the numerical conditions, and set them as numerical accuracy scores. The modality alignment score and the numerical accuracy score are weighted and fused into the correlation score.

[0125] In one embodiment, the answer text generation module 17 is used for: The reordered multimodal context curve candidates are converted into a list of discrete data points and trend description text; The flowchart candidates in the reordered multimodal context are converted into natural language flowchart descriptions; The molecular structure candidates in the reordered multimodal context are converted into a list of molecular attributes; The transformed, reordered multimodal context is assembled with the user's natural language question into a prompt template, which is then input into the large language model. Extract the answer text from the output of the large language model, and append the source metadata corresponding to each citation to the end of the answer text; When the user's natural language question requires a comparison of the performance of multiple products, the system retrieves the curve numerical sequences and structured table data corresponding to the multiple products from the multimodal vector database, converts the curve numerical sequences and structured table data corresponding to the multiple products into a comparison table format, inputs it into the large language model, and generates answer text containing the comparison table.

[0126] The multimodal chemical documents include technical data sheets, safety data sheets, product manuals, and process flow diagrams for chemical products; the numerical sequence of curves includes a viscosity-temperature curve numerical sequence, a particle size distribution curve numerical sequence, and a stress-strain curve numerical sequence; and the molecular structure string includes the SMILES string.

[0127] Compared to existing technologies, this application first overcomes the limitation of existing technologies that can only process single text or simple images by using a unified loading mechanism, laying a data foundation for building a comprehensive chemical knowledge base and ensuring that subsequent analysis can be based on rich and authentic multi-source documents in the chemical field.

[0128] Furthermore, through coordinate axis detection, discrete point extraction, and function restoration, the continuous numerical sequence corresponding to the curve is restored with high precision, filling the gap in curve graph data in the chemical industry. Systematic errors are eliminated by clarifying the detection units and scale values ​​and establishing a precise pixel-physical coordinate mapping. High-value statistical features are extracted to enhance the information density and practicality of the answer. Restoring the curve numerical sequence and associating it with its caption text provides a precise data foundation for comparison. Subsequently, by pre-constructing a standard chemical symbol library of typical symbols and combining it with a chemical symbol dataset, accurate identification of symbol instances is achieved, reducing the false detection rate. Directed connections between symbol instances are established through line orientation recognition. Finally, a standard graph data structure is constructed, providing a structured reasoning basis for answer generation.

[0129] Furthermore, the numerical sequence of the curve graph is input into a time series encoder to obtain the curve shape vector. Through comparative learning, a cross-modal mapping between the curve shape and the natural language trend description is established, achieving deep semantic alignment between the curve shape and the text description. The flowchart topology, molecular structure string, and text fragment are respectively encoded using a language model fine-tuned for the chemical engineering field to obtain flowchart embedding vectors, molecular structure embedding vectors, and text embedding vectors. This enables subsequent cross-modal retrieval to obtain a more accurate initial candidate set, reduces the pressure on the re-ranking model, and improves the accuracy of answer generation.

[0130] Furthermore, converting users' natural language queries into a representation in the same semantic space as the embedded vectors in the multimodal vector database makes cross-modal retrieval possible.

[0131] Furthermore, modal intent prediction is performed on the user's natural language questions, and initial similarity weights are assigned to different modalities. This achieves modality-aware retrieval enhancement, enabling candidate results of the desired modality to obtain a higher ranking position in the initial retrieval list, and achieving accurate identification of the user's modal intent.

[0132] Furthermore, candidate results from different modalities in the multimodal candidate results are merged according to a weighted fusion strategy to obtain a fused candidate list, providing a data foundation for subsequent re-ranking. The matching degree between the expected answer modality type and the actual modality type of the candidate results is calculated and set as the modality alignment score to ensure the alignment of each modality. When the user's natural language question contains numerical conditions, it is checked whether it contains precise numerical values ​​that meet the numerical conditions, and this is set as the numerical accuracy score, which is used as the ranking standard to improve the accuracy of re-ranking. Relevance scores are obtained through weighted fusion and sorted from high to low, reducing the dependence on large-scale high-quality labeled data. The top K candidate results are selected as the multimodal context after re-ranking, providing more reliable evidence for generating answers.

[0133] Finally, the candidate graphs, flowcharts, and molecular structures in the reordered multimodal context were transformed to achieve understandable transformation of multimodal data. Subsequently, the transformed reordered multimodal context was assembled with the user's natural language question into a prompt template, input into a large language model, and the answer text was extracted, along with source metadata of the citation information, providing a traceable answer citation mechanism. When the user's natural language question requires a comparison of the performance of multiple products, the corresponding graph numerical sequences and structured table data were retrieved, converted into a comparison table format, input into the large language model, and the answer text containing the comparison table was generated, enhancing the usability of the answer and the user experience.

Claims

1. A method for accurately generating intelligent customer service questions and answers based on multi-source knowledge base retrieval, characterized in that, include: Load multimodal chemical engineering documents; The multimodal chemical engineering document is subjected to layout analysis and region segmentation to obtain multimodal structured data, wherein the multimodal structured data includes text fragments, structured table data, curve graph numerical sequences, flowchart topology, and molecular structure strings; Multimodal encoding is performed on the multimodal structured data to obtain the corresponding embedding vectors, and the embedding vectors are stored in the multimodal vector database. Receive a user's natural language question and encode the user's natural language question into a query embedding vector; Cross-modal retrieval is performed in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, wherein the multimodal candidate results include text candidates, table candidates, graph candidates, flowchart candidates and molecular structure candidates; Perform fusion sorting on the aforementioned multimodal candidate results to obtain the reordered multimodal context; The answer text is generated based on the user's natural language question and the reordered multimodal context.

2. The method as described in claim 1, characterized in that, Perform layout analysis and region segmentation on the multimodal chemical engineering document to obtain a series of curve values, including: Detect the coordinate axes, dimensions, and scale values ​​of the curve region; Extract discrete data points on the curve in the curve graph region; The discrete data points are restored to a functional form to generate a numerical sequence, which includes at least one of temperature-viscosity correspondence and stress-strain correspondence. Extract the statistical features of the numerical sequence, wherein the statistical features include maximum value, minimum value, inflection point and trend type.

3. The method as described in claim 1, characterized in that, Perform layout analysis and region segmentation on the multimodal chemical engineering document to obtain the flowchart topology, including: Load a pre-built chemical standard symbol library, wherein the chemical standard symbol library includes pump symbol templates, reactor symbol templates, valve symbol templates and heat exchanger symbol templates; Perform symbol recognition on the flowchart area to obtain symbol instances and their corresponding types; Identify the direction of the connecting lines in the flowchart area and establish directed connection relationships between the symbol instances; The types of the symbol instances and the directed connection relationships are organized into a directed process flow graph, which serves as the topology of the flow graph.

4. The method as described in claim 1, characterized in that, Multimodal encoding is performed on the multimodal structured data to obtain the corresponding embedding vectors, including: Extract the caption text associated with the numerical sequence of the curve graph, and perform language model encoding with fine-tuning in the chemical industry on the caption text to obtain the text vector of the trend; The numerical sequence of the curve is input into a time series encoder to obtain a curve shape vector. The curve shape vector is then compared with the text vector of the trend to perform comparative learning, so that the curve shape vector and the text vector of the trend are aligned in the embedding space to obtain the curve embedding vector. Graph neural network encoding is performed on the flowchart topology to obtain the flowchart embedding vector; Graph neural network encoding is performed on the molecular structure string to obtain the molecular structure embedding vector; The text fragment is encoded using a language model fine-tuned for the chemical engineering field to obtain a text embedding vector.

5. The method as described in claim 1, characterized in that, Perform cross-modal retrieval in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, including the following: Modal intent prediction is performed on the user's natural language question to obtain the expected answer modal type, wherein the expected answer modal type includes text modality, table modality, graph modality, flowchart modality, and molecular structure modality; Based on the expected answer modality type, different initial similarity weights are assigned to the search results of different modalities.

6. The method as described in claim 5, characterized in that, Perform a fusion sort on the plurality of multimodal candidate results to obtain a reordered multimodal context, including: The candidate results from different modalities among the plurality of multimodal candidate results are merged according to a weighted fusion strategy to obtain a fused candidate list, wherein the weighted fusion strategy uses the initial similarity weight; The user's natural language question is concatenated with each candidate result in the fusion candidate list, and then input into the cross-modal re-ranking model to obtain the relevance score of each candidate result; The top K candidate results are selected as the reordered multimodal contexts based on their relevance scores, sorted from highest to lowest.

7. The method as described in claim 6, characterized in that, The cross-modal reordering model performs the following judgment when calculating the relevance score: Calculate the matching degree between the expected answer modality type and the actual modality type of the candidate result, and set it as the modality alignment score; When the user's natural language question contains numerical conditions, check whether the candidate results contain precise numerical values ​​that meet the numerical conditions, and set them as numerical accuracy scores. The modality alignment score and the numerical accuracy score are weighted and fused into the correlation score.

8. The method as described in claim 1, characterized in that, Generate answer text based on the user's natural language question and the reordered multimodal context, including: The reordered multimodal context curve candidates are converted into a list of discrete data points and trend description text; The flowchart candidates in the reordered multimodal context are converted into natural language flowchart descriptions; The molecular structure candidates in the reordered multimodal context are converted into a list of molecular attributes; The transformed, reordered multimodal context is assembled with the user's natural language question into a prompt template, which is then input into the large language model. Extract the answer text from the output of the large language model, and append the source metadata corresponding to each citation to the end of the answer text; When the user's natural language question requires a comparison of the performance of multiple products, the system retrieves the curve numerical sequences and structured table data corresponding to the multiple products from the multimodal vector database, converts the curve numerical sequences and structured table data corresponding to the multiple products into a comparison table format, inputs it into the large language model, and generates answer text containing the comparison table.

9. The method as described in claim 1, characterized in that, The multimodal chemical engineering documents include technical data sheets, safety data sheets, product manuals, and process flow diagrams for chemical products; the numerical sequences of curves include viscosity-temperature curve sequences, particle size distribution curve sequences, and stress-strain curve sequences; the molecular structure strings include the SMILES string.

10. An intelligent customer service question-and-answer generation system based on multi-source knowledge base retrieval, characterized in that, The system is used to implement the intelligent customer service question and answer accuracy generation method based on multi-source knowledge base retrieval as described in any one of claims 1-9, the system comprising: The document acquisition module is used to load multimodal chemical engineering documents; The multimodal data acquisition module is used to perform layout analysis and region segmentation on the multimodal chemical engineering document to obtain multimodal structured data, wherein the multimodal structured data includes text fragments, structured table data, curve graph numerical sequences, flowchart topology, and molecular structure strings; An embedding vector storage module is used to perform multimodal encoding on the multimodal structured data respectively to obtain the corresponding embedding vectors, and store the embedding vectors into a multimodal vector database; An embedded vector query module is used to receive user natural language questions and encode the user natural language questions into query embedded vectors; The candidate result acquisition module is used to perform cross-modal retrieval in the multimodal vector database to obtain several multimodal candidate results with the highest similarity to the query embedding vector, wherein the multimodal candidate results include text candidates, table candidates, graph candidates, flowchart candidates and molecular structure candidates; The context acquisition module is used to perform fusion sorting on the plurality of multimodal candidate results to obtain the reordered multimodal context; The answer text generation module is used to generate answer text based on the user's natural language question and the reordered multimodal context.