A customer service reply method and system of a multi-modal fusion architecture and a storage medium

The customer service system, built on a multimodal fusion architecture, utilizes a spatial-semantic dual-constraint graph network and decision compliance analysis to address the issues of fragmented and low-standardization of multimodal data recognition in government intelligent customer service. This enables precise positioning and structured transformation of government data, thereby improving the accuracy and security of government services.

CN121388977BActive Publication Date: 2026-06-26STONE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
STONE TECH CO LTD
Filing Date
2025-10-11
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, intelligent customer service systems cannot accurately extract key information such as policy timeliness, scope of application, and responsible departments in the government sector. This results in government users being unable to quickly locate core clauses when uploading or citing policy documents, thus failing to meet the needs of precise government services.

Method used

A multimodal fusion architecture is adopted, which processes multimodal data through a spatial-semantic dual-constraint graph network (SS-DCGN) to construct a data spatial topology. Combined with the node association mechanism of the graph network, it can achieve accurate positioning and structured transformation of multimodal data in government and enterprise scenarios, generate structured policy elements, and combine decision compliance analysis and dynamic weight adjustment models to ensure the accuracy and transparency of decision-making.

Benefits of technology

It improves the accuracy of multimodal data identification, enhances the transparency and credibility of government services, ensures the compliance of decision-making, and complies with data security requirements, thus achieving efficient and secure government services.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121388977B_ABST
    Figure CN121388977B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of artificial intelligence, in particular to a customer service reply method and system of a multi-modal fusion architecture and a storage medium, the method comprises the following steps: obtaining initial input data of a user, preprocessing the initial input data to generate a standard data source; performing decision compliance analysis on the standard data source to generate a decision judgment result; and generating a corresponding decision scheme according to the decision judgment result. Through a space-semantic double-constraint graph network, the application effectively solves the technical defects of fragmentation of multi-modal data recognition and low standardization degree of traditional intelligent customer service, overcomes the problem of insufficient adaptation of general recognition technology to government affairs formats, in the semantic constraint dimension, relying on the node association mechanism of the graph network, the spatial features are deeply fused with the government and enterprise professional semantic dictionary, the conversion of multi-modal data to structured policy elements is automatically completed, and the recognition accuracy is improved by more than 40% compared with traditional NLP technology.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, specifically to a customer service response method, system, and storage medium with a multimodal fusion architecture. Background Technology

[0002] With the deepening of digital government construction and enterprise digital transformation, AI customer service has become a core supporting technology for government and enterprises to optimize service efficiency and reduce operating costs. In the government sector, the construction of intelligent customer service has become a core task of digital government construction, and by 2025, intelligent customer service systems will account for over 90% of my country's government AI market. In the enterprise sector, the surge in inquiries brought about by business expansion and customers' high demands for service response speed are driving the upgrade of traditional customer service models to intelligent ones, making AI customer service a key tool for enterprises to improve customer experience.

[0003] Currently, the technological implementation of AI-powered customer service primarily relies on core technologies such as natural language processing (NLP), machine learning, and large-scale models. NLP technology enables human-computer language interaction through word segmentation and semantic analysis. Large-scale models, with their advantages of hundreds of millions to trillions of parameters and large-scale data training, have achieved breakthroughs in areas such as language understanding and knowledge reasoning, making question answering in complex scenarios possible. Meanwhile, knowledge graph technology helps integrate structured and unstructured data from government and enterprise sectors to build specialized knowledge bases, providing data support for accurate responses.

[0004] Chinese invention patent CN119577459B discloses a method, device, and storage medium for training a multimodal large-scale intelligent customer service model. The method includes: collecting multimodal data; performing data deduplication on the multimodal data; preprocessing the deduplicated data; performing data augmentation on the preprocessed data; constructing an initial multimodal large-scale model based on a Transformer architecture; training the initial multimodal large-scale model using augmented data to obtain a target multimodal large-scale model; obtaining user query information; extracting weighted text from the user query information; inputting the weighted text into the target multimodal large-scale model to obtain related information; preprocessing the related information to generate feature vectors; obtaining the similarity value between the feature vectors and the weighted text using a similarity algorithm; arranging the similarity values ​​in order to obtain a sorting result; extracting the sorting result; determining target related information based on the target similarity value; and displaying the target related information.

[0005] The aforementioned patents lack the ability to recognize formula formats and official documents, making it impossible to accurately extract key information such as policy validity, scope of application, and responsible departments. The patented solutions do not include corresponding format parsing modules and text structuring algorithms, resulting in government users being unable to quickly locate core clauses when uploading or citing policy documents, thus failing to meet the demand for "precise government services." Summary of the Invention

[0006] To address the aforementioned technical problems in the existing technology, this invention provides a customer service response method, system, and storage medium based on a multimodal fusion architecture.

[0007] To achieve the above objectives, the technical solution of the present invention is as follows:

[0008] In a first aspect, the present invention provides a customer service response method with a multimodal fusion architecture, comprising the following steps:

[0009] S1: Obtain the user's initial input data, preprocess the initial input data, and generate a standard data source;

[0010] S2: Perform decision compliance analysis on the standard data source to generate decision judgment results;

[0011] S3: Based on the decision judgment result, generate the corresponding decision scheme.

[0012] Further, the steps for preprocessing the initial input data to generate a standard data source are as follows:

[0013] Determine the data type of the initial input data, and select the corresponding processing model for standardized data processing based on the different data types.

[0014] Furthermore, the steps for performing decision compliance analysis on the standard data source and generating decision judgment results are as follows:

[0015] The standard data source is divided into demand data and policy data;

[0016] The demand data is correlated with the policy data to determine whether the demand data conforms to the policy data, and the decision judgment result is output.

[0017] Furthermore, based on the decision judgment result, the steps for generating the corresponding decision scheme are as follows:

[0018] If the decision result is to meet the user's needs, then the demand data is optimized based on the policy data to generate the optimal decision solution;

[0019] If the decision result is that the user's needs are not met, the demand data will be modified based on the policy data to generate a compliance decision-making scheme.

[0020] Furthermore, based on different data types, the steps for selecting the corresponding processing model for standardized data processing are as follows:

[0021] If the data type is an image, then the YOLOv7_doc_parser function is called for processing;

[0022] If the data type is audio, then the Conformer_ASR function is called for processing;

[0023] If the data type is text, then the BERT_entity_extractor function is called for processing.

[0024] Furthermore, when associating the demand data with the policy data, a fusion correlation model is used. The construction steps of the fusion correlation model are as follows:

[0025] Construct a graph network model G:

[0026] ;

[0027] ;

[0028] ;

[0029] Where V represents a node in the image, and E represents an edge in the image. This represents the nth text type node. This represents the m-th image node. This represents the k-th audio type node. This represents the spatial coordinates of node p. This represents the spatial coordinates of node q. Represents the semantic information of node p. Represents the semantic information of node q;

[0030] Construct a dynamic weight adjustment model: If the policy data is a tax policy, then the text weight α, image weight β, and audio weight are... The weighting ratio is 0.6:0.3:0.1;

[0031] If the policy data refers to subsidy policies, then the text weight α, image weight β, and audio weight are... The weighting ratio is 0.3:0.5:0.2;

[0032] If the policy data is other policies, then the text weight α, image weight β, and audio weight are... The weighting ratio is 0.4:0.4:0.2;

[0033] The graph network model and the dynamic weight adjustment model are integrated to generate the fusion association model. :

[0034]

[0035] in, This represents the position mask matrix of the image.

[0036] Furthermore, the method for determining the decision result uses a conflict detection model, and the construction steps of the conflict detection model are as follows:

[0037] Conflict detection model C:

[0038]

[0039] in, The version validity indicator is represented by Wi, where Wi represents weight, R represents user demand, Pi represents the i-th policy data item, and n represents the total number of policy data items.

[0040] If the policy data is a central policy, the value of Wi is 1.0; if the policy data is a non-central policy, the value of Wi is 0.6.

[0041] Furthermore, it also includes step S4: sending the decision plan to the policy application terminal, uploading relevant materials according to the process rules of the policy application terminal, until the policy application is completed.

[0042] Secondly, the present invention provides a multimodal fusion architecture customer service response system, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the above-described customer response method.

[0043] Thirdly, the present invention provides a computer-readable storage medium storing a computer program, characterized in that the computer program is executed by a processor to implement the steps of the above-described customer response method.

[0044] Compared with the prior art, the present invention has the following beneficial effects:

[0045] This invention provides a multimodal fusion architecture-based customer service response method that effectively addresses the technical shortcomings of traditional intelligent customer service in recognizing fragmented and low-standardized multimodal data through a spatial-semantic dual-constraint graph network (SS-DCGN). In the spatial constraint dimension, by constructing a data spatial topology, spatial features are extracted from frequently occurring multimodal data in government and enterprise scenarios, such as image-based policy documents (e.g., scanned copies of official documents, screenshots of business forms), recorded voice consultations, and textual requests. This accurately locates key areas such as document titles, document numbers, and clause paragraphs, overcoming the problem of insufficient adaptation to government formats in general recognition technologies. In the semantic constraint dimension, relying on the node association mechanism of the graph network, spatial features are deeply integrated with a professional semantic dictionary for government and enterprises, automatically completing the transformation of multimodal data into structured policy elements (e.g., applicable subjects, policy validity, and core requirements). The recognition accuracy is improved by more than 40% compared to traditional NLP technologies.

[0046] Furthermore, during the decision-making guidance process, the system generates suggestions based entirely on structured policy data and traceable comparison logic, ensuring that every decision-making guidance has a clear policy basis, thereby enhancing the transparency and credibility of government services. In addition, the standardized data processing workflow can directly connect to a data grading and de-identification engine, and is compatible with the GB / T 39725-2020 requirements for government data security, ensuring data security while improving service efficiency, thus achieving a unity of technological and compliance value. Attached Figure Description

[0047] Figure 1 This is a schematic diagram of the modules of the present invention. Detailed Implementation

[0048] The technical solution of the present invention will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are not all embodiments of the present invention. All other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.

[0049] It should be noted that, unless otherwise specifically stated, the relative arrangement and numerical expressions of the components and steps described in these embodiments should not be construed as limiting the scope of the invention.

[0050] The following description of exemplary embodiments is merely illustrative and is not intended to limit the invention or its application or use in any way. Techniques, methods, and apparatus known to those skilled in the art may not be discussed in detail herein, but where applicable, such techniques, methods, and apparatus should be considered part of this specification.

[0051] like Figure 1 As shown, the present invention provides a customer service response system with a multimodal fusion architecture, including a user interaction layer, a multimodal fusion parsing engine, a dynamic knowledge hub, a compliance decision engine, a policy graph database, a cross-domain collaborative executor, a transaction flow knowledge base, and a government system interface group.

[0052] The user interaction layer is used to acquire the user's initial input data and transmit it to the multi-modal fusion parsing engine. In the multi-modal fusion parsing engine, when the input data type is detected as IMAGE (image, such as scanned document, screenshot, etc.), the YOLOv7_doc_parser function is called for processing. This function, based on the YOLOv7 object detection algorithm, is specifically designed for document parsing and outputs JSON data containing table coordinates and the location of official seals, enabling the localization and extraction of key visual elements in the image.

[0053] When the input type is AUDIO (audio, such as voice consultation), the Conformer_ASR function is called. This function uses the Conformer model for speech recognition, specifically specifying Cantonese as the dialect, and converts Cantonese speech into standard terminology text, solving the problems of dialect recognition and standardization.

[0054] For other types of input (presumably text-based, such as user-entered written requests, policy document text, etc.), the `BERT_entity_extractor` function is called. This function extracts entities based on the BERT model, retrieving key policy-related entities (such as policy name, applicable conditions, responsible departments, etc.). The operational logic for multimodal input standardization is as follows:

[0055] def multimodal_normalization(input):

[0056] if input.type == "IMAGE":

[0057] return YOLOv7_doc_parser(input) # Output JSON {"Table coordinates": (x1, y1, x2, y2)", "Official seal location", (x, y)"}

[0058] elif input.type == "AUDIO":

[0059] return Conformer_ASR(input, dialect="cantonese") # Convert Cantonese to Standard Terminology

[0060] else:

[0061] return BERT_entity_extractor(input) # Extract policy entities

[0062] Cross-modal correlation modeling, constructing a graph network model G:

[0063] ;

[0064] ;

[0065] ;

[0066] Where V represents a node in the image, and E represents an edge in the image. This represents the nth text type node. This represents the m-th image node. This represents the k-th audio type node. This represents the spatial coordinates of node p. This represents the spatial coordinates of node q. Represents the semantic information of node p. Represents the semantic information of node q;

[0067] Construct a dynamic weight adjustment model: If the policy data is a tax policy, then the text weight α, image weight β, and audio weight are... The weighting ratio is 0.6:0.3:0.1;

[0068] If the policy data refers to subsidy policies, then the text weight α, image weight β, and audio weight are... The weighting ratio is 0.3:0.5:0.2;

[0069] If the policy data is other policies, then the text weight α, image weight β, and audio weight are... The weight ratio is 0.4:0.4:0.2; the operating logic of the dynamic weight adjustment algorithm is as follows:

[0070] function [α,β, ]= adjust_weights(policy type)

[0071] switch policy_type

[0072] case “TAX_POLICY”

[0073] α=0.6; β=0.3; =0.1 / / Text-dominated

[0074] case “SUBSIDY APPLICATION”

[0075] α=0.3; β=0.5; =0.2 / / Image-dominated

[0076] otherwise

[0077] α=0.4; β=0.4; =0.2 / / Balanced mode

[0078] end

[0079] end

[0080] The graph network model and the dynamic weight adjustment model are integrated to generate the fusion association model. :

[0081]

[0082] in, This represents the image's location mask matrix. Finally, the multimodal fusion parsing engine divides the initial input data into demand data and policy data, and transmits both to the dynamic knowledge hub.

[0083] The dynamic knowledge hub transmits demand data to the compliance decision engine, and stores policy data as a knowledge graph in the policy graph database for easy retrieval of similar policies in the future. An example of a policy graph is shown below:

[0084] {

[0085] "node":{

[0086] "id": "POL-2024-086",

[0087] "Type": "Central Regulations"

[0088] "Terms": "The additional deduction ratio for R&D expenses of technology-based SMEs is 150%": "Effective Date": "2024-01-01"

[0089] "Conflict Relationship": ["CONFLICT:LOC-2023-772"]

[0090] }

[0091] }

[0092] The dynamic knowledge hub also includes a dynamic update mechanism, which uses a web crawler engine to monitor the latest policy changes on government websites in real time, updating the content of the policy graph and the flow of the transaction knowledge base. The operating logic of the dynamic update mechanism is as follows:

[0093] sequenceDiagram

[0094] Government website -> Web crawler: Monitor policy updates;

[0095] Web crawler engine -> BERT-PolicyDiff: New policy text;

[0096] BERT-PolicyDiff ->> Knowledge Hub: Change Markers (ADD / MODIFY / DELETE);

[0097] Knowledge Hub -> Transaction Flow Knowledge Base: Triggers related transaction updates.

[0098] The transaction flow knowledge base stores the transaction flows for enterprise establishment, used for querying relevant processes for policy application. An example of a transaction flow is as follows:

[0099] Step 1:

[0100] Department: Administration for Industry and Commerce;

[0101] API: biz / register;

[0102] Input: ["Company Name", "Registered Capital"] Output: ["Unified Social Credit Code"];

[0103] Timeout: 120s.

[0104] Step 2:

[0105] Department: Public Security Bureau;

[0106] API: seal / apply;

[0107] Prerequisite: $step1.output.creditcode != null.

[0108] Exception handling:

[0109] Error code: ERR TAX TIMEOUT;

[0110] Retry strategy: [10-second interval, 3 retries];

[0111] Alternative solution: Switch to human operators.

[0112] The compliance decision engine performs compliance assessments by comparing demand data with the corresponding policy graph, generating a decision plan. It then queries the transaction flow corresponding to the decision plan and transmits the data to the cross-domain collaborative executor.

[0113] Compliance assessment uses conflict detection model C:

[0114]

[0115] in, The version validity indicator is represented by Wi (currently valid = 1, obsolete = 0), where Wi represents the weight, R represents the user requirement, Pi represents the i-th policy data, and n represents the total number of policy data.

[0116] If the policy data is a central policy, then Wi is 1.0; if the policy data is not a central policy, then Wi is 0.6. The operating logic of Wi is as follows:

[0117] def weight_calc(policy level):

[0118] return 1.0 if policy_level=="CENTRAL" else 0.6

[0119] The logic for generating decision-making options is as follows:

[0120] Generate plan(company):

[0121] industry (company, Industry)

[0122] region(company, Region)

[0123] findall(step, eligible_step(Industry, Region, step), steps),

[0124] Optimize_order(steps)

[0125] Examples of generated decision-making schemes are shown in Table 1:

[0126] Table 1. Decision-making schemes

[0127]

[0128] The cross-domain collaborative executor executes the corresponding transaction flow of the decision-making scheme, automatically interfaces with the government system interface group, and submits policy applications. The policy application process uses the PN model:

[0129] ;

[0130] ;

[0131] ;

[0132] Where F represents the arc set, Q represents the state or condition in the system, and T represents the event or operation that occurs in the system. Indicates the initial identifier.

[0133] The operating logic of the process model PN is as follows:

[0134] while(is_final_marking(m)){

[0135] Enabled_transitions = get_enabled(T, M);

[0136] Fire_transition(select(enabled transitions));

[0137] if(api_timeout>threshold){trigger_rollback();

[0138] activate backup path();}

[0139] }

[0140] The cross-domain collaborative executor also includes an exception handling mechanism, which uses error codes to indicate errors in the policy application process and implements corresponding handling strategies. An example of the error code mapping is shown in Table 2.

[0141] Table 2 Example of Error Code Mapping

[0142]

[0143] Example 1: Establishment of a foreign-invested high-tech enterprise

[0144] Step 1: Multimodal material analysis.

[0145] (1) Input:

[0146] Scanned copy: Page 12 of the "Negative List for Foreign Investment Access (2024 Edition)" (including the table of prohibited industries);

[0147] Voiceover: "Our company plans to invest in quantum computing research and development. Does this fall under the category of restricted investments?"

[0148] (2) Analysis process:

[0149] Image analysis: Locating the "Quantum Information Technology → Restricted Classes" cell in the YOLOv7 table;

[0150] Speech translation: Conformer recognizes "quantum computing" as a subclass of "quantum information technology";

[0151] Semantic fusion: Constructing related edges (voice node → table cell, weight = 0.93).

[0152] Step 2: Compliance decision engine response.

[0153] (1) Collision detection:

[0154] Enterprises investing in "quantum computing" are categorized under "restricted" in the negative list.

[0155] An orange alert has been triggered (conflict index C=0.72).

[0156] (2) Generation scheme:

[0157] "Action": "Adjust the investment ratio to <50%".

[0158] “steps”: [

[0159] {“Department”: “Ministry of Commerce”, “Operation”: “Foreign Investment Filing (49% Shareholding)”},

[0160] {“Department”: “Science and Technology Bureau”, “Operation”: “Special R&D Qualification Approval”}

[0161] ],

[0162] "Risk note": "A Technical Security Commitment Letter must be submitted."

[0163] Step 3: Cross-domain collaborative execution.

[0164] (1) API scheduling process:

[0165] System -> -> Ministry of Commerce API: Submit foreign investment filing;

[0166] Ministry of Commerce API -->> System: Returns FDI2024-08756;

[0167] System ->>+State Administration of Foreign Exchange: Pre-registration of Capital Account (Filing Number);

[0168] State Administration of Foreign Exchange -->> System: Generates account number CNY88*******;

[0169] System -> Corporate Email: Send the materials list (including the English company bylaws template).

[0170] Exception handling:

[0171] Foreign exchange registration timeout triggers automatic retry (3 times with 10-second intervals).

[0172] Example 2: Application for Environmental Protection Technology Upgrade Subsidies in Manufacturing

[0173] Step 1: Intelligent verification of multimodal materials.

[0174] (1) Input:

[0175] Environmental impact assessment report (PDF, including pollutant emission data sheet);

[0176] Equipment list Excel (Air compressor model: SA-2300).

[0177] (2) Analysis process:

[0178] PDF table extraction: Identify SO2 emission value of 35 mg / m³ (lower than the national standard of 50 mg / m³);

[0179] Equipment compliance verification: The knowledge center searched the "Catalogue of Energy-Saving Technologies and Equipment" and confirmed that SA-2300 is a nationally promoted model (energy efficiency level 1).

[0180] Step 2: Dynamic policy matching.

[0181] (1) Real-time change detection:

[0182] The web crawler captured a new regulation on technological upgrading subsidies in a certain province: "Enterprises with SO2 < 40 mg / m³ will receive a 10% increase in subsidies."

[0183] Automatically update transaction flow: Add an application node for "Environmental Compliance Reward".

[0184] Generate the optimal path:

[0185] Original path: Technological upgrade filing → Equipment verification → Subsidy application;

[0186] New pathway: Technological upgrade filing → Environmental compliance certification → Equipment verification → Subsidy application (+10%).

[0187] Step 3: Risk control and implementation.

[0188] (1) Compliance verification:

[0189] If the investment in equipment (5 million) is less than 25% of the company's net assets (20 million), it meets the safety threshold.

[0190] (2) Cross-system execution:

[0191] Submit the device list via the Ministry of Industry and Information Technology's API (response within seconds);

[0192] Synchronized data from the Ecological and Environmental Protection Bureau: Environmental impact assessment report numbers are automatically entered into the subsidy application form.

[0193] The above specific embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to examples, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A customer service response method with a multimodal fusion architecture, characterized in that, Includes the following steps: S1: Obtain the user's initial input data, preprocess the initial input data, and generate a standard data source; S2: Perform decision compliance analysis on the standard data source to generate decision judgment results; specifically: The standard data source is divided into demand data and policy data; The demand data is correlated with the policy data to determine whether the demand data conforms to the policy data, and the decision judgment result is output. S3: Based on the decision judgment results, generate the corresponding decision scheme; Specifically: If the decision result is to meet the user's needs, then the demand data is optimized based on the policy data to generate the optimal decision solution; If the decision judgment result is that the user's needs are not met, the demand data will be modified according to the policy data to generate a compliance decision-making plan; When associating the demand data with the policy data, a fusion correlation model is used. The steps for constructing the fusion correlation model are as follows: Construct a graph network model G: Where V represents a node in the image, and E represents an edge in the image. This represents the nth text type node. This represents the m-th image node. This represents the k-th audio type node. This represents the spatial coordinates of node p. This represents the spatial coordinates of node q. Represents the semantic information of node p. Represents the semantic information of node q; Construct a dynamic weight adjustment model: If the policy data is a tax policy, then the text weight α, image weight β, and audio weight are... The weighting ratio is 0.6:0.3:0.1; If the policy data refers to subsidy policies, then the text weight α, image weight β, and audio weight are... The weighting ratio is 0.3:0.5:0.2; If the policy data is neither a tax policy nor a subsidy policy, then the text weight α, image weight β, and audio weight... The weighting ratio is 0.4:0.4:0.2; The graph network model and the dynamic weight adjustment model are integrated to generate the fusion association model. : in, Represents the position mask matrix of the image; The method for determining the decision result uses a conflict detection model, and the construction steps of the conflict detection model are as follows: Conflict detection model C: in, The version validity indicator is represented by Wi, where Wi represents weight, R represents user demand, Pi represents the i-th policy data item, and n represents the total number of policy data items. If the policy data is a central policy, the value of Wi is 1.0; if the policy data is a non-central policy, the value of Wi is 0.

6.

2. The customer service response method based on the multimodal fusion architecture according to claim 1, characterized in that, The steps for preprocessing the initial input data to generate a standard data source are as follows: Determine the data type of the initial input data, and select the corresponding processing model for standardized data processing based on the different data types.

3. The customer service response method based on the multimodal fusion architecture according to claim 1, characterized in that, The steps for standardizing data by selecting the appropriate processing model based on different data types are as follows: If the data type is an image, then the YOLOv7_doc_parser function is called for processing; If the data type is audio, then the Conformer_ASR function is called for processing; If the data type is text, then the BERT_entity_extractor function is called for processing.

4. The customer service response method based on the multimodal fusion architecture according to claim 1, characterized in that, It also includes step S4: sending the decision plan to the policy application terminal, uploading materials according to the process rules of the policy application terminal, until the policy application is completed.

5. A customer service response system with a multimodal fusion architecture, comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the customer response method according to any one of claims 1-4.

6. A computer-readable storage medium storing a computer program thereon, characterized in that, The computer program is executed by a processor to implement the steps of the customer response method according to any one of claims 1-4.