A communication method and device based on multi-modal data, equipment and storage medium
By preprocessing and standardizing multimodal data, the problems of inconsistent data types and semantic alignment in multimodal data processing are solved, achieving efficient data transmission and information fusion, and improving the system's decision-making efficiency and interaction accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU WAFA INFORMATION TECH CO LTD
- Filing Date
- 2025-04-25
- Publication Date
- 2026-06-12
Smart Images

Figure CN120455540B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of communication technology, and more specifically, to a communication method, apparatus, device, and storage medium based on multimodal data. Background Technology
[0002] With the rapid development of artificial intelligence technology, especially in applications such as smart terminals, autonomous driving, and human-computer interaction, the demand for multimodal data collection and processing is increasing daily. Traditional systems often face problems such as inconsistent data types, large structural differences, and high difficulty in semantic alignment, resulting in unsatisfactory information fusion effects and further affecting the system's decision-making efficiency and interaction accuracy. Therefore, there is an urgent need for a method that supports standardized processing and efficient transmission of multimodal data to improve data compatibility and processing intelligence, and meet the information interaction needs of complex application scenarios. Summary of the Invention
[0003] The purpose of this invention is to provide a communication method, apparatus, device, and readable storage medium based on multimodal data to improve the aforementioned problems. To achieve the above objective, the technical solution adopted by this invention is as follows:
[0004] In a first aspect, this application provides a communication method based on multimodal data, including:
[0005] Acquire multimodal data, which includes at least text, images, speech, and structured data;
[0006] The multimodal data is preprocessed to obtain preprocessed target data;
[0007] The target data is standardized based on a pre-constructed set of prompt words to obtain revalued data;
[0008] The reset data is transmitted via communication based on a predetermined communication protocol.
[0009] Secondly, this application also provides a communication device based on multimodal data, comprising:
[0010] An acquisition unit is used to acquire multimodal data, which includes at least text, images, speech, and structured data;
[0011] The preprocessing unit is used to preprocess the multimodal data to obtain preprocessed target data;
[0012] The first standardization unit is used to standardize the target data based on a pre-constructed set of prompt words to obtain redefined data;
[0013] The transmission unit is used to transmit the reset data based on a predetermined communication protocol.
[0014] Thirdly, this application also provides a communication device based on multimodal data, comprising:
[0015] Memory, used to store computer programs;
[0016] A processor is used to implement the steps of the communication method based on multimodal data when executing the computer program.
[0017] Fourthly, this application also provides a readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the above-described communication method based on multimodal data.
[0018] The beneficial effects of this invention are as follows:
[0019] This invention enhances the system's ability to understand and process complex information by fusing and standardizing multimodal data. Combined with an efficient data transmission mechanism, it significantly improves data interaction efficiency and intelligence, making it suitable for cross-modal fusion and communication needs in multiple scenarios.
[0020] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing embodiments of the invention. Attached Figure Description
[0021] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 This is a schematic diagram of the communication method based on multimodal data described in an embodiment of the present invention;
[0023] Figure 2 This is a schematic diagram of the communication device structure based on multimodal data as described in an embodiment of the present invention;
[0024] Figure 3 This is a schematic diagram of the communication device structure based on multimodal data as described in an embodiment of the present invention.
[0025] The diagram is labeled as follows: 10, acquisition unit; 20, preprocessing unit; 30, first standardization unit; 40, transmission unit; 800, communication device based on multimodal data; 801, processor; 802, memory; 803, multimedia component; 804, I / O interface; 805, communication component. Detailed Implementation
[0026] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0027] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0028] Example 1:
[0029] This embodiment provides a communication method based on multimodal data.
[0030] See Figure 1 The figure shows that the method includes steps S10, S20, S30 and S40.
[0031] Step S10. Obtain multimodal data, which includes at least text, images, speech, and structured data;
[0032] Specifically, different forms of raw data are collected within the target system or application environment to comprehensively reflect the information status of the current scenario. Multimodal data includes at least text data (such as user input and log information), image data (such as camera footage), voice data (such as voice commands and ambient sounds), and structured data (such as sensor data and tabular data). The acquisition of multimodal data can be achieved through multi-source sensors, API interfaces, database calls, and other methods.
[0033] Step S20. Preprocess the multimodal data to obtain the preprocessed target data;
[0034] Specifically, the acquired multimodal data is first cleaned and formatted to improve data usability and consistency, ensuring that subsequent standardization processing can be carried out efficiently.
[0035] Step S30. Standardize the target data based on the pre-constructed set of prompt words to obtain revalued data;
[0036] Specifically, based on a pre-built set of prompt words, semantic recognition and reconstruction are performed on the target data to generate a standard data format, resulting in revalued data. The prompt word set includes at least standard data prompt words, supplementary semantic prompt words, enumerated value prompt words, unit conversion prompt words, and value setting rule prompt words.
[0037] Step S40. Transmit the reset data via communication based on a predetermined communication protocol;
[0038] Specifically, standardized revalued data is transmitted to the target system or module via a specific communication protocol, which may include, but is not limited to, MQTT, HTTP, WebSocket, Modbus, etc., selected according to specific business needs. During transmission, reliable data interaction and parsing consistency are ensured across multiple systems, providing support for subsequent processing, storage, or intelligent analysis of multimodal data.
[0039] It should be noted that step S20 includes steps S21 to S25:
[0040] Step S21. Perform modality recognition and data signal classification operations on the multimodal data to determine the modality type corresponding to each data unit. The modality type includes, but is not limited to, text, image, voice, or table.
[0041] Specifically, considering the lack of unified classification labels among multimodal data, the mixed processing of data from different modalities can easily lead to low processing efficiency and semantic confusion. Therefore, this application uses a modality recognition and signal classification mechanism to accurately distinguish different modal types such as text, images, speech, or tables, and can apply different data processing strategies to different modal types, thereby effectively improving the accuracy and efficiency of data parsing in the system.
[0042] Step S22. Based on the preset structure-aware model, perform field structure parsing on the multimodal data, identify the hierarchical relationship between fields and the semantic data block boundaries, and generate structure labels and field location information;
[0043] Specifically, existing technologies typically rely on static rules or template matching to perform structured parsing of unstructured or semi-structured data, such as identifying field positions and extracting key content through fixed formats. This approach is prone to failure when the format changes slightly and lacks adaptability to complex hierarchies and diverse structures. Therefore, this application introduces a structure-aware model that can automatically identify field levels and semantic boundaries, extract structural labels and field location information, providing a precise structural foundation for subsequent standardization.
[0044] Step S23. Through a cross-modal semantic nesting mechanism, align the homologous fields or semantically related fields in different modalities to form an intermediate dataset after field alignment;
[0045] Specifically, in existing technologies, information from different modalities is often processed in isolation, leading to contextual fragmentation and insufficient semantic reconstruction capabilities. This application constructs a cross-modal semantic nesting mechanism, which can effectively align common or semantically related fields in text, speech, and structured data, improving the depth and accuracy of information fusion and outputting more consistent and expressive intermediate datasets.
[0046] Step S24. Redundancy content identification and suppression processing is performed on the intermediate dataset, deleting duplicate, conflicting or ambiguous fields, retaining the set of fields with high confidence, and generating cleaned data content;
[0047] Specifically, this application uses a confidence scoring mechanism and an ambiguity resolution model to automatically identify and suppress redundant and conflicting fields in intermediate datasets, effectively retaining representative and credible fields and ensuring high quality and consistency of data content.
[0048] Step S25. Based on the preset field template or business structure rules, the cleaned data content is restructured in a structured manner, and the target data that conforms to the preprocessing standard format is output.
[0049] Specifically, by combining field templates with a business rule engine, the cleaned content can be flexibly reorganized, which not only achieves structural uniformity but also has the advantages of strong business adaptability and good scalability, laying the foundation for subsequent data standardization and transmission.
[0050] It should be noted that step S22 includes steps S221 to S224:
[0051] Step S221. Input multimodal data into a preset structure perception model. The model outputs perception and recognition results containing structural features such as field blocks, row and column logic, indentation levels, and adjacent field relationships.
[0052] Specifically, for example, when given a scanned image of a personal information form, the model can identify structural features including: field blocks such as "name", "date of birth", and "contact information"; row and column logic, where name and date of birth are in the same row, and different rows represent different users; indentation level, where "province / city" and "street" are subfields under the "address" field; and adjacency field relationships, determining that "phone number" and "email" are in the same semantic unit.
[0053] Step S222. Based on the perception and recognition results, analyze the structural hierarchy relationship between fields in the multimodal data;
[0054] Specifically, based on the information output by the model, it is determined whether a field is a primary field or a secondary field, or whether it belongs to the same group, thereby establishing a tree-like or hierarchical relationship between fields.
[0055] For example, in a form containing address information: "Address" is considered the parent field, and "Province", "City", and "Postal Code" are considered child fields.
[0056] Step S223. Based on the structural information of the perception recognition results, determine the semantic consistency region of the multimodal data, identify the boundary position of each semantic data block, and generate boundary recognition results;
[0057] Specifically, based on structural information, it determines which fields belong to the same semantic block (i.e., have the same business logic or semantic goal) and identifies and marks their start and end boundaries.
[0058] Step S224. Based on the structural hierarchy relationship between fields and the boundary recognition results, assign a structural label to each field and generate the corresponding field location information;
[0059] Specifically, the system assigns labels to fields based on the structural hierarchy and boundary information, such as "main field", "subfield", "last-level field", etc., and records their position information in the original data (such as page number, coordinates, index, etc.).
[0060] It should be noted that step S30 includes steps S31 to S33:
[0061] Step S31. For each field information in the target data, combined with its corresponding modality type, structural label information and field positioning information, select and match the most relevant prompt word type and prompt word content from the prompt word set;
[0062] Specifically, for each field in the target data, the system first determines the modal type of the field (such as text, image, voice, or structured data), and then, in conjunction with its structural tags (such as "main field" and "subfield") and location information (such as page number, location, and context index), retrieves and matches the most relevant prompt word type and content in the prompt word set.
[0063] Step S32. Based on the matched prompt word types and field semantic results, standardize the target data according to preset standard rules;
[0064] Specifically, the system performs cleaning and transformation operations according to standardized rules based on the matched prompt word type and field semantics.
[0065] Step S33. Output the standardized target data as structured revalued data. The revalued data meets the data standards defined by the prompt word set in terms of field semantics, enumeration specifications, unit form and value rules.
[0066] Specifically, the standardized field data is output in a structured manner to form revalued data that meets the predefined data standards. The revalued data needs to meet the requirements of clear semantic definition, consistent units, unified format, and compliant value selection.
[0067] It should be noted that step S31 includes steps S311 to S314:
[0068] Step S311. For each field information in the target data, combine its corresponding modality type, structural label information and field location information to perform field semantic analysis;
[0069] Specifically, modal types (such as text, voice, and image), structural labels (such as "table header" and "data cell"), and field location information (such as "row 2, column 3") are used as semantic analysis context; natural language processing or semantic parsing models are applied to identify the potential meaning of the fields. For example, if the field is "38.5", the modal type is "text", and it is located in the "body temperature" field cell of the table, its initial semantic meaning is determined to be "body temperature value".
[0070] Step S312. Perform semantic analysis and content extraction on each field, and construct a field semantic model based on the semantic tags, contextual semantic relationships, and domain attributes analyzed from each field;
[0071] Specifically, semantic tags (such as "temperature value", "contact information", "time expression", etc.) are extracted from the field; combined with the context (such as the field's surroundings, the form module it belongs to) and domain attributes (such as medical, financial, transportation); semantic embedding or knowledge graph representation is constructed to form a semantic model that can be compared. For example, the field "BP: 120 / 80" is labeled "blood pressure" and belongs to the "health and medical" domain. It coexists with other fields such as "body temperature" and "heart rate" to further strengthen the semantics.
[0072] Step S313. Based on the field semantic model, determine the most likely applicable prompt word type for each field and determine the limited prompt word matching range;
[0073] Specifically, the semantic model determines which type of prompt word the field is more likely to be associated with, thereby excluding irrelevant prompt word types and avoiding semantic conflicts or mismatches. For example, if the semantic model determines that the field is "blood pressure", the matching range should be limited to "numerical format + unit conversion + normal range prompt words", and will not match "gender enumeration prompt words".
[0074] Step S314. Within the range of limited prompt word matching, match the prompt word content that is most relevant to the field semantics from the pre-constructed prompt word set;
[0075] Specifically, in the prompt word set of the limited type, perform semantic similarity calculation or rule matching; select the prompt word with the highest similarity or the most logical compliance as the normalization basis for this field. For example, for the field "38.5", it matches the prompt word: "For the body temperature field: the value should be in °C and the range is between 35 and 42", and accordingly supplement the unit and verify whether it is reasonable.
[0076] It should be noted that step S32 includes steps S321 to S325:
[0077] Step S32. Based on the data standard prompt words, perform unified naming, format conversion, and semantic alignment on the field name and its corresponding value;
[0078] Specifically, the data standard prompt words are used to solve problems such as non-standard field names, chaotic synonyms, and inconsistent data formats, and ensure that the semantic expressions of fields are standard and consistent. For example, different names such as Order No., PO Number, and order number are uniformly recognized as "order number"; dates such as "2024 / 4 / 10, 2024.4.10" are converted into the unified format "2024-04-10"; "1,000 pieces" is converted into the numerical field "1000".
[0079] Step S32. Based on the supplementary semantic prompt words, perform semantic supplementation and correction on the fields with missing or ambiguous semantics;
[0080] Specifically, when the field naming is too short, the meaning is unclear, or the context lacks information, supplement its complete semantics through the prompt words to enhance the system's understanding ability. For example, when the field name is "time", the semantics may be unclear, and through the semantic prompt words, identify whether its context meaning is "delivery time" or "payment time"; when "material" is the field name, after combining the context with "WLD001" as the value, supplement the complete description as "cold-rolled steel sheet - material type" through the prompt words.
[0081] Step S32. Based on the enumerated value prompt words, map the non-standard enumerated type values to the unified enumeration standard;
[0082] Specifically, perform mapping processing on the field values with enumeration characteristics (such as payment method, delivery status, material type) to avoid confusion caused by different expressions. For example, map "cash on delivery", "cash payment", and "cash delivery" to the standard enumerated item: "cash payment". Map "shipped", "delivered", and "in transit" to the unified enumeration: "shipped".
[0083] Step S32. Based on unit conversion prompts, implement standard conversion and unified representation of fields with unit information;
[0084] Specifically, it addresses the issue of inconsistent units and different expressions of physical quantities in field values, achieving numerical standardization and unified dimensional expression. For example, it converts "2 tons" to "2000 kg"; "1000 g" and "1 kilogram" to "1 kg"; and "3 boxes (10 pieces per box)" is converted to "30 pieces" according to the prompt word rules.
[0085] Step S32. Based on the fixed value rule prompt words constraining the field value range, correct, remove, or mark data that does not conform to the rules;
[0086] Specifically, the system uses preset rules (such as upper and lower limits for values, reasonable time ranges, etc.) to perform validity checks and processing on field values to ensure data validity. For example, if the date field value "2049-13-01" does not conform to the time format or exceeds the reasonable date range, the system marks it as "invalid data"; if the delivery quantity field value is "-50" or "0", and is constrained by the prompt word "must be a positive integer", it will be corrected or removed; if the "payment period" field is "300 days", exceeding the set upper limit (such as 180 days), a warning will be issued or it will be marked in red.
[0087] Example 2:
[0088] like Figure 2 As shown, this embodiment provides a communication device based on multimodal data, the device comprising:
[0089] The acquisition unit 10 is used to acquire multimodal data, which includes at least text, images, speech and structured data;
[0090] Preprocessing unit 20 is used to preprocess multimodal data to obtain preprocessed target data;
[0091] The first standardization unit 30 is used to standardize the target data based on a pre-constructed set of prompt words to obtain redefined data;
[0092] The transmission unit 40 is used to transmit reset data based on a predetermined communication protocol.
[0093] In one specific embodiment disclosed in this application, the preprocessing unit 20 includes:
[0094] The classification unit is used to perform modality recognition and data signal classification operations on multimodal data to determine the modality type corresponding to each data unit. The modality type includes, but is not limited to, text, image, voice or table.
[0095] The parsing unit is used to parse the field structure of multimodal data based on a preset structure-aware model, identify the hierarchical relationship between fields and the boundaries of semantic data blocks, and generate structure labels and field location information.
[0096] Alignment units are used to align homologous or semantically related fields in different modalities through a cross-modal semantic nesting mechanism, forming an intermediate dataset after field alignment.
[0097] The suppression unit is used to identify and suppress redundant content in the intermediate dataset, remove duplicate, conflicting or ambiguous fields, retain the set of fields with high confidence, and generate cleaned data content.
[0098] The reorganization unit is used to restructure the cleaned data content according to the preset field template or business structure rules, and output the target data that conforms to the preprocessing standard format.
[0099] In one specific embodiment disclosed in this application, the first standardization unit 30 includes:
[0100] The first matching unit is used to select and match the most relevant prompt word type and prompt word content from the prompt word set based on the field information in the target data, combined with its corresponding modality type, structural label information and field positioning information.
[0101] The second standardization unit is used to standardize the target data according to preset standard rules based on the matched prompt word type and field semantic results.
[0102] The output unit is used to output the standardized target data as structured revalued data. The revalued data meets the data standards defined by the prompt word set in terms of field semantics, enumeration specifications, unit form and value rules.
[0103] In one specific embodiment disclosed in this application, the first matching unit includes:
[0104] The analysis unit is used to perform semantic analysis on each field in the target data, combining its corresponding modality type, structural label information, and field location information.
[0105] The extraction unit is used to perform semantic analysis and extract content from each field, and to construct a field semantic model based on the semantic tags, contextual semantic relationships and domain attributes analyzed from each field.
[0106] The determination unit is used to determine the most likely type of prompt word for each field based on the field semantic model, and to determine the limited scope of prompt word matching;
[0107] The second matching unit is used to match the most relevant prompt words to the field semantics from a pre-constructed set of prompt words within a limited prompt word matching range.
[0108] In one specific embodiment disclosed in this application, the second standardization unit includes:
[0109] Named units are used to uniformly name, format-convert, and semantically align field names and their corresponding values based on data standard prompts.
[0110] The supplementary unit is used to supplement and correct semantically missing or ambiguous fields based on supplementary semantic prompts;
[0111] The mapping unit is used to map non-standard enumeration class values to a unified enumeration standard based on enumeration value prompts.
[0112] The conversion unit is used to achieve standard conversion and unified representation of units with information fields based on unit conversion prompts.
[0113] The set value unit is used to constrain the range of field values based on the set value rule prompt words, and to correct, remove or mark data that does not conform to the rules.
[0114] In one specific embodiment disclosed in this application, the parsing unit includes:
[0115] The input unit is used to input multimodal data into a preset structure-aware model. The model outputs perception and recognition results containing structural features such as field blocks, row and column logic, indentation levels, and relationships between adjacent fields.
[0116] The recognition unit is used to parse the structural hierarchy relationship between fields in multimodal data based on the perception recognition results;
[0117] The decision unit is used to determine the semantic consistency region of multimodal data based on the structural information of the perception and recognition results, identify the boundary position of each semantic data block, and generate boundary recognition results.
[0118] The generation unit is used to assign structural labels to each field and generate corresponding field location information based on the structural hierarchy relationship between fields and the boundary recognition results.
[0119] It should be noted that the specific manner in which each module performs its operation in the apparatus described in the above embodiments has been described in detail in the embodiments of the method, and will not be elaborated here.
[0120] Example 3:
[0121] Corresponding to the above method embodiments, this embodiment also provides a communication device based on multimodal data. The communication device based on multimodal data described below and the communication method based on multimodal data described above can be referred to in correspondence.
[0122] Figure 3 This is a block diagram illustrating a communication device 800 based on multimodal data according to an exemplary embodiment. Figure 3 As shown, the communication device 800 based on multimodal data may include a processor 801 and a memory 802. The communication device 800 may also include one or more of a multimedia component 803, an I / O interface 804, and a communication component 805.
[0123] The processor 801 controls the overall operation of the multimodal data-based communication device 800 to complete all or part of the steps in the aforementioned multimodal data-based communication method. The memory 802 stores various types of data to support the operation of the multimodal data-based communication device 800. This data may include, for example, instructions for any application or method operating on the multimodal data-based communication device 800, as well as application-related data such as contact data, sent and received messages, images, audio, video, etc. The memory 802 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The multimedia component 803 may include a screen and an audio component. The screen may be, for example, a touchscreen, and the audio component is used to output and / or input audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 802 or transmitted via the communication component 805. The audio component also includes at least one speaker for outputting audio signals. I / O interface 804 provides an interface between processor 801 and other interface modules, such as keyboards, mice, and buttons. These buttons can be virtual or physical. Communication component 805 is used for wired or wireless communication between the multimodal data-based communication device 800 and other devices. Wireless communication includes Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination thereof. Therefore, the corresponding communication component 805 may include a Wi-Fi module, a Bluetooth module, and an NFC module.
[0124] In an exemplary embodiment, the communication device 800 based on multimodal data may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the aforementioned communication method based on multimodal data.
[0125] In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided, which, when executed by a processor, implement the steps of the multimodal data-based communication method described above. For example, the computer-readable storage medium may be the memory 802 including the program instructions described above, which may be executed by the processor 801 of the multimodal data-based communication device 800 to complete the multimodal data-based communication method described above.
[0126] Example 4:
[0127] Corresponding to the above method embodiments, this embodiment also provides a readable storage medium. The readable storage medium described below can be referred to in conjunction with the communication method based on multimodal data described above.
[0128] A readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the communication method based on multimodal data described in the above method embodiments.
[0129] Specifically, the readable storage medium can be a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, or any other readable storage medium capable of storing program code.
[0130] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
[0131] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A communication method based on multimodal data, characterized in that, include: Acquire multimodal data, which includes at least text, images, speech, and structured data; The multimodal data is preprocessed to obtain preprocessed target data; The specific methods for acquiring the target data include: Modality recognition and data signal classification operations are performed on the multimodal data to determine the modality type corresponding to each data unit. The modality type includes, but is not limited to, text, image, voice, or table. Based on a preset structure-aware model, the multimodal data is parsed to identify the hierarchical relationship between fields and the boundaries of semantic data blocks, and to generate structure labels and field location information. By using a cross-modal semantic nesting mechanism, homologous or semantically related fields in different modalities are aligned to form an intermediate dataset after field alignment. Redundancy content identification and suppression processing is performed on the intermediate dataset to remove duplicate, conflicting or ambiguous fields, retain the set of fields with high confidence, and generate cleaned data content. Based on preset field templates or business structure rules, the cleaned data content is restructured and the target data that conforms to the preprocessing standard format is output. The target data is standardized based on a pre-constructed set of prompt words to obtain revalued data; The set of prompt words includes data standard prompt words, supplementary semantic prompt words, enumerated value prompt words, unit conversion prompt words, and fixed value rule prompt words. The specific methods for obtaining the redefined data include: For each field in the target data, the most relevant prompt word type and prompt word content are selected and matched from the prompt word set, based on the corresponding modality type, structural label information and field location information. Based on the matched prompt word types and field semantic results, the target data is standardized according to preset standard rules; The standardized target data is output as structured revalued data. The revalued data meets the data standards defined by the prompt word set in terms of field semantics, enumeration specifications, unit form and value rules. The revalued data meets the requirements of clear semantic definition, consistent units, unified format and compliant value. The reset data is transmitted via communication based on a predetermined communication protocol.
2. The communication method based on multimodal data according to claim 1, characterized in that, For each field in the target data, based on its corresponding modality type, structural label information, and field location information, the most relevant prompt word type and prompt word content are selected and matched from the prompt word set, including: For each field in the target data, semantic analysis is performed based on its corresponding modality type, structural label information, and field location information. Semantic analysis is performed on each field to extract content, and a field semantic model is constructed based on the semantic tags, contextual semantic relationships, and domain attributes analyzed from each field. Based on the semantic model of the fields, the most likely type of prompt word to be applied to each field is determined, and the limited prompt word matching range is determined. Within the defined scope of prompt word matching, the prompt word content most relevant to the semantics of the field is matched from the pre-constructed set of prompt words.
3. A communication device based on multimodal data, characterized in that, include: An acquisition unit is used to acquire multimodal data, which includes at least text, images, speech, and structured data; The preprocessing unit is used to preprocess the multimodal data to obtain preprocessed target data; The preprocessing unit includes: A classification unit is used to perform modality recognition and data signal classification operations on the multimodal data to determine the modality type corresponding to each data unit. The modality type includes, but is not limited to, text, image, voice, or table. The parsing unit is used to parse the field structure of the multimodal data based on a preset structure-aware model, identify the hierarchical relationship between fields and the boundary of semantic data blocks, and generate structure labels and field location information. Alignment units are used to align homologous or semantically related fields in different modalities through a cross-modal semantic nesting mechanism, forming an intermediate dataset after field alignment. The suppression unit is used to perform redundant content identification and suppression processing on the intermediate dataset, delete duplicate, conflicting or ambiguous fields, retain the set of fields with high confidence, and generate cleaned data content. The reorganization unit is used to reorganize the cleaned data content in a structured manner according to the preset field template or business structure rules, and output the target data that conforms to the preprocessing standard format. The first standardization unit is used to standardize the target data based on a pre-constructed set of prompt words to obtain redefined data; The set of prompt words includes data standard prompt words, supplementary semantic prompt words, enumerated value prompt words, unit conversion prompt words, and fixed value rule prompt words. The first standardization unit includes: The first matching unit is used to select and match the most relevant prompt word type and prompt word content from the prompt word set based on the field information in the target data, combined with its corresponding modality type, structural label information and field positioning information. The second standardization unit is used to standardize the target data according to preset standard rules based on the matched prompt word type and field semantic results. The output unit is used to output the standardized target data as structured revalued data. The revalued data meets the data standards defined by the prompt word set in terms of field semantics, enumeration specifications, unit form and value rules. The revalued data meets the requirements of clear semantic definition, consistent units, unified format and compliant value. The transmission unit is used to transmit the reset data based on a predetermined communication protocol.
4. The communication device based on multimodal data according to claim 3, characterized in that, The first matching unit includes: The analysis unit is used to perform semantic analysis on each field in the target data, combining its corresponding modality type, structural label information, and field location information. The extraction unit is used to perform semantic analysis and content extraction on each field, and to construct a field semantic model based on the semantic tags, contextual semantic relationships and domain attributes analyzed from each field. The determining unit is used to determine the most likely type of prompt word applicable to each field based on the field semantic model, and to determine the limited prompt word matching range; The second matching unit is used to match the most relevant prompt word content to the field semantics from a pre-constructed set of prompt words within the defined prompt word matching range.
5. A communication device based on multimodal data, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the communication method based on multimodal data as described in any one of claims 1 to 2 when executing the computer program.
6. A readable storage medium, characterized in that, The readable storage medium stores a computer program that, when executed by a processor, implements the steps of the communication method based on multimodal data as described in any one of claims 1 to 2.