Quality label embedding method and system applied to multi-modal data

By extracting quality features and performing hierarchical clustering on multimodal data, and generating quality labels using a pre-trained model, the problems of low efficiency and poor objectivity in multimodal data evaluation are solved, and efficient integration and accurate management of data quality information are achieved.

CN122241232APending Publication Date: 2026-06-19CHINA ELECTRONICS STANDARDIZATION INST

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA ELECTRONICS STANDARDIZATION INST
Filing Date
2026-03-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multimodal data quality assessment methods mainly rely on manual methods, which are inefficient and lack objectivity. They cannot comprehensively consider the correlation between multimodal data and the overall quality characteristics. They also lack an effective quality label embedding mechanism, making it difficult to closely integrate data quality information with the data itself, which is not conducive to data retrieval, screening and efficient utilization.

Method used

By acquiring a multimodal dataset, quality feature extraction is performed across data units to generate a multidimensional quality feature vector set. Quality hierarchical clustering is then performed in a unified representation space. A pre-trained quality label generation model is called to extract group features and generate a quality label feature set. Finally, the label information is embedded into the metadata structure of the data unit.

Benefits of technology

It enables the effective integration and management of multimodal data quality information, improves the efficiency and accuracy of quality assessment, and ensures the accurate reflection of the similarity of quality characteristics of data units within the same group and the quality differences between groups.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241232A_ABST
    Figure CN122241232A_ABST
Patent Text Reader

Abstract

This invention provides a method and system for embedding quality labels into multimodal data. It acquires a raw multimodal dataset containing text, images, audio, video, and 3D models, and performs cross-modal quality feature extraction on each data unit to generate a multidimensional quality feature vector set. Quality hierarchical clustering is performed in a unified representation space to obtain at least one quality hierarchical group containing data units from different modalities. A pre-trained quality label generation model is invoked to extract group features for each quality hierarchical group, obtaining a quality label feature set. Based on the quality label feature set, label embedding processing is performed on each data unit, appending quality label information to the metadata structure to generate a labeled multimodal dataset carrying quality label information. This invention can comprehensively and accurately evaluate the quality of multimodal data and embed quality labels, improving the efficiency and accuracy of data quality management.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence, and more specifically, to a method and system for embedding quality labels for multimodal data. Background Technology

[0002] In today's digital information age, the application of multimodal data is becoming increasingly widespread, encompassing various types such as text, images, audio, video, and 3D models. This multimodal data plays a crucial role in various fields, such as intelligent security, medical image analysis, and multimedia content creation and dissemination. However, with the explosive growth of data volume, the quality of multimodal data varies greatly, data sources are complex and diverse, data timeliness is difficult to accurately grasp, and data formats lack unified standards. These problems pose significant challenges to the effective utilization of data. Currently, multimodal data quality assessment mainly relies on manual methods, which are not only inefficient but also heavily influenced by subjective factors, lacking objectivity and consistency. Furthermore, existing automated quality assessment methods often only target single-modal data, failing to comprehensively consider the correlation between multimodal data and the overall quality characteristics. In addition, the lack of an effective quality label embedding mechanism in data management and application makes it difficult to closely integrate data quality information with the data itself, hindering data retrieval, filtering, and efficient utilization. Summary of the Invention

[0003] In view of the aforementioned problems, and in conjunction with the first aspect of the present invention, embodiments of the present invention provide a quality label embedding method for multimodal data, the method comprising: Obtain the original multimodal dataset, which includes text data units, image data units, audio data units, video data units, and 3D model data units; The original multimodal dataset is subjected to cross-modal data unit quality feature extraction processing to generate a multidimensional quality feature vector set corresponding to each data unit. The multidimensional quality feature vector set includes data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector. Based on the set of multidimensional quality feature vectors corresponding to each data unit, the text data unit, image data unit, audio data unit, video data unit and three-dimensional model data unit are subjected to quality hierarchical clustering in a unified representation space to generate at least one quality level group. Each quality level group contains data units from at least two different modalities, and the quality feature vectors of the data units within the same quality level group satisfy a preset similarity range in the unified representation space. The pre-trained quality label generation model is invoked to perform group feature extraction processing on each quality level group to obtain the quality label feature set corresponding to the quality level group. The quality label feature set includes group feature identifier, quality consistency measurement parameter between data units within the group, and quality difference representation vector between groups. Based on the quality label feature set, label embedding processing is performed on the data units within each quality level group. The quality label feature set is then appended to the metadata structure of each data unit in the form of label embedding vectors to generate a labeled multimodal dataset carrying quality label information.

[0004] In another aspect, embodiments of the present invention also provide a quality tag embedding system for multimodal data, including a processor and a machine-readable storage medium connected to the processor. The machine-readable storage medium is used to store programs, instructions, or code, and the processor is used to execute the programs, instructions, or code in the machine-readable storage medium to implement the above-described method.

[0005] Based on the above, this embodiment of the invention acquires a raw multimodal dataset containing multiple modalities and performs quality feature extraction processing on it across modal data units to generate a multidimensional quality feature vector set. In a unified representation space, it performs quality-level clustering of data units from different modalities, fully considering the correlation between multimodal data, and can generate reasonable and accurate quality-level groups, ensuring that data units within the same group have similarity in quality features. A pre-trained quality label generation model is invoked to perform group feature extraction processing on each quality-level group, accurately obtaining a set of quality label features reflecting the overall quality level of the group, the consistency of quality among data units within the group, and the quality differences between groups. Finally, based on the quality label feature set, label embedding processing is performed on the data units within each quality-level group, attaching quality label information to the metadata structure of each data unit in a scientifically reasonable manner, generating a labeled multimodal dataset carrying quality label information. This achieves effective integration and management of multimodal data quality information, greatly improving the efficiency and accuracy of multimodal data quality assessment. Attached Figure Description

[0006] Figure 1 This is a schematic diagram of the execution flow of the quality label embedding method for multimodal data provided in an embodiment of the present invention.

[0007] Figure 2 This is a schematic diagram of exemplary hardware and software components of a quality label embedding system for multimodal data provided in an embodiment of the present invention. Detailed Implementation

[0008] The present invention will now be described in detail with reference to the accompanying drawings. Figure 1 This is a flowchart illustrating a quality label embedding method for multimodal data according to an embodiment of the present invention. The following is a detailed description of this quality label embedding method for multimodal data.

[0009] Step S110: Obtain the original multimodal dataset, which includes text data units, image data units, audio data units, video data units, and three-dimensional model data units.

[0010] In this embodiment, a biodiversity monitoring and research project is used as an example. This project aims to integrate heterogeneous data from different field survey devices and online databases to assess the survival status of species and environmental changes within a specific nature reserve. The process of constructing the original multimodal dataset is as follows.

[0011] First, text data units were collected from field researchers' survey notes and online species databases. A survey note text data unit records descriptive information such as the observation of an adult male golden snub-nosed monkey in a specific area on a specific date, and its activity range around a mixed coniferous and broad-leaved forest zone with a specific altitude range. This text data unit is encoded as a string in a specific character encoding format and includes metadata such as the recorder identifier, record timestamp, and geographic location tag.

[0012] Secondly, image data units are acquired, including image data units captured by infrared-triggered cameras deployed within the protected area. An image data unit captured on a specific date and at a specific time has a specific image format and a resolution within a specific range. The exchangeable image file format information records the camera model, shutter speed, aperture size, and GPS coordinates. This image data unit clearly captures images of golden monkeys active at night.

[0013] Next, audio data units were acquired. These units were collected by researchers using directional recording equipment during field surveys to record the calls of different species or environmental background sounds. One audio data unit recorded the communication sounds between groups of golden monkeys and the sounds of woodpeckers pecking at tree trunks, collected in a specific area at a specific time on a specific date. This audio data unit was stored in a specific audio format, with a sampling rate and quantization bit depth within a specific range, and configured for a specific number of channels.

[0014] Next, video data units are acquired. These data units come from periodic aerial photographs taken by researchers using drones over the core area of ​​the protected area. A video data unit taken on a specific date and within a specific time period has a specific video encoding standard, a frame rate within a specific range, and a resolution within a specific range. The video footage shows a large area of ​​vegetation cover, topography, and the flow direction of several major rivers from an aerial perspective.

[0015] Finally, three-dimensional model data units are obtained. These data units are three-dimensional models of terrain or individual plants generated based on LiDAR scanning and oblique photogrammetry techniques. A three-dimensional model data unit describes the complete morphology of a century-old pine tree in a specific area. Its file format is a specific three-dimensional model format, containing a specific number of vertices and a specific number of triangular faces, and is accompanied by a high-resolution color texture map. This three-dimensional model data unit records in detail the spatial geometry and surface texture details of the trunk, branches, and pine needles.

[0016] Step S120: Perform cross-modal data unit quality feature extraction processing on the original multimodal dataset to generate a multidimensional quality feature vector set corresponding to each data unit. The multidimensional quality feature vector set includes data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector.

[0017] This step aims to systematically quantify the quality of each modal data unit acquired above, transforming qualitative quality attributes into quantitative feature vectors. The implementation of this step relies on a series of parallel feature extraction sub-steps for different modal data.

[0018] Step S121: Obtain text data units from the original multimodal dataset, perform text integrity scanning processing on the text data units, identify missing character segments, semantic break segments, and grammatical structure abnormal segments in the text data units, and generate data integrity feature vectors for the text data units based on the length of the missing character segments, the number of semantic break segments, and the number of grammatical structure abnormal segments.

[0019] For the survey note text data units in the aforementioned biodiversity monitoring project, this embodiment first performs text integrity scanning processing. The processing includes reading the content of the text data unit as a character stream, scanning character by character, and identifying continuous sequences composed of garbled symbols or empty characters, which are recorded as missing character segments. For example, if a description of an altitude range from a specific value to a specific value is scanned as an altitude range from a specific value to a specific value with several missing characters, the length of the missing segment is counted as the number of missing characters. Simultaneously, a semantic integrity detection method based on dependency parsing is used to identify whether the subject-verb-object structure of the sentence is complete. For example, if a sentence observes an adult male golden snub-nosed monkey and its activity range lacks a predicate and object after "activity range," it is marked as a semantic break segment, and its number is accumulated. Furthermore, a pre-set grammar rule base is used to detect spelling errors or inappropriate sentence collocations in the text. For example, if a description of an activity range mainly surrounding a forest belt does not collocate with "surrounding" or "forest belt," it is marked as a grammatically abnormal segment, and its number is accumulated. Finally, the total length of missing character segments, the number of semantically broken segments, and the number of syntactically abnormal segments are combined into a three-dimensional vector, which serves as the data integrity feature vector for this text data unit.

[0020] Step S122: Perform source tracing processing on the text data unit, extract the publisher's digital signature, publishing platform authorization certificate and content modification history of the text data unit, and generate a data source trust feature vector of the text data unit based on the verification status of the publisher's digital signature, the validity period of the publishing platform authorization certificate and the number of modifications in the content modification history.

[0021] Following the aforementioned text data unit, the source tracing process begins by parsing its metadata. First, the publisher's digital signature is extracted. The publisher identifier for this survey note text is a specific researcher's number. The system verifies the validity of this digital signature by comparing it with a pre-set trusted researcher digital certificate repository. If the verification passes, the verification status value is recorded as valid; otherwise, it is recorded as invalid. Second, the publishing platform authorization certificate is extracted. This data was published from an internally authorized field survey data management system. The system checks the validity period of this certificate. If the current date is within the certificate's validity period, the validity period status value is recorded as valid; otherwise, it is recorded as invalid. Finally, the content modification history is extracted. The version history of this text data unit shows that it has been modified a specific number of times, and this number of modifications is recorded as a numerical value. This information is combined into a three-dimensional vector, serving as the trusted feature vector of the data source for this text data unit.

[0022] Step S123: Perform content timeliness determination processing on the text data unit, extract the publication timestamp of the text data unit, the occurrence timestamp of the event referenced in the content, and the term update mark used in the content, and generate the data timeliness feature vector of the text data unit according to the time difference between the publication timestamp and the current time, the time difference between the occurrence timestamp and the publication timestamp, and the update version number of the term update mark.

[0023] The text data unit is further processed. The timeliness determination process first extracts its publication timestamp and calculates the time difference between it and the current system time, converting this time difference into a value in hours. Second, the occurrence timestamp of the cited event is extracted from the text content, and the time difference between this time and the publication timestamp is calculated, yielding a value in hours. Finally, update markers for terms used in the text are identified. The system queries a biological taxonomy nomenclature database to obtain the latest version number of the term and compares it with the version used in the text. The difference in update version numbers is used as the numerical value; if there is no update, the difference is zero. Ultimately, a three-dimensional timeliness feature vector is generated.

[0024] Step S124: Perform format structure verification processing on the text data unit, identify the title hierarchy structure, paragraph division structure and citation mark structure of the text data unit, and generate the data format standardization feature vector of the text data unit based on the completeness of the title hierarchy structure, the coherence of the paragraph division structure and the standardization of the citation mark structure.

[0025] Finally, the text data unit undergoes format and structure validation. First, its heading hierarchy is checked to identify the presence of first-level headings, second-level headings, and their nesting relationships. If the heading hierarchy increases continuously from level one without jumps, the completeness of the heading hierarchy is marked as high; otherwise, it is marked as low. Second, the paragraph structure is checked to analyze the presence of obvious transition sentences or paragraphs and the uniformity of paragraph length. If the paragraphs are logically coherent and of uniform length, the coherence of the paragraph structure is marked as high; otherwise, it is marked as low. Finally, the citation structure is checked to identify whether the references in the text use a consistent citation format, such as author-year or numbered citations. If all citations follow the same standard and are correctly positioned, the standardization of the citation structure is marked as high; otherwise, it is marked as low. This information is combined into a three-dimensional vector, serving as the data format standardization feature vector for this text data unit.

[0026] Step S125: Perform vector concatenation on the data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the text data unit to generate a multi-dimensional quality feature vector set corresponding to the text data unit.

[0027] After completing the feature extraction of the above four dimensions, this step performs a vector concatenation operation. The four three-dimensional vectors—data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector—are concatenated end to end in a preset order to form a twelve-dimensional vector, which serves as the set of multi-dimensional quality feature vectors corresponding to this text data unit.

[0028] Step S126: Obtain image data units from the original multimodal dataset, perform image integrity detection processing on the image data units, scan the pixel missing regions, compression artifact regions and color channel missing regions of the image data units, and generate the data integrity feature vector of the image data units based on the area ratio of the pixel missing regions, the texture complexity of the compression artifact regions and the number of color channel missing regions.

[0029] For the image data unit captured by the aforementioned infrared-triggered camera, this embodiment first performs image integrity detection processing. The process includes traversing the pixel matrix of the image data unit, identifying continuous regions with zero or outlier pixel values ​​(denoted as pixel missing regions), and calculating the proportion of this region's area to the entire image area to obtain the area ratio of the pixel missing regions. Secondly, discrete cosine transform is used to analyze the frequency domain characteristics of the image, identifying regions with abnormal blocky structures at high-frequency components (denoted as compression artifact regions), and calculating the entropy value of the gray-level co-occurrence matrix of adjacent pixel values ​​within these regions as a texture complexity parameter. Finally, the color space of the image is examined, identifying regions where any one of the red, green, or blue channels is completely missing data (denoted as color channel missing regions), and their number is accumulated. This information is combined into a three-dimensional vector as the data integrity feature vector of the image data unit.

[0030] Step S127: Perform image source analysis processing on the image data unit, extract the unique identifier of the shooting device, the image acquisition timestamp, and the geographic coordinate marker information of the image data unit, and generate the data source reliable feature vector of the image data unit based on the matching result of the unique identifier of the shooting device and the list of trusted devices, the continuity of the image acquisition timestamp, and the positioning accuracy of the geographic coordinate marker information.

[0031] The image data unit is further processed. Image source analysis first parses the exchangeable image file format information to extract the unique identifier of the capturing device, which is the camera serial number. The system compares this serial number with a pre-set list of trusted devices. If a match is found, it is recorded as a trusted device identifier; otherwise, it is recorded as an untrusted device identifier. Next, the image acquisition timestamp is extracted and compared with the timestamps of adjacent images captured by the same camera to check if they are sequentially continuous. If the timestamps increase sequentially and the intervals are uniform, they are recorded as continuous; otherwise, they are recorded as discontinuous. Finally, the horizontal precision factor value from the geographic coordinate marker information is extracted as a positioning accuracy parameter; the smaller the value, the higher the positioning accuracy. This information is combined into a three-dimensional vector, serving as the trusted feature vector of the image data unit's data source.

[0032] Step S128: Perform image content timeliness analysis processing on the image data unit, identify the seasonal characteristics, vegetation cover characteristics and building status characteristics of the photographed object in the image data unit, and generate the data timeliness feature vector of the image data unit based on the matching degree between the seasonal characteristics and the shooting time, the change range of the vegetation cover characteristics and the comparison results of the building status characteristics with historical records.

[0033] The image data unit is further processed. The image content timeliness analysis first uses a pre-trained convolutional neural network model to identify changes in leaf color, flower opening status, and sky hue features in the image to infer the season in which the image was taken, such as spring, summer, autumn, or winter. This inferred season is compared with the season recorded in the image acquisition timestamp; if they match, the match is recorded as a perfect match; otherwise, it is recorded as a mismatch. Next, the image is compared with historical images of the same location from the same period to calculate the magnitude of change in vegetation cover. For example, the difference in the normalized vegetation index is used to quantify the change in vegetation cover, obtaining a parameter of the magnitude of change. Finally, the state of man-made buildings in the image is identified and compared with the state of buildings in historical records. For example, the exterior wall color and roof structure of buildings are compared for significant changes; if they match the records, they are marked as consistent; otherwise, they are marked as changed. This information is combined into a three-dimensional vector, serving as the data timeliness feature vector for this image data unit.

[0034] Step S129: Perform image format standardization verification processing on the image data unit, detect the encoding format identifier, resolution parameter and color space configuration information of the image data unit, and generate the data format standardization feature vector of the image data unit based on the matching status of the encoding format identifier with the preset standard format, the deviation of the resolution parameter from the standard resolution and the completeness of the color space configuration information.

[0035] The image data unit is further processed. Image format compliance verification first checks the encoding format identifier in the file header, such as the identifier for a specific image format, and matches it against a preset list of standard formats. If a match is successful, it is marked as a format match; otherwise, it is marked as a format mismatch. Next, the image's resolution parameters, such as the number of pixels in width and height, are extracted, and the degree of deviation from the standard resolution is calculated. If the deviation is within the allowable range, it is marked as a low deviation; otherwise, it is marked as a high deviation. Finally, the color space configuration information is checked, including the color space type (e.g., red, green, blue or cyan, magenta, yellow, black), the number of color channels, and the bit depth. If all necessary information is present and complete, it is marked as complete; otherwise, it is marked as missing. This information is combined into a three-dimensional vector, serving as the data format compliance feature vector for this image data unit.

[0036] Step S1210: Perform vector concatenation processing on the data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the image data unit to generate a multi-dimensional quality feature vector set corresponding to the image data unit.

[0037] After completing the feature extraction of the above four dimensions, this step performs a vector concatenation operation. The four three-dimensional vectors of the image data unit—data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector—are concatenated end to end in a preset order to form a twelve-dimensional vector, which serves as the set of multi-dimensional quality feature vectors corresponding to that image data unit.

[0038] Step S1211: Obtain audio data units from the original multimodal dataset, perform audio continuity detection processing on the audio data units, identify silence segments, pop segments, and spectral interruption segments of the audio data units, and generate a data integrity feature vector of the audio data units based on the duration ratio of the silence segments, the number of pop segments, and the frequency domain energy decay rate of the spectral interruption segments.

[0039] For the aforementioned audio data units of golden monkey communication sounds, this embodiment first performs audio continuity detection processing. The processing includes analyzing the waveform amplitude values ​​of the audio data units, identifying segments with amplitude values ​​below a preset threshold and durations exceeding a specific duration, denoted as silent segments, and calculating the proportion of the total duration of all silent segments to the total audio duration to obtain the percentage of silent segment duration. Secondly, segments with waveform amplitude values ​​that momentarily and sharply increase and exceed a preset upper limit threshold are identified and denoted as pop segments, and their number is accumulated. Finally, a short-time Fourier transform is performed on the audio to obtain a spectrum, identifying continuous frequency band regions in the spectrum where energy is suddenly interrupted, denoted as spectrum interruption segments, and calculating the attenuation rate of frequency domain energy on the time axis within these segments to obtain the frequency domain energy attenuation rate value. This information is combined into a three-dimensional vector as the data integrity feature vector of the audio data unit.

[0040] Step S1212: Perform audio source verification processing on the audio data unit, extract the voiceprint features, recording device model identifier and recording environment background noise features of the audio data unit, and generate a reliable feature vector of the audio data unit's data source based on the matching degree of the voiceprint features with the known voiceprint library, the device category of the recording device model identifier and the stability of the recording environment background noise features.

[0041] The audio data unit is further processed. The audio source verification process first uses the Mel-frequency cepstral coefficient algorithm to extract voiceprint features from the audio, forming a voiceprint feature vector. This vector is then matched against a known wildlife voiceprint database, and a matching score is calculated; a higher matching score indicates a more credible source. Next, the audio file metadata is parsed to extract the recording device model identifier, such as a specific model of recorder. Based on this model, the device is categorized as professional or consumer-grade, with professional-grade devices indicating higher credibility. Finally, the background noise characteristics of the recording environment are analyzed, and the variance of the background noise power spectral density over time is calculated. A smaller variance indicates more stable background noise, better environmental consistency, and higher credibility. This information is combined into a three-dimensional vector, serving as the data source credibility feature vector for this audio data unit.

[0042] Step S1213: Perform audio content timeliness determination processing on the audio data unit, identify the event mention time of the speech content in the audio data unit, the environmental feature time imprint in the background sound, and the matching degree between the speech rate feature and the era speech rate model, and generate the data timeliness feature vector of the audio data unit according to the interval between the event mention time and the current time, the era identifier of the environmental feature time imprint, and the matching coefficient of the speech rate feature and the era speech rate model.

[0043] The audio data unit is further processed. The audio content timeliness determination process first uses speech recognition technology to convert the audio content into text, extracting the specific event times mentioned in the text, such as the time description of golden monkey migration. The interval between this event time and the current time is calculated to obtain the time interval value. Second, the environmental features and time imprints in the background sounds are analyzed, such as identifying cicada chirping, wind sounds in a specific season, or machine noises that only occur in a specific year. These environmental features are compared with a time database to infer the approximate year of the audio recording and generate a time identifier. Finally, the speech rate characteristics, such as the number of syllables per minute, are analyzed and compared with the average speech rate models of different historical periods. A matching coefficient is calculated; a higher matching coefficient indicates that the speech rate is more consistent with the characteristics of that era. This information is combined into a three-dimensional vector, serving as the data timeliness feature vector for this audio data unit.

[0044] Step S1214: Perform audio format standardization verification processing on the audio data unit, detect the sampling rate parameter, quantization bit depth parameter and channel configuration parameter of the audio data unit, and generate the data format standardization feature vector of the audio data unit based on the deviation value of the sampling rate parameter from the standard sampling rate, the conformity of the quantization bit depth parameter with the standard quantization bit depth, and the matching status of the channel configuration parameter with the preset channel mapping table.

[0045] The audio data unit is further processed. The audio format compliance verification process first checks the audio file's sampling rate parameter, such as the Hertz number within a specific range, and calculates its absolute deviation from the standard sampling rate. Second, it checks the quantization bit depth parameter, such as the number of bits within a specific range, to determine if it conforms to industry standards. If it does, the compliance score is recorded as a compliance flag; otherwise, it is recorded as a non-compliance flag. Finally, it checks the channel configuration parameters, such as mono, stereo, or surround sound, and matches them against a preset channel mapping table. If a match is successful, it is recorded as a match flag; otherwise, it is recorded as a non-match flag. This information is combined into a three-dimensional vector, serving as the data format compliance feature vector for this audio data unit.

[0046] Step S1215: Perform vector concatenation on the data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the audio data unit to generate a multi-dimensional quality feature vector set corresponding to the audio data unit.

[0047] After completing the feature extraction of the above four dimensions, this step performs a vector concatenation operation. The four three-dimensional vectors of the audio data unit—data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector—are concatenated end to end in a preset order to form a twelve-dimensional vector, which serves as the set of multi-dimensional quality feature vectors corresponding to the audio data unit.

[0048] Step S1216: Obtain video data units from the original multimodal dataset, perform video frame sequence continuity detection processing on the video data units, identify dropped frame segments, frozen screen segments, and mosaic distortion segments in the video data units, and generate a data integrity feature vector for the video data units based on the frame ratio of the dropped frame segments, the duration of the frozen screen segments, and the block artifact density of the mosaic distortion segments.

[0049] For the aforementioned UAV aerial video data unit, this embodiment first performs video frame sequence continuity detection processing. The process includes decoding the video data unit into a continuous sequence of image frames, checking the timestamp of each frame, and identifying dropped frame segments if the timestamp intervals between adjacent frames are uneven or if missing timestamps exist. The proportion of dropped frame segments to the total number of video frames is calculated to obtain the percentage of dropped frame segments. Secondly, the pixel differences between adjacent frames are analyzed. If the pixel values ​​of multiple consecutive frames remain essentially unchanged, they are identified as frozen frames, and their duration is recorded. Finally, during video stream decoding, regularly arranged blocky distortion regions are detected in the image and recorded as mosaic distortion segments. The number of distortion blocks per unit area within these segments is calculated to obtain the blocky artifact density value. This information is combined into a three-dimensional vector, serving as the data integrity feature vector for this video data unit.

[0050] Step S1217: Perform video source tracing processing on the video data unit, extract the acquisition device serial number, encoding generation timestamp and video fingerprint hash value of the video data unit, and generate a reliable feature vector of the data source of the video data unit based on the matching result of the acquisition device serial number and the authorized device database, the temporal continuity of the encoding generation timestamp and the uniqueness verification status of the video fingerprint hash value.

[0051] The video data unit is further processed. The video source tracing process first extracts the acquisition device serial number from the video file's metadata, such as the drone's serial number. The system compares this serial number with the authorized device database. If a match is found, it is recorded as an authorized device identifier; otherwise, it is recorded as an unauthorized device identifier. Next, the encoded timestamps are extracted and checked to see if the timestamp sequence increases continuously in the order of recording time. If the timestamps are continuous, they are recorded as a continuous identifier; otherwise, they are recorded as a discontinuous identifier. Finally, a hash algorithm is used to calculate the video fingerprint hash value of the video content, and this is compared with the fingerprint of the video in a public database to verify its uniqueness. If the hash value is unique and without duplicates, it is recorded as a unique identifier; otherwise, it is recorded as a duplicate identifier. This information is combined into a three-dimensional vector, serving as the trusted feature vector of the video data unit's data source.

[0052] Step S1218: Perform video content timeliness analysis processing on the video data unit, identify the clothing style features, vehicle model features and scene landmark change features of moving objects in the video data unit, and generate the data timeliness feature vector of the video data unit based on the matching degree of the clothing style features and the fashion trend cycle, the production year of the vehicle model features and the update cycle of the scene landmark change features.

[0053] The video data unit is further processed. The timeliness analysis first uses object detection algorithms to identify pedestrians in the video, extracting clothing style features such as collar shape and cuff design. These features are then matched against a clothing fashion trend cycle database to calculate the matching degree; a higher matching degree indicates that the clothing style better matches the fashion characteristics of a specific year. Secondly, the vehicles appearing in the video are identified, such as specific models of SUVs. The production year range for that model is queried to generate production year parameters. Finally, landmarks within the video scene are identified, such as the paint color of a bridge or the content of a building's billboard. These are compared with historical video records to calculate the update cycle of landmark changes; for example, how often does the paint color change, resulting in an update cycle parameter. This information is combined into a three-dimensional vector, serving as the data timeliness feature vector for this video data unit.

[0054] Step S1219: Perform video format standardization verification processing on the video data unit, detect the encoding standard identifier, frame rate parameter and bit rate parameter of the video data unit, and generate the data format standardization feature vector of the video data unit based on the compatibility status of the encoding standard identifier with the preset encoding standard, the deviation of the frame rate parameter from the standard frame rate and the calculation result of the bit rate parameter matching degree with the resolution.

[0055] The video data unit is further processed. The video format compliance verification process first checks the video file's encoding standard identifier, such as the identifier of a specific video encoding standard, and determines whether it is compatible with a preset list of compatible encoding standards. If compatible, it is marked as compatible; otherwise, it is marked as incompatible. Second, the frame rate parameter, such as frames per second within a specific range, is extracted, and its deviation from the standard frame rate is calculated. If the deviation is within the allowable range, it is marked as low deviation; otherwise, it is marked as high deviation. Finally, the bitrate parameter, such as bits per second within a specific range, is checked, and combined with the resolution parameter, the bitrate allocation per pixel is calculated to determine whether the bitrate is sufficient to support the clarity requirements at the current resolution. If the bitrate matches the resolution, it is marked as matched; otherwise, it is marked as mismatched. This information is combined into a three-dimensional vector, serving as the data format compliance feature vector for this video data unit.

[0056] Step S1220: Perform vector concatenation processing on the data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the video data unit to generate a multi-dimensional quality feature vector set corresponding to the video data unit.

[0057] After completing the feature extraction of the above four dimensions, this step performs a vector concatenation operation. The four three-dimensional vectors of the video data unit—data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector—are concatenated end to end in a preset order to form a twelve-dimensional vector, which serves as the set of multi-dimensional quality feature vectors corresponding to that video data unit.

[0058] Step S1221: Obtain the three-dimensional model data units in the original multimodal dataset, perform model integrity detection processing on the three-dimensional model data units, identify the mesh missing regions, texture mapping missing regions, and normal direction inconsistency regions of the three-dimensional model data units, and generate the data integrity feature vector of the three-dimensional model data units based on the area ratio of the mesh missing regions, the pixel missing rate of the texture mapping missing regions, and the number of triangular faces of the normal direction inconsistency regions.

[0059] For the aforementioned 3D model data unit of the ancient pine tree, this embodiment first performs model integrity detection processing. The process includes traversing all triangular faces of the 3D model, identifying regions with missing vertices or broken faces, denoted as mesh missing regions, and calculating the proportion of the total area of ​​these regions to the total surface area of ​​the model, obtaining the area ratio of mesh missing regions. Secondly, the model's texture mapping is checked, identifying regions with missing texture coordinates or missing texture image pixels, denoted as texture mapping missing regions, and calculating the proportion of missing pixels in these regions to the total number of pixels that should be mapped, obtaining the pixel missing rate. Finally, the normal direction of each triangular face is calculated, identifying regions where the angle between the normal directions of adjacent faces exceeds a preset threshold, denoted as regions with inconsistent normal directions, and the number of triangular faces in these regions is accumulated. This information is combined into a 3D vector, serving as the data integrity feature vector for this 3D model data unit.

[0060] Step S1222: Perform model source verification processing on the three-dimensional model data unit, extract the creator digital certificate, model generation software identifier and version iteration record of the three-dimensional model data unit, and generate the data source credibility feature vector of the three-dimensional model data unit according to the credibility of the issuing authority of the creator digital certificate, the software version number of the model generation software identifier and the update frequency of the version iteration record.

[0061] The processing of this 3D model data unit continues. The model source verification process first parses the metadata of the 3D model file, extracts the creator's digital certificate, and obtains the name of the certificate issuing organization, such as a well-known surveying and mapping institution. Based on a pre-defined institution credibility rating, the institution's credibility level is used as the issuing organization credibility parameter. Secondly, the software identifier for generating the model is extracted, such as the name and version number of the specific 3D modeling software. A higher software version number generally indicates more complete functionality; the software version number is used as the software version number parameter. Finally, the version iteration record is extracted to examine the modification history of the model file, calculate the number of updates within a certain period, and obtain the update frequency parameter. A higher update frequency indicates more active model maintenance. This information is combined into a 3D vector, serving as the data source credibility feature vector for this 3D model data unit.

[0062] Step S1223: Perform model timeliness determination processing on the three-dimensional model data unit, identify the modeling style features, model detail precision features and material texture style features of the three-dimensional model data unit, and generate the data timeliness feature vector of the three-dimensional model data unit based on the matching degree of the modeling style features and style evolution cycle, the conformity of the model detail precision features with the technical level of the modeling era, and the era identifier of the material texture style features.

[0063] The 3D model data unit is further processed. The model timeliness assessment first analyzes the modeling style characteristics, such as polygon distribution, surface subdivision, and overall shape simplicity. This is compared with a 3D modeling style evolution cycle database to calculate the matching degree; a higher matching degree indicates that the modeling style better matches the characteristics of a specific era. Second, the model's detail precision characteristics are analyzed, such as the clarity of bark texture on tree trunks and the realism of pine needle distribution density. This is compared with the standards of modeling technology levels in different eras to determine if it conforms to the technological level of a specific era, generating a conformity parameter. Finally, material texture style characteristics are analyzed, such as texture resolution, color saturation, and lighting rendering methods. Based on these characteristics, the approximate era in which the material textures were created is inferred, generating an era identifier. This information is combined into a 3D vector, serving as the data timeliness feature vector for this 3D model data unit.

[0064] Step S1224: Perform model format standardization verification processing on the three-dimensional model data unit, detect the format standard identifier, vertex number scale and texture resolution parameter of the three-dimensional model data unit, and generate the data format standardization feature vector of the three-dimensional model data unit according to the compatibility status of the format standard identifier with the preset standard format, the matching result of the vertex number scale with the model complexity standard and the adaptability of the texture resolution parameter with the model accuracy.

[0065] The processing of this 3D model data unit continues. The model format standardization verification process first checks the format standard identifier of the 3D model file, such as the identifier of a specific 3D model format, and determines whether it is compatible with a preset list of standard formats. If compatible, it is marked as compatible; otherwise, it is marked as incompatible. Second, the scale of the number of vertices in the model is statistically analyzed, such as the order of magnitude of the number of vertices, and matched against model complexity standards. For example, a refined model should contain a higher order of magnitude of vertices. If the number of vertices meets the complexity standard, it is marked as matching; otherwise, it is marked as mismatched. Finally, texture resolution parameters, such as the width and height of the texture image in pixels, are checked to determine whether they are suitable for the geometric precision of the model. If a high-precision model is equipped with a high-resolution texture, it is marked as suitable; otherwise, it is marked as incompatible. This information is combined into a 3D vector, which serves as the data format standardization feature vector for this 3D model data unit.

[0066] Step S1225: Perform vector concatenation on the data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the three-dimensional model data unit to generate a multi-dimensional quality feature vector set corresponding to the three-dimensional model data unit.

[0067] After completing the feature extraction of the above four dimensions, this step performs a vector concatenation operation. The four three-dimensional vectors of the three-dimensional model data unit—data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector—are concatenated end to end in a preset order to form a twelve-dimensional vector, which serves as the set of multi-dimensional quality feature vectors corresponding to that three-dimensional model data unit.

[0068] Step S130: Based on the set of multidimensional quality feature vectors corresponding to each data unit, perform quality-level clustering on the text data unit, image data unit, audio data unit, video data unit and 3D model data unit in a unified representation space to generate at least one quality level group. Each quality level group contains data units from at least two different modalities, and the quality feature vectors of the data units within the same quality level group satisfy a preset similarity range in the unified representation space.

[0069] After extracting the multidimensional quality feature vector set of all data units, this step aims to map these data units of different modalities into the same abstract space and cluster them based on their quality feature vectors, grouping data units with similar quality levels into the same group. The implementation of this step relies on a series of specific clustering sub-steps.

[0070] Step S131: Map the integrity feature vector, source credibility feature vector, timeliness feature vector, and format standardization feature vector in the multidimensional quality feature vector set corresponding to each data unit to the integrity dimension axis, source credibility dimension axis, timeliness dimension axis, and format standardization dimension axis in the unified representation space, respectively, to obtain the quality feature space coordinate point of each data unit in the unified representation space.

[0071] This step first constructs a four-dimensional unified representation space, where the four dimensions correspond to the four core aspects of quality features: completeness, source reliability, timeliness, and format standardization. For each data unit, the completeness feature vector from its multi-dimensional quality feature vector set is treated as a whole and mapped onto the completeness dimension axis of the unified representation space. Similarly, the source reliability feature vector, timeliness feature vector, and format standardization feature vector are mapped onto the source reliability dimension axis, timeliness dimension axis, and format standardization dimension axis, respectively. Since the feature vectors on each dimension axis are multi-dimensional, each data unit in the unified representation space is actually a composite coordinate point composed of four sub-coordinate points. Each sub-coordinate point is located on its corresponding dimension axis, collectively forming the quality feature space coordinate point of that data unit.

[0072] Step S132: Calculate the Euclidean distance between the quality feature space coordinates of any two data units in the unified representation space to obtain a set of quality difference measurement values ​​between data units.

[0073] After obtaining the quality feature space coordinates of each data unit, this step calculates the quality difference metric between any two data units. For two data units A and B, their quality feature space coordinates consist of four sub-coordinate points: A_int and B_int on the integrity dimension axis, A_rel and B_rel on the source credibility dimension axis, A_temp and B_temp on the timeliness dimension axis, and A_form and B_form on the format standardization dimension axis. The Euclidean distance between A_int and B_int is calculated to obtain the distance value d_int. Similarly, d_rel, d_temp, and d_form are obtained. Then, these four distance values ​​are weighted and summed to obtain the final quality difference metric. The smaller this metric is, the closer the two data units are in terms of quality features.

[0074] Step S133: Using the set of quality difference measurement values ​​as the basis for clustering, execute the density peak clustering algorithm to cluster all quality feature spatial coordinate points in the unified representation space, and identify the core coordinate points with local density maxima.

[0075] This step employs a density peak clustering algorithm to cluster all quality feature spatial coordinate points in the unified representation space. First, for each data unit's quality feature spatial coordinate point, its local density is calculated. Local density is defined as the number of other data units whose quality difference metric from that point is less than a preset cutoff distance. Then, the minimum quality difference metric between each point and points with higher local densities is calculated. Finally, points with higher local densities and larger minimum distances to higher-density points are identified as core coordinate points.

[0076] Step S134: Using each core coordinate point as the center, allocate other quality feature spatial coordinate points to the clusters closest to the core coordinate points according to the preset cutoff distance parameter, and generate an initial cluster set.

[0077] After identifying the core coordinate points, this step involves cluster assignment. For each non-core quality feature spatial coordinate point, the quality difference metric between it and all core coordinate points is calculated. The core coordinate point with the smallest metric is found, and this point is assigned to the cluster containing that core coordinate point. This process is repeated until the quality feature spatial coordinate points of all data units are assigned, forming an initial set of clusters. Each initial cluster corresponds to one core coordinate point and contains the quality feature spatial coordinate points of several data units.

[0078] Step S135: Perform intra-cluster quality consistency assessment on each cluster in the initial cluster set, calculate the average distance between all quality feature spatial coordinate points in each cluster and the cluster center point, mark the clusters with average distance less than a preset threshold as the quality level group, and treat the quality feature spatial coordinate points in the clusters with average distance greater than the preset threshold as isolated data points for re-clustering iteration, until all quality feature spatial coordinate points are classified into quality level groups with average distance less than the preset threshold.

[0079] This step evaluates the quality consistency of the initial clusters generated in step S134. For each initial cluster, its cluster center is calculated, which is the geometric center of all quality feature spatial coordinate points within the cluster. Then, the Euclidean distance from each quality feature spatial coordinate point within the cluster to the cluster center is calculated, and the average of these distances is taken to obtain the average distance within the cluster. If the average distance is less than a preset quality consistency threshold, the cluster is marked as a qualified quality level group. If the average distance is greater than or equal to the threshold, it indicates that the quality characteristics of the data units within the cluster vary greatly, and the consistency is poor. In this case, all quality feature spatial coordinate points within the cluster are marked as isolated data points. Then, steps S133 and S134 are re-executed for these isolated data points, i.e., the local density is recalculated, the core coordinate points are identified, and cluster assignments are performed until all quality feature spatial coordinate points are assigned to quality level groups with an average distance less than the preset threshold. Finally, several quality level groups are generated, and the data units within each group are highly consistent in terms of quality characteristics.

[0080] Step S140: Call the pre-trained quality label generation model to perform group feature extraction processing on each quality level group to obtain the quality label feature set corresponding to the quality level group. The quality label feature set includes group feature identifiers, quality consistency measurement parameters between data units within the group, and quality difference representation vectors between groups.

[0081] After completing the quality tier classification, this step calls a pre-trained deep learning model—the quality label generation model—to generate a structured set of quality label features for each quality tier group. This model is specifically designed to extract abstract features from a set of data units that can represent its overall quality level, internal consistency, and relative differences. The implementation of this step relies on a series of processing sub-steps within the model.

[0082] Step S141: Input the quality feature spatial coordinates of all data units contained in the quality level group into the group feature encoder of the quality label generation model. The group feature encoder performs nonlinear mapping processing on each quality feature spatial coordinate point through a multilayer perceptron network to generate the intra-group embedded feature vector of each data unit.

[0083] For each quality level group, the quality feature space coordinates of all its data units in a unified representation space are first fed into the group feature encoder module of the quality label generation model as input. This group feature encoder consists of a multilayer perceptron network, including an input layer, several hidden layers, and an output layer. The input layer receives each quality feature space coordinate point, which is composed of sub-coordinate points on four dimensional axes. Each sub-coordinate point is itself a multidimensional vector, so the number of neurons in the input layer matches the total dimensionality of these sub-coordinate points. In the first hidden layer, the input vector undergoes linear transformation and nonlinear activation function processing, such as using a modified linear unit activation function, mapping the input to a higher-dimensional feature space. Subsequent hidden layers continue to perform nonlinear transformations on the features, progressively extracting deeper levels of abstract features. Finally, the output layer outputs a fixed-dimensional vector, namely the in-group embedding feature vector of that data unit. In this way, the group feature encoder maps the original quality feature space coordinates to a new embedding space that is more suitable for subsequent aggregation and analysis.

[0084] Step S142: Input the intra-group embedded feature vector into the group aggregation layer of the quality label generation model, and perform average pooling on the intra-group embedded feature vectors of all data units in the quality level group to obtain the group global feature vector reflecting the overall quality level of the group.

[0085] After obtaining the intra-group embedding feature vector for each data unit within a group, this step inputs these vectors into the group aggregation layer of the quality label generation model. This group aggregation layer performs average pooling, which means that for all data units in the current quality level group, the average of their intra-group embedding feature vectors is calculated dimension by dimension. Specifically, assuming a quality level group contains K data units, and each data unit's intra-group embedding feature vector is a D-dimensional vector, then the group global feature vector obtained after average pooling is also a D-dimensional vector, where the value of its j-th dimension is equal to the sum of the j-th dimension values ​​of all K data units' intra-group embedding feature vectors divided by K. This group global feature vector integrates the quality feature information of all data units within the group, reflecting the overall quality level of that quality level group.

[0086] Step S143: Input the global feature vector of the group into the intra-group consistency analysis module of the quality label generation model, calculate the cosine similarity between the global feature vector of the group and the intra-group embedded feature vector of each data unit in the group, and generate the quality consistency measurement parameter between the data units in the group according to the distribution of the cosine similarity. The quality consistency measurement parameter includes the similarity mean and the similarity standard deviation.

[0087] This step aims to quantify the quality consistency within quality hierarchy groups. The intra-group consistency analysis module receives the group global feature vector generated in step S142 and all intra-group embedded feature vectors generated in step S141. For each data unit within a group, the cosine similarity between its intra-group embedded feature vector and the group global feature vector is calculated. The cosine similarity is calculated by dividing the dot product of the two vectors by the product of their magnitudes. The result ranges within a specific interval; a larger value indicates that the two vectors are more consistent in direction, meaning that the quality characteristics of the data unit are more similar to the overall level of the group. After calculating the cosine similarity between all data units within a group and the group global feature vector, a cosine similarity set is formed. Then, the arithmetic mean of this set is calculated to obtain the mean similarity; simultaneously, the standard deviation of this set is calculated to obtain the standard deviation of similarity. The mean similarity reflects the average level of consistency between the quality of data units within a group and the overall quality of the group, while the standard deviation of similarity reflects the magnitude of fluctuation in the degree of quality consistency among data units within a group. These two values ​​together constitute the quality consistency measurement parameters.

[0088] Step S144: Input the global feature vector of the quality level group and the global feature vector of the adjacent quality level group into the inter-group difference analysis module of the quality label generation model, calculate the feature difference vector between the global feature vectors of different groups, and generate the inter-group quality difference representation vector based on the norm of the feature difference vector.

[0089] This step aims to quantify the quality differences between different quality tier groups. The inter-group difference analysis module receives the global feature vector of the current quality tier group and the global feature vectors of its neighboring quality tier groups (groups with similar quality levels). This module calculates the feature difference vector between these two global feature vectors, i.e., subtracting the values ​​in corresponding dimensions to obtain a new vector, where each dimension's value represents the degree of difference between the two groups in the corresponding quality feature dimension. Then, the norm of this feature difference vector is calculated. The norm is a way to measure vector length, typically using the Euclidean norm (square root of the sum of squares). The magnitude of this norm reflects the overall quality difference between the two groups. Since there may be more than one quality tier group, for each group, an inter-group quality difference representation vector containing information about its differences with all neighboring groups can be generated. This vector consists of multiple norm values, each norm value corresponding to one neighboring group.

[0090] Step S145: Concatenate the group identifier of the quality level group with the group global feature vector, the quality consistency measurement parameter and the inter-group quality difference representation vector to generate the quality label feature set corresponding to the quality level group.

[0091] After obtaining all the above information, this step performs the final concatenation process to generate a quality label feature set. First, a unique group identifier is generated for this quality level group, such as "Quality Level Group_01". Then, the group global feature vector generated in step S142, the quality consistency metric parameters (including the mean and standard deviation of similarity) generated in step S143, and the inter-group quality difference representation vector generated in step S144 are concatenated. The concatenation operation links these vectors or parameters end-to-end in a preset order to form a larger vector. Finally, the group identifier is used as the label header, and together with the concatenated vector, they form a structured data object, namely the quality label feature set corresponding to this quality level group.

[0092] Step S150: Based on the quality label feature set, perform label embedding processing on the data units within each quality level group, and append the quality label feature set to the metadata structure of each data unit in the form of label embedding vectors to generate a labeled multimodal dataset carrying quality label information.

[0093] After obtaining the quality label feature set for each quality level group, this step embeds the label information into the metadata structure of each data unit within that group, making the quality label part of the data itself. The implementation of this step depends on a series of specific embedding operation sub-steps.

[0094] Step S151: Extract the group global feature vector from the quality label feature set corresponding to the quality level group, use the group global feature vector as the basic embedding vector, and scale the basic embedding vector according to the mean similarity in the quality consistency measurement parameter between data units within the group to obtain the group embedding vector.

[0095] For each quality level group, a global feature vector is first extracted from its quality label feature set; this vector serves as the base embedding vector. Then, the mean similarity value is extracted from the quality consistency metric parameters of that group. This mean similarity value falls within a specific numerical range; a higher value indicates higher intra-group consistency. Using this mean similarity value as a scaling factor, a multiplication operation is performed on each dimension of the base embedding vector; that is, each component of the base embedding vector is multiplied by the mean similarity value, resulting in a new vector called the group embedding vector. In this way, groups with higher intra-group consistency will have their group embedding vector values ​​amplified, thus more prominently reflecting their high-quality consistency characteristics in subsequent embeddings.

[0096] Step S152: Extract the inter-group quality difference representation vector from the quality label feature set corresponding to the quality level group, and perform dimensional expansion processing on the inter-group quality difference representation vector to obtain a difference embedding vector with the same dimension as the group embedding vector.

[0097] Continue by extracting the inter-group quality difference representation vector from the quality label feature set. This vector may contain multiple norm values, and its dimension may not match the dimension of the group embedding vector. Therefore, it needs to be dimension-expanded to match the dimension of the group embedding vector. Dimension expansion can be achieved through linear mapping, for example, by inputting the inter-group quality difference representation vector into a fully connected layer. The number of input neurons in this fully connected layer is the same as the dimension of the inter-group quality difference representation vector, and the number of output neurons is the same as the dimension of the group embedding vector. After linear transformation and nonlinear activation of this fully connected layer, a vector with the same dimension as the group embedding vector is output, called the difference embedding vector.

[0098] Step S153: Perform vector addition on the group embedding vector and the difference embedding vector to generate a composite label embedding vector containing overall group quality information and relative quality information between groups.

[0099] This step performs vector addition on the group embedding vector generated in step S151 and the difference embedding vector generated in step S152, that is, the values ​​of the corresponding dimensions of the two vectors are added together to obtain a new vector, called the composite label embedding vector. This vector contains both the overall quality feature information within the group (group embedding vector) and the relative quality difference information between this group and other groups (difference embedding vector), thus reflecting the quality characteristics of this quality level group more comprehensively.

[0100] Step S154: The composite tag embedding vector is split into multiple sub-embedded vector fragments. The allocation weight of each sub-embedded vector fragment is determined according to the data type of each data unit in the quality level group. The allocated sub-embedded vector fragments are injected into the reserved fields in the metadata structure of each data unit. The reserved fields in the metadata structure include the metadata extension fields of the text data unit, the custom field of the interchangeable image file format of the image data unit, the identifier 3 extension tag of the audio data unit, the container metadata extension area of ​​the video data unit, and the metadata block reserved space of the 3D model data unit.

[0101] This step embeds the composite tag embedding vector into the metadata of the data unit. First, the composite tag embedding vector is split into multiple sub-embedded vector fragments according to a preset partitioning rule. For example, a fixed-dimensional vector can be evenly divided into several smaller vectors. Then, for each data unit within the quality level group, the weights of each sub-embedded vector fragment are determined based on its data type. For example, for text data units, integrity features may be more important, so sub-embedded vector fragments with higher correlation to integrity features can be assigned higher weights; for image data units, source credibility features may be more important, so corresponding weights are assigned. After assigning weights, the sub-embedded vector fragments are scaled according to their respective weights and then injected into specific reserved fields in the metadata structure of the corresponding data unit. For text data units, they are injected into metadata extension fields; for image data units, they are injected into custom fields for exchangeable image file formats; for audio data units, they are injected into the Identifier 3 extension tag; for video data units, they are injected into the container metadata extension area; and for 3D model data units, they are injected into the metadata block reserved space. In this way, quality label information is seamlessly embedded into the native metadata structure of each modal data unit.

[0102] Step S155: After injecting the sub-embedded vector fragments into the metadata structure of each data unit, perform metadata integrity verification on the injected data units, read the sub-embedded vector fragments in the metadata structure of each data unit for reorganization verification, and generate a labeled multimodal dataset carrying quality label information.

[0103] After injecting all sub-embedded vector fragments, this step performs final verification and data integration. First, for each data unit, the injected sub-embedded vector fragments are read from its metadata structure. Then, according to the fragment identifiers and partitioning rules recorded during injection, these fragments are recombined into a candidate composite label embedding vector. Next, the candidate composite label embedding vector is compared element-wise with the original composite label embedding vector generated in step S153, and the feature difference rate between the candidate vector and the original vector is calculated. For example, the sum of the absolute values ​​of the differences in the corresponding dimensions of the two vectors is divided by the magnitude of the original vector. If the feature difference rate is less than a preset difference threshold, it indicates that the label embedding process is successful, the quality label information of the data unit is complete and error-free, and the data unit is marked as successfully embedded and retained in the dataset. If the feature difference rate is greater than or equal to the threshold, it indicates that data corruption or errors may have occurred during the embedding process. In this case, the composite label embedding vector of the quality level group to which the data unit belongs is re-extracted, the disassembly and injection operation of step S154 is performed again, and the re-injected data unit replaces the original data unit. Finally, embedding status statistics are performed on all data units, generating a report containing the number of successfully embedded data units, the number of retried embedding data units, and the final feature difference rate distribution. At this point, all data units carry quality label information for their respective quality level group, forming a labeled multimodal dataset.

[0104] For example, in step S1551: scan the metadata structure of all data units in the labeled multimodal dataset, extract each injected sub-embedded vector fragment, and reassemble the sub-embedded vector fragments into candidate composite label embedding vectors according to the fragment identifier recorded at the time of injection.

[0105] In the biodiversity monitoring project of this embodiment, after the quality label embedding of text data units, image data units, audio data units, video data units, and 3D model data units is completed, the system performs a comprehensive integrity verification on the labeled multimodal dataset. This verification process first traverses each data unit in the dataset. For text data units, the system parses their metadata extension fields, reading each previously written sub-embedded vector fragment one by one. Each sub-embedded vector fragment is accompanied by a unique fragment identifier during injection, such as "fragment_01_integrity dimension" or "fragment_02_source credibility dimension". Based on these fragment identifiers, the system reassembles the sub-embedded vector fragments scattered in the metadata extension fields according to a preset splicing order. For example, the fragment whose fragment identifier begins with "fragment_01" is placed at the beginning of the recombined vector, followed by the fragment corresponding to "fragment_02", until all fragments are sequentially spliced ​​to form a complete vector, i.e., the candidate composite label embedding vector. For image data units, the system extracts and reassembles sub-embedded vector fragments from their custom fields in the exchangeable image file format using the same identifier logic; for audio data units, it extracts from the Identifier 3 extended tag; for video data units, it extracts from the metadata extended area of ​​the container; and for 3D model data units, it extracts from the reserved space in the metadata block. In this way, each data unit generates a candidate composite tag embedding vector corresponding to its original composite tag embedding vector.

[0106] Step S1552: Perform element-by-element comparison between the candidate composite label embedding vector and the original generated composite label embedding vector, and calculate the feature difference rate between the candidate composite label embedding vector and the original composite label embedding vector.

[0107] After obtaining the candidate composite label embedding vector, the system retrieves the original composite label embedding vector of the quality level group to which the data unit belongs from memory. This original vector was generated and temporarily stored in step S153. The system performs element-wise comparison processing, that is, it compares the values ​​of each corresponding dimension of the two vectors. Assuming that both vectors are D-dimensional, the system calculates the absolute difference between the candidate vector value and the original vector value in each dimension, and then sums these absolute differences to obtain the total absolute difference value. Next, the system calculates the modulus of the original composite label embedding vector, which is the arithmetic square root of the sum of the squares of its values ​​in all dimensions. Finally, the total absolute difference value is divided by the modulus of the original vector to obtain the feature difference rate. This feature difference rate is a dimensionless value that reflects the overall deviation of the candidate vector from the original vector. For example, if the candidate vector after the reorganization of a text data unit has slight deviations from the original vector in multiple dimensions, these deviations are summed and divided by the modulus to obtain a feature difference rate within a specific numerical range. The closer this value is to zero, the more accurate the embedding process.

[0108] Step S1553: If the feature difference rate is less than the preset difference threshold, the current data unit is marked as successfully embedded and the data unit is retained in the labeled multimodal dataset.

[0109] The system pre-sets a difference threshold, which is configured based on data accuracy requirements and the tolerance of the application scenario. For data units with a feature difference rate less than the pre-set threshold, the system determines that the label embedding process was successful and that the metadata has not been significantly damaged or tampered with. The system adds a status flag to the data unit, such as writing "Embedding Status: Success" in the metadata, and retains the data unit in the labeled multimodal dataset as a final valid data asset.

[0110] Step S1554: If the feature difference rate is greater than or equal to the preset difference threshold, then re-extract the composite label embedding vector of the quality level group to which the data unit belongs, perform the sub-embedded vector fragment decomposition and metadata structure injection operation again, and replace the original data unit with the re-injected data unit to update the labeled multimodal dataset.

[0111] For data units with a feature difference rate greater than or equal to a preset threshold, the system determines that there is an anomaly in its label embedding process, such as data truncation during storage or accidental modification of metadata fields. In this case, the system will not directly discard the data unit but will initiate a retry mechanism. The system first retrieves the original composite label embedding vector of the quality level group to which the data unit belongs from memory. Then, the system executes step S154 again, re-splitting the composite label embedding vector into multiple sub-embedding vector fragments, determining the allocation weight of each sub-embedding vector fragment according to the data unit's data type, and re-injecting the allocated sub-embedding vector fragments into the corresponding reserved fields in the data unit's metadata structure. After re-injection, the system replaces the original data unit with this retried new data unit and updates the corresponding entries in the labeled multimodal dataset. If a data unit still fails verification after a preset number of retries, the system will mark it as an embedding failure and record the reason for the failure, but this will not affect the normal processing flow of other data units.

[0112] Step S1555: Perform embedding state statistical processing on all data units in the labeled multimodal dataset to generate report data including the number of successfully embedded data units, the number of retried embedding data units, and the final feature difference rate distribution.

[0113] After all data units have completed validation and necessary retry processing, the system performs a final statistical summary. The system iterates through all data units in the labeled multimodal dataset, counting the number of data units marked as successfully embedded, thus obtaining the number of successfully embedded data units. Simultaneously, it counts the number of data units that underwent at least one retry, obtaining the number of retried embedded data units. Furthermore, the system performs distribution analysis on the final feature difference rate of all successfully embedded data units, such as the range of statistical feature difference rates, the median, and the distribution of the number of data units within different intervals. The system integrates this statistical information into a structured report, which can be output in text format to evaluate the reliability and accuracy of the entire quality label embedding process. At this point, all steps are completed, and the final labeled multimodal dataset not only contains the original multimodal data content but also embeds structured quality label information into its metadata structure. This label information supports subsequent applications such as data retrieval, filtering, fusion, and credibility assessment based on data quality.

[0114] Figure 2 The illustration shows exemplary hardware and software components of a quality tag embedding system 100 for multimodal data, which can implement the ideas of this application, according to some embodiments of this application. For example, a processor 120 can be used in the quality tag embedding system 100 for multimodal data and to perform the functions in this application.

[0115] The quality label embedding system 100 for multimodal data can be a general-purpose server or a special-purpose server; both can be used to implement the quality label embedding method for multimodal data of this application. Although only one server is shown in this application, for convenience, the functions described in this application can be implemented in a distributed manner on multiple similar platforms to balance the load.

[0116] For example, a quality tag embedding system 100 for multimodal data may include a network port 110 connected to a network, one or more processors 120 for executing program instructions, a communication bus 130, and various forms of storage media 140, such as a disk, ROM, or RAM, or any combination thereof. Exemplarily, the quality tag embedding system 100 for multimodal data may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The methods of this application can be implemented according to these program instructions. The quality tag embedding system 100 for multimodal data also includes an I / O interface 150 between the computer and other input / output devices.

[0117] For ease of explanation, only one processor is described in the quality tag embedding system 100 for multimodal data. However, it should be noted that the quality tag embedding system 100 for multimodal data in this application may also include multiple processors, and therefore the steps performed by one processor as described in this application may also be performed jointly or individually by multiple processors. For example, if the processor of the quality tag embedding system 100 for multimodal data performs steps A and B, it should be understood that steps A and B may also be performed jointly by two different processors or individually by one processor. For example, the first processor performs step A, the second processor performs step B, or the first processor and the second processor jointly perform steps A and B.

[0118] Furthermore, embodiments of the present invention also provide a readable storage medium having computer-executable instructions pre-set therein. When a processor executes the computer-executable instructions, the quality tag embedding method applied to multimodal data as described above is implemented.

[0119] It should be noted that, in order to simplify the description of the present invention and thus help to understand one or more embodiments of the invention, multiple features may sometimes be grouped into one embodiment, drawing or description thereof in the foregoing description of the embodiments of the present invention.

Claims

1. A method for embedding quality labels for multimodal data, characterized in that, The method includes: Obtain the original multimodal dataset, which includes text data units, image data units, audio data units, video data units, and 3D model data units; The original multimodal dataset is subjected to cross-modal data unit quality feature extraction processing to generate a multidimensional quality feature vector set corresponding to each data unit. The multidimensional quality feature vector set includes data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector. Based on the set of multidimensional quality feature vectors corresponding to each data unit, the text data unit, image data unit, audio data unit, video data unit and three-dimensional model data unit are subjected to quality hierarchical clustering in a unified representation space to generate at least one quality level group. Each quality level group contains data units from at least two different modalities, and the quality feature vectors of the data units within the same quality level group satisfy a preset similarity range in the unified representation space. The pre-trained quality label generation model is invoked to perform group feature extraction processing on each quality level group to obtain the quality label feature set corresponding to the quality level group. The quality label feature set includes group feature identifier, quality consistency measurement parameter between data units within the group, and quality difference representation vector between groups. Based on the quality label feature set, label embedding processing is performed on the data units within each quality level group. The quality label feature set is then appended to the metadata structure of each data unit in the form of label embedding vectors to generate a labeled multimodal dataset carrying quality label information.

2. The quality label embedding method for multimodal data according to claim 1, characterized in that, The process of performing cross-modal data unit quality feature extraction on the original multimodal dataset to generate a set of multidimensional quality feature vectors for each data unit includes: Obtain text data units from the original multimodal dataset, perform text integrity scanning processing on the text data units, identify missing character segments, semantic break segments, and grammatical structure abnormal segments in the text data units, and generate data integrity feature vectors for the text data units based on the length of the missing character segments, the number of semantic break segments, and the number of grammatical structure abnormal segments. The source tracing process is performed on the text data unit to extract the publisher's digital signature, the publishing platform authorization certificate and the content modification history of the text data unit. Based on the verification status of the publisher's digital signature, the validity period of the publishing platform authorization certificate and the number of modifications in the content modification history, a data source trust feature vector of the text data unit is generated. The text data unit is subjected to content timeliness determination processing. The publication timestamp of the text data unit, the occurrence timestamp of the event referenced in the content, and the term update mark used in the content are extracted. The data timeliness feature vector of the text data unit is generated based on the time difference between the publication timestamp and the current time, the time difference between the occurrence timestamp and the publication timestamp, and the update version number of the term update mark. The text data unit is subjected to format structure verification processing to identify the title hierarchy structure, paragraph division structure and citation mark structure of the text data unit. Based on the completeness of the title hierarchy structure, the coherence of the paragraph division structure and the standardization of the citation mark structure, a data format standardization feature vector of the text data unit is generated. The data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the text data unit are concatenated to generate a multidimensional quality feature vector set corresponding to the text data unit.

3. The quality label embedding method for multimodal data according to claim 2, characterized in that, The step of performing cross-modal data unit quality feature extraction processing on the original multimodal dataset to generate a multidimensional quality feature vector set corresponding to each data unit further includes: Image data units are obtained from the original multimodal dataset. Image integrity detection processing is performed on the image data units. The pixel missing regions, compression artifact regions and color channel missing regions of the image data units are scanned. Data integrity feature vectors of the image data units are generated based on the area ratio of the pixel missing regions, the texture complexity of the compression artifact regions and the number of color channel missing regions. Image source analysis processing is performed on the image data unit to extract the unique identifier of the shooting device, the image acquisition timestamp, and the geographic coordinate marker information of the image data unit. Based on the matching result of the unique identifier of the shooting device with the list of trusted devices, the continuity of the image acquisition timestamp, and the positioning accuracy of the geographic coordinate marker information, a data source trusted feature vector of the image data unit is generated. The image data unit is subjected to image content timeliness analysis processing to identify the seasonal characteristics, vegetation cover characteristics and building status characteristics of the photographed objects in the image data unit. Based on the matching degree between the seasonal characteristics and the shooting time, the change range of the vegetation cover characteristics and the comparison results between the building status characteristics and historical records, the data timeliness feature vector of the image data unit is generated. The image data unit is subjected to image format standardization verification processing, which detects the encoding format identifier, resolution parameter and color space configuration information of the image data unit, and generates the data format standardization feature vector of the image data unit based on the matching status of the encoding format identifier with the preset standard format, the deviation of the resolution parameter from the standard resolution and the completeness of the color space configuration information. The data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the image data unit are concatenated to generate a multidimensional quality feature vector set corresponding to the image data unit.

4. The quality label embedding method for multimodal data according to claim 3, characterized in that, The step of performing cross-modal data unit quality feature extraction processing on the original multimodal dataset to generate a multidimensional quality feature vector set corresponding to each data unit further includes: Audio data units are obtained from the original multimodal dataset. Audio continuity detection processing is performed on the audio data units to identify silent segments, pop segments, and spectral interruption segments of the audio data units. Data integrity feature vectors of the audio data units are generated based on the duration ratio of the silent segments, the number of pop segments, and the frequency domain energy decay rate of the spectral interruption segments. The audio data unit is subjected to audio source verification processing. The voiceprint features, recording device model identifier and recording environment background noise features of the audio data unit are extracted. Based on the matching degree between the voiceprint features and the known voiceprint database, the device category of the recording device model identifier and the stability of the recording environment background noise features, a reliable feature vector of the data source of the audio data unit is generated. The audio data unit is subjected to audio content timeliness determination processing, which identifies the event mention time of the speech content in the audio data unit, the environmental feature time imprint in the background sound, and the matching degree between the speech rate feature and the era speech rate model. The data timeliness feature vector of the audio data unit is generated based on the interval between the event mention time and the current time, the era identifier of the environmental feature time imprint, and the matching coefficient between the speech rate feature and the era speech rate model. The audio data unit is subjected to audio format standardization verification processing, which detects the sampling rate parameter, quantization bit depth parameter and channel configuration parameter of the audio data unit. Based on the deviation value of the sampling rate parameter from the standard sampling rate, the conformity of the quantization bit depth parameter with the standard quantization bit depth, and the matching status of the channel configuration parameter with the preset channel mapping table, the audio data unit's data format standardization feature vector is generated. The data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the audio data unit are concatenated to generate a multi-dimensional quality feature vector set corresponding to the audio data unit.

5. The quality label embedding method for multimodal data according to claim 4, characterized in that, The step of performing cross-modal data unit quality feature extraction processing on the original multimodal dataset to generate a multidimensional quality feature vector set corresponding to each data unit further includes: The video data units in the original multimodal dataset are obtained, and video frame sequence continuity detection processing is performed on the video data units to identify dropped frames, frozen frames, and mosaic distortion frames in the video data units. The data integrity feature vector of the video data unit is generated based on the frame ratio of the dropped frames, the duration of the frozen frames, and the block artifact density of the mosaic distortion frames. The video data unit is subjected to video source tracing processing. The acquisition device serial number, encoding generation timestamp and video fingerprint hash value of the video data unit are extracted. Based on the matching result of the acquisition device serial number and the authorized device database, the temporal continuity of the encoding generation timestamp and the uniqueness verification status of the video fingerprint hash value, a reliable feature vector of the data source of the video data unit is generated. The video data unit is subjected to video content timeliness analysis processing to identify the clothing style features, vehicle model features and scene landmark change features of moving objects in the video data unit. The data timeliness feature vector of the video data unit is generated based on the matching degree of the clothing style features and the fashion trend cycle, the production year of the vehicle model features and the update cycle of the scene landmark change features. The video data unit is subjected to video format standardization verification processing, which detects the encoding standard identifier, frame rate parameter and bit rate parameter of the video data unit. Based on the compatibility status of the encoding standard identifier with the preset encoding standard, the deviation of the frame rate parameter from the standard frame rate and the calculation result of the bit rate parameter matching degree with the resolution, the data format standardization feature vector of the video data unit is generated. The data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the video data unit are concatenated to generate a multi-dimensional quality feature vector set corresponding to the video data unit.

6. The quality label embedding method for multimodal data according to claim 5, characterized in that, The step of performing cross-modal data unit quality feature extraction processing on the original multimodal dataset to generate a multidimensional quality feature vector set corresponding to each data unit further includes: Obtain the three-dimensional model data units from the original multimodal dataset, perform model integrity detection processing on the three-dimensional model data units, identify the mesh missing regions, texture mapping missing regions, and normal direction inconsistency regions of the three-dimensional model data units, and generate the data integrity feature vector of the three-dimensional model data units based on the area ratio of the mesh missing regions, the pixel missing rate of the texture mapping missing regions, and the number of triangular faces of the normal direction inconsistency regions. The model source verification process is performed on the three-dimensional model data unit, and the creator digital certificate, model generation software identifier and version iteration record of the three-dimensional model data unit are extracted. Based on the credibility of the issuing authority of the creator digital certificate, the software version number of the model generation software identifier and the update frequency of the version iteration record, a data source credibility feature vector of the three-dimensional model data unit is generated. The model timeliness determination process is performed on the three-dimensional model data unit to identify the modeling style features, model detail precision features and material texture style features of the three-dimensional model data unit. Based on the matching degree between the modeling style features and the style evolution cycle, the conformity between the model detail precision features and the technical level of the modeling era, and the era identifier of the material texture style features, the data timeliness feature vector of the three-dimensional model data unit is generated. The three-dimensional model data unit is subjected to model format standardization verification processing. The format standard identifier, vertex number scale and texture resolution parameter of the three-dimensional model data unit are detected. The data format standardization feature vector of the three-dimensional model data unit is generated according to the compatibility status of the format standard identifier with the preset standard format, the matching result of the vertex number scale with the model complexity standard and the adaptability of the texture resolution parameter with the model accuracy. The data integrity feature vector, data source reliability feature vector, data timeliness feature vector, and data format standardization feature vector of the three-dimensional model data unit are concatenated to generate a set of multi-dimensional quality feature vectors corresponding to the three-dimensional model data unit.

7. The quality label embedding method for multimodal data according to claim 1, characterized in that, The step involves performing quality-level clustering on the text data units, image data units, audio data units, video data units, and 3D model data units in a unified representation space based on the multi-dimensional quality feature vector set corresponding to each data unit, generating at least one quality-level group, including: The integrity feature vector, source credibility feature vector, timeliness feature vector, and format standardization feature vector in the multidimensional quality feature vector set corresponding to each data unit are mapped to the integrity dimension axis, source credibility dimension axis, timeliness dimension axis, and format standardization dimension axis in the unified representation space, respectively, to obtain the quality feature space coordinate point of each data unit in the unified representation space; Calculate the Euclidean distance between the quality feature space coordinates of any two data units in the unified representation space to obtain a set of quality difference measurement values ​​between data units. Using the set of quality difference metrics as the basis for clustering, the density peak clustering algorithm is executed to cluster all quality feature spatial coordinate points in the unified representation space, and to identify the core coordinate points with local density maxima. Centered on each core coordinate point, other quality feature spatial coordinate points are assigned to the clusters closest to the core coordinate points according to the preset cutoff distance parameter, generating an initial cluster set; For each cluster in the initial cluster set, perform intra-cluster quality consistency assessment processing, calculate the average distance between all quality feature spatial coordinate points in each cluster and the cluster center point, mark the clusters with average distance less than a preset threshold as the quality level group, and treat the quality feature spatial coordinate points in the clusters with average distance greater than the preset threshold as isolated data points for re-clustering iteration processing, until all quality feature spatial coordinate points are classified into quality level groups with average distance less than the preset threshold.

8. The quality label embedding method for multimodal data according to claim 7, characterized in that, The pre-trained quality label generation model is invoked to perform group feature extraction processing on each quality level group, resulting in a quality label feature set corresponding to the quality level group, including: The quality feature spatial coordinates of all data units contained in the quality level group are input into the group feature encoder of the quality label generation model. The group feature encoder performs nonlinear mapping processing on each quality feature spatial coordinate point through a multilayer perceptron network to generate an intra-group embedded feature vector for each data unit. The in-group embedded feature vector is input into the group aggregation layer of the quality label generation model, and average pooling is performed on the in-group embedded feature vector of all data units in the quality level group to obtain the group global feature vector that reflects the overall quality level of the group. The group global feature vector is input into the intra-group consistency analysis module of the quality label generation model. The cosine similarity between the group global feature vector and the intra-group embedded feature vector of each data unit in the group is calculated. Based on the distribution of the cosine similarity, the quality consistency measurement parameters between the data units in the group are generated. The quality consistency measurement parameters include the similarity mean and the similarity standard deviation. The global feature vector of the quality level group and the global feature vector of the adjacent quality level group are input into the inter-group difference analysis module of the quality label generation model to calculate the feature difference vector between the global feature vectors of different groups, and generate the inter-group quality difference representation vector based on the norm of the feature difference vector. The group identifier of the quality level group is concatenated with the group global feature vector, the quality consistency measurement parameter and the inter-group quality difference representation vector to generate the quality label feature set corresponding to the quality level group.

9. The quality label embedding method for multimodal data according to claim 8, characterized in that, The process involves performing label embedding processing on data units within each quality level group based on the quality label feature set, appending the quality label feature set as a label embedding vector to the metadata structure of each data unit, and generating a labeled multimodal dataset carrying quality label information, including: Extract the global feature vector of the group from the quality label feature set corresponding to the quality level group, use the global feature vector of the group as the basic embedding vector, and scale the basic embedding vector according to the mean similarity in the quality consistency measurement parameter between data units in the group to obtain the group embedding vector. Extract the inter-group quality difference representation vector from the quality label feature set corresponding to the quality level group, and perform dimensional expansion processing on the inter-group quality difference representation vector to obtain the difference embedding vector with the same dimension as the group embedding vector; Perform a vector addition operation between the group embedding vector and the difference embedding vector to generate a composite label embedding vector that contains overall group quality information and relative quality information between groups. The composite tag embedding vector is split into multiple sub-embedded vector fragments. The allocation weight of each sub-embedded vector fragment is determined according to the data type of each data unit in the quality level group. The allocated sub-embedded vector fragments are injected into the reserved fields in the metadata structure of each data unit. The reserved fields in the metadata structure include the metadata extension fields of text data units, the EXIF ​​custom fields of image data units, the ID3 extension tags of audio data units, the container metadata extension area of ​​video data units, and the metadata block reserved space of 3D model data units. After injecting the sub-embedded vector fragments into the metadata structure of each data unit, a metadata integrity verification operation is performed on the injected data units. The sub-embedded vector fragments in the metadata structure of each data unit are read and recombined for verification, generating a labeled multimodal dataset carrying quality label information.

10. A quality label embedding system for multimodal data, characterized in that, The quality tag embedding system for multimodal data includes a processor and a memory, the memory and the processor being connected. The memory is used to store programs, instructions or code, and the processor is used to execute the programs, instructions or code in the memory to implement the quality tag embedding method for multimodal data as described in any one of claims 1-9.