Multimodal graph-text automatic recognition storage method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By preprocessing and vector-matrix transformation of image and text data, combined with association learning and threshold analysis, the system identifies and stores related image and text data, solving the problem of excessive text data in image and text recognition storage and improving accuracy and efficiency.

CN122243720APending Publication Date: 2026-06-19BEIJING YIFEI SHENXI TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING YIFEI SHENXI TECHNOLOGY CO LTD
Filing Date: 2026-03-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Mar 2026

Application

19 Jun 2026

Publication

CN122243720A

IPC: G06T1/60; G06F40/16; G06V30/148; G06V10/82; G06N3/0455; G06V30/19

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Cross-modal image-text retrieval method based on multi-window attention mechanism
CN120011609A
Multi-modal sentiment analysis method and system based on image-text fusion
CN117115534A
A multimodal humor recognition method
CN119863742B
Humor identification method oriented to multiple modes
CN119863742A
Multimodal sentiment analysis method and system based on image-text fusion
CN117115534B

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing image recognition and storage technologies do not perform text data recognition and filtering, resulting in excessively large amounts of text data being stored when training large image models. This increases the cost of storage devices and reduces the processing speed of large image models.

⚗Method used

By preprocessing the input image and text data, the output valid image and text data are converted into image vector matrices and text vector matrices. Sample data is collected for association learning, and related image and text data are identified and stored based on association thresholds.

🎯Benefits of technology

It improves the accuracy and effectiveness of image and text recognition, prevents misjudgments, optimizes storage space utilization, and reduces the cost of storage devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122243720A_ABST

Patent Text Reader

Abstract

This invention discloses a method and system for automatic image and text recognition and storage based on multimodal imaging, relating to the field of image and text recognition and storage technology. The method includes the following steps: preprocessing input image and text data to output valid image data and valid text data; converting the valid image data into an image vector matrix and the valid text data into a text vector matrix; aligning the image feature vectors and text vectors in the image vector matrix and the text vector matrix; and recognizing the image vector matrix and text vector matrix based on the aligned image feature vectors and text vectors to identify and store related image and text data. This invention addresses the problem that existing image and text recognition and storage technologies do not perform text data recognition and filtering, resulting in excessively large amounts of stored text data when training large image models, increasing storage device costs and reducing the processing speed of large image models.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image and text recognition and storage technology, specifically to a method and system for automatic image and text recognition and storage based on multimodal modes. Background Technology

[0002] Image and text recognition storage technology refers to an intelligent information processing technology that deeply integrates image content understanding, text recognition, and data storage management.

[0003] During the training of large image models, a large amount of image and text data is often input into the model to associate the image and text data. In this process, the text data needs to be stored. The description of the image data often contains a lot of invalid words. If all the text data is stored, it will result in a huge amount of storage space. Therefore, the text data needs to be identified and filtered. However, the existing image and text recognition storage technology does not identify and filter the text data, resulting in the excessive amount of text data stored when training large image models. This increases the cost of storage devices and reduces the processing speed of large image models. Summary of the Invention

[0004] This invention aims to at least partially solve one of the technical problems in the prior art. It preprocesses input image and text data to output valid image and text data. Then, it identifies the valid image data, converting it into an image vector matrix, and simultaneously identifies the valid text data, converting it into a text vector matrix. Next, it collects sample data and extracts sample features from the image and text vector matrices. Then, it performs association learning on the sample features and aligns the image and text feature vectors in the image and text vector matrices. Based on the aligned image and text feature vectors, it analyzes the association threshold between the image and text data. Finally, it identifies and stores correlated image and text data based on the association threshold. This addresses the problem that existing image and text recognition storage technologies do not perform text data identification and filtering, resulting in excessively large amounts of stored text data when training large image models, increasing storage device costs and reducing the processing speed of large image models.

[0005] To achieve the above objectives, in a first aspect, this application provides a method for automatic recognition and storage of multimodal images and text, comprising the following steps: The input image and text data are preprocessed to output valid image data and valid text data. The system identifies valid image data and converts it into an image vector matrix; it also identifies valid text data and converts it into a text vector matrix. Collect sample data to learn the association between the image vector matrix and the text vector matrix, and align the image vector matrix with the image feature vectors and text vectors in the text vector matrix; Based on the aligned image feature vectors and text vectors, the image vector matrix and text vector matrix are identified to identify and store related image and text data.

[0006] Further, the input image and text data are preprocessed to output valid image data and valid text data, including the following sub-steps: Image enhancement is performed on image data to obtain effective image data; The text data is uniformly converted into English format to obtain valid text data.

[0007] Further, the effective image data is identified and converted into an image vector matrix. Simultaneously, the effective text data is identified and converted into a text vector matrix, including the following sub-steps: Connect the Transformer ViT model, identify the effective image data through the Transformer ViT model, and convert the effective image data into a vector matrix, named the image vector matrix. The first row of the image vector matrix is the image feature vector of the effective image data. Connect to the BERT model, input the valid text data into the BERT model, and the BERT model outputs a vector matrix, named the text vector matrix. Each row in the text vector matrix is a text vector.

[0008] Furthermore, collecting sample data to perform association learning on the image vector matrix and text vector matrix, aligning the image vector matrix with the image feature vectors and text vectors in the text vector matrix includes the following sub-steps: Collect sample data and extract sample features from the image vector matrix and text vector matrix in the sample data; The sample features are correlated and then aligned with the image feature vectors and text vectors in the image vector matrix and the text vector matrix.

[0009] Furthermore, collecting sample data and extracting sample features from the image vector matrix and text vector matrix includes the following sub-steps: Collect sample data. When collecting sample data, similar image groups and similar text groups are manually divided. The sample data includes sample images and sample text, and the sample images and sample text are associated. The image feature vector of the sample image is extracted and named the image sample vector. The text vector matrix of the sample text is extracted and named the text sample matrix. The image sample vectors are labeled PX. The text vectors in the text sample matrix are numbered in top-to-bottom order, and then processed using TX. g This indicates that g is a positive integer and g is the index of TX; Label the t-th column of PX as PX(t), and TX g The t-th column in the array is labeled TX. g (t), where t is a positive integer; Sort and number PX(t) in ascending order, using the symbol EP. d This indicates that TX is processed in ascending order. g (t) is used for sorting and numbering, using the symbol ET. d This indicates that d is a positive integer and d is the index of EP and ET; With d as the X-axis, and EP as the Y-axis, d and ET d Establish a two-dimensional coordinate system for the Y-axis, named Image Feature Selection Map and Text Feature Selection Map, and set EP d According to the image feature selection map entered by d, ET d Enter the text feature filtering map according to d, and name the coordinate points in the image feature filtering map and the text feature filtering map as image filtering points and text filtering points respectively. Connect the adjacent image filtering points and the adjacent text filtering points by straight lines to obtain the image filtering line and the text filtering line respectively. Obtain the slope of the image filtering line and the text filtering line and name them the image filtering slope and the text filtering slope respectively. Name the image filtering line with the largest image filtering slope as the image feature segmentation line and the text filtering line with the largest text filtering slope as the text feature segmentation line. EP is located to the right of the image segmentation line. d The corresponding PX(t) is labeled as an effective image feature, and EX is located to the right of the text segmentation line. d The corresponding TX g (t) is marked as a valid text feature. All valid image features in the similar image group to which the sample image belongs are extracted. At the same time, all valid text features in the similar text group to which the sample text belongs are extracted. Together, they form a set of sample features.

[0010] Furthermore, the association learning of sample features and the alignment of the image vector matrix with the image feature vectors and text vectors in the text vector matrix include the following sub-steps: PX in statistical image effective features h (t) and TX in effective text features g (t), respectively relabeled as PYX(t) and TYX. g (t); Calculate the sum of PYX(t) for features with the same t in a set of samples, and label the result as PF(t). Simultaneously, calculate TYX for features with the same t in a set of samples. g The sum of (t) is used to calculate the result, which is denoted as TF(t). Sort and number PF(t) in descending order, using the symbol PL. a This indicates that TF(t) is sorted and numbered in descending order, using the symbol TL. b This indicates that a and b are both positive integers, and a is the index of PL and b is the index of TL; Starting with a=b=1, obtain PL a The t in the corresponding PF(t) is marked as T1, and TL is also obtained. b The t in the corresponding TF(t) is labeled as T2, and the TYX is obtained when t=T2. g (t) associates the T1 column of the image feature vector in the T2 column of the text vector, increments a and b by one and associates them again until the maximum value of a or b is reached, and analyzes all sample data.

[0011] Furthermore, based on the aligned image feature vectors and text vectors, the image vector matrix and text vector matrix are identified to recognize and store related image and text data, including the following sub-steps: Based on the aligned image feature vectors and text vectors, the correlation threshold between image data and text data is analyzed. Identify and store relevant image and text data based on association thresholds.

[0012] Furthermore, the association threshold analysis between image data and text data based on the aligned image feature vector and text vector includes the following sub-steps: Obtain relevant sample images and sample texts, count the number of words in the sample texts, and represent them using the symbol W1; Obtain the image feature vector and text vector of the sample image and sample text. Each text vector corresponds to a word. If the effective text features in the text vector are related to the effective image features in the image feature vector, then the word corresponding to the text vector is marked as related to the sample image. The number of words in the sample text that are related to the sample image is represented by the symbol W2. W1 / W2 is calculated, and the result is named the association rate. The association rate of all sample data is calculated. Establish a two-dimensional coordinate system with W1 as the horizontal axis and the correlation rate as the vertical axis, and name it the Correlation Threshold Analysis Chart. Enter the correlation rate into the Correlation Threshold Analysis Chart according to W1. A multinomial regression is performed on the correlation threshold analysis graph, and the regression function is named the threshold floating function. The threshold floating function is used to solve for the correlation threshold.

[0013] Furthermore, identifying and storing correlated image and text data based on correlation thresholds includes the following sub-steps: The input image data and text data are named "Image to be analyzed" and "Text to be analyzed" respectively. The text to be analyzed is divided into different sub-texts to be analyzed using periods and newlines as delimiters. Extract the image feature vector of the image to be analyzed, named image recognition feature, extract the text feature matrix of the subtext to be analyzed, and calculate the correlation rate between the subtext to be analyzed and the image to be analyzed based on the image recognition feature and the text feature matrix, named recognition correlation, and mark the number of words in the subtext to be analyzed as R1; Substitute R1 into the threshold floating function and solve for the association threshold. Determine whether the identification association is greater than or equal to the association threshold. If yes, mark that the subtext to be analyzed and the image to be analyzed are related. If no, mark that the subtext to be analyzed and the image to be analyzed are not related. The image to be analyzed and the related sub-text to be analyzed are named image-text data and stored.

[0014] Secondly, this application provides a multimodal image and text automatic recognition and storage system, including an image and text input module, a feature extraction module, a feature learning module, and a recognition and storage module; the image and text input module, the feature extraction module, and the feature learning module are respectively data connected to the recognition and storage module; The image and text input module is used to preprocess the input image and text data and output valid image data and valid text data. The feature extraction module is used to identify valid image data and convert it into an image vector matrix, and at the same time, to identify valid text data and convert it into a text vector matrix. The feature learning module is used to collect sample data to learn the association between the image vector matrix and the text vector matrix, and to align the image feature vector and text feature vector in the image vector matrix with the text feature vector in the text vector matrix. The recognition and storage module is used to recognize the image vector matrix and text vector matrix based on the aligned image feature vector and text vector, thereby identifying and storing related image and text data.

[0015] The beneficial effects of this invention are as follows: This invention preprocesses the input image data and text data to output valid image data and valid text data. Then, it identifies the valid image data and converts it into an image vector matrix. Simultaneously, it identifies the valid text data and converts it into a text vector matrix. Then, it collects sample data and extracts sample features from the image vector matrix and text vector matrix in the sample data. Then, it performs association learning on the sample features and aligns the image feature vectors and text vectors in the image vector matrix and the text vector matrix. The advantage is that the same dimension in the image vector matrix and the text vector matrix does not represent the same concept. Therefore, it is necessary to find the dimension in the image vector matrix and the text vector matrix that represents the same concept in order to identify the correlation between image data and text data, thereby improving the accuracy and effectiveness of image and text recognition. This invention analyzes the association threshold between image data and text data based on aligned image feature vectors and text vectors. Finally, it identifies and stores related image-text data based on the association threshold. The advantage is that a piece of text data contains descriptions of image data, but also contains different invalid words. These invalid words also have text vectors, which means that invalid words may be misjudged as related to image data. If only one invalid word in a piece of text data is related to image data, it may lead to misjudgment. Therefore, it is necessary to analyze the association threshold and calculate different association thresholds based on the number of words. This ensures that the proportion of related words between the entire text data and image data is greater than the association threshold before it can be judged as related, preventing misjudgment and improving the accuracy, effectiveness, and rationality of image-text recognition and storage. Attached Figure Description

[0016] Figure 1 This is a schematic diagram of the system of the present invention; Figure 2 This is a schematic diagram of the image feature filtering map of the present invention; Figure 3 This is a schematic diagram of the correlation threshold analysis graph of the present invention; Figure 4 This is a flowchart of the steps of the method of the present invention. Detailed Implementation

[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0018] Example 1, please refer to Figure 1 As shown, this application provides a multimodal image and text automatic recognition and storage system, including an image and text input module, a feature extraction module, a feature learning module, and a recognition and storage module; the image and text input module, the feature extraction module, and the feature learning module are respectively connected to the recognition and storage module for data transmission. The image and text input module is used to preprocess the input image and text data and output valid image data and valid text data. The image and text input module is configured with image and text input strategies, which include: Image enhancement is performed on image data to obtain effective image data; The text data is uniformly converted into English format to obtain valid text data. In practical applications, image enhancement is an existing image processing technique aimed at enhancing the details in image data. In this embodiment, it will not be described in detail. Converting text data into English format is the valid text data.

[0019] The feature extraction module is used to identify valid image data and convert it into an image vector matrix, and at the same time, it identifies valid text data and converts it into a text vector matrix. The feature extraction module is configured with feature extraction strategies, which include: Connect to the Transformer ViT model, identify the effective image data through the Transformer ViT model, and convert the effective image data into a vector matrix, named the image vector matrix. The first row of the image vector matrix is the image feature vector of the effective image data. In practical applications, connecting to the existing Transformer ViT model, the first row of the image vector matrix output by the Transformer ViT model is actually a comprehensive vector of all other vectors, that is, a summary of an image data, i.e., an image feature vector. For example, if an image of a gear is input, after being transformed by the Transformer ViT model, a 197×768 vector matrix is obtained. Among them, the vectors in the 196 rows other than the first row are feature vectors of different regions in the image data, and the first row is a summary of them. Since the image feature vector has 768 dimensions, it is inconvenient to list them all in this embodiment. Therefore, this embodiment only lists a small amount of data as an example to illustrate the analysis process of this embodiment. After inputting the above gear image, the image feature vector obtained is [0.92,0.15,0.88,0.95,0.02,0.08].

[0020] Connect to the BERT model, input the valid text data into the BERT model, and the BERT model outputs a vector matrix, named the text vector matrix. Each row in the text vector matrix is a text vector. In practical applications, by connecting to an existing BERT model, the BERT model can convert text into vector representations. For example, if the input text is "The gear is worn," the converted English text is "The gear is worn." After being input into the BERT model, the BERT model outputs a text vector matrix. In this text vector matrix, each of the first four rows corresponds to a word, namely The, gear, is, and wound, from top to bottom. The last row contains sentence break symbols. Each sentence in the text data ends with a sentence break symbol. The vectors corresponding to the sentence break symbols are not included in the analysis. Therefore, each of the first to fourth rows in the text vector matrix is a text vector.

[0021] The feature learning module is used to collect sample data to learn the association between the image vector matrix and the text vector matrix, aligning the image feature vectors and text feature vectors in the image vector matrix with those in the text vector matrix; the feature learning module includes a sample feature collection unit and a feature alignment unit; The sample feature collection unit is used to collect sample data and extract sample features from the image vector matrix and text vector matrix in the sample data; The sample feature collection unit is configured with a sample feature collection strategy, which includes: Collect sample data. When collecting sample data, similar image groups and similar text groups are manually divided. The sample data includes sample images and sample text, and the sample images and sample text are associated. The image feature vector of the sample image is extracted and named the image sample vector. The text vector matrix of the sample text is extracted and named the text sample matrix. In practical applications, sample data is collected. For example, in this embodiment, an image of a gear and the text "the gear is worn" constitute a sample data set. At the same time, similar image groups are manually divided. In the similar image group of the gear image, all images contain gears and show some wear, but the shape, size, color, and degree of wear of the gears are not restricted. It is only necessary to ensure that the main subject in the image is a gear and shows some wear. In the similar text group to which "the gear is worn" belongs, all texts have a meaning similar to "the gear is worn". This part needs to be manually divided, and in the early stage of image training, manual division is an unavoidable operation. Taking the image feature vectors listed above as image sample vectors and the text vector matrix as text sample matrices as an example.

[0022] The image sample vectors are labeled PX. The text vectors in the text sample matrix are numbered in top-to-bottom order, and then processed using TX. g This indicates that g is a positive integer and g is the index of TX; Label the t-th column of PX as PX(t), and TX g The t-th column in the array is labeled TX. g (t), where t is a positive integer; In practical applications, PX is [0.92, 0.15, 0.88, 0.95, 0.02, 0.08], TX1 is [0.23, 0.13, 0.02, 0.03, 0.09, 0.17], TX2 is [0.92, 0.18, 0.86, 0.24, 0.03, 0.15], TX3 is [0.04, 0.02, 0.05, 0.06, 0.02, 0.03], and TX4 is [0.24, 0.11, 0.12, 0.93, 0.07, 0.11]. The t-th column of PX is labeled PX(t), and TX... g The t-th column in the array is labeled TX. g (t), for example, PX(1) is 0.92, PX(2) is 0.15, TX1(1) is 0.23 in TX1, and so on.

[0023] Sort and number PX(t) in ascending order, using the symbol EP. d This indicates that TX is processed in ascending order. g (t) is used for sorting and numbering, using the symbol ET. d This indicates that d is a positive integer and d is the index of EP and ET; Please see Figure 2 As shown, with d as the X-axis, and EP as the Y-axis... d and ET d Establish a two-dimensional coordinate system for the Y-axis, named Image Feature Selection Map and Text Feature Selection Map, and set EP d According to the image feature selection map entered by d, ET d Enter the text feature filtering map according to d, and name the coordinate points in the image feature filtering map and the text feature filtering map as image filtering points and text filtering points respectively. Connect the adjacent image filtering points and the adjacent text filtering points by straight lines to obtain the image filtering line and the text filtering line respectively. Obtain the slope of the image filtering line and the text filtering line and name them the image filtering slope and the text filtering slope respectively. Name the image filtering line with the largest image filtering slope as the image feature segmentation line and the text filtering line with the largest text filtering slope as the text feature segmentation line. EP is located to the right of the image segmentation line.d The corresponding PX(t) is labeled as an effective image feature, and EX is located to the right of the text segmentation line. d The corresponding TX g (t) is marked as a valid text feature. All valid image features in the similar image group to which the sample image belongs are extracted. At the same time, all valid text features in the similar text group to which the sample text belongs are extracted. Together, they form a set of sample features. In practical applications, EP is obtained by sorting the numbers. d and ET d 1≤d≤6, Note that ET here d The sorting is based solely on a single text vector, i.e., each TX... g Independent analysis was performed to construct the image feature selection map as follows: Figure 2 As shown, by Figure 2 It is easy to see that the image filtering line between EP3 and EP4 has the largest slope. Therefore, the image filtering line between EP3 and EP4 is named the image feature segmentation line. EP3 with d≥4 is then used as the image feature segmentation line. d The corresponding PX(t) is marked as an effective image feature, that is, PX(1), PX(3), and PX(4) are effective image features. Since the construction and analysis of the text feature filtering map is the same as the construction and analysis of the image feature filtering map, this embodiment will not describe it in detail. However, it should be noted that when constructing and analyzing the text feature filtering map, each TX g It is necessary to perform independent analysis to extract the effective image features of all images in the similar image group to which the gear image belongs, and the effective text features of all texts in the similar text group to which "the gear is worn" belongs, and finally obtain the sample features. At this time, the sample features contain the effective image features and effective text features of multiple images and multiple texts.

[0024] The feature alignment unit is used to learn the association between sample features and align the image vector matrix with the image feature vectors and text vectors in the text vector matrix; The feature alignment unit is configured with a feature alignment strategy, which includes: PX in statistical image effective features h (t) and TX in effective text features g (t), respectively relabeled as PYX(t) and TYX. g (t); Calculate the sum of PYX(t) for features with the same t in a set of samples, and label the result as PF(t). Simultaneously, calculate TYX for features with the same t in a set of samples. g The sum of (t) is used to calculate the result, which is denoted as TF(t). In practical applications, the sum of PYX(t) with the same t is calculated in a set of sample features. Taking t=1 as an example, this means calculating the sum of all PYX(t) with t=1 in a set of sample features. For example, if this set of sample features is obtained from the analysis of 20 sample images and 20 sample texts, then the sum of these 20 PYX(1) is calculated, resulting in PF(1) of 18.36. It should be noted that if the PYX(t) obtained from the analysis of a certain sample image does not contain PYX(1), then only the sum of 19 PYX(1) needs to be calculated. And so on, calculating PF(1) to PF(6). It should be noted that the maximum t here is 6 because it is a partial data cut off in this embodiment for the convenience of data display. In actual analysis, the maximum value of t is 768. The sum of TYX(t) with the same t in a set of sample features is calculated. g The sum of (t) is used to summarize the vector features of all words on the same dimension, ignoring the value of g. Taking t=1 as an example, calculating TF(1) is actually calculating the sum of the effective text features in the first column of all text vectors analyzed from these 20 sample texts. For example, the data in the first column of TX2 listed in this embodiment belongs to the effective text features, so it is included in the summation calculation. The data in the first column of TX1 does not belong to the effective text features, so it is not summed. Thus, TF(1) is calculated to be 23.85. Similarly, TF(1) to TF(6) are calculated.

[0025] Sort and number PF(t) in descending order, using the symbol PL. a This indicates that TF(t) is sorted and numbered in descending order, using the symbol TL. b This indicates that a and b are both positive integers, and a is the index of PL and b is the index of TL; Starting with a=b=1, obtain PL a The t in the corresponding PF(t) is marked as T1, and TL is also obtained. b The t in the corresponding TF(t) is labeled as T2, and the TYX is obtained when t=T2. g (t), associate the T1 column of the T2 column of the image feature vector in the text vector, add one to both a and b and associate them again until the maximum value of a or b is reached, and analyze all sample data; In practical applications, PL is obtained by sorting the numbers. a and TL bSince only image and text effective features are analyzed, the calculated PF(t) actually only has PF(1), PF(3) and PF(4), so 1≤a≤3. The calculated TF(t) has TF(1), TF(3), TF(4) and TF(6), so 1≤b≤4. Since the text data is the text description of the image data, the proportion of the same dimension in the feature vector should be the same for image effective features and text effective features. PL1 is PF(1) and TL1 is TF(1). The maximum PF(1) indicates that the feature of this dimension is the most prominent, while the maximum TF(1) indicates that the feature of this dimension has the largest proportion in the text description. Therefore, it is determined that the first column in the image feature vector is related to the first column in the text vector. Both a and b are added by one and related again. By analogy, only the correlation of 3 dimensions can be determined through this set of sample data. Therefore, other sample data are analyzed until all 768 dimensions of the image feature vector and the text vector are related.

[0026] The recognition and storage module is used to recognize the image vector matrix and text vector matrix based on the aligned image feature vector and text vector, thereby identifying and storing related image and text data; the recognition and storage module includes an association threshold analysis unit and an image and text data recognition unit; The association threshold analysis unit is used to analyze the association threshold between image data and text data based on the aligned image feature vector and text vector. The correlation threshold analysis unit is configured with a correlation threshold analysis strategy, which includes: Obtain relevant sample images and sample texts, count the number of words in the sample texts, and represent them using the symbol W1; Obtain the image feature vector and text vector of the sample image and sample text. One text vector corresponds to one word. If the effective text features in the text vector are related to the effective image features in the image feature vector, then mark the word corresponding to the text vector as related to the sample image. In practical applications, taking the gear image and "The gear is worn" listed in this embodiment as examples, the statistical result shows that W1 is 4, and the text vectors include TX1, TX2, TX3 and TX4, which correspond to The, gear, is and worn, respectively. Among them, there are text effective features in TX2 and TX4 that are related to the image effective features in the image feature vector. Therefore, the words gear and worn are related to the gear image.

[0027] The number of words in the sample text that are related to the sample image is represented by the symbol W2. W1 / W2 is calculated, and the result is named the association rate. The association rate of all sample data is calculated. Please see Figure 3 As shown, a two-dimensional coordinate system is established with W1 as the horizontal axis and the correlation rate as the vertical axis, named the correlation threshold analysis chart. The correlation rate is entered into the correlation threshold analysis chart according to W1. Perform multinomial regression on the correlation threshold analysis graph, and name the regression function the threshold floating function. The threshold floating function is used to solve the correlation threshold. In practical applications, the statistical W² value is 2, the calculated association rate is 0.5, and the association rate of all sample data is calculated to construct an association threshold analysis chart as shown below. Figure 3 As shown, the threshold floating function obtained through polynomial regression is Y = 0.0001 × X. 2 -0.0123×X+0.3536, where Y is the association threshold and X is W1.

[0028] The image and text data recognition unit is used to identify and store related image and text data based on a correlation threshold. The image and text data recognition unit is configured with an image and text data recognition strategy, which includes: The input image data and text data are named "Image to be analyzed" and "Text to be analyzed" respectively. The text to be analyzed is divided into different sub-texts to be analyzed using periods and newlines as delimiters. Extract the image feature vector of the image to be analyzed, named image recognition feature, extract the text feature matrix of the subtext to be analyzed, and calculate the correlation rate between the subtext to be analyzed and the image to be analyzed based on the image recognition feature and the text feature matrix, named recognition correlation, and mark the number of words in the subtext to be analyzed as R1; Substitute R1 into the threshold floating function and solve for the association threshold. Determine whether the identification association is greater than or equal to the association threshold. If yes, mark that the subtext to be analyzed and the image to be analyzed are related. If no, mark that the subtext to be analyzed and the image to be analyzed are not related. The image to be analyzed and the related sub-text to be analyzed are named image-text data and stored. In practical applications, the subtext to be analyzed actually involves analyzing each segment of the text separately. This is because when training image-related models, the input text often contains a lot of invalid information. This type of data not only interferes with model training but also occupies unnecessary storage space. Therefore, it needs to be filtered. For example, a text to be analyzed may contain two sentences, i.e., two subtexts. The first subtext has a word count R1 of 10, and the correlation threshold is calculated to be 0.2406. After analysis, the correlation of the first subtext is 0.3, which is greater than the correlation threshold. Therefore, the first subtext is determined to be correlated with the image. The second subtext has a word count R1 of 20, and the correlation threshold is calculated to be 0.1476. The correlation of the second subtext is 0.1, which is less than the correlation threshold. Therefore, the second subtext is determined to be uncorrelated with the image. Thus, when storing the image and the text to be analyzed, only the image and the first subtext need to be stored.

[0029] Example 2, please refer to Figure 4 As shown, this application provides a method for automatic recognition and storage of multimodal images and text, including the following steps: Step S1 involves preprocessing the input image and text data to output valid image data and valid text data. Step S1 includes the following sub-steps: Step S101: Perform image enhancement on the image data to obtain valid image data; Step S102: Convert the text data into English format to obtain valid text data; Step S2 involves recognizing valid image data and converting it into an image vector matrix, and simultaneously recognizing valid text data and converting it into a text vector matrix. Step S2 includes the following sub-steps: Step S201: Connect the Transformer ViT model, identify the effective image data through the Transformer ViT model, convert the effective image data into a vector matrix, named the image vector matrix, and the first row of the image vector matrix is the image feature vector of the effective image data. Step S202: Connect to the BERT model. Input the valid text data into the BERT model. The BERT model outputs a vector matrix, named the text vector matrix. Each row in the text vector matrix is a text vector. Step S3 involves collecting sample data to perform association learning on the image vector matrix and text vector matrix, aligning the image feature vectors and text vectors in the image vector matrix with those in the text vector matrix. Step S3 includes the following sub-steps: Step S301: Collect sample data and extract sample features from the image vector matrix and text vector matrix in the sample data; Step S301 includes the following sub-steps: Step S3011: Collect sample data. When collecting sample data, similar image groups and similar text groups are manually divided. The sample data includes sample images and sample text, and the sample images and sample text are associated. The image feature vector of the sample image is extracted and named the image sample vector. The text vector matrix of the sample text is extracted and named the text sample matrix. Step S3012: Label the image sample vector as PX, number the text vectors in the text sample matrix in top-to-bottom order, and use TX... g This indicates that g is a positive integer and g is the index of TX; Step S3013: Mark the t-th column of PX as PX(t), and set TX... g The t-th column in the array is labeled TX. g (t), where t is a positive integer; Step S3014: Sort and number PX(t) in ascending order, using the symbol EP. d This indicates that TX is processed in ascending order. g (t) is used for sorting and numbering, using the symbol ET. d This indicates that d is a positive integer and d is the index of EP and ET; Step S3015, with d as the X-axis, respectively using EP d and ET d Establish a two-dimensional coordinate system for the Y-axis, named Image Feature Selection Map and Text Feature Selection Map, and set EP d According to the image feature selection map entered by d, ET d Enter the text feature filtering map according to d, and name the coordinate points in the image feature filtering map and the text feature filtering map as image filtering points and text filtering points respectively. Step S3016: Connect the adjacent image filtering points and the adjacent text filtering points of d with straight lines to obtain the image filtering line and the text filtering line respectively. Obtain the slope of the image filtering line and the text filtering line and name them the image filtering slope and the text filtering slope respectively. Name the image filtering line with the largest image filtering slope as the image feature segmentation line and the text filtering line with the largest text filtering slope as the text feature segmentation line. Step S3017, EP located to the right of the image segmentation line d The corresponding PX(t) is labeled as an effective image feature, and EX is located to the right of the text segmentation line. d The corresponding TXg (t) is marked as a valid text feature. All valid image features in the similar image group to which the sample image belongs are extracted. At the same time, all valid text features in the similar text group to which the sample text belongs are extracted. Together, they form a set of sample features. Step S302: Perform association learning on the sample features and align the image vector matrix with the image feature vectors and text vectors in the text vector matrix; Step S302 includes the following sub-steps: Step S3021: Statistically analyze the PX values in the effective features of the image. h (t) and TX in effective text features g (t), respectively relabeled as PYX(t) and TYX. g (t); Step S3022: Calculate the sum of PYX(t) with the same t in a set of sample features, and label the calculation result as PF(t). At the same time, calculate TYX with the same t in a set of sample features. g The sum of (t) is used to calculate the result, which is denoted as TF(t). Step S3023: Sort and number PF(t) in descending order, using the symbol PL. a This indicates that TF(t) is sorted and numbered in descending order, using the symbol TL. b This indicates that a and b are both positive integers, and a is the index of PL and b is the index of TL; Step S3024, starting with a=b=1, obtain PL a The t in the corresponding PF(t) is marked as T1, and TL is also obtained. b The t in the corresponding TF(t) is labeled as T2, and the TYX is obtained when t=T2. g (t), associate the T1 column of the T2 column of the image feature vector in the text vector, add one to both a and b and associate them again until the maximum value of a or b is reached, and analyze all sample data; Step S4 involves identifying the image vector matrix and text vector matrix based on the aligned image feature vector and text vector, thereby identifying and storing related image and text data. Step S4 includes the following sub-steps: Step S401: Analyze the correlation threshold between image data and text data based on the aligned image feature vector and text vector; Step S401 includes the following sub-steps: Step S4011: Obtain relevant sample images and sample text, count the number of words in the sample text, and represent them by the symbol W1; Step S4012: Obtain the image feature vector and text vector of the sample image and sample text. One text vector corresponds to one word. If the effective text features in the text vector are related to the effective image features in the image feature vector, then mark the word corresponding to the text vector as related to the sample image. Step S4013: Count the number of words in the sample text that are related to the sample image, denoted by the symbol W2, calculate W1 / W2, name the calculation result as the association rate, and calculate the association rate of all sample data; Step S4014: Establish a two-dimensional coordinate system with W1 as the horizontal axis and the correlation rate as the vertical axis, and name it the Correlation Threshold Analysis Chart. Enter the correlation rate into the Correlation Threshold Analysis Chart according to W1. Step S4015: Perform multinomial regression on the correlation threshold analysis graph, and name the regression function the threshold floating function. The threshold floating function is used to solve the correlation threshold. Step S402: Identify and store the relevant image and text data based on the association threshold; Step S402 includes the following sub-steps: Step S4021: Name the input image data and text data as the image to be analyzed and the text to be analyzed, respectively. Use periods and newline characters as delimiters to divide the text to be analyzed into different sub-texts to be analyzed. Step S4022: Extract the image feature vector of the image to be analyzed and name it as image recognition feature; extract the text feature matrix of the sub-text to be analyzed and calculate the correlation rate between the sub-text to be analyzed and the image to be analyzed based on the image recognition feature and the text feature matrix, which is named recognition correlation; and mark the number of words in the sub-text to be analyzed as R1. Step S4023: Substitute R1 into the threshold floating function and solve the association threshold. Determine whether the identification association is greater than or equal to the association threshold. If yes, mark that the sub-text to be analyzed and the image to be analyzed have an association. If no, mark that the sub-text to be analyzed and the image to be analyzed do not have an association. Step S4024: Name the image to be analyzed and the related sub-text to be analyzed as image-text data, and store the image-text data.

[0030] Example 3: This application provides an electronic device, which may include a processor, a communication interface, a memory, and a communication bus. The processor, communication interface, and memory communicate with each other via the communication bus. The memory stores computer-readable instructions. The processor can call the instructions in the memory. When the computer-readable instructions are executed by the processor, steps such as those in the multimodal image-text automatic recognition and storage method are performed to achieve the following functions: preprocessing the input image data and text data to output valid image data and valid text data; converting the valid image data into an image vector matrix and the valid text data into a text vector matrix; aligning the image feature vectors and text vectors in the image vector matrix and the text vector matrix; and recognizing the image vector matrix and text vector matrix based on the aligned image feature vectors and text vectors to identify and store related image-text data.

[0031] Furthermore, when the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0032] Example 4: This application also provides a computer-readable storage medium. This application provides a storage medium storing a computer program thereon. When the computer program is executed by a processor, it performs the steps of the above-described multimodal image-text automatic recognition and storage method to achieve the following functions: preprocessing the input image data and text data, outputting valid image data and valid text data; converting the valid image data into an image vector matrix, and converting the valid text data into a text vector matrix; aligning the image feature vectors and text vectors in the image vector matrix and the text vector matrix; and recognizing the image vector matrix and text vector matrix based on the aligned image feature vectors and text vectors, thereby recognizing and storing related image-text data.

[0033] Based on the above description of the embodiments, the embodiments of the present invention can be provided as methods, systems, or computer program products. Based on this understanding, the above technical solutions, in essence or in terms of their contribution to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or certain parts of the embodiments.

[0034] In the embodiments provided in this application, it should be understood that the disclosed system or method can be implemented in other ways. The embodiments described above are merely illustrative. For example, the division of modules or units is only a logical functional division, and there may be other division methods in actual implementation. Furthermore, multiple modules or units may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces. The indirect coupling or communication connection between systems, modules, and units may be electrical, mechanical, or other forms.

[0035] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A method for automatic recognition and storage of multimodal images and text, characterized in that, Includes the following steps: The input image and text data are preprocessed to output valid image data and valid text data. The system identifies valid image data and converts it into an image vector matrix; it also identifies valid text data and converts it into a text vector matrix. Collect sample data to learn the association between the image vector matrix and the text vector matrix, and align the image vector matrix with the image feature vectors and text vectors in the text vector matrix; Based on the aligned image feature vectors and text vectors, the image vector matrix and text vector matrix are identified to identify and store related image and text data.

2. The method for automatic recognition and storage of multimodal images and text according to claim 1, characterized in that, Preprocessing the input image and text data to output valid image and text data includes the following sub-steps: Image enhancement is performed on image data to obtain effective image data; The text data is uniformly converted into English format to obtain valid text data.

3. The method for automatic recognition and storage of multimodal images and text according to claim 2, characterized in that, The process of identifying valid image data and converting it into an image vector matrix, and simultaneously identifying valid text data and converting it into a text vector matrix, includes the following sub-steps: Connect the Transformer ViT model, identify the effective image data through the Transformer ViT model, and convert the effective image data into a vector matrix, named the image vector matrix. The first row of the image vector matrix is the image feature vector of the effective image data. Connect to the BERT model, input the valid text data into the BERT model, and the BERT model outputs a vector matrix, named the text vector matrix. Each row in the text vector matrix is a text vector.

4. The method for automatic recognition and storage of multimodal images and text according to claim 3, characterized in that, Collecting sample data to learn the association between the image vector matrix and the text vector matrix, aligning the image vector matrix with the image feature vectors and text vectors in the text vector matrix, includes the following sub-steps: Collect sample data and extract sample features from the image vector matrix and text vector matrix in the sample data; The sample features are correlated and then aligned with the image feature vectors and text vectors in the image vector matrix and the text vector matrix.

5. The method for automatic recognition and storage of multimodal images and text according to claim 4, characterized in that, Collecting sample data and extracting sample features from the image vector matrix and text vector matrix includes the following sub-steps: Collect sample data. When collecting sample data, similar image groups and similar text groups are manually divided. The sample data includes sample images and sample text, and the sample images and sample text are associated. The image feature vector of the sample image is extracted and named the image sample vector. The text vector matrix of the sample text is extracted and named the text sample matrix. The image sample vectors are labeled PX, and the text vectors in the text sample matrix are numbered in top-to-bottom order, using TX. g This indicates that g is a positive integer and g is the index of TX; Label the t-th column of PX as PX(t), and TX g The t-th column in the array is labeled TX. g (t), where t is a positive integer; Sort and number PX(t) in ascending order, using the symbol EP. d This indicates that TX is processed in ascending order. g (t) is used for sorting and numbering, using the symbol ET. d This indicates that d is a positive integer and d is the index of EP and ET; With d as the X-axis, and EP as the Y-axis, d and ET d Establish a two-dimensional coordinate system for the Y-axis, named Image Feature Selection Map and Text Feature Selection Map, and set EP d According to the image feature selection map entered by d, ET d Enter the text feature filtering map according to d, and name the coordinate points in the image feature filtering map and the text feature filtering map as image filtering points and text filtering points respectively. Connect the adjacent image filtering points and the adjacent text filtering points by straight lines to obtain the image filtering line and the text filtering line respectively. Obtain the slope of the image filtering line and the text filtering line and name them the image filtering slope and the text filtering slope respectively. Name the image filtering line with the largest image filtering slope as the image feature segmentation line and the text filtering line with the largest text filtering slope as the text feature segmentation line. EP located to the right of the image feature segmentation line d The corresponding PX(t) is labeled as an effective image feature, and EX, which is located to the right of the text feature segmentation line, is... d The corresponding TX g (t) is marked as a valid text feature. All valid image features in the similar image group to which the sample image belongs are extracted. At the same time, all valid text features in the similar text group to which the sample text belongs are extracted. Together, they form a set of sample features.

6. The method for automatic recognition and storage of multimodal images and text according to claim 5, characterized in that, The process of performing association learning on sample features and aligning the image vector matrix with the image feature vectors and text vectors in the text vector matrix includes the following sub-steps: PX in statistical image effective features h (t) and TX in effective text features g (t), respectively relabeled as PYX(t) and TYX. g (t); Calculate the sum of PYX(t) for features with the same t in a set of samples, and label the result as PF(t). Simultaneously, calculate TYX for features with the same t in a set of samples. g The sum of (t) is used to calculate the result, which is denoted as TF(t). Sort and number PF(t) in descending order, using the symbol PL. a This indicates that TF(t) is sorted and numbered in descending order, using the symbol TL. b This indicates that a and b are both positive integers, and a is the index of PL and b is the index of TL; Starting with a=b=1, obtain PL a The t in the corresponding PF(t) is marked as T1, and TL is also obtained. b The t in the corresponding TF(t) is labeled as T2, and the TYX is obtained when t=T2. g (t) associates the T2 column of the text vector with the T1 column of the image feature vector, increments a and b by one and then associates them again until the maximum value of a or b is reached, and analyzes all sample data.

7. The method for automatic recognition and storage of multimodal images and text according to claim 6, characterized in that, Based on the aligned image feature vectors and text vectors, the image vector matrix and text vector matrix are identified to identify and store related image and text data, including the following sub-steps: Based on the aligned image feature vectors and text vectors, the correlation threshold between image data and text data is analyzed. Identify and store relevant image and text data based on association thresholds.

8. The method for automatic recognition and storage of multimodal images and text according to claim 7, characterized in that, Analyzing the correlation threshold between image data and text data based on aligned image feature vectors and text vectors includes the following sub-steps: Obtain relevant sample images and sample texts, count the number of words in the sample texts, and represent them using the symbol W1; Obtain the image feature vector and text vector of the sample image and sample text. Each text vector corresponds to a word. If the effective text features in the text vector are related to the effective image features in the image feature vector, then the word corresponding to the text vector is marked as related to the sample image. The number of words in the sample text that are related to the sample image is represented by the symbol W2. The ratio W1 / W2 is calculated, and the result is named the association rate. The association rate of all sample data is calculated. Establish a two-dimensional coordinate system with W1 as the horizontal axis and the correlation rate as the vertical axis, and name it the Correlation Threshold Analysis Chart. Enter the correlation rate into the Correlation Threshold Analysis Chart according to W1. A multinomial regression is performed on the correlation threshold analysis graph, and the regression function is named the threshold floating function. The threshold floating function is used to solve for the correlation threshold.

9. The method for automatic recognition and storage of multimodal images and text according to claim 8, characterized in that, Identifying and storing correlated image and text data based on correlation thresholds includes the following sub-steps: The input image data and text data are named "Image to be analyzed" and "Text to be analyzed" respectively. The text to be analyzed is divided into different sub-texts using periods and newlines as delimiters. Extract the image feature vector of the image to be analyzed, named image recognition feature, extract the text feature matrix of the subtext to be analyzed, and calculate the correlation rate between the subtext to be analyzed and the image to be analyzed based on the image recognition feature and the text feature matrix, named recognition correlation, and mark the number of words in the subtext to be analyzed as R1; Substitute R1 into the threshold floating function and solve for the association threshold. Determine whether the identification association is greater than or equal to the association threshold. If yes, mark that the subtext to be analyzed and the image to be analyzed are related. If no, mark that the subtext to be analyzed and the image to be analyzed are not related. The image to be analyzed and the related sub-text to be analyzed are named image-text data and stored.

10. A multimodal image and text automatic recognition and storage system, used to implement the multimodal image and text automatic recognition and storage method according to any one of claims 1-9, characterized in that, It includes an image and text input module, a feature extraction module, a feature learning module, and a recognition and storage module; the image and text input module, the feature extraction module, and the feature learning module are respectively connected to the recognition and storage module for data transmission. The image and text input module is used to preprocess the input image and text data and output valid image data and valid text data. The feature extraction module is used to identify valid image data and convert it into an image vector matrix, and at the same time, to identify valid text data and convert it into a text vector matrix. The feature learning module is used to collect sample data to learn the association between the image vector matrix and the text vector matrix, and to align the image vector matrix with the image feature vectors and text vectors in the text vector matrix. The recognition and storage module is used to recognize the image vector matrix and text vector matrix based on the aligned image feature vector and text vector, thereby identifying and storing related image and text data.