Data expansion methods, devices, electronic equipment and storage media
By rewriting and deduplicating multimodal datasets and expanding the data using seed text and image features, the problems of data redundancy and insufficient quality in multimodal data models are solved, thereby improving the performance and efficiency of the data models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
- Filing Date
- 2024-12-20
- Publication Date
- 2026-06-30
AI Technical Summary
The performance of multimodal data models is insufficient, mainly due to the lack of quantity and quality of multimodal instruction data. This leads to problems such as data redundancy, uneven distribution, and low quality during the data expansion process, which in turn results in long training cycles and wasted computing resources.
By acquiring the first dataset, rewriting the first text into a second text in descriptive form, filtering semantically similar datasets, applying a pre-defined deduplication strategy to remove redundancy, and expanding the data from the database using seed text and image features, a high-quality target dataset is formed.
It improves the quality and accuracy of multimodal data, ensures data coverage of multiple scenarios, and reduces training cycles and computational resource waste.
Smart Images

Figure CN122309633A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of multimodal data analysis technology, and in particular to a data augmentation method, apparatus, electronic device and storage medium. Background Technology
[0002] Currently, research on multimodal data models has garnered widespread attention across various fields. However, the performance of these models is hampered by insufficient quantity and quality of multimodal instruction data. Multimodal data is typically generated manually by constructing image-question-answer pairs or extracting simple instructions from large models, resulting in inefficiency, limited data volume, and a clear task-oriented approach. To increase the quantity of multimodal data, datasets are often simply merged to expand the data dimensions. However, this data expansion method can lead to data redundancy, uneven data distribution, and low quality, consequently resulting in long training cycles, wasted computational resources, and poor model performance. Summary of the Invention
[0003] In view of the above, it is necessary to propose a data augmentation method, apparatus, electronic device and storage medium to solve the technical problem of low accuracy in multimodal data augmentation.
[0004] This application provides a data augmentation method applied to an electronic device. The method includes: acquiring a first dataset, the first dataset including first text and a corresponding first image; determining a corresponding second text based on the semantics of the first text; determining a second dataset from the first dataset based on the second text; performing a deduplication operation on the second dataset based on a preset deduplication strategy to obtain a third dataset, the third dataset including seed text and a seed image; determining augmented text from a database based on the semantics of the seed text; determining an augmented image from the database based on the image features of the seed image; and determining an augmented target dataset based on the third dataset, the augmented text, and the augmented image.
[0005] This application embodiment also provides a data augmentation device, the device comprising: an acquisition module, configured to acquire a first dataset, the first dataset including first text and a corresponding first image; a determination module, configured to determine a corresponding second text based on the semantics of the first text; the determination module is further configured to determine a second dataset from the first dataset based on the second text; a deduplication module, configured to perform deduplication on the second dataset based on a preset deduplication strategy to obtain a third dataset, the third dataset including seed text and a seed image; the determination module is further configured to determine augmented text from a database based on the semantics of the seed text; the determination module is further configured to determine augmented images from the database based on the image features of the seed images; and an augmentation module, configured to determine an augmented target dataset based on the third dataset, the augmented text, and the augmented image.
[0006] This application also provides an electronic device, comprising: a memory storing at least one instruction; and a processor executing the instructions stored in the memory to implement the data expansion method described above.
[0007] This application also provides a computer-readable storage medium storing at least one instruction, which is executed by a processor in an electronic device to implement the data expansion method described above.
[0008] As can be seen from the above technical solutions, the embodiments of this application rewrite the form of the first text based on the semantics of the question text and the answer text in the first text to obtain the second text in the form of a statement. This can unify the semantics and form of the question text and the answer text, and improve the accuracy of the semantic analysis process in the subsequent data expansion process. Based on the second text, a second dataset with semantic similarity is determined from the first dataset, and deduplication is performed on the second dataset based on a preset deduplication strategy to obtain a third dataset. This can improve the data diversity of the third dataset, ensure that the text data and image data in the third dataset cover more scenarios, and improve the quality of multimodal data. Then, seed text and seed images are determined from the third dataset. Expanded text is determined based on the semantics of the seed text, and expanded images are determined based on the image features of the seed images. Finally, the third dataset, expanded text, and expanded images are combined to obtain the target dataset. This can expand the third dataset based on seed text and seed images covering multiple scenarios, ensure the diversity of multimodal data in the expanded target dataset, and thus improve the quality of multimodal data. Attached Figure Description
[0009] Figure 1 This is an application scenario diagram of a data augmentation method provided in an embodiment of this application.
[0010] Figure 2 This is a flowchart of a data augmentation method provided in an embodiment of this application.
[0011] Figure 3 This is a flowchart of a method for determining a second dataset provided in an embodiment of this application.
[0012] Figure 4 This is a flowchart of a method for deduplicating a second dataset according to an embodiment of this application.
[0013] Figure 5 This is a flowchart of a method for deduplicating a second dataset provided in another embodiment of this application.
[0014] Figure 6 This is a flowchart of a method for deduplicating a second dataset according to another embodiment of this application.
[0015] Figure 7 This is a flowchart of a method for constructing graph data according to an embodiment of this application.
[0016] Figure 8 This is a functional block diagram of a data expansion device provided in an embodiment of this application.
[0017] Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0018] To better understand the purpose, features, and advantages of this application, a detailed description of the application is provided below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in the embodiments of this application can be combined with each other. Numerous specific details are set forth in the following description to provide a thorough understanding of this application; the described embodiments are only a part of the embodiments of this application, and not all of them.
[0019] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the stated features. In the description of this application, "a plurality of" means two or more, unless otherwise explicitly specified.
[0020] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0021] This application provides a data expansion method that can be applied to one or more electronic devices. An electronic device is a device that can automatically perform numerical calculations and / or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.
[0022] Electronic devices can be any electronic product that allows human-computer interaction with a customer, such as personal computers, tablets, smartphones, personal digital assistants (PDAs), game consoles, interactive network television (IPTV), smart wearable devices, etc.
[0023] Electronic devices may also include network devices and / or client devices. The network devices include, but are not limited to, a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing consisting of a large number of hosts or network servers.
[0024] The networks in which electronic devices reside include, but are not limited to, the Internet, wide area networks (WANs), metropolitan area networks (MANs), local area networks (LANs), and virtual private networks (VPNs).
[0025] like Figure 1 As shown, the data augmentation method provided in this application can be applied to an electronic device 100, which is communicatively connected to a database 200. The database 200 is used to store multimodal data (e.g., text data, image data, video data, voice data, etc.).
[0026] In one embodiment of this application, the electronic device 100 acquires a first dataset including first text and a first image. It rewrites the form of the first text according to its semantics to determine a second text with a corresponding descriptive form. The electronic device 100 also filters data in the first dataset based on the second text to obtain a corresponding second dataset, and performs deduplication on the second dataset based on a preset deduplication strategy to obtain a third dataset. This reduces the amount of data while ensuring data quality. The electronic device 100 also determines seed text and seed image from the third dataset, determines expanded text from a database 200 based on the semantics of the seed text, and determines expanded images from the database 200 based on the image features of the seed images. Finally, the electronic device 100 fuses the third dataset, expanded text, and expanded images to obtain an expanded target dataset. This enables the expansion of multimodal data and ensures that the types of multimodal data cover a wide range during the expansion process, thereby improving the accuracy of the multimodal data.
[0027] like Figure 2 The diagram shown is a flowchart of a data augmentation method according to an embodiment of this application. The order of the steps in the flowchart can be changed, and some steps can be omitted, depending on different requirements. The data augmentation method provided in this embodiment includes the following steps.
[0028] S20, Obtain the first dataset, which includes the first text and the corresponding first image.
[0029] In one embodiment of this application, the first dataset may be a dataset containing multimodal data used to train a generative pre-trained model. A multimodal dataset is a dataset composed of data from two or more modalities. For example, a user may ask a question about a historical event, and the system provides a detailed answer through text data. Specifically, the multimodal data in the first dataset includes first text and a corresponding first image, where the first text represents questions and answers about the content contained in the first image. For some questions requiring visual aids (such as art appreciation, geolocation recognition, etc.), the first image can provide intuitive visual information. Additionally, multimodal data may also include audio data, video data, etc., which are not limited in this application. Audio data can be the primary form of input data in question-and-answer scenarios (e.g., speech recognition or voice question-and-answer systems), and can also be used to provide background music, environmental sound effects, etc., to enhance the accuracy of the question-and-answer system. Video data combines multiple information from images and audio, thus providing multi-dimensional information in some complex question-and-answer scenarios, thereby improving the accuracy of relevant answers based on video content analysis.
[0030] In one embodiment of this application, the first dataset includes first text and a first image, wherein the first text is used to represent questions and answers regarding the content contained in the first image. For example, when the first dataset includes data related to the education field, the first image data may be a surveillance image in a teaching scenario, and the first text may be a question text asking about the content in the teaching scenario and a response text answering the question. For example, if the first image is a surveillance image in a classroom, the question text asking about the content may be "What class is being conducted in the image?", and the response text answering the question may be "The class being conducted in the image is a math class taught by Teacher Wang." When the first dataset includes data related to the medical field, the first image data may be a surveillance image in a medical scenario, and the first text may be a question text asking about the content in the medical scenario and a response text answering the question. For example, if the first image is a medically related image of a patient, the question text asking about the content may be "What kind of medical information is present in this image, and what kind of medical advice can be given?", and the response text answering the question may be "The image includes multiple physical indicators of the patient, and the relevant advice given includes regular sleep to enhance immunity." When the first dataset includes data related to the entertainment field, the first image data can be a surveillance image of an entertainment scene, and the first text can be a question text asking about the content of the entertainment scene and a response text answering the question. For example, if the first image is an image related to an entertainment scene, the question text asking about the content could be "What entertainment facilities are included in this image? What kind of entertainment activity is taking place?", and the response text answering the question could be "The image includes entertainment facilities such as slides and carousels, and a carnival is being held."
[0031] S21, determine the corresponding second text based on the semantics of the first text.
[0032] In one embodiment of this application, since the first text in the multimodal data includes both question text and response text, the semantic difference between the question text and response text may lead to low accuracy in interpreting the first image when using the first text. To train a generative pre-trained model for multimodal data question answering based on the first text and the first image, thereby improving the performance of the generative pre-trained model, key information in the first text can be extracted and integrated based on the semantics of the question text and response text, and this key information can be used for subsequent training of the generative pre-trained model. For example, prompt data can be constructed based on the question text, response text, and pre-stored prompt templates, and input into the pre-trained generative pre-trained model to obtain the output second text. The second text is used to represent the semantics of the question text and the response text, and can be in the form of a declarative sentence. This improves the efficiency of multimodal data analysis and saves time on semantic processing. Furthermore, converting question and answer texts into declarative statements achieves data consistency and standardization, helping to eliminate potential ambiguity, vagueness, or inconsistencies in question-and-answer data, thereby improving the accuracy and reliability of data analysis. The declarative format also makes the data easier to mine and analyze. Through the transformed data, patterns, trends, and correlations can be more easily discovered, supporting deeper data analysis and decision-making. Moreover, generative pre-trained models maintain the semantic integrity and coherence of the text when converting question-and-answer data into declarative statements, ensuring that the transformed data remains highly readable and understandable.
[0033] For example, if the question text in the first text is "What class is taking place in the image?", and the answer text in the first text is "The class taking place in the image is a math class taught by Teacher Wang", then the second text could be "The first image shows a math class taught by Teacher Wang". If the question text in the first text is "What kind of medical information is present in this image, and what kind of medical advice can be given?", and the answer text in the first text is "The image includes multiple physical indicators of the patient, and the relevant advice given includes regular sleep to enhance immunity", then the second text could be "The first image includes multiple physical indicators of the patient, and the relevant advice given based on the content of the first image includes regular sleep to enhance immunity". If the question text in the first text is "What recreational facilities are included in this image? What kind of recreational activities are taking place?", and the answer text in the first text is "The image includes recreational facilities such as slides and carousels, and a carnival is being held," then the second text could be "The image includes recreational facilities such as slides and carousels, and the recreational facilities in the image are being used to hold a carnival."
[0034] S22, Based on the second text, determine the second dataset from the first dataset.
[0035] In one embodiment of this application, in order to filter the first data and the first image in the first dataset, and remove the first text and the first image with low semantic matching degree, the semantics of the second text can be determined first, and then the semantic matching degree between the first text and the first image can be evaluated based on the semantics of the second text. Since the second text is a declarative sentence generated based on the question text and the answer text, it contains all the semantic information in the first text and unifies the form of the question text and the answer text. Therefore, using the semantics of the second text to evaluate the semantic matching degree between the first text and the first image can improve the accuracy of the semantic matching degree evaluation.
[0036] In one embodiment of this application, the method for determining the second dataset from the first dataset based on the second text can be found in [link to relevant documentation]. Figure 3 The corresponding detailed explanation.
[0037] S23, perform deduplication on the second dataset based on a preset deduplication strategy to obtain a third dataset, which includes seed text and seed image.
[0038] In one embodiment of this application, during the acquisition of the second dataset, although the first text and the first image were filtered based on the semantics of the second text, no semantic analysis was performed between different first texts, nor was similarity analysis performed on the image features of different first images. This may result in redundant information being included in the second dataset. To remove potential semantic duplication while ensuring the diversity and richness of the data, a deduplication strategy can be formulated based on factors such as the types of question and answer texts in the first text, the semantic concepts of the question and answer texts, or the image features of the first images. Redundant data in the second dataset is then identified and removed according to the deduplication strategy to obtain the third dataset.
[0039] In one embodiment of this application, the deduplication strategy includes at least one of the following: performing deduplication on the second dataset based on the type of the first text; performing deduplication on the second dataset based on the semantics of the first text; and performing deduplication on the second dataset based on the image features of the first image. Specifically, for the detailed method of performing deduplication on the second dataset based on the type of the first text, please refer to [link to relevant documentation]. Figure 4 For detailed instructions on how to deduplicate the second dataset based on the semantics of the first text, please refer to [link to relevant documentation]. Figure 5 For detailed instructions on how to deduplicate the second dataset based on the image features of the first image, please refer to [link to relevant documentation]. Figure 6 The corresponding detailed explanation.
[0040] In one embodiment of this application, the third dataset includes seed text and corresponding seed images. The seed text can be a plurality of first texts randomly determined from the third dataset according to a preset proportion threshold, and the seed image can be a first image corresponding to a randomly determined first text. For example, the preset proportion threshold can be 30%. When the third dataset includes 100 first texts and 100 first images, any 30 first texts can be randomly determined from the third dataset as seed texts, and the corresponding 30 first images can be determined as seed images.
[0041] S24, Based on the semantics of the seed text, determine the extended text from the database.
[0042] In one embodiment of this application, the first text and the first image can be applied to a visual question-answering scenario, that is, to ask a question about the content in the first image based on the question text in the first text, and then to answer the question about the content in the first image based on the response text in the first text. To utilize the first text and the first image to train a generative pre-trained model for visual question-answering scenarios and improve the accuracy of visual question answering of images, the diversity of the first text and the first image can be expanded. Specifically, a third dataset can be added based on seed text and seed images. Since the number of seed texts and seed images is limited and covers diverse scenarios in visual question-answering, adding a third dataset based on seed text and seed images ensures that the expanded dataset retains diverse information.
[0043] In one embodiment of this application, when expanding a third dataset based on seed text, the analysis can be refined to include the type of question text in the seed text (e.g., yes / no questions, multiple-choice questions, fill-in-the-blank questions, etc.), the precision of the seed text description (e.g., vague descriptions versus precise descriptions), and variations in tone (e.g., inquiries, requests, commands, etc.). A pre-trained first feature recognition model is used to identify the semantics of the seed text, and then the expanded text is determined from the database based on the semantics. The first feature recognition model can be a recurrent neural network model, a long short-term memory model, or a graph neural network model; this application does not limit this. For example, when the similarity between the semantics of any text in the database and the semantics of the seed text reaches a similarity threshold, that text can be determined as expanded text. This ensures that the obtained expanded text has diverse forms and covers a wider range of user intents.
[0044] S25, determine the augmented image from the database based on the image features of the seed image.
[0045] In one embodiment of this application, a pre-trained second feature extraction model can be used to extract features from a seed image to obtain the image features contained in the seed image, providing a data foundation for subsequent calculations and analysis. The second feature extraction model can be a convolutional neural network model, a fully connected neural network model, or a transfer learning model; this application does not limit this. When the similarity between the image features of any image in the database and the image features of the seed image reaches a similarity threshold, the seed image can be determined as an augmented image. This expands the coverage of the dataset in terms of image content, ensuring not only the diversity of image data but also providing data support for generating multi-dimensional, multimodal data.
[0046] S26, determine the expanded target dataset based on the third dataset, the expanded text, and the expanded image.
[0047] In one embodiment of this application, the expanded text and the corresponding expanded image can be combined into expanded data, and the expanded data can be added to a third dataset to obtain the expanded target dataset.
[0048] In one embodiment of this application, after obtaining the target dataset, the target dataset can be further deduplicated based on the seed text and the corresponding seed image to remove duplicate or redundant text and image data. The deduplicated text and image data are then used as updated seed samples to provide a data foundation for subsequent data expansion processes, ensuring that the text and image data retain a high degree of diversity after each expansion of the target dataset.
[0049] In one embodiment of this application, after obtaining the expanded target dataset, graph data is constructed based on the first dataset before expansion and the expanded target dataset. This allows for formatted storage of the target dataset, facilitating continuous updates and improving the efficiency of data expansion. Specifically, for the method of constructing graph data based on the first dataset before expansion and the expanded target dataset, please refer to [link to relevant documentation]. Figure 7 The corresponding detailed explanation.
[0050] As can be seen from the above technical solutions, the embodiments of this application rewrite the form of the first text based on the semantics of the question text and the answer text in the first text to obtain the second text in the form of a statement. This can unify the semantics and form of the question text and the answer text, and improve the accuracy of the semantic analysis process in the subsequent data expansion process. Based on the second text, a second dataset with semantic similarity is determined from the first dataset, and deduplication is performed on the second dataset based on a preset deduplication strategy to obtain a third dataset. This can improve the data diversity of the third dataset, ensure that the text data and image data in the third dataset cover more scenarios, and improve the quality of multimodal data. Then, seed text and seed images are determined from the third dataset. Expanded text is determined based on the semantics of the seed text, and expanded images are determined based on the image features of the seed images. Finally, the third dataset, expanded text, and expanded images are combined to obtain the target dataset. This can expand the third dataset based on seed text and seed images covering multiple scenarios, ensure the diversity of multimodal data in the expanded target dataset, and thus improve the quality of multimodal data.
[0051] like Figure 3 The diagram shown is a flowchart of a method for determining a second dataset from a first dataset according to an embodiment of this application. The order of steps in this flowchart can be changed, and some steps can be omitted, depending on different requirements. The method for determining a second dataset from a first dataset according to an embodiment of this application includes the following steps.
[0052] S30, determine the semantic similarity between the image features of the first image and the text features of the second text.
[0053] In one embodiment of this application, a pre-trained first feature recognition model can be used to identify the semantics of the seed text, and a pre-trained second feature extraction model can be used to extract features from the seed image to obtain the image features contained in the seed image and provide data support for subsequent similarity calculation. The first feature recognition model can be a recurrent neural network model, a long short-term memory model, or a graph neural network model, and the second feature extraction model can be a convolutional neural network model, a fully connected neural network model, or a transfer learning model; this application does not limit the specific model.
[0054] In one embodiment of this application, the semantic similarity between image features and text features can be determined according to a preset similarity algorithm. The similarity algorithm can be a cosine similarity algorithm, a similarity algorithm based on Euclidean distance, or a similarity algorithm based on Hamming distance; this application does not limit the specific algorithm used.
[0055] S31, if the semantic similarity is less than a preset fourth threshold, delete the first image and the corresponding first text.
[0056] In one embodiment of this application, when the semantic similarity between image features and text features is less than a preset fourth threshold, it indicates that the accuracy of interpreting the content in the first image using the semantics of the second text is low. Therefore, it can be determined that the correlation between the first text corresponding to the second text and the first image is low, and the data quality of the first text and the first image is low. Therefore, the first image and the first text can be deleted.
[0057] S32, determine the second dataset based on the remaining first text and the corresponding first image.
[0058] In one embodiment of this application, after deleting the first text and the first image with a low degree of correlation, it can be determined that the remaining first text and the first image have a high degree of correlation, so the remaining first text and the corresponding first image can be determined as the second dataset.
[0059] like Figure 4 The diagram shown is a flowchart of a method for deduplicating a second dataset based on the type of a first text, according to an embodiment of this application. The order of steps in this flowchart can be changed, and some steps can be omitted, depending on different requirements. The method for deduplicating a second dataset based on the type of a first text, according to an embodiment of this application, includes the following steps.
[0060] S40, classify any first text, determine the type of any first text and the probability that any first text belongs to the type.
[0061] In one embodiment of this application, a pre-trained text classification model can be used to classify the first text to obtain the type of any first text and the probability that the first text belongs to that type. Specifically, the type can be classified according to the question format of the question text in the first text (e.g., the type includes yes / no questions, multiple choice questions, fill-in-the-blank questions, etc.), or according to the precision of the description of the first text (e.g., the type includes vague descriptions and precise descriptions) and the change in tone (e.g., the type includes inquiries, requests, commands, etc.).
[0062] S41, if the probability is less than a preset first threshold, delete any first text and the corresponding first image.
[0063] In one embodiment of this application, if the probability of any first text belonging to its corresponding type is less than a preset first threshold, it indicates that the category of the first text is unclear, and the probability that the first text belongs to its corresponding category is low. Therefore, the first text and its corresponding first image can be deleted. This ensures that the first text within each category is concentrated in a certain category, reducing redundant first text and first images while preserving the diversity of data of different types. For example, if the first threshold is 80%, and a certain first text is "query," and the probability that the first text belongs to that category is 40%, then the first text and its corresponding first image can be deleted.
[0064] like Figure 5 The diagram shown is a flowchart of a method for deduplicating a second dataset based on the semantics of a first text, according to an embodiment of this application. The order of steps in this flowchart can be changed, and some steps can be omitted, depending on different requirements. The method for deduplicating a second dataset based on the semantics of a first text, according to an embodiment of this application, includes the following steps.
[0065] S50, determine the first code corresponding to the question text in any first text, and determine the second code corresponding to the response text in any first text.
[0066] In one embodiment of this application, a pre-trained text encoding model can be used to encode the question text in the first text to obtain a first code corresponding to the question text. The pre-trained text encoding model can then be used to encode the response text in the first text to obtain a second code corresponding to the response text. The first code can represent the semantics of the question text in a quantized form, and the second code can represent the semantics of the response text in a quantized form. The text encoding model can be a word embedding model, a recurrent neural network model, or a long short-term memory model; this application does not limit the specific model used.
[0067] S51, if the similarity between the first code and the second code is less than or equal to a preset second threshold, delete any of the first text and the corresponding first image.
[0068] In one embodiment of this application, if the similarity between the first encoding and the second encoding is less than or equal to a second threshold, it indicates that the semantics of the question text and the answer text are significantly different. Therefore, the data quality of the first text is poor, and the first text and the corresponding first image can be deleted.
[0069] like Figure 6The diagram shown is a flowchart of a method for deduplicating a second dataset based on image features of a first image, according to an embodiment of this application. The order of steps in this flowchart can be changed, and some steps can be omitted, depending on different requirements. The method for deduplicating a second dataset based on image features of a first image, according to an embodiment of this application, includes the following steps.
[0070] S60, traverse any two first images and determine the image features of any two first images.
[0071] In one embodiment of this application, the second dataset can also be deduplicated based on the image features of the first image. Specifically, a pre-trained image feature extraction model can be used to determine the image features in the first image, as well as the image features corresponding to another first image. The image is used to represent the content of the first image in a quantized form.
[0072] S61, if the similarity between the image features of any two first images is greater than a preset third threshold, delete either of the two first images and the corresponding first text.
[0073] In one embodiment of this application, if the similarity between the image features of any two first images is greater than a preset third threshold, it indicates that the content contained in the two first images is highly similar, and therefore either of the two first images and the corresponding first text can be deleted.
[0074] like Figure 7 The diagram shown is a flowchart of a method for constructing graph data according to an embodiment of this application. The order of steps in this flowchart can be changed, and some steps can be omitted, depending on different requirements. The method for constructing graph data provided in this embodiment includes the following steps.
[0075] S70, construct a node based on the first text and the corresponding first image, as well as the expanded text and the corresponding expanded image.
[0076] In one embodiment of this application, the text encoding of any first text can be used as the node corresponding to the first text, the text encoding of any extended text can be used as the node corresponding to the extended text, the image feature of any first image can be used as the node corresponding to the first image, and the image feature of any extended image can be used as the node corresponding to the extended image.
[0077] Specifically, the first text and the expanded text can be encoded separately using a pre-trained text encoding model to obtain the text encoding of the first text and the text encoding of the expanded text. Image features in the first image and image features in the expanded image can also be determined using a pre-trained image feature recognition model. The text encoding model can be a word embedding model, a recurrent neural network model, or a long short-term memory model; the image feature extraction model can be a convolutional neural network model or a fully connected neural network model, and this application does not limit the specific model used.
[0078] S71, determine the edge weights between the nodes based on the semantics of the first text, the semantics of the expanded text, the image features of the first image, and the image features of the expanded image.
[0079] In one embodiment of this application, the similarity between the semantics of the first text and the semantics of the extended text can be determined to obtain the edge weights between the nodes corresponding to the first text and the nodes corresponding to the extended text; the similarity between the semantics of the first text and the image features of the first image can be determined to obtain the edge weights between the nodes corresponding to the first text and the nodes corresponding to the first image; the similarity between the semantics of the extended text and the image features of the first image can be determined to obtain the edge weights between the nodes corresponding to the extended text and the nodes corresponding to the first image; and the similarity between the semantics of the extended text and the image features of the extended image can be determined to obtain the edge weights between the nodes corresponding to the extended text and the nodes corresponding to the first image.
[0080] S72, construct graph data based on the nodes and edge weights.
[0081] In one embodiment of this application, two nodes can be connected based on the edge weights between any two nodes, thus obtaining graph data based on all nodes and edge weights. This graph data is used to store first text and its corresponding first image, as well as extended text and its corresponding extended image, and can use edge weights to represent the association between the text and the image.
[0082] Please see Figure 8 , Figure 8 This is a functional block diagram of a data expansion device provided in one embodiment of this application. A data expansion device 81 includes an acquisition module 811, a determination module 812, a deduplication module 813, and an expansion module 814. The module / unit referred to in this application refers to a series of computer-readable instruction segments that can be executed by the processor 13 and perform a fixed function, and are stored in the memory 12. In this embodiment, the functions of each module / unit will be described in detail in subsequent embodiments.
[0083] The acquisition module 811 is used to acquire a first dataset, which includes first text and a corresponding first image.
[0084] The determining module 812 is used to determine the corresponding second text based on the semantics of the first text.
[0085] The determining module 812 is further configured to determine a second dataset from the first dataset based on the second text.
[0086] The deduplication module 813 is used to perform deduplication on the second dataset based on a preset deduplication strategy to obtain a third dataset, which includes seed text and seed image.
[0087] The determining module 812 is further configured to determine the extended text from the database based on the semantics of the seed text.
[0088] The determining module 812 is further configured to determine an augmented image from the database based on the image features of the seed image.
[0089] The expansion module 814 is used to determine the expanded target dataset based on the third dataset, the expanded text, and the expanded image.
[0090] The deduplication module 813 is further configured to perform deduplication on the second dataset according to the type of the first text, including classifying any first text, determining the type of any first text and the probability that any first text belongs to the type; and deleting any first text and the corresponding first image when the probability is less than a preset first threshold.
[0091] The deduplication module 813 is further configured to perform deduplication on the second dataset based on the semantics of the first text, including: determining a first code corresponding to the question text in any first text, determining a second code corresponding to the response text in any first text; and deleting any first text and its corresponding first image if the similarity between the first code and the second code is less than or equal to a preset second threshold.
[0092] The deduplication module 813 is further configured to perform deduplication on the second dataset based on the image features of the first image, including: deleting any one of the two first images and the corresponding first text if the similarity between the image features of any two first images is greater than a preset third threshold.
[0093] The determining module 812 is further configured to determine the semantic similarity between the image features of the first image and the text features of the second text; if the semantic similarity is less than a preset fourth threshold, delete the first image and the corresponding first text; and determine the second dataset based on the remaining first text and the corresponding first image.
[0094] The expansion module 814 is further configured to construct nodes based on the first text and the corresponding first image, as well as the expanded text and the corresponding expanded image; determine the edge weights between the nodes based on the semantics of the first text, the semantics of the expanded text, the image features of the first image, and the image features of the expanded image; and construct graph data based on the nodes and the edge weights.
[0095] Please see Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device 100 includes a memory 12 and a processor 13. The memory 12 is used to store computer-readable instructions, and the processor 13 executes the computer-readable instructions stored in the memory to implement a data expansion method described in any of the above embodiments.
[0096] In one embodiment of this application, the electronic device 100 further includes a bus and a computer program, such as a data expansion program, stored in the memory 12 and executable on the processor 13.
[0097] Figure 9 Only an electronic device 100 with memory 12 and processor 13 is shown; those skilled in the art will understand that... Figure 9 The structure shown does not constitute a limitation on the electronic device 100, and may include fewer or more components than shown, or combine certain components, or have different component arrangements.
[0098] Combination Figure 2 The memory 12 in the electronic device 100 stores a plurality of computer-readable instructions to implement a data augmentation method. The processor 13 can execute the plurality of instructions to achieve: acquiring a first dataset, the first dataset including a first text and a corresponding first image; determining a corresponding second text based on the semantics of the first text; determining a second dataset from the first dataset based on the second text; performing a deduplication operation on the second dataset based on a preset deduplication strategy to obtain a third dataset, the third dataset including a seed text and a seed image; determining augmented text from the database based on the semantics of the seed text; determining an augmented image from the database based on the image features of the seed image; and determining an augmented target dataset based on the third dataset, the augmented text, and the augmented image.
[0099] Specifically, the processor 13's implementation method for the above instructions can be found in [reference needed]. Figure 2 The descriptions of the relevant steps in the corresponding embodiments are not repeated here.
[0100] Those skilled in the art will understand that the schematic diagram is merely an example of the electronic device 100 and does not constitute a limitation on the electronic device 100. The electronic device 100 may be a bus-type structure or a star-type structure. The electronic device 100 may also include more or fewer other hardware or software than shown in the diagram, or different component arrangements. For example, the electronic device 100 may also include input / output devices, network access devices, etc.
[0101] It should be noted that electronic device 100 is only an example. Other existing or future electronic products that are suitable for this application should also be included within the scope of protection of this application and are incorporated herein by reference.
[0102] The memory 12 includes at least one type of readable storage medium, which can be non-volatile or volatile. The readable storage medium includes flash memory, portable hard drives, multimedia cards, card-type memory (e.g., SD or DX memory), magnetic storage, magnetic disks, optical disks, etc. In some embodiments, the memory 12 can be an internal storage unit of the electronic device 100, such as the portable hard drive of the electronic device 100. In other embodiments, the memory 12 can also be an external storage device of the electronic device 100, such as a plug-in portable hard drive, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the electronic device 100. The memory 12 can be used not only to store application software and various types of data installed on the electronic device 100, such as the code of a data expansion program, but also to temporarily store data that has been output or will be output.
[0103] In some embodiments, the processor 13 may be composed of integrated circuits, such as a single packaged integrated circuit or multiple integrated circuits packaged with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The processor 13 is the control unit of the electronic device 100, connecting to various components of the electronic device 100 via various interfaces and lines. It executes programs or modules stored in the memory 12 (e.g., executing a data expansion program) and calls data stored in the memory 12 to perform various functions of the electronic device 100 and process data.
[0104] The processor 13 executes the operating system of the electronic device 100 and various installed applications. The processor 13 executes the applications to implement the steps in each of the above-described data expansion method embodiments, for example... Figure 2 The steps are shown.
[0105] For example, the computer program may be divided into one or more modules / units, which are stored in the memory 12 and executed by the processor 13 to complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing a specific function, which describe the execution process of the computer program in the electronic device 100. For example, the computer program may be divided into an acquisition module 811, a determination module 812, a deduplication module 813, and an expansion module 814.
[0106] The integrated unit implemented as a software functional module described above can be stored in a computer-readable storage medium. This software functional module, stored in a storage medium, includes several instructions to cause a computer device (which may be a personal computer, computer equipment, or network device, etc.) or processor to execute portions of the data expansion method described in the various embodiments of this application.
[0107] If the modules / units integrated in the electronic device 100 are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware devices. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above.
[0108] The computer program includes computer program code, which may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory, and other memory.
[0109] Furthermore, the computer-readable storage medium may primarily include a stored program area and a stored data area, wherein the stored program area may store the operating system, an application program required for at least one function, etc.; and the stored data area may store data created based on the use of blockchain nodes, etc.
[0110] The bus can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, in... Figure 9 The symbol is represented by only one arrow, but this does not indicate that there is only one bus or one type of bus. The bus is configured to enable communication between the memory 12 and at least one processor 13, etc.
[0111] This application also provides a computer-readable storage medium (not shown), which stores computer-readable instructions that are executed by a processor in an electronic device to implement the data expansion method described in any of the above embodiments.
[0112] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0113] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0114] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0115] Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices described in the specification may also be implemented by a single unit or device through software or hardware. Terms such as "first," "second," etc., are used to indicate names and do not indicate any specific order.
[0116] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and are not intended to limit it. Although this application has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of this application without departing from the spirit and scope of the technical solutions of this application.
Claims
1. A data expansion method applied to an electronic device, wherein the electronic device is communicatively connected to a database, characterized in that, The method includes: Obtain a first dataset, which includes a first text and a corresponding first image; Based on the semantics of the first text, determine the corresponding second text; Based on the second text, determine the second dataset from the first dataset; The second dataset is deduplicated based on a preset deduplication strategy to obtain a third dataset, which includes seed text and seed image. Based on the semantics of the seed text, the augmented text is determined from the database; Based on the image features of the seed image, an augmented image is determined from the database; The expanded target dataset is determined based on the third dataset, the expanded text, and the expanded image.
2. The data augmentation method as described in claim 1, characterized in that, The deduplication strategy includes at least one of the following: The second dataset is deduplicated based on the type of the first text. The second dataset is deduplicated based on the semantics of the first text; The second dataset is deduplicated based on the image features of the first image.
3. The data augmentation method as described in claim 2, characterized in that, The step of deduplicating the second dataset according to the type of the first text includes: Classify any first text, determine the type of any first text and the probability that any first text belongs to the type; If the probability is less than a preset first threshold, delete any first text and its corresponding first image.
4. The data augmentation method as described in claim 2, characterized in that, The step of deduplicating the second dataset based on the semantics of the first text includes: Determine the first code corresponding to the question text in any first text, and determine the second code corresponding to the response text in any first text; If the similarity between the first encoding and the second encoding is less than or equal to a preset second threshold, delete any of the first text and the corresponding first image.
5. The data augmentation method as described in claim 2, characterized in that, The step of performing deduplication on the second dataset based on the image features of the first image includes: If the similarity between the image features of any two first images is greater than a preset third threshold, delete either of the two first images and the corresponding first text.
6. The data augmentation method as described in claim 1, characterized in that, The step of determining the second dataset from the first dataset based on the second text includes: Determine the semantic similarity between the image features of the first image and the text features of the second text; If the semantic similarity is less than a preset fourth threshold, delete the first image and the corresponding first text; The second dataset is determined based on the remaining first text and the corresponding first image.
7. The data augmentation method as described in claim 1, characterized in that, The method further includes: Nodes are constructed based on the first text and the corresponding first image, as well as the expanded text and the corresponding expanded image; The edge weights between the nodes are determined based on the semantics of the first text, the semantics of the expanded text, the image features of the first image, and the image features of the expanded image. Graph data is constructed based on the nodes and edge weights.
8. A data expansion device, characterized in that, The device includes: The acquisition module is used to acquire a first dataset, which includes first text and a corresponding first image; The determining module is used to determine the corresponding second text based on the semantics of the first text; The determining module is further configured to determine a second dataset from the first dataset based on the second text; The deduplication module is used to perform deduplication on the second dataset based on a preset deduplication strategy to obtain a third dataset, which includes seed text and seed image. The determining module is further configured to determine the augmented text from the database based on the semantics of the seed text; The determining module is further configured to determine an augmented image from the database based on the image features of the seed image; An expansion module is used to determine the expanded target dataset based on the third dataset, the expanded text, and the expanded image.
9. An electronic device, characterized in that, The electronic device includes a processor and a memory, the processor being configured to implement the data expansion method as described in any one of claims 1 to 7 when executing a computer program stored in the memory.
10. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by the processor, it implements the data augmentation method as described in any one of claims 1 to 7.