Small language graphic data set construction method, device and medium for internet data
By obtaining HTML webpage files from the Internet and using a multimodal large model to clean and label the image and text data of minority languages, the problem of insufficient coverage of image and text datasets of minority languages was solved, a high-quality and standardized image and text dataset of minority languages was constructed, the diversity and usability of the dataset were improved, and the development of artificial intelligence technology in countries with minority languages was promoted.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
- Filing Date
- 2024-10-29
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies lack sufficient coverage of image and text datasets for minority languages, leading to delays in model training and application development. Furthermore, inconsistent annotation quality and a lack of standardized processes affect the usability and reliability of the datasets.
By obtaining HTML webpage files from the Internet, we use a multimodal large model to clean and annotate the text and image data of minority languages. This includes cleaning the document object model tree, extracting text description information, performing image and text filtering, and combining background information detection to generate high-quality text and image description pairs and visual question-answering pairs. We also perform grammar checks on the minority languages to build a standardized text and image dataset for the minority languages.
We have constructed a high-quality, diverse, and standardized dataset of text and image data in minority languages, which improves the usability and accuracy of the data, meets the needs of multilingual processing, and promotes the development of artificial intelligence technology in countries with minority languages.
Smart Images

Figure CN119577205B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing, and in particular to a method, device, and medium for constructing a dataset of images and text in a minority language for Internet data. Background Technology
[0002] By training large models on multimodal image and text datasets and fusing visual and textual information, not only are rich information and context provided to the large models, but they are also able to master and perform complex visual and language tasks. This is crucial for cultivating large models that can understand and process multimodal inputs, and can achieve significant performance improvements in key areas such as image recognition, image description generation, and visual question answering. It is currently widely used in fields including autonomous driving, medical image analysis, and intelligent education.
[0003] However, most current image-text datasets primarily focus on mainstream languages such as English and Chinese, with insufficient coverage of less commonly spoken languages, failing to meet the growing demand for multilingual processing. This lack of datasets for less commonly spoken languages limits model training and application development for these languages, causing a lag in the development and innovation of large-scale modeling technologies in these countries and limiting their access to advanced artificial intelligence technologies. Furthermore, less commonly spoken languages are often closely linked to specific cultures and communities; the absence of datasets for these languages means overlooking the significant value of large-scale models in terms of cultural diversity. In addition, the inconsistent annotation quality and lack of standardized processes in existing image-text datasets affect their usability and reliability.
[0004] Therefore, we continue to design a method for constructing image and text datasets for minority languages, which can overcome the shortcomings of existing image and text datasets for minority languages, especially the limitations in data diversity, annotation quality and usability. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of the existing technology by providing a method, device and medium for constructing minority language image and text datasets for Internet data, which can realize the construction of high-quality, diverse and standardized minority language image and text datasets.
[0006] The objective of this invention can be achieved through the following technical solutions:
[0007] According to a first aspect of the present invention, a method for constructing a minority language image-text dataset for Internet data is provided, the method comprising:
[0008] Obtain HTML webpage files from the Internet, extract multimodal documents containing image and text data in minority languages from the HTML webpage files, and obtain image-text related pairs by filtering the replaceable text corresponding to the images in the multimodal documents;
[0009] Input the plain images without replaceable text, the text information after removing the images, and the custom prompts from the multimodal document into the first multimodal large model to generate corresponding image description information and image-text question-and-answer information. Combine the plain images and the corresponding image description information into image-text description pairs, and combine the plain images and the corresponding image-text question-and-answer information into visual question-and-answer pairs.
[0010] Input multimodal documents, image-text related pairs, image-text description pairs, and visual question-answering pairs into the second multimodal large model, and output image scene labels;
[0011] The multimodal documents, image-text related pairs, image-text description pairs, visual question-answering pairs, and image scene labels are combined to obtain the image-text dataset for minority languages.
[0012] Preferably, after obtaining the HTML webpage file in the minority language, the method further includes: using a document object model tree cleaning strategy to simplify the HTML webpage file by removing irrelevant text and images; wherein the simplified HTML webpage file retains the structure of the original HTML webpage file in the minority language.
[0013] Preferably, the text description information associated with the image is extracted from the HTML webpage file in the minority language using the Alt-text attribute of the HTML webpage, and the text context information corresponding to the image is extracted from the HTML webpage file in the minority language using the paragraph, heading, and list information of the text surrounding the image.
[0014] Preferably, the method further includes filtering the multimodal document, the filtering process including image filtering, text paragraph filtering, document-level filtering, and image and text deduplication.
[0015] Preferably, after extracting the multimodal document containing the image and text data of the minority language, the method further includes extracting background information from the images in the multimodal document. This background information is then used to input the background information, along with the pure image, the text information after removing the image, and custom prompts, into the first multimodal large model to obtain annotation information. The annotation information includes image-text description pairs and visual question-and-answer pairs. The background information includes human body detection information, minority language OCR recognition information, object detection information, object relationship detection information, and image tone detection information.
[0016] Preferably, before the text information after removing the image is sent to the first multimodal large model, the text information in the minority language is translated into Chinese text information; after processing by the first multimodal large model, the Chinese image description information and Chinese image-text question and answer information output by the first multimodal large model are translated into the corresponding minority language and then output.
[0017] Preferably, the visual question-and-answer pair covers key information about the image, including objects, actions, scenes, and contextual information.
[0018] Preferably, the method further includes: performing grammar checks on multimodal documents, image-text related pairs, image-text description pairs, and visual question-answering pairs in minority languages.
[0019] According to a second aspect of the present invention, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the program to implement any of the methods described above.
[0020] According to a third aspect of the invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements any of the methods described herein.
[0021] Compared with the prior art, the present invention has the following beneficial effects:
[0022] 1) This invention utilizes a multimodal large model to clean and process the image and text data in HTML web pages of minority languages, and simultaneously annotates the image and text data to construct a higher quality and more diverse image and text dataset for minority languages.
[0023] 2) This invention employs a document object model tree cleaning strategy to simplify HTML web page files in minority languages by removing irrelevant text and images, thereby improving the availability of data.
[0024] 3) By extracting text descriptions associated with images through the Alt-text attribute in HTML web pages, and by analyzing paragraphs, headings, lists, and images around the images, more comprehensive contextual information can be obtained, thus improving the data quality of the image-text dataset for minority languages.
[0025] 4) When performing multimodal large model annotation, this invention additionally introduces background information extracted from images, human body detection information, minority language OCR recognition information, object detection information, object relationship detection information, and image tone detection information, which can generate more accurate image-text description pairs and visual question-answering pairs, thus improving the quality of minority language image-text datasets.
[0026] 5) By performing grammar checks on multimodal documents, text-image related pairs, text-image description pairs, and visual question-answering pairs in minority languages, the usability and accuracy of data in minority language datasets have been improved. Attached Figure Description
[0027] Figure 1 This is a flowchart of the method of the present invention;
[0028] Figure 2 This describes the processing procedure for the minority language image and text dataset of the present invention;
[0029] Figure 3 This is a schematic diagram illustrating the generation process of textual description information and visual question-and-answer information in the embodiment;
[0030] Figure 4 This is a diagram illustrating the process of obtaining descriptive prompts and visual question-and-answer prompts. Detailed Implementation
[0031] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0032] Example
[0033] like Figure 1 and Figure 2 As shown, this embodiment provides a method for constructing a minority language image-text dataset for Internet data. The method includes:
[0034] S1. Obtain HTML webpage files from the Internet, preprocess the HTML webpage files, extract multimodal documents containing image and text data in minority languages from the HTML webpage files, and obtain image-text related pairs based on the replaceable text corresponding to the images in the multimodal documents.
[0035] The preprocessing process includes text deduplication, quality assessment, and language identification to ensure the quality of the selected minority language content. Specifically, after obtaining the minority language HTML webpage files, a refined document object model tree cleaning strategy is used to simplify the HTML webpage files by removing irrelevant text and images (such as advertisements, navigation bars, etc.), retaining only the core information. A multimodal document containing text and images is extracted from the simplified HTML, while maintaining the structural integrity of the minority language HTML webpage files to better understand the context of the content.
[0036] By extracting text descriptions associated with images from HTML webpages in other languages using the Alt-text attribute, and utilizing paragraph, heading, and list information from the text surrounding the images, the text context information corresponding to the images can be extracted from the HTML webpages in other languages. By deeply mining the text content surrounding the images, richer multimodal documents can be constructed.
[0037] After extracting multimodal documents containing image and text data in less commonly spoken languages, these documents undergo filtering processes, including image filtering, text paragraph filtering, document-level filtering, and image and text deduplication, to ensure the uniqueness and diversity of the dataset. Throughout the implementation of the filtering strategy, data privacy and copyright regulations are strictly adhered to, and inappropriate content is removed.
[0038] Furthermore, this step also includes extracting background information from images in the multimodal document. This background information, along with the pure images, text information after removing the images, and custom prompts, is then input into the first multimodal large model to obtain annotation information. The annotation information includes image-text description pairs and visual question-answer pairs, such as... Figure 3 As shown, the background information includes human body detection information, OCR recognition information for minor languages, object detection information, object relationship detection information, and image tone detection information.
[0039] S2. Input the plain images (without replaceable text) and the text information after removing the images from the multimodal document, as well as the custom prompts tailored to the specific data source, into the first multimodal large model (Visual Language Model VLM) to generate corresponding image description information and image-text question-and-answer information. Combine the plain images and the corresponding image description information into image-text description pairs, and combine the plain images and the corresponding image-text question-and-answer information into visual question-and-answer pairs.
[0040] Visual Question Answering (VQA) generation: Utilizing the same multimodal large model and visual question answering cues, generate image-related question-answer pairs. These pairs should cover key information about the image, including objects, actions, scenes, and contextual information.
[0041] Multimodal large language models may produce what is known as "illusion" when processing images and text, meaning they generate outputs that are inconsistent with the actual visual content. This phenomenon is particularly common when the model generates long responses. To address this issue, background knowledge is incorporated. First, images are preprocessed using proprietary models, such as human detection, minority language OCR, object detection, object relationship detection, and image tone detection. These proprietary small models, along with scene labels added during image cleaning, together constitute the background knowledge. By inputting this background knowledge and the images into the large model, along with customized prompts, accurate image-text descriptions and visual question-answer pairs are ultimately generated.
[0042] In addition, before the text information after removing the images is fed into the multimodal large model, the text information in the minority language is translated into Chinese text information; after the first multimodal large model is processed, the Chinese image description information and Chinese image-text question and answer information output by the first multimodal large model are translated into the corresponding minority language and then output.
[0043] S3. Input the multimodal documents, image-text related pairs, image-text description pairs, and visual question-answering pairs into the second multimodal large model, and output the image scene labels.
[0044] S4. Combine the multimodal documents, image-text related pairs, image-text description pairs, visual question-answering pairs, and image scene labels to obtain the image-text dataset for minority languages.
[0045] In addition, this embodiment also includes performing minor language grammar checks on multimodal documents, image-text related pairs, image-text description pairs, and visual question-answer pairs.
[0046] like Figure 4 The process of obtaining descriptive prompts used in the image and text description information generation process shown specifically includes:
[0047] Collect high-quality images and annotate the images using the VLM model;
[0048] Based on the country of origin and scene annotation information of the image, generate corresponding descriptive prompts, including:
[0049] Please carefully observe this image from [specific country name in a less commonly spoken language] and, based on the main elements and scene in the image, generate a detailed, accurate, and rich description in Chinese. The description should cover the following aspects: Please state the main subject of the image, such as people, specific objects, or a location / scene.
[0050] Please describe the activities the subject is engaged in, its characteristics and state, and the emotional atmosphere or story it conveys. Please describe the elements in the background of the image, and the relationship and interaction between these elements and the subject.
[0051] Please describe the cultural characteristics of the [country where the less commonly spoken language is spoken] depicted in the image, such as traditional clothing, architecture, festivals, or customs. Please describe the emotions or atmosphere you felt upon seeing the image, and explain the special significance of these emotions or atmosphere in the context of the cultural background. If there are specific details, such as lighting effects, color textures, artistic style, or handicrafts, please include them in your description and explain their cultural meaning. Please ensure your description is accurate and insightful, reflecting the cultural characteristics of the [specific country where the less commonly spoken language is spoken], providing sufficient detail to allow the reader to form a clear impression of the image.
[0052] As for the visual question-answering prompts in the process of obtaining visual question-answering prompts, they include:
[0053] General:
[0054] What details in the image are eye-catching? What do these details symbolize? Considering the context of a country where a less commonly spoken language is spoken, what emotions or atmosphere does the image convey?
[0055] If the actions or events in the image continue, what will happen next?
[0056] General category:
[0057] What is the main object or scene in this image? What is the function or role of the elements in the image? What are their numbers and positional relationships?
[0058] Characters:
[0059] Who are the people in the picture? What might their relationship be? What activity are they engaged in?
[0060] Considering the background of a country where a less commonly spoken language is spoken, what kind of emotional state or story do the characters' expressions, postures, and clothing convey?
[0061] What kind of scene does the scene-based image depict? What are the overall color tones, lighting conditions, and compositional perspective of the image?
[0062] What historical background and cultural significance does the image represent for [countries with less commonly spoken languages]? What folk traditions or social phenomena does this cultural image reflect in [countries with less commonly spoken languages]? Are the elements in the image related to a specific historical period or event?
[0063] The electronic device of this invention includes a central processing unit (CPU), which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) or loaded from a storage unit into random access memory (RAM). The RAM may also store various programs and data required for device operation. The CPU, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.
[0064] Multiple components in the device are connected to the I / O interface, including: input units such as keyboards and mice; output units such as various types of displays and speakers; storage units such as disks and optical discs; and communication units such as network interface cards (NICs), modems, and wireless transceivers. The communication unit allows the device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0065] The processing unit performs the various methods and processes described above. For example, in some embodiments, the methods may be implemented as computer software programs tangibly contained in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and / or installed on the device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of the methods described above may be performed. Alternatively, in other embodiments, the CPU may be configured to execute the methods by any other suitable means (e.g., by means of firmware).
[0066] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload programmable logic devices (CPLDs), and so on.
[0067] The program code used to implement the methods of the present invention can be written in any combination of one or more programming languages. This program code can be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code can be executed entirely on the machine, partially on the machine, as a standalone software package partially on the machine and partially on a remote machine, or entirely on a remote machine or server.
[0068] In the context of this invention, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media can include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0069] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A small language script image-text data set construction method for Internet data, characterized in that, The method includes: Obtain HTML webpage files in a minority language from the Internet, extract multimodal documents containing image and text data in the minority language from the HTML webpage files, and obtain image-text related pairs by filtering the replaceable text corresponding to the images in the multimodal documents; Input the plain images without replaceable text, the text information after removing the images, and the custom prompts from the multimodal document into the first multimodal large model to generate corresponding image description information and image-text question-and-answer information. Combine the plain images and the corresponding image description information into image-text description pairs, and combine the plain images and the corresponding image-text question-and-answer information into visual question-and-answer pairs. Input multimodal documents, image-text related pairs, image-text description pairs, and visual question-answering pairs into the second multimodal large model, and output image scene labels; The multimodal documents, image-text related pairs, image-text description pairs, visual question-answering pairs, and image scene labels are combined to obtain the image-text dataset for minority languages.
2. The small language script image-text data set construction method for Internet data according to claim 1, characterized in that, After obtaining the HTML webpage file in the minority language, the process further includes: using a document object model tree cleaning strategy to simplify the HTML webpage file by removing irrelevant text and images; wherein the simplified HTML webpage file retains the structure of the original HTML webpage file in the minority language.
3. The method for constructing a minority language image-text dataset for Internet data according to claim 1, characterized in that, By using the Alt-text attribute of the HTML webpage, text description information associated with the image is extracted from the HTML webpage file in the minority language. By utilizing paragraph, heading, and list information of the text surrounding the image, the text context information corresponding to the image is extracted from the HTML webpage file in the minority language.
4. The method for constructing a minority language image-text dataset for Internet data according to claim 1, characterized in that, The method further includes filtering the multimodal document, including image filtering, text paragraph filtering, document-level filtering, and image and text deduplication.
5. The method for constructing a minority language image-text dataset for Internet data according to claim 1, characterized in that, After extracting the multimodal document containing image and text data in a minority language, the process further includes extracting background information from the images in the multimodal document. This background information is then used to input the background information, along with the pure image, the text information after removing the image, and custom prompts, into the first multimodal large model to obtain annotation information. The annotation information includes image-text description pairs and visual question-answering pairs. The background information includes human body detection information, OCR recognition information for minor languages, object detection information, object relationship detection information, and image tone detection information.
6. The method for constructing a minority language image-text dataset for Internet data according to claim 5, characterized in that, Before the text information after removing the images is sent to the first multimodal large model, the text information in the minority language is translated into Chinese text information; after processing by the first multimodal large model, the Chinese image description information and Chinese image-text question and answer information output by the first multimodal large model are translated into the corresponding minority language and then output.
7. The method for constructing a minority language image-text dataset for Internet data according to claim 1, characterized in that, The visual question-and-answer pair covers key information about the image, including objects, actions, scenes, and contextual information.
8. A method for constructing a minority language image-text dataset for Internet data according to claim 1, characterized in that, The method further includes performing grammar checks on multimodal documents, image-text related pairs, image-text description pairs, and visual question-answering pairs in minority languages.
9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 8.