A method and apparatus for constructing a data set
By selecting high-ranking products from e-commerce websites as seeds, and using a multimodal product crawler to extract SKU and SPU information, fine-grained attributes are extracted layer by layer to form a multimodal dataset. This solves the problems of high cost and low quality caused by manual collection in existing technologies, and achieves the construction of a high-quality dataset that saves manpower.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHWESTERN POLYTECHNICAL UNIV
- Filing Date
- 2024-03-12
- Publication Date
- 2026-06-26
AI Technical Summary
The construction of existing multimodal datasets requires manual data collection, which results in high costs and poor data quality. There is a lack of datasets that include fine-grained alignment of multimodal data.
By selecting high-ranking products from a designated website as seeds, a multimodal product crawler is used to crawl SKU and SPU information, crawling fine-grained attributes layer by layer to form a dataset, and then constructing the dataset by random matching and evaluating the model's output.
It provides a high-quality and labor-saving method for constructing datasets, which can generate fine-grained candidate groups, improve the quality of datasets, and evaluate the accuracy of models.
Smart Images

Figure CN117972432B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of deep learning, and more specifically to a method and apparatus for constructing a dataset. Background Technology
[0002] Multimodal machine learning, aiming to build models to process and correlate information from multiple modalities, is considered a field with a significant impact on general artificial intelligence. Among these multimodalities, researchers have shown great interest in visual-language multimodal research because these two modalities are widely used and closely intertwined in daily human life. Recent Vision-Language Multimodal (VLM) methods have primarily focused on modeling the overall relationship between visual and linguistic inputs, but have rarely considered local relationships. Here, the ability of a method to model local relationships is considered its fine-grained capability. A deeper understanding is necessary to better model the fine-grained properties of visual and linguistic modalities. However, the fine-grained capabilities of some existing methods are not better than random methods and fall far short of human expectations. Researchers lack a standard dataset to probe the true fine-grained capabilities of current methods.
[0003] Multimodal fine-grained datasets have been widely and deeply applied in numerous artificial intelligence fields, including healthcare, agriculture, and e-commerce. By integrating information from multiple sensors and data sources, these datasets provide more comprehensive, accurate, and detailed data descriptions. In healthcare, multimodal fine-grained datasets utilize various data types, such as medical images, physiological signals, and medical records, to assist doctors in disease diagnosis, treatment planning, and prognosis assessment. In agriculture, multimodal fine-grained datasets combine meteorological data, soil information, and crop growth data to achieve precision agricultural management, improving crop yield and quality. In e-commerce, multimodal fine-grained datasets leverage various data types, such as user behavior data, product images, and descriptions, to provide more precise support for personalized recommendations, advertising targeting, and risk control. These applications have not only significantly enhanced the effectiveness of artificial intelligence technology in various fields but also brought more business opportunities and development potential to related industries.
[0004] The construction of multimodal datasets requires ensuring the alignment between multiple modalities. Currently, only manual alignment is the most accurate method for this. In addition, to ensure that the dataset contains enough information, the data needs to be carefully filtered to obtain data with high information density. Such a construction process may face many challenges, among which the cost of manpower is often the key bottleneck. At present, multimodal datasets still have drawbacks: (1) the data quality is poor, and there is a lack of datasets that contain fine-grained alignment of multimodalities; (2) a large amount of manpower is consumed in the data collection process. Data collection involves data acquisition, organization, cleaning and labeling, and these tasks require manpower. Summary of the Invention
[0005] This invention provides a method and apparatus for constructing a dataset, which solves the problem that the construction of existing multimodal datasets requires manual data collection, resulting in high costs and poor data quality.
[0006] This invention provides a method for constructing a dataset, comprising:
[0007] Select to identify multiple high-ranking products from multiple categories within the website as product seeds;
[0008] By crawling the SKU inventory units of each product seed using a multimodal product crawler that includes at least text analysis and image recognition, the SPU standardized product unit of each product seed is obtained.
[0009] The information corresponding to the SPU of each product seed is crawled layer by layer according to the hierarchical structure to obtain all SKUs included in each SPU and the fine-grained attributes corresponding to each SKU. The fine-grained attributes are then uniformly annotated to obtain the fine-grained candidate group of each product seed. The fine-grained candidate group includes at least one product, one image and one text pair.
[0010] Multiple fine-grained candidate groups form a dataset.
[0011] Preferably, the fine-grained attributes include at least one or more of color, style, material, and size.
[0012] Preferably, the SKU and the SPU include at least one or more of the following relationships: hierarchical relationship, inventory management relationship, price relationship, sales statistics relationship, and marketing strategy relationship.
[0013] Preferably, after the plurality of fine-grained candidate groups form a dataset, the method further includes:
[0014] Randomly match the image and text pairs included in the dataset to form random pairs including random images and random text, and input the random pairs into the detection model;
[0015] The retrieval accuracy of the model under test is determined based on the output of the model under test and the retrieval ranking list of the random pair, which includes fine-grained retrieval samples.
[0016] Preferably, after the plurality of fine-grained candidate groups form a dataset, the method further includes:
[0017] A random question and a random image from the dataset are input to the model to be detected, causing the model to select an option from alternative answers that corresponds to the random image and the random question. The model is then evaluated based on the accuracy of the selected options; or
[0018] A random question and a random image from the dataset are input into the model to be detected, so that the model to be detected generates a reference answer based on the random question and the random image, and the model to be detected is evaluated based on the accuracy of the reference answer.
[0019] This invention provides a dataset construction apparatus, comprising:
[0020] The "Determine Unit" is used to select multiple high-ranking products from multiple categories within a given website as product seeds.
[0021] The first obtaining unit is used to crawl the SKU inventory units of each product seed by a multimodal product crawler that includes at least text analysis and image recognition, and obtain the SPU standardized product unit of each product seed.
[0022] The second obtaining unit is used to crawl all the information corresponding to the SPU of each product seed layer by layer according to the hierarchical structure, obtain all the SKUs included in each SPU, and the fine-grained attributes corresponding to the SKUs, and perform consistent annotation on the fine-grained attributes to obtain the fine-grained candidate group of each product seed; wherein, the fine-grained candidate group includes at least one product, one image and one text pair;
[0023] A forming unit is used to form a dataset from multiple fine-grained candidate groups.
[0024] Preferably, the forming unit is further configured to:
[0025] Randomly match the image and text pairs included in the dataset to form random pairs including random images and random text, and input the random pairs into the detection model;
[0026] The retrieval accuracy of the model under test is determined based on the output of the model under test and the retrieval ranking list of the random pair, which includes fine-grained retrieval samples.
[0027] Preferably, the forming unit is further configured to:
[0028] A random question and a random image from the dataset are input to the model to be detected, causing the model to select an option from alternative answers that corresponds to the random image and the random question. The model is then evaluated based on the accuracy of the selected options; or
[0029] A random question and a random image from the dataset are input into the model to be detected, so that the model to be detected generates a reference answer based on the random question and the random image, and the model to be detected is evaluated based on the accuracy of the reference answer.
[0030] This invention provides a computer device, which includes a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the processor performs the dataset construction method described in any of the above-described embodiments.
[0031] This invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the dataset construction method described in any of the preceding claims.
[0032] This invention provides a method and apparatus for constructing a dataset. The method includes: selecting multiple high-ranking products from a designated website, encompassing multiple categories, as product seeds; crawling the SKU inventory units of each product seed using a multimodal product crawler that includes at least text analysis and image recognition, to obtain the SPU standardized product unit for each product seed; crawling all information corresponding to the SPU of each product seed layer by layer according to a hierarchical structure, to obtain all SKUs included in each SPU and the fine-grained attributes corresponding to the SKUs; performing consistent annotation on the fine-grained attributes to obtain fine-grained candidate groups for each product seed; wherein each fine-grained candidate group includes at least one product, one image, and one text pair; and multiple fine-grained candidate groups form a dataset. This method provides a high-quality and labor-saving approach to constructing datasets. By integrating hierarchical information from e-commerce websites, it obtains fine-grained attributes based on similar product seeds, thereby forming fine-grained candidate groups. Each fine-grained candidate group includes at least one product and an image-text pair. The products differ significantly in certain attributes, and the image-text pair accurately displays the attribute differences between products. This solves the problem that the construction of existing multimodal datasets requires manual data collection, resulting in high costs and poor data quality. Attached Figure Description
[0033] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0034] Figure 1 A schematic diagram of a dataset construction method provided in an embodiment of the present invention;
[0035] Figure 2A This is a schematic diagram of the structure of the first half of the dataset construction method provided in the embodiments of the present invention;
[0036] Figure 2B A schematic diagram of the structure of the method for constructing the latter half of the dataset provided in an embodiment of the present invention;
[0037] Figure 3A This is a schematic diagram of the fine-grained candidate group structure provided in an embodiment of the present invention;
[0038] Figure 3B This is a schematic diagram of a hybrid modality retrieval and evaluation method based on fine-grained candidate groups provided in an embodiment of the present invention;
[0039] Figure 3C This is a schematic diagram of the structure of the fine-grained visual question answering evaluation method based on fine-grained candidate groups provided in an embodiment of the present invention;
[0040] Figure 4 This is a schematic diagram of a dataset construction device provided in an embodiment of the present invention. Detailed Implementation
[0041] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0042] In the embodiments of the present invention, the technical terms involved are as follows:
[0043] 1. An SPU (Standard Product Unit) is the smallest unit for aggregating product information. It is a set of reusable and easily searchable standardized information that describes the characteristics of a product. Simply put, products with the same attribute values and characteristics can be called an SPU.
[0044] For example, a certain brand and model of mobile phone is a SU (Special Purpose Unit), which is unrelated to the merchant, color, style, or package.
[0045] In the process of digitizing product information, the characteristics of a product can be described by multiple "attributes and their corresponding attribute value pairs". Products with identical "attributes and their corresponding attribute value pairs" can be abstracted into a Single Product Unit (SPU). Simultaneously, these "attributes and their corresponding attribute value pairs" are also solidified within the SPU, gradually becoming standardized. Based on the SPU-based product information structure, a wide range of applications can be realized, such as integrating product information with news, reviews, and other SPUs.
[0046] 2. SKU (Stock Keeping Unit): SKU is the unit used to measure inventory inflows and outflows, and can be measured in pieces, boxes, pallets, etc. It is most commonly used in clothing and footwear. For example, in textiles, an SKU typically represents: specifications, color, and style.
[0047] SKU is the smallest physically indivisible inventory unit. Its usage depends on different business formats and management models. For example, a carton of cigarettes contains 50 cartons, a carton contains 10 packs, and a pack contains 20 cigarettes. These units need to be set up with different SKUs based on specific needs. For instance, warehouse-style wholesale supermarkets always set SKUs by carton; regular supermarkets always set SKUs by pack; and tobacco and liquor specialty stores always set SKUs by box.
[0048] SPU stands for Standardized Product Unit, used to distinguish product varieties; SKU stands for Inventory Unit, used to distinguish individual items.
[0049] In the field of deep learning, datasets are fundamental for training deep learning models. A good dataset provides a sufficient number of high-quality data samples, helping the model learn more accurate and representative features. The quality of the dataset directly affects the model's accuracy and performance. Furthermore, a good dataset helps the model generalize better to new data samples, improving its performance on unseen data. Simultaneously, datasets can also be used for data augmentation, evaluating and comparing different deep learning models. Therefore, the selection, construction, and processing of datasets are crucial steps in deep learning. Only with a good dataset can the model learn and generalize better, improving its performance and robustness.
[0050] In the field of artificial intelligence, data is the core driver of models and algorithms. However, traditional single-data types often cannot provide enough information to solve complex problems. To better understand and handle the complexities of the real world, the emergence of multimodal fine-grained datasets has become an important innovation.
[0051] First, it integrates information from multiple sensors and data sources, including images, text, audio, and video, providing rich information from multiple perspectives. Multimodal fine-grained datasets offer more comprehensive information, with different types of data providing complementary perspectives and information. The integrated information from multimodal fine-grained datasets can better describe and understand complex problems, supporting more accurate decision-making and predictions by artificial intelligence systems.
[0052] Secondly, multimodal fine-grained datasets improve data representation and analysis. By integrating multiple data types, more detailed, comprehensive, and accurate data representations can be provided. The interrelationships and inherent characteristics between different data types can be better presented through comprehensive representation. This is of great significance for data analysis and mining. For example, in computer vision tasks, combining images and text allows for a better understanding of the content and semantics within images. In natural language processing tasks, combining text and audio data allows for a better understanding and analysis of the meaning and sentiment of language. Therefore, the comprehensive representation of multimodal fine-grained datasets helps to gain a deeper understanding of the inherent characteristics and interrelationships of data, improving the effectiveness of data analysis and mining.
[0053] Third, multimodal fine-grained datasets support multimodal tasks. Many tasks require processing multiple data types simultaneously, such as image classification, object detection, and sentiment analysis. Using multimodal fine-grained datasets provides the necessary data foundation and training samples for these tasks. For example, in image classification, combining image and text data allows for more accurate identification of objects and scenes within images. In object detection, combining image and audio data enables more accurate detection and tracking of object positions and movements. Therefore, multimodal fine-grained datasets provide crucial support for multimodal tasks.
[0054] Finally, multimodal fine-grained datasets facilitate cross-domain applications. The integration and synthesis of multimodal data enables the cross-fertilization and transfer of knowledge. By collecting and integrating multimodal data across different domains, knowledge and technology can be transferred from one domain to another, accelerating the application and innovation of artificial intelligence technologies in various fields. For example, by integrating user behavior data, product images, and descriptions from the e-commerce sector with medical images and medical records from the medical field, personalized recommendations and targeted advertising can be applied in the medical field. This cross-domain application not only expands the scope of artificial intelligence technology but also brings more business opportunities and development potential to related industries.
[0055] In summary, the necessity of multimodal fine-grained datasets lies in providing more comprehensive information, improving data representation and analysis, supporting multimodal tasks, and promoting cross-domain applications. They provide the necessary data foundation and research directions for the development and application of artificial intelligence technology, driving its widespread application in fields such as healthcare, agriculture, and e-commerce. With continuous technological advancements and the expansion of application scenarios, the importance of multimodal fine-grained datasets will become even more prominent.
[0056] Multimodal datasets still suffer from poor data quality, high manpower costs in data collection, and a lack of challenging benchmarks to evaluate the performance of multimodal models. To address these issues, this invention aims to provide a cost-effective and high-quality fine-grained dataset construction method by extracting fine-grained multimodal features from products and utilizing the hierarchical information of e-commerce websites to save manpower. Furthermore, two novel evaluation tasks—fine-grained mixed-modal retrieval and fine-grained VQA (Visual Question Answering)—are developed to evaluate the multimodal model.
[0057] Specifically, the dataset construction method provided in this embodiment of the invention first collects a fine-grained visual language (FGVL) dataset from Amazon.com, which consists of image-text pairs extracted from products of the same type.
[0058] Within each group, products differ significantly in certain attributes (such as color and style), and image-text pairs accurately reveal these attribute differences. The FGVL dataset contains a rich set of fine-grained multimodal information. Simultaneously, these comparable attributes can be used to generate the Winograd Schema Challenge (WSC). Through WSC-like evaluations, the commonsense capabilities of VLM methods are well explored, as these methods must sensitively and accurately understand fine-grained semantic visual and linguistic information.
[0059] Furthermore, it is well known that creating a Winograd pattern dataset from scratch is a labor-intensive task. Taking the VQA dataset as an example, Amazon Mechanical Turk is used to collect annotations. After collecting data from AMT, extensive analysis and post-processing are also required. To save manpower, this method leverages hierarchical information from e-commerce websites (such as SPUs and SKUs) to collect comparable attributes from potential candidate product groups.
[0060] The embodiments of this invention provide two novel evaluation tasks based on the FGVL dataset. Both tasks incorporate large-scale evaluation samples by combining and permuting attributes. These large-scale and fine-grained tasks increase the difficulty of VLM evaluation and more accurately test the true capabilities of VLM models.
[0061] Figure 1 A schematic diagram of a dataset construction method provided in an embodiment of the present invention; Figure 2A This is a schematic diagram of the structure of the first half of the dataset construction method provided in the embodiments of the present invention; Figure 2B This is a schematic diagram of the structure of the latter half of the dataset construction method provided in the embodiments of the present invention; the following is combined with Figure 1 , Figure 2A and Figure 2B Taking this as an example, the method for constructing the dataset provided in the embodiments of the present invention will be described in detail.
[0062] like Figure 1 As shown, the dataset construction method provided in this embodiment of the invention mainly includes the following steps:
[0063] Step 101: Select multiple high-ranking products from multiple categories in the website as product seeds;
[0064] Step 102: Using a multimodal product crawler that includes at least text analysis and image recognition, crawl the SKU inventory units of each product seed to obtain the SPU standardized product unit of each product seed.
[0065] Step 103: Crawling through the hierarchical structure to retrieve all information corresponding to the SPU of each product seed, obtaining all SKUs included in each SPU and the fine-grained attributes corresponding to each SKU, and performing consistent annotation on the fine-grained attributes to obtain a fine-grained candidate group for each product seed; wherein, the fine-grained candidate group includes at least one product, one image and one text pair;
[0066] Step 104: The multiple fine-grained candidate groups form a dataset.
[0067] In step 101, the method draws inspiration from the hierarchical information of e-commerce websites to collect candidate product groups with comparable attributes. These products contain certain comparable attributes, such as textual or image information, where image information can include product pictures, colors, styles, types, etc. Specifically, the method selects product seeds from designated websites. The selection of product seeds requires best-selling products from certain designated websites and high-ranking products from the designated website's review dataset. Furthermore, best-selling or high-ranking products can include multiple categories and are not limited to a single designated category.
[0068] For example, such as Figure 2A As shown, over 8.7K product seeds were collected from the setting website, covering 29 categories. These product seeds included best-selling products on the setting website and high-ranking products from the corresponding review dataset on that website. Figure 2A The high-ranking products provided include images, colors, styles, types, and textual information about these products.
[0069] The high-ranking products provided by this method include not only the text information of the product seed, but also the image information of the product seed, such as color, picture and style.
[0070] It should be noted that after identifying best-selling and high-ranking products as product seeds, the SKUs corresponding to these product seeds can be used to further determine the SPUs associated with those SKUs.
[0071] In practical applications, SKU and SPU are common concepts in merchandise management and sales, and their relationship includes the following aspects:
[0072] 1. Hierarchical Relationship: SKU is the basic unit of a product, representing its fundamental attributes and characteristics. SKU, on the other hand, is a further subdivision and differentiation unit based on the SPU. Each SKU has a unique identifier and specific attributes. One SPU can correspond to multiple SKUs, each representing variations in specifications, colors, sizes, etc.
[0073] 2. Inventory Management Relationships: Each SKU has its own inventory quantity and inventory management system. SKU inventory management helps companies accurately grasp the inventory status of each specific product specification, enabling better supply chain management and inventory preparation decisions. SKU inventory management, on the other hand, is generally a holistic statistical approach, used to understand the overall inventory status and sales performance of the products.
[0074] 3. Pricing Relationship: Each SKU can have its own independent pricing because they may have different specifications, costs, and market demands. However, SKUs generally have a unified pricing strategy used for overall sales and market positioning.
[0075] 4. Sales statistics relationship: Sales data for both SPUs and SKUs can be statistically analyzed. However, SPU sales statistics are generally overall statistics used to understand the overall sales situation of the product. SKU sales statistics, on the other hand, can provide a more granular understanding of the sales situation of different specifications or variations, which helps in analyzing consumer preferences and market trends.
[0076] 5. Marketing Strategy Relationship: The different positioning and characteristics of SUs and SKUs can help companies implement different marketing strategies. Based on the overall characteristics of SUs, companies can formulate overall brand positioning and marketing strategies; while based on the characteristics of different SKUs, companies can formulate targeted promotional activities and marketing strategies.
[0077] In this embodiment of the invention, based on the relationship between SKU and SPU, the SPU corresponding to each product seed can be further obtained.
[0078] In step 102, the SPU of the next level can be obtained by using a multimodal product crawler based on the SKU corresponding to each product seed.
[0079] The multimodal product crawler provided in this invention is a tool that comprehensively utilizes multiple technologies such as image recognition, natural language processing, speech recognition, and video processing to automatically acquire and integrate multimodal data (such as text, images, audio, and video) of products from the Internet. This technology can more comprehensively understand and analyze product information, support more complex data analysis and business decisions, and provide richer and more accurate data services for obtaining fine-grained attributes.
[0080] Specifically, in web crawling technology, Natural Language Processing (NLP) is one of the core technologies. It enables crawlers to understand and analyze textual information on web pages. NLP technology can help achieve functions such as extracting product names, semantic analysis of product descriptions, and sentiment analysis of user reviews. Through techniques such as word segmentation, part-of-speech tagging, and named entity recognition, useful information can be extracted from messy text. Computer vision technology enables crawlers to understand the content of product images. For example, Convolutional Neural Networks (CNNs) in deep learning can be used to identify and classify visual elements in product images. In this embodiment of the invention, it can help crawlers identify features such as brand logos, colors, and shapes in product images, and even perform more complex scene understanding and object recognition. Speech recognition technology can convert received voice comments into text, which can then be further analyzed using NLP technology. This is very valuable for understanding received opinions and emotions. Video processing technology can help crawlers extract keyframes from videos and even extract audio information from videos for speech recognition. In addition, through video understanding technology, crawlers can identify key actions and events in videos, thereby providing richer product information. Combining data from different modalities to provide a unified view of products is one of the key challenges of multimodal product crawling. Data fusion technology needs to handle the synchronization and alignment issues of different data sources to ensure that all modal data is about the same product. In addition, the fused data needs to be cleaned and standardized to facilitate subsequent analysis and application. Distributed crawling technology can run in parallel on multiple machines to improve crawling speed and reduce the risk of single points of failure. At the same time, efficient storage and retrieval systems are also crucial for processing and analyzing large datasets.
[0081] In this embodiment of the invention, a multimodal product crawler, when crawling product seeds obtained from a designated website, analyzes the text information corresponding to the product seeds based on attributes such as images, colors, styles, types, and text information included in the product seeds. First, it analyzes the text information corresponding to the product seeds using text analysis technologies including word segmentation, part-of-speech tagging, and named entity recognition. Then, it uses computer vision technology to analyze the visual elements in the images corresponding to the product seeds, helping the crawler identify features such as brand logos, colors, and shapes in the product seed images. This allows it to crawl the SKU information of the product seeds and obtain information such as the corresponding images and titles.
[0082] For example, if the product seed is a women's dress, text analysis using a multimodal product crawler can obtain the text information that the dress is a dress. Computer vision technology can obtain the image information of the dress, such as the picture, color, style, and brand logo. Thus, the corresponding SKU information of the dress can be obtained. Then, by locating the SKU on the webpage, all the feminist information can be obtained, and useful SPUs for fine-grained attributes, such as titles and images, can be extracted.
[0083] Furthermore, since one SKU can correspond to multiple SKUs, in practical applications, an SKU can be the smallest unit of inventory. In the apparel industry, "single style, single color, single size" represents one SKU. Correspondingly, when one SKU can include multiple SKUs, "single style, multiple colors, multiple sizes" might be one SKU; "multiple styles, multiple colors, multiple sizes" might also be one SKU. For example... Figure 2A As shown, when the product seed determined in step 101 is a sun hat and its corresponding SKU is B085Q33Q65, further, by crawling and analysis, it is determined that the SPU corresponding to this SKU is B004NSUSRA. This SPU includes dozens of sun hats, which are mainly different in color or style.
[0084] In this embodiment of the invention, after determining the SKU corresponding to the product seed, one SPU or multiple SPUs including the SKU can be obtained based on the SKU.
[0085] In step 103, all information corresponding to the SPU of each product seed is crawled layer by layer according to the hierarchical structure to obtain all SKUs included in each SPU and the fine-grained attributes corresponding to the SKUs.
[0086] In practical applications, since each SPU can include multiple SKUs, after determining one or more SPUs corresponding to the SKUs of the product seed, we can further crawl the SKUs corresponding to each SPU layer by layer according to the hierarchical relationship of each SPU, so as to obtain all the SKUs corresponding to each product seed and the fine-grained attributes corresponding to each SKU.
[0087] In this embodiment of the invention, fine-grained attributes may include any one or more combinations of color, style, material, and size. Furthermore, consistent annotation of the fine-grained attributes can yield fine-grained candidate groups for each product seed. It should be noted that a fine-grained candidate group includes at least one product, one image, and one text pair.
[0088] For example, such as Figure 2AAs shown, after obtaining the corresponding SPU code based on the SKU code B085Q33Q65 of the seed product, dozens of sun hats are obtained based on SPUB004NSUSRA. Most of these dozens of hats are the same style as the hats on the seed product, all being sun hats; some hats are made of the same material as the hats on the seed product, all being cotton; some hats are related in color to the hats on the seed product, including both black and white; and some hats have the same pattern as the hats on the seed product, both being striped.
[0089] In an embodiment of the present invention, according to Figure 2A After obtaining a series of hats related to the product seeds, SPUB004NSUSRA can further classify these hats according to style, color, material, size, pattern, etc. The classification result is the fine-grained candidate group provided by the embodiments of the present invention.
[0090] It's important to note here that each fine-grained candidate group must include at least one product (hat), along with its corresponding image and text description. These are referred to as image-text pairs. Specifically, within each fine-grained candidate group, a product's main image and manually selected attributes form an <image, text> pair.
[0091] It should be noted that in image and text pairs, the text can still include color and type. Here, type can be a description of the material, style, size, or pattern. In this embodiment of the invention, the specific content of the text is not limited.
[0092] In step 104, multiple fine-grained candidate groups form the dataset provided in this embodiment of the invention. It should be noted that in practical applications, before the fine-grained candidate groups form the dataset, it may also include manual screening of products in each SPU by standard experts with extensive experience in visual and speech research methods. For example, ... Figure 2B As shown, 4K fine-grained candidate groups were obtained in step 103, and after screening, 1788 fine-grained candidate groups were finally obtained. Each group had an average of 6.63 products. These products are obviously comparable in some attributes, such as differences in color (e.g., red, yellow, and blue) or style (e.g., shark, lobster, and fish).
[0093] Furthermore, after obtaining the dataset, this embodiment of the invention constructs two evaluation tasks: fine-grained mixed-modal retrieval evaluation and fine-grained visual question answering evaluation.
[0094] 1) Fine-grained hybrid mode retrieval and evaluation
[0095] Based on the dataset provided in this embodiment of the invention, random matching is performed on the image and text pairs included in the dataset to form random pairs including random images and random text. The random pairs are then input into the detection model. The retrieval accuracy of the detection model is determined based on the retrieval ranking list, which includes fine-grained retrieval samples, output by the detection model and corresponding to the random pairs.
[0096] It should be noted that the detection model provided in this embodiment of the invention is a visual-language multimodal pre-trained model. This model is an innovative technology in the field of deep learning, aiming to learn and understand the complex interactions between visual (images, videos) and linguistic (text) data through pre-training. This model combines computer vision (CV) and natural language processing techniques, enabling machines not only to understand images but also to comprehend and generate related text descriptions, or to recognize and interpret images based on text content.
[0097] Specifically, a visual-language multimodal pre-trained model typically consists of two parts: a convolutional neural network or deformable network (such as a Transformer-based Vision Model) for processing visual information, and a Transformer network for processing linguistic information. During the pre-training phase, the model is usually trained on large-scale image-text datasets to learn cross-modal representations. By designing various pre-training tasks, such as image-text matching, masked language modeling, and masked image modeling, the model can capture the correlation between visual and linguistic data.
[0098] Pre-training datasets: The pre-training of these models typically relies on large multimodal datasets, such as MS COCO, Flickr30k, Conceptual Captions, and Visual Genome. These datasets contain a large number of images and corresponding descriptive texts, providing rich learning resources for the models.
[0099] Pre-training methods: During pre-training, the model learns the intrinsic connections between visual and linguistic information through self-supervised learning. For example, the model might need to predict randomly masked parts of text or infer occluded content in an image. Through these tasks, the model learns richer and more nuanced cross-modal feature representations.
[0100] Fine-tuning and Applications: After pre-training, the model can be fine-tuned to suit various downstream tasks, such as image annotation, visual question answering (VQA), image-text retrieval, and cross-modal translation. During the fine-tuning phase, the model is trained on a smaller dataset specific to the task, thereby learning to perform the task.
[0101] For the model to be detected, there are two types of input information: images and text. The model will extract the image information and text information to recognize the images and text, and then generate the content (images or text) required for the specified task. Therefore, image information is used in this process. If there are no image inputs, the accuracy of the generated content will be greatly reduced.
[0102] In this embodiment of the invention, any two image-text pairs are selected from the dataset (i.e., <V im ,T im >and <V in ,T in >, in the i-th group, if <V im ,T in The combination of > as a query, V in The expected image (i.e., the true value) can be considered as the expected image, while the other images in the group can be considered as negative samples. For the VLM method, these negative samples are difficult to distinguish to some extent. The fine-grained mixed modality retrieval samples of the i-th group can be as shown in Equation (1):
[0103]
[0104] Among them, V im T represents the m-th sampled image in the i-th group. in V represents the nth sampled text in the i-th group. in V represents the nth sampled image in the i-th group. ix V represents the other sampled images in the i-th group. j* This indicates an image sampled from another group.
[0105] In this embodiment of the invention, if there are K pairs of images and text in the i-th group, K(K-1) mixed-modal retrieval samples can be created. The mixed-modal retrieval task requires the multimodal model to deeply understand visual and linguistic queries and correctly retrieve the expected image from a set of similar images, which is a severe test of the fine-grained ability of the VLM method.
[0106] The fine-grained mixed-modality retrieval evaluation provided in this embodiment of the invention differs from previous retrieval tasks in two aspects.
[0107] First, existing ground truth models focus on the semantics of the input text but fail to preserve other semantic elements within the image. In contrast, the fine-grained hybrid modality retrieval evaluation provided in this invention focuses solely on the semantics of the input text, minimizing modifications to the input image to correctly match the semantics of the input query. Second, the large-scale evaluation samples are automatically generated by combining images and text within a group. This method yields 991,228 fine-grained retrieval samples. Simultaneously, negative samples include very similar images, which are difficult for VLM models to distinguish. Therefore, the fine-grained hybrid modality retrieval evaluation provided in this invention more confidently assesses fine-grained capabilities.
[0108] like Figure 3A As shown, the dataset provided in this embodiment of the invention includes a set of fine-grained candidate groups. The attribute types of the fine-grained candidate groups are {color, style}. The first product in the fine-grained candidate group has <image, text> pairs of <green, shark>; the second product has <image, text> pairs of <white, lobster>; and the third product has <image, text> pairs of <white, fish>.
[0109] right Figure 3A The fine-grained candidate groups provided in the software include images and text that are randomly matched to obtain... Figure 3B The random pairs shown consist of multiple random images and random texts, including: an image of product 1 and text of product 2, an image of product 1 and text of product 3, an image of product 3 and text of product 2, an image of product 3 and text of product 1, an image of product 2 and text of product 3, and an image of product 2 and text of product 1.
[0110] 1. According to Figure 3A As shown in the provided Rank list, images of product 2 can be retrieved based on the image of product 1 and the text of product 2, and vice versa. It should be noted that the retrieval method provided in this embodiment of the invention is a multimodal image retrieval method of image + text -> image. For each group of products, there are only fine-grained differences in their images, which are reflected in the text attributes of the products. This retrieval method also performs image retrieval based on both image and text attributes. Therefore, if based on the image of product 1 and the text of product 2, because products in the same group are very similar, the content to be retrieved is determined by the text of product 2, thus retrieving the image of product 2; conversely, if based on the image of product 2 and the text of product 1, the content to be retrieved is determined by the text of product 1, thus retrieving the image of product 1.
[0111] In this embodiment of the invention, for the detection model, given a random pair of random images and random text as input, the model under test is allowed to perform image retrieval. The retrieval accuracy of the model under test can be seen through a large dataset or random pairs, and the model can be evaluated.
[0112] For example, if the retrieval metrics are "rank@1", "rank@5", and "rank@10", in practical applications, "rank@1", "rank@5", and "rank@10" are evaluation metrics in the field of information retrieval, used to measure the performance of search engines or recommendation systems.
[0113] Specifically, rank@1: represents the proportion of the user's desired information appearing in the first position of the returned results. In other words, if the user's desired information is at the top of the returned results, this metric is 1. rank@5: represents the proportion of the user's desired information appearing in the first five positions of the returned results. If the user's desired information is in the first five positions of the returned results, this metric is 1. rank@10: represents the proportion of the user's desired information appearing in the first ten positions of the returned results. If the user's desired information is in the first ten positions of the returned results, this metric is 1.
[0114] 2) Fine-grained visual question-answering assessment
[0115] Based on the dataset provided in this embodiment of the invention, large-scale fine-grained VQA evaluations can be generated by combining images, text, and attributes.
[0116] Specifically, embodiments of the present invention propose two special VQA tasks: fine-grained multi-select VQA and fine-grained open VQA.
[0117] Specifically, for fine-grained multiple-choice VQA, this embodiment of the invention selects to input a random question and a random image from the dataset into the model to be detected. The model to be detected selects the option corresponding to the random image and the random question from multiple alternative answers. The model to be detected can be evaluated based on the accuracy of the options.
[0118] For example, such as Figure 3CAs shown in (c.1), in the fine-grained multiple-choice VQA task, question templates are written for different groups, such as "What is the <attribute type> of the product in the image?". The model to be tested is automatically populated into the template to form a question, and then selects the correct answer that matches "the <attribute type> of the product" from multiple alternative answers. In this method, the image is treated as a visual reference, and all text in the group is combined into multiple choices. That is, for fine-grained multiple-choice VQA, given a question and an image of a product, the model to be tested is asked to choose from multiple options. In this method, the evaluation of the model to be tested is the accuracy of the model in selecting the correct answer that matches the question "What is the <attribute type> of the product?".
[0119] Specifically, for fine-grained open VQA, this embodiment of the invention selects to input a random question and a random image from the dataset into the model to be detected. The model to be detected generates a reference answer based on the random question and the random image, and the model to be detected is evaluated based on the accuracy of the generated reference answer.
[0120] For example, such as Figure 3C As shown in (c.2), samples are created in the fine-grained open-ended VQA task by treating text as the true value for generating reference answers. To increase the difficulty, the reference answers of the VLM model are required to strictly follow the semantics of the question. For example, if the question is "What are the colors and styles of the product in the image?", the reference answer generated by the model under test should also be "What are the colors and styles of the product in the image?". That is, for fine-grained open-ended VQA, given a question and a product image, the model under test is asked to generate a reference answer, and the evaluation here is the accuracy of the model's generated reference answer.
[0121] It should be noted that when there are multiple attribute types in a fine-grained candidate group, the problem becomes more difficult by adjusting the attribute types or the order of the attribute types in the question. Meanwhile, by combining and arranging attribute types, a large-scale fine-grained VQA sample can be generated. If there are L attribute types and K products in the i-th group, the total number of fine-grained multi-select VQA and fine-grained open-ended VQA samples generated from this group can be expressed by the following formula (2):
[0122]
[0123] in, Represents a permutation, denoted as L! / (Li)!, VQA_G i This represents the i-th group in VQA.
[0124] In summary, embodiments of the present invention provide a method and apparatus for constructing a dataset. The method includes: selecting multiple high-ranking products from a set website, including multiple categories, as product seeds; crawling the SKU inventory units of each product seed using a multimodal product crawler that includes at least text analysis and image recognition, to obtain the SPU standardized product unit of each product seed; crawling all information corresponding to the SPU of each product seed layer by layer according to a hierarchical structure, to obtain all SKUs included in each SPU and the fine-grained attributes corresponding to the SKUs, and performing consistent annotation on the fine-grained attributes to obtain a fine-grained candidate group for each product seed; wherein, the fine-grained candidate group includes at least one product, one image, and one text pair; and multiple fine-grained candidate groups form a dataset. This method offers a high-quality and labor-saving approach to constructing datasets. By integrating hierarchical information from e-commerce websites, it obtains fine-grained attributes based on similar product categories, thus forming fine-grained candidate groups. Each candidate group includes at least one product and an image-text pair. The products differ significantly in certain attributes, and the image-text pair accurately reveals these attribute differences, enabling the extraction of multimodal fine-grained features from the products. This method addresses the problem of existing multimodal dataset construction methods, which require manual data collection, resulting in high costs and poor data quality.
[0125] Based on the same inventive concept, this invention provides a dataset construction apparatus. Since the principle by which this apparatus solves the technical problem is similar to that of a dataset construction method, the implementation of this apparatus can refer to the implementation of the method, and repeated details will not be described again.
[0126] like Figure 4 As shown, the device mainly includes a determining unit 401, a first obtaining unit 402, a second obtaining unit 403, and a forming unit 404.
[0127] Unit 401 is used to select multiple high-ranking products from multiple categories in the set website as product seeds;
[0128] The first obtaining unit 402 is used to crawl the SKU inventory units of each product seed by a multimodal product crawler that includes at least text analysis and image recognition, and obtain the SPU standardized product unit of each product seed.
[0129] The second obtaining unit 403 is used to crawl all the information corresponding to the SPU of each product seed layer by layer according to the hierarchical structure, obtain all the SKUs included in each SPU, and the fine-grained attributes corresponding to the SKUs, and perform consistent annotation on the fine-grained attributes to obtain the fine-grained candidate group of each product seed; wherein, the fine-grained candidate group includes at least one product, one image and one text pair;
[0130] Forming unit 404 is used to form a dataset from multiple fine-grained candidate groups.
[0131] Preferably, the forming unit 404 is further configured to:
[0132] Randomly match the image and text pairs included in the dataset to form random pairs including random images and random text, and input the random pairs into the detection model;
[0133] The retrieval accuracy of the model under test is determined based on the output of the model under test and the retrieval ranking list of the random pair, which includes fine-grained retrieval samples.
[0134] Preferably, the forming unit 404 is further configured to:
[0135] A random question and a random image from the dataset are input into the model to be detected, so that the model to be detected selects an option corresponding to the random image and the random question from the alternative answers, and the model to be detected is evaluated based on the accuracy of the selected options;
[0136] Alternatively, a random question and a random image from the dataset can be input into the model to be detected, so that the model to be detected generates a reference answer based on the random question and the random image, and the model to be detected is evaluated based on the accuracy of the reference answer.
[0137] It should be understood that the units included in the above-described dataset construction apparatus are merely a logical division based on the functions implemented by the apparatus. In practical applications, the units can be superimposed or split. Furthermore, the functions implemented by the dataset construction apparatus provided in this embodiment correspond one-to-one with the dataset construction method provided in the above-described embodiment. The more detailed processing flow implemented by the apparatus has been described in detail in the first embodiment of the method described above, and will not be described in detail here.
[0138] Another embodiment of the present invention provides a computer device, the computer device including: a processor and a memory; the memory is used to store computer program code, the computer program code including computer instructions; when the processor executes the computer instructions, the electronic device executes each step of the dataset construction method in the method flow shown in the above method embodiment.
[0139] Another embodiment of the present invention provides a computer-readable storage medium storing computer instructions that, when executed on a computer device, cause the computer device to perform each step of the data set construction method in the method flow shown in the above method embodiment.
[0140] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0141] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A method for constructing a dataset, characterized in that, include: Select multiple high-ranking products from multiple categories included in the website to identify as product seeds; By crawling the SKU inventory units of each product seed using a multimodal product crawler that includes at least text analysis and image recognition, the SPU standardized product unit of each product seed is obtained. The information of all SPUs of each product seed is crawled layer by layer according to the hierarchical structure to obtain all SKUs included in the SPU corresponding to each product seed, as well as the fine-grained attributes corresponding to the SKUs. The fine-grained attributes are uniformly annotated to obtain the fine-grained candidate group of each product seed; wherein, the fine-grained candidate group includes at least one product, one image and one text pair; Multiple fine-grained candidate groups form a dataset; The fine-grained attributes include at least color, style, material, and size; The SKU and SPU include at least the following relationships: hierarchical relationship, inventory management relationship, price relationship, sales statistics relationship, and marketing strategy relationship; After the plurality of fine-grained candidate groups are formed into a dataset, the following is also included: Randomly match the image and text pairs included in the dataset to form random pairs including random images and random text, and input the random pairs into the detection model; The retrieval accuracy of the model under test is determined based on the retrieval ranking list, which includes fine-grained retrieval samples, corresponding to the output of the model under test and the random pair.
2. The construction method as described in claim 1, characterized in that, After the plurality of fine-grained candidate groups are formed into a dataset, the following is also included: A random question and a random image from the dataset are input to the model to be detected, causing the model to select an option from alternative answers that corresponds to the random image and the random question. The model is then evaluated based on the accuracy of the selected options; or A random question and a random image from the dataset are input into the model to be detected, so that the model to be detected generates a reference answer based on the random question and the random image, and the model to be detected is evaluated based on the accuracy of the reference answer.
3. A dataset construction apparatus, characterized in that, include: The "determine unit" is used to select multiple high-ranking products from multiple categories included in the website as product seeds. The first obtaining unit is used to crawl the SKU inventory units of each product seed by a multimodal product crawler that includes at least text analysis and image recognition, and obtain the SPU standardized product unit of each product seed. The second obtaining unit is used to crawl all the information of the SPU of each product seed layer by layer according to the hierarchical structure, obtain all the SKUs included in the SPU corresponding to each product seed, and the fine-grained attributes corresponding to the SKUs, and perform consistent annotation on the fine-grained attributes to obtain the fine-grained candidate group of each product seed; wherein, the fine-grained candidate group includes at least one product, one image and one text pair; A forming unit is used to form a dataset from multiple fine-grained candidate groups; The fine-grained attributes include at least color, style, material, and size; The SKU and SPU include at least the following relationships: hierarchical relationship, inventory management relationship, price relationship, sales statistics relationship, and marketing strategy relationship; The forming unit is further configured to: Randomly match the image and text pairs included in the dataset to form random pairs including random images and random text, and input the random pairs into the detection model; The retrieval accuracy of the model under test is determined based on the retrieval ranking list, which includes fine-grained retrieval samples, corresponding to the output of the model under test and the random pair.
4. The construction apparatus as described in claim 3, characterized in that, The forming unit is also used for: A random question and a random image from the dataset are input to the model to be detected, causing the model to select an option from alternative answers that corresponds to the random image and the random question. The model is then evaluated based on the accuracy of the selected options; or A random question and a random image from the dataset are input into the model to be detected, so that the model to be detected generates a reference answer based on the random question and the random image, and the model to be detected is evaluated based on the accuracy of the reference answer.
5. A computer device, characterized in that, The computer device includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the method for constructing a dataset as described in any one of claims 1-2.
6. A computer-readable storage medium, characterized in that, The system contains a computer program that, when executed by a processor, causes the processor to perform the method for constructing the dataset as described in any one of claims 1-2.