A method based on category perception and multi-modal fashion compatibility model testing
By introducing a three-stage optimization mechanism of category embedding and multimodal feature fusion, the problems of overestimation of model performance and insufficient interpretability in clothing matching compatibility assessment in existing technologies are solved, and more accurate fine-grained aesthetic understanding and interpretable analysis are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHANGZHOU TEXTILE GARMENT INST
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for evaluating the compatibility of clothing item combinations suffer from several drawbacks. The simple generation of negative samples makes the task too easy, the model struggles to learn fine-grained aesthetic understanding, and it lacks the integration of category prior knowledge and interpretability. As a result, its performance is overestimated in real-world scenarios and it is difficult to provide a basis for decision-making.
By explicitly introducing learnable 768-dimensional category embeddings and concatenating them with CLIP visual features and BERT textual features in the early stage, a 2304-dimensional multimodal vector is formed. Combined with linear projection, GELU activation and layer normalization processing, it is mapped to a 768-dimensional single-item embedding vector. A three-stage joint loss function is used for optimization, and a high-difficulty FITB test set is constructed for multi-dimensional attribution diagnosis.
It significantly improves the model's ability to distinguish subtle style and color coordination differences, increases the difficulty and reliability of the evaluation benchmark, provides multi-dimensional interpretability analysis, and improves the accuracy of model performance evaluation in real-world scenarios.
Smart Images

Figure CN122241231A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and computer vision, and in particular to a method for testing a category-aware and multimodal fashion compatibility model. Background Technology
[0002] In the field of fashion computing and recommendation systems, automatically evaluating the compatibility of clothing items is a core and challenging task. This technology aims to determine, through computational models, whether a given set of clothing items constitutes a harmonious overall combination in terms of visual appeal, style, and semantics. Its applications are widely used in various scenarios, including intelligent fashion recommendations, virtual wardrobe management, and personalized styling design.
[0003] Currently, research and practice in this field primarily rely on publicly available benchmarks to train and validate model performance, with datasets based on the Fill-in-the-Blank (FITB) task being the most common. However, these standard benchmarks have significant limitations in their construction: their negative samples are usually generated through random replacement or simple cross-class selection, resulting in a relatively low task difficulty. This makes the models prone to learning coarse-grained class rejection rules, failing to fully challenge and reflect their deep understanding of fine-grained aesthetic elements such as color, texture, and style, ultimately leading to an overestimation of model performance in real-world, complex scenarios.
[0004] From the perspective of model architecture and learning mechanism, existing methods also have several shortcomings. First, at the feature fusion level, although most methods introduce multimodal information (such as visual and textual information), they fail to explicitly and effectively integrate key prior knowledge of clothing categories. This lack of category information makes the model prone to confusion when faced with hard negative samples that have highly similar visual features but belong to different categories (such as tops and skirts of the same color), leading to a sharp drop in discriminative performance. Second, in terms of training optimization mechanisms, existing methods usually use relatively simple loss functions, lacking consistency between individual item representations and overall suit representations, as well as joint contrastive optimization designs for the most difficult negative samples, limiting the model's ability to learn more discriminative representations.
[0005] More importantly, existing technical solutions generally suffer from a lack of interpretability. The vast majority of models function merely as a "black box," ultimately outputting a compatibility score or ranking list, without providing users or developers with any systematic analysis of the decision-making process or the reasons for failure. When a model determines that a combination is incompatible, users struggle to understand whether the result is due to color discord, style clashes, or improper category matching. This severely hinders the practical application and iterative optimization of the technology.
[0006] To address these challenges, the industry has proposed several improvement strategies. For example, some research attempts to model complex relationships between individual items by constructing graph structures, or to introduce pre-trained large-scale visual-language models (such as CLIP) to enhance feature representation capabilities. Furthermore, commercial systems combining generative technologies and virtual try-on are beginning to emerge. However, these solutions still do not fundamentally solve the core problem of weak discriminative power in evaluation benchmarks, do not fully utilize fine-grained semantics and category priors, and still lack the ability to automatically and multi-dimensionally attribute and explain model prediction results.
[0007] Therefore, there is an urgent need for a new technical solution that can build a more discriminative evaluation benchmark to objectively assess the true capabilities of the model, design a more comprehensive multimodal fusion mechanism to utilize class priors, adopt more advanced joint optimization strategies to improve model performance, and ultimately provide a systematic interpretable analysis framework to gain insight into the model's decision-making logic and the root causes of errors. Summary of the Invention
[0008] To address the above issues, this invention explicitly introduces a learnable 768-dimensional category embedding and performs early concatenation with CLIP visual features and BERT text features of equal dimension to form a 2304-dimensional original multimodal vector. This vector is then processed through linear projection, GELU activation, and layer normalization, ultimately mapping to a 768-dimensional single-item embedding vector. A weighted combination of a three-stage joint loss function is set to achieve multi-level optimization synergy. Finally, by constructing a high-difficulty FITB based on hard negative sampling of the same category and automatically performing multi-dimensional attribution diagnosis of mispredicted samples, this invention provides a reliable tool for objectively and reproducibly evaluating the model's true ability in understanding fine-grained aesthetic compatibility, significantly improving the model's ability to distinguish subtle differences in style, color, and functional coordination.
[0009] According to an embodiment of the present invention, a method for testing a category-aware and multimodal fashion compatibility model is provided.
[0010] In a first aspect of the invention, a method for testing a category-aware, multimodal fashion compatibility model is provided. The method includes: Step S01: Construct a training set containing complete fashion outfits and a high-difficulty FITB test set, where each question of the benchmark uses hard negative samples from the same category as disturbances; Step S02: Construct a category-aware and multimodal fashion compatibility model: Integrate the visual, textual, and learnable category features of individual items, and obtain the final set representation vector through pooling and transformation; Step S03: Use the training set to jointly optimize the category awareness and multimodal fashion compatibility model by combining bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss, and Outfit-level Ranking loss; Step S04: Test and evaluate the category awareness and multimodal fashion compatibility model using the high-difficulty FITB test set, and perform multi-dimensional attribution diagnosis on the mispredicted samples.
[0011] Furthermore, before constructing the training set mentioned in step S01, the fashion matching dataset is cleaned and effective semantic categories are selected. The selection criterion for effective semantic categories is that they appear at least 10 times in the dataset.
[0012] Furthermore, the high-difficulty FITB test set described in step S01 consists of several four-choice fill-in-the-blank questions. Each fill-in-the-blank question includes one item selected from the original set that belongs to a specified category as the correct answer, and three other items selected from the same category as hard negative options. The number of questions used to generate the high-difficulty FITB test set is no less than 1,000.
[0013] Furthermore, the category perception and multimodal fashion compatibility model described in step S02 extracts and splices the CLIP visual features, BERT text features, and learnable category embeddings of each item in the suit. The item embedding vector is obtained by fusion through linear projection, GELU activation, and layer normalization. The item embedding vector is then subjected to mean pooling to obtain the original suit representation. Finally, the suit representation vector is obtained through linear transformation, GELU, and layer normalization projection.
[0014] Furthermore, the bidirectional In-batch InfoNCE loss described in step S03 is calculated by generating two views by adding Gaussian noise to the positive example set embedding.
[0015] Furthermore, the weights of the bidirectional In-batch InfoNCE loss, Item-to-OutfitInfoNCE loss, and Outfit-level Ranking loss mentioned in step S03 are 1.0, 1.0, and 0.3, respectively.
[0016] Furthermore, the multi-dimensional attribution diagnosis described in step S04 includes: color coordination preference diagnosis, category hierarchy confusion diagnosis, and composite factor influence diagnosis.
[0017] In a second aspect of the invention, an apparatus for testing a category-aware, multimodal fashion compatibility model is provided. The apparatus includes: Dataset building module: used to build a training set containing complete fashion outfits and a high-difficulty FITB test set, where each question of the benchmark uses hard negative samples from the same category as disturbances; Model building module: used to build category-aware and multimodal fashion compatibility models: integrates the visual, textual and learnable category features of individual items, and obtains the final set representation vector through pooling and transformation; Model training module: Used to jointly optimize the category awareness and multimodal fashion compatibility model using the training set in combination with bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss and Outfit-level Ranking loss; Model Testing Module: Used to test and evaluate category-aware and multimodal fashion compatibility models using the challenging FITB test set, and to perform multi-dimensional attribution diagnosis on mispredicted samples.
[0018] In a third aspect of the invention, an electronic device is provided. The electronic device includes a memory and a processor, the memory storing a computer program, the processor executing the program to implement the method according to a first aspect of the invention.
[0019] In a fourth aspect of the invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the method according to a first aspect of the invention.
[0020] This invention explicitly introduces a learnable 768-dimensional category embedding and performs early concatenation with CLIP visual features and BERT text features of equal dimension to form a 2304-dimensional original multimodal vector. After linear projection, GELU activation, and layer normalization, it is finally mapped to a 768-dimensional single-item embedding vector. A weight combination of a three-stage joint loss function is set to achieve multi-level optimization synergy. Finally, by constructing a high-difficulty FITB based on hard negative sampling of the same category and automatically diagnosing multi-dimensional causes of predicted errors, this invention provides a reliable tool for objectively and reproducibly evaluating the model's true ability to understand fine-grained aesthetic compatibility, and significantly improves the model's ability to distinguish subtle differences in style, color, and functional coordination.
[0021] It should be understood that the description in the Summary of the Invention is not intended to limit the key or essential features of the embodiments of the present invention, nor is it intended to restrict the scope of the invention. Other features of the invention will become readily apparent from the following description.
[0022] Beneficial effects: 1. Significantly improves the difficulty of the evaluation benchmark and the reliability of the model's true performance assessment: By constructing a high-difficulty FITB based on hard negative sampling of the same class, this effectively solves the problem that the existing Polyvore standard FITB is too simple due to the use of random cross-class replacement of negative samples, leading to an overestimation of model performance. Experiments show that the method of this invention achieves an argmax accuracy of 66.90% on this high-difficulty benchmark, which is 13.34% higher than mainstream comparison methods (such as compatibility-sensitive networks 53.56% and GNN methods 52.00%), and more than 27% higher than early triple conditional embedding methods. This benchmark provides a reliable tool for objectively and reproducibly evaluating the model's true ability in fine-grained aesthetic compatibility understanding. 2. The synergistic effect of multimodal fusion and category-aware mechanisms significantly enhances fine-grained discrimination capability: By explicitly introducing a learnable 768-dimensional category embedding, and concatenating it early with CLIP visual features and BERT textual features of the same dimension, a 2304-dimensional original multimodal vector is formed. This vector is then processed through linear projection, GELU activation, and layer normalization to finally map it into a 768-dimensional item embedding vector. This mechanism achieves deep synergy of multimodal information: the category embedding provides an explicit clothing type prior, effectively mitigating confusion between items of the same category in visually highly similar scenarios; ablation experiments show that the BERT textual modality contributes the most significantly (accuracy decreases by 14.00% after removal), providing key semantic discrimination clues for hard negative examples with highly similar visual features; the domain-fine-tuned CLIP visual backbone further enhances the robustness of the features. The overall synergistic effect improves the model's accuracy on the challenging FITB task by 8.50% compared to the baseline model with category embedding removed, and by 14.00% compared to the baseline model with text modality completely removed. 3. The three-stage joint contrastive loss optimization mechanism brings significant improvement in representation quality: By setting the weight combination of the three-stage joint loss function, multi-level optimization synergy is achieved: The first stage adopts bidirectional In-batch InfoNCE loss (generating views by perturbing positive examples with Gaussian noise) to enhance the consistency of set-level representation; the second stage adopts Item-to-OutfitInfoNCE loss to strengthen the semantic alignment between individual items and sets; the third stage introduces Outfit-level Ranking loss to widen the score gap between positive and negative sets for the most difficult negative samples. This mechanism effectively prevents the model from over-relying on local individual item features and directly optimizes the compatibility decision boundary. Ablation experiments show that the model accuracy decreases by 4.20% when using only a single InfoNCE loss, confirming that the three-stage joint optimization can significantly improve the model's ability to distinguish subtle differences in style, color, and functional coordination. 4. The TACED interpretability diagnostic framework provides systematic error attribution and optimization directions: The TACED framework is the first to achieve automatic multi-dimensional attribution diagnosis of predicted error samples (supporting multiple labels). Its analysis results show that: color coordination preference errors account for 49.5% (judgment threshold of 0.045), category hierarchy confusion accounts for 24.5%, and the influence of composite factors accounts for 35.0%. Through CLIP cosine similarity comparison, predefined clothing hierarchy matching, and composite judgment logic, the framework accurately reveals the limitations of the model in terms of susceptibility to color visual saliency interference, insufficient fine-grained category differentiation, and weak multi-factor joint decision-making ability. Combined with the web interactive system, error evidence, typical cases, and error type distribution can be visualized, providing a clear and quantifiable improvement path for subsequent targeted optimization of the model, and has outstanding engineering practical value. 5. Advantages in the practicality and scalability of the overall system and device: The method proposed in this invention supports end-to-end training and inference, and can be transferred from public datasets to real user wardrobe image scenarios; the system adopts a modular design, which is convenient for industrial-grade deployment and application; the integrated Web interactive diagnostic system can significantly improve user experience and commercial conversion potential, and is suitable for diverse application scenarios such as fashion recommendations, e-commerce shopping guides, and virtual fitting rooms. Attached Figure Description
[0023] The above and other features, advantages, and aspects of the various embodiments of the present invention will become more apparent from the accompanying drawings and the following detailed description. Wherein: Figure 1 A flowchart of a method for testing a category-aware and multimodal fashion compatibility model according to an embodiment of the present invention is shown; Figure 2 A comparison diagram of a high-difficulty FITB test set and a standard FITB test set according to an embodiment of the present invention is shown; Figure 3 A schematic diagram of a category-aware and multimodal fashion compatibility model according to an embodiment of the present invention is shown; Figure 4 A schematic diagram of TACED diagnostic test results according to an embodiment of the present invention is shown; Figure 5 A schematic diagram of a Streamlit-based web diagnostic interface according to an embodiment of the present invention is shown; Figure 6 A block diagram of an apparatus for testing a category-aware and multimodal fashion compatibility model according to an embodiment of the present invention is shown. Figure 7 A schematic diagram of a device for testing a category-aware and multimodal fashion compatibility model according to an embodiment of the present invention is shown. Detailed Implementation
[0024] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0025] According to an embodiment of the present invention, a method for testing a category-aware and multimodal fashion compatibility model is proposed. This method explicitly introduces a learnable 768-dimensional category embedding and concatenates it with CLIP visual features and BERT text features of equal dimension in an early stage, forming a 2304-dimensional original multimodal vector. This vector is then processed through linear projection, GELU activation, and layer normalization, ultimately mapping to a 768-dimensional single-item embedding vector. A weighted combination of a three-stage joint loss function is set to achieve multi-level optimization synergy. Finally, by constructing a high-difficulty FITB based on hard negative sampling of the same category and automatically performing multi-dimensional attribution diagnosis on mispredicted samples, a reliable tool is provided for objectively and reproducibly evaluating the model's true ability in understanding fine-grained aesthetic compatibility. This significantly improves the model's ability to distinguish subtle differences in style, color, and functional coordination.
[0026] The principles and spirit of the present invention will be explained in detail below with reference to several representative embodiments.
[0027] Figure 1 This is a schematic flowchart illustrating a method for testing a category-aware and multimodal fashion compatibility model according to an embodiment of the present invention. The method includes: Step S01: Construct a training set containing complete fashion outfits and a high-difficulty FITB test set, where each question of the benchmark uses hard negative samples from the same category as disturbances; Step S02: Construct a category-aware and multimodal fashion compatibility model: Integrate the visual, textual, and learnable category features of individual items, and obtain the final set representation vector through pooling and transformation; Step S03: Use the training set to jointly optimize the category awareness and multimodal fashion compatibility model by combining bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss, and Outfit-level Ranking loss; Step S04: Test and evaluate the category awareness and multimodal fashion compatibility model using the high-difficulty FITB test set, and perform multi-dimensional attribution diagnosis on the mispredicted samples.
[0028] It should be noted that although the operation of the method of the present invention has been described in a specific order in the above embodiments and figures, this does not require or imply that the operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0029] To provide a clearer explanation of the above-mentioned method for testing based on category awareness and multimodal fashion compatibility models, several specific embodiments are described below. However, it is worth noting that these embodiments are only for better illustrating the present invention and do not constitute an improper limitation of the present invention.
[0030] Example 1
[0031] Step S01: Construct a training set containing complete fashion outfits and a high-difficulty FITB test set, where each question on the benchmark uses hard negative samples from the same category as disturbances.
[0032] Before constructing the training set, the fashion matching dataset was cleaned and effective semantic categories were selected. The selection criteria for effective semantic categories were that they appeared at least 10 times in the dataset.
[0033] The training set contains several complete sets of samples that serve as supervised learning examples.
[0034] The advanced FITB test set consists of several multiple-choice fill-in-the-blank questions. Each fill-in-the-blank question includes one item from the original set that belongs to a specified category as the correct answer, and three other items from the same category as hard negative options. The number of questions used to generate the advanced FITB test set is no less than 1,000.
[0035] In this embodiment, the Polyvore dataset was used for data cleaning, retaining only complete sets with at least 3 items, resulting in a training set of 15,028 sets, a validation set of 3,016 sets, and a test set of 3,059 sets. The semantic categories of all items were statistically analyzed, and valid semantic categories (including tops, bottoms, coats, shoes, bags, accessories, etc.) with a frequency of at least 10 occurrences were selected. One item belonging to a valid category was randomly selected from the test set sets as the correct answer. Within the same category (excluding items already present in the current set), three other items were randomly selected as hard negative options, generating 1000 strictly category-specific four-choice fill-in-the-blank questions, forming a high-difficulty FITB test set. Figure 2 The image shown is a comparison chart of the high-difficulty FITB test set and the standard FITB test set in this embodiment.
[0036] Step S02: Construct a category-aware and multimodal fashion compatibility model: Integrate the visual, textual, and learnable category features of individual items, and obtain the final set representation vector through pooling and transformation.
[0037] like Figure 3 As shown, the category-aware and multimodal fashion compatibility model extracts and splices the CLIP visual features, BERT text features, and learnable category embeddings of each item in the set. The item embedding vector is obtained by fusion through linear projection, GELU activation, and layer normalization. The item embedding vector is then averaged to obtain the original set representation. Finally, the set representation vector is obtained through linear transformation, GELU, and layer normalization projection.
[0038] CLIP-ViT-B / 32 extracts visual features (768-dimensional), BERT-base-uncased extracts text features (768-dimensional), and nn.Embedding generates category embeddings (768-dimensional). These three are concatenated into a 2304-dimensional original vector, which is then linearly projected (Linear, 2304-dimensional → 768-dimensional), activated by GELU, and normalized by LayerNorm (768-dimensional) to obtain a 768-dimensional individual item embedding vector. Mean pooling is then applied to all individual item embeddings within the set to obtain the original set representation (768-dimensional). This is then linearly transformed (Linear, 768-dimensional → 512-dimensional), activated by GELU, and normalized by LayerNorm (512-dimensional) to obtain the final 512-dimensional set representation vector.
[0039] Step S03: Use the training set to jointly optimize the category awareness and multimodal fashion compatibility model by combining bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss, and Outfit-level Ranking loss.
[0040] The bidirectional in-batch InfoNCE loss is calculated by generating two views by adding Gaussian noise to the positive example set embedding.
[0041] The weights for the bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss, and Outfit-levelRanking loss are 1.0, 1.0, and 0.3, respectively.
[0042] Training environment: CPU is a 16 vCPU Intel(R) Xeon(R) Platinum 8474C; GPU is an RTX4090D. Specific training parameter configurations are shown in Table 1. Table 1
[0043] Step S04: Test and evaluate the category awareness and multimodal fashion compatibility model using the high-difficulty FITB test set, and perform multi-dimensional attribution diagnosis on the mispredicted samples.
[0044] TACED multidimensional attribution diagnosis supports multi-label annotation. Among them, the judgment criterion for color coordination preference error is: the average CLIP cosine similarity of the wrong option with respect to the context is higher than the preset threshold of the correct option.
[0045] In this embodiment, the argmax accuracy reaches 66.90% on the highly challenging FITB benchmark.
[0046] In this embodiment, an ablation experiment was also performed, as detailed below.
[0047] Removing learnable category embeddings: Using only CLIP vision + BERT text concatenation (1536-dimensional → 768-dimensional projection), the accuracy drops to 58.40%, a relative decrease of 8.50%.
[0048] Removing BERT text modalities: Visual + category embeddings only, accuracy drops to 52.90%, a relative decrease of 14.00%.
[0049] CLIP without domain fine-tuning: using frozen pre-trained weights, accuracy drops to 60.20%, a relative decrease of 6.70%.
[0050] Using only random negative samples (removing hard negative sampling): accuracy drops to 60.20%, a relative decrease of 6.70%.
[0051] Using only the single In-batch InfoNCE loss: removing the three-stage joint loss, accuracy drops to 62.70%, a relative decrease of 4.20%.
[0052] Conclusion: Category embedding, BERT text modality, CLIP domain fine-tuning, same-category hard negative sampling, and three-stage joint loss are all necessary components. Among them, BERT text contributes the most, and hard negative sampling and category embedding are particularly effective for fine-grained tasks, as shown in Table 2.
[0053] Table 2
[0054] Example 2
[0055] This embodiment demonstrates the application of the TACED interpretable diagnostic system.
[0056] TACED automated diagnosis was performed on 331 samples that were incorrectly predicted in the high-difficulty FITB test set: 1. Color Coordination Preference Judgment: Calculate the average CLIP cosine similarity p_color between the incorrect option and the context items, and the average similarity c_color between the correct option; if p_color > c_color + 0.045, then mark it as an incorrect color coordination preference.
[0057] 2. Category hierarchy confusion judgment: Based on the predefined clothing hierarchy structure (e.g., top → T-shirt / shirt / sweatshirt etc.), determine whether the predicted category and the correct category are different subclasses of the same parent class.
[0058] 3. Judgment of the influence of composite factors: If neither of the first two categories is triggered, it is marked as the influence of composite factors (involving multiple implicit constraints such as occasion, season, body type, style, etc.).
[0059] The system supports multi-label annotation, and the total number of annotations exceeds the number of samples. Diagnostic result statistics: Errors in color coordination preferences: 164 times, accounting for 49.5%; Category hierarchy confusion: 81 times, accounting for 24.5%; Composite factors had an impact: 116 times, accounting for 35.0%.
[0060] The specific testing method is as follows: High-difficulty FITB accuracy test: run inference on the 1000 high-difficulty FITB test set constructed in Example 1, encode the context + 4 options for each question, calculate the compatibility score (cosine similarity or projected inner product) between the complete set of representation vectors, select the option with the highest score as the prediction, and calculate the correct proportion (argmaxAcc@1).
[0061] TACED diagnostic test: Run diagnostic logic on all incorrectly predicted samples, such as Figure 4 As shown, the multi-label distribution is statistically analyzed; the accuracy is verified by manually reviewing at least 100 samples.
[0062] Implementation of a web interaction system: This embodiment also develops a web-based diagnostic interface based on Streamlit, such as... Figure 5 As shown, users can upload images of test sets or select test questions, which the system displays in real time. Compare the predicted results with the correct answers; TACED diagnostic labels and confidence levels; Visualization of false evidence (color similarity heatmap, category hierarchy tree, composite factor hint text); Error type distribution bar chart (global statistics).
[0063] This embodiment ultimately achieves: FITB benchmark accuracy of 66.90% (correctness of four-choice fill-in-the-blank questions); improvement over mainstream baselines: 13.34% improvement over compatibility-sensitive networks, and over 27% improvement over triple conditional embeddings; TACED diagnostic coverage of 100% (all error samples can be automatically attributed); error type recognition accuracy: F1 score > 0.85 at a color harmony judgment threshold of 0.045 (based on the validation set).
[0064] Based on the same inventive concept, this invention also proposes a device for testing a category-aware, multimodal fashion compatibility model. The implementation of this device can be found in the implementation of the method described above; details that are repeated will not be repeated. Figure 6 As shown, the device 100 includes: Dataset building module 101: Used to build a training set containing complete fashion outfits and a high-difficulty FITB test set, where each problem of the benchmark uses hard negative samples from the same category as disturbances; Model building module 102: Used to build a category-aware and multimodal fashion compatibility model: integrates the visual, textual and learnable category features of individual items, and obtains the final set representation vector through pooling and transformation; Model training module 103: Used to jointly optimize the category awareness and multimodal fashion compatibility model using the training set in combination with bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss and Outfit-level Ranking loss; Model Testing Module 104: Used to test and evaluate the category-aware and multimodal fashion compatibility model using the challenging FITB test set, and to perform multi-dimensional attribution diagnosis on mispredicted samples.
[0065] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the described module can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0066] like Figure 7 As shown, the device includes a central processing unit (CPU), which can perform various appropriate actions and processes based on computer program instructions stored in read-only memory (ROM) or loaded from storage units into random access memory (RAM). The RAM can also store various programs and data required for device operation. The CPU, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.
[0067] Multiple components in the device are connected to the I / O interface, including: input units such as keyboards and mice; output units such as various types of displays and speakers; storage units such as disks and optical discs; and communication units such as network interface cards (NICs), modems, and wireless transceivers. The communication unit allows the device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0068] The processing unit executes the various methods and processes described above, such as method steps S01 to S04. For example, in some embodiments, method steps S01 to S04 may be implemented as a computer software program tangibly contained in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and / or installed on the device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of method steps S01 to S04 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to execute method steps S01 to S04 by any other suitable means (e.g., by means of firmware).
[0069] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload programmable logic devices (CPLDs), and so on.
[0070] The program code used to implement the methods of the present invention can be written in any combination of one or more programming languages. This program code can be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code can be executed entirely on the machine, partially on the machine, as a standalone software package partially on the machine and partially on a remote machine, or entirely on a remote machine or server.
[0071] In the context of this invention, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media can include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0072] Furthermore, although the operations are described in a specific order, this should be understood as requiring that such operations be performed in the specific order shown or in sequential order, or requiring that all illustrated operations be performed to achieve the desired result. In certain environments, multitasking and parallel processing may be advantageous. Similarly, although several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the invention. Certain features described in the context of individual embodiments may also be implemented in combination in a single implementation. Conversely, various features described in the context of a single implementation may also be implemented individually or in any suitable sub-combination in multiple implementations.
[0073] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.
Claims
1. A method for testing based on category perception and multi-modal fashion compatibility model, characterized in that, The method includes: Step S01: Construct a training set containing complete fashion outfits and a high-difficulty FITB test set, where each question of the benchmark uses hard negative samples from the same category as disturbances; Step S02: Construct a category-aware and multimodal fashion compatibility model: Integrate the visual, textual, and learnable category features of individual items, and obtain the final set representation vector through pooling and transformation; Step S03: Use the training set to jointly optimize the category awareness and multimodal fashion compatibility model by combining bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss, and Outfit-level Ranking loss; Step S04: Test and evaluate the category awareness and multimodal fashion compatibility model using the high-difficulty FITB test set, and perform multi-dimensional attribution diagnosis on the mispredicted samples.
2. The method for testing based on category perception and multi-modal fashion compatibility model according to claim 1, characterized in that, Before constructing the training set mentioned in step S01, the fashion matching dataset is cleaned and effective semantic categories are selected. The selection criteria for effective semantic categories is that they appear at least 10 times in the dataset.
3. The method of claim 1, wherein, The high-difficulty FITB test set mentioned in step S01 consists of several four-choice fill-in-the-blank questions. Each fill-in-the-blank question includes one item selected from the original set that belongs to a specified category as the correct answer, and three other items selected from the same category as hard negative options. The number of questions used to generate the high-difficulty FITB test set is no less than 1,000.
4. The method for testing based on category perception and multi-modal fashion compatibility model according to claim 1, characterized in that, In step S02, the category perception and multimodal fashion compatibility model extracts and splices the CLIP visual features, BERT text features, and learnable category embeddings of each item in the set. The item embedding vector is obtained by fusion through linear projection, GELU activation, and layer normalization. The item embedding vector is then averaged to obtain the original set representation. Finally, the set representation vector is obtained through linear transformation, GELU, and layer normalization projection.
5. The method of claim 1, wherein, The bidirectional In-batch InfoNCE loss described in step S03 is calculated by generating two views by adding Gaussian noise to the positive example set embedding.
6. The method for testing based on category perception and multi-modal fashion compatibility model according to claim 1, characterized in that, The weights of the bidirectional In-batch InfoNCE loss, Item-to-Outfit InfoNCE loss, and Outfit-level Ranking loss mentioned in step S03 are 1.0, 1.0, and 0.3, respectively.
7. The method of testing based on category perception and multi-modal fashion compatibility model according to claim 1, wherein, The multidimensional attribution diagnosis described in step S04 includes: color coordination preference diagnosis, category hierarchy confusion diagnosis, and composite factor influence diagnosis.
8. An apparatus for testing based on category perception and multi-modal fashion compatibility model, characterized in that, The device implements the method as described in any one of claims 1 to 7, comprising: Dataset building module: used to build a training set containing complete fashion outfits and a high-difficulty FITB test set, where each question of the benchmark uses hard negative samples from the same category as disturbances; Model building module: used to build category-aware and multimodal fashion compatibility models: integrates the visual, textual and learnable category features of individual items, and obtains the final set representation vector through pooling and transformation; Model training module: Used to jointly optimize the category awareness and multimodal fashion compatibility model using the training set in combination with bidirectional In-batch InfoNCE loss, Item-to-OutfitInfoNCE loss and Outfit-level Ranking loss; Model Testing Module: Used to test and evaluate category-aware and multimodal fashion compatibility models using the challenging FITB test set, and to perform multi-dimensional attribution diagnosis on mispredicted samples.
9. An electronic device comprising a memory and a processor, said memory having stored thereon a computer program, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 7.