A Smart Search Method for Clothing Images

By extracting feature vectors from clothing images and text through comparative learning pre-trained models, constructing multi-dimensional attribute vectors for two-dimensional classification and similarity-weighted fusion, the problem of insufficient accuracy and flexibility in existing clothing image search is solved, achieving more accurate, comprehensive and flexible clothing image retrieval.

CN122309786APending Publication Date: 2026-06-30GUANGZHOU CHUNXIAO INFORMATION TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU CHUNXIAO INFORMATION TECH
Filing Date
2026-04-17
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing methods for searching clothing images have low accuracy and comprehensiveness, and lack flexibility and effectiveness, making it difficult to fully and accurately express the complex visual features of clothing and user needs.

Method used

A contrastive learning pre-trained model is used to extract semantic and textual feature vectors from clothing images, construct multi-dimensional attribute vectors, perform two-dimensional classification by combining predefined attribute dimensions, calculate similarity and perform weighted fusion, and generate category label images for retrieval.

Benefits of technology

It improves the accuracy and comprehensiveness of search results, enhances the flexibility and effectiveness of the search, and is able to more accurately understand user needs and filter out matching photos from a massive amount of images.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309786A_ABST
    Figure CN122309786A_ABST
Patent Text Reader

Abstract

This invention discloses an intelligent image search method for clothing. It extracts semantic feature vectors from clothing images using an image encoder based on a contrastive learning pre-trained model, mining high-level semantic information. Simultaneously, it constructs text descriptions based on predefined attribute dimensions and extracts text feature vectors. Combining these two methods creates a multi-dimensional attribute vector, integrating image and text information to improve search result accuracy. Secondly, it obtains category tag images based on two-dimensional classification using the multi-dimensional attribute vectors, calculates similarity with candidate photos, and weights and fuses them, fully considering the multi-dimensional features of clothing, accurately measuring similarity, and improving the comprehensiveness of results. Furthermore, by extracting semantic features from images and combining them with text descriptions, users can upload images for searching, overcoming the limitations of text descriptions and enhancing flexibility. Additionally, the construction of multi-dimensional attribute vectors and precise similarity calculations enable a more accurate understanding of user intent, filtering photos that meet requirements from a massive amount of images, thus improving search effectiveness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of clothing image search technology, and in particular to an intelligent method for searching clothing images. Background Technology

[0002] In the garment production process, from obtaining design inspiration and selecting fabrics to displaying and selling finished products, there are multiple stages that involve the efficient management and rapid retrieval of a large number of garment images.

[0003] Traditional clothing image search methods primarily rely on text keyword searches. Users need to input specific textual descriptions such as clothing type, color, and style, and the system then matches and retrieves these textual information from the database. However, this method has several limitations: First, textual descriptions often fail to fully and accurately express the complex characteristics of clothing. For example, visual features such as texture, pattern details, and overall style are difficult to describe precisely in words, and different users may describe the same clothing feature differently, affecting the accuracy and comprehensiveness of search results. Second, users may not be clear about the specific clothing features they want when searching, or they may find it difficult to express them with accurate vocabulary, limiting the flexibility and effectiveness of the search. Summary of the Invention

[0004] In view of this, the present invention proposes an intelligent search method for clothing images, which can effectively solve the shortcomings of existing technologies, such as low accuracy and comprehensiveness of search results, as well as low flexibility and effectiveness of search.

[0005] The technical solution of this invention is implemented as follows:

[0006] A method for intelligent search of clothing images, specifically including:

[0007] Obtain the clothing image to be searched;

[0008] A contrastive learning pre-trained model is used to extract semantic feature vectors from the clothing images to be retrieved;

[0009] Based on predefined attribute dimensions, a text description of the clothing image to be retrieved is constructed, and a text encoder of a contrastive learning pre-trained model is used to extract the text feature vector of the text description.

[0010] Based on semantic feature vectors and text feature vectors, a multidimensional attribute vector is constructed for the clothing image to be retrieved;

[0011] Two-dimensional classification is performed based on the multi-dimensional attribute vector of the clothing image to be retrieved to obtain the category label image of the clothing image to be retrieved;

[0012] Calculate the similarity between each category tag image of the clothing image to be retrieved and the candidate photos in the database, and then weight and fuse the similarity of each category tag image to obtain the similarity of the clothing image to be retrieved.

[0013] Based on the similarity of the clothing images to be retrieved, the most matching photos are selected from the candidate photos in the database as the retrieval results.

[0014] As a further optional solution to the intelligent clothing image search method, the step of extracting the text feature vector of the clothing image to be retrieved using a text encoder based on a contrastive learning pre-trained model and predefined attribute dimensions specifically includes:

[0015] Predefined attribute dimensions for describing clothing, including type, color, sleeve length, collar type, pattern, style, season, material, and fit;

[0016] For each predefined attribute dimension, a corresponding text description is constructed using natural language, combining the features presented by the clothing image to be retrieved in that attribute dimension.

[0017] The text encoder, which employs a contrastive learning pre-trained model, takes the text descriptions corresponding to each attribute dimension as input and performs feature extraction on each text description to obtain the text feature vector corresponding to each text description.

[0018] As a further optional solution to the intelligent search method for clothing images, the step of constructing a multi-dimensional attribute vector of the clothing image to be retrieved based on semantic feature vectors and text feature vectors specifically includes:

[0019] For each predefined attribute dimension, the cosine similarity between the semantic feature vector of the clothing image to be retrieved and the text feature vector under that attribute dimension is calculated, and the calculated cosine similarity is normalized to obtain the confidence probability distribution of each candidate value under each attribute dimension.

[0020] For each attribute dimension, retain the attribute identification results that are higher than the preset confidence threshold in the confidence probability distribution;

[0021] The attribute recognition results under each retained attribute dimension are combined to form a multidimensional attribute vector of the clothing image to be retrieved.

[0022] As a further optional solution to the intelligent search method for clothing images, the step of performing two-dimensional classification based on the multi-dimensional attribute vector of the clothing image to be retrieved to obtain the category label image of the clothing image to be retrieved specifically includes:

[0023] Define a two-dimensional classification system that includes both location and hierarchical dimensions;

[0024] Based on the attribute recognition results in the multidimensional attribute vector of the clothing image to be retrieved, and combined with the defined two-dimensional classification system, the category to which the clothing image to be retrieved belongs in the position dimension and the hierarchical dimension is determined.

[0025] Based on the judgment result of two-dimensional classification, a category label image of the clothing image to be retrieved is generated. The category label image is used to mark the classification information of the clothing image to be retrieved in the position dimension and the hierarchical dimension.

[0026] As a further optional solution to the intelligent search method for clothing images, the step of calculating the similarity between each category tag image of the clothing image to be retrieved and the candidate photos in the database, and then weighting and fusing the similarity of each category tag image to obtain the similarity of the clothing image to be retrieved, specifically includes:

[0027] For each category label image, a preset image similarity calculation strategy is used to calculate the similarity between the category label image and the candidate photos in the database.

[0028] Determine the weighted similarity of the label images for each product category;

[0029] The similarity between each category label image and the candidate photos in the database is multiplied by the corresponding weighted weight and then summed to obtain the final similarity between the clothing image to be retrieved and the candidate photos in the database.

[0030] As a further optional solution to the aforementioned intelligent search method for clothing images, the preset image similarity calculation strategy is a multi-dimensional similarity calculation method that integrates image color histogram, texture features, shape features, and semantic features. The specific steps are as follows:

[0031] Extract the color histogram features of the category label image and the candidate photo, and calculate the chi-square distance of the color histogram features of the two as the color similarity component;

[0032] The local binary mode algorithm is used to extract the texture features of the category label image and the candidate photo, and the Euclidean distance between the texture features of the two is calculated as the texture similarity component.

[0033] The shape features of category label images and candidate photos are extracted using edge detection algorithms, and the Hausdorff distance between the two shape features is calculated as the shape similarity component.

[0034] By comparing and learning a pre-trained model, semantic feature vectors of category label images and candidate photos are extracted respectively, and the cosine similarity between the two semantic feature vectors is calculated as the semantic similarity component.

[0035] As a further optional solution to the intelligent search method for clothing images, the method also includes:

[0036] The most matching candidate photos selected are then subjected to result optimization processing, which includes deduplication and diversity enhancement processing. Deduplication is used to remove duplicate photos, and diversity enhancement processing is used to enhance the diversity of the search results.

[0037] A smart image search system for clothing includes:

[0038] The image acquisition module is used to acquire images of clothing to be retrieved.

[0039] The semantic feature extraction module is used to extract the semantic feature vector of the clothing image to be retrieved using an image encoder that employs a contrastive learning pre-trained model.

[0040] The text feature extraction module is used to construct a text description of the clothing image to be retrieved based on predefined attribute dimensions, and to extract the text feature vector of the text description using a text encoder of a contrastive learning pre-trained model.

[0041] The multidimensional attribute construction module is used to construct a multidimensional attribute vector of the clothing image to be retrieved based on the semantic feature vector and the text feature vector;

[0042] The category tag acquisition module is used to perform two-dimensional classification based on the multi-dimensional attribute vector of the clothing image to be retrieved, and obtain the category tag image of the clothing image to be retrieved;

[0043] The similarity calculation module is used to calculate the similarity between each category tag image of the clothing image to be retrieved and the candidate photos in the database, and to perform weighted fusion of the similarity of each category tag image to obtain the similarity of the clothing image to be retrieved.

[0044] The search results filtering module is used to select the most matching photos from the candidate photos in the database as search results based on the similarity of the clothing images to be searched.

[0045] A computing device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of any of the above-described intelligent clothing image search methods.

[0046] A computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the above-described intelligent clothing image search methods.

[0047] The beneficial effects of this invention are as follows: By employing an image encoder with a contrastive learning pre-trained model to extract semantic feature vectors from the clothing images to be retrieved, the high-level semantic information of the clothing in the images can be deeply mined. Simultaneously, text descriptions are constructed based on predefined attribute dimensions, and text feature vectors are extracted, characterizing the clothing features from a textual perspective. Combining semantic and text feature vectors to construct multi-dimensional attribute vectors comprehensively integrates information from both image and text aspects, making the description of clothing features more complete and accurate, effectively improving the accuracy of search results. Secondly, two-dimensional classification based on multi-dimensional attribute vectors yields category label images. Similarity between these category label images and candidate photos in the database is calculated and weighted fusion is performed. This method fully considers the characteristics of clothing across different attribute dimensions. Avoiding the limitations of single features or simple feature combinations, the similarity calculation and weighted fusion of images with different category tags can more accurately measure the similarity between the clothing image to be retrieved and the candidate photos. This allows the search results to more comprehensively cover various situations similar to the clothing to be retrieved, improving the comprehensiveness of the search results. In addition, by extracting semantic features from the images themselves and combining them with text descriptions based on predefined attribute dimensions, even if users cannot accurately describe the characteristics of the clothing, they can still search by uploading images, overcoming the limitations of text descriptions and greatly enhancing the flexibility of the search. Furthermore, by constructing multi-dimensional attribute vectors and accurately calculating similarity, it is possible to more accurately understand the user's search intent and filter out the photos that best meet the user's needs from a massive number of clothing images, improving the effectiveness of the search. Attached Figure Description

[0048] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0049] Figure 1 This is a flowchart illustrating an intelligent image search method for clothing according to the present invention.

[0050] Figure 2 This is a schematic diagram of the composition of an intelligent clothing image search system according to the present invention;

[0051] Figure 3 This is a schematic diagram of the composition of a computing device according to the present invention. Detailed Implementation

[0052] The technical solutions in the embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0053] refer to Figures 1 to 3 A method for intelligent search of clothing images, specifically including:

[0054] Retrieve the clothing image to be searched.

[0055] The CLIP (Contrastive Language-Image Pre-training) model is used as the feature extraction skeleton. After standardizing the clothing images to be retrieved, the high-dimensional semantic feature vector of the images to be retrieved is extracted by the image encoder of the CLIP model.

[0056] Specifically, firstly, the acquired image of the clothing to be retrieved is standardized by normalizing the pixel values ​​according to the format required by the CLIP model, adjusting the pixel value range to the expected input range of the model. For example, the pixel values ​​are adjusted from the common 0-255 range to a specific mean and standard deviation range to conform to the input specifications of the CLIP model. Next, the standardized image is adjusted to the input size specified by the CLIP model, such as 224×224 pixels, to ensure that the image can be successfully input into the model for processing. Finally, the preprocessed image of the clothing to be retrieved is input into the image encoder part of the pre-trained CLIP model. Image encoders typically employ a Transformer architecture or an improved convolutional neural network structure. They perform multi-layer feature extraction and transformation on the input image. During this process, the model analyzes information such as color distribution, shape contours, and texture details in the image, and gradually transforms these low-level visual features into high-level semantic features. For example, for this beige trench coat, the model can identify that it belongs to the outerwear category and has high-level semantic information corresponding to features such as being long and having a belt. Finally, the image encoder outputs a high-dimensional semantic feature vector, assuming a dimension of 512, which contains rich semantic information about the clothing in the image.

[0057] Thus, because the CLIP model, as a contrastive learning pre-trained model, is fully trained on large-scale image-text pairs, it can learn general visual and linguistic representations. When extracting features from clothing images, it can quickly and accurately capture key features of clothing, such as color, style, and material, and transform them into highly expressive high-dimensional semantic feature vectors. Compared with some traditional feature extraction methods, this greatly improves the efficiency and accuracy of feature extraction. Secondly, because the CLIP model integrates image and text information during training, its image encoder has a certain semantic understanding capability. For clothing images, it can not only recognize intuitive visual features but also understand some abstract semantic concepts, such as clothing style (fashion, retro, etc.) and occasion suitability (formal, casual, etc.). This helps to more accurately match user needs and find semantically similar clothing images in the subsequent retrieval process.

[0058] Based on predefined attribute dimensions, a text description of the clothing image to be retrieved is constructed, and a text encoder of a contrastive learning pre-trained model is used to extract the text feature vector of the text description, specifically including:

[0059] Predefined attribute dimensions for describing clothing, including type, color, sleeve length, collar type, pattern, style, season, material, and fit;

[0060] For each predefined attribute dimension, a corresponding text description is constructed using natural language, combining the features presented by the clothing image to be retrieved in that attribute dimension.

[0061] The text encoder, which employs a contrastive learning pre-trained model, takes the text descriptions corresponding to each attribute dimension as input and performs feature extraction on each text description to obtain the text feature vector corresponding to each text description.

[0062] Specifically, predefined attribute dimensions are used to describe the clothing, including type, color, sleeve length, collar type, pattern, style, season, material, and fit. These attribute dimensions cover multiple key aspects of clothing and can comprehensively describe the characteristics of a garment. For each predefined attribute dimension, combined with the features presented by the uploaded beige long trench coat image in that attribute dimension, a corresponding text description is constructed using natural language: Type: Long trench coat; Color: Beige; Sleeve length: Long sleeve; Collar type: Turn-down collar; Pattern: No pattern; Style: Fashionable, sophisticated; Season: Spring and autumn; Material: Crisp material; Fit: Slim fit. In summary, the complete text description is: "A beige long trench coat, long sleeves, turn-down collar, no pattern, fashionable and sophisticated style, suitable for spring and autumn wear, crisp material, slim fit."

[0063] Using a text encoder derived from the CLIP model (which includes both an image encoder and a text encoder), the constructed text description is taken as input. The text encoder performs a series of operations on the input text, such as word vector encoding and position encoding, and then performs feature extraction and transformation through a multi-layer Transformer structure. Finally, a text feature vector is output, assuming that the dimension is also 512. This vector can well represent the semantic information in the text description.

[0064] Thus, by constructing text descriptions based on predefined attribute dimensions and extracting text feature vectors, which are then combined with previously extracted image semantic feature vectors, the feature representation of the clothing to be retrieved is further enriched. Image feature vectors primarily capture the characteristics of clothing from a visual perspective, while text feature vectors characterize clothing from a semantic perspective. The two complement each other, making the description of clothing more comprehensive and three-dimensional. Secondly, since a text encoder of the same origin as the CLIP image encoder is used, the extracted text feature vectors and the previously extracted image semantic feature vectors are in the same semantic space. This enhances the semantic correlation between images and text, enabling more accurate calculation of the similarity between the clothing image to be retrieved and clothing images in the database during subsequent retrieval processes, thereby improving retrieval accuracy. Furthermore, the method of constructing text descriptions based on predefined attribute dimensions offers a degree of flexibility. Users can focus on different attribute dimensions according to their needs and can also adjust the predefined attribute dimensions according to different application scenarios. For example, in the summer clothing retrieval scenario, attribute dimensions such as "breathability" can be added to make the retrieval more closely aligned with the user's specific needs.

[0065] Based on semantic feature vectors and text feature vectors, a multi-dimensional attribute vector is constructed for the clothing image to be retrieved, specifically including:

[0066] For each predefined attribute dimension, the cosine similarity between the semantic feature vector of the clothing image to be retrieved and the text feature vector under that attribute dimension is calculated, and the calculated cosine similarity is normalized to obtain the confidence probability distribution of each candidate value under each attribute dimension.

[0067] For each attribute dimension, retain the attribute identification results that are higher than the preset confidence threshold in the confidence probability distribution;

[0068] The attribute recognition results under each retained attribute dimension are combined to form a multidimensional attribute vector of the clothing image to be retrieved.

[0069] Specifically, for each predefined attribute dimension, such as type, color, sleeve length, etc., the cosine similarity between the semantic feature vector and the text feature vector under that attribute dimension is calculated. Taking the color attribute dimension as an example, assuming that the cosine similarity between the feature representation of color in the semantic feature vector and the feature representation corresponding to "beige" in the text feature vector is calculated, a similarity value is obtained. This calculation is performed on all predefined attribute dimensions in the same way. The cosine similarity calculated for each attribute dimension is then normalized. For example, for the color attribute dimension, assuming that multiple candidate color values ​​and the cosine similarity between them and the semantic feature vector are calculated, these similarity values ​​are normalized to the [0,1] interval to obtain the confidence probability distribution of each candidate value under the color attribute dimension. Other attribute dimensions are processed in the same way. For each attribute dimension, a confidence threshold is preset, such as 0.6. In the confidence probability distribution of the color attribute dimension, if the confidence of "beige" is 0.8, which is higher than the preset threshold... If the value is 0.6, the attribute recognition result "Color: Beige" is retained. For other attribute dimensions, such as sleeve length, if the calculated confidence score of "long sleeve" is 0.7, which is higher than the threshold, the recognition result "sleeve length: long sleeve" is also retained. In this way, all predefined attribute dimensions are filtered, and attribute recognition results that meet the conditions are retained. The attribute recognition results under each retained attribute dimension are combined. For example, the retained attribute recognition results are "Type: Long trench coat", "Color: Beige", "Sleeve length: Long sleeve", "Collar type: Turn-down collar", "Pattern: No pattern", "Style: Fashionable and elegant", "Season: Spring and autumn", "Material: Stiff material", and "Fit: Slim fit". These results are combined in a certain order to form a multi-dimensional attribute vector. This vector comprehensively and accurately describes the various features of the beige long trench coat to be retrieved.

[0070] Thus, by calculating the cosine similarity between semantic feature vectors and text feature vectors across various attribute dimensions, and then normalizing and thresholding, the characteristics of clothing across each predefined attribute dimension can be accurately identified. This avoids errors that may arise from single features or simple judgments, improving the accuracy of attribute recognition. For example, in color recognition, it can accurately determine that a trench coat is beige, rather than other similar colors. Secondly, the attribute recognition results from each retained attribute dimension are combined into a multi-dimensional attribute vector, providing a comprehensive feature description of the clothing from multiple aspects. This makes the representation of clothing richer and more complete, providing a more reliable foundation for subsequent similarity calculations and retrieval. Compared to relying solely on image features or single text features, it can more accurately find other clothing similar to the clothing to be retrieved. Furthermore, the multi-dimensional attribute vector constructed based on accurate attribute recognition and comprehensive feature description can more stably match clothing information in the database during clothing retrieval, ensuring the reliability and stability of retrieval results even when faced with a large number of clothing images of different styles and designs.

[0071] Two-dimensional classification is performed based on the multi-dimensional attribute vector of the clothing image to be retrieved, resulting in a category label image for the clothing image to be retrieved, specifically including:

[0072] Define a two-dimensional classification system that includes both location and hierarchical dimensions;

[0073] Based on the attribute recognition results in the multidimensional attribute vector of the clothing image to be retrieved, and combined with the defined two-dimensional classification system, the category to which the clothing image to be retrieved belongs in the position dimension and the hierarchical dimension is determined.

[0074] Based on the judgment result of two-dimensional classification, a category label image of the clothing image to be retrieved is generated. The category label image is used to mark the classification information of the clothing image to be retrieved in the position dimension and the hierarchical dimension.

[0075] Specifically, a two-dimensional classification system is defined: the position dimension is defined as "upper body clothing", "lower body clothing", and "full body clothing (bodysuits, etc.)"; the hierarchical dimension is defined as "outerwear", "middle layer clothing", and "underwear". This two-dimensional classification system can clearly cover the wearing position and hierarchical information of various types of clothing.

[0076] Category determination: Based on the attribute recognition results in the multi-dimensional attribute vector of the beige long trench coat, combined with the defined two-dimensional classification system, the trench coat is classified as upper body clothing from the perspective of wearing position (although the length may cover part of the legs, the main wearing position is on the upper body); from the perspective of hierarchy, it is classified as outerwear. Therefore, this beige long trench coat belongs to "upper body clothing" in the position dimension and "outerwear" in the hierarchy dimension.

[0077] Generate category label images: Based on the judgment results of the above two-dimensional classification, generate corresponding category label images. These can be represented in a simple graphical way, such as using a specific icon to represent "upper body clothing" and another icon to represent "outerwear". Combine the two icons to form the category label image of the beige long trench coat. This image intuitively marks the classification information of the garment in the positional and hierarchical dimensions.

[0078] Thus, by defining a clear two-dimensional classification system and generating category label images, clear and intuitive classification identifiers are provided for the clothing to be searched. Users can understand the wearing position and hierarchical attributes of the clothing at a glance, making it easier to filter and compare among many search results. For example, if a user only wants to find outerwear, they can quickly identify clothing with the "outerwear" hierarchical dimension identifier. Secondly, category label images help improve the targeting of the search. When performing similarity calculations and searches in the database, based on the classification information of the category label images, matching can be prioritized among clothing in the same or similar categories, reducing unnecessary calculations and improving search efficiency. For example, when searching for outerwear, clothing in irrelevant categories such as underwear will not be included in the main similarity calculation scope.

[0079] The similarity between the category tag image of the clothing image to be retrieved and the candidate photos in the database is calculated, and the similarity of each category tag image is weighted and fused to obtain the similarity of the clothing image to be retrieved. Specifically, this includes:

[0080] For each category label image, a preset image similarity calculation strategy is used to calculate the similarity between the category label image and the candidate photos in the database.

[0081] Determine the weighted similarity of the label images for each product category;

[0082] The similarity between each category label image and the candidate photos in the database is multiplied by the corresponding weighted weight and then summed to obtain the final similarity between the clothing image to be retrieved and the candidate photos in the database.

[0083] Specifically, for the category label image of the trench coat, cosine similarity is used as the image similarity calculation strategy. In the database, a corresponding category label image (also including positional and hierarchical dimensions) is generated for each candidate photo according to the method described above. The cosine similarity is calculated between the category label image features of the beige long trench coat (the feature representation can be obtained through the encoding method mentioned above) and the category label image features of each candidate photo in the database. For example, the similarity is calculated with the category label image of a candidate photo of a black long coat that is also worn on the upper body, resulting in a similarity value of 0.6. The similarity is calculated with the category label image of a candidate photo of trousers worn on the lower body. Due to the mismatch in positional dimensions, the similarity value may be lower, such as 0.2. Based on the actual application and user needs analysis, the weighting of positional and hierarchical dimensions in the similarity calculation is determined. It is assumed that the weight of positional dimension is 0.6 and the weight of hierarchical dimension is 0.4. This is because for clothing retrieval, matching the wearing position is often more crucial, so it is given a relatively high weight. The similarity between each category tag image and the candidate photos in the database is multiplied by the corresponding weight and then summed. For the candidate photos of the beige long trench coat and the black long coat mentioned above, assuming that the similarity in the hierarchical dimension is 0.7 and the similarity in the positional dimension is 0.6 as calculated earlier (assuming that the similarity calculated in the positional dimension alone is the positional part value in the overall category tag image similarity calculation), then the final similarity = 0.6 (positional dimension similarity) × 0.6 (positional dimension weight) + 0.7 (hierarchical dimension similarity) × 0.4 (hierarchical dimension weight) = 0.36 + 0.28 = 0.64. The final similarity between the beige long trench coat and all candidate photos in the database is calculated in the same way.

[0084] Thus, by employing a preset image similarity calculation strategy to calculate the similarity between category tag images and candidate photos, and reasonably determining weighting weights for fusion, the similarity between the clothing image to be retrieved and the candidate photos in the database can be measured more accurately. This comprehensively considers key factors such as the clothing's wearing position and layering, avoiding the limitations of single-dimensional judgment and making the similarity calculation results more in line with actual needs. Secondly, in determining the weighting weights, the role of important dimensions is highlighted by combining practical applications and user needs analysis. For example, in this case, a higher weight is given to the position dimension, making the matching of clothing wearing positions more important during the retrieval process, thereby improving the accuracy of the retrieval results and returning clothing images that better meet the user's expectations.

[0085] In some embodiments, the preset image similarity calculation strategy is a multi-dimensional similarity calculation method that integrates image color histogram, texture features, shape features, and semantic features. The specific steps are as follows:

[0086] Extract the color histogram features of the category label image and the candidate photo, and calculate the chi-square distance of the color histogram features of the two as the color similarity component;

[0087] The local binary mode algorithm is used to extract the texture features of the category label image and the candidate photo, and the Euclidean distance between the texture features of the two is calculated as the texture similarity component.

[0088] The shape features of category label images and candidate photos are extracted using edge detection algorithms, and the Hausdorff distance between the two shape features is calculated as the shape similarity component.

[0089] By comparing and learning a pre-trained model, semantic feature vectors of category label images and candidate photos are extracted respectively, and the cosine similarity between the two semantic feature vectors is calculated as the semantic similarity component.

[0090] Specifically, the color histogram features of the beige long trench coat category tag image and each candidate photo in the database are extracted. For the trench coat image, the pixel distribution of different color intervals is statistically analyzed to form a color histogram. The same operation is performed on the candidate photos. The chi-square distance between the color histogram features of the two images is calculated as the color similarity component. For example, if the trench coat color histogram has a high proportion of pixels in the beige interval, and a candidate photo's color histogram also has a high proportion in the beige interval, its chi-square distance is small, and its color similarity component value is high; conversely, it is low.

[0091] The Local Binary Pattern (LBP) algorithm is used to extract the texture features of the trench coat category label image and the candidate photo. The LBP algorithm generates binary codes to describe local texture features by comparing the gray values ​​of a pixel with those of its neighboring pixels. The Euclidean distance between the texture features of the two is calculated as the texture similarity component. If the binary code distribution of the textures of the trench coat and a candidate photo is similar in the local area, their Euclidean distance is small and the texture similarity component value is large.

[0092] Edge detection algorithms (such as the Canny algorithm) are used to extract the shape features of trench coat category label images and candidate photos. The Canny algorithm can detect edge information in the image and outline the shape of the garment. The Hausdorff distance between the two shape features is calculated as the shape similarity component. If the shape features of the trench coat, such as its length and slim fit, are similar to the shape features of a candidate photo, the Hausdorff distance is small and the shape similarity component value is high.

[0093] The semantic feature vectors of the trench coat category tag image and the candidate photo are extracted using the contrastive learning pre-trained model used previously. The cosine similarity of the two semantic feature vectors is calculated as the semantic similarity component. If the trench coat and a candidate photo are similar in semantics (such as style, occasion applicability, etc.), the cosine similarity of their semantic feature vectors is high.

[0094] Based on practical application requirements, weights are assigned to the color, texture, shape, and semantic similarity components, such as a color weight of 0.3, a texture weight of 0.2, a shape weight of 0.2, and a semantic weight of 0.3. Each component is multiplied by its corresponding weight and then summed to obtain the final similarity score between the trench coat and each candidate photo.

[0095] Thus, by calculating similarity from multiple dimensions such as color, texture, shape, and semantics, various features of clothing images are comprehensively considered. Color reflects the intuitive visual experience of clothing, texture reflects the detailed features of the surface, shape describes the outline structure of clothing, and semantics contains abstract information such as style. By combining these features to calculate similarity, the degree of similarity between the trench coat to be searched and the candidate photos can be measured more accurately. Secondly, features of different dimensions have different importance in clothing retrieval. By reasonably setting the weight of each similarity component, the role of key features can be highlighted. For example, in this case, color and semantics are given higher weights, so that when searching for beige long trench coats, more attention is paid to the similarity of color matching and semantic style, thereby improving the accuracy of the search results.

[0096] In some embodiments, the method further includes:

[0097] The most matching candidate photos selected are then optimized. This optimization process includes deduplication and diversity enhancement. Deduplication checks for duplicate photos among the most matching candidate photos (by comparing image hash values). If duplicate photos are found, the photo with the highest similarity is retained, and the remaining duplicate photos are removed. Diversity enhancement analyzes the distribution of the most matching candidate photos across attributes such as clothing type, color, and style. If photos are too concentrated in one attribute dimension, some photos with slightly lower similarity but different performance in other attribute dimensions are selected to replace them, thereby enhancing the diversity of the search results.

[0098] Specifically, by deduplication, duplicate photos are eliminated, making search results more concise and clear. Users no longer need to sift through a large number of repetitive images, saving them time and effort and improving the efficiency of browsing search results. Diversity enhancement optimizes search results from multiple attribute dimensions, increasing the richness of the results. Users can not only see clothing from the same brand and similar styles, but also learn about beige long trench coats from different brands, in different price ranges, and with different details, meeting the diverse needs of users.

[0099] A smart image search system for clothing includes:

[0100] The image acquisition module is used to acquire images of clothing to be retrieved.

[0101] The semantic feature extraction module is used to extract the semantic feature vector of the clothing image to be retrieved using an image encoder that employs a contrastive learning pre-trained model.

[0102] The text feature extraction module is used to construct a text description of the clothing image to be retrieved based on predefined attribute dimensions, and to extract the text feature vector of the text description using a text encoder of a contrastive learning pre-trained model.

[0103] The multidimensional attribute construction module is used to construct a multidimensional attribute vector of the clothing image to be retrieved based on the semantic feature vector and the text feature vector;

[0104] The category tag acquisition module is used to perform two-dimensional classification based on the multi-dimensional attribute vector of the clothing image to be retrieved, and obtain the category tag image of the clothing image to be retrieved;

[0105] The similarity calculation module is used to calculate the similarity between each category tag image of the clothing image to be retrieved and the candidate photos in the database, and to perform weighted fusion of the similarity of each category tag image to obtain the similarity of the clothing image to be retrieved.

[0106] The search results filtering module is used to select the most matching photos from the candidate photos in the database as search results based on the similarity of the clothing images to be searched.

[0107] A computing device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of any of the above-described intelligent clothing image search methods.

[0108] A computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the above-described intelligent clothing image search methods.

[0109] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for intelligent search of clothing images, characterized in that, Specifically, it includes: Obtain the clothing image to be searched; A contrastive learning pre-trained model is used to extract semantic feature vectors from the clothing images to be retrieved; Based on predefined attribute dimensions, a text description of the clothing image to be retrieved is constructed, and a text encoder of a contrastive learning pre-trained model is used to extract the text feature vector of the text description. Based on semantic feature vectors and text feature vectors, a multidimensional attribute vector is constructed for the clothing image to be retrieved; Two-dimensional classification is performed based on the multi-dimensional attribute vector of the clothing image to be retrieved to obtain the category label image of the clothing image to be retrieved; Calculate the similarity between each category tag image of the clothing image to be retrieved and the candidate photos in the database, and then weight and fuse the similarity of each category tag image to obtain the similarity of the clothing image to be retrieved. Based on the similarity of the clothing images to be retrieved, the most matching photos are selected from the candidate photos in the database as the retrieval results.

2. The intelligent clothing image search method according to claim 1, characterized in that, The text encoder, based on a predefined attribute dimension and employing a contrastive learning pre-trained model, extracts text feature vectors from the clothing image to be retrieved. Specifically, this includes: Predefined attribute dimensions for describing clothing, including type, color, sleeve length, collar type, pattern, style, season, material, and fit; For each predefined attribute dimension, a corresponding text description is constructed using natural language, combining the features presented by the clothing image to be retrieved in that attribute dimension. The text encoder, which employs a contrastive learning pre-trained model, takes the text descriptions corresponding to each attribute dimension as input and performs feature extraction on each text description to obtain the text feature vector corresponding to each text description.

3. The intelligent clothing image search method according to claim 1, characterized in that, The construction of a multi-dimensional attribute vector for the clothing image to be retrieved based on semantic feature vectors and text feature vectors specifically includes: For each predefined attribute dimension, the cosine similarity between the semantic feature vector of the clothing image to be retrieved and the text feature vector under that attribute dimension is calculated, and the calculated cosine similarity is normalized to obtain the confidence probability distribution of each candidate value under each attribute dimension. For each attribute dimension, retain the attribute identification results that are higher than the preset confidence threshold in the confidence probability distribution; The attribute recognition results under each retained attribute dimension are combined to form a multidimensional attribute vector of the clothing image to be retrieved.

4. The intelligent search method for clothing images according to claim 1, characterized in that, The step of performing two-dimensional classification based on the multi-dimensional attribute vector of the clothing image to be retrieved, to obtain the category label image of the clothing image to be retrieved, specifically includes: Define a two-dimensional classification system that includes both location and hierarchical dimensions; Based on the attribute recognition results in the multidimensional attribute vector of the clothing image to be retrieved, and combined with the defined two-dimensional classification system, the category to which the clothing image to be retrieved belongs in the position dimension and the hierarchical dimension is determined. Based on the judgment result of two-dimensional classification, a category label image of the clothing image to be retrieved is generated. The category label image is used to mark the classification information of the clothing image to be retrieved in the position dimension and the hierarchical dimension.

5. The intelligent search method for clothing images according to claim 1, characterized in that, The calculation of the similarity between each category tag image of the clothing image to be retrieved and the candidate photos in the database, and the weighted fusion of the similarity of each category tag image to obtain the similarity of the clothing image to be retrieved, specifically includes: For each category label image, a preset image similarity calculation strategy is used to calculate the similarity between the category label image and the candidate photos in the database. Determine the weighted similarity of the label images for each product category; The similarity between each category label image and the candidate photos in the database is multiplied by the corresponding weighted weight and then summed to obtain the final similarity between the clothing image to be retrieved and the candidate photos in the database.

6. The intelligent search method for clothing images according to claim 5, characterized in that, The preset image similarity calculation strategy is a multi-dimensional similarity calculation method that integrates image color histogram, texture features, shape features, and semantic features. The specific steps are as follows: Extract the color histogram features of the category label image and the candidate photo, and calculate the chi-square distance of the color histogram features of the two as the color similarity component; The local binary mode algorithm is used to extract the texture features of the category label image and the candidate photo, and the Euclidean distance between the texture features of the two is calculated as the texture similarity component. The shape features of category label images and candidate photos are extracted using edge detection algorithms, and the Hausdorff distance between the two shape features is calculated as the shape similarity component. By comparing and learning a pre-trained model, semantic feature vectors of category label images and candidate photos are extracted respectively, and the cosine similarity between the two semantic feature vectors is calculated as the semantic similarity component.

7. The intelligent search method for clothing images according to claim 1, characterized in that, The method further includes: The most matching candidate photos selected are then subjected to result optimization processing, which includes deduplication and diversity enhancement processing. Deduplication is used to remove duplicate photos, and diversity enhancement processing is used to enhance the diversity of the search results.

8. A smart image search system for clothing, characterized in that, include: The image acquisition module is used to acquire images of clothing to be retrieved. The semantic feature extraction module is used to extract the semantic feature vector of the clothing image to be retrieved using an image encoder that employs a contrastive learning pre-trained model. The text feature extraction module is used to construct a text description of the clothing image to be retrieved based on predefined attribute dimensions, and to extract the text feature vector of the text description using a text encoder of a contrastive learning pre-trained model. The multidimensional attribute construction module is used to construct a multidimensional attribute vector of the clothing image to be retrieved based on the semantic feature vector and the text feature vector; The category tag acquisition module is used to perform two-dimensional classification based on the multi-dimensional attribute vector of the clothing image to be retrieved, and obtain the category tag image of the clothing image to be retrieved; The similarity calculation module is used to calculate the similarity between each category tag image of the clothing image to be retrieved and the candidate photos in the database, and to perform weighted fusion of the similarity of each category tag image to obtain the similarity of the clothing image to be retrieved. The search results filtering module is used to select the most matching photos from the candidate photos in the database as search results based on the similarity of the clothing images to be searched.

9. A computing device, characterized in that, The method includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the intelligent clothing image search method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the steps of the intelligent clothing image search method according to any one of claims 1-7.