A fabric matching method and system based on multi-modal information matching
By using semantic structuring enhancement driven by a terminology vocabulary for the fabric industry and salient region segmentation, combined with multi-channel texture fidelity processing, the problem of inaccurate semantic matching between fabric images and text descriptions is solved, achieving high-precision fabric retrieval and matching, suitable for complex shooting environments and non-standard description scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHIYI TECH
- Filing Date
- 2025-07-29
- Publication Date
- 2026-06-19
Smart Images

Figure CN120976585B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent matching and recognition technology for fabrics, and in particular to a fabric matching method and system based on multimodal information matching. Background Technology
[0002] Currently, with the widespread application of multimodal learning models (such as CLIP) in image and text retrieval and content understanding, the industry has begun to explore their application in matching fabric images with text descriptions to assist in apparel selection, design reference, and supply chain fabric positioning. However, existing methods still suffer from key technological shortcomings, especially in processing fabric-level semantic matching, where issues such as incomplete semantic coverage, inaccurate texture representation, and low matching accuracy exist, making it difficult to directly meet the industry's demand for rapid and accurate product selection.
[0003] For example, general image-text matching models are typically trained on open-domain datasets, whose label spaces struggle to accurately represent specialized fabric terms such as "acetic acid drape," "cationic blend twill," and "washed silk." This results in the model's inability to fully recognize the material properties, fabric structure, and processing characteristics expressed in user input. Furthermore, existing methods generally ignore the complexity of the fabric image's shooting environment, including factors like wrinkles, reflections, and background interference. This severely dilutes the response intensity of key texture information in the visual semantic vector, leading to unstable image-text matching results, especially in long-tail semantic descriptions or cross-brand retrieval scenarios.
[0004] Existing technologies cannot fully meet the refined needs of designers and buyers to quickly locate corresponding fabrics based on natural language descriptions and further link them to popular styles on Taobao, Douyin, and other social media platforms. Therefore, there is an urgent need for a multimodal matching method that can still achieve high-precision and robust semantic alignment between fabric images and text descriptions even when the industry's semantic space is insufficient or the image quality is not ideal, in order to improve the matching accuracy, response speed, and actual conversion rate of intelligent product selection systems. Summary of the Invention
[0005] To address the aforementioned technical shortcomings, the present invention aims to propose a fabric matching method based on multimodal information matching. This method addresses the technical problem that existing technologies using simple text retrieval or single-channel feature matching methods cannot achieve accurate fabric positioning and semantic consistency matching, especially when user descriptions are non-standard, fabric image textures are complex, or structures are repetitive.
[0006] To solve the above technical problems, the present invention adopts the following technical solution: The present invention provides a fabric matching method based on multimodal information matching.
[0007] The fabric matching method based on multimodal information matching includes:
[0008] Step S10: Based on the preset fabric industry terminology glossary W, perform semantic structuring enhancement processing on the user-input raw text T, and output enhanced descriptive text. Among them, semantic structuring enhancement processing includes adopting a multi-level semantic regularization recognition mechanism and a semantic insertion and splicing strategy based on weighted hint embedding;
[0009] Step S20: Enhance the description text The input is fed into the improved CLIP text encoding model, which outputs a first text semantic vector and uses a semantic category-guided trainable vector projection mechanism to extract a second text semantic vector related to the fabric structure attributes from the first text semantic vector.
[0010] Step S30: Obtain the original fabric image from the preset background database, perform foreground mask extraction, multi-channel texture fidelity enhancement and structural balance normalization on the original fabric image, and output the preprocessed image data.
[0011] Step S40: Perform texture saliency region segmentation on the preprocessed image data, and generate the global semantic vector V and texture region sub-semantic vector B based on the region segmentation results;
[0012] Step S50: Based on the global semantic vector V of the image and the sub-semantic vector B of the texture region generated in step S40, perform a cross-modal semantic alignment matching task in combination with the second text semantic vector, and output the final matching fabric result based on the multi-factor joint ranking mechanism.
[0013] Preferably, in step S10, the fabric industry terminology glossary W includes N hierarchical fabric terms, including a material subset W1, a texture subset W2, a pattern subset W3, and a style modification subset W4; in step S10, based on the preset fabric industry terminology glossary W, the user-inputted original text T is semantically structured and enhanced, and an enhanced description is output. The steps specifically include: performing term matching in the original text T using a multi-level semantic regularization recognition mechanism to identify the term sequence of layered fabric terms; performing industry annotation expansion based on the term sequence of layered fabric terms through a preset word meaning mapping table; and constructing connections for the annotation expansion results using a semantic insertion and splicing strategy based on weighted hint embedding to output enhanced descriptive text. .
[0014] Preferably, in step S20, the improved CLIP text encoding model specifically includes: an input layer for receiving enhanced descriptive text. The semantic structure hint embedding layer is used to introduce guiding hint vectors representing fabric term classification; the multi-scale attention focusing layer is used to improve the model's attention response capability to fabric keywords at different levels; the domain feature adaptation enhancement layer is used to integrate the semantic context distribution pattern obtained by fine-tuning the fabric industry corpus; and the output layer is used to output the first text semantic vector.
[0015] The second text semantic vector is used for subsequent semantic alignment with the texture salient regions extracted from the image. The extraction method of the second text semantic vector specifically includes: by setting a vector projection structure for organizational structure attribute recognition, weight learning and semantic clustering are performed on each dimension of the first text semantic vector to identify and extract the sub-semantic features related to fabric organization and texture morphology, and the sub-semantic features are vectorized to obtain the second text semantic vector.
[0016] Preferably, step S30, which involves obtaining the original fabric image from a preset background database, performing foreground mask extraction, multi-channel texture fidelity enhancement, and structural equalization normalization on the original fabric image, and outputting preprocessed image data, specifically includes:
[0017] Step S301: Retrieve raw image data from a pre-set backend database using the HTTPS protocol and an open API.
[0018] Step S302: Based on the preset weakly supervised training semantic segmentation model, perform pixel-level region segmentation on the original image data to generate a fabric region mask map, and use the fabric region mask map to perform masking processing on the original image data to remove background region interference and obtain optimized image data.
[0019] Step S303: Decompose the optimized image data into two sub-channels: brightness channel and texture direction channel. Improve the local structural clarity and enhance the direction consistency of the sub-channels respectively to obtain the reconstructed and enhanced image.
[0020] Step S304: Detect high-frequency abnormal regions in the reconstructed and enhanced image, and perform edge smoothing on wrinkle shadows, overexposed lighting, and reflection areas to preserve real texture information and output preprocessed image data.
[0021] Preferably, step S40, which involves performing texture saliency region modeling on the preprocessed image data and generating a global semantic vector and texture region sub-semantic vectors based on the modeling results, specifically includes:
[0022] The adaptive image segmentation mechanism based on perceptual texture response divides the preprocessed image data into N local image regions, extracts the local feature vector set of each region, and generates the local feature vector of each region i based on the local feature vector set. ;
[0023] Based on the local feature vectors of each region Calculate its texture saliency score ;
[0024] Based on the texture saliency scores of all regions, local feature vectors Weighted fusion is performed to obtain the global semantic vector V of the image;
[0025] All regions are arranged in descending order of texture saliency score. The top K regions are selected to extract their corresponding local feature vectors, which are then concatenated or averaged to generate the texture region sub-semantic vector B.
[0026] Preferably, in step S40, the texture saliency score is... Defined by the following formula:
[0027] ;
[0028] in, Represents the local eigenvectors The response energy obtained after multi-scale, multi-directional Gabor filtering is used to measure the texture directionality of the region. It represents the L2 norm of the image gradient, reflecting the intensity of structural changes in local regions; The information entropy represents the local pixel grayscale distribution and is used to measure the image complexity or texture density of the region; α, β, and γ are significance weighting coefficients, which correspond to the relative weights of the three dimensions mentioned above, satisfying α+β+γ= 1.
[0029] Preferably, step S50, which involves performing a cross-modal semantic alignment matching task based on the image global semantic vector V and texture region sub-semantic vector B generated in step S40, combined with the second text semantic vector, and outputting the final matching fabric result based on a multi-factor joint ranking mechanism, specifically includes:
[0030] Cosine similarity calculation is performed on the second text semantic vector and the image global semantic vector V to obtain the image-text global semantic matching score S1. The image-text global semantic matching score S1 is used to measure the semantic consistency between the text description and the overall style of the image.
[0031] The cosine similarity is calculated between the semantic vector of the second text and each sub-vector of the texture region sub-semantic vector B. The local similarity scores are then weighted and fused based on the preset Top-K aggregation rule to obtain the local texture semantic matching score S2. The local texture semantic matching score S2 is used to measure the fine-grained semantic association between the text description and the key texture structure of the image.
[0032] Based on the semantic angle or distance between all sub-vectors in the texture region sub-semantic vector B, the texture semantic discreteness index S3 between regions is calculated. The texture semantic discreteness index S3 between regions is used to reflect the diversity distribution of image texture structure.
[0033] A joint matching score function is constructed based on the global semantic matching score S1, the local texture semantic matching score S2, and the inter-regional texture semantic discreteness index S3. The joint matching score function outputs the joint matching score result. The original fabric images in step S30 are sorted according to the joint matching score result, and the final matching fabric result is output.
[0034] The present invention also provides a fabric matching system based on multimodal information matching, comprising:
[0035] The text semantic enhancement module is used to perform semantic structuring enhancement on the user-input raw text T based on a preset fabric industry terminology lexicon W, and output enhanced descriptive text. Among them, semantic structuring enhancement processing includes adopting a multi-level semantic regularization recognition mechanism and a semantic insertion and splicing strategy based on weighted hint embedding;
[0036] A semantic vector extraction module for organizational attributes is used to enhance descriptive text. The input is fed into the improved CLIP text encoding model, which outputs a first text semantic vector and uses a semantic category-guided trainable vector projection mechanism to extract a second text semantic vector related to the fabric structure attributes from the first text semantic vector.
[0037] The fabric image preprocessing module is used to obtain the original fabric image from the preset background database, perform foreground mask extraction, multi-channel texture fidelity enhancement and structural balance normalization on the original fabric image, and output preprocessed image data.
[0038] The texture saliency modeling module is used to perform texture saliency region segmentation on preprocessed image data and generate a global semantic vector V and texture region sub-semantic vector B based on the region segmentation results.
[0039] The cross-modal alignment and sorting output module is used to perform cross-modal semantic alignment matching task based on the image global semantic vector V and texture region sub-semantic vector B generated in step S40, combined with the second text semantic vector, and output the final matching fabric result based on the multi-factor joint sorting mechanism.
[0040] The present invention also provides a fabric matching device based on multimodal information matching, comprising: a memory, a processor, and a fabric matching program based on multimodal information matching stored in the memory and executable on the processor. When the fabric matching program based on multimodal information matching is executed by the processor, it implements a fabric matching method based on multimodal information matching.
[0041] The present invention also provides a computer program product, including a fabric matching program based on multimodal information matching, wherein the fabric matching program based on multimodal information matching implements the fabric matching method based on multimodal information matching when executed by a processor.
[0042] The beneficial effects of this invention are as follows: Compared with the existing technology that uses simple text retrieval or single-channel feature matching, especially under conditions of non-standard user descriptions, complex fabric image textures, or repetitive structures, it is impossible to achieve accurate fabric positioning and consistent matching of text and image semantics. This application improves the fabric retrieval accuracy in open descriptive contexts by introducing an industry ontology-driven semantic enhancement mechanism and a saliency-guided cross-modal matching architecture, thereby achieving multi-granular modeling and fine-grained semantic alignment of fabric semantic features. Attached Figure Description
[0043] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0044] Figure 1 This is a flowchart illustrating the first embodiment of a fabric matching method based on multimodal information matching according to the present invention.
[0045] Figure 2 This is a schematic diagram of a device for a fabric matching method based on multimodal information matching according to the present invention. Detailed Implementation
[0046] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0047] Example 1: As Figure 1The diagram shown is a flowchart of the first embodiment of the fabric matching method based on multimodal information matching of the present invention, which presents the first embodiment of the fabric matching method based on multimodal information matching of the present invention.
[0048] In the first embodiment, the fabric matching method based on multimodal information matching includes:
[0049] Step S10: Based on the preset fabric industry terminology glossary W, perform semantic structuring enhancement processing on the user-input raw text T, and output enhanced descriptive text. Among them, semantic structuring enhancement processing includes adopting a multi-level semantic regularization recognition mechanism and a semantic insertion and splicing strategy based on weighted hint embedding;
[0050] It should be noted that the "multi-level semantic regularization recognition mechanism" refers to the combined processing of part-of-speech parsing, hierarchical classification, and semantic constraint recognition of the original text T based on the terminology list W of the fabric industry. This includes, but is not limited to, the independent extraction and clustering modeling of terms related to fabric materials (such as "acetic acid" and "nylon"), terms related to weave structure (such as "twill" and "double yarn"), and terms related to processing (such as "washed" and "brushed"). The "semantic insertion and splicing strategy based on weighted prompt embedding" refers to the process of setting weights for the core terms identified above according to their semantic importance and guiding them with prompt templates to embed missing key terms or modifiers into the original text according to set rules, thereby constructing a structured enhanced descriptive text T*.
[0051] Understandably, this structured enhancement process can effectively address issues such as missing industry terminology, semantic ambiguity, or non-standard expressions in users' natural language descriptions, making the input text more consistent with the semantic alignment requirements in fabric image-text matching tasks, thereby improving the subsequent encoding model's ability to perceive fine-grained semantics.
[0052] It should be understood that, compared to the traditional CLIP model which directly processes the original text, this invention significantly improves the ability of text embedding vectors to distinguish industry-specific attributes without changing the original model structure by introducing an explicit semantic regularization structure and a prompting enhancement mechanism, especially when dealing with long-tail terms (such as "nylon matte weft elastic twill").
[0053] For example, in a set of original text containing the natural description "thicker feel, suitable for skirts, slightly wrinkled chiffon", the traditional CLIP model failed to identify the possible corresponding semantic tag "high-twist chiffon". However, after structural enhancement, the present invention generates "thicker chiffon fabric with high-twist yarn structure, with natural wrinkles, suitable for skirts", and the Top-1 retrieval accuracy of the corresponding fabric image is improved from 68.4% to 91.2%, which significantly improves the professionalism and commercial applicability of the matching results.
[0054] Step S20: Enhance the description text The input is fed into the improved CLIP text encoding model, which outputs a first text semantic vector and uses a semantic category-guided trainable vector projection mechanism to extract a second text semantic vector related to the fabric structure attributes from the first text semantic vector.
[0055] It should be noted that the "trainable vector projection mechanism based on semantic category guidance" refers to: in the first text semantic vector output by the original CLIP text encoder, using a set of semantic categories predefined by fabric industry knowledge (such as "organizational structure", "material composition", "functional characteristics", "applicable scenarios", etc.), multiple semantic projection subspaces are constructed by guiding labels, and soft constraint projection is performed on the first vector to output one or more semantic dimension-related subvectors. The second text semantic vector corresponding to "organizational structure attribute" represents the embedding information of the description under the organizational structure dimension.
[0056] Understandably, this mechanism adaptively models the contribution strength of each dimension of information in the first semantic vector to the semantics of the "organizational structure" class through a learned weight matrix or attention parameters, thereby embedding abstract semantics into a sub-vector space that is interpretable in the industry dimension and has a clear function, so that subsequent image matching tasks not only "know what you said", but also "know what the key point you said is".
[0057] It should be understood that, compared to the single semantic vector uniformly generated by the traditional CLIP model, it cannot distinguish the relative weights of "material" and "organism" in the description, often resulting in matching drift when dealing with mixed expressions of multiple attributes. This invention effectively separates and purifies the semantics of the target attribute dimension through a guided semantic decomposition mechanism, making the matching process more structurally selective and target-oriented, especially suitable for understanding and extracting product descriptions with compressed multi-attribute expressions (such as "cotton and linen twill elastic fabric").
[0058] For example, if a user inputs "soft twill cotton fabric, suitable for making casual suits," the enhanced processing generates structured text. After CLIP outputs the first text semantic vector, this mechanism projects a "structure" sub-vector, explicitly focusing on the "twill" attribute. This vector is then matched with the "twill weave features" in the image. Comparative experiments show that using the guided semantic projection mechanism of this invention improves the Top-1 matching accuracy related to structure from 74.2% to 92.3%, significantly enhancing the dimensional accuracy and controllability of image-text alignment.
[0059] Step S30: Obtain the original fabric image from the preset background database, perform foreground mask extraction, multi-channel texture fidelity enhancement and structural balance normalization on the original fabric image, and output the preprocessed image data.
[0060] It should be noted that step S30, which involves obtaining the original fabric image from a preset background database, performing foreground mask extraction, multi-channel texture fidelity enhancement, and structural equalization normalization on the original fabric image, and outputting preprocessed image data, specifically includes: Step S301: Retrieving the original image data from the preset background database using an open API via HTTPS protocol; Step S302: Performing pixel-level region segmentation on the original image data based on a preset weakly supervised semantic segmentation model to generate a fabric region mask map, and using the fabric region mask map to perform occlusion processing on the original image data to remove background interference and obtain optimized image data; Step S303: Decomposing the optimized image data into two sub-channels: brightness channel and texture direction channel, and performing local structural clarity enhancement and direction consistency enhancement respectively to obtain a reconstructed and enhanced image; Step S304: Detecting high-frequency abnormal regions in the reconstructed and enhanced image, and performing edge smoothing processing on wrinkle shadows, overexposed lighting, and reflection areas to preserve real texture information and outputting preprocessed image data. Foreground mask extraction refers to extracting the main fabric region and eliminating background interference by combining a lightweight segmentation model (such as an improved U-Net or SAM) with edge enhancement and color clustering; Multi-channel texture fidelity enhancement refers to enhancing texture contrast and directionality in RGB, LAB, and grayscale spaces respectively using frequency domain enhancement and Retinex illumination normalization, while preserving local variations in fabric features; Structural equalization normalization refers to equalizing the image's grayscale mean, contrast, and sharpness based on the full-image grayscale structure tensor distribution, ensuring comparability and consistency of expression under different shooting conditions.
[0061] Understandably, the core objective of this three-stage processing flow is to maximize the enhancement of significant texture representations in the image that are strongly correlated with fabric attributes without introducing spurious features, while suppressing the negative interference of low-correlation factors such as illumination changes, wrinkle shadows, scanning background, and fabric edge interference on the subsequent semantic encoder.
[0062] It should be understood that traditional image enhancement operations (such as global histogram stretching or sharpening filtering) are not discriminatory against different fabric types and are prone to overfitting to a certain image style, thus losing universality. The enhancement strategy of this invention emphasizes the combination of semantic relevance and visual consistency, and enhances robustness through multi-channel fusion, enabling the image encoding vector to have high stability under different lighting conditions, shooting distances, and fabric wrinkles, significantly improving the domain adaptability and anti-misalignment capability of the image-text matching model.
[0063] For example, when faced with an original image of "plain white cotton-linen fabric," the background in the original image is a desktop texture with rolled-up fabric edges. Without processing, the CLIP model extracts vectors mainly concentrated in the "desktop wood grain + fabric projection" area, often resulting in a mismatch of "imitation wood grain jacquard." After processing using S30 of this invention, the background is successfully removed, the main texture direction of the fabric is enhanced, and after local brightness normalization, the model can accurately identify its "plain weave and plain white" features, improving the matching accuracy by approximately 19.6% and significantly reducing interference from "false detection of high-contrast decorative patterns."
[0064] Step S40: Perform texture saliency region segmentation on the preprocessed image data, and generate the global semantic vector V and texture region sub-semantic vector B based on the region segmentation results;
[0065] It should be noted that the "texture saliency region segmentation" in step S40 is not simply dividing the image into regular grid regions, but rather an adaptive partitioning based on image content. Its purpose is to utilize the local performance of the image in terms of texture directionality, detail complexity, and structural gradient to determine which regions are more semantically expressive. This segmentation strategy integrates statistical indicators of multiple visual factors, assigning a "texture saliency score" to each local region. This score measures whether the region has strong organization, typical structure, or recognition value. This segmentation is not isolated image segmentation, but rather provides structural guidance for the subsequent semantic vector construction process.
[0066] Understandably, this step addresses the issue of missing main texture representation caused by the generalization methods of "average processing" and "full-image perception" in original multimodal models like CLIP. Since real-world fabric images often have shooting defects such as wrinkles, light spots, curling edges, and uneven lighting, inputting the entire image with equal weight to the visual encoder would significantly reduce the proportion of the main texture signal in the encoding vector, leading to a shift in the core semantics of matching. This invention, by introducing a structured salient region recognition mechanism before image encoding, ensures that the main texture region receives a higher weight, thus improving both the resolution of image semantics and enhancing the model's sensitivity in matching tasks.
[0067] It should be understood that, compared with the traditional method of directly using visual backbones such as ResNet or ViT to extract whole image vectors and then calculating cosine similarity with text vectors, the salient region segmentation mechanism introduced in this step has two key advantages: (1) It constructs a dual-channel expression structure of "global semantic vector" and "texture region sub-semantic vector". The former ensures the overall fabric style and tone expression, while the latter focuses on key texture details; (2) It actively excludes regions in the image that are irrelevant to or even interfere with semantic expression (such as background tablecloth, shadows, corner wrinkles, etc.) through the structural modeling process, thereby significantly enhancing the focus and matching robustness of semantic alignment.
[0068] For example, for an image of "diamond-patterned jacquard Tencel" taken under natural light with slight wrinkles, traditional methods might lead to incorrect matching, favoring "light-colored, high-gloss plain weave fabric," because the high-reflectivity area in the lower right corner of the image carries a large weight in the visual vector. In this step, saliency scores are used to identify the central region's continuous jacquard texture as having strong directionality, rich structural gradient changes, and high local entropy. This region is then extracted as a highly saliency area and a sub-semantic vector is constructed. In the subsequent alignment stage, this texture sub-semantic vector becomes dominant in the matching process, successfully retrieving similar Tencel fabrics with a "diamond pattern," improving search accuracy and enhancing the user experience.
[0069] Step S50: Based on the global semantic vector V of the image and the sub-semantic vector B of the texture region generated in step S40, perform a cross-modal semantic alignment matching task in combination with the second text semantic vector, and output the final matching fabric result based on the multi-factor joint ranking mechanism.
[0070] It should be noted that the "cross-modal semantic alignment and matching task" refers to achieving joint alignment of visual semantics and textual semantics in a multi-scale semantic space by calculating the embedding spatial distance between the global semantic vector V of the image, the sub-semantic vector B of the texture region, and the second text semantic vector. This process does not simply use cosine similarity as the sole criterion; instead, it constructs a multi-factor scoring function that integrates three core indicators: semantic relevance, texture saliency consistency, and structural feature coverage. Adjustable parameters are used to adaptively optimize the ranking results.
[0071] Understandably, the core design of this step is to solve the following two problems: (1) General CLIP often lacks the ability to perceive details in alignment tasks, making it difficult to distinguish the structural differences between "plain weave cotton" and "twill weave cotton"; (2) Although the texture information in the fabric image is rich, if the matching scoring function does not give sufficient weight to the texture area, the matching result is very likely to have the problem of "correct style but incorrect details". This invention introduces the image sub-semantic vector B to participate in the scoring process, which specifically strengthens the influence of structural details on the alignment result, effectively alleviating the industry-level pain point of "semantic alignment but physical mismatch".
[0072] For example, given a user inputting "twill-weave, crisp washed Tencel fabric," the traditional CLIP method is likely to recommend soft fabrics with similar tones but a plain weave, ignoring fine-grained structural keywords like "twill" and "crisp." This method, however, enhances the ranking weight of the "twill" dimension in the final result by introducing an alignment mechanism between the second text vector T2 and image B in the Score function. This ensures that the returned results prioritize fabric images with "twill directionality" and "crisp texture." Experimental results show that the visual structural consistency score of the Top-3 recommendations increased from 63.5% in the original model to 92.1%, significantly improving the practical value of the matching.
[0073] Example 2: Furthermore, the present invention provides a fabric matching system based on multimodal information matching, employing a fabric matching method based on multimodal information matching from the above embodiments, which can solve a technical problem related to fabric matching based on multimodal information matching. Compared with the prior art, the beneficial effects of the fabric matching system based on multimodal information matching provided by the present invention are the same as those of the fabric matching method based on multimodal information matching provided in the above embodiments, and other technical features of the fabric matching system based on multimodal information matching are the same as those disclosed in the methods of the above embodiments, and will not be repeated here.
[0074] Example 3: This invention provides a fabric matching device based on multimodal information matching. Please refer to... Figure 2A fabric matching device based on multimodal information matching includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enable the at least one processor to perform a fabric matching method based on multimodal information matching as described in Embodiment 1 above. The fabric matching device based on multimodal information matching in this embodiment may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital radio receivers, PDAs (Personal Digital Assistants), PADs (Portable Application Description), PMPs (Portable Media Players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. This fabric matching device based on multimodal information matching is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this invention. A fabric matching device based on multimodal information matching may include a processing device 1001, which can perform various appropriate actions and processes according to a program stored in a read-only memory 1002 or a program loaded from a storage device 1003 into a random access memory 1004. The random access memory 1004 also stores various programs and data required for the operation of a fabric matching device based on multimodal information matching. The processing device 1001, read-only memory 1002, and random access memory 1004 are interconnected via bus 1005. I / O interface 1006 is also connected to the bus. Typically, the following systems can be connected to I / O interface 1006: input devices 1007 including, for example, touchscreens, touchpads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, etc.; output devices 1008 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 1003; and communication devices 1009. Communication device 1009 allows a fabric matching device based on multimodal information matching to communicate wirelessly or wiredly with other devices to exchange data. Although a fabric matching device based on multimodal information matching with various systems is shown in the figures, it should be understood that it is not required to implement or possess all of the systems shown. More or fewer systems can be implemented alternatively.
[0075] Example 4: This invention also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the fabric matching method based on multimodal information matching described above. The computer program product provided by this invention can solve a technical problem related to fabric matching based on multimodal information matching. Compared with the prior art, the beneficial effects of the computer program product provided by this invention are the same as those of the fabric matching method based on multimodal information matching provided in the above embodiments, and will not be repeated here.
[0076] In particular, according to the embodiments disclosed in this invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from storage device 1003, or installed from read-only memory 1002. When the computer program is executed by processing device 1001, it performs the functions defined in the methods of the embodiments disclosed in this invention.
[0077] It should be understood that the various parts disclosed in this invention can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics may be combined in any suitable manner in one or more embodiments or examples.
[0078] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A fabric matching method based on multimodal information matching, characterized in that, The methods include: Step S10: Based on the preset fabric industry terminology glossary W, perform semantic structuring enhancement processing on the user-input raw text T, and output enhanced descriptive text. The semantic structuring enhancement process includes employing a multi-level semantic regularization recognition mechanism and a semantic insertion and splicing strategy based on weighted hint embedding. The fabric industry terminology lexicon W includes N hierarchical fabric terms, including a material subset W1, a texture subset W2, a pattern subset W3, and a style modification subset W4. In step S10, the user-inputted original text T is semantically structuring enhanced based on the preset fabric industry terminology lexicon W, and the enhanced description is output. The steps specifically include: performing term matching in the original text T using a multi-level semantic regularization recognition mechanism to identify the term sequence of layered fabric terms; performing industry annotation expansion based on the term sequence of layered fabric terms through a preset word meaning mapping table; and constructing connections for the annotation expansion results using a semantic insertion and splicing strategy based on weighted hint embedding to output enhanced descriptive text. ; The semantic insertion splicing strategy based on weight prompt embedding refers to setting weights according to semantic importance, guiding through a prompt template, embedding missing key terms or modifiers into the original text according to the set rules, and constructing a structured enhanced description text ; Step S20: Enhance the description text The input is fed into the improved CLIP text encoding model, which outputs a first text semantic vector and uses a semantic category-guided trainable vector projection mechanism to extract a second text semantic vector related to the fabric structure attributes from the first text semantic vector. The improved CLIP text encoding model specifically includes: an input layer for receiving enhanced descriptive text. The semantic structure hint embedding layer is used to introduce guiding hint vectors representing fabric term classification; the multi-scale attention focusing layer is used to improve the model's attention response capability to fabric keywords at different levels; the domain feature adaptation enhancement layer is used to integrate the semantic context distribution pattern obtained by fine-tuning the fabric industry corpus; and the output layer is used to output the first text semantic vector. The second text semantic vector is used for semantic alignment with the texture salient regions extracted from the image. The extraction method of the second text semantic vector specifically includes: by setting a vector projection structure for organizational structure attribute recognition, weight learning and semantic clustering are performed on each dimension of the first text semantic vector to identify and extract the sub-semantic feature parts related to fabric structure and texture morphology, and the sub-semantic feature parts are vectorized to obtain the second text semantic vector. Step S30: Obtain the original fabric image from the preset background database, perform foreground mask extraction, multi-channel texture fidelity enhancement and structural balance normalization on the original fabric image, and output the preprocessed image data. Step S40: Perform texture saliency region segmentation on the preprocessed image data, and generate the global semantic vector V and texture region sub-semantic vector B based on the region segmentation results; The steps of performing texture saliency region modeling on the preprocessed image data and generating global semantic vectors and texture region sub-semantic vectors based on the modeling results specifically include: The adaptive image segmentation mechanism based on perceptual texture response divides the preprocessed image data into N local image regions, extracts the local feature vector set of each region, and generates the local feature vector of each region i based on the local feature vector set. ; Based on the local feature vectors of each region Calculate its texture saliency score ; Based on the texture saliency scores of all regions, local feature vectors Weighted fusion is performed to obtain the global semantic vector V of the image; All regions are arranged in descending order of texture saliency score. The top K regions are selected to extract their corresponding local feature vectors, which are then concatenated or averaged to generate texture region sub-semantic vector B. Step S50: Based on the global semantic vector V of the image and the sub-semantic vector B of the texture region generated in step S40, perform a cross-modal semantic alignment matching task in combination with the second text semantic vector, and output the final matching fabric result based on the multi-factor joint ranking mechanism.
2. The fabric matching method based on multimodal information matching as described in claim 1, characterized in that, Step S30 involves obtaining the original fabric image from a preset background database, performing foreground mask extraction, multi-channel texture fidelity enhancement, and structural equalization normalization on the original fabric image, and outputting the preprocessed image data. Specifically, this includes: Step S301: Retrieve raw image data from the pre-set backend database using the HTTPS protocol and an open API; Step S302: Based on the preset weakly supervised training semantic segmentation model, perform pixel-level region segmentation on the original image data to generate a fabric region mask map, and use the fabric region mask map to perform masking processing on the original image data to remove background region interference and obtain optimized image data. Step S303: Decompose the optimized image data into two sub-channels: brightness channel and texture direction channel. Improve the local structural clarity and enhance the direction consistency of the sub-channels respectively to obtain the reconstructed and enhanced image. Step S304: Detect high-frequency abnormal regions in the reconstructed and enhanced image, and perform edge smoothing on wrinkle shadows, overexposed lighting, and reflection areas to preserve real texture information and output preprocessed image data. 3.The fabric matching method based on multi-modal information matching of claim 1, wherein, In step S40, the texture saliency score is defined by the following equation: ; in, Represents the local eigenvectors The response energy obtained after multi-scale, multi-directional Gabor filtering is used to measure the texture directionality of the region. It represents the L2 norm of the image gradient, reflecting the intensity of structural changes in local regions; The information entropy represents the local pixel grayscale distribution and is used to measure the image complexity or texture density of the region; α, β, and γ are significance weighting coefficients, which correspond to the relative weights of the three dimensions mentioned above, satisfying α+β+γ=1.
4. The fabric matching method based on multi-modal information matching of claim 1, wherein, Step S50, which involves performing a cross-modal semantic alignment matching task based on the image global semantic vector V and texture region sub-semantic vector B generated in step S40, combined with the second text semantic vector, and outputting the final matching fabric result based on a multi-factor joint ranking mechanism, specifically includes: Cosine similarity calculation is performed on the second text semantic vector and the image global semantic vector V to obtain the image-text global semantic matching score S1. The image-text global semantic matching score S1 is used to measure the semantic consistency between the text description and the overall style of the image. The cosine similarity is calculated between the semantic vector of the second text and each sub-vector of the texture region sub-semantic vector B. The local similarity scores are then weighted and fused based on the preset Top-K aggregation rule to obtain the local texture semantic matching score S2. The local texture semantic matching score S2 is used to measure the fine-grained semantic association between the text description and the key texture structure of the image. Based on the semantic angle or distance between all sub-vectors in the texture region sub-semantic vector B, the texture semantic discreteness index S3 between regions is calculated. The texture semantic discreteness index S3 between regions is used to reflect the diversity distribution of image texture structure. A joint matching score function is constructed based on the global semantic matching score S1, the local texture semantic matching score S2, and the inter-regional texture semantic discreteness index S3. The joint matching score function outputs the joint matching score result. The original fabric images in step S30 are sorted according to the joint matching score result, and the final matching fabric result is output.
5. A fabric matching system based on multimodal information matching, applied to the fabric matching method based on multimodal information matching according to any one of claims 1 to 4, characterized in that, The fabric matching system based on multimodal information matching includes: The text semantic enhancement module is used to perform semantic structuring enhancement on the user-input raw text T based on a preset fabric industry terminology lexicon W, and output enhanced descriptive text. Among them, semantic structuring enhancement processing includes adopting a multi-level semantic regularization recognition mechanism and a semantic insertion and splicing strategy based on weighted hint embedding; A semantic vector extraction module for organizational attributes is used to enhance descriptive text. The input is fed into the improved CLIP text encoding model, which outputs a first text semantic vector and uses a semantic category-guided trainable vector projection mechanism to extract a second text semantic vector related to the fabric structure attributes from the first text semantic vector. The fabric image preprocessing module is used to obtain the original fabric image from the preset background database, perform foreground mask extraction, multi-channel texture fidelity enhancement and structural balance normalization on the original fabric image, and output preprocessed image data. The texture saliency modeling module is used to perform texture saliency region segmentation on preprocessed image data and generate a global semantic vector V and texture region sub-semantic vector B based on the region segmentation results. The cross-modal alignment and sorting output module is used to perform cross-modal semantic alignment matching task based on the image global semantic vector V and texture region sub-semantic vector B generated in step S40, combined with the second text semantic vector, and output the final matching fabric result based on the multi-factor joint sorting mechanism.
6. A fabric matching device based on multi-modal information matching, characterized by, The fabric matching device based on multimodal information matching includes: a memory, a processor, and a fabric matching program based on multimodal information matching stored in the memory and executable on the processor. When the fabric matching program based on multimodal information matching is executed by the processor, it implements a fabric matching method based on multimodal information matching according to any one of claims 1 to 4.
7. A computer program product, characterised in that, The computer program product includes a fabric matching program based on multimodal information matching, which, when executed by a processor, implements a fabric matching method based on multimodal information matching as described in any one of claims 1 to 4.