Constraint propagation based multi-modal image processing method and system

By parsing and fusing constraint information in an image processing system to generate constraint-aware index units, the problem of constraint information not participating in feature construction in existing technologies is solved, thereby improving the accuracy of image retrieval and cross-system security.

CN121636743BActive Publication Date: 2026-06-19AACAT TECHNOLOGY LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
AACAT TECHNOLOGY LTD
Filing Date
2026-02-05
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, constraint information in image processing and retrieval systems fails to participate in the core processes of image content understanding and feature construction, resulting in low accuracy of retrieval results. Furthermore, comparisons made under different constraint benchmarks are unreliable, making it difficult to achieve cross-system privacy protection equivalence determination.

Method used

By parsing constraint information into a set of structured constraints, calculating constraint satisfaction representations, and fusing these constraint satisfaction representations during multimodal feature extraction, a constraint-aware index unit is constructed to perform multimodal feature extraction and similarity comparison for constraint regulation.

🎯Benefits of technology

It improves the accuracy and stability of multimodal retrieval and deduplication tasks, ensures comparisons under the same or compatible benchmarks, and enhances the security and scalability of the system in cross-domain collaboration scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121636743B_ABST
    Figure CN121636743B_ABST
Patent Text Reader

Abstract

This invention provides a multimodal image processing method and system based on constraint propagation, relating to the field of computer vision, aiming to solve the problems of unstable features and unreliable comparisons caused by image constraints not participating in core computation. The method generates structured constraints by parsing requirement information, calculates the degree to which the image satisfies the constraints, and integrates the constraints and satisfaction degree into feature extraction to generate modulated features. Then, based on the modulated features, constraints, and satisfaction degree, a unified constraint-aware index unit is constructed. When comparing index units, constraint compatibility is checked first, followed by similarity determination. This invention uses constraints as the core computational element, ensuring the reliability of comparisons and improving feature stability and judgment accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and information retrieval technology, and more specifically, to a multimodal image processing method and system based on constraint propagation. Background Technology

[0002] In image processing and retrieval systems, in order to ensure the quality and compliance of image data, it is usually necessary to impose various constraints on the image's clarity, integrity, layout structure, or visibility of specific content.

[0003] In existing technologies, the methods for handling these constraints are typically quite simple. One common approach is to pre-filter images before the image processing workflow begins, discarding those that do not meet the constraints. Another approach is to perform posterior validation on the results after feature extraction and retrieval, using it as a reference for human intervention. A common drawback of these methods is that the constraint information is merely used as an independent judgment condition and fails to participate in the core processes of image content understanding and feature construction. Therefore, even if an image barely passes the filtering, the stability of its feature representation will still be affected by regions that do not meet the constraints, leading to low accuracy in subsequent similarity calculations and retrieval results.

[0004] Some more advanced existing technologies, such as those used in intelligent assisted diagnostic applications, attempt to incorporate expert rules into the feature generation process to guide neural networks in generating features that better match the rules. However, in scenarios requiring large-scale image comparison, existing technologies still lack a standardized data structure to uniformly encapsulate image feature information and its generated contextual information, i.e., the constraints it follows and the degree to which they are actually satisfied. This makes comparisons under different constraint benchmarks unreliable and poses challenges to cross-system, privacy-preserving equivalence determination. Summary of the Invention

[0005] To address the shortcomings of existing technologies, the purpose of this invention is to provide a multimodal image processing method and system based on constraint propagation.

[0006] A multimodal image processing method based on constraint propagation provided by the present invention includes:

[0007] Step S1: Obtain the requirement information related to image processing from the task environment, and parse the requirement information into a set of structured constraints;

[0008] Step S2: Calculate the degree to which the target image satisfies each structured constraint in the set of structured constraints, and generate a constraint satisfaction representation;

[0009] Step S3: During the extraction of multimodal features, the constraint satisfaction representation is fused to generate constraint-controlled multimodal features.

[0010] Preferably, step S1 includes:

[0011] Step S1.1: Obtain the template file or page structure information of the task environment; the template file or page structure information of the task environment includes the location and requirements of each key functional area;

[0012] Step S1.2: Parse the obtained template file or page structure information into a set of structured constraints R; the set of structured constraints R includes: area visibility structured constraints, layout matching constraints, and text readability structured constraints.

[0013] The structured constraints include: constraint identifier, constraint type, and constraint scope.

[0014] Preferably, step S2 includes:

[0015] Step S2.1: Identify the constraint type based on the constraint identifier. When the constraint type is a regional visibility structured constraint, calculate the variance of the Laplacian operator in the image region defined by the constraint scope of the target image, and normalize the calculated variance to obtain the constraint satisfaction degree related to the regional sharpness, which is used to characterize the regional sharpness.

[0016] Step S2.2: Identify the constraint type based on the constraint identifier. When the constraint type is layout matching constraint, calculate the similarity between the layout and the standard template of the target image within the constraint scope to obtain the constraint satisfaction degree related to the layout, which is used to characterize the layout matching degree of the region.

[0017] Step S2.3: Identify the constraint type based on the constraint identifier. When the constraint type is a text readability structured constraint, call the optical character recognition engine to identify the text within the constraint scope, and then calculate the text readability score based on the recognition confidence to obtain the constraint satisfaction degree related to the text, which is used to characterize the text readability of the region.

[0018] Step S2.4: Combine the satisfaction scores of all constraints to generate a constraint satisfaction vector that quantifies the level of satisfaction.

[0019] Preferably, step S3 includes:

[0020] Multiple parallel feature extraction channels are constructed to extract features from different modalities; the multiple parallel feature extraction channels include layout feature channels, text feature channels, and visual feature channels;

[0021] When the constraint satisfaction related to the layout is less than the preset value, the structure that does not meet the layout matching constraint is blocked, and then the layout feature g is generated by a graph neural network based on the layout feature channel.

[0022] When the satisfaction of constraints related to the text is less than the preset value, the text that does not meet the text readability structure constraints is suppressed, and then the deep text semantic features t are extracted based on the text feature channels using a language model based on the Transformer architecture.

[0023] When the constraint satisfaction related to regional sharpness is less than the preset value, the region that does not satisfy the constraint related to image sharpness is deleted, and then visual features v are extracted using a convolutional neural network based on the visual feature channel;

[0024] The layout feature g, textual semantic feature t, and visual feature v are fused to generate the final fused feature F.

[0025] Preferably, the method further includes:

[0026] Step S4: Construct a constraint-aware index unit based on the constraint-controlled multimodal features, the constraint satisfaction representation, and the structured constraint set;

[0027] Step S5: Construct constraint-aware index units for target image A and target image B, and compare the constraint-aware index units of target image A and target image B to determine whether target image A and target image B are duplicate images.

[0028] Preferably, step S5 includes:

[0029] The process checks whether the structured constraint sets contained in the constraint-aware index unit CAU-1 of target image A and the constraint-aware index unit CAU-2 of target image B are consistent or compatible. If the constraints are inconsistent, the process terminates, and the two are considered incomparable. If the constraint consistency check passes, the fusion features F1 and F2 in the constraint-aware index unit CAU-1 of target image A and the constraint-aware index unit CAU-2 of target image B are extracted, and the similarity between the fusion features F1 and F2 is calculated using cosine similarity. The calculated similarity is compared with a preset threshold. If the similarity is greater than the preset threshold, image A and image B are finally determined to be duplicate images; otherwise, they are determined to be non-duplicate.

[0030] A multimodal image processing system based on constraint propagation according to the present invention includes:

[0031] Module M1: Obtains image processing-related requirement information from the task environment and parses the requirement information into a set of structured constraints;

[0032] Module M2: Calculates the degree to which the target image satisfies each structured constraint in the structured constraint set, and generates a constraint satisfaction representation;

[0033] Module M3: During the extraction of multimodal features, the constraint satisfaction representation is fused to generate constraint-controlled multimodal features.

[0034] Preferably, the module M1 includes:

[0035] Module M1.1: Obtain the template file or page structure information of the task environment; the template file or page structure information of the task environment includes the location and requirements of each key functional area;

[0036] Module M1.2: Parses the acquired template file or page structure information into a set of structured constraints R; the set of structured constraints R includes: area visibility structured constraints, layout matching constraints, and text readability structured constraints;

[0037] The structured constraints include: constraint identifier, constraint type, and constraint scope.

[0038] Preferably, the module M2 includes:

[0039] Module M2.1: Based on constraint identifiers, the constraint type is identified. When the constraint type is a regional visibility structured constraint, the variance of the Laplacian operator is calculated within the image region defined by the constraint scope of the target image. The calculated variance is then normalized to obtain the constraint satisfaction degree related to the regional sharpness, which is used to characterize the regional sharpness.

[0040] Module M2.2: Based on constraint identifiers, the constraint type is identified. When the constraint type is layout matching constraint, the similarity between the layout and the standard template of the target image is calculated within the constraint scope to obtain the constraint satisfaction degree related to the layout, which is used to characterize the layout matching degree of the region.

[0041] Module M2.3: Based on constraint identifiers, it identifies constraint types. When the constraint type is a text readability structured constraint, it calls the optical character recognition engine to identify the text within the constraint scope, and then calculates the text readability score based on the recognition confidence to obtain the constraint satisfaction degree related to the text, which is used to characterize the text readability of the region.

[0042] Module M2.4: Combines the satisfaction scores of all constraints to generate a constraint satisfaction vector that quantifies the level of satisfaction.

[0043] Preferably, the module M3 includes:

[0044] Multiple parallel feature extraction channels are constructed to extract features from different modalities; the multiple parallel feature extraction channels include layout feature channels, text feature channels, and visual feature channels;

[0045] When the constraint satisfaction related to the layout is less than the preset value, the structure that does not meet the layout matching constraint is blocked, and then the layout feature g is generated by a graph neural network based on the layout feature channel.

[0046] When the satisfaction of constraints related to the text is less than the preset value, the text that does not meet the text readability structure constraints is suppressed, and then the deep text semantic features t are extracted based on the text feature channels using a language model based on the Transformer architecture.

[0047] When the constraint satisfaction related to regional sharpness is less than the preset value, the region that does not satisfy the constraint related to image sharpness is deleted, and then visual features v are extracted using a convolutional neural network based on the visual feature channel;

[0048] The layout feature g, textual semantic feature t, and visual feature v are fused to generate the final fused feature F.

[0049] Compared with the prior art, the present invention has the following beneficial effects:

[0050] 1. This invention deeply integrates constraint information into the entire process of feature extraction calculation, enabling feature representation to actively adapt to constraint requirements, reducing the interference of non-compliant content on the results, thereby significantly improving the accuracy and stability of multimodal retrieval and deduplication tasks;

[0051] 2. This invention upgrades constraints from simple filtering conditions in existing technologies to computationally calculable, propagable, and core computational technical elements, thus innovating the technical paradigm of image processing. Furthermore, by constructing a unified constraint-aware index unit, a compatibility check of constraint conditions is performed before similarity comparison, ensuring that the comparison is conducted under the same or compatible benchmark, fundamentally avoiding misjudgments caused by inconsistent constraints.

[0052] 3. The unified index unit structure provides a controllable constraint basis for equivalence determination under privacy protection conditions, enhancing the security and scalability of the system in cross-domain collaboration scenarios.

[0053] 4. This invention transforms constraints in image processing into core elements that can participate in computation and propagation, and provides a unified data structure that can encapsulate image features, the constraints they follow, and the degree to which they are satisfied, in order to improve the stability of multimodal features and the accuracy of similarity determination. Attached Figure Description

[0054] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0055] Figure 1This is a schematic diagram of the overall process of a multimodal image processing method based on constraint propagation.

[0056] Figure 2 A schematic diagram illustrating the constraint generation and propagation process.

[0057] Figure 3 A schematic diagram of the structure constructed for the constraint-aware indexing unit.

[0058] Figure 4 This is a flowchart illustrating the process of performing equivalence determination under constraints. Detailed Implementation

[0059] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.

[0060] Example 1

[0061] According to the present invention, a multimodal image processing method based on constraint propagation is provided, such as... Figures 1 to 4 As shown, it includes:

[0062] Step S1: Obtain the requirement information related to image processing from the task environment, and parse the requirement information into a set of structured constraints;

[0063] Specifically, step S1 includes:

[0064] Step S1.1: Obtain the template file or page structure information of the task environment; the template file or page structure information of the task environment includes the location and requirements of each key functional area;

[0065] Step S1.2: Parse the obtained template file or page structure information into a set of structured constraints R; the set of structured constraints R includes: regional visibility structured constraints, overall layout matching constraints, and text readability structured constraints.

[0066] The structured constraints include: constraint identifier, constraint type, and constraint scope.

[0067] Step S2: Calculate the degree to which the target image satisfies each structured constraint in the set of structured constraints, and generate a constraint satisfaction representation;

[0068] Specifically, step S2 includes:

[0069] Step S2.1: For the regional visibility structured constraint in the structured constraint set, the variance of the Laplacian operator of the target image is calculated within the image region defined by the constraint scope, and the calculated variance is normalized to obtain the constraint satisfaction degree related to the regional sharpness, which is used to characterize the regional sharpness.

[0070] Step S2.2: For the overall layout matching degree constraint in the structured constraint set, the similarity between the overall layout of the target image and the standard template is calculated within the constraint scope to obtain the constraint satisfaction degree related to the layout, which is used to characterize the layout matching degree of the region.

[0071] Step S2.3: For the text readability structured constraints in the structured constraint set, call the optical character recognition engine to recognize the text within the constraint scope, and then calculate the text readability score based on the recognition confidence to obtain the constraint satisfaction degree related to the text, which is used to characterize the text readability of the region.

[0072] Step S2.4: Combine the satisfaction scores of all constraints to generate a constraint satisfaction vector that quantifies the level of satisfaction.

[0073] Step S3: During the extraction of multimodal features, the constraint satisfaction representation is fused to generate constraint-controlled multimodal features;

[0074] Specifically, step S3 includes:

[0075] Multiple parallel feature extraction channels are constructed to extract features from different modalities; the multiple parallel feature extraction channels include layout feature channels, text feature channels, and visual feature channels;

[0076] When the constraint satisfaction related to the layout is less than the preset value, the structure that does not meet the layout matching constraint is blocked, and then the layout feature g is generated by a graph neural network based on the layout feature channel.

[0077] When the satisfaction of constraints related to the text is less than the preset value, the text that does not meet the text readability structure constraints is suppressed, and then the deep text semantic features t are extracted based on the text feature channels using a language model based on the Transformer architecture.

[0078] When the constraint satisfaction related to regional sharpness is less than the preset value, the region that does not satisfy the constraint related to image sharpness is deleted, and then visual features v are extracted using a convolutional neural network based on the visual feature channel;

[0079] The layout feature g, textual semantic feature t, and visual feature v are fused to generate the final fused feature F.

[0080] Step S4: Construct a constraint-aware index unit based on the constraint-controlled multimodal features, the constraint satisfaction representation, and the structured constraint set;

[0081] Step S5: Construct constraint-aware index units for target image A and target image B, and compare the constraint-aware index units of target image A and target image B to determine whether target image A and target image B are duplicate images.

[0082] Specifically, step S5 includes:

[0083] The process checks whether the structured constraint sets contained in the constraint-aware index unit CAU-1 of target image A and the constraint-aware index unit CAU-2 of target image B are consistent or compatible. If the constraints are inconsistent, the process terminates, and the two are considered incomparable. If the constraint consistency check passes, the fusion features F1 and F2 in the constraint-aware index unit CAU-1 of target image A and the constraint-aware index unit CAU-2 of target image B are extracted, and the similarity between the fusion features F1 and F2 is calculated using cosine similarity. The calculated similarity is compared with a preset threshold. If the similarity is greater than the preset threshold, image A and image B are finally determined to be duplicate images; otherwise, they are determined to be non-duplicate.

[0084] The present invention also provides a constraint propagation-based multimodal image processing system, which can be implemented by executing the process steps of the constraint propagation-based multimodal image processing method. That is, those skilled in the art can understand the constraint propagation-based multimodal image processing method as a preferred embodiment of the constraint propagation-based multimodal image processing system.

[0085] Example 2

[0086] Example 2 is a preferred example of Example 1.

[0087] According to the present invention, a multimodal image processing method based on constraint propagation is provided, such as... Figures 1 to 4 As shown, it includes: a structured constraint generation step, a constraint satisfaction encoding step, a constraint propagation and feature extraction step, a constraint-aware index construction step, and an equivalence determination step under constraints.

[0088] In the structured constraint generation step, requirement information is obtained from a preset task environment. In this embodiment, the requirement information comes from a template file describing the standard structure of the target webpage. The template file can be a JSON document that defines the location and requirements of each key functional area on the page.

[0089] The template file is read and parsed into a set of structured constraints, R. This set R contains two structured constraints: first, identified as `logo_area`, of type `region_visibility`, whose scope is defined as a rectangular region in the image coordinate system determined by points (10,10) and (100,50); second, identified as `title_text`, of type `text_readability`, whose scope is a rectangular region from coordinates (110,20) to (300,40); and third, identified as `layout_match`, of type `layout_similarity`. Each structured constraint is a structured data unit containing constraint identifier, constraint type, and constraint scope information, which together constitute the set of structured constraints, R.

[0090] The constraint satisfaction encoding step receives a screenshot of a webpage to be processed. Upon receiving the screenshot, it analyzes the content of the input image based on the constraints in the structured constraint set R and calculates the degree to which the image satisfies each constraint. For example, for the visibility constraint of the region identified as logo_area, the image sharpness evaluation algorithm is called to calculate the variance of its Laplacian operator within the image region defined by the constraint scope [10,10,100,50]. The larger the variance value, the richer the image edges and the clearer the content. After normalization, a continuous value between 0 and 1 is obtained, for example, 0.95, indicating high sharpness of the region. For the text readability constraint identified as title_text, the optical character recognition engine is invoked to identify the text within the constraint scope [110,20,300,40]. A readability score is then calculated based on the recognition confidence, for example, 0.60, indicating that the text in this area is partially readable but of low quality. For the overall layout matching constraint identified as layout_match, the similarity between the overall layout of the target image and the standard template is calculated within the constraint scope, for example, 0.8, indicating a high degree of layout matching in the region. Finally, the satisfaction scores of all constraints are combined to generate a constraint satisfaction vector S that quantifies the satisfaction level, for example, S=[0.95,0.60,0.8]. This continuous numerical representation method can more finely characterize the compliance level of the image compared to the traditional binary judgment.

[0091] The constraint propagation and feature extraction steps integrate multiple parallel feature extraction channels, each used to extract features from different modalities, including: layout / structure channel, text feature channel, and visual feature channel. Upon receiving the structured constraint set R and the constraint satisfaction vector S, the constraint propagation controller sends differentiated control commands to these three channels.

[0092] In the layout / structure feature channel, a graph neural network is used to model the layout structure of the image. Each element of the image, such as a text block, image, or table, is considered a node in the graph, and the spatial proximity or logical relationship between nodes is considered an edge. This channel aims to generate a layout feature g that represents the overall macroscopic structure of the image. When the constraint satisfaction related to the layout is low, for example, 0.45, below a preset threshold of 0.5, the controller issues instructions to the layout feature channel to adjust the graph structure, including: masking or weakening layout structures that do not meet the structural constraints before the graph neural network propagates information; thus, the final generated layout feature g better reflects the main, compliant structure of the image and reduces interference caused by layout disorder.

[0093] In the text feature channel, a language model based on the Transformer architecture, such as the BERT model, is used to extract deep text semantic features t. This channel first extracts all text content and its location from the image using optical character recognition (OCR). When the constraint satisfaction related to the text is extremely low, it indicates that key text information is missing, and the controller sends a suppression command to the text feature channel. Specifically, before feeding the text extracted from the corresponding text region through OCR into the BERT model, the corresponding input embedding vector is multiplied by a near-zero suppression coefficient, or directly set to zero. This ensures that even if there are some incorrect recognition results or irrelevant text in the region, they have almost no effect in subsequent self-attention calculations and semantic information extraction, thus guaranteeing that the final generated text feature t is not contaminated by this invalid information.

[0094] In the visual feature channel, a convolutional neural network is used to extract general visual features v. When the constraint satisfaction related to regional sharpness is extremely low, an attention mechanism is used to adjust the visual feature weights of the corresponding image region. Since the region has a high satisfaction, its feature weights are enhanced or maintained, ensuring that key visual information dominates the final features.

[0095] After generating constrained and regulated features—layout feature g, text feature t, and visual feature v—a multimodal fusion engine will fuse these three feature vectors, for example, through concatenation, weighted summation, or more complex attention fusion mechanisms, to generate the final fused feature F.

[0096] The constraint-aware index construction step combines the fused features F generated in the previous step, the representation of the structured constraint set R, and the constraint satisfaction vector S to construct a unified constraint-aware index unit CAU. In this embodiment, the index unit CAU can be defined as a tuple or structure: CAU =<F, R_repr,S> Here, F is a 512-dimensional feature vector, R_repr is the second representation of the constraint set R, and S is a vector [0.95, 0.60, 0.8]. These three parts are encapsulated together to form an indivisible data unit, which is then stored in a database or indexing system. This constraint-aware index unit (CAU), as a whole, contains both the image's content information (features F) and records the constraints under which this content information was generated (constraints R), as well as the degree to which these constraints are actually satisfied (satisfaction S).

[0097] The equivalence determination steps under constraints are as follows: When it is necessary to determine whether a new webpage screenshot, denoted as image B, is a duplicate of a screenshot already existing in the database, denoted as image A, the above steps are first performed on image B to generate its corresponding constraint-aware index unit CAU-2. Then, the constraint-aware index unit CAU-1 corresponding to image A is retrieved from the database, and CAU-1 and CAU-2 are compared. Specifically, this comparison process first checks whether the structured constraint sets contained in CAU-1 and CAU-2 are consistent or compatible, that is, whether their R_repr are the same. Subsequent feature comparison is only meaningful when the two screenshots are processed under completely identical constraint benchmarks. If the constraints are inconsistent, the determination process terminates, and the two are considered incomparable. If the constraint consistency check passes, the process continues, extracting the fusion features F1 and F2 from CAU-1 and CAU-2, and calculating their similarity. A commonly used calculation method is cosine similarity, and its calculation formula is:

[0098]

[0099] The calculated similarity value is a continuous numerical value between -1 and 1. This similarity value is then sent to the constraint-related threshold determination module and compared with a preset threshold, such as 0.98. If the similarity is greater than 0.98, image A and image B are ultimately determined to be duplicate images; otherwise, they are determined to be non-duplicate.

[0100] This embodiment enables more refined processing of images that are "partially unclear" or "partially non-compliant." The generated features better reflect the condition of key areas, thus more accurately identifying substantial duplicates during deduplication, significantly reducing misjudgments caused by local quality issues, and improving the accuracy and stability of image processing.

[0101] Those skilled in the art will understand that, in addition to implementing the system, apparatus, and their modules provided by this invention in purely computer-readable program code, the same program can be implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system, apparatus, and their modules provided by this invention can be considered a hardware component, and the modules included therein for implementing various programs can also be considered structures within the hardware component; alternatively, modules for implementing various functions can be considered both software programs implementing the method and structures within the hardware component.

[0102] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.

Claims

1. A multimodal image processing method based on constraint propagation, characterized in that, include: Step S1: Obtain image processing-related requirement information from the task environment and parse the requirement information into a set of structured constraints; Step S2: Calculate the degree to which the target image satisfies each structured constraint in the structured constraint set, and generate a constraint satisfaction representation; Step S3: During the extraction of multimodal features, the constraint satisfaction representation is fused to generate constraint-controlled multimodal features; Step S1 includes: Step S1.1: Obtain the template file or page structure information of the task environment; the template file or page structure information of the task environment includes the location and requirements of each key functional area; Step S1.2: Parse the obtained template file or page structure information into a set of structured constraints R; the set of structured constraints R includes: area visibility structured constraints, layout matching constraints, and text readability structured constraints. The structured constraints include: constraint identifier, constraint type, and constraint scope; Step S2 includes: Step S2.1: Identify the constraint type based on the constraint identifier. When the constraint type is a regional visibility structured constraint, calculate the variance of the Laplacian operator in the image region defined by the constraint scope of the target image, and normalize the calculated variance to obtain the constraint satisfaction degree related to the regional sharpness, which is used to characterize the regional sharpness. Step S2.2: Identify the constraint type based on the constraint identifier. When the constraint type is layout matching constraint, calculate the similarity between the layout and the standard template of the target image within the constraint scope to obtain the constraint satisfaction degree related to the layout, which is used to characterize the layout matching degree of the region. Step S2.3: Identify the constraint type based on the constraint identifier. When the constraint type is a text readability structured constraint, call the optical character recognition engine to identify the text within the constraint scope, and then calculate the text readability score based on the recognition confidence to obtain the constraint satisfaction degree related to the text, which is used to characterize the text readability of the region. Step S2.4: Combine the satisfaction scores of all constraints to generate a constraint satisfaction vector that quantifies the satisfaction level; Step S3 includes: Multiple parallel feature extraction channels are constructed to extract features from different modalities; the multiple parallel feature extraction channels include layout feature channels, text feature channels, and visual feature channels; When the constraint satisfaction related to the layout is less than the preset value, the structure that does not meet the layout matching constraint is blocked, and then the layout feature g is generated by a graph neural network based on the layout feature channel. When the satisfaction of constraints related to the text is less than the preset value, the text that does not meet the text readability structure constraints is suppressed, and then the deep text semantic features t are extracted based on the text feature channels using a language model based on the Transformer architecture. When the constraint satisfaction related to regional sharpness is less than the preset value, the region that does not satisfy the constraint related to image sharpness is deleted, and then visual features v are extracted using a convolutional neural network based on the visual feature channel; The layout feature g, textual semantic feature t, and visual feature v are fused to generate the final fused feature F.

2. The multimodal image processing method based on constraint propagation according to claim 1, characterized in that, The method further includes: Step S4: Construct a constraint-aware index unit based on the constraint-controlled multimodal features, the constraint satisfaction representation, and the structured constraint set; Step S5: Construct constraint-aware index units for target image A and target image B, and compare the constraint-aware index units of target image A and target image B to determine whether target image A and target image B are duplicate images.

3. The multimodal image processing method based on constraint propagation according to claim 2, characterized in that, Step S5 includes: The process checks whether the structured constraint sets contained in the constraint-aware index unit CAU-1 of target image A and the constraint-aware index unit CAU-2 of target image B are consistent or compatible. If the constraints are inconsistent, the process terminates, and the two are considered incomparable. If the constraint consistency check passes, the fusion features F1 and F2 in the constraint-aware index unit CAU-1 of target image A and the constraint-aware index unit CAU-2 of target image B are extracted, and the similarity between the fusion features F1 and F2 is calculated using cosine similarity. The calculated similarity is compared with a preset threshold. If the similarity is greater than the preset threshold, image A and image B are finally determined to be duplicate images; otherwise, they are determined to be non-duplicate.

4. A multimodal image processing system based on constraint propagation, characterized in that, include: Module M1: Obtains image processing-related requirement information from the task environment and parses the requirement information into a set of structured constraints; Module M2: Calculates the degree to which the target image satisfies each structured constraint in the structured constraint set, and generates a constraint satisfaction representation; Module M3: During the extraction of multimodal features, the constraint satisfaction representation is fused to generate constraint-controlled multimodal features; The module M1 includes: Module M1.1: Obtain the template file or page structure information of the task environment; the template file or page structure information of the task environment includes the location and requirements of each key functional area; Module M1.2: Parses the acquired template file or page structure information into a set of structured constraints R; the set of structured constraints R includes: area visibility structured constraints, layout matching constraints, and text readability structured constraints; The structured constraints include: constraint identifier, constraint type, and constraint scope; The module M2 includes: Module M2.1: Based on constraint identifiers, the constraint type is identified. When the constraint type is a regional visibility structured constraint, the variance of the Laplacian operator is calculated within the image region defined by the constraint scope of the target image. The calculated variance is then normalized to obtain the constraint satisfaction degree related to the regional sharpness, which is used to characterize the regional sharpness. Module M2.2: Based on constraint identifiers, the constraint type is identified. When the constraint type is layout matching constraint, the similarity between the layout and the standard template of the target image is calculated within the constraint scope to obtain the constraint satisfaction degree related to the layout, which is used to characterize the layout matching degree of the region. Module M2.3: Based on constraint identifiers, it identifies constraint types. When the constraint type is a text readability structured constraint, it calls the optical character recognition engine to identify the text within the constraint scope, and then calculates the text readability score based on the recognition confidence to obtain the constraint satisfaction degree related to the text, which is used to characterize the text readability of the region. Module M2.4: Combines the satisfaction scores of all constraints to generate a constraint satisfaction vector that quantifies the level of satisfaction. The module M3 includes: Multiple parallel feature extraction channels are constructed to extract features from different modalities; the multiple parallel feature extraction channels include layout feature channels, text feature channels, and visual feature channels; When the constraint satisfaction related to the layout is less than the preset value, the structure that does not meet the layout matching constraint is blocked, and then the layout feature g is generated by a graph neural network based on the layout feature channel. When the satisfaction of constraints related to the text is less than the preset value, the text that does not meet the text readability structure constraints is suppressed, and then the deep text semantic features t are extracted based on the text feature channels using a language model based on the Transformer architecture. When the constraint satisfaction related to regional sharpness is less than the preset value, the region that does not satisfy the constraint related to image sharpness is deleted, and then visual features v are extracted using a convolutional neural network based on the visual feature channel; The layout feature g, textual semantic feature t, and visual feature v are fused to generate the final fused feature F.

Citation Information

Patent Citations

  • Multi-modal fusion analysis method and system for test data

    CN120611145A

  • Animal wound multi-mode intelligent identification method based on artificial intelligence

    CN121234298A