Multi-modal event extraction method and system based on neural-symbol hybrid gradient backpropagation
By employing a neural-symbolic hybrid gradient backpropagation method, the problems of visual attention bias and self-correction in multimodal event extraction are solved, achieving cross-modal optimization and interpretable iterative optimization, which is applicable to a variety of application scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUZHOU AEROSPACE INFORMATION RES INST
- Filing Date
- 2026-06-03
- Publication Date
- 2026-06-30
AI Technical Summary
Existing multimodal event extraction methods lack cross-modal optimization mechanisms, making it difficult to effectively correct visual attention biases and self-correction, especially when errors occur at the visual perception level, leading to problems of omitted or incorrect argument attribution.
A neural-symbolic hybrid gradient backpropagation method is adopted. Through five steps, namely initialization, forward propagation, evaluation, gradient decomposition, gradient mapping and input update, a spatial attention mask is generated and the input is iteratively optimized by utilizing pre-trained visual experts, inference agents, evaluators, gradient decomposition and gradient mapping modules to achieve cross-modal closed-loop feedback.
It enables dynamic correction of visual attention bias during the testing phase, has strong self-correction capabilities, requires no training of model parameters, is suitable for application scenarios with scarce data or high annotation costs, provides an interpretable iterative optimization process, and improves event extraction performance.
Smart Images

Figure CN122309770A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to artificial intelligence, multimodal information processing, and natural language processing technologies, specifically to a multimodal event extraction method and system based on neural-symbolic hybrid gradient backpropagation. Background Technology
[0002] Multimodal event extraction aims to automatically detect event types and extract corresponding argument roles from image and text pairs, and is a key technology for multimedia content understanding and intelligent information extraction. Existing methods can be mainly divided into several categories: classification-based methods achieve image-text alignment by constructing a cross-modal common space and are trained using graph neural networks or contrastive learning, but are limited by fixed categories and have difficulty flexibly handling diverse event structures; generative-based methods transform event extraction into a sequence generation task, usually relying on external object detectors to pre-extract candidate arguments, which increases the risk of omissions; and methods based on multimodal large language models directly utilize visual language models for end-to-end extraction or improve performance through few-shot cues, but semantic accuracy and cross-modal alignment remain the main bottlenecks.
[0003] Reference 1 (Yuksekgonul, Mert et al. "Optimizing Generative AI by Backpropagating Language Model Feedback.", Nature 639.8055 (2025): 609-616.) proposes a method called text feedback optimization. This method models a complex AI system as a computational graph. First, it constructs a directed acyclic graph consisting of variable nodes and function edges, and performs forward propagation to obtain the output. Then, a large language model is used as a judge to generate feedback in natural language form based on the output. This feedback is then used as "text feedback" and backpropagated to each node through another large language model optimizer to generate modification suggestions for each variable. Finally, the node values are updated and the above process is repeated until the feedback disappears or a preset number of rounds is reached. The optimization signal of this method exists entirely in the text space, influencing subsequent outputs by modifying text prompts. However, directly applying this method to multimodal event extraction has inherent limitations: errors are often rooted in visual perception, such as the model ignoring key objects in the corners of an image. While textual feedback can prompt the model "where to look", it cannot forcibly correct the attention distribution of the visual encoder. Even if the model receives feedback, it will still "ignore it".
[0004] In summary, the main drawback of existing technologies lies in the lack of a cross-modal optimization mechanism capable of translating high-level semantic evaluation into low-level visual attention constraints. Specifically, text feedback optimization methods can only transmit optimization signals within the text space. When event arguments are presented visually (such as small objects or occluded objects in an image) but are not described in detail in the text, the model lacks effective visual refocusing mechanisms, leading to persistent problems of visual focus shift and argument misattribution. Furthermore, existing methods cannot self-correct based on output quality during testing; once preliminary results contain omissions or errors, the system cannot correct them. Summary of the Invention
[0005] The purpose of this invention is to provide a multimodal event extraction method and system based on neural-symbolic hybrid gradient backpropagation, so as to dynamically correct visual attention bias during testing.
[0006] The technical solution to achieve the purpose of this invention is: a multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation, comprising the following steps:
[0007] Initialization steps: Set the maximum number of iterations and the highlight intensity coefficient, obtain the predefined event pattern, initialize visual query hints and inference hints, and use the original image as the current image;
[0008] Forward propagation steps: Input the current image and current visual query prompts into the pre-trained visual expert module to generate a visual description; input the original text, visual description, and inference prompts into the pre-trained inference agent module to generate structured events;
[0009] Evaluation steps: Input the structured event, raw text, and visual description into the pre-trained evaluator module to generate text feedback; if the text feedback is empty, output the structured event and terminate.
[0010] Gradient decomposition steps: Input the text feedback into the gradient decomposition module, which decomposes it into a signed gradient and a set of keywords;
[0011] Gradient mapping step: Input the current image and keyword set into the pre-trained gradient mapping module to generate a spatial attention mask as a neural gradient;
[0012] Input update steps: Overlay the spatial attention mask onto the current image in the form of a heatmap to generate the updated image; input the current visual query hints and symbolic gradients into the pre-trained optimizer large language model to generate the updated visual query hints;
[0013] Iterative control steps: Using the updated image as the new current image and the updated visual query cue as the new current visual query cue, repeat the forward propagation step, evaluation step, gradient decomposition step, gradient mapping step and input update step until the stopping condition is met, and output the structured event generated in the last round.
[0014] The parameters of all pre-trained modules remain unchanged during the method's execution.
[0015] Furthermore, in the gradient decomposition step, the text feedback is decomposed into a signed gradient and a set of keywords, specifically in the following ways:
[0016] Rule extraction: Based on predefined event patterns, match missing argument roles appearing in text feedback to generate an initial keyword list;
[0017] Semantic expansion: Using a thesaurus or pre-trained word vectors, each initial keyword is expanded into a set of synonyms;
[0018] Large Language Model Refinement: The expanded keyword list is input into the large language model, which removes irrelevant words and sorts them in descending order of relevance, outputting a refined keyword set; furthermore, the large language model generates natural language instructions for modifying visual query prompts based on text feedback, which serve as symbolic gradients.
[0019] Furthermore, the gradient mapping module is an open vocabulary visual localization model; the specific method for generating the spatial attention mask in the gradient mapping step is as follows: generate a heatmap of the same size as the current image for each keyword, and the pixel value of each heatmap represents the confidence that the position belongs to the corresponding keyword; take the maximum value of each element of all heatmaps at the same coordinate position, and then scale the value to the [0,1] interval through min-max linear normalization to obtain the spatial attention mask.
[0020] Furthermore, the specific method for overlaying the spatial attention mask onto the current image in the form of a heatmap in the input update step is as follows:
[0021] ;
[0022] in, For the current image, For high brightness intensity coefficient, Here, is the spatial attention mask, and colormap is the function that maps a single-channel mask to a three-channel color heatmap.
[0023] Furthermore, the evaluator module generates text feedback based on the following two criteria:
[0024] Image-text consistency criterion: Check whether each argument in a structured event has corresponding evidence to support it in the original text or the visual description generated in the current round;
[0025] Event pattern integrity criterion: Based on the predefined event pattern, check whether the event type of the current structured event is missing any required argument roles;
[0026] When both criteria are met, the text feedback is an empty string; otherwise, the output is natural language text containing descriptions of missing arguments and suggested modifications.
[0027] Furthermore, the visual expert module is a multimodal large language model, the reasoning agent module, the judge module, and the optimizer large language model are all large language models, and the gradient mapping module is an open vocabulary visual localization model; all modules are pre-trained models, and their model parameters remain frozen during the method's operation.
[0028] Furthermore, the stopping condition in the iterative control step includes any of the following:
[0029] Condition 1: The text feedback generated during the evaluation process is an empty string;
[0030] Condition 2: The number of iterations executed has reached the preset maximum number of iterations K;
[0031] Condition 3: The similarity of structured events generated in two consecutive rounds exceeds a preset threshold;
[0032] Condition 4: Dynamically adjust the maximum number of iteration rounds based on the semantic strength in the text feedback;
[0033] Condition 5: The reasoning agent module outputs the confidence score of each argument, and the confidence scores of all required arguments are higher than the preset threshold.
[0034] Furthermore, the open-vocabulary visual localization model used in the gradient mapping step can be replaced by any of the following alternatives:
[0035] Alternative Solution 1: Based on the closed-set target detection model, pre-establish a mapping table from keywords to detection categories;
[0036] Alternative Solution 2: Directly utilize the cross-modal attention weights of the visual expert module itself to extract the cross-attention map of text keywords on image patches as a spatial attention mask;
[0037] Alternative Solution 3: Generate a mask based on an unsupervised saliency detection method;
[0038] Alternative Solution 4: Use the instance segmentation mask output by the open vocabulary segmentation model as a heatmap.
[0039] Furthermore, the three-stage processing in the gradient decomposition step is entirely replaced by any of the following alternatives:
[0040] Alternative Option 1: Directly use a large language model to output the symbolic gradient and keyword set in JSON format;
[0041] Alternative Solution 2: Introduce external knowledge graphs or domain ontologies to expand keywords through multi-hop reasoning;
[0042] Alternative Solution 3: Use pre-trained word vectors to calculate semantic similarity and automatically select words with similarity exceeding a threshold to add to the keyword set.
[0043] A multimodal event extraction system based on neural-symbolic hybrid gradient backpropagation, used to implement the aforementioned multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation, includes:
[0044] The Visual Expert module is configured to receive images and visual query prompts and generate visual descriptions.
[0045] The reasoning agent module is configured to receive raw text, visual descriptions, and reasoning prompts, and generate structured events.
[0046] The judge module is configured to receive structured events, raw text, and visual descriptions, and generate textual feedback.
[0047] The gradient decomposition module is configured to receive text feedback and decompose it into a symbolic gradient and a set of keywords.
[0048] The gradient mapping module is configured to receive the current image and a set of keywords, and generate a spatial attention mask as a neural gradient.
[0049] The optimizer large language model is configured to receive the current visual query hints and symbolic gradients, and generate updated visual query hints.
[0050] The input update module is configured to overlay the spatial attention mask onto the current image in the form of a heatmap to generate an updated image, and input the current visual query hints and symbolic gradients into the optimizer large language model to generate updated visual query hints.
[0051] The iteration control module is configured to control the iteration rounds, repeatedly calling the above module until the stopping condition is met, and outputting structured events.
[0052] All modules are pre-trained modules, and their parameters remain unchanged during system operation.
[0053] Compared with the prior art, the significant advantages of this invention are: (1) cross-modal optimization capability: transforming semantic evaluation into visual attention constraint.
[0054] Existing technologies can only transmit optimization signals within the text space. When errors are rooted in visual perception (such as the model ignoring key objects in the corners of an image), even if the feedback prompts "Please pay attention to the blue car in the lower right corner," the model still processes the image according to its inherent attention pattern and cannot force a correction of the visual encoder's area of focus. This invention uses a gradient decomposition module to parse the text feedback into a set of keywords, then uses an open-vocabulary visual localization model to map it into a spatial attention mask, which is then physically superimposed onto the input image in the form of a red heatmap. This operation is equivalent to applying a "gradient update" in the image space, ensuring that the next visual expert module will inevitably focus on the highlighted area.
[0055] (2) Self-correction during testing can be achieved without training.
[0056] Existing classification and generative frameworks require large amounts of labeled data for training or fine-tuning, making them difficult to apply in data-scarce vertical domains (such as rare event types). This invention, however, runs entirely at test time, without updating any model parameters, optimizing extraction results solely through iterative modifications to the input image and text prompts. This characteristic makes this invention particularly suitable for real-world applications where annotation costs are high or event types change dynamically.
[0057] (3) Interpretable iterative optimization process
[0058] Existing optimization techniques lack interpretability: users can only see the final output and cannot understand why or how the model was modified. This invention generates readable textual feedback (e.g., "Missing damaged parts, please focus on the front of the blue car") and a visual attention mask (red highlighted areas on the image) for each iteration. Users can not only see the final event extraction results but also trace which regions the model focused on and which arguments were completed in each iteration. This transparency enhances the system's credibility and debuggability, facilitating manual review and intervention in actual deployment.
[0059] (4) Good model generalization and component replaceability
[0060] Existing technologies (text feedback optimization), while not tied to a specific large language model, limit their optimization signals to the text modality. This invention adds a visual gradient branch, but similarly does not bind to a specific visual expert, inference agent, or localization model. This method, by not binding to a specific visual expert, inference agent, or localization model, allows for flexible selection of appropriate pre-trained models based on the application scenario. Attached Figure Description
[0061] Figure 1 This is a flowchart of a multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation. Detailed Implementation
[0062] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0063] This invention proposes a multimodal event extraction method and system based on neural-symbolic hybrid gradient backpropagation, which transforms high-level semantic evaluation into spatial attention constraints that can directly affect visual input, and realizes cross-modal closed-loop feedback optimization, thereby effectively solving the problems of visual omission and argument misattribution.
[0064] This solution is applied to computer systems, utilizing multimodal large language models, large language models, and open-vocabulary visual localization models as basic components. During the testing phase (i.e., the inference phase), it iterative optimization of the input improves event extraction performance without updating any model parameters. The following, combined with the appendix... Figure 1 The flowchart shown describes in detail the implementation process of this solution.
[0065] I. System Composition and Input / Output
[0066] The input for this method is:
[0067] Original image I∈R H×W×3 , where H and W are the height and width pixels of the image, respectively;
[0068] The original text T is a natural language description associated with the image, such as a news headline or body text;
[0069] The predefined event pattern Ω contains the event type and its corresponding list of required argument roles. For example, the "traffic accident" event must include arguments such as "the vehicle involved" and "the location of the accident".
[0070] The output of this method is a structured event E, which includes the event type, trigger words, and the correspondence between argument roles and entities.
[0071] This method involves the following core computational modules, all of which are pre-trained public models and their parameters are not updated with this method:
[0072] The visual expert module employs a multimodal large language model, receives images and visual query prompts, and outputs a detailed natural language description of the images.
[0073] The reasoning agent module uses a large language model to receive text, visual descriptions, and reasoning prompts, and outputs structured events.
[0074] The judge module uses a large language model to evaluate the current event and generate natural language feedback.
[0075] The gradient decomposition module uses a hybrid approach of rules and large language models to parse the feedback into symbolic gradients and keyword sets.
[0076] The gradient mapping module uses an open vocabulary visual localization model to map the keyword set into a spatial attention mask.
[0077] The input update module updates the visual query cues and the input image based on the symbolic gradient and the attention mask, respectively.
[0078] II. Terminology Definition
[0079] The neural-symbolic hybrid gradient consists of two parts: the symbolic gradient and the neural gradient. The symbolic gradient is a suggestion for modifying textual cues in natural language form, while the neural gradient is a spatial attention mask used to guide visual attention. Together, they constitute the optimization signal from high-level semantic evaluation to low-level input correction.
[0080] Text feedback refers to the natural language feedback generated by the evaluator module, which is similar to the adjustment direction information carried by the gradient in a traditional neural network.
[0081] Spatial attention mask is a two-dimensional matrix of the same size as the image, with each element ranging from 0 to 1. The larger the value, the more attention the model needs to pay to that pixel region.
[0082] Open-vocabulary visual localization refers to a localization technique that can detect any object described in natural language, not limited to predefined object categories.
[0083] III. Implementation Steps
[0084] The following is in conjunction with the appendix Figure 1 The following describes a complete iteration of this method in chronological order. This method employs multi-round iterative optimization, with each round consisting of five stages: forward propagation, evaluation, gradient decomposition, gradient mapping, and input update.
[0085] Step 1: Initialization
[0086] Set the maximum number of iterations. The preferred value is 5, which sets the high brightness intensity coefficient. The preferred value is 0.3. Simultaneously, a predefined event pattern Ω is obtained as the basis for subsequent evaluation and gradient decomposition. Initialize visual query hints. Default template: "Please describe in detail the events depicted in this news image, including all visible people, objects, actions, and scene details." Initialize inference hints The default template is: "Based on the following news text and image description, extract the event type and arguments. The output format is JSON: containing the event type, trigger words, and a list of arguments." Let the current iteration round... Current image It should be noted that t represents the number of forward propagations completed, and the first forward propagation is performed when t=0 initially.
[0087] Step 2: Forward Propagation
[0088] Current image And current visual query prompts Input Vision Expert Module Generate visual description :
[0089]
[0090] in The string is a natural language string, and its length does not exceed the model output limit, such as 256 tokens.
[0091] Original text Visual description and reasoning hints Input reasoning agent module Generate structured events :
[0092]
[0093] Includes event type, trigger word, and list of argument role-entity pairs.
[0094] Step 3: Evaluation and Assessment
[0095] The current structured events Original text and current visual description Input Judge Module Generate text feedback :
[0096]
[0097] The judge module evaluates arguments based on two criteria. The first is the text-image consistency criterion, which checks whether each argument has corresponding supporting evidence in the original text or the visual description generated in the current round. The second is the event pattern integrity criterion, which assesses arguments based on predefined event patterns. Check if the current event type is missing a key argument. If the current structured event... If all criteria are met, then text feedback will be provided. If the string is empty, this method terminates and outputs the structured event. Otherwise, text feedback For example, a natural language assessment might detect a traffic accident, but the vehicles involved are missing from the argument. The text mentions a collision between two vehicles, but the visual description does not include vehicle details.
[0098] Step 4: Gradient Decomposition
[0099] If text feedback If not empty, then execute the gradient decomposition module. This module decomposes text feedback into symbolic gradients. and keyword set :
[0100]
[0101] The specific decomposition process is as follows:
[0102] Sub-step 4.1 Rule extraction: Event-based pattern extraction Matching text feedback The system generates an initial keyword list based on the missing argument roles appearing in the feedback. For example, if the feedback includes "missing vehicle involved", the initial keyword would be "vehicle".
[0103] Sub-step 4.2 Semantic expansion: Using a thesaurus (such as WordNet) or pre-trained word vectors, expand each keyword into a set of synonyms. For example, "vehicle" is expanded to "car, truck, sedan, motorcycle, bicycle".
[0104] Sub-step 4.3 Refining the Large Language Model: Input the expanded keyword list into a large language model. This model can be the same as the judge module but uses different prompts. The model is required to remove irrelevant words and sort them in descending order of relevance, outputting the refined keyword set. Simultaneously, this large language model generates symbolic gradients. This refers to suggestions for modifying visual query prompts, such as "Please check the image more carefully for any damaged vehicles."
[0105] Step 5: Gradient Mapping
[0106] Current image and keyword set Input gradient mapping module Gradient mapping module An open-vocabulary visual positioning model is used for each keyword. Generate a heatmap :
[0107]
[0108] Heat map Size and current image Similarly, each pixel value represents the confidence level that the location belongs to the object described by the keyword. Then, the aggregated spatial attention mask is obtained by taking the maximum value element-wise and normalizing. :
[0109]
[0110] Where the normalization function Scale the mask values linearly to Interval. This refers to the neural gradient described in this method.
[0111] Step Six: Input Update
[0112] This step updates both the image and visual query suggestions, creating a closed-loop feedback loop.
[0113] Image Update: Spatial Attention Mask The heatmap is overlaid onto the current image to generate the next image. :
[0114]
[0115] in Map a single-channel mask to a color heatmap. A preferred implementation is to map it to the red channel: This involves overlaying a red semi-transparent highlight onto the original image. The highlight intensity is controlled, with a value ranging from 0.1 to 0.5, preferably 0.3. This operation is equivalent to applying a physical "gradient update" to the input image, forcing the next round of the vision expert module to focus its attention on the highlighted area.
[0116] Text suggestion update: Update the current visual query suggestion and sign gradient Input an optimizer large language model, which can be the same as the judge module but uses different prompts, to generate updated visual query prompts:
[0117]
[0118] For example, if the initial prompt is "Please describe this image in detail," and the symbolic gradient is "Please pay attention to the vehicles in the image," then the updated prompt will be "Please describe this image in detail, paying particular attention to the vehicles in the image." (Inference prompt) It remains unchanged throughout the entire process.
[0119] Step 7: Iteration Loop
[0120] make Return to step two and repeat the forward propagation, evaluation, gradient decomposition, gradient mapping, and input update until the evaluator module outputs an empty string (i.e., the evaluation passed), or Reaching the maximum number of iterations The final output is the structured event generated in the last round. It should be noted that when t=0, round 0, i.e., the initial round, is executed; if an update is needed after round 0, then t becomes 1, and round 1 forward propagation is executed; and so on. The stopping condition is checked after each round's evaluation.
[0121] This method runs on general-purpose computing devices such as servers or personal computers, leveraging graphics processing units (GPUs) to accelerate model inference. During the testing phase, it invokes pre-trained multimodal large language models, large language models, and open-vocabulary visual localization models, all loaded into GPU memory. Each forward propagation involves numerous matrix operations. This method does not modify model parameters, only the input image and prompt words, thus avoiding additional training overhead. Compared to single-pass inference, this method adds multiple iterations and localization model calls, resulting in a computational cost approximately 3 to 5 times that of single-pass inference, but yields a significant performance improvement. It is particularly suitable for vertical domains where data is scarce and annotation costs are high, such as rare event type analysis and professional news event extraction.
[0122] IV. Examples
[0123] To more clearly demonstrate the implementation process of this invention, the following uses a specific traffic accident news photo pair as an example to describe in detail the complete process from initial input to iteration termination.
[0124] (1) Case input:
[0125] Original image: A photo of a traffic accident scene showing two cars, a red sedan and a white SUV, stopped in the center of an intersection, facing each other. In the lower right corner of the image, there is a blue sedan with a severely deformed front end, its bumper detached, and its hood warped. The entire image has a resolution of 1920×1080 pixels.
[0126] Original text: "Around 2 p.m. today, a three-vehicle collision occurred at the intersection of Jianshe Road and Hongqi Street. One of the vehicles, a blue sedan, was severely damaged. There are no reports of injuries at this time."
[0127] Event pattern: The traffic accident type must include three mandatory arguments: "vehicle involved", "accident location", and "damaged parts".
[0128] (2) Initial settings:
[0129] Maximum number of iterations .
[0130] High brightness intensity coefficient .
[0131] Initial visual query prompts "Please describe in detail the events depicted in this news image, including all visible people, objects, actions, and scene details."
[0132] Reasoning hints "Based on the following news text and image description, extract the event type and arguments. The output format is JSON: {'event_type': str, 'trigger_word': str, 'arguments': [{'role': str,'text_mention': str or null, 'image_description': str or null}]}"
[0133] Current round Current image This is the original image.
[0134] (3) Implementation process of round 0 (initial round)
[0135] Step 1, Initialization
[0136] Step 2: Forward Propagation
[0137] Visual expert module input and Output visual description "In the center of the image, there are two cars, a red sedan and a white SUV, stopped at an intersection. In the lower right corner of the image, there is a blue sedan. There are no pedestrians on the road."
[0138] Input text for the reasoning agent module , and Output structured events :
[0139] {
[0140] "event_type": "traffic accident",
[0141] "trigger_word": "collision",
[0142] "arguments": [
[0143] {"role": "Vehicles involved", "text_mention": "Three vehicles", "image_description": "Red sedan, white SUV, blue sedan"},
[0144] {"role": "Accident Location", "text_mention": "Intersection of Jianshe Road and Hongqi Street", "image_description": "Intersection"} ]
[0146] }
[0147] The "damaged part" argument is missing at this point.
[0148] Step 3: Evaluation and Assessment
[0149] Judge module input , , Based on the criteria, the following assessment was conducted:
[0150] Consistency between text and images: The text mentions that "the blue sedan was severely damaged," but... The report does not describe any details of the vehicle damage, therefore it does not meet the requirements.
[0151] Model completeness: The traffic accident model requires the "damaged part" argument, which is currently missing.
[0152] The judge outputs text feedback "A 'traffic accident' event was detected, but the 'damaged parts' data is missing. The text clearly states that the 'blue sedan was severely damaged,' but the visual description does not include any details of the vehicle damage, such as a deformed front end or a detached bumper. Please focus on the blue sedan and its damage characteristics."
[0153] Step 4: Gradient Decomposition
[0154] Gradient decomposition module processing :
[0155] Rule extraction: Match event patterns from “damaged parts” to obtain the initial keyword “damaged parts”; extract “blue car” and “damaged” from “blue car is severely damaged”.
[0156] Semantic expansion: "blue sedan" is expanded to "blue small sedan, blue car"; "damaged" is expanded to "deformed, bumper detached, hood lifted, dented".
[0157] Large Language Model Refinement: Input an expanded list, output the keywords most relevant to the visual features that may appear in the image. Output: a set of keywords. = {"blue sedan", "deformed front end", "detached bumper", "raised hood"}. Simultaneously generate symbolic gradients. = "Please pay special attention to the blue sedan in the lower right corner of the image, and describe the deformation of its front end, whether the bumper has fallen off, and whether the hood is raised."
[0158] Step 5: Gradient Mapping
[0159] Gradient Mapping Module (Open Vocabulary Visual Localization Model) Input and :
[0160] For the "blue car": Localization model output mask The highlighted area is the overall outline of the blue sedan in the lower right corner of the image.
[0161] For the "deformed front of the car": output mask The front area of the bright blue sedan (front bumper area).
[0162] For "detached bumper": output mask High-gloss part that fell off below the front of the car.
[0163] For "raised hood": output mask Highlight the hood area.
[0164] Find the maximum value element by element: .
[0165] Normalize to [0,1]: The final attention mask value is close to 1 in the front and bumper area of the blue car, and close to 0 in other areas.
[0166] Step Six: Input Update
[0167] Image update: Effect: A red semi-transparent highlight is overlaid on the original image, giving the front and bumper areas of the blue sedan a red overlay.
[0168] Text hint update: Optimizer large language model input and Output updated visual query hints "Please describe in detail the events depicted in this news image, including all visible people, objects, actions, and scene details. Pay particular attention to the blue sedan in the lower right corner of the image, describing the deformation of its front end, whether the bumper has detached, and whether the hood is raised."
[0169] Step 7: Iterate through the loop. Then proceed to the next round.
[0170] (4) Implementation process of the first round
[0171] Step 2: Forward Propagation
[0172] The visual expert module takes the updated image as input. (Highlighted in red) and the updated prompt Output visual description "In the center of the image, two cars (a red sedan and a white SUV) are stopped at an intersection. The blue sedan in the lower right corner of the image is highlighted in red. The front of the blue sedan is severely deformed, the bumper is completely detached, the hood is tilted up at about 30 degrees, and the windshield is cracked. There are no pedestrians on the road."
[0173] Input of the reasoning agent module , and Output structured events :
[0174] {
[0175] "event_type": "traffic accident",
[0176] "trigger_word": "collision",
[0177] "arguments": [
[0178] {"role": "Vehicles involved", "text_mention": "Three vehicles", "image_description": "Red sedan, white SUV, blue sedan"},
[0179] {"role": "Accident Location", "text_mention": "Intersection of Jianshe Road and Hongqi Street", "image_description": "Intersection"},
[0180] {"role": "Damaged area", "text_mention": "Blue sedan is severely damaged", "image_description": "Blue sedan's front end is severely deformed, bumper is detached, and hood is lifted up"} ]
[0182] }
[0183] Step 3: Evaluation and Assessment
[0184] Judge module input , , :
[0185] Image-text consistency: The text “the blue sedan was severely damaged” perfectly matches the visual description “the front of the car was severely deformed, the bumper was detached, and the hood was raised”.
[0186] Pattern completeness: The three required arguments for the traffic accident, namely "vehicles involved", "accident location" and "damaged parts", already exist.
[0187] The judge outputs an empty string. This indicates that the evaluation has passed.
[0188] Step 7: Iteration Termination
[0189] This method outputs the final structured events. As a result, the entire process involves one iteration, namely the initial round plus one update round, which ultimately correctly extracts the complete event structure containing details of vehicle damage.
[0190] As demonstrated by the above examples, this invention uses an evaluator module to identify missing key information in the visual description and generate textual feedback. Gradient decomposition transforms the feedback into specific keywords and suggested modifications. Gradient mapping generates an attention mask and overlays it onto the image, creating physical visual guidance. The updated prompts and image together prompt the visual expert module to focus on the highlighted areas in the next round, thus completing the missing argument information. Throughout the entire process, no model retraining or fine-tuning is required; self-correction is achieved simply by modifying the input.
[0191] To broaden the scope of protection of this invention and prevent others from achieving the same invention by replacing some components or steps, several alternative solutions are listed below. These alternative solutions can all achieve the core function of this invention, namely, transforming high-level semantic evaluation into cross-modal closed-loop optimization signals.
[0192] V. Alternatives to the Gradient Mapping Module
[0193] (1) Alternatives based on closed-set object detection: If the object categories in the application scenario are limited and predefined, such as industrial fault detection and traffic monitoring, traditional closed-set object detection models, such as the YOLO series and Faster R-CNN, can be used instead of open-vocabulary models. In this case, it is necessary to pre-establish a mapping table from keywords to detection categories, for example, the keyword "deformed car front" is mapped to the detection category "car_damage". This method may achieve higher localization accuracy and faster inference speed in specific domains.
[0194] (2) An alternative based on the internal attention of the visual language model: Instead of an independent target localization model, the cross-modal attention weights of the multimodal large language model, i.e., the visual expert module itself, can be directly utilized. Specifically, the cross-attention weights of the Chinese text keywords on the image patch in the last layer of the visual expert module are extracted. These weights are then reshaped into an attention heatmap of the same size as the image, and normalized before being used as a mask. This approach reduces the dependence on external localization models, making the overall framework more lightweight.
[0195] (3) Alternative based on saliency detection: When the keywords are general attributes, such as "damaged area" or "abnormal area", rather than specific object names, unsupervised saliency detection methods can be used, such as image gradient, frequency domain transformation, or deep learning saliency detectors, to generate a mask. In this case, the keywords are only used to trigger the behavior of "needing to pay attention to salient areas" and do not participate in the specific localization.
[0196] (4) Alternative based on segmentation models: Use open vocabulary segmentation models, such as Grounding SAM and SEEM, to replace the localization model and directly output the instance segmentation mask corresponding to the keyword. The segmentation mask is more refined than the attention mask provided by the localization bounding box and is suitable for scenarios that require precise region guidance.
[0197] 2. Alternatives to the Judge Module
[0198] (1) Multi-judge ensemble scheme: Multiple independent judge models are used, including different prompt templates, different base language models, or different temperature parameters of the same model, to evaluate the current event in parallel, and the text feedback of each model is combined by voting or weighted averaging. This scheme can enhance the robustness of the judgment signal and reduce the misjudgment of a single model.
[0199] (2) Rule-based evaluator scheme: For vertical domains with highly fixed event patterns, such as industrial assembly motion detection and traffic violation event extraction, a pure rule and template matching approach can be used to replace the large language model evaluator. The system automatically checks for missing arguments based on predefined event patterns and splices pre-set feedback templates. This scheme can significantly reduce computational overhead, and the feedback content is predictable and easy to debug.
[0200] (3) Joint Judge and Optimizer Scheme: The judge and the large language model optimizer in gradient decomposition are merged into a single model, which outputs text feedback, symbolic gradient and keyword set in a single call, reducing the number of model calls. This scheme sacrifices some module decoupling, but improves running efficiency.
[0201] 3. Alternative solutions for image update methods
[0202] (1) Non-overlay image modification schemes: In addition to the Alpha fusion and overlay heatmaps mentioned above, the following methods can also be used: cropping the highlighted areas, enlarging them, and replacing the corresponding areas in the original image (i.e., "digital zoom"), drawing colored bounding boxes or outlines around the highlighted areas, converting the highlighted areas to grayscale while retaining the color of the remaining areas, or increasing the overall brightness of the highlighted areas. These methods can all force the visual expert module to focus on specific areas and are simple to implement.
[0203] (2) Multi-region differentiated highlighting scheme: When the keyword set contains multiple semantically different objects, such as "deformed car front" and "fallen tire", different colors can be used for different regions, such as red to highlight the car front and blue to highlight the tire, and the model can be informed of the attention intent corresponding to each color in the text feedback. This scheme helps the model distinguish different types of key information.
[0204] 4. Alternatives to the gradient decomposition module
[0205] (1) Pure model-driven decomposition scheme: Directly use the large language model for end-to-end structured output, requiring the model to output in JSON format, including a "symbolic_gradient" field (string) and a "keywords" field (string array), without intermediate steps of rule extraction and semantic expansion. This scheme simplifies implementation, but requires carefully designed prompt templates to ensure stable output format.
[0206] (2) Knowledge Enhancement Decomposition Scheme: Introduce external knowledge bases, such as WordNet, ConceptNet, domain ontology, or knowledge graphs, during the semantic expansion stage to perform multi-hop reasoning expansion. For example, expand from "vehicle" to component-level words such as "front", "bumper", "hood", and "door" to improve the localization recall rate.
[0207] (3) Automatic expansion scheme based on word vectors: Using pre-trained word vectors, such as GloVe and BERT embeddings, the semantic similarity between keywords and candidate words is calculated, and words with similarity exceeding the threshold are automatically selected to be added to the keyword set without calling a large language model for refinement. This scheme is particularly suitable when computational resources are limited.
[0208] 5. Alternatives to the iteration stopping condition
[0209] (1) Confidence threshold stopping scheme: When the similarity of the event structure output in two consecutive rounds, such as the Jaccard coefficient of the argument set or the consistency of the event type, exceeds the preset threshold (e.g., 0.95), the process will stop in advance even if the judge still has feedback, in order to avoid unnecessary iterations.
[0210] (2) Adaptive round number scheme: The maximum number of iteration rounds is dynamically adjusted based on the feedback from the judges. For example, more rounds are allowed when strong negative words such as "severely missing" or "completely wrong" appear in the feedback, while the process is terminated early when "slightly incomplete" or "suggests to be supplemented" appear.
[0211] (3) Stopping scheme based on model confidence output: If the reasoning agent module can output the confidence score of each argument, such as the generation probability value, then the iteration stops when the confidence of all required arguments is higher than the threshold.
[0212] 6. Alternative integration schemes for visual experts and reasoning agents
[0213] (1) End-to-end joint model scheme: The visual expert and the reasoning agent are merged into a single multimodal large language model, which simultaneously receives images and text and directly outputs structured events. In this case, the iterative framework of the present invention still applies: the judge module evaluates the output of the model, and updates the input image (overlay mask) and text prompts (modify user instructions) after gradient decomposition, without distinguishing between two independent modules. This scheme simplifies the system architecture, but requires the selected model to have sufficient instruction compliance capabilities.
[0214] (2) Streaming Iterative Scheme: This scheme does not strictly distinguish the boundary between forward propagation and input update, but instead directly adds the feedback generated by the judge in each round to the dialogue history, using a multi-turn dialogue mechanism to achieve iterative optimization. At this time, image updates are still performed independently, but text prompt updates become cumulative updates of the entire dialogue context.
[0215] The above alternative solutions do not deviate from the core idea of this invention, which is to map text evaluation to image space attention constraints through neural-symbolic hybrid gradients, forming a cross-modal closed-loop optimization. Any method that adopts this core idea but replaces some implementation details should fall within the protection scope of this invention.
[0216] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0217] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these modifications and improvements all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation, characterized in that, Includes the following steps: Initialization steps: Set the maximum number of iterations and the highlight intensity coefficient, obtain the predefined event pattern, initialize visual query hints and inference hints, and use the original image as the current image; Forward propagation step: Input the current image and the current visual query prompt into the pre-trained visual expert module to generate a visual description; The raw text, visual description, and reasoning hints are input into a pre-trained reasoning agent module to generate structured events; Evaluation steps: Input the structured event, raw text, and visual description into the pre-trained evaluator module to generate text feedback; If the text feedback is empty, output a structured event and terminate; Gradient decomposition steps: Input the text feedback into the gradient decomposition module, which decomposes it into a signed gradient and a set of keywords; Gradient mapping step: Input the current image and keyword set into the pre-trained gradient mapping module to generate a spatial attention mask as a neural gradient; Input update steps: Overlay the spatial attention mask onto the current image in the form of a heatmap to generate the updated image; The current visual query suggestion and symbolic gradient are input into the pre-trained optimizer large language model to generate an updated visual query suggestion; Iterative control steps: Using the updated image as the new current image and the updated visual query cue as the new current visual query cue, repeat the forward propagation step, evaluation step, gradient decomposition step, gradient mapping step and input update step until the stopping condition is met, and output the structured event generated in the last round. The parameters of all pre-trained modules remain unchanged during the method's execution.
2. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, In the gradient decomposition step, the text feedback is decomposed into a signed gradient and a set of keywords. Specific methods include: Rule extraction: Based on predefined event patterns, match missing argument roles appearing in text feedback to generate an initial keyword list; Semantic expansion: Using a thesaurus or pre-trained word vectors, each initial keyword is expanded into a set of synonyms; Large Language Model Refinement: The expanded keyword list is input into the large language model, which removes irrelevant words and sorts them in descending order of relevance, outputting a refined keyword set; furthermore, the large language model generates natural language instructions for modifying visual query prompts based on text feedback, which serve as symbolic gradients.
3. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, The gradient mapping module is an open vocabulary visual localization model; the specific method for generating the spatial attention mask in the gradient mapping step is as follows: generate a heatmap of the same size as the current image for each keyword, and the pixel value of each heatmap represents the confidence that the position belongs to the corresponding keyword; take the maximum value of each element of all heatmaps at the same coordinate position, and then scale the value to the [0,1] interval through min-max linear normalization to obtain the spatial attention mask.
4. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, The specific method for overlaying the spatial attention mask onto the current image in the form of a heatmap in the input update step is as follows: ; in, For the current image, For high brightness intensity coefficient, Here, is the spatial attention mask, and colormap is the function that maps a single-channel mask to a three-channel color heatmap.
5. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, The evaluator module generates text feedback based on the following two criteria: Image-text consistency criterion: Check whether each argument in a structured event has corresponding evidence to support it in the original text or the visual description generated in the current round; Event pattern integrity criterion: Based on the predefined event pattern, check whether the event type of the current structured event is missing any required argument roles; When both criteria are met, the text feedback is an empty string; otherwise, the output is natural language text containing descriptions of missing arguments and suggested modifications.
6. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, The visual expert module is a multimodal large language model, the reasoning agent module, the judge module, and the optimizer large language model are all large language models, and the gradient mapping module is an open vocabulary visual localization model; all modules are pre-trained models, and their model parameters are frozen during the method operation.
7. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, The stopping condition in the iterative control step includes any of the following: Condition 1: The text feedback generated during the evaluation process is an empty string; Condition 2: The number of iterations executed has reached the preset maximum number of iterations K; Condition 3: The similarity of structured events generated in two consecutive rounds exceeds a preset threshold; Condition 4: Dynamically adjust the maximum number of iteration rounds based on the semantic strength in the text feedback; Condition 5: The reasoning agent module outputs the confidence score of each argument, and the confidence scores of all required arguments are higher than the preset threshold.
8. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, The open-vocabulary visual localization model used in the gradient mapping step can be replaced by any of the following alternatives: Alternative Solution 1: Based on the closed-set target detection model, pre-establish a mapping table from keywords to detection categories; Alternative Solution 2: Directly utilize the cross-modal attention weights of the visual expert module itself to extract the cross-attention map of text keywords on image patches as a spatial attention mask; Alternative Solution 3: Generate a mask based on an unsupervised saliency detection method; Alternative Solution 4: Use the instance segmentation mask output by the open vocabulary segmentation model as a heatmap.
9. The multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation according to claim 1, characterized in that, The three-stage processing in the gradient decomposition step can be entirely replaced by any of the following alternatives: Alternative Option 1: Directly use a large language model to output the symbolic gradient and keyword set in JSON format; Alternative Solution 2: Introduce external knowledge graphs or domain ontologies to expand keywords through multi-hop reasoning; Alternative Solution 3: Use pre-trained word vectors to calculate semantic similarity and automatically select words with similarity exceeding a threshold to add to the keyword set.
10. A multimodal event extraction system based on neural-symbolic hybrid gradient backpropagation, characterized in that, The method for implementing the multimodal event extraction method based on neural-symbolic hybrid gradient backpropagation as described in any one of claims 1-9 includes: The Visual Expert module is configured to receive images and visual query prompts and generate visual descriptions. The reasoning agent module is configured to receive raw text, visual descriptions, and reasoning prompts, and generate structured events. The judge module is configured to receive structured events, raw text, and visual descriptions, and generate textual feedback. The gradient decomposition module is configured to receive text feedback and decompose it into a symbolic gradient and a set of keywords. The gradient mapping module is configured to receive the current image and a set of keywords, and generate a spatial attention mask as a neural gradient. The optimizer large language model is configured to receive the current visual query hints and symbolic gradients, and generate updated visual query hints. The input update module is configured to overlay the spatial attention mask onto the current image in the form of a heatmap to generate an updated image, and input the current visual query hints and symbolic gradients into the optimizer large language model to generate updated visual query hints. The iteration control module is configured to control the iteration rounds, repeatedly calling the above module until the stopping condition is met, and outputting structured events. All modules are pre-trained modules, and their parameters remain unchanged during system operation.