A sarcasm detection method and system based on multi-modal cue routing fusion
By constructing a structured satirical clue graph and a two-layer controlled dynamic fusion mechanism, the problem of insufficient clue localization and fusion strategy adaptation in multimodal satire detection is solved, thereby improving the accuracy and stability of satire recognition in social media environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing multimodal satire detection methods have shortcomings in key clue localization, fusion strategy adaptation, and noise robustness, resulting in insufficient accuracy, stability, and engineering usability of satire recognition in social media environments.
We adopt a multimodal cue routing fusion approach. By constructing a structured ironic cue graph, we utilize a two-layer controlled dynamic fusion mechanism and lightly perturbed consistency training to explicitly characterize cue strength and relationships, dynamically select fusion paths and granularity, and improve the model's robustness and stability to noise.
It improves the accuracy and stability of satire detection, reduces the risk of false positives, and enhances adaptability and engineering usability in open social media environments.
Smart Images

Figure CN122240937A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of natural language processing and multimodal learning technology, and in particular to a method and system for sarcasm detection based on multimodal cue routing fusion. Background Technology
[0002] With the widespread adoption of social media and mobile devices, user expressions on platforms like Weibo and Twitter exhibit typical "multimodal collage" characteristics: a post often simultaneously contains natural images, accompanying text, and information such as Optical Character Recognition (OCR) text, symbols, emojis, and local visual identifiers carried in screenshots / emojis. Irony, as a high-frequency social language phenomenon, typically conveys attitudes by "contrary to or inconsistent with the true intent," and is widely used for emotional venting, group stance confrontation, public opinion guidance, and topic manipulation. For content moderation and community governance, irony detection can be used to identify covert attacks, hate speech avoidance expressions, and misleading narratives; for public opinion analysis and public safety scenarios, it can be used to capture emotional escalation and confrontational rhetoric in controversial events; and in recommendation and search filtering, it can also be used to reduce the risk of spreading content with provocative or sarcastic tendencies. Therefore, multimodal irony detection has clear practical needs and application value.
[0003] Unlike general sentiment classification or topic recognition, the difficulty in identifying irony lies not in "whether there is emotion," but in "whether the emotion and semantics have been reversed, contradicted, or misaligned." Its key triggering cues often exhibit sparseness and locality: on the text side, triggering structures such as negation, transition, exaggeration, irony, quotation, and rhetorical questions may only appear in a few fragments; on the image side, irony may only be provided by local targets, expressions, gestures, background symbols, or compositional context; meanwhile, OCR slogans and symbolic information in screenshots / memes often play a "theme-highlighting" role, even directly determining the direction of semantic reversal. In this context, simply performing global representation learning on the entire text or image is easily overwhelmed by a large amount of neutral background, resulting in "diluted ironic cues" or "failure to capture key triggering fragments."
[0004] Furthermore, the cross-modal relationships in multimodal irony are highly diverse and not singular: they can be direct semantic reversals between text and image (e.g., text conveying a positive message while the image presents a negative one), or implicit conflicts in stance / emotion between text and image (e.g., an image presenting success while the text expresses mockery). They can also be triggered by OCR text, similes, or implicit contexts, leading to "semantic jumps" at a finer, more subtle cue level. Different types of inconsistencies often correspond to different effective fusion strategies: some samples require fine-grained alignment of local cues, some rely more on key triggering information from a particular modality, and some must explicitly model the "alignment-conflict" relationship to arrive at a reliable judgment. Therefore, the system not only needs to accurately locate triggering fragments within each modality but also needs to characterize the alignment and conflict relationships between cues between modalities, and dynamically select appropriate fusion paths and granularities accordingly to achieve adaptive reasoning that varies "depending on the cue form."
[0005] For the task of satire detection in social media graphic content, existing research has explored directions such as static fusion, cross-modal interaction, external knowledge enhancement and dynamic fusion, as detailed below.
[0006] (1) Static fusion and early splicing methods: Early multimodal irony detection often adopted the static fusion approach, directly splicing and pooling image and text features, or obtaining a joint representation through bilinear fusion and then classifying it. These methods are simple to implement and have controllable inference overhead, but because irony triggering cues are often sparse and local (e.g., a few negations, transitions, exaggerated segments, or local symbols in the image that contrast with the scene), static fusion is easily "diluted" by a large amount of neutral background semantics, making it difficult to stably capture the reversal and misalignment relationships between the image and text. These methods also usually lack explicit characterization of "what the key cues are and where they are located", resulting in insufficient interpretability and controllability, which provides the motivation for the subsequent introduction of "explicit cue expression and selection mechanism".
[0007] (2) Cross-modal attention and end-to-end alignment methods: To enhance the interactive capabilities of text and images, researchers generally introduce co-attention or cross-modal Transformers to model alignment relationships at the token / patch level, thereby improving cross-modal semantic understanding. These methods can learn the connections and inconsistencies between text and images to a certain extent, but satirical cues are often implicitly encoded in high-dimensional attention distributions, lacking operable and storable structured objects to carry the information of "cues-relationships-conflicts". At the same time, in real social media scenarios, perturbations such as screenshot compression, minor text rewriting, and collage propagation can cause the attention distribution and alignment relationships to drift, thus causing problems such as unstable fusion strategies and sensitivity to noise. Therefore, it is necessary to upgrade key cues from "implicit attention" to "computable structured representations" to support more controllable and robust fusion decisions.
[0008] (3) External knowledge enhancement methods: For content with semantic deficiencies or strong context dependence, such as memes, screenshots, and slogans, external knowledge enhancement methods attempt to introduce supplementary information to improve comprehension. For example, they utilize text information obtained from OCR recognition, semantic symbols or expressions, common sense knowledge, dictionaries, and knowledge bases to complete key signals in the image and explain context reversals. These methods can significantly improve performance on some samples, but they often face two challenges: First, external information naturally contains noise and uncertainty, lacks reliability assessment and controlled use mechanisms, and is prone to introducing erroneous information into decision-making; second, external information is often coarsely integrated in the form of "additional features," lacking unified modeling of alignment and conflict with the original textual clues, making it difficult for the enhanced signal to play a stable role or even causing interference. Therefore, a more reasonable direction is to regard external knowledge as a "source of clues" that can participate in reasoning, incorporate it into a unified clue space, and combine it with reliability for controlled fusion.
[0009] (4) Dynamic fusion method: To adapt to the diverse semantic relationships of different samples, one type of research introduces a dynamic fusion mechanism, which adaptively adjusts the fusion weights of different modalities, stages, or granularities according to the input content, thereby improving the adaptability to cross-modal inconsistencies. Compared with static fusion, dynamic fusion has more potential in irony detection, but existing methods still often have two problems: First, dynamic weights often lack interpretable and operable basis, making it difficult to ensure the stable selection of appropriate strategies under different inconsistencies; Second, under slight perturbation conditions (compression, cropping, OCR error, synonym rewriting, etc.), the gating weights may fluctuate unreasonably, leading to unstable output and misjudgment. The above shortcomings indicate that dynamic fusion not only needs to "adjust weights", but also needs to be driven by more structured cue relationships, and introduce stability and reliability constraints in training and inference to achieve controllable dynamic decision-making.
[0010] However, in open, noisy environments and scenarios with inconsistent morphologies, existing technologies still have the following three limitations.
[0011] (1) Existing methods are not good at locating and characterizing key ironic triggers, which leads to unstable decision-making basis.
[0012] The discrimination of multimodal irony is typically determined by a few triggering cues, including negation, transitions, and ironic segments in text, local symbols and contextual contrasts in images, screenshot captions, and emoji cues. These cues are sparse and localized, and their effectiveness depends on the correspondence and contradictions between them. Existing methods primarily rely on holistic representation learning, dispersing triggering cues across high-dimensional features and attention distributions, making it difficult to consistently highlight decisive segments. Models are prone to attention drift and cue omission in samples with strong irrelevant background semantics or weak cues. Furthermore, when there are semantic inversions, emotional misalignments, and contextual shifts between text and images, models struggle to clearly characterize the correspondences and conflicts between cues, easily misinterpreting irrelevant correspondences as supporting evidence and downplaying key conflicts as noise. These problems result in a lack of stability and controllability in the model's decision-making process, limited cross-scene generalization ability, and a significantly increased risk of misjudgment in real-world noisy environments.
[0013] (2) Existing fusion strategies are difficult to stably select appropriate inference paths under different inconsistency states, resulting in insufficient strategy adaptability.
[0014] Multimodal satirical samples exhibit significant morphological differences. Some samples are determined by direct image-text reversal, requiring fine-grained alignment and interaction; others are dominated by single-modal trigger fragments, requiring highlighting key modalities and suppressing interfering modalities; still others are triggered by screenshots, symbols, and contextual information, requiring semantic completion before determining conflict relationships. While existing methods introduce gating mechanisms or multi-stage fusion to achieve sample-level weight allocation, fusion decisions typically rely on the overall semantic representation, lacking a stable path selection mechanism that matches inconsistent morphologies. When faced with samples exhibiting significant morphological differences, models are prone to inference path mismatches, manifesting as excessive interaction amplifying noise or insufficient fusion missing conflicts. Existing methods also lack clear and controllable constraints, and fusion weights are prone to unreasonable allocation under complex noise and distribution variations, leading to fluctuations in model output and insufficient overall strategy adaptability and engineering usability.
[0015] (3) Existing models are sensitive to common propagation noise and expression variants, leading to fluctuations in online performance and an increased risk of misjudgment.
[0016] Social media content often undergoes slight changes during dissemination. For example, screenshot compression and cropping lead to loss of detail, forwarding and reposting reduce image clarity, and captions may be paraphrased, rearranged, or use colloquial variations. Text recognition in screenshots may also suffer from omissions, misspellings, and suffix confusion. While these changes typically don't alter the core meaning of the post at the semantic level, they do change the form and signal strength of local cues visible to the model. Existing methods often focus more on the accuracy of the final classification during training and inference, lacking constraints and verification on the stability of the fusion decision process. This makes the model prone to reallocating attention and fusion weights under these slight changes. Fluctuations in fusion weights further lead to inconsistent confidence levels, and boundary samples are more likely to show prediction reversals. Especially when OCR quality fluctuates or a particular modality weakens, the model struggles to mitigate the influence of that modality in a timely manner. Noise signals are easily amplified and propagated to the final judgment, resulting in online misjudgments and performance fluctuations. Summary of the Invention
[0017] This invention addresses the shortcomings of existing multimodal satire detection methods in key cue localization, fusion strategy adaptation, and noise robustness. It proposes a satire detection method and system based on multimodal cue routing fusion, aiming to solve the problem of satire recognition in social media posts caused by complex alignment and conflict relationships between images, text, and external text information, and improve the accuracy, stability, and engineering usability of satire detection in open social media environments.
[0018] Here, "clue" refers to a local or global semantic information unit extracted from multimodal inputs such as text, images, and OCR text, which can represent the basis for judging irony. "Route" refers to an adaptive allocation mechanism that dynamically determines which clues are retained, which clues participate in subsequent interactions, and through which fusion path to process them, based on the importance, reliability, or conflict characteristics of different clues in the current sample.
[0019] To achieve the above objectives, the present invention provides the following technical solution:
[0020] In a first aspect, the present invention provides an irony detection method based on multimodal cue routing fusion, comprising the following steps:
[0021] Step 1: Obtain the social media post data to be detected, which includes images, text, and external text information;
[0022] Step 2: Encode the image, text, and external text information respectively to obtain the corresponding basic feature representations;
[0023] Step 3: Based on the aforementioned basic feature representation, candidate satirical cues are generated through a cross-modal guidance mechanism, and a structured satirical cue graph is constructed. The nodes of the graph include text cue nodes, image cue nodes, and external text cue nodes, and the edges of the graph include alignment relationship edges and conflict relationship edges.
[0024] Step 4: Based on the structured ironic cue graph, a fused representation is generated through a two-layer controlled dynamic fusion mechanism. The two-layer controlled dynamic fusion mechanism first performs structural-level routing selection to determine the main inference structure, then performs parameter-level continuous gating fine-tuning within the selected structure, and generates a fused representation under controlled constraints.
[0025] Step 5: Input the fused representation into the discriminant layer and output the irony detection result.
[0026] Furthermore, step 3, constructing the structured ironic cue graph, specifically includes:
[0027] Step 3.1, process the text representation matrix. Image representation matrix External text representation matrix A symmetric cross-modal attention mechanism is applied to generate candidate cue representations. ;
[0028] Step 3.2: Calculate the weight distribution of each modality candidate cue. And select the Top K clues as a set of candidate nodes based on their weights;
[0029] Step 3.3: Focus and refine the candidate node set to obtain the focused cue representation. And by calculating the redundancy removal loss Minimize redundancy between feature dimensions;
[0030] Step 3.4: Calculate the alignment relation edge weights based on the focused cue representation. And the weight of conflict relationship edges This forms a structured satirical clue graph G containing nodes, edges, and weights.
[0031] Furthermore, the alignment relation edge weights The calculation method is as follows: for node representation and After performing a linear transformation, the cosine similarity is calculated, and then normalized using the Sigmoid function; the conflict relation edge weights are then obtained. The calculation method is as follows: the degree of semantic inversion Degree of difference in polarity After weighted summation, it is normalized using the Sigmoid function; wherein, the degree of polarity difference is... Node-based polar scalars Calculate the polar scalar The expected value of the probability distribution output by the polarity classification head is determined.
[0032] Furthermore, step 4, generating the fused representation, specifically includes:
[0033] Step 4.1: Extract the routing feature vector r from the structured ironic cue graph G. The routing feature vector includes conflict intensity, conflict concentration, alignment intensity, external text involvement, and cue concentration.
[0034] Step 4.2: Input the routing feature vector r into the structured routing selector and output the fusion structure template M. The template M is used to determine the main fusion depth path, the main fusion granularity strategy and the enabled status of the external text strong alignment submodule.
[0035] Step 4.3: Generate path gating vectors based on the fusion structure template M. Granularity-gated vector and modal gating vector ;
[0036] Step 4.4: Calculate the path fusion vector based on the gating vector. Granularity fusion vector and modal fusion vector And merge to obtain the final fusion representation z.
[0037] Furthermore, the path gating vector Used for dynamic fusion of early path output Intermediate path output and late path output The early path is used to integrate global consistency judgments, the mid-term path is used to perform fine-grained interaction modeling around clue nodes, and the late path is used to independently aggregate information from each modality and then integrate it.
[0038] Furthermore, the granularity gating vector Used for dynamic fusion of local granular output and global granular output The local granularity output The global granular output is obtained by aggregating high-weight thread nodes. It is obtained by global pooling of the three modes.
[0039] Furthermore, the method generates modal gating vectors. Apply a monotonic reliability constraint, i.e., based on the text reliability parameter. Image reliability parameters and external text reliability parameters Limit the modal weights satisfy , where κ is a constant coefficient, and the gated vectors that do not meet the constraints are truncated and renormalized.
[0040] Furthermore, the method generates path gating vectors. and granularity-gated vector Apply strategy consistency constraints, that is, increase the weight of the intermediate path when the conflict intensity and conflict concentration exceed a preset threshold. Local granularity weights The lower limit; when the alignment strength is higher than the preset threshold, the weight of the early path is increased. With global granularity weights The lower limit.
[0041] Furthermore, the training phase of the method also includes training steps with light perturbation consistency and monotonicity constraints:
[0042] Step A: Construct a semantically invariant lightly perturbed sample x′ for each training sample x;
[0043] Step B involves simultaneously inputting the sample pair (x, x′) into the network to obtain the classification output (y, y′) and the gating vector (g, g′) respectively.
[0044] Step C, construct the total loss function ;
[0045] in, For classification cross-entropy loss, To eliminate redundant losses, To ensure that the predicted distributions of the original samples and the perturbed samples are consistent, the output consistency loss is... To constrain the gating vectors of the original sample and the perturbed sample to be consistent, the gating consistency loss is... To constrain the monotonicity loss of reliability as the modal reliability decreases, its weights decrease synchronously. to This is the balance coefficient.
[0046] Secondly, the present invention provides an irony detection system based on multimodal cue routing fusion, comprising the following modules for implementing the method described in any of the above:
[0047] The input encoding module is used to receive post data containing images, text, and external text information, and extract the corresponding basic feature representations;
[0048] The structured ironic cue graph construction module is used to construct a structured ironic cue graph containing nodes, aligned edges, conflict edges and their weights based on the basic feature representation, through cross-modal guidance, cue filtering and focusing, and redundancy removal constraints.
[0049] The two-layer controlled dynamic fusion module is used to determine the main inference structure based on the structured ironic clue graph by first selecting the main inference structure through structural routing, then dynamically allocating the fusion path, semantic granularity and modal contribution through parameter-level continuous gating fine-tuning, and generating the fusion representation under the constraints of reliability monotonicity and policy consistency.
[0050] The discrimination output module is used to map the fused representation into an irony detection result and output it.
[0051] The training constraint module is used to improve the robustness of the model during the training phase by constructing lightly perturbed sample pairs and jointly optimizing output consistency, gating consistency and reliability monotonicity loss.
[0052] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0053] 1. Irony detection relies on a small number of local triggering cues, which are scattered throughout text fragments, local image semantics, and screenshot captions. Existing methods are easily interfered with by irrelevant background semantics, making it difficult to consistently highlight decisive fragments. To address this, this invention constructs a computable multimodal irony cue expression system. Using cues as the core organizing object, it incorporates textual cues, visual cues, and external textual cues into a unified representation, explicitly characterizing the correspondence and conflict relationships between cue strengths. This allows the model to stably focus on decisive triggering fragments. Simultaneously, it introduces redundancy removal constraints to suppress repeated responses from redundant and noisy cues, reducing their interference with decision-making and thus improving the stability and reliability of cue localization and relationship characterization.
[0054] 2. Multimodal satirical samples exhibit significant morphological differences, with different inconsistencies corresponding to different effective reasoning methods. Existing methods are prone to inference path mismatch during the fusion process, leading to excessive interaction amplifying noise or insufficient fusion resulting in missed conflicts. To address this, this invention uses cue relationship representation as the basis for fusion decisions, first selecting the inference structure and then fine-tuning the fusion weights to ensure the fusion strategy matches the inconsistency patterns of the samples. Simultaneously, this invention introduces controllable constraints in the fusion decision-making process to ensure that the fusion weights maintain a reasonable trend with changes in modality reliability, avoiding performance fluctuations caused by unreasonable weight allocation, thereby improving strategy adaptability and inference controllability. This invention enhances the adaptability of inference paths under different inconsistency patterns through a cue relationship-driven controlled fusion strategy.
[0055] 3. Social media content commonly experiences minor perturbations during dissemination, such as screenshot compression, cropping, reduced clarity, paraphrasing, and text recognition errors. These changes typically do not alter the core semantics but weaken local cue strength and change feature distribution, leading to fluctuations in fusion weights and inconsistencies in predictions. To address this, this invention proposes a joint training mechanism for minor perturbation consistency and reliability monotonicity. During training, it constructs semantically invariant minor perturbation sample pairs, constraining the model's output to remain consistent while maintaining semantic integrity. It also constrains the fusion weights and inference path selection to remain consistent and ensures that weight allocation exhibits a monotonically decreasing trend when modal quality deteriorates. This improves the model's robustness to propagation noise and expression variants, reduces output fluctuations and misjudgment risks under minor perturbation conditions, and enhances stability and engineering usability in real-world propagation noise environments. Attached Figure Description
[0056] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this invention. For those skilled in the art, other drawings can be obtained based on these drawings.
[0057] Figure 1 This is a schematic diagram of the overall architecture of the irony detection method based on multimodal cue routing fusion provided in an embodiment of the present invention.
[0058] Figure 2 The structured ironic cue graph construction module provided in the embodiments of the present invention.
[0059] Figure 3 This is a dual-layer controlled dynamic fusion module provided in an embodiment of the present invention. Detailed Implementation
[0060] To better understand this technical solution, the method of the present invention will be described in detail below with reference to the accompanying drawings.
[0061] The irony detection method based on multimodal cue routing fusion proposed in this invention has the following overall technical solution: Figure 1 As shown, this method, designed for social media posts containing images, text, and external textual information, automatically distinguishes between satirical and non-satirical content, and achieves clue focusing, strategic fusion selection, and stability assurance during the reasoning process. The technical solution of this invention consists of three synergistic core components: a structured satirical clue graph construction module, a two-layer controlled dynamic fusion module, and a lightly perturbed consistency constraint training module. These three modules work together to form a complete closed loop from clue extraction to fusion decision-making and robustness enhancement.
[0062] The overall architecture includes an input layer and an encoding layer, a cue graph layer, a controlled fusion layer, a training constraint layer, and a discriminative output layer. For example... Figure 1 As shown, the overall process of the method includes:
[0063] 1) The input layer receives social media post data, which includes images, text, and external text information. The external text information comes from the screenshot text recognition results and symbol semantic parsing results. The encoding layer extracts basic feature representations from the images, text, and external text information respectively.
[0064] 2) The clue graph layer extracts candidate satirical clues from the basic features, constructs a structured satirical clue graph, and outputs the graph nodes, edges and their weights.
[0065] 3) The controlled fusion layer uses the features of the clue graph to first complete the structural routing selection, then completes continuous gating fine-tuning within the selected structure, and generates a fusion representation under controlled constraints.
[0066] 4) The training constraint layer constructs lightly perturbed sample pairs during the training phase to jointly constrain output consistency, gating consistency, and reliability monotonicity, thereby improving the stability and robustness during the deployment phase.
[0067] 5) The discriminant output layer outputs the irony detection result based on the fusion representation.
[0068] The specific implementation of the key modules is shown in the following scheme:
[0069] 1. Structured satirical clue graph construction module
[0070] This module receives basic feature representations of images, text, and external textual information, and outputs a structured satirical cue map and a focused cue representation. The system first encodes the input data to obtain a text representation matrix. Image representation matrix External text representation matrix .in Indicates the number of text tokens. Indicates the number of image patches. Indicates the number of external text tokens. The system represents the feature dimension. Under the guidance of a cross-modal mechanism, candidate satirical cue representations are generated. This mechanism employs symmetric cross-modal attention computation to achieve bidirectional information guidance between text and image, image and text, and external text and both image and text. The guidance result is residually fused with the original representation to obtain the candidate cue representation. .in , , The system calculates clue weights based on candidate clue representations and selects high-contribution clues to form a clue node set. Clue weights are obtained through linear mapping and normalization; weight vectors are calculated separately for text, images, and external text.
[0071]
[0072] in For a trainable parameter vector, softmax( ) represents the normalization function that makes the sum of the weights equal to 1. Indicates the weight distribution of text clues. Represents the image cue weight distribution. This represents the weight distribution of external textual cues. The system selects the Top K cues as a candidate node set based on their weights, and performs focus refinement within each modality to enhance trigger fragments and suppress background interference, resulting in a focused cue representation. .in , , To reduce redundant cue responses and improve cue diversity, the system imposes redundancy removal constraints on the focused cue representation. The system concatenates the three types of focused cue representations along the sample dimension and normalizes them to obtain a matrix. ,in And calculate the correlation matrix. The system minimizes the off-diagonal terms of the correlation matrix to constitute the redundancy removal loss.
[0073]
[0074] in Represents the first element in the correlation matrix. Line number Column elements, A smaller value indicates lower redundancy between feature dimensions. This loss is jointly optimized with the main task loss during the training phase, ensuring that the cue representation remains discriminative in the feature space, thereby improving the stability of cue selection.
[0075] After the clue nodes are stably formed, the system constructs a structured satirical clue graph. The graph nodes consist of text clue nodes, image clue nodes, and external text clue nodes, and the graph edges include alignment edges and conflict edges. The system extracts the corresponding node representation vector for each clue node, denoted as [vector name missing]. and ,in Represents a node The focus of the clues indicates that Represents a node The focus cue representation. The node representation vector is composed of the focus cue representation matrix. The token or patch corresponding to this node can be retrieved directly. The system calculates the weights of the alignment edges. Used to characterize nodes With nodes The semantic correspondence strength. The system first performs a linear transformation on the node representation to obtain... and ,in This is a trainable parameter matrix. The system then calculates the cosine similarity and obtains the alignment edge weights through Sigmoid normalization, specifically...
[0076]
[0077] in Represents the cosine similarity function. This represents the Sigmoid function, with an output range of 0 to 1. A larger alignment edge weight indicates greater semantic consistency between the two threads. The system calculates the weights of conflicting edges. Used to characterize nodes With nodes The system decomposes the conflict intensity into two parts: the degree of semantic inversion and the degree of polarity difference. The system first calculates the cosine similarity of the node representations. and with 1- This indicates the degree of semantic incongruence; a higher degree of semantic incongruence indicates greater semantic inconsistency between the two clues. The system simultaneously calculates a polarity scalar for each node. Polar scalars are used to characterize the emotions or stances expressed by the cues. The system represents vectors at the nodes. Connect to a polarity classification head to output a three-class probability distribution. This corresponds to three polarities: negative, neutral, and positive, with the sum of the probabilities of all three being 1. The system defines the polarity scalar as the expected value of this distribution.
[0078]
[0079] in Represents the probability of positive polarity. The polarity scalar represents the probability of negative polarity. The value ranges from -1 to +1, with larger values indicating a more positive bias and smaller values indicating a more negative bias. The system calculates the node values accordingly. With nodes The intensity of the polarity difference is defined as
[0080]
[0081] The system calculates the conflict edge weights by weighting the semantic incongruence and polarity difference strength and then normalizing the sum using Sigmoid.
[0082]
[0083] in and These are constant coefficients fixed before training to balance the contributions of semantic incongruence and polarity differences. A larger conflict edge weight indicates a stronger conflict between the two cues, and the system accordingly increases the attention given to that cue pair in subsequent fusion routing and interaction modeling. The system ultimately outputs a cue graph. The graph contains a set of nodes, a set of aligned edges, a set of conflicting edges, and their weights, and also outputs a focus cue representation. The clue graph serves as a unified control object for subsequent structural routing and controlled fusion, providing a deterministic basis for the fusion strategy.
[0084] 2. Dual-layer controlled dynamic fusion module
[0085] This module transforms multimodal cue representations into fused representations for irony discrimination. It employs a two-layer controlled dynamic fusion mechanism. The two-layer control means two things: the first layer performs structure-level routing selection, where the system first determines which inference structure is more suitable for the current sample; the second layer performs parameter-level continuous gating fine-tuning, where the system then finely allocates different fusion paths, different granular representations, and different modal contributions within the selected inference structure. By selecting the structure first and then adjusting the weights, this module ensures that the fusion strategy can adapt to different types of cross-modal inconsistencies and maintains inference stability under reliability constraints.
[0086] The system input includes three types of focus cue representations and a structured ironic cue graph. The three types of focus cue representations are: text focus representation... Image focus representation External text focus indication .in These represent the number of three types of tokens or patches, Representing feature dimensions. A structured ironic cue graph is denoted as... The graph includes cue nodes, alignment edges, conflict edges, and weights. This module outputs the final fused representation. Simultaneously, it outputs path gating vectors, granularity gating vectors, and modality gating vectors for stability constraints during the training phase. To achieve structure-level routing selection, the system first extracts routing feature vectors from the clue graph. The design goal of routing features is to describe the inconsistency patterns of samples using a small number of deterministic statistics, thereby providing a basis for subsequent structure selection. Specifically, the system calculates the conflict intensity. The overall level of cross-modal conflict is characterized by the mean of conflict edge weights; the system calculates the conflict concentration. The mean of the top portion of the conflict edge weights is used to characterize whether conflicts are concentrated on a small number of cues; the system calculates the alignment strength. Cross-modal consistency is characterized by the mean of aligned edge weights; the system calculates external text participation. The sum of the weights of external textual clues characterizes the contribution of external text to decision-making; the system calculates the concentration of clues. The negative entropy of the clue weight distribution is used to characterize whether the clues are highly concentrated. The system concatenates the above quantities to obtain...
[0087]
[0088] in The vector has a dimension of 5, and the system uses this vector as the input for structural routing decisions. After completing the routing feature extraction, the system performs structural-level route selection and outputs a fused structural template. The fusion structure template is used to determine the main inference structure of this sample. The template includes three decision components: the main fusion depth path, the main fusion granularity strategy, and the activation status of the external text strong alignment submodule. The template's role is to elevate the fusion strategy from continuous weighting to structural-level selection, enabling the system to stably adopt a matching inference structure under different inconsistencies. The system calculates the template distribution.
[0089]
[0090] in For a trainable parameter matrix, Let be the template probability distribution vector. The system determines the template with the highest probability.
[0091]
[0092] Template selection remains deterministic during the deployment phase; the system does not employ random sampling. After the structural template is determined, the system enters the second layer of control, performing parameter-level continuous gating fine-tuning. Continuous gating is used to further allocate weights within the selected structure, ensuring that the fusion process simultaneously considers fusion depth, semantic granularity, and modal contribution. The system defines three types of gating vectors. The path gating vector is denoted as... These correspond to the early path, mid-path, and late path, respectively. The granularity gating vector is denoted as... , corresponding to local focusing granularity and global aggregation granularity, respectively. The modal gating vector is denoted as . These correspond to the text modality, image modality, and external text modality, respectively. All three types of gated vectors are normalized using softmax to ensure that the weights are non-negative and sum to 1. Specifically...
[0093]
[0094] in For a trainable parameter matrix, This is the fusion context vector. The fusion context vector consists of the routing feature vector. Together with the pooling vector of the three-modal focusing representation, it constitutes the system. Pooling was performed separately to obtain ,in The system will Linear mapping yields To facilitate understanding of the division of labor among the three fusion paths, this invention divides the fusion depth into three paths: early, middle, and late. The early path is used to quickly form a globally consistent judgment; the system jointly fuses the three-modal global aggregate representations to obtain... The intermediate path is used for fine-grained interaction modeling to address inconsistencies. The system performs cross-modal interactions around clue nodes and clue relationships and aggregates the results. The late-stage path is used to enhance stability and reduce the risk of noise amplification. The system first independently converges each mode to obtain a mode summary, and then fuses them to obtain the final mode summary. The outputs of the three paths satisfy... The system calculates the path fusion vector based on the path gating vector.
[0095]
[0096] The system simultaneously integrates local and global information at the granularity level. Local granularity is used to highlight the local representations corresponding to focused clues; the system aggregates high-weight clue nodes to obtain... Global granularity is used to preserve overall context and background consistency information. The system performs global pooling on the three modalities to obtain... The system obtains the granularity-gated vector.
[0097]
[0098] The system simultaneously fuses three types of modality summaries at the modality dimension. The system aggregates summaries from text, images, and external text separately. ,in The system obtains the modal gating vector.
[0099]
[0100] The system concatenates the path fusion vector, granularity fusion vector, and modality fusion vector and performs a linear mapping to obtain the final fused representation.
[0101]
[0102] in Given a trainable parameter matrix, the final fused representation As input to the irony discriminant layer.
[0103] To reflect the controllability and engineering stability of controlled fusion, this module applies two types of controlled constraints to the gating decision. The first type is a reliability monotonic constraint, used to ensure that the modal weight decreases synchronously when the modal quality decreases. The system calculates the text reliability parameters. Image reliability parameters External text reliability parameters All three reliability parameters range from 0 to 1, with higher values indicating higher modal quality. The system uses these reliability parameters to limit the upper bound of the modal gating weights.
[0104]
[0105] in This represents the weight of the corresponding mode in the modal gating vector. These are constant coefficients fixed before training. The system... Perform upper bound truncation and re-normalization to ensure that the weights satisfy the upper bound constraint and the weight sum is 1. The second type is policy consistency constraint, used to ensure that the fusion policy matches the inconsistent patterns. When the conflict intensity... With conflict concentration When the values exceed a preset threshold, the system increases the lower limit of the intermediate path weight and the local granularity weight.
[0106]
[0107] When alignment strength When the weights exceed a preset threshold, the system increases the lower limit of early path weights and global granularity weights.
[0108]
[0109] in These are constant lower bound parameters fixed before training. After gating computation, the system performs projection correction to ensure all lower bound constraints hold, and maintains non-negative weights with a sum of 1. The system finally outputs a fused representation. It also outputs three types of gating vectors, which are used for subsequent discrimination and stability constraints during the training phase.
[0110] 3. Training module for consistency and monotonicity constraints under slight perturbations
[0111] This module enhances the model's robustness to propagated noise and representational variants during the training phase, ensuring stable output under semantically invariant minor perturbations and making the fusion weights exhibit a reasonable trend with modal quality changes. This module does not alter the inference flow during the deployment phase; the deployment phase only uses the cue graph construction module and the two-layer controlled dynamic fusion module to complete the irony discrimination. During the training phase, this module constructs minor perturbation sample pairs and introduces consistency and monotonicity constraints, enabling the model to acquire stable cue selection and fusion decision-making capabilities.
[0112] The system constructs a corresponding lightly perturbed sample for each training sample, forming a sample pair. Let the original sample be... The perturbation sample is The design of the light perturbation follows the principle of semantic invariance, ensuring that the samples retain their original labels after perturbation. Text perturbation employs synonym rewriting and format fine-tuning to maintain entities and core viewpoints. Image perturbation uses noise compression and light cropping to maintain the main targets and scenes. External text information perturbation uses confidence reduction and local character noise simulation to maintain the main words. The system will respectively... and Inputting the inference network of this invention yields the classification outputs of the original sample and the perturbed sample, along with their corresponding gating vectors. The system denotes the classification output of the original sample as... The classification output of the perturbation sample is denoted as .in and This represents the unnormalized logit vector output by the classification head. The system denotes the gating vector of the original sample as... The gating vector of the perturbation sample is denoted as The gating vector is composed of three parts: the path gating vector, the path gate vector, and the path gate vector. Granularity-gated vector Modal gating vector Correspondingly, the perturbation sample gating vector contains The system records the true label of the original sample as... To improve the model's stability in real-world noisy environments, this module designs the training objective as follows. For classification accuracy, the system uses cross-entropy loss to calculate the difference between the predicted distribution and the true label, defined as...
[0113]
[0114] Among them, CE ( ) represents the cross-entropy loss function.
[0115] For output stability, the system constrains the predicted distribution of the original samples and the perturbed samples to remain consistent, thus ensuring that minor perturbations do not cause prediction shifts. The system will... and The output consistency loss is defined using the mean squared error method after converting the result to a probability distribution via softmax.
[0116]
[0117] Where softmax( ) represents the normalization function. This represents the L2 norm.
[0118] To ensure the stability of the fusion decision, the system constrains the gating vectors of the original samples and the perturbed samples to remain consistent, thereby avoiding fluctuations in the fusion weights caused by minor perturbations. The system calculates the differences for the three types of gating separately and sums them to form the gating consistency loss.
[0119]
[0120] This constraint ensures that the model maintains stability in fusion path selection and fusion weight allocation under slight perturbation conditions, reducing the risk of prediction reversal for boundary samples.
[0121] A reasonable response to quality changes is achieved when the system's constrained modal reliability decreases, corresponding to a synchronous decrease in modal weights, thus preventing the overuse of low-quality modes under noisy conditions. The system denotes the modal reliability parameters of the original sample as follows: , corresponding to text, image, and external text modalities, respectively. The system denotes the modal reliability parameter of the perturbed sample as . The reliability parameter ranges from 0 to 1, with higher values indicating higher modal quality. The system compares reliability before and after a disturbance; if the reliability of a particular mode decreases, its weight must not increase. The system defines the reliability monotonicity loss in hinge form.
[0122]
[0123] in Represents the first modal gating vector in the original sample. The weights of each modality This indicates the weights corresponding to the perturbed samples. This represents an indicator function, taking the value 1 when the condition within the parentheses is true, and 0 otherwise. This loss ensures that the system reduces the impact of a mode on the final decision when the mode quality deteriorates. The system combines this loss with the redundancy removal loss from the cue graph module to form the training objective. The redundancy removal loss is denoted as... Its definition can be found in the clue graph construction module. The total loss of the system is defined as...
[0124]
[0125] in These are constant coefficients fixed before training to balance the contributions of each loss term. After training, the system parameters are fixed. During the deployment phase, perturbation sample construction and loss calculation are not performed; instead, the irony detection results are directly output according to the inference flow.
[0126] Taking the detection of sarcasm in text and image posts on online social platforms as an example, the steps of the sarcasm detection method based on multimodal cue routing fusion proposed in this invention are as follows:
[0127] 1) Obtain the post data to be detected. The input sample includes image content, accompanying text, and external text information. The external text information is provided by the upstream system and includes the screenshot text recognition text and its confidence parameters.
[0128] 2) Encode the images, text and external text information respectively to obtain the corresponding vector representations, and generate candidate ironic cue representations based on the cross-modal guidance mechanism, calculate the cue weights and filter high-contribution cue.
[0129] 3) Perform intramodal focusing refinement and redundancy removal constraints on the selected candidate clues to form a stable set of clue nodes, and construct a structured ironic clue graph. The graph explicitly depicts the alignment and conflict relationships and their weights among the clues.
[0130] 4) Extract routing features based on the clue graph and complete the structural-level routing selection to determine the main inference structure of the sample; perform continuous gating fine-tuning within the selected structure to dynamically allocate the fusion path, semantic granularity and modal contribution, and generate the fusion representation under the constraints of reliability monotonicity and policy consistency.
[0131] 5) The fusion representation inputs the satirical category prediction results from the discriminative layer and the corresponding confidence scores. During the training phase, semantically invariant lightly perturbed sample pairs are constructed, and output consistency, gating consistency, and reliability monotonicity are jointly constrained to improve the model's robustness to perturbations such as screenshot compression, cropping, text rewriting, and external textual noise. In practical applications, the model is deployed in content moderation and public opinion monitoring systems, and online or offline evaluations verify its stable detection performance and low false positive fluctuations under various scenarios. Through multiple rounds of testing, the proposed solution demonstrates high detection accuracy and low false positive fluctuations in large-scale data scenarios, with an accuracy exceeding 95%.
[0132] The method provided by this invention for detecting sarcasm and identifying risks in social media image and text content scenarios has the following advantages:
[0133] 1. This invention significantly improves the ability to capture and accurately identify sparse satirical cues. In real posts, tags are often determined by a few local trigger fragments, and existing methods are easily interfered with by large sections of neutral background and irrelevant visual content. This invention can stably locate decisive trigger information in complex backgrounds and use it as a basis for judgment in reasoning, enabling the model to maintain higher recognition reliability in typical satirical forms such as image-text reversal, context jumps, and external text triggers, thereby reducing missed detections and false detections.
[0134] 2. This invention improves policy adaptability and inference stability under different inconsistency patterns. Faced with diverse satirical expressions, the model needs to make reasonable trade-offs between fine-grained interactions and global fusion. Existing methods are prone to inference path mismatches when sample morphology changes, leading to performance instability. This invention provides clear selection criteria for fusion decisions and ensures that fusion weights maintain a reasonable trend with changes in input quality. This allows the model to maintain more stable output performance across topics, platforms, and propagation patterns, reducing performance fluctuations caused by modal noise and inappropriate structure selection.
[0135] 3. This invention significantly enhances robustness to propagation noise and expression variants, reducing online deployment risks. Social media content commonly suffers from perturbations such as screenshot compression and cropping, detail loss due to secondary propagation, minor text rewriting, and external character recognition errors. Existing methods are prone to confidence drift and prediction flipping under these conditions. This invention enables the model to maintain consistent output under semantically invariant minor perturbations and automatically reduces the influence of a mode on decision-making when mode quality deteriorates, thereby reducing the probability of boundary sample flipping and improving the stability and maintainability of the online system.
[0136] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. However, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A satire detection method based on multimodal cue routing fusion, characterized in that, Includes the following steps: Step 1: Obtain the social media post data to be detected, which includes images, text, and external text information; Step 2: Encode the image, text, and external text information respectively to obtain the corresponding basic feature representations; Step 3: Based on the aforementioned basic feature representation, candidate satirical cues are generated through a cross-modal guidance mechanism, and a structured satirical cue graph is constructed. The nodes of the graph include text cue nodes, image cue nodes, and external text cue nodes, and the edges of the graph include alignment relationship edges and conflict relationship edges. Step 4: Based on the structured ironic cue graph, a fused representation is generated through a two-layer controlled dynamic fusion mechanism. The two-layer controlled dynamic fusion mechanism first performs structural-level routing selection to determine the main inference structure, then performs parameter-level continuous gating fine-tuning within the selected structure, and generates a fused representation under controlled constraints. Step 5: Input the fused representation into the discriminant layer and output the irony detection result.
2. The irony detection method based on multimodal cue routing fusion according to claim 1, characterized in that, Step 3, which involves constructing a structured satirical cue graph, specifically includes: Step 3.1, process the text representation matrix. Image representation matrix External text representation matrix A symmetric cross-modal attention mechanism is applied to generate candidate cue representations. ; Step 3.2: Calculate the weight distribution of each modality candidate cue. And select the Top K clues as a set of candidate nodes based on their weights; Step 3.3: Focus and refine the candidate node set to obtain the focused cue representation. And by calculating the redundancy removal loss Minimize redundancy between feature dimensions; Step 3.4: Calculate the alignment relation edge weights based on the focused cue representation. And the weight of conflict relationship edges This forms a structured satirical clue graph G containing nodes, edges, and weights.
3. The irony detection method based on multimodal cue routing fusion according to claim 2, characterized in that, The alignment relationship edge weight The calculation method is as follows: for node representation and After performing a linear transformation, the cosine similarity is calculated, and then normalized using the Sigmoid function; the conflict relation edge weights are then obtained. The calculation method is as follows: the degree of semantic inversion Degree of difference in polarity After weighted summation, it is normalized using the Sigmoid function; wherein, the degree of polarity difference is... Node-based polar scalars Calculate the polar scalar The expected value of the probability distribution output by the polarity classification head is determined.
4. The irony detection method based on multimodal cue routing fusion according to claim 1, characterized in that, Step 4, generating the fusion representation, specifically includes: Step 4.1: Extract the routing feature vector r from the structured ironic cue graph G. The routing feature vector includes conflict intensity, conflict concentration, alignment intensity, external text involvement, and cue concentration. Step 4.2: Input the routing feature vector r into the structured routing selector and output the fusion structure template M. The template M is used to determine the main fusion depth path, the main fusion granularity strategy and the enabled status of the external text strong alignment submodule. Step 4.3: Generate path gating vectors based on the fusion structure template M. Granularity-gated vector and modal gating vector ; Step 4.4: Calculate the path fusion vector based on the gating vector. Granularity fusion vector and modal fusion vector And merge to obtain the final fusion representation z.
5. The irony detection method based on multimodal cue routing fusion according to claim 4, characterized in that, The path gate vector Used for dynamic fusion of early path output Intermediate path output and late path output The early path is used to integrate global consistency judgments, the mid-term path is used to perform fine-grained interaction modeling around clue nodes, and the late path is used to independently aggregate information from each modality and then integrate it.
6. The irony detection method based on multimodal cue routing fusion according to claim 4, characterized in that, The granularity-gated vector Used for dynamic fusion of local granular output and global granular output The local granularity output The global granular output is obtained by aggregating high-weight thread nodes. It is obtained by global pooling of the three modes.
7. The irony detection method based on multimodal cue routing fusion according to claim 4, characterized in that, The method generates modal gating vectors. Apply a monotonic reliability constraint, i.e., based on the text reliability parameter. Image reliability parameters and external text reliability parameters Limit the modal weights satisfy , where κ is a constant coefficient, and the gated vectors that do not meet the constraints are truncated and renormalized.
8. The irony detection method based on multimodal cue route fusion according to claim 4, characterized in that, The method generates path gating vectors. and granularity-gated vector Apply strategy consistency constraints, that is, increase the weight of the intermediate path when the conflict intensity and conflict concentration exceed a preset threshold. Local granularity weights The lower limit; When the alignment strength is higher than a preset threshold, increase the weight of the early path. With global granularity weights The lower limit.
9. The irony detection method based on multimodal cue routing fusion according to claim 1, characterized in that, The training phase of the method also includes training steps with slight perturbation consistency and monotonicity constraints: Step A: Construct a semantically invariant lightly perturbed sample x′ for each training sample x; Step B involves simultaneously inputting the sample pair (x, x′) into the network to obtain the classification output (y, y′) and the gating vector (g, g′) respectively. Step C, construct the total loss function ; in, For classification cross-entropy loss, To eliminate redundant losses, To ensure that the predicted distributions of the original samples and the perturbed samples are consistent, the output consistency loss is... To constrain the gating vectors of the original sample and the perturbed sample to be consistent, the gating consistency loss is... To constrain the monotonicity loss of reliability as the modal reliability decreases, its weights decrease synchronously. to This is the balance coefficient.
10. An irony detection system based on multimodal cue routing fusion, characterized in that, The following modules are included for implementing the method according to any one of claims 1-9: The input encoding module is used to receive post data containing images, text, and external text information, and extract the corresponding basic feature representations; The structured ironic cue graph construction module is used to construct a structured ironic cue graph containing nodes, aligned edges, conflict edges and their weights based on the basic feature representation, through cross-modal guidance, cue filtering and focusing, and redundancy removal constraints. The two-layer controlled dynamic fusion module is used to determine the main inference structure based on the structured ironic clue graph by first selecting the main inference structure through structural routing, then dynamically allocating the fusion path, semantic granularity and modal contribution through parameter-level continuous gating fine-tuning, and generating the fusion representation under the constraints of reliability monotonicity and policy consistency. The discrimination output module is used to map the fused representation into an irony detection result and output it. The training constraint module is used to improve the robustness of the model during the training phase by constructing lightly perturbed sample pairs and jointly optimizing output consistency, gating consistency and reliability monotonicity loss.