A large language model-based multi-modal sarcasm detection method
By constructing a high-quality multimodal satire detection training dataset and a chain-based satire inference framework, combined with the ImageBind encoder and LoRA low-rank adaptation technique, the problems of data scarcity and high annotation costs in multimodal satire detection are solved, achieving high efficiency, accuracy, and interpretability in fine-grained satire detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INSTITUTE OF PETROCHEMICAL TECHNOLOGY
- Filing Date
- 2025-08-28
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies suffer from data scarcity and high annotation costs in multimodal irony detection, and traditional methods struggle to effectively integrate multimodal information, failing to deeply understand and identify cross-modal contradictions in irony.
We construct a high-quality multimodal satire detection training dataset, combine cross-enhancement with a strict quality discriminator, use the ImageBind encoder for modality alignment, and employ LoRA low-rank adaptation and dynamic weighted contrastive loss for model fine-tuning. The model is decomposed into a chain-reasoning framework for target-aspect detection, reason-opinion detection, and satire category mining.
It significantly improves the model's generalization ability and cross-modal irony semantic understanding, achieves fine-grained and interpretable irony detection, reduces computational and storage overhead, and improves detection accuracy and robustness.
Smart Images

Figure CN120952006B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of natural language processing technology, and in particular to a multimodal irony detection method based on a large language model. Background Technology
[0002] Irony is a rhetorical device that typically conveys opinions or emotions by expressing the opposite of its intended meaning. Irony can reverse the emotional tone of text or commentary, carrying a degree of criticism and aggression. With the widespread use of the internet, more and more users express their views and opinions through posts, especially on social media, news reports, forums, and product reviews. To comprehensively mine the information within this data and analyze the attitudes, emotions, and tendencies of comments, it is necessary to establish an irony detection system capable of perceiving and understanding the meaning of irony.
[0003] Early research on satire detection mainly fell into two categories: those based on traditional machine learning and those based on deep learning. Methods based on traditional machine learning can be further divided into feature engineering and classification. Feature engineering, a semi-supervised approach, constructs text patterns using high-frequency words and content words, calculates the matching degree between the comment text and the constructed text patterns, combines text length and punctuation as text features, and calculates the degree of satire using the K-nearest neighbor algorithm based on Eulerian distance; a higher value indicates a greater degree of satire. Classification uses decision trees as classifiers to detect satire in text. By combining different features, the accuracy of satire detection is compared to demonstrate the effectiveness of manually designed text features in satire detection.
[0004] However, traditional satire detection is mainly based on text. More and more social media platforms allow users to create multimodal messages, including text, images and videos. Text alone cannot accurately detect satire in multimodal information. Therefore, the main work of this invention focuses on multimodal satire detection and conducts in-depth research and discussion on feature fusion between multimodalities.
[0005] Multimodal irony detection refers to the process of identifying and understanding the existence and expression of irony using multiple information modalities. This process considers not only the meaning expressed by the text itself but also focuses on analyzing associated visual information, vocal intonation, and other contextual elements. From the perspective of detection granularity, multimodal irony detection can be divided into coarse-grained and fine-grained recognition methods. Fine-grained irony recognition mainly refers to extracting elements from the text that can affect the recognition accuracy to form tuples, generally including tuples such as target, aspect category, opinion item, and irony classification.
[0006] While fine-grained sarcasm detection has achieved some success, research on sarcasm detection using large language models remains lacking. Therefore, defining a more comprehensive granularity for sarcasm tuples to fully cover sarcasm elements is still crucial. Thus, how to seamlessly integrate multimodal and fine-grained approaches with massive datasets and accurately perform aspect-level sarcasm detection is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0007] To address the technical problems existing in the prior art, this invention proposes a multimodal sarcasm detection method based on a large language model. The aim is to define a more comprehensive sarcasm quintuple that fully covers the granularity of sarcasm elements, providing a transparent path for model decision-making and meeting the traceability requirements of detection results in practical applications.
[0008] To achieve the above objectives, this invention provides a multimodal irony detection method based on a large-scale language model, comprising:
[0009] Construct large-scale, high-quality text datasets;
[0010] A pre-trained language model is constructed, and the parameters of the pre-trained language model are optimized using a supervised fine-tuning strategy. The pre-trained language model is then trained using the cross-entropy loss function of autoregressive language modeling to obtain a multimodal large-scale language model.
[0011] The large-scale, high-quality text dataset is input into the multimodal large-scale language model for processing to obtain detection results.
[0012] Preferably, a large-scale, high-quality text dataset is constructed, including:
[0013] The target-opinion binary pairs of the original text are expanded using a cross-tabulation method through a large language model, and a discriminator containing irony category judgment and automatic scoring mechanism is introduced to filter low-quality data, thereby obtaining the large-scale high-quality text dataset.
[0014] Among these, the cross-method is used to expand the target-viewpoint binary of the original text using a large language model, including:
[0015] Based on the target terms and opinion terms in the original text, text containing the replaced terms is generated separately through synonym substitution. Then, the target terms and opinion terms in the text with replaced terms are cross-combined to generate new pseudo-data. When the target terms are empty, pseudo-data is generated only by replacing opinion terms. Specifically:
[0016]
[0017] In the formula, It is the original text. It is enhanced text.
[0018] Preferably, the satire category determination is used to determine whether the expanded text contains satirical elements, output satire identifiers, and extract aspect tuples from the text;
[0019] The automatic scoring mechanism is used to score texts from three dimensions: syntactic complexity, lexical richness, and matching degree with real-world scenarios. Texts with scores lower than the preset score are filtered out.
[0020] Preferably, obtaining the multimodal large-scale language model includes:
[0021] Multimodal data is input into the pre-trained language model, and features are extracted using the ImageBind encoder. The extracted features are then projected into the input space of Qwen3-14B to obtain the embedded representation of the text input.
[0022] A chain-based irony reasoning framework is constructed, and irony quintuples of the input text are extracted and irony detection is performed based on the chain-based irony reasoning framework.
[0023] The pre-trained language model is fine-tuned using the LoRA low-rank adaptation technique and trained using dynamic weighted contrastive loss to obtain the multimodal large-scale language model.
[0024] Preferably, feature extraction is performed using an ImageBind encoder, including:
[0025] h img =ImageBind img (X img );
[0026] In the formula, h img X is the feature representation vector of the image modality. img For the input image tensor, ImageBind img This is the image encoder function for the ImageBind model.
[0027] Preferably, the extracted features are projected into the input space of Qwen3-14B, including:
[0028] Define the projection matrix and bias Then, a linear transformation is performed on the multimodal encoded features output by ImageBind:
[0029] h proj =W·h ImageBind +b;
[0030] In the formula, W is the projection matrix, b is the bias vector, and h projh is the projected vector. ImageBind This refers to the multimodal feature vector obtained by the ImageBind encoder after encoding the input image data.
[0031] The projected vector h proj Concatenated with embedded text, as input h for Qwen3-14B T5-input :
[0032] h T5-input =Concat(h text ,h proj );
[0033] In the formula, h text An embedded representation of text input.
[0034] Preferably, the process of extracting ironic quintuples from the input text and performing irony detection based on the chain-like irony reasoning framework includes:
[0035] Target-Aspect Detection: Given input text D, containing multimodal signals h Qwen-input And a specific instruction P1, extracting all targets and their corresponding aspects from the text, generating a target-aspect pair set, i.e., {(t i ,a i )};
[0036] Reason-Opinion Detection: Used to extract reasons, objectives, aspects, and opinions from the target-aspect pair set, generating a set of four-tuples, i.e., {(r l ,t i ,a i ,o j )};
[0037] Irony Category Mining: This function analyzes the irony categories corresponding to viewpoints using the set of quadruples, generating a set of quintuples containing the target, aspect, viewpoint, reason, and irony category, i.e., {(r l ,t i ,a i ,o j ,s k )}.
[0038] Preferably, the pre-trained language model is fine-tuned using the LoRA low-rank adaptation technique, including:
[0039] Construct a fine-tuning dataset of instructions, including task instructions, input text, and target satirical labels;
[0040] The cross-entropy loss function, modeled using autoregressive language, is used for optimization.
[0041] Using the LoRA low-rank adaptive technique, only the injected low-rank matrix pairs are updated, keeping the pre-trained model parameters frozen, and LoRA modules are injected into specific projection layers of Qwen3-14B.
[0042] Preferably, in the LoRA low-rank adaptive technique, the original weight matrix of the pre-trained language model Forward propagation W ′ x Revised to:
[0043] W ′ x =W x +ΔW x =W x +BA x ;
[0044] In the formula, r is the rank decomposition dimension, ΔW x For low-rank adaptation increments, W x For the original weights, W ′ x For the adjusted weights.
[0045] Preferably, training is performed using a dynamically weighted contrastive loss, including:
[0046] Based on the principle of stochastic gradient descent, an 8-bit AdamW optimizer is used to update the model parameters. The objective function is optimized as the cross-entropy loss for autoregressive language modeling. The training configuration uses a global batch size of 8, a weight decay coefficient of 0.01, and a linear learning rate decay strategy for training. Finally, parameter fusion technology is used to save the final model, which is the multimodal large language model.
[0047] Compared with the prior art, the present invention has the following advantages and technical effects:
[0048] (1) This invention constructs a high-quality multimodal irony detection training dataset, effectively solving the problems of data scarcity and high annotation costs. Existing technologies typically rely on datasets with limited scale, coarse annotation granularity, and poor multimodal alignment. This invention innovatively proposes a data construction scheme that combines a cross-enhancement method based on large-scale language models (LLMs) with a rigorous quality discriminator. This method can automatically generate large-scale, high-quality, fine-grained synthetic data, and effectively filter low-quality samples through irony category judgment and multivariate automatic scoring mechanisms, thereby providing a rich and reliable data foundation for model training and significantly improving the model's generalization ability.
[0049] (2) This invention achieves deep modal alignment and fusion, enhancing the model's ability to understand cross-modal satirical semantics. Addressing the shortcomings of traditional methods in multimodal representation alignment, this invention utilizes the ImageBind encoder to map images (a non-textual modality) into a semantic space, and then efficiently aligns them with the text space of the Qwen3-14B large-scale language model using a learnable projection matrix. This design enables the model to deeply understand the potential contradictions, consistency, or reinforcement relationships between textual and non-textual information, laying a solid foundation for accurately identifying satirical expressions in multimodal contexts.
[0050] (3) This invention proposes a chain-based irony reasoning framework, achieving fine-grained and interpretable irony detection. Unlike existing coarse-grained or simple tuple detection methods, this invention decomposes the complex irony detection task into a progressive reasoning chain of "target-aspect detection → reason-opinion detection → irony category mining". This framework not only outputs a more comprehensive irony quintuple (reason, target, aspect, opinion, irony category), providing a transparent reasoning path for decision-making and enhancing the credibility and interpretability of the results, but also reduces the difficulty of direct end-to-end detection by solving step by step, thus improving the overall accuracy.
[0051] (4) This invention employs a parameter-efficient fine-tuning strategy, significantly reducing computational and storage overhead while ensuring performance. This invention uses LoRA (Low-Rank Adaptation) technology to fine-tune a large-scale pre-trained model (Qwen3-14B). This method only requires updating a very small number (approximately 0.1%) of parameters to enable the model to efficiently adapt to downstream multimodal irony detection tasks, avoiding the huge computational cost and catastrophic forgetting risk associated with full parameter fine-tuning, making it possible to deploy and fine-tune large models with limited computing resources.
[0052] (5) This invention introduces a dynamic weighted contrastive loss training mechanism, which optimizes the training process and further improves model performance. During the training phase, this invention innovatively utilizes the results of the automatic scoring mechanism to assign higher weights to high-quality training samples through dynamic weighted contrastive loss. This strategy guides the model to focus more on high-quality, high-difficulty samples, effectively improving the stability of training and the discrimination accuracy and robustness of the final model. Attached Figure Description
[0053] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:
[0054] Figure 1 This is a flowchart of a multimodal irony detection method based on a large language model according to an embodiment of the present invention;
[0055] Figure 2 This is a technical roadmap of an embodiment of the present invention;
[0056] Figure 3 This is a schematic diagram of a multimodal large-scale language model framework according to an embodiment of the present invention;
[0057] Figure 4 This is an overview diagram of the data augmentation framework for an embodiment of the present invention;
[0058] Figure 5 This is an example diagram illustrating data augmentation using the crossover method in an embodiment of the present invention.
[0059] Figure 6 This is a flowchart of the discriminator processing according to an embodiment of the present invention. Detailed Implementation
[0060] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0061] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.
[0062] This embodiment includes four steps: data augmentation, building a multimodal LLMs framework, model fine-tuning, and model training. Figures 1-2 .
[0063] Irony detection tasks rely on fine-grained labeled data, but existing datasets are small in size and multimodal alignment annotation is extremely costly. By generating diverse pseudo-data through cross-tabulation and incremental methods, and combining this with the semantic control capabilities of large-scale language models (LLMs), the data volume can be dynamically expanded. This process not only alleviates data scarcity but also ensures the semantic consistency between the augmented data and the multimodal context, providing reliable input for subsequent model training; therefore, data augmentation is the first step.
[0064] Traditional unimodal models struggle to capture cross-modal contradictions in satire. To address this, ImageBind is used to encode non-textual modalities and project them into a semantic space aligned with the text, achieving joint cross-modal representation. Simultaneously, a chain-reasoning framework decomposes the five-tuple extraction into a three-step progressive reasoning process: "goal-aspect → reason-opinion → satire category," gradually parsing the implicit logic of multimodal text. Therefore, the second step requires building a multimodal LLMs framework.
[0065] After the framework is built, although the pre-trained LLMs have powerful semantic understanding capabilities, their default parameters are not optimized for the fine-grained requirements of multimodal irony detection. Therefore, LoRA low-rank adaptation technology is needed to efficiently adapt to the task while retaining pre-training knowledge, hence the third step is model fine-tuning.
[0066] However, the effectiveness of model fine-tuning depends not only on algorithm design but also on the support of high-quality training data. Therefore, it is necessary to use dynamic weighted contrastive loss to assign higher weights to high-quality samples based on automatic scoring results, and at the same time introduce a contrastive regularization term to force high-quality samples of the same type to cluster in the feature space before model training.
[0067] From data augmentation to model training, the four-step approach forms a logical chain of "input → architecture → optimization → output". The following are the specific steps of the technical solution in this embodiment:
[0068] The first step consists of two modules. The first module involves building a large-scale, high-quality text dataset. The second module involves data augmentation using LLMs. The quantitative augmentation module uses LLMs to extract target-opinion pairs and then expands the text according to the original text by modifying the pairs to generate pseudo-data. Finally, a discriminator and scoring mechanism are used to evaluate and filter the augmented text.
[0069] The second step is to build such Figure 3 The multimodal LLMs framework (i.e., multimodal large language models) presented here constructs a chained ironic reasoning framework for encoding and understanding multimodal text content, achieving a high-performance task solution. This framework decomposes the task into three progressive reasoning steps, from simple to complex. It can more effectively extract ironic quintuple elements, while a rewritten verification mechanism enhances the robustness of the chained ironic reasoning process.
[0070] The third step is model fine-tuning, which uses LoRA to fine-tune the multimodal LLMs model to achieve minimal parameter updates. This step involves using the training set as supervised data and packaging the corresponding instructions to obtain model fine-tuning data. The model is then trained on this data to learn its response patterns to the given inputs and outputs.
[0071] The fourth step is model training, which uses the pre-tuned Qwen3-14B model to classify sarcasm based on the conclusions of the sarcasm quintuple and outputs the final detection results.
[0072] A multimodal irony detection method based on a large language model includes:
[0073] Construct large-scale, high-quality text datasets;
[0074] A pre-trained language model is constructed, and the parameters of the pre-trained language model are optimized using a supervised fine-tuning strategy. The pre-trained language model is then trained using the cross-entropy loss function of autoregressive language modeling to obtain a multimodal large-scale language model.
[0075] The large-scale, high-quality text dataset is input into the multimodal large-scale language model for processing to obtain detection results.
[0076] Furthermore, a large-scale, high-quality text dataset is constructed, including:
[0077] The target-opinion binary pairs of the original text are expanded using a cross-tabulation method through a large language model, and a discriminator with irony category judgment and automatic scoring mechanism is introduced to filter low-quality data, thus obtaining the large-scale high-quality text dataset.
[0078] Specifically, the framework overview diagram for this step is as follows: Figure 4 As shown, it is divided into two parts: pseudo-data generation and evaluation / selection. In general, data augmentation requires an input sequence x and outputs a sequence y containing the answers using the following formula:
[0079] y = argmaxP(y i |x);
[0080] Where y is the output sequence, y i It represents all possible outcomes of y.
[0081] The first part of data augmentation requires text augmentation, also known as text generation. Generally speaking, for the original training dataset D... o ={T O The dataset is obtained through the quantity expansion module.
[0082] Text generation (TG) aims to produce expanded commentary text by modifying the goals and viewpoints in the original commentary text; that is, by generating text by replacing the original terms with similar goal terms. Since goal and viewpoint terms are often composed of phrases, TG's goal is not merely synonym replacement, but to further generate fluent text.
[0083] When performing data augmentation, the dataset needs to be augmented using cross-validation. Specifically, TG-Prompt is used to generate two entirely new texts by replacing the target and opinion terms in the original commentary text. Then, a new text is generated based on these two augmented texts using cross-validation. An example of augmentation is shown in the figure below. Figure 5 As shown in Table 1.
[0084] Table 1
[0085]
[0086] refer to Figure 5 The cross-referencing method can replace the target and opinion tuples in the text content of "good drink" with synonyms, thereby generating two texts: "delicious drink" and "good beverage". The generated texts are then recombined to form the augmented data "delicious beverage", which is different from the original data.
[0087] In the cross-referencing method, LLMs are used to obtain expanded text by replacing the target terms and opinion terms in the original text using the following formula:
[0088]
[0089] in, It is the original text. It is enhanced text.
[0090] Given the original dataset D o China T o Since it is the text of the training set, augmented datasets can be obtained through LLMs. In the cross-validation method, obtaining augmented datasets requires modifying the original text through the two steps mentioned above. The target term 't' and the viewpoint term 'o' are used to obtain expanded text. Next, the new and old texts need to be merged to obtain an expanded dataset. If the target term is empty, the expanded text is generated by modifying the viewpoint term.
[0091] The second part is evaluation and filtering. While large language models are powerful, they are prone to illusions and may unintentionally generate low-quality data, thus impacting their performance. Therefore, evaluating the quality of the generated data and filtering out low-quality data is crucial. To achieve this, a new discriminator needs to be introduced, such as... Figure 6 As shown, it includes a satirical category judgment and an automatic scoring mechanism. The detailed instructions for the judgment module and the automatic scoring mechanism are shown in Tables 2-3 below.
[0092] Table 2
[0093]
[0094] Table 3
[0095]
[0096] Specifically, the satire category determination employs a large language model as the judge, forcing the model to determine the relevance of the synthesized data to the given domain and satire category. In other words, the large language model is used to verify whether the synthesized data is relevant to the given domain and satire category. After filtering out data with low relevance to the domain and satire category, an automatic scoring mechanism is further used to quantitatively measure data quality based on grammatical structure, lexical richness, and real-world relevance.
[0097] The scoring mechanism assigns a score of 1-10 to each sample, with higher scores indicating higher data quality. To filter out low-quality data, this embodiment sets a filtering threshold of 8. Data exceeding the threshold is used as the final training data, while other data is discarded.
[0098] Further, obtaining the multimodal large-scale language model includes:
[0099] Multimodal data is input into the pre-trained language model, and features are extracted using the ImageBind encoder. The extracted features are then projected into the input space of Qwen3-14B to obtain the embedded representation of the text input.
[0100] A chain-based irony reasoning framework is constructed, and irony quintuples of the input text are extracted and irony detection is performed based on the chain-based irony reasoning framework.
[0101] The pre-trained language model is fine-tuned using the LoRA low-rank adaptation technique and trained using dynamic weighted contrastive loss to obtain the multimodal large-scale language model.
[0102] Specifically, given a text D = {u1,…,u...} n The task is to extract all quintuples (t, a, o, s, r) from each text. Includes m i The word consists of elements t (goal), a (aspect), o (opinion), and r (reason), which can be a continuous textual span in an explicitly mentioned discourse or implied from the context or non-textual modality, and s represents the satirical category (Sarcasm / No-sarcasm).
[0103] This embodiment develops a novel multimodal large-scale language model, such as Figure 3 As shown, this model employs a large language model, Qwen3-14B, as its core for semantic understanding and decision-making. For non-text input, a multimodal model is used to encode the signals into representations that the large language model can understand. Specifically, ImageBind is used as a unified encoder for all non-text modalities, and then a linear layer is used to connect ImageBind to the large language model for representation projection.
[0104] The purpose of this section is to uniformly encode non-textual modalities such as images, audio, and video into vector representations, including steps S1 and S2, as follows:
[0105] Step S1 is multimodal signal encoding:
[0106] In this step, multimodal data needs to be input, and the ImageBind encoder is used for feature extraction. The feature extraction formula is as follows:
[0107] h img =ImageBind img (X img );
[0108] In the formula, h img X is the feature representation vector of the image modality. img For the input image tensor, ImageBind img This is the image encoder function for the ImageBind model.
[0109] ImageBind achieves representation alignment by mapping different modalities to the same semantic space through cross-modal contrastive learning. ImageBind's contrastive loss function... for:
[0110]
[0111] Where s(·) is the similarity score, τ is the temperature coefficient controlling the sharpness of the probability distribution, and h i ,h j is the feature vector of the positive sample pair (such as the matched image-text pair), N is the batch size, and k is the index of the negative sample (1≤k≤N).
[0112] This contrastive loss function can effectively describe the matching degree of paired samples (image-text pairs) and is applied to the training model for extracting features.
[0113] Step S2 is the representation projection, the purpose of which is to project the multimodal encoding of ImageBind onto the input space of Qwen3-14B.
[0114] This step requires defining the projection matrix. and bias Then, a linear transformation is performed on the multimodal code output by ImageBind:
[0115] h proj =W·h ImageBind +b;
[0116] In the formula, W is the projection matrix, b is the bias vector, and hproj h is the projected vector. ImageBind This is the multimodal feature representation vector obtained by the ImageBind encoder after extracting features from the input image data;
[0117] Finally, the projected vector h is... proj Concatenated with embedded text, as input h for Qwen3-14B T5-input :
[0118] h T5-input =Concat(h text ,h proj );
[0119] Among them, h text It is an embedded representation of text input.
[0120] Furthermore, based on the aforementioned chain-based irony reasoning framework, irony quintuples are extracted from the input text and irony detection is performed, including:
[0121] Target-Aspect Detection: Given input text D, containing multimodal signals h Qwen-input And a specific instruction P1, extracting all targets and their corresponding aspects from the text, generating a target-aspect pair set, i.e., {(t i ,a i )};
[0122] Reason-Opinion Detection: Used to extract reasons, objectives, aspects, and opinions from the target-aspect pair set, generating a set of four-tuples, i.e., {(r l ,t i ,a i ,o j )};
[0123] Irony Category Mining: This function analyzes the irony categories corresponding to viewpoints using the set of quadruples, generating a set of quintuples containing the target, aspect, viewpoint, reason, and irony category, i.e., {(r l ,t i ,a i ,o j ,s k )}.
[0124] Specifically, this embodiment proposes a chain-like irony reasoning framework. The main idea is to decompose the irony detection task into three progressive, chained reasoning steps, from simple to complex. By leveraging the capabilities of a large model, solving each step step by step accumulates key clues and insights for subsequent steps.
[0125] Step 1: Target-Aspect Detection. Given input text D containing multimodal signals h Qwen-inputAnd specific instruction P1, the initial step is designed to prompt the model to detect all possible targets and their specific aspects discussed in the text, i.e., {(t i ,a i )}.
[0126] Input instruction P1 as shown in Table 4. This step can be described as: {(t i ,a i )}←f1(D|P1).
[0127] Table 4
[0128]
[0129] Step 2: Reason-Opinion Detection. The second step is to detect the holder and their specific opinions. This step requires the model to output a quadruple consisting of the holder, the goal, the aspect, and the opinion {(r l ,t i ,a i ,o j Following this step, construct the reason-goal-aspect-viewpoint quadruple. This step can be expressed as: {(r l ,t i ,a i ,o j )}←f2(D,{(t i ,a i )}|P2); as shown in Table 5.
[0130] Table 5
[0131]
[0132]
[0133] Step 3: Satirical Category Mining. The third step is to analyze the satirical category s of each opinion. k And based on the detected reason-goal-aspect-opinion quadruple, the irony category is detected. This step requires the model to output a set of quintuples, which are formed by further adding the irony category to the previous quadruples to form {(r l ,t i ,a i ,o j ,s k This step can be expressed as: {(t)}. i ,a i ,o j ,s k ,r l )}←f3(D,{(r l ,t i ,a i ,o j)}|P3); as shown in Table 6.
[0134] Table 6
[0135]
[0136] Furthermore, the pre-trained language model is fine-tuned using the LoRA low-rank adaptation technique, including:
[0137] Construct a fine-tuning dataset of instructions, including task instructions, input text, and target satirical labels;
[0138] The cross-entropy loss function, modeled using autoregressive language, is used for optimization.
[0139] Using the LoRA low-rank adaptive technique, only the injected low-rank matrix pairs are updated, keeping the pre-trained model parameters frozen, and LoRA modules are injected into specific projection layers of Qwen3-14B.
[0140] Specifically, model fine-tuning includes a two-stage training process.
[0141] The first stage involves fine-tuning the model using supervised learning methods. This process comprises three key steps: training dataset processing (S1), training loss calculation (S2), and LoRA parameter update (S3). The specific implementation process is as follows:
[0142] Step S1 involves processing the training dataset. This includes constructing the instruction fine-tuning dataset. Where P j To imply task instructions (such as "detect the irony in the following text"), X j For the input text, i.e., the "text" field in the original data, Y j The target output is the binary ironic tag "is_sar".
[0143] The specific processing procedure is divided into three parts:
[0144] The first part requires converting the raw JSON data into a Hugging Face Dataset object.
[0145] The second part requires applying a dialogue template conversion function to format each sample into a structured dialogue format, i.e., P... j +X j Provided to the model, causing the model to output Y. j .
[0146] The third part requires applying the ChatML format template using the `tokenizer.apply_chat_template` method. This process generates a final training dataset containing N samples, which is then shuffled using a fixed random seed to ensure reproducibility.
[0147] Step S2 is to calculate the training loss.
[0148] This step requires optimizing the model L using the cross-entropy loss function modeled by autoregressive language. stage1 :
[0149]
[0150] Where M is the length of the target output sequence, P j X provides a task instruction prompt. j Given the input text sequence, Y j Let M be the target output sequence, t be the sequence position index, and Y be the target output sequence. j,t This is the output context before position t.
[0151] The specific calculation process is as follows: Input sequence S = [P] j ,X j ,Y j The token is converted into a token ID sequence using a tokenizer; the tag sequence is the same as the input sequence, but only Y is calculated. j Partial loss; after the model outputs logits Z, calculate the softmax probability distribution P at each position t. t :
[0152]
[0153] Where V is the vocabulary size, Z t,c Let Z be a fractional vector. t Z corresponds to the score of a specific word c in the vocabulary. t,c′ Z is a collective term for the scores of all words in the vocabulary. t This is the unnormalized score vector output by the model at time step (position) t.
[0154] Finally, the cross-entropy loss is calculated as follows:
[0155]
[0156] Among them, y t Let y be a one-hot label vector. t,c P is the one-hot vector of the real label. t,c L represents the predicted probability of the model. t Let be the cross-entropy loss at time step t.
[0157] Optimized configuration requires an effective batch size of 2×4 (achieved through gradient accumulation) and an initial learning rate of 2×10. -4 (with linear scheduling decay), and a weight decay coefficient of 0.01.
[0158] Step S3 updates the LoRA parameters.
[0159] This step requires efficient fine-tuning using low-rank adaptive techniques, keeping the pre-trained model parameters θ frozen, and only updating the injected low-rank matrix pair (A,B).
[0160] For the original weight matrix Its forward propagation is modified as follows:
[0161] W ′ x =W x +ΔW x =W x +BA x ;
[0162] in, r is the rank (rank decomposition dimension), W is the original weight of the pre-trained model, and ΔW is the low-rank adaptation increment.
[0163] In practice, LoRA modules need to be injected into the seven projection layers of Qwen3-14B, and LoRA alpha=16 and dropout=0 should be set.
[0164] Parameter updates use an 8-bit AdamW optimizer, and the learning rate scheduling follows a linear decay strategy.
[0165]
[0166] Where t is the current training step number, T max η is the maximum number of training steps. t Let η be the current learning rate at step t. max t represents the maximum learning rate, and t represents the current iteration number.
[0167] This method significantly reduces the number of trainable parameters (approximately 0.1% of the original number of parameters) while maintaining the model's representational power.
[0168] Furthermore, training is performed using a dynamically weighted contrastive loss, including:
[0169] Based on the principle of stochastic gradient descent, an 8-bit AdamW optimizer is used to update the model parameters. The objective function is optimized as the cross-entropy loss for autoregressive language modeling. The training configuration uses a global batch size of 8, a weight decay coefficient of 0.01, and a linear learning rate decay strategy for training. Finally, parameter fusion technology is used to save the final model, which is the multimodal large language model.
[0170] Specifically, during the training phase, a supervised fine-tuning strategy is employed to optimize the parameters of the pre-trained language model. The training process is based on the principle of stochastic gradient descent, using an 8-bit AdamW optimizer to update the model parameters. The mathematical formula is as follows:
[0171]
[0172] Where η=2×10 -4 For learning rate, This is the first-order moment estimate after bias correction. This is the second-order moment estimate after bias correction, where ∈ is the numerical stability constant, and θ is the trainable parameter of the model. t Let θ be the parameter value of the model at step t. t-1 Let be the parameter values of the model at step t-1.
[0173] The objective function is optimized using the cross-entropy loss L modeled by autoregressive language:
[0174]
[0175] Where T is the length of the target sequence, y <t This represents the historical output context, where x is the input text sequence.
[0176] The training configuration uses a global batch size of 8, a weight decay factor of 0.01, and applies a linear learning rate decay strategy.
[0177]
[0178] Among them, T max This is the preset maximum number of training steps.
[0179] After training, the final model is saved using parameter fusion techniques. First, the low-rank adapter (LoRA) weights are fused with the base model parameters:
[0180] θ final =θ pretrained +B×A;
[0181] in, The LoRA parameter matrix is then used, and the fused FP32 parameters are converted to FP16 precision representation:
[0182] θ FP16 =quantize FP16 (θ FP32 );
[0183] Finally, save the complete suite, including model parameters, word segmenter, and configuration files, to the specified directory, supporting plug-and-play deployment.
[0184] The saved model can directly receive text input and output a prediction of the probability of irony:
[0185]
[0186] in, This is a sequence classification representation vector.
[0187] This method has significant advantages in parameter efficiency, computational efficiency, and ease of deployment, making it suitable for resource-constrained satire recognition applications.
[0188] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A multimodal irony detection method based on a large language model, characterized in that, include: Construct large-scale, high-quality text datasets; A pre-trained language model is constructed, and the parameters of the pre-trained language model are optimized using a supervised fine-tuning strategy. The pre-trained language model is then trained using the cross-entropy loss function of autoregressive language modeling to obtain a multimodal large-scale language model. The large-scale, high-quality text dataset is input into the multimodal large-scale language model for processing to obtain detection results; Obtaining the multimodal large-scale language model includes: Multimodal data is input into the pre-trained language model, and features are extracted using the ImageBind encoder. The extracted features are then projected into the input space of Qwen3-14B to obtain the embedded representation of the text input. A chain-based irony reasoning framework is constructed, and irony quintuples of the input text are extracted and irony detection is performed based on the chain-based irony reasoning framework. The pre-trained language model is fine-tuned using LoRA low-rank adaptation technology and trained using dynamic weighted contrastive loss to obtain the multimodal large-scale language model. Based on the chain-based irony reasoning framework, the irony quintuples of the input text are extracted and irony detection is performed, including: Target-Aspect Detection: Given input text D, containing multimodal signals And a specific instruction P1, extracting all targets and their corresponding aspects from the text, generating a target-aspect pair set, i.e. ; Reason-Opinion Detection: Used to extract reasons, objectives, aspects, and opinions from the target-aspect pair set, generating a set of four-tuples, i.e. ; Irony Category Mining: This function analyzes the irony categories corresponding to viewpoints using the set of four-tuples, generating a set of five-tuples containing the target, aspect, viewpoint, reason, and irony category. .
2. The multimodal irony detection method based on a large language model according to claim 1, characterized in that, Construct a large-scale, high-quality text dataset, including: The target-opinion binary pairs of the original text are expanded using a cross-tabulation method through a large language model, and a discriminator containing irony category judgment and automatic scoring mechanism is introduced to filter low-quality data, thereby obtaining the large-scale high-quality text dataset. Among these, the cross-method is used to expand the target-viewpoint binary of the original text using a large language model, including: Based on the target terms and opinion terms in the original text, text containing the replaced terms is generated separately through synonym substitution. Then, the target terms and opinion terms in the text with replaced terms are cross-combined to generate new pseudo-data. When the target terms are empty, pseudo-data is generated only by replacing opinion terms. Specifically: ; In the formula, It is the original text. It is enhanced text.
3. The multimodal irony detection method based on a large language model according to claim 2, characterized in that, The satire category judgment is used to determine whether the expanded text contains satirical elements, output satire identifiers, and extract aspect tuples from the text; The automatic scoring mechanism is used to score texts from three dimensions: syntactic complexity, lexical richness, and matching degree with real-world scenarios. Texts with scores lower than the preset score are filtered out.
4. The multimodal irony detection method based on a large language model according to claim 1, characterized in that, Feature extraction is performed using the ImageBind encoder, including: ; In the formula, This is the feature representation vector of the image modality. Given the input image tensor, This is the image encoder function for the ImageBind model.
5. The multimodal irony detection method based on a large language model according to claim 4, characterized in that, The extracted features are projected onto the input space of Qwen3-14B, including: Define the projection matrix and bias Then, a linear transformation is performed on the multimodal encoded features output by ImageBind: ; In the formula, Let be the projection matrix. For bias vectors, The projected vector. This refers to the multimodal feature vector obtained by the ImageBind encoder after encoding the input image data. The projected vector Concatenate with embedded text as input for Qwen3-14B. : ; In the formula, An embedded representation of text input.
6. The multimodal irony detection method based on a large language model according to claim 1, characterized in that, The pre-trained language model is fine-tuned using the LoRA low-rank adaptation technique, including: Construct a fine-tuning dataset of instructions, including task instructions, input text, and target satirical labels; The cross-entropy loss function, modeled using autoregressive language, is used for optimization. Using the LoRA low-rank adaptive technique, only the injected low-rank matrix pairs are updated, keeping the pre-trained model parameters frozen, and LoRA modules are injected into specific projection layers of Qwen3-14B.
7. The multimodal irony detection method based on a large language model according to claim 6, characterized in that, In the LoRA low-rank adaptive technique, the original weight matrix of the pre-trained language model Forward propagation Revised to: ; In the formula, , , For rank decomposition dimension, To adapt the increment for low rank, For the original weights, For the adjusted weights.
8. The multimodal irony detection method based on a large language model according to claim 1, characterized in that, Training is performed using dynamically weighted contrastive loss, including: Based on the principle of stochastic gradient descent, an 8-bit AdamW optimizer is used to update the model parameters. The objective function is optimized as the cross-entropy loss for autoregressive language modeling. The training configuration uses a global batch size of 8, a weight decay coefficient of 0.01, and a linear learning rate decay strategy for training. Finally, parameter fusion technology is used to save the final model, which is the multimodal large language model.