Multimodal named entity recognition and localization method in low-resource scenarios

By learning from examples and using the LLaMA model, a multimodal named entity recognition and localization method is constructed in low-resource scenarios, which solves the problem of insufficient data utilization under low-resource conditions and achieves better recognition and localization results.

CN119005190BActive Publication Date: 2026-06-12CHINA ACADEMY OF ELECTRONICS AND INFORMATION TECHNOLOGY OF CHINA ELECTRONICS TECHNOLOGY GROUP CORPORATION +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA ACADEMY OF ELECTRONICS AND INFORMATION TECHNOLOGY OF CHINA ELECTRONICS TECHNOLOGY GROUP CORPORATION
Filing Date
2024-07-31
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In low-resource scenarios, existing multimodal named entity recognition and localization technologies cannot effectively utilize data, resulting in poor model training and inference performance.

Method used

The example learning approach is adopted. By calculating the semantic similarity of text-image pairs, the most semantically similar training set instances are selected for example learning. The LLaMA model is used for example learning and decoding, and the loss functions of named entity recognition and image entity localization are combined for training and inference.

🎯Benefits of technology

Under low-resource conditions, the performance of multimodal named entity recognition and localization is improved, and the performance of the model is improved when there are few samples by making full use of pre-trained knowledge.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119005190B_ABST
    Figure CN119005190B_ABST
Patent Text Reader

Abstract

The application provides a multi-modal named entity recognition and positioning method in a low-resource scenario, which comprises the following steps: screening image-text pairs with similar semantics by calculating similarity; in the low-resource scenario, using LLaMA as the core structure, constructing multi-modal instances, and more fully utilizing the pre-training knowledge of the model. In the training stage, image-text pairs with similar semantics are screened by calculating similarity, instance-assisted training is constructed, and the named entity recognition loss function and the entity positioning loss function are calculated simultaneously during the training process to help the training; in the non-training stage, instance-assisted reasoning is constructed by calculating semantic similarity, and the effect of multi-modal named entity recognition and positioning in the low-resource scenario is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to a method for multimodal named entity recognition and localization in low-resource scenarios. Background Technology

[0002] In recent years, with the rapid development of artificial intelligence, the field of natural language processing has gradually emerged. Information extraction is a key task in natural language processing, aiming to identify and extract specific types of information from unstructured or semi-structured data, that is, to transform scattered information into structured data. Information extraction mainly includes three sub-tasks: named entity recognition, relation extraction, and event extraction.

[0003] Named entity recognition is an important area of ​​information extraction. Traditional named entity recognition tasks mainly deal with plain text data; while multimodal named entity recognition tasks further extend to multimodal information including images and text, and are a relatively new research direction.

[0004] With the development of technology, multimodal named entity recognition and localization tasks have gradually emerged. This task requires identifying named entities in text and locating regions related to these entities in images. Its applications include: on social media, user-posted content often contains a wealth of information, which can be automatically extracted using multimodal named entity recognition and localization technology for user profiling and interest analysis; in intelligent question-answering systems, multimodal named entity recognition and localization technology helps the system better understand user queries. In addition to textual information, the system can further understand user needs based on information from images provided by the user, thus providing more accurate and detailed answers; in the medical field, by combining medical images and case descriptions, multimodal named entity recognition and localization technology can assist doctors in diagnosis and treatment, improving the quality and efficiency of medical services.

[0005] In the research on multimodal named entity recognition and localization, data from both text and image modalities are combined, thus providing richer information. However, this multimodalization of data also brings many challenges: due to the significant differences between text and images, fusing information from the two modalities becomes very difficult. Therefore, it is necessary to design effective mechanisms to capture the semantic relationships between images and text. In addition, it is also necessary to develop models that can learn cross-modal associations to establish effective connections between images and text.

[0006] Recent technical methods have been able to solve the above difficulties quite well. However, through research on existing multimodal named entity recognition and localization technologies, it has been found that the current training process requires a large amount of text data and corresponding images, and a large amount of labeled data to complete the task. This makes it difficult for the current technology to train the model well in low-resource scenarios and to make full use of pre-trained knowledge in low-resource scenarios. Summary of the Invention

[0007] The technical problem to be solved by this invention is how to construct a multimodal named entity recognition and localization model that can make fuller use of data in low-resource scenarios; in view of this, this invention provides a multimodal named entity recognition and localization method in low-resource scenarios.

[0008] The technical solution adopted in this invention is a method for multimodal named entity recognition and localization in low-resource scenarios, comprising:

[0009] Step 1: For each text-image pair (T) in the test set... i I i The text T in ) i Word segmentation is performed using a pre-trained language model to encode each text as a vector form v. T The image is encoded using a visual feature extractor and represented as a vector v. I ;

[0010] Step 2: Calculate the text semantic similarity and image semantic similarity between each text-image pair in the test set and each text-image pair in the training set, combine them into a comprehensive semantic similarity, and form a similarity matrix;

[0011] Step 3: Based on the semantic similarity matrix, select text-image pairs (T) from the test set. i I i The most semantically similar training set of text-image pairs (T) j I j );

[0012] Step 4: Place (T) i I i ) and (T j I j ) combination as a set of data, where (T i I i (T) is the content to be inferred. j I j ) is used to build demonstration instances;

[0013] Step 5, (T) j I j In the text T, jBased on predefined entity types, the natural language template filling method is used to fill in the text T. j Create an example and obtain the text. Using a text encoder to convert T j and Encoded as a text vector and

[0014] For image I j A visual feature extractor is used to extract visual features from specific regions containing entities, resulting in visual feature vectors for the entities. And extract the features of the entire image to obtain the visual feature vector of the entire image.

[0015] Finally, we obtain the image-text pairs represented by vectors. and Together they constitute a demonstrative example;

[0016] Step Six, (T) i I i In ), the text T i Tokenization and encoding into text vectors using a text encoder. Using a visual feature extractor from I i Visual features are extracted to obtain visual feature vectors. To obtain the image-text pair represented by vectors

[0017] Step 7: Transform all image feature vectors to the same dimension as the text vectors using a linear projection layer;

[0018] Step 8: Concatenate the converted image feature vector and text vector to obtain the input vector;

[0019] Step 9: Input the input vector into the LLaMA model for instance learning. Through the attention mechanism, the extracted text features and visual features are weighted and fused to generate a joint feature representation.

[0020] Step 10: Decode the joint feature representation using an LLaMA decoder to obtain the image-text pair (T). i I i The predicted output of );

[0021] The method further includes a model training process, comprising:

[0022] Step 11: Based on the predicted output and the actual information of the test set, determine the named entity recognition loss function and the image entity localization loss function based on cross-entropy loss and IoU, respectively.

[0023] Step 12: Construct an overall loss function based on the named entity recognition loss function and the image entity localization loss function;

[0024] Step 13: Use the overall loss function to update the parameters using the optimizer.

[0025] In one implementation, the semantic similarity is calculated in step two as follows:

[0026]

[0027]

[0028]

[0029] in, and Representing text in vector form, and Represents an image in vector form; Indicates text semantic similarity. Represents the semantic similarity of images, s ij Let S represent the overall semantic similarity; α and β are parameters that control the weights; the similarity matrix composed of semantic similarities is denoted as S.

[0030] In one implementation, the predefined entity types in step five include Person, Organization, Location, and Other.

[0031] In one implementation, the linear projection is calculated in step seven as follows:

[0032]

[0033]

[0034]

[0035] Where Linear(x) is the linear projection layer; and These are the original dimensional image features; and It is the projected image feature vector, with the same dimension as the text vector.

[0036] In one implementation, in step eight, the sequence of input vectors is constructed as follows:

[0037]

[0038] in, Indicates instance content, This indicates the training content.

[0039] In one implementation, the named entity recognition loss function and the image entity localization loss function are calculated in step eleven as follows:

[0040] Named entity recognition loss function:

[0041]

[0042] Where, p i,c y represents the probability that word i belongs to category c; if word i belongs to category c, then y i,c If y is 1, otherwise y i,c =0; N represents the total number of lexical units, and C represents the total number of categories.

[0043] Image entity localization loss function:

[0044]

[0045] Where M represents the number of entities; IoU represents the intersection-union ratio, which is the ratio of the area of ​​intersection to the area of ​​union between the predicted region and the actual region; ε is a positive constant.

[0046] In one implementation, in step twelve, the overall loss function is constructed as follows:

[0047] L = L NER +λ·L BOX

[0048] Where λ is the parameter controlling the weights; L NER It is the named entity recognition loss function, L BOX is the image entity localization loss function, and L is the final loss function.

[0049] Another aspect of the present invention provides an electronic device comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the steps of the multimodal named entity recognition and localization method in low-resource scenarios as described in any of the preceding claims.

[0050] Another aspect of the present invention provides a computer storage medium storing a computer program, which, when executed by a processor, implements the steps of the multimodal named entity recognition and localization method in low-resource scenarios as described in any of the preceding claims.

[0051] Compared with the prior art, the present invention has at least the following advantages:

[0052] This application provides a method for multimodal named entity recognition and localization in low-resource scenarios based on instance learning. It introduces instance learning to construct new training and inference methods, making more efficient use of the dataset in low-resource scenarios. By using the LLaMA language model, it better utilizes pre-trained knowledge for instance learning, improving data utilization when there are few samples, and achieving better multimodal named entity recognition and localization results under low-resource conditions. Attached Figure Description

[0053] Figure 1 This is a schematic diagram of the logic of a multimodal named entity recognition and localization method in a low-resource scenario according to an embodiment of the present invention;

[0054] Figure 2 This is a schematic diagram of the multimodal named entity recognition and localization method in low-resource scenarios according to an embodiment of the present invention;

[0055] Figure 3 This is a schematic diagram of an electronic device according to an embodiment of the present invention. Detailed Implementation

[0056] To further illustrate the technical means and effects of the present invention in achieving its intended purpose, the present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments.

[0057] Unless otherwise specified, all terms used herein (including technical and scientific terms) shall have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It should also be understood that terms (e.g., those defined in common dictionaries) shall be interpreted as having the meaning consistent with their meaning in the context of the relevant art and shall not be interpreted in an idealized or overly formal sense unless expressly so specified herein.

[0058] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0059] For ease of understanding, the abbreviations and key terms used in this embodiment are defined as follows:

[0060] SBERT: Sentence-BERT is a pre-trained model suitable for calculating text similarity;

[0061] ResNet: Residual Neural Network, a type of convolutional neural network;

[0062] LLaMA: Large Language Model MetaAI, an autoregressive pre-trained model;

[0063] Linear: Linear layer;

[0064] Adam: Adaptive Moment Estimation, an optimizer.

[0065] IoU: Intersection over Union, which is the area of ​​the overlapping region divided by the area of ​​the union of the two regions.

[0066] BP: Back Propagation, a common method used for training.

[0067] This invention provides a method for multimodal named entity recognition and localization in low-resource scenarios, comprising:

[0068] Step 1: For each text-image pair (T) in the test set... i I i The text T in ) i Word segmentation is performed using a pre-trained language model to encode each text as a vector form v. T The image is encoded using a visual feature extractor and represented as a vector v. I ;

[0069] Step 2: Calculate the text semantic similarity and image semantic similarity between each text-image pair in the test set and each text-image pair in the training set, combine them into a comprehensive semantic similarity, and form a similarity matrix;

[0070] Step 3: Based on the semantic similarity matrix, select text-image pairs (T) from the test set. i I i The most semantically similar training set of text-image pairs (T) j I j );

[0071] Step 4: Place (T) i I i ) and (T j I j ) combination as a set of data, where (T i I i (T) is the content to be inferred. j I j ) is used to build demonstration instances;

[0072] Step 5, (T) j I j In the text T, j Based on predefined entity types, the natural language template filling method is used to fill in the text T. jCreate an example and obtain the text. Using a text encoder to convert T j and Encoded as a text vector and

[0073] For image I j A visual feature extractor is used to extract visual features from specific regions containing entities, resulting in visual feature vectors for the entities. And extract the features of the entire image to obtain the visual feature vector of the entire image.

[0074] Finally, we obtain the image-text pairs represented by vectors. and Together they constitute a demonstrative example;

[0075] Step Six, (T) i I i In ), the text T i Tokenization and encoding into text vectors using a text encoder. Using a visual feature extractor from I i Visual features are extracted to obtain visual feature vectors. To obtain the image-text pair represented by vectors

[0076] Step 7: Transform all image feature vectors to the same dimension as the text vectors using a linear projection layer;

[0077] Step 8: Concatenate the converted image feature vector and text vector to obtain the input vector;

[0078] Step 9: Input the input vector into the LLaMA model for instance learning. Through the attention mechanism, the extracted text features and visual features are weighted and fused to generate a joint feature representation.

[0079] Step 10: Decode the joint feature representation using an LLaMA decoder to obtain the image-text pair (T). i I i The predicted output of );

[0080] The method further includes a model training process, comprising:

[0081] Step 11: Based on the predicted output and the actual information of the test set, determine the named entity recognition loss function and the image entity localization loss function based on cross-entropy loss and IoU, respectively.

[0082] Step 12: Construct an overall loss function based on the named entity recognition loss function and the image entity localization loss function;

[0083] Step 13: Use the overall loss function to update the parameters using the optimizer.

[0084] refer to Figure 1 as well as Figure 2 The method provided in this embodiment will be described in detail step by step below.

[0085] Specifically, when using a model for reasoning, this method includes:

[0086] Step 1: For each text-image pair (T) in the test set... i I i The text T in ) i Word segmentation is performed using a pre-trained language model to encode each text as a vector form v. T The image is encoded using a visual feature extractor and represented as a vector v. I ;

[0087] In this embodiment, the pre-trained language model uses SBERT, and the visual feature extractor uses ResNet;

[0088] Step 2: Calculate the text semantic similarity and image semantic similarity between each text-image pair in the test set and each text-image pair in the training set, combine them into a comprehensive semantic similarity, and form a similarity matrix;

[0089] Furthermore, in step two, the semantic similarity is calculated as follows:

[0090]

[0091]

[0092]

[0093] in, and Representing text in vector form, and Represents an image in vector form; Indicates text semantic similarity. Indicates semantic similarity between images; s ij Let S represent the overall semantic similarity, where α and β are parameters that control the weights; the similarity matrix composed of semantic similarities is denoted as S.

[0094] In this embodiment, α is set to 0.7 and β is set to 0.3.

[0095] Step 3: Based on the semantic similarity matrix, select text-image pairs (T) from the test set. i I i The most semantically similar training set of text-image pairs (T) j I j );

[0096] Furthermore, in step three, the calculation of the selected text-image pair is as follows:

[0097]

[0098] in,

[0099] Step 4: Place (T) i I i ) and (T j I j ) combination as a set of data, where (T i I i (T) is the content to be inferred. j I j ) is used to build demonstration instances;

[0100] Step 5, (T) j I j In the text T, j Based on predefined entity types, the natural language template filling method is used to fill in the text T. j Create an example and obtain the text. Using a text encoder to convert T j and Encoded as a text vector and

[0101] For image I j A visual feature extractor is used to extract visual features from specific regions containing entities, resulting in visual feature vectors for the entities. And extract the features of the entire image to obtain the visual feature vector of the entire image.

[0102] Finally, we obtain the image-text pairs represented by vectors. and Together they constitute a demonstrative example;

[0103] Furthermore, in step five, the predefined entity types include {Person, Organization, Location, Other}.

[0104] Step Six: For the training content (T) i I i), will text T i Tokenization and encoding into text vectors Using a visual feature extractor from I i Extracting image features yields the image feature vector. Finally, we obtain the image-text pairs represented by vectors.

[0105] In this embodiment, the visual feature extractor uses ResNet.

[0106] Step 7: Transform the image feature vector to the same dimension as the text vector using a linear projection layer;

[0107] Furthermore, in step seven, the linear projection is calculated as follows:

[0108]

[0109]

[0110]

[0111] Where Linear(x) is the linear projection layer; and These are the original dimensional image features; and It is the projected image feature vector, with the same dimension as the text vector.

[0112] Step 8: Concatenate the converted image feature vector and text vector to use as the input vector for the model;

[0113] Furthermore, in step eight, the input sequence is constructed as follows:

[0114]

[0115] in, Indicates instance content; This indicates the training content.

[0116] Step 9: Input the constructed input vector into the LLaMA model for instance learning. Through the attention mechanism, the extracted text features and visual features are weighted and fused to generate a joint feature representation.

[0117] Step 10: Decode the joint feature representation using an LLaMA decoder to obtain the image-text pair (T). i I i The predicted output of ).

[0118] Furthermore, in step ten, the output is as follows:

[0119] Output = [E, T, L]

[0120] Where E is the entity, T is the entity type, and L is the bounding box coordinates of the entity in the image.

[0121] The method provided in this application further includes a model training process, comprising:

[0122] Step 11: Calculate the named entity recognition loss function and the image entity localization loss function based on the decoded output and the actual information of the training set.

[0123] Furthermore, in step eleven, the two loss functions are calculated as follows:

[0124] (1) Named entity recognition loss function:

[0125]

[0126] Where, p i,c y represents the probability that word i belongs to category c; if word i belongs to category c, then y i,c If y is 1, otherwise y i,c =0; N represents the total number of lexical units; C represents the total number of categories.

[0127] (2) Image entity localization loss function:

[0128]

[0129] Where M represents the number of entities; IoU represents the intersection-union ratio, which is the ratio of the area of ​​intersection to the area of ​​union between the predicted region and the actual region; ε is a positive constant.

[0130] Step 12: Combine the two loss functions to construct the overall loss function;

[0131] L = L NER +λ·L BoX

[0132] Where λ is the parameter controlling the weights, L NER It is the named entity recognition loss function, L BOX is the image entity localization loss function, and L is the final loss function.

[0133] Step 13: During the BP process, update the parameters using the optimizer.

[0134] In this embodiment, the optimizer is Adam optimizer, the learning rate is set to 3e-5, and the number of training epochs is set to 30.

[0135] In the second embodiment of the present invention, when performing multimodal named entity recognition and localization without training, steps eleven, twelve, and thirteen are removed.

[0136] Can be referenced again Figure 1 First, the image-text pairs in the training set are encoded, with text encoded using SBERT and images encoded using ResNet. The text semantic similarity and image semantic similarity between each pair are calculated separately, and a weighted sum is obtained to obtain a comprehensive semantic similarity matrix. Based on the semantic similarity matrix, the image-text pair with the highest similarity in the training set is found for the pair to be trained or used for inference, and this is used to construct instances. Subsequently, text and image instances are constructed using natural language template filling and specific entity region encoding, combined to form multimodal instances, which are then added to the training or inference process. The integrated vectors are input into the LLaMA model for instance learning. During training, a loss function that comprehensively considers named entity recognition and entity localization is constructed, and the Adam optimizer is used to update the weights during backpropagation. During non-training, instances are selected from the training set based on the comprehensive semantic similarity to assist in inference. The final multimodal named entity recognition and localization results are obtained based on the output of the LLaMA decoder.

[0137] Embodiment 1 of this invention employs a multimodal named entity recognition and localization method based on instance learning in low-resource scenarios. This method filters semantically similar image-text pairs by calculating similarity. In low-resource scenarios, it uses LLaMA as the core structure, constructing multimodal instances to more fully utilize the model's pre-trained knowledge. During the training phase, semantically similar image-text pairs are filtered by calculating similarity to construct instances to assist training. Simultaneously, named entity recognition loss functions and entity localization loss functions are calculated to aid training. In the non-training phase, instances are constructed using semantic similarity calculation to assist in reasoning, improving the performance of multimodal named entity recognition and localization in low-resource scenarios.

[0138] A second embodiment of the present invention provides an electronic device, such as... Figure 3 As shown, it can be understood as a physical device, including a processor and a memory storing processor-executable instructions. When the instructions are executed by the processor, the following operations are performed:

[0139] Step 1: For each text-image pair (T) in the test set... i I i The text T in ) i Word segmentation is performed using a pre-trained language model to encode each text as a vector form v. T The image is encoded using a visual feature extractor and represented as a vector v. I ;

[0140] Step 2: Calculate the text semantic similarity and image semantic similarity between each text-image pair in the test set and each text-image pair in the training set, combine them into a comprehensive semantic similarity, and form a similarity matrix;

[0141] Step 3: Based on the semantic similarity matrix, select text-image pairs (T) from the test set. i I i The most semantically similar training set of text-image pairs (T) j I j );

[0142] Step 4: Place (T) i I i ) and (T j I j ) combination as a set of data, where (T i I i (T) is the content to be inferred. j I j ) is used to build demonstration instances;

[0143] Step 5, (T) j I j In the text T, j Based on predefined entity types, the natural language template filling method is used to fill in the text T. j Create an example and obtain the text. Using a text encoder to convert T j and Encoded as a text vector Know

[0144] For image I j A visual feature extractor is used to extract visual features from specific regions containing entities, resulting in visual feature vectors for the entities. And extract the features of the entire image to obtain the visual feature vector of the entire image.

[0145] Finally, we obtain the image-text pairs represented by vectors. and Together they constitute a demonstrative example;

[0146] Step Six, (T) i I i In ), the text T i Tokenization and encoding into text vectors using a text encoder. Using a visual feature extractor from I i Visual features are extracted to obtain visual feature vectors. To obtain the image-text pair represented by vectors

[0147] Step 7: Transform all image feature vectors to the same dimension as the text vectors using a linear projection layer;

[0148] Step 8: Concatenate the converted image feature vector and text vector to obtain the input vector;

[0149] Step 9: Input the input vector into the LLaMA model for instance learning. Through the attention mechanism, the extracted text features and visual features are weighted and fused to generate a joint feature representation.

[0150] Step 10: Decode the joint feature representation using an LLaMA decoder to obtain the image-text pair (T). i I i The predicted output of );

[0151] The method further includes a model training process, comprising:

[0152] Step 11: Based on the predicted output and the actual information of the test set, determine the named entity recognition loss function and the image entity localization loss function based on cross-entropy loss and IoU, respectively.

[0153] Step 12: Construct an overall loss function based on the named entity recognition loss function and the image entity localization loss function;

[0154] Step 13: Use the overall loss function to update the parameters using the optimizer.

[0155] In the third embodiment of the present invention, the process of the multimodal named entity recognition and localization method in low-resource scenarios is the same as that in the first and second embodiments. The difference lies in the engineering implementation: this embodiment can be implemented using software plus necessary general-purpose hardware platforms. While hardware can also be used, the former is often a better implementation method. Based on this understanding, the method of the present invention can be embodied in the form of a computer software product stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk), including several instructions to cause a device to execute the method described in the embodiments of the present invention.

[0156] Through the description of specific embodiments, a more in-depth and specific understanding should be gained of the technical means and effects adopted by the present invention to achieve the intended purpose. However, the accompanying drawings are only provided for reference and illustration and are not intended to limit the present invention.

Claims

1. A method for multimodal named entity recognition and localization in low-resource scenarios, characterized in that, When using a model for reasoning, this includes: Step 1: For each text-image pair in the test set text in Word segmentation is performed using a pre-trained language model to encode each text as a vector. Images are encoded using a visual feature extractor and represented as vectors. ; Step 2: Calculate the text semantic similarity and image semantic similarity between each text-image pair in the test set and each text-image pair in the training set, combine them into a comprehensive semantic similarity, and form a similarity matrix; Step 3: Based on the semantic similarity matrix, select text-image pairs from the test set. The training set of text-image pairs with the most semantic similarity ; Step 4: The combination is a set of data, in which As the content to be reasoned about Used to build demonstration instances; Step 5 In the text Based on predefined entity types, the natural language template filling method is used to fill in the text. Create an example and obtain the text. Using a text encoder and Encoded as text vectors and ; For images A visual feature extractor is used to extract visual features from specific regions containing entities, resulting in visual feature vectors for the entities. And extract the features of the entire image to obtain the visual feature vector of the entire image. ; Finally, we obtain the image-text pairs represented by vectors. and Together they constitute a demonstrative example; Step Six In the middle, the text Tokenization and encoding into text vectors using a text encoder. Using a visual feature extractor from Visual features are extracted to obtain visual feature vectors. To obtain a vector representation of the image-text pair ; Step 7: Transform all image feature vectors to the same dimension as the text vectors using a linear projection layer; Step 8: Concatenate the converted image feature vector and text vector to obtain the input vector; Step 9: Input the input vector into the LLaMA model for instance learning. Through the attention mechanism, the extracted text features and visual features are weighted and fused to generate a joint feature representation. Step 10: Decode the joint feature representation using an LLaMA decoder to obtain the image-text pair. The predicted output; The method further includes a model training process, comprising: Step 11: Based on the predicted output and the actual information of the test set, determine the named entity recognition loss function and the image entity localization loss function based on cross-entropy loss and IoU, respectively. Step 12: Construct an overall loss function based on the named entity recognition loss function and the image entity localization loss function; Step 13: Use the overall loss function to update the parameters using the optimizer.

2. The method for multimodal named entity recognition and localization in low-resource scenarios as described in claim 1, characterized in that, In step two, the semantic similarity is calculated as follows: in, and Representing text in vector form, and Represents an image in vector form; Indicates text semantic similarity. Indicates semantic similarity between images. Indicates the overall semantic similarity; These are parameters that control the weights; the similarity matrix composed of semantic similarity is denoted as... .

3. The method for multimodal named entity recognition and localization in low-resource scenarios as described in claim 1, characterized in that, In step five, the predefined entity types include Person, Organization, Location, and Other.

4. The method for multimodal named entity recognition and localization in low-resource scenarios as described in claim 1, characterized in that, In step seven, the linear projection is calculated as follows: in, It is a linear projection layer; These are the original dimensional image features; It is the projected image feature vector, with the same dimension as the text vector.

5. The method for multimodal named entity recognition and localization in low-resource scenarios as described in claim 4, characterized in that, In step eight, the sequence of input vectors is constructed as follows: in,[ ] represents instance content, [ This indicates the training content.

6. The method for multimodal named entity recognition and localization in low-resource scenarios as described in claim 1, characterized in that, In step eleven, the named entity recognition loss function and the image entity localization loss function are calculated as follows: Named entity recognition loss function: in, Indicates word elements Category The probability of a word; if the word Category ,but =1, otherwise =0; Indicates the total number of lexical units. Indicates the total number of categories; Image entity localization loss function: in, Indicates the number of entities; This represents the intersection-to-union ratio, which is the ratio of the area of ​​intersection between the predicted region and the area of ​​union between the actual region and the predicted region. It is a positive integer.

7. The method for multimodal named entity recognition and localization in low-resource scenarios as described in claim 1, characterized in that, In step twelf, the overall loss function is constructed as follows: in, These are parameters that control the weights; It is the named entity recognition loss function. It is the image entity localization loss function. It is the final loss function.

8. An electronic device, characterized in that, The electronic device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the multimodal named entity recognition and localization method in low-resource scenarios as described in any one of claims 1 to 7.

9. A computer storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the multimodal named entity recognition and localization method in low-resource scenarios as described in any one of claims 1 to 7.