A picture text-oriented named entity recognition method, electronic equipment and medium

CN116563856BActive Publication Date: 2026-06-19ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2023-06-08
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

目前的命名实体识别技术通常都是针对于文本信息这种形式,缺少对于图片文本的识别技术

Benefits of technology

[0016] (1) The method of the present invention can extract key information of text in the form of images, which solves the problem that the named entity recognition work in the prior art is all for text data and lacks research on named entity recognition of text in images.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116563856B_ABST
    Figure CN116563856B_ABST
Patent Text Reader

Abstract

This invention discloses a named entity recognition method, electronic device, and medium for image-text recognition, comprising: acquiring an image containing text; performing text detection on the image and segmenting the text lines; performing text recognition on the text lines to obtain Chinese text information; segmenting the text lines to obtain several image blocks; flattening the image blocks to obtain a one-dimensional feature vector sequence; superimposing a first position vector and a second position vector on the feature vector corresponding to each image block to obtain an image feature vector; inputting the image feature vector into an encoder for encoding to obtain an encoded output vector; performing named entity annotation on the Chinese text information to obtain text data, then encoding the text data and superimposing sequence position vector encoding to obtain a text sequence; inputting the encoded output vector and the text sequence into a decoder to obtain a decoded output vector; and inputting the decoded output vector into a conditional random field for label prediction to obtain the entity corresponding to the text in the image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to named entity recognition tasks, and more particularly to a named entity recognition method, electronic device, and medium for image text. Background Technology

[0002] Textual information is ubiquitous in daily life, appearing in social media, news, and delivery information. With the rapid increase in textual information, the methods for processing it have become increasingly important. However, due to the unstructured and disordered nature of textual information, it is generally only searchable through full-text retrieval. But textual information often contains a large amount of irrelevant information, with useful and useless information mixed together, posing a significant challenge to textual information retrieval.

[0003] Named entity recognition (NER) is a subtask of information extraction. It outputs unordered information in a structured data format, effectively filtering the content of the information. The task of NER involves detecting named entities from natural language text and classifying them into predefined categories, such as people, organizations, places, and times. In general, NER is the foundation for other tasks in information extraction techniques.

[0004] In recent years, with the development of information technology, the forms of natural text information have become increasingly diverse, encompassing not only textual information but also textual information within images, such as mobile phone screenshots, document data, and photos of express delivery tracking numbers—multimodal information. Current named entity recognition technologies are typically designed for textual information and lack technology for recognizing text within images. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides a named entity recognition method, electronic device, and medium for image text.

[0006] According to a first aspect of the present invention, a named entity recognition method for image text is provided, the method comprising:

[0007] Obtain an image containing text; perform text detection on the image and segment the image by line to obtain text line images; perform text recognition on the text line images to obtain the Chinese text information corresponding to each text line image;

[0008] The text line image is segmented to obtain several image blocks; all image blocks are vector flattened to obtain a one-dimensional feature vector sequence; the first position vector and the second position vector are superimposed on the one-dimensional feature vector corresponding to each image block to obtain the image feature vector; where the first position vector represents the position of the image block in the text line image, and the second position vector represents the position of the one-dimensional feature vector corresponding to the image block in the feature vector sequence.

[0009] The image feature vector is input into the encoder for encoding, and the encoded output vector is obtained.

[0010] Named entity annotation is performed on Chinese text information to obtain text data. Sequence encoding is performed on the text data, and sequence position vector encoding is superimposed to obtain the text sequence.

[0011] The encoded output vector and the text sequence are input into the decoder to obtain the decoded output vector;

[0012] The decoded output vector is input into a conditional random field for label prediction to obtain the entity corresponding to the text in the image.

[0013] According to a second aspect of the present invention, an electronic device is provided, including a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is used to store program data, and the processor is used to execute the program data to implement the above-described named entity recognition method for image text.

[0014] According to a third aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the above-described named entity recognition method for image text.

[0015] The beneficial effects of this invention are:

[0016] (1) The method of the present invention can extract key information of text in the form of images, which solves the problem that the named entity recognition work in the prior art is all for text data and lacks research on named entity recognition of text in images.

[0017] (2) The named entity recognition method for image text in this invention is an end-to-end method. It does not require first performing text recognition on the image to obtain the text in the image, and then using a named entity recognition method to extract the entity category from the text. The method of this invention does not require multi-stage processing and can obtain the final named entity recognition output in one step, which is convenient and fast.

[0018] (3) The method of the present invention utilizes the Transformer structure to process two different modal information, image and text, through encoder and decoder. It abandons the traditional convolutional neural network for processing visual information and recurrent neural network for processing text information, and achieves unification in structure, which can better integrate and process visual and text features. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 A flowchart of a named entity recognition method for image text provided in an embodiment of the present invention;

[0021] Figure 2 This is an overall schematic diagram of the named entity recognition method for image text provided in an embodiment of the present invention;

[0022] Figure 3 This is a schematic diagram of the encoder provided in an embodiment of the present invention;

[0023] Figure 4 This is a schematic diagram of the structure of a multi-head self-attention layer provided in an embodiment of the present invention;

[0024] Figure 5 This is a schematic diagram of the decoder provided in an embodiment of the present invention;

[0025] Figure 6 This is a schematic diagram of the structure of the self-attention layer provided in an embodiment of the present invention;

[0026] Figure 7 This is a schematic diagram of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0027] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings. For clarity, the embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of methods consistent with some aspects of the present invention as detailed in the appended claims.

[0028] To better address the problem of image text information extraction, this invention proposes a named entity recognition method for image text, specifically an end-to-end Transformer-based image text named entity recognition method, the method of which is as follows:

[0029] Step S1: Obtain an image containing text; perform text detection on the image, segment the image by line to obtain text line images; perform text recognition on the text line images to obtain the Chinese text information corresponding to each text line image.

[0030] Furthermore, in this example, the text processing toolkit PaddleOCR can be used to perform text detection on the initial image containing text, obtaining the four coordinates of the top left, top right, bottom right, and bottom left corners of all text lines in the image. Based on these four coordinates, the image is segmented by line to obtain text line images. Then, the PaddleOCR toolkit is used to perform text recognition on the text line images to obtain the Chinese text information corresponding to each text line image.

[0031] Step S2: Segment the text line image to obtain several image blocks; flatten all image blocks to obtain a one-dimensional feature vector sequence; superimpose a first position vector and a second position vector on the one-dimensional feature vector corresponding to each image block to obtain an image feature vector; the first position vector represents the position of the image block in the text line image; the second position vector represents the position of the one-dimensional feature vector corresponding to the image block in the feature vector sequence.

[0032] Image segmentation of text line images involves the following steps: First, each text line image is scaled to a fixed size for standardization. This ensures consistency across all input data, facilitating subsequent segmentation. Then, the scaled image is segmented into several identical square blocks. Next, these image blocks are flattened. The original image can be considered a two-dimensional feature vector composed of pixels; the flattening operation transforms this two-dimensional feature vector into a one-dimensional feature vector. Each one-dimensional feature vector represents an image block, and its dimension equals the number of pixels in that block. Finally, a first position vector and a second position vector are added to the one-dimensional feature vector to obtain the image feature vector, which is then used as input to the encoder.

[0033] Step S3: Input the image feature vector into the encoder for encoding to obtain the encoded output vector;

[0034] like Figure 3 As shown, the encoder consists of N stacked encoder blocks. Each encoder block calculates the correlation between each image feature vector and other image feature vectors to obtain a first correlation score. Based on the first correlation score and the input image feature vector, the encoder block obtains its first encoded vector. This first encoded vector is then used as the input to the next encoder block, and so on, until the final encoded output vector is obtained.

[0035] The encoder block comprises a multi-head self-attention layer, a residual normalization layer (Add & Norm), a fully connected layer (Feed Forward), and a residual normalization layer (Add & Norm) connected in sequence.

[0036] Each encoder block uses a self-attention mechanism to compute the correlation between each image feature vector and other image feature vectors. Based on this method, a multi-head self-attention layer is used when inputting image vectors to fully extract information from the images. Multi-head self-attention does not compute attention only once, but multiple times in parallel. Each independent attention output is simply concatenated and linearly transformed into the desired dimension. Multi-head attention allows the model to collectively focus on information from different vector subspaces at different locations.

[0037] Step S4: Named entity annotation is performed on the Chinese text information obtained in step S1 to obtain text data, and sequence annotation is performed on the text data to obtain a text sequence.

[0038] The BIO (Browser-Induced Entity) annotation method is used to annotate Chinese text information to obtain text data. Here, B (Begin) represents the first character of the entity, I (Inter) represents the other characters besides the first character, and O (other) represents non-entity characters. Then, the text data is shifted right by one position, and a text start symbol "BOS" is added at the beginning and a text end symbol "EOS" is added at the end. The text data is padded to a fixed length using the "pad" symbol. Finally, the text data is sequence-encoded using an index, and the sequence position vector encoding is superimposed to obtain the text sequence.

[0039] Step S5: Input the encoded output vector obtained in step S3 and the text sequence obtained in step S4 into the decoder to obtain the decoded output vector.

[0040] The decoder can calculate the correlation between images and text, and fuse the feature vectors of images and text to obtain a new decoded output vector, which is used for final sequence prediction.

[0041] The decoder is also composed of N decoder blocks. Each decoder block uses the same multi-head self-attention layer as the encoder, plus a masked multi-head self-attention layer. Specifically, the decoder block includes, in sequence, a masked multi-head self-attention layer, a residual normalization layer (Add & Norm), a multi-head self-attention layer, a residual normalization layer (Add & Norm), a fully connected layer (Feed Forward), and a residual normalization layer (Add & Norm).

[0042] Specifically, the masked multi-head self-attention layer in the first decoding block converts the text sequence into a masked text feature vector, calculates the correlation between the encoded output vector and the masked text feature vector to obtain a second correlation score, obtains the first decoding vector output by the first decoding block based on the second correlation score, uses the first decoding vector as the input of the next decoding block, and so on, to obtain the final decoding output vector.

[0043] The decoder block utilizes a masked multi-head self-attention layer to ensure consistency between the information obtained during training and prediction. Specifically, the masked multi-head self-attention layer masks "future" information during training, ensuring that when predicting a label at a given position, the decoder can only compute the output information from the text's starting position to the current position. The multi-head self-attention layer in the decoder calculates the correlation between the image feature vector output by the encoder and the text vector processed by the masked multi-head self-attention layer and the residual normalization layer.

[0044] Step S6: Input the decoded output vector into a conditional random field (CRF) for label prediction to obtain the corresponding entity in the image text. Example

[0045] This embodiment provides a named entity recognition method for image text, including:

[0046] Step S1: Obtain an image containing text; perform text detection on the image, segment the image by line to obtain text line images; perform text recognition on the text line images to obtain the Chinese text information corresponding to each text line image.

[0047] Furthermore, obtaining images containing text includes:

[0048] The system crawls image information from user complaints about products on the Black Cat Complaint platform. The image information includes the time of the user complaint, user comments, and user-uploaded images. These images contain a lot of text, such as mobile phone screenshots, photos of express delivery tracking numbers, images of contract terms, and document images.

[0049] Furthermore, text detection is performed on the image, and the image is segmented line by line to obtain text line images; text recognition is performed on the text line images to obtain the Chinese text information corresponding to each text line image, including:

[0050] The PaddleOCR toolkit is used to perform text detection on the initial image containing text, obtaining the text lines in the image and the coordinates of the top-left, top-right, bottom-right, and bottom-left corners of each text line. Based on these four coordinates, the image is segmented into text line images. Then, the PaddleOCR toolkit is used to perform text recognition on the text line images to obtain the Chinese text information corresponding to each text line image.

[0051] Step S2: Perform image segmentation and vector flattening on the text line images, convert the two-dimensional image feature vector into a one-dimensional feature vector, add a position vector to the one-dimensional feature vector to obtain the image feature vector; mark the position of each image in the original text line.

[0052] The process of processing an image into an input vector can be divided into the following three sub-steps:

[0053] Step S201: First, all text line images are scaled to the same fixed size for standardization. The height of the scaled image is denoted as H, and the width as W. The image is then divided into P × P (in this embodiment, P is 16 pixels) blocks, resulting in N image blocks of size P × P, where N = (HW / P) 2 ).

[0054] Step S202 involves flattening each image block, expanding the two-dimensional feature vector (P, P) into a one-dimensional feature vector (1, P × P). After flattening, the text image row yields a feature vector of dimension (N, P × P), with each feature vector corresponding to an image block in the original image.

[0055] Step S203: Superimpose the first position vector and the second position vector onto the one-dimensional feature vector corresponding to each image patch to obtain the image feature vector; the first position vector represents the position of the image patch in the text line image (Patchemmbedding); the second position vector represents the position of the one-dimensional feature vector corresponding to the image patch in the feature vector sequence (Position embedding{E1,E2,,,E...)). n}

[0056] The feature vector of this image is used as the input vector of the encoder.

[0057] Step S3: Input the image feature vector into the encoder for encoding to obtain the encoded output vector {H1,H2,,,H...} n};

[0058] like Figure 3 As shown, the encoder is composed of several encoder blocks stacked together. The first encoder block calculates the correlation between each image feature vector and other image feature vectors to obtain the first correlation score. Based on the first correlation score and the input image feature vector, the first encoded vector output by the first encoder block is obtained. The first encoded vector is used as the input of the next encoder block, and so on, to obtain the final encoded output vector.

[0059] The encoder block comprises a multi-head self-attention layer, a residual normalization layer (Add & Norm), a fully connected layer (Feed Forward), and a residual normalization layer (Add & Norm) connected in sequence.

[0060] Figure 4 The diagram illustrates the structure of a multi-head self-attention layer, which is formed by combining multiple self-attention layers. The input vector X is fed into different self-attention layers, with the number of layers depending on the number of heads. h Calculate different output matrices for Attention(Q). i ,K i V i All output matrices are concatenated together and fed into a linear layer to obtain the final output MultiHead(Q,K,V) of Multi-HeadAttention.

[0061] Figure 3The working principle of the self-attention mechanism is as follows: It uses dot product attention, and its output is a weighted sum of values, where the weight assigned to each value is determined by the query and the key. The following are the details of the self-attention mechanism implementation:

[0062] First, the input feature vector X is mapped to three learnable linear transformation matrices Wq, Wk, and Wv to obtain three feature vectors Q (query), K (key), and V (value), respectively. The K and V vectors have a one-to-one correspondence. Next, the correlation coefficient between the Q and K vectors needs to be calculated. This means taking the inner product of each element (query) in Q and each element in K, and then using the softmax function to obtain the similarity between the elements in Q and V. Finally, a weighted sum is calculated to obtain a new vector.

[0063] The residual normalization layer (Add & Norm) consists of: Add refers to residual connections, typically used to address training issues in multi-layer networks, allowing the network to focus only on the currently differing parts. Norm refers to Layer Normalization, commonly used in recurrent neural network structures. Layer Normalization transforms the input of each neuron in a layer into one with the same mean and variance, thus accelerating convergence.

[0064] The fully connected layer (Feed Forward) consists of: after passing the output matrix MultiAttention(Q,K,V) through residual connections and regularization, it needs to pass through two fully connected layers (Feed Forward) to obtain the final output matrix O. The activation function of the first layer is ReLU, and no activation is applied to the second layer.

[0065] Step S4: Named entity annotation is performed on the Chinese text information obtained in step S1 to obtain text data, and sequence annotation is performed on the text data to obtain a text sequence.

[0066] In this example, the BIO annotation method is used to annotate the text lines obtained from PaddleOCR in step S2 with named entity annotations. Then, the text lines are encoded according to the vocabulary index, converting natural text into numbers. Specifically, in the text sequence encoding process, in addition to encoding the BIO-annotated text, some symbols representing text states also need to be encoded and added to the text sequence. Text state symbols include: "BOS" indicating the start of text; "pad" indicating text padding to ensure all text lengths are consistent; "EOS" indicating the end of text; "unk" indicating unrecognizable information; and "mask" indicating occlusion of the sequence. When the text line encoding is used as input to the Decoder, the text line encoding needs to be shifted one position to the right, and a "BOS" symbol is added at the beginning to indicate the start position of the sentence, and an "EOS" symbol is added at the end to indicate the end position of the sentence. Then, "pad" is used to pad the text to the specified length. Finally, the text data is encoded according to the index and the sequence position information encoding (Position embedding) {E1, E2, ..., E...} is superimposed. n}, thus obtaining the final decoder input.

[0067] Table 1 below is an index table of text annotations, indicating that each index corresponds to a annotation and its meaning in that text line.

[0068] Table 1: Index of Text Annotation

[0069]

[0070] Step S5: Input the encoded output vector obtained in step S3 and the encoded text sequence obtained in step S4 into the decoder to obtain the decoded output vector.

[0071] The decoder uses a self-attention mechanism to calculate the relevance scores between the image and the text, and then obtains the final decoder output vector based on the image-text relevance scores and the image's feature vector. Each decoder block in the decoder has two multi-head attention sub-layers and one fully connected feedforward sub-layer. Similar to the encoder, each sub-layer uses residual connections and layer normalization.

[0072] Figure 7This is the Decoder structure diagram. The structure of the Decoder block is similar to that of the Encoder block, except for the input sources of the Mask Multi-Head Self-Attention layer and the Multi-Head Self-Attention layer.

[0073] (1) The masked multi-head self-attention layer in the decoding block ensures that the information acquired by the model during training and prediction remains consistent, preventing the model from seeing "future" information. Specifically, the decoder input is divided into two categories. During training, the decoder input is the entire text sequence, while during prediction, the model's input starts from "BOS" and continuously uses newly generated outputs as the next input. That is, when predicting the output at a certain position, only the decoder output information from the text start position to the current position can be obtained. Therefore, it is necessary to mask part of the text sequence to ensure the consistency of information during training and prediction. In the process of calculating the output vector, for the positions that need to be masked, by adding an infinitely large negative number (negative infinity) to the value at that position, the correlation weights of these positions will approach 0 when passing through the activation function. The text sequence is processed through the masked multi-head self-attention layer to obtain the masked text feature vector.

[0074] (2) When calculating the relevance score, the input value of the multi-head attention layer in the decoding block is different from that of the encoding block. The K (Key) and V (Value) vectors come from the image feature vector w of the encoding output, while the Q (Query) vector comes from the masked text feature vector after being processed by the masked multi-head self-attention layer. The correlation between the image feature vector and the masked text feature vector is calculated to obtain the output vector of the decoding block, and this output vector will be used as the input vector of the next layer of the decoder block.

[0075] Finally, the output vector of the last layer of the decoder block goes through a linear mapping layer to change the dimension of each output vector to the size of the index table, thus obtaining the final output vector of the decoder.

[0076] Step S6: Input the decoded output vector into a conditional random field (CRF) for label prediction to obtain the corresponding entity in the image text.

[0077] The decoded output vector passes through a sequence prediction layer, namely a Conditional Random Field (CRF) layer. The decoder's output vector represents the score of each label in the index table corresponding to each vector, i.e., the state transition score from each vector to each label. Given a set of output vectors X and the state transition scores of each output vector, the CRF layer can predict the entity sequence Y with the highest score. The CRF layer adds constraints to ensure that the final prediction result is valid. These constraints can be automatically learned by the CRF layer during training. Possible constraints include: the sentence should start with "B-" or "O", not "I-"; "B-label1 I-label2 I-label3", in this model, categories 1, 2, and 3 should be the same category. Example

[0078] The two-stage method in this invention belongs to the fields of text recognition and named entity recognition, respectively. This invention includes a control group. Specifically, the input data is first processed by a text recognition model to obtain text information, and then the text information is input into a named entity recognition model to obtain an entity sequence, which is then evaluated and predicted. The experimental results are shown in Table 2 below. On the collected data, this invention achieved an overall average accuracy of 82.14%, an overall average precision of 87.73%, an overall average recall of 86.76%, and an overall average F1 score of 87.24% for named entity recognition. Compared to the two best-performing stages, the overall average accuracy improved by 17.22%, the overall average precision improved by 4.72%, the overall average recall improved by 12.48%, and the overall average F1 score improved by 8.63%.

[0079] Table 2: Experimental Results

[0080]

[0081] In summary, the method of this invention can extract key information from text existing in image form. This method is an end-to-end approach, eliminating the need for prior text recognition of the image to obtain the text, followed by named entity recognition to extract entity categories. It provides the final named entity recognition output in one step, offering convenience and speed. Furthermore, this method utilizes a Transformer structure to process both image and text modal information through an encoder and decoder, achieving structural unification and enabling better fusion and processing of visual and textual features.

[0082] Accordingly, this application also provides an electronic device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the image-text-oriented named entity recognition method as described above. Figure 7 The diagram shown is a hardware structure diagram of any device with data processing capabilities for the named entity recognition method for image text provided in this embodiment of the invention, except... Figure 7 In addition to the processor, memory, and network interface shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.

[0083] Accordingly, this application also provides a computer-readable storage medium storing computer instructions thereon, which, when executed by a processor, implement the named entity recognition method for image text as described above. The computer-readable storage medium can be an internal storage unit of any data processing device as described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be an external storage device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units of any data processing device and external storage devices. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.

[0084] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and embodiments are to be considered exemplary only.

[0085] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope.

Claims

1. A picture-text-oriented named entity recognition method, characterized in that, The method includes: Obtain an image containing text; perform text detection on the image and segment the image by line to obtain text line images; perform text recognition on the text line images to obtain the Chinese text information corresponding to each text line image; The text line image is segmented to obtain several image blocks; all image blocks are vector flattened to obtain a one-dimensional feature vector sequence; the first position vector and the second position vector are superimposed on the one-dimensional feature vector corresponding to each image block to obtain the image feature vector; where the first position vector represents the position of the image block in the text line image, and the second position vector represents the position of the one-dimensional feature vector corresponding to the image block in the feature vector sequence. The image feature vector is input into the encoder for encoding, and the encoded output vector is obtained. Named entity annotation is performed on Chinese text information to obtain text data. Sequence encoding is performed on the text data, and sequence position vector encoding is superimposed to obtain the text sequence. The encoded output vector and the text sequence are input into the decoder to obtain the decoded output vector; The decoded output vector is input into a conditional random field for label prediction to obtain the entity corresponding to the text in the image.

2. The picture-oriented text named entity recognition method of claim 1, wherein, Text detection is performed on the image, and the image is segmented line by line to obtain text line images, including: Text detection is performed on images containing text to obtain the coordinates of the top left, top right, bottom right, and bottom left corners of all text lines in the image. Based on these four coordinates, the image is segmented line by line to obtain text line images. 3.The picture-oriented text named entity recognition method of claim 1, wherein, The text line image is segmented to obtain several image blocks, including: Scale all text line images to the same size, then divide the scaled images into several identical square blocks.

4. The picture-oriented text named entity recognition method of claim 1, wherein, The image feature vector is input into the encoder for encoding, and the resulting encoded output vector includes: The encoder consists of several stacked coding blocks. The first coding block calculates the correlation between each image feature vector and other image feature vectors to obtain a first correlation score. Based on the first correlation score and the input image feature vector, the first coding vector output by the first coding block is obtained. The first coding vector is used as the input of the next coding block, and so on, to obtain the final coding output vector.

5. The picture-oriented text named entity recognition method according to claim 4, characterized in that, The coding block includes a multi-head attention layer, a residual normalization layer, a fully connected layer, and a residual normalization layer connected in sequence.

6. The picture-oriented text named entity recognition method of claim 1, wherein, Named entity annotation is performed on Chinese text information to obtain text data. Sequence encoding is then performed on the text data, and sequence position vector encoding is superimposed to obtain the text sequence, which includes: The BIO annotation method is used to perform named entity annotation on Chinese text information to obtain text data, where B represents the first character that makes up an entity, I represents the other characters that make up an entity besides the first character, and O represents non-entity characters. The text data is shifted one position to the right, and a text start symbol "BOS" is added to the beginning and a text end symbol "EOS" is added to the end. The text data is padded to a fixed length using the "pad" symbol. Finally, the text data is sequence encoded using index and superimposed with sequence position vector encoding to obtain the text sequence.

7. The named entity recognition method for image text according to claim 1, characterized in that, The decoder is composed of several stacked decoding blocks; the decoding block includes a mask multi-head self-attention layer, a residual normalization layer, a multi-head attention layer, a residual normalization layer, a fully connected layer, and a residual normalization layer connected in sequence.

8. The picture-oriented text named entity recognition method according to claim 7, characterized in that, The encoded output vector and the text sequence are input into the decoder to obtain the decoded output vector, which includes: The masked multi-head self-attention layer in the first decoding block converts the text sequence into a masked text feature vector, calculates the correlation between the encoded output vector and the masked text feature vector to obtain a second correlation score, and obtains the first decoding vector output by the first decoding block based on the second correlation score. The first decoding vector is used as the input of the next decoding block, and so on, to obtain the final decoding output vector.

9. An electronic device comprising a memory and a processor, characterized in that The memory is coupled to the processor; wherein the memory is used to store program data, and the processor is used to execute the program data to implement the named entity recognition method for image text as described in any one of claims 1-8.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that, When the program is executed by the processor, it implements the named entity recognition method for image text as described in any one of claims 1-8.