Entity recognition method and training method of entity recognition model

By employing an entity recognition model training method that does not rely on an OCR engine, and utilizing the alignment of sample images with text features for image feature comparison and learning, the problem of low image recognition efficiency in natural scenes is solved, and efficient and accurate extraction of structured information from images is achieved.

CN116524500BActive Publication Date: 2026-06-26ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIBABA (CHINA) CO LTD
Filing Date
2023-03-16
Publication Date
2026-06-26

Smart Images

  • Figure CN116524500B_ABST
    Figure CN116524500B_ABST
Patent Text Reader

Abstract

Embodiments of the present specification provide an entity recognition method and a training method of an entity recognition model, wherein the entity recognition method comprises: acquiring a target image; inputting the target image into an entity recognition model to obtain an entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and a sample text, and the sample text is recognized from the sample image. By inputting the target image into the entity recognition model, the entity recognition model can complete entity extraction of the target image; the entity recognition model is pre-trained based on feature alignment between the sample image and the sample text, so that the entity recognition model has a semantic analysis function, and thus can obtain an entity recognition result meeting requirements, and improve entity recognition efficiency and accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments in this specification relate to the field of computer technology, and in particular to entity recognition methods and training methods for entity recognition models. Background Technology

[0002] With the continuous development of computer technology, image processing technology is also constantly improving.

[0003] Currently, in order to detect and recognize image content, image recognition algorithms are typically used to directly identify the content contained in the image. This is usually applicable to images containing clear entities, such as document scans or images containing only one type of entity. However, for more complex images, such as natural scene images, the text in the image may be folded, distorted, or otherwise affect the recognition effect and efficiency.

[0004] Therefore, how to improve the efficiency of image recognition and detection and obtain structured entity content has become a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] In view of this, embodiments of this specification provide entity recognition methods. One or more embodiments of this specification also relate to methods for training entity recognition models, document entity recognition methods, entity recognition devices, computing devices, computer-readable storage media, and computer programs, to address technical deficiencies in the prior art.

[0006] According to a first aspect of the embodiments of this specification, an entity recognition method is provided, comprising:

[0007] Acquire the target image;

[0008] The target image is input into an entity recognition model to obtain the entity recognition result of the target image. The entity recognition model is pre-trained based on feature alignment between the sample image and the sample text, and the sample text is identified from the sample image.

[0009] According to a second aspect of the embodiments of this specification, a method for training an entity recognition model is provided, applied to a cloud-based device, comprising:

[0010] Sample images are obtained based on model training requests sent by the terminal, wherein the sample images carry sample text and sample location;

[0011] The sample image is input into the entity recognition model, and image features of the sample image are extracted based on the sample location, and text features are extracted based on the sample text;

[0012] The image features are compared with the text features to obtain the comparison results;

[0013] Based on the comparison results, the model loss value is determined, and the entity recognition model is trained based on the model loss value until the model training stops. Then, the model parameters of the entity recognition model are sent to the terminal.

[0014] According to a third aspect of the embodiments of this specification, an entity recognition method is provided, applied to a cloud-side device, comprising:

[0015] Receive entity recognition request sent by the terminal;

[0016] The target image is obtained according to the entity recognition request, and the target image is input into the entity recognition model, wherein the entity recognition model is pre-trained based on feature alignment between the sample image and the sample text, and the sample text is recognized from the sample image;

[0017] Obtain the entity recognition result output by the entity recognition model and return the entity recognition result to the terminal.

[0018] According to a fourth aspect of the embodiments of this specification, a document entity recognition method is provided, comprising:

[0019] Acquire the image to be processed, wherein the image to be processed is collected from the document to be processed;

[0020] The image to be processed is input into an entity recognition model to obtain the text entity recognition result of the image to be processed. The entity recognition model is pre-trained based on feature alignment between the sample image and the sample text, and the sample text is recognized from the sample image.

[0021] According to a fifth aspect of the embodiments of this specification, an entity recognition device is provided, comprising:

[0022] The acquisition module is configured to acquire the target image;

[0023] The input module is configured to input the target image into an entity recognition model to obtain the entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and sample text, and the sample text is identified from the sample image.

[0024] According to a sixth aspect of the embodiments of this specification, a computing device is provided, comprising:

[0025] Memory and processor;

[0026] The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, which, when executed by the processor, implement the steps of the above-described entity recognition method.

[0027] According to a seventh aspect of the embodiments of this specification, a computer-readable storage medium is provided that stores computer-executable instructions that, when executed by a processor, implement the steps of the entity recognition method described above.

[0028] According to an eighth aspect of the embodiments of this specification, a computer program is provided, wherein when the computer program is executed in a computer, it causes the computer to perform the steps of the entity recognition method described above.

[0029] One embodiment of this specification implements the following: acquiring a target image; inputting the target image into an entity recognition model to obtain the entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and sample text, and the sample text is identified from the sample image.

[0030] By inputting the target image into the entity recognition model, the entity recognition model can extract entities from the target image. The entity recognition model is pre-trained based on the feature objects between sample images and sample text, which enables the entity recognition model to have semantic analysis capabilities, thereby obtaining entity recognition results that meet the requirements and improving the efficiency and accuracy of entity recognition. Attached Figure Description

[0031] Figure 1 This is a schematic diagram illustrating a scenario of an entity recognition method provided in one embodiment of this specification;

[0032] Figure 2 This is a flowchart of an entity recognition method provided in one embodiment of this specification;

[0033] Figure 3 This is a flowchart illustrating the processing procedure of a document entity recognition method provided in one embodiment of this specification.

[0034] Figure 4 This is a flowchart illustrating a training method for an entity recognition model provided in one embodiment of this specification;

[0035] Figure 5 This is a flowchart of another entity recognition method provided in one embodiment of this specification;

[0036] Figure 6 This is a schematic diagram of the structure of an entity recognition device provided in one embodiment of this specification;

[0037] Figure 7This is a structural block diagram of a computing device provided in one embodiment of this specification. Detailed Implementation

[0038] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.

[0039] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” as used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.

[0040] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."

[0041] First, the terms and concepts used in one or more embodiments of this specification will be explained.

[0042] OCR: Optical Character Recognition (OCR) refers to the process by which electronic devices (such as scanners or digital cameras) examine characters printed on paper and then use character recognition methods to translate the shapes into computer text.

[0043] BERT: BERT stands for Bidirectional Encoder Representation from Transformers. It is an unsupervised pre-trained language model for natural language processing tasks proposed in the paper Pre-training of Deep Bidirectional Transformers for Language Understanding.

[0044] Information Extraction: Information extraction (IE) is the process of structuring the information contained in text and transforming it into a table-like organizational form.

[0045] CenterNet: CenterNet is an anchor-free object detection network that can be used not only for object detection, but also for other tasks such as pose recognition or 3D object detection.

[0046] Currently, information extraction is one of the core algorithms in the OCR field. Whether it's cards, receipts, forms, or documents, information extraction algorithms are needed to extract core information to obtain easily understandable structured information. Early OCR algorithms were mainly designed for scenarios like scanning and printing documents, where the data was clean and clear, allowing for high accuracy through direct algorithm recognition. However, with the emergence of smart terminals and the rapid development of information technology, the scenarios faced by OCR have gradually shifted to natural scenes. The input data is no longer clean and clear; it's accompanied by various noises such as rotation, folding, and wrinkles, making text detection and recognition a challenging task.

[0047] The mainstream information extraction algorithm first uses an OCR engine to extract the position and content of text in an image, then uses a language model to extract features from the input image and text, and finally connects a lightweight head network designed for downstream tasks to implement specific functions, such as entity labeling or entity relationship extraction. However, this method depends on the accuracy of the OCR engine. If a large number of characters extracted by the OCR engine are missed or misrecognized in natural scenes, it will seriously affect the subsequent information extraction results.

[0048] When solving information extraction tasks in natural scenes, it's common practice to implement text detection, text recognition, and information extraction within a single framework. Therefore, information extraction heavily relies on text detection and text recognition. However, text recognition tasks, due to their large number of categories, often require vast amounts of data to train a good model. Text recognition annotation is extremely time-consuming and costly. Furthermore, these methods lack pre-training solutions, resulting in insufficient accuracy and robustness. While the LayoutLM sequence method, based on document pre-training, relies on an OCR engine to extract OCR information, the accuracy of OCR engine extraction drops significantly in natural scenes, consequently impacting the accuracy of information extraction.

[0049] Therefore, the solution presented in this specification proposes an information extraction method that does not rely on an OCR engine or text recognition algorithm. This improves the accuracy and robustness of the model in extracting information from natural scenes, and it can extract structured information from images without using any additional OCR engine. While significantly improving speed, the number of model parameters is also greatly reduced.

[0050] This specification provides an entity recognition method, and also relates to a training method for an entity recognition model, a document entity recognition method, an entity recognition device, a computing device, and a computer-readable storage medium, which will be described in detail in the following embodiments.

[0051] See Figure 1 , Figure 1 The illustration shows a scenario diagram of an entity recognition method according to an embodiment of this specification, specifically including:

[0052] The system receives an entity recognition request for an image and acquires the image. It then inputs the image into a trained entity recognition model. Specifically, the image is input into the feature extraction layer of the entity recognition model to obtain the image features output by the feature extraction layer. These image features are then input into the preprocessing layer of each branch to obtain the image features output by the preprocessing layer of each branch. Finally, the preprocessed image features are input into the entity location layer to obtain the entity location information output by the entity location layer. The entity location layer can perform entity detection on the image, which can be achieved using object detection methods such as CenterNet and Seglink.

[0053] The preprocessed image features and entity location information are input into the feature transformation layer. Based on the entity location information, the entity center information is determined. The image vector corresponding to the entity center information is extracted from the image features to form entity features. Specifically, the feature transformation layer can map field features from a three-dimensional feature map to a point. Specifically, the field center point is determined based on the entity location information. A one-dimensional vector feature is extracted from the center point of the field as the representation of the field. The entity features are composed of one-dimensional vectors.

[0054] Entity features are input into the entity category layer and the entity relation layer. The entity relation layer outputs the entity relations of the image, and the entity category layer outputs the entity categories. Specifically, the entity category layer concatenates the positional encoding with the entity features, processes it through an encoder, and then inputs it into a classifier to obtain the entity categories output by the classifier. The entity relation layer constructs an NxN relation matrix (N being the number of fields) based on the entity features, and then inputs the relation matrix into the encoder to obtain NxC features (C being the number of channels). Matrix transformation converts the NxC features into two feature maps, 1xNxC and Nx1xC. Subtracting these two maps yields an NxNxC matrix, which is the relation matrix of the fields. This relation matrix is ​​then input into the classifier to obtain the entity relations output by the classifier.

[0055] The entity relationship, entity category, and entity location information output by the entity recognition model are used as the entity recognition results.

[0056] The above describes the application process of the entity recognition model. Before applying the entity recognition model, it needs to be trained, specifically as follows:

[0057] The process involves acquiring sample images and extracting sample text and sample locations from them using an OCR engine as sample labels. Further, the frequency of each field in the sample text is counted, and fields with frequencies exceeding a threshold are labeled with attribute tags, while the remaining fields are labeled with attribute value tags. This process sets sample category labels for the sample images. Finally, based on the spatial relationships between fields with different category labels, sample relationship labels are set for the sample images.

[0058] The entity recognition model is trained using sample images with labeled samples. The specific process is as follows: Sample text is input into a language model to obtain text features; image features output by the feature transformation layer are obtained based on sample locations; image features and text features are compared and learned to obtain comparison results, which are then used to determine the model loss value. The entity recognition model is then trained based on this model loss value. Specifically, the model loss value is obtained by comparing the predicted text features output by the recognition model based on image features with the text features output by the language model. The difference between the text features output based on image features and the text features output by the language model is calculated. The similarity is used to optimize the text feature output function of the entity recognition model, enabling the image processing model to have semantic analysis capabilities; the image features aligned by contrastive learning are concatenated with the positional encoding, and then passed through an encoder and classifier to obtain the predicted category; the image features aligned by contrastive learning are concatenated with the positional encoding and classification encoding, and then passed through an encoder, matrix transformation and classifier to obtain the predicted relationship; the relationship loss value is calculated based on the sample relationship and the predicted relationship, the category loss value is calculated based on the sample category and the predicted category, and the positional loss value is calculated based on the sample position and the predicted position; the entity recognition model is trained using the comparison results, positional loss value, category loss value and relationship loss value until the model training stops.

[0059] One embodiment of this specification implements the following: acquiring a target image; inputting the target image into an entity recognition model to obtain the entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and sample text, and the sample text is identified from the sample image.

[0060] By setting up a pre-training task—that is, extracting sample text and sample locations from sample images using an OCR engine as sample labels—a large amount of label data can be efficiently obtained, increasing the amount of sample data used to train the model. This rich sample data then improves the model's processing accuracy. Furthermore, the capabilities of a language model are introduced into the image processing model. This involves comparing and learning the text features output by the language model with those analyzed by the image processing model. This allows the image processing model to extract entities based on the text features calculated by the model without needing to recognize the text content, thus skipping text recognition and freeing the model from limitations imposed by text recognition performance. Finally, the entity recognition results are output within the image processing model, avoiding the need to train multiple models with different functions and then perform entity recognition tasks based on each model, thereby improving the efficiency of entity recognition for images.

[0061] See Figure 2 , Figure 2A flowchart of an entity recognition method according to an embodiment of this specification is shown, which specifically includes the following steps.

[0062] Step 202: Obtain the target image.

[0063] The target image refers to an image that requires entity recognition. For example, the target image can be an image captured by an image acquisition device, a picture extracted from an electronic document, an image generated by image processing software, etc. The target image can be an image determined based on an entity recognition request sent by the user, or an image determined by an entity recognition request triggered by other tasks on the terminal, etc. This specification does not make specific limitations.

[0064] Specifically, the terminal receives an entity recognition request for a target image; the entity recognition request carries the target image, or the image recognition request contains an image identifier, and the target image can be obtained based on the image identifier.

[0065] In one specific embodiment of this specification, a user sends an entity recognition request for a contract photo; the entity recognition request includes the contract photo.

[0066] In another specific embodiment of this specification, the terminal's retrieval task requires retrieving image content from a document; based on the retrieval task, an entity recognition request for the image in the document is triggered; the entity recognition request includes an image identifier; the document is parsed, and the target image corresponding to the entity recognition request is obtained from the image corresponding to the document based on the image identifier.

[0067] Acquiring the target image facilitates subsequent entity recognition.

[0068] Step 204: Input the target image into the entity recognition model to obtain the entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between the sample image and the sample text, and the sample text is identified from the sample image.

[0069] After determining the target image for entity recognition, the target image can be input into the trained entity recognition model, which will then recognize the entities in the target image.

[0070] In this context, an entity recognition model refers to a pre-trained model capable of recognizing entities in a target image. For example, an image requiring entity extraction can be input into a pre-trained entity recognition model. The entity recognition result refers to the entity recognition result output by the entity recognition model, corresponding to the target image. For example, the entity recognition result might be the student ID field in a document image, the corresponding field value "123", and the position information of the student ID field and its value within the document image. In practical applications, the entity recognition result can include entity relationships, entity categories, and entity location information. Entity location refers to the position of an entity in the target image. For example, the entity location of field A in a document image is the coordinate information of field A within the document image. Entity category refers to the category corresponding to the entity in the target image. For example, an entity category can include both fields and field values; for instance, field A is the field category, and field B is the field value category. Entity relationship refers to the relationship between different entity categories. For example, the field "product name" and the field value "hand sanitizer" have a corresponding relationship, while the field "product name" and the field value "hello" do not correspond, indicating a lack of correspondence based on the entity relationship.

[0071] Specifically, before inputting the target image into the entity recognition model, the entity recognition model needs to be trained to enable it to recognize entities. Since the entity recognition model cannot recognize text semantics before training, it needs to be trained by feature alignment between sample images and sample text. The sample text corresponding to the sample image is identified from the sample image. The target image, i.e., the image for which entity recognition is required, is input into the entity recognition model to obtain the entity recognition result output by the entity recognition model based on the target image. Subsequent downstream tasks can be processed based on the entity recognition result.

[0072] In one specific embodiment of this specification, a table image that needs to be recognized as an entity is input into a pre-trained entity recognition model; the entity recognition result output by the entity recognition model based on the table image is obtained; before inputting the table image into the entity recognition image, feature alignment can be performed based on the sample image and the sample text corresponding to the sample image, thereby realizing the training of the entity recognition model.

[0073] The entity recognition task is completed by inputting the target image into the entity recognition model and obtaining the entity recognition result output by the entity recognition model. The entity recognition result can also be used for other downstream tasks.

[0074] In practical applications, it is necessary to train the entity recognition model based on feature alignment of sample text and sample images. This involves inputting the target image into the entity recognition model and, before obtaining the entity recognition result for the target image, the following steps are also included:

[0075] Obtain a sample image, wherein the sample image carries sample text and sample location;

[0076] The sample image is input into the entity recognition model, and image features of the sample image are extracted based on the sample location, and text features are extracted based on the sample text;

[0077] The image features are compared with the text features to obtain the comparison results;

[0078] Based on the comparison results, the model loss value is determined, and the entity recognition model is trained based on the model loss value until the model training stops.

[0079] Here, sample images refer to images used to train entity recognition models. In practical applications, entity recognition models can be trained based on a sample set composed of multiple sample images, thereby improving the recognition performance of entity recognition models. In order to train entity recognition models, each sample image carries a sample label, which can contain sample text and sample location. Sample text refers to the text recognized in the sample image, and sample location refers to the position of the sample text in the sample image.

[0080] A sample image carrying sample text and sample location is input into the entity recognition model. The entity recognition model extracts image features from the sample image based on the sample location and sample features based on the sample text. Here, image features refer to the features extracted from the sample image based on the sample location, and text features refer to the features corresponding to the text extracted from the sample text. The comparison result refers to the similarity between the image features and the text features. For example, the comparison result between the image features and the text features is 90%. The model loss value refers to the loss value used to adjust the model parameters of the entity recognition model.

[0081] Specifically, sample images are acquired for training the entity recognition model; the sample images carry sample text and sample location labels; the sample images carrying sample locations and sample text are input into the entity recognition model, which extracts image features from the sample images based on the sample locations and extracts text features from the sample text; the image features and text features are compared, i.e., feature alignment is performed, and the comparison result is obtained; the model loss value of the entity recognition model is determined based on the comparison result; the model parameters of the entity recognition model are adjusted based on the model loss value until the model training conditions are met.

[0082] In a specific embodiment of this specification, a sample image set G is obtained, where each sample image in the sample image set G carries corresponding sample text and sample location; sample image 1 in the sample image set G is input into the entity recognition model, and image features of sample image 1 are extracted based on the sample location of sample image 1, and text features are extracted from the sample text of sample image 1; the image features and text features are compared to obtain the comparison result; the model loss value is determined based on the comparison result, and the entity recognition model is trained based on the model loss value; then other sample images in the sample image set G are input into the entity recognition model, and the entity recognition model is trained again until the model training stopping condition is reached, thus obtaining the trained entity recognition model.

[0083] By using sample images carrying sample locations and sample text to train the entity recognition model, the model can be trained based on the feature comparison results between the image and the text, thus enabling the entity recognition model to recognize the text.

[0084] In practical applications, before training an entity recognition model, it is necessary to obtain sample images carrying sample labels. The sample labels on the sample images can be manually labeled, labeled by a pre-trained model, or extracted directly from the sample images based on the recognition algorithm. Since training an entity recognition model requires a large number of sample images, and the accuracy requirements for the sample labels are not high, the method for generating sample images carrying sample labels in this specification is to extract them directly from the sample images based on the recognition algorithm.

[0085] Specifically, before acquiring the sample image, the following steps are also included:

[0086] Extract sample image features from the sample images;

[0087] Based on the features of the sample image, the text boxes in the sample image are identified to determine the sample text and the corresponding sample position;

[0088] By analyzing the target entity fields in the sample text, the sample category corresponding to the entity in the sample image is determined.

[0089] The sample relationships between entities are determined based on the sample location and sample category corresponding to the entity.

[0090] Among them, sample image features refer to the image features extracted from the sample image; text box refers to a text content box contained in the sample image; target entity field refers to the field in the sample text whose frequency of occurrence exceeds a preset threshold; sample category refers to the category corresponding to the field in the sample image; sample relationship refers to the relationship between sample fields of different sample categories, such as corresponding relationship or non-corresponding relationship.

[0091] Specifically, the algorithm extracts sample image features from the sample image; based on the sample image features, it identifies sample text in the text box of the sample image and determines the sample position; it determines each field contained in the sample text, and fields with more than a preset threshold are designated as target entity fields, while fields with less than the preset threshold are designated as non-target entity fields; based on the position information of the target entity fields and non-target entity fields in the sample image, it determines the correspondence between the target entity fields and non-target entity fields, i.e., the sample relationship.

[0092] In a specific embodiment of this specification, the sample location and sample text in the sample image are identified based on the OCR recognition engine; fields with more than 30 fields in the sample text are identified as target entity fields, and the remaining fields are identified as non-target entity fields; the sample relationship is determined based on the positional relationship between the target entity fields and the non-target entity fields, that is, the relationship between the target entity fields and the non-target entity fields is a corresponding relationship or a non-corresponding relationship.

[0093] By directly identifying sample labels in sample images, the speed of sample label generation is improved, saving the time of obtaining sample images containing sample labels, thereby improving the training efficiency of entity recognition models.

[0094] In practical applications, image features of the sample image are extracted based on the sample location, and text features are extracted based on the sample text, including:

[0095] Extract the sample image features from the sample image;

[0096] The entity center information of the entity in the sample image is determined based on the sample location;

[0097] Based on the entity center information, extract the sample image vector from the sample image features of the sample image, and generate image features based on the sample image vector;

[0098] The sample text is input into the target language model to obtain the text features output by the target language model.

[0099] Among them, sample image features refer to the image features extracted from the sample image. The specific method for extracting image features is not specifically limited in this specification. The position information of the entity in the image can be determined based on the sample position. The sample center point, i.e., the entity center information, can be determined based on the sample position. The sample image vector refers to the vector corresponding to the entity center information in the sample image features. The sample image vectors corresponding to the entity are combined to obtain the image features corresponding to the sample image. The target language model refers to the model with semantic analysis function, such as the BERT model.

[0100] Specifically, the sample images are input into the entity recognition model to extract sample image features; the entity center location information is determined based on the sample location; sample image vectors are extracted from the sample image features based on the entity center location information, and image features are formed by each sample image vector; the sample text is input into the target language model to obtain the text features output by the target language model.

[0101] By acquiring image features generated by the entity recognition model and text features extracted from sample text, it is easy to compare the image features with the text features. This allows the entity recognition model to be trained based on the comparison results, enabling the entity recognition model to have the semantic analysis function of the target language model.

[0102] Furthermore, in order for the entity recognition model to output entity location information in an image, it is necessary to train the entity recognition model based on sample locations. That is, before training the entity recognition model based on the model loss value, the following steps are also included:

[0103] Obtain the predicted location output by the entity recognition model;

[0104] Calculate the location loss value based on the predicted location and the sample location;

[0105] Based on the comparison results, the model loss value is determined, including:

[0106] The model loss value is determined based on the location loss value and the comparison result.

[0107] Among them, the predicted location refers to the location information output by the entity recognition model based on the input sample image; the location loss value refers to the loss value obtained based on the sample location and the predicted location.

[0108] Specifically, based on the predicted location output by the entity recognition model and the sample location carried in the sample image, the location loss value is calculated; based on the location loss value and the comparison result, the model loss value for the entity recognition model is determined, so that the entity recognition model can be trained based on the model loss value including the location loss value.

[0109] In one specific embodiment of this specification, the predicted position output by the entity recognition model based on the sample image is obtained; the position loss value is calculated based on the loss function, the predicted position, and the sample position; the sum of the position loss value and the comparison result is used as the model loss value to train the entity recognition model.

[0110] By training the entity recognition model with location loss values, the entity recognition model can output more accurate entity location information.

[0111] Furthermore, in order for the entity recognition model to output the entity category of an image, the entity recognition model can be trained based on the sample category labels of the sample images, that is, the sample images can carry sample categories; before training the entity recognition model based on the model loss value, the following steps are also included:

[0112] Obtain the predicted category output by the entity recognition model;

[0113] Calculate the category loss value based on the sample category and the preset category;

[0114] Based on the comparison results, the model loss value is determined, including:

[0115] The model loss value is determined based on the category loss value and the comparison result.

[0116] Among them, the predicted category refers to the category information output by the entity recognition model based on the input sample image; the category loss value refers to the loss value obtained based on the sample category and the predicted category.

[0117] Specifically, based on the predicted category output by the entity recognition model and the sample category carried by the sample image, a category loss value is calculated; based on the category loss value and the comparison result, a model loss value for the entity recognition model is determined, so that the entity recognition model can be trained based on the model loss value including the category loss value.

[0118] In one specific embodiment of this specification, the predicted category output by the entity recognition model based on the sample image is obtained; the positional loss value is calculated based on the loss function, the predicted category, and the sample category; and the sum of the category loss value and the comparison result is used as the model loss value to train the entity recognition model.

[0119] By training the entity recognition model with category loss values, the entity recognition model can output more accurate entity category information.

[0120] Furthermore, in order for the entity recognition model to output the entity relationships in an image, the entity recognition model can be trained based on the sample relationship labels of the sample images, that is, the sample images can carry sample relationships. Before training the entity recognition model based on the model loss value, the following steps are also included:

[0121] Obtain the predicted relationship output by the entity recognition model;

[0122] Calculate the relationship loss value based on the sample relationship and the predicted relationship;

[0123] Based on the comparison results, the model loss value is determined, including:

[0124] The model loss value is determined based on the comparison results according to the relationship loss value.

[0125] Among them, the predicted relationship refers to the relationship information output by the entity recognition model based on the input sample image; the relationship loss value refers to the loss value obtained based on the sample relationship and the predicted relationship.

[0126] Specifically, the relationship loss value is calculated based on the predicted relationship output by the entity recognition model and the sample relationship carried by the sample image; the model loss value for the entity recognition model is determined based on the relationship loss value and the comparison result, so that the entity recognition model can be trained based on the model loss value including the relationship loss value.

[0127] In one specific embodiment of this specification, the predicted relationship output by the entity recognition model based on the sample image is obtained; the position loss value is calculated based on the loss function, the predicted relationship and the sample relationship; the sum of the relationship loss value and the comparison result is used as the model loss value to train the entity recognition model.

[0128] By training the entity recognition model with relation loss values, the entity recognition model can output more accurate entity relation information.

[0129] In another specific embodiment of this specification, the position loss value, relationship loss value, and category loss value are calculated based on the predicted position, predicted category, and predicted relationship output by the entity recognition model, respectively; the sum of the position loss value, relationship loss value, category loss value, and comparison result is used as the model loss value for training the entity recognition model.

[0130] After training the entity recognition model and obtaining the trained entity recognition model, sample images can be input into the entity recognition model to extract entities from the target image.

[0131] Specifically, the entity recognition model includes an entity location layer, an entity category layer, a relationship layer, a feature transformation layer, and a feature extraction layer;

[0132] The target image is input into an entity recognition model to obtain the entity recognition result of the target image, including:

[0133] The target image is input into the feature extraction layer to obtain the target image features output by the feature extraction layer;

[0134] The target image features are input into the entity location layer to obtain the entity location information detected by the entity location layer;

[0135] The target image features and the entity location information are input into the feature conversion layer to obtain the target entity features output by the feature conversion layer.

[0136] The target entity features are input into the entity category layer and the entity relationship layer respectively to obtain the entity category output by the entity category layer and the entity relationship output by the entity relationship layer.

[0137] The entity location information, entity category, and entity relationship are used as the entity recognition result of the target image.

[0138] The module comprises the following layers: Entity Location Layer (for identifying the location information of entities in the target image); Entity Category Layer (for identifying the category of entities in the target image); Entity Relationship Layer (for identifying the relationship between entities in the target image); Feature Conversion Layer (for converting the image features of the target image into target entity features); Feature Extraction Module (for extracting the image features of the target image); Target Image Features (for extracting image features from sample images); Entity Location Information (for the location information of entities in the target image); Target Entity Features (for the entity features obtained by converting the target image features); Entity Category (for the category corresponding to the entity in the target image); and Entity Relationship (for the relationship between entities in the target image).

[0139] In one specific embodiment of this specification, an image containing text content is input into a trained entity recognition model; target image features output by the feature extraction layer of the entity recognition model are obtained; the target image features are input into the entity location layer to obtain entity location information output by the entity location layer; both the target image features and entity location information are input into the feature transformation layer to obtain target entity features output by the feature transformation layer; the target entity features are input into the entity category layer to obtain the entity category output by the entity type layer; the target entity features are input into the entity relationship layer to obtain the entity relationship output by the entity relationship layer; and the entity location information, entity category, and entity relationship output by the entity recognition model are used as the entity recognition result of the image.

[0140] Further, the target image features and the entity location information are input into the feature conversion layer to obtain the target entity features output by the feature conversion layer, including:

[0141] The target image features and the entity location information are input into the feature transformation layer;

[0142] The entity center information of the entity in the target image is determined based on the entity location information;

[0143] Based on the entity center information, the target image vector is extracted from the target image features, and entity center features are generated based on the target image vector;

[0144] Target entity features are generated based on the entity center features and target semantic features.

[0145] Among them, entity center information refers to the central location information of the entity; target image vector refers to the image vector in the target image features that corresponds to the entity center information; entity center feature refers to the matrix generated based on the target image vector; target semantic feature refers to the semantic features used to train the entity recognition model based on the feature alignment between the sample image and the sample text.

[0146] Specifically, the entity center information of the entity in the target image is determined based on the entity location information; the corresponding target image vector is extracted from the target image features based on the entity center information; the entity center feature is constructed based on each target image vector; and the entity center feature and the target semantic feature are fused to obtain the target entity feature.

[0147] By extracting target entity features from the target feature image, which can be used to identify entity categories and entity relationships, the identification of entity relationships and entity categories can be achieved subsequently based on the target entity features.

[0148] One embodiment of this specification implements the following: acquiring a target image; inputting the target image into an entity recognition model to obtain the entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and sample text, and the sample text is identified from the sample image.

[0149] By inputting the target image into the entity recognition model, the entity recognition model can extract entities from the target image. The entity recognition model is pre-trained based on the feature objects between sample images and sample text, which enables the entity recognition model to have semantic analysis capabilities, thereby obtaining entity recognition results that meet the requirements and improving the efficiency and accuracy of entity recognition.

[0150] The following is in conjunction with the appendix Figure 3 Taking the application of the entity recognition method provided in this specification to documents as an example, the entity recognition method will be further explained. Among other things, Figure 3 The flowchart of a document entity recognition method according to an embodiment of this specification is shown, which specifically includes the following steps.

[0151] Step 302: Obtain the image to be processed, wherein the image to be processed is collected from the document to be processed.

[0152] Specifically, the document to be processed may contain one or more images to be processed; the images to be processed are determined by parsing the text to be processed.

[0153] Step 304: Input the image to be processed into the entity recognition model to obtain the text entity recognition result of the image to be processed, wherein the entity recognition model is pre-trained based on feature alignment between the sample image and the sample text, and the sample text is recognized from the sample image.

[0154] Specifically, the entity recognition model is a trained model; furthermore, in order to improve the recognition accuracy of the entity recognition model, it can be fine-tuned based on document-related sample images, and then the fine-tuned entity recognition model is used to recognize the image to be processed.

[0155] The embodiments in this specification implement the following: acquiring an image to be processed; inputting the image to be processed into an entity recognition model to obtain the entity recognition result of the image to be processed, wherein the entity recognition model is pre-trained based on feature alignment between sample images and sample text, and the sample text is identified from the sample images.

[0156] By inputting the image to be processed into the entity recognition model, the entity recognition model can extract entities from the image. The entity recognition model is pre-trained based on the feature objects between sample images and sample text, which enables the entity recognition model to have semantic analysis capabilities, thereby obtaining entity recognition results that meet the requirements and improving the efficiency and accuracy of entity recognition.

[0157] See Figure 4 , Figure 4 A flowchart illustrating a training method for an entity recognition model according to an embodiment of this specification, applied to a cloud-side device, is shown, specifically including the following steps.

[0158] Step 402: Obtain a sample image, wherein the sample image carries sample text and sample location.

[0159] Step 404: Input the sample image into the entity recognition model, extract the image features of the sample image based on the sample location, and extract the text features based on the sample text.

[0160] Step 406: Compare the image features with the text features to obtain the comparison results.

[0161] Step 408: Based on the comparison results, determine the model loss value, and train the entity recognition model based on the model loss value until the model training stopping condition is met, and obtain the trained entity recognition model;

[0162] Step 410: Based on the model acquisition request sent by the terminal, send the model parameters of the trained entity recognition model to the terminal.

[0163] Specifically, the cloud-side device acquires a sample dataset for model training based on the model training task. This dataset contains sample images. Sample labels are assigned to the sample images in the dataset, containing sample text and sample location. The labeled sample images are then input into the entity recognition model. The cloud-side device extracts image features from the sample images based on their locations and extracts text features based on the sample text. The image features and text features are compared to obtain the comparison results. Based on the comparison results, the model loss value is calculated, and the entity recognition model on the cloud-side device is trained based on this loss value until the model training stops, resulting in a trained entity recognition model. The user sends a model acquisition request to the cloud-side device via their terminal. Based on the model acquisition request, the cloud-side device returns the model parameters of the trained entity recognition model to the terminal. The terminal uses these model parameters to call the trained entity recognition model from the cloud-side device, avoiding the consumption of terminal computing resources for model training and application, thus reducing the terminal's computational burden.

[0164] One embodiment of this specification implements the following: acquiring a sample image, wherein the sample image carries sample text and sample location; inputting the sample image into the entity recognition model; extracting image features of the sample image based on the sample location, and extracting text features based on the sample text; comparing the image features with the text features to obtain a comparison result; determining a model loss value based on the comparison result, and training the entity recognition model based on the model loss value until a model training stopping condition is reached to obtain a trained entity recognition model; and sending the model parameters of the trained entity recognition model to the terminal based on a model acquisition request sent by the terminal.

[0165] By setting up a pre-training task—that is, extracting sample text and sample locations from sample images using an OCR engine as sample labels—a large amount of label data can be efficiently obtained, increasing the amount of sample data used to train the model. This rich sample data then improves the model's processing accuracy. Furthermore, the capabilities of a language model are introduced into the image processing model. This involves comparing and learning the text features output by the language model with those analyzed by the image processing model. This allows the image processing model to extract entities based on the text features calculated by the model without needing to recognize the text content, thus skipping text recognition and freeing the model from limitations imposed by text recognition performance. Finally, the entity recognition results are output within the image processing model, avoiding the need to train multiple models with different functions and then perform entity recognition tasks based on each model, thereby improving the efficiency of entity recognition for images.

[0166] See Figure 5 , Figure 5 A flowchart of another entity recognition method according to an embodiment of this specification is shown, applied to a cloud-side device, and specifically includes the following steps.

[0167] Step 502: Receive the entity recognition request sent by the terminal.

[0168] Step 504: Obtain the target image according to the entity recognition request, and input the target image into the entity recognition model, wherein the entity recognition model is pre-trained based on feature alignment between the sample image and the sample text, and the sample text is recognized from the sample image.

[0169] Step 506: Obtain the entity recognition result output by the entity recognition model and return the entity recognition result to the terminal.

[0170] One embodiment of this specification implements the following: acquiring a target image based on an entity recognition request sent by a terminal; inputting the target image into an entity recognition model to obtain an entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and sample text, and the sample text is identified from the sample image; acquiring the entity recognition result output by the entity recognition model and returning the entity recognition result to the terminal.

[0171] By inputting the target image into the entity recognition model, the entity recognition model can extract entities from the target image. The entity recognition model is pre-trained based on the feature objects between sample images and sample text, which enables the entity recognition model to have semantic analysis capabilities, thereby obtaining entity recognition results that meet the requirements and improving the efficiency and accuracy of entity recognition.

[0172] Corresponding to the above method embodiments, this specification also provides embodiments of entity recognition devices. Figure 6 A schematic diagram of the structure of an entity recognition device according to one embodiment of this specification is shown. Figure 6 As shown, the device includes:

[0173] The acquisition module 602 is configured to acquire the target image;

[0174] The input module 604 is configured to input the target image into an entity recognition model to obtain the entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and sample text, and the sample text is identified from the sample image.

[0175] Optionally, the entity recognition model includes an entity location layer, an entity category layer, a relational layer, a feature transformation layer, and a feature extraction layer; the input module 604 is further configured to:

[0176] The target image is input into the feature extraction layer to obtain the target image features output by the feature extraction layer;

[0177] The target image features are input into the entity location layer to obtain the entity location information detected by the entity location layer;

[0178] The target image features and the entity location information are input into the feature conversion layer to obtain the target entity features output by the feature conversion layer.

[0179] The target entity features are input into the entity category layer and the entity relationship layer respectively to obtain the entity category output by the entity category layer and the entity relationship output by the entity relationship layer.

[0180] The entity location information, entity category, and entity relationship are used as the entity recognition result of the target image.

[0181] Optionally, the input module 604 is further configured to:

[0182] The target image features and the entity location information are input into the feature transformation layer;

[0183] The entity center information of the entity in the target image is determined based on the entity location information;

[0184] Based on the entity center information, the target image vector is extracted from the target image features, and entity center features are generated based on the target image vector;

[0185] Target entity features are generated based on the entity center features and target semantic features.

[0186] Optionally, the input module 604 is further configured to:

[0187] Obtain a sample image, wherein the sample image carries sample text and sample location;

[0188] The sample image is input into the entity recognition model, and image features of the sample image are extracted based on the sample location, and text features are extracted based on the sample text;

[0189] The image features are compared with the text features to obtain the comparison results;

[0190] Based on the comparison results, the model loss value is determined, and the entity recognition model is trained based on the model loss value until the model training stops.

[0191] Optionally, the input module 604 is further configured to:

[0192] Extract sample image features from the sample images;

[0193] Based on the features of the sample image, the text boxes in the sample image are identified to determine the sample text and the corresponding sample position;

[0194] By analyzing the target entity fields in the sample text, the sample category corresponding to the entity in the sample image is determined.

[0195] The sample relationships between entities are determined based on the sample location and sample category corresponding to the entity.

[0196] Optionally, the input module 604 is further configured to:

[0197] Extract the sample image features from the sample image;

[0198] The entity center information of the entity in the sample image is determined based on the sample location;

[0199] Based on the entity center information, extract sample image vectors from the sample image features of the sample image, and generate image features based on the sample image vectors;

[0200] The sample text is input into the target language model to obtain the text features output by the target language model.

[0201] Optionally, the input module 604 is further configured to:

[0202] Obtain the predicted location output by the entity recognition model;

[0203] Calculate the location loss value based on the predicted location and the sample location;

[0204] Based on the comparison results, the model loss value is determined, including:

[0205] The model loss value is determined based on the location loss value and the comparison result.

[0206] Optionally, the sample image carries a sample category; the input module 604 is further configured to:

[0207] Obtain the predicted category output by the entity recognition model;

[0208] Calculate the category loss value based on the sample category and the preset category;

[0209] Based on the comparison results, the model loss value is determined, including:

[0210] The model loss value is determined based on the category loss value and the comparison result.

[0211] Optionally, the sample image carries sample relationships: the input module 604 is further configured to:

[0212] Obtain the predicted relationship output by the entity recognition model;

[0213] Calculate the relationship loss value based on the sample relationship and the predicted relationship;

[0214] Based on the comparison results, the model loss value is determined, including:

[0215] The model loss value is determined based on the comparison results according to the relationship loss value.

[0216] The entity recognition device described in this specification implements the following: acquiring a target image; inputting the target image into an entity recognition model to obtain the entity recognition result of the target image, wherein the entity recognition model is pre-trained based on feature alignment between a sample image and sample text, and the sample text is identified from the sample image.

[0217] By inputting the target image into the entity recognition model, the entity recognition model can extract entities from the target image. The entity recognition model is pre-trained based on the feature objects between sample images and sample text, which enables the entity recognition model to have semantic analysis capabilities, thereby obtaining entity recognition results that meet the requirements and improving the efficiency and accuracy of entity recognition.

[0218] The above is an illustrative scheme of an entity recognition device according to this embodiment. It should be noted that the technical solution of this entity recognition device and the technical solution of the entity recognition method described above belong to the same concept. For details not described in detail in the technical solution of the entity recognition device, please refer to the description of the technical solution of the entity recognition method described above.

[0219] Figure 7 A structural block diagram of a computing device 700 according to one embodiment of this specification is shown. The components of the computing device 700 include, but are not limited to, a memory 710 and a processor 720. The processor 720 is connected to the memory 710 via a bus 730, and a database 750 is used to store data.

[0220] The computing device 700 also includes an access device 740, which enables the computing device 700 to communicate via one or more networks 760. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. The access device 740 may include one or more of any type of wired or wireless network interface (e.g., a network interface controller (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, a Wi-MAX (Worldwide Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, or a Near Field Communication (NFC) interface.

[0221] In one embodiment of this specification, the above-described components of the computing device 700 and Figure 7 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 7 The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this specification. Those skilled in the art can add or replace other components as needed.

[0222] The computing device 700 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 700 can also be a mobile or stationary server.

[0223] The processor 720 executes computer-executable instructions, which, when executed by the processor, implement the steps of the aforementioned entity recognition method. The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device and the technical solution of the aforementioned entity recognition method belong to the same concept; details not described in detail in the technical solution of the computing device can be found in the description of the technical solution of the aforementioned entity recognition method.

[0224] An embodiment of this specification also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the entity recognition method described above.

[0225] The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium and the technical solution of the entity recognition method described above belong to the same concept. For details not described in detail in the technical solution of the storage medium, please refer to the description of the technical solution of the entity recognition method described above.

[0226] An embodiment of this specification also provides a computer program, wherein when the computer program is executed in a computer, it causes the computer to perform the steps of the above-described entity recognition method.

[0227] The above is an illustrative example of a computer program according to this embodiment. It should be noted that the technical solution of this computer program and the technical solution of the aforementioned entity recognition method belong to the same concept. Details not described in detail in the computer program's technical solution can be found in the description of the technical solution of the aforementioned entity recognition method.

[0228] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0229] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0230] It should be noted that the images, models, sample sets, and other information and data involved in the above method embodiments are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0231] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.

[0232] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0233] The preferred embodiments disclosed above are merely illustrative of this specification. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments described herein. These embodiments are selected and specifically described in this specification to better explain the principles and practical applications of the embodiments, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.

Claims

1. An entity recognition method, comprising: Acquire the target image; The target image is input into an entity recognition model to obtain the entity recognition result of the target image. The entity recognition model is pre-trained based on feature alignment between sample images and sample text, and the sample text is identified from the sample images. The entity recognition model includes an entity location layer, an entity category layer, an entity relationship layer, a feature transformation layer, and a feature extraction layer. The target image is input into the feature extraction layer to output target image features. The target image features are input into the entity location layer to obtain entity location information. The entity location information and the target image features are input into the feature transformation layer to determine entity center information and output target entity features. The target entity features are input into the entity category layer and the entity relationship layer respectively to output entity category and entity relationship.

2. The method as described in claim 1, wherein inputting the target image into an entity recognition model to obtain the entity recognition result of the target image includes: The target image is input into the feature extraction layer to obtain the target image features output by the feature extraction layer; The target image features are input into the entity location layer to obtain the entity location information detected by the entity location layer; The target image features and the entity location information are input into the feature conversion layer to obtain the target entity features output by the feature conversion layer. The target entity features are input into the entity category layer and the entity relationship layer respectively to obtain the entity category output by the entity category layer and the entity relationship output by the entity relationship layer. The entity location information, entity category, and entity relationship are used as the entity recognition result of the target image.

3. The method as described in claim 2, wherein the target image features and the entity location information are input into the feature conversion layer to obtain the target entity features output by the feature conversion layer, comprising: The target image features and the entity location information are input into the feature transformation layer; The entity center information of the entity in the target image is determined based on the entity location information; Based on the entity center information, the target image vector is extracted from the target image features, and entity center features are generated based on the target image vector; Target entity features are generated based on the entity center features and target semantic features.

4. The method as described in claim 1, before inputting the target image into the entity recognition model to obtain the entity recognition result of the target image, further includes: Obtain a sample image, wherein the sample image carries sample text and sample location; The sample image is input into the entity recognition model, and image features of the sample image are extracted based on the sample location, and text features are extracted based on the sample text; The image features are compared with the text features to obtain the comparison results; Based on the comparison results, the model loss value is determined, and the entity recognition model is trained based on the model loss value until the model training stops.

5. The method of claim 4, further comprising, before acquiring the sample image: Extract sample image features from the sample images; Based on the features of the sample image, the text boxes in the sample image are identified to determine the sample text and the corresponding sample position; By analyzing the target entity fields in the sample text, the sample category corresponding to the entity in the sample image is determined. The sample relationships between entities are determined based on the sample location and sample category corresponding to the entity.

6. The method as described in claim 4 or 5, wherein image features of the sample image are extracted based on the sample location, and text features are extracted based on the sample text, comprising: Extract the sample image features from the sample image; The entity center information of the entity in the sample image is determined based on the sample location; Based on the entity center information, extract the sample image vector from the sample image features of the sample image, and generate image features based on the sample image vector; The sample text is input into the target language model to obtain the text features output by the target language model.

7. The method of claim 4, further comprising, before training the entity recognition model based on the model loss value: Obtain the predicted location output by the entity recognition model; Calculate the location loss value based on the predicted location and the sample location; Based on the comparison results, the model loss value is determined, including: The model loss value is determined based on the location loss value and the comparison result.

8. The method of claim 5, wherein the sample image carries a sample category; and before training the entity recognition model based on the model loss value, further comprising: Obtain the predicted category output by the entity recognition model; Calculate the category loss value based on the sample category and the predicted category; Based on the comparison results, the model loss value is determined, including: The model loss value is determined based on the category loss value and the comparison result.

9. The method of claim 5, wherein the sample image carries sample relationships: before training the entity recognition model based on the model loss value, it further includes: Obtain the predicted relationship output by the entity recognition model; Calculate the relationship loss value based on the sample relationship and the predicted relationship; Based on the comparison results, the model loss value is determined, including: The model loss value is determined based on the comparison results according to the relationship loss value.

10. A training method for an entity recognition model, applied to cloud-based devices, comprising: Obtain a sample image, wherein the sample image carries sample text and sample location; The sample image is input into the entity recognition model. Based on the sample location, image features of the sample image are extracted, and text features are extracted based on the sample text. The entity recognition model includes an entity location layer, an entity category layer, an entity relationship layer, a feature transformation layer, and a feature extraction layer. The sample image is input into the feature extraction layer, which outputs target image features. The target image features are input into the entity location layer to obtain entity location information. The entity location information and the image features are input into the feature transformation layer to determine entity center information and output target entity features. The target entity features are input into the entity category layer and the entity relationship layer, respectively, to output entity category and entity relationship. The image features are compared with the text features to obtain the comparison results; Based on the comparison results, the model loss value is determined, and the entity recognition model is trained based on the model loss value until the model training stopping condition is met, thereby obtaining a trained entity recognition model. The model loss value includes position loss value, category loss value, and relation loss value. Based on the model acquisition request sent by the terminal, the model parameters of the trained entity recognition model are sent to the terminal.

11. An entity recognition method, applied to a cloud-based device, comprising: Receive entity recognition request sent by the terminal; The target image is obtained according to the entity recognition request, and then input into the entity recognition model. The entity recognition model is pre-trained based on feature alignment between sample images and sample text, where the sample text is identified from the sample images. The entity recognition model includes an entity location layer, an entity category layer, an entity relationship layer, a feature transformation layer, and a feature extraction layer. The target image is input into the feature extraction layer, which outputs target image features. The target image features are input into the entity location layer to obtain entity location information. The entity location information and the target image features are input into the feature transformation layer to determine entity center information and output target entity features. The target entity features are input into the entity category layer and the entity relationship layer, respectively, to output entity category and entity relationship. Obtain the entity recognition result output by the entity recognition model and return the entity recognition result to the terminal.

12. A document entity recognition method, comprising: Acquire the image to be processed, wherein the image to be processed is collected from the document to be processed; The image to be processed is input into an entity recognition model to obtain the text entity recognition result of the image to be processed. The entity recognition model is pre-trained based on feature alignment between sample images and sample text. The sample text is identified from the sample images. The entity recognition model includes an entity location layer, an entity category layer, an entity relationship layer, a feature transformation layer, and a feature extraction layer. The image to be processed is input into the feature extraction layer to output target image features. The target image features are input into the entity location layer to obtain entity location information. The entity location information and the target image features are input into the feature transformation layer to determine entity center information and output target entity features. The target entity features are input into the entity category layer and the entity relationship layer respectively to output entity category and entity relationship.

13. A computing device, comprising: Memory and processor; The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1 to 12.

14. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 12.