Document analysis method, apparatus, device, and storage medium

By performing multimodal fusion and serialization processing on textual and two-dimensional positional information in documents, the problem of low analysis efficiency for complex document layouts is solved, and efficient classification of long texts and multi-page documents is achieved.

CN115294594BActive Publication Date: 2026-06-12SHANGHAI SENSETIME INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI SENSETIME INTELLIGENT TECH CO LTD
Filing Date
2022-08-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively analyze documents with complex layouts, especially multi-page contracts. Traditional methods are inefficient and have difficulty handling long texts and multi-page documents.

Method used

By performing multimodal information fusion processing on the textual information and two-dimensional positional information of the target text in the document to be analyzed, an analysis vector is obtained. Then, based on the two-dimensional positional information, serialization and classification processing are performed to obtain the category attribute of the text corresponding to each analysis vector.

🎯Benefits of technology

It improves the accuracy and efficiency of classifying documents with complex reading sequences, can quickly process long texts and multi-page documents, and reduces hardware resource consumption and processing time.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115294594B_ABST
    Figure CN115294594B_ABST
Patent Text Reader

Abstract

Embodiments of the present application disclose a document analysis method, device and equipment and a storage medium. The method comprises: obtaining text information and two-dimensional position information of a target text in a document to be analyzed; performing multi-modal information fusion processing on the text information and the two-dimensional position information to obtain an analysis vector corresponding to the target text; performing serialization processing on the analysis vector according to the two-dimensional position information of all texts in the document to be analyzed to obtain a sequence of analysis vectors to be analyzed; performing classification processing on each analysis vector in the sequence of analysis vectors to be analyzed according to sequence position information of the analysis vector in the sequence of analysis vectors to be analyzed to obtain a category attribute of each analysis vector; and performing document analysis on the document to be analyzed according to the category attribute of the target text.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to, but is not limited to, the field of computer technology, and in particular to a document analysis method, apparatus, device, and storage medium. Background Technology

[0002] Analyzing document layout using natural language processing is currently the mainstream approach to document analysis. However, most natural language processing solutions are designed for plain text, and current technologies for document analysis are limited to traditional handwritten rule-based approaches, which involve manually summarizing pattern information from key fields and writing rules.

[0003] However, the technical solutions provided by the relevant technologies are difficult to analyze documents with complex layouts. Summary of the Invention

[0004] Based on the problems existing in related technologies, this application provides a document analysis method, apparatus, device, and storage medium.

[0005] The technical solution of this application embodiment is implemented as follows:

[0006] This application provides a document analysis method, the method comprising:

[0007] Obtain the text information and two-dimensional position information of the target text in the document to be analyzed;

[0008] The text information and the two-dimensional position information are fused using multimodal information fusion to obtain the analysis vector corresponding to the target text;

[0009] Based on the two-dimensional position information corresponding to all the text in the document to be analyzed, the corresponding analysis vectors are serialized to obtain the sequence of vectors to be analyzed.

[0010] Based on the sequence position information of the analysis vector in the sequence of vectors to be analyzed, each analysis vector in the sequence of vectors to be analyzed is classified to obtain the category attribute of the text corresponding to each analysis vector;

[0011] Based on the category attributes of the target text, document analysis is performed on the document to be analyzed.

[0012] This application provides a document analysis device, the device comprising:

[0013] The acquisition module is used to acquire the text information and two-dimensional position information of the target text in the document to be analyzed;

[0014] A multimodal information fusion module is used to perform multimodal information fusion processing on the text information and the two-dimensional position information to obtain the analysis vector corresponding to the target text;

[0015] The serialization processing module is used to serialize the corresponding analysis vectors based on the two-dimensional position information corresponding to all the text in the document to be analyzed, so as to obtain a sequence of vectors to be analyzed.

[0016] The classification processing module is used to classify each analysis vector in the sequence of vectors to be analyzed according to the sequence position information of the analysis vector in the sequence of vectors to be analyzed, so as to obtain the category attribute of the text corresponding to each analysis vector;

[0017] The document analysis module is used to perform document analysis on the document to be analyzed based on the category attributes of the target text.

[0018] This application provides a document analysis device, including a processor and a memory. The memory stores a computer program that can run on the processor, and the processor executes the computer program to implement the above-described document analysis method.

[0019] This application provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the above-described document analysis method.

[0020] This application provides a computer program product, which includes executable instructions stored in a computer-readable storage medium. When the processor of a document analysis device reads the executable instructions from the computer-readable storage medium and executes the executable instructions, the above-described document analysis method is implemented.

[0021] The document analysis method, apparatus, device, and storage medium provided in this application embodiment perform multimodal information fusion processing on the text information and two-dimensional position information of target text in the document to be analyzed, obtaining the analysis vector corresponding to the target text. Based on the two-dimensional position information corresponding to the text, the analysis vector is serialized to obtain a sequence of vectors to be analyzed. Based on the sequence position information of the analysis vectors in the sequence of vectors to be analyzed, each analysis vector is classified to obtain the category attribute of the text corresponding to each analysis vector, thereby realizing document analysis of the document to be analyzed. Thus, this application embodiment improves upon traditional natural language processing algorithms when classifying text in the document to be analyzed by fusing multimodal information such as text information and two-dimensional position information of the document to be analyzed, and uses the two-dimensional coordinate information of each target text in the document to be analyzed instead of the reading order information of the text. Because it uses the two-dimensional position information of the target text, this application embodiment can quickly classify documents with complex reading orders, improving the accuracy of text classification.

[0022] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this application. Attached Figure Description

[0023] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with this application and, together with the specification, serve to explain the technical solutions of this application.

[0024] Figure 1 This is a schematic diagram illustrating an application scenario of the document analysis method provided in the embodiments of this application;

[0025] Figure 2 This is a schematic diagram illustrating the implementation process of a document analysis method provided in an embodiment of this application;

[0026] Figure 3 This is a schematic diagram illustrating the implementation process of a document analysis method provided in an embodiment of this application;

[0027] Figure 4 This is a schematic diagram illustrating the implementation process of a document analysis method provided in an embodiment of this application;

[0028] Figure 5 This is a schematic diagram of the sliding window solution provided in the embodiments of this application;

[0029] Figure 6 This is a schematic diagram of a document analysis device provided in an embodiment of this application;

[0030] Figure 7 This is a schematic diagram of the hardware entity of a document analysis device provided in an embodiment of this application. Detailed Implementation

[0031] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0032] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0033] In the following description, the terms "first, second, third" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, third" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0034] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0035] Natural Language Processing (NLP) algorithms are still in their nascent stage, and using NLP for document layout analysis (e.g., rich text documents, text documents, image documents, or mixed-type documents, which may include at least one paragraph of text) is gradually becoming the mainstream approach. However, most NLP solutions are designed for plain text documents and require encoding the text reading order (i.e., position embedding). This method struggles to obtain the text reading order for documents with complex layouts, affecting classification accuracy. Furthermore, modeling multi-page contracts is a significant challenge in the field of structured document analysis. Current technologies are limited to traditional handwritten rule-based solutions, involving manual summarization of patterns in key fields and rule writing. This approach is inefficient and ill-suited for complex contract layouts. Additionally, encoding the text reading order limits the maximum text length of the document being analyzed, making it difficult to analyze long texts and multi-page documents.

[0036] To address the problems existing in related technologies, this application provides a document analysis method. This method involves multimodal information fusion processing of the textual information and two-dimensional positional information of target text in the document to be analyzed, obtaining analysis vectors corresponding to the target text. Based on the two-dimensional positional information of the text, the analysis vectors are serialized to obtain a sequence of vectors to be analyzed. Based on the sequence position information of the analysis vectors in the sequence of vectors to be analyzed, each analysis vector is classified to obtain the category attribute of the text corresponding to each analysis vector, thereby achieving document analysis of the document to be analyzed. Thus, this application improves upon traditional natural language processing algorithms when classifying text in the document to be analyzed by fusing multimodal information such as textual information and two-dimensional positional information of the document to be analyzed, and uses the two-dimensional coordinate information of each target text in the document to replace the reading order information of the text for analysis. By using the two-dimensional positional information of the target text, this invention solves the problem in related text recognition methods that can only recognize straight and horizontal text using one-dimensional positional information. For documents with complex reading sequences (such as curved text, jumping text, or text with different text sizes), this embodiment can accurately obtain the features of text with complex reading sequences through the two-dimensional positional information of the text, thereby enabling rapid classification of documents with complex reading sequences and improving the accuracy of text classification.

[0037] The document analysis method provided in this application can be executed by electronic devices such as document analysis equipment. These electronic devices can be various types of terminals, including laptops, tablets, desktop computers, set-top boxes, and mobile devices (e.g., mobile phones, portable music players, personal digital assistants, dedicated messaging devices, portable gaming devices), or they can be implemented as servers. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.

[0038] The following will describe an exemplary application of the document analysis device as a server. The technical solutions in the embodiments of this application will be clearly and completely described in conjunction with the accompanying drawings.

[0039] Figure 1This is a schematic diagram illustrating an application scenario of the document analysis method provided in this application embodiment. The document analysis system 10 provided in this application embodiment includes a terminal 100, a network 200, and a server 300. The network 200 can be a wide area network (WAN), a local area network (LAN), or a combination of both. The server 300 and the terminal 100 can be physically separate or integrated. When performing document analysis, the server 300 can use the method provided in this application embodiment to obtain the text information and two-dimensional position information of each character in the document to be analyzed through the network 200, perform multimodal information fusion processing on the text information and the two-dimensional position information to obtain the analysis vector corresponding to each character, and perform serialization processing on the corresponding analysis vectors according to the two-dimensional position information of all characters in the document to be analyzed to obtain the analysis vector sequence. According to the sequence position information of each analysis vector in the analysis vector sequence, each analysis vector in the analysis vector sequence is classified to obtain the category attribute of the character corresponding to each analysis vector. According to the category attribute of each character, the set of characters corresponding to each category attribute in the document to be analyzed can be determined, and the set of characters is sent to the terminal 100 and displayed on the display interface 100-1 of the terminal 100.

[0040] Figure 2 This is a schematic diagram illustrating the implementation flow of a document analysis method provided in an embodiment of this application, such as... Figure 2 As shown, the method is implemented through steps S201 to S205:

[0041] Step S201: Obtain the text information and two-dimensional position information of the target text in the document to be analyzed.

[0042] In some embodiments, the document to be analyzed can be a rich text document, a text document, an image document, or a mixed-type document. Image documents can be files such as bmp, jpg, and png, while text documents can be documents such as xml, pdf, and doc. A rich text document refers to a document with a large amount of character information and includes multiple formats (such as font color, images, and tables). Examples include shopping receipts, emails, project proposals, calculation sheets, negotiation materials, contracts, organizational charts, and business plans. Its sources include, but are not limited to, web pages, portable document format (PDF), and scanned copies of paper documents.

[0043] In some embodiments, the target text can be every single character in the document to be analyzed, or it can be a portion of the text in the document to be analyzed. For example, when the document to be analyzed is a receipt, the target text can be the text of a paragraph, such as "potato chips" on a shopping receipt. In the following, some embodiments will be explained in detail with the target text being every single character in the document to be analyzed.

[0044] In this embodiment of the application, the document to be analyzed may be obtained by scanning a paper document with a scanner; or it may be obtained by recognizing scanned images of contracts or other contract images using online recognition software installed on, for example, a mobile phone, laptop or tablet. This embodiment of the application does not limit the source of the document acquisition.

[0045] In some embodiments, the document to be analyzed can be a short text document such as a shopping receipt, or a long text document with multiple pages such as a contract. For short text documents, Optical Character Recognition (OCR) technology can be used to extract information. During information extraction, the target text can correspond to a recognition box, and the two-dimensional position information of the target text can be determined by the position of the recognition box. For long text documents with multiple pages such as contracts, the processing time is long when recognizing long text across pages. This embodiment of the application can reduce the document processing time by segmenting the long text and recognizing the segmented text sequence.

[0046] Step S202: Perform multimodal information fusion processing on the text information and the two-dimensional position information to obtain the analysis vector corresponding to the target text.

[0047] In this application embodiment, multimodal information fusion processing refers to fusing information from different modalities such as text information, image information, or location information. By leveraging the complementarity of multimodal information, information with multimodal features is obtained. This can be achieved by representing the entity information as a vector through machine learning.

[0048] In some embodiments, multimodal information fusion processing may involve adding or multiplying multimodal vectors.

[0049] This application embodiment can fuse the textual information and two-dimensional positional information of the target text using a neural network model or a sequence representation model. For example, the textual information and two-dimensional positional information of the target text can be mapped to a shared subspace. A shared semantic subspace is implemented in different hidden layers, and the semantics of the single-modal feature vectors corresponding to the transformed textual information and two-dimensional positional information are semantically combined to achieve multimodal fusion and obtain the analysis vector corresponding to the target text. Alternatively, attention vectors for each modality can be obtained, and then the weight distribution of the two attention vectors can be calculated using the hidden layer representation of the decoder. Finally, the two attention vectors are fused according to the weights to obtain the analysis vector corresponding to the target text.

[0050] In this embodiment of the application, each character in the text to be analyzed can be converted into a one-dimensional vector (i.e., word vector) by querying the word vector table. The one-dimensional vector of each character carries the semantic information of each character.

[0051] Step S203: Based on the two-dimensional position information corresponding to all the text in the document to be analyzed, the corresponding analysis vectors are serialized to obtain the sequence of vectors to be analyzed.

[0052] In some embodiments, serialization processing refers to determining the reading order of the document based on the two-dimensional position information of each character, and obtaining a sequence of vectors to be analyzed for all analysis vectors based on the reading order of the document.

[0053] In some embodiments, the document can be segmented into multiple segmented images, and the feature vector of each segmented image can be determined. The feature vector of each image can be concatenated with the analysis vectors corresponding to all the text in the reading order, and the resulting sequence of vectors to be analyzed also contains the image features of the document. Thus, the embodiments of this application not only provide the ability to understand the semantic context of text, but also enhance the association between visual and linguistic modalities by utilizing the corresponding two-dimensional positional information.

[0054] Step S204: Based on the sequence position information of the analysis vector in the sequence of vectors to be analyzed, classify each analysis vector in the sequence of vectors to be analyzed to obtain the category attribute of the text corresponding to each analysis vector.

[0055] In this embodiment, classifying each analysis vector in the sequence of analysis vectors can be done using a pre-trained classification model to classify each analysis vector and obtain the category attribute corresponding to the target text. Here, the category attribute refers to the semantic category attribute of the text, such as date attribute, number attribute, or tag attribute. For example, the category attribute corresponding to each character in July 15, 2022 on a shopping receipt could be a date attribute, and the category attribute corresponding to each character in the product name could be a tag attribute.

[0056] Step S205: Perform document analysis on the document to be analyzed based on the category attribute of the target text.

[0057] In some embodiments, document analysis of the document to be analyzed may refer to grouping texts with the same category attributes together, sorting all texts corresponding to each category attribute according to the two-dimensional position information of the target text, obtaining a text category sequence corresponding to each category attribute, and obtaining key information corresponding to each category attribute in the document to be analyzed through the text category sequence.

[0058] In some embodiments, step S205 can be implemented by steps S2051 to S2053:

[0059] Step S2051: Based on the category attributes of the target text, classify all the text in the document to be analyzed to obtain the text set corresponding to each category attribute.

[0060] Step S2052: Based on the two-dimensional position information of the target text, sort the texts in the text set to obtain the text category sequence corresponding to each category attribute.

[0061] Step S2053: Determine the information corresponding to each category attribute in the document to be analyzed based on the text category sequence.

[0062] By fusing multimodal information from the textual and two-dimensional positional information of the target text in the document to be analyzed, an analysis vector corresponding to the target text is obtained. Based on the two-dimensional positional information of the text, the analysis vector is serialized to obtain a sequence of vectors to be analyzed. Based on the sequence position information of the analysis vectors in the sequence of vectors to be analyzed, each analysis vector is classified to obtain the category attribute of the text corresponding to each analysis vector, thus achieving document analysis of the document to be analyzed. Therefore, this embodiment improves upon traditional natural language processing algorithms when classifying text in a document by fusing multimodal information such as textual and two-dimensional positional information from the document to be analyzed, and uses the two-dimensional coordinate information of each text in the document to replace the reading order information of the text for analysis. Because it uses the two-dimensional positional information of the text, this embodiment can quickly classify documents with complex reading orders, improving the accuracy of text classification.

[0063] In some embodiments, the document to be analyzed can be a short text document or a long text document with multiple pages. For short text documents, text recognition can be performed directly. For long text documents, it is necessary to segment the long text to obtain multiple text sequences, and then recognize the multiple text sequences to obtain the text information and two-dimensional position information of each character. Based on the above embodiments, Figure 3 This is a schematic diagram illustrating the implementation flow of a document analysis method provided in an embodiment of this application, such as... Figure 3 As shown, step S201 can be achieved through steps S301 to S304:

[0064] Step S301: In response to the fact that the number of characters in the first text paragraph in the document to be analyzed is less than a first preset number, perform text recognition processing on the first text paragraph to obtain the text information and recognition box corresponding to the target characters in the first text paragraph.

[0065] In some embodiments, the first preset number can be set according to requirements. The value of the preset number can be set based on the device's processing power, video memory capacity, and processor speed. For example, when the device's video memory capacity is 1 terabyte (TB), the first preset number can be 3000; when the device's video memory capacity is 8 gigabytes (GB), the first preset number can be 500. When the first preset number is 3000, paragraphs in the document to be analyzed with fewer than 3000 characters are considered first text paragraphs, i.e., short text, such as dates and company names in a contract; paragraphs in the document to be analyzed with more than 3000 characters are considered second text paragraphs, i.e., long text, such as the main body of a contract. The number of characters in a paragraph can be determined using character recognition technology or other text recognition technologies.

[0066] Step S302: Determine the two-dimensional position information corresponding to the target text based on the position of each recognition box on the document to be analyzed.

[0067] In some embodiments, when a paragraph is identified as short text, each character can be identified using OCR technology, and the corresponding recognition box for each character can be determined. The position information of the recognition box can be used as the two-dimensional position information of each character. For example, the x and y coordinates of the top left and bottom right corners of the recognition box can be used as the two-dimensional position information of each character, such as [13, 25, 40, 50], where the top left corner of the recognition box is [13, 25] and the bottom right corner is [40, 50]. Alternatively, the x and y coordinates of the center position of the recognition box can be used as the two-dimensional position information of each character, such as [17, 36].

[0068] Step S303: In response to the fact that the number of characters in the second text segment in the document to be analyzed is greater than the second preset number, starting from the beginning position of the second text segment, the text segment is cut with a first length as the sliding step size through a sliding window with a second length to obtain at least two text sequences; wherein, the first length is less than or equal to the second length.

[0069] In some embodiments, the method for setting the second preset quantity may be the same as or different from the method for setting the first preset quantity, and the values ​​of the second preset quantity and the first preset quantity may be the same as or different.

[0070] When determining the category attribute of a text, it is only necessary to look at the text adjacent to it to determine the category attribute. It is not necessary to use other texts that are far away to assist in the determination. Furthermore, long texts will consume more hardware memory and take longer to process. For example, it takes 100 seconds to process a paragraph of 5,000 words at a time. However, if the paragraph of 5,000 words is divided into multiple sub-paragraphs of 50 words each, it only takes 20 seconds to process multiple sub-paragraphs at the same time.

[0071] Therefore, in this embodiment, the super-long text can be obtained first. For example, the text in a multi-page document can be spliced ​​together from top to bottom and from left to right within the page to obtain the super-long text. Then, a sliding window with a second length can be used to cut the super-long text, and at least two text sequences can be obtained with the first length as the sliding step.

[0072] Step S304: Perform text recognition processing on the at least two text sequences to determine the text information and the recognition box of the target text in the at least two text sequences.

[0073] In some embodiments, when the first length is equal to the second length, there are no overlapping characters between two adjacent text sequences. Text recognition processing is directly performed on at least two text sequences to determine the text information and recognition box of the target characters in at least two text sequences.

[0074] In some embodiments, when the first length is less than the second length, when the text segment is cut by a sliding window with the second length, there are overlapping characters between two adjacent text sequences because the sliding window step size is less than the second length.

[0075] In this embodiment of the application, after obtaining at least two text sequences, it is necessary to remove duplicate text to avoid introducing erroneous information during recognition and affecting the accuracy of recognition.

[0076] In some embodiments, for two adjacent text sequences with overlapping characters, the overlapping characters in either text sequence can be removed to obtain a cleaned text sequence. For example, if the two adjacent text sequences are "multimodal solution intelligence" and "intelligent contract document analysis", the word "intelligent" can be removed from either sequence. This application only provides exemplary embodiments for removing overlapping characters, and does not limit any method for removing overlapping characters.

[0077] In some embodiments, after obtaining the cleared text sequence, text recognition processing can be performed on the cleared text sequence to determine the text information and the recognition box of the target text in the cleared text sequence.

[0078] In some embodiments, for any two adjacent character sequences with overlapping characters, removing overlapping characters can also be achieved through steps S1 to S5:

[0079] Step S1: For any two adjacent text sequences, determine the number of overlapping characters in the first text sequence or the second text sequence.

[0080] In some embodiments, any two adjacent text sequences may include a first text sequence and a second text sequence, wherein the text sequence closer to the beginning of the document may be the first text sequence. When two adjacent text sequences have overlapping characters, the number of overlapping characters in the first text sequence or the second text sequence is determined. For example, if the first text sequence is "intelligent multimodal solutions" and the second text sequence is "intelligent contract document analysis", and the overlapping character is "intelligent", then the number of overlapping characters is 2.

[0081] Step S2: Based on the number of overlaps, divide the overlapping characters into first overlapping characters and second overlapping characters.

[0082] In some embodiments, when the number of overlaps is even, the overlapping text can be divided into first overlapping text and second overlapping text with the same number of characters. For example, when the overlapping text is "smart contract", the number of overlaps is 4, and it can be evenly divided into "smart" and "contract". When the number of overlaps is odd, the overlapping text can be divided as evenly as possible. For example, when the overlapping text is "using a deep learning framework to solve the problem of multi-page contract document analysis", the number of overlaps is 21, and the overlapping text can be divided into "using a deep learning framework to solve the problem of multi-page contract document analysis" and "page contract document analysis".

[0083] Step S3: In response to the distance between the first overlapping character and the sequence center of the first character sequence being less than the sequence center of the first overlapping character and the second character sequence, remove the second overlapping character from the first character sequence.

[0084] Step S4: In response to the distance between the second overlapping character and the sequence center of the second character sequence being less than the sequence center of the second overlapping character and the first character sequence, remove the first overlapping character from the second character sequence.

[0085] In some embodiments, the characters to be removed can be determined based on the distance between the overlapping characters and the sequence centers of the first or second character sequence. When the distance between the first overlapping character and the sequence center of the first character sequence is less than the distance between the first overlapping character and the sequence center of the second character sequence, the second overlapping character in the first character sequence is removed. That is, characters closer to the center of the character sequence can be retained among the overlapping characters. For example, in two adjacent character sequences, "Multimodal Solution Intelligence" and "Intelligent Contract Document Analysis", the characters "Intelligent" are overlapping characters. In the character sequence "Multimodal Solution Intelligence", "Intelligent" is closer to the center of the character sequence. Therefore, in the character sequence "Multimodal Solution Intelligence", "Intelligent" is retained and "Intelligent" is removed. Similarly, in the character sequence "Intelligent Contract Document Analysis", "Intelligent" is closer to the center of the character sequence. Therefore, in the character sequence "Intelligent Contract Document Analysis", "Intelligent" is retained and "Intelligent" is removed.

[0086] Step S5: Determine all the first and second character sequences after clearing as the cleared character sequences.

[0087] In this embodiment, the different sliding windows are equal-length sliding windows, and the first length of the sliding step can be much smaller than the second length of the sliding window, so that there are more overlapping characters in the adjacent text sequences. For example, the adjacent text sequences after long text is cut can be "ABCDEFGHIJKLMNOP" and "IJKLMNOPQRSTUVWX", with the overlapping character being "IJKLMNOP". "IJKL" is closer to the first text sequence, and "MNOP" is closer to the second text sequence. Therefore, the first text sequence retains "IJKL", and the second text sequence retains "MNOP". The text sequences after removing the overlapping characters are "ABCDEFGHIJKL" and "MNOPQRSTUVWX".

[0088] In this embodiment of the application, after removing overlapping characters, text recognition processing can be performed on the cleared text sequence to determine the text information and recognition box of each character in the cleared text sequence. Based on the position of each recognition box on the document to be analyzed, the two-dimensional position information corresponding to each character is determined.

[0089] This application embodiment classifies paragraphs in the document to be analyzed and segments paragraphs with more than a preset number of characters. This allows the application embodiment to avoid consuming too much hardware memory when processing long texts, thereby reducing the computational load on the server, reducing processing time, and improving document processing efficiency.

[0090] In this embodiment, the one-dimensional reading order encoding is removed, the coordinates of the text are normalized, and a relative two-dimensional position encoding technique is used to determine the position and reading order of text in a multi-page document. This solves the problem of difficulty in obtaining the reading order of a multi-page document. Based on the foregoing embodiments, the document analysis method provided in this embodiment further includes steps S11 and S12.

[0091] Step S11: Normalize the two-dimensional position information of each target character to obtain normalized two-dimensional position information.

[0092] In some embodiments, normalizing the two-dimensional position information of each target character can refer to normalizing the x and y coordinate values ​​of each target character to within 1 to 1000. In this application embodiment, no specific limitation is made on the coordinate normalization values.

[0093] Step S12: According to the page number order of the multi-page document, add weight information to the first direction position information in the normalized two-dimensional position information corresponding to the target text in each page of the document, so as to obtain the first direction position information corresponding to each target text in the multi-page document.

[0094] In some embodiments, the two-dimensional position information of each target character includes at least first direction position information, which may be the y-axis coordinate of each character.

[0095] In this embodiment, the weight information can be numerical values ​​corresponding to different page numbers in a multi-page document. Adding weight information means adding a value that determines the page number of each target character to its y-axis coordinate. The value after adding weight information becomes the y-axis coordinate of that character. Based on the y-axis coordinate after adding weight information, the position of the character in the multi-page document can be determined. For example, if the y-axis coordinate of the first character on the first page of the document is 1, adding a weight of 1000 to the y-axis coordinate of the character on the second page makes the y-axis coordinate of the first character on the second page 1001. Adding a weight of 2000 to the y-axis coordinate of the character on the third page makes the y-axis coordinate of the first character on the second page 2001. By adding weight information to the y-axis coordinate of each character sequentially according to the page number order of the multi-page document, the first direction position information corresponding to each character in the multi-page document is obtained.

[0096] This application embodiment adds an offset to the vertical coordinate of each character in a multi-page document and uses relative two-dimensional position encoding to distinguish characters from different page numbers. This enables the document analysis method provided by this application embodiment to extract key information from multi-page documents and improves the versatility of the document analysis method.

[0097] In the embodiments of this application, multimodal information fusion processing can be achieved through a multimodal fusion neural network. Figure 4 This is a schematic diagram illustrating the implementation flow of a document analysis method provided in an embodiment of this application, such as... Figure 4 As shown, step S202 is achieved through steps S401 to S403.

[0098] Step S401: Extract text features from the text information corresponding to the target text to obtain a text feature vector.

[0099] Step S402: Extract position features from the two-dimensional position information corresponding to the target text to obtain a two-dimensional position feature vector.

[0100] In some embodiments, feature extraction layers in a multimodal fusion neural network can be used to extract features from textual information and two-dimensional positional information to obtain textual feature vectors and two-dimensional feature vectors.

[0101] In this embodiment of the application, since the two-dimensional position information of each character is introduced in the model attention encoding stage, this embodiment of the application can use the T5 (Transfer Text-to-TextTransformer) model to obtain the corresponding attention matrix for the x and y coordinates of the character, and then add the two attention matrices to obtain the two-dimensional position feature vector.

[0102] In some embodiments, the two-dimensional position information includes at least first-direction position information and second-direction position information, namely y-coordinate information and x-coordinate information. Therefore, step S402 can be implemented by steps S4021 to S4023:

[0103] Step S4021: Encode the first direction position information and the second direction position information corresponding to the target text respectively to obtain the first direction attention matrix and the second direction attention matrix.

[0104] Step S4022: Superimpose the first directional attention matrix and the second directional attention matrix to obtain the two-dimensional position matrix corresponding to the target text.

[0105] Step S4023: Extract features from the two-dimensional position matrix to obtain the two-dimensional position feature vector.

[0106] Here, the first and second directional position information corresponding to the target text can be introduced into the Spatial-Aware Self-Attention Mechanism using the T5 model. This allows the Spatial-Aware Self-Attention Mechanism to perceive two-dimensional spatial distance information. The Spatial-Aware Self-Attention Mechanism encodes and models the first and second directional position information corresponding to the target text using the two-dimensional spatial distance information, obtaining a first-directional attention matrix and a second-directional attention matrix. Adding the first and second directional attention matrices yields a two-dimensional position matrix for each character. Feature extraction from the two-dimensional position matrix yields a two-dimensional position feature vector.

[0107] Step S403: Perform multimodal feature fusion processing on the text feature vector and the two-dimensional position feature vector to obtain the analysis vector corresponding to the target text.

[0108] In this embodiment of the application, a multimodal feature fusion processing can be performed on the text feature vector and the two-dimensional position feature vector through a multimodal fusion neural network to obtain the analysis vector corresponding to the target text.

[0109] In some embodiments, after obtaining the analysis vector corresponding to the target text, the corresponding analysis vectors can be sorted according to the two-dimensional position information of all texts in the document to be analyzed, resulting in a sequence of analysis vectors. Then, based on the sequence position information of each analysis vector in the sequence of analysis vectors, each analysis vector in the sequence is classified sequentially using a fully connected layer or other classification model to obtain a classification sequence. This classification sequence may include the score of each attribute category corresponding to the target text. Finally, the classification sequence can be normalized using a softmax function to obtain the probability of each attribute category corresponding to the target text. The category attribute with the highest probability is determined as the category attribute corresponding to that text, thus obtaining the category attribute of the text corresponding to each analysis vector.

[0110] In this embodiment, the two-dimensional position information of each target text is introduced in the model attention encoding stage. This allows the embodiment to process documents without being limited by the length of the text input or the difficulty in obtaining the document reading order. It can also process multi-page documents, effectively improving server processing efficiency.

[0111] In some embodiments, when classifying text, image information of the document can be incorporated. This can be achieved by segmenting the image into high-dimensional features and overlaying them along the text sequence direction, or by overlaying the image information of each character with its feature vector. Therefore, embodiments of this application may further include steps S21 to S22.

[0112] Step S21: Perform image segmentation processing on the document to be analyzed to obtain at least two segmented images.

[0113] In some embodiments, image segmentation can involve dividing the image into equal parts to obtain at least two segmented images. For example, dividing the image into four or nine equal parts results in four or nine sub-images of the same area. Here, some appearance features can be captured from the image, such as font orientation, type, color, and other information.

[0114] Step S22: Extract image features from at least two cut images to obtain at least two image feature vectors.

[0115] Here, image feature extraction of the cut image can be performed by extracting the layout information of the text in the cut image or the image information corresponding to each text, to obtain at least two image feature vectors.

[0116] Correspondingly, step S203 can be achieved through steps S2031 to S2032:

[0117] Step S2031: Based on the two-dimensional position information corresponding to all the text in the document to be analyzed, the corresponding analysis vectors are serialized to obtain an initial vector sequence.

[0118] Step S2032: Add the at least two image feature vectors to the initial vector sequence to obtain the vector sequence to be analyzed.

[0119] In some embodiments, the order of each character in the document can be determined based on the two-dimensional position information corresponding to each character. The analysis vectors corresponding to each character can be sorted according to the two-dimensional position information to obtain the analysis vector sequence corresponding to all characters in the document to be analyzed. The image feature vectors of at least two segmented images corresponding to the document to be analyzed can be added to the analysis vector sequence to obtain the analysis sequence. Then, the analysis sequence is classified to obtain the category attribute corresponding to each character.

[0120] In this embodiment, the acquired two-dimensional coordinate information is position-encoded, and the text information is text-encoded. The position-encoded information, text-encoded information, and image features of the document are fused together, and a multimodal fusion network is used to classify the text. By fusing the multimodal information of text, position, and image, the accuracy of document classification is improved.

[0121] In some embodiments, the document to be analyzed may be a multi-page contract document, and the two-dimensional position information of the target text in the multi-page contract document includes at least first-direction position information. The document analysis method provided in this application embodiment can also normalize the two-dimensional position information of the target text in the multi-page contract document to obtain normalized two-dimensional position information. Based on the page number order of the multi-page contract document, weight information is sequentially added to the first-direction position information in the normalized two-dimensional position information corresponding to the target text in each page of the contract document to obtain the first-direction position information corresponding to the target text in the multi-page contract document. Multimodal information fusion processing is then performed on the text information and the two-dimensional position information including the first-direction position information to obtain the analysis vector corresponding to the target text.

[0122] This application provides another example of a document analysis method applied in a real-world scenario.

[0123] In some embodiments, document analysis methods can extract key information from documents such as contracts. First, the document is scanned into an image, and an optical text recognition (OCR) scheme is used to obtain the text information and text locations within the document. Second, a multimodal analysis is used to fuse and analyze the obtained text information and text locations to obtain the category attribute to which each text (i.e., the text itself) belongs. Finally, texts with consistent category attributes are processed to obtain a set of texts for each category attribute, and the document analysis results are output.

[0124] The multimodal fusion scheme model provided in this application improves the BERT model by removing the one-dimensional reading order encoding (position embedding), fusing the two-dimensional coordinate information of the text (i.e., two-dimensional position information), and normalizing the text coordinate information and text length and width information to 1000 in each page. In this way, the model is not limited by the length of the text input, nor by the difficulty in obtaining the reading order.

[0125] In this embodiment, the two-dimensional position information of each character is introduced in the model attention encoding stage. Therefore, in this embodiment, the x and y coordinates of the characters can be obtained by using the T5 (Transfer Text-to-Text Transformer) model to obtain the corresponding attention matrices, and then the two attention matrices are added together to obtain the two-dimensional position feature vector.

[0126] In this embodiment, for the text position information obtained on each page, we normalize its coordinates using 1000 to obtain the coordinates of each text on each page. For multi-page text, relative to the text on the nth page after the first page, we add a bias of 1000*n to the vertical coordinate (i.e., y-coordinate) of each text on that page to distinguish text from different page numbers. Since this embodiment uses relative position encoding for coordinates, it can perform attention analysis on multi-page text.

[0127] In this embodiment, processing of infinitely long sequences can be supported. However, to save memory, and because the type of a text can often be determined by neighboring texts without the need for auxiliary judgment from other texts that are far apart, we can first obtain the extremely long text from a multi-page document by splicing it from top to bottom and left to right within the page. Then, we use a sliding window scheme to segment the extremely long text, obtaining intersecting windows. We analyze the text in each window and finally merge them. In post-processing, texts with the same category attributes and that are adjacent are output as individual texts. For each text in the original sentence, it may appear in multiple windows. In this case, we select the text that is closest to the middle of the multiple windows. Finally, the model takes all of the above to reconstruct the sentence for subsequent analysis.

[0128] Figure 5 This is a schematic diagram of a sliding window scheme provided in an embodiment of this application, such as... Figure 5As shown, the long text 501 is segmented using a sliding window with a length of 5 characters. The sliding window moves in steps of 3, resulting in six text sequences from 5011 to 5016. Adjacent sliding windows intersect, as shown in the dashed box corresponding to 502. The overlapping text between 5011 and 5012 is identical. Therefore, when analyzing the segmented text sequences, it is necessary to remove overlapping text from adjacent sequences. Here, text closer to the center of the sliding window can be selected. For example, 1 and 3, and 2 and 4, overlap between text sequences 5011 and 5012. Since 3 is closer to the center of text sequence 5011 and 1 is farther from the center of text sequence 5012, 1 from text sequence 5012 can be removed from the overlapping texts 1 and 3. Similarly, since 2 is closer to the center of text sequence 5012 and 4 is farther from the center of text sequence 5011, 4 from text sequence 5011 can be removed from the overlapping texts 2 and 4.

[0129] In some embodiments, overlapping characters in either of the text sequences 5011 and 5012 can also be removed, for example, removing 1 and 2 simultaneously, or removing 3 and 4 simultaneously. This application embodiment only provides exemplary embodiments of removing overlapping characters, and this application embodiment does not limit any method of removing overlapping characters.

[0130] In this embodiment of the application, image information can also be incorporated during classification. This can be achieved by mapping image segments to high-dimensional features and then overlaying them along the text sequence, or by overlaying the image information of each text segment.

[0131] This application embodiment can not only use this algorithm to extract key information from contract documents, but also to extract key information from multi-page documents, and to extract key information from extremely long text documents. It can also compare key information in multiple contracts to obtain contract comparison results, etc.

[0132] Based on the above embodiments, this application provides a document analysis device. Figure 6 This is a schematic diagram of a document analysis device provided in an embodiment of this application, such as... Figure 6 As shown, the device 60 includes an acquisition module 601, a multimodal information fusion processing module 602, a serialization processing module 603, a classification processing module 604, and a document analysis module 605.

[0133] The acquisition module 601 is used to acquire the text information and two-dimensional position information of the target text in the document to be analyzed;

[0134] The multimodal information fusion processing module 602 is used to perform multimodal information fusion processing on the text information and the two-dimensional position information to obtain the analysis vector corresponding to the target text;

[0135] The serialization processing module 603 is used to perform serialization processing on the corresponding analysis vectors based on the two-dimensional position information corresponding to all the text in the document to be analyzed, so as to obtain the sequence of vectors to be analyzed.

[0136] The classification processing module 604 is used to classify each analysis vector in the sequence of vectors to be analyzed according to the sequence position information of the analysis vector in the sequence of vectors to be analyzed, so as to obtain the category attribute of the text corresponding to each analysis vector.

[0137] The document analysis module 605 is used to perform document analysis on the document to be analyzed based on the category attributes of the target text.

[0138] In some embodiments, the acquisition module 601 is further configured to, in response to the fact that the number of characters in the first text paragraph in the document to be analyzed is less than a first preset number, perform text recognition processing on the first text paragraph to obtain text information and recognition boxes corresponding to the target characters in the first text paragraph; and determine the two-dimensional position information corresponding to the target characters based on the position of each recognition box on the document to be analyzed.

[0139] In some embodiments, the apparatus further includes: a cutting module, configured to, in response to the fact that the number of characters in the second text segment in the document to be analyzed is greater than a second preset number, cut the second text segment from the starting position of the second text segment with a first length as the sliding step size through a sliding window with a second length to obtain at least two text sequences; wherein the first length is less than or equal to the second length; and a text recognition processing module, configured to perform text recognition processing on the at least two text sequences to determine the text information and the recognition box of the target text in the at least two text sequences.

[0140] In some embodiments, the apparatus further includes: a determining module, configured to determine, in response to the first length being less than the second length, that there are overlapping characters between two adjacent text sequences; a removing module, configured to remove overlapping characters from either text sequence that are the same as those in the other text sequence, for two adjacent text sequences, to obtain a cleared text sequence; correspondingly, the text recognition processing module is further configured to perform text recognition processing on the cleared text sequence to determine the text information and the recognition box of the target text in the cleared text sequence.

[0141] In some embodiments, any two adjacent text sequences include a first text sequence and a second text sequence; the removal module is further configured to, for any two adjacent text sequences, determine the number of overlapping characters in the first text sequence or the second text sequence; divide the overlapping characters into first overlapping characters and second overlapping characters according to the number of overlapping characters; remove the second overlapping characters from the first text sequence in response to the distance between the first overlapping character and the sequence center of the first text sequence being less than the sequence center of the first overlapping character and the second text sequence; remove the first overlapping characters from the second text sequence in response to the distance between the second overlapping character and the sequence center of the second text sequence being less than the sequence center of the second overlapping character and the first text sequence; and determine the cleared first text sequence and the cleared second text sequence as the cleared text sequence.

[0142] In some embodiments, the document to be analyzed includes at least multiple pages; the two-dimensional position information includes at least first direction position information; the device further includes: a normalization processing module, configured to normalize the two-dimensional position information of each target character to obtain normalized two-dimensional position information; and an adding module, configured to add weight information to the first direction position information in the normalized two-dimensional position information corresponding to the target character in each page of the document according to the page number order of the multiple pages, to obtain the first direction position information corresponding to each target character in the multiple pages of the document.

[0143] In some embodiments, the multimodal information fusion processing module 602 is further configured to extract text features from the text information corresponding to the target text to obtain a text feature vector; extract position features from the two-dimensional position information corresponding to the target text to obtain a two-dimensional position feature vector; and perform multimodal feature fusion processing on the text feature vector and the two-dimensional position feature vector to obtain an analysis vector corresponding to the target text.

[0144] In some embodiments, the two-dimensional position information includes at least first-direction position information and second-direction position information; the multimodal information fusion processing module 602 is further configured to encode the first-direction position information and the second-direction position information corresponding to the target text respectively to obtain a first-direction attention matrix and a second-direction attention matrix; superimpose the first-direction attention matrix and the second-direction attention matrix to obtain a two-dimensional position matrix corresponding to the target text; and extract features from the two-dimensional position matrix to obtain the two-dimensional position feature vector.

[0145] In some embodiments, the apparatus further includes: an image segmentation module for performing image segmentation processing on the document to be analyzed to obtain at least two segmented images; an image feature extraction module for performing image feature extraction on the at least two segmented images to obtain at least two image feature vectors; correspondingly, the serialization processing module 603 is further configured to perform serialization processing on the corresponding analysis vectors according to the two-dimensional position information corresponding to all the text in the document to be analyzed to obtain an initial vector sequence; and add the at least two image feature vectors to the initial vector sequence to obtain the vector sequence to be analyzed.

[0146] In some embodiments, the classification processing module 604 is further configured to classify each analysis vector in the sequence of vectors to be analyzed sequentially according to the sequence position information of each analysis vector in the sequence of vectors to be analyzed, to obtain a classification sequence; and to normalize the classification sequence to obtain the category attribute of the text corresponding to each analysis vector.

[0147] In some embodiments, the document analysis module 605 is further configured to classify all the text in the document to be analyzed according to the category attribute of the target text to obtain a text set corresponding to each category attribute; sort the text in the text set according to the two-dimensional position information of the target text to obtain a text category sequence corresponding to each category attribute; and determine the information corresponding to each category attribute in the document to be analyzed according to the text category sequence.

[0148] In some embodiments, the document to be analyzed is a multi-page contract document, and the two-dimensional position information includes at least first direction position information; the device further includes: a normalization processing module, used to normalize the two-dimensional position information of the target text to obtain normalized two-dimensional position information; a weighting information adding module, used to add weight information to the first direction position information in the normalized two-dimensional position information corresponding to the target text in each page of the multi-page contract document according to the page number order, to obtain the first direction position information corresponding to the target text in the multi-page contract document; correspondingly, the multimodal information fusion processing module is also used to perform multimodal information fusion processing on the text information and the two-dimensional position information including the first direction position information to obtain the analysis vector corresponding to the target text.

[0149] The descriptions of the above device embodiments are similar to those of the above method embodiments, and have similar beneficial effects. For technical details not disclosed in the device embodiments of this application, please refer to the descriptions of the method embodiments of this application for understanding.

[0150] If the technical solution of this application involves personal information, the product using this technical solution has clearly informed the user of the personal information processing rules and obtained the user's voluntary consent before processing the personal information. If the technical solution of this application involves sensitive personal information, the product using this technical solution has obtained the user's separate consent before processing the sensitive personal information, and also meets the requirement of "express consent". For example, at personal information collection devices such as cameras, clear and prominent signs are set up to inform users that they have entered the scope of personal information collection and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed that they have agreed to the collection of their personal information; or on the personal information processing device, with clear signs / information informing users of the personal information processing rules, authorization is obtained from the user through pop-up information or by asking the user to upload their personal information; wherein, the personal information processing rules may include information such as the personal information processor, the purpose of personal information processing, the processing method, and the types of personal information processed.

[0151] It should be noted that, in the embodiments of this application, if the above-described document analysis method is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, or the part that contributes to the related technology, can be embodied in the form of a software product. This software product is stored in a storage medium and includes several instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), magnetic disks, or optical disks. Thus, the embodiments of this application are not limited to any specific hardware and software combination.

[0152] This application provides an electronic device, including a memory and a processor. The memory stores a computer program that can run on the processor, and the processor executes the computer program to implement the above-described document analysis method.

[0153] This application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the document analysis method described above. The computer-readable storage medium can be transient or non-transient.

[0154] This application provides a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, it implements some or all of the steps in the above-described method. This computer program product can be implemented specifically through hardware, software, or a combination thereof. In one optional embodiment, the computer program product is specifically embodied as a computer storage medium; in another optional embodiment, the computer program product is specifically embodied as a software product, such as a software development kit (SDK), etc.

[0155] It should be noted that, Figure 7 This is a schematic diagram of the hardware entity of a document analysis device provided in an embodiment of this application, such as... Figure 7 As shown, the hardware entity of the electronic device 70 includes: a processor 701, a communication interface 702, and a memory 703, wherein:

[0156] The processor 701 typically controls the overall operation of the electronic device 70.

[0157] Communication interface 702 enables electronic devices to communicate with other terminals or servers via a network.

[0158] The memory 703 is configured to store instructions and applications executable by the processor 701, and can also cache data to be processed or already processed (e.g., image data, audio data, voice communication data, and video communication data) in the processor 701 and various modules in the electronic device 70. It can be implemented using flash memory or random access memory (RAM). Data transfer between the processor 701, the communication interface 702, and the memory 703 can be performed via bus 704.

[0159] It should be noted that the descriptions of the storage medium and device embodiments above are similar to the descriptions of the method embodiments above, and have similar beneficial effects. For technical details not disclosed in the storage medium and device embodiments of this application, please refer to the descriptions of the method embodiments of this application for understanding.

[0160] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. It should be understood that in the various embodiments of this application, the sequence numbers of the above-described processes do not imply a sequential order of execution; the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. The sequence numbers of the above-described embodiments are merely descriptive and do not represent the superiority or inferiority of the embodiments.

[0161] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0162] In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.

[0163] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units. They may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.

[0164] In addition, each functional unit in the embodiments of this application can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0165] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media that can store program code, such as mobile storage devices, read-only memory (ROM), magnetic disks, or optical disks.

[0166] Alternatively, if the integrated units described above are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence or the part that contributes to related technologies, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROMs, magnetic disks, or optical disks.

[0167] The above description is merely an embodiment of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.

Claims

1. A document analysis method, characterized in that, The method includes: Obtain the text information and two-dimensional position information of the target text in the document to be analyzed; the document to be analyzed includes at least multiple pages, and the two-dimensional position information includes at least position information in a first direction; The text information and the two-dimensional position information are fused using multimodal information fusion to obtain the analysis vector corresponding to the target text; Based on the two-dimensional position information corresponding to all the text in the document to be analyzed, the corresponding analysis vectors are serialized to obtain the sequence of vectors to be analyzed. Based on the sequence position information of the analysis vector in the sequence of vectors to be analyzed, each analysis vector in the sequence of vectors to be analyzed is classified to obtain the category attribute of the text corresponding to each analysis vector; Based on the category attributes of the target text, perform document analysis on the document to be analyzed; The method further includes: The two-dimensional position information of each target character is normalized to obtain normalized two-dimensional position information. Based on the page number order of the multi-page document, weight information is added sequentially to the first direction position information of the normalized two-dimensional position information corresponding to the target text in each page of the document to obtain the first direction position information corresponding to the target text in the multi-page document. The weight information at least represents the page number information of the target text.

2. The method according to claim 1, characterized in that, The process of obtaining the text information and two-dimensional position information of the target text in the document to be analyzed includes: In response to the fact that the number of characters in the first text paragraph of the document to be analyzed is less than a first preset number, text recognition processing is performed on the first text paragraph to obtain the text information and recognition box corresponding to the target characters in the first text paragraph; Based on the position of each recognition box on the document to be analyzed, the two-dimensional position information corresponding to the target text is determined.

3. The method according to claim 2, characterized in that, The method further includes: In response to the fact that the number of characters in the second text segment of the document to be analyzed is greater than a second preset number, starting from the beginning position of the second text segment, the second text segment is cut with a first length as the sliding step size through a sliding window with a second length to obtain at least two text sequences; wherein, the first length is less than or equal to the second length; Text recognition processing is performed on the at least two text sequences to determine the text information and the recognition box of the target text in the at least two text sequences.

4. The method according to claim 3, characterized in that, The method further includes: In response to the first length being less than the second length, it is determined that there are overlapping characters between two adjacent text sequences; For two adjacent character sequences with overlapping characters, remove the overlapping characters in one character sequence that are the same as those in the other character sequence to obtain the cleaned character sequence; Correspondingly, the step of performing text recognition processing on the at least two text sequences to determine the text information and the recognition box of the target text in the at least two text sequences includes: The text sequence after removal is processed by text recognition to determine the text information and the recognition box of the target text in the text sequence after removal.

5. The method according to claim 4, characterized in that, Any two adjacent character sequences include a first character sequence and a second character sequence; the method further includes: For any two adjacent character sequences with overlapping characters, determine the number of overlapping characters in the first character sequence or the second character sequence; Based on the number of overlaps, the overlapping characters are divided into first overlapping characters and second overlapping characters. In response to the fact that the distance between the first overlapping character and the sequence center of the first character sequence is less than the sequence center of the first overlapping character and the second character sequence, the second overlapping character in the first character sequence is removed; In response to the distance between the second overlapping character and the sequence center of the second character sequence being less than the sequence center of the second overlapping character and the first character sequence, the first overlapping character in the second character sequence is removed; The entire first character sequence and the entire second character sequence after clearing are determined as the cleared character sequence.

6. The method according to any one of claims 1 to 5, characterized in that, The process of fusing the text information and the two-dimensional position information into a multimodal information fusion to obtain an analysis vector corresponding to each character includes: The text information corresponding to the target text is subjected to text feature extraction to obtain a text feature vector; The positional features of the two-dimensional positional information corresponding to the target text are extracted to obtain a two-dimensional positional feature vector. Multimodal feature fusion processing is performed on the text feature vector and the two-dimensional position feature vector corresponding to the same character to obtain the analysis vector corresponding to the target character.

7. The method according to claim 6, characterized in that, The two-dimensional position information includes at least first-direction position information and second-direction position information; The positional features of the two-dimensional positional information corresponding to the target text are extracted to obtain a two-dimensional positional feature vector, including: The first directional position information and the second directional position information corresponding to the target text are encoded respectively to obtain the first directional attention matrix and the second directional attention matrix; The first directional attention matrix and the second directional attention matrix are superimposed to obtain the two-dimensional position matrix corresponding to the target text; Feature extraction is performed on the two-dimensional position matrix to obtain the two-dimensional position feature vector.

8. The method according to any one of claims 1 to 7, characterized in that, The method further includes: The document to be analyzed is subjected to image segmentation processing to obtain at least two segmented images; Image feature extraction is performed on at least two segmented images to obtain at least two image feature vectors; Correspondingly, the step of serializing the corresponding analysis vectors based on the two-dimensional position information corresponding to all the text in the document to be analyzed, to obtain a sequence of analysis vectors, includes: Based on the two-dimensional position information corresponding to all the text in the document to be analyzed, the corresponding analysis vector is serialized to obtain an initial vector sequence. The at least two image feature vectors are added to the initial vector sequence to obtain the vector sequence to be analyzed.

9. The method according to any one of claims 1 to 8, characterized in that, The step of classifying each analysis vector in the sequence of vectors to be analyzed based on its sequence position information to obtain the category attribute of the text corresponding to each analysis vector includes: Based on the sequence position information of each analysis vector in the sequence of vectors to be analyzed, each analysis vector in the sequence of vectors to be analyzed is classified sequentially to obtain a classification sequence. The classification sequence is normalized to obtain the category attribute of the text corresponding to each analysis vector.

10. The method according to any one of claims 1 to 9, characterized in that, The step of performing document analysis on the document to be analyzed based on the category attributes of the target text includes: Based on the category attributes of the target text, all text in the document to be analyzed is classified to obtain a set of text corresponding to each category attribute; Based on the two-dimensional position information of the target text, the texts in the text set are sorted to obtain the text category sequence corresponding to each category attribute; Based on the text category sequence, determine the information corresponding to each category attribute in the document to be analyzed.

11. The method according to claim 1, characterized in that, The document to be analyzed is a multi-page contract document, and the two-dimensional location information includes at least first-direction location information; the method further includes: The two-dimensional position information of the target text is normalized to obtain normalized two-dimensional position information; According to the page number order of the multi-page contract document, weight information is added to the first direction position information of the normalized two-dimensional position information corresponding to the target text in each page of the contract document to obtain the first direction position information corresponding to the target text in the multi-page contract document. Correspondingly, the step of performing multimodal information fusion processing on the text information and the two-dimensional position information to obtain the analysis vector corresponding to the target text includes: The text information and the two-dimensional position information including the first direction position information are subjected to multimodal information fusion processing to obtain the analysis vector corresponding to the target text.

12. A document analysis device, characterized in that, The device includes: The acquisition module is used to acquire the text information and two-dimensional position information of the target text in the document to be analyzed; the document to be analyzed includes at least multiple pages, and the two-dimensional position information includes at least position information in the first direction; A multimodal information fusion processing module is used to perform multimodal information fusion processing on the text information and the two-dimensional position information to obtain the analysis vector corresponding to the target text; The serialization processing module is used to serialize the corresponding analysis vectors based on the two-dimensional position information corresponding to all the text in the document to be analyzed, so as to obtain a sequence of vectors to be analyzed. The classification processing module is used to classify each analysis vector in the sequence of vectors to be analyzed according to the sequence position information of the analysis vector in the sequence of vectors to be analyzed, so as to obtain the category attribute of the text corresponding to each analysis vector; The document analysis module is used to perform document analysis on the document to be analyzed based on the category attributes of the target text; The device further includes: The normalization processing module is used to normalize the two-dimensional position information of each target character to obtain normalized two-dimensional position information. An adding module is used to add weight information to the first direction position information of the normalized two-dimensional position information corresponding to the target text in each page of the multi-page document according to the page number order, so as to obtain the first direction position information corresponding to the target text in the multi-page document, wherein the weight information at least represents the page number information of the target text.

13. A document analysis device, comprising a processor and a memory, wherein the memory stores a computer program executable on the processor, characterized in that, When the processor executes the computer program, it implements the method according to any one of claims 1 to 11.

14. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the method described in any one of claims 1 to 11.