Method and system for structured parsing of project electronic documents

By identifying and extracting paragraphs and modal elements from project electronic documents, and constructing feature libraries and semantic annotation rules, a highly accurate structured parsing of diverse scientific and technological project documents is achieved. This solves the problem of low parsing accuracy in existing technologies and simplifies the document organization and storage process.

CN115408995BActive Publication Date: 2026-06-23NAVAL UNIV OF ENG PLA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NAVAL UNIV OF ENG PLA
Filing Date
2022-08-23
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing electronic document structured parsing methods are difficult to apply to diverse scientific and technological project documents, have low parsing accuracy, and cannot meet the needs of in-depth and fine-grained parsing.

Method used

By identifying paragraphs and modal elements in project electronic documents, a coding type feature library and a semantic annotation rule library are constructed. Information from the cover, basic information table, and project member table is extracted, and a set of structural information fields is constructed to achieve fine-grained document parsing.

Benefits of technology

It is suitable for electronic documents of all types of projects, with high parsing accuracy, covering important content, reducing the complexity of document structure parsing, and facilitating organization and storage.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115408995B_ABST
    Figure CN115408995B_ABST
Patent Text Reader

Abstract

The application discloses a structured analysis method and system for a project electronic document. The method comprises the following steps: identifying paragraphs of the project electronic document and a modality element of each paragraph, wherein the modality element is used to distinguish content types of different paragraphs; identifying a cover, a basic information table and a project member table of the project electronic document according to the modality element, and extracting project basic information from the cover, the basic information table and the project member table; constructing a structure information field set, extracting project structure information from the project electronic document based on the modality element and the structure information field set; and storing or displaying the project basic information and the project structure information according to a preset specification. The application is suitable for various types of project electronic documents and has high analysis accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of document processing technology, and more specifically, to a structured parsing method and system for project electronic documents. Background Technology

[0002] Structured parsing of unstructured or semi-structured electronic documents, such as scientific and technological project documents, facilitates the effective organization, management, and analysis of electronic materials. Existing electronic document structuring methods typically target documents with fixed formats or templates, and are mostly shallow content parsing. This involves analyzing the encoding of a particular document format to obtain text and positional information, and extracting keywords or sentences conforming to certain fixed rules based on a fixed template. This method can only achieve shallow information extraction from fixed documents and is insufficient to meet the needs of deep, fine-grained parsing of scientific and technological project electronic documents with diverse formats and significant structural differences. Summary of the Invention

[0003] To address at least one deficiency or improvement need in the existing technology, the present invention provides a structured parsing method and system for project electronic documents, applicable to various types of project electronic documents, and with high parsing accuracy.

[0004] To achieve the above objectives, according to a first aspect of the present invention, a structured parsing method for project electronic documents is provided, comprising:

[0005] Identify paragraphs in the project's electronic documents and the modal elements of each paragraph, wherein the modal elements are used to distinguish the content types of different paragraphs;

[0006] The cover, basic information table, and project member table of the project electronic document are identified based on the modal elements, and basic project information is extracted from the cover, basic information table, and project member table.

[0007] Construct a set of structural information fields, and extract project structural information from the project electronic document based on the modal elements and the set of structural information fields;

[0008] The basic information and structural information of the project are stored or displayed according to preset specifications.

[0009] Furthermore, before identifying the paragraphs and modal elements of each paragraph in the project electronic document, the project electronic document is preprocessed, and the preprocessing includes:

[0010] A coding type feature library is constructed, which stores the features corresponding to each coding type. The candidate coding type of the project electronic document is determined according to the file name suffix of the project electronic document. If the project electronic document includes the features corresponding to the candidate coding type, the candidate coding type is used as the coding type of the project electronic document.

[0011] The project electronic document is converted into a standard format document according to the encoding type of the project electronic document.

[0012] Furthermore, the identification of paragraphs in the electronic document and the modal elements of each paragraph includes:

[0013] Extract paragraphs from the electronic documents of the project and the original document tags for each paragraph;

[0014] Identify the modal elements of each paragraph based on the original document tags;

[0015] Mark the modal elements of each paragraph in chronological order.

[0016] Furthermore, the document modal elements include any combination of documents, tables, images, code, and formulas.

[0017] Furthermore, the step of identifying the cover, basic information table, and project member table of the project electronic document based on the modal elements includes:

[0018] A cover content dictionary and cover recognition rules are constructed. The area in the project's electronic document where words from the cover content dictionary appear is taken as the initial positioning area of ​​the cover. Then, the cover position is determined based on the initial positioning area and the preset cover recognition rules.

[0019] Construct a basic information table dictionary, and use the table in the project's electronic document that contains words from the basic information table dictionary as the basic information table;

[0020] Construct a project member table dictionary, and use the table containing words from the project member table dictionary that appear in the project's electronic documents as the project member table.

[0021] Furthermore, the extraction of basic project information from the cover, basic information table, and project member table includes:

[0022] A semantic annotation rule library is pre-built. The semantic annotation rule library stores different document templates and the semantic annotation rules corresponding to each document template. The semantic annotation rules include the parsing methods for specific content in the cover, basic information table and project member table, as well as the basic information table dictionary and project member table dictionary on which the parsing depends.

[0023] If the semantic annotation rule base stores the document template used by the electronic document of the project, then the basic information of the document is extracted using the semantic annotation rules corresponding to the document template;

[0024] If the semantic annotation rule base does not store the document template used by the project's electronic document, then a corresponding semantic annotation rule is customized for the document template used by the project's electronic document. The customized semantic annotation rule is used to extract the basic information of the document, and the document template used by the project's electronic document and its corresponding semantic annotation rule are updated and stored in the semantic annotation rule base.

[0025] Further, determining whether the semantic annotation rule base stores the document template used by the project's electronic document includes:

[0026] Calculate the similarity of the project electronic document with each document template in the semantic annotation rule base for the project type, project cover, basic information table, and project member table. Then, weight the similarity of the project type, project cover, basic information table, and project member table according to a preset weight to obtain the overall similarity between the project electronic document and each document template in the semantic annotation rule base. If there is a document template in the semantic annotation rule base with an overall similarity greater than a preset threshold, it is determined that the semantic annotation rule base stores the document template used by the project electronic document.

[0027] Furthermore, the methods for parsing the specific content of the cover, basic information table, and project member table include:

[0028] The cover is divided into multiple lines. Each line of the cover is matched with the basic information table dictionary from left to right. If a line matches a word in the basic information table dictionary, it means that there is no specific content of the basic information item in this line. Otherwise, the longest matching string of the line with a word in the basic information table dictionary is used as the basic information metadata item, and the remaining content of the line is used as the specific content of the basic information metadata item.

[0029] Identify the basic information metadata items and values ​​in the basic information table, establish the association between each basic information metadata item cell and the value cell, and determine the value of each basic information metadata item based on the association.

[0030] Identify the basic information metadata items and corresponding values ​​of the project member table based on the project member table dictionary.

[0031] Furthermore, the set of structural information fields includes several structural information fields, each of which includes several constituent elements. The extraction of project structural information includes:

[0032] Let q be the i-th paragraph in the project's electronic document. i Let the j-th structural information field in the set of structural information fields be denoted as s. field-j If q i It contains s field-j If any of the constituent elements is q, then q i The specific content is as s field-j The corresponding structural information.

[0033] According to a second aspect of the present invention, a structured parsing system for project electronic documents is also provided, comprising:

[0034] The modal recognition module is used to identify paragraphs in the project's electronic documents and the modal elements of each paragraph. The modal elements are used to distinguish the content types of different paragraphs.

[0035] The basic information extraction module is used to identify the cover, basic information table and project member table of the project electronic document based on the modal elements, and to extract basic project information from the cover, basic information table and project member table;

[0036] The project structure information extraction module is used to construct a set of structure information fields and extract project structure information from the project electronic document based on the modal elements and the set of structure information fields.

[0037] The standardization module is used to store or display the basic information and structural information of the project according to a preset standard.

[0038] In summary, compared with the prior art, the above-described technical solutions conceived by this invention can achieve the following beneficial effects:

[0039] (1) Applicable to all types of project electronic documents, first identify the paragraphs of the project electronic document and the modal elements of each paragraph, and then extract the basic information and project structure information of the project based on the modal elements. That is, the document is split into fine-grained parts from three aspects: modal elements, basic information of the project and project structure information. The parsing results can cover all the important content in the electronic documents of science and technology projects and the parsing accuracy is high.

[0040] (2) Before parsing, preprocessing is performed on documents of different formats, including encoding format identification, verification and standard format conversion. This not only reduces the complexity of subsequent document structure parsing, but also facilitates the organization and storage of electronic documents of scientific and technological projects. Attached Figure Description

[0041] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0042] Figure 1 A flowchart illustrating a structured parsing method for project electronic documents provided in this application embodiment;

[0043] Figure 2 A schematic diagram illustrating the parsing principle of the cover, basic information table, and project member table provided for embodiments of this application;

[0044] Figure 3 This is a schematic diagram of document table segmentation for similarity calculation provided in an embodiment of this application. Detailed Implementation

[0045] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0046] The terms "comprising" and "having," and any variations thereof, in the specification, claims, and accompanying drawings of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or modules is not limited to the steps or modules listed, but may optionally include steps or modules not listed, or may optionally include other steps or modules inherent to such process, method, product, or apparatus.

[0047] like Figure 1 As shown, an embodiment of the present invention provides a structured parsing method for project electronic documents, comprising:

[0048] S101, Identify the paragraphs of the project's electronic document and the modal elements of each paragraph. The modal elements are used to distinguish the content types of different paragraphs.

[0049] The process involves acquiring project electronic documents, which typically consist of multiple paragraphs, and identifying the modal elements of each paragraph. By identifying the paragraphs and modal elements of each paragraph, the project electronic document is transformed into a data format that is easier to process in subsequent steps.

[0050] The information in the project's electronic document is divided into three categories: modal elements, basic project information, and project structure information. In one embodiment, these three categories can be: ① Modal elements, which can be text, images, tables, formulas, code, etc. These elements annotate each paragraph of the document from a modal perspective. ② Basic project information is custom content reflecting the basic information of the project, including the research funding, project type, project name, discipline, applicant's name, and applicant details. ③ Project structure information is custom content reflecting the main body of the project, including the basis for project approval, research object, research objectives, research framework, key problems to be solved, other research content, research plan, innovations, academic resume, etc. These are key components of the electronic document's main body and are also the core for determining whether the document is duplicated. The academic resume contains some personal information of the applicant and project team members, which can serve as the basis for extracting researcher attributes and relationship knowledge.

[0051] Table 1. Elements extracted from electronic documents of the project

[0052]

[0053] Furthermore, before identifying the paragraphs and modal elements of each paragraph in the project's electronic document, the project's electronic document undergoes preprocessing, which includes:

[0054] (1) Document encoding type identification and verification. An encoding type feature library is constructed, which stores the features corresponding to each encoding type. The candidate encoding type of the project electronic document is determined based on the file name suffix of the project electronic document. If the project electronic document includes the features corresponding to the candidate encoding type, the candidate encoding type is used as the encoding type of the project electronic document.

[0055] Generally, a document is named as "document name + file extension," where the file extension represents the document's encoding type. Different project document formats have significantly different encoding types. For example, due to different encodings, when reading doc and docx format documents, the common Word document parsing package—the poi package—uses the hwpf and xwpf modules respectively to define the elements in the document. Different types of documents have different parsing and modal element annotation methods. Therefore, in the project document preprocessing process, it is necessary to first identify and verify the document format of the project document to be annotated. Common document extensions include .doc, .docx, .wps, .pdf, etc. When obtaining the document extension, a partial matching method can be used to match the entire document name with the listed known project document extensions to obtain the extension category of the target document.

[0056] However, determining a document's encoding type solely based on its file extension can be inaccurate. For example, when modifying a .docx document, simply changing the extension from .docx to .doc without altering the internal encoding type can lead to incorrect encoding type identification, further impacting subsequent document parsing and modal feature annotation. To address this issue, in addition to obtaining the document extension, it's necessary to validate the document content. This involves selecting features unique to the target encoding type for matching to determine if the document contains those features, thus completing the document format identification.

[0057] (2) Convert the project electronic documents into standard format documents according to the encoding type of the project electronic documents.

[0058] In one embodiment, the more structured and faster-responding docx document is selected as the standard format. First, the format of the document to be parsed needs to be identified. For documents in doc, wps, rtf, and pdf formats, the Windows Jacob component is used to uniformly convert the target document to docx format.

[0059] Furthermore, identifying paragraphs in the project's electronic documents and the modal elements of each paragraph includes:

[0060] (1) Extract paragraphs from the project's electronic documents and the original document tags for each paragraph.

[0061] Project electronic documents typically contain their own paragraph marks and original document tags for each paragraph. Existing technologies can be used to extract paragraphs and original document tags from project electronic documents.

[0062] (2) Identify the modal elements of each paragraph based on the original document tags.

[0063] In one embodiment, suppose the document content needs to be divided into five modal elements: "text," "tables," "images," "code," and "formulas." Document element parsing involves extracting element information from each paragraph, and the extraction steps for each paragraph are essentially the same: determine the element information of paragraph e. i The modal element types are primarily divided into paragraph format and table format. Paragraph format includes text, image, and formula types, while table format includes table and code types. In implementation, the paragraph format is considered comprehensively. i Identify the modal elements to which a paragraph belongs by using its identifiers and content features.

[0064] (3) Mark the modal elements of each paragraph in the order of their sequence.

[0065] When marking the content and modal elements in a project electronic document according to the order of paragraphs, it is important to ensure that the order of the document content remains unchanged during the parsing and marking of the element information in the document, so as to output the content and corresponding element information of the project electronic document in sequence from front to back.

[0066] S102, Identify the cover, basic information table and project member table of the project electronic document based on modal elements, and extract basic project information from the cover, basic information table and project member table.

[0067] A cover page, a basic information sheet, and a project member list are standard components of a project's electronic documentation.

[0068] Furthermore, based on modal features, the cover, basic information table, and project member table of the project's electronic documents are identified, including:

[0069] (1) Construct a cover content dictionary and cover recognition rules. Use the area in the project's electronic document where the words in the cover content dictionary appear as the initial positioning area of ​​the cover. Then, determine the cover position based on the initial positioning area and the preset cover recognition rules.

[0070] Regardless of whether the document is in native doc, docx, or wps format, or in native PDF format or as a scanned copy of a paper document, rule-based cover recognition can be divided into two stages: preliminary cover location and cover content determination. Preliminary cover location can be achieved using a dictionary approach; if a region contains item type information and multiple basic information items, that region can be considered the approximate location of the document cover. In the latter stage, if the document is in native doc, docx, or wps format, the cover information is extracted from both the location of the item type information and the areas before and after it. For example, the top 5 lines and the bottom 10 lines are extracted as the cover information. The reason for this extraction is that although most basic item information appears below the item type information, some information may appear above it.

[0071] (2) Construct a basic information table dictionary, and use the table in the project electronic document that contains words from the basic information table dictionary as the basic information table.

[0072] The table modal elements identified in the preprocessing can be used as objects. If the values ​​of multiple cells in the table appear in the basic information table dictionary, then it is regarded as the basic information table of the project.

[0073] (3) Construct a project member table dictionary, and use the table in the project electronic document that contains words from the project member table dictionary as the project member table.

[0074] Since independent member information tables are mostly row header tables, and the header generally does not exceed two rows, in member information table identification, if the values ​​of multiple cells in the first row or the first two rows of the table appear in the project member table dictionary, then it is regarded as a member information table.

[0075] As can be seen from the above recognition rules, dictionary quality has a significant impact on the recognition results of the cover, basic information table, and project member table. The dictionary can be maintained manually or created manually with automatic updates. For example, an initial dictionary can be created based on a few document templates, and then automatically supplemented and improved from the recognized basic information table and project member table using unsupervised methods.

[0076] Furthermore, the principle of extracting basic project information from the cover, basic information table, and project member table is as follows: Figure 2 As shown, the process includes: pre-building a semantic annotation rule base, which stores different document templates and the semantic annotation rules corresponding to each document template. The semantic annotation rules include parsing methods for specific content in the cover, basic information table, and project member table, as well as the basic information table dictionary and project member table dictionary on which the parsing depends. If the semantic annotation rule base stores the document template used by the project electronic document, the basic information of the document is extracted using the semantic annotation rules corresponding to that document template. If the semantic annotation rule base does not store the document template used by the project electronic document, a corresponding semantic annotation rule is customized for the document template used by the project electronic document, the basic information of the document is extracted using the customized semantic annotation rule, and the document template used by the project electronic document and its corresponding semantic annotation rules are updated and stored in the semantic annotation rule base.

[0077] In other words, when a new batch of project electronic documents is added to be processed, semantic annotation is first performed on the project electronic documents covered by the existing document template set and annotation rule set; then, taking the data that cannot be annotated as the object, new document templates and annotation rules are learned, and the basic information table dictionary and member information table dictionary are updated according to the learning results to better support the discovery of the cover, basic information table and project member table; finally, the documents that were not successfully annotated before are processed again until semantic annotation of all documents is achieved.

[0078] Further, determining whether the semantic annotation rule base stores the document template used by the project electronic document includes: calculating the similarity of the project type, project cover, basic information table, and project member table between the project electronic document and each document template in the semantic annotation rule base; weighting the similarity of the project type, project cover, basic information table, and project member table according to preset weights to obtain the overall similarity between the project electronic document and each document template in the semantic annotation rule base; if there is a document template in the semantic annotation rule base with an overall similarity greater than a preset threshold, then it is determined that the semantic annotation rule base stores the document template used by the project electronic document.

[0079] The core of achieving template-based electronic document recognition for projects is to determine whether the semantic annotation rule base stores the document template used by the project's electronic document based on the project type, cover, basic information table, and project member table. The calculation methods for project type similarity, project cover similarity, basic information table similarity, and project member table similarity are as follows.

[0080] ① Item type similarity calculation method. The item type of a document is determined based on the item type thesaurus. If the item types of two documents are completely identical, the similarity is 1; otherwise, it is 0.

[0081] ② Cover Similarity Calculation Method. The cover content is read paragraph by paragraph, and paragraphs with empty content are removed. If multiple consecutive spaces appear, the paragraph is split into multiple parts. These split units can be broadly categorized into two types: first, basic information on the cover; and second, information outside the cover. This latter type of information is often found in templates such as bidder commitments and instructions, so including it in the similarity calculation will not significantly affect the results. Assume the cover of document i can be split into c... i The cover of document j can be broken down into c units. j If the number of units whose first n characters (n≥2) are completely identical is k, then the cover similarity s(c) between document i and document j is... i ,c j The calculation method is shown below.

[0082]

[0083] In the formula above, k represents the number of similar units. For example, if i has 10 units and j has 10 units, and all 10 units in i and j are identical, then the number is c. i =c j =k, the similarity is 1; if there are 9 identical ones, the similarity is 0.

[0084] ③ Basic Information Table Similarity Calculation Method. Given the complex structure of the basic information table, which is an unstructured table (i.e., the header is not fixed in the first row / first n rows or the first column / first n columns), it is difficult to distinguish between the basic information items and the content to be filled in. Furthermore, users may make minor adjustments to the structure, such as deleting unnecessary blank rows, adding rows, changing row heights, or adjusting cell widths. Therefore, direct cell matching is not feasible for similarity calculation. To address this issue, a method combining visual similarity and cell content similarity is proposed, mainly including three steps: table preprocessing, visual similarity calculation, and cell content similarity calculation.

[0085] The preprocessing stage of the basic information table involves processing the original table, dividing it into several modules and standardizing them to support similarity calculation. First, considering only the horizontal lines, the table is divided into several regions based on line length, position, and whether they are adjacent (for adjacent lines of unequal length, the longer line can be truncated). Within each region, the ordinates of the start and end points of the horizontal lines are consistent. Second, each segmented region is analyzed row by row. If the number of vertical lines in a subsequent row is the same as that in the previous row, and the ordinates of each vertical line are the same, then it is considered the same module; otherwise, it is divided into different modules. Third, if a segmented module contains multiple rows, only the first row is retained. While deleting other rows, the position information of related regions needs to be adjusted according to the characteristics of the horizontal lines. Finally, only the content of the first cell in each region is retained, and the content of other cells is cleared. The row height and cell width of each region are adjusted to ensure that the width of horizontally adjacent regions is consistent, the width of each cell within each region is consistent, and the height of each row in the table is consistent (except for vertical cell merging). Figure 3 As shown, the basic information table of this electronic document can be partially decomposed into 9 modules. After normalization, modules 1 and 2 are vertically merged cells, while modules 3-9 contain 8, 6, 8, 2, 2, 4, and 8 cells respectively. After preprocessing, the structure of the project's electronic document is further simplified, forming a new table composed of several normalized modules. The content of the first cell of each module is retained, allowing similarity calculations to consider both the visual features of the table (i.e., the number of cells and layout consistency) and some of the table's content, thus laying a data foundation for more accurate table similarity calculations.

[0086] After completing the table preprocessing, visual similarity and cell content similarity are calculated sequentially. In the visual similarity calculation, if the number of modules and the number of cells in each module are exactly the same in both tables, the visual similarity is 1. Cell content similarity calculation is only performed on tables with a visual similarity of 1. During implementation, the content similarity of corresponding cells in the two tables is checked one by one. If all cells are the same, the cell content similarity is 1; otherwise, the cell content similarity is 0. Assume the visual similarity between the basic information tables of document i and document j is s(v...). i v j The cell content similarity is s(con). i , con j If the overall similarity of the basic information tables is s(b), then the similarity is s(b). i b j The calculation method for ) is shown below.

[0087]

[0088] ④ Project Member Table Similarity Calculation Method. If both project documents contain a project member table and their headers are completely identical, then their similarity is 1; otherwise, the similarity is 0. Therefore, assume that the header of the project member table in document i contains m... i The header of the project member table in document j contains m cells. j If there are k cells with exactly the same value, then the similarity s(m) between the project member tables of documents i and j is... i ,m j The calculation method is shown below.

[0089]

[0090] In the formula above, k represents the number of similar units. For example, if i has 10 units and j has 10 units, and all 10 units in i and j are identical, then the number is m. i =m j =k, the similarity is 1; if there are 9 identical ones, the similarity is 0.

[0091] Furthermore, the methods for parsing the specific content of the cover, basic information table, and project member table include:

[0092] (1) Cover Analysis. Basic information appearing on the document cover often appears simultaneously with the name of the basic information item and is located to the right of the basic information item. The cover is divided into multiple lines, and each line of the cover is matched with the basic information table dictionary from left to right. If a line matches a word in the basic information table dictionary, it means that this line does not contain the specific content of the basic information item. Otherwise, the longest matching string between the line and a word in the basic information table dictionary is taken as the basic information metadata item, and the remaining content of the line is taken as the specific content of the basic information metadata item.

[0093] (2) Basic Information Table Parsing. Identify the basic information metadata items and values ​​in the basic information table, establish the association between each basic information metadata item cell and the value cell, and determine the value of each basic information metadata item based on the association.

[0094] In the basic information table, metadata items are distributed relatively sparsely, possibly across various areas, but they meet the requirement of consistent values ​​for metadata items within the same template. Each document template includes multiple document samples. First, we can statistically analyze the document samples that match the template, obtaining the number of cells where each cell matches the value of the corresponding cell in the sample. If the number equals the number of samples, that cell is considered a metadata item, and the other cells are considered metadata values. Next, we establish the relationship between each metadata item cell and its corresponding value cell. Through research and analysis, the relationship between metadata item cells and their corresponding value cells follows a right-to-bottom rule: if there is no value on the right, the value is searched from the cell below. There are four main types of relationships, as shown in Table 2. Based on the positional distribution of metadata item cells and their corresponding value cells, we can automatically establish the relationship between the two types of cells, such as the L3 relationship. If the cell to the right of metadata item C is a metadata item, the value is searched from its adjacent cell below; if the cell below is not a metadata item, the cell below it is the value of metadata item C.

[0095] Table 2 shows the positional relationship between metadata items and value cells.

[0096]

[0097]

[0098] (3) Project Member Table Parsing. Identify the basic information metadata items and corresponding values ​​of the project member table based on the project member table dictionary. Since the headers of the project member table are all row headers, cells with consistent values ​​in each document can be regarded as headers, and other cells can be regarded as the body; then, based on the judgment results of row headers and table headers, a one-to-many mapping relationship between metadata items and values ​​can be further established.

[0099] S103, Construct a set of structural information fields, and extract project structural information from project electronic documents based on modal elements and the set of structural information fields.

[0100] Furthermore, the set of structured information fields includes several structured information fields, and each structured information field includes several constituent elements. For example, the structured information field "research framework" may be expressed differently in different documents, such as "overall research framework" or "research architecture". "Overall research framework", "research architecture", and "research framework" constitute the constituent elements of the structured information field "research framework".

[0101] Extracting project structure information includes:

[0102] Let q be the i-th paragraph in the project's electronic document. i Let s be the j-th structural information field in the set of structural information fields. field-j If q i It contains s field-j If any of the constituent elements is q, then q i The specific content is as s field-j The corresponding structural information.

[0103] S104 stores or displays basic project information and project structure information according to preset specifications.

[0104] The preset specifications can be flexibly set as needed.

[0105] In one embodiment, a JSON structure is used to store or display the parsed content. The document's own information, including "Document Name (wdName)," "Document Size (wdFilesize)," "Document Parsing Status (wdAnalystate)," "Parsing Failure Reason (wdErrormemo)," "Is it an Attachment (wdIsfj)," and "Document Type (wdType)," along with "Basic Information (basic)" and "Structure Information (structure)," is placed at the first level of the JSON structure. Because the person in charge and research team members require special handling during subsequent database entry, "Person in Charge (fzr)" and "Research Team (wcr)" are also placed at the first level. Other basic information is placed in a second-level array of "Basic Information," and their corresponding JSON names are shown in the attached "Information Field Correspondence Table." In the second-level structure, "id" and "location" indicate the relative position of the corresponding field in the article, and "name" and "construction" indicate the field name. "content" represents the content corresponding to the field, "element" represents the element structure of the content, and "constructname" represents the synonym of the structure information field in the article.

[0106] An embodiment of the present invention provides a structured parsing system for project electronic documents, comprising:

[0107] The modality recognition module is used to identify paragraphs in the project's electronic documents and the modal elements of each paragraph. The modal elements are used to distinguish the content types of different paragraphs.

[0108] The basic information extraction module is used to identify the cover, basic information table and project member table of the project electronic document based on modal elements, and extract basic project information from the cover, basic information table and project member table;

[0109] The project structure information extraction module is used to construct a set of structure information fields and extract project structure information from project electronic documents based on modal elements and the set of structure information fields.

[0110] The standardization module is used to store or display basic project information and project structure information according to preset standards.

[0111] The specific implementation and technical effects of the structured parsing system are the same as those of the structured parsing method described above, and will not be repeated here.

[0112] The foregoing description is merely an exemplary embodiment of this disclosure and should not be construed as limiting the scope of this disclosure. Any equivalent changes and modifications made in accordance with the teachings of this disclosure shall still fall within the scope of this disclosure. Those skilled in the art will readily conceive of embodiments of this disclosure upon considering the specification and practicing the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not described herein. The specification and embodiments are to be considered exemplary only, and the scope and spirit of this disclosure are defined by the claims.

[0113] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0114] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A structured parsing method for project electronic documents, characterized in that, include: Identify paragraphs in the project's electronic documents and the modal elements of each paragraph, wherein the modal elements are used to distinguish the content types of different paragraphs; The cover, basic information table, and project member table of the project's electronic document are identified based on the modal elements. Extracting basic project information from the cover, basic information table, and project member table specifically includes: pre-constructing a semantic annotation rule base, which stores different document templates and corresponding semantic annotation rules for each document template. These semantic annotation rules include parsing methods for specific content in the cover, basic information table, and project member table, as well as the basic information table dictionary and project member table dictionary upon which the parsing depends; calculating the project type similarity, project cover similarity, basic information table similarity, and project member table similarity between the project electronic document and each document template in the semantic annotation rule base, respectively; and combining the project type similarity, project cover similarity, basic information table similarity, and project member table similarity. The similarity is weighted according to a preset weight to obtain the overall similarity between the project electronic document and each document template in the semantic annotation rule base. If there is a document template in the semantic annotation rule base with an overall similarity greater than a preset threshold, it is determined that the semantic annotation rule base stores the document template used by the project electronic document, and the basic information of the project is extracted using the semantic annotation rules corresponding to the document template. Otherwise, a corresponding semantic annotation rule is customized for the document template used by the project electronic document, the basic information of the project is extracted using the customized semantic annotation rule, and the document template used by the project electronic document and its corresponding semantic annotation rule are updated and stored in the semantic annotation rule base. Construct a set of structural information fields, and extract project structural information from the project electronic document based on the modal elements and the set of structural information fields; The basic information and structural information of the project are stored or displayed according to preset specifications.

2. The structured parsing method for project electronic documents as described in claim 1, characterized in that, Before identifying paragraphs and modal elements of each paragraph in the project's electronic document, the project's electronic document undergoes preprocessing, which includes: A coding type feature library is constructed, which stores the features corresponding to each coding type. The candidate coding type of the project electronic document is determined according to the file name suffix of the project electronic document. If the project electronic document includes the features corresponding to the candidate coding type, the candidate coding type is used as the coding type of the project electronic document. The project electronic document is converted into a standard format document according to the encoding type of the project electronic document.

3. The structured parsing method for project electronic documents as described in claim 1, characterized in that, The paragraphs and modal elements of the identified electronic document include: Extract paragraphs from the electronic documents of the project and the original document tags for each paragraph; Identify the modal elements of each paragraph based on the original document tags; Mark the modal elements of each paragraph in chronological order.

4. The structured parsing method for project electronic documents as described in claim 1, characterized in that, The modal elements include any combination of documents, tables, images, code, and formulas.

5. The structured parsing method for project electronic documents as described in claim 4, characterized in that, The process of identifying the cover, basic information table, and project member table of the project electronic document based on the modal elements includes: A cover content dictionary and cover recognition rules are constructed. The area in the project's electronic document where words from the cover content dictionary appear is taken as the initial location area of ​​the cover. Then, the cover position is determined based on the initial location area and the preset cover recognition rules. Construct a basic information table dictionary, and use the table in the project's electronic document that contains words from the basic information table dictionary as the basic information table; Construct a project member table dictionary, and use the table containing words from the project member table dictionary that appear in the project's electronic documents as the project member table.

6. The structured parsing method for project electronic documents as described in claim 1, characterized in that, The methods for parsing the specific content of the cover, basic information table, and project member table include: The cover is divided into multiple lines. Each line of the cover is matched with the basic information table dictionary from left to right. If a line matches a word in the basic information table dictionary, it means that there is no specific content of the basic information item in this line. Otherwise, the longest matching string of the line with a word in the basic information table dictionary is used as the basic information metadata item, and the remaining content of the line is used as the specific content of the basic information metadata item. Identify the basic information metadata items and values ​​in the basic information table, establish the association between each basic information metadata item cell and the value cell, and determine the value of each basic information metadata item based on the association. Identify the basic information metadata items and corresponding values ​​of the project member table based on the project member table dictionary.

7. The structured parsing method for project electronic documents as described in claim 1, characterized in that, The set of structural information fields includes several structural information fields, and each structural information field includes several constituent elements. The extraction of project structural information includes: Let the i-th paragraph in the project's electronic document be denoted as _i_. q i Let the j-th structural information field in the set of structural information fields be denoted as... s field-j ,like q i It includes s field-j Any one of the constituent elements, then q i The specific content as s field-j The corresponding structural information.

8. A structured parsing system for project electronic documents, characterized in that, include: The modal recognition module is used to identify paragraphs in the project's electronic documents and the modal elements of each paragraph. The modal elements are used to distinguish the content types of different paragraphs. The basic information extraction module is used to identify the cover, basic information table, and project member table of the project electronic document based on the modal elements, and to pre-build a semantic annotation rule base. The semantic annotation rule base stores different document templates and the semantic annotation rules corresponding to each document template. The semantic annotation rules include the parsing methods for the specific content of the cover, basic information table, and project member table, as well as the basic information table dictionary and project member table dictionary on which the parsing depends. The module calculates the similarity of the project electronic document with each document template in the semantic annotation rule base in terms of project type, project cover, basic information table, and project member table, respectively. The module then weights the similarity of the project type, project cover, basic information table, and project member table according to preset weights to obtain the overall similarity between the project electronic document and each document template in the semantic annotation rule base. If a document template with an overall similarity greater than a preset threshold exists in the semantic annotation rule base, it is determined that the document template used by the project's electronic document is stored in the semantic annotation rule base, and the basic information of the project is extracted using the semantic annotation rules corresponding to the document template; otherwise, a corresponding semantic annotation rule is customized for the document template used by the project's electronic document, the basic information of the project is extracted using the customized semantic annotation rule, and the document template used by the project's electronic document and its corresponding semantic annotation rule are updated and stored in the semantic annotation rule base. The project structure information extraction module is used to construct a set of structure information fields and extract project structure information from the project electronic document based on the modal elements and the set of structure information fields. The standardization module is used to store or display the basic information and structural information of the project according to a preset standard.