A method and device for identifying a table of contents of an article in the financial field

CN116151224BActive Publication Date: 2026-06-30BEIJING RONGDAKEJI CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING RONGDAKEJI CO LTD
Filing Date
2023-02-16
Publication Date
2026-06-30

Smart Images

  • Figure CN116151224B_ABST
    Figure CN116151224B_ABST
Patent Text Reader

Abstract

This invention provides a method and apparatus for identifying the directory of articles in the financial field, relating to the field of data processing technology. The method and apparatus for identifying the directory of articles in the financial field include a part-of-speech tagging module, a preprocessing module, a processing module, and a generation module. The part-of-speech tagging module is a supporting module of the apparatus. The preprocessing and processing modules extract the directory based on the part-of-speech tagging module. The generation module outputs the results generated by the processing module. The part-of-speech tagging module includes initializing directory part-of-speech tags, initializing nshort part-of-speech tags, initializing custom part-of-speech tags, exhaustively listing three types of first-level directory features in prospectuses, and customizing directory features. In this invention, the method processes special directories to improve the recognition accuracy, achieving an accuracy rate of over 95% for file directories and over 90% for scanned documents, making it highly efficient and fast.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, specifically to a method and apparatus for identifying article catalogs in the financial field. Background Technology

[0002] Articles in the financial field often run to hundreds of pages, such as prospectuses, IPO prospectuses, annual reports, IPO plans, bond listings, and audit reports. It is very troublesome for users to find information from hundreds of pages of content, which greatly affects efficiency. By extracting a table of contents, users can quickly understand the outline of the article and quickly locate the corresponding parts of each part in the article, thereby optimizing the reading experience.

[0003] Current technology can only extract content from a single table of contents based on a single table of contents field, outline level, or style level, and then match this content with page numbers to create a table of contents. However, it is not very effective for identifying special table of contents.

[0004] Therefore, those skilled in the art provide a method and apparatus for identifying the directory of articles in the financial field. This apparatus is based on HTML structure, combines the Hanlp Chinese language processing package and a custom part-of-speech database for the financial field, extracts the directory hierarchy of source files from HTML structured data, and supports content location, so as to solve the problems mentioned in the background art. Summary of the Invention

[0005] (a) Technical problems to be solved

[0006] To address the shortcomings of existing technologies, this invention provides a method and apparatus for recognizing the table of contents of articles in the financial field. It can process special directories, improve recognition accuracy, and achieve an accuracy rate of over 95% for document directories and over 90% for scanned documents. It is highly efficient and fast, solving the problem that current technologies can only extract content based on a single directory field, outline level, or style level when extracting article directory content, and then match this content with page numbers to create a table of contents. This results in poor recognition performance for special directories.

[0007] (II) Technical Solution

[0008] To achieve the above objectives, the present invention provides the following technical solution:

[0009] A device for identifying the table of contents of articles in the financial field includes a part-of-speech tagging module, a preprocessing module, a processing module, and a generation module. The part-of-speech tagging module is a supporting module of the device. The preprocessing module and the processing module extract the table of contents based on the part-of-speech tagging module. The generation module outputs the result generated by the processing module.

[0010] The part-of-speech tag module includes initializing directory part-of-speech tags, initializing nshort part-of-speech tags, initializing custom part-of-speech tags (circled numbers, uppercase numbers), exhaustively listing three types of first-level directory features in the prospectus, and custom directory features.

[0011] Preprocessing module: Obtains table of contents features from the table of contents chapters, finds the line containing the table of contents, and determines whether the next line of the table of contents contains the table of contents features. If it does, it returns the first-level table of contents features.

[0012] If the previous step did not find the directory feature, logical reasoning is required: by "analyzing sample features," it is found that the priority of the first-level directory features in most files is satisfied, decreasing from left to right. Considering that line breaks may cause non-directories to be treated as directories, which may affect the processing logic, we count the directory features by requiring the first text of the current feature to appear before considering subsequent directories with the same feature as valid.

[0013] After obtaining the occurrence counts of the first-level directories in the previous step, we removed the feature tags with an occurrence count less than 1, and returned the feature tags of the directories with the highest priority as the final first-level directories.

[0014] Processing module: By adding custom part-of-speech tags to the statistically analyzed directory, it performs word segmentation on paragraphs with p tags, and adds the is catalog attribute to those containing directory features.

[0015] The generation module aggregates the identified directories, such as adding the IDs of the first and second level directories to the third level directory, and generates HTML with a left-right structure containing anchor points.

[0016] In the preprocessing module, when we encounter “Section 3 xxx”, we do not immediately count the frequency of mq occurrences. Only when we encounter “Section 1”, we record the mq part-of-speech occurrence as 1. When we encounter “Section 2” again, the mq part-of-speech increases.

[0017] In the processing module, recognition optimization was performed when segmenting the p-label paragraphs. Specifically, the han lp part-of-speech tag would identify dates or certain Chinese characters as m-part-of-speech tags, so rule exclusion was applied. Additionally, the w-symbol part-of-speech tag was restricted, allowing only commas, dots, full-width and half-width brackets, etc.

[0018] A method for identifying article catalogs in the financial field includes the following steps:

[0019] S1. Initialize the part-of-speech tag;

[0020] S2. Input the HTML address to be processed and the target HTML address to be output;

[0021] S3. Load the HTML file that needs to be processed;

[0022] S4. Import the part-of-speech tag database;

[0023] S5. Extract the first-level directory features from the HTML document that needs to be processed;

[0024] S6. Find all directories from the processed HTML (with special handling for the three types of prospectus directory features), tag the directories and generate directory IDs, and construct a directory tree structure based on the directory IDs;

[0025] S7. Generate an HTML document with left and right structures based on the directory tree structure, and finally write it to the target HTML path.

[0026] The initialization of the part-of-speech tag library in step S1 includes initializing the directory part-of-speech tags, initializing the nshort part-of-speech tags, and initializing custom part-of-speech tags.

[0027] After the process in step S5 is completed, it is determined whether directory features have been extracted. If directory features have been extracted, the first-level directory part-of-speech model list of the HTML file is returned. If directory features have not been extracted, the tagName of the Elements in the HTML is traversed, and tabs are skipped. Spaces are removed and characters with a length greater than 500 are skipped. The first 50 characters are used to determine the part of speech. If the character is empty or the length is less than 2, it is skipped. Characters that do not meet the first-level directory features are skipped, stored in the first-level directory, and have their part of speech added. If there are more than two characters, it is considered a first-level directory. If it is a first-level directory, the first-level directory part-of-speech model list of the HTML file is returned. If it is not a first-level directory, the part of speech of the current HTML is matched with the part-of-speech library. If no match is found, the MQ is returned.

[0028] The process in step S6 is as follows:

[0029] Iterate through the tagNames of Elements in the HTML, skipping them using tabs, removing spaces, and skipping those with a length greater than 500 characters. Take the first 50 characters to determine the part of speech; skip those that are empty or less than 2 characters in length. Skip those that do not meet the characteristics of a first-level directory. Then determine if it is a financial part-of-speech directory. If not, add a mark to the special directory tag between directories and generate a directory ID. If not, skip, add a mark, generate a directory ID, and label the tag as a directory. Then build standard data for all directories. Finally, build a directory tree structure based on the directory IDs.

[0030] The process in step S7 is as follows:

[0031] 1) Generate a tree structure HTML file based on the directory tree structure;

[0032] 2) Remove the original HTML directory;

[0033] 3) Construct the left and right structure HTML;

[0034] 4) Write the target file path.

[0035] (III) Beneficial Effects

[0036] This invention provides a method and apparatus for identifying article catalogs in the financial field. It has the following beneficial effects:

[0037] 1. This invention provides a method and apparatus for recognizing article directories in the financial field. The method processes special directories to improve recognition accuracy, and the recognition accuracy of document directories reaches more than 95%, and the accuracy of scanned documents reaches more than 90%, which is very efficient and fast.

[0038] 2. This invention provides a method and apparatus for identifying the table of contents of articles in the financial field. Through the cooperation of multiple modules and multiple steps of text recognition and analysis, the method optimizes the recognition of the table of contents. It identifies dates or certain Chinese characters as part-of-speech tags using HANLP, performs rule exclusion processing, and restricts the part-of-speech tags of symbols such as commas, periods, and full-width and half-width brackets. This makes the method applicable not only to the extraction of Chinese table of contents pages but also to the extraction of table of contents pages in other languages, thus broadening its applicability. Attached Figure Description

[0039] Figure 1 This is a flowchart of the part-of-speech tag module of the present invention;

[0040] Figure 2 This is a flowchart of the preprocessing module of the present invention;

[0041] Figure 3 This is a flowchart of the processing module of the present invention;

[0042] Figure 4 This is a flowchart of the generation module of the present invention;

[0043] Figure 5 This is a schematic diagram showing the connection relationship between the modules of the present invention;

[0044] Figure 6 This is a detailed flowchart of the steps of the present invention. Detailed Implementation

[0045] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0046] Example:

[0047] like Figure 1-6 As shown, this embodiment of the invention provides an apparatus for identifying the directory of articles in the financial field, including a part-of-speech tagging module, a preprocessing module, a processing module, and a generation module. The part-of-speech tagging module is a supporting module of the apparatus. The preprocessing module and the processing module extract the directory based on the part-of-speech tagging module. The generation module outputs the result generated by the processing module.

[0048] The part-of-speech tag module includes initializing directory part-of-speech tags, initializing nshort part-of-speech tags, initializing custom part-of-speech tags (circled numbers, uppercase numbers), exhaustively listing three types of first-level directory features in the prospectus, and custom directory features.

[0049] Preprocessing module: Obtains table of contents features from the table of contents chapters, finds the line containing the table of contents, and determines whether the next line of the table of contents contains the table of contents features. If it does, it returns the first-level table of contents features.

[0050] If the previous step did not find the directory feature, logical reasoning is required: by "analyzing sample features," it is found that the priority of the first-level directory features in most files is satisfied, decreasing from left to right. Considering that line breaks may cause non-directories to be treated as directories, which may affect the processing logic, we count the directory features by requiring the first text of the current feature to appear before considering subsequent directories with the same feature as valid.

[0051] After obtaining the occurrence counts of the first-level directories in the previous step, we removed the feature tags with an occurrence count less than 1, and returned the feature tags of the directories with the highest priority as the final first-level directories.

[0052] Processing module: By adding custom part-of-speech tags to the statistically analyzed directory, it performs word segmentation on paragraphs with p tags, and adds the is catalog attribute to those containing directory features.

[0053] The generation module performs aggregation operations on the identified directories, such as adding the IDs of the first and second level directories to the third level directory, and generates an HTML document with left and right structure and anchor points.

[0054] In the preprocessing module, when encountering "Section 3 xxx", we do not immediately count the frequency of mq occurrences. Only when encountering "Section 1" are we record the mq part-of-speech occurrence as 1. When encountering "Section 2" subsequently, the mq part-of-speech occurrence increases. In the processing module, recognition optimization is performed when segmenting p-labeled paragraphs. Specifically, the han lp part-of-speech will identify dates or certain Chinese characters as m-part-of-speech, which is excluded by rules. In addition, the w symbol part-of-speech is restricted, allowing only commas, dots, full-width and half-width brackets, etc.

[0055] The method for identifying the financial article catalog includes the following steps:

[0056] S1. Initialize the part-of-speech tag library. Initializing the part-of-speech tag library includes initializing the directory part-of-speech tags, initializing the nshort part-of-speech tags, and initializing custom part-of-speech tags.

[0057] S2. Input the HTML address to be processed and the target HTML address to be output;

[0058] S3. Load the HTML file that needs to be processed;

[0059] S4. Import the part-of-speech tag database;

[0060] S5. Extract the first-level directory features from the HTML document that needs to be processed;

[0061] Determine if directory features have been extracted. If extracted, return the list of part-of-speech tags for the first-level directory in the HTML file. If not extracted, iterate through the tagNames in the HTML's Elements, skipping tags by using tabs, removing spaces, and skipping tags longer than 500 characters. Use the first 50 characters to determine the part-of-speech tag; skip empty tags or tags shorter than 2 characters. Skip tags that do not meet the first-level directory features, store them in the first-level directory, and add their part-of-speech tags. If there are more than two tags, consider it a first-level directory. If it is a first-level directory, return the list of part-of-speech tags for the first-level directory in the HTML file. If it is not a first-level directory, match the current HTML's part-of-speech tag with the part-of-speech tag database. If no match is found, return the message queue (MQ).

[0062] S6. Find all directories from the processed HTML (with special handling for the three types of prospectus directory features), tag the directories and generate directory IDs, and construct a directory tree structure based on the directory IDs.

[0063] The specific process is as follows: Iterate through the tagName of the Elements in the HTML, skipping them using tags, removing spaces and skipping those with a length greater than 500 characters. Take the first 50 characters to determine the part of speech; if empty or less than 2 characters, skip them. Skip those that do not meet the characteristics of a first-level directory. Then determine if it is a financial part-of-speech directory. If not, add a mark to the special directory tag between the directories and generate a directory ID. If not, skip it, add a mark, generate a directory ID, and label the tag as a directory. Then build the standard data for all directories. Finally, build a directory tree structure based on the directory ID.

[0064] S7. Generate HTML with a left-right structure based on the directory tree structure, and finally write it to the target HTML path;

[0065] The specific process is as follows: 1) Generate a tree structure HTML based on the directory tree structure;

[0066] 2) Remove the original HTML file from the directory;

[0067] 3) Construct a left-right structured HTML file;

[0068] 4) Write the target file path.

[0069] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A device for recognizing article catalogs in the financial field, characterized in that, It includes a part-of-speech tag module, a preprocessing module, a processing module, and a generation module. The part-of-speech tag module is a supporting module of the device. The preprocessing module and the processing module perform directory extraction based on the part-of-speech tag module. The generation module outputs the results generated by the processing module. Part-of-speech tag module: includes initializing directory part-of-speech tags, initializing nshort part-of-speech tags, initializing custom part-of-speech tags, exhaustively listing three types of first-level directory features in the prospectus, and custom directory features; Preprocessing module: Obtains table of contents features from the table of contents chapters, finds the line containing the table of contents, and determines whether the next line of the table of contents contains the table of contents features. If it does, it returns the first-level table of contents features. If the directory feature was not found in the previous step, it is necessary to use logical reasoning: by "analyzing sample features", it was found that the priority of the first-level directory feature in most files is satisfied, decreasing from left to right. Considering that there may be a situation where non-directories are treated as directories due to line breaks, which may affect the processing logic, we count the directory features in such cases that the first text of the current feature must appear before subsequent directories with the same feature are considered valid. After obtaining the occurrence counts of the first-level directories in the previous step, we remove the feature tags with an occurrence count less than 1 and return the feature tags of the directories with the highest priority as the final first-level directories. Processing module: By adding custom part-of-speech tags to the statistically analyzed directory, it performs word segmentation on paragraphs with p tags, and adds the iscatalog attribute to those containing directory features; The generation module aggregates the identified directories and generates HTML with a left-right structure containing anchor points. In the processing module, recognition optimization was performed when segmenting the p-label paragraph. Specifically, hanlp part-of-speech tags would identify dates or certain Chinese characters as m-part-of-speech tags, so rule exclusion was applied. Additionally, the w-symbol part-of-speech tags were restricted, allowing only commas, periods, and full-width and half-width brackets.

2. The device for identifying an article directory in the financial field according to claim 1, characterized in that: In the preprocessing module, if "section 3xxx" is encountered, the frequency of mq will not be counted immediately. Only when "section 1" is encountered will the mq part of speech be recorded as 1. When "section 2" is encountered again, the part of speech of mq will be incremented.

3. A method for identifying article catalogs in the financial field, characterized in that: Includes the following steps: S1. Initialize the part-of-speech taggear; S2. Input the HTML address to be processed and the target HTML address to be output; S3. Load the HTML that needs to be processed; S4. Import the part-of-speech tag database; S5. Extract the first-level directory features from the HTML that needs to be processed; S6. Find all directories from the processed HTML, tag the directories and generate directory IDs, and build a directory tree structure based on the directory IDs; S7. Generate HTML with a left-right structure based on the directory tree structure, and finally write it to the target HTML path; The initialization of the part-of-speech tag library in step S1 includes initializing the directory part-of-speech tags, initializing the nshort part-of-speech tags, and initializing custom part-of-speech tags. After the process in step S5 is completed, it is determined whether directory features have been extracted. If directory features have been extracted, the first-level directory part-of-speech model list of the HTML file is returned. If directory features have not been extracted, the tagName of the Element in the HTML is traversed, and the table is used to skip elements. Spaces are removed and elements with a length greater than 500 are skipped. The first 50 characters are used to determine the part of speech. If the character is empty or the length is less than 2, it is skipped. Elements that do not meet the first-level directory features are skipped, stored in the first-level directory, and have their part of speech added. Elements with more than two parts of speech are considered to be first-level directories. If it is a first-level directory, the first-level directory part-of-speech model list of the HTML file is returned. If it is not a first-level directory, the part of speech of the current HTML is matched with the part-of-speech library. If the match is unsuccessful, the mq is returned. The process in step S6 is as follows: Iterate through the tagName of the elements in the HTML, skipping elements using a table, removing spaces, and skipping elements with a length greater than 500. Take the first 50 characters to determine the part-of-speech tag; skip elements that are empty or have a length less than 2 characters. Skip elements that do not meet the characteristics of a first-level directory. Then determine if it is a financial part-of-speech directory. If not, add a special directory tag between directories and generate a directory ID. If not, skip the directory, add a tag, generate a directory ID, and mark the tag as a directory. Then construct the standard data for all directories. Finally, construct the directory tree structure based on the directory IDs.

4. The method for identifying an article directory in the financial field according to claim 3, characterized in that: The process in step S7 is as follows: 1) Generate a tree structure HTML based on the directory tree structure; 2) Remove the original HTML file from the directory; 3) Construct a left-right structured HTML file; 4) Write the target file path.