Banking flow document analysis method and device

CN122242479APending Publication Date: 2026-06-19CHENGDU WANWANG SECONDARY PLANET COMM EQUIP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHENGDU WANWANG SECONDARY PLANET COMM EQUIP CO LTD
Filing Date
2026-02-27
Publication Date
2026-06-19

Smart Images

  • Figure CN122242479A_ABST
    Figure CN122242479A_ABST
Patent Text Reader

Abstract

This invention belongs to the technical field of information technology, specifically disclosing a method and apparatus for parsing bank transaction documents. The method includes obtaining format samples from multiple periods through a historical transaction document database and extracting field arrangement sequence features; performing time-series analysis on the sequence features using a sequence model to obtain the format evolution path; dividing time windows according to the format evolution path and calculating the field change frequency within each window, determining whether the change frequency exceeds a preset threshold to identify potential deformation patterns; if the potential deformation pattern includes repeated field adjustment patterns, using a clustering algorithm to group the adjustment patterns to obtain format change clusters; and extracting dominant change factors, such as field additions or deletions, for the format change clusters. The purpose of this invention is to solve the problem in existing technologies where parsing rules fail due to the evolution of transaction document formats over time.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of information technology, specifically to a method and apparatus for parsing bank transaction documents. Background Technology

[0002] Bank statement document parsing, as a crucial area of ​​financial data processing, plays an irreplaceable role in improving bank operational efficiency, ensuring data accuracy, and supporting business decision-making. In the financial industry, bank statements are not only direct records of account transactions but also vital evidence for regulatory compliance and risk management. The accuracy and timeliness of their parsing directly impact the stable operation of the banking system and the quality of customer service.

[0003] However, a common problem in this field is the insufficient adaptability of parsing methods to format changes. Many existing solutions tend to focus only on the fixed format of a certain time period or a certain bank, lacking attention to and response mechanisms for dynamic format evolution. This neglect leads to parsing systems often needing to spend a lot of time and resources to readjust rules when faced with frequent adjustments to bank transaction record formats, seriously affecting business continuity.

[0004] A deeper technical challenge lies in the fact that the patterns of change in bank statement formats have not yet been effectively grasped. Changes in bank statement formats are often influenced by various external factors, such as system upgrades or regulatory policy adjustments. These factors make format changes complex and unpredictable. Furthermore, this complexity brings another problem: it is difficult to identify potential format change trends in advance, causing the parsing system to always be in a reactive state. For example, after a bank adjusted the field arrangement of its statement documents due to regulatory requirements, the parsing system could not promptly identify the new field meanings and arrangement logic, resulting in data extraction errors and consequently affecting subsequent financial reconciliation and customer service processes.

[0005] Therefore, developing a technical method that can proactively adapt to and anticipate format changes in the context of frequent changes in bank transaction document formats has become a key issue in the field of bank transaction document parsing. Solving this problem is not only about improving parsing efficiency, but also directly related to the bank's ability to respond to changes in the business environment. Summary of the Invention

[0006] This invention provides a method and apparatus for parsing bank transaction documents, aiming to solve the problem in the prior art where parsing rules become invalid due to the evolution of transaction document formats over time.

[0007] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows: A bank statement document parsing method includes: obtaining format samples from multiple periods through a historical statement document database and extracting field arrangement sequence features; performing time-series analysis on the sequence features using a sequence model to obtain the format evolution path; dividing time windows according to the format evolution path and calculating the field change frequency within each window, determining whether the change frequency is higher than a preset threshold to identify potential deformation patterns; if the potential deformation patterns include repeated field adjustment patterns, grouping the adjustment patterns using a clustering algorithm to obtain format change clusters; extracting dominant change factors such as field additions or deletions for the format change clusters, determining the matching degree by comparing the current statement document with the most recent group in the change cluster to obtain preliminary adaptation rules; acquiring real-time statement document input and applying the preliminary adaptation rules for field mapping; if the mapping coverage is lower than a preset threshold, it is judged as a new format variation to obtain an extended rule set; updating the parsing engine's matching logic through the extended rule set and performing secondary extraction on the input document to determine the completeness of the extraction results to obtain corrected data output; performing consistency verification between the corrected data output and a historical validation set; if the consistency score is higher than a preset threshold, the adaptation process is judged to be complete to obtain a final parsing model for subsequent document processing.

[0008] In one aspect of this disclosure, the method of obtaining format samples from multiple periods through a historical logistic document database and extracting field arrangement sequence features, and then using a sequence model to perform time-series analysis on the sequence features to obtain the format evolution path, includes: By accessing the stored archival data through the historical log document database, the datasets from multiple periods are filtered to obtain format samples that meet the criteria. Based on the obtained format samples, the field arrangement information is parsed, the corresponding sequence features are extracted, and a structured feature dataset is formed. A long short-term memory network model is used to perform time series analysis on the extracted sequence features to determine the trend of feature changes over time; If there are significant adjustments to the field arrangement in the trend, record the time point of the adjustment and the specific changes to obtain the key nodes of the format evolution; By analyzing the correlations between key nodes, we can construct the format evolution path and determine the main stages in the evolution process. Obtain the stage division data in the evolution path, combine it with the timestamp information of the historical flow, generate a mapping table with stage field arrangement, and determine the completeness of the evolution path. By analyzing the changes in the field arrangement in the mapping table, common features across stages are extracted to obtain a stable pattern of format evolution.

[0009] In one aspect of this disclosure, the step of dividing time windows according to the format evolution path and calculating the field change frequency within each window, and determining whether the change frequency is higher than a preset threshold to determine potential deformation patterns, includes: Based on the data of the format evolution path, time windows are divided for the historical records, and multiple consecutive time window sets are obtained by using pre-established time period segmentation rules. For the set of divided time windows, obtain the field change records in each window, and determine the change statistics in each window by comparing the arrangement and content differences of the fields one by one. Based on the statistical data of changes, the frequency of field changes within each time window is calculated, and the distribution of frequency data is obtained by summing and averaging. Based on the distribution of frequency data, a comparison is made with a preset threshold. If the frequency data within a certain time window is higher than the preset threshold, it is determined that there is a potential deformation pattern within that window, and a preliminary pattern determination result is obtained. Based on the preliminary pattern judgment results, time windows with deformation patterns are extracted, and in-depth comparisons are performed on the field change records in these windows to determine the specific change patterns and directions. For the identified change patterns and directions, a long short-term memory network model is used to perform time-series correlation analysis on the change patterns of multiple time windows to obtain the deformation trend across windows. Based on the deformation patterns and trends across windows, the results of pattern determination from all time windows are integrated, and the final set of field deformation patterns is determined through correlation comparison.

[0010] In one aspect of this disclosure, if the potential deformation pattern includes a repeating field adjustment pattern, then a clustering algorithm is used to group the adjustment patterns to obtain a format change cluster, including: To identify deformation patterns, records containing duplicate fields are obtained from historical data, and a preliminary set of adjustment patterns is determined using pre-established filtering rules. Based on the set of adjustment patterns, clustering methods are used to group these patterns to obtain the classified pattern categories, focusing on feature extraction of format changes; For each categorized pattern, the characteristic data of each field is obtained, and the changing trend between categories is determined by comparing the differences between the fields. If the trend of change is consistent in a certain category, then that category is marked as a key cluster, resulting in a set of format change clusters of key concern; For the key clusters marked, obtain the regular distribution data within them, and determine the specific pattern classification result by comparing the order of the fields one by one; Based on the pattern classification results, the data grouping situation within each cluster is obtained, and the final deformation pattern distribution is determined by analyzing the correlation between the groups. The final deformation pattern distribution is recorded and archived in a pre-defined database to obtain a complete field characteristic analysis file.

[0011] In one aspect of this disclosure, the step of extracting dominant change factors, such as field additions or deletions, from the format change cluster, and obtaining preliminary adaptation rules by comparing the current pipeline document with the most recent group in the change cluster to determine the matching degree, includes: In response to format changes, historical data from cluster groups is obtained, and the distribution of dominant factors is determined by comparing the patterns of field additions and removals. Based on the distribution of dominant factors, the latest field structure data in the log document is obtained, and the degree of matching is determined by comparing it item by item with the field information of the most recently grouped fields. If the matching degree is higher than the preset threshold, the current serial document will be classified into the corresponding nearest group to obtain the preliminary classification result. Based on the preliminary classification results, obtain the field change data after classification, and determine the detailed content of the changes by analyzing the specific locations of field additions and removals; Based on the detailed content extracted from the changes, a pre-established mapping table is used to associate the field changes with the adaptation rules, generating a corresponding rule draft; For the generated draft rules, compare the draft rules with historical rules, use logic verification tools to determine the applicability of the draft rules, and obtain the final adaptive rules.

[0012] In one aspect of this disclosure, the step of acquiring real-time pipelined document input and applying the preliminary adaptation rules to perform field mapping, and determining a new format variation if the mapping coverage is lower than a preset threshold to obtain an extended rule set, includes: Obtain the latest data input from the log document, parse its structural information, extract the distribution of fields, and obtain a preliminary list of fields. For the initial field list data, pre-established adaptation rules are applied to perform field mapping processing to determine the mapping coverage. If the coverage is lower than the preset threshold, it is marked as a new format variation and further processing is required. Based on the labeling results of the new format variant, the process of generating extended rules is triggered to extract field features related to the new format variant from historical data, and obtain a feature comparison dataset. For the feature comparison dataset, the support vector machine algorithm is used for classification processing to divide the field features into different mutation categories and determine the mutation mode after classification. By combining the categorized mutation patterns with pre-established rule templates, corresponding extended rule drafts are generated, resulting in a set of rule drafts. Obtain a set of draft rules, compare each draft rule with the historical rules, and if the difference exceeds the preset range, filter the draft rules and determine the final applicable extended rule set. Based on the final applicable extended rule set, update the field mapping processing logic and apply it to subsequent serial document data inputs to obtain the updated mapping results.

[0013] In one aspect of this disclosure, the step of updating the matching logic of the parsing engine through the extended rule set and performing secondary extraction on the input document to determine the completeness of the extraction results and thus obtain the corrected data output includes: Obtain the extended rule set, update the matching logic of the parsing engine, generate the updated logical framework, and determine the scope of application of the logical framework; The updated logical framework loads the data content of the input document, executes the document processing flow, and obtains the pre-processed document dataset. For the document dataset after initial processing, a secondary extraction mechanism is adopted, which calls the matching logic to identify fields and determine the set of extracted fields; Based on the field set, analyze the completeness of the extracted results. If the completeness is lower than a preset threshold, trigger the correction mechanism to generate corrected field content. The corrected field content is integrated to form corrected data, and the accuracy of the corrected data is judged using pre-established verification rules. Based on the accuracy of the calibration data, the final data output is constructed, the results are verified, and the processed data record is obtained. Once the data records have been processed, they are archived to the document processing library, the parsing engine's log information is updated, and the completion status of the archiving operation is determined.

[0014] In one aspect of this disclosure, the step of performing a consistency check between the corrected data output and the historical validation set, and determining that the adaptation process is complete if the consistency score is higher than a preset threshold, thereby obtaining the final parsing model for subsequent document processing, includes: Obtain the data content of the correction results and historical verification, and make a preliminary comparison of the matching degree between the two to obtain the preliminary calculation result of the consistency score; Based on the preliminary calculation results, a standard comparison is performed using a preset threshold. If the consistency score is higher than the preset threshold, the adaptation process is determined to meet the requirements, and the determination is completed. By completing the judgment state, the logical framework of the final model is constructed, and the parameters of the parsing logic are solidified to obtain a stable structure suitable for document processing; For parsing logic under a stable structure, load the document content to be processed, perform batch content recognition, and determine the core field set for document processing; Based on the core field set, the result comparison mechanism is invoked to verify whether the data in the set meets expectations and to obtain the matching status of the field recognition. By analyzing the matching status, the document processing results that meet the conditions are filtered out and archived into a pre-established repository to determine the application scope of the final model. Based on the defined application scope, update the execution log of the parsing logic, record the execution details of each document processing, and obtain a complete operation trajectory record.

[0015] In another aspect, this disclosure also relates to a bank statement document parsing device, the device comprising: The historical serial format analysis module is used to obtain format samples from multiple periods through a historical serial document database and extract field arrangement sequence features. It then uses a sequence model to perform time-series analysis on the sequence features to obtain the format evolution path. The format deformation pattern identification module is used to divide time windows according to the format evolution path and calculate the field change frequency in each window, and determine whether the change frequency is higher than a preset threshold to determine the potential deformation pattern. The format change cluster generation module is used to group the adjustment patterns using a clustering algorithm to obtain a format change cluster if the potential deformation pattern contains a repeating field adjustment pattern. The preliminary adaptation rule extraction module is used to extract the dominant change factors such as field additions or deletions for the format change cluster, and obtain the preliminary adaptation rules by comparing the current pipeline document with the most recent group in the change cluster to determine the matching degree. The real-time format adaptation module is used to acquire real-time pipeline document input and apply the preliminary adaptation rules to perform field mapping. If the mapping coverage is lower than a preset threshold, it is judged as a new format variation and thus an extended rule set is obtained. The parsing engine correction module is used to update the matching logic of the parsing engine through the extended rule set and perform secondary extraction on the input document to determine the completeness of the extraction results and thus obtain the corrected data output. The final parsing model generation module is used to perform consistency verification between the corrected data output and the historical validation set. If the consistency score is higher than a preset threshold, the adaptation process is determined to be complete, thereby obtaining the final parsing model for subsequent document processing.

[0016] Compared with the prior art, the present invention has the following beneficial effects: This invention extracts field arrangement sequence features through time-series analysis to construct format evolution paths and identifies potential deformation patterns by combining time window division and change frequency judgment. Furthermore, it uses clustering algorithms to adjust grouping patterns, extracts dominant change factors, generates preliminary adaptation rules, and maps fields to real-time documents, dynamically updating the parsing logic. For new format variations, this invention optimizes the matching logic by expanding the rule set to ensure the integrity of the extraction results, and finally performs consistency verification with the historical validation set to complete the adaptive process. This invention achieves intelligent tracking and dynamic adjustment of parsing rules for complex document format changes, significantly improving the parsing engine's adaptability to heterogeneous documents and the accuracy of data extraction, providing efficient and stable technical support for subsequent document processing. Attached Figure Description

[0017] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained from these drawings without creative effort.

[0018] Figure 1 This is one of the flowcharts for a bank statement document parsing method according to the present invention.

[0019] Figure 2 This is the second flowchart of a bank statement document parsing method according to the present invention.

[0020] Figure 3 This is the third flowchart of a bank statement document parsing method according to the present invention. Detailed Implementation

[0021] The present invention will be further described below with reference to embodiments. These embodiments are merely some, not all, of the embodiments of the present invention. Other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are all within the protection scope of the present invention.

[0022] Please see Figures 1-3 As shown in the figure, this embodiment discloses a method and system for parsing bank statement documents, which may specifically include: S101. Obtain format samples from multiple periods through a historical log document database and extract field arrangement sequence features. Use a sequence model to perform time-series analysis on the sequence features to obtain the format evolution path.

[0023] By accessing the stored archival data through a historical logistic document database, and filtering datasets from multiple periods, format samples that meet the criteria are obtained. Based on the obtained format samples, the field arrangement information is parsed, and corresponding sequence features are extracted to form a structured feature dataset. A Long Short-Term Memory (LSTM) network model is used to perform time-series analysis on the extracted sequence features to determine the trend of feature changes over time. If significant adjustments to field arrangements are found in the trend, the time point and specific changes are recorded to identify key nodes in the format evolution. Through correlation analysis between key nodes, a format evolution path is constructed to determine the main stages in the evolution process. Stage division data in the evolution path is obtained, and combined with the timestamp information of the historical logistic data, a mapping table of stage-specific field arrangements is generated to determine the completeness of the evolution path. For changes in field arrangements in the mapping table, common features across stages are extracted to obtain a stable pattern of format evolution.

[0024] For example, the specific implementation method of obtaining format samples from multiple periods through a historical transaction document database and extracting field arrangement sequence features, and using a sequence model to perform time-series analysis on the sequence features to obtain the format evolution path is as follows: First, assuming that data is extracted from a database containing corporate financial transaction records of the past 20 years, document samples from five key years—2003, 2008, 2013, 2018, and 2023—are selected. 1000 transaction documents are randomly selected from each year as the analysis object. Natural language processing technology combined with regular expressions is used to extract field information from each document, such as transaction date, amount, account number, etc. The arrangement order of fields for each year is statistically analyzed to form a sequence feature dataset. It is found that the field order in 2003 is mostly "date-account-amount", while in 2023 it gradually evolves into "account-amount-date", and the number of fields increases from an average of 3.5 to 5.2. Next, these sequence features are transformed into vector representations, and a Hidden Markov Model (HMM) is used for time series modeling. The number of states is set to 5, corresponding to five main format patterns. The optimal state transition path is calculated using the Viterbi algorithm, yielding a probability matrix for format evolution. The transition probability from "Date-Account-Amount" to "Account-Date-Amount" is 0.65, indicating this is the main evolutionary trend. Further analysis using timestamp data calculates the duration of each format pattern, revealing that early formats lasted an average of 4.2 years, while this has shortened to 2.8 years in the last decade, reflecting an accelerated format update frequency. Finally, based on the above analysis results, a format evolution path diagram is generated, marking the field changes and transition probabilities at each key node. For example, 2013 was a turning point, with the "Remarks" field being added in 78% of cases. The results are stored in the database for subsequent business optimization, such as adjusting data entry templates to adapt to the latest format trends. Simultaneously, business needs are considered, and the possibility of adding a "Transaction Type" field in the next three years is predicted with a probability of 0.72. Machine learning models further validate the prediction accuracy, achieving 85%, thus forming a complete logical chain from data extraction to path analysis and business application.

[0025] S102. Divide the time window according to the format evolution path and calculate the field change frequency in each window, and determine whether the change frequency is higher than a preset threshold to determine the potential deformation pattern.

[0026] Based on the data of the format evolution path, historical records are divided into time windows using pre-established time segmentation rules, resulting in multiple consecutive sets of time windows. For each time window set, field change records are retrieved, and the change statistics for each window are determined by comparing the arrangement and content differences of the fields one by one. Based on the change statistics, the field change frequency within each time window is calculated, and the frequency data distribution is obtained through accumulation and averaging. The frequency data distribution is compared with a preset threshold. If the frequency data within a time window exceeds the preset threshold, a potential deformation pattern is identified within that window, yielding preliminary pattern identification results. Based on the preliminary pattern identification results, time windows exhibiting deformation patterns are extracted, and in-depth comparisons are performed on the field change records within these windows to determine the specific change patterns and directions. For the identified change patterns and directions, a Long Short-Term Memory (LSTM) network model is used to perform temporal correlation analysis on the change patterns of multiple time windows, obtaining cross-window deformation pattern trends. Based on the cross-window deformation pattern trends, the pattern identification results of all time windows are integrated, and the final set of field deformation patterns is determined through correlation comparison.

[0027] For example, based on the existing format evolution path analysis results, the system first loads enterprise financial document format data from the database for the past 15 years, dividing the time into three time windows: 2008-2012, 2013-2017, and 2018-2022. Each window contains approximately 5,000 document samples. Using automated scripts combined with text parsing algorithms, the system extracts field changes in the documents within each window, such as field additions, deletions, or position adjustments. Statistical analysis reveals that the field change frequency in the first window is 0.3 times per year, in the second window it is 0.6 times per year, and in the third window it rises to 1.2 times per year. Next, the system sets a preset threshold of 0.8 times per year. Comparison shows that the change frequency in the third window is significantly higher than the threshold, triggering a potential deformation pattern identification mechanism. A clustering algorithm is used to perform pattern mining on the data within the high-frequency change windows. Analysis shows that the occurrence rate of the field "Payment Method" increased from 12% to 45% between 2018 and 2022, indicating that it may become a core field. Furthermore, the system uses time series analysis algorithms to fit the trend of the frequency change data, calculates the acceleration of frequency growth to be 0.15 times / year, and infers that the frequency change may reach 1.5 times per year in the next two years. This result is combined with business needs to automatically generate field adjustment suggestions and store them in the system knowledge base for subsequent dynamic updates of financial document templates, forming a complete analysis chain from time window division to frequency calculation, threshold judgment and pattern recognition.

[0028] S103. If the potential deformation pattern includes a repeating field adjustment pattern, then a clustering algorithm is used to group the adjustment patterns to obtain a format change cluster.

[0029] To identify deformation patterns, records containing duplicate fields are retrieved from historical data. Using pre-established filtering rules, a preliminary set of adjustment patterns is determined. Based on this set, clustering methods are used to group these patterns, resulting in categorized pattern classes, focusing on feature extraction for format changes. For each categorized pattern class, field characteristic data is obtained, and differences between fields are compared to determine the changing trends between classes. If the changing trend is consistent within a certain class, that class is marked as a key cluster, resulting in a set of format change clusters of focus. For each marked key cluster, the internal pattern distribution data is obtained, and the specific pattern classification result is determined by comparing the order of fields one by one. Based on the pattern classification results, the data grouping within each cluster is obtained, and the final distribution of deformation patterns is determined by analyzing the correlation between groups. Finally, the final distribution of deformation patterns is recorded and archived in a pre-defined database, resulting in a complete field characteristic analysis archive.

[0030] For example, after identifying potential deformation patterns in the format of corporate financial documents, the system automatically initiates a pattern analysis process targeting repeated field adjustment patterns. First, the system extracts format adjustment records from 8,000 documents over the past 10 years from the database, focusing on repeated field adjustments, such as certain fields being moved or renamed multiple times within different time periods. Next, the system uses the K-means clustering algorithm to group these adjustment records according to adjustment type and time distribution, setting the number of clusters to 5 and the algorithm iterations to 100, ultimately generating 5 format change clusters. One cluster shows that the "Invoice Number" field was adjusted in 28% of the past 3 years, with the adjustment direction mostly being from the bottom to the top of the document. Further, the system extracts features from each cluster and uses a decision tree algorithm to analyze the commonalities of adjustment patterns within each cluster. It finds a high correlation between the "Invoice Number" adjustment and the proportion of "Value-Added Tax Invoice" document types, with a correlation coefficient of 0.75, suggesting that its adjustment may be linked to changes in tax policies. Subsequently, the system compares the analysis results with the enterprise's financial process database, automatically identifying the clusters with the highest proportion of tax-related documents (42%), and marking them as high-priority focus objects. Finally, the system stores the feature data of these clusters in the format management module, forming a classification basis and providing a reference for subsequent document format optimization, ensuring a complete closed loop of analysis logic from data extraction and pattern grouping to feature analysis and business association.

[0031] S104. Extract the dominant change factors, such as field additions or deletions, for the format change cluster, and determine the matching degree by comparing the current serial document with the most recent group in the change cluster to obtain preliminary adaptation rules.

[0032] To address format changes, historical data from cluster groups is retrieved. By comparing patterns of field additions and removals, the distribution of dominant factors is determined. Based on this distribution, the latest field structure data in the logistical document is obtained. This data is then compared item by item with the field information of the most recent group to assess the degree of match. If the match degree exceeds a preset threshold, the current logistical document is categorized into the corresponding most recent group, yielding a preliminary classification result. For this preliminary classification, data on field changes after classification is retrieved. By analyzing the specific locations of field additions and removals, the detailed content of the changes is extracted. Based on this detailed content, a pre-established mapping table is used to associate field changes with adaptation rules, generating a corresponding draft rule. Finally, the draft rule is compared with historical rules. A logical validation tool is used to determine the applicability of the draft rule, resulting in the final adaptation rule.

[0033] For example, in the field of enterprise financial document format management, the system's process for extracting dominant change factors and generating preliminary adaptation rules for format change clusters is as follows: First, the system extracts dominant change factors from the format change clusters, such as the addition or deletion of fields. Specifically, by analyzing the format records of 3000 documents over the past 5 years, it identifies that the field "Receiving Unit" was added in 45% of the documents, while the field "Remarks" was deleted in 32%. The system uses a weighted average algorithm to calculate the change frequency, finding that the weight of the addition of "Receiving Unit" is 0.65, thus considering it a major change factor. Next, the system compares the current transaction document with the most recent group in the change cluster. Assuming the current transaction document contains 100 documents, the system uses a cosine similarity algorithm to calculate the matching degree between the current document format and the most recent cluster, finding a matching degree of 0.82. The "Receiving Unit" field appears in 60% of these clusters, further confirming its dominant position. Subsequently, the system generates preliminary adaptation rules based on the matching results. It uses a logistic regression algorithm to analyze the correlation between field changes and business scenarios, and finds that the correlation coefficient between the addition of "payee unit" and the "cross-regional transaction" document type is 0.78. It infers that the change may be related to the expansion of the transaction scope. The system automatically compares this rule with the financial audit process database, filters out the document category with a cross-regional transaction ratio of 38%, marks it as a key adaptation object, and finally stores the generated rule in the format adaptation module to form a basis for dynamic adjustment, ensuring a logical closed loop from change extraction to rule generation.

[0034] S105. Obtain real-time streaming document input and apply the preliminary adaptation rules to perform field mapping. If the mapping coverage is lower than the preset threshold, it is judged as a new format variation, thereby obtaining an extended rule set.

[0035] The system acquires the latest data input from the logistic document. By parsing its structural information, it extracts the distribution of fields to obtain a preliminary field list. For this preliminary field list, pre-established adaptation rules are applied for field mapping to determine the mapping coverage. If the coverage is below a preset threshold, it is marked as a new format variation, requiring further processing. Based on the marking results of the new format variations, the system triggers the generation of extended rules, extracting field features related to the new format variations from historical data to obtain a feature comparison dataset. For this feature comparison dataset, a support vector machine algorithm is used for classification, dividing the field features into different variation categories to determine the variation patterns after classification. Using the classified variation patterns and pre-established rule templates, corresponding extended rule drafts are generated, resulting in a set of rule drafts. The set of rule drafts is then compared item by item with historical rules. If the differences exceed a preset range, the rule drafts are filtered to determine the final applicable set of extended rules. Based on the final applicable set of extended rules, the field mapping processing logic is updated and applied to subsequent logistic document data inputs to obtain updated mapping results.

[0036] For example, in the field of enterprise financial document format management, the system achieves real-time field mapping and rule expansion of transaction documents through a series of automated processes. First, the system obtains real-time input transaction documents from the financial data stream. Assuming this input contains 200 documents, the system parses each document using a pre-loaded field recognition model, extracting key fields such as "transaction amount" and "invoice date," and recording the frequency of these fields. The "transaction amount" occurrence rate is 98%, while "invoice date" is only 55%. Next, the system applies previously stored preliminary adaptation rules for field mapping, using a vector matching algorithm to calculate the similarity between the current document fields and the standard fields defined in the rule base, resulting in an average mapping rate of 0.75. Simultaneously, for fields that fail to map successfully, the system automatically generates a temporary mapping log, finding that 15% of the documents contain the undefined field "tax number." Subsequently, the system compares the mapping coverage rate with a preset threshold of 0.80, finding that the current coverage rate is below the threshold, indicating a new format variation, and triggering the extended rule set generation mechanism. The system uses clustering analysis algorithms to group unmapped fields and combines the correlation analysis of the "tax compliance inspection" process in the business scenario database to calculate the correlation coefficient between "tax number" and compliance inspection as 0.85. It infers that this may be related to recent policy adjustments and automatically adds this field to the rule base to generate an extended rule set. At the same time, it associates it with the compliance review module and marks it as a priority processing field to ensure that the coverage rate is increased to above 0.82 when processing subsequent documents, forming a complete logical closed loop from real-time input to rule expansion.

[0037] S106. Update the matching logic of the parsing engine through the extended rule set and perform secondary extraction on the input document to determine the completeness of the extraction results and obtain the corrected data output.

[0038] The process begins by acquiring an extended rule set, updating the parsing engine's matching logic, generating an updated logical framework, and determining its applicability. Using this updated framework, the input document's data is loaded, and the document processing flow is executed to obtain a pre-processed document dataset. For this dataset, a secondary extraction mechanism is employed, calling the matching logic to identify fields and determine the extracted field set. Based on this field set, the completeness of the extraction results is analyzed. If the completeness falls below a preset threshold, a correction mechanism is triggered, generating corrected field content. This corrected field content is then integrated to form corrected data, and pre-established verification rules are used to determine its accuracy. Based on the accuracy judgment of the corrected data, the final data output is constructed, the result is verified, and the processed data records are obtained. Finally, the processed data records are archived to the document processing library, the parsing engine's log information is updated, and the completion status of the archiving operation is confirmed.

[0039] For example, in the field of enterprise financial document management, the system performs in-depth processing of input documents through automated processes to ensure data accuracy. First, the system updates the matching logic of the parsing engine using a newly generated extended rule set. Specifically, the matching weight of newly added fields in the rule base is adjusted to 0.9, and the field recognition priority is reordered using a decision tree algorithm, resulting in a 12% improvement in matching efficiency. Next, the system performs a secondary extraction operation on 300 input financial documents. Using a natural language processing model based on semantic analysis, it calculates the contextual relevance of field content, finding that 85% of the documents have an extraction completeness of 0.95 or higher for the "Payer Name" field, while the remaining 15% are partially missing due to inconsistent formatting. Then, the system evaluates the completeness of the extraction results. By comparing it to the historical data completeness benchmark of 0.92, a data correction mechanism is automatically triggered. An association rule mining algorithm is used to infer and fill in missing fields. For example, based on the association degree of 0.88 between "Payer Name" and "Transaction Type," possible values ​​for missing fields are derived, and corrected data output is generated, improving the completeness to 0.96. At the same time, the system will link the correction results with the financial audit process database to automatically generate a compliance verification report, ensuring that the data output meets business needs and forming a complete closed-loop logic from rule updates to data correction.

[0040] S107. Perform consistency verification between the corrected data output and the historical verification set. If the consistency score is higher than the preset threshold, the adaptation process is completed, and the final parsing model is obtained for subsequent document processing.

[0041] The process begins by acquiring the calibration results and historical verification data, performing a preliminary comparison of their matching degree to obtain a preliminary consistency score. Based on this preliminary score, a standard comparison is performed using a preset threshold. If the consistency score is higher than the preset threshold, the adaptation process is deemed to meet the requirements, and the judgment is considered complete. Using this completed judgment status, the logical framework of the final model is constructed, and the parameters of the parsing logic are solidified to obtain a stable structure suitable for document processing. For the parsing logic under this stable structure, the document content to be processed is loaded, and batch content recognition is performed to determine the core field set for document processing. Based on the core field set, a result comparison mechanism is invoked to verify whether the data in the set meets expectations, obtaining the field recognition matching status. Through analysis of the matching status, document processing results that meet the conditions are selected and archived in a pre-established repository, determining the application scope of the final model. Based on the defined application scope, the execution log of the parsing logic is updated, recording the execution details of each document processing step to obtain a complete operation trajectory record.

[0042] For example, in the field of enterprise financial document management, the system uses an automated process to perform consistency verification on the corrected data output and complete the construction of the final parsing model to support the efficiency of subsequent document processing. First, the system compares the corrected data output with a historical validation set. Specifically, it extracts 500 financial documents approved within the past six months as the validation set and uses a cosine similarity algorithm to calculate the consistency score between the current data output and the validation set on key fields such as "transaction amount" and "invoice date." The analysis shows a consistency score of 0.93, significantly higher than the preset threshold of 0.90, indicating that the data output has met the expected standard in terms of content accuracy. Next, based on the consistency score, the system automatically triggers an adaptive evaluation mechanism, using a clustering analysis algorithm to compare the feature distribution of the current data output with the historical validation set. The results show a data distribution similarity of 0.87, further confirming the system's stability and reliability. To ensure a closed-loop logic, the system correlates the consistency verification results with the financial compliance database, automatically extracting compliance indicators such as a "tax rate compliance" score of 0.98 or higher. If deviations are detected, the system fine-tunes the data output using a built-in regression analysis model, ultimately generating an adaptability score report to indicate that the adaptation process is complete. Subsequently, the system solidifies the current parsing logic into the final parsing model. For future input financial documents, this model is automatically applied for batch processing, while new data samples are periodically extracted to update the validation set, ensuring the model's continued applicability and forming a complete automated process from verification to model building.

[0043] This invention provides a bank statement document parsing system, which mainly includes: The historical serial format analysis module is used to obtain format samples from multiple periods through a historical serial document database and extract field arrangement sequence features. It then uses a sequence model to perform time-series analysis on the sequence features to obtain the format evolution path. The format deformation pattern identification module is used to divide time windows according to the format evolution path and calculate the field change frequency in each window, and determine whether the change frequency is higher than a preset threshold to determine the potential deformation pattern. The format change cluster generation module is used to group the adjustment patterns using a clustering algorithm to obtain a format change cluster if the potential deformation pattern contains a repeating field adjustment pattern. The preliminary adaptation rule extraction module is used to extract the dominant change factors such as field additions or deletions for the format change cluster, and obtain the preliminary adaptation rules by comparing the current pipeline document with the most recent group in the change cluster to determine the matching degree. The real-time format adaptation module is used to acquire real-time pipeline document input and apply the preliminary adaptation rules to perform field mapping. If the mapping coverage is lower than a preset threshold, it is judged as a new format variation and thus an extended rule set is obtained. The parsing engine correction module is used to update the matching logic of the parsing engine through the extended rule set and perform secondary extraction on the input document to determine the completeness of the extraction results and thus obtain the corrected data output. The final parsing model generation module is used to perform consistency verification between the corrected data output and the historical validation set. If the consistency score is higher than a preset threshold, the adaptation process is determined to be complete, thereby obtaining the final parsing model for subsequent document processing.

[0044] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for parsing bank statement documents, characterized in that, include: By obtaining format samples from multiple periods through a historical logistic document database and extracting the field arrangement sequence features, a sequence model is used to perform time-series analysis on the sequence features to obtain the format evolution path. Divide the time window according to the format evolution path and calculate the field change frequency in each window. Determine whether the change frequency is higher than a preset threshold to determine the potential deformation pattern. If the potential deformation pattern includes a repeating field adjustment pattern, then a clustering algorithm is used to group the adjustment patterns to obtain a format change cluster. For the format change cluster, extract the dominant change factors such as field addition or deletion, and determine the matching degree by comparing the current pipeline document with the most recent group in the change cluster to obtain preliminary adaptation rules; The system acquires real-time pipeline document input and applies the preliminary adaptation rules to perform field mapping. If the mapping coverage is lower than a preset threshold, it is judged as a new format variation, thereby obtaining an extended rule set. The matching logic of the parsing engine is updated by the extended rule set and a second extraction is performed on the input document to determine the completeness of the extraction results and thus obtain the corrected data output. The consistency of the corrected data output with the historical validation set is checked. If the consistency score is higher than the preset threshold, the adaptation process is considered complete, and the final parsing model is obtained for subsequent document processing.

2. The bank statement document parsing method and apparatus according to claim 1, characterized in that: The process involves obtaining format samples from multiple periods through a historical logistic document database, extracting field arrangement sequence features, and then using a sequence model to perform time-series analysis on these features to obtain the format evolution path. This includes: By accessing the stored archival data through the historical log document database, the datasets from multiple periods are filtered to obtain format samples that meet the criteria. Based on the obtained format samples, the field arrangement information is parsed, the corresponding sequence features are extracted, and a structured feature dataset is formed. A long short-term memory network model is used to perform time series analysis on the extracted sequence features to determine the trend of feature changes over time; If there are significant adjustments to the field arrangement in the trend, record the time point of the adjustment and the specific changes to obtain the key nodes of the format evolution; By analyzing the correlations between key nodes, we can construct the format evolution path and determine the main stages in the evolution process. Obtain the stage division data in the evolution path, combine it with the timestamp information of the historical flow, generate a mapping table with stage field arrangement, and determine the completeness of the evolution path. By analyzing the changes in the field arrangement in the mapping table, common features across stages are extracted to obtain a stable pattern of format evolution.

3. The bank statement document parsing method according to claim 1, characterized in that, The step of dividing time windows according to the format evolution path and calculating the field change frequency within each window, and determining whether the change frequency is higher than a preset threshold to determine potential deformation patterns, includes: Based on the data of the format evolution path, time windows are divided for the historical records, and multiple consecutive time window sets are obtained by using pre-established time period segmentation rules. For the set of divided time windows, obtain the field change records in each window, and determine the change statistics in each window by comparing the arrangement and content differences of the fields one by one. Based on the statistical data of changes, the frequency of field changes within each time window is calculated, and the distribution of frequency data is obtained by summing and averaging. Based on the distribution of frequency data, a comparison is made with a preset threshold. If the frequency data within a certain time window is higher than the preset threshold, it is determined that there is a potential deformation pattern within that window, and a preliminary pattern determination result is obtained. Based on the preliminary pattern judgment results, time windows with deformation patterns are extracted, and in-depth comparisons are performed on the field change records in these windows to determine the specific change patterns and directions. For the identified change patterns and directions, a long short-term memory network model is used to perform time-series correlation analysis on the change patterns of multiple time windows to obtain the deformation trend across windows. Based on the deformation patterns and trends across windows, the results of pattern determination from all time windows are integrated, and the final set of field deformation patterns is determined through correlation comparison.

4. The bank statement document parsing method according to claim 1, characterized in that, If the potential deformation pattern includes a repeating field adjustment pattern, then a clustering algorithm is used to group the adjustment patterns to obtain a format change cluster, including: To identify deformation patterns, records containing duplicate fields are obtained from historical data, and a preliminary set of adjustment patterns is determined using pre-established filtering rules. Based on the set of adjustment patterns, clustering methods are used to group these patterns to obtain the classified pattern categories, focusing on feature extraction of format changes; For each categorized pattern, the characteristic data of each field is obtained, and the changing trend between categories is determined by comparing the differences between the fields. If the trend of change is consistent in a certain category, then that category is marked as a key cluster, resulting in a set of format change clusters of key concern; For the key clusters marked, obtain the regular distribution data within them, and determine the specific pattern classification result by comparing the order of the fields one by one; Based on the pattern classification results, the data grouping situation within each cluster is obtained, and the final deformation pattern distribution is determined by analyzing the correlation between the groups. The final deformation pattern distribution is recorded and archived in a pre-defined database to obtain a complete field characteristic analysis file.

5. The bank statement document parsing method according to claim 1, characterized in that, The process involves extracting dominant change factors, such as field additions or deletions, from the format change cluster, and determining the matching degree by comparing the current pipeline document with the most recent group in the change cluster to obtain preliminary adaptation rules, including: In response to format changes, historical data from cluster groups is obtained, and the distribution of dominant factors is determined by comparing the patterns of field additions and removals. Based on the distribution of dominant factors, the latest field structure data in the log document is obtained, and the degree of matching is determined by comparing it item by item with the field information of the most recently grouped fields. If the matching degree is higher than the preset threshold, the current serial document will be classified into the corresponding nearest group to obtain the preliminary classification result. Based on the preliminary classification results, obtain the field change data after classification, and determine the detailed content of the changes by analyzing the specific locations of field additions and removals; Based on the detailed content extracted from the changes, a pre-established mapping table is used to associate the field changes with the adaptation rules, generating a corresponding rule draft; For the generated draft rules, compare the draft rules with historical rules, use logic verification tools to determine the applicability of the draft rules, and obtain the final adaptive rules.

6. The bank statement document parsing method according to claim 1, characterized in that, The process involves acquiring real-time pipelined document input and applying the initial adaptation rules to perform field mapping. If the mapping coverage is lower than a preset threshold, it is determined to be a new format variation, thereby obtaining an extended rule set, including: Obtain the latest data input from the log document, parse its structural information, extract the distribution of fields, and obtain a preliminary list of fields. For the initial field list data, pre-established adaptation rules are applied to perform field mapping processing to determine the mapping coverage. If the coverage is lower than the preset threshold, it is marked as a new format variation and further processing is required. Based on the labeling results of the new format variant, the process of generating extended rules is triggered to extract field features related to the new format variant from historical data, and obtain a feature comparison dataset. For the feature comparison dataset, the support vector machine algorithm is used for classification processing to divide the field features into different mutation categories and determine the mutation mode after classification. By combining the categorized mutation patterns with pre-established rule templates, corresponding extended rule drafts are generated, resulting in a set of rule drafts. Obtain a set of draft rules, compare each draft rule with the historical rules, and if the difference exceeds the preset range, filter the draft rules and determine the final applicable extended rule set. Based on the final applicable extended rule set, update the field mapping processing logic and apply it to subsequent serial document data inputs to obtain the updated mapping results.

7. The bank statement document parsing method according to claim 1, characterized in that, The step of updating the parsing engine's matching logic through the extended rule set and performing secondary extraction on the input document to determine the completeness of the extraction results and obtain the corrected data output includes: Obtain the extended rule set, update the matching logic of the parsing engine, generate the updated logical framework, and determine the scope of application of the logical framework; The updated logical framework loads the data content of the input document, executes the document processing flow, and obtains the pre-processed document dataset. For the document dataset after initial processing, a secondary extraction mechanism is adopted, which calls the matching logic to identify fields and determine the set of extracted fields; Based on the field set, analyze the completeness of the extracted results. If the completeness is lower than a preset threshold, trigger the correction mechanism to generate corrected field content. The corrected field content is integrated to form corrected data, and the accuracy of the corrected data is judged using pre-established verification rules. Based on the accuracy of the calibration data, the final data output is constructed, the results are verified, and the processed data record is obtained. Once the data records have been processed, they are archived to the document processing library, the parsing engine's log information is updated, and the completion status of the archiving operation is determined.

8. The bank statement document parsing method according to claim 1, characterized in that, The process of verifying the consistency between the corrected data output and the historical validation set, and determining that the adaptation process is complete if the consistency score is higher than a preset threshold, thereby obtaining the final parsing model for subsequent document processing, includes: Obtain the data content of the correction results and historical verification, and make a preliminary comparison of the matching degree between the two to obtain the preliminary calculation result of the consistency score; Based on the preliminary calculation results, a standard comparison is performed using a preset threshold. If the consistency score is higher than the preset threshold, the adaptation process is determined to meet the requirements, and the determination is completed. By completing the judgment state, the logical framework of the final model is constructed, and the parameters of the parsing logic are solidified to obtain a stable structure suitable for document processing; For parsing logic under a stable structure, load the document content to be processed, perform batch content recognition, and determine the core field set for document processing; Based on the core field set, the result comparison mechanism is invoked to verify whether the data in the set meets expectations and to obtain the matching status of the field recognition. By analyzing the matching status, the document processing results that meet the conditions are filtered out and archived into a pre-established repository to determine the application scope of the final model. Based on the defined application scope, update the execution log of the parsing logic, record the execution details of each document processing, and obtain a complete operation trajectory record.

9. A bank statement document parsing device, characterized in that, The device includes: The historical serial format analysis module is used to obtain format samples from multiple periods through a historical serial document database and extract field arrangement sequence features. It then uses a sequence model to perform time-series analysis on the sequence features to obtain the format evolution path. The format deformation pattern identification module is used to divide time windows according to the format evolution path and calculate the field change frequency in each window, and determine whether the change frequency is higher than a preset threshold to determine the potential deformation pattern. The format change cluster generation module is used to group the adjustment patterns using a clustering algorithm to obtain a format change cluster if the potential deformation pattern contains a repeating field adjustment pattern. The preliminary adaptation rule extraction module is used to extract the dominant change factors such as field additions or deletions for the format change cluster, and obtain the preliminary adaptation rules by comparing the current pipeline document with the most recent group in the change cluster to determine the matching degree. The real-time format adaptation module is used to acquire real-time pipeline document input and apply the preliminary adaptation rules to perform field mapping. If the mapping coverage is lower than a preset threshold, it is judged as a new format variation and thus an extended rule set is obtained. The parsing engine correction module is used to update the matching logic of the parsing engine through the extended rule set and perform secondary extraction on the input document to determine the completeness of the extraction results and thus obtain the corrected data output. The final parsing model generation module is used to perform consistency verification between the corrected data output and the historical validation set. If the consistency score is higher than a preset threshold, the adaptation process is determined to be complete, thereby obtaining the final parsing model for subsequent document processing.