A method and apparatus for matching heterogeneous data

By using time-series binning and multi-dimensional fuzzy matching algorithms, the problem of high-confidence automated matching of transaction logs and invoice data was solved, achieving efficient and robust heterogeneous data matching, reducing manual intervention, and improving matching accuracy and system stability.

CN122243671APending Publication Date: 2026-06-19BEIJING HESI HUIZHI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING HESI HUIZHI INFORMATION TECHNOLOGY CO LTD
Filing Date
2026-04-01
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In corporate expense control and personal advance payment scenarios, existing technologies struggle to achieve highly reliable automated matching of transaction records and invoice data. This is especially true in complex scenarios where multiple transactions correspond to a single invoice or a single transaction corresponds to multiple invoices, where reliability assessment is lacking and extensive manual intervention is required.

Method used

By using time series binning technology, the matching of massive data is transformed into a small-scale sub-matching problem. Merchant name semantic normalization and hybrid similarity algorithm are adopted, combined with a multi-factor fuzzy matching algorithm with dynamic weight allocation, to perform multi-dimensional fuzzy matching of amount, time and name. Through continuous quantitative matching confidence calculation and a three-level decision mechanism, automated and high-confidence matching is achieved.

Benefits of technology

It significantly improves matching efficiency, reduces the rate of manual intervention, enhances the robustness and accuracy of matching, and has the ability to learn and continuously optimize itself, ensuring long-term stability and accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243671A_ABST
    Figure CN122243671A_ABST
Patent Text Reader

Abstract

This application provides a method and apparatus for matching heterogeneous data, relating to the field of data processing technology. The method includes: acquiring transaction flow data and invoice data from a transaction flow system and an invoice recognition system, respectively; dividing the processed transaction flow data and invoice data into buckets based on a time tolerance window, and identifying transaction flow data and invoice data within the same time bucket as matching candidate pairs; performing multi-dimensional fuzzy matching of amount, time, and name on the matching candidate pairs, determining the amount similarity, time similarity, and name similarity respectively, and adjusting the weight coefficients of each similarity based on the characteristics of the matching candidate pairs; determining the matching confidence of the matching candidate pairs based on the amount similarity, time similarity, and name similarity and their corresponding weight coefficients; and performing corresponding matching decision operations on the matching candidate pairs according to a preset confidence threshold. This application achieves efficient and high-confidence automatic matching of heterogeneous data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing, and more specifically, to a method and apparatus for matching heterogeneous data. Background Technology

[0002] Currently, in corporate expense control and personal advance payment scenarios, employees or users generate transaction records through online payment platforms, while compliant reimbursement requires the submission of corresponding invoices as proof. Since payment systems and invoice issuance systems are typically two independent and heterogeneous systems, they often lack a unique and reproducible business ID. Therefore, matching massive transaction records with corresponding invoices is a crucial step in the expense control and reimbursement process. Existing technologies mainly rely on precise matching of transaction amounts and consistency of transaction times, but cannot quantify the reliability of the matching. Fuzzy matching results still require significant manual verification, and complex scenarios involving multiple transactions corresponding to a single invoice or a single transaction corresponding to multiple invoices are difficult to handle. Summary of the Invention

[0003] The purpose of this application is to provide a method and apparatus for matching heterogeneous data, which solves the above-mentioned problems existing in the prior art and can realize automated and high-confidence matching of transaction flow and invoice data.

[0004] Firstly, a method for matching heterogeneous data is provided, which may include: Transaction flow data and invoice data are obtained from the transaction flow system and invoice recognition system, respectively. Based on the preset time tolerance window, the processed transaction data and invoice data are divided into buckets, and the transaction data and invoice data in the same time bucket are identified as matching candidate pairs. Multi-dimensional fuzzy matching of amount, time, and name is performed on the candidate matching pairs to determine the similarity of amount, time, and name, and the weight coefficients of each similarity are adjusted according to the features of the candidate matching pairs. Based on the similarity of amount, time, and name, and their corresponding weight coefficients, the matching confidence of the candidate pairs is determined, and the corresponding matching decision operation is performed on the candidate pairs according to the preset confidence threshold.

[0005] In one possible implementation, processing the transaction log data and the invoice data includes: The transaction amounts in the transaction log data and the invoice amounts in the invoice data are uniformly converted into the smallest currency unit; The transaction time in the transaction log data and the invoice issuance time in the invoice data are uniformly converted into corresponding timestamps.

[0006] In one possible implementation, the processed transaction log data and invoice data are bucketed according to a preset time tolerance window, including: For each transaction record, the time tolerance window is extended forward and backward based on the transaction time of that transaction record to form an effective time interval. Records in the invoice data whose invoice issuance time falls within the valid time interval are assigned to the same time bucket as the transaction flow data. Within each time bucket, transaction data and invoice data are sorted and indexed based on amount.

[0007] In one possible implementation, the weight coefficients of each similarity are dynamically adjusted based on the features of the matching candidate pairs, including: If the monetary similarity of the matching candidate pair reaches the configured matching state, the weight coefficient corresponding to the monetary similarity is increased, and the weight coefficients corresponding to the time similarity and name similarity are decreased. If the time difference of the matching candidate pairs is close to the edge of the time tolerance window, the weight coefficient corresponding to time similarity is reduced and the weight coefficient corresponding to name similarity is increased, while the lower limit threshold of name similarity is set.

[0008] In one possible implementation, monetary similarity and temporal similarity are determined separately, including: The amount similarity is determined based on the difference or ratio difference between the transaction amount and the invoice amount in the matching candidate pair, and a preset fault tolerance threshold is allowed. The time similarity is determined based on the time difference between the transaction time and the invoice issuance time in the matching candidate pair, and the normalized result within the time tolerance window.

[0009] In one possible implementation, a matching decision operation is performed on the candidate matching pairs based on a preset confidence threshold, including: If the matching confidence is greater than or equal to the configured first threshold, then the matching candidate pair is confirmed as a valid match; If the matching confidence is less than the configured second threshold, then the matching candidate pair is confirmed as an invalid match; If the matching confidence level is between the second threshold and the first threshold, the matching candidate pair is marked as a fuzzy match and pushed to the manual review pool.

[0010] In one possible implementation, after performing the corresponding matching decision operation on the matching candidate pairs based on a preset confidence threshold, the method further includes: Collect the manual verification results of fuzzy matching pairs in the manual verification pool, and mark the verification results as training data with valid or invalid matching labels; The model with adjusted weight coefficients is periodically retrained using the training data, or the values ​​of the first threshold and the second threshold are adjusted. Update the optimized model parameters or thresholds to the production environment.

[0011] Secondly, a heterogeneous data matching device is provided, the device may include: The acquisition unit is used to acquire transaction flow data and invoice data from the transaction flow system and the invoice recognition system, respectively. The partitioning unit is used to divide the processed transaction data and invoice data into buckets according to a preset time tolerance window, and to identify the transaction data and invoice data in the same time bucket as matching candidate pairs. The determining unit is used to perform multi-dimensional fuzzy matching of amount, time, and name on the matching candidate pairs, determine the similarity of amount, time, and name respectively, and adjust the weight coefficient of each similarity according to the features of the matching candidate pairs. Furthermore, the matching confidence of the candidate pairs is determined based on the similarity of amounts, time, and names, and their corresponding weight coefficients. The execution unit is used to perform corresponding matching decision operations on the matching candidate pairs based on a preset confidence threshold.

[0012] Thirdly, an electronic device is provided, which includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; When a processor executes a program stored in memory, it implements any of the steps described in the first aspect above.

[0013] Fourthly, a computer-readable storage medium is provided, wherein a computer program is stored therein, and when executed by a processor, the computer program implements the steps of any of the methods described in the first aspect above.

[0014] This application provides a method and apparatus for matching heterogeneous data. By employing time-series bucketing technology, it transforms the massive data matching problem into multiple small-scale sub-matching problems, significantly improving matching efficiency. Through merchant name semantic normalization and a hybrid similarity algorithm, it effectively solves the matching failure problem caused by the heterogeneity of merchant names across systems. The multi-factor fuzzy matching algorithm with dynamic weight allocation can adapt to varying errors and deviations in business operations, improving the robustness of matching. Through continuous quantification of matching confidence calculation and a three-level decision-making mechanism, it achieves optimal configuration for human-machine collaboration, significantly reducing the rate of manual intervention. Through a feedback optimization loop mechanism, the system possesses the ability to self-learn and continuously optimize, ensuring long-term stability and accuracy. Attached Figure Description

[0015] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 A flowchart illustrating a method for matching heterogeneous data provided in an embodiment of this application; Figure 2 A schematic diagram of the structure of a heterogeneous data matching device provided in an embodiment of this application; Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0017] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0018] Currently, in corporate expense control and personal advance payment scenarios, employees or users generate transaction records through online payment platforms, while compliant reimbursement requires the submission of corresponding invoices as proof. Since payment systems and invoice issuance systems are typically two independent and heterogeneous systems, they often lack a unique and reproducible business ID. Therefore, matching massive transaction records with corresponding invoices is a crucial step in the expense control and reimbursement process. Existing technologies mainly rely on precise matching of transaction amounts and consistency of transaction times, but cannot quantify the reliability of the matching. Fuzzy matching results still require significant manual verification, and complex scenarios involving multiple transactions corresponding to a single invoice or a single transaction corresponding to multiple invoices are difficult to handle.

[0019] Therefore, this application provides a method for matching heterogeneous data to solve the above-mentioned problems existing in the prior art, and can realize automated and high-confidence matching of transaction flow and invoice data.

[0020] The preferred embodiments of this application are described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are for illustration and explanation only and are not intended to limit this application. Furthermore, the embodiments and features in the embodiments of this application can be combined with each other without conflict.

[0021] Figure 1 This is a flowchart illustrating a method for matching heterogeneous data provided in an embodiment of this application. Figure 1 As shown, the method may include: Step S110: Obtain transaction data and invoice data from the transaction flow system and invoice recognition system, respectively.

[0022] Specifically, this step involves the collection of heterogeneous data. The transaction log system includes bank clearing systems and third-party payment platform systems (such as WeChat and Alipay). The collected transaction log data must include core fields such as transaction amount, transaction time, and merchant name. The invoice recognition system is an invoice data extraction system based on optical character recognition (OCR) technology. It can extract invoice data from paper invoices and electronic invoices. The collected invoice data must include core fields such as invoice amount, invoice date, and seller name.

[0023] The collected transaction log data and invoice data are heterogeneous data from different systems, with issues such as inconsistent field formats, inconsistent data precision, and redundant names. Therefore, before proceeding to the subsequent bucketing stage, both types of data need to be cleaned and normalized. Specific processing includes: Unify the transaction amounts in the transaction flow data and the invoice amounts in the invoice data into the smallest currency unit (such as "cents" for RMB), eliminate the precision errors caused by floating-point calculations, and ensure the consistency of the matching benchmark in the amount dimension; Unify the transaction time in the transaction flow data and the invoicing time in the invoice data into the corresponding UTC timestamp, standardize the date and time format, and solve the problem of heterogeneous time formats in different systems and time zones; Perform text preprocessing on the transaction merchant name and the seller name, remove the noise words in the name (such as prefixes / suffixes like "payment on behalf", "POS", "Alipay", "*", etc.), convert full-width characters to half-width characters, and eliminate meaningless redundant information to prepare for subsequent name similarity calculation.

[0024] After the above processing, transaction flow data and invoice data with unified formats and regularized information are obtained, which serve as the basic data for subsequent bucket partitioning.

[0025] In some embodiments, introduce user behavior portraits to optimize the setting of the time tolerance window, replacing the traditional fixed window or static rules that are only adjusted according to business types. Specifically, when implementing, train a user portrait model through the user's historical matching data (such as the distribution of transaction-invoice time differences in the past 3 months, the passing rate of manual review), and extract features such as "average time difference", "standard deviation of time difference", and "high-frequency consumption period" at the user dimension; for the current transaction flow data to be matched, call the user portrait model to predict the "expected time difference range" of the transaction, and dynamically adjust the length of the time tolerance window based on the prediction results (for example, when the predicted standard deviation of the time difference is large, the window length is increased by 20%; for transactions during the high-frequency consumption period, the window length is shortened by 10%). The dynamically adjusted time tolerance window is used for subsequent bucket partitioning, which not only avoids "overmatching" (too large window) or "missing matching" (too small window) caused by the fixed window, but also improves the matching accuracy in personalized scenarios.

[0026] Step S120: Perform bucket partitioning on the processed transaction flow data and invoice data according to the preset time tolerance window, and determine the transaction flow data and invoice data within the same time bucket as matching candidate pairs.

[0027] Among them, performing bucket partitioning on the processed transaction flow data and invoice data according to the preset time tolerance window includes: For each transaction flow data, based on the transaction time of this transaction flow data, extend the length of the time tolerance window forward and backward to form a valid time interval; Record the invoice data whose invoicing time falls within the valid time interval into the same time bucket as this transaction flow data; within each time bucket, sort the transaction flow data and invoice data according to the amount respectively and establish an index.

[0028] This includes processing transaction log data and invoice data, including: Convert the transaction amount in the transaction log data and the invoice amount in the invoice data into the smallest currency unit; Convert the transaction time in the transaction log data and the invoice time in the invoice data into corresponding timestamps; Text cleaning is performed on the names of trading merchants and sellers to obtain processed transaction log data and invoice data.

[0029] This step is the pre-screening stage for matching candidate pairs. Its core is to transform global matching of massive heterogeneous data into local matching within time buckets using time-series bucketing technology. This significantly reduces the matching search space and lowers the matching complexity from O(N×M) to near O(N), ensuring the algorithm's efficiency in massive data environments. The time tolerance window is a preset time range (e.g., 24 hours, 48 ​​hours) based on the actual business scenario and can be adaptively adjusted according to the business type (e.g., longer time tolerance windows are allowed for travel-related consumption businesses, while shorter time tolerance windows are allowed for daily office consumption businesses).

[0030] The specific implementation steps for bucketing based on the preset time tolerance window are as follows: For each processed transaction log, a unique valid time interval is formed by extending the time tolerance window forward and backward, based on the transaction time (UTC timestamp) of that transaction log. The valid time interval is defined as [Ti.Time]. [WTolerance, Ti.Time+WTolerance], where Ti.Time is the transaction time of the i-th transaction and WTolerance is the preset time tolerance window; Iterate through all processed invoice data and determine whether the invoice date (UTC timestamp) of the invoice data falls within the valid time interval corresponding to a certain transaction data. If it does, then classify the invoice data into the same time bucket as the transaction data and make it a potential matching object for the transaction data. Within each generated time bucket, the transaction data and invoice data are sorted in ascending / descending order based on the amount, and an amount index is created for the sorted data to further improve the efficiency of subsequent multi-dimensional fuzzy matching queries and calculations.

[0031] After completing the above bucket division, the transaction flow data and invoice data within the same time bucket are paired up to determine the matching candidate pairs for this matching. Data from different time buckets are directly excluded and do not participate in subsequent matching calculations.

[0032] Step S130: Perform multi-dimensional fuzzy matching of amount, time, and name on the matching candidate pairs, determine the similarity of amount, time, and name respectively, and adjust the weight coefficient of each similarity according to the features of the matching candidate pairs.

[0033] The weight coefficients of each similarity are dynamically adjusted based on the features of the matching candidate pairs, including: If the amount similarity of the matching candidate pairs reaches the configured matching state, the weight coefficient corresponding to the amount similarity is increased, and the weight coefficients corresponding to the time similarity and name similarity are decreased. If the time difference of matching candidate pairs is close to the edge of the time tolerance window, the weight coefficient corresponding to time similarity is reduced and the weight coefficient corresponding to name similarity is increased, while the lower limit threshold of name similarity is set.

[0034] Among them, determining the similarity of amounts and the similarity of times includes: The amount similarity is determined based on the difference or ratio between the transaction amount and the invoice amount in the matching candidate pair, and a preset fault tolerance threshold is allowed. Time similarity is determined based on the time difference between the transaction time and the invoice issuance time in the matching candidate pair, and the normalized result within the time tolerance window.

[0035] This step is the core computational step in heterogeneous data matching. It achieves flexible evaluation through fuzzy matching of three core dimensions: amount, time, and name. Simultaneously, a dynamic weight allocation mechanism is introduced, adjusting the weight coefficients of the similarity of each dimension based on the real-time characteristics of the matching candidate pairs. This adapts to the variable amounts and time errors in business scenarios, avoiding distortion of the overall matching result due to weak matching in a single dimension. The method for determining the similarity of each dimension and the dynamic adjustment rules of the weight coefficients in this step are the core technical features of the invention. Specific implementation details are as follows: A. Determining the similarity of monetary amounts: The amount similarity is determined based on the difference or ratio difference between the transaction amount and the invoice amount in the matching candidate pair, and a preset error tolerance threshold is allowed (such as absolute amount error ±0.05 yuan, ratio error ±1%), which is suitable for actual business scenarios such as rounding and service fee adjustment.

[0036] In the specific calculation, the similarity of the amount ratio is calculated first:

[0037] Where Difference is the absolute difference between the transaction amount and the invoice amount in the matching candidate pair, and Max(Amounts) is the maximum value between the transaction amount and the invoice amount; if the calculated proportional similarity falls within the preset fault tolerance threshold range (such as 0.995~1.0), a high score is assigned to the amount similarity, and finally the amount proportional similarity is normalized to the [0,1] interval as the amount similarity of the matching candidate pair.

[0038] B. Determining time similarity: Time similarity is determined based on the time difference between the transaction time and the invoice issuance time in the matching candidate pair, and is a normalized result within a preset time tolerance window. The smaller the time difference, the higher the time similarity.

[0039] In the specific calculation, first calculate the time difference Δt between the transaction time and the invoice issuance time of the matching candidate pair, and then normalize the time difference Δt with the time tolerance window WTolerance. The normalization formula can be: Finally, the calculation results are normalized to the [0,1] interval and used as the time similarity of the matching candidate pair; if the time difference Δt is close to the time tolerance window W Tolerance If the edges are adjacent, the time similarity approaches 0.

[0040] C. Determining name similarity: Determining name similarity is the core of solving the problem of heterogeneous merchant names across systems. It requires first semantically normalizing the merchant names and seller names in the matching candidate pairs, and then using a hybrid similarity algorithm for calculation. The specific implementation steps are as follows: Maintain a merchant alias database, which stores the abbreviation, common name, full name, and corresponding normalized merchant entity ID of core merchants. Utilize Normalized Entity Recognition (NER) technology from Natural Language Processing (NLP) to map the cleaned merchant names and seller names to the normalized merchant entity IDs in the merchant alias database. If both are successfully mapped to the same normalized merchant entity ID, assign a name similarity value of 1.0 (perfect match). If the merchant name and the seller name cannot be mapped to the same normalized merchant entity ID, a hybrid similarity calculation is initiated. First, the prefix matching degree of the two names is evaluated by the Jaro-Winkler distance (the core information of the merchant name is usually concentrated in the prefix part). Then, the lexical overlap of the two names is evaluated by the N-gram Jaccard similarity (N is 3 or 4). Finally, the scores calculated by the two algorithms are fused by weighted averaging, and the fusion result is normalized to the [0,1] interval as the name similarity of the matching candidate pair.

[0041] In some embodiments, reinforcement learning or a rule-based adaptive model is used to dynamically adjust the weight coefficients of amount similarity, time similarity, and name similarity in real time based on the real-time features of the matching candidate pairs. The weight coefficients range from [0,1], and the sum of the weight coefficients of all dimensions is 1. The core adjustment rule is: Weight adjustment when the amount is perfectly matched: If the amount similarity of the matching candidate pair reaches the configured perfect match state (e.g., amount similarity = 1.0, there is no difference between the transaction amount and the invoice amount), the weight coefficient corresponding to the amount similarity will be increased (e.g., increased to 0.6), and the weight coefficients corresponding to the time similarity and name similarity will be reduced accordingly, with the matching result of the amount dimension as the core evaluation basis. Weight adjustment when the time difference is close to the window edge: If the time difference Δt between the transaction time and the invoice issuance time of the matching candidate pair is close to the time tolerance window W Tolerance If the edge is reached, the weight coefficient corresponding to time similarity is reduced (e.g., reduced to 0.1), and the weight coefficient corresponding to name similarity is increased. At the same time, a lower limit hard threshold (e.g., 0.8) is set for name similarity, which forces the name dimension to reach a high degree of matching before it can participate in subsequent confidence calculation, thus preventing erroneous matching caused by time coincidence.

[0042] In some embodiments, the weight coefficients of each dimension can be flexibly adjusted according to the risk model of the actual business to ensure that the algorithm is adapted to different business error scenarios.

[0043] In some embodiments, for business scenarios where the amount of a single transaction exceeds the limit of a single invoice (such as the maximum limit of a VAT invoice), a multi-invoice aggregation and matching function is supported. Specifically, after time bucket division, the system automatically checks whether the transaction amount in the matching candidate pair exceeds a preset threshold for the maximum amount of a single invoice. If it does, the multi-invoice combination matching logic is triggered: the transaction data is included in the candidate set along with multiple invoices within the same time bucket whose seller name similarity is ≥ a preset lower limit (e.g., 0.85). A dynamic programming algorithm is used to calculate the "minimum difference between the transaction amount and the sum of the amounts of multiple invoices," and the combination weight is dynamically adjusted based on name similarity and time similarity (e.g., the more combined invoices, the lower the weight of time similarity). Finally, the combination matching confidence score is calculated. If the combination matching confidence score is ≥ a first threshold, the association between the transaction data and multiple invoices is automatically confirmed, and the multiple invoice IDs are bound and stored with the transaction ID. This mechanism solves the problem of heterogeneous data matching in large transactions where "a single transaction corresponds to multiple invoices."

[0044] Step S140: Based on the similarity of amount, time, and name, and the corresponding weight coefficients, determine the matching confidence of the matching candidate pairs, and perform the corresponding matching decision operation on the matching candidate pairs according to the preset confidence threshold.

[0045] The determined similarity in amount, time, and name is denoted as S1, S2, and S3, respectively, and the corresponding dynamically adjusted weight coefficients are denoted as W1, W2, and W3. The final matching confidence of the matching candidate pair is calculated using a weighted summation formula: Confidence = W1 × S1 + W2 × S2 + W3 × S3. The calculated matching confidence is normalized to the interval [0,1]. The higher the value, the stronger the matching reliability of the matching candidate pair.

[0046] The matching decision operation is performed on the candidate matching pairs according to the preset confidence threshold, including: If the matching confidence is greater than or equal to the configured first threshold, then the matching candidate pair is confirmed as a valid match; If the matching confidence is less than the configured second threshold, the candidate pair is confirmed as an invalid match. If the matching confidence level is between the second threshold and the first threshold, the matching candidate pair is marked as a fuzzy match and pushed to the manual review pool.

[0047] Specifically, two core confidence thresholds are preset: a first threshold (high confidence threshold) and a second threshold (low confidence threshold), wherein the first threshold is preferably 0.95 and the second threshold is preferably 0.70; based on the relationship between the matching confidence and the two thresholds, the corresponding matching decision operation is performed, and the specific rules are as follows: If the matching confidence of a candidate pair is greater than or equal to the configured first threshold, the candidate pair is confirmed as a valid match. The system automatically associates the transaction data with the invoice data and writes the association ID back to the business database without manual intervention. If the matching confidence of a candidate pair is less than the configured second threshold, the candidate pair is confirmed as an invalid match, and the system marks the candidate pair as abnormal data and does not associate it. If the matching confidence of a candidate pair is between the second threshold and the first threshold, the candidate pair is marked as a fuzzy match and pushed to the manual review pool, where manual reviewers will make a one-to-one matching result decision.

[0048] In some embodiments, after performing the corresponding matching decision operation on the matching candidate pairs according to a preset confidence threshold, the method further includes: Collect the manual verification results of fuzzy matching pairs in the manual verification pool, and mark the verification results as training data with valid or invalid matching labels; The model with dynamically adjusted weight coefficients is retrained periodically using training data, or the values ​​of the first and second thresholds are adjusted. Update the optimized model parameters or thresholds to the production environment.

[0049] Specifically, the manual review results of all fuzzy matching pairs in the manual review pool are collected. The review results are divided into two categories: "valid match" and "invalid match". The multi-dimensional similarity, weight coefficient, and matching confidence of each fuzzy matching pair are associated with the corresponding manual review results and marked as training data with valid / invalid match labels. The labeled training data is input into the dynamic weight allocation model, and the model is periodically retrained using the training data to optimize the model's weight adjustment rules and parameters. At the same time, the values ​​of the first threshold and the second threshold can be adjusted based on the statistical analysis of the manual review results to achieve adaptive optimization of the thresholds. The core objective of the optimization is to maximize the automatic matching rate while minimizing the false matching rate. The retrained dynamic weight allocation model parameters, or the adjusted confidence threshold, are updated to the production environment in a timely manner so that subsequent heterogeneous data matching calculations can use the optimized model and parameters, thereby achieving continuous improvement in algorithm performance.

[0050] This application provides a method for matching heterogeneous data. The method includes: acquiring transaction data and invoice data from a transaction log system and an invoice recognition system, respectively; dividing the processed transaction data and invoice data into buckets according to a preset time tolerance window, and identifying transaction data and invoice data within the same time bucket as matching candidate pairs; performing multi-dimensional fuzzy matching on the matching candidate pairs based on amount, time, and name, determining the amount similarity, time similarity, and name similarity respectively, and adjusting the weight coefficients of each similarity according to the characteristics of the matching candidate pairs; determining the matching confidence of the matching candidate pairs based on the amount similarity, time similarity, and name similarity and their corresponding weight coefficients; and performing corresponding matching decision operations on the matching candidate pairs according to a preset confidence threshold. Through a complete technical solution involving time series bucketing, multi-dimensional fuzzy matching, dynamic weight allocation, hierarchical confidence decision-making, and feedback optimization, this method solves the core technical challenge of matching heterogeneous data across systems.

[0051] Corresponding to the above method, embodiments of this application also provide a heterogeneous data matching device, such as... Figure 2 As shown, the device includes: The acquisition unit 210 is used to acquire transaction flow data and invoice data from the transaction flow system and the invoice recognition system, respectively. The partitioning unit 220 is used to divide the processed transaction flow data and invoice data into buckets according to a preset time tolerance window, and to determine the transaction flow data and invoice data in the same time bucket as matching candidate pairs. The determining unit 230 is used to perform multi-dimensional fuzzy matching of amount, time and name on the matching candidate pairs, determine the similarity of amount, time and name respectively, and adjust the weight coefficient of each similarity according to the features of the matching candidate pairs; Furthermore, the matching confidence of the candidate pairs is determined based on the similarity of amounts, time, and names, and their corresponding weight coefficients. The execution unit 240 is used to perform a corresponding matching decision operation on the matching candidate pair according to a preset confidence threshold.

[0052] The functions of each functional unit of the heterogeneous data matching device provided in the above embodiments of this application can be implemented through the above method steps. Therefore, the specific working process and beneficial effects of each unit in the device provided in the embodiments of this application will not be repeated here.

[0053] This application also provides an electronic device, such as... Figure 3 As shown, it includes a processor 310, a communication interface 320, a memory 330, and a communication bus 340, wherein the processor 310, the communication interface 320, and the memory 330 communicate with each other through the communication bus 340.

[0054] Memory 330 is used to store computer programs; When the processor 310 executes the program stored in the memory 330, it performs the following steps: Transaction flow data and invoice data are obtained from the transaction flow system and invoice recognition system, respectively. Based on the preset time tolerance window, the processed transaction data and invoice data are divided into buckets, and the transaction data and invoice data in the same time bucket are identified as matching candidate pairs. Multi-dimensional fuzzy matching of amount, time, and name is performed on the candidate matching pairs to determine the similarity of amount, time, and name, and the weight coefficients of each similarity are adjusted according to the features of the candidate matching pairs. Based on the similarity of amount, time, and name, and their corresponding weight coefficients, the matching confidence of the candidate pairs is determined, and the corresponding matching decision operation is performed on the candidate pairs according to the preset confidence threshold.

[0055] The communication bus mentioned above can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0056] The communication interface is used for communication between the aforementioned electronic devices and other devices.

[0057] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0058] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0059] The implementation methods and beneficial effects of the various components of the electronic device in the above embodiments for solving the problem can be found in [reference needed]. Figure 1 The steps in the illustrated embodiments are used to implement the electronic device. Therefore, the specific working process and beneficial effects of the electronic device provided in this application will not be repeated here.

[0060] In another embodiment provided in this application, a computer-readable storage medium is also provided, which stores instructions that, when executed on a computer, cause the computer to perform a heterogeneous data matching method as described in any of the above embodiments.

[0061] In another embodiment provided in this application, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the heterogeneous data matching methods described in the above embodiments.

[0062] Those skilled in the art will understand that the embodiments in this application can be provided as methods, systems, or computer program products. Therefore, the embodiments in this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the embodiments in this application can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0063] This application describes embodiments of methods, apparatus (systems), and computer program products according to embodiments of this application with reference to flowchart illustrations and / or block diagrams. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0064] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0065] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0066] Unless otherwise defined, the technical or scientific terms used in this application shall have the ordinary meaning understood by one of ordinary skill in the art to which this invention pertains. The terms "first," "second," and similar terms used in this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the element or object listed following the word and its equivalents, without excluding other elements or objects. Terms such as "connected," "coupled," or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

[0067] Although preferred embodiments have been described in this application, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the embodiments in this application are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments in this application.

[0068] Obviously, those skilled in the art can make various modifications and variations to the embodiments of this application without departing from the spirit and scope of the embodiments of this application. Therefore, if these modifications and variations to the embodiments of this application fall within the scope of the embodiments of this application and their equivalents, then these modifications and variations are also intended to be included in the embodiments of this application.

Claims

1. A method for matching heterogeneous data, characterized in that, The method includes: Transaction flow data and invoice data are obtained from the transaction flow system and invoice recognition system, respectively. Based on the preset time tolerance window, the processed transaction data and invoice data are divided into buckets, and the transaction data and invoice data in the same time bucket are identified as matching candidate pairs. Multi-dimensional fuzzy matching of amount, time, and name is performed on the candidate matching pairs to determine the similarity of amount, time, and name, and the weight coefficients of each similarity are adjusted according to the features of the candidate matching pairs. Based on the similarity of amount, time, and name, and their corresponding weight coefficients, the matching confidence of the candidate pairs is determined, and the corresponding matching decision operation is performed on the candidate pairs according to the preset confidence threshold.

2. The method as described in claim 1, characterized in that, Processing the transaction log data and the invoice data includes: The transaction amounts in the transaction log data and the invoice amounts in the invoice data are uniformly converted into the smallest currency unit; The transaction time in the transaction log data and the invoice issuance time in the invoice data are uniformly converted into corresponding timestamps.

3. The method as described in claim 1, characterized in that, The processed transaction log data and invoice data are binned according to a preset time tolerance window, including: For each transaction record, the time tolerance window is extended forward and backward based on the transaction time of that transaction record to form an effective time interval. Records in the invoice data whose invoice issuance time falls within the valid time interval are assigned to the same time bucket as the transaction flow data. Within each time bucket, transaction data and invoice data are sorted and indexed based on amount.

4. The method as described in claim 1, characterized in that, The weight coefficients of each similarity are dynamically adjusted based on the features of the matching candidate pairs, including: If the monetary similarity of the matching candidate pair reaches the configured matching state, the weight coefficient corresponding to the monetary similarity is increased, and the weight coefficients corresponding to the time similarity and name similarity are decreased. If the time difference of the matching candidate pairs is close to the edge of the time tolerance window, the weight coefficient corresponding to time similarity is reduced and the weight coefficient corresponding to name similarity is increased, while the lower limit threshold of name similarity is set.

5. The method as described in claim 1, characterized in that, Determine the similarity of amounts and the similarity of times, including: The amount similarity is determined based on the difference or ratio difference between the transaction amount and the invoice amount in the matching candidate pair, and a preset fault tolerance threshold is allowed. The time similarity is determined based on the time difference between the transaction time and the invoice issuance time in the matching candidate pair, and the normalized result within the time tolerance window.

6. The method as described in claim 1, characterized in that, Based on a preset confidence threshold, the matching candidate pairs are subjected to corresponding matching decision operations, including: If the matching confidence is greater than or equal to the configured first threshold, then the matching candidate pair is confirmed as a valid match; If the matching confidence is less than the configured second threshold, then the matching candidate pair is confirmed as an invalid match; If the matching confidence level is between the second threshold and the first threshold, the matching candidate pair is marked as a fuzzy match and pushed to the manual review pool.

7. The method as described in claim 6, characterized in that, After performing the corresponding matching decision operation on the matching candidate pairs according to the preset confidence threshold, the method further includes: Collect the manual verification results of fuzzy matching pairs in the manual verification pool, and mark the verification results as training data with valid or invalid matching labels; The model with adjusted weight coefficients is periodically retrained using the training data, or the values ​​of the first threshold and the second threshold are adjusted. Update the optimized model parameters or thresholds to the production environment.

8. A heterogeneous data matching device, characterized in that, The device includes: The acquisition unit is used to acquire transaction flow data and invoice data from the transaction flow system and the invoice recognition system, respectively. The partitioning unit is used to divide the processed transaction data and invoice data into buckets according to a preset time tolerance window, and to identify the transaction data and invoice data in the same time bucket as matching candidate pairs. The determining unit is used to perform multi-dimensional fuzzy matching of amount, time, and name on the matching candidate pairs, determine the similarity of amount, time, and name respectively, and adjust the weight coefficient of each similarity according to the features of the matching candidate pairs. Furthermore, the matching confidence of the candidate pairs is determined based on the similarity of amounts, time, and names, and their corresponding weight coefficients. The execution unit is used to perform corresponding matching decision operations on the matching candidate pairs based on a preset confidence threshold.

9. An electronic device, characterized in that, The electronic device includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the steps of the method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the method described in any one of claims 1-7.