A man-machine collaborative data association verification method and system based on multi-dimensional similarity fusion

The human-machine collaborative data association verification method using multi-dimensional similarity fusion solves the problems of incomplete association discovery and high false alarm rate in data table association management, achieving efficient and accurate data association identification and optimization, reducing enterprise costs and improving data governance efficiency.

CN122309485APending Publication Date: 2026-06-30JIANGSU DAMENG DATABASE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGSU DAMENG DATABASE CO LTD
Filing Date
2026-04-08
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies for managing data table associations suffer from problems such as incomplete association discovery, high false alarm rate, lack of confidence measurement mechanism, and weak human-machine collaboration. In particular, they are difficult to identify implicit associations due to non-standard naming or lack of foreign keys, and it is difficult to integrate manual review and experience feedback into a closed loop of continuous optimization.

Method used

A human-computer collaborative data association verification method based on multidimensional similarity fusion is adopted. The multidimensional similarity of field names is calculated by natural language processing technology, and the distribution similarity is calculated by combining cosine similarity and Pearson correlation coefficient. Data standardization is performed to generate potential association candidates, and user feedback is recorded for model optimization.

Benefits of technology

It significantly improves the discovery rate of implicit associations, reduces the false positive rate, improves the accuracy of association recommendations, enhances the universality and robustness of the method, reduces manual sorting time, saves enterprise data governance costs, and provides intelligent data lineage management and data asset catalog construction solutions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309485A_ABST
    Figure CN122309485A_ABST
Patent Text Reader

Abstract

This invention discloses a human-computer collaborative data association verification method and system based on multi-dimensional similarity fusion. The method establishes an initial communication connection with the database to obtain data access permissions. Using natural language processing technology, it calculates multi-dimensional similarity for the collected field names to generate naming similarity. The multi-dimensional distribution features are mapped to vectors, and distribution similarity is calculated using cosine similarity or Pearson correlation coefficient. Data standardization is performed, and value range overlap is calculated. Potential association candidates are generated by combining the output results of naming similarity, distribution similarity, and value range overlap. All user operation feedback during the verification process, along with corresponding association features and confidence information, are recorded, and the feedback data is stored in the training set for subsequent iterative optimization. This invention effectively identifies implicit associations in data tables due to non-standard naming or lack of foreign keys, integrating manual review and experience feedback into a closed loop of continuous optimization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data governance technology, and in particular to a human-machine collaborative data association verification method and system based on multi-dimensional similarity fusion. Background Technology

[0002] In the process of enterprise digital transformation, with the rapid expansion of database size—often covering hundreds or even thousands of data tables with complex relationships between them—currently, managing data lineage and table relationships mainly relies on manual annotation, primary and foreign key constraints, and SQL log parsing.

[0003] While these methods have some effect, they still have significant shortcomings: manual annotation relies on manual sorting and maintenance, which is inefficient and makes it difficult to guarantee completeness and timeliness; primary and foreign key constraints can only identify relationships that are clearly defined in the system and cannot cover a large number of actual business relationships that lack explicit constraints; and SQL log parsing relies on historical query derivation, which is difficult to be effective for legacy systems, new tables, or scenarios with incomplete query coverage.

[0004] Overall, existing technologies generally suffer from problems such as incomplete association detection, high false alarm rate, lack of confidence measurement mechanism, and weak human-machine collaboration. In particular, they are difficult to effectively identify implicit associations due to non-standard naming or lack of foreign keys, and it is also difficult to integrate manual review and experience feedback into a closed loop of continuous optimization. Summary of the Invention

[0005] Purpose of the invention: This invention provides a human-computer collaborative data association verification method and system based on multi-dimensional similarity fusion, which can effectively identify implicit associations in data tables due to non-standard naming or lack of foreign keys, and integrate manual review and experience feedback into a closed loop of continuous optimization.

[0006] Technical solution: The present invention provides a human-computer collaborative data association verification method based on multi-dimensional similarity fusion, comprising the following steps:

[0007] Step 1: Establish an initial communication connection with the database, obtain data access permissions, and collect metadata;

[0008] Step 2: Based on natural language processing technology, perform multi-dimensional similarity calculation on the collected field names to generate naming similarity;

[0009] Step 3: Map the multi-dimensional distribution features into vectors, and calculate the distribution similarity using cosine similarity or Pearson correlation coefficient;

[0010] Step 4: Perform data standardization and calculate the overlap of value ranges;

[0011] Step 5: Combine the output results of naming similarity, distribution similarity, and value range overlap to generate potential association candidates;

[0012] Step 6: Record all user operation feedback during this verification process, as well as the corresponding associated features and confidence information. Store the feedback data in the training set for subsequent iterative optimization.

[0013] Furthermore, in step 1, the metadata information and statistical characteristics of the data table are automatically extracted, including basic metadata such as field name, data type, length, and constraint conditions, while statistical characteristics such as the number of unique values, null value rate, maximum value, minimum value, and average value are collected.

[0014] Furthermore, in step 2, based on natural language processing technology, multi-dimensional similarity calculations are performed on the collected field names to generate naming similarity, specifically including the following steps:

[0015] Step 21: Calculate the character-level similarity between field names using the edit distance algorithm to identify spelling differences and minor distortions; calculate the minimum number of editing operations required between two field names, including insertion, deletion, and replacement, through the edit distance algorithm to quantify the character-level similarity.

[0016] Step 22: A segment-level matching method combining variable-length N-gram-based Jaccard similarity and longest common subsequence (LCS) is used to identify local structural similarities such as abbreviations, truncations, and word order reversals in field names.

[0017] Step 23: Map field names to low-dimensional vectors using a deep semantic model and calculate cosine similarity to identify synonyms, business terms, and cross-language semantic alignment.

[0018] Step 24: Call the preset industry terminology rule library to perform regular expression matching and identify fields that conform to specific naming conventions; use regular expressions or pattern rule libraries built based on industry knowledge to perform precise pattern matching on field names and identify fields that conform to specific naming conventions.

[0019] Step 25: Weight and fuse the above multi-dimensional similarity results to generate a comprehensive naming similarity score; weight and fuse the multi-dimensional similarity results such as character-level, fragment-level, semantic-level, and rule-matching to generate the final comprehensive similarity score.

[0020] Furthermore, step 22, identifying the local structural similarity in the field names, specifically includes the following steps:

[0021] (1) Construction of variable-length N-gram sets;

[0022] Treat the field name s to be matched as a string, and generate a set G(s) of all continuous substrings of length n≥1:

[0023]

[0024] (2) Calculation of Jaccard coefficient;

[0025] For two field names s1 and s2, the Jaccard coefficient is defined as follows:

[0026]

[0027] This coefficient directly reflects the overlap ratio of the two field names at the fragment level; the larger the value, the more common substrings are shared.

[0028] (3) Normalization of the length of the longest common subsequence (LCS);

[0029] The LCS algorithm is used to extract the length of the longest common subsequence (LCS(s1,s2)) of two strings that maintain their order, and normalizes it to the interval [0,1].

[0030]

[0031] (4) Fragment similarity fusion;

[0032] The Jaccard coefficient is weighted and fused with the normalized LCS length to obtain the final fragment-level similarity:

[0033]

[0034] Where α is the preset weight, with a value of 0.6, which is determined through cross-validation based on the features of the actual corpus.

[0035] Furthermore, step 23, which involves identifying synonyms, business terms, and cross-linguistic semantic alignment, specifically includes the following steps:

[0036] (1) Generation of basic semantic vectors;

[0037] The system supports two mainstream semantic vector generation methods:

[0038] Static word embedding: Using pre-trained Word2Vec, FastText, or GloVe models, for multi-word field names, average pooling or weighted average is used to aggregate the word vectors into field name vectors.

[0039] Contextual dynamic embedding: Using a Transformer-based deep language model, the field name is taken as the input sentence, and the vector at the [CLS] position in the model output is taken as the semantic representation of the entire field name.

[0040] (2) Domain adaptation fine-tuning;

[0041] For specific business domains, the system uses a contrastive learning framework (such as SimCSE) for domain-unsupervised fine-tuning: constructing positive sample pairs (selecting high-frequency co-occurrence or manually labeled similar field name pairs from the corpus), constructing negative sample pairs (randomly sampling dissimilar field name pairs), and optimizing the objective (using the contrastive loss function (InfoNCE) to bring positive sample pairs closer in the semantic space and push negative sample pairs further apart, making the model more sensitive to semantic similarity within the domain).

[0042] (3) Cross-language semantic alignment;

[0043] For multinational enterprise data integration scenarios, multilingual pre-trained models (mBERT, XLM-R) are introduced to achieve cross-language field name matching.

[0044] (4) Similarity calculation;

[0045] For semantic vectors v1 and v2 of two field names, the semantic similarity is defined as:

[0046]

[0047] Cosine similarity, with a value range of [-1, 1]. Positive values ​​indicate similarity. When using contextual dynamic embedding, this similarity can effectively distinguish between synonyms and antonyms.

[0048] (5) Dynamic model selection;

[0049] The system dynamically selects the most suitable semantic model based on the language, domain, and computing resources of the field name.

[0050] Furthermore, in step 3, the multi-dimensional distribution features are mapped into vectors, and the distribution similarity is calculated using cosine similarity or Pearson correlation coefficient. This specifically includes the following steps:

[0051] Step 31: For numerical fields, calculate the mean, variance, standard deviation, and quartile statistics, and generate an isofrequency histogram or kernel density estimation curve; use the histogram intersection algorithm or chi-square test algorithm to calculate the distribution similarity; extract the basic statistical features of numerical fields to describe their central tendency and dispersion, and capture the actual distribution pattern of the data through isofrequency histograms or kernel density estimation.

[0052] Step 32: For categorical fields, extract the unique value set of the field as category labels, calculate the Jaccard overlap of the category set, and compare the consistency of the frequency distribution of each category; extract the category value set of the two fields, calculate the overlap ratio of the category labels through the Jaccard coefficient, and further compare the frequency distribution of each category in the data.

[0053] Step 33: For time / date fields, analyze the degree of overlap in time ranges and verify the alignment of time granularities; calculate the overlap ratio of time ranges by comparing the minimum / maximum dates of two time fields, and check whether the time granularities are consistent.

[0054] Step 34: Map the above multi-dimensional distribution features into vectors, and calculate the comprehensive distribution similarity using cosine similarity or Pearson correlation coefficient; encode the distribution features of numerical, categorical, and time fields into vector form, and then use cosine similarity or Pearson correlation coefficient to calculate the similarity between the feature vectors of the two fields.

[0055] Furthermore, in step 4, data standardization is performed, and the calculation of the value range overlap specifically includes the following steps:

[0056] Step 41: Remove whitespace characters from both ends of the string, unify capitalization, and convert the date format to standard type; eliminate format noise in the data through preprocessing operations, including removing leading and trailing spaces, unifying capitalization, and converting various date formats to standard format;

[0057] Step 42: Extract unique value sets A and B from the two fields; extract all non-repeating values ​​from the standardized field data to form sets A and B respectively;

[0058] Step 43: For numeric fields, set a tolerance threshold (e.g., ±0.01 or ±1%), and consider matches within the tolerance range; for string fields, if the edit distance is lower than a preset threshold (e.g., ≤2), it is considered a fuzzy match; introduce error tolerance mechanisms for different data types: numeric fields allow minor differences caused by precision or rounding, and consider close values ​​as matches by setting absolute or relative error thresholds; string fields identify spelling variations or slight formatting differences through edit distance.

[0059] Step 44: Calculate the set overlap index: overlap = |A ∩ B| / |A ∪ B|; Based on the overlap value, divide the association strength into four levels: overlap ≥ 95% is extremely strong association, 80%~95% is strong association, 60%~80% is moderate association, and < 60% is weak association or suggests filtering; Based on the unique value set after standardization and fault-tolerant matching, calculate the Jaccard overlap, and divide the results into different association strength levels according to the preset threshold.

[0060] Furthermore, by combining the output results of naming similarity, distribution similarity, and value range overlap, potential association candidates are generated as follows: if naming similarity > first threshold, distribution similarity > second threshold, and value range overlap > third threshold, then high-confidence association candidates are generated.

[0061] If the naming similarity is greater than the first threshold and the value range overlap is greater than the fourth threshold, then generate medium-confidence association candidates (applicable to synonym fields).

[0062] If the distribution similarity is greater than the second threshold and the value range overlap is greater than the third threshold, then a low-confidence association candidate is generated (applicable to implicit association).

[0063] The first, second, third, and fourth thresholds are all configurable parameters, and their default values ​​can be set to 80%, 70%, 80%, and 60%, respectively.

[0064] Furthermore, the reliability of the generated candidate associations is quantitatively assessed, and a weighted summation model is used to calculate the overall confidence level.

[0065] Total confidence = w1·name similarity + w2·distribution similarity + w3·value range overlap + w4·business rule bonus;

[0066] Where w1, w2, w3, and w4 are configurable weight coefficients, and satisfy w1+w2+w3+w4=1;

[0067] Business rule enhancements provide an additional 5% to 10% confidence boost for associations that conform to specific naming conventions (such as fields ending with "id", "key", or "code").

[0068] Candidates with related information are classified according to their confidence scores: 90%~100% is considered extremely high confidence and can be included in the automatic confirmation process; 70%~90% is considered high confidence and is recommended for quick manual review; 50%~70% is considered medium confidence and requires detailed verification; and below 50% is considered low confidence and should only be used as a reference or filtered directly.

[0069] Accordingly, a human-machine collaborative data association verification system based on multi-dimensional similarity fusion includes: a raw data input module, a multi-dimensional association analysis module, an association generation output module, and a human-machine collaborative verification and feedback module;

[0070] The raw data input module is used to connect to the target database, perform metadata collection and sample data collection, and obtain basic information and data samples of the fields to be analyzed.

[0071] The multi-dimensional association analysis module, based on natural language processing technology, performs multi-dimensional similarity calculations on the collected field names to generate naming similarity; maps multi-dimensional distribution features into vectors and calculates distribution similarity using cosine similarity or Pearson correlation coefficient; performs data standardization and calculates value range overlap.

[0072] The association generation output module integrates a multi-dimensional association reasoning engine, which comprehensively considers naming similarity, distribution similarity, and value range overlap to generate an association candidate list based on preset rules, and uses a confidence scoring model to quantitatively evaluate the reliability of each candidate.

[0073] The human-machine collaborative verification and feedback module presents related data and evidence chains through a visual interface, provides an entry point for audit operations, and records user feedback for model iteration and optimization, forming a closed-loop mechanism for human-machine collaboration.

[0074] Beneficial Effects: Compared with existing technologies, this invention has the following significant advantages: This invention employs a multi-dimensional fusion reasoning strategy, integrating features from three dimensions—field naming similarity, data distribution similarity, and value range overlap—for comprehensive judgment. This not only significantly improves the discovery rate of implicit associations but also greatly reduces the false positive rate, ensuring an association recommendation accuracy exceeding 85%. This invention introduces a dynamic weighted reliability model, employing a configurable weighted scoring mechanism. This allows the system to flexibly adapt to the data characteristics of different industries and business scenarios, avoiding the poor adaptability of fixed threshold methods in different data environments, and significantly enhancing the universality and robustness of the method. Furthermore, this invention designs a progressive discovery mechanism, achieving efficient and scalable analysis of massive table fields through a five-stage pipeline architecture, avoiding full Cartesian regression analysis. The performance bottleneck caused by the multiplicative scanning ensures processing efficiency and system stability at the scale of hundreds of TB of data. At the same time, this invention constructs a human-machine collaborative closed loop, feeding back the results of manual verification to the model weight optimization and rule base update process in real time, forming a virtuous cycle of continuous learning and increasing accuracy with use. This not only significantly saves the manual sorting time of DBAs and data engineers, but also improves the efficiency of manual review by 3 times, greatly reducing the human resource costs of enterprise data governance. This invention can effectively discover implicit relationships without explicit primary and foreign key constraints but with close business logic, solving a series of technical problems such as incomplete discovery, high false positive rate, lack of confidence, and poor interpretability of traditional methods. It provides a complete, efficient, and evolvable intelligent solution for scenarios such as data lineage management, data asset catalog construction, and ETL intelligent design. Attached Figure Description

[0075] Figure 1 This is a schematic diagram of the method flow of the present invention.

[0076] Figure 2 This is a schematic diagram of the system structure of the present invention.

[0077] Figure 3 This is a schematic diagram of the human-machine collaborative verification and confidence scoring dashboard interface of the present invention.

[0078] Figure 4 This is a chart analyzing the expected results of a typical case. Detailed Implementation

[0079] like Figure 1 As shown, a human-computer collaborative data association verification method based on multi-dimensional similarity fusion includes the following steps:

[0080] Step 1: Establish an initial communication connection with the database, obtain data access permissions, and collect metadata;

[0081] Step 2: Based on natural language processing technology, perform multi-dimensional similarity calculation on the collected field names to generate naming similarity;

[0082] Step 3: Map the multi-dimensional distribution features into vectors, and calculate the distribution similarity using cosine similarity or Pearson correlation coefficient;

[0083] Step 4: Perform data standardization and calculate the overlap of value ranges;

[0084] Step 5: Combine the output results of naming similarity, distribution similarity, and value range overlap to generate potential association candidates;

[0085] Step 6: Record all user operation feedback during this verification process, as well as the corresponding associated features and confidence information. Store the feedback data in the training set for subsequent iterative optimization.

[0086] In step 1, the metadata information and statistical characteristics of the data table are automatically extracted, including basic metadata such as field name, data type, length, and constraint conditions. At the same time, statistical characteristics such as the number of unique values, null value rate, maximum value, minimum value, and average value are collected.

[0087] In step 2, based on natural language processing technology, multi-dimensional similarity calculation is performed on the collected field names to generate naming similarity, specifically including the following steps:

[0088] Step 21: Use the edit distance algorithm to calculate the character-level similarity between field names and identify spelling differences and slight deformations;

[0089] Levenshtein Distance quantifies character-level similarity by calculating the minimum number of edit operations (insert, delete, replace) required between two field names. It effectively tolerates minor variations in field naming, such as common spelling errors, inconsistent capitalization, and differences in separators (e.g., underscores, hyphens). Even if field names have similar overall structures but differ in individual characters, it accurately captures their degree of similarity. This is highly relevant to data table processing: in real-world data integration, field names from different systems or tables often exhibit spelling variations due to manual input or inconsistent naming conventions (e.g., "customer_name" vs. "customername"). Levenshtein Distance can quickly identify these candidate matches, providing a foundation for subsequent field mapping and significantly reducing manual verification costs.

[0090] Step 22: A segment-level matching method combining variable-length N-gram-based Jaccard similarity and longest common subsequence (LCS) is used to identify local structural similarities such as abbreviations, truncations, and word order inversions in field names. The specific implementation is as follows:

[0091] (1) Construction of variable-length N-gram sets;

[0092] Treat the field name s to be matched as a string, and generate a set G(s) of all continuous substrings of length n≥1:

[0093]

[0094] For example, the set generated by the field name cust is: {c, u, s, t, cu, us, st, cus, ust, cust}. Compared with fixed-length N-grams, variable-length N-grams can capture fragment information of different granularities more comprehensively and are more robust to abbreviations (such as addr and address sharing adr) and truncation (such as cust and customer sharing cust).

[0095] (2) Calculation of Jaccard coefficient;

[0096] For two field names s1 and s2, the Jaccard coefficient is defined as follows:

[0097]

[0098] This coefficient directly reflects the overlap ratio of the two field names at the fragment level; the larger the value, the more common substrings are shared.

[0099] (3) Normalization of the length of the longest common subsequence (LCS);

[0100] The LCS algorithm is used to extract the length of the longest common subsequence (LCS(s1,s2)) that maintains the order of two strings. It is normalized to the interval [0,1].

[0101]

[0102] LCS has good tolerance for word order reversal (such as user_name and name_user) and can capture the order invariance of core phrases.

[0103] (4) Fragment similarity fusion;

[0104] The Jaccard coefficient is weighted and fused with the normalized LCS length to obtain the final fragment-level similarity:

[0105]

[0106] Where α is a preset weight, preferably 0.6, which can be determined through cross-validation based on the features of the actual corpus. This fusion formula combines the coverage of common segments (Jaccard) and the order preservation of core phrases (LCS), so that segment-level matching can tolerate abbreviations and truncation, and will not fail due to word order changes.

[0107] Step 23: Map field names to low-dimensional vectors using a deep semantic model and calculate cosine similarity to identify synonyms, business terms, and cross-linguistic semantic alignment. The specific implementation is as follows:

[0108] (1) Generation of basic semantic vectors;

[0109] The system supports two mainstream semantic vector generation methods:

[0110] Static word embeddings: These use pre-trained Word2Vec, FastText, or GloVe models. For multi-word field names (such as customer_id), average pooling or weighted averaging can be used to aggregate the word vectors into a field name vector. The advantage of static embeddings is their fast computation speed, making them suitable for large-scale real-time matching.

[0111] Contextual dynamic embedding: This uses Transformer-based deep language models such as BERT and RoBERTa. The field name is taken as the input sentence, and the vector at the [CLS] position in the model output is used as the semantic representation of the entire field name. Dynamic embedding can generate differentiated vectors based on the specific context of the field name (e.g., "level" in different contexts of "grade" and "level"), thus better resolving ambiguity of polysemous words.

[0112] (2) Domain adaptation fine-tuning;

[0113] For specific business domains (such as finance and e-commerce), general pre-trained models may not accurately capture domain-specific synonym relationships (for example, "customer" and "user" can be considered equivalent in specific businesses). The system uses a contrastive learning framework (such as SimCSE) for domain-specific unsupervised fine-tuning: constructing positive sample pairs (selecting high-frequency co-occurrence or manually labeled similar field name pairs from the corpus), constructing negative sample pairs (randomly sampling dissimilar field name pairs), and optimizing the objective (using the contrastive loss function (InfoNCE) to bring positive sample pairs closer in the semantic space and push away negative sample pairs, making the model more sensitive to semantic similarity within the domain).

[0114] (3) Cross-language semantic alignment;

[0115] For multinational enterprise data integration scenarios, the system introduces multilingual pre-trained models (mBERT, XLM-R) to achieve cross-language field name matching. These models are jointly trained on multilingual corpora and can map the semantic representations of different languages ​​(such as customer) to the same vector space, directly calculating cross-language similarity through cosine similarity without the need for translation tools.

[0116] (4) Similarity calculation;

[0117] For semantic vectors v1 and v2 of two field names, the semantic similarity is defined as:

[0118]

[0119] This refers to cosine similarity, with values ​​ranging from -1 to 1. Positive values ​​are typically used to indicate similarity. When using context-based dynamic embedding, this similarity can effectively distinguish between synonyms and antonyms.

[0120] (5) Dynamic model selection;

[0121] The system can dynamically select the most suitable semantic model based on the language, domain, and computing resources of the field name. For example, the lightweight FastText model can be used when the field name is purely in English and computing resources are limited; for complex scenarios containing polysemous words, the system switches to the domain-fine-tuned BERT model; and for multilingual scenarios, the XLM-R model is automatically enabled.

[0122] Step 24: Call the preset industry terminology rule library to perform regular expression matching and identify fields that conform to specific naming conventions;

[0123] Regular expressions or pattern rule bases built upon industry knowledge perform precise pattern matching on field names, identifying fields that conform to specific naming conventions (such as date fields "YYYY-MM-DD", primary key fields "id" or "pk"). Encoding domain expert knowledge into executable rules accurately identifies standardized fields, compensating for the blind spots of purely data-driven methods regarding specific terminology. This is highly relevant to data table processing: in heavily regulated industries such as finance and telecommunications, data table fields often follow strict naming conventions (e.g., including standard fields like "imsi" and "msisdn"). Regular expression matching can quickly locate these fields and assign them high similarity weights. Furthermore, in data quality checks, rule bases can be used to verify whether field names conform to expected formats, promptly detect abnormal naming, and ensure data consistency.

[0124] Step 25: Weight and fuse the above multi-dimensional similarity results to generate a comprehensive naming similarity score;

[0125] This method generates a final comprehensive similarity score by weighted fusion of multi-dimensional similarity results, including character-level, fragment-level, semantic-level, and rule-based matching (with dynamically adjustable weights or machine learning methods). It comprehensively utilizes multi-dimensional information to overcome the limitations of single methods and adapts to different business scenarios through an adaptive weighting strategy (e.g., focusing on character matching for internal tables and semantic matching for cross-source tables). This is highly relevant to data table processing: in real-world data integration tasks, field matching often requires considering multiple factors, and weighted fusion can output more reliable similarity scores to support subsequent field mapping decisions. For example, in ETL processes, weights can be dynamically adjusted based on the characteristics of the source and target, improving matching accuracy, reducing manual intervention, and thus accelerating data pipeline construction.

[0126] Step 3 involves mapping the multi-dimensional distribution features into vectors and calculating the distribution similarity using cosine similarity or Pearson correlation coefficient. This includes the following steps:

[0127] Step 31: For numerical fields, calculate the mean, variance, standard deviation, and quartile statistics, and generate an isofrequency histogram or kernel density estimation curve; use the histogram intersection algorithm or chi-square test algorithm to calculate the distribution similarity.

[0128] This method extracts basic statistical characteristics (such as mean, variance, standard deviation, and quartiles) from numerical fields to describe their central tendency and dispersion, while capturing the actual distribution pattern of the data through isofrequency histograms or kernel density estimation. It integrates global statistics and local density distributions, and uses histogram intersection or chi-square tests to quantify the overlap between two fields across numerical intervals, thus more precisely measuring distribution similarity and avoiding differences in distribution shape that might be overlooked by relying solely on statistics. This is highly relevant to data table processing: in data integration, even if field names are similar (e.g., "sales revenue" in two tables), significant differences in data distribution (e.g., one for annual sales revenue, the other for monthly sales revenue) may indicate different business meanings or data quality issues. Distribution similarity calculation can help verify the rationality of field matching and prevent erroneous merging. Furthermore, in data quality monitoring, regularly comparing field distributions can promptly detect data drift or anomalies.

[0129] Step 32: For categorical fields, extract the set of unique values ​​of the field as category labels, calculate the Jaccard overlap of the category set, and compare the consistency of the distribution of the frequency of occurrence of each category.

[0130] Extract the category value sets of two fields, calculate the overlap ratio of category labels using the Jaccard coefficient, and then further compare the frequency distribution of each category in the data (e.g., using chi-square test or cosine similarity). This approach not only focuses on whether the category sets are consistent but also delves into the similarity of the actual proportions of each category, enabling the identification of situations where the categories are the same but their distributions are vastly different, or where the category sets partially overlap but their frequency distributions are similar. This is highly relevant to data table processing: in data warehouse construction or data fusion, the values ​​of categorical fields (such as "region" or "product type") often differ due to different business scopes. Distribution similarity can quantify this difference, helping to determine whether two fields represent the same business dimension. For example, when merging two customer tables, the values ​​of the "province" field may not be exactly the same, but if the frequency distributions are highly similar, they can still be considered the same field.

[0131] Step 33: For time / date fields, analyze the degree of overlap in time range coverage and verify the alignment of time granularity;

[0132] By comparing the minimum / maximum dates of two time fields, the overlap ratio of time ranges is calculated, while simultaneously checking the consistency of time granularity (e.g., year, month, day, hour). The similarity of time fields is comprehensively evaluated from both coverage and granularity dimensions, and alignment requirements can be determined by setting thresholds (e.g., allowing for a certain offset). This is highly relevant to data table processing: in time series data analysis or multi-table joins, the alignment of time fields is crucial. For example, the "transaction time" records from two systems might be accurate to the second in one and only to the day in the other; directly joining them could lead to data loss or errors. Granularity verification can identify such problems in advance and guide subsequent date format conversions or data alignment operations. Furthermore, in data integration, the degree of overlap in time range coverage reveals the temporal continuity of the data, helping to determine data completeness.

[0133] Step 34: Map the above multi-dimensional distribution features into vectors, and calculate the comprehensive distribution similarity using cosine similarity or Pearson correlation coefficient.

[0134] The distribution characteristics of numerical, categorical, and time-based fields are uniformly encoded into vector form (e.g., concatenating statistics, histogram distributions, and category frequencies into feature vectors). Then, cosine similarity or Pearson correlation coefficient is used to calculate the similarity between the feature vectors of two fields. This achieves the fusion and unified measurement of heterogeneous distribution characteristics, allowing for the comparison of distribution similarity across different types of fields within a single framework. Furthermore, feature weights can be adjusted to adapt to different business scenarios. This is highly relevant to data table processing: in automated data integration tools, comprehensive distribution similarity can serve as a crucial basis for field matching, complementing naming similarity and improving mapping accuracy. For example, two fields may have different names but highly similar distributions, suggesting they may represent the same concept; conversely, similar names but vastly different distributions may require manual verification. This metric can also be used for data table structure comparison, data version difference analysis, and other scenarios, providing quantitative support for data governance.

[0135] Step 4 involves data standardization and calculating the overlap of value ranges, specifically including the following steps:

[0136] Step 41: Remove whitespace characters from both ends of the string, unify capitalization, and convert the date format to standard type;

[0137] Preprocessing eliminates format noise in the data, including removing leading and trailing spaces, standardizing letter case, and converting various date formats (such as "YYYY-MM-DD" and "DD / MM / YYYY") to a standard format. Standardization eliminates misjudgments of value ranges caused by inconsistent formats, laying the foundation for accurate matching later. Standardization rules can be customized for different data types (such as automatic recognition and conversion of date formats). This is highly relevant to data table processing: in real-world data tables, fields with the same meaning often have inconsistent formats due to input habits or system differences, such as "Zhang San" versus "Zhang San", or "2023-01-01" versus "01 / 01 / 2023". Without standardization, the overlap of value ranges will be severely underestimated, affecting the determination of field correlation. This step ensures the accuracy of subsequent calculations and is a key preprocessing step in data cleaning and integration.

[0138] Step 42: Extract the unique value sets A and B from the two fields;

[0139] All unique values ​​are extracted from the standardized field data, forming sets A and B respectively. Using these unique value sets as the basis for value range comparison avoids interference from duplicate values ​​in the overlap calculation, and more purely reflects the similarity of the field value space. This is highly relevant to data table processing: in data table field comparison, the unique value set represents the actual value range of a field. By comparing sets, it is possible to intuitively determine whether two fields share the same value range. For example, when merging customer tables, the unique value of the "gender" field is usually {male, female}. If a table shows "unknown", the set difference will indicate a data quality problem or a difference in business scope.

[0140] Step 43: For numeric fields, set a tolerance threshold (e.g., ±0.01 or ±1%), and consider it a match if it is within the tolerance range; for string fields, if the edit distance is lower than the preset threshold (e.g., ≤2), it is considered a fuzzy match.

[0141] Introduce a fault tolerance mechanism for different data types: For numeric fields, allow minor differences caused by precision or rounding. Consider numerically close values as matches by setting absolute or relative error thresholds. For string fields, identify spelling variants or minor format differences (such as "Beijing, China" and "Beijing") through the edit distance. Expand exact matching to fuzzy matching to accommodate inevitable noise in actual data, and flexibly control the matching strictness through adjustable thresholds. This is highly relevant to data table processing: In many business scenarios, numerically equivalent values may vary slightly due to calculation precision or storage format (such as 3.14 and 3.1416), and strings may also have abbreviations or typos (such as "Contact Number" and "Contact Number "). Tolerance matching can more realistically reflect the overlap of field value ranges, avoid misjudging as non-matching due to minor differences, and thus improve the recall rate of field association recognition.

[0142] Step 44: Calculate the set overlap metric: Overlap = |A ∩ B| / |A ∪ B|; Divide the association strength into four levels based on the overlap value: Overlap ≥ 95% is extremely strong association, 80% - 95% is strong association, 60% - 80% is medium association, and < 60% is weak association or recommended for filtering.

[0143] Based on the unique value sets after standardization and tolerance matching, calculate the Jaccard overlap and divide the results into different association strength levels according to preset thresholds. Convert the continuous overlap scores into discrete semantic levels for easy business understanding and subsequent decision-making (such as automatic mapping, manual review, or filtering), and the level thresholds can be dynamically adjusted according to industry experience or specific requirements. This is highly relevant to data table processing: In data integration or schema matching, the overlap of field value ranges is an important metric for measuring field semantic consistency. For example, when the overlap of two fields reaches over 95%, it can be highly suspected that they represent the same attribute, and thus an automatic mapping can be established; while an overlap below 60% may indicate different meanings and is recommended for filtering to reduce candidate pairs. This grading method can effectively guide the processing flow of automated tools, reduce the cost of manual review, and improve data fusion efficiency.

[0144] Integrate the output results of name similarity, distribution similarity, and value range overlap to generate potential association candidates. Specifically: If name similarity > the first threshold, distribution similarity > the second threshold, and value range overlap > the third threshold, then generate high-confidence association candidates;

[0145] If name similarity > the first threshold and value range overlap > the fourth threshold, then generate medium-confidence association candidates (applicable to synonym fields);

[0146] If distribution similarity > the second threshold and value range overlap > the third threshold, then generate low-confidence association candidates (applicable to implicit associations);

[0147] The first, second, third, and fourth thresholds are all configurable parameters, and their default values ​​can be set to 80%, 70%, 80%, and 60%, respectively.

[0148] The reliability of the generated association candidates is quantitatively evaluated, and the overall confidence level is calculated using a weighted summation model.

[0149] Total confidence = w1·name similarity + w2·distribution similarity + w3·value range overlap + w4·business rule bonus;

[0150] Where w1, w2, w3, and w4 are configurable weight coefficients, and satisfy w1+w2+w3+w4=1;

[0151] Business rule enhancements provide an additional 5% to 10% confidence boost for associations that conform to specific naming conventions (such as fields ending with "id", "key", or "code").

[0152] Candidates with related information are classified according to their confidence scores: 90%~100% is considered extremely high confidence and can be included in the automatic confirmation process; 70%~90% is considered high confidence and is recommended for quick manual review; 50%~70% is considered medium confidence and requires detailed verification; and below 50% is considered low confidence and should only be used as a reference or filtered directly.

[0153] This invention also provides three types of operation entry points: "verification" (confirming correct association), "ignore" (marking as a false alarm), and "correction" (manually specifying the correct association); it records all user review operation logs and exports them periodically for model parameter retraining and rule base updates. It also supports batch review and quick operations, improving the processing efficiency of large-scale association candidates.

[0154] The system receives the user's decision command on the verification interface, determines whether the association is confirmed, and if confirmed, writes the user-confirmed valid association into the official lineage graph to solidify the data lineage link; if ignored or corrected (and reconfirmed after correction), the current association does not enter the official lineage graph and the process ends; it records all user operation feedback (confirmation, ignore, correction) and corresponding association features and confidence information during this verification process, and stores the feedback data in the training set for subsequent iterations to optimize the weight model and similarity algorithm.

[0155] like Figure 2 and Figure 3 As shown, a human-computer collaborative data association verification system based on multidimensional similarity fusion includes: a raw data input module, a multidimensional association analysis module, an association generation output module, and a human-computer collaborative verification and feedback module.

[0156] The raw data input module is used to connect to the target database, perform metadata collection and sample data collection, and obtain basic information and data samples of the fields to be analyzed.

[0157] The multi-dimensional association analysis module, based on natural language processing technology, performs multi-dimensional similarity calculations on the collected field names to generate naming similarity; maps multi-dimensional distribution features into vectors and calculates distribution similarity using cosine similarity or Pearson correlation coefficient; performs data standardization and calculates value range overlap.

[0158] The association generation output module integrates a multi-dimensional association reasoning engine, which comprehensively considers naming similarity, distribution similarity, and value range overlap to generate an association candidate list based on preset rules, and uses a confidence scoring model to quantitatively evaluate the reliability of each candidate.

[0159] The human-machine collaborative verification and feedback module presents related data and evidence chains through a visual interface, provides an entry point for audit operations, and records user feedback for model iteration and optimization, forming a closed-loop mechanism for human-machine collaboration.

[0160] like Figure 4 As shown, single-dimensional similarity calculation has significant limitations when dealing with complex and ever-changing real-world data:

[0161] Relying solely on name similarity is easily affected by factors such as spelling, abbreviations, synonyms, and cross-language issues, leading to missed or incorrect matches.

[0162] Relying solely on distribution similarity may work well for categorical fields with limited value space, but it is difficult to distinguish semantically different fields with the same name.

[0163] Relying solely on the overlap of value ranges can easily lead to biases when data is sparse or noisy, and it cannot handle semantic synonyms.

[0164] This invention achieves complementary and verification information from multiple perspectives by integrating three dimensions: naming similarity, distribution similarity, and value range overlap. For example, in the "homophones" scenario, although the naming similarity is extremely high, the low scores of distribution features and value range overlap promptly correct the judgment and avoid erroneous associations. In the "synonyms" or "abbreviations" scenarios, distribution and value range information compensate for the deficiencies in naming similarity, ensuring the recall of correct matches.

[0165] Furthermore, the human-machine collaborative feedback mechanism introduced in this invention can use users' feedback on the matching results in the above cases (e.g., confirming or rejecting a certain association) as training data to continuously optimize the weights or model parameters of each dimension. As feedback data accumulates, the accuracy and stability of the system in handling similar cases will be further improved, forming a virtuous cycle and ultimately achieving a deep integration of automated matching and manual verification.

[0166] In summary, this invention significantly improves the accuracy and recall rate of data field association verification through multi-dimensional integration and human-machine collaboration, effectively reducing the cost of manual review, and is suitable for large-scale, multi-source heterogeneous data integration and governance scenarios.

Claims

1. A human-computer collaborative data association verification method based on multi-dimensional similarity fusion, characterized in that, Includes the following steps: Step 1: Establish an initial communication connection with the database, obtain data access permissions, and collect metadata; Step 2: Based on natural language processing technology, perform multi-dimensional similarity calculation on the collected field names to generate naming similarity; Step 3: Map the multi-dimensional distribution features into vectors, and calculate the distribution similarity using cosine similarity or Pearson correlation coefficient; Step 4: Perform data standardization and calculate the overlap of value ranges; Step 5: Combine the output results of naming similarity, distribution similarity, and value range overlap to generate potential association candidates; Step 6: Record all user operation feedback, corresponding related features, and confidence information during this verification process. Store the feedback data in the training set for subsequent iterative optimization.

2. The human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 1, characterized in that, In step 1, the metadata information and statistical characteristics of the data table are automatically extracted, including basic metadata such as field name, data type, length, and constraint conditions. At the same time, statistical characteristics such as the number of unique values, null value rate, maximum value, minimum value, and average value are collected.

3. The human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 1, characterized in that, In step 2, based on natural language processing technology, multi-dimensional similarity calculation is performed on the collected field names to generate naming similarity, specifically including the following steps: Step 21: Calculate the character-level similarity between field names using the edit distance algorithm to identify spelling differences and minor distortions; calculate the minimum number of editing operations required between two field names, including insertion, deletion, and replacement, through the edit distance algorithm to quantify the character-level similarity. Step 22: A segment-level matching method combining variable-length N-gram-based Jaccard similarity and longest common subsequence (LCS) is used to identify local structural similarities in field names, such as abbreviations, truncations, and word order reversals. Step 23: Map field names to low-dimensional vectors using a deep semantic model and calculate cosine similarity to identify synonyms, business terms, and cross-language semantic alignment. Step 24: Call the preset industry terminology rule library to perform regular expression matching and identify fields that conform to specific naming conventions; use regular expressions or pattern rule libraries built based on industry knowledge to perform precise pattern matching on field names and identify fields that conform to specific naming conventions. Step 25: Weight and fuse the above multi-dimensional similarity results to generate a comprehensive naming similarity score; weight and fuse the character-level, fragment-level, semantic-level, and rule-matching multi-dimensional similarity results to generate the final comprehensive similarity score.

4. The human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 3, characterized in that, Step 22, identifying the local structural similarity in field names, specifically includes the following steps: (1) Construction of variable-length N-gram sets; Treat the field name s to be matched as a string, and generate a set G(s) of all continuous substrings of length n≥1: (2) Calculation of Jaccard coefficient; For two field names s1 and s2, the Jaccard coefficient is defined as follows: This coefficient directly reflects the overlap ratio of the two field names at the fragment level; the larger the value, the more common substrings are shared. (3) Normalization of the length of the longest common subsequence (LCS); The LCS algorithm is used to extract the length of the longest common subsequence (LCS(s1,s2)) of two strings that maintain their order, and normalizes it to the interval [0,1]. (4) Fragment similarity fusion; The Jaccard coefficient is weighted and fused with the normalized LCS length to obtain the final fragment-level similarity: Where α is the preset weight, with a value of 0.6, which is determined through cross-validation based on the features of the actual corpus.

5. The human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 3, characterized in that, Step 23, which involves identifying synonyms, business terms, and cross-linguistic semantic alignment, specifically includes the following steps: (1) Generation of basic semantic vectors; The system supports two mainstream semantic vector generation methods: Static word embedding: Using pre-trained Word2Vec, FastText, or GloVe models, for multi-word field names, average pooling or weighted average is used to aggregate the word vectors into field name vectors; Contextual dynamic embedding: Using a Transformer-based deep language model, the field name is taken as the input sentence, and the vector at the [CLS] position in the model output is taken as the semantic representation of the entire field name; (2) Domain adaptation fine-tuning; For specific business domains, the system uses a contrastive learning framework for domain-unsupervised fine-tuning: constructing positive sample pairs and negative sample pairs, and using the contrastive loss function (InfoNCE) to bring positive sample pairs closer in the semantic space and push negative sample pairs further apart, making the model more sensitive to semantic similarity within the domain. (3) Cross-language semantic alignment; For data integration scenarios of multinational enterprises, a multilingual pre-trained model is introduced to achieve cross-language field name matching; (4) Similarity calculation; For semantic vectors v1 and v2 of two field names, the semantic similarity is defined as: Cosine similarity, with a value range of [-1, 1]. Positive values ​​indicate similarity. When using contextual dynamic embedding, this similarity can effectively distinguish between synonyms and antonyms. (5) Dynamic model selection; The system dynamically selects the most suitable semantic model based on the language, domain, and computing resources of the field name.

6. The human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 1, characterized in that, Step 3 involves mapping the multi-dimensional distribution features into vectors and calculating the distribution similarity using cosine similarity or Pearson correlation coefficient. This includes the following steps: Step 31: For numerical fields, calculate the mean, variance, standard deviation, and quartile statistics, and generate an isofrequency histogram or kernel density estimation curve; use the histogram intersection algorithm or chi-square test algorithm to calculate the distribution similarity; extract the basic statistical features of numerical fields to describe their central tendency and dispersion, and capture the actual distribution pattern of the data through isofrequency histograms or kernel density estimation. Step 32: For categorical fields, extract the unique value set of the field as category labels, calculate the Jaccard overlap of the category set, and compare the consistency of the frequency distribution of each category; extract the category value set of the two fields, calculate the overlap ratio of the category labels through the Jaccard coefficient, and further compare the frequency distribution of each category in the data. Step 33: For time / date fields, analyze the degree of overlap in time range coverage and verify the alignment of time granularity; The overlap ratio of time ranges is calculated by comparing the minimum / maximum dates of two time fields, while also checking whether the time granularity is consistent. Step 34: Map the above multi-dimensional distribution features into vectors, and calculate the comprehensive distribution similarity using cosine similarity or Pearson correlation coefficient; encode the distribution features of numerical, categorical, and time fields into vector form, and then use cosine similarity or Pearson correlation coefficient to calculate the similarity between the feature vectors of the two fields.

7. The human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 1, characterized in that, Step 4 involves data standardization and calculating the overlap of value ranges, specifically including the following steps: Step 41: Remove whitespace characters from both ends of the string, unify capitalization, and convert the date format to standard type; eliminate format noise in the data through preprocessing operations, including removing leading and trailing spaces, unifying capitalization, and converting various date formats to standard format; Step 42: Extract unique value sets A and B from the two fields; extract all non-repeating values ​​from the standardized field data to form sets A and B respectively; Step 43: For numeric fields, set a tolerance threshold, and consider those within the tolerance range as matches; for string fields, if the edit distance is lower than the preset threshold, it is considered a fuzzy match; introduce error tolerance mechanisms for different data types: numeric fields allow minor differences caused by precision or rounding, and close values ​​are considered matches by setting absolute or relative error thresholds; string fields identify spelling variations or slight format differences by editing distance. Step 44: Calculate the set overlap index: overlap = |A ∩ B| / |A ∪ B|; Based on the overlap value, divide the association strength into four levels: overlap ≥ 95% is extremely strong association, 80%~95% is strong association, 60%~80% is moderate association, and < 60% is weak association or suggests filtering; Based on the unique value set after standardization and fault-tolerant matching, calculate the Jaccard overlap, and divide the results into different association strength levels according to the preset threshold.

8. The human-computer collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 7, characterized in that, Based on the combined outputs of naming similarity, distribution similarity, and value range overlap, potential association candidates are generated as follows: if naming similarity > first threshold, distribution similarity > second threshold, and value range overlap > third threshold, then high-confidence association candidates are generated. If the naming similarity is greater than the first threshold and the value range overlap is greater than the fourth threshold, then a medium-confidence association candidate is generated, which is applicable to the synonym field. If the distribution similarity is greater than the second threshold and the value range overlap is greater than the third threshold, then a low-confidence association candidate is generated, which is suitable for implicit association. The first, second, third, and fourth thresholds are all configurable parameters, with default values ​​set to 80%, 70%, 80%, and 60%, respectively.

9. The human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 8, characterized in that, The reliability of the generated association candidates is quantitatively evaluated, and the overall confidence level is calculated using a weighted summation model. Total confidence = w1·name similarity + w2·distribution similarity + w3·value range overlap + w4·business rule bonus; Where w1, w2, w3, and w4 are configurable weight coefficients, and satisfy w1+w2+w3+w4=1; Business rule bonuses add an extra 5% to 10% confidence level to associations that conform to specific naming conventions; Candidates are classified according to their confidence scores: 90%~100% is considered extremely high confidence and is included in the automatic confirmation process; 70%~90% is considered high confidence and is recommended for quick manual review; 50%~70% is considered medium confidence and requires detailed verification; and below 50% is considered low confidence and is only for reference or can be filtered directly.

10. A system for implementing the human-machine collaborative data association verification method based on multi-dimensional similarity fusion as described in claim 1, characterized in that, include: The module includes a raw data input module, a multi-dimensional correlation analysis module, a correlation generation and output module, and a human-computer collaborative verification and feedback module. The raw data input module is used to connect to the target database, perform metadata collection and sample data collection, and obtain basic information and data samples of the fields to be analyzed. The multi-dimensional association analysis module, based on natural language processing technology, performs multi-dimensional similarity calculations on the collected field names to generate naming similarity; maps multi-dimensional distribution features into vectors and calculates distribution similarity using cosine similarity or Pearson correlation coefficient; performs data standardization and calculates value range overlap. The association generation output module integrates a multi-dimensional association reasoning engine, which comprehensively considers naming similarity, distribution similarity, and value range overlap to generate an association candidate list based on preset rules, and uses a confidence scoring model to quantitatively evaluate the reliability of each candidate. The human-machine collaborative verification and feedback module presents related data and evidence chains through a visual interface, provides an entry point for audit operations, and records user feedback for model iteration and optimization, forming a closed-loop mechanism for human-machine collaboration.