Intelligent due diligence method and system with automatic mapping of evidence fields

By extracting semantic vectors from evidence documents and using graph neural network reasoning, the problem of insufficient structure of evidence document information in existing technologies is solved, and efficient and accurate due diligence data generation is achieved.

CN122221972APending Publication Date: 2026-06-16北京鉴微知著智能科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
北京鉴微知著智能科技有限公司
Filing Date
2026-03-19
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies rely on manual review or simple machine learning methods in due diligence, which are difficult to adapt to evidence documents with diverse formats and flexible expressions, resulting in low information structuring, insufficient completeness, and limited automation efficiency.

Method used

By acquiring evidence documents, extracting fields and generating semantic vectors, constructing the mapping distance between cross-modal field vectors and standard fields, and using graph neural networks to infer and complete field values, structured due diligence data is generated.

🎯Benefits of technology

It enables the generation of high-quality, structured due diligence data, reduces the workload of manual verification, improves the efficiency and accuracy of due diligence reports, and provides a reliable data foundation for risk assessment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122221972A_ABST
    Figure CN122221972A_ABST
Patent Text Reader

Abstract

The application provides an intelligent due diligence method and system for evidence field automatic mapping, relates to the technical field of intelligent data processing, and comprises the following steps: extracting and encoding original fields of evidence documents, fusing field names and values to generate cross-modal vectors to determine a preliminary mapping relationship. A field knowledge graph and a query graph structure are constructed, and a structured mapping relationship is generated through graph isomorphism determination. Graph neural networks are used to complete the values of missing fields in the graph. Finally, structured due diligence data is generated according to the mapping and completion results. The application realizes accurate and automatic mapping and information completion of evidence fields, and improves the efficiency and data integrity of due diligence work.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent data processing technology, and in particular to an intelligent due diligence method and system for automatic mapping of evidence fields. Background Technology

[0002] In due diligence across various fields, processing and analyzing large volumes of unstructured or semi-structured evidence documents is a core and demanding task. Existing technologies typically rely on manual review and information extraction, or employ rule-based or simple machine learning-based automated methods. These methods usually predefine a set of fixed keywords or regular expression patterns, matching them within the document to locate and extract specific field information. Alternatively, they utilize traditional named entity recognition technology to identify entities such as names of people, organizations, dates, and amounts in the document and attempt to categorize them into preset categories. These methods can reduce manual workload to some extent and constitute the current mainstream automated or semi-automated processing paradigm in this field.

[0003] However, the aforementioned conventional approaches have significant limitations. Keyword- or rule-based methods heavily rely on expert experience to build rule bases, resulting in poor generalization capabilities. When faced with new documents that are diverse in format, flexible in expression, or use variations of technical terminology, the rule system requires frequent adjustments and maintenance, making it difficult to adapt to large-scale, ever-changing real-world application scenarios. Traditional named entity recognition technologies typically focus on identifying the entities themselves, lacking an understanding of the deep semantic relationships and business logic dependencies between fields. Simply relying on entity type matching cannot accurately establish this mapping, nor can it infer key fields that are not explicitly present in the document but can be deduced from business logic. This leads to low-structure and incomplete extracted information, ultimately requiring substantial manual intervention for verification and completion, limiting the efficiency of automation. Summary of the Invention

[0004] This invention provides an intelligent due diligence method and system for automatic mapping of evidence fields, which can solve the problems in the prior art.

[0005] A first aspect of this invention provides an intelligent due diligence method for automatic mapping of evidence fields, comprising:

[0006] Obtain the target evidence document to be investigated, extract the original fields and field values ​​to obtain the original field set, and perform semantic encoding on the original field set to obtain the original field semantic vector set;

[0007] The semantics of field names and the content of field values ​​are combined to generate cross-modal field vectors. The mapping distance between the cross-modal field vectors and the standard field semantic vectors in the preset due diligence template is calculated to determine the initial set of mapping relationships.

[0008] A field knowledge graph is constructed based on the standard fields and business dependencies in the due diligence template. A query graph structure is constructed based on the original field set and the preliminary mapping relationship set. The candidate subgraph and the query graph structure are isomorphic to determine the target subgraph. A structured mapping relationship set is generated based on the node correspondence between the target subgraph and the query graph structure.

[0009] The missing field nodes in the field knowledge graph are marked, the field values ​​of the mapped neighbor nodes are extracted and encoded as initial node features, and multi-layer aggregated feature vectors are generated through multi-layer propagation of graph neural network and then decoded to generate complete field values;

[0010] Based on the set of structured mapping relationships, the field values ​​in the original field set are filled into the corresponding standard field positions, and the fill-in field values ​​are filled into the missing standard field positions to generate structured due diligence data.

[0011] In one optional embodiment, a cross-modal field vector is generated by fusing the semantics of field names and the content of field values. The mapping distance between the cross-modal field vector and the standard field semantic vector in a preset due diligence template is calculated, and a preliminary mapping relationship set is determined, including:

[0012] Encode the field name of each original field in the original field set to generate a name semantic vector, and extract features from the corresponding field values ​​to generate a content feature vector;

[0013] The semantic association weight between the name semantic vector and the content feature vector is calculated through an attention mechanism, and the name semantic vector and the content feature vector are weighted and fused to generate a cross-modal field vector.

[0014] Align the cross-modal field vector with the standard field semantic vector in the preset due diligence template in the shared semantic space, calculate the vector distance between the cross-modal field vector and the standard field semantic vector, and obtain the mapping distance;

[0015] A weighted complete bipartite graph is constructed based on the mapping distance. The globally optimal mapping scheme is solved by augmented path search and matching state iterative flipping in the weighted complete bipartite graph. The mapping pairs in the globally optimal mapping scheme are used as the initial mapping relationship set.

[0016] In one optional embodiment, encoding the field name of each original field in the original field set to generate a name semantic vector, and extracting features from the corresponding field values ​​to generate a content feature vector, includes:

[0017] The field name of each original field in the original field set is segmented to obtain the word sequence in the field name;

[0018] Each word in the word sequence is converted into a word vector, and the word vector sequence is encoded by a multi-layer neural network. Each layer of the multi-layer neural network performs a non-linear transformation on the input vector and outputs a hidden state vector.

[0019] Extract the hidden state vector of the final layer of a multi-layer neural network to determine the name semantic vector;

[0020] Obtain the set of field values ​​corresponding to the original field, iterate through each field value in the set of field values, determine whether the field value conforms to the numeric data format, text data format or date data format, and determine the data type of the field value;

[0021] Perform statistical analysis on the set of field values ​​to extract the numerical distribution characteristics, string length distribution characteristics, and format consistency characteristics of the set of field values;

[0022] The numerical distribution feature, the string length distribution feature, and the format consistency feature are encoded into a feature vector to determine the content feature vector.

[0023] In one optional embodiment, constructing a weighted complete bipartite graph based on the mapping distance, and solving for the globally optimal mapping scheme in the weighted complete bipartite graph through augmented path search and iterative flipping of matching states includes:

[0024] Each original field in the original field set is used as the left node set of the bipartite graph structure, and each standard field in the preset due diligence template is used as the right node set.

[0025] For each left node and each right node, the mapping distance between the corresponding original field and the standard field is converted into an edge weight, and the edge weight is negatively correlated with the mapping distance;

[0026] Based on the edge weights, a weighted complete bipartite graph is constructed in the bipartite graph structure, and the matching state of all nodes in the weighted complete bipartite graph is initialized to unmatched.

[0027] Iteratively traverse the set of left nodes. For each unmatched left node, construct an augmented path search tree starting from that left node. Use depth-first search to find an alternating path in the weighted complete bipartite graph that satisfies the positive increment of the sum of global edge weights. The alternating path consists of alternating unmatched and matched edges.

[0028] When the alternating path is found, the state of the matching edge and the unmatched edge are flipped along the alternating path, the set of matching edges is updated to increase the sum of the global edge weights, and the iteration is repeated until all left nodes cannot find an alternating path that meets the conditions.

[0029] The set of all matching edges in the final set is taken as the globally optimal mapping scheme.

[0030] In one optional embodiment, a field knowledge graph is constructed based on standard fields and business dependencies in the due diligence template; a query graph structure is constructed based on the original field set and the preliminary mapping relationship set; isomorphism is determined between candidate subgraphs and the query graph structure to identify the target subgraph; and a structured mapping relationship set is generated based on the node correspondence between the target subgraph and the query graph structure, including:

[0031] Extract standard fields from the due diligence template as nodes, extract business dependencies between standard fields as directed edges, and construct a field knowledge graph based on nodes and directed edges;

[0032] Each original field in the original field set is used as a query node, and the mapping relationship between the original field and the standard field of each mapping pair in the preliminary mapping relationship set is used as a query edge. A query graph structure is constructed based on the query nodes and the query edges.

[0033] Candidate subgraphs are selected from the field knowledge graph, and isomorphism is determined between the candidate subgraphs and the query graph structure. When the candidate subgraph is isomorphic to the query graph structure, the candidate subgraph is determined as the target subgraph.

[0034] Based on the correspondence between the nodes of the target subgraph and the nodes of the query graph structure, the original fields in the query graph structure are mapped to the standard fields in the target subgraph, generating a set of structured mapping relationships.

[0035] In an optional embodiment, isomorphism determination is performed between the candidate subgraph and the query graph structure. When the candidate subgraph is isomorphic to the query graph structure, determining the candidate subgraph as the target subgraph includes:

[0036] Get the candidate node set and candidate edge set of the candidate subgraph, get the query node set and query edge set of the query graph structure, and determine whether the number of nodes in the candidate node set and the query node set are equal, and whether the number of edges in the candidate edge set and the query edge set are equal.

[0037] When the number of nodes and the number of edges are equal, establish a bijective mapping relationship between the candidate node set and the query node set;

[0038] Extract the starting and ending candidate nodes of each candidate edge in the candidate edge set, and extract the starting and ending query nodes of each query edge in the query edge set.

[0039] Determine whether the candidate nodes for the starting point and the candidate nodes for the ending point correspond to the query nodes for the starting point and the query nodes for the ending point, respectively, under the bijective mapping relationship;

[0040] When all candidate edges correspond to the query edge under the bijective mapping relationship, the candidate subgraph is determined to be structurally isomorphic to the query graph, and the candidate subgraph is determined as the target subgraph.

[0041] In one optional embodiment, marking missing field nodes in the field knowledge graph, extracting field values ​​from mapped neighbor nodes and encoding them as initial node features, generating multi-layer aggregated feature vectors through multi-layer propagation of a graph neural network, and decoding to generate complete field values ​​includes:

[0042] Traverse the standard field nodes in the field knowledge graph and mark the standard field nodes that do not have a corresponding original field as empty field nodes;

[0043] Obtain the neighbor nodes of the missing field node along the business dependency edge in the field knowledge graph, filter out the neighbor nodes with the corresponding mapping original field, determine the mapped neighbor nodes, extract the field value of the mapping original field corresponding to the mapped neighbor node and encode it as the initial node feature.

[0044] The initial node features are input into the first layer of the graph neural network, the initial node features are transformed and propagated along the adjacent business dependency edges to the missing field nodes, generating the first layer of aggregate features;

[0045] The first layer of aggregated features is input into the second layer of the graph neural network. The first layer of aggregated features is then propagated along the business dependency edges, and the features of the two-hop neighbor nodes are aggregated to the missing field nodes to generate the second layer of aggregated features.

[0046] The first-layer aggregated features and the second-layer aggregated features are concatenated to form a multi-layer aggregated feature vector. The multi-layer aggregated feature vector is then input into the decoding network to be converted into a field value format, generating the fill-in field value for missing field nodes.

[0047] A second aspect of the present invention provides an intelligent due diligence system for automatic mapping of evidence fields, comprising:

[0048] The field extraction unit is used to obtain the target evidence document to be investigated, extract the original fields and field values ​​to obtain the original field set, and perform semantic encoding on the original field set to obtain the original field semantic vector set.

[0049] The field mapping unit is used to integrate the semantics of field names and the content of field values ​​to generate cross-modal field vectors, calculate the mapping distance between the cross-modal field vectors and the standard field semantic vectors in the preset due diligence template, and determine the initial set of mapping relationships.

[0050] The relationship building unit is used to build a field knowledge graph based on the standard fields and business dependencies in the due diligence template, build a query graph structure based on the original field set and the preliminary mapping relationship set, determine the target subgraph by isomorphism judgment between the candidate subgraph and the query graph structure, and generate a set of structured mapping relationships based on the node correspondence between the target subgraph and the query graph structure.

[0051] The field completion unit is used to mark missing field nodes in the field knowledge graph, extract the field values ​​of mapped neighbor nodes and encode them as initial node features, generate multi-layer aggregated feature vectors through multi-layer propagation of graph neural network and decode to generate complete field values;

[0052] The data generation unit is used to fill the field values ​​in the original field set into the corresponding standard field positions according to the structured mapping relationship set, and to fill the missing standard field positions with the supplementary field values, thereby generating structured due diligence data.

[0053] A third aspect of the present invention provides an electronic device, comprising:

[0054] processor;

[0055] Memory used to store processor-executable instructions;

[0056] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.

[0057] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.

[0058] In this embodiment of the invention, by extracting the semantic vector of the original field and fusing the field name and field value to generate a cross-modal vector, the deep semantic information of the field can be captured more accurately. Calculating the mapping distance between the cross-modal vector and the standard field semantic vector effectively overcomes the limitations of traditional keyword-based matching methods in terms of semantic ambiguity and expression diversity. By constructing a query graph and performing subgraph isomorphism determination with the knowledge graph, the initial mapping results can be verified and corrected from a global association perspective, ensuring that the generated mapping relationship is consistent and complete in business logic, avoiding contradictions and errors that may arise from isolated mapping. Extracting the features of mapped neighbor nodes as initial information, and through the multi-layer message propagation mechanism of the graph neural network, the contextual information of multi-hop neighbors can be aggregated, thereby more accurately inferring the reasonable value of the missing field, exhibiting stronger reasoning ability and generalization performance than methods relying on a single field or simple rules. By filling the original field value and the completed field value into the corresponding positions of the standard template, high-quality, structured due diligence data can be automatically generated, greatly reducing the workload of manual verification, organization, and completion of data, improving the efficiency and accuracy of due diligence report generation, and providing a reliable and consistent data foundation for subsequent risk assessment and decision analysis. Attached Figure Description

[0059] Figure 1 This is a flowchart illustrating the intelligent due diligence method for automatic mapping of evidence fields according to an embodiment of the present invention.

[0060] Figure 2 Flowchart for graph construction and isomorphism determination logic. Detailed Implementation

[0061] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0062] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0063] Figure 1 This is a flowchart illustrating the intelligent due diligence method for automatic mapping of evidence fields according to an embodiment of the present invention, as shown below. Figure 1 As shown, the method includes:

[0064] Obtain the target evidence document to be investigated, extract the original fields and field values ​​to obtain the original field set, and perform semantic encoding on the original field set to obtain the original field semantic vector set;

[0065] The semantics of field names and the content of field values ​​are combined to generate cross-modal field vectors. The mapping distance between the cross-modal field vectors and the standard field semantic vectors in the preset due diligence template is calculated to determine the initial set of mapping relationships.

[0066] A field knowledge graph is constructed based on the standard fields and business dependencies in the due diligence template. A query graph structure is constructed based on the original field set and the preliminary mapping relationship set. The candidate subgraph and the query graph structure are isomorphic to determine the target subgraph. A structured mapping relationship set is generated based on the node correspondence between the target subgraph and the query graph structure.

[0067] The missing field nodes in the field knowledge graph are marked, the field values ​​of the mapped neighbor nodes are extracted and encoded as initial node features, and multi-layer aggregated feature vectors are generated through multi-layer propagation of graph neural network and then decoded to generate complete field values;

[0068] Based on the set of structured mapping relationships, the field values ​​in the original field set are filled into the corresponding standard field positions, and the fill-in field values ​​are filled into the missing standard field positions to generate structured due diligence data.

[0069] In one specific implementation, the intelligent due diligence method with automatic mapping of evidence fields begins with acquiring the target evidence document to be investigated. The target evidence document is converted into text format using a document parsing engine, and layout analysis is performed to identify tables, paragraphs, and structural relationships within the document. For table data, table headers are extracted as field names, and table content is extracted as field values; for text paragraphs, key information points and corresponding values ​​are extracted through keyword matching and semantic analysis. After extraction, an original field set is formed, with each element containing a field name and a field value. A BERT pre-trained language model is used to encode the original field names, converting each field name into a 768-dimensional vector representation, forming an original field semantic vector set.

[0070] When fusing field features, each original field name is segmented using a Chinese word segmentation tool, stopping words are removed, and core words are retained. The segmentation results are input into a pre-trained word vector model to obtain a 300-dimensional word vector for each word. The word vector sequence is then encoded using a bidirectional LSTM network, and the last hidden state is extracted as a 512-dimensional name semantic vector. Data type identification is performed on field values ​​to distinguish between numeric, text, and date types. For numeric data, statistical features such as maximum, minimum, average, median, and standard deviation are calculated; for text data, features such as character count, word frequency, and proportion of special characters are calculated; for date data, features such as year-month-day format and time span are extracted. These features are integrated into a 256-dimensional content feature vector. An attention layer is constructed to calculate the association weights between the name semantic vector and the content feature vector. The weights are calculated by multiplying the two vectors and then normalizing them using softmax. The name semantic vector and content feature vector are weighted and fused according to the attention weights to generate a 768-dimensional cross-modal field vector.

[0071] A predefined due diligence template is used, containing standard fields and relationships for various due diligence scenarios. Each standard field is encoded into a standard field semantic vector using a pre-trained model. The cosine similarity between each cross-modal field vector and all standard field semantic vectors is calculated, and the negative value of the similarity is used as the mapping distance. A weighted bipartite graph is constructed based on the mapping distance, with the original fields as left nodes and the standard fields as right nodes, and the edge weights being the similarity values. The Kernel-Miller algorithm is used to solve the maximum weight matching problem of the weighted bipartite graph, finding the globally optimal field mapping scheme and forming a preliminary set of mapping relationships.

[0072] When constructing the field knowledge graph, standard fields from the due diligence template are used as nodes in the graph, and business dependencies are used as directed edges. Taking financial due diligence as an example, the debt-to-asset ratio depends on total liabilities and total assets, forming dependency edges. A query graph structure is constructed based on the original field set and preliminary mapping relationships, using the original fields as query nodes and the mapping relationships as query edges. All possible subgraphs are enumerated in the field knowledge graph for isomorphism determination. Isomorphism determination first compares whether the number of nodes and edges match, then establishes a one-to-one correspondence between nodes, checking whether the edge relationships of each pair of corresponding nodes are consistent. After finding an isomorphic subgraph, it is identified as the target subgraph, and a set of structured mapping relationships is generated based on the node correspondences.

[0073] When filling in missing fields, all standard field nodes in the field knowledge graph are traversed to check which nodes lack corresponding original fields and mark them as missing field nodes. Along the business dependency edges in the knowledge graph, one-hop neighbor nodes of the missing nodes are identified, and neighbor nodes with existing mapping relationships are selected. The field values ​​corresponding to the original fields of these neighbor nodes are extracted and encoded into numerical vectors as initial node features. A Graph Convolutional Network (GCN) is used for feature propagation. The first layer of the GCN propagates the initial features of neighbor nodes along dependency edges to the missing nodes, generating the first layer of aggregated features; the second layer of the GCN continues to aggregate the features of two-hop neighbor nodes, generating the second layer of aggregated features. The multi-layer aggregated features are concatenated and input into a fully connected network for dimensionality reduction and feature transformation, and then a decoding layer generates values ​​that conform to the target field type. Numerical fields are decoded into specific numerical values, categorical fields are decoded into category labels, and textual fields are decoded into text content.

[0074] When generating structured due diligence data, each pair of mappings in the set of structured mapping relationships is traversed, and the field values ​​of the original fields are filled into the corresponding standard field positions. For missing fields, the completed field values ​​are filled into the appropriate positions. According to the data structure defined in the due diligence template, the filled standard fields and values ​​are organized into a structured data format, such as JSON or XML. The final generated structured due diligence data is organized according to the preset template structure, containing all necessary standard fields and corresponding values, providing complete and standardized data support for subsequent due diligence analysis and decision-making.

[0075] In practical applications, when processing due diligence data, original fields are extracted from contract documents, such as "Contract Party A" as "XXX Technology Co., Ltd.", "Signing Date" as "January 15, 2024", and "Contract Amount" as "5 million yuan". The standard due diligence template includes fields such as "Counterparty", "Effective Date", "Transaction Amount", and "Expiration Date". By calculating the semantic similarity between fields, the similarity between "Contract Party A" and "Counterparty" is 0.89, between "Signing Date" and "Effective Date" is 0.93, and between "Contract Amount" and "Transaction Amount" is 0.87. After establishing a field knowledge graph, it was found that the standard field "Expiration Date" is missing from the original fields, but the knowledge graph shows that "Expiration Date" depends on "Effective Date" and "Contract Term". The "Signing Date" value "January 15, 2024" and the "Contract Term" value "24 months" were extracted from the original fields, and the "Expiration Date" value was calculated to be "January 14, 2026" using a graph neural network. The final structured due diligence data includes complete information such as "Counterparty": "XXX Technology Co., Ltd.", "Effective Date": "January 15, 2024", "Transaction Amount": "5 million yuan", and "Due Date": "January 14, 2026".

[0076] In one optional embodiment, a cross-modal field vector is generated by fusing the semantics of field names and the content of field values. The mapping distance between the cross-modal field vector and the standard field semantic vector in a preset due diligence template is calculated, and a preliminary mapping relationship set is determined, including:

[0077] Encode the field name of each original field in the original field set to generate a name semantic vector, and extract features from the corresponding field values ​​to generate a content feature vector;

[0078] The semantic association weight between the name semantic vector and the content feature vector is calculated through an attention mechanism, and the name semantic vector and the content feature vector are weighted and fused to generate a cross-modal field vector.

[0079] Align the cross-modal field vector with the standard field semantic vector in the preset due diligence template in the shared semantic space, calculate the vector distance between the cross-modal field vector and the standard field semantic vector, and obtain the mapping distance;

[0080] A weighted complete bipartite graph is constructed based on the mapping distance. The globally optimal mapping scheme is solved by augmented path search and matching state iterative flipping in the weighted complete bipartite graph. The mapping pairs in the globally optimal mapping scheme are used as the initial mapping relationship set.

[0081] In one specific implementation, after acquiring the target evidence document, dual-channel encoding is performed on each original field in the original field set. For field names, a pre-trained language model, such as BERT, is used to convert the text sequence of field names into a dense vector representation of fixed dimensions. This semantic vector has a dimension of 768, which can capture the semantic information of the field name. Simultaneously, feature extraction is performed on the field values. When the field value is text-based, a document-level encoder is used to extract its content features; when the field value is numeric, statistical feature engineering is used to extract numerical distribution features, dimensional features, etc., ultimately generating a 512-dimensional content feature vector.

[0082] In the cross-modal fusion stage, a multi-head attention mechanism is introduced to calculate the interaction between the name semantic vector and the content feature vector. Specifically, the name semantic vector is used as the query vector Q, and the content feature vectors are used as the key vector K and value vector V. A semantic association weight matrix is ​​calculated using a scaled dot product attention mechanism. The attention score is normalized by Softmax to obtain the weight coefficient α, which reflects the semantic consistency between the field name and the field value content. Subsequently, a weighted fusion operation is performed, combining the name semantic vector and the weighted content feature vector through linear transformation and residual connection to output a 1024-dimensional cross-modal field vector. This vector simultaneously encodes the naming convention of the field name and the actual content features of the field value.

[0083] To achieve precise alignment between the original fields and the standard fields, a shared semantic space is pre-constructed. The standard field names in the pre-defined due diligence template are converted into a set of standard field semantic vectors using the same encoder. This set maintains the same dimensionality of 1024 as the cross-modal field vectors. Within the shared semantic space, cosine similarity is used to calculate the vector distance between the cross-modal field vector and each standard field semantic vector; a smaller distance value indicates higher semantic similarity. All combinations of original and standard fields are iterated to construct a distance matrix D, where the matrix elements D... ij This represents the mapping distance between the i-th original field and the j-th standard field.

[0084] A weighted complete bipartite graph is constructed based on a distance matrix. The set of nodes on the left side of the graph corresponds to the original field set, and the set of nodes on the right side corresponds to the standard field set. The weight of each edge is the negative of the mapped distance. The Hungarian algorithm is used to solve the minimum weight perfect matching problem. This algorithm iteratively executes by alternately searching for augmenting paths and updating dual variables. After initializing feasible vertex labels, each iteration starts from the unmatched left-side node and searches for an augmenting path using breadth-first search. When an augmenting path is found, the matching state is reversed, and unmatched edges on the path are converted to matching edges, and matching edges are converted to unmatched edges. The iteration terminates when all left-side nodes are matched or no new augmenting path can be found. After convergence, the algorithm outputs the globally optimal mapping scheme, which ensures that the sum of the mapping distances is minimized and satisfies the one-to-one mapping constraint. Mapping pairs with a mapping distance less than the threshold of 0.3 in the optimal scheme are retained, and the remaining mapping pairs are marked as low-confidence mappings, thus forming a preliminary mapping relationship set. Each element in the set contains three pieces of information: the original field identifier, the standard field identifier, and the mapping confidence score.

[0085] In one optional embodiment, encoding the field name of each original field in the original field set to generate a name semantic vector, and extracting features from the corresponding field values ​​to generate a content feature vector, includes:

[0086] The field name of each original field in the original field set is segmented to obtain the word sequence in the field name;

[0087] Each word in the word sequence is converted into a word vector, and the word vector sequence is encoded by a multi-layer neural network. Each layer of the multi-layer neural network performs a non-linear transformation on the input vector and outputs a hidden state vector.

[0088] Extract the hidden state vector of the final layer of a multi-layer neural network to determine the name semantic vector;

[0089] Obtain the set of field values ​​corresponding to the original field, iterate through each field value in the set of field values, determine whether the field value conforms to the numeric data format, text data format or date data format, and determine the data type of the field value;

[0090] Perform statistics on the set of field values, and extract the numerical distribution characteristics, string length distribution characteristics, and format consistency characteristics of the set of field values;

[0091] Encode the numerical distribution characteristics, the string length distribution characteristics, and the format consistency characteristics into feature vectors to determine the content feature vector.

[0092] In a specific embodiment, in the intelligent due diligence method for automatic mapping of evidence fields, field names such as "Enterprise Operating Period", "Approved Registered Capital", and "Paid-in Registered Capital" are extracted from the original field set, and word segmentation processing is applied to each field name. The field name "Enterprise Operating Period" is segmented into three words: "Enterprise", "Operation", and "Period" through word segmentation. A professional Chinese word segmentation tool is used to segment the field names. The word segmentation tool identifies the boundaries of Chinese words based on a dictionary and a probability statistical model, and at the same time introduces a stop word filtering function to remove words with little semantic contribution such as "de", "he", and "yu". After word segmentation, a word sequence is formed for each field name. For example, the word segmentation of "Approved Registered Capital" results in a word sequence of "Approved", "Registered", and "Capital".

[0093] Input each word in the word sequence into a pre-trained word vector model to convert it into a word vector of a fixed dimension. For the word "Enterprise", it is converted into a 300-dimensional word vector through the pre-trained word vector model, representing the semantic features of the word. The pre-trained word vector model is obtained through training on a large-scale corpus and can map words with similar semantics to close positions in the vector space. For example, words with similar semantics such as "Enterprise", "Company", and "Unit" are close in the vector space. After each word in the word sequence is converted, a word vector sequence is formed. For example, "Enterprise Operating Period" is converted into a sequence composed of three 300-dimensional word vectors.

[0094] Construct a multi-layer neural network to encode the word vector sequence, and adopt a bidirectional long short-term memory network (BiLSTM) structure. BiLSTM contains LSTM layers in both the forward and backward directions and can consider the context information of the word sequence simultaneously. For the input word vector sequence, the forward LSTM unit of the first layer of BiLSTM processes the sequence from left to right, and the backward LSTM unit processes the sequence from right to left. Each LSTM unit contains three gating units: an input gate, a forget gate, and an output gate, and controls the information flow through the gating mechanism. Taking "Enterprise Operating Period" as an example, the forward LSTM unit processes the word vectors of "Enterprise", "Operation", and "Period" in sequence, and the backward LSTM unit processes the word vectors of "Period", "Operation", and "Enterprise" in sequence. The output hidden state vector of the first layer of LSTM is used as the input of the second layer of LSTM. The last layer of BiLSTM outputs a 256-dimensional forward hidden state vector and a 256-dimensional backward hidden state vector.

[0095] Extract the hidden state vector of the final layer of the BiLSTM, and concatenate the forward and backward hidden state vectors to obtain a 512-dimensional vector. For the "Enterprise Operating Period", the finally obtained hidden state vector contains the complete semantic information of this field name. To enhance the expression ability of the hidden state vector, an attention mechanism is introduced to assign different weights to each word vector in the word vector sequence. Calculate the correlation score of each word vector with the overall sequence, and obtain the attention weight after normalization by the softmax function. For the "Enterprise Operating Period", the word weights of "operating" and "period" are relatively high, reflecting their importance in the field semantics. Weight and sum the word vectors according to the attention weights, concatenate them with the hidden state vector output by the BiLSTM, and perform dimensionality reduction processing through the fully connected layer to obtain the final name semantic vector.

[0096] Obtain the set of field values, such as the value "October 12, 2005 to October 11, 2035" corresponding to the "Enterprise Operating Period". Judge the data type of the field value, and distinguish between numerical type, text type, date type, etc. Numerical type judgment basis: The field value consists entirely of numbers or contains numerical symbols such as decimal points, plus and minus signs, and conforms to the numerical format specification. Text type judgment basis: The field value contains non-numerical characters such as Chinese and English characters, punctuation marks, etc. Date type judgment basis: The field value conforms to a specific date format pattern, such as "yyyy-MM-dd", "yyyy year MM month dd day", etc. Apply the regular expression pattern matching to the value of the "Enterprise Operating Period" to identify the date format "yyyy year MM month dd day" contained therein, and determine it as date type data.

[0097] Perform statistical analysis on the set of field values, and extract numerical distribution characteristics, string length characteristics, and format consistency characteristics. Numerical distribution characteristics include statistical quantities such as maximum value, minimum value, average value, median, standard deviation, quartiles, etc. For the set of values of the numerical field "Registered Capital", such as "1 million yuan", "5 million yuan", "10 million yuan", etc., calculate the minimum value of 1 million yuan, the maximum value of 10 million yuan, the average value of 5.3333 million yuan, and the standard deviation of 4.5093 million yuan. String length characteristics include average length, maximum length, minimum length, length standard deviation, etc. For the set of values of the text field "Enterprise Address", calculate the average length of 25.7 characters, the maximum length of 42 characters, the minimum length of 15 characters, and the length standard deviation of 8.9. The format consistency characteristic represents the degree of uniformity of the format in the set of field values. Match various common formats through regular expressions, and count the occurrence frequency of each format. For the set of values of the date field "Establishment Date", count that the "yyyy-MM-dd" format accounts for 60%, and the "yyyy year MM month dd day" format accounts for 40%.

[0098] Encode the feature vectors to construct content feature vectors. For the numerical distribution features, normalize statistics such as the maximum value, minimum value, and average value, map them to the interval [0, 1], and splice them to form numerical feature sub-vectors. For the string length features, normalize the length statistics in the same way and splice them to form length feature sub-vectors. For the format consistency features, form a probability distribution vector of the proportion of each format as the format feature sub-vector. Taking "Enterprise Operating Period" as an example, analyze its value set to obtain date format features, and extract the interval features between the start and end dates, such as the minimum interval of 10 years, the maximum interval of 30 years, and the average interval of 20.5 years, etc. Splice the numerical feature sub-vectors, length feature sub-vectors, and format feature sub-vectors, and then perform feature fusion and dimensionality reduction through a fully connected neural network to finally obtain a 256-dimensional content feature vector, which completely represents the statistical characteristics of the field value set.

[0099] In the intelligent due diligence process, when processing enterprise industrial and commercial information, the original field name "Enterprise Establishment Time" is segmented into three words: "Enterprise", "Establishment", and "Time", and converted into a 300-dimensional word vector sequence through a word vector model. This sequence is processed by two layers of BiLSTM, each layer containing 128 LSTM units, to obtain forward and backward hidden state vectors of 256 dimensions each, and a 512-dimensional vector is obtained after splicing. Apply the attention mechanism to calculate the word vector weights, with the weight of "Establishment" being 0.45, the weight of "Time" being 0.35, and the weight of "Enterprise" being 0.2, and obtain the representation vector through weighted summation. After splicing this vector with the output vector of BiLSTM, a 512-dimensional name semantic vector is output through a fully connected layer. Analyze the field value set of "Enterprise Establishment Time" such as "2015-06-18", "March 24, 2017", etc., identify it as date-type data, and extract statistical features such as the average year of 2016.8, the earliest of 2015, and the latest of 2020. Calculate the average string length of 10.3 characters, and the format consistency feature shows that the "yyyy-MM-dd" format accounts for 60% and the "yyyy year MM month dd day" format accounts for 40%. After normalizing these features and splicing them, dimensionality reduction is performed through two layers of fully connected neural networks to obtain a 256-dimensional content feature vector, which completely represents the semantic of the field name and the field value features of the original field.

[0100] In an optional embodiment, constructing a weighted complete bipartite graph based on the mapping distance, and solving the global optimal mapping scheme in the weighted complete bipartite graph through augmenting path search and matching state iterative flipping includes:

[0101] Take each original field in the original field set as the left node set of the bipartite graph structure, and take each standard field in the preset due diligence template as the right node set;

[0102] For each left node and each right node, the mapping distance between the corresponding original field and the standard field is converted into an edge weight, and the edge weight is negatively correlated with the mapping distance;

[0103] Based on the edge weights, a weighted complete bipartite graph is constructed in the bipartite graph structure, and the matching state of all nodes in the weighted complete bipartite graph is initialized to unmatched.

[0104] Iteratively traverse the set of left nodes. For each unmatched left node, construct an augmented path search tree starting from that left node. Use depth-first search to find an alternating path in the weighted complete bipartite graph that satisfies the positive increment of the sum of global edge weights. The alternating path consists of alternating unmatched and matched edges.

[0105] When the alternating path is found, the state of the matching edge and the unmatched edge are flipped along the alternating path, the set of matching edges is updated to increase the sum of the global edge weights, and the iteration is repeated until all left nodes cannot find an alternating path that meets the conditions.

[0106] The set of all matching edges in the final set is taken as the globally optimal mapping scheme.

[0107] In one specific implementation, when establishing the initial mapping relationship, each original field in the original field set is first used as the left-side node set of the bipartite graph structure, and each standard field in the preset due diligence template is used as the right-side node set. Assuming 15 original fields are extracted from the evidence document to be investigated, and the preset due diligence template defines 20 standard fields, the constructed bipartite graph will contain 15 nodes on the left and 20 nodes on the right.

[0108] For each left-hand node and each right-hand node, the mapping distance between the corresponding original field and the standard field is converted into edge weights. The mapping distance is obtained by calculating the cosine distance between the cross-modal field vector and the semantic vector of the standard field, with a distance value ranging from [0, 2]. The edge weights are negatively correlated with the mapping distance, and the specific conversion formula is w = 1 / (d + 0.01), where w represents the edge weight, d represents the mapping distance, and the constant 0.01 is used to avoid division by zero anomalies. The smaller the mapping distance, the larger the corresponding edge weight, indicating a higher confidence level in the match between the original field and the standard field.

[0109] A weighted complete bipartite graph is constructed in a bipartite graph structure based on the calculated edge weights. In this graph, every node on the left is connected to all nodes on the right by an edge, and the total number of edges is the product of the number of nodes on the left and the number of nodes on the right. The matching state of all nodes in the weighted complete bipartite graph is initialized to unmatched, and the set of matched edges is initialized to an empty set.

[0110] Iteratively traverse the set of left-hand nodes. For each unmatched left-hand node, construct an augmenting path search tree starting from that node. A depth-first search strategy is used to find alternating paths in a weighted complete bipartite graph where the sum of global edge weights increases positively. Alternating paths consist of alternating unmatched and matched edges, starting at an unmatched left-hand node and ending at an unmatched right-hand node. During the search, a visit marker array is maintained to prevent repeated visits to the same node from causing the search to loop.

[0111] When an alternating path that meets the conditions is found, the states of matched and unmatched edges are reversed along the alternating path. Specifically, unmatched edges in the path are added to the set of matched edges, and matched edges in the path are removed from the set of matched edges. This operation increases the sum of global edge weights because the sum of the weights of the newly added edges is greater than the sum of the weights of the removed edges. After updating the set of matched edges, the matching states of the relevant nodes are updated synchronously, and the left start node and the right end node become matched.

[0112] Repeat the above iterative process until no alternative path satisfying the conditions can be found for any of the left-hand nodes. At this point, the weighted complete bipartite graph reaches the maximum weight matching state, and the sum of global edge weights reaches its maximum value. All mapping pairs in the final set of matched edges are taken as the globally optimal mapping scheme, where each mapping pair represents a definite mapping relationship between an original field and its corresponding standard field. For nodes that exist on the left but are not matched, the corresponding original fields are not mapped for the time being, and will be further optimized later through the structural constraints of the field knowledge graph.

[0113] In one optional embodiment, a field knowledge graph is constructed based on standard fields and business dependencies in the due diligence template; a query graph structure is constructed based on the original field set and the preliminary mapping relationship set; isomorphism is determined between candidate subgraphs and the query graph structure to identify the target subgraph; and a structured mapping relationship set is generated based on the node correspondence between the target subgraph and the query graph structure, including:

[0114] Extract standard fields from the due diligence template as nodes, extract business dependencies between standard fields as directed edges, and construct a field knowledge graph based on nodes and directed edges;

[0115] Each original field in the original field set is used as a query node, and the mapping relationship between the original field and the standard field of each mapping pair in the preliminary mapping relationship set is used as a query edge. A query graph structure is constructed based on the query nodes and the query edges.

[0116] Candidate subgraphs are selected from the field knowledge graph, and isomorphism is determined between the candidate subgraphs and the query graph structure. When the candidate subgraph is isomorphic to the query graph structure, the candidate subgraph is determined as the target subgraph.

[0117] Based on the correspondence between the nodes of the target subgraph and the nodes of the query graph structure, the original fields in the query graph structure are mapped to the standard fields in the target subgraph, generating a set of structured mapping relationships.

[0118] In one specific implementation, when constructing the field knowledge graph, all standard fields, such as "company name," "registered capital," "legal representative," and "shareholder information," are parsed from the data dictionary of the due diligence template. Each standard field is abstracted as a node in the graph. Simultaneously, the business dependencies between standard fields are analyzed. For example, "legal representative" depends on the existence of "company name," and "shareholder shareholding ratio" depends on the joint determination of "shareholder name" and "registered capital." These dependencies are represented as directed edges. The direction of the directed edges points from the dependent field to the dependent field, and the dependency type can be labeled on the edges, such as required dependency, conditional dependency, or computational dependency. By traversing all standard fields and their dependencies, a complete field knowledge graph G is constructed. K = (V K E K ), where V K For the standard field node set, E K This is a business dependency edge set.

[0119] When constructing the query graph structure, each original field in the original field set is traversed and used as a node in the query graph. Mapping pairs are extracted from the initial mapping relationship set. Each mapping pair contains an original field, its corresponding candidate standard field, and the mapping confidence between them. The association between the original field and the standard field in the mapping pair is transformed into a query edge, and the weight of the edge is set to the mapping confidence. If the confidence of mapping the original field "Company Name" to the standard field "Enterprise Name" is 0.92, then a query edge is created in the query graph from the "Company Name" node to the "Enterprise Name" node, with a weight of 0.92. After traversing all the initial mapping relationships, the query graph structure G is obtained. Q = (V Q E Q ), where V Q For the original field node set, E Q It is the edge set of the mapping relationship.

[0120] When performing subgraph isomorphism determination, candidate subgraphs are enumerated from the field knowledge graph. The number of nodes in the candidate subgraph should be the same as the number of nodes in the query graph structure, and the node types must match. A depth-first search-based VF2 algorithm is used for graph isomorphism determination, comparing the topological structure of the candidate subgraph with that of the query graph structure node by node. During the determination process, the consistency of node degree, adjacency, and edge directionality is checked. When the query graph contains two query edges, "Company Name" → "Enterprise Name" and "Legal Person" → "Legal Representative," the field knowledge graph is searched for the existence of two nodes, "Enterprise Name" and "Legal Representative," and whether their topological relationship corresponds to the query edge. If the node correspondence of the candidate subgraph completely matches the query graph, and the edge connection pattern is consistent, it is determined to be isomorphic, and the candidate subgraph is marked as the target subgraph.

[0121] When generating structured mapping relationships, the pairing relationship between each standard field node in the target subgraph and the corresponding original field node in the query graph structure is extracted. This is based on the node mapping function f: V established during the isomorphism determination process. Q → V K The process maps the original field "Company Name" in the query graph to the standard field "Enterprise Name" in the target subgraph, and "Legal Person" to "Legal Representative". For each mapping relationship, the original field name, standard field name, mapping confidence level, and field value are recorded to form a structured mapping entry. After traversing all node correspondences, a set of structured mapping relationships is generated. This set uses the standard fields as indexes and stores the original field and its value corresponding to each standard field, providing accurate mapping rules for subsequent data population.

[0122] like Figure 2 As shown, the flowchart of the graph construction and isomorphism determination logic is illustrated.

[0123] In an optional embodiment, isomorphism determination is performed between the candidate subgraph and the query graph structure. When the candidate subgraph is isomorphic to the query graph structure, determining the candidate subgraph as the target subgraph includes:

[0124] Get the candidate node set and candidate edge set of the candidate subgraph, get the query node set and query edge set of the query graph structure, and determine whether the number of nodes in the candidate node set and the query node set are equal, and whether the number of edges in the candidate edge set and the query edge set are equal.

[0125] When the number of nodes and the number of edges are equal, establish a bijective mapping relationship between the candidate node set and the query node set;

[0126] Extract the starting and ending candidate nodes of each candidate edge in the candidate edge set, and extract the starting and ending query nodes of each query edge in the query edge set.

[0127] Determine whether the candidate nodes for the starting point and the candidate nodes for the ending point correspond to the query nodes for the starting point and the query nodes for the ending point, respectively, under the bijective mapping relationship;

[0128] When all candidate edges correspond to the query edge under the bijective mapping relationship, the candidate subgraph is determined to be structurally isomorphic to the query graph, and the candidate subgraph is determined as the target subgraph.

[0129] In one specific implementation, during isomorphism determination, the first step is to obtain all node and edge information of the candidate subgraph from the field knowledge graph. Specifically, the candidate node set records all standard field nodes in the candidate subgraph, such as "company name," "registered capital," and "legal representative," with each node carrying field semantic encoding information. The candidate edge set records the business dependencies between nodes; for example, an edge pointing from "company name" to "legal representative" indicates an association constraint between the two. Simultaneously, the query node set and query edge set are extracted from the query graph structure. Query nodes correspond to fields in the original field set that have already been preliminarily mapped, and query edges reflect the logical relationships between these original fields.

[0130] After obtaining the complete graph structure data, perform basic topology verification. Calculate the number of elements n in the candidate node set. c The number of elements n in the query node set q At the same time, count the number of edges e in the candidate edge set. c The number of edges e in the query edge set q When n is satisfied c =n q And e c = e q If the two graphs are homogeneous in size, it indicates that they meet the necessary condition for being isomorphic; otherwise, they are directly determined to be heterogeneous and the detection process ends.

[0131] After topological verification, a bijective mapping relationship is established between the candidate node set and the query node set. A recursive backtracking algorithm is used to attempt to map each query node to a candidate node one by one, requiring that the mapping relationship is one-to-one and without repetition. During the mapping process, the semantic similarity of nodes is used as a constraint condition. The cosine distance between the field semantic vector of the query node and the standard field semantic vector of the candidate node is calculated, and the candidate node with the smallest semantic distance is selected for pairing to ensure that the mapping relationship conforms to the business logic.

[0132] After establishing the bijective mapping, edge preservation verification is performed. Each candidate edge in the candidate edge set is traversed, and its starting and ending candidate nodes are extracted. For example, if a candidate edge connects the "Company Name" node and the "Registered Capital" node, then the starting candidate node is "Company Name," and the ending candidate node is "Registered Capital." Correspondingly, query edges with the same semantic relationship are searched in the query edge set, and their starting and ending query nodes are extracted.

[0133] Under a bijective mapping, check whether the candidate node at the starting point is mapped to the query node at the starting point, and simultaneously check whether the candidate node at the ending point is mapped to the query node at the ending point. If the candidate edge (u c v c ) and query edge (u q v q ) satisfies the mapping relationship f(u) q )= u c And f(v) q ) = v c If the candidate edge passes the verification, then the verification process is performed on all edges in the candidate edge set. When all candidate edges can be found in the query edge set and the mapping relationship is consistent, it is confirmed that the candidate subgraph and the query graph structure satisfy the edge isomorphism condition.

[0134] Based on the four conditions of equal number of nodes, equal number of edges, existence of bijective mapping, and passing edge preservation verification, the candidate subgraph is determined to be structurally isomorphic to the query graph. The candidate subgraph that passes the isomorphism test is marked as the target subgraph, and the correspondence between its nodes and the nodes in the query graph constitutes a structured mapping relationship, used for subsequent automatic field value filling and completion operations.

[0135] In one optional embodiment, marking missing field nodes in the field knowledge graph, extracting field values ​​from mapped neighbor nodes and encoding them as initial node features, generating multi-layer aggregated feature vectors through multi-layer propagation of a graph neural network, and decoding to generate complete field values ​​includes:

[0136] Traverse the standard field nodes in the field knowledge graph and mark the standard field nodes that do not have a corresponding original field as empty field nodes;

[0137] Obtain the neighbor nodes of the missing field node along the business dependency edge in the field knowledge graph, filter out the neighbor nodes with the corresponding mapping original field, determine the mapped neighbor nodes, extract the field value of the mapping original field corresponding to the mapped neighbor node and encode it as the initial node feature.

[0138] The initial node features are input into the first layer of the graph neural network, the initial node features are transformed and propagated along the adjacent business dependency edges to the missing field nodes, generating the first layer of aggregate features;

[0139] The first layer of aggregated features is input into the second layer of the graph neural network. The first layer of aggregated features is then propagated along the business dependency edges, and the features of the two-hop neighbor nodes are aggregated to the missing field nodes to generate the second layer of aggregated features.

[0140] The first-layer aggregated features and the second-layer aggregated features are concatenated to form a multi-layer aggregated feature vector. The multi-layer aggregated feature vector is then input into the decoding network to be converted into a field value format, generating the fill-in field value for missing field nodes.

[0141] After obtaining the initial mapping relationship set, missing fields are identified and filled in using a field knowledge graph. All standard field nodes in the field knowledge graph are traversed, and each standard field node is checked to see if it has a mapping relationship from the original field set. If a standard field node does not have a corresponding original field in the structured mapping relationship set, the node is marked as a missing field node. The marking process records the node identifier and missing status attributes to facilitate the location of target nodes during subsequent feature propagation.

[0142] In one specific implementation, for each missing field node, a neighbor node search is performed along the pre-constructed business dependency edges in the field knowledge graph. Business dependency edges reflect the business logic relationships between standard fields; for example, "registered capital" and "paid-in capital" have a dependency relationship. After obtaining the set of one-hop neighbor nodes for the missing field node, nodes with corresponding mapped original fields are selected and identified as mapped neighbor nodes. The field values ​​of the original fields corresponding to the mapped neighbor nodes are extracted from the set of structured mapping relationships. These field values ​​are then converted into fixed-dimensional semantic vectors using a pre-trained text encoder, serving as the initial node features for that neighbor node. The initial node feature dimension is set to 256 dimensions to ensure sufficient feature representation.

[0143] The initial node features are input into the first layer of the graph neural network. This layer uses graph convolution to perform a linear transformation on the initial features of each node, with the transformation matrix having a parameter dimension of 256 × 128. The transformed features are propagated to neighboring nodes along the business dependency edges. For nodes with missing fields, the feature vectors propagated from all mapped neighboring nodes are aggregated, and the feature values ​​are accumulated using a summation aggregation method. After processing with the ReLU activation function, the first layer of aggregated features is generated. The first layer of aggregated features has a dimension of 128 and includes the one-hop neighborhood information of the nodes with missing fields.

[0144] The aggregated features from the first layer are used as input to the second layer network. The second layer network structure is the same as the first layer, using a 128 × 64-dimensional transformation matrix. Features continue to propagate along the business dependency edges, at which point the aggregation range expands to two-hop neighbor nodes. Two-hop neighbor nodes are nodes that are two edges away from the missing field node; the features of these nodes reach the missing field node after two propagations. During the second-layer aggregation, the features propagated from the two-hop neighbors are summed and aggregated with the first-layer features of the missing field node itself. After processing by an activation function, the second-layer aggregated features are generated, with a dimension of 64.

[0145] The first and second layer aggregated features are concatenated dimensionally to form a 128 + 64 = 192-dimensional multi-layer aggregated feature vector. This multi-layer aggregated feature vector comprehensively captures the multi-hop neighborhood information and semantic abstraction at different levels of the missing field node. This vector is then input into the decoding network, which consists of two fully connected layers: the first layer has a dimension of 192 × 128, and the second layer has a dimension of 128 × V, where V is the size of the field value vocabulary. The decoding network outputs a probability distribution of field values ​​after Softmax normalization, and the word sequence with the highest probability is selected as the fill-in field value for the missing field node. The format of the fill-in field value is consistent with the data type required by the standard field; if the standard field requires numeric data, the format is converted; if it requires text data, the original string format is retained.

[0146] The intelligent due diligence system for automatic mapping of evidence fields in this invention includes:

[0147] The field extraction unit is used to obtain the target evidence document to be investigated, extract the original fields and field values ​​to obtain the original field set, and perform semantic encoding on the original field set to obtain the original field semantic vector set.

[0148] The field mapping unit is used to integrate the semantics of field names and the content of field values ​​to generate cross-modal field vectors, calculate the mapping distance between the cross-modal field vectors and the standard field semantic vectors in the preset due diligence template, and determine the initial set of mapping relationships.

[0149] The relationship building unit is used to build a field knowledge graph based on the standard fields and business dependencies in the due diligence template, build a query graph structure based on the original field set and the preliminary mapping relationship set, determine the target subgraph by isomorphism judgment between the candidate subgraph and the query graph structure, and generate a set of structured mapping relationships based on the node correspondence between the target subgraph and the query graph structure.

[0150] The field completion unit is used to mark missing field nodes in the field knowledge graph, extract the field values ​​of mapped neighbor nodes and encode them as initial node features, generate multi-layer aggregated feature vectors through multi-layer propagation of graph neural network and decode to generate complete field values;

[0151] The data generation unit is used to fill the field values ​​in the original field set into the corresponding standard field positions according to the structured mapping relationship set, and to fill the missing standard field positions with the supplementary field values, thereby generating structured due diligence data.

[0152] A third aspect of the present invention provides an electronic device, comprising:

[0153] processor;

[0154] Memory used to store processor-executable instructions;

[0155] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.

[0156] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.

[0157] This invention can be a method, apparatus, system, and / or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of the invention.

[0158] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An intelligent due diligence method for automatic mapping of evidence fields, characterized in that, include: Obtain the target evidence document to be investigated, extract the original fields and field values ​​to obtain the original field set, and perform semantic encoding on the original field set to obtain the original field semantic vector set; The semantics of field names and the content of field values ​​are combined to generate cross-modal field vectors. The mapping distance between the cross-modal field vectors and the standard field semantic vectors in the preset due diligence template is calculated to determine the initial set of mapping relationships. A field knowledge graph is constructed based on the standard fields and business dependencies in the due diligence template. A query graph structure is constructed based on the original field set and the preliminary mapping relationship set. The candidate subgraph and the query graph structure are isomorphic to determine the target subgraph. A structured mapping relationship set is generated based on the node correspondence between the target subgraph and the query graph structure. The missing field nodes in the field knowledge graph are marked, the field values ​​of the mapped neighbor nodes are extracted and encoded as initial node features, and multi-layer aggregated feature vectors are generated through multi-layer propagation of graph neural network and then decoded to generate complete field values; Based on the set of structured mapping relationships, the field values ​​in the original field set are filled into the corresponding standard field positions, and the fill-in field values ​​are filled into the missing standard field positions to generate structured due diligence data.

2. The method according to claim 1, characterized in that, The semantics of field names and the content of field values ​​are combined to generate cross-modal field vectors. The mapping distance between the cross-modal field vectors and the standard field semantic vectors in the preset due diligence template is calculated to determine the initial set of mapping relationships, including: Encode the field name of each original field in the original field set to generate a name semantic vector, and extract features from the corresponding field values ​​to generate a content feature vector; The semantic association weight between the name semantic vector and the content feature vector is calculated through an attention mechanism, and the name semantic vector and the content feature vector are weighted and fused to generate a cross-modal field vector. Align the cross-modal field vector with the standard field semantic vector in the preset due diligence template in the shared semantic space, calculate the vector distance between the cross-modal field vector and the standard field semantic vector, and obtain the mapping distance; A weighted complete bipartite graph is constructed based on the mapping distance. The globally optimal mapping scheme is solved by augmented path search and matching state iterative flipping in the weighted complete bipartite graph. The mapping pairs in the globally optimal mapping scheme are used as the initial mapping relationship set.

3. The method according to claim 2, characterized in that, Encode the field name of each original field in the original field set to generate a name semantic vector, and extract features from the corresponding field values ​​to generate content feature vectors, including: The field name of each original field in the original field set is segmented to obtain the word sequence in the field name; Each word in the word sequence is converted into a word vector, and the word vector sequence is encoded by a multi-layer neural network. Each layer of the multi-layer neural network performs a non-linear transformation on the input vector and outputs a hidden state vector. Extract the hidden state vector of the final layer of a multi-layer neural network to determine the name semantic vector; Obtain the set of field values ​​corresponding to the original field, iterate through each field value in the set of field values, determine whether the field value conforms to the numeric data format, text data format or date data format, and determine the data type of the field value; Perform statistical analysis on the set of field values ​​to extract the numerical distribution characteristics, string length distribution characteristics, and format consistency characteristics of the set of field values; The numerical distribution feature, the string length distribution feature, and the format consistency feature are encoded into a feature vector to determine the content feature vector.

4. The method according to claim 2, characterized in that, Constructing a weighted complete bipartite graph based on the mapping distance, and solving for the globally optimal mapping scheme in the weighted complete bipartite graph through augmented path search and matching state iterative flipping includes: Each original field in the original field set is used as the left node set of the bipartite graph structure, and each standard field in the preset due diligence template is used as the right node set. For each left node and each right node, the mapping distance between the corresponding original field and the standard field is converted into an edge weight, and the edge weight is negatively correlated with the mapping distance; Based on the edge weights, a weighted complete bipartite graph is constructed in the bipartite graph structure, and the matching state of all nodes in the weighted complete bipartite graph is initialized to unmatched. Iteratively traverse the set of left nodes. For each unmatched left node, construct an augmented path search tree starting from that left node. Use depth-first search to find an alternating path in the weighted complete bipartite graph that satisfies the positive increment of the sum of global edge weights. The alternating path consists of alternating unmatched and matched edges. When the alternating path is found, the state of the matching edge and the unmatched edge are flipped along the alternating path, the set of matching edges is updated to increase the sum of the global edge weights, and the iteration is repeated until all left nodes cannot find an alternating path that meets the conditions. The set of all matching edges in the final set is taken as the globally optimal mapping scheme.

5. The method according to claim 1, characterized in that, A field knowledge graph is constructed based on the standard fields and business dependencies in the due diligence template. A query graph structure is constructed based on the original field set and the preliminary mapping relationship set. The candidate subgraph and the query graph structure are isomorphic to determine the target subgraph. A structured mapping relationship set is generated based on the node correspondence between the target subgraph and the query graph structure, including: Extract standard fields from the due diligence template as nodes, extract business dependencies between standard fields as directed edges, and construct a field knowledge graph based on nodes and directed edges; Each original field in the original field set is used as a query node, and the mapping relationship between the original field and the standard field of each mapping pair in the preliminary mapping relationship set is used as a query edge. A query graph structure is constructed based on the query nodes and the query edges. Candidate subgraphs are selected from the field knowledge graph, and isomorphism is determined between the candidate subgraphs and the query graph structure. When the candidate subgraph is isomorphic to the query graph structure, the candidate subgraph is determined as the target subgraph. Based on the correspondence between the nodes of the target subgraph and the nodes of the query graph structure, the original fields in the query graph structure are mapped to the standard fields in the target subgraph, generating a set of structured mapping relationships.

6. The method according to claim 5, characterized in that, The process of determining isomorphism between the candidate subgraph and the query graph structure, and identifying the candidate subgraph as the target subgraph when the candidate subgraph and the query graph structure are isomorphic, includes: Get the candidate node set and candidate edge set of the candidate subgraph, get the query node set and query edge set of the query graph structure, and determine whether the number of nodes in the candidate node set and the query node set are equal, and whether the number of edges in the candidate edge set and the query edge set are equal. When the number of nodes and the number of edges are equal, establish a bijective mapping relationship between the candidate node set and the query node set; Extract the starting and ending candidate nodes of each candidate edge in the candidate edge set, and extract the starting and ending query nodes of each query edge in the query edge set. Determine whether the candidate nodes for the starting point and the candidate nodes for the ending point correspond to the query nodes for the starting point and the query nodes for the ending point, respectively, under the bijective mapping relationship; When all candidate edges correspond to the query edge under the bijective mapping relationship, the candidate subgraph is determined to be structurally isomorphic to the query graph, and the candidate subgraph is determined as the target subgraph.

7. The method according to claim 1, characterized in that, The missing field nodes in the labeled field knowledge graph are extracted, and the field values ​​of the mapped neighbor nodes are extracted and encoded as initial node features. Multi-layer aggregated feature vectors are generated through multi-layer propagation via a graph neural network, and then decoded to generate complete field values, including: Traverse the standard field nodes in the field knowledge graph and mark the standard field nodes that do not have a corresponding original field as empty field nodes; Obtain the neighbor nodes of the missing field node along the business dependency edge in the field knowledge graph, filter out the neighbor nodes with the corresponding mapping original field, determine the mapped neighbor nodes, extract the field value of the mapping original field corresponding to the mapped neighbor node and encode it as the initial node feature. The initial node features are input into the first layer of the graph neural network, the initial node features are transformed and propagated along the adjacent business dependency edges to the missing field nodes, generating the first layer of aggregate features; The first layer of aggregated features is input into the second layer of the graph neural network. The first layer of aggregated features is then propagated along the business dependency edges, and the features of the two-hop neighbor nodes are aggregated to the missing field nodes to generate the second layer of aggregated features. The first-layer aggregated features and the second-layer aggregated features are concatenated to form a multi-layer aggregated feature vector. The multi-layer aggregated feature vector is then input into the decoding network to be converted into a field value format, generating the fill-in field value for missing field nodes.

8. An intelligent due diligence system for automatic mapping of evidence fields, used to implement the method of any one of claims 1-7, characterized in that, include: The field extraction unit is used to obtain the target evidence document to be investigated, extract the original fields and field values ​​to obtain the original field set, and perform semantic encoding on the original field set to obtain the original field semantic vector set. The field mapping unit is used to integrate the semantics of field names and the content of field values ​​to generate cross-modal field vectors, calculate the mapping distance between the cross-modal field vectors and the standard field semantic vectors in the preset due diligence template, and determine the initial set of mapping relationships. The relationship building unit is used to build a field knowledge graph based on the standard fields and business dependencies in the due diligence template, build a query graph structure based on the original field set and the preliminary mapping relationship set, determine the target subgraph by isomorphism judgment between the candidate subgraph and the query graph structure, and generate a set of structured mapping relationships based on the node correspondence between the target subgraph and the query graph structure. The field completion unit is used to mark missing field nodes in the field knowledge graph, extract the field values ​​of mapped neighbor nodes and encode them as initial node features, generate multi-layer aggregated feature vectors through multi-layer propagation of graph neural network and decode to generate complete field values; The data generation unit is used to fill the field values ​​in the original field set into the corresponding standard field positions according to the structured mapping relationship set, and to fill the missing standard field positions with the supplementary field values, thereby generating structured due diligence data.

9. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1 to 7.