A data asset standardization packaging method, system and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By constructing a high-dimensional topological graph and a semantic gravity field model, and by filtering and weighting external related nodes to generate residual semantic metadata, the problem of balancing data independence and analytical value in data asset encapsulation is solved, thereby improving the model's learning efficiency and prediction accuracy.

CN122113144BActive Publication Date: 2026-06-26SHAANXI YUNCHUANG NETWORK TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHAANXI YUNCHUANG NETWORK TECH CO LTD
Filing Date: 2026-04-24
Publication Date: 2026-06-26

Application Information

Patent Timeline

24 Apr 2026

Application

26 Jun 2026

Publication

CN122113144B

IPC: G06F21/60; G06F21/62; G06F18/2135

AI Tagging

Technology Topics

Topological graph Theoretical computer science

Technical Efficacy Phrases

Improve learning efficiencyImprove forecast accuracy

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A sparse medical entity recognition method based on an attention mechanism
CN117952107Breduce distractionsImprove learning efficiencyMedical data mining Biological models Named-entity recognition Ensemble learning
A Multi-Agent Consensus Reinforcement Learning Method and System Based on Improved Q-Function
CN114545777BImprove reinforcement learning self-learning abilitySensitive to environmental changes
An instant reward learning method based on self-supervised reinforcement learning
CN122088606AGuaranteed accuracy Guarantee stability Biological models AlgorithmReward learning
A large model-based answer analysis management method and system
CN122311445Aimprove understanding Improve fault tolerance Scale modelSpoken language
A learning method, apparatus, device, and medium for testing a large language model
CN122287755Aresolve ambiguityReliable signal supportLinguistic model Algorithm

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to balance data independence and analytical value during data asset encapsulation. Hard-condition filtering causes data assets to lose their global contextual characteristics, while expanding the selection scope leads to storage redundancy and privacy leaks.

Method used

By constructing a high-dimensional topological graph and a semantic gravity field model, the core asset node set and the external related node set are determined, strong gravity boundary nodes are screened out, weighted semantic features are generated and aggregated into residual semantic metadata, which are then spliced and encrypted with the raw data of the core asset node set to form a standard data asset package.

Benefits of technology

It accurately selects external nodes with high analytical value, solves the problems of privacy leakage and storage redundancy, improves the learning efficiency and prediction accuracy of machine learning models, and preserves global topological features.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122113144B_ABST

Patent Text Reader

Abstract

The application discloses a data asset standardization packaging method and system and a storage medium, and belongs to the technical field of data processing. The method comprises the following steps: constructing a high-dimensional topological graph based on a business data domain, determining a core asset node set and an external associated node set in the high-dimensional topological graph; calculating the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set based on a semantic gravity field model, and marking the associated node with a semantic gravity probability higher than a preset threshold as a strong gravity boundary node; aggregating the weighted semantic features of all the strong gravity boundary nodes belonging to the same edge node to generate residual semantic metadata; and splicing and encrypting the residual semantic metadata and the naked data ontology of the core asset node set to generate a standard data asset package. The rationality of data asset packaging can be improved by the application.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of data processing technology, specifically relating to a standardized encapsulation method, system, and storage medium for data assets. Background Technology

[0002] Data asset encapsulation refers to the process of standardizing and packaging scattered business detail data within an enterprise to form independent data products that can be externally accessed or traded. Encapsulated data assets can provide high-quality training datasets for various artificial intelligence models. In existing data processing workflows, technicians typically use structured query statements to set filtering rules based on specific business themes, directly extracting matching record rows from the underlying relational database. Subsequently, the extracted record rows are converted into table files, and basic descriptive tags and access control policies are attached to the outside of the table files, thereby completing the physical isolation and encapsulation of the data assets.

[0003] However, existing technologies face a challenge in balancing data independence and analytical value during encapsulation. On one hand, encapsulation methods employing hard-condition filtering only perform physical isolation, forcibly severing the implicit topological connection between the target data and the external business environment. This causes data assets to lose their original network structure associations, preventing downstream AI models from acquiring the global contextual features of the data and severely weakening the in-depth analytical value of the data assets. On the other hand, if an encapsulation strategy that expands the selection scope to maintain network structure associations is adopted, packaging all external related nodes related to the target data into the asset package, it not only introduces a large amount of useless detailed data unrelated to the target topic, causing storage redundancy, but also easily exposes external detailed data containing sensitive fields directly to unauthorized callers, thereby triggering issues of data privilege escalation and privacy leaks. Summary of the Invention

[0004] To address the aforementioned problems, this invention provides a standardized data asset encapsulation method, system, and storage medium to resolve the issues present in the background art.

[0005] To achieve the aforementioned objectives, this invention proposes a standardized data asset encapsulation method, comprising:

[0006] A high-dimensional topology graph is constructed based on the entities and relationships in the specified business data domain. The core asset node set and the external related node set that have topological connections with the core asset node set are determined in the high-dimensional topology graph according to the encapsulation instructions.

[0007] Construct a semantic gravity field model, calculate the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set based on the semantic gravity field model, and mark the associated nodes whose semantic gravity probability exceeds a preset threshold as strong gravity boundary nodes.

[0008] Obtain the feature representation matrix of strong gravitational boundary nodes in a high-dimensional topological graph, and generate weighted semantic features by weighting them according to the corresponding semantic gravitational probabilities.

[0009] The weighted semantic features of all strong gravitational boundary nodes belonging to the same edge node are aggregated to generate residual semantic metadata for characterizing the external context.

[0010] The residual semantic metadata is concatenated and encrypted with the raw data ontology of the core asset node set to generate a standard data asset package.

[0011] This invention also provides a data asset standardization and encapsulation system for implementing the above-described method. The system includes:

[0012] The graph recognition module constructs a high-dimensional topology graph based on entities and relationships in the specified business data domain, and determines the core asset node set and the external related node set that has a topological connection with the core asset node set in the high-dimensional topology graph according to the encapsulation instructions.

[0013] The gravity screening module constructs a semantic gravity field model, calculates the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set based on the semantic gravity field model, and marks associated nodes whose semantic gravity probability exceeds a preset threshold as strong gravity boundary nodes.

[0014] The semantic generation module obtains the feature representation matrix of strong gravitational boundary nodes in the high-dimensional topology graph, and generates weighted semantic features based on the corresponding semantic gravity probabilities. It then aggregates the weighted semantic features of all strong gravitational boundary nodes belonging to the same edge node to generate residual semantic metadata for characterizing the external context environment.

[0015] The asset encapsulation module concatenates and encrypts the residual semantic metadata with the raw data ontology of the core asset node set to generate a standard data asset package.

[0016] This application also provides a computer-readable storage medium storing instructions that, when executed by a processor, implement the method described above.

[0017] The beneficial effects of this invention are as follows:

[0018] This invention first constructs a high-dimensional topology graph through the business data domain to determine the core asset node set and the external associated node set. Then, it constructs a semantic gravity field model to calculate semantic gravity probabilities, marking associated nodes exceeding a preset threshold as strong gravity boundary nodes, thereby accurately filtering out external nodes with high analytical value. Next, it obtains and weights the feature representation matrices of the strong gravity boundary nodes, and then aggregates the weighted semantic features to generate residual semantic metadata representing the external context environment. This solves the problems of privacy leakage and unauthorized access caused by directly introducing detailed data of external nodes, filtering out sensitive values while retaining global topological features. Therefore, machine learning models trained using the encapsulated data of this invention can significantly improve the learning efficiency and prediction accuracy of the model. Notably, this invention addresses the problems in existing technologies where data asset encapsulation struggles to balance data independence and analytical value, where hard filtering prevents downstream models from obtaining global context features, and where expanding the selection range easily leads to storage redundancy and privacy leakage. Attached Figure Description

[0019] Figure 1 This is a flowchart illustrating the steps of a standardized data asset encapsulation method according to the present invention.

[0020] Figure 2 This is a schematic diagram illustrating the principle of the high-dimensional topology graph of the present invention;

[0021] Figure 3 This is a schematic diagram illustrating the principle of the wandering particles moving towards the node in this invention;

[0022] Figure 4 This is a schematic diagram of the structure of a data asset standardization and encapsulation system according to the present invention. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0024] It is understood that the terms "first," "second," etc., used herein may be used to describe various elements, but unless otherwise stated, these elements are not limited by these terms. These terms are used only to distinguish one element from another. For example, without departing from the scope of this application, a first script may be referred to as a second script, and similarly, a second script may be referred to as a first script.

[0025] like Figure 1 As shown, a standardized encapsulation method for data assets includes:

[0026] S1: Construct a high-dimensional topology graph based on entities and relationships in the specified business data domain, and determine the core asset node set and the external related node set that have topological connections with the core asset node set in the high-dimensional topology graph according to the encapsulation instructions.

[0027] A business data domain refers to a heterogeneous data collection for a specific business, such as an e-commerce retail business data domain or a promotional activity business data domain. After obtaining the business data domain, the table structure is first parsed, converting table records into entity nodes and primary / foreign key relationships into related edges, thus generating a graph structure. Then, the user's encapsulation instructions are received, and target nodes are filtered out as the core asset node set by matching node attributes. The neighbor nodes of all nodes within the core asset node set are traversed, and nodes that do not belong to the core asset node set but have connecting edges are found and added to the external related node set.

[0028] For example, suppose the business data domain includes a user table, an order table, and a product table. The system maps user Zhang San, order 1001, and product mobile phone as entity nodes, and maps user purchase behavior as related edges, constructing a high-dimensional topology graph containing 10,000 nodes. When the system receives an encapsulation instruction to extract the consumption behavior assets of high-end users of a certain platform in the third quarter, it filters out a core asset node set consisting of 500 user nodes and 2,000 order nodes based on the two conditions of the time attribute being the third quarter and the user tag being high-end. Then, it searches for related edges outward along the 2,000 order nodes, finding 800 product nodes and 300 supplier nodes connected to the order nodes but not yet defined. The system then includes the 800 product nodes and 300 supplier nodes in the external related node set.

[0029] S2: Construct a semantic gravity field model, calculate the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set based on the semantic gravity field model, and mark the associated nodes whose semantic gravity probability exceeds a preset threshold as strong gravity boundary nodes.

[0030] To address the technical challenge of losing crucial contextual information by directly severing external connections, while retaining all external connections leads to data redundancy, this embodiment first establishes a semantic gravity field model. The information entropy of associated nodes and the shortest path length from associated nodes to edge nodes are extracted as feature inputs to the semantic gravity field model. The model outputs initial gravity values, which are then converted into semantic gravity probabilities between 0 and 1 using a normalized exponential function (e.g., the Softmax function). The calculated semantic gravity probabilities are compared to a preset probability threshold. If the semantic gravity probability is greater than the threshold, the corresponding associated node is marked as a strong gravity boundary node.

[0031] For example, the semantic gravity field model yields a semantic gravity probability of 85% for mobile phone nodes. A preset probability threshold of 60% is set. Since 85% is greater than 60%, mobile phone nodes are marked as strong gravity boundary nodes. This step quantifies the semantic impact of external data on core assets, thereby selecting the external nodes with the highest value for asset analysis and avoiding interference from useless data.

[0032] S3: Obtain the feature representation matrix of the strong gravitational boundary node in the high-dimensional topological graph, and generate weighted semantic features by weighting the corresponding semantic gravitational probabilities.

[0033] S4: Aggregate the weighted semantic features of all strong gravitational boundary nodes belonging to the same edge node to generate residual semantic metadata for characterizing the external context.

[0034] To address the technical issues of privacy leaks and unauthorized access caused by directly importing raw detailed data from external nodes, this embodiment first obtains the feature vectors of strong gravitational boundary nodes in a high-dimensional topology graph, and then uses semantic gravitational probability to weight and correct the feature vectors to obtain weighted semantic features.

[0035] Subsequently, the weighted semantic features are aggregated or dimensionality reduced to remove the original detailed dimensions containing specific privacy values, retaining only the dimensions reflecting the business distribution pattern, thus obtaining residual semantic metadata. The residual semantic metadata is a comprehensive vector that condenses the global context topology features of the external network.

[0036] S5: The residual semantic metadata is concatenated and encrypted with the raw data ontology of the core asset node set to generate a standard data asset package.

[0037] The raw data of the core asset node set is converted into a relational table or comma-separated value format as the raw data ontology. The residual semantic metadata is then concatenated with the raw data ontology and encrypted using an advanced encryption standard algorithm to obtain the standard data asset package.

[0038] Construct a high-dimensional topology graph based on entities and relationships within a specified business data domain, including:

[0039] Structured and unstructured data in the business data domain are mapped to triples. An initial topology graph is constructed based on the triples. A graph convolutional network is then used to learn from the initial topology graph to generate a high-dimensional topology graph.

[0040] First, the system connects to a relational database and a document storage system, reading text paragraphs and table fields related to the business data domain. Natural language processing (NLP) techniques are used to extract entity nouns and action verbs from the text paragraphs. The primary and foreign keys in the table fields, along with the extracted entity nouns and action verbs, are converted into standardized triple expressions according to the subject-predicate-object grammatical structure. Simultaneously, the original timestamp field indicating when the business data occurred is extracted. For example, the system connects to a structured relational database containing basic user information and an unstructured document storage system containing user review records. The system reads the user table from the relational database, extracting the primary key "User Identifier 101" and the foreign key "Product Number 505," and converts them into a resource description framework triple containing the subject "User Identifier 101," the predicate "Purchase," and the object "Product Number 505." Simultaneously, the system reads a text paragraph from the document storage system containing the user ID 101's review of product number 505 as "very good." The system uses NLP techniques to extract the entity nouns "User Identifier 101" and "Product Number 505," as well as the action verb "Review." The system converts the entity noun user identifier 101, product number 505, and action verb evaluation into a resource description framework triple containing the subject user identifier 101, the predicate evaluation, and the object product number 505. The specific extraction and construction algorithm described above is existing technology and will not be described here.

[0041] Next, the subject and object in the triple are mapped to graph nodes, and the predicate is mapped to the connecting edge between nodes. If the same pair of nodes interacts multiple times within the historical time window, only one unique connecting edge is retained on the topology graph. The original timestamps of the multiple interactions are statistically analyzed according to the preset time slice granularity, and the temporal frequency sequence attribute of the connecting edge is generated and attached. For example, the system extracts the subject user identifier 101, the predicate purchase, and the object product number 505, and extracts the purchase time as October 1st and October 2nd, respectively. The system establishes a static connecting edge between user identifier 101 and product number 505, and statistically analyzes these two purchase behaviors as the temporal frequency sequence attribute on the connecting edge (e.g., [October 1st: 1 time, October 2nd: 1 time]), which is used for subsequent temporal fluctuation entropy calculation.

[0042] Initial feature encoding (e.g., one-hot encoding) is performed on the graph nodes to generate initial vectors containing data attribute features. Next, a multi-layer graph convolutional network is activated. During information transmission within the graph convolutional network, the temporal frequency sequences of the connecting edges are summed by a scalar and used as the static aggregation weights for those edges. This guides the fusion of local attributes and global network connectivity, enabling each graph node to aggregate the feature vectors of its neighboring graph nodes, such as... Figure 2 As shown, the model ultimately outputs a high-dimensional topology graph that integrates local attributes and global network connectivity.

[0043] For example, for a resource description framework containing user identifier 101, predicate purchase, and object product number 505, user identifier 101 and product number 505 are mapped to graph nodes, and purchase is mapped to a connection edge. The system performs initial feature encoding on graph node user identifier 101, combining the user's age and gender attributes to generate a 16-dimensional initial vector. Subsequently, the system launches a graph convolutional network with three hidden layers. During the information transmission process in the first layer of the graph convolutional network, graph node user identifier 101 reads and aggregates the 16-dimensional feature vector of its connected graph node product number 505. After iterative propagation through three hidden layers, graph node user identifier 101 not only retains its own age and gender attributes but also absorbs the features of surrounding product nodes and surrounding supplier nodes within a three-hop range. Finally, the system outputs a high-dimensional fused feature map where each graph node is represented by a 128-dimensional vector. The system uses this high-dimensional fused feature map containing 10,000 graph nodes, each with a 128-dimensional vector, as a high-dimensional topology graph.

[0044] The semantic gravity field model is used to calculate the semantic attraction probability of each edge node in the core asset node set to each associated node in the external associated node set, including:

[0045] Calculate the information entropy response value of associated nodes in the high-dimensional topology graph, extract the shortest path length from associated nodes to edge nodes, and calculate the connectivity curvature index based on the intermediate nodes in the shortest path length.

[0046] The information entropy response value is calculated based on the probability distribution and the information entropy formula. The information entropy formula is: ,in, For associated nodes Information entropy response value, This indicates the total number of neighboring nodes of an associated node that belong to different types. For example, if an associated node connects to three types: user, order, and supplier, then... , This indicates that the neighboring nodes connected to the associated node are in the 1st position. The probability distribution under this type is calculated as follows: the neighboring node belongs to the... Number of types / Total number of associated nodes connected to nodes.

[0047] For example, for a product node with node number 800, the system counts the number of connecting edges in the high-dimensional topology graph. Assume that product node number 800 has 10 connecting edges, with 5 connected to user nodes, 3 to order nodes, and 2 to supplier nodes. The system calculates the probability distribution of the connection between the product node and user nodes as 0.5, the probability distribution of the connection to order nodes as 0.3, and the probability distribution of the connection to supplier nodes as 0.2. Substituting these probabilities (0.5, 0.3, and 0.2) into the information entropy formula, the system calculates the information entropy response value for product node number 800 as 1.48.

[0048] High information entropy indicates that a node not only connects to multiple types of nodes, but also that the distribution of these connections is relatively balanced. This suggests that the node is a core node containing rich cross-domain business logic. Low information entropy indicates that a node has a single type of connection. For example, if 99% of the connections for a product point only to supplier A, it indicates that the amount of information contained is relatively small. Therefore, the higher the information entropy of a node, the greater the amount of structured information it contains. When performing semantic folding, retaining such high-entropy nodes can maximize the reconstruction of the complex network environment of core assets in the real world, thereby providing significant contextual information for downstream machine learning model training.

[0049] The system uses a breadth-first search algorithm to find the minimum number of hops from external related nodes to edge nodes in a high-dimensional topology graph, and uses the minimum number of hops as the shortest path length. The smaller the shortest path length, the greater the semantic relevance. However, this linear distance metric cannot reveal the intrinsic quality of the connection. Therefore, this embodiment extracts all intermediate nodes on the shortest topology path, calculates the clustering coefficient of the intermediate nodes, and uses the average of the clustering coefficients as a connectivity curvature index. The calculation method of the clustering coefficient is existing technology and will not be described here. A high connectivity curvature index means that the path traverses a tight local community, and its association has high redundancy and support; conversely, low curvature indicates that the path may be a fragile bridge-like connection with a high degree of randomness.

[0050] For example, the system calculates the path features from product node number 800 in the external associated node set to order node 1001 in the core asset node set. Using a breadth-first search algorithm, the system finds that starting from product node 800, passing through intermediate node store node A, and finally reaching edge node order 1001, the path involves two connecting edges. The system uses 2 as the shortest path length. Subsequently, the system extracts intermediate node store node A on the shortest topological path and calculates the proportion of interconnections between store node A's neighboring nodes, obtaining a clustering coefficient of 0.6. Since there is only one intermediate node on the shortest topological path, the system uses the average of the clustering coefficients of 0.6, i.e., 0.6, as the connectivity curvature index. By combining breadth-first search and clustering coefficient statistics for path feature extraction, this invention not only measures the physical hierarchical distance between data nodes but also obtains the local topological density of the connection channels, enabling subsequent gravity calculations to fully consider the mesh transmission effect in real business networks.

[0051] The information entropy response value, connectivity curvature index, and preset weight coefficients of the business subdomain where the associated node is located are input into the semantic gravity field model to obtain the initial gravity value of the associated node belonging to the edge node. The initial gravity value is normalized to obtain the semantic gravity probability.

[0052] Preset weighting coefficients are assigned based on the business type to which the associated node belongs. For example, if the associated node belongs to the "credit rating" business subdomain, it is assigned a higher preset weighting coefficient, such as 1.8, because it directly determines transaction risk. If the associated node belongs to the "internal approval process" business subdomain, it has a smaller impact on transaction analysis and is assigned a lower preset weighting coefficient, such as 0.3.

[0053] The information entropy response value, connectivity curvature index, and weight coefficients are concatenated into an input vector. This input vector is then fed into a multilayer perceptron semantic gravity field model containing multiple hidden layers for nonlinear mapping, outputting initial gravity values. The above steps are repeated to calculate the initial gravity value for each associated node connected to the edge node. Normalization is then applied to map the initial gravity values to the interval between 0 and 1, generating the semantic gravity probability of the associated nodes.

[0054] For example, the system determines that product node number 800 belongs to the product business subdomain and reads the weight coefficient of the product business subdomain as 1.2 from the configuration table. The system concatenates the information entropy response value of product node number 800 (1.48), the connectivity curvature index (0.6), and the weight coefficient 1.2 into a 3D input vector. The system inputs this 3D input vector into a pre-trained multilayer perceptron semantic gravity field model. After the activation function is applied, the multilayer perceptron semantic gravity field model outputs an initial gravity value of 3.5 for product node number 800. Assume that the initial gravity value of another external related node, supplier A, is also calculated to be 1.5. The system substitutes the initial gravity values of 3.5 and 1.5 into the normalized exponential function for exponentialization and summation, obtaining a semantic gravity probability of 0.88 for product node number 800 and a semantic gravity probability of 0.12 for supplier A.

[0055] In this embodiment, a semantic gravity field model is established, including:

[0056] A training set is constructed based on historical business data. A semantic gravity field model composed of a multilayer perceptron is trained based on the training set. The label of each training sample in the training set is generated by weighted summation of query frequency probability and steady-state probability distribution value. The steady-state probability distribution value is calculated based on the random walk algorithm. When calculating, the random walk algorithm calculates the historical temporal fluctuation entropy of the connection frequency of each associated node within the historical time window and maps it to the temporal gravity barrier value.

[0057] When evaluating the correlation between data nodes, using global random walks or statistical analysis of massive historical query logs presents technical challenges for graph networks with tens of millions of nodes, resulting in excessively high computational complexity and failing to meet the requirements for online real-time encapsulation of data assets. The semantic correlation strength between an external node and an edge node (quantified by historical query frequency and network steady-state reachability) can be approximated by local features such as its topological importance (information entropy response value), the quality of its connection path to core assets (connectivity curvature index), and the prior importance of its business domain (preset weight coefficient). Therefore, this invention trains a semantic gravity field model, using the computationally expensive global steady-state probability and historical query frequency as training labels, and the low-computational-cost local topological features as input. This allows for rapid inference of high-precision semantic gravity probabilities through the model during the online encapsulation stage, requiring only the computation of local lightweight features, significantly reducing the computational latency of online encapsulation. The specific technical solution is as follows.

[0058] The input features of the semantic gravity field model in the training set are the aforementioned information entropy response value, connectivity curvature index, and weight coefficients. The labels of the training set, i.e., the output features of the semantic gravity field model, are generated by a weighted sum of query frequency probability and steady-state probability distribution values. The query frequency probability is calculated as follows: In queries involving edge nodes, the frequency of each associated node being simultaneously queried is counted to obtain the joint query count of edge nodes and associated nodes. The total query count of edge nodes is used as the total cardinality, and the ratio of associated nodes to the total cardinality is used as the query frequency probability.

[0059] The steady-state probability distribution is then calculated using a random walk algorithm. However, in the complex business graph network encapsulated in data assets, certain externally related nodes (such as "flash sale" nodes during short-term promotional events) generate a large number of instantaneous connections in a very short time, forming "instantaneous high-frequency nodes." If the traditional random walk algorithm is used directly, once the walking particles enter such nodes, they will wander back and forth in their local star network, leading to an artificially high calculated steady-state probability distribution value. This causes short-term abnormal features to be folded into the data asset package, severely interfering with the prediction accuracy of downstream callers' machine learning models. Existing technologies typically address this issue by adding time decay weights to the graph network edges or directly removing nodes with excessively high degrees. However, this forcibly cuts off the real business links, destroys the global integrity of the graph structure, and cannot dynamically adapt to changes in the network.

[0060] To address the aforementioned issues, the random walk algorithm in this embodiment first calculates the historical temporal fluctuation entropy of the connection frequency of each associated node within a historical time period, and maps it to a temporal gravitational barrier value. For example, a historical time window is set and divided into N time slices of equal length. The system reads the temporal frequency sequence attributes pre-attached to each connection edge in the high-dimensional topology graph, counts the total number of times the associated node connects with all other nodes in the high-dimensional topology graph within each time slice, and uses this as the connection frequency under that slice, thereby reconstructing the temporal frequency sequence of the node. Based on the maximum and minimum values in the temporal frequency sequence, the frequency range is divided into M statistical intervals, and the probability that the connection frequency falls into the m-th interval is calculated. This probability is used as the event occurrence probability, and the historical temporal fluctuation entropy is calculated using the information entropy formula. The form of the information entropy formula has been given above and will not be described again here. Unlike the aforementioned topological information entropy based on node type, this temporal fluctuation information entropy is specifically used to quantify the suddenness and oscillation of node interaction behavior over time. The more drastic the fluctuation of a node's connection frequency over time, and the more dispersed its frequency distribution across different intervals, the greater the calculated historical temporal fluctuation entropy. For example, for long-term stable regular supplier nodes, their connection frequency fluctuations are minimal, resulting in low historical temporal fluctuation entropy; while for flash sale product nodes, their connection frequency oscillates violently during promotional periods, resulting in extremely high historical temporal fluctuation entropy.

[0061] Then, a nonlinear exponential mapping function is used to map the historical time-series fluctuation entropy to the time-series gravitational barrier value. The mapping formula is: ,in Based on the basic potential barrier value, and These are the preset amplification factors for the first and second potential barriers. Specific values can be determined experimentally, for example, 1.5 and 2. This represents the historical time-series fluctuation entropy.

[0062] The random walk algorithm starts a random walk from an edge node. When the walking particle jumps to a node, it uses the Boltzmann distribution factor to exponentially decay the preset basic transition probability to generate topological repulsion. When the walking particle enters a node whose temporal gravitational barrier value is greater than the critical threshold, it blocks the outward diffusion path weight and forces the transition probability to be allocated to the previous hop node or the starting point.

[0063] Subsequently, a random walk is initiated starting from an edge node within the core asset node set (such as order 1001). The system first calculates the basic transition probability based on the edge weights of the graph network. Specifically, it obtains the nodes in the high-dimensional topology graph. The weights of the edges between a node and all its neighboring nodes are calculated, and then the edge weights are normalized to obtain the node's edge weights. The basic transition probabilities of all its neighboring nodes.

[0064] During the wandering process, as the wandering particle moves towards a node, the transition probability of the wandering particle is exponentially decayed using the Boltzmann distribution factor, thereby significantly reducing the probability of the particle mistakenly entering an unstable node. For example, when the wandering particle moves from node... To the node During the walk, the transition probability is updated by extracting nodes. temporal gravitational barrier value And set the temperature over-parameter. For example, it can be set to 0.8. The system first calculates the nodes. Boltzmann attenuation factor Nodes in a graph network arrive Basic transition probability Multiply by this attenuation factor to obtain the basic attenuation value for the node. Normalize the basic decay values of all neighboring nodes to obtain the particle-to-node... decaying transition probability In this way, the decay factor of nodes with high temporal gravitational barrier values (such as "flash sale" items with instantaneous fluctuations) approaches 0, thereby significantly reducing the probability of particles accidentally entering unstable nodes in the first stage.

[0065] like Figure 3 Based on the above steps, wandering particles still have a very small probability of entering high-barrier nodes. High-barrier nodes are those where the temporal gravitational barrier value is greater than a critical threshold (the critical threshold is, for example, the 95th percentile of the temporal gravitational barrier value). If a wandering particle accidentally enters such a node, the system will cut off the possibility of the wandering particle continuing to wander outward along the "flash sale item" node and forcibly allocate the transition probability to the previous hop node or the starting point. Through this forced backtracking operation, the wandering particle is effectively bounced back from high-frequency noise nodes and continues to explore the surrounding stable business network. After multiple iterations of wandering until network convergence, the steady-state probability distribution values on each externally associated node are extracted to generate training labels.

[0066] In the above steps, the historical time-series fluctuation entropy calculated based on Shannon's formula is essentially logarithmic in growth, with a typically narrow range and insignificant differences (for example, the entropy of an absolutely stable node is 0, while the entropy of a violently oscillating flash sale node may only be 2.5). If the fluctuation entropy, which exhibits a linear or logarithmic distribution, is directly substituted as a resistance factor into the subsequent random walk algorithm, the resistance difference between stable nodes and noisy nodes is too small. Wandering particles still have a high probability of mistakenly entering high-frequency noise nodes (such as flash sale items), resulting in severe distortion of the extracted steady-state probability distribution value and folding instantaneous business noise into the final data asset package. However, by introducing a mapping formula for nonlinear exponential mapping, for long-term stable nodes with small historical time-series fluctuation entropy, the growth of the exponential function is extremely slow. The mapped barrier value closely follows the basic barrier value, thus ensuring normal semantic topological connectivity and allowing wandering particles to penetrate freely. When the historical time-series fluctuation entropy exceeds a certain threshold, the exponential function will trigger explosive growth, causing the barrier value of that node to increase dramatically in a geometric progression. When this nonlinearly amplified barrier value is used to calculate the transition probability in conjunction with the Boltzmann distribution factor, it causes the probability of a wandering particle jumping to the noise node to decrease drastically. Through the above improvements, this invention blocks the interference of instantaneous business noise nodes at the underlying physical mechanism of the algorithm without destroying the connectivity of the underlying historical graph structure, so that the generated steady-state probability distribution value can accurately quantify the true and stable semantic gravitational field distribution of data assets.

[0067] Generate residual semantic metadata to characterize the external context, including:

[0068] For each strong gravitational boundary node, obtain its feature representation matrix on the high-dimensional topological graph, and multiply the feature representation matrix with the semantic gravity probability corresponding to the strong gravitational boundary node to obtain the weighted semantic features of the strong gravitational boundary node.

[0069] Based on the strong gravitational boundary node identifiers marked in the previous steps, an index search is performed in the high-dimensional topology graph. The feature vectors of the found strong gravitational boundary nodes are extracted and arranged into a two-dimensional feature representation matrix according to a preset dimensional format for subsequent matrix operations. For example, based on the product node numbered 800 marked in the previous steps, the system searches for the 128-dimensional vector corresponding to product node number 800 in a high-dimensional topology graph containing 10,000 graph nodes. The system extracts this 128-dimensional vector and arranges it in an 8x16 format to generate an 8x16 feature representation matrix.

[0070] To address the technical issue of varying semantic contributions of different external nodes to core assets, which could lead to secondary information overshadowing primary information if treated equally, the system reads the semantic gravity probability values calculated for strong gravity boundary nodes. Each element in the feature representation matrix is multiplied by this semantic gravity probability value to obtain a scaled feature representation matrix, which is then used as the weighted semantic feature of that external node.

[0071] For example, the system reads that the semantic attraction probability of product node number 800, calculated in the previous step, is 0.88. The system multiplies each value in the 8x16 feature representation matrix containing 128 elements by 0.88. Assuming the value in the first row and first column of the feature representation matrix is 2.5, the system multiplies 2.5 by 0.88 to obtain a new value of 2.2, and fills the corresponding position in the new matrix with 2.2. The system performs the same multiplication operation on all 128 values in the 8x16 feature representation matrix, finally generating a new 8x16 matrix, and uses this new 8x16 matrix as the weighted semantic feature of product node number 800.

[0072] Perform feature aggregation on the weighted semantic features of all strong gravitational boundary nodes connected to the same edge node to obtain residual semantic metadata.

[0073] In this embodiment, feature dimensionality reduction is performed on the weighted semantic features of all strong gravitational boundary nodes connected to the same edge node, including:

[0074] The weighted semantic features of all strong gravitational boundary nodes connected to the same edge node are summed and averaged to obtain residual semantic metadata.

[0075] The weighted semantic features of all strong gravity boundary nodes belonging to the same edge node are aggregated. Specifically, the system calculates the average vector of the weighted semantic feature vectors in the vector space to obtain residual semantic metadata. At this point, the residual semantic metadata not only integrates the feature information of all high gravity-related nodes, but also reflects their different importance through the pre-weighting of semantic gravity probabilities.

[0076] In other embodiments, the PCA algorithm can also be used to perform feature dimensionality reduction on the weighted semantic features of strong gravitational boundary nodes to obtain residual semantic metadata.

[0077] Generate a standard data asset package, including:

[0078] The standard data asset package is divided into a semantic residual area and a raw data ontology area. The residual semantic metadata is written into the semantic residual area, and the original business detail records belonging to the core asset node set are written into the raw data ontology area. An index mapping is established between the semantic residual area and the raw data ontology area to achieve joint reading of the semantic residual area and the raw data ontology area.

[0079] When generating a standard data asset package, residual semantic metadata is written to the semantic residual area, and detailed entity records exported from the underlying database are allocated to the raw data ontology area. Simultaneously with writing the residual semantic metadata, the system obtains the specific physical storage address or row number of the node to which the residual semantic metadata belongs in the raw data ontology area. The system generates a mapping key-value pair containing the node's unique identifier and its physical storage address, and appends this mapping key-value pair to the residual semantic metadata, thereby establishing an index mapping. For example, the system converts the one-dimensional low-dimensional vector containing 8 values generated for edge node order 1001 in the previous steps into a byte stream. The system sequentially writes the byte stream into the semantic residual area of the standard data asset package. The system locates the entity detail record of edge node order 1001 in the raw data ontology area, finding it to be stored at row 150. The system generates a mapping key-value pair, with the key being the order identifier 1001 and the value being the row number 150. The system appends the mapping key-value pair immediately to the end of the one-dimensional low-dimensional vector containing 8 values.

[0080] By employing an index mapping method that combines feature writing with cross-segment physical addressing, this invention precisely anchors highly compressed external network features to specific internal entity records, ensuring that data assets maintain a tight internal logical correspondence even under physical isolation. When the calling system initiates an access request and passes authentication in the access control zone, the system allows the calling algorithm model to simultaneously access both the raw data ontology area and the semantic residual area of the standard data asset package. The calling algorithm model first parses the raw data ontology area to extract entity attribute features, and then extracts the corresponding residual semantic metadata in the semantic residual area based on the index mapping. The calling algorithm model then concatenates and fuses the entity attribute features and residual semantic metadata in the feature space, using the concatenated and fused vectors to reconstruct the context of the edge nodes before the cutting, thereby restoring local topological features.

[0081] For example, a financial institution, acting as the caller, requests its credit assessment algorithm model to access third-quarter high-end user consumption behavior assets for default risk prediction. After system verification, the credit assessment algorithm model reads the edge node "Order 1001" in row 150 of the raw data ontology, extracting the order amount of 5000 yuan and the purchase time entity attribute features. Simultaneously, the model reads the corresponding one-dimensional low-dimensional vector containing eight values from the semantic residual region based on the index mapping. The model then concatenates the order amount of 5000 yuan with the one-dimensional low-dimensional vector containing eight values to form a comprehensive feature vector. Through this comprehensive feature vector, the model not only learns the order amount of 5000 yuan but also perceives, from the one-dimensional low-dimensional vector containing eight values, the local topological features of order 1001 in the original network, indicating a close association with high-quality suppliers and popular products.

[0082] By employing a dual-segment joint reading and feature space reconstruction method, this invention enables external algorithm models to achieve holographic perception of the context of core asset node sets without accessing the underlying raw privacy data, greatly enhancing the high-dimensional analysis value and model prediction accuracy of data assets in cross-domain circulation scenarios.

[0083] like Figure 4 As shown, the present invention also provides a data asset standardization and encapsulation system for implementing the above-described method. The system includes:

[0084] The graph recognition module constructs a high-dimensional topology graph based on entities and relationships in the specified business data domain, and determines the core asset node set and the external related node set that has a topological connection with the core asset node set in the high-dimensional topology graph according to the encapsulation instructions.

[0085] The gravity screening module constructs a semantic gravity field model, calculates the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set based on the semantic gravity field model, and marks associated nodes whose semantic gravity probability exceeds a preset threshold as strong gravity boundary nodes.

[0086] The semantic generation module obtains the feature representation matrix of strong gravitational boundary nodes in the high-dimensional topology graph, and generates weighted semantic features based on the corresponding semantic gravity probabilities. It then aggregates the weighted semantic features of all strong gravitational boundary nodes belonging to the same edge node to generate residual semantic metadata for characterizing the external context environment.

[0087] The asset encapsulation module concatenates and encrypts the residual semantic metadata with the raw data ontology of the core asset node set to generate a standard data asset package.

[0088] This application also provides a computer-readable storage medium storing instructions that, when executed by a processor, implement the method described above.

[0089] It should be noted that, in the specific implementation of this invention, the specific values of some algorithm parameters and thresholds involved, such as the "preset threshold" used to screen strong gravitational boundary nodes, the "preset weight coefficient" characterizing the importance of the business subdomain, the "critical threshold" used to block random walks, the amplification coefficient used to construct the temporal gravitational barrier value, and the specific structure and hyperparameters (such as the number of network layers, feature dimension, learning rate, etc.) of deep learning models such as semantic gravitational field models and graph convolutional networks, are not uniquely fixed. Those skilled in the art can flexibly set and optimize these parameters according to the specific business data domain being processed, the data distribution characteristics, and the desired technical effect (such as maximizing semantic integrity preservation) through conventional experimental testing, cross-validation, and parameter tuning to achieve the best encapsulation performance. Therefore, determining these parameters is a conventional technical practice that can be accomplished by those skilled in the art through limited and reasonable experiments.

[0090] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0091] The above embodiments merely illustrate several implementation methods of the present invention, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this patent should be determined by the appended claims.

[0092] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A standardized packaging method for data assets, characterized in that, include: A high-dimensional topology graph is constructed based on the entities and relationships in the specified business data domain. The core asset node set and the external related node set that have topological connections with the core asset node set are determined in the high-dimensional topology graph according to the encapsulation instructions. Construct a semantic gravity field model, calculate the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set based on the semantic gravity field model, and mark the associated nodes whose semantic gravity probability exceeds a preset threshold as strong gravity boundary nodes. Obtain the feature representation matrix of strong gravitational boundary nodes in a high-dimensional topological graph, and generate weighted semantic features by weighting them according to the corresponding semantic gravitational probabilities. The weighted semantic features of all strong gravitational boundary nodes belonging to the same edge node are aggregated to generate residual semantic metadata for characterizing the external context. The residual semantic metadata is concatenated and encrypted with the raw data ontology of the core asset node set to generate a standard data asset package.

2. The method according to claim 1, characterized in that, Construct a high-dimensional topology graph based on entities and relationships within a specified business data domain, including: Structured and unstructured data in the business data domain are mapped to triples. An initial topology graph is constructed based on the triples. A graph convolutional network is then used to learn from the initial topology graph to generate a high-dimensional topology graph.

3. The method according to claim 1, characterized in that, The semantic gravity field model is used to calculate the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set, including: Calculate the information entropy response value of associated nodes in the high-dimensional topology graph, extract the shortest path length from associated nodes to edge nodes, and calculate the connectivity curvature index based on the intermediate nodes in the shortest path length. The information entropy response value, connectivity curvature index, and preset weight coefficients of the business subdomain where the associated node is located are input into the semantic gravity field model to obtain the initial gravity value of the associated node belonging to the edge node. The initial gravity value is normalized to obtain the semantic gravity probability.

4. The method according to claim 3, characterized in that, Establish a semantic gravity field model, including: A training set is constructed based on historical business data. A semantic gravity field model composed of a multilayer perceptron is trained based on the training set. The label of each training sample in the training set is generated by weighted summation of query frequency probability and steady-state probability distribution value. The steady-state probability distribution value is calculated based on the random walk algorithm. When the random walk algorithm is calculated, it calculates the historical temporal fluctuation entropy of the connection frequency of each associated node within the historical time window and maps it to the temporal gravity barrier value. The random walk algorithm starts a random walk from an edge node. When the walking particle jumps to a node, it uses the Boltzmann distribution factor to exponentially decay the preset basic transition probability to generate topological repulsion. When the walking particle enters a node whose temporal gravitational barrier value is greater than the critical threshold, it blocks the outward diffusion path weight and forces the transition probability to be allocated to the previous hop node or the starting point.

5. The method according to claim 3, characterized in that, Generate residual semantic metadata to characterize the external context, including: For each strong gravitational boundary node, obtain its feature representation matrix on the high-dimensional topology graph, and multiply the feature representation matrix with the semantic gravity probability corresponding to the strong gravitational boundary node to obtain the weighted semantic features of the strong gravitational boundary node. Perform feature aggregation on the weighted semantic features of all strong gravitational boundary nodes connected to the same edge node to obtain residual semantic metadata.

6. The method according to claim 5, characterized in that, Perform feature aggregation on the weighted semantic features of all strong gravitational boundary nodes connected to the same edge node, including: The weighted semantic features of all strong gravitational boundary nodes connected to the same edge node are summed and averaged to obtain residual semantic metadata.

7. The method according to claim 1, characterized in that, Generate a standard data asset package, including: The standard data asset package is divided into a semantic residual area and a raw data ontology area. The residual semantic metadata is written into the semantic residual area, and the original business detail records belonging to the core asset node set are written into the raw data ontology area. An index mapping is established between the semantic residual area and the raw data ontology area to achieve joint reading of the semantic residual area and the raw data ontology area.

8. A data asset standardization and encapsulation system for implementing the method as described in any one of claims 1-7, characterized in that, The system includes: The graph recognition module constructs a high-dimensional topology graph based on entities and relationships in the specified business data domain, and determines the core asset node set and the external related node set that has a topological connection with the core asset node set in the high-dimensional topology graph according to the encapsulation instructions. The gravity screening module constructs a semantic gravity field model, calculates the semantic gravity probability of each edge node in the core asset node set to each associated node in the external associated node set based on the semantic gravity field model, and marks associated nodes whose semantic gravity probability exceeds a preset threshold as strong gravity boundary nodes. The semantic generation module obtains the feature representation matrix of strong gravitational boundary nodes in the high-dimensional topology graph, and generates weighted semantic features based on the corresponding semantic gravity probabilities. It then aggregates the weighted semantic features of all strong gravitational boundary nodes belonging to the same edge node to generate residual semantic metadata for characterizing the external context environment. The asset encapsulation module concatenates and encrypts the residual semantic metadata with the raw data ontology of the core asset node set to generate a standard data asset package.

9. A computer-readable storage medium storing instructions thereon, characterized in that, When the instruction is executed by the processor, it implements the method as described in any one of claims 1-7.

Citation Information

Patent Citations

CN120510623A
CN120874853A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

CN120510623A

CN120874853A