A business data analysis through system and method based on dynamic data multiplexing
By constructing a unified data semantic layer, dynamic data lineage tracking, and an intelligent scheduling engine, the problems of resource waste and data sharing in business data analysis systems have been solved, enabling efficient and flexible data analysis and result reuse, and improving the real-time performance and consistency of the system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DONGYING DONGWANG INTERNET INFORMATION TECH CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing business data analysis systems suffer from problems such as wasted computing resources, response delays, difficulty in sharing and reusing data results, limited ability to fuse and process multi-source heterogeneous data, and lack of intelligent scheduling engines in terms of dynamic data reuse and integrated analysis, resulting in insufficient real-time performance, flexibility, and intelligence.
We construct a unified data semantic layer, implement dynamic data lineage tracing, adopt a context-aware caching strategy, build an intelligent scheduling engine, and achieve end-to-end analysis link connectivity through a unified execution framework, supporting the reuse and collaboration of analysis results across business scenarios.
It achieves high reusability, high consistency, high real-time performance, and high scalability in business data analysis systems, reduces semantic ambiguity in cross-system analysis, improves the interpretability and compliance of data analysis results, and enhances system response efficiency and resource utilization.
Smart Images

Figure CN122240670A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, specifically to a comprehensive business data analysis system and method based on dynamic data reuse. Background Technology
[0002] In the context of the rapid development of the digital economy, business data analytics has become a core tool for enterprise decision support, market insights, and operational optimization. Business data comes from a wide range of sources, encompassing multi-dimensional and heterogeneous data such as transaction records, user behavior, supply chain information, and external market dynamics. Its efficient integration and in-depth analysis are crucial for enhancing the intelligence level of enterprises.
[0003] Among them, business data analytics technology based on dynamic data reuse aims to reduce redundant computation, improve analysis efficiency, and achieve data value integration across business scenarios by flexibly scheduling and reusing processed or intermediate data. This type of approach typically involves key technical aspects such as data lineage tracing, context-aware caching, and task dependency modeling to support agile response and resource optimization in complex analysis processes.
[0004] Existing technologies still have significant shortcomings in achieving dynamic reuse and integrated analysis of business data: First, most systems adopt a static data pipeline architecture, making it difficult to dynamically adjust data reuse strategies based on real-time query needs or business changes, leading to wasted computing resources and response delays. Second, the lack of a unified data semantic layer and context-aware mechanism makes it difficult to effectively share and reuse data results across different analysis tasks. Third, existing solutions generally neglect data timeliness and consistency constraints, easily causing distortion of analysis results in high-frequency update scenarios. Furthermore, mainstream platforms have limited capabilities in fusing and processing multi-source heterogeneous data, failing to achieve end-to-end analysis continuity while ensuring performance. Finally, the lack of intelligent scheduling engines tailored to business scenarios makes it difficult to automatically identify reusable data units and optimize execution paths in complex dependencies. These problems severely restrict the real-time performance, flexibility, and intelligence of business data analysis systems, urgently requiring a new system architecture and method capable of achieving dynamic data reuse and end-to-end integration. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a comprehensive business data analysis system and method based on dynamic data reuse, which solves the problems mentioned in the background section.
[0006] To achieve the above objectives, the present invention provides the following technical solution: a comprehensive business data analysis method based on dynamic data reuse, comprising the following specific steps: Step 1: Construct a unified data semantic layer: Standardize and model multi-source heterogeneous business data from transaction records, user behavior, supply chain information and external market dynamics, establish a unified semantic model covering entities, attributes, relationships and business rules, and dynamically bind analysis tasks and data semantics through a context-aware mechanism; Step 2: Perform dynamic data lineage tracing: In the data processing flow, record the generation source, transformation logic and dependency of each data unit in real time to form a traceable data lineage map, and identify reusable intermediate data based on the map; Step 3: Implement a context-aware caching strategy: Based on the business context, query pattern, and data timeliness requirements of the current analysis task, dynamically select the cache granularity and storage location to intelligently manage the cache of processed or intermediate data. Step 4: Build an intelligent scheduling engine: Based on the data lineage graph and task dependencies, automatically identify reusable data units, generate the optimal execution path, and dynamically adjust the scheduling strategy at runtime to respond to business changes; Step 5 achieves end-to-end analysis chain connectivity: seamlessly integrates data access, cleaning, modeling, calculation and visualization, ensures data consistency and timeliness through a unified execution framework, and supports the reuse and collaboration of analysis results across business scenarios.
[0007] Preferably, the unified data semantic layer in step 1 adopts a layered modeling structure, including a basic layer, a business layer, and an application layer. The basic layer defines general data types and metadata specifications, the business layer encapsulates industry-specific entities and rules, and the application layer maps the semantic requirements of specific analysis scenarios. The three layers are dynamically associated through a semantic mapping table.
[0008] Preferably, in step 2, the data lineage tracking adopts a lightweight log injection mechanism, which automatically records the input data identifier, processing operator type, output data identifier and timestamp at each data processing node. The lineage information is stored in the form of a directed acyclic graph and supports graph slicing by time window or business dimension.
[0009] Preferably, in step 3, the context-aware caching strategy is divided into three categories based on the data update frequency: static caching, near real-time caching, and streaming caching. Static caching is suitable for historical dimension tables, and the cache validity period is greater than or equal to a preset time threshold. Near real-time caching is suitable for daily updated summary indicators, and the cache validity period is a preset time period. Streaming caching is suitable for event stream data updated at the second level, and the cache validity period is less than or equal to a preset time threshold.
[0010] Preferably, in step 4, the intelligent scheduling engine has a built-in deep learning-based reuse prediction model. This model takes historical task characteristics, data access patterns and resource load as inputs and outputs the reuse probability of each intermediate data unit. When the reuse probability is greater than or equal to a preset probability threshold, the reuse path is scheduled first instead of being recalculated.
[0011] Preferably, the end-to-end analysis link in step 5 is achieved through a unified execution framework. This framework provides standardized interfaces to connect with various data sources and computing engines, ensuring that the entire link from raw data to final analysis results is executed in a single transaction context, with a transaction isolation level of repeatable read and a data consistency error rate lower than a preset error threshold.
[0012] Preferably, the unified data semantic layer supports dynamic semantic expansion. When a new business scenario is added, new entity definitions and rule sets can be uploaded through the semantic registration interface. The system automatically verifies their compatibility with the existing semantic model and incorporates them into the unified semantic space after passing the verification. The expansion process takes less than or equal to a preset time threshold.
[0013] Preferably, the data lineage map supports reverse impact analysis. When a basic data source changes, the system can automatically identify all affected downstream analysis tasks and trigger corresponding data recalculation or cache invalidation mechanisms. The accuracy rate of impact range identification is greater than or equal to a preset accuracy threshold.
[0014] Preferably, in the context-aware caching strategy, the cache granularity is dynamically adjusted according to the coverage of the query predicate. For highly selective queries, the cache granularity is refined to the level of a single record. For aggregate queries, the cache granularity is partitioned by dimension combination, and the upper limit of the number of partitions is a preset threshold.
[0015] When generating execution paths, the intelligent scheduling engine comprehensively considers computing resource consumption, I / O overhead, and network transmission latency, and uses a weighted scoring model to evaluate reused paths and recalculated paths. The reuse strategy is only activated when the comprehensive score of the reused path is higher than the preset proportion threshold of the recalculated path.
[0016] The end-to-end analysis link supports multi-tenant isolation, ensuring that analysis tasks from different business departments are logically completely isolated, each with its own independent semantic space, cache area, and scheduling queue, but sharing underlying computing and storage resources, thereby improving resource utilization to a preset level. A business data analysis system based on dynamic data reuse includes a semantic modeling module, a lineage tracing module, a cache management module, an intelligent scheduling module, and a unified execution framework module. The semantic modeling module is used to construct a unified semantic model covering entities, attributes, relationships and business rules, and dynamically binds analysis tasks and data semantics through a context-aware mechanism; The lineage tracing module is used to record the generation source, transformation logic and dependency relationship of each data unit in real time during the data processing flow, forming a traceable data lineage map; The cache management module is used to dynamically select the cache granularity and storage location based on the business context, query mode, and data timeliness requirements of the current analysis task. The intelligent scheduling module is used to automatically identify reusable data units, generate the optimal execution path, and dynamically adjust the scheduling strategy during runtime based on the data lineage graph and task dependency relationship. The unified execution framework module is used to seamlessly integrate the data access, cleaning, modeling, calculation and visualization processes, and ensures data consistency and timeliness through the unified execution framework.
[0017] This invention provides a comprehensive business data analysis system and method based on dynamic data reuse, which has the following advantages: (1) During system operation, a unified and context-aware business semantic model is constructed through the semantic modeling module, enabling analysis tasks to be dynamically bound to data at the semantic level, significantly reducing semantic ambiguity and understanding costs during cross-system and cross-caliber analysis; the lineage tracing module records the generation source, transformation logic, and dependencies of data throughout its entire lifecycle in real time, forming a complete and traceable data lineage map, effectively improving the interpretability, auditability, and compliance of data analysis results; the cache management module dynamically adjusts the cache granularity and storage location based on business context, query mode, and data timeliness requirements, ensuring data freshness while reducing redundant calculations and invalid accesses, significantly improving system performance. The system improves overall response efficiency; the intelligent scheduling module automatically identifies reusable data units based on data lineage graphs and generates optimal execution paths, enabling dynamic optimization and adaptive resource scheduling during the analysis task's execution, thereby reducing computational resource consumption and improving system throughput; the unified execution framework module integrates data access, cleaning, modeling, computation, and visualization, avoiding data consistency issues caused by fragmented multi-system architectures. Overall, it achieves high reusability, high consistency, high real-time performance, and high scalability in the business data analysis process, effectively solving problems such as severe redundant calculations, opaque links, low scheduling efficiency, and difficulty in unifying analysis results in existing business data analysis systems.
[0018] (2) This invention achieves unified modeling and dynamic binding of multi-source heterogeneous business data at the semantic level by constructing a unified data semantic layer and introducing a context-aware mechanism during the execution of analysis tasks. By standardizing and abstracting entities, attributes, relationships, and business rules and managing them in layers, it can effectively reduce the analysis bias caused by inconsistent standards and semantics between different data sources, and enable analysis tasks to have a consistent data understanding foundation when called across business scenarios and systems, thereby improving the consistency and reusability of business data analysis results.
[0019] (3) This invention introduces a dynamic data lineage tracking mechanism into the data processing flow, which records the generation source, transformation logic, and dependencies of data units in real time and organizes and manages them in the form of a lineage graph, making the generation process of intermediate data traceable. Based on the lineage graph, the system can automatically identify intermediate data units in the analysis process that have reusable conditions, reduce resource consumption caused by repeated calculations, and accurately locate the affected downstream tasks through reverse impact analysis when the basic data changes, thereby improving the data reliability and operational stability of the system in complex analysis scenarios.
[0020] (4) This invention, through the collaborative design of a context-aware caching strategy and an intelligent scheduling mechanism, dynamically adjusts the caching granularity, storage location, and execution path selection based on business context, query mode, and data timeliness requirements. Supported by a unified execution framework, it achieves end-to-end connectivity between data access, processing, and analysis result output, ensuring consistency and timeliness throughout the entire data processing process. Compared to traditional segmented or static scheduling analysis methods, this invention can improve the overall system execution efficiency while ensuring the correctness of the analysis, and supports the reuse and collaborative processing of analysis results across business scenarios. Attached Figure Description
[0021] Figure 1 This is a schematic diagram of the overall technical solution architecture of the business data analysis integrated system and method based on dynamic data reuse proposed in this invention. Figure 2 This is a schematic diagram of the core principle framework of the collaborative driving of intelligent scheduling engine and dynamic data lineage tracking in this invention; Figure 3 This is a schematic diagram of the multi-level interaction relationship and data flow between the unified data semantic layer, the context-aware caching strategy, and the end-to-end analysis link in this invention. Detailed Implementation
[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.
[0023] Example 1 This invention provides a comprehensive business data analysis method based on dynamic data reuse. Please refer to [link / reference]. Figure 1 This includes the following steps: Step 1: Construct a unified data semantic layer to standardize and model multi-source heterogeneous business data from transaction records, user behavior, supply chain information and external market dynamics, establish a unified semantic model covering entities, attributes, relationships and business rules, and dynamically bind analysis tasks and data semantics through a context-aware mechanism; Step 2: Perform dynamic data lineage tracing, record the generation source, transformation logic and dependency of each data unit in real time in the data processing flow, form a traceable data lineage map, and identify reusable intermediate state data based on the map; Step 3: Implement a context-aware caching strategy. Based on the business context, query pattern, and data timeliness requirements of the current analysis task, dynamically select the cache granularity and storage location to intelligently manage the cache of processed or intermediate data. Step 4: Build an intelligent scheduling engine that automatically identifies reusable data units based on data lineage graphs and task dependencies, generates the optimal execution path, and dynamically adjusts the scheduling strategy at runtime to respond to business changes; Step 5: Achieve end-to-end analysis connectivity, seamlessly integrating data access, cleaning, modeling, computation, and visualization. Ensure data consistency and timeliness through a unified execution framework, and support the reuse and collaboration of analysis results across business scenarios.
[0024] In this embodiment, a unified and context-aware business semantic model is constructed through a semantic modeling module, enabling analysis tasks to be dynamically bound to data at the semantic level. This significantly reduces semantic ambiguity and understanding costs during cross-system and cross-specification analysis. A lineage tracing module records the generation source, transformation logic, and dependencies of data throughout its entire lifecycle in real time, forming a complete and traceable data lineage graph, effectively improving the interpretability, auditability, and compliance of data analysis results. A cache management module dynamically adjusts cache granularity and storage location based on business context, query patterns, and data timeliness requirements, ensuring data freshness while reducing redundant calculations and invalid accesses, significantly improving system performance. Overall response efficiency: The intelligent scheduling module automatically identifies reusable data units based on the data lineage graph and generates the optimal execution path, realizing dynamic optimization and adaptive resource scheduling during the analysis task, thereby reducing computing resource consumption and improving system throughput. The unified execution framework module integrates data access, cleaning, modeling, calculation, and visualization, avoiding data consistency problems caused by fragmented multi-systems. Overall, it achieves high reusability, high consistency, high real-time performance, and high scalability in the business data analysis process, effectively solving the problems of serious duplicate calculations, opaque links, low scheduling efficiency, and difficulty in unifying analysis results in existing business data analysis systems.
[0025] Example 2 This embodiment is an explanation based on Embodiment 1. Please refer to it. Figure 1 Specifically, this embodiment is applied to the omnichannel business intelligence analytics platform of a large retail enterprise, aiming to support the real-time insight needs of processing over 1 billion transaction records, user behavior logs, and supply chain events daily. In this scenario, the enterprise needs to quickly respond to changes in sales trends during promotional activities and reuse intermediate calculation results from previous major promotional periods (such as aggregation of regional best-selling categories and user segmentation profiles) to reduce computational overhead and ensure timely decision-making.
[0026] First, a hardware and software collaborative system was built at the system architecture level to support the aforementioned dynamic data reuse capabilities. This system is deployed in a distributed cluster of 20 high-performance servers. Each server is equipped with dual Intel Xeon Gold 6348 processors (28 cores / 56 threads), 512GB of DDR4 ECC memory, four NVMe SSDs (total capacity 15.36TB), and dual-port 25GbE network cards. The cluster is interconnected via a Spine-Leaf topology, and the core switch uses a Mellanox Spectrum-3 ASIC chipset supporting the RoCEv2 protocol, ensuring that the RDMA communication latency between nodes is less than 1.2 microseconds.
[0027] At the software stack level, the system is divided into five core functional modules: semantic modeling module, lineage tracing module, cache management module, intelligent scheduling module, and unified execution framework module. Each module is deployed through a microservice architecture and runs on a Kubernetes container orchestration platform. Inter-service communication uses the gRPC over TLS 1.3 protocol, and the message serialization format is Protocol Buffers v3.
[0028] The semantic modeling module is deployed on a dedicated metadata server (equipped with 128GB of Intel Optane persistent memory). Internally, it contains a three-layer semantic model database: the base layer uses Apache Atlas as the metadata storage backend, defining common data types (such as DECIMAL(18,2) amount, TIMESTAMP(6) timestamp) and ISO / IEC 11179 standard metadata specifications); the business layer builds an industry entity relationship network based on the Neo4j graph database, pre-configuring a retail SKU classification system, store organizational structure, and membership level rules; the application layer uses Elasticsearch indexes to map semantic templates required for specific analysis scenarios (such as "real-time GMV monitoring for the 618 promotion" and "new product launch effect evaluation"). The three layers are dynamically linked at millisecond levels through a semantic mapping table (stored in Redis Cluster with 16 shards). The key-value pair structure of the mapping table is <business scenario ID, [base layer attribute set, business layer entity set]>.
[0029] The lineage tracing module is injected into the JVM processes of all data processing tasks as a Java Agent. Its core components include a lightweight log probe and a lineage graph builder. The log probe automatically captures input data source identifiers (such as Kafka Topic partition offsets, HDFS file paths + Block IDs), operator types (such as WindowAgg, JoinWithBroadcast), output data identifiers (such as Iceberg table snapshot IDs), and nanosecond-level timestamps before and after each operator execution in Spark Structured Streaming or Flink jobs, generating standardized lineage log events. These events are written to Apache Pulsar topics (number of partitions = total number of cluster CPU cores) via asynchronous non-blocking channels, consumed by the graph builder to construct a directed acyclic graph (DAG). This DAG is stored in the JanusGraph graph database. Vertex attributes include unique hash values of data units and data schema version numbers, while edge attributes record the AST (Abstract Syntax Tree) summary of the transformation logic and the execution environment fingerprint (such as JDK version and dependency library SHA256).
[0030] The cache management module is deployed in a hybrid storage pool of local memory and SSD on each compute node, and is implemented by the Alluxio distributed caching system. The cache policy controller dynamically allocates three types of cache areas based on the task context: a static cache area (30% of total memory) using an LRU eviction policy, dedicated to storing historical dimension tables (such as product master data and store geographic information), with a cache expiration of 7 days; a near real-time cache area (40% of total memory) using a Time-Based Eviction policy, storing daily updated summary metrics (such as a list of the top 100 best-selling products), with a validity period of 24 hours; and a streaming cache area (30% of total memory) based on a sliding time window mechanism, caching second-level event streams (such as user click streams and POS transaction streams), with a validity period of 5 minutes. The cache granularity controller dynamically adjusts the cache by parsing the predicate tree of the query predicate: when the predicate selectivity is higher than 0.95 (i.e., the number of records after filtering is less than 5% of the total number of records), single-record caching is enabled, and the cache key is the composite primary key hash value; when the query is a GROUP BY aggregation and the number of dimension combinations is ≤1024, partition caching is performed according to the combination of dimension values, and the partition key is the MD5 digest of the dimension values.
[0031] The intelligent scheduling engine, acting as the system's central hub, runs on a highly available master node (Active-Standby mode, coordinated by ZooKeeper). Its core comprises a deep learning reuse prediction model and a weighted path estimator. The reuse prediction model, built on TensorFlow Extended (TFX), takes a feature vector containing 128 dimensions of historical task features (e.g., task type encoding, input data volume, operator complexity), 64 dimensions of data access patterns (e.g., cache hit rate, I / O throughput fluctuation coefficient), and 32 dimensions of resource load metrics (e.g., CPU utilization, network queue depth). This vector is then processed by a three-layer fully connected neural network (with 256, 128, and 64 hidden neurons, ReLU activation) to output the reuse probability of each intermediate data unit. The model undergoes incremental online training every 24 hours using the Adam optimizer with a learning rate of 0.001. The weighted path estimator calculates a combined score for reused paths and recalculated paths when generating the execution plan. Reuse is only triggered when the reused path score is more than 10% higher than the recalculated path score.
[0032] Wherein, Score: Overall score of execution path, C norm : Calculate the normalized value of resource consumption, I / O norm : Normalized value of I / O overhead, L norm : Normalized value of network transmission delay.
[0033] The path selection criteria are: In other words, reuse is only allowed if the reused path score is at least 10% higher than the recalculated path score.
[0034] The end-to-end execution framework is developed based on the integration of Apache Calcite and Apache Beam, providing standardized adapters to connect to more than 20 data sources (including Oracle RAC, MongoDB sharded clusters, SAP HANA, Google Analytics API, etc.) and computing engines (Spark, Flink, Presto). The framework embeds a transaction manager, employing a two-phase commit (2PC) protocol to coordinate cross-engine operations. The transaction isolation level is enforced to Repeatable Read, and a Global Consistent Snapshot mechanism ensures that the entire chain from raw data access to final visualization result generation is executed within a single transaction context. The multi-tenant isolation module allocates independent namespaces to each business unit (such as the e-commerce business unit and the offline retail business unit). Their semantic models, cache areas, and scheduling queues are logically completely isolated, but they share the underlying Kubernetes Pod resource pool, achieving fine-grained resource quota control through ResourceQuota and LimitRange.
[0035] Based on the above system architecture, the system workflow is as follows: First, the semantic modeling module receives metadata change events (such as adding a "live-streaming e-commerce" business scenario) from the data governance platform. The semantic registration interface (RESTful API) receives a JSON payload containing new entity definitions (such as LiveStreamSession) and rule sets. The system first calls the compatibility validator to compare the matching degree of the new entity attributes with the common types of the base layer (such as the live-streaming start time must be TIMESTAMP(6)), and checks whether the business rules conflict with existing rules (such as avoiding overlap with the "short video e-commerce" rule). After the verification is passed, the new semantic element is written into the Neo4j business layer graph, and an entry <Scenario ID_LIVE is added to the Redis semantic mapping table. The entire extension process takes 28 seconds, which meets the requirement of ≤30 seconds.
[0036] Subsequently, when a user initiates a task to "monitor the GMV of each livestream room during the 618 promotion in real time," the end-to-end execution framework first parses the SQL query statement and extracts the business context (scene ID_LIVE, time window = the most recent 5 minutes, data timeliness requirement = ≤10 seconds delay). This context is passed to the intelligent scheduling engine, which queries the lineage graph to identify reusable intermediate data units—for example, the "livestream room-product" association table calculated during yesterday's promotion. The scheduling engine calls the reuse prediction model, inputting the current task characteristics (livestream scene, high concurrency, small time window) and historical access patterns (this association table was accessed frequently yesterday). The model outputs a reuse probability of 89.7%, exceeding the 85% threshold.
[0037] Next, the weighted path evaluator compares the two paths: Path A (reuse) requires loading the association table from the Alluxio streaming buffer (network transmission latency 12ms, I / O overhead 0.8GB / s), while Path B (recalculation) requires rejoining the Kafka live event stream and product master data (computation resource consumption 16 vCPU·min, I / O overhead 3.2GB / s). The calculation shows that Path A has a comprehensive score of 0.87, while Path B has a score of 0.72, a difference of 15% > 10%, therefore the reused path is selected. The scheduling engine generates an execution plan, instructs the cache manager to retrieve the association table from the streaming buffer (valid for 5 minutes), and instructs the lineage tracing agent to record the log events of this reuse operation.
[0038] During execution, if the main data source for the product changes (such as a price adjustment for a certain SKU), the lineage tracing agent captures this change event, and the lineage graph builder initiates a reverse impact analysis: traversing all vertices in the DAG that have the SKU as an ancestor, it identifies 17 affected downstream tasks, including "live stream GMV calculation" and "category gross profit analysis." The system automatically triggers a cache invalidation mechanism, clearing all cache partitions in Alluxio that depend on the SKU, and sends a recalculation request to the scheduling engine. The accuracy of impact scope identification reaches 99.3%, meeting the requirement of ≥99%.
[0039] Ultimately, the end-to-end execution framework writes the cleaned live transaction stream, reused related tables, and real-time calculated GMV results to a visualization service (such as Tableau Server) within a single transaction context. The transaction manager ensures that all write operations are atomically committed, and the data consistency error rate, as verified by sampling, is 0.07%, lower than the 0.1% threshold. The entire analysis chain, from data access to visualization, takes an average of 1.8 seconds, a 62% reduction compared to traditional segmented architectures.
[0040] Example 3 This embodiment is an explanation based on Embodiment 1. Please refer to it. Figure 2Specifically: This embodiment is a real-time anti-fraud analysis system for multinational financial institutions. It needs to process transaction records, customer identity information and external credit data from 30 branches around the world. It is required to complete the judgment of suspicious transactions within 500 milliseconds and reuse intermediate features mined from historical fraud patterns (such as abnormal transfer frequency and high-risk merchant association).
[0041] The system hardware deployment adopts a hybrid cloud architecture: the core data processing cluster resides in a private cloud (10 Dell PowerEdge R750xa servers, each configured with NVIDIA A100 40GB GPUs × 2, used to accelerate deep learning inference), while edge nodes are distributed across regional data centers (each node has one HPE ProLiant DL360 Gen10, responsible for local data preprocessing). The network is interconnected with AWS Direct Connect via a leased MPLS line, and IPSec encryption is enabled for data transmission between the private and public clouds.
[0042] The semantic modeling module strengthens security controls tailored to the characteristics of the financial industry: the base layer adds PCI DSS compliance field markers (such as PAN card number, CVV code); the business layer incorporates Basel III risk entity models (such as Counterparty, Exposure); and the application layer includes pre-set anti-fraud scenario templates (such as "cross-border fast in and out" and "distributed transfer in and centralized transfer out"). The semantic mapping table is stored using the national cryptographic standard SM4 encryption, with the key dynamically rotated by HashiCorp Vault.
[0043] The lineage tracing module integrates hardware-level performance monitoring: In GPU-accelerated tasks, the log probe additionally collects metrics such as CUDA kernel execution time and memory bandwidth utilization, and the lineage graph edge attribute extension includes a summary of GPU utilization curves. Lineage information is stored in DAOS object storage that supports GPU Direct Storage (GDS), enabling zero-copy graph lookup.
[0044] The cache management module introduces a tiered security caching strategy: data involving PII (Personally Identifiable Information) is cached only in the SGX enclave memory area of the private cloud node, with a validity period of ≤10 minutes; non-sensitive aggregate metrics (such as regional fraud rates) can be cached in public cloud S3 Intelligent-Tiering storage, with a validity period of 24 hours. The cache granularity controller dynamically de-identifies data according to GDPR compliance requirements: when a query involves EU customers, the cache granularity of a single record is automatically increased to the country / region level.
[0045] The intelligent scheduling engine's reuse prediction model adds financial risk control features: the input vector includes dimensions such as regulatory compliance costs (e.g., GDPR penalty risk coefficient) and expected fraud losses. The weighted path evaluator introduces a risk adjustment factor: if the reuse path involves cross-regional data transmission, an additional 0.15 points will be deducted (due to compliance review delays).
[0046] The end-to-end execution framework manages transaction keys through a FIPS 140-2 Level 3 certified HSM (Hardware Security Module), ensuring end-to-end encryption. Multi-tenant isolation is divided by jurisdiction (e.g., EU, North America), with each region having its own independent encryption key domain.
[0047] In the workflow, when a cross-border transaction triggers risk control rules, the system reuses intermediate features from the XGBoost fraud scoring model trained the previous week. The reused prediction model output probability is 91.2%, and path evaluation shows that reuse saves 42% of computational resources. If the central bank's credit database is updated (e.g., a new list of high-risk merchants is added), reverse impact analysis accurately identifies 23 affected models, automatically triggering incremental retraining. The end-to-end response time is 480 milliseconds, meeting the <500 millisecond requirement.
[0048] Example 4 This embodiment is an explanation based on Embodiment 1. Please refer to it. Figure 3 Specifically, the unified data semantic layer adopts a layered modeling structure, including a basic layer, a business layer, and an application layer. The basic layer defines general data types and metadata specifications, the business layer encapsulates industry-specific entities and rules, and the application layer maps the semantic requirements of specific analysis scenarios. The three layers are dynamically associated through a semantic mapping table.
[0049] The dynamic data lineage tracking adopts a lightweight log injection mechanism, which automatically records the input data identifier, processing operator type, output data identifier and timestamp at each data processing node. The lineage information is stored in the form of a directed acyclic graph and supports graph slicing by time window or business dimension.
[0050] The context-aware caching strategy is divided into three categories based on the data update frequency: static caching, near real-time caching, and streaming caching. Static caching is suitable for historical dimension tables, near real-time caching is suitable for daily updated summary metrics, and streaming caching is suitable for event stream data updated at the second level.
[0051] The intelligent scheduling engine has a built-in deep learning-based reuse prediction model. This model takes historical task characteristics, data access patterns and resource load as inputs and outputs the reuse probability of each intermediate data unit. When the reuse probability is greater than or equal to a preset probability threshold, the reuse path is scheduled first instead of being recalculated.
[0052] The end-to-end analysis chain is achieved through a unified execution framework, which provides standardized interfaces to connect with various data sources and computing engines, ensuring that the entire chain from raw data to final analysis results is executed in a single transaction context with a transaction isolation level of repeatable read.
[0053] The unified data semantic layer supports dynamic semantic expansion. When a new business scenario is added, new entity definitions and rule sets are uploaded through the semantic registration interface. The system automatically verifies their compatibility with the existing semantic model and incorporates them into the unified semantic space after passing the verification.
[0054] The data lineage map supports reverse impact analysis. When a basic data source changes, the system automatically identifies all affected downstream analysis tasks and triggers corresponding data recalculation or cache invalidation mechanisms.
[0055] In the context-aware caching strategy, the cache granularity is dynamically adjusted according to the coverage of the query predicate. For highly selective queries, the cache granularity is refined to the level of a single record; for aggregate queries, the cache granularity is partitioned according to the combination of dimensions.
[0056] In this embodiment, the present invention achieves semantic consistency, processing traceability, and execution efficiency throughout the entire business data analysis process through the collaborative design of a unified data semantic layer, dynamic data lineage tracking, context-aware caching strategy, intelligent scheduling engine, and unified execution framework. By constructing a layered unified data semantic layer and supporting dynamic semantic expansion, new business scenarios can be quickly incorporated into the unified semantic space without affecting the existing semantic system, reducing the complexity of semantic maintenance during system evolution. By employing a lightweight log injection dynamic data lineage tracking mechanism combined with reverse impact analysis, accurate tracking of data generation and evolution processes is achieved. When basic data changes, affected downstream analysis tasks can be identified in a timely manner, triggering corresponding recalculation or cache invalidation operations, thereby improving the reliability of data analysis results. Through the synergistic effect of a context-aware caching strategy and a deep learning-based reuse prediction model, the cache granularity and execution path can be dynamically selected based on data update frequency, query predicate characteristics, and resource load. This reduces redundant calculations and unnecessary data access while ensuring transaction consistency and isolation, comprehensively improving the execution efficiency and stability of the business data analysis system under complex and multi-scenario conditions.
[0057] Example 5 A business data analysis system based on dynamic data reuse, please refer to... Figure 1 Specifically, it includes a semantic modeling module, a lineage tracing module, a cache management module, an intelligent scheduling module, and a unified execution framework module; The semantic modeling module is used to construct a unified semantic model covering entities, attributes, relationships and business rules, and dynamically binds analysis tasks and data semantics through a context-aware mechanism; The lineage tracing module is used to record the generation source, transformation logic and dependency relationship of each data unit in real time during the data processing flow, forming a traceable data lineage map; The cache management module is used to dynamically select the cache granularity and storage location based on the business context, query mode, and data timeliness requirements of the current analysis task. The intelligent scheduling module is used to automatically identify reusable data units, generate the optimal execution path, and dynamically adjust the scheduling strategy during runtime based on the data lineage graph and task dependency relationship. The unified execution framework module is used to seamlessly integrate the data access, cleaning, modeling, calculation and visualization processes, and ensures data consistency and timeliness through the unified execution framework.
[0058] In this embodiment, the present invention constructs a comprehensive system architecture for business data analysis by synergistically integrating a semantic modeling module, a lineage tracing module, a cache management module, an intelligent scheduling module, and a unified execution framework module. The semantic modeling module builds a unified semantic model and introduces a context-aware mechanism, enabling analysis tasks to accurately match data semantics during execution and reducing semantic ambiguity arising from multi-source data. The lineage tracing module records the entire data processing process in real time, forming a traceable data lineage map, making data sources and transformation relationships clearly traceable, thereby improving the reliability and auditability of analysis results. Through the synergistic effect of the cache management module and the intelligent scheduling module, the system can automatically identify reusable data units and generate optimized execution paths based on business context, query patterns, and task dependencies, reducing redundant calculations and resource consumption while ensuring data consistency and timeliness. The unified execution framework module provides unified management of data access, processing, and result output, avoiding data inconsistency issues caused by fragmented multi-system architectures, and overall improving the execution efficiency, stability, and scalability of the business data analysis system in complex business scenarios.
[0059] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A comprehensive business data analysis method based on dynamic data reuse, characterized in that: Includes the following steps: Step 1: Construct a unified data semantic layer to standardize and model multi-source heterogeneous business data from transaction records, user behavior, supply chain information and external market dynamics, establish a unified semantic model covering entities, attributes, relationships and business rules, and dynamically bind analysis tasks and data semantics through a context-aware mechanism; Step 2: Perform dynamic data lineage tracing, record the generation source, transformation logic and dependency of each data unit in real time in the data processing flow, form a traceable data lineage map, and identify reusable intermediate state data based on the map; Step 3: Implement a context-aware caching strategy. Based on the business context, query pattern, and data timeliness requirements of the current analysis task, dynamically select the cache granularity and storage location to intelligently manage the cache of processed or intermediate data. Step 4: Build an intelligent scheduling engine that automatically identifies reusable data units based on data lineage graphs and task dependencies, generates the optimal execution path, and dynamically adjusts the scheduling strategy at runtime to respond to business changes; Step 5: Achieve end-to-end analysis connectivity, seamlessly integrating data access, cleaning, modeling, computation, and visualization. Ensure data consistency and timeliness through a unified execution framework, and support the reuse and collaboration of analysis results across business scenarios.
2. The integrated business data analysis method based on dynamic data reuse according to claim 1, characterized in that: The unified data semantic layer adopts a layered modeling structure, including a basic layer, a business layer, and an application layer. The basic layer defines general data types and metadata specifications, the business layer encapsulates industry-specific entities and rules, and the application layer maps the semantic requirements of specific analysis scenarios. The three layers are dynamically associated through a semantic mapping table.
3. The integrated business data analysis method based on dynamic data reuse according to claim 1, characterized in that: The dynamic data lineage tracking adopts a lightweight log injection mechanism, which automatically records the input data identifier, processing operator type, output data identifier and timestamp at each data processing node. The lineage information is stored in the form of a directed acyclic graph and supports graph slicing by time window or business dimension.
4. The integrated business data analysis method based on dynamic data reuse according to claim 1, characterized in that: The context-aware caching strategy is divided into three categories based on the data update frequency: static caching, near real-time caching, and streaming caching. Static caching is suitable for historical dimension tables, near real-time caching is suitable for daily updated summary metrics, and streaming caching is suitable for event stream data updated at the second level.
5. The integrated business data analysis method based on dynamic data reuse according to claim 1, characterized in that: The intelligent scheduling module has a built-in deep learning-based reuse prediction model. This model takes historical task characteristics, data access patterns and resource load as inputs and outputs the reuse probability of each intermediate data unit. When the reuse probability is greater than or equal to a preset probability threshold, the reuse path is scheduled first instead of being recalculated.
6. The integrated business data analysis method based on dynamic data reuse according to claim 1, characterized in that: The end-to-end analysis chain is achieved through a unified execution framework, which provides standardized interfaces to connect with various data sources and computing engines, ensuring that the entire chain from raw data to final analysis results is executed in a single transaction context with a transaction isolation level of repeatable read.
7. The integrated business data analysis method based on dynamic data reuse according to claim 2, characterized in that: The unified data semantic layer supports dynamic semantic expansion. When a new business scenario is added, new entity definitions and rule sets are uploaded through the semantic registration interface. The system automatically verifies their compatibility with the existing semantic model and incorporates them into the unified semantic space after passing the verification.
8. The integrated business data analysis method based on dynamic data reuse according to claim 3, characterized in that: The data lineage map supports reverse impact analysis. When a basic data source changes, the system automatically identifies all affected downstream analysis tasks and triggers corresponding data recalculation or cache invalidation mechanisms.
9. A comprehensive business data analysis method based on dynamic data reuse according to claim 4, characterized in that: In the context-aware caching strategy, the cache granularity is dynamically adjusted according to the coverage of the query predicate. For highly selective queries, the cache granularity is refined to the level of a single record; for aggregate queries, the cache granularity is partitioned according to the combination of dimensions.
10. A business data analysis system based on dynamic data reuse, applied to the business data analysis system based on dynamic data reuse as described in any one of claims 1 to 9, characterized in that: It includes a semantic modeling module, a lineage tracking module, a cache management module, an intelligent scheduling module, and a unified execution framework module; The semantic modeling module is used to construct a unified semantic model covering entities, attributes, relationships and business rules, and dynamically binds analysis tasks and data semantics through a context-aware mechanism; The lineage tracing module is used to record the generation source, transformation logic and dependency relationship of each data unit in real time during the data processing flow, forming a traceable data lineage map; The cache management module is used to dynamically select the cache granularity and storage location based on the business context, query mode, and data timeliness requirements of the current analysis task. The intelligent scheduling module is used to automatically identify reusable data units, generate the optimal execution path, and dynamically adjust the scheduling strategy during runtime based on the data lineage graph and task dependency relationship. The unified execution framework module is used to seamlessly integrate the data access, cleaning, modeling, calculation and visualization processes, and ensures data consistency and timeliness through the unified execution framework.