A multi-modal metadata driven intelligent scheduling method and system

By using an intelligent scheduling method driven by multimodal metadata, a global relationship graph is constructed and strategies are dynamically adjusted. This solves the problems of poor scalability, complex manual orchestration, and suboptimal resource utilization in large-scale data scheduling, and realizes automated and intelligent data link management, thereby improving system efficiency and resource utilization.

CN122243158APending Publication Date: 2026-06-19联通数智医疗科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
联通数智医疗科技有限公司
Filing Date
2026-02-06
Publication Date
2026-06-19

Smart Images

  • Figure CN122243158A_ABST
    Figure CN122243158A_ABST
Patent Text Reader

Abstract

This invention discloses an intelligent scheduling method and system driven by multimodal metadata, comprising: acquiring multimodal metadata, wherein the metadata includes at least a data directory describing data modality attributes, data tables with logical storage structures, data lineage representing the transformation relationships between data tables, and workflow definitions composed of data tables and data lineages; constructing a global relationship graph with data tables as nodes and data lineages as directed edges based on the metadata; automatically dividing the global relationship graph into multiple sub-relationship graphs without mutual dependencies using a graph algorithm; creating or updating a corresponding workflow for each sub-relationship graph; scheduling and executing the workflows, and continuously monitoring predefined indicators during execution, dynamically adjusting the execution strategy based on the monitoring indicators. This invention frees data engineers from tedious and error-prone manual orchestration work, achieving full-process automation of data link management.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of data development and scheduling, and in particular to an intelligent scheduling method and system driven by multimodal metadata. Background Technology

[0002] In the traditional field of data development and scheduling, workflow orchestration-based implementation schemes are commonly used. The core process typically involves: first, defining data processing operations such as data acquisition, cleaning, and filtering as independent tasks (jobs); then, data developers manually orchestrating these tasks visually on a workflow canvas based on business logic, thus forming a complete data processing chain; finally, configuring timed or periodic scheduling strategies for the workflow to trigger execution.

[0003] However, while the above solutions work effectively in lightweight scenarios with small task sizes and simple workflows, their limitations become increasingly apparent in complex, large-scale data projects (such as systems with over a thousand data tasks per day), mainly due to the following issues: Poor scalability: When the number of tasks to be orchestrated is large (e.g., more than 50), the front-end visual canvas will become extremely complex, resulting in excessive browser memory usage and decreased rendering performance. When the number of tasks reaches hundreds, the interface is prone to lag or even crash.

[0004] Manual orchestration is complex and error-prone: When breaking down a large workflow into multiple processes to avoid UI performance issues, it heavily relies on the professional experience and meticulous checking of data development engineers. Manual partitioning is prone to orchestration errors such as duplicate task definitions, missing or conflicting dependencies, resulting in huge maintenance costs.

[0005] Weak knowledge transfer and system maintainability: The orchestration logic of workflows relies heavily on engineers' personal understanding of specific business processes. Once the original engineer leaves and documentation is lost, the successor will find it difficult to understand and maintain the existing complex data links, resulting in poor system maintainability.

[0006] Inefficient resource utilization: Data developers often have limited understanding of the underlying computing cluster's resource status. Their orchestrated workflows mainly focus on the correctness of business logic, making it difficult to optimize task resource allocation based on the actual cluster load. This can easily lead to waste or bottlenecks in computing resources (such as CPU and memory).

[0007] In summary, existing technical solutions have significant shortcomings in terms of automation, maintainability, resource efficiency, and ease of operation when facing large-scale and complex data scheduling scenarios. Summary of the Invention

[0008] To overcome the shortcomings of existing technologies, the present invention aims to propose an intelligent scheduling method and system based on multimodal metadata, which can automatically understand global data relationships, intelligently divide scheduling units, and dynamically optimize execution strategies to reduce the cost of manual intervention and improve the overall efficiency and reliability of the system.

[0009] To achieve the objectives of this invention, the following technical solution is adopted: One objective of this invention is to provide an intelligent scheduling method driven by multimodal metadata, comprising the following steps: Obtain multimodal metadata, which includes at least a data directory describing data modality attributes, data tables with logical storage structure, data lineage representing the conversion relationship between data tables, and workflow definition consisting of data tables and data lineage; Based on the aforementioned metadata, a global relationship graph is constructed with data tables as nodes and data lineage as directed edges; The global relational graph is automatically divided into multiple sub-relational graphs that do not have mutual dependencies using graph algorithms; Create or update a corresponding workflow for each of the sub-relationship diagrams; The workflow is scheduled and executed, and predefined metrics are continuously monitored during execution. The execution strategy is dynamically adjusted based on the monitored metrics.

[0010] In the above technical solution, a complete closed loop from metadata aggregation to intelligent workflow scheduling and dynamic optimization is set up to free data engineers from heavy and error-prone manual orchestration work, thereby realizing full-process automation of data link management.

[0011] Furthermore, the multimodal metadata includes metadata corresponding to structured data, semi-structured data, and unstructured data; the data directory is used to abstract the access formats and attributes of different modal data.

[0012] The above technical solution achieves unified management of heterogeneous data sources, laying the foundation for building a knowledge graph covering the entire enterprise data domain, and improving the versatility and applicability of the method.

[0013] Furthermore, the steps for constructing the global relationship graph include: mapping each data table to a node, and mapping each data lineage representing the direction of data transformation to a directed edge pointing from the source data table node to the target data table node, thereby forming a directed graph structure.

[0014] In the above technical solution, by transforming business logic (lineage relationship) into a computable graph structure, subsequent automated algorithms such as loop detection, dependency analysis, and graph partitioning can be implemented, providing core data model support for intelligent scheduling.

[0015] Furthermore, the partitioning using graph algorithms includes the following sub-steps: Perform loop detection on the global relationship graph, and issue an alarm if a loop is found; Perform topological sorting on acyclic global relational graphs or global relational graphs with processed cycles to determine the dependency order between nodes; Detect strongly connected components in the global graph; Extract the subgraph corresponding to each strongly connected component as an independent sub-relationship graph.

[0016] In the above technical solution, a systematic graph theory algorithm is used to automatically decompose complex dependencies, replacing the error-prone manual partitioning work. This not only ensures the logical correctness of the partitioning and avoids task conflicts, but also allows each sub-workflow to be scheduled independently and in parallel, greatly improving processing efficiency and system reliability.

[0017] Furthermore, the specific logic for creating or updating workflows for each sub-relationship graph includes: traversing the nodes and edges in the sub-relationship graph, generating a data processing link that has no external dependencies and can run independently, and defining it as a workflow.

[0018] The above technical solution realizes the automatic transformation from "data-dependent view" to "executable task flow", generates a workflow object that can be directly understood and executed by the scheduling engine, and completes the final output of intelligent orchestration.

[0019] Furthermore, the monitoring metrics include at least one of the following: cluster computing resource usage, resources required for task declaration, actual task execution time, task priority, workflow execution time, workflow execution status, and remaining data freshness.

[0020] In the above technical solution, by establishing a multi-dimensional monitoring indicator system oriented towards efficiency and quality, accurate data input is provided for subsequent dynamic scheduling decisions, enabling the system to transform from experience-driven to data-driven.

[0021] Furthermore, the dynamic adjustment of execution strategies based on monitoring indicators includes: dynamically adjusting the concurrency of tasks in the workflow, computing resource allocation parameters, error retry strategies, or execution priorities according to cluster resource utilization and task resource requirements.

[0022] The above technical solution realizes flexible resource allocation and intelligent fault handling, significantly improves the overall utilization of cluster resources and the robustness of data links in the face of anomalies, and ensures the SLA (Service Level Agreement) of data services.

[0023] Furthermore, the method for calculating the remaining data freshness is as follows: obtain the preset data freshness period from the data table definition, and calculate the data freshness period minus the difference between the current time and the most recent data update time.

[0024] In the above technical solution, by quantifying the business concept of data timeliness into a calculable scheduling indicator, the system can proactively and predictively ensure the timely update of key data, and better support real-time or near-real-time business scenarios.

[0025] Furthermore, when scheduling workflows, workflows with lower remaining data freshness and / or higher task priority are prioritized for execution.

[0026] The above technical solution achieves optimized scheduling under multiple objectives (business priority, data timeliness), ensuring that limited cluster resources are always used to execute the most urgent and important data processing tasks, thereby maximizing the business value of data output.

[0027] The second objective of this invention is to provide an intelligent scheduling system driven by multimodal metadata, wherein the system: The metadata service module is used to obtain multimodal metadata, which includes at least a data directory describing data modality attributes, data tables with logical storage structure, data lineage representing the conversion relationship between data tables, and workflow definition composed of data tables and data lineage. The scheduling engine module is used to construct a global relationship graph based on the metadata, with data tables as nodes and data lineage as directed edges; and to automatically divide the global relationship graph into multiple sub-relationship graphs that do not have mutual dependencies using graph algorithms; and to create or update corresponding workflows for each sub-relationship graph. The monitoring and execution module is used to schedule and execute the workflow, and continuously monitor predefined indicators during the execution process, and dynamically adjust the execution strategy based on the monitoring indicators.

[0028] Compared with the prior art, the beneficial effects of the present invention are as follows: This invention proposes an intelligent scheduling method and system based on multimodal metadata. By uniformly aggregating and managing metadata for all modalities (structured, semi-structured, and unstructured) data, a global data lineage graph is constructed. Graph algorithms are then used to automatically and intelligently divide complex data dependency networks into multiple independently schedulable sub-workflows. Furthermore, during scheduling and execution, the system dynamically adjusts execution strategies and parameters by real-time monitoring of task and resource metrics, thereby achieving automated data link orchestration, intelligent scheduling execution, and efficient resource utilization. This effectively solves the technical problems of high cost, error susceptibility, poor maintainability, and low resource utilization associated with manual workflow orchestration in large-scale data scenarios. Attached Figure Description

[0029] Figure 1 A flowchart illustrating an intelligent scheduling method driven by multimodal metadata, provided in an embodiment of this application. Figure 2 A schematic diagram illustrating the principle of intelligent scheduling of multimodal metadata provided in this application embodiment; Figure 3 A global relationship graph based on data tables and data lineage is provided for embodiments of this application; Figure 4 A flowchart illustrating the workflow for creating or updating a sub-relationship graph, provided for embodiments of this application; Figure 5 A schematic diagram illustrating the principle of dynamically adjusting execution strategies based on monitoring indicators, provided in an embodiment of this application; Figure 6 This is a schematic diagram of the structure of an intelligent scheduling system driven by multimodal metadata, provided in an embodiment of this application. Detailed Implementation

[0030] To facilitate understanding of the present invention, a more complete description will be given below with reference to the accompanying drawings. Preferred embodiments of the invention are shown in the drawings. However, the invention can be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided to provide a thorough and complete understanding of the disclosure of the invention.

[0031] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.

[0032] Example 1: This embodiment provides an intelligent scheduling method driven by multimodal metadata. See [link to relevant documentation]. Figure 1 and Figure 2 This includes the following steps: Step S1: Obtain the metadata of the multimodal data, which includes at least a data directory describing the data modality attributes, a data table with logical storage structure, a data lineage representing the conversion relationship between data tables, and a workflow definition composed of data tables and data lineage. Step S2: Based on the metadata, construct a global relationship graph with data tables as nodes and data lineage as directed edges; Step S3: Use graph algorithms to automatically divide the global relation graph into multiple sub-relation graphs that do not have mutual dependencies; Step S4: Create or update the corresponding workflow for each of the sub-relationship graphs; Step S5: Schedule and execute the workflow, and continuously monitor predefined metrics during execution, dynamically adjusting the execution strategy based on the monitored metrics.

[0033] In a preferred embodiment, in step S1, the multimodal metadata includes metadata corresponding to structured data, semi-structured data, and unstructured data; the data directory is used to abstract the access formats and attributes of different modal data.

[0034] Specifically, the metadata service actively collects or passively receives metadata, and aggregates all metadata within the metadata service. This metadata includes various types, with the following specific meanings: 1. Data Directory: Describes the attributes of a modality of data. A data directory contains multiple data tables. The modality of a data directory can be various, including structured data (such as MySQL), unstructured file sets (such as folders on a server's hard drive), and semi-structured data queues (such as Kafka). However, the final abstracted form is a two-dimensional table, where columns represent the attribute values ​​of metadata and rows represent a row of data. See Tables 1 and 2 below for details.

[0035] Table 1: Data Catalog Table Table 2: Data Directory Attribute Table 2. Data Table: A two-dimensional grid logical structure used to store data. It consists of rows and columns, where columns represent data fields and rows represent specific records of data. The specific logical structure depends on the data format defined in the data catalog.

[0036] 3. Data lineage: This can be viewed as the relationship between data tables. This relationship is many-to-one and directional. For example, the lineage of merging table A and table B into table C is A+B->C.

[0037] 4. Workflow: It can be viewed as a data link that runs independently without external dependencies, consisting of data tables and data lineage.

[0038] Metadata services aggregate all metadata information and persist it (such as in files or databases).

[0039] Understandably, the technical solution in step S1 enables unified management of heterogeneous data sources, laying the foundation for building a knowledge graph covering the entire enterprise data domain and improving the versatility and applicability of the method.

[0040] In a preferred embodiment, step S2 includes the following steps for constructing the global relationship graph: mapping each data table to a node, and mapping each data lineage representing the direction of data transformation to a directed edge pointing from the source data table node to the target data table node, thereby forming a directed graph structure.

[0041] Specifically, the scheduling engine retrieves all metadata from the metadata service via an interface and constructs a relationship graph. The relationship graph has two core data structures: nodes and edges. In this case, data tables are used as nodes, and data lineage is used as edges, as follows: Figure 3 As shown.

[0042] For example, the scheduling engine traverses all data lineage records. For each "data table A -> data table B" lineage relationship, it creates or locates two nodes representing data table A and data table B in the relationship graph, and then creates a directed edge from node A to node B; ultimately, all data tables and their transformation relationships constitute a directed graph, such as... Figure 3 As shown in the diagram, this figure visually illustrates the source, processing, and final output of the data, serving as the foundational data structure for subsequent automated analysis.

[0043] Understandably, by transforming business logic (lineage relationships) into a computable graph structure, subsequent automated algorithms such as loop detection, dependency analysis, and graph partitioning can be implemented, providing core data model support for intelligent scheduling.

[0044] In a preferred embodiment, step S3, the partitioning using a graph algorithm, includes the following sub-steps: Step S31: Perform loop detection on the global relationship graph; if a loop exists, issue an alarm. Step S32: Perform topological sorting on the acyclic global relation graph or the global relation graph with processed cycles to determine the dependency order between nodes; Step S33: Detect strongly connected components in the global relationship graph; Step S34: Extract the subgraph corresponding to each strongly connected component as an independent sub-relation graph.

[0045] Specifically, S31 (Ring Detection): Uses Depth-First Search (DFS) to detect if a cycle exists in the graph. If a cycle exists (e.g., A->B, B->C, C->A), it means that there is a logical infinite loop in the data processing flow. The system will issue an alarm and stop the automatic process division, prompting data developers to intervene and check.

[0046] S32 (Topological Sort): Performs a topological sort on an acyclic graph, resulting in a linear sequence of node operations. This sequence clarifies the order in which all data tables are processed, providing a reference for understanding global dependencies.

[0047] S33 (Strongly Connected Component Detection): Using Tarjan's or Kosaraju's algorithm, find the largest strongly connected subgraph (SCC) in the graph. Nodes within an SCC are mutually reachable, representing a tightly coupled set of data transformation steps that must be processed together.

[0048] S34 (Subgraph Extraction): Extract each detected strongly connected component, as well as the independent nodes (and their associated edges) that do not belong to any SCC, to form multiple subgraphs. Each subgraph is tightly dependent internally, but there is no dependency between subgraphs.

[0049] Understandably, the use of systematic graph theory algorithms to automatically decompose complex dependencies replaces the error-prone manual partitioning process. This not only ensures the logical correctness of the partitioning and avoids task conflicts, but also allows each sub-workflow to be scheduled independently and in parallel, greatly improving processing efficiency and system reliability.

[0050] In a preferred embodiment, in step S4, the specific logic for creating or updating a workflow for each sub-relationship graph includes: traversing the nodes and edges in the sub-relationship graph, generating a data processing link that has no external dependencies and can run independently, and defining it as a workflow.

[0051] Specifically, see Figure 4 The scheduling engine traverses all nodes (data tables) and edges (lineage) of a sub-graph. Based on the dependencies indicated by the edges, the engine connects these scattered task nodes (such as SQL queries and Jar package jobs) in the correct execution order, forming a workflow definition with a linear or directed acyclic graph (DAG) structure. This workflow definition contains all the information required for task execution (such as scripts and parameters) and ensures that the workflow can run independently without depending on any data tables outside the subgraph.

[0052] Understandably, this achieves an automated transformation from a "data-dependent view" to an "executable task flow," generating workflow objects that can be directly understood and executed by the scheduling engine, thus completing the final output of intelligent orchestration.

[0053] In a preferred embodiment, in step S5, the monitoring indicators include at least one of the following: cluster computing resource usage, resources required for task declaration, actual task execution time, task priority, workflow execution time, workflow execution status, and remaining data freshness.

[0054] Specifically, the key monitoring indicators and their sources are as follows: Cluster computing resources: obtained by calling the YARN REST API or Kubernetes Metrics API, including the CPU and memory usage of each node.

[0055] Task declaration resources: parsed from task definitions (such as Spark job configurations).

[0056] Task / workflow execution time: calculated in real time from the execution logs of the scheduling engine.

[0057] Task priority: Specifyed by the user when defining the task, or configured in the metadata according to business importance.

[0058] Workflow execution status: provided directly by the scheduling engine instance (e.g., running, successful, failed).

[0059] Remaining data freshness: Calculated based on the refresh cycle defined in the data table (e.g., updated every 3600 seconds) and the time of the most recent successful data update; the formula is: Remaining freshness = Defined cycle - (Current time - Data update time).

[0060] Understandably, by establishing a multi-dimensional monitoring indicator system oriented towards efficiency and quality, accurate data input is provided for subsequent dynamic scheduling decisions, enabling the system to shift from experience-driven to data-driven.

[0061] In a preferred embodiment, in step S5, the dynamic adjustment of the execution strategy based on monitoring indicators includes: dynamically adjusting the concurrency of tasks in the workflow, computational resource allocation parameters, error retry strategies, or execution priorities according to cluster resource utilization and task resource requirements.

[0062] Specifically, such as Figure 5 As shown, the process of the dynamic adjustment strategy based on indicators is as follows: Scheduling decision logic, and continuous system analysis and monitoring of metrics: When the overall cluster resources are detected to be idle, while some high-priority workflows are queued, the scheduling decision-maker will increase the concurrent execution number of these workflows.

[0063] When a task fails multiple times due to insufficient memory, the system will attempt to automatically increase its requested memory resource quota based on historical data during the next scheduling.

[0064] For failed tasks, the system dynamically determines the number of retries and the retry interval based on the error type (such as network timeout or temporary data unavailability), rather than using a fixed strategy.

[0065] When the system detects a sharp drop in the "remaining data freshness" of a data table (meaning it is about to expire), it will temporarily increase the priority of all upstream workflows to ensure that the data is updated in a timely manner.

[0066] Understandably, this achieves flexible resource allocation and intelligent fault handling, significantly improving the overall utilization of cluster resources and the robustness of data links in the face of anomalies, and ensuring the SLA (Service Level Agreement) of data services.

[0067] In a preferred embodiment, in step S5, the method for calculating the remaining data freshness is as follows: obtain a preset data freshness period from the data table definition, and calculate the data freshness period minus the difference between the current time and the most recent data update time.

[0068] For example, suppose the data table "Daily Active Users Report" defines its data freshness period as 3600 seconds (i.e., 1 hour). The latest successful data update time for this table is today at 10:00:00. When making a scheduling decision at the current time 10:30:00, the remaining freshness is calculated as: 3600 - (10:30:00 - 10:00:00) = 3600 - 1800 = 1800 seconds. This means that the data has 1800 seconds (30 minutes) of validity remaining. The scheduler can use this value for priority ranking.

[0069] Understandably, by quantifying the business concept of data timeliness into a calculable scheduling indicator, the system can proactively and predictively ensure the timely updating of critical data, thus better supporting real-time or near-real-time business scenarios.

[0070] In a preferred embodiment, in step S5, when scheduling workflows, workflows with lower remaining data freshness and / or higher task priority are executed first.

[0071] For example, the scheduler maintains a queue of workflows to be executed. At each scheduling decision, it calculates a comprehensive scheduling score for each workflow in the queue. For instance, a simplified formula could be used: Score = Task Priority Coefficient α + (1 / remaining data freshness) β, where α and β are weighting coefficients. A higher score indicates a greater scheduling urgency. The scheduler prioritizes executing the workflow instance with the highest score. In this way, the system can automatically balance business importance and data timeliness requirements.

[0072] Understandably, this achieves optimized scheduling under multiple objectives (business priority, data timeliness), ensuring that limited cluster resources are always used to execute the most urgent and important data processing tasks, thereby maximizing the business value of data output.

[0073] In this embodiment, a complete closed loop from metadata aggregation to intelligent workflow scheduling and dynamic optimization is set up to free data engineers from heavy and error-prone manual orchestration work, thereby achieving full-process automation of data link management.

[0074] Example 2: This embodiment, based on the method described in the previous embodiment, provides an intelligent scheduling system driven by multimodal metadata. See [link to previous embodiment]. Figure 6 The system includes: The metadata service module is used to obtain multimodal metadata, which includes at least a data directory describing data modality attributes, data tables with logical storage structure, data lineage representing the conversion relationship between data tables, and workflow definition composed of data tables and data lineage. The scheduling engine module is used to construct a global relationship graph based on the metadata, with data tables as nodes and data lineage as directed edges; and to automatically divide the global relationship graph into multiple sub-relationship graphs that do not have mutual dependencies using graph algorithms; and to create or update corresponding workflows for each sub-relationship graph. The monitoring and execution module is used to schedule and execute the workflow, and continuously monitor predefined indicators during the execution process, and dynamically adjust the execution strategy based on the monitoring indicators.

[0075] Specifically, the Metadata Service Module acts as the "data map" creator for the system, responsible for the unified modeling, collection, storage, and external service of the enterprise's full-domain data assets.

[0076] Unified Metamodel Definition: Provides a standardized data schema for core metadata objects (data directory, data table, data lineage, task definition), supporting dynamic API extension.

[0077] Multi-mode data collection: For known data sources (such as MySQL, Hive, Kafka, S3), dedicated connectors are deployed to periodically pull metadata and lineage relationships via JDBC, API, etc.; and standardized RESTful APIs or message queues (such as Kafka) are provided for various data processing tools (such as ETL tools, Spark jobs, Flink jobs) to actively report task execution logs and generated data lineages during runtime, achieving real-time and accurate capture of lineage relationships; at the same time, a built-in SQL parser and code analysis engine (used to parse Spark and Flink job logic) can automatically parse field-level lineages from task scripts and associate them with table-level lineages to form a more granular knowledge graph.

[0078] Versioning and Change Auditing: All metadata changes are versioned, supporting metadata snapshots and rollbacks, and providing a complete change audit log to facilitate tracking the evolution history of the data chain.

[0079] Technical implementation example: Storage: Graph databases (such as Neo4j and JanusGraph) are used to store the lineage graph, leveraging their powerful relationship query capabilities; relational databases (such as MySQL) or document databases (such as Elasticsearch) are used to store attribute class metadata, supporting complex queries and full-text search.

[0080] Service-oriented architecture: Deployed as a standalone Spring Boot or gRPC microservice, providing a high-performance, highly available metadata query API.

[0081] The Scheduling Engine Module acts as the "intelligent brain" of the system, responsible for decision-making and analysis.

[0082] Relationship graph construction: Periodically or based on events (listening for metadata change messages), pull full / incremental metadata from the metadata service, dynamically build and maintain a global data relationship graph in memory.

[0083] Graph algorithm computing engine: Integrates or encapsulates high-performance graph computing libraries (such as JGraphT, Apache SparkGraphX) to execute algorithms such as ring detection, topological sorting, and strongly connected component (SCC) detection as described in Example 4. For ultra-large-scale graphs (number of nodes > 100,000), sharding or parallel graph computing frameworks can be used for processing.

[0084] Intelligent workflow orchestration: Subgraph to Workflow: Automatically converts each subgraph output by the algorithm into a standard DAG (Directed Acyclic Graph) workflow definition. This process includes: identifying entry nodes (without upstream dependencies) and exit nodes (without downstream dependencies) in the subgraph, determining the critical path, and setting default task execution parameters.

[0085] Workflow templates and parameter injection: Workflow template functionality is supported. For subgraphs with similar structures, templates can be reused, and only specific dynamic parameters such as data sources and table names need to be injected.

[0086] Dynamic strategy decision-making: Introducing a rule engine or lightweight machine learning model, the input is real-time monitoring metrics, and the output is scheduling instructions (such as priority adjustment, resource reallocation, and retry policy change). Decision rules are configurable, for example: "IF remaining data freshness < 300 seconds AND cluster CPU idle rate > 20% THEN Set workflow priority to the highest".

[0087] Technical implementation example: Core services: Can be developed using Scala / Java and run as a persistent process.

[0088] Caching and Performance: Use caching tools such as Redis to store hot metadata and snapshots of the constructed relationship graph to accelerate scheduling decisions.

[0089] High availability: Employ master-slave or cluster deployment, and use ZooKeeper / Etcd to achieve leader election, ensuring seamless failover in case of single point of failure.

[0090] Monitoring & Execution Module: As the "limbs" of the system, this module is responsible for precise execution and real-time feedback.

[0091] Workflow execution: Multi-engine adaptation: A unified executor interface is abstracted, allowing the backend to adapt to various task execution engines, such as Apache Airflow, DolphinScheduler, Kubernetes Job, or directly submit to YARN and Spark. The executor is responsible for translating the workflow DAG issued by the scheduling engine into a format that the target engine can recognize and submit it.

[0092] Lifecycle management: Responsible for starting, pausing, resuming, and stopping workflow instances.

[0093] Unified indicator collection: Multi-layer data collection: Metrics are collected from the infrastructure layer (K8s, YARN), task layer (logs and metrics of each computing engine), and business layer (data table update time, data quality rule verification results) through agents or API integration.

[0094] Real-time stream processing: The collected metric data is sent to a real-time stream processing pipeline (such as Apache Flink) for window aggregation, anomaly detection (such as a sudden increase in task latency), and real-time alerts.

[0095] Strategy Execution: Receives instructions from the scheduling engine's decision-maker and translates them into specific operations. For example: Dynamic Parameter Adjustment: Adjusts resources before task startup by modifying task submission parameters (such as `spark.executor.memory`). Elastic Scaling: Works with Kubernetes HPA or YARN Resource Manager to automatically scale the compute resource pool based on queue load.

[0096] Technical implementation example: Observability Stack: Integrates Prometheus (metrics storage), Grafana (dashboard), and ELK (log analysis) to provide comprehensive system observability.

[0097] Event-driven architecture: Message queues (such as Apache Pulsar and RocketMQ) are used extensively for decoupling within and between modules. For example, task status update events and resource metric events are published to the message bus for relevant modules to subscribe to and consume.

[0098] System Interaction Flow and Advantages: Closed-Loop Feedback Mechanism: The system forms an intelligent closed loop of "perception-decision-execution-re-perception". Monitoring data from the execution module is fed back to the scheduling engine in real time, and the engine optimizes subsequent decisions accordingly, enabling the system to have continuous learning and adaptive capabilities.

[0099] In this embodiment, the three modules of the system described above not only achieve a qualitative leap in data scheduling from "manual orchestration" to "fully automated intelligent scheduling," but also construct a highly available, scalable, easily integrated, and self-optimizing enterprise-level intelligent data scheduling platform through advanced microservice architecture, event-driven design, and unified observability technologies. It significantly reduces the operational complexity and technical barriers of big data platforms, improves the agility of data development, the timeliness of data output, and the utilization efficiency of cluster resources, providing solid and reliable underlying support for data-driven businesses.

[0100] The above description is merely an embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.

Claims

1. A smart scheduling method based on multimodal metadata-driven scheduling, characterized in that, The method includes the following steps: Obtain multimodal metadata, which includes at least a data directory describing data modality attributes, data tables with logical storage structure, data lineage representing the conversion relationship between data tables, and workflow definition consisting of data tables and data lineage; Based on the aforementioned metadata, a global relationship graph is constructed with data tables as nodes and data lineage as directed edges; The global relational graph is automatically divided into multiple sub-relational graphs that do not have mutual dependencies using graph algorithms; Create or update a corresponding workflow for each of the sub-relationship diagrams; The workflow is scheduled and executed, and predefined metrics are continuously monitored during execution. The execution strategy is dynamically adjusted based on the monitored metrics.

2. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 1, characterized in that, The multimodal metadata includes metadata corresponding to structured data, semi-structured data, and unstructured data; the data directory is used to abstract the access formats and attributes of different modal data.

3. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 2, characterized in that, The steps for constructing the global relationship graph include: mapping each data table to a node, and mapping each data lineage representing the direction of data transformation to a directed edge from the source data table node to the target data table node, thereby forming a directed graph structure.

4. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 3, characterized in that, The partitioning using graph algorithms includes the following sub-steps: Perform loop detection on the global relationship graph, and issue an alarm if a loop is found; Perform topological sorting on acyclic global relational graphs or global relational graphs with processed cycles to determine the dependency order between nodes; Detect strongly connected components in the global graph; Extract the subgraph corresponding to each strongly connected component as an independent sub-relationship graph.

5. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 4, characterized in that, The specific logic for creating or updating workflows for each sub-relationship graph includes: traversing the nodes and edges in the sub-relationship graph, generating a data processing link that has no external dependencies and can run independently, and defining it as a workflow.

6. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 1, characterized in that, The monitoring metrics include at least one of the following: cluster computing resource usage, resources required for task declaration, actual task execution time, task priority, workflow execution time, workflow execution status, and remaining data freshness.

7. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 6, characterized in that, The dynamic adjustment of execution strategies based on monitoring indicators includes: dynamically adjusting the concurrency of tasks in the workflow, computing resource allocation parameters, error retry strategies, or execution priorities according to cluster resource utilization and task resource requirements.

8. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 6, characterized in that, The method for calculating the remaining data freshness is as follows: obtain the preset data freshness period from the data table definition, and calculate the data freshness period minus the difference between the current time and the most recent data update time.

9. The intelligent scheduling method based on multimodal metadata-driven scheduling according to claim 8, characterized in that, When scheduling workflows, prioritize workflows with lower remaining data freshness and / or higher task priority.

10. A multimodal metadata-driven intelligent scheduling system, wherein the system is based on the method described in any one of claims 1-9, characterized in that, The system: The metadata service module is used to obtain multimodal metadata, which includes at least a data directory describing data modality attributes, data tables with logical storage structure, data lineage representing the conversion relationship between data tables, and workflow definition composed of data tables and data lineage. The scheduling engine module is used to construct a global relationship graph based on the metadata, with data tables as nodes and data lineage as directed edges; The global relationship graph is automatically divided into multiple sub-relationship graphs that do not have mutual dependencies using graph algorithms; and a corresponding workflow is created or updated for each sub-relationship graph. The monitoring and execution module is used to schedule and execute the workflow, and continuously monitor predefined indicators during the execution process, and dynamically adjust the execution strategy based on the monitoring indicators.