An agent-oriented bioinformatics analysis method
By establishing a registry of bioinformatics tools and automating data format conversion, the agent can automatically perform multi-step bioinformatics analysis without needing to understand the internal details of the tools. This solves the problem of format incompatibility between tools and improves the reliability and credibility of the analysis process.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BLOOMAGE ENGINE BIOTECHNOLOGY (TIANJIN) CO LTD
- Filing Date
- 2026-02-13
- Publication Date
- 2026-06-12
AI Technical Summary
Existing intelligent agents cannot automatically handle format incompatibility issues between tools in bioinformatics analysis, resulting in a fragile, error-prone, and unreliable analysis process.
A registry for bioinformatics tools is established to record the input and output formats and execution conditions of the tools. By comparing the user data format with the registry, a format conversion tool is automatically invoked to standardize the data. Based on the execution conditions, the confidence level of the task score is determined, and data lineage is constructed to trace the reasons for failure.
This enables intelligent agents to automate and reliably perform multi-step bioinformatics analysis without needing to understand the internal details of the tools, improving process robustness and reliability, and ensuring the verifiability and traceability of analysis results.
Smart Images

Figure CN122201456A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of bioinformatics analysis technology, specifically relating to a bioinformatics analysis method for intelligent agents. Background Technology
[0002] With the rapid development of artificial intelligence technology, intelligent agents have shown great potential in scientific research, especially in automated analysis within the field of bioinformatics. Bioinformatics analysis tasks, such as predicting the three-dimensional structure of protein sequences and further docking them with small molecules, typically involve the cascaded execution of multiple heterogeneous and specialized computational tools. These processes are highly complex: the tools come from diverse sources, are developed by different teams, have varying input and output data formats, and the behavioral semantics of the tools themselves are often not explicitly described.
[0003] Researchers rely on manually writing scripts or using workflow engines to orchestrate these tools. However, these methods require users to have in-depth knowledge of the technical details and formatting specifications of each tool, resulting in rigid processes that are difficult to flexibly respond to dynamic task requirements.
[0004] Therefore, intelligent agents have been attempted to automate task planning and tool invocation. While these agents excel at understanding user intent and generating high-level plans, they face significant challenges when actually invoking specific bioinformatics tools.
[0005] Due to a lack of structured understanding of the specific characteristics of bioinformatics tools, agents struggle to automatically handle inherent format incompatibility issues between tools, and are unable to comprehend the non-deterministic results or confidence information that may arise during tool execution. This results in agent-driven analysis processes that are fragile and prone to errors, and once an error occurs, the root cause is difficult to trace, and the entire analysis process cannot be reliably reproduced.
[0006] Therefore, there is an urgent need in existing technologies to build a supporting framework that enables intelligent agents to reliably and automatically execute multi-step bioinformatics analysis processes without delving into the internal implementation details of each tool, while ensuring the verifiability and traceability of the entire process. Summary of the Invention
[0007] This invention provides a bioinformatics analysis method for intelligent agents, enabling intelligent agents to automatically plan, adaptively execute, and perform full-process reliable quantification of bioinformatics analysis tasks of various types and steps without needing to understand the internal implementation details of the tools.
[0008] The technical solution adopted in this invention is as follows: A bioinformatics analysis method for intelligent agents, comprising: A bioinformatics tool registry is pre-established, where tool registration includes input data format, execution conditions, and output data format; The method further includes: Receive the bioinformatics task description and compare the data format of the bioinformatics according to the bioinformatics tool registry. In response to the data format comparison results, data standardization is performed to perform bioinformatics tasks, and the task score confidence level is determined based on the execution conditions.
[0009] The bioinformatics analysis method for intelligent agents used in this invention also includes the following additional technical features: Comparing data formats in bioinformatics, including: The data format of the biological information in the description of the bioinformatics task is compared with the data format entered in the bioinformatics tool registry. If the comparison matches, it can be used to perform bioinformatics tasks; If the comparison is inconsistent, a format conversion tool is invoked to convert the biological information into the corresponding input data format for use in performing bioinformatics tasks.
[0010] Performing bioinformatics tasks, specifically: Based on any one of the stated biological information, execute the first biological information task; or... The bioinformatics task description includes multiple types of bioinformatics, and a second bioinformatics task is performed based on the standardized bioinformatics; or, The bioinformatics task description includes multiple types of bioinformatics. Based on any one of the aforementioned biological information, execute the first biological information task. Based on the standardized biological information and the execution result of the first biological information task, or based on the execution result of the first biological information task, a second biological information task is executed between the biological information.
[0011] The task scoring confidence level is determined based on the aforementioned execution conditions, specifically as follows: The execution conditions include standardized processing, a first bioinformatics task, and a second bioinformatics task; For the standardization process, no task score confidence adjustment is performed; for the first bioinformatics task, confidence information is generated; for the second bioinformatics task, a comprehensive bioinformatics score is generated. When only the standardization process is performed, the maximum score for the task rating confidence is set; Otherwise, the confidence level of the task score is determined based on the confidence information of the task results and / or the comprehensive score of the bioinformatics.
[0012] After performing bioinformatics tasks, the following are also included: Based on the tools in the bioinformatics tool registry, and considering the input-output relationships of all tools, a data lineage relationship for task execution is constructed. Adjust the confidence level of the task score based on the length of the blood relationship described in the data.
[0013] The data lineage for task execution is established as follows: Based on the input and output data format of the tool, create corresponding data nodes and tool execution nodes, and establish directed edges from the input data node to the tool execution node and from the tool execution node to the output data node; Based on the input-output relationships of all tools, the tool execution nodes are connected to obtain a lineage diagram.
[0014] The execution conditions also include failure modes: The failure mode has at least several reasons for failure due to incorrect input data format; Based on the unique identifier of the failure cause, the cause of failure is located by tracing back through data lineage, and the recoverable adjustment conditions are determined according to the type of failure cause.
[0015] The bioinformatics tasks include at least one of protein structure prediction, small molecule structure docking, genome sequence analysis, and bioactivity analysis.
[0016] Based on standardized protein structure data, an AI-based protein structure prediction tool is invoked to perform protein structure prediction tasks. The predicted protein structure, along with standardized small molecule ligands, is input into the ligand docking tool to perform the docking task. The assembling and alignment of genomic sequence data with a reference genome; Modeling and analyzing bioactivity data.
[0017] The second aspect of this invention employs a bioinformatics analysis device for intelligent agents, comprising: The registration module is used to pre-establish a registry for bioinformatics tools, where tool registration includes input data format, execution conditions, and output data format. The parsing and comparison module is used to receive the bioinformatics task description and compare the data format of the bioinformatics information according to the bioinformatics tool registry. The execution engine module is used to standardize data in response to the data format comparison results in order to perform bioinformatics tasks and determine the task score confidence level based on the execution conditions.
[0018] Due to the adoption of the above technical solution, the beneficial effects achieved by this invention are as follows: 1. This invention structures the input data format, execution conditions, and output data format of tools through a registry, transforming the originally implicit and heterogeneous tool interfaces into a unified and queryable standardized contract. This breaks down the barriers between intelligent agents and heterogeneous tools, allowing agents to obtain tool usage specifications by querying the registry without pre-programming or learning the internal implementation details of each tool. This significantly lowers the technical threshold for agents to integrate and invoke bioinformatics tools.
[0019] Furthermore, by automating the comparison between user-submitted bioinformatics data formats and the input formats of target tools in the registry, pre-emptive diagnostics of data compliance are achieved. Potential format conflicts can be accurately identified before task execution, reducing runtime errors and significantly improving the robustness and predictability of the analysis process.
[0020] By incorporating format conversion tools into the registry for unified management, the system automatically calls the matching conversion tool to standardize the data when comparison results are inconsistent. This fully automates and streamlines the format adaptation process, which previously required manual intervention. This not only eliminates process interruptions caused by format incompatibility but also enables the system to adaptively handle diverse input data.
[0021] Furthermore, by incorporating the tool's execution conditions into the registration scope, semantic attributes such as whether the tool's execution is non-deterministic and whether it outputs confidence information can be explicitly expressed. Consequently, during task execution, the confidence level of the task score can be automatically captured, integrated, and calculated based on these registered execution conditions. This transforms the analysis results from isolated numerical values or files into standardized output objects with explicit confidence indicators, greatly enhancing the reliability and practical value of the automated analysis process. Attached Figure Description
[0022] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are used to explain the invention, but do not constitute an undue limitation of the invention. In the drawings: Figure 1 This is a flowchart illustrating the agent-oriented bioinformatics analysis method according to one embodiment of the present invention. Detailed Implementation
[0023] To more clearly illustrate the overall concept of the present invention, a detailed description will be provided below with reference to the accompanying drawings and examples.
[0024] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and therefore the scope of protection of the invention is not limited to the specific embodiments disclosed below.
[0025] like Figure 1 As shown, a bioinformatics analysis method for intelligent agents includes: S100: A bioinformatics tool registry is pre-established, which includes input data format, execution conditions, and output data format.
[0026] The core objective of this step is to build a structured tool knowledge base that can be understood and queried by intelligent agents. Its goal is to abstract and standardize bioinformatics computing tools from diverse sources and with varying interfaces into explicit services, thereby shielding the heterogeneity and complexity of the underlying tools.
[0027] Registration involves more than just recording tool names and paths. For each tool to be integrated (e.g., a protein structure prediction tool), it needs to be defined in a machine-readable, structured format, clearly declaring the core content, including: The input port explicitly specifies the data format for each input parameter. The output port explicitly specifies the data format for the tool's output results.
[0028] Specifically, the data format of each input parameter includes, but is not limited to, data type, format, and constraints. The data format of each output parameter includes, but is not limited to, data type, format, and core metrics.
[0029] The constraints include, but are not limited to, parameter types, default values, value ranges, and scenario adaptation suggestions.
[0030] Specifically, the protein sequence format input to the protein structure prediction tool can be FASTA or FASTQ, and the output protein structure format is BCIF. Small molecule formats can be mol2, smiles, or sdf; The standardized small molecule format for ligand docking tools can be mol2.
[0031] The core purpose of this step is to provide a standard for pre-validating parameters, so as to avoid tool execution failure or distorted results due to improper parameter configuration.
[0032] Specifically, the execution conditions include execution semantics, such as data standardization and bioinformatics tasks. Different execution semantics exhibit varying degrees of determinism and idempotency. The division of execution semantics can be used to determine the confidence level of task scoring.
[0033] All the tool contracts defined above are persistently stored in a centralized database or configuration center, forming a globally accessible bioinformatics tool registry.
[0034] This step addresses the lack of a unified cognitive standard when agents invoke tools. By pre-registering, the usage specifications of the tools are made explicit and standardized. This allows agents to automate and standardize tool invocation and process assembly simply by querying the registry, without needing to understand the internal implementation of the tools.
[0035] The method further includes: S200: Receive the bioinformatics task description and compare the data format of the bioinformatics information according to the bioinformatics tool registry.
[0036] This step aims to receive high-level task instructions from users and, based on the tool's knowledge base, automatically verify and diagnose the compliance of the format of the core input data required for the task, providing a basis for decision-making in subsequent dynamic planning of execution paths.
[0037] Receive standardized task requests (e.g., JSON format) submitted by the agent. Parse the request and extract bioinformatics parameters. For example, in a sequence-based protein-ligand docking scenario, the core is to extract: The first attribute is the protein sequence data and its format (such as FASTA) provided by the user. The second attribute is the small molecule data provided by the user and its format (such as smiles).
[0038] Based on the task objectives, the key tools required to perform the task (e.g., protein structure prediction tools and molecular docking tools) are retrieved from the bioinformatics tool registry. Then, the extracted bioinformatics data format is compared item by item with the input data format requirements declared by these target tools in the registry.
[0039] This step enables the system to detect format differences between data and tools. Through automated comparison, it can accurately identify whether the biometric information entered by the user can be directly processed by the tool. This avoids runtime errors caused by implicit format mismatches, ensuring process robustness.
[0040] S300: In response to the data format comparison results, perform data standardization to perform bioinformatics tasks, and determine the task score confidence level based on the execution conditions.
[0041] If the data format submitted by the user does not match the registration input format of the target tool, an automated governance process will be initiated: Search the bioinformatics tools registry for registered format conversion tools that possess the corresponding conversion capabilities. These conversion tools themselves are standardized services governed by contracts, and their registration information clearly records the source and target formats.
[0042] The format conversion tool is automatically invoked, taking the original data as input, performing conversion calculations, and generating a standardized data object that conforms to the input contract of the target tool.
[0043] After completing the necessary data standardization, the corresponding bioinformatics task is executed according to the type described in the bioinformatics task description. The confidence level is determined based on the execution conditions registered for the task being executed.
[0044] For example, in response to the first attribute comparison result, protein structure prediction is performed, and in response to the second attribute comparison result, ligand standardization is performed to perform docking of the predicted protein structure with the standardized ligand.
[0045] For the first attribute (protein sequence format), if the alignment results are consistent (e.g., the user provides FASTA and the prediction tool requires FASTA), then the protein structure prediction tool (such as Boltz) is directly invoked. If they are inconsistent, an automatic decision is made: first, a protein sequence format conversion tool registered in the registry that converts the current format to the target format is invoked; after the conversion is complete, the structure prediction tool is then invoked.
[0046] The logic is the same for the second attribute (small molecule format). For example, if the user provides "smiles" but requires "mol2" for the tool, the registered "smiles→mol2" conversion tool will be automatically inserted and called to complete the standardization.
[0047] After completing all necessary formatting, a standardized protein structure file (e.g., .bcif format) and a standardized small molecule file (e.g., mol2 format) will be obtained. These two standardized data objects will then be automatically input into a molecular docking tool (e.g., Gnina) to perform the final docking calculations.
[0048] The core value of this step lies in its intelligence and automation, dynamically adjusting the execution path (automatically inserting conversion steps) based on the actual situation (input format). This enables the agent to seamlessly drive complex, multi-step bioinformatics workflows, while users only need to focus on what data to provide and what results to expect, without needing to understand the intermediate format conversions. This solves the problem of workflow interruptions caused by the agent's inability to autonomously handle format incompatibility between tools, significantly improving the success rate and reliability of automated analysis.
[0049] As a preferred embodiment of the present invention, the comparative bioinformatics data format includes: The data format of the biological information in the description of the bioinformatics task is compared with the data format entered in the bioinformatics tool registry. If the comparison matches, it can be used to perform bioinformatics tasks; If the comparison is inconsistent, a format conversion tool is invoked to convert the biological information into the corresponding input data format for use in performing bioinformatics tasks.
[0050] This implementation method is based on format contracts to achieve automated decision-making and execution. It aims to enable the system to intelligently determine whether the raw protein sequence data provided by the user can be directly used for downstream analysis tools. If it is not usable, it will automatically trigger the data preprocessing process, thereby ensuring the smoothness and reliability of the entire analysis chain without the need for manual intervention in the details of format conversion.
[0051] The cornerstone of this step is the established bioinformatics tool registry. The actual format of the input data carried in the task description is automatically compared with the registered input format of the retrieved tools.
[0052] Path 1 (Matching): If the actual format matches the registered format, the data is determined to be usable directly. Subsequently, the user-provided sequence data is used as input to initiate the bioinformatics task.
[0053] Path Two (Inconsistent Alignment): If the actual format does not match the registered format (e.g., the user provides the Clutal multiple sequence alignment format, while the tool requires FASTA), no error will be reported or the process will be interrupted. Instead, an automatic governance process will be initiated. A pre-registered dedicated format conversion tool, designed to convert the current format (e.g., Clutal) to the target format (e.g., FASTA), will be retrieved from the same tool registry. This conversion tool is invoked as a separate, governed service to complete the format standardization. Afterward, the converted, contract-compliant FASTA format data will be used to perform the bioinformatics task.
[0054] Each tool invocation is a governed, standardized event. An immutable execution log is generated for each invocation, recording the tool identifier, unique identifiers for input and output data, start / end timestamps, and status. This ensures that every step of the operation is traceable, even in complex paths with automatic transformations.
[0055] In a preferred embodiment of the present invention, performing a bioinformatics task specifically involves: Based on any one of the stated biological information, execute the first biological information task; or... The bioinformatics task description includes multiple types of bioinformatics, and a second bioinformatics task is performed based on the standardized bioinformatics; or, The bioinformatics task description includes multiple types of bioinformatics. Based on any one of the aforementioned biological information, execute the first biological information task. Based on the standardized biological information and the execution result of the first biological information task, or based on the execution result of the first biological information task, a second biological information task is executed between the biological information.
[0056] Bioinformatics analysis tasks exhibit significant diversity—ranging from the analysis of intrinsic attributes of a single data type (such as predicting structure from a sequence) to the joint computation of interactions between multiple data types (such as molecular docking), and even complex workflows combining both sequentially. This step establishes differentiated and standardized execution paradigms for different types of tasks by exploring the first and second bioinformatics tasks and their combinations, enabling precise matching of task types and execution strategies.
[0057] Example 1: Independent execution of the first bioinformatics task.
[0058] Users only need to complete a single type of bioinformatics analysis, such as predicting only the three-dimensional structure of a protein without subsequent docking; or although the user's task description contains multiple types of bioinformatics, only one type needs to be processed in the current execution phase.
[0059] The bioinformatics task description is parsed to identify the first type of bioinformatics task to be performed (such as protein structure prediction). Based on this task type, registered tools with the corresponding functions are retrieved from the bioinformatics tool registry.
[0060] Read the corresponding bioinformatics data object (such as a protein sequence file in FASTA format) that has been standardized by the preceding steps from persistent storage.
[0061] The tool is automatically scheduled and executed according to the calling contract (Docker image address, command line template, resource quota, etc.) in the tool registry, and standardized input data is passed to the tool process.
[0062] After the tool completes its execution, it actively captures the output file according to its registered output specifications, parses and extracts the core scientific results (protein three-dimensional structure coordinates), and forcibly captures the confidence information generated by the tool.
[0063] Example 2: Independent execution of the second bioinformatics task.
[0064] Users already possess standardized multi-type bioinformatics data (e.g., standardized protein structures and standardized ligand files have been prepared in advance), and only need to perform collaborative analysis tasks between bioinformatics, without needing to perform a preliminary first task.
[0065] Read two or more standardized bioinformatics data objects from persistent storage. Taking molecular docking as an example, load simultaneously: Standardized protein structure data (e.g., BCIF format); Standardized small molecule ligand data (e.g., format MOL2); The registered ligand docking tool is invoked, and the two standardized data streams mentioned above are used as joint inputs to perform molecular docking calculations.
[0066] Example 3: Serial execution of complex tasks.
[0067] The user task description includes multiple types of bioinformatics (protein sequences + small molecule SMILES), and the task objective requires completing the entire process from raw data to final collaborative analysis results—that is, first performing the first bioinformatics task to generate intermediate results, and then performing the second bioinformatics task based on the intermediate results and another type of standardized data.
[0068] The first task (protein structure prediction) is performed on the protein sequence data. This step produces an intermediate result—a normalized three-dimensional protein structure file.
[0069] The output data object identifier of the first task is explicitly recorded in the execution context, serving as a key input dependency for the subsequent second task.
[0070] Load the results of the first task (protein structure) and the standardized ligand data, input both into the ligand docking tool, and execute the second task (molecular docking).
[0071] Specifically, the task scoring confidence level is determined based on the aforementioned execution conditions, as follows: The execution conditions include standardized processing, a first bioinformatics task, and a second bioinformatics task; For the standardization process, no task score confidence adjustment is performed; for the first bioinformatics task, confidence information is generated; for the second bioinformatics task, a comprehensive bioinformatics score is generated. When only the standardization process is performed, the maximum score for the task rating confidence is set; Otherwise, the confidence level of the task score is determined based on the confidence information of the task results and / or the comprehensive score of the bioinformatics.
[0072] In automated bioinformatics analysis workflows, the scientific attributes and sources of uncertainty differ fundamentally across different task types: data standardization tasks (such as format conversion) are typically deterministic processes, with no scientifically significant uncertainty in their results; however, AI-based prediction tasks (such as protein structure prediction) possess inherent uncertainty, requiring the tool to explicitly provide or the system to infer the output confidence level; and for multi-data collaborative analysis tasks (such as molecular docking), the reliability of the final score depends not only on the accuracy of the tool's algorithm but also on the quality of the upstream input data. Existing technologies often treat all tasks the same or focus only on the confidence level of a single task, lacking differentiated modeling and a unified synthesis mechanism for the confidence levels of multiple task types.
[0073] Based on the three major task categories—standardized processing, first bioinformatics task, and second bioinformatics task—defined in the execution conditions, differentiated confidence determination rules are customized for each category to ensure that the evaluation strategy matches the scientific characteristics of the task itself.
[0074] For the first task, actively capture the native confidence output of the tool or infer the confidence based on the execution semantics; for the second task, based on the obtained tool output score, further introduce the confidence information of upstream tasks for comprehensive calibration, and generate a comprehensive bioinformatics score confidence that reflects the uncertainty of the entire link.
[0075] For single-task scenarios that only perform standardized processing, the system directly assigns a full confidence score to avoid introducing unnecessary uncertainty reduction. For complex task scenarios, the system outputs a comprehensive, objective, and interpretable final confidence index through weighted synthesis of confidence information.
[0076] Furthermore, taking protein-ligand docking tasks as an example, format conversion tools (such as SMILES→MOL2) are idempotent tools, and the results remain unchanged when the same data is converted repeatedly; mutation detection tools (such as GATK) are non-idempotent tools, and repeated execution may lead to differences in results due to random variables in the process. The format conversion tool is a deterministic tool, while the agent-driven structure prediction tool is a non-deterministic tool. The same sequence may have slight differences due to random initialization.
[0077] Specifically, when performing standardized processing tasks such as data format conversion (e.g., when a user only requests that SMILES be converted to MOL2), the task is determined to be a deterministic operation, and there is a strict and reversible mapping relationship between its output and the input data, and there is no uncertainty in a scientific sense.
[0078] Therefore, no confidence adjustment is performed on the task scores, and no confidence reduction factor is added to the transformation results. The task score confidence is set to full, and the output is explicitly marked with `confidence: 1.0`, indicating that the standardized data can be directly used for downstream analysis, and its format compliance is completely reliable.
[0079] When performing primary bioinformatics tasks such as protein structure prediction, the following hierarchical strategy is adopted based on the execution conditions in the tool registry: The tool natively supports confidence score output (e.g., Boltz outputs pLDDT scores). This confidence score information is directly captured and used as the confidence score for the task result.
[0080] The tool does not support confidence level output, but its execution conditions are declared as non-deterministic. It estimates a confidence level value based on a preset global default confidence level (e.g., 0.85) or a dynamic confidence level model based on historical execution statistics.
[0081] When performing second bioinformatics tasks such as molecular docking, the determination of confidence level is a comprehensive synthetic process, which specifically includes: Obtain the raw scores and confidence levels output by the tool. After calling the docking tool, parse the output file according to the registry specification to extract the comprehensive bioinformatics score, which is combined with the affinity score. Retrieve the execution records of the first bioinformatics task upon which this second task depends and extract its task result confidence information. The tool is a deterministic tool (such as a traditional homology modeling tool), and the confidence level can be set to 1.0.
[0082] Based on a pre-defined synthesis model, the confidence scores of the task results and upstream data are weighted and fused to generate the final task score confidence score. Example synthesis models include: Multiplicative model (suitable for scenarios where upstream and downstream uncertainties are independent and cumulative). Weighted average model (suitable for scenarios where upstream and downstream uncertainties can be linearly compensated). Minimum model (suitable for scenarios with significant bottleneck effects).
[0083] Bioinformatics tools are diverse, some outputting deterministic results and others probabilistic predictions. Existing technologies often employ a single confidence level processing strategy, leading to distorted evaluation results. This step addresses the problem of inconsistent evaluation scales by using task-type-aware differentiated confidence rules. This ensures that evaluation criteria are scientifically matched to task characteristics, preventing unnecessary reductions for deterministic tasks, preserving inherent uncertainty for predictive tasks, and enabling uncertainty propagation for collaborative tasks.
[0084] In a preferred embodiment of the present invention, after performing the bioinformatics task, the method further includes: Based on the tools in the bioinformatics tool registry, and considering the input-output relationships of all tools, a data lineage relationship for task execution is constructed. Adjust the docking confidence level based on the length of the bloodline relationship in the data.
[0085] The core objectives of this implementation method are twofold: first, to achieve explicit and structured recording of the entire analysis process, transforming the complex dependencies of all tool calls, data generation, and consumption during a single task execution into a queryable and traceable data lineage graph; second, to treat the topological complexity of the process (represented by the length of lineage relationships) as a novel quality influencing factor, enabling refined calibration of the confidence level of the final results. The aim is to transcend the trust assessment of single tool results and establish a comprehensive result reliability measurement system based on end-to-end observability and process architecture awareness.
[0086] Iterate through all tool execution instances that are invoked during this task execution and are recorded in the bioinformatics tool registry.
[0087] The data lineage for task execution is established as follows: Based on the input and output data format of the tool, create corresponding data nodes and tool execution nodes, and establish directed edges from the input data node to the tool execution node and from the tool execution node to the output data node; Based on the input-output relationships of all tools, the tool execution nodes are connected to obtain a lineage diagram.
[0088] For each unique data object generated during task execution (whether it's raw input, intermediate files, or the final result), a corresponding data node is created in the lineage graph based on the unique identifier assigned during persistent storage. This node can be associated with metadata such as data format, creation time, and size.
[0089] For each successful tool invocation, a tool execution node is created in the lineage graph based on its unique run_id (e.g., run_101). This node is associated with the tool identifier, version, parameters, and execution status of this invocation.
[0090] Establish two directed edges following a defined causal relationship, pointing from the input data node to the tool execution node: this indicates that the data object is consumed by this tool execution as its input material.
[0091] The pointer from the tool execution node to the output data node indicates that the data object was generated by the tool and is its output.
[0092] This step establishes a standardized triplet structure of input → process → output for each tool call, which is the basic unit of lineage.
[0093] The input data nodes of downstream tool execution nodes are necessarily the output data nodes of one or more upstream tool execution nodes. By identifying and matching these shared data nodes, individual triples can be automatically connected.
[0094] After traversing and connecting all relevant tool execution records for this task, a complete lineage diagram is generated. This diagram, starting from the original input data node and ending with the final result data node, clearly shows how the data flows through a series of processing steps and gradually evolves into the final result.
[0095] The construction of the lineage graph relies directly on standardized execution records (run_id, input / output data ID), which combines dynamic process execution with static data lineage tracing.
[0096] Furthermore, in this embodiment, the length of the data lineage can be precisely defined as the number of tool execution nodes that need to be traversed from the user-submitted original input data node (such as the original protein sequence or the original small molecule) to the final docking result data node. For example, a process containing four steps, "format conversion A, structure prediction B, format conversion C, and docking D," has a lineage length of 4.
[0097] A confidence adjustment function is predefined. This function takes the original docking confidence (such as the integrated protein structure confidence value obtained in the previous step) and the lineage length of the current task as input, and outputs a final adjusted confidence.
[0098] The adjustment function can reflect the principle that the longer the process, the greater the potential for accumulated uncertainty. For example, a decay factor can be used: Final confidence = Original docking confidence * (decay coefficient)^(lineage length), where the decay coefficient is a constant slightly less than 1 (e.g., 0.98). This means that with each additional processing step, the final confidence will be slightly reduced.
[0099] Using bloodline length as a confidence adjustment parameter quantitatively correlates the complexity of engineering processes with the credibility of scientific results. This reflects the common understanding that the more processing steps involved, the greater the chance of potential errors or information loss, making the confidence assessment more rigorous and comprehensive.
[0100] Furthermore, when the confidence level of a result is low, a lineage diagram can be used for diagnosis. If the lineage diagram shows a lengthy process (e.g., due to repeated format conversions), it indicates that there is room for optimization in the process design, thereby driving users or the system to choose more integrated tools or more direct process paths in the future.
[0101] In addition, the execution conditions also include failure modes: The failure mode has at least several reasons for failure due to incorrect input data format; Based on the unique identifier of the failure cause, the cause of failure is located by tracing back through data lineage, and the recoverable adjustment conditions are determined according to the type of failure cause.
[0102] In automated bioinformatics analysis systems, issues such as incompatible data formats, out-of-bounds parameters, insufficient resources, internal tool errors, and algorithm convergence failures occur frequently. Current technologies handle these failures in extremely rudimentary ways: either simply throwing an exception to terminate the process, or overwhelming error logs with large amounts of text output, requiring manual review and troubleshooting.
[0103] This invention makes failures predictable, categorizable, and manageable structured information by explicitly declaring failure modes in the tool registry, including failure cause type, unique identifier, and recoverability identifier. When a process fails, instead of relying on manual log review, it uses a pre-constructed data lineage graph to traverse backwards along directed edges, accurately locating the root cause node and its corresponding unique failure cause identifier.
[0104] When a complex task (such as the entire protein-ligand docking process) fails at a certain stage, the failure backtracking engine is activated. First, the execution node that directly caused the failure is located, and its unique identifier for the failure reason is obtained. The constructed data lineage graph is accessed, and starting from the failed node, the process is traversed backwards along the directed edges, i.e., tracing back from the tool execution node to the input data node, and then tracing back from the input data node to the upstream tool execution node, recursively.
[0105] During the reverse traversal, the execution status and failure records of each upstream node are checked. Two scenarios are possible: Scenario A: The direct cause is the root cause. For example, the direct cause of ligand docking failure is an incorrect input ligand format, where the ligand data is the user-submitted SMILES string without standardization. In this case, the failure node itself is the root cause node.
[0106] Scenario B: The root cause lies upstream. For example, the apparent cause of ligand docking failure is an incorrect ligand format. However, further investigation reveals that although the upstream ligand normalization step shows a successful status, its output MOL2 file is missing hydrogen atoms due to a version defect in the conversion tool, causing the docking tool to fail to resolve the issue. In this case, the root cause is the upstream normalization processing node.
[0107] Based on the preset recoverable adjustment conditions in the failure reason type, it can automatically determine whether the current failure is recoverable, what adjustment strategy needs to be performed (such as format conversion, parameter correction, resource expansion, tool switching), and automatically perform recovery operations or generate recovery suggestions.
[0108] When the failure is due to an incorrect input data format that is recoverable, the corresponding format conversion tool is automatically inserted before the failure node, the toolchain is redesigned and executed.
[0109] When the failure is due to parameter-related issues such as algorithm non-convergence or insufficient resources, the parameter configuration will be automatically corrected according to the parameter adjustment strategy declared in the registry, and the execution will be retried.
[0110] When the failure is due to the current tool being unsuitable, the registry is checked for an alternative tool with the same functionality but different execution conditions. The tool is then automatically switched and the process is repeated. For example, when GPU memory is insufficient, the tool is switched from Boltz to the CPU version of AlphaFold.
[0111] When the failure is due to an unrecoverable reason (such as exceeding the sequence length limit or severe data corruption), a structured failure report is generated, which includes the failure reason identifier, root cause location, explanation of unrecoverable reasons, and manual handling suggestions, and is presented to the user through the intelligent agent.
[0112] As a preferred embodiment of the present invention, the bioinformatics task includes at least one of protein structure prediction, small molecule structure docking, genome sequence analysis, and bioactivity analysis.
[0113] This invention constructs a general execution governance framework for multimodal biological data that is compatible with heterogeneous analysis tools and can be widely applied to various bioinformatics analysis tasks.
[0114] Based on standardized protein structure data, an AI-based protein structure prediction tool is invoked to perform protein structure prediction tasks.
[0115] Register an AI-based protein structure prediction tool (such as Boltz), specifying the input format (FASTA), output format (BCIF), and execution conditions.
[0116] The system compares the protein sequence format submitted by the user with FASTA, and automatically calls the sequence format conversion tool when there is a discrepancy.
[0117] The pLDDT score output by the capture tool is used as the confidence level for the task rating.
[0118] Specifically, this involves using an artificial intelligence (AI)-based protein structure prediction tool; It receives a protein sequence and outputs the protein's three-dimensional structure, including its three-dimensional structural coordinates.
[0119] By invoking advanced AI prediction tools, the input one-dimensional protein sequence information is automatically and accurately transformed into a three-dimensional spatial structure. This result, along with its inherent quality assessment information, is then encapsulated into a standardized data object that can be directly used in downstream processes for subsequent molecular docking.
[0120] The AI-based protein structure prediction tool invoked in this step is a service that has been pre-registered in the bioinformatics tool registry. Its registration information explicitly specifies that the input is a protein sequence in a specific format (such as FASTA), and the output is a structure file containing atomic-level three-dimensional coordinates (such as .bcif or .pdb format). According to this contract, after passing format verification, a runtime instance of the tool is automatically scheduled and launched (e.g., in a dedicated container or computing environment), and the well-governed, standardized protein sequence data is accurately delivered to the tool.
[0121] After the AI prediction tool finishes running, its output is not directly treated as an unstructured file. Instead, it actively captures and parses the generated structure file according to the output specifications defined in the registry. The output protein 3D structure is a standardized data object containing the 3D coordinates of all atoms.
[0122] In addition, it also captures structural confidence information (e.g., global pLDDT score or local confidence for each residue) that is included in the tool output. This information is a quantitative assessment of the uncertainty of the AI prediction tool's own prediction results.
[0123] The very nature of AI predictions dictates that their results are inherently uncertain. By not only acquiring structural coordinates but also forcibly capturing their confidence level indicators, this uncertainty is propagated throughout the process (e.g., for subsequent adjustments to the docking confidence level), enhancing the scientific rigor of the entire analysis chain.
[0124] The predicted protein structure, along with standardized small molecule ligands, is input into the ligand docking tool to perform the docking task.
[0125] Register a molecular docking tool (such as Gnina), specifying the input format (protein structure BCIF, small molecule MOL2), output format (JSON), and execution conditions (output docking score and confidence level).
[0126] The system compares the user-submitted small molecule formats with MOL2, and automatically calls conversion tools such as SMILES→MOL2 when there is a discrepancy.
[0127] The docking score and tool confidence were captured and combined with the upstream protein structure prediction confidence for comprehensive calibration.
[0128] This includes inputting the predicted protein structure, along with the standardized ligand, into a ligand docking tool; Perform docking calculations and obtain docking results including binding affinity scores.
[0129] This embodiment aims to ensure that the input data (protein structure and ligand) used for docking calculations have been rigorously verified and standardized through the aforementioned steps, and that docking is performed in a controlled and traceable manner to ultimately obtain structured results that include key scientific indicators (combined with affinity scores).
[0130] The input to this step is not raw or arbitrary data, but rather two standardized data objects obtained after processing by upstream steps: a predicted protein structure, from the protein structure prediction step, which is already a 3D structure file in a specific format (e.g., bcif) with confidence information; and a standardized ligand, from the ligand standardization step, which has been converted into a small molecule 3D structure file in the format required by the docking tool (e.g., mol2).
[0131] The ligand docking tool is automatically scheduled and executed according to the calling contract in the tool registry. This process takes place in a controlled environment, with the system monitoring its execution status. The docking tool receives the two standardized files mentioned above and executes its internal molecular interaction simulation and conformation search algorithms.
[0132] After the docking tool completes its operation, it doesn't simply collect the output files. Instead, it actively parses and extracts key scientific results data according to the output specifications declared in the tool's registry. The core output is the docking result, which includes a binding affinity score. This score (e.g., -9.1 kcal / mol) quantifies the theoretical binding strength between the protein and ligand in numerical form.
[0133] In addition, the results typically include the use of tools to analyze other structured information, such as the optimal binding conformation coordinates and the list of interacting amino acid residues. All of this information is encapsulated by the system into a structured data object (such as JSON format).
[0134] The process involves assembling and aligning genomic sequence data to a reference genome. Users submit raw sequencing data (e.g., in FASTQ format) and request the completion of a genomic analysis workflow, from sequence alignment to variant detection.
[0135] Register the following tools and their contracts in advance in the bioinformatics tool registry: alignment tools (such as BWA-MEM), input format FASTQ, output format BAM, execution condition is deterministic tool, and failure modes include missing reference genome index, incorrect input quality value format, etc.
[0136] A sorting and deduplication tool (such as SAMtools) takes BAM as input and outputs coordinate-sorted BAM as output, and is executed under deterministic conditions.
[0137] Mutation detection tools (such as GATK HaplotypeCaller) have BAM as input and VCF as output. They are nondeterministic tools (some algorithms involve random sampling) and can output a quality of mutation (QUAL) value as confidence information.
[0138] The system receives the task description. It compares the user-submitted sequencing data format (FASTQ) with the input format (FASTQ) of the alignment tool. If they match, the data is used directly. If the user provides FASTA or other formats, the system automatically calls a format conversion tool (such as seqret) for standardization.
[0139] The system sequentially performs alignment, sorting and deduplication, and mutation detection steps. For mutation detection tools, the system captures the output QUAL value (quality score for each mutation) and maps it to a confidence index in the range of 0-1.
[0140] For complex tasks (such as tumor-normal paired analysis), the confidence scores of tumor sample variants can be combined with the information of normal sample controls based on the execution conditions in the registry to generate the confidence scores of the final variant scores.
[0141] Modeling and analyzing bioactivity data. Users submit compound structures (e.g., in SDF format) and request predictions of their bioactivity (e.g., IC50, EC50) or toxicity properties against specific targets.
[0142] Register cheminformatics tools (such as RDKit and OpenBabel) for molecular structure standardization and fingerprint calculation. Input formats include SDF and SMILES, and the output format is a structured fingerprint vector. The execution condition is deterministic tool.
[0143] Register a machine learning prediction model (such as a random forest or graph neural network model), with molecular fingerprint or molecular graph as input and activity prediction value (such as pIC50) and prediction confidence interval as output. The execution condition is nondeterministic tool (the model inference has inherent errors), and declare the confidence output capability.
[0144] The system receives the task description and compares the molecular format submitted by the user with the input format required by the model (such as a specific fingerprint type or graph representation). If they are inconsistent, it automatically calls a conversion tool (such as SDF→ECFP4 fingerprint, SMILES→molecular graph) to standardize the data.
[0145] If the user only provides the SMILES string, the system will automatically call the structure generation tool to convert it into 3D SDF format and add a semantic loss coefficient for the conversion.
[0146] Call the machine learning model to perform activity prediction and output the pIC50 value and the model's built-in confidence (such as standard deviation and dropout uncertainty).
[0147] If the model does not directly output confidence scores, confidence scores are generated based on the preset confidence estimation rules in the registry (such as based on the training set RMSE).
[0148] For complex processes involving molecular structure generation, the confidence level of structure generation and the confidence level of model prediction are multiplied together to obtain the comprehensive confidence level of the final activity score.
[0149] This preferred embodiment also supports the free combination of the above-mentioned various task types, and there are no restrictions on this.
[0150] A second aspect of the present invention provides a bioinformatics analysis device for intelligent agents, comprising: The registration module is used to pre-establish a registry for bioinformatics tools, where tool registration includes input data format, execution conditions, and output data format. The parsing and comparison module is used to receive the bioinformatics task description and compare the data format of the bioinformatics information according to the bioinformatics tool registry. The execution engine module is used to standardize data in response to the data format comparison results in order to perform bioinformatics tasks and determine the task score confidence level based on the execution conditions.
[0151] Therefore, it can achieve any effect in bioinformatics analysis methods for intelligent agents, which will not be elaborated here.
[0152] For any parts not mentioned in this invention, existing technologies can be used or referenced.
[0153] The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.
[0154] The above description is merely an embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the present invention should be included within the scope of the claims of the present invention.
Claims
1. A bioinformatics analysis method for intelligent agents, characterized in that, include: A bioinformatics tool registry is pre-established, where tool registration includes input data format, execution conditions, and output data format; The method further includes: Receive the bioinformatics task description and compare the data format of the bioinformatics according to the bioinformatics tool registry. In response to the data format comparison results, data standardization is performed to perform bioinformatics tasks, and the task score confidence level is determined based on the execution conditions.
2. The method according to claim 1, characterized in that, Comparing data formats in bioinformatics, including: The data format of the biological information in the description of the bioinformatics task is compared with the data format entered in the bioinformatics tool registry. If the comparison matches, it can be used to perform bioinformatics tasks; If the comparison is inconsistent, a format conversion tool is invoked to convert the biological information into the corresponding input data format for use in performing bioinformatics tasks.
3. The method according to claim 1, characterized in that, Performing bioinformatics tasks, specifically: Based on any one of the stated biological information, execute the first biological information task; or... The bioinformatics task description includes multiple types of bioinformatics, and a second bioinformatics task is performed based on the standardized bioinformatics; or, The bioinformatics task description includes multiple types of bioinformatics. Based on any one of the aforementioned biological information, execute the first biological information task. Based on the standardized biological information and the execution result of the first biological information task, or based on the execution result of the first biological information task, a second biological information task is executed between the biological information.
4. The method according to claim 3, characterized in that, The task scoring confidence level is determined based on the aforementioned execution conditions, specifically as follows: The execution conditions include standardized processing, a first bioinformatics task, and a second bioinformatics task; For the standardization process, no task score confidence adjustment is performed; for the first bioinformatics task, confidence information is generated; for the second bioinformatics task, a comprehensive bioinformatics score is generated. When only the standardization process is performed, the maximum score for the task rating confidence is set; Otherwise, the confidence level of the task score is determined based on the confidence information of the task results and / or the comprehensive score of the bioinformatics.
5. The method according to claim 1, characterized in that, After performing bioinformatics tasks, the following are also included: Based on the tools in the bioinformatics tool registry, and considering the input-output relationships of all tools, a data lineage relationship for task execution is constructed. Adjust the confidence level of the task score based on the length of the blood relationship described in the data.
6. The method according to claim 5, characterized in that, The data lineage for task execution is established as follows: Based on the input and output data format of the tool, create corresponding data nodes and tool execution nodes, and establish directed edges from the input data node to the tool execution node and from the tool execution node to the output data node; Based on the input-output relationships of all tools, the tool execution nodes are connected to obtain a lineage diagram.
7. The method according to claim 5, characterized in that, The execution conditions also include failure modes: The failure mode has at least several reasons for failure due to incorrect input data format; Based on the unique identifier of the failure cause, the cause of failure is located by tracing back through data lineage, and the recoverable adjustment conditions are determined according to the type of failure cause.
8. The method according to claim 1, characterized in that, The bioinformatics tasks include at least one of protein structure prediction, small molecule structure docking, genome sequence analysis, and bioactivity analysis.
9. The method according to claim 8, characterized in that, Based on standardized protein structure data, an AI-based protein structure prediction tool is invoked to perform protein structure prediction tasks. The predicted protein structure, along with standardized small molecule ligands, is input into the ligand docking tool to perform the docking task. The assembling and alignment of genomic sequence data with a reference genome; Modeling and analyzing bioactivity data.
10. A bioinformatics analysis device for intelligent agents, characterized in that, include: The registration module is used to pre-establish a registry for bioinformatics tools, where tool registration includes input data format, execution conditions, and output data format. The parsing and comparison module is used to receive the bioinformatics task description and compare the data format of the bioinformatics information according to the bioinformatics tool registry. The execution engine module is used to standardize data in response to the data format comparison results in order to perform bioinformatics tasks and determine the task score confidence level based on the execution conditions.