System and method comprising foundation model
The foundation model addresses the inefficiencies in material research by processing multimodal data to predict chemical reactions and generate detailed molecular structure descriptions, improving research efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- LG MANAGEMENT DEV INST CO LTD
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-25
Smart Images

Figure KR2025021227_25062026_PF_FP_ABST
Abstract
Description
System and method including foundation model
[0001] The present invention relates to a system and method including a foundation model, and more specifically, to a system and method including a foundation model that learns one-dimensional text and two-dimensional graph phenotypes to perform molecular-level tasks.
[0002] Recently, artificial intelligence (AI) technology has been attracting attention from society as it demonstrates cutting-edge development. Artificial intelligence refers to computers performing unique human intellectual abilities with high proficiency, encompassing concepts such as "a computer brain that executes tasks within the realm of human intelligence," "the engineering and science of creating intelligent machines," and "a system of algorithms designed to think, perceive, and act like humans."
[0003] Artificial intelligence is being introduced as a technology that will provide highly integrated smart spaces when utilized alongside augmented reality, the Internet of Things, edge computing, and digital twins, and is being emphasized as a core new technology that will lead the era of the Fourth Industrial Revolution. Furthermore, AI is attracting attention as a next-generation growth engine capable of evolving industrial ecosystems beyond standardized problem-solving, and is being actively applied not only in IT, healthcare, agriculture, energy, automobiles, and robotics, but also in knowledge service industries such as retail, finance, law, education, real estate, advertising, and telecommunications. In other words, AI is preparing for a new era by combining with all existing systems, ranging from industries that aim to improve convenience or standards in daily life to the entire spectrum of culture and arts in our society.
[0004] Meanwhile, various efforts are being made to utilize artificial intelligence to solve diverse scientific problems in natural science fields such as physics, chemistry, and biology. Specifically, attempts to apply AI in areas such as designing new materials, developing new drugs, and predicting the physical properties of novel materials are continuing. These studies are expected to play a significant role in future technological advancement and innovation. (Republic of Korea Published Patent No. 10-2024-0011349 (Jan. 26, 2024))
[0005] The present invention aims to provide a foundation model-based system for processing multimodal data that can solve the time and cost issues associated with material research and development and increase the efficiency of material research and development.
[0006] In addition, the present invention aims to provide a model capable of predicting various types of chemical reaction results by utilizing multimodal data including one-dimensional text data and two-dimensional graph data.
[0007] In addition, the aim is to provide a system and method that automatically generates natural language descriptions for large-scale molecular databases that are highly informative and accurately reflect the unique chemical characteristics of each molecule.
[0008] In particular, the purpose is to provide a system and method capable of constructing an explanation centered on statistically sparse substructures among the substructures existing within a specific molecule.
[0009] A system according to one embodiment of the present invention comprises: at least one processor; and at least one memory for storing instructions, information, or an artificial intelligence model executed by the at least one processor; wherein the instructions, information, or artificial intelligence model executed by the at least one processor may include a data input unit that receives multimodal data including one-dimensional text representation data of a molecule, two-dimensional graph representation data of a molecule, and a natural language-based task instruction sequence; a graph processing unit that includes a hybrid graph encoder that extracts local and global features of a molecular structure from the two-dimensional graph representation data to generate a molecular graph embedding, and a crossmodal bridge that aligns the molecular graph embedding and a text embedding generated from the one-dimensional text representation data; and a foundation model that receives the aligned molecular graph embedding and the text embedding and learns from them.
[0010] Here, the hybrid graph encoder may include a graph encoder that captures the local structure of the molecular graph; and a graph sequence encoder that captures the global context of the molecular graph.
[0011] In addition, the crossmodal bridge can extract and summarize information highly relevant to the text embeddings from the molecular graph embeddings using a learnable query.
[0012] In addition, the large-scale language model performs a molecular-level task according to the task command sequence and generates a result, and the molecular-level task may include at least one of predicting chemical reactions, predicting molecular properties, and generating a natural language description of a molecular structure.
[0013] Additionally, the foundation model of the system further includes a molecular structure description generation model that generates a natural language description of a target molecular structure, and the molecular structure description generation model may include: a sparse substructure search unit that calculates a sparsity score for each substructure by analyzing the frequency of occurrence of substructures included in each of a plurality of molecules within a large-scale molecular database; and a molecular structure description generation unit that generates a final natural language description including a description of at least one sparse substructure sampled based on the sparsity score among the substructures included in the target molecular structure.
[0014] In addition, the above-mentioned rare substructure search unit can analyze the frequency of occurrence of the above substructures using molecular fingerprints.
[0015] In addition, the above molecular fingerprint may be a MACCS (Molecular Access System) key.
[0016] In addition, the molecular structure description generation unit may search for pre-generated natural language description documents for each of the sampled sparse substructures and integrate the searched natural language description documents into a single document using the large-scale language model to generate the final natural language description.
[0017] In addition, the molecular structure description generation model may further include a substructure document generation unit that generates natural language description documents describing the chemical characteristics and effects of each predefined substructure.
[0018] A computerized learning method according to an embodiment of the present invention may include: a step of pre-training a hybrid graph encoder to receive two-dimensional molecular graph data, predict the functional groups of a molecule, and restore the original one-dimensional molecular text; a step of pre-training a crossmodal bridge to convert a molecular graph embedding generated by the hybrid graph encoder so that the large-scale language model can understand it, while keeping the weights of the pre-trained hybrid graph encoder and the large-scale language model frozen; and a step of fine-tuning the entire foundation model including the pre-trained hybrid graph encoder, the crossmodal bridge, and the large-scale language model.
[0019] Here, the fine-tuning step can damage a portion of the one-dimensional molecular text data to induce the foundation model to rely more on the two-dimensional molecular graph data for learning.
[0020] In addition, the fine-tuning step may damage a part of the one-dimensional molecular text data by replacing some tokens in the token sequence of the one-dimensional molecular text data with random tokens.
[0021] In contrast, the fine-tuning step can learn to maximize the probability of generating a result when the correct molecular graph is input and minimize the probability of generating a result when the incorrect molecular graph is input by using preference pairs consisting of a correct molecular graph and an incorrect molecular graph in which the substructure of the correct molecular graph is modified.
[0022] In addition, the fine-tuning step may update the foundation model using a total loss function that sums the loss function of a learning process that damages a part of the one-dimensional molecular text data by replacing some tokens in the token sequence of the one-dimensional molecular text data with random tokens, and the loss function of a learning process that maximizes the probability of generating a result when the correct molecular graph is input and minimizes the probability of generating a result when the incorrect molecular graph is input, using preference pairs composed of a correct molecular graph and an incorrect molecular graph in which the substructure of the correct molecular graph is modified.
[0023] Additionally, a computerized method according to an embodiment of the present invention may include: a step of inputting multimodal data including one-dimensional text representation data of a molecule, two-dimensional graph representation data, and a task instruction sequence; a step of generating a molecular graph embedding from the two-dimensional graph representation data using a hybrid graph encoder; a step of aligning the molecular graph embedding with a text embedding generated from the one-dimensional text representation data and the task instruction sequence using a crossmodal bridge; and a step of generating a result of a molecular unit task according to the task instruction sequence based on the aligned molecular graph embedding and the text embedding using a large-scale language model.
[0024] Here, the result generation step may include: a step of identifying a plurality of substructures included in the target molecular structure to be analyzed when the task command sequence directs the generation of a natural language description of the molecular structure; a step of sampling at least one sparse substructure based on the sparse score of each of the identified plurality of substructures; and a step of generating a final natural language description of the target molecular structure by integrating the description of the sampled sparse substructure.
[0025] In addition, the step of identifying the substructures can identify the plurality of substructures by calculating the molecular fingerprint of the target molecular structure.
[0026] In addition, the above scarcity score can be calculated inversely proportional to the frequency of occurrence of the above substructure within a large molecular database.
[0027] Additionally, the final natural language description generation step may include: a step of searching for pre-generated natural language description documents for each of the sampled sparse substructures; and a step of inputting the searched documents into a large-scale language model to integrate them into a single consistent document.
[0028] A system according to an embodiment of the present invention may include: a server computing system equipped with the system according to claim 1; and a user computing device that transmits a request including multimodal data and a task instruction sequence to the server computing system through the user computing device and receives a result generated by a large-scale language model of the server computing system.
[0029] In a custom integrated circuit comprising a memory in which information and instructions are stored according to an embodiment of the present invention and a functional block including at least one processor requesting access to said memory, the memory may store instructions or information including: an operation of inputting multimodal data including one-dimensional text representation data of a molecule, two-dimensional graph representation data, and a task instruction sequence; an operation of generating a molecule graph embedding from said two-dimensional graph representation data using a hybrid graph encoder; an operation of aligning said molecule graph embedding with a text embedding generated from said one-dimensional text representation data and said task instruction sequence using a crossmodal bridge; and an operation of generating a result of a molecule unit task according to said task instruction sequence based on said aligned molecule graph embedding and said text embedding using a large-scale language model.
[0030] According to embodiments of the present invention, a system capable of processing various types of chemical reaction predictions and various molecular unit tasks can be provided by processing multimodal data including one-dimensional text data, two-dimensional graph data, and text data of various molecular unit tasks.
[0031] In addition, it is possible to automatically generate descriptive text with high informational value for any molecular structure, thereby building a high-quality molecular structure-description dataset.
[0032] In addition, since the explanation is generated based on rare substructure information that distinguishes it from other molecules, the unique and core chemical characteristics of each molecule can be effectively explained.
[0033] In addition, by generating explanatory text based solely on the substructures existing within the actual molecule through molecular fingerprinting, hallucinatory phenomena during information generation can be fundamentally prevented.
[0034] FIG. 1 is a schematic diagram of an electronic device according to one embodiment of the present disclosure.
[0035] FIG. 2 is a schematic diagram of a foundation model according to an embodiment of the present invention.
[0036] FIG. 3 is a schematic diagram illustrating a method for learning a foundation model according to an embodiment of the present invention.
[0037] FIG. 4 is a schematic diagram of a molecular information analysis system including a foundation model of an embodiment of the present invention.
[0038] FIG. 5 is a drawing showing a sequential multi-agent according to one embodiment of the present disclosure.
[0039] FIG. 6 is a drawing showing a supervisory agent according to one embodiment of the present disclosure.
[0040] FIG. 7 is a diagram showing a hierarchical agent system according to one embodiment of the present disclosure.
[0041] FIG. 8 is a drawing showing a multi-agent discussion type system according to one embodiment of the present disclosure.
[0042] FIG. 9 is a diagram showing a Mixture-of-AI Agents system according to one embodiment of the present disclosure.
[0043] FIG. 10 is a drawing showing a ReAct agent system according to one embodiment of the present disclosure.
[0044] FIG. 11 is a drawing showing a CodeAct agent system according to one embodiment of the present disclosure.
[0045] FIG. 12 is a drawing showing a modern tool-using agent system according to one embodiment of the present disclosure.
[0046] FIG. 13 is a drawing showing a self-reflective agent system according to one embodiment of the present disclosure.
[0047] FIG. 14 is a drawing showing a multi-agent workflow system according to one embodiment of the present disclosure.
[0048] FIG. 15 is a diagram illustrating an Agentic RAG (Retrieval-Augmented Generation) system according to one embodiment of the present disclosure.
[0049] FIG. 16 is a drawing showing a Multi-Agent Debate (MAD) system according to one embodiment of the present disclosure.
[0050] FIG. 17 is a diagram showing an A2A (Agent2Agent) protocol system according to one embodiment of the present disclosure.
[0051] FIG. 18 is a drawing illustrating an Agentic RAG (search-based generation) system according to one embodiment of the present disclosure.
[0052] FIG. 19 is a schematic diagram of an AI agent system according to one embodiment of the present disclosure.
[0053] FIG. 20 is a schematic diagram of a large-scale language model (LLM) chatbot according to one embodiment of the present disclosure.
[0054] FIG. 21 is a schematic diagram of a Robotic Process Automation (RPA) system according to one embodiment of the present disclosure.
[0055] FIG. 22 is a schematic diagram of a Retrieval-Augmented Generation (RAG) system according to one embodiment of the present disclosure.
[0056] FIG. 23 is a schematic diagram of a Learning-Augmented Mechanism (LAM) according to one embodiment of the present disclosure.
[0057] FIG. 24 is a diagram showing an AI agent memory structure according to one embodiment of the present disclosure.
[0058] FIG. 25 is a drawing showing a GPT (General Pretrained Transformer) model according to one embodiment of the present disclosure.
[0059] FIG. 26 is a drawing showing a Mixture of Experts (MoE) model according to one embodiment of the present disclosure.
[0060] FIG. 27 is a drawing showing a Large Reasoning Model (LRM) according to one embodiment of the present disclosure.
[0061] FIG. 28 is a drawing showing a Vision Language Model (VLM) according to one embodiment of the present disclosure.
[0062] FIG. 29 is a drawing showing a Small Language Model (SLM) according to one embodiment of the present disclosure.
[0063] FIG. 30 is a drawing showing a Large Action Model (LAM) according to one embodiment of the present disclosure.
[0064] FIG. 31 is a drawing showing a Hierarchical Reasoning Model (HRM) according to one embodiment of the present disclosure.
[0065] FIG. 32 is a drawing showing a ToolFormer (Tools-trained Model) according to one embodiment of the present disclosure.
[0066] FIGS. 33 to 38 are drawings illustrating vulnerabilities of an MCP according to one embodiment of the present disclosure.
[0067] FIG. 39 is a diagram illustrating a context engineering structure in an AI agent system according to one embodiment of the present disclosure.
[0068] To clarify the technical concept of the present disclosure, embodiments of the present invention will be described in detail with reference to the attached drawings. In describing the present disclosure, detailed descriptions of related known functions or components will be omitted if it is determined that such detailed descriptions would unnecessarily obscure the essence of the present disclosure. Components having substantially the same functional configuration in the drawings have been assigned the same reference numerals and symbols as much as possible, even if they are shown in different drawings. For convenience of explanation, devices and methods will be described together where necessary. Each operation of the present disclosure does not necessarily need to be performed in the order described and may be performed in parallel, selectively, or individually.
[0069] The terms used in the embodiments of this disclosure have been selected to be as widely used and general as possible, taking into account the functions of this disclosure; however, these terms may vary depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms have been selected at the applicant's discretion, and in such cases, their meanings will be described in detail in the description of the relevant embodiments. Therefore, terms used in this specification should be defined not merely by their names, but based on their meanings and the overall content of this disclosure.
[0070] Throughout this disclosure, singular expressions may include plural expressions unless the context clearly indicates otherwise. Terms such as “comprising” or “having” are intended to specify the presence of features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof. That is, throughout this disclosure, when a part is described as “comprising” a certain component, it means that, unless specifically stated otherwise, it does not exclude other components but may include additional components.
[0071] Expressions such as "at least one" modify the entire list of components and do not modify the components of the list individually. For example, "at least one of A, B, and C" and "at least one of A, B, or C" refer to only A, only B, only C, both A and B, both B and C, both A and C, all of A, B, and C, or any combination thereof.
[0072] Additionally, terms such as “...part,” “...module,” etc., as described in this disclosure refer to a unit that processes at least one function or operation, and may be implemented in hardware or software, or a combination of hardware and software.
[0073] Throughout the entire disclosure, when a part is described as being “connected” to another part, this includes not only cases where they are “directly connected” but also cases where they are “electrically connected” with other elements interposed between them. Furthermore, when a part is described as “comprising” a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components.
[0074] As used throughout this disclosure, the expression “configured to” may be replaced, depending on the context, with, for example, “suitable for,” “having the capacity to,” “designed to,” “adapted to,” “made to,” or “capable of.” The term “configured to” may not necessarily mean only “specifically designed to” in hardware. Instead, in some situations, the expression “system configured to” may mean that the system is “capable of” together with other devices or components. For example, the phrase “a processor configured (or set) to perform A, B, and C” may mean a dedicated processor for performing said operations (e.g., an embedded processor), or a generic-purpose processor (e.g., a CPU or an application processor) capable of performing said operations by executing one or more software programs stored in memory.
[0075] Artificial intelligence (AI) is a field of computer science and information technology that studies methods to enable computers to perform thinking, learning, and self-development—tasks achievable by human intelligence—and refers to the ability of computers to mimic intelligent human behavior. Furthermore, AI does not exist in isolation but is closely related, directly or indirectly, to many other fields of computer science. Particularly in the modern era, there are very active attempts to introduce AI elements into various sectors of information technology and utilize them to solve problems within those fields.
[0076] Machine learning is a field of artificial intelligence that enables computers to learn without explicit programming. Specifically, machine learning can be defined as a technology that studies and builds systems and algorithms capable of learning, making predictions, and improving their own performance based on empirical data. Rather than executing strictly defined static program commands, machine learning algorithms adopt an approach of constructing specific models to derive predictions or decisions based on input data.
[0077] Many machine learning algorithms have been developed to address how to classify data in machine learning. Representative examples include Decision Trees, Bayesian Networks, Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs). A Decision Tree is an analytical method that performs classification and prediction by plotting decision rules in a tree structure. A Bayesian Network is a model that represents the probabilistic relationships (conditional independence) between multiple variables in a graph structure. Bayesian Networks are suitable for data mining through unsupervised learning. Support Vector Machines are supervised learning models for pattern recognition and data analysis, primarily used for classification and regression analysis. Artificial Neural Networks model the operating principles of biological neurons and the relationships between them; they are information processing systems in which multiple neurons, referred to as nodes or processing elements, are connected in a layered structure.
[0078] Artificial neural networks are models used in machine learning, serving as statistical learning algorithms in machine learning and cognitive science that draw inspiration from biological neural networks (particularly the brain within the animal central nervous system). Specifically, an artificial neural network can refer to a model in which artificial neurons (nodes), forming a network through the connection of synapses, change the strength of these connections through learning to possess problem-solving capabilities.
[0079] An artificial neural network may include multiple layers, and each layer may include multiple neurons. Additionally, an artificial neural network may include synapses connecting neurons. An artificial neural network can generally be defined by the following three factors: ㉠ connection patterns between neurons of different layers, ㉡ a learning process that updates the weights of the connections, and ㉢ an activation function that generates an output value from a weighted sum of inputs received from the previous layer.
[0080] Artificial neural networks may include, but are not limited to, network models such as Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), Bidirectional Recurrent Deep Neural Networks (BRDNN), Multilayer Perceptrons (MLP), and Convolutional Neural Networks (CNN).
[0081] Artificial neural networks are classified into single-layer neural networks and multi-layer neural networks depending on the number of layers. A typical single-layer neural network consists of an input layer and an output layer. Additionally, a typical multi-layer neural network consists of an input layer, one or more hidden layers, and an output layer.
[0082] The input layer is a layer that receives external data, and the number of neurons in the input layer is equal to the number of input variables. The hidden layer is located between the input layer and the output layer, receives signals from the input layer, extracts features, and transmits them to the output layer. The output layer receives signals from the hidden layer and outputs an output value based on the received signals. Input signals between neurons are multiplied by their respective connection strengths (weights) and then summed; if this sum is greater than the neuron's threshold, the neuron is activated and outputs the value obtained through the activation function.
[0083] Meanwhile, a deep neural network containing multiple hidden layers between the input layer and the output layer can be a representative artificial neural network that implements deep learning, a type of machine learning technique. Meanwhile, the term 'deep learning' may be used interchangeably with the term 'deep learning,' and the term 'learning' may be used interchangeably with 'training.'
[0084] The machine learning workflow consists of a series of processes involving collecting data for learning and validation, modeling, and training the model, and may include the processes of collecting training data, checking and exploring data, data preprocessing and cleaning, modeling, and training.
[0085] 1. Collect Training Data
[0086] Training data applied to the training of the learning model of this specification may be generated using data collected from a plurality of samples. In this specification, at least one different type of training data set may be used to train the learning model, and each training data may further include one or more experimental results used as feature labels. At least a portion of the training data set may be used to train the learning model, and another portion may be used to validate the learned learning model.
[0087] The data used in the graph model of one embodiment of the present invention may have the SMILES (Simplified Molecular-Input Line-Entry System) format, which is commonly used to represent the chemical formula of a molecule as a string. SMILES is a notation that expresses molecular structures in the form of strings, enabling the application of molecular structures to various machine learning and deep learning algorithms. Generally, SMILES can consist of atoms, bonds, rings, aromaticity, and branches. In SMILES notation, each atom is represented by its corresponding element symbol. For example, carbon can be represented as C, nitrogen as N, oxygen as O, and chlorine as Cl, while the hydrogen atom H can be omitted. Bonds are represented by eight symbols: ".", "-", "=", "#", "$", ":", " / ", and " / ". For example, a double bond can be represented as "=", a triple bond as "#", and a quadruple bond as "$". A ring is represented in a molecular structure by breaking a bond at any arbitrary point and numbering the two atoms at that broken point. Aromaticity refers to carbon compounds containing aromatic rings that form a stable structure by bonding in a planar ring shape; these aromatic rings are represented in the same way as the aforementioned rings, but the B, C, N, O, P, and S atoms contained within them are written in lowercase. Molecular branches are indicated by parentheses. The first atom inside the parentheses and the first atom after the parentheses end can be connected to the same atom. Ambiguity can arise because the same molecular structure can have SMILES representations written in different ways. Tools such as the RDKit library can be utilized to resolve this ambiguity.By removing duplicate compounds and unclear structural forms from the collected dataset, a dataset for the final graphing of molecules can be obtained. This dataset of SMILES representation can be represented as an adjacency matrix or an adjacency list so as to be expressed in a graph format represented by nodes and edges. The adjacency matrix and the adjacency list represent the connectivity relationships of the graph as a two-dimensional array and a list, respectively.
[0088] Data used for text embedding in one embodiment of the present invention may include a text description of a crystal structure. In one embodiment, the text description data may be obtained through a Robocrystallographer package that generates a text description similar to the way an actual crystallographer analyzes a structure. When generating a text description of a crystal structure, the Robocrystallographer package indicates symmetry, local environment, and extended connectivity, and this package may include utilities for identifying molecule names, component orientation, heterostructure information, etc. For example, in one embodiment, the Robocrystallographer used can output text such as "SnO2 is rutile structured and crystallizes in the tetragonal P4_2 / mnm space group. The structure is three-dimensional. Sn(1) is bonded to six equivalent O(1) atoms to form a mixture of edge and corner-sharing SnO6 octahedra. The corner-sharing octahedral tilt angles are 51°. All Sn(1)-O(1) bond lengths are 2.09Å. O(1) is bonded in a trigonal planar geometry to three equivalent Sn(1) atoms." when SnO2 is given as input.These text descriptions may include extensive information, including global characteristics (e.g., space group and crystal type), local details (e.g., bond length and coordination environment), and semi-global characteristics (e.g., connectivity and structural arrangement).
[0089] 2. Data Inspection and Exploration
[0090] Once training data for training a learning model is collected, the collected training data can be examined and explored regarding its structure, noise data, and data cleaning methods for machine learning applications.
[0091] This stage of data inspection and exploration is called Exploratory Data Analysis (EDA), which can be described as the process of observing and understanding collected data from various angles. Before training the data, independent variables, dependent variables, variable types, and data types are examined using visualizations such as graphs and statistical tests, allowing the characteristics of the data and inherent structural relationships to be identified in advance. Through this EDA, examining the distribution and values of the data enables a better understanding of the phenomena represented by the data and the discovery of potential problems. Furthermore, by examining the data from various angles, diverse patterns that might not have been identified during the problem definition stage can be discovered, allowing for the modification of existing hypotheses or the formulation of new ones. Exploratory data analysis can broadly encompass the process of searching for data outliers and analyzing the relationships between data attributes.
[0092] The process of detecting outliers involves verifying whether the data contains them and can include sampling, statistical, and visualization methods. Sampling methods involve drawing random samples from the data to identify overall trends and anomalies in the data values. Statistical methods may utilize summary statistics, such as the mean, median, and mode to identify the center of the data, or range and variance to check the dispersion. Visualization methods utilize probability density functions, histograms, dot plots, word clouds, time series charts, and maps to determine which statistical indicators are appropriate for the individual attributes of the collected data. However, when using statistical indicators, caution should be exercised regarding the use of statistical indicators: while the mean reflects all data values within a set and is therefore affected by outliers, the median uses only the single value in the middle, allowing for representative results even in the presence of outliers.
[0093] The process of analyzing relationships between data attributes involves identifying combinations of attributes within the data that possess meaningful correlations. Relationship analysis can be conducted differently depending on the combination of attributes between qualitative attributes (Categorical Variables; Qualitative), which cannot be expressed numerically but can be arbitrarily quantified, and quantitative attributes (Numeric Variables; Quantitative), which can be quantified. Categorical-categorical relationships can display the number of values corresponding to each pair of attribute values using cross-tabulation tables or mosaic plots; Numeric-categorical relationships can be visually represented through box plots or by observing statistical values by category (mean, median, etc.); and Numeric-numeric relationships can analyze the association between two attributes using correlation coefficients. It can be confirmed that a correlation coefficient of -1 indicates a negative correlation where the two attributes change in opposite directions, 0 indicates no correlation, and 1 indicates a positive correlation where the two attributes always change in the same direction. The relationship between two attributes with a correlation coefficient can also exhibit various aspects, which can be visually represented using a scatter plot.
[0094] 3. Data Preprocessing and Cleansing
[0095] Data that has completed inspection and exploration undergoes data preprocessing to transform it into a format suitable for machine learning training models. Data preprocessing involves cleaning the data and converting it into a form that the model can understand; it generally includes handling missing data, outlier removal, data scaling, categorical data encoding, feature selection and extraction, and data transformation. The detailed processes of data preprocessing may be performed in whole or in part selectively, and a separate machine learning model may be used for this purpose.
[0096] Handling Missing Data is the process of handling missing values when they exist in the data; these values can be displayed as NaN (Not a Number) or empty, or deleted. Filling in or deleting missing values improves data completeness, and values such as the mean, median, or mode may be used when filling in missing values.
[0097] Outlier removal is the process of eliminating outliers, which are values that deviate from typical data patterns. Since outliers can degrade model performance, they must be removed or replaced; this involves identifying outliers and deleting the corresponding rows or columns or replacing them with other values.
[0098] Data scaling is the process of adjusting the size of data; through data scaling, the range of the data is adjusted, which can improve model performance or accelerate convergence. Data scaling allows data characteristics to be aligned within a similar range, and generally, standardization and normalization can be applied. Standardization is a method of transforming data into a distribution with a mean of 0 and a standard deviation of 1; it is primarily performed using the mean and standard deviation, and the standardized value z is It can be denoted as (where x is the original value, μ is the mean, and σ is the standard deviation). Normalization is a method of transforming the range of data to [0,1] or [-1,1], primarily using minimum and maximum values to transform the data, and the normalized value x norm silver It can be expressed as (x is the original value, xmin is the minimum value, xmax is the maximum value).
[0099] Categorical Data Encoding is the process of converting categorical variables, which are represented as string or integer values and cannot be directly input into a model, into a numeric type that can be input. Generally, one-hot encoding or label encoding can be used to convert categorical variables into numeric types.
[0100] Feature selection and extraction is intended to improve the performance of a model by selecting the most useful features for model training or extracting new features. Through this process, the complexity of the model can be reduced and overfitting can be prevented.
[0101] Data transformation involves converting data to extract new information or enable a model to understand it better, and may include the tokenization of text data or the preprocessing of image data. Through data transformation, model performance can be improved by extracting useful features from original data or converting data into an appropriate format.
[0102] Through data preprocessing as described above, it is possible to achieve the effects of improving the performance and ensuring the stability of machine learning models.
[0103] Meanwhile, when training a learning model according to one embodiment of the present invention, a process of preprocessing information written in natural language and a process of training a large-scale language model based on the preprocessed data may be performed.
[0104] 3-1. Text Preprocessing for Large-Scale Language Models
[0105] If the collected data has not been preprocessed according to the requirements, tokenization, cleaning, and normalization can be performed to suit the intended use of the data.
[0106] Tokenization refers to the process of dividing given data into units called tokens, and these token units can generally be defined as meaningful units. Tokenization can broadly include word tokenization and sentence tokenization.
[0107] Tokenization refers to the process of dividing given data into units called tokens, and these token units can generally be defined as meaningful units. Tokenization can broadly include word tokenization and sentence tokenization.
[0108] Word tokenization refers to the case where the standard for tokens is a word; here, a word can include not only individual words but also phrases or meaningful strings. Word tokenization means separating words based on spaces or punctuation marks, such as periods, commas, question marks, semicolons, and exclamation marks. However, since removing all punctuation or special characters during the tokenization process can cause tokens to lose their meaning, precise algorithms may be required. For instance, if a word itself contains punctuation or uses special characters with meaning, simply removing them may not be sufficient. Therefore, tokenization rules such as Penn Treebank Tokenization rules may be applied during the process.
[0109] Sentence tokenization refers to the process of dividing text into sentence units. Typically, if data is unrefined, the corpus is not organized into sentences, so sentence tokenization may be necessary to suit the intended use. Various rules for this sentence tokenization can be defined depending on the language used and how special characters are utilized within the corpus.
[0110] The process of classifying tokens according to their purpose is called tokenization, and before and after tokenization, text data undergoes cleaning and normalization tailored to its intended use. Cleaning involves removing noise data, while normalization involves consolidating words with different representations into a single word.
[0111] Cleansing is sometimes performed prior to tokenization to exclude elements that interfere with the process, but it can also be repeatedly carried out after tokenization to remove noise that remains. The noise data removed during cleansing consists of meaningless characters; methods for eliminating unnecessary words include stopword removal, as well as removing infrequent and short words.
[0112] Normalization tasks include the consolidation of words with different spellings and case consolidation based on rules. Case consolidation is a normalization method that can reduce the number of words in English-speaking languages; since uppercase letters are used only in specific situations, such as at the beginning of a sentence, and most text is written in lowercase, case consolidation can mostly be accomplished through the conversion of uppercase letters to lowercase.
[0113] To process natural language in computing systems, a preprocessing step of converting text into numerical values is required; for this purpose, each word in the text is mapped to a unique integer. This mapping process can utilize techniques such as integer encoding, padding, and one-hot encoding.
[0114] Integer encoding is a method of assigning integers to words. It involves creating a vocabulary sorted by frequency and assigning integers sequentially from lowest to highest frequency. Integer encoding performs sentence tokenization on text data containing multiple sentences, and simultaneously conducts word tokenization through cleaning and normalization processes. During this process, words are converted to lowercase to standardize the word count, and words can be removed based on stop words or word length. Through this, words can be recorded as keys and their frequencies as values. Integer encoding is performed by sorting words within the text in order of frequency and assigning integers to the words with the highest frequencies.
[0115] Padding is a process used to arbitrarily equalize the lengths of sentences of different lengths within a text. Computing systems can perform parallel operations by grouping sentences of equal length into a single matrix. In other words, to perform parallel operations, the lengths of sentences can be equalized by arbitrarily filling the integer encoding results of sentences of different lengths within the text with '0's. That is, the longest sentence is identified from the set of integer-encoded words, and "0"s can be added to the integer matrix corresponding to the length of that longest sentence. The computing system can proceed with parallel processing by recognizing sentences of equal length as a single matrix, and in this process, the "0" words, which are perceived as meaningless, can be ignored. Adjusting the size (shape) of data by filling it with specific values in this manner is called padding, and when the number "0" is used to adjust length, it is referred to as zero padding.
[0116] One-hot encoding is a vector representation method in which the size of the word set is used as the dimension of the vector, and a value of 1 is assigned to the index of the word to be represented and 0 to other indices; the vector represented in this way is called a one-hot vector. One-hot encoding consists of integer encoding and index assignment processes. After integer encoding is performed to assign a unique integer to each word, the unique integer of the word to be represented is considered as the index, and a "1" is assigned to that position, while a "0" is assigned to the index positions of other words. However, one-hot encoding has the disadvantage that the space required to store the vector increases (increase in vector dimension) as the number of words increases, and it is also impossible to verify similarity between words. To address these drawbacks, techniques that vectorize in a multi-dimensional space by reflecting the latent meaning of words are available. These include count-based vectorization methods such as LSA (Latent Semantic Analysis); prediction-based vectorization methods such as NNLM, RNNLM, Word2Vec, and FastText; and the GloVe method, which uses both count-based and prediction-based approaches.
[0117] Meanwhile, in order for a computer to understand and process text, it must be appropriately converted into numbers. Since the performance of natural language processing varies significantly depending on how words are represented, many techniques have been proposed to quantify words. Currently, word embedding, which vectorizes each word through artificial neural network learning, is the most widely used method.
[0118] Word embedding is a method of representing words as vectors, converting words into dense representations. The result derived through the word embedding process is called a dense vector or embedding vector. Word embedding methodologies such as LSA, Word2Vec, FastText, and Glove have been proposed.
[0119] 4. Modeling and Training
[0120] Artificial neural networks can be trained using training data. Here, training refers to the process of determining the parameters of an artificial neural network using training data to achieve objectives such as classifying, regressing, or clustering input data. Typical examples of artificial neural network parameters include weights assigned to synapses or biases applied to neurons.
[0121] An artificial neural network trained on training data can classify or cluster input data according to the patterns of the input data. Meanwhile, an artificial neural network trained using training data may be referred to as a trained model in this specification.
[0122] The following explains the learning methods of artificial neural networks. The learning methods of artificial neural networks can be broadly classified into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.
[0123] Supervised learning is a method of machine learning designed to infer a function from training data. Among the functions inferred in this way, outputting a continuous value is called regression, and predicting and outputting the class of an input vector is called classification.
[0124] In supervised learning, an artificial neural network is trained with labels for the training data. Here, a label refers to the correct answer (or result value) that the artificial neural network must infer when training data is input into the artificial neural network. In this specification, the correct answer (or result value) that the artificial neural network must infer when training data is input is referred to as a label or labeling data. Furthermore, in this specification, setting labels on the training data for the training of the artificial neural network is referred to as labeling the training data. In this case, the training data and the corresponding labels constitute a single training set, and can be input to the artificial neural network in the form of a training set.
[0125] Meanwhile, training data represents multiple features, and labeling the training data implies that labels are attached to the features represented by the training data. In this case, the training data can represent the features of the input object in the form of a vector. An artificial neural network can infer a function regarding the association between the training data and the labeled data by utilizing the training data and the labeled data. Furthermore, the parameters of the artificial neural network can be determined (optimized) through the evaluation of the function inferred by the network.
[0126] Unsupervised learning is a type of machine learning in which no labels are provided for the training data. Specifically, unsupervised learning can be a learning method that trains an artificial neural network to find and classify patterns within the training data itself, rather than focusing on the relationship between the training data and its corresponding labels. Examples of unsupervised learning include clustering and Independent Component Analysis.
[0127] Examples of artificial neural networks that utilize unsupervised learning include Generative Adversarial Networks (GANs) and Autoencoders (AEs).
[0128] Generative Adversarial Networks (GANs) are machine learning methods in which two distinct artificial intelligence models—a generator and a discriminator—compete to improve performance. In this context, the generator is a model that creates new data, capable of generating new data based on original data. The discriminator, on the other hand, is a model that recognizes data patterns, performing the role of distinguishing whether input data is original data or new data generated by the generator. Furthermore, the generator learns by receiving input data that failed to deceive the discriminator, while the discriminator learns by receiving input data that was deceived by the generator. Consequently, the generator can evolve to deceive the discriminator as effectively as possible, and the discriminator can evolve to better distinguish between original data and data generated by the generator.
[0129] An autoencoder is a neural network that aims to reproduce the input itself as the output. An autoencoder includes an input layer, at least one hidden layer, and an output layer. In this case, since the number of nodes in the hidden layer is less than the number of nodes in the input layer, the dimensionality of the data is reduced, and accordingly, compression or encoding is performed. Additionally, the data output from the hidden layer enters the output layer. In this case, since the number of nodes in the output layer is greater than the number of nodes in the hidden layer, the dimensionality of the data is increased, and accordingly, decompression or decoding is performed.
[0130] Meanwhile, an autoencoder represents input data as hidden layer data by adjusting the connection strengths of neurons through learning. In the hidden layer, information is represented with fewer neurons than in the input layer, and the fact that input data can be reproduced as output implies that the hidden layer has discovered and represented hidden patterns from the input data.
[0131] Semi-supervised learning is a type of machine learning that refers to a learning method utilizing both labeled and unlabeled training data. One technique within semi-supervised learning involves inferring labels from unlabeled training data and then performing learning using those inferred labels; this method can be particularly useful when the cost of labeling is high.
[0132] Reinforcement learning is a theory that states that if an agent is provided with an environment where it can determine the best action to take at every moment, it can find the optimal path through experience alone, without relying on data. Reinforcement learning is primarily executed via a Markov Decision Process (MDP). To explain the MDP, first, an environment is provided containing the information necessary for the agent to take its next action; second, the agent's behavior within that environment is defined; third, rewards are determined for success and penalties for failure; and fourth, the optimal policy is derived through repeated experience until future rewards reach their peak.
[0133] The structure of an artificial neural network is determined by the configuration of the model, activation function, loss function or cost function, learning algorithm, optimization algorithm, etc., and hyperparameters are set in advance before learning, and model parameters are set through learning thereafter, so the content can be determined.
[0134] For example, factors determining the structure of an artificial neural network may include the number of hidden layers, the number of hidden nodes included in each hidden layer, the input feature vector, the target feature vector, etc.
[0135] Hyperparameters include various parameters that must be initially set for training, such as the initial values of model parameters. Model parameters, on the other hand, include various parameters intended to be determined through training. For example, hyperparameters may include initial values for inter-node weights, initial values for inter-node bias, mini-batch size, number of training iterations, and learning rate. Additionally, model parameters may include inter-node weights and inter-node bias.
[0136] A loss function can be used as an indicator (criterion) to determine optimal model parameters during the learning process of an artificial neural network. In an artificial neural network, learning refers to the process of manipulating model parameters to reduce the loss function, and the objective of learning can be viewed as determining model parameters that minimize the loss function. The loss function can primarily be the Mean Squared Error (MSE) or the Cross Entropy Error (CEE), but the present invention is not limited thereto. The Cross Entropy Error can be used when the correct label is one-hot encoded. One-hot encoding is an encoding method in which the correct label value is set to 1 only for neurons corresponding to the correct answer, and the correct label value is set to 0 for neurons that are not the correct answer.
[0137] In machine learning or deep learning, learning optimization algorithms can be used to minimize the loss function, and learning optimization algorithms include Gradient Descent (GD), Stochastic Gradient Descent (SGD), Momentum, Nesterov Accelerate Gradient (NAG), Adagrad, AdaDelta, RMSProp, Adam, Nadam, etc.
[0138] Gradient Descent is a technique that adjusts model parameters in a direction that reduces the loss function value by considering the gradient of the loss function from the current state. The direction in which model parameters are adjusted is called the step direction, and the magnitude of the adjustment is called the step size. In this context, the step size can refer to the learning rate. Gradient Descent obtains the gradient by taking the partial derivative of the loss function with respect to each model parameter, and updates the model parameters by changing them in the direction of the obtained gradient by the learning rate.
[0139] Stochastic Gradient Descent is a technique that divides training data into mini-batches and performs gradient descent on each mini-batch to increase the frequency of gradient descent.
[0140] Adagrad, AdaDelta, and RMSProp are techniques that improve optimization accuracy in SGD by adjusting the step size. In SGD, Momentum and NAG are techniques that improve optimization accuracy by adjusting the step direction. Adam is a technique that improves optimization accuracy by combining Momentum and RMSProp to adjust both the step size and the step direction. Nadam is a technique that improves optimization accuracy by combining NAG and RMSProp to adjust both the step size and the step direction.
[0141] The learning speed and accuracy of artificial neural networks are characterized by being heavily dependent on hyperparameters, as well as the network structure and the type of learning optimization algorithm. Therefore, to obtain a good learning model, it is important to set appropriate hyperparameters in addition to determining a suitable network structure and learning algorithm.
[0142] Typically, hyperparameters are experimentally set to various values while training the artificial neural network, and then set to the optimal value that provides stable training speed and accuracy based on the training results.
[0143] FIG. 1 is a schematic diagram of an electronic device according to one embodiment of the present disclosure, FIG. 2 is a schematic diagram of a foundation model according to one embodiment of the present invention, FIG. 3 is a schematic diagram showing a learning method of a foundation model according to one embodiment of the present invention, and FIG. 4 is a schematic diagram of a molecular information analysis system including a foundation model according to one embodiment of the present invention. FIG. 5 is a diagram showing a sequential multi-agent according to one embodiment of the present disclosure, FIG. 6 is a diagram showing a supervisory agent according to one embodiment of the present disclosure, FIG. 7 is a diagram showing a hierarchical agent system according to one embodiment of the present disclosure, FIG. 8 is a diagram showing a multi-agent discussion system according to one embodiment of the present disclosure, FIG. 9 is a diagram showing a mixture-of-AI agents system according to one embodiment of the present disclosure, and FIG. 10 is a diagram showing a ReAct agent system according to one embodiment of the present disclosure.FIG. 11 is a diagram showing a CodeAct agent system according to one embodiment of the present disclosure, FIG. 12 is a diagram showing a modern tool-using agent system according to one embodiment of the present disclosure, FIG. 13 is a diagram showing a self-reflective agent system according to one embodiment of the present disclosure, FIG. 14 is a diagram showing a multi-agent workflow system according to one embodiment of the present disclosure, FIG. 15 is a diagram showing an Agentic RAG (Retrieval-Augmented Generation) system according to one embodiment of the present disclosure, FIG. 16 is a diagram showing a MAD (Multi-Agent Debate) system according to one embodiment of the present disclosure, FIG. 17 is a diagram showing an A2A (Agent2Agent) protocol system according to one embodiment of the present disclosure, FIG. 18 is a diagram showing an Agentic RAG (Retrieval-Augmented Generation) system according to one embodiment of the present disclosure, FIG. 19 is a schematic diagram of an AI agent system according to one embodiment of the present disclosure, and FIG. 20 is a schematic diagram of a large-scale language model (LLM) chatbot according to one embodiment of the present disclosure.FIG. 21 is a schematic diagram of a Robotic Process Automation (RPA) system according to one embodiment of the present disclosure, FIG. 22 is a schematic diagram of a Retrieval-Augmented Generation (RAG) system according to one embodiment of the present disclosure, FIG. 23 is a schematic diagram of a Learning-Augmented Mechanism (LAM) according to one embodiment of the present disclosure, FIG. 24 is a diagram showing an AI agent memory structure according to one embodiment of the present disclosure, FIG. 25 is a diagram showing a General Pretrained Transformer (GPT) model according to one embodiment of the present disclosure, FIG. 26 is a diagram showing a Mixture of Experts (MoE) model according to one embodiment of the present disclosure, FIG. 27 is a diagram showing a Large Reasoning Model (LRM) according to one embodiment of the present disclosure, FIG. 28 is a diagram showing a Vision Language Model (VLM) according to one embodiment of the present disclosure, FIG. 29 is a diagram showing a Small Language Model (SLM) according to one embodiment of the present disclosure, and FIG. 30 is one This is a drawing showing a Large Action Model (LAM) according to an embodiment. FIG. 31 is a drawing showing a Hierarchical Reasoning Model (HRM) according to an embodiment of the present disclosure, FIG. 32 is a drawing showing a ToolFormer (Tools-trained Model) according to an embodiment of the present disclosure, FIG. 33 to FIG. 38 are drawings showing vulnerabilities of an MCP according to an embodiment of the present disclosure, and FIG. 39 is a drawing illustrating a context engineering structure in an AI agent system according to an embodiment of the present disclosure.
[0144] As illustrated in FIG. 1, an electronic device (100) according to embodiments of the present invention may include a processor (110), memory (120), a communication unit (130), and an input / output unit (140). The electronic device (100) is a basic configuration for performing a computing environment, and the electronic device (100) may be implemented with some other components additionally or substantially in other embodiments, implemented as a single or multiple entity, or implemented as only some of the disclosed configurations. Internal or external components of the electronic device (100), or at least some of the components, may transmit or receive data or signals by being connected to each other through a BUS, GPIO (General Purpose Input / Output), SPI (Serial Peripheral Interface), or MIPI (Mobile Industry Processor Interface), etc.
[0145] The processor (110) may mean a set of one or more processors unless the context clearly indicates otherwise, and may control components of the processor (110) and electronic device (100) by running software (e.g., instructions, programs, etc.) stored in memory (120). Additionally, the processor (110) may perform various operations such as computation, processing, data generation or processing, and may read data from memory (120) or store it in memory (120). The processor (110) may be composed of at least one core and may include processors for data analysis, machine learning (ML), or deep learning (DL), such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a General Purpose Graphics Processing Unit (GPGPU), or a Tensor Processing Unit (TPU). The processor (110) can read software stored in memory (120) and perform data processing for machine learning (or deep learning) of the present invention. According to one embodiment of the present disclosure, the processor (110) can perform operations for learning a neural network. The processor (110) can perform calculations for learning a neural network, such as processing input data for learning in deep learning, extracting features from input data, calculating errors, and updating the weights of the neural network using backpropagation. At least one of the CPU, GPU, GPGPU, and TPU of the processor (110) can process the learning of the neural network model. For example, the CPU and GPGPU can together process the learning of the neural network model and data classification using the neural network model.In addition, in one embodiment of the present disclosure, at least one processor (110) of the electronic device (100) can be used together to process the learning of a neural network model and the classification of data using a neural network model.
[0146] Memory (120) is intended to store various data, and the data may include software (e.g., instructions, programs, etc.) which is acquired, processed, or used by at least one component of the electronic device (100). Unless otherwise clearly expressed in the context, memory (120) may mean a set of one or more memories and may include at least one type of storage medium among flash memory type, hard disk type, multimedia card micro type, card type memory (e.g., SD or XD memory, etc.), RAM, SRAM (Static Random Access Memory), ROM, EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk, optical disk, and web storage that performs storage functions on the internet. The instruction, program, or software stored in the memory (120) may be used to refer to an operating system, an application, or middleware that provides various functions to an application to enable the application to utilize the components of the electronic device (100) for controlling the components of the electronic device (100). In one embodiment, when the processor (110) performs a specific operation, the memory (120) may store instructions that are performed by the processor (110) and correspond to the specific operation.
[0147] The communication unit (130) performs wireless or wired communication between the electronic device (100) and another device (e.g., a user terminal or another server), and the communication unit (130) may use wireless communication systems according to methods such as eMBB, URLLC, MMTC, LTE, LTE-A, NR, UMTS, GSM, CDMA, WCDMA, TDMA, FDMA, OFDMA, SCFDMA, WiBro, WiFi, Bluetooth, NFC, GPS, or GNSS. In addition, the communication unit (130) can use various wired communication systems such as USB, HDMI, RS-232 (Recommended Standard-232), POTS (Plain Old Telephone Service), Public Switched Telephone Network (PSTN), xDSL (x Digital Subscriber Line), RADSL (Rate Adaptive DSL), MDSL (Multi Rate DSL), VDSL (Very High Speed DSL), UADSL (Universal Asymmetric DSL), HDSL (High Bit Rate DSL), and Local Area Network (LAN). In one embodiment of the present invention, the communication unit (130) can be configured regardless of the mode of communication, such as wired or wireless, and can be configured with various communication networks such as a Personal Area Network (PAN) and a Wide Area Network (WAN). In addition, the communication network may be the known World Wide Web (WWW), and may also utilize wireless transmission technologies used for short-range communication, such as Infrared Data Association (IrDA) or Bluetooth.
[0148] The input / output unit (140) may be configured to be divided into an input unit and an output unit, but alternatively, the input / output unit (140) may have an integrated configuration. The input unit serves as a means for data input and can be composed of various types. For example, the input unit may be configured to receive user input. The input unit may be configured to receive user input from a user terminal. Here, "receiving input" may mean receiving an input signal (or selection signal) corresponding to the user's input based on input made by the user through the input unit configuration provided in the user terminal. The input unit may also be referred to as a user interface module. The input unit may include a touch screen, a computer mouse, a keyboard, a keypad, a touchpad, a trackball, a joystick, a voice recognition module, or other similar devices. However, the present invention does not limit the type of input unit. Furthermore, in the present invention, the input unit does not necessarily mean a hardware means, but can be understood as a channel for receiving input from a user. Here, user input may include documents, text, images (or video), etc. Next, the output unit can output information through an output unit configuration (e.g., a display unit, a touch screen, etc.) provided in a user terminal or computing device. The output unit does not necessarily refer to a hardware means, but can be understood as a channel for outputting results to the user.
[0149] An electronic device (100) according to an embodiment of the present invention may execute software that configures a system including a foundation model or a method for learning a foundation model. In addition, the electronic device (100) may perform various molecular unit tasks based on user queries and provide the results to the user.
[0150] 1. Composition of the foundation model
[0151] Referring to FIG. 2, a foundation model (200) according to embodiments may be configured to include a data input unit (210), a graph processing unit (220), a large-scale language model (230; LLM), and a molecular structure description generation model (300).
[0152] The data input unit (210) is configured to receive multimodal data based on user input through the input / output unit (140) or information stored in the memory (120). The multimodal data input through the data input unit (210) of one embodiment may include molecular text data in a 1D phenotype representing the sequence of molecules, molecular graph data in a 2D phenotype representing the molecular structure, and a natural language-based task command sequence defining a task performed in the foundation model (200).
[0153] Molecular text data in 1D phenotypes can be text data in SELFIES (Self-Referencing Embedded Strings) or SMILES (Simplified Molecular-Input Line-Entry System) phenotypes, which represent molecular structures as strings. SMILES is the most widely used molecular structure notation; it is a phenotype that views molecular structures as tree or graph structures and linearizes them using ASCII characters to represent atoms and bonds by traversing them. SMILES displays atomic symbols and their bonds, but single bonds can be omitted, double bonds represented by "=", and triple bonds by "#"; branches are represented by parentheses (); and rings are represented by assigning the same number to the two atoms forming the ring. SELFIES is a new molecular notation proposed to overcome the limitations of SMILES and is particularly optimized for molecular generation tasks in AI models. Like SMILES, SELFIES represents molecular structures as strings, but its grammatical rules are much stricter and more robust. SELFIES strings are represented by enclosing tokens in "[]" (e.g., [C], [O], [Branch1], [Ring1]). These SELFIES phenotypes encode molecular structures in a self-referencing manner, so that a grammatically valid SELFIES string can always be decoded into a valid molecular structure. That is, a valid SELFIES string can always be converted into a chemically valid molecule. In one embodiment of the present invention, molecular text data is molecular text data of the SELFIES phenotype, but is not limited thereto.
[0154] Molecular graph data in 2D representation is data that represents molecular structures as a graph data structure. A molecular graph can be constructed by including nodes and edges. Nodes represent atoms constituting a molecule, and each node may contain attributes of the corresponding atom (e.g., atomic number, atomic weight, charge, number of hydrogen atoms surrounding the atom, etc.). Edges are lines connecting nodes that represent chemical bonds between atoms, and edges may contain information such as single, double, and triple bonds. Molecular graphs can clearly represent complex structures or branches, such as benzene rings; compared to linear string-based representations like SMILES or SELFIES, they can directly store spatial information of 2D molecular structures, thereby preserving molecular structural information. Furthermore, chemically similar molecules tend to exhibit similar graph structures, making it easier for AI models to learn the relationships between molecules. Additionally, since various attributes of atoms (nodes) and bonds (edges) can be captured in feature vectors, the complex chemical characteristics of molecules can be represented more richly.
[0155] A task command sequence may be a natural language text-based command that defines a task performed in the foundation model (200). That is, the task command sequence may be a set of commands for various tasks or tasks that the user will perform on the foundation model. In one embodiment, various tasks can be understood and performed through a foundation model, that is, a single general-purpose model, rather than designing a separate model corresponding to each task to execute various tasks at the molecular level. To this end, in one embodiment, various commands that can be performed at the molecular level may be input as a data set of the model. Various tasks at the molecular level may include chemical reaction prediction, translation, molecular property prediction, etc. Task command sequences may differ for molecular level tasks. A sequence for a molecular property prediction task may include commands that cause the model to derive target properties for an input molecule, such as, for example, "What is the HOMO energy of this molecule?" or "Does this molecule have known side effects?" Additionally, a sequence for chemical reaction prediction may include commands such as, when a part of a target chemical formula is input, the model completing the entire formula, predicting the result molecule when the input molecule acts as a reactant, predicting the reaction molecule in a chemical reaction equation where the given molecule is the result, or predicting the molecule acting as a mediator when the reactants and result molecules are given.Additionally, the sequence for generating a description may include commands such as generating the remaining information based on one side of the input information when a molecule and a text description are given, such as "Please provide a detailed description of the molecular structure." or "Can you create a molecule based on this structural description?", generating a text description of a molecule when 1D molecular text is input, or receiving a text description of a target molecule and generating 1D molecular text (SMILES or SELFIES) for the target molecule. The task command sequences for molecular unit tasks as described above are disclosed only as examples, and the embodiments are not limited thereto.
[0156] With reference to FIG. 2, text-based data, i.e., molecular text data and task command sequences, can be tokenized through a tokenizer (240), and the tokenized data can be converted into text embeddings among input embeddings (250) through a text embedding layer of a large-scale language model (230). That is, in one embodiment, among the multimodal data input through the data input unit (210), while 2D molecular graph data is processed in the graph processing unit (220), text-based data can be processed by the large-scale language model (230) and converted into input embeddings (250).
[0157] The graph processing unit (220) receives 2D molecular graph data and converts it into a graph embedding among the input embeddings (250) that are inputs to a large-scale language model (230). In one embodiment, the graph processing unit (220) may include a hybrid graph encoder (221) and a crossmodal bridge (224).
[0158] In one embodiment of the present invention, the hybrid graph encoder (221) combines two different neural network architectures. The hybrid graph encoder (221) may be configured to include a graph encoder (222) that captures the local structure of a molecular graph and a graph sequence encoder (223) that captures the global context of a molecular graph.
[0159] The graph encoder (222) can be configured to extract features of each atom and bond through a graph representation of the molecular structure and to generate an embedding vector based thereon. That is, the graph encoder (222) can learn features of small units, such as direct connections between atoms and functional groups. In one embodiment, the graph encoder (222) utilizes a Graph Isomorphism Network (GINE) designed to distinguish graph isomorphisms, but is not limited thereto. A 2D molecular graph consists of nodes and edges, and in the molecular graph, atoms can be represented as nodes and bonds between atoms as edges. For example, a water molecule is represented by the molecular formula “H2O,” where the atoms consist of H, H, and O. In this case, there are three nodes, and the atoms H, H, and O can be represented as nodes n1, n2, and n3, respectively. In addition, there may be two bonds between atoms, OH and OH, and these bonds can be represented as edges (e1, e2). In this way, when converting a molecular structure into a molecular graph, the unique positional and topological relationships of the molecular structure can be preserved, enabling more accurate learning and prediction. In one embodiment, the molecular structure can be converted into a molecular graph and input into a graph encoder (222). Meanwhile, in one embodiment, the graph encoder (222) can embed the molecular structure into a vector using the molecular graph. The graph embedding may include atomic type, bonding information, charge information, hybridization information, directionality information, etc., and additionally, a vector regarding the bonding information of the molecule can be embedded and utilized for analysis. GINE, which is the graph encoder (222) of one embodiment, can utilize unique information possessed by the edges as well as the nodes of the graph, such as the type of bond between atoms in the molecular structure, together in model learning.That is, the graph encoder (222) of one embodiment can better capture the local structure of the molecular graph by focusing more on the edge characteristics of the graph compared to the existing GNN. The graph encoder (222) can update the embedding vector by receiving information about each node's neighboring nodes and the edges connecting them using a message passing method. That is, the graph encoder (222) can output node embeddings for all nodes of the molecular graph.
[0160] The graph sequence encoder (223) recognizes the nodes and edges constituting the molecular graph as independent tokens and receives a graph sequence composed of these tokens as input, thereby capturing the global characteristics of the molecular graph structure. In one embodiment, the graph sequence encoder (233) may utilize a transformer-based TokenGT (Tokenized Graph Transformer) to process the graph sequence composed of tokens, but is not limited thereto. The graph sequence encoder (233) of one embodiment may generate (preprocess) a graph sequence by performing tokenization of the molecular graph. During the molecular graph tokenization process, each node (atom) and edge (bond) is converted into a token embedding containing its own unique characteristics and structural position information within the graph, and a graph sequence may be generated by arranging these token embeddings in order. The graph sequence encoder (233) of one embodiment may generate an embedding vector by capturing the structure and characteristics of the molecular graph using the graph sequence of the transformed molecular graph. The graph sequence encoder (233) can learn the structure and relationships of the entire molecular graph through the self-attention of the transformer and output a graph embedding for the input token sequence.
[0161] The hybrid graph encoder (221) can combine the node embeddings of the molecular graph, which are the output of the graph encoder (222), and the graph embeddings of the graph sequence encoder (223) with other modalities (text embeddings) to provide them as input embeddings (250) of a large-scale language model (230).
[0162] Meanwhile, different modalities, namely text embeddings and molecular graph embeddings (node embeddings + graph embeddings), need to be connected to enable a deeper and richer understanding in a large-scale language model (230). That is, in order for a large-scale language model to process structural information (molecular graph embeddings) and semantic information (text embeddings) about molecules from a single integrated perspective, it is desirable for the information to be connected. In other words, molecular graph embeddings represent a graph structure consisting of nodes and their relationships (edges) in a low-dimensional vector space, and this vector contains structural information such as the topological characteristics of the nodes within the graph. However, molecular graph embeddings alone have limitations in that it is difficult to know information about the specific content contained in each node. Meanwhile, text embeddings convert text data such as words or sentences into numeric vectors, and these vectors capture the semantic similarity of the text. However, text embeddings alone have limitations in grasping complex relationships or structures that may exist between text data. To overcome the limitations between such multimodal embeddings and to efficiently utilize multimodal data, it is necessary to connect the two embeddings. As a method of connecting multimodal embeddings, there are two main methods: one that utilizes a shallow linear layer in the projection head of a hybrid graph encoder (221), and another that utilizes a Q-Former (Querying Transformer) that efficiently queries and summarizes only the information related to text embeddings (currently needed information) among the information extracted by the hybrid graph encoder, and then embeds this summarized information. In one embodiment of the present invention, a Q-Former is utilized as a cross-modal bridge to connect multimodal embeddings, but it is not limited thereto.
[0163] The crossmodal bridge (224) acts as an intelligent bridge between the hybrid graph encoder (221) and the large-scale language model (230), and the Q-Former in one embodiment can receive three inputs. The crossmodal bridge (224) can receive, as a key, a molecular graph embedding (node embedding, graph embedding; target for reference by the Q-Former) generated through the hybrid graph encoder (221), learnable queries which are learnable vectors as the empty questionnaire of the Q-Former, and text embeddings to provide direction to focus only on relevant parts of the entire graph. In one embodiment, the crossmodal bridge (224) can cross-learn the molecular graph embedding and the text embedding through self-attention and cross-attention. Specifically, the crossmodal bridge (224) can set the search direction of the learnable query by associating the empty learnable query with the text embedding through self-attention processing. That is, the learnable query can be set to focus on the information that needs to be verified in the molecular graph embedding by being guided by the text embedding. Subsequently, the crossmodal bridge (224) can learn the graph embedding based on the learnable query to extract and learn the graph embedding that is highly relevant to the text embedding. Such a crossmodal bridge (224) can filter out unnecessary information by focusing only on the parts of the entire graph embedding that are highly relevant to the text embedding, compress and summarize the graph information to match the intent of the query, i.e., the text embedding, and then output it as an input embedding that is the input to the large-scale language model (230). That is, the crossmodal bridge (224) [requires] the query of Nq representing the input molecule and dimensional adaptive molecular graph embeddings Cross-attention between them can be performed. Through this cross-modal cross-attention, the cross-modal bridge (224) can learn queries representing molecular structures such as functional groups of the molecule and backbone structures of the molecule. In one embodiment of the invention, 32 learnable queries were used to perform various molecular tasks including molecule generation, but alternatively, 8 queries may be used. Meanwhile, in one embodiment, the text embedding layer of the cross-modal bridge (224) and the large-scale language model (230) may be referred to as a lightweight model.
[0164] As described above, text embeddings and molecular graph embeddings related to text embeddings can be combined to form input embeddings (250) of a large-scale language model (230).
[0165] A Large Language Model (230; LLM; Large Language Model) is an artificial intelligence model trained with a vast amount of text data and can perform various language-related tasks such as understanding, summarizing, generating, and translating text information. In one embodiment, the Large Language Model (230) can learn through input embeddings (250), which are information extracted and summarized from molecular graphs and text (SELFIES, task instructions), to finally interpret and generate answers that match user instructions (Task Instruction). In one embodiment, the Large Language Model (230) can perform reasoning to solve a given molecular-level task by synthesizing and processing compressed and refined molecular graph embeddings received through Q-Former and natural language instructions input by the user. In one embodiment, the Large Language Model (230) may utilize a Galactica model specialized in scientific fields or text representations of molecules, but is not limited thereto.
[0166] A molecular structure description generation model (300) is intended to generate a natural language description describing a specific molecular structure by emphasizing information on sparse substructures of a specific molecular structure. A molecular structure description generation model (300) according to one embodiment of the present invention may be configured to include a sparse substructure search unit (210), a substructure document generation unit (320), and a molecular structure description generation unit (330).
[0167] The rare substructure search unit (310) is intended to receive all molecular information included in a large-scale molecular database and analyze the frequency of occurrence of the molecular substructures. That is, the rare substructure search unit (310) can analyze the frequency of occurrence of predefined substructures for all molecules and calculate the rarity score of each substructure.
[0168] In one embodiment of the present invention, input data input to the rare substructure search unit (310) may be obtained from PubChem as a large-scale molecular database, but is not limited thereto and various molecular data may be used. The PubChem database includes a set of specific substructures that are predefined for fast searching and classification, and this set is named structural keys or molecular fingerprints. The molecular fingerprints of PubChem use binary vectors that indicate with 0s and 1s what substructures each molecular structure has for efficient searching of molecules. That is, the molecular fingerprint is represented by N bit vectors consisting of 0s and 1s, and multiple molecular fingerprints may exist depending on the number and type of substructures.
[0169] A rare substructure search unit (310) of one embodiment of the present invention may use MACCS keys (Molecular Access System keys) molecular fingerprints to identify whether a molecule substructure exists within a dataset. MACCS keys are molecular fingerprints that represent the existence of 166 major chemical features and substructures as a 166-bit vector, where each bit may indicate the existence of a specific chemical structural element.
[0170] The sparse substructure search unit (310) can apply MACCS keys fingerprints to all molecules in the database to count how often each of the 166 bits (each substructure) appears as '1' (i.e., how often it appears) in all molecules. FIG. 3 is a distribution histogram of the 166 substructures for all molecules. Subsequently, the sparse substructure search unit (310) can calculate a scarcity score corresponding to the frequency of appearance of a specific substructure. That is, the scarcity score (Ri) of a specific substructure i calculated by the sparse substructure search unit (310) can be calculated using the following mathematical formula.
[0171]
[0172] Here f i represents the total frequency of occurrence of substructure i, and the sparsity score R i It has a value between 0 and 1, and it can be derived that the closer it is to 1, the sparser the substructure. In other words, the sparsity score can be calculated inversely proportional to the frequency of occurrence of the substructure within a large molecular database.
[0173] The substructure document generation unit (320) is intended to generate a natural language description document for each substructure, such as the 166 defined in the MACCS keys, that describes the chemical definition, characteristics, role and effect within the molecule, etc. of the substructure.
[0174] The substructure document generation unit (320) may use a large-scale language model (230). That is, natural language descriptions describing the 166 substructures defined in MACCS keys may be generated by utilizing a large-scale language model (230), such as GPT-4o. For example, the substructure document generation unit (320) can generate a detailed natural language explanation document regarding the instability, short half-life, physicochemical properties, etc. of the superheavy elements, which is the substructure, when the name of the substructure is entered along with a system prompt such as “You are a chemistry expert. The user will ask about possible substructures within a molecule, and your task is to write a brief explanation, in five sentences or fewer, detailing the effects and properties of that substructure,” for a substructure of a molecular graph entered by a user or a substructure entered via a text prompt, such as “You are a chemistry expert. The user will ask about possible substructures within a molecule, and your task is to write a brief explanation, in five sentences or fewer, detailing the effects and properties of that substructure.” Alternatively, the substructure document generation unit of another embodiment of the present invention may build a database of natural language explanation documents for substructures by extracting descriptions for each substructure from chemistry-related websites or papers using web crawling. The description documents for the 166 substructures generated in this way can be retrieved and used to generate description text for specific molecular structures entered by the user.
[0175] As described above, the sparse substructure search unit (310) and the substructure document generation unit (320) can perform analysis and operation on all molecular structures stored in the database once. That is, it may be desirable for the operation of the sparse substructure search unit (310) and the substructure document generation unit (320) to be performed only once, i.e., pre-learned, until molecular structures are added to the database.
[0176] The molecular structure description generation unit (330) is for generating and outputting description text for a target molecular structure, and can be executed when a target molecular structure (molecular graph or molecular text) for which description text is to be obtained is input by a user through the input / output unit (140). For the target molecular structure received from the user, the molecular structure description generation unit (330) can perform functions such as obtaining a molecular fingerprint, sampling substructures, and integrating the generated natural language description document.
[0177] In one embodiment, the molecular structure description generation unit (330) can obtain a vector by calculating a molecular fingerprint for a target molecular structure input by a user to obtain a natural language description of the molecular structure. That is, the molecular structure description generation unit (330) can obtain a 166-bit vector by applying a MACCS key fingerprint to the target molecular structure. In the vector of the target molecular structure thus obtained, bits with a value of '1' may represent substructures included in the target molecular structure. The molecular structure description generation unit (330) can search a database for natural language descriptions corresponding to the substructures included in the target molecular structure, i.e., bits with a value of '1'. At this time, the molecular structure description generation unit (330) can select some of the natural language description documents corresponding to all substructures by sampling. That is, the molecular structure description generation unit (330) can select natural language description documents to describe the target molecular structure by selecting some of the entire substructures of the target molecular structure. At this time, the molecular structure description generation unit (330) may use the sparsity score calculated by the sparsity substructure search unit (310) as a sampling weight. The molecular structure description generation unit (330) may select natural language description documents for sparser substructures. That is, among the natural language description documents describing all substructures stored in the database, natural language description documents corresponding to the target substructure may be searched, and then sparser natural language description documents may be sampled based on the sparsity score assigned to the searched natural language description documents. In one embodiment of the present invention, the molecular structure description generation unit (330) may sample five natural language description documents based on the sparsity score. The number of documents to be sampled may be freely changed.
[0178] The molecular structure description generation unit (330) can integrate five sampled substructure natural language description documents. In one embodiment, the molecular structure description generation unit (330) can utilize a large-scale language model (230) to integrate natural language descriptions. Five sampled natural language description documents are given as input to the large-scale language model (230), and the large-scale language model (230) may also be given a system prompt such as “Your task is to consolidate descriptions of substructures within a molecule to create a new document that explains the molecule as whole. The user will provide information about each substructure in individual paragraph, and you integrate these descriptions into a cohesive document that characterizes the molecule in 5 sentences or fewer. Avoid general explanations and focus on unique features to craft a distinct description of the molecule.” Large-scale language models can use system prompts and five input natural language description documents to organically combine information from each substructure to generate a single, consistent, and logical complete molecular description text, as if written by an expert.
[0179] For the target molecular structure input by the user derived through the configuration described above, natural language description text generated using sparse substructures can be provided to the user through the input / output unit (140) of the system.
[0180] 2. Learning the Foundation Model
[0181] Meanwhile, the system of an embodiment of the present invention, as described above, may implement a multi-stage learning method to prevent the large-scale language model (230) from learning by focusing on molecular text (SELFIES) data rather than molecular graph data during the learning of the foundation model (200), that is, to maximize the utilization of graph information. The learning method may be a training method in which noise is intentionally input into one side of the data to learn the other side better, or a preference is set to focus more on one side of the data, in order to avoid the model relying only on specific information. In an embodiment of the present invention, the learning method of multimodal data of a system including a foundation model prevents the model from relying excessively on 1D text (SELFIES) information and induces it to utilize 2D molecular graph information more actively, thereby maximizing the utilization of 2D molecular graph information and resolving the graph bypass phenomenon that relies only on 1D text information.
[0182] Referring to FIG. 3, a method for training a foundation model according to one embodiment of the present invention may include the step of inputting a multimodal dataset (S100); a pre-training step of a hybrid graph encoder (S200); a pre-training step of a crossmodal bridge for modality alignment (S300); and a fine-tuning step of a foundation model (S400).
[0183] The step of inputting a multimodal dataset (S110) may input a multimodal dataset including molecular text data, molecular graph data, and a natural language-based task command sequence defining a task into the foundation model (200). In one embodiment, the molecular text data may be text data of the SELFIES expression type representing the structure of a molecule as a string, the molecular graph data may be data representing the molecular structure as a graph data structure including nodes and edges, and the task command sequence may be natural language text command data defining a task performed in the foundation model (200). The task command sequence defines various tasks at the molecular level and may include natural language sequences related to chemical reaction prediction, description generation generating text descriptions about molecular structures, etc., and molecular attribute prediction. Text-based data (molecular text, task command) input into the foundation model (200) may be converted into a numeric token sequence through a tokenizer (240).
[0184] The pre-training step (S200) of the hybrid graph encoder can pre-train the hybrid graph encoder (221) to have sufficient prior knowledge in the domain before integrating the hybrid graph encoder (221) and the large-scale language model (230).
[0185] The pre-training step (S210) of the hybrid graph encoder learns to effectively represent molecular structures as vectors, and in order to learn the ability to effectively represent molecular structures as vectors, the process of predicting functional groups of molecular structures that determine the chemical properties and reactivity of molecules and the process of restoring the original one-dimensional molecular text (e.g., SELFIES) using graph embedding information can be repeated.
[0186] Pre-training for functional group prediction of the hybrid graph encoder can be performed based on graph embeddings generated by the graph encoder (222) and the graph sequence encoder (223) of the hybrid graph encoder (221), respectively. The generated graph embeddings are input into a separate Multi-Layer Perceptron (MLP) to predict the functional groups of the molecular structure. Through such pre-training for functional group prediction of the hybrid graph encoder (221), the hybrid graph encoder (221) can accurately predict functional groups, which are key factors determining the chemical properties and reactivity of a molecule, thereby enabling the model to precisely understand the local chemical characteristics of the molecule.
[0187] Pre-training for one-dimensional molecular text restoration of a hybrid graph encoder can be performed by training it to restore the original one-dimensional SELFIES string using only the graph embedding information generated by the graph encoder (222) and the graph sequence encoder (223) of the hybrid graph encoder (221), respectively. Through this process, the hybrid graph encoder (221) can enhance its ability to preserve global structural information of the entire molecule without loss. Pre-training for text restoration can be performed by inputting the graph embeddings generated by the hybrid graph encoder into a separate decoder. Specifically, the graph encoder (222) (GINE) or the graph sequence encoder (223) (TokenGT) of the hybrid graph encoder (221) can generate a graph embedding vector that compresses the entire structural information of the molecule from the input 2D molecular graph, and the graph embedding vector thus generated can be input into a separate decoder, such as a decoder like GPT-2. The decoder can perform the task of reconstructing the SELFIES string of the original molecule one token at a time in order using only graph embedding information. Through such prior training, the hybrid graph encoder (221) can be trained so that the graph encoder (222) and the graph sequence encoder (223) generate graph embeddings containing enough information to enable reconstruction.
[0188] The hybrid graph encoder (221) can be optimized using an integrated loss function that sums the respective loss functions for performing these two tasks, and through this, the hybrid graph encoder (221) can learn a high-quality molecular representation that includes both local and global features of the molecule.
[0189] The pre-training step (S300) of the crossmodal bridge for modality alignment can align molecular graph embeddings and text embedding spaces by pre-training only the crossmodal bridge (224) (Q-Former) while the main body of the large-scale language model (230) is frozen. Specifically, in the pre-training step (S300) of the crossmodal bridge, the crossmodal bridge (224), which acts as an intelligent bridge connecting heterogeneous modalities such as the hybrid graph encoder (221) and the large-scale language model (230), can be pre-trained. Here, the weights of the pre-trained hybrid graph encoder (221) and the large-scale language model (230) are frozen so that they are not changed, and only the parameters of the crossmodal bridge (224) can be updated. The crossmodal bridge (224) receives molecular graph embeddings from the hybrid graph encoder (221) and can compress and refine this information into a fixed-length token sequence through a learnable query. Through this process, the converted graph information tokens and text tokens are input together into a large-scale language model (230) to predict the correct answer, and through the loss generated at this time, the crossmodal bridge (224) can learn how to convert the graph information in a way that the large-scale language model (230) can best understand.
[0190] The fine-tuning step (S400) of the foundation model can fine-tune the entire foundation model (200), that is, the entire foundation model (200) including the large-scale language model (230), to resolve the graph neglect phenomenon. The fine-tuning step may implement at least one of the following two methods, and in one embodiment of the present invention, both methods may be used. The fine-tuning step (S400) may include a damage-based learning step (S410) and a graph preference enhancement step (S420).
[0191] The damage-based learning step (S410) generates damaged text data by replacing a portion of the input 1D molecular text (SELFIES) data with random tokens, and as the model is trained through this, the foundation model (200) can be forced to rely more on 2D molecular graph information. That is, in this step, a portion of the molecular text data can be damaged by replacing some tokens in the token sequence of the tokenized molecular text data with other random tokens. Due to the damaged molecular text data, the foundation model (200) is unable to rely solely on the incomplete SELFIES information and can be trained to focus more on the 2D molecular graph data input together.
[0192] The graph preference enhancement step (S420) prepares a pair of an original 2D molecular graph and an incorrect molecular graph generated by modifying the substructure of the original 2D molecular graph, and can further optimize a preference loss function based on preference difference so that when the foundation model (200) receives the original molecular graph, the probability of generating the correct answer is increased, and when the incorrect graph is received, the probability of generating the correct answer is decreased. That is, in this step, a preference pair consisting of a correct molecular graph and an incorrect molecular graph in which the structural features of the substructure of the graph are intentionally modified is generated, and a preference optimization objective function can be used so that the foundation model (200) maximizes the probability of generating a result when the correct molecular graph is input and minimizes the probability of generating a result when the incorrect molecular graph is input.
[0193] A foundation model (200) trained using a total loss function that combines the respective loss functions of damage-based learning and graph preference enhancement learning can utilize 2D graph structure information in a preferential and in-depth manner regardless of the molecular task given.
[0194] 3. Inference of the Foundation Model
[0195] After repeating such learning steps, the final foundation model (200), in which the updated weights are stored, can process molecular-level tasks by receiving prompts from the user. That is, an inference step using the trained foundation model (200) may be further included. In the inference step, the user can perform desired molecular-level tasks through the trained model, and in this inference step, molecular text data or molecular graph data may not be damaged or altered. In the inference step, various molecular-level tasks are input as prompts by the user, and the foundation model (200) can understand the prompts, infer, and provide answers corresponding to the tasks. At this time, various molecular-level tasks may include chemical reaction prediction, explanation generation, molecular property prediction, etc. A sequence of task commands corresponding to the molecular-level tasks may be input as a prompt.
[0196] In the task of predicting molecular properties, the model can derive target properties for a molecule entered via a prompt along with commands such as "What is the HOMO energy of this molecule?" or "Does this molecule have any known side effects?" Additionally, in the task of predicting chemical reactions, when a part of a target chemical formula is entered along with a command such as "Give me possible reactants when the following products are given," the model can provide an answer to the user by completing the entire chemical formula, predicting the result molecule when the input molecule acts as a reactant, predicting the reaction molecule in a chemical reaction equation where the given molecule becomes the product, or predicting the molecule acting as a mediator when reactants and product molecules are given.
[0197] In addition, for the description generation task, when a molecular graph or molecular text description is entered as a prompt along with commands such as "Please provide a detailed description of the molecular structure" or "Can you generate a molecule based on this structural description?", the remaining information can be generated based on one side of the entered information, or when 1D molecular text is entered, a text description of the molecule can be generated, or a text description of the target molecule can be entered and 1D molecular text (SMILES or SELFIES) of the target molecule can be generated and provided as an answer.
[0198] In one embodiment of the present invention, the description generation process can generate a natural language description describing the input molecular structure by focusing more on the substructures of the input molecular structure. That is, the description generation process of one embodiment can calculate a molecular fingerprint for a target molecular structure input by a user. In one embodiment, a MACCS keys fingerprint can be calculated for the target molecular structure to identify a list of substructures included in the target molecular structure. Subsequently, some of the identified substructures for the input target molecular structure can be sampled. Specifically, substructures corresponding to bits with a value of '1' in a 166-bit vector obtained by applying the MACCS keys fingerprint to the target molecular structure can be identified. Subsequently, among the identified substructures, a preset number, for example, five substructures, can be selected using the sparsity score for the substructures as a sampling weight. That is, sampling can be performed by assigning weights such that the probability of sampling increases as the sparsity score increases, i.e., the sparser the substructure. Next, natural language description documents corresponding to the sampled sparse substructures can be retrieved from a pre-existing database, and the retrieved multiple natural language description documents can be integrated into a single coherent text to generate a final text describing the entire target molecular structure. In one embodiment, this integration process may utilize a large-scale language model (LLM). For example, while inputting five sampled natural language description documents into the LLM, a system prompt such as, "Your task is to integrate descriptions of substructures within the molecule to create a new document describing the entire molecule. If the user provides information about each substructure in individual paragraphs, you must integrate these descriptions into a coherent document that characterizes the molecule in five sentences or less. Avoid general descriptions and focus on unique features to create a unique description of the molecule," may be provided to the LLM.In this case, LLM goes beyond simple text summarization or merging to generate an organic description that includes the characteristics of the entire molecule by considering the interactions and synergies between information on each substructure. Through this, LLM can organically combine information from each substructure to produce a logical and highly complete text description of the molecule.
[0199] The final molecular description text regarding the target molecular structure, focusing on the sparse substructures generated in this way, can be provided to the user through the system's input / output section or by saving it as a file.
[0200] As described above, embodiments of the present invention can provide a new integrated multimodal foundation model that exhibits excellent performance and generalization ability in various molecule-related tasks by combining a new learning strategy that forces the maximum utilization of two-dimensional graph structure information of molecules with a method that automatically generates large-scale informative natural language descriptions that emphasize unique and sparse features of each molecule.
[0201] 4. Foundation Model-Based Molecular Information Analysis System
[0202] A foundation model-based molecular information analysis system (400) according to one embodiment of the present invention includes a user computing device (410) and a server computing system (420), and these components are connected to communicate through a service environment (e.g., an internet site) to provide services such as molecular information analysis to the user computing device (420).
[0203] A user computing device (410) is a client terminal for a user to request a molecular information analysis service and receive the results, and may include any type of computing device capable of connecting to the internet, such as a smartphone, tablet PC, or desktop computer. The user computing device (410) may include a user input unit (411) (e.g., touchscreen, keyboard) for receiving user input and a display (412) for outputting analysis results received from the server. The user may access the service through a web browser or a dedicated application and input a task command to be performed (e.g., "What is the HOMO energy of this molecule?") in the form of a prompt, along with information about the molecule to be analyzed (e.g., text of SMILES, SELFIES) or a molecular structure graph file. The input request data may be transmitted to a server computing system (420) through a communication unit.
[0204] A server computing system (420) includes a high-performance processor such as a central processing unit (CPU) and a graphics processing unit (GPU), and a large-capacity memory, and can load and execute a foundation model (200) according to one embodiment of the present invention. A foundation model (200) according to one embodiment may include a data input unit (210) that receives multimodal data including 1D molecular text, 2D molecular graph, and task command sequence from a user computing device, a graph processing unit (220) that includes a hybrid graph encoder (221) for extracting both local and global features of the 2D molecular graph data and a crossmodal bridge (224) that aligns the extracted graph information with text information, and a large-scale language model (230) that comprehensively understands the molecular text and the graph embedding converted from the graph processing unit, performs inference according to the user's task command, and generates a final result. In addition, when a request for an explanation generation task is made, the foundation model (200) can identify statistically sparse substructures based on molecular fingerprints for molecular data input by the user, and integrate explanations that have already been written for the substructures and built in the database (421) to generate and provide to the user natural language explanation documents with high informational value. That is, the foundation model-based molecular information analysis system may include a database (421) that stores frequency analysis of sparse substructures and natural language explanation documents for each substructure.
[0205] A method for providing services of a foundation model-based molecular information analysis system according to one embodiment of the present invention may include a user request step, an information processing step, a response step, and an output step.
[0206] In the user request stage, the user can send a request to the server by inputting molecular information to be analyzed and task commands through the user computing device. In the information processing stage, for the received request, the foundation model sends the multimodal input to the respective processing units to convert it into embeddings, and performs inference corresponding to the user's request (chemical reaction prediction, molecular property prediction, description generation, etc.) using a large-scale language model. In the response stage, after the inference is completed, the foundation model generates results (e.g., predicted property values, generated molecular text, natural language description). In the output stage, the server computing system transmits the generated results to the user computing device via the communication unit, and the user can view the analysis results on their device screen.
[0207] According to a foundation model-based molecular information analysis system according to one embodiment of the present invention as described above, by providing a high-performance multimodal foundation model requiring complex and extensive computation in the form of a service from a central server, general researchers or students can easily utilize state-of-the-art molecular analysis tools through the web or applications without expensive computing equipment.
[0208] This can lower entry barriers for research on new materials and new drugs, and significantly reduce the time and cost required for research and development, thereby contributing to the development of the related industrial ecosystem.
[0209] In particular, by providing a function to generate explanatory text centered on the unique and rare chemical characteristics of molecules, users can gain a deep and intuitive understanding of molecular structures, which can be very useful in the fields of education and research.
[0210] Meanwhile, one embodiment of the present invention may be implemented as an Application Specific Integrated Circuit (ASIC) manufactured to suit specific application fields and special functions of devices.
[0211] Custom integrated circuits are also referred to as custom semiconductors. Unlike standard semiconductors, which have fixed specifications and can be applied to any electronic product or application as long as certain requirements are met, custom semiconductors are used for specific products or functions and are integrated circuits manufactured by semiconductor companies to meet specific orders. In other words, custom semiconductors are designed and manufactured to perform only the functions necessary for a specific device or feature. Custom semiconductors are broadly classified according to their design method into Full Custom ICs, which design and manufacture circuits from scratch to meet user requirements, and Semi-Custom ICs, which design and manufacture circuits using parts of a standardized design.
[0212] Application-specific semiconductors are primarily used in communication systems, high-performance computing systems, consumer electronics, automobiles, industrial automation, medical devices, the military, and the aerospace industry; recently, they are being applied to AI semiconductors that execute the large-scale computations required for AI implementation with high performance and power efficiency.
[0213] Application-specific semiconductors (ASICs) are used as core components in communication systems, such as network routers, switches, and modems, performing data packet processing, protocol conversion, and signal processing to provide high throughput and low latency. In high-performance computing systems, ASICs serve as key components for high-speed and parallel processing, while in consumer electronics—including digital cameras, smartphones, tablets, and game consoles—ASICs provide high-performance and low-power solutions required to perform specific functions. In the automotive industry, ASICs are used to control various electronic systems within vehicles, and in industrial automation systems, they provide solutions for high-precision control and high-performance processing.
[0214] An application-specific integrated circuit to which an embodiment of the present invention is applied includes a memory in which an individual memory interface (I / F) is implemented, and may include a plurality of function blocks that request memory access. Each function block may be a Direct Memory Access (DMA) function block, a processor, a video processor, a cache controller, a decompression block, or a data path block. The basic configuration of the application-specific integrated circuit may include a transistor that amplifies or switches an electrical signal, a logic gate which is a circuit that performs a logical function by combining transistors, a memory cell that stores data, an analog circuit which is a circuit that processes continuous voltage or current by combining transistors, and an Intellectual Property Core (IP Core) such as a microprocessor, DSP, or graphics core that is pre-designed to perform a specific function.
[0215] The ASIC may include an individual memory I / F that interfaces with individual memory and an embedded memory I / F that interfaces with embedded memory. The individual memory I / F is connected to each function block to receive memory access signals (e.g., control signals, address signals, and data signals) and, based on these input signals, can generate signals to control the individual memory. The embedded memory I / F is connected to each function block to receive memory access signals (e.g., control signals, address signals, and data signals) and, based on these input signals, can generate modified memory access signals to control the embedded memory. The individual memory I / F and the embedded memory I / F are designed within the memory control block of the ASIC to provide a memory control structure that can be flexibly applied to both the individual memory and the embedded memory.
[0216] Additionally, an application-specific integrated circuit (ASIC) for an artificial neural network (ANN) is composed of multiple neurons arranged in an array and multiple synapse circuits, each neuron being composed of a register, a microprocessor, and at least one input, and each synapse circuit being configured to include memory for storing synapse weights. Here, each neuron of the ASIC may be connected to at least one other neuron through one of the multiple synapse circuits.
[0217] Although the present disclosure has been described as generally being implementable by a computing device, a person skilled in the art will be well aware that the present disclosure may be implemented in combination with computer-executable instructions and / or other program modules that can be executed on one or more computers and / or as a combination of hardware and software.
[0218] The embodiments of the present disclosure as described above can be performed by an artificial intelligence agent.
[0219] [Correction pursuant to Rule 91 12.02.2026] The agent's workflow may include the types of FIGS. 40a to 40c.
[0220] [Correction pursuant to Rule 91 12.02.2026]<Deleted>
[0221] In one embodiment, most of the structure may consist of a hierarchical structure of "Input → Plan / Evaluate / Branch → Output". Additionally, due to the agentic nature, complex problem solving may be possible through patterns such as iteration, parallelism, and cooperation, rather than a single large-scale language model (LLM) call. Furthermore, since each structure is designed to suit a specific business purpose, selecting the optimal structure according to the purpose is important. For example, a workflow such as [Table 2] may be recommended depending on the business purpose.
[0222] Business Purpose Recommendation Workflow Conversational AI (e.g., Chatbots, Consultation) Prompt Chaining, Routing Real-time Analysis and Evaluation Evaluator-Optimizer, Reflection Complex Business Planning and Automation Plan and Execute, Rewoo Autonomous Behavior-based Systems Autonomous Workflow Large-scale Parallel Processing or Synthesis Parallelization, Orchestrator-Worker
[0223] FIG. 5 is a drawing showing a sequential multi-agent according to one embodiment of the present disclosure.
[0224] Referring to FIG. 5, a sequential multi-agent may include a user agent, a write agent, a style agent, etc. In one embodiment, the sequential agents communicate sequentially and can perform a single task in order. Additionally, each agent can receive the result of the previous step and perform the next task. For example, the user agent may obtain user input from the user to generate a first processing result, the write agent may generate a second processing result including the written text by generating the first processing result from the user agent, and the style agent may generate a third processing result by receiving the second processing result from the write agent and applying a style. Accordingly, the third processing result may finally be output. Such a sequential multi-agent has a linear flow and is suitable for processing a single task. For example, the sequential multi-agent can be used in the field of creative writing.
[0225] FIG. 6 is a drawing showing a supervisory agent according to one embodiment of the present disclosure.
[0226] Referring to FIG. 6, a supervised agent may refer to an agent in which a centrally located Supervisory Large Language Model (LLM) coordinates the entire process. In one embodiment, the supervised agent may direct necessary tasks to appropriate agents and aggregate results in response to user requests. That is, the supervised agent may manage communication between agents. Accordingly, the supervised agent may enable flexible task distribution. For example, the supervised agent may request research from a research agent and request calculations from a mathematics agent. Such a supervised agent may be used in fields such as deep research.
[0227] FIG. 7 is a diagram showing a hierarchical agent system according to one embodiment of the present disclosure.
[0228] Referring to FIG. 7, a hierarchical agent system can refer to a system in which a meta-agent controls and coordinates lower-level agents. For example, the meta-agent can obtain user input from a user, request tasks from research agents, data analysis agents, etc., receive task results from each agent, and generate outputs. This hierarchical agent system has a hierarchical control structure, allowing tasks to be divided and managed in a more complex manner. In other words, a meta-agent acting as an intermediate manager can be utilized. This hierarchical agent system is suitable for complex systems or coding, and can be used as a coding agent, etc.
[0229] FIG. 8 is a drawing showing a multi-agent discussion type system according to one embodiment of the present disclosure.
[0230] Referring to Fig. 8, a multi-agent discussion system is a system in which multiple agents present different opinions and select the most appropriate result by voting or evaluating it, thereby deriving an optimal solution based on discussion. That is, multiple agents perform discussion and evaluation based on user input and can output the optimal answer among them. Such a multi-agent discussion system can make the best choice by comparing various perspectives in a competitive structure and can be used in fields such as world simulation.
[0231] FIG. 9 is a diagram showing a Mixture-of-AI Agents system according to one embodiment of the present disclosure.
[0232] Referring to FIG. 9, in a hybrid AI agent system, multiple agents perform parallel processing layer by layer, and an aggregator can integrate the results at the end. For example, a first agent and a second agent may perform a process in parallel at the first layer, and then the first agent and the second agent may perform a process in parallel at the second layer, and an aggregator may synthesize these to generate an output. In one embodiment, the hybrid AI agent system may use a multi-stage approach for complex problems. The hybrid AI agent system has a hierarchical and parallel structure and is characterized by the distribution and combination of expertise, so it can be used for medical research, etc.
[0233] FIG. 10 is a drawing showing a ReAct agent system according to one embodiment of the present disclosure.
[0234] Referring to FIG. 10, ReAct is a compound word of "Reason + Act," and a ReAct agent can refer to an agent that solves problems by repeating reasoning (Reason) and action (Act). For example, if a user asks, "What is the weather like in New York these days?", a large-scale language model (LLM) can interpret the meaning of the query, search for the current weather in New York through a search engine's search tool, summarize the results, and deliver them back to the user. Such a ReAct agent can be used in AI chatbots, etc. According to one embodiment, the ReAct agent has excellent tool usage capabilities and can generate more accurate responses through the repetition of reasoning and action.
[0235] FIG. 11 is a drawing showing a CodeAct agent system according to one embodiment of the present disclosure.
[0236] Referring to Fig. 11, the CodeAct agent can handle more flexible and complex logic by executing Python code instead of JSON. For example, when it receives input from a user such as "Analyze sales data for the last 3 months," a large-scale language model analyzes the request, and Pandas can be used for Python code. Subsequently, tasks such as loading CSV files, performing statistical calculations, generating graphs, and creating summary reports can be performed. Because the CodeAct agent is code-based, it is strong in handling complex calculations and logic, and the large-scale language model (LLM) can directly program and execute.
[0237] FIG. 12 is a drawing showing a modern tool-using agent system according to one embodiment of the present disclosure.
[0238] Referring to Fig. 12, the agent using modern tools can easily utilize various SaaS tools or APIs (e.g., AWS, Brave Search, etc.) through a Multi-Channel Processing (MCP) server. For example, when the agent receives a text request from a user saying "Stop my AWS EC2 instance," it can call the AWS API through the MCP server and return a message indicating successful stop. The agent using modern tools can be used in developer IDE-integrated AI, etc., and has the advantage of enabling tool control with almost no code and facilitating easy integration with various cloud or web functions.
[0239] FIG. 13 is a drawing showing a self-reflective agent system according to one embodiment of the present disclosure.
[0240] Referring to Fig. 13, a self-reflective agent system can refer to a metacognitive mechanism in which a large-scale language model evaluates and modifies its own responses. For example, upon receiving input from a user such as "Write a cover letter that fits my resume," the system may generate a draft using a large-scale language model (LLM), check for logical and contextual errors using a Critique large-scale language model (LLM), and generate a final output after iterative modification and improvement. The self-reflective agent system can automatically improve quality and incrementally enhance performance through a feedback loop.
[0241] FIG. 14 is a drawing showing a multi-agent workflow system according to one embodiment of the present disclosure.
[0242] Referring to Fig. 14, a multi-agent workflow system can refer to a system in which multiple specialized agents cooperate to perform a single task. For example, upon receiving input from a user such as "Please write a startup market research report," the first agent can collect the latest market trends, the second agent can analyze competitors, and the third agent can summarize investment trends. An aggregator can synthesize the information generated by the first, second, and third agents to generate a report. The multi-agent workflow system improves accuracy through a cooperative structure and can divide and process complex tasks.
[0243] FIG. 15 is a drawing illustrating an Agentic search-based generation (RAG; Retrieval-Augmented Generation) system according to one embodiment of the present disclosure.
[0244] Referring to Fig. 15, an Agentic Retrieval-Augmented Generation (RAG) system can be described as a system in which an AI retrieves information from an external database or search engine in real time and generates a response based on it. For example, when a query such as "What are the major issues of the 2024 US presidential election?" is obtained from a user, the AI performs a vector DB / web search (Google, News, etc.), extracts relevant articles and summaries, and can write an explanation using that information. Such an Agentic Retrieval-Augmented Generation (RAG) system can provide the latest information and generate an accurate response that fits the context.
[0245] FIG. 16 is a drawing showing a Multi-Agent Debate (MAD) system according to one embodiment of the present disclosure.
[0246] Referring to FIG. 16, a Multi-Agent Debate (MAD) system may refer to a system in which multiple small language models derive an answer through discussion. An aggregator, for example, an aggregated large language model (LLM), can receive a query from a user and combine the opinions of multiple small language models to generate a final answer. In one embodiment, when a user inputs a query, the aggregator generates an initial answer, and multiple Small Language Models (SLMs) can present different answers and refute each other. For example, when the first SLM presents an answer such as "This is the answer," the second SLM presents "No, this is the answer. I verified it this way," and the third SLM presents "I think it is almost correct, but there are these points," the aggregator can make an intermediate judgment based on the discussion content. Based on this first discussion and previous judgment, a more refined second discussion can proceed. Some models may utilize tools (search, vector database, etc.) to verify facts. After repeating various discussions in this manner, the aggregator can determine the most appropriate response as the final Verdict and deliver it to the user. In one embodiment, various SLMs can participate in discussions and verification with each other.
[0247] FIG. 17 is a diagram showing an A2A (Agent2Agent) protocol system according to one embodiment of the present disclosure.
[0248] Referring to Fig. 17, the A2A (Agent2Agent) protocol enables communication without sharing data with each other, allows for task distribution and negotiation among multiple agents, and enables each agent to maintain shared context and state information. In the example of Fig. 13, the first AI agent (AI Agent 1) can primarily perform local-based file or search tasks and can connect to various MCP servers via the MCP Protocol. In the example of Fig. 16, the second AI agent (AI Agent 2) can primarily handle cloud and communication tasks and can connect to various MCP servers via the MCP Protocol. MCP allows for communication by separating each function (file access, search, cloud, etc.) into separate servers. Additionally, A2A has the advantage of high security because it enables direct communication between agents without sharing data. Each agent can operate independently using its own large-scale language model, framework, and database.
[0249] FIG. 18 is a drawing illustrating an Agentic search-based generation (RAG; Retrieval-Augmented Generation) system according to one embodiment of the present disclosure.
[0250] Referring to FIG. 18, an Agentic Retrieval-Augmented Generation (RAG) system may refer to a system that extracts data from a website, stores it in a vector database, searches for similar information in response to a user query, and generates a response through a Large Language Model (LLM). In particular, this structure can support advanced question-answering by including agent functions (Memory, Tools, Planning, etc.). In one embodiment, the Agentic Retrieval-Augmented Generation (RAG) system may include a data extraction step, a search step, and a generation step. The data extraction step may include a step of extracting data from a specified website (e.g., GitHub, Hacker News, etc.) or web content (website content in various formats such as text, images, audio, video, etc.). Additionally, the data extraction step may include a preprocessing and storage step. The preprocessing and storage steps may include a step of extracting text and metadata from the content, a step of chunking the text into small units, a step of vectorizing each piece through an embedding model, and a step of storing the vectorized data in a vector database (Vector DB).
[0251] In one embodiment, the search step of the Agentic Search-based Generation (RAG) pipeline may include the steps of: inputting a user query; performing embedding after query rewriting; searching for similarity in a vector database; configuring the search results into a context; and ranking the context based on relevance.
[0252] In one embodiment, the generation step of an Agentic search-based generation (RAG) pipeline may include: a step of generating an input by combining a user query and a searched context; a step in which a large-scale language model (LLM) generates a response—including agentic elements such as memory functions, tool calls, and planning functions; and a step of providing the generated response to a user.
[0253] In one embodiment, website content is collected and textified, text segmentation and embedding are performed, stored in a vector database, user query embeddings and similar content are searched, relevant context is constructed, Large Language Model (LLM) input is expanded, and Large Language Model (LLM)-based response is generated and delivered to the user.
[0254] In one embodiment, the Agentic Search-Based Generation (RAG) system is a Search-Based Generation (RAG) system capable of generating precise and contextually relevant answers based on its architecture, and capable of handling complex queries through memory storage, calling external tools, and planning. It also supports various forms of content such as images, audio, and video in addition to text, enabling multimodal input processing, and can improve search accuracy by including query rewriting and ranking functions.
[0255] In one embodiment, an Agentic search-based generation (RAG) system can collect real-time information from websites and utilize it for question and answer, and can be applied to various services such as technical support, search engine enhancement, and personal assistant services, and can be utilized for complex task automation through Agentic components.
[0256] FIG. 19 is a schematic diagram of an AI agent system according to one embodiment of the present disclosure.
[0257] Referring to FIG. 19, based on system prompts and user prompts, the AI system can generate a final response by performing actions such as formulating a plan, calling a tool, storing it in memory, and collecting feedback. This AI agent system features an autonomous planning and execution structure and can dynamically select and execute tools. In addition, the AI agent system can be used for complex multi-step tasks, business automation, research, etc., as it enables continuous learning and improvement through a feedback loop.
[0258] An agent system according to one embodiment of the present disclosure may be an advanced agent system that goes beyond a simple input-output structure by incorporating complex functions such as memory, reasoning, tool integration, and planning.
[0259] FIG. 20 is a schematic diagram of a large-scale language model (LLM) chatbot according to one embodiment of the present disclosure.
[0260] Referring to FIG. 20, a large-scale language model (LLM) can take user input as input and output a response. This system consists of a single large-scale language model (LLM) call and does not involve external tool calls or complex workflows, making it suitable for simple question and answer, FAQ, and customer service. However, this system may lack the ability to maintain context or perform repetitive tasks.
[0261] FIG. 21 is a schematic diagram of a Robotic Process Automation (RPA) system according to one embodiment of the present disclosure.
[0262] Referring to Fig. 21, a tool can be executed according to predefined fixed rules using user input. In this case, the Large Language Model (LLM) may be limited to a secondary role. Such an RPA system is suitable for repetitive automation and simple back-office tasks, but it may have limitations in flexible flow control or advanced decision-making.
[0263] FIG. 22 is a schematic diagram of a search-based generation (RAG) system according to one embodiment of the present disclosure.
[0264] Referring to Fig. 22, user input is embedded and converted into a vector DB, and an augmented prompt is processed by a Large Language Model (LLM) by searching for relevant information, and a final response can be output. This search-based generation (RAG) system is suitable for fields such as generating correct answers using external knowledge (documents, etc.), improving the accuracy of search-based Large Language Models (LLM), QA systems, document search, and summarization, but there are limitations in processing information outside the scope of the searched information.
[0265] FIG. 23 is a schematic diagram of a Learning-Augmented Mechanism (LAM) according to one embodiment of the present disclosure.
[0266] Referring to Fig. 23, the LAM system can perform a process in which a large-scale language model (LLM) based on training data processes user input to execute a tool, thereby performing a task and outputting a response. The LAM system uses a model trained using tool usage data, enables learning and execution based on actual behavior, and is suitable for automating repeatable GUI tasks. However, since the LAM system requires a training process, there are limitations to generalization.
[0267] These AI systems are summarized as shown in [Table 3] below.
[0268] Item Structure Key Features Limitations Large-scale Language Model Chatbot: Single Large-scale Language Model (LLM) call, generates simple responses, lacks context retention RPA: Rule-based + fixed tool, repetitive automation, lacks flexibility Search-based Generation (RAG): Search-based + prompt augmentation, utilizes external knowledge, lacks information other than search LAM: Behavior Learning-based GUI Automation, requires learning, limited versatility AI Agents: Planning + memory + tools + repetition, performs autonomous tasks, complex structure, consumes resources
[0269] FIG. 24 is a diagram showing an AI agent memory structure according to one embodiment of the present disclosure.
[0270] Referring to FIG. 24, the process of an AI agent generating a response to a query by utilizing memory is illustrated. Memory is divided into short-term and long-term, and each can support complex decision-making and task execution through various types of memory. In one embodiment, memory may consist of short-term memory and long-term memory. Short-term memory is a temporary memory space activated during work, allowing focus on the currently ongoing workflow (task execution). Short-term memory may include Working Memory, which manages reasoning and task flow per workflow, and Cache Memory, which provides rapid access to frequently used data and result values. Long-term memory is knowledge and experience-based memory that is continuously preserved, and may include Episodic Memory, which stores events or incidents manually saved in a specific workflow; Semantic Memory, which stores conceptual and factual knowledge (e.g., "Paris is the capital of France"); and Procedural Memory, which stores methods of task execution or procedural knowledge (e.g., "How to reset a server").
[0271] In one embodiment, all memory can operate in conjunction with a large-scale language model framework through a central memory controller. An input query can go through a process via the framework of 1) query analysis and memory referencing, 2) retrieving relevant information from memory if necessary, 3) the large-scale language model generating a response through a decision procedure, and 4) delivering the response result to the user.
[0272] In one embodiment, the MCP server is responsible for interfacing with external knowledge and tools and may include a vector database which is a search-based embedding vector repository, a Semantic Database which is a conceptual knowledge base database, a third-party API integration unit, etc.
[0273] In other words, the process can be handled as follows.
[0274]
[0275] One embodiment of the present disclosure can systematically integrate design elements essential for implementing an agent system that can continuously learn and make decisions according to the situation, as shown in [Table 4] below.
[0276] Item Description Integration of Memory Structures Combines short-term (Working / Cache) and long-term (Episodic / Semantic / Procedural) memory to enhance context awareness and learning. Includes decision logic for prompt adjustment and response generation based on a framework linked to memory. External Information Extensibility Supports information enrichment and real-time integration via databases and APIs. User-Customized Responses Capable of generating responses that reflect past experiences and current context.
[0277] Next, I would like to explain the types of large-scale language models utilized by AI agent systems. Each model has a specific processing method and role, and a suitable model can be selected and utilized depending on the nature of the task.
[0278] FIG. 25 is a drawing showing a GPT (General Pretrained Transformer) model according to one embodiment of the present disclosure.
[0279] Referring to Fig. 25, the GPT model is a general-purpose large-scale language model pre-trained on a large-scale text corpus, which can tokenize and embedding input prompts, generate hidden states through a transformer layer, calculate logits and probabilities for the next token, and sequentially sample or select tokens to generate text. This GPT model has the advantage of high generality.
[0280] FIG. 26 is a drawing showing a Mixture of Experts (MoE) model according to one embodiment of the present disclosure.
[0281] Referring to Fig. 26, the Mixture of Experts (MoE) model is a decentralized model structure in which only some experts (sub-networks) selected according to the input are activated. The input is tokenized and embedded, a gating network selects a top expert sub-model per token, the outputs of the selected experts are merged (weighted average or aggregated), and decoding is performed based on the merged result. This MoE model has the advantage of high computational efficiency relative to the number of parameters.
[0282] FIG. 27 is a drawing showing a Large Reasoning Model (LRM) according to one embodiment of the present disclosure.
[0283] Referring to FIG. 27, the LRM is a model capable of processing a chain of thought for complex reasoning, performing input and context tokenization, internally generating a reasoning path, evaluating or regenerating possible logical paths, and determining and outputting a final logical answer. Such an LRM may be suitable for solving high-difficulty problems, logic-based question answering, etc.
[0284] FIG. 28 is a drawing showing a Vision Language Model (VLM) according to one embodiment of the present disclosure.
[0285] Referring to FIG. 28, VLM is a multimodal model that integrally understands and processes images and text. It can perform image encoding, perform text tokenization, combine both modalities into an integrated embedding, and generate a response by inferring based on the integrated representation. VLM can be advantageous for generating explanations and answering questions that include visual information.
[0286] FIG. 29 is a drawing showing a Small Language Model (SLM) according to one embodiment of the present disclosure.
[0287] Referring to FIG. 29, the SLM is a large-scale language model with a lightweight structure that can be used in environments with limited computational resources. It performs input tokenization, projects embedded tokens into a low-dimensional space, passes them through a simplified transformer layer, calculates token probabilities, and generates output. Such an SLM can be suitable for edge devices, on-device AI environments, etc.
[0288] FIG. 30 is a drawing showing a Large Action Model (LAM) according to one embodiment of the present disclosure.
[0289] Referring to FIG. 30, LAM is a model trained to perform actions in a real environment, capable of tokenizing and embedding task descriptions and environmental states as inputs, planning a sequence of actions (based on Chain-of-Thought), executing actions and making API calls within the environment, and monitoring results. LAM can perform modification and iteration procedures as necessary. In one embodiment, LAM can be used in robot control, game agents, automation systems, etc.
[0290] FIG. 31 is a drawing showing a Hierarchical Reasoning Model (HRM) according to one embodiment of the present disclosure.
[0291] Referring to Fig. 31, HRM is a hierarchical model that processes inference by dividing it into high-level planning (H-layer) and low-level computation (L-layer). It can derive a final result by performing high-level planning after input encoding, performing detailed operations and iterations at each step, and constructing an iterative feedback loop until convergence. HRM can be effective for complex multi-step planning and inference tasks.
[0292] FIG. 32 is a drawing showing a ToolFormer (Tools-trained Model) according to one embodiment of the present disclosure.
[0293] Referring to Fig. 32, ToolFormer is a large-scale language model trained to use various external tools. It starts based on a pre-trained large-scale language model (LLM), samples examples of tool calls, determines valid tool usage through test and evaluation, and performs fine-tuning with filtered data. ToolFormer has advantages in integration with calculators, searches, API calls, etc.
[0294] [Correction pursuant to Rule 91 12.02.2026] The four major types of artificial intelligence systems—basic large-scale language model (LLM) workflow, search-based generative (RAG), single AI agent, and multi-agent based Agentic AI—are compared and summarized in terms of structure, function, characteristics, and use cases as shown in Figure 41 and the following [Table 6].
[0295] [Correction pursuant to Rule 91 12.02.2026]<Deleted>
[0296] Item Large-scale Language Model Workflow Search-based Generation (RAG) AI Agent Agentic AI Functionality Input-based next token prediction Answer search and augmentation through external knowledge Autonomous execution + component combination Autonomous tasks based on multi-agent collaboration Representative Use Cases Text generation, summarization Accurate question answering tools and planning Workflow Large-scale tasks, collaboration-based problem solving Strengths Fast, simple, easy to deploy Improved accuracy through external knowledge Planning + reasoning-based automation Flexible division of labor, solving complex problems Weaknesses Limited contextual understanding Sensitivity to data quality Requires clear goals and tool access Increased design and control complexity Examples Chatbot, email generator Graph RAG, Modular RAG ReAct Agent, Rewoo Agent CUA (Computer Using Agent), Embodied Agents
[0297] FIGS. 33 to 38 are drawings illustrating vulnerabilities of an MCP according to one embodiment of the present disclosure.
[0298] Referring to Fig. 33, a command injection problem can occur in the MCP. That is, hidden commands (intended meanings) can be inserted into the prompt entered by the user to induce an agent to operate on the MCP server without authorization. For example, the agent can gain unauthorized access to external resources such as drant or supabase.
[0299] Referring to Fig. 34, tool addiction problems can occur in MCPs. By including a tool with malicious code inserted into the MCP, it can be induced to produce incorrect results for specific tasks or perform intended actions. For example, an attacker can gain access to a service such as Slack and steal API keys or personal information.
[0300] Referring to Figure 35, server-sent event issues can occur in MCP. Since the Server-Sent Events (SSE) method transmits data in segments, the connection must remain open for a long time, which can cause latency and security issues. For example, when transmitting to Slack, drant, stride, etc., the connection may remain open for a long time, which can lead to security risks.
[0301] Referring to Fig. 36, privilege escalation issues can occur in MCP. A malicious tool can intercept or overwrite calls to other trusted tools, thereby stealing the privileges of the tools trusted by the user. For example, an attacker can override tools such as Slack within the MCP server to connect to a malicious server.
[0302] Referring to Fig. 37, persistent context issues can occur in MCP. MCP records and maintains context throughout a user's session. This can lead to context tampering. For example, there is a risk that session contexts linked with tools such as AWS, Kagi, and Notion may be tampered with.
[0303] Referring to Fig. 38, a problem of server data theft may occur in MCP. If a tool server connected to the MCP server is hacked, data and passwords from other servers can be stolen or controlled. For example, a client may be able to access user data from another server through a malicious MCP tool.
[0304] FIG. 39 is a diagram illustrating a context engineering structure in an AI agent system according to one embodiment of the present disclosure.
[0305] Referring to Fig. 39, a flow and memory structure can be used to effectively configure and utilize context in an AI Agent system. User input may include input queries or requests provided by the user to the system. The agent may be a central component that plans tasks and coordinates execution based on given inputs and context. Additionally, Search-based Generation (RAG) may include a knowledge retrieval component that retrieves relevant documents through long-term memory-based vector search. Action Tools may perform functions such as calling various external tools for code execution, document lookup, time checking, and calendar processing. Long-Term memory includes memory based on persistent knowledge stored through an MCP server and database, and Short-Term memory may store prompt components containing the context of the current conversation session (input, reasoning, tool usage history, etc.).
[0306] In one embodiment, a user may input a query or request into the system. Based on user input, the agent may formulate a plan and coordinate tasks according to the current context and purpose. The Search-based Generation (RAG) system may perform a search on the vector DB if necessary and collect relevant information or documents. Action Tools may call external tools according to the requested task to retrieve execution results. The Prompt Generation and Update Unit may generate or update prompts based on the collected information (search, tools, results, inference, etc.). A final response may be generated and delivered to the user following these procedures. All contextual elements may be stored in the 'Chat History' within short-term memory. For example, information related to input, tools, usage, inference, etc., may be stored in short-term memory. Specific contexts, results, etc., may be added to long-term memory (e.g., MCP server / DB) to ensure reusability.
[0307] In one embodiment, user input includes initial input such as user queries or instructions, and tools may include external systems, APIs, calculators, document tools, etc. Agents may perform internal judgment, planning, state management, etc. as inference. Search-based generation (RAG) context includes document-based knowledge through vector search, and user children may include user settings, profiles, IDs, etc. Conversation history may include records of previous queries and responses.
[0308] According to one embodiment, context-based accuracy can be improved by utilizing both short-term memory (prompt) and long-term memory (DB). Additionally, a dynamic prompt can be configured by integrating not only user input but also tool usage results, search documents, and reasoning content. Furthermore, memory layers can be separated. For example, short-term memory is maintained during the session, while long-term memory can be used for long-term strategic iterations. An agent is responsible for the decision-making and execution of each step and can play a key role in coordinating the overall flow.
[0309] In one embodiment, an agentic AI system centered on user input, including search (RAG), tool calling, internal reasoning, and memory linkage, may be utilized. The system aims to provide sophisticated responses and perform tasks tailored to the situational context by sophisticatedly structuring and utilizing various contextual information generated during interaction with the user. In one embodiment, user input includes various forms of user requests such as text, voice, and images; the agent module performs planning, determines whether to search or execute a tool, and generates responses based on user requests; and the search-based generation (RAG) module performs vector-based similar document searches and can be used for external knowledge and context reinforcement. Action tools can call various functional tools such as external APIs, calculators, calendars, search engines, and databases. In one embodiment, the prompt engine can construct a final prompt by integrating various contextual elements. Short-Term Memory (STM) is a temporary storage for maintaining context and constructing prompts within a session, and Long-Term Memory (LTM) is a persistent memory structure based on an MCP server and database, which can be used for the agent's long-term learning and the utilization of accumulated experience.
[0310] The context processing pipeline can be as follows.
[0311] (1) User Input → (2) Agent Module → (3) If necessary → RAG Module → Search and convert related documents → (4) If necessary → Call Action Tools and obtain results → (5) Prompt Engine: Integrate contextual information (Input + Search + Tool results + Inference) → (6) Generate response via LLM → (7) Store the entire current context in Short-Term Memory → (8) Transfer and accumulate important information in Long-Term Memory → (9) Deliver response to the user (Answer)
[0312] In one embodiment, the prompt may be composed of user input corresponding to a query or command, tool usage results including API call results, calculation results, etc., search-based context including documents retrieved from search-based generation (RAG), agent reasoning including internal reasoning and planning, user information including preferences, ID, status, etc., conversation history including the previous conversation context, etc.
[0313] In one embodiment, the prompt may be deleted based on priority when the maximum prompt length is exceeded. For example, the priority may be configured in the order of agent inference > search context > tool results > user information > past conversation history. In one embodiment, result values after a tool call may be inserted into the prompt in the form of a "contextual tag." Additionally, the search context may be inserted along with a summary and confidence score, rather than the original text.
[0314] In one embodiment, short-term memory is intended to maintain the entire context within a session and may store user input, prompt components, reasoning processes, tool usage results, etc. Short-term memory is deleted upon session termination, but important information may be transferred to long-term memory.
[0315] In one embodiment, the long-term memory may be composed of an MCP server (Agent Metadata) and a domain knowledge DB (Structured Knowledge), etc. The long-term memory may be updated when the Add to memory command is executed or when automatic saving conditions are satisfied. For example, information such as "User A prefers tools related to 'data visualization'" and "On August 7, 2025, the 'Search-based Generation (RAG) + Tools' path was used in the 'Context Engineering' flow" may be stored in the long-term memory.
[0316] In one embodiment, the agent can determine whether to call a tool, the necessity of searching, and the possibility of repeated calls. Additionally, the agent can perform priority-based reasoning. For example, priorities may proceed in the order of user goal → environment state → available resources → execution strategy. In one embodiment, the agent can support parallel calls to multiple tools and support feedback-based iterative execution after execution.
[0317] For example, in the case of multimodal question and answer, when a user image is uploaded, a corresponding description is generated, and a date corresponding to the user image can be calculated via a tool call.
[0318] As another example, in the case of report generation, the search and summarization process proceeds based on user instructions, and templates can be inserted and edited.
[0319] As another example, in the case of automated schedule coordination, the calendar API is invoked based on natural language requests, and schedule recommendations can be provided.
[0320] According to one embodiment, a multi-agent-based Agentic AI extension structure may be supported. Additionally, according to one embodiment, a prompt dynamic optimization (auto-slimming) algorithm may be implemented. Furthermore, a memory vectorization-based summary storage module may be constructed, and user-specific customized context weighting profiling may be performed.
[0321] Those skilled in the art of the present disclosure will understand that information and signals may be represented using any various different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced in the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
[0322] Those skilled in the art will understand that the various exemplary logic blocks, modules, processors, means, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented by electronic hardware, various forms of programs or design code (referred to herein as software for convenience), or a combination of all such. To clearly illustrate this interoperability between hardware and software, various exemplary components, blocks, modules, circuits, and steps have been generally described above in relation to their functions. Whether such functions are implemented as hardware or software depends on the design constraints imposed on the specific application and the overall system. Those skilled in the art may implement the functions described in various ways for each specific application, but such implementation decisions should not be interpreted as being outside the scope of this disclosure.
[0323] The various embodiments presented herein may be implemented as methods, devices, or articles manufactured using standard programming and / or engineering techniques. The term "article manufactured" includes a computer program, a carrier, or a medium accessible from any computer-readable storage device. For example, computer-readable storage media include, but are not limited to, magnetic storage devices (e.g., hard disks, floppy disks, magnetic strips, etc.), optical discs (e.g., CDs, DVDs, etc.), smart cards, and flash memory devices (e.g., EEPROMs, cards, sticks, key drives, etc.). Additionally, the various storage media presented herein include one or more devices and / or other machine-readable media for storing information.
[0324] It should be understood that the specific order or hierarchy of steps in the presented processes is an example of exemplary approaches. It should be understood that the specific order or hierarchy of steps in the processes may be rearranged within the scope of this disclosure based on design priorities. The appended method claims provide elements of various steps in a sample order, but do not imply being limited to the specific order or hierarchy presented.
[0325] Description of the presented embodiments is provided so that a person skilled in the art may use or practice the present disclosure. Various modifications to these embodiments will be apparent to a person skilled in the art, and the general principles defined herein may be applied to other embodiments without departing from the scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments presented herein, but should be interpreted in the broadest possible scope consistent with the principles and novel features presented herein.
Claims
1. As a system, At least one processor; and It includes at least one memory for storing instructions, information, or artificial intelligence models executed on at least one processor; and The instructions, information, or artificial intelligence model executed by at least one processor mentioned above are A data input unit that receives multimodal data including one-dimensional text representation data of a molecule, two-dimensional graph representation data of a molecule, and a natural language-based task command sequence; A graph processing unit comprising a hybrid graph encoder that extracts local and global features of a molecular structure from the above 2D graph representation data to generate a molecular graph embedding, and a crossmodal bridge that aligns the molecular graph embedding and a text embedding generated from the above 1D text representation data; and A foundation model comprising: a large-scale language model that is trained by receiving the above-mentioned aligned molecular graph embeddings and the above-mentioned text embeddings as inputs; System.
2. In claim 1, the hybrid graph encoder is, A graph encoder that captures the local structure of a molecular graph; and a graph sequence encoder that captures the global context of a molecular graph, comprising System.
3. In claim 1, the crossmodal bridge is, Extracting and summarizing information highly relevant to the text embeddings from the molecular graph embeddings using a learnable query, System.
4. In Claim 1, The above-described large-scale language model performs molecular-level tasks according to the above-described task instruction sequence and generates results, wherein the molecular-level tasks include at least one of predicting chemical reactions, predicting molecular properties, and generating natural language descriptions of molecular structures. System.
5. In Claim 1, It further includes a molecular structure description generation model that generates natural language descriptions of target molecular structures, and The above molecular structure description generation model includes: a rare substructure search unit that calculates a scarcity score for each substructure by analyzing the frequency of occurrence of substructures included in each of a plurality of molecules within a large-scale molecular database; and A molecular structure description generation unit that generates a final natural language description including a description of at least one sparse substructure sampled based on the sparse score among the substructures included in the target molecular structure; System.
6. In claim 5, the rare substructure search unit, Analyzing the frequency of occurrence of the above substructures using molecular fingerprints, System.
7. In Claim 6, The above molecular fingerprint is a MACCS (Molecular Access System) key, System.
8. In Claim 5, The above molecular structure description generation unit is, For each of the above-mentioned sparse substructures, search for pre-generated natural language description documents, and Integrating the above-searched natural language description documents into a single document using the above-described large-scale language model to generate the above-described final natural language description, System.
9. In Claim 5, The above molecular structure description generation model is, A substructure document generation unit that generates natural language description documents explaining the chemical characteristics and effects of each predefined substructure; further comprising System.
10. In computerized learning methods, A step of pre-training a hybrid graph encoder to receive 2D molecular graph data as input, predict the functional groups of the molecule, and restore the original 1D molecular text; A step of pre-training a crossmodal bridge that converts molecular graph embeddings generated by the hybrid graph encoder so that the large-scale language model can understand them, while keeping the weights of the pre-trained hybrid graph encoder and the large-scale language model frozen; and The method comprising the step of fine-tuning the entire foundation model, including the pre-trained hybrid graph encoder, the crossmodal bridge, and the large-scale language model; method.
11. In Claim 10, The above fine-tuning step is, Damages a portion of the one-dimensional molecular text data to induce the foundation model to rely more on the two-dimensional molecular graph data for learning, method.
12. In Claim 11, The above fine-tuning step is, Damages a part of the one-dimensional molecular text data by replacing some tokens in the token sequence of the one-dimensional molecular text data with random tokens, method.
13. In Claim 10, The above fine-tuning step is, Learning to maximize the probability of generating a result when the correct molecular graph is input and minimize the probability of generating a result when the incorrect molecular graph is input, using preference pairs composed of a correct molecular graph and an incorrect molecular graph in which the substructure of the correct molecular graph is modified. method.
14. In Claim 10, The above fine-tuning step is, Updating the foundation model using a total loss function that sums the loss function of a learning process that damages a part of the one-dimensional molecular text data by replacing some tokens in the token sequence of the one-dimensional molecular text data with random tokens, and the loss function of a learning process that maximizes the probability of generating a result when the correct molecular graph is input and minimizes the probability of generating a result when the incorrect molecular graph is input, using preference pairs composed of a correct molecular graph and an incorrect molecular graph in which the substructure of the correct molecular graph is modified. method.
15. As a computerized method, A step of inputting multimodal data including one-dimensional text representation data of a molecule, two-dimensional graph representation data, and a task command sequence; A step of generating molecular graph embeddings from the two-dimensional graph representation data using a hybrid graph encoder; A step of aligning the molecular graph embedding with the text embedding generated from the one-dimensional text phenotype data and the task instruction sequence using a crossmodal bridge; and A step comprising generating a result of a molecular unit task according to the task instruction sequence based on the aligned molecular graph embeddings and the text embeddings using a large-scale language model; method.
16. In Claim 15, The above result generation step is, If the above task command sequence directs the generation of a natural language description of a molecular structure, A step of identifying multiple substructures included in the target molecular structure to be analyzed; A step of sampling at least one sparse substructure based on the sparse score of each of the plurality of substructures identified above; and A step comprising: integrating descriptions of the sampled sparse substructures to generate a final natural language description of the target molecular structure; method.
17. In Claim 16, The above step of identifying substructures is, Identifying the plurality of substructures by calculating the molecular fingerprint of the above target molecular structure, method.
18. In Claim 16, The above scarcity score is, Calculated inversely proportional to the frequency of occurrence of the above substructure within a large-scale molecular database, method.
19. In Claim 16, The above final natural language description generation step is, A step of searching for pre-generated natural language description documents for each of the above-mentioned sampled sparse substructures; and A step comprising inputting the above-mentioned retrieved documents into a large-scale language model to integrate them into a single consistent document; method.
20. A server computing system equipped with the system according to claim 1; and A user computing device that transmits a request including multimodal data and a task instruction sequence to a server computing system via the user computing device, and receives a result generated by a large-scale language model of the server computing system; System.
21. An application-specific integrated circuit comprising a functional block including a memory in which information and instructions are stored and at least one processor that requests access to said memory, The memory stores instructions or information comprising: an operation of inputting multimodal data including one-dimensional text representation data of a molecule, two-dimensional graph representation data, and a task instruction sequence; an operation of generating a molecular graph embedding from the two-dimensional graph representation data using a hybrid graph encoder; an operation of aligning the molecular graph embedding with a text embedding generated from the one-dimensional text representation data and the task instruction sequence using a crossmodal bridge; and an operation of generating a result of a molecular unit task according to the task instruction sequence based on the aligned molecular graph embedding and the text embedding using a large-scale language model. Custom Integrated Circuit.