Natural language interface for interacting with different application programming interface systems and applications
By converting natural language queries into API calls through a multilingual model system, the integration challenges in cloud API interaction are solved, enabling seamless interaction and efficient integration between different cloud services, and adapting to the unique differences and specifications of various APIs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NVIDIA CORP
- Filing Date
- 2025-12-25
- Publication Date
- 2026-06-30
AI Technical Summary
Existing cloud API interaction solutions are difficult to adapt to the unique differences and complexities of different cloud providers, resulting in integration difficulties, performance issues, and insufficient flexibility, and are unable to effectively handle various scenarios.
The system uses multiple language models to convert natural language queries into API calls, determines the best service through the planner node, generates structured interaction paths through the tool node, processes API calls using API specifications and language models, and converts responses into natural language through the response generation node.
It enables seamless interaction between different cloud services, ensuring performance and flexibility, and can efficiently integrate multiple cloud-based services, adapting to the unique differences and specifications of various APIs.
Smart Images

Figure CN122308993A_ABST
Abstract
Description
Background Technology
[0001] Generally, Application Programming Interfaces (APIs) are used to interact with cloud delivery services and applications. APIs provide a standardized way for clients, applications, and / or other systems to communicate with the cloud resources they require. Because APIs typically follow common web standards (such as REST, GraphQL, etc.) and return data in predictable formats (such as JSON, XML, etc.), cloud delivery services can be accessed from a variety of applications and / or platforms. For example, APIs can allow developers to perform various tasks such as data storage and / or retrieval, configuring virtual machines, database management, and / or resource scaling, all potentially without directly managing the underlying infrastructure of the cloud delivery service. Furthermore, by sending requests to these APIs, applications can utilize cloud delivery resources in real time, allowing them to benefit from the flexibility, scalability, and cost-effectiveness offered by cloud computing.
[0002] However, interacting with a variety of different cloud-based services is often challenging. For example, different cloud providers may offer different sets of APIs with unique authentication mechanisms, data formats, and protocols. This diversity makes it difficult for developers to integrate multiple cloud-based services into a single, cohesive system, especially when each API may have its own nuances and limitations (e.g., authentication mechanisms, rate limiting, data formats, versioning, error handling, latency, etc.). While some solutions attempt to address the challenges of interacting with various cloud APIs, existing solutions often fall short in key areas. For instance, some existing solutions may rely on a single large Language Model (LLM) call to manage the complexity of API interactions. However, relying on a single LLM call can lead to performance issues, including but not limited to low accuracy, hallucination, high latency, and non-reproducibility. Furthermore, customization is often limited in these existing systems because they tend to be "one-size-fits-all" solutions that may not adapt well to specific needs or complex workflows. Therefore, when dealing with real-world APIs (which can be complex and have stringent security, rate limits, and / or other requirements), most existing solutions may not be robust or flexible enough to effectively handle a variety of scenarios, making it difficult to create reliable, scalable integrations between different cloud-based services. Summary of the Invention
[0003] Embodiments of this disclosure relate to a natural language interface for interacting with various application programming interface (API) systems and applications. The disclosed systems and methods may use one or more language models to translate natural language queries into API calls for interaction with various cloud-based services or applications.
[0004] For example, based at least on received input data representing a natural language query, the systems of this disclosure can perform multiple language model calls to (among other things) determine one or more optimal services (e.g., cloud-based services or applications) to be called in response to the query, and generate one or more API calls to be sent to one or more APIs of one or more services (e.g., cloud APIs) (this may include identifying one or more optimal API endpoints to use and / or properly formatting one or more API calls to include necessary parameters and / or other information). In some examples, to determine the optimal service and / or generate one or more API calls, one or more systems may apply multiple API specifications associated with one or more APIs for one or more services to one or more language models. For example, one or more systems may simultaneously apply API specifications during training and / or when making language model calls to determine the optimal service and / or generate one or more API calls. Furthermore, in some examples, one or more systems may use one or more language models to convert one or more responses to one or more API calls back to natural language responses to the query (e.g., from JSON or XML to natural language).
[0005] Compared to traditional systems, in some embodiments, the system of this disclosure is able to seamlessly interact with various services and applications by adapting to the unique differences, nuances, and specifications of their respective APIs. For example, by using language models to process API specifications and natural language queries simultaneously, the system of this disclosure can generate API calls for a variety of different APIs, enabling developers and / or other customers to integrate multiple cloud-based services into a single, cohesive system regardless of API diversity. Furthermore, by executing multiple language model calls for each query to decompose the solution into multiple smaller tasks (e.g., server classification, API classification, parameter population, API call execution, and response generation), the system of this disclosure ensures better control and easier execution of specific tasks. Moreover, by customizing different parts of the pipeline with configuration files and adding customizable rules for each stage, the system of this disclosure ensures reliable performance across various systems. Attached Figure Description
[0006] The following describes in detail, with reference to the accompanying drawings, the system and method for using a natural language interface to interact with different application programming interface (API) systems and applications, wherein:
[0007] Figure 1 This is a data flow diagram illustrating an example of a process according to some embodiments of the present disclosure, which can be executed by an application to provide a natural language interface for interacting with different APIs of different services;
[0008] Figure 2 This is a block diagram illustrating example details associated with an agent according to some embodiments of this disclosure;
[0009] Figure 3 This is a data flow diagram illustrating an example of a process according to some embodiments of the present disclosure, the process being performed at least in part by a planner node to select one or more optimal services in response to a query;
[0010] Figure 4 This is a data flow diagram illustrating an example of a process performed by an API tool to convert a query into an API call, according to some embodiments of the present disclosure;
[0011] Figure 5 This is a data flow diagram illustrating an example of a process performed, at least in part, by a response node according to some embodiments of the present disclosure to convert a response to an API call into output data that at least represents a natural language message;
[0012] Figure 6 This is a block diagram illustrating an example of a system capable of performing one or more processes described herein, according to some embodiments of this disclosure;
[0013] Figure 7 This is a flowchart illustrating examples of methods, according to some embodiments of the present disclosure, that can be executed to provide a natural language interface for interacting with an API of a service;
[0014] Figure 8 This is a flowchart illustrating an example of a method for converting a natural language query into an API call and converting a response to the API call back into a natural language message, according to some embodiments of the present disclosure;
[0015] Figure 9A This is a block diagram of an example generative language model system suitable for implementing at least some embodiments of the present disclosure;
[0016] Figure 9B It is a block diagram of an example generative language model including a converter encoder-decoder, suitable for implementing at least some embodiments of this disclosure;
[0017] Figure 9CIt is a block diagram of an example generative language model including a decoder-only converter architecture suitable for implementing at least some embodiments of this disclosure;
[0018] Figure 10 This is a block diagram of an example computing device suitable for implementing at least some embodiments of the present disclosure; and
[0019] Figure 11 This is a block diagram of an example data center suitable for implementing at least some embodiments of the present disclosure. Detailed Implementation
[0020] Systems and methods related to natural language interfaces for interacting with various application programming interface (API) systems and applications are disclosed. For example, one or more systems may receive input data representing queries (e.g., natural language queries) from a client device. In some examples, the client device may be executing an instance of a user interface, and the input data may be received via the user interface executing on the client device. For example, a user of the client device may use the user interface to input queries (e.g., by speaking or saying the query, by typing the query, writing the query, etc.). In some examples, the input data may include text data representing the query. Furthermore, in some cases, the input data may include multimodal data (e.g., a combination of two or more of text data, audio data, image data, video data, etc.). As described herein, in various examples, a query may include a request for information (e.g., “What’s the weather like today?”, “How many people are in the building?”, etc.), a request to control a machine or device or perform an action (e.g., “Turn up the thermostat”, “Lock the door”, etc.), or one or more other requests or queries.
[0021] In some cases, one or more systems may contain multiple AI agents (e.g., LLM-driven agents or any other type of agent), which may contain multiple nodes. These nodes may be configured to perform various functions on behalf of one or more systems and / or agents to generate responses to queries. For example, one or more systems may contain planner nodes, tool nodes, response generation nodes, and / or any other nodes. These nodes may incorporate or utilize one or more language models to understand context, generate human-like responses, process natural language, and / or integrate with other tools to achieve comprehensive functionality.
[0022] For example, a planner node can use one or more language models to determine the best service or application (e.g., a cloud-based service) to invoke to generate a response to a query, and the planner node can map a query (or a subquery within a query) to the best service. In some examples, a planner node can use one or more language models to process the input data representing the query and information related to one or more services. Such information can include, but is not limited to, information describing the capabilities, functions, tools, and / or uses of a service. In this way, a planner node can use one or more language models to determine the best service that should be invoked to respond to a query. For example, if the query asks for the weather at a location, the planner node can use one or more language models to determine which weather service should be invoked / selected that may know the weather conditions at that location.
[0023] In some examples, a query may contain multiple queries, and the planner node can use one or more language models to break down the query into simpler queries (e.g., subqueries) and identify the best service to route to each different query. For example, if a query contains a first query asking about the weather at a location and a second query asking about the time of a football match, the planner node can use a language model to break down the query into the first and second queries and determine that the best service for the first query is the weather service, while the best service for the second query is the sports service and / or the television service.
[0024] In some examples, services may include cloud-delivered services, on-premises services, or any other services delivered or hosted on any type of infrastructure. In some examples, services may include, but are not limited to: weather services (e.g., weather forecasting services, environmental / climate monitoring services, etc.), analytics services (e.g., website or user analytics services, business intelligence and data visualization services, performance and user behavior tracking services, etc.), payment processing services (e.g., online payment gateways, e-commerce checkout solutions, digital wallets and mobile payment systems, etc.), cloud storage services (e.g., file storage and sharing, backup and disaster recovery, data archiving services, etc.), content delivery network services, authentication and identity services (e.g., SSO solutions, multi-factor authentication services, etc.), email and messaging services, cloud computing services (e.g., Software as a Service (SaaS), Platform as a Service (PaaS), etc.), machine learning and AI services (e.g., language processing services, image and video recognition services, etc.), geolocation and mapping services, video and streaming services, social media services, customer support and help desk services, e-commerce services, database services, cybersecurity services, and telecommunications services. In various examples, these services may include APIs based on Representational State Transfer (REST), which follow the design principles of the REST architectural style, allowing services to exchange data with each other and / or with other systems via web URLs and / or other forms using standard HTTP methods such as GET, POST, PUT, and / or DELETE. In at least one example, these services may include an autonomous machine deployment or management service, and API calls can be submitted to this service to cause one or more machines (e.g., a group of autonomous machines) to perform one or more operations. For example, queries such as requesting a ride, requesting assistance, requesting the delivery of one or more items can be submitted, and API calls can be generated based on the queries and submitted to such an autonomous machine deployment service, which can dispatch machines (e.g., autonomous machines or vehicles) to specific locations (e.g., the location indicated in the query or the user's current location).
[0025] As described in this article, in some cases, one or more systems may also contain tool nodes, and these tool nodes can manage multiple AI-based agent tools. In some examples, each of the multiple tools may correspond to or be associated with a specific service within a service, thereby creating a structured interaction path for executing API calls. Therefore, although the planner node has been described above as mapping queries to the optimal service, in other words, the planner node can map queries to one or more tools corresponding to one or more services. In some examples, the tool node may make multiple calls to one or more language models to perform query decomposition (if necessary), determine the appropriate API endpoint for the API call, and / or generate the API call itself (e.g., populate parameters and / or other information), among other things.
[0026] For example, tool nodes and / or tools can use one or more language models to perform query decomposition. As an example, if the query is "What's the weather like in California? What's the weather like in Idaho?", tool nodes and / or tools can use one or more language models to decompose the query into two separate queries: the first query asks "What's the weather like in California?" and the second query asks "What's the weather like in Idaho?". In this way, by decomposing the query, tool nodes and / or tools can generate API calls for each of these queries and submit these API calls separately to the backend service. Therefore, while the planner node can decompose the query to determine the best service to use in response, the tool nodes and / or tools can decompose the query to determine which API calls should be made for the best service selected by the planner node.
[0027] In some examples, tool nodes and / or tools may use one or more language models (e.g., invoking one or more language models) to categorize each query into APIs to be invoked (or a chain of APIs for complex queries). For example, based on the use of one or more language models to process the query and the API specification associated with the service, tool nodes and / or tools may determine which API endpoints(s) to use to make API calls to the service. In some examples, one or more language models may select one or more API endpoints based on the specific functionality provided by one or more API endpoints and whether those functionality matches the operations required to be performed in response to a query (such as retrieving data, updating resources, creating new entries, etc.). This selection process may involve one or more language models consulting the API specification (e.g., through enhancements) to understand the available endpoints, their HTTP methods (e.g., GET, POST, PUT, DELETE, etc.), required parameters, and response format. For example, the API specification associated with the service may indicate one or more endpoints (e.g., URL addresses, etc.) corresponding to the API used for the service, the functionality of one or more endpoints, the parameters to be included in the API call payload, the format of the API request or response, or any other information associated with the API of the service.
[0028] In some examples, tool nodes and / or tools may also invoke one or more language models to generate API calls. For instance, a tool node and / or tool may apply at least a portion of the input data and the API specification associated with the service to one or more language models. One or more language models may process this input and generate text data representing the API call. That is, one or more language models may analyze the query and API specification to generate a payload for each API call, at least by specifying the endpoint, using HTTP methods (e.g., GET, POST, PUT, DELETE), and populating the parameters and / or other data payload for the API call. The tool node and / or tool can then execute the API call by sending the API call to the API endpoint for the service.
[0029] In some examples, at least based on executing an API call, tool nodes and / or tools can receive responses to API calls from backend services and / or the APIs of backend services. In some cases, responses to API calls may be in non-natural language formats. For example, a response may contain text data representing a JSON, XML, or any other structured response. Responses to API calls can be forwarded to response generation nodes of one or more systems. Response generation nodes can use one or more language models to convert API responses from structured formats to natural language formats. For example, a response generation node can use one or more language models to process the query, API response, and API specification as input. Based on processing these inputs using one or more language models, the language models can generate output data that at least represents a natural language response to the query. For example, the API specification may outline how the data returned by the API will be structured and formatted (e.g., whether the data is returned as a JSON object, array, or XML document, and the key-value hierarchy, the type of data associated with each key (e.g., string, number, boolean value), etc.). In some examples, the output data may represent a multimodal response to the query. For example, the output data may contain a combination of two or more of the following: text data, audio data, image data, video data, and / or other data. As an example, the output data may contain an image or video and text data representing a natural language description of what is depicted in the image or video.
[0030] One or more systems may then send output data to a client device. In some examples, the output data may be presented by the client device via an instance of a user interface executed on the client device. In other words, by sending output data to the client device, one or more systems may enable the client device and / or user interface to present a response to the query. In some examples, making the response presentable may include, but is not limited to, outputting audio data of the response using one or more speakers of the client device, outputting visual data of the response (e.g., image data, text data, analytics data, etc.) using the display of the client device, or outputting any other data using any other component or method of the client device.
[0031] In some examples, AI-based agent tools can include visualization tools. In some cases, visualization tools (or services) can be configured to generate one or more charts or graphs based on input queries. In some cases, visualization tools can operate in two phases. The first phase (e.g., data transformation) can efficiently convert API JSON data into (X,Y) values. Taking the user's query and API JSON as input, this step can use an LLM to generate the (X,Y) data points needed for plotting. The second phase (e.g., code generation) can efficiently convert the (X,Y) values into chart code. For example, the visualization tool can use another LLM call to determine the appropriate chart type and generate the corresponding code for visualization. Once the chart code is generated, the response node can combine it with a natural language interpretation. As an example, when a user enters a query such as "Please plot the number of boxes we were able to pack at conveyor belt A over the past week," the planner node can determine to call the visualization tool, which can then generate output such as "This is a chart of the box counts I found over the past week: [chart]" (where the graph (e.g., as an image, etc.) is inserted in the position of "[chart]"). The planner node can then send the combined output to the UI, which displays the natural language response and rendered charts in a chat window, providing a seamless and information-rich user experience.
[0032] Furthermore, in some examples, AI-based agent tools can include predictive tools. In some cases, predictive tools (or services) can be configured to make predictions based on historical data or otherwise forecast future trends. Predictive tools can perform two main functions in certain situations: data transformation and predictive modeling. During data transformation, a predictive tool can convert API JSON data into (X,Y) values. For example, using a user's query and API JSON as input, this step can use LLM to generate relevant (X,Y) data points for further analysis. The second function of running a predictive model can include the predictive tool using (X,Y) values to predict future trends. By applying a time series forecasting model (e.g., an algorithmic model, a machine learning model, etc.), this step can calculate the value of Y' for a given X = X' value, thus aligning with the user's query predicting future data. As an example, when a user enters a query such as "Based on the number of orders we received in the past 3 months, how many orders do we expect in the next few weeks?", the planner node can determine to invoke the predictive tool, which can ultimately produce an output such as "Based on previous patterns, we expect approximately 40 orders this week."
[0033] In some examples, one or more systems may determine to invoke multiple different tools and combine multiple different outputs from these tools. For example, consider the same example query above, “Based on the number of orders we received in the past 3 months, how many orders do we expect in the next few weeks?”. Based on this query, the planner node may decide to invoke both the visualization tool and the forecasting tool to provide a more reliable response. Based on the responses from the visualization and forecasting tools, the response node can generate relevant natural language output that enhances the chart code obtained from the visualization tool. For example, the response might include something like, “Based on these previous patterns [chart], we expect approximately 40 orders this week” (where the chart (e.g., as an image, etc.) is inserted in the “[chart]” position).
[0034] The systems and methods described herein can be used for a variety of purposes, including but not limited to: machine control, machine motion, machine driving, synthetic data generation, model training, perception, augmented reality, virtual reality, mixed reality, robotics, security and surveillance, simulation and digital twins, autonomous or semi-autonomous machine applications, deep learning, environmental simulation, object or participant simulation and / or digital twins, data center processing, conversational AI, optical transport simulation (e.g., ray tracing, path tracing, etc.), collaborative content creation of 3D assets, cloud computing, video management, operations center supervision and control, and / or any other suitable application.
[0035] The disclosed embodiments can be included in a variety of different systems, such as automotive systems (e.g., control systems for autonomous or semi-autonomous machines, perception systems for autonomous or semi-autonomous machines), systems implemented using robots, aerial systems, medical systems, marine systems, smart area monitoring systems, systems for performing deep learning operations, systems for performing simulation operations, systems for performing digital twin operations, systems implemented using edge devices, systems implementing language models (e.g., large language models (LLM), small language models (SLM), visual language models (VLM), and / or multimodal language models), systems implementing one or more multimodal language models, systems using or deploying one or more inference microservices, systems including deploying one or more machine learning models and OS-level virtualization packages (e.g., containers) in services or microservices, systems containing one or more virtual machines (VMs), systems for performing synthetic data generation operations, systems implemented at least partially in a data center, systems for performing conversational AI operations, systems for performing optical transmission simulations, systems for performing collaborative content creation of 3D assets, systems for performing generative AI operations, systems implemented at least partially using cloud computing resources, and / or other types of systems.
[0036] refer to Figure 1 , Figure 1 This is a data flow diagram illustrating an example of a process 100, executable by an application, to provide a natural language interface for interacting with an API of a cloud-based service, according to some embodiments of this disclosure. It should be understood that such and other arrangements described herein are merely illustrative examples. Other arrangements and elements (e.g., machines, interfaces, functions, sequences, functional groupings, etc.) may be used to complement or replace the arrangements and elements shown, and certain elements may be omitted entirely. Furthermore, many of the elements described herein are functional entities that may be implemented as discrete or distributed components, or in combination with other components, and may be implemented in any suitable combination and location. The various functions performed by the entities described herein can be performed by hardware, firmware, and / or software. For example, various functions may be performed using one or more processors that execute instructions stored in one or more memories. For example, in some embodiments, the systems and methods described herein may use one or more generative language models (e.g., such as...). Figures 9A-9C As described in [the document], one or more computing devices or components thereof (e.g., such as [other devices]). Figure 10 (as described in) and / or one or more data centers or components thereof (e.g., such as) Figure 11 This is achieved as described in [the document].
[0037] Process 100 can be implemented using (in addition to additional or replacement components) client device 102, application 104 (which may include planner node 106, tool node 108, one or more API tools 110(1)-110(N) (where “N” can represent any number) and response node 114), and one or more services 112(1)-112(N). As a brief overview of process 100, application 104 may receive input data 116 from client device 102, and application 104 may use planner node 106 to process input data 116 and determine one or more queries in input data 116 and one or more mappings 118 of one or more API tools 110. Tool node 108 may use one or more mappings 118 to map one or more queries from input data 116 to the corresponding API tool in one or more API tools 110. One or more API tools 110 may generate and make one or more API calls 120 to one or more services 112, and one or more services 112 may generate one or more API responses 122 and send the responses back to one or more API tools 110. These API responses 122 may be sent back to planner node 106, and planner node 106 may use the responses to determine whether to call more tools or use response node 114 (also referred to herein as the "response generation node") to generate output data 124, at least based on using a language model to convert the one or more API responses 122 into a natural language description or other message. Output data 124 may be sent to client device 102, and client device 102 may use output data 124 to render a response to a query contained in input data 116.
[0038] In some examples, client device 102 may include any type of computing device, such as a desktop computer, laptop computer, server computer, smartphone, tablet computer, or any other computing device. Client device 102 may execute an instance of a user interface and may receive input data 116 as input from a user of client device 102 via that user interface. For example, a user of client device 102 may input a query using the user interface (e.g., by speaking or saying the query, by typing the query, by writing the query, by expressing the query using sign language, etc.). Therefore, in some cases, input data 116 may contain multimodal data representing the query. For example, input data 116 may contain text data, audio data, video data, image data, and / or combinations of other data representing the query. As described herein, in various examples, a query may include a request for information (e.g., “What’s the weather like today?”, “How many people are in the building?”, etc.), a request to control a machine or device or perform an action (e.g., “Turn up the thermostat”, “Lock the door”, etc.), or any other one or more requests or one or more queries.
[0039] As shown in the figure, application 104 (in some examples, it may represent an agent (e.g., an "LLM agent") or a group of agents) may contain multiple nodes. In some cases, each node (e.g., planner node 106, tool node 108, response node 114, etc.) may be configured to perform various different functions on behalf of application 104 to generate responses to one or more queries contained in input data 116. For example, each node may be an AI-driven system designed to autonomously perform tasks by leveraging one or more language models and / or other machine learning models, along with various integrated tools and / or resources. An example node may include a complex language model capable of understanding and generating human-like text, enabling it to engage in conversation, answer questions, or generate content. Nodes / agents may also connect to external tools, such as one or more API tools 110(1)-110(N) for data retrieval, web search functions, or databases, thereby extending their functionality beyond simple text processing. These combined elements enable application 104 to perform a variety of complex tasks effectively and efficiently.
[0040] For example, Figure 2 This is a block diagram illustrating example details 200 associated with agent 202 according to some embodiments of the present disclosure. In various examples, agent 202 may correspond to Figure 1 Example application 104. (e.g.) Figure 2 As shown in the example, agent 202 may include: one or more nodes 204, which may correspond to Figure 1The example includes any one of the planner node 106, tool node 108, and / or response node 114; and memory 206, one or more models 208, and one or more tools 210. Although in Figure 2 The example is shown separately, but one or more models 208 may run inside one or more nodes 204 or be executed using one or more nodes 204.
[0041] In some cases, memory 206 can be used as a repository for the internal records of agent 202 and / or for interactions between the agent and other agents, users, clients, APIs, services, etc. Memory 206 may contain short-term and / or long-term memory. In some examples, short-term memory can be used as a ledger of the actions and thoughts processed by agent 202 in handling a specific query or task, essentially capturing the agent's "thoughts." In contrast, long-term memory can be used as a log recording ongoing interactions and events between agent 202 and other agents and / or users, including conversation history that can last for weeks or months.
[0042] As described herein, one or more models 208 may contain one or more language models, which serve as the core engine for understanding and generating human-like text. One or more models 208 can process input by analyzing the context and intent behind the query and leverage extensive training on a variety of text data to produce coherent and context-sensitive responses. By utilizing advanced algorithms found in transducer architectures, one or more models 208 can capture subtle meanings and relationships between words, enabling them to handle complex language tasks such as dialogue, summarization, and translation. Essentially, one or more models 208 enable agent 202 to engage in meaningful interactions, adapt to different contexts, and provide informative answers, while continuously learning from its interactions to improve future performance. In some examples, agent 202 may provide service specifications and / or API specifications as input to one or more models 208. For example, agent 202 may be configured to apply these specifications to one or more models 208 simultaneously when submitting a request or call to one or more models 208. In this way, agent 202 does not need to explicitly train one or more models 208 to generate responses in a specific format, but can instead “show” one or more models 208 examples of the output it expects to receive. For example, by sending an API specification along with a query to one or more models 208, one or more models can use the API specification to determine how to correctly generate API calls in response to the query.
[0043] While many of the examples described herein relate to the use of language models, particularly large language models, this is not intended to be limiting. For example, but not limited to, any of the various machine learning models and / or neural networks described herein can include any type of machine learning model, such as one or more using linear regression, logistic regression, decision trees, support vector machines (SVM), Naive Bayes, k-nearest neighbors (KNN), K-means clustering, random forests, dimensionality reduction algorithms, gradient boosting algorithms, neural networks (e.g., autoencoder neural networks, artificial neural networks (ANN), convolutional neural networks (CNN), recurrent neural networks (RNN), perceptrons, long short-term memory (LSTM) networks, multilayer perceptron (MLP) networks, deep stacked networks (DSN), generative pre-trained (GPT) models or networks, feedforward networks, radial basis function ANNs, self-organizing maps (SOMs). Machine learning models including Kohonen mappings, Hopfield networks, Boltzmann machines, deep belief neural networks, deconvolutional neural networks, generative adversarial networks (GANs), liquid machines, modular neural networks, sequence-to-sequence models, networks using transformer architectures, diffusion models (e.g., diffusion probability models, score-based generative models, etc.), neural rendering field (NeRF) models, models with encoder-only architectures, models with decoder-only architectures, models with encoder-decoder architectures, generative machine learning models, language models, large language models (LLMs), small language models (SLMs), visual language models (VLMs), multimodal language models (MMLMs), etc., and / or other types of machine learning models.
[0044] One or more tools 210 may represent or contain a defined executable workflow that enables agent 202 to perform a variety of tasks efficiently. In some examples, the one or more tools 210 may contain specialized third-party APIs designed to enhance the capabilities of agent 202. For example, one or more tools 210 of agent 202 may contain a Retrieval Enhancement Generation (RAG) pipeline for providing context-aware responses. Furthermore, agent 202 may use one or more tools 210 to access external APIs to search for information online, retrieve real-time data from services such as weather APIs, or interact with instant messaging platforms. By leveraging one or more of its tools 210, agent 202 can extend its functionality, enabling it to handle a variety of queries and tasks with greater accuracy and relevance.
[0045] review Figure 1For example, process 100 may include: planner node 106 receiving input data 116 representing a query and determining one or more mappings 118. In other words, planner node 106 may determine which of one or more services 112 (and / or which of one or more API tools 110) to invoke in response to the query. In some examples, planner node 106 may use one or more language models to analyze input data 116 and determine one or more mappings 118. Furthermore, in some cases, planner node 106 may use one or more language models to decompose input data 116 into multiple subqueries (e.g., smaller, simpler queries than the entire query in input data 116). For example, if input data 116 contains the query “What’s the weather like in California? When does the game start?”, planner node 106 may use one or more language models to break this query down into at least a first query “What’s the weather like in California?” and a second query “When does the game start?”. In this example, planner node 106 can also use one or more language models to determine which one or more services and / or one or more API tools (e.g., weather service and sports service) should be invoked for different queries.
[0046] For example, Figure 3 This is a data flow diagram illustrating an example of a process 300, according to some embodiments of the present disclosure, that can be at least partially executed by planner node 106 to select one or more optimal services for responding to a query. As shown, process 300 may include: planner node 106 receiving input data 116 and generating a request or invocation to one or more language models 306. The request may include request data 302, which may include input data 116 and service information 304 associated with a service mapped to API tool 110. In some examples, service information 304 may include API specifications associated with the service and / or other information indicating at least the capabilities or functionalities of the service. One or more language models 306 may process request data 302 and determine one or more mappings 118, and planner node 106 may forward one or more mappings 118 to tool node 108. One or more mappings 118 may indicate which API tools 110 (and ultimately which services) to forward query data 308 to, where the query data may contain one or more queries from the input data 116. For example, in Figure 3 In the example, one or more mappings 118 may instruct tool nodes 108 to forward query data 308 to the first API tool 110 (1) and the third API tool 110 (3), but not to the second API tool 110 (2) and / or the Nth API tool 110 (N).
[0047] Return to reference Figure 1 For example, process 100 may include: tool node 108 forwarding a query to one or more API tools 110 using one or more mappings 118. In some examples, each of the one or more API tools 110 may correspond to or be associated with a specific service in one or more services 112, thereby creating a structured interaction path for executing one or more API calls 120. In some examples, tool node 108 may be configured as a "dispatcher" node, which may be responsible for invoking one or more API tools 110 selected by the language model in planner node 106. In some examples, tool node 108 and / or one or more API tools 110 may make multiple invocations to one or more language models to perform query decomposition (if necessary), determine the appropriate API endpoint for the API call, and / or generate the API call itself (e.g., fill in parameters and / or other information), among other things.
[0048] For example, Figure 4 This is a data flow diagram illustrating an example of a process 400 performed, at least in part, by an API tool 110 of tool node 108 to translate a query into an API call, according to some embodiments of the present disclosure. Process 400 may include: tool node 108 forwarding query data 308 to API tool 110. In some examples, tool node 108 may determine whether to forward query data 308 to API tool 110 based at least on one or more mappings 118. In some examples, query data 308 may include one or more portions of input data 116 and / or one or more queries contained in the input data 116. For example, query data 308 may include text data representing a natural language query that API tool 110 can respond to by interacting with its mapped service 112.
[0049] Upon receiving query data 308, API tool 110 can use query decomposition component 402 to determine whether the query represented by query data 308 needs to be decomposed into multiple queries. To do this, query decomposition component 402 can invoke one or more language models 408 (which may be the same as or different from one or more language models 306) to process query data 308. For example, if the query is "What is the weather like in California? What is the weather like in Idaho?", query decomposition component 402 and / or one or more language models 408 can decompose the query into two independent queries: the first query asks "What is the weather like in California?", and the second query asks "What is the weather like in Idaho?". In this way, by decomposing the query, API tool 110 can generate separate API calls 120 for each of these queries and submit these API calls 120 to their respective mapped backend services 112.
[0050] In some examples, API tool 110 may use API classification node 404 to determine which APIs of service 112 to invoke (e.g., which API endpoints 414(1)-414(N)). For example, API classification component 404 may invoke or use one or more language models 408 to process query data 308 (or decomposed query data) and API specifications 410 associated with the APIs of service 112. By doing so, API classification component 404 and / or one or more language models 408 may determine which API endpoints 414 or more to use for one or more API calls 120 to service 112. In some examples, API classification component 404 and / or one or more language models 408 may select one or more API endpoints 414 based on the specific functionality provided by one or more API endpoints 414 and whether those functionality matches the operations required to be performed in response to a query (such as retrieving data, updating resources, creating new entries, etc.). This selection process may involve consulting one or more language models 408 to examine the API specification 410 to understand the available API endpoints 414, their HTTP methods (e.g., GET, POST, PUT, DELETE, etc.), required parameters, and response formats. For example, the API specification 410 associated with service 112 may indicate one or more endpoints (e.g., URL addresses, etc.) corresponding to the API used for the service, the functionality of one or more API endpoints 414, the parameters to be included in the payload of one or more API calls 120, the format of the API request or response, or any other information associated with the API of service 112.
[0051] In some examples, API tool 110 may also use parameter population component 406 to generate one or more API calls 120 (e.g., API request payloads). For example, parameter population component 406 may apply query data 308 (or a portion thereof) along with API specification 410 associated with service 112 to one or more language models 408. One or more language models 408 may process these inputs and generate text data representing one or more API calls 120. That is, one or more language models 408 may analyze query data 308 and API specification 410 to generate a payload for one or more API calls 120 by at least specifying one or more API endpoints 414 to be used, using HTTP methods (e.g., GET, POST, PUT, DELETE), and populating parameters and / or other data payloads for one or more API calls 120. In some examples, API tool 110 may then use API call execution component 412 to execute one or more API calls 120 by sending a request / message containing a payload to one or more API endpoints 414 of service 112.
[0052] While this is only one example of how tool node 108 and / or one or more API tools 110 generate one or more API calls 120 from a natural language query, in additional or alternative examples, tool node 108 and / or one or more API tools 110 may use any other methods or processes to convert a natural language query into one or more API calls 120. For example, instead of using multiple components to make multiple calls to one or more language models 408, API tool 110 may make a single call to one or more language models 408 to generate one or more API calls 120, or as described above... Figure 4 The example described above allows for more calls to one or more language models 408 compared to the previous one.
[0053] Return to reference Figure 1For example, process 100 may include: one or more API tools 110 executing one or more API calls 120 for one or more services 112 and receiving one or more API responses 122 returned from one or more services 112. One or more API tools 110 may then forward one or more API responses 122 to planner node 106. In any example, planner node 106 may use one or more API responses 122 to determine whether to call another tool or call response node 114 to generate a response to the query. For example, planner node 106 may use a language model to process one or more API responses 122 and / or any other information available to them to determine whether a response can be generated, in which case planner node 106 may forward the response or other information to response node 114, or whether another tool call is needed to obtain any additional, necessary information for the response. In some cases, one or more API responses 122 to one or more API calls 120 may be in a format that is not in natural language. For example, one or more API responses 122 may contain text data representing JSON, XML, or any other structured response. However, as described herein, response node 114 may use one or more language models to convert one or more API responses 122 from a structured format to a natural language format / message.
[0054] For example, Figure 5 This is an example data flow diagram illustrating a process 500, at least partially performed by a response node 114 according to some embodiments of the present disclosure, to convert a response to an API call into output data that at least represents a natural language message. Process 500 includes: one or more API tools 110 receiving one or more API responses 122 from one or more API endpoints 414 of one or more services 112, and forwarding the one or more API responses 122 to a planner node 106. Although not shown, the planner node 106 may use a language model to determine whether it has obtained all the information required to respond to the query, or whether it needs to invoke another tool or obtain any additional information by invoking tool node 108. Figure 5In the example, planner node 106 determines that it has all the information needed to form a response to the query and invokes response node 114 to generate the response (e.g., no longer looping to continue trying to determine how to respond to the query). Response node 114 may generate request data 502 and apply request data 502 as input to one or more language models 504, which may be the same as or different from language models 306 and / or 408. Request data 502 may include one or more of input data 116, API response 122, and / or API specification 410 in some examples. Language model 504 may process request data 502 and generate one or more natural language messages 506.
[0055] In some examples, the natural language message 506 may contain text representing a natural language message that interprets the content of the API response 122. For example, again suppose the query is "What's the weather like in California?". In this scenario, the API response 122 may contain text representing a JSON-formatted response (or other structured data format response), the content of which may resemble the following:
[0056]
[0057] However, at least based on using language model 504 to process request data 502, including API specification 410 (which may explain the meaning of the aforementioned keys, fields, and values), natural language message 506 could contain information like: “Los Angeles is currently experiencing sunny and mild weather with a high of 72 degrees Fahrenheit. Similar weather is expected this week, with daily highs reaching the mid-70s to the low-80s Fahrenheit.” As another example, natural language message 506 could contain information like: “Los Angeles, California is experiencing sunny weather with a temperature of 72 degrees Fahrenheit, humidity of 60%, and wind speeds of 8 mph.”
[0058] In some examples, response node 114 may receive natural language message 506 returned from language model 504 and include one or more portions of natural language message 506 in output data 124. For example, response node 114 may use natural language message 506 to generate output data 124. In some examples, output data 124 may contain multimodal data. For example, continuing with the above examples, in addition to containing natural language message 506 describing weather conditions, output data 124 may also contain visual data (e.g., weather forecast maps, news forecast video clips, etc.), audio data, and / or other data.
[0059] Return to reference Figure 1For example, process 100 may include: application 104 sending output data 124 to client device 102. In some examples, output data 124 may be presented by client device 102 via an instance of a user interface executed on client device 102. In other words, by sending output data 124 to the client device, application 104 may cause client device 102 and / or the user interface to present a response to a query. In some examples, presenting a response may include, but is not limited to: outputting audio data of the response using one or more speakers of client device 102; outputting visual data of the response (e.g., image data, video data, text data, analytics data, etc.) using the display of client device 102; and / or outputting any other data using any other component or manner of client device 102.
[0060] Now for reference Figure 6 , Figure 6 This is a block diagram illustrating an example of a system 602 that can perform one or more processes described herein, according to some embodiments of the present disclosure. As shown, system 602 (which may represent and / or include example computing device 1000 and / or example data center 1100) may include one or more processors 604 (which may resemble and / or include CPU 1006 and / or GPU 1008) and memory 606 (which may resemble and / or include memory 1004). For example, memory 606 may store one or more of application 104, planner node 106, tool node 108, one or more API tools 110, response node 114, and / or one or more language models 608. Furthermore, one or more processors 604 may execute application 104, planner node 106, tool node 108, one or more API tools 110, response node 114, and / or one or more language models 608 to perform one or more processes described herein.
[0061] In some examples, system 602 may communicate with one or more client devices 102 and / or one or more services 112 via one or more networks 610. For example, system 602 may receive input data representing natural language queries from one or more client devices 102 and execute application 104 (and / or components thereof) using one or more processors 604, and / or use one or more language models 608 to translate the natural language queries into one or more API calls to be sent to one or more services 112. System 602 may also receive responses to API calls from one or more services 112 and translate the responses into natural language messages to be sent back to one or more client devices 102.
[0062] Now for reference Figure 7 and Figure 8 Each block of the methods 700 and 800 described herein includes a computational process that can be executed using any combination of hardware, firmware, and / or software. For example, various functions can be performed using one or more processors executing instructions stored in one or more memories. These methods can also be embodied as computer-usable instructions stored on a computer storage medium. These methods can be provided as microservices via application programming interfaces (APIs) by standalone applications, services, or managed services (standalone or in combination with other managed services), or provided as plug-ins to another product, etc. Furthermore, by way of example, for... Figure 1 The systems described herein are methods 700 and 800. However, these methods may be performed additionally or alternatively by any system or any combination of systems, including but not limited to the systems described herein.
[0063] Figure 7 This is a flowchart illustrating an example of a method 700, executable to provide a natural language interface for interacting with an API of a service, according to some embodiments of the present disclosure. Method 700 includes, at block B702, obtaining input data at a planner node representing one or more queries sent by a client device. For example, planner node 106 may obtain input data 116 representing one or more queries sent by client device 102.
[0064] Method 700 at box B704 includes: mapping one or more queries to one or more tool nodes, based at least on the planner node processing input data and information associated with multiple services using one or more first language models. For example, planner node 106 may use one or more first language models to process input data 116 and information associated with one or more services 112, and map one or more queries to one or more API tools 110. For example, if the input data contains a first query asking for information about the weather and a second query asking for information about sports teams, the first query may be mapped or routed to a first API tool corresponding to the weather-related service or application, while the second query may be mapped or routed to a second API tool corresponding to the sports-related service or application.
[0065] Method 700, at box B706, includes: generating first text data representing one or more API calls to one or more services, based at least a portion of input data processed using one or more second language models by one or more tool nodes and one or more API specifications associated with one or more services. For example, one or more API tools 110 (and / or tool nodes 108) may generate first text data representing one or more API calls 120 to one or more services 112, based at least a portion of input data 116 processed using one or more second language models and API specifications associated with one or more services 112. In some examples, the second language model may be the same as or different from the first language model. Furthermore, in some cases, one or more API tools 110 may perform multiple calls to one or more second language models to generate one or more API calls 120. For example, one or more API tools 110 may make a first call to one or more second language models to decompose or simplify a query, a second call to one or more second language models to classify one or more API endpoints for one or more services 112, and a third call to one or more second language models to populate parameters or payloads for one or more API calls 120 and / or requests.
[0066] Method 700, at box B708, includes obtaining second text data representing one or more responses from one or more services by executing one or more API calls based on at least one or more tool nodes at a response generation node. For example, response node 114 may obtain second text data representing one or more API responses 122 by executing one or more API calls 120 based on at least one or more API tools 110. In some examples, the second text data may be in a structured data format, such as JSON, XML, or any other structured data format.
[0067] Method 700, at box B710, includes: processing second text data and one or more API specifications using at least one third language model based on a response generation node, to generate output data representing one or more responses to one or more queries. For example, response node 114 may generate output data 124 based at least on processing second text data (e.g., one or more API responses 122) and one or more API specifications using at least one or more third language models. In some examples, the third language model may be the same as or different from one or more of the first language model and / or the second language model. In some examples, response node 114 may use one or more third language models to convert one or more API responses 122 into natural language messages, and one or more portions of the natural language messages may be included in the output data 124.
[0068] Method 700 at block B712 includes sending output data to a client device for presentation via an instance of a user interface executed by the client device. For example, response node 114 and / or application 104 may send output data 124 to client device 102. In some examples, client device 102 may execute an instance of a user interface, and this user interface may use output data 124 to render a response to the original query (e.g., by displaying a natural language message on the screen, by outputting audio data representing the utterances of the natural language message, etc.).
[0069] Figure 8 This is a flowchart illustrating an example of a method 800 for converting a natural language query into an API call and converting the response to the API call back into a natural language reply to the query, according to some embodiments of the present disclosure. Method 800 includes, at block B802, obtaining input data representing the query. For example, a planner node 106 of application 104 may obtain input data 116 representing the query.
[0070] Method 800 at box B804 includes: using one or more language models and based at least on a portion of the input data and one or more API specifications associated with one or more services to generate first text data representing one or more API calls to one or more services. For example, tool node 108 of application 104 may use one or more API tools 110 (which may perform one or more calls to one or more language models) to generate first text data representing one or more API calls 120.
[0071] Method 800 at box B806 includes receiving second text data representing one or more responses to one or more API calls from at least one or more services that perform one or more API calls. For example, response node 114 may receive second text data representing responses to one or more API calls 120.
[0072] Method 800 at box B808 includes: generating output data representing one or more responses to a query using one or more language models and based at least on second text data and one or more API specifications. For example, response node 114 may generate output data 124 representing one or more responses to a query, based at least on processing the second text data using one or more language models and one or more API specifications.
[0073] The systems and methods described herein can be used for a variety of purposes, including, but not limited to: machine (e.g., robots, vehicles, construction machinery, warehouse vehicles / machines, autonomous, semi-autonomous and / or other types of machines) control, machine motion, machine driving, synthetic data generation, model training (e.g., using real data, augmented data and / or synthetic data, such as synthetic data generated using simulation platforms or systems, synthetic data generation techniques, such as, but not limited to, the techniques described herein), perception, augmented reality (AR), virtual reality (VR), mixed reality (MR), robotics, security and surveillance (e.g., in smart city implementations), autonomous or semi-autonomous machine applications, deep learning, environmental simulation, object or participant simulation and / or digital twins, data center processing, conversational AI, optical transport simulation (e.g., ray tracing, path tracing, etc.), distributed or collaborative content creation of 3D assets (e.g., using common scene descriptor (USD) data, such as OpenUSD and / or other data types), cloud computing, generative artificial intelligence (e.g., using one or more diffusion models, transformer models, etc.) and / or any other suitable application.
[0074] The disclosed embodiments can be included in a variety of different systems, such as automotive systems (e.g., control systems for autonomous or semi-autonomous machines, perception systems for autonomous or semi-autonomous machines), systems implemented using robots or robotic platforms, aerial systems, medical systems, marine systems, smart area monitoring systems, systems for performing deep learning operations, systems for performing simulation operations (e.g., in driving or vehicle simulations, robot simulations, smart city or surveillance simulations, etc.), systems for performing digital twin operations (e.g., in conjunction with collaborative content creation platforms or systems, such as, but not limited to, NVIDIA's OMNIVERSE and / or other platforms, systems, or services using USD or OpenUSD data types), systems implemented using edge devices, systems containing one or more virtual machines (VMs), and systems for performing synthetic data generation operations (e.g., using one or more neural rendering fields (NERF), Gaussian diffusion...). Systems that are at least partially implemented in a data center (such as splat) technology, diffusion models, transformer models, etc., systems for performing conversational AI operations, systems that implement one or more language models (such as one or more large language models (LLM), one or more small language models (SLM), one or more visual language models (VLM), one or more multimodal language models, etc.), systems for performing optical transmission simulations, systems for performing collaborative content creation of 3D assets (e.g., using common scene descriptor (USD) data (such as OpenUSD), computer-aided design (CAD) data, 2D and / or 3D graphics or design data and / or other data types), systems that are at least partially implemented using cloud computing resources, and / or other types of systems.
[0075] Example language model
[0076] In at least some embodiments, language models such as Large Language Models (LLM), Visual Language Models (VLM), Multimodal Language Models (MMLM), and / or other types of generative artificial intelligence (AI) can be implemented. These models may be able to understand, summarize, translate, and / or otherwise generate text (e.g., natural language text, code, etc.), images, videos, computer-aided design (CAD) assets, OMNIVERSE and / or METAVERSE file information (e.g., USD formats such as OpenUSD), and / or the like based on context provided in input prompts or queries. In embodiments, these language models may be considered “large” because they are trained on massive datasets and have architectures with a large number of learnable network parameters (weights and biases)—e.g., millions or billions of parameters. LLM / SLM / VLM / MMLM / etc. can be implemented for summarizing textual data, analyzing data (e.g., text, images, videos, etc.), extracting insights from data (e.g., text, images, videos, etc.), and generating new text / images / videos / etc. in a user-specified style, tone, and / or format. In some embodiments, the LLM / SLM / VLM / MMLM / etc. disclosed herein may be specifically designed for text processing. In other embodiments, a multimodal LLM may be implemented to accept, understand, and / or generate text and / or other types of content, such as images, audio, 2D and / or 3D data (e.g., USD format) and / or video. For example, a Visual Language Model (VLM) or more specifically a Multimodal Language Model (MMLM) may be implemented to accept images, video, audio, text, 3D designs (e.g., CAD) and / or other input data types and / or generate or output images, video, audio, text, 3D designs and / or other output data types.
[0077] Various types of LLM / SLM / VLM / MMLM / etc. architectures can be implemented in various embodiments. For example, different architectures can be implemented using different techniques to understand and generate outputs (e.g., text, audio, video, images, 2D and / or 3D design or asset data, etc.). In some embodiments, LLM / SLM / VLM / MMLM / etc. architectures (e.g., recurrent neural networks (RNNs) or long short-term memory networks (LSTMs)) can be used, while in other embodiments, converter architectures (e.g., architectures relying on self-attention and / or cross-attention (e.g., between contextual data and textual data) mechanisms) can be used to understand and recognize relationships between words or tokens and / or contextual data (e.g., other text, video, images, design data, USD, etc.). One or more generative processing pipelines including LLM / SLM / VLM / MMLM / etc. may also include one or more diffusion blocks (e.g., noise reduction blocks). The LLM / SLM / VLM / MMLM / etc. of this disclosure may include encoder and / or decoder blocks. For example, discriminative or encoder-only models (e.g., BERT (Bidirectional Encoder Representations from Transformers)) can be implemented for tasks involving language understanding (e.g., classification, sentiment analysis, question answering, and named entity recognition). As another example, generative or decoder-only models (e.g., GPT (Generative Pretrained Transformer)) can be implemented for tasks involving language and content generation (e.g., text completion, story generation, and dialogue generation). LLM / SLM / VLM / MMLM / etc., including encoder and decoder components (e.g., T5 (Text-to-Text Transformer)), can be implemented to understand and generate content, such as for translation and summarization. These examples are not intended to be limiting and any architecture type (including, but not limited to, those described herein) can be implemented depending on the specific implementation and the task performed using LLM / SLM / VLM / MMLM / etc.
[0078] In various embodiments, unsupervised learning can be used to train LLM / SLM / VLM / MMLM / etc., where LLM / SLM / VLM / MMLM / etc. learn patterns from a large amount of unlabeled text / audio / video / image / design / USD / etc. data. Due to extensive training, in these embodiments, the model may not require task-specific or domain-specific training. An LLM / SLM / VLM / MMLM / etc. extensively pre-trained on a large amount of unlabeled data can be referred to as a base model and can excel at various tasks, such as question answering, summarizing, filling in missing information, translation, and image / video / design / USD / data generation. Some LLM / SLM / VLM / MMLM / etc. can be customized for specific use cases using techniques such as cue tuning, fine-tuning, retrieval augmentation generation (RAG), adding adapters (e.g., custom neural networks and / or neural network layers to tune or adjust cues or labels to bias the language model towards a specific task or domain), and / or using optimization models for specific tasks and / or other fine-tuning or customization techniques within a specific domain.
[0079] In some embodiments, the LLM / SLM / VLM / MMLM / etc. disclosed herein can be implemented using various model alignment techniques. For example, in some embodiments, guardrails can be implemented to identify incorrect or unwanted inputs (e.g., prompts) and / or outputs of the model. In this process, the system can use guardrails and / or other model alignment techniques to prevent the processing of specific unwanted inputs using LLM / SLM / VLM / MMLM / etc., and / or to prevent the output or presentation of information generated by LLM / SLM / VLM / MMLM / etc. (e.g., displays, audio outputs, etc.). In some embodiments, one or more additional models (or layers thereof) can be implemented to identify problems with the model's inputs and / or outputs. For example, these "protective" models can be trained to identify "safe" or otherwise okay or desired inputs and / or outputs and / or "unsafe" or otherwise unwanted inputs and / or outputs for a particular application / implementation. Therefore, the LLM / SLM / VLM / MMLM / etc. disclosed herein are unlikely to output language / text / audio / video / design data / USD data / etc. that may be offensive, vulgar, inappropriate, insecure, out of scope, and / or unwanted for a particular application / implementation.
[0080] In some embodiments, LLM / SLM / VLM / MMLM / etc. can be configured or able to access or use one or more plugins, application programming interfaces (APIs), databases, data stores, repositories, etc. For example, for certain tasks or operations where the model is not ideally suited, the model may have instructions for accessing one or more plugins (e.g., third-party plugins) to help process the current input (e.g., as a result of training, and / or based on instructions in a given prompt). In such an example, when at least part of the prompt relates to restaurants or weather, the model can access one or more restaurant or weather plugins (e.g., via one or more APIs) to retrieve relevant information. Another example is that if at least part of the response requires mathematical computation, the model can access one or more mathematical plugins or APIs to help solve the problem, and then the response from the plugins and / or APIs can be used in the model's output. This process can be repeated (e.g., recursively) an arbitrary number of iterations, using any number of plugins and / or APIs, until a response to each query / question / request / process / action / etc. can be generated in response to the input prompt. Therefore, models can rely not only on their own knowledge gained from training on large datasets, but also on the expertise or optimized properties of one or more external resources (such as APIs, plugins, etc.).
[0081] In some embodiments, multiple language models (e.g., LLM / SLM / VLM / MMLM / etc., multiple instances of the same language model, and / or multiple hints provided to the same language model or instances of the same language model) can be implemented, executed, or accessed (e.g., using one or more plugins, user interfaces, APIs, databases, data stores, repositories, etc.) to provide output in response to the same query or in response to separate parts of a query. In at least one embodiment, the same input query and hints (e.g., a set of constraints, condition generators, etc.) can be provided to multiple language models (e.g., language models with different architectures, language models trained on different (e.g., updated) data corpora). In one or more embodiments, the language models can be different versions of the same base model. In one or more embodiments, at least one language model can be instantiated as multiple agents, for example, providing more than one hint to constrain, guide, or otherwise influence the style, content, or character of the provided output. In one or more exemplary non-limiting embodiments, the same language model can be required to provide output corresponding to different roles, perspectives, characters, or different knowledge bases, as defined by the provided hints.
[0082] In any such embodiment, the outputs of two or more (e.g., each) language models, two or more versions of at least one language model, two or more instantiated proxies of at least one language model, and / or provided to two or more prompts for at least one language model can be further processed, such as aggregated, compared, or filtered, or used to determine (and provide) a consensus response. In one or more embodiments, the output from one language model (or version, instance, or proxy) can be provided as input to another language model for further processing and / or validation. In one or more embodiments, the language model can be required to generate or otherwise obtain output about the input source material, wherein the output is associated with the input source material. This association may include, for example, generating captions or text portions embedded (e.g., as metadata) within the input source text or image. In one or more embodiments, the output of the language model can be used to determine the validity of the input source material for further processing or inclusion in a dataset. For example, the language model can be used to evaluate the presence (or absence) of a target word in a text portion or the presence (or absence) of an object in an image, wherein the text or image is annotated to indicate such presence (or absence). Alternatively, the determination from the language model can be used to determine whether the source material should be included in the curatorial dataset, for example, but not limited to this.
[0083] Figure 9A This is a block diagram of an example generative language model system 900 applicable to implementing at least some embodiments of this disclosure. Figure 9A In the example shown, the generative language model system 900 includes a retrieval augmentation (RAG) component 992, an input processor 905, a tokenizer 910, an embedding component 920, a plugin / API 995, and a generative language model (LM) 930 (which may include LLM, SLM, VLM, multimodal LM, etc.).
[0084] At a high level, the input processor 905 can receive input 901, which includes text and / or other types of input data (e.g., audio data, video data, image data, sensor data (e.g., LiDAR, RADAR, ultrasound, etc.), 3D design data, CAD data, generic scene descriptor (USD) data (e.g., OpenUSD, etc.), depending on the architecture of the generative LM 930 (e.g., LLM / SLM / VLM / MMLM, etc.). In some embodiments, input 901 includes plain text in the form of one or more sentences, paragraphs, and / or documents. Additionally or alternatively, input 901 may include numerical sequences, pre-computed embeddings (e.g., word or sentence embeddings), and / or structured data (e.g., tabular format, JSON, or XML). In the generative LM In some implementations of 930 capable of handling multimodal input, input 901 can combine text (or text that may be omitted) with image data, audio data, video data, design data, USD data, and / or other types of input data (e.g., but not limited to the data described herein). Taking raw input text as an example, input processor 905 can prepare the raw input text in various ways. For example, input processor 905 can perform various types of text filtering to remove noise from relevant text content (e.g., special characters, punctuation marks, HTML tags, stop words, portions of images, portions of audio, etc.). In examples involving stop words (common words that often have little semantic meaning), input processor 905 can remove stop words to reduce noise and allow the generative LM 930 to focus on more meaningful content. Input processor 905 can apply text normalization, for example, by converting all characters to lowercase, removing accent marks, and / or handling special cases (such as abbreviations or abbreviations) to ensure consistency. These are just a few examples; other types of input processing can be applied.
[0085] In some embodiments, RAG component 992 (which may include one or more RAG models, and / or may be performed using generative LM 930 itself) may be used to retrieve additional information to be used as part of input 901 or a prompt. RAG can be used to enhance input to LLM / SLM / VLM / MMLM / etc. with external knowledge to make the answer to a specific question or query or request more relevant, such as when specific knowledge is required. RAG component 992 may obtain this additional information (e.g., basic information such as basic text / images / videos / audio / USD / CAD / etc.) from one or more external sources, and then feed it along with the prompt to LLM / SLM / VLM / MMLM / etc. to improve the accuracy of the model's response or output.
[0086] For example, in some embodiments, in addition to the data retrieved using RAG component 992, input 901 may also be generated using query or model input (e.g., questions, requests, etc.). In some embodiments, input processor 905 may analyze input 901 and communicate with RAG component 992 (or in embodiments, RAG component 992 may be part of input processor 905) to identify relevant text and / or other data to provide to generative LM 930 as additional context or information source, typically from which to identify responses, answers, or outputs 990. For example, when the input indicates that a user is interested in the required tire pressure for a particular brand and model of vehicle, RAG component 992 may use a RAG model, for example, to perform a vector search in the embedding space to retrieve tire pressure information or its corresponding text from a digital (embedded) version of the owner's manual for that particular vehicle brand and model. Similarly, when a user revisits the chatbot related to a specific product sale or service, the RAG component 992 can retrieve previously stored conversation history (or at least its summary) and provide the previous conversation history, along with the current inquiry / request, as part of the generative LM 930 as input 901.
[0087] RAG component 992 can use various RAG techniques. For example, it can use naive RAG ( The document is indexed, chunked, and applied to an embedding model to generate embeddings corresponding to chunks. User queries can also be applied to this embedding model and / or another embedding model of the RAG component 992, and the embeddings of the chunks can be compared with the embeddings of the query to identify the most similar / relevant embeddings to the query. These most similar / relevant embeddings can be provided to the generative LM 930 to generate output.
[0088] In some embodiments, more advanced RAG techniques can be used. For example, chunks can undergo pre-retrieval processes (e.g., routing, rewriting, metadata analysis, expansion, etc.) before being passed to the embedding model. Furthermore, post-retrieval processes (e.g., re-ranking, hint compression, etc.) can be performed on the output of the embedding model before generating the final embedding, which is then used for comparison with the input query.
[0089] As a further example, modular RAG techniques can be used, such as those similar to Naive RAG and / or Advanced RAG, but also including features such as hybrid search, recursive retrieval and query engines, StepBack methods, subqueries and hypothetical document embeddings.
[0090] As another example, Graph RAG can use a knowledge graph as a source of context or factual information. Graph RAG can be implemented using a graph database as a source of contextual information sent to LLM / SLM / VLM / MMLM / etc. Instead of providing the model with data chunks extracted from larger documents (which may result in a lack of context, factual correctness, linguistic accuracy, etc.) (or anything other than providing the model with data chunks extracted from larger documents), Graph RAG can also provide structured entity information to LLM / SLM / VLM / MMLM / etc. by combining structured entity text descriptions with their many attributes and relationships, thereby giving the model deeper insights. In implementing Graph RAG, the systems and methods described herein use graphs as content stores and extract relevant document chunks, requiring LLM / SLM / VLM / MMLM / etc. to use them to answer questions. In such embodiments, the knowledge graph may contain relevant textual content and metadata about the knowledge graph, or it may be integrated with a vector database. In some embodiments, Graph RAG can use graphs as subject matter experts, where descriptions of concepts and entities relevant to the query / hint can be extracted and passed to the model as semantic context. These descriptions may include relationships between concepts. In other examples, the graph can be used as a database where a portion of a query / hint can be mapped to a graph query, the graph query can be executed, and LLM / SLM / VLM / MMLM / etc. can aggregate the results. In such examples, the graph can store relevant factual information and can be used for queries (natural language queries) and entity links to graph query tools (NL to graph query tools). In some embodiments, the graph RAG (e.g., using a graph database) can be combined with standard (e.g., vector database) RAGs and / or other RAG types to benefit from a variety of approaches.
[0091] In any embodiment, the RAG component 992 can implement plugins, APIs, user interfaces, and / or other functions to perform RAG. For example, LLM / SLM / VLM / MMLM / etc. can use graph RAG plugins to run queries on knowledge graphs to extract relevant information to feed into the model, and can use standard or vector RAG plugins to run queries on vector databases. For example, the graph database can interact with the plugin's REST (Representational State Transition) interface, thus decoupling the graph database from the vector database and / or embedded models.
[0092] The tokenizer 910 can segment (e.g., processed) text data into smaller units (tags) for subsequent analysis and processing. Depending on the implementation, the tags can represent individual words, sub-words, characters, audio / video / images, etc. Word-based tokenization divides the text into individual words, treating each word as a separate tag. Sub-word tokenization breaks words down into smaller meaningful units (e.g., prefixes, suffixes, stems), enabling the generative LM 930 to understand morphological changes and process words outside the vocabulary more effectively. Character-based tokenization represents each character as a separate tag, enabling the generative LM 930 to process text at a fine-grained level. The choice of tokenization strategy can depend on factors such as the language being processed, the task at hand, and / or the characteristics of the training dataset. Therefore, the tokenizer 910 can transform (e.g., processed) text into a structured format according to the tokenization scheme implemented in a particular embodiment.
[0093] Embedding component 920 can use any known embedding technique to transform discrete tokens into semantically meaningful (e.g., dense, continuous vector) representations. For example, embedding component 920 can use pre-trained word embeddings (e.g., Word2Vec, GloVe, or FastText), one-hot encoding, Term Frequency-Inverse Document Frequency (TF-IDF) encoding, one or more embedding layers of a neural network, and / or others.
[0094] In some implementations where input 901 includes image data / video data, etc., input processor 901 may resize the data to a standard size compatible with the format of the corresponding input channel and / or normalize pixel values to a common range (e.g., 0 to 1) to ensure consistent representation, and embedding component 920 may encode the image data using any known technique (e.g., using one or more convolutional neural networks (CNNs) to extract visual features). In some implementations where input 901 includes audio data, input processor 901 may resample the audio file to a consistent sampling rate for uniform processing, and embedding component 920 may use any known technique to extract and encode audio features, such as in the form of a spectrogram (e.g., a Mel spectrogram). In some implementations where input 901 includes video data, input processor 901 may extract frames or apply resizing to extracted frames, and embedding component 920 may extract features such as optical flow embedding or video embedding and / or encode temporal information or frame sequences. In some implementations where input 901 includes multimodal data, the embedded component 920 can use techniques such as early fusion (concatenation), late fusion (sequential processing), and attention-based fusion (e.g., self-attention, cross-attention) to fuse representations of different types of data (e.g., text, images, audio, USD, video, design, etc.).
[0095] Other components of the generative LM 930 and / or generative LM system 900 may use different types of neural network architectures depending on the implementation scheme. For example, a transducer-based architecture (e.g., the architecture used in models such as GPT) may be implemented, and it may include a self-attention mechanism that weights the importance of different words or tokens in the input sequence and / or a feedforward network that processes the output of the self-attention layer, applying a nonlinear transformation to the input representation and extracting higher-level features. Some non-limiting example architectures include transducers (e.g., encoder-decoder, decoder-only, multimodal), RNNs, LSTMs, fusion models, diffusion models, cross-modal embedding models that learn a joint embedding space, graph neural networks (GNNs), hybrid architectures that combine different types of adversarial networks (such as generative adversarial networks or GANs or adversarial autoencoders (AAEs) for joint distribution learning), etc. Therefore, depending on the implementation scheme and architecture, the embedded component 920 can apply the encoded representation of the input 901 to the generative LM 930, and the generative LM 930 can process the encoded representation of the input 901 to generate an output 990, which may include response text and / or other types of data.
[0096] As described herein, in some embodiments, the generative LM 930 may be configured to access or use (or be able to access or use) plugins / APIs 995 (which may include one or more plugins, application programming interfaces (APIs), databases, data stores, repositories, etc.). For example, for certain tasks or operations where the generative LM 930 is not ideally suited, the model may have instructions (e.g., as a result of training, and / or based on instructions in a given prompt, such as instructions retrieved using RAG component 992) to access one or more plugins / APIs 995 (e.g., third-party plugins) to help process the current input. In such an example, when at least part of the prompt is related to a restaurant or weather, the model may access one or more restaurant or weather plugins (e.g., via one or more APIs), sending at least part of the prompt related to a particular plugin / API 995 to the plugin / API 995, which can process the information and return an answer to the generative LM 930, which can then use the response to generate output 990. This process can be repeated (e.g., recursively) an arbitrary number of iterations and repeated using any number of plugins / APIs 995 until an output 990 that resolves each query / question / request / process / action / etc. from input 901 is generated. Therefore, the model can rely not only on its own knowledge acquired from training on a large dataset and / or from data retrieved using the RAG component 992, but also on the expertise or optimized properties of one or more external resources (e.g., plugins / APIs 995).
[0097] Figure 9B This is a block diagram of an example implementation scheme, where the generative LM 930 includes a converter encoder-decoder. For example, suppose the input text (e.g., “Who discovered gravity”) is tokenized (e.g., by...) Figure 9A The tokenizer 910) is used for tokens such as words, and each token is encoded (e.g., by...). Figure 9A The embedding component 920 is a corresponding embedding (e.g., of size 512). Since these token embeddings do not typically represent the position of the tokens in the input sequence, positional encoding can be added to each token embedding using any known technique to encode the order relation and context of the tokens in the input sequence. Thus, (e.g., the resulting) embeddings can be applied to one or more encoders 935 of the generative LM 930.
[0098] In the example implementation, encoder 935 forms an encoder stack, where each encoder includes a self-attention layer and a feedforward network. In the example converter architecture, each token (e.g., a word) flows through a separate path. Therefore, each encoder can accept a sequence of vectors, pass each vector through the self-attention layer, then through the feedforward network, and then up to the next encoder in the stack. Any known self-attention technique can be used. For example, to compute a self-attention score for each token (word), a query vector, a key vector, and a value vector can be created for each token. The self-attention score for a token pair can be computed by taking the dot product of the query vector and the corresponding key vector, normalizing the resulting score, multiplying by the corresponding value vector, and summing the weighted value vectors. The encoder can apply multi-head attention, where the attention mechanism is applied multiple times in parallel with different learned weight matrices. Any number of encoders can be cascaded to generate a context vector encoding the input. Attention projection layer 940 can transform the context vector into attention vectors (keys and values) for decoder 945.
[0099] In the example implementation, decoder 945 forms a decoder stack, where each decoder includes a self-attention layer, an encoder-decoder self-attention layer that uses attention vectors (keys and values) from the encoder to focus on relevant parts of the input sequence, and a feedforward network. Similar to encoder 935, in the example converter architecture, each token (e.g., a word) flows through a separate path in decoder 945. During the first pass, decoder 945, classifier 950, and generation mechanism 955 can generate a first token, and generation mechanism 955 can apply the generated token as input during a second pass. This process can be repeated cyclically, generating tokens (e.g., words) and adding them to the output of the previous pass, and in subsequent passes applying token embeddings of positionally encoded composite sequences as input to decoder 945, generating one token at a time (called autoregression) until a symbol or token indicating the end of the response is predicted. In each decoder, the self-attention layer is typically restricted to focusing only on earlier positions in the output sequence by applying a masking technique (e.g., setting future positions to negative infinity) before the softmax operation. In the example implementation, the encoder-decoder attention layer operates similarly to the (e.g., multi-head) self-attention operation in encoder 935, except that it creates its queries from the layers below it and obtains keys and values (e.g., matrices) from the output of encoder 935.
[0100] Therefore, decoder 945 can output some decoded (e.g., vector) representation of the input applied during a particular pass. Classifier 950 can include a multi-class classifier comprising one or more neural network layers and a softmax operation that transforms logit probabilities into probabilities, the neural network layers projecting the decoded (e.g., vector) representation onto corresponding dimensions (e.g., one dimension for each supported word or token in the output vocabulary). Thus, generation mechanism 955 can select or sample words or tokens based on corresponding predicted probabilities (e.g., selecting the word with the highest predicted probability) and append it to the output of the previous pass, thereby generating each word or token sequentially. Generation mechanism 955 can repeat this process, triggering successive decoder inputs and corresponding predictions until a symbol or token representing the end of the response is selected or sampled, at which point generation mechanism 955 can output the generated response.
[0101] Figure 9C This is a block diagram of an example implementation where the generative LM 930 includes a decoder-only converter architecture. For example, Figure 9C The decoder 960 can be used with Figure 9B The decoder 945 operates similarly, except... Figure 9C Each decoder 960 omits the encoder-decoder self-attention layer (because there is no encoder in this implementation). Therefore, decoders 960 can form a decoder stack, where each decoder includes a self-attention layer and a feedforward network. Furthermore, instead of encoding the input sequence, a symbol or tag indicating the end of the input sequence (or the beginning of the output sequence) can be appended to the input sequence, and the resulting sequence (e.g., a corresponding embedding with positional encoding) can be applied to decoder 960. Figure 9B Similar to decoder 945, each tag (e.g., a word) can flow through a separate path in decoder 960, and decoder 960, classifier 965, and generation mechanism 970 can use autoregression to generate one tag at a time sequentially until a symbol or tag indicating the end of the response is predicted. Classifier 965 and generation mechanism 970 can be combined with... Figure 9B The classifier 950 and the generation mechanism 955 operate similarly, wherein the generation mechanism 970 selects or samples each consecutive output label based on the corresponding predicted probability and appends it to the output of the previous iteration, generating each label sequentially until a symbol or label representing the end of the response is selected or sampled. The architectures described herein, and others, are merely examples, and other suitable architectures may be implemented within the scope of this disclosure.
[0102] Example computing device
[0103] Figure 10This is a block diagram of an example computing device 1000 suitable for implementing some embodiments of the present disclosure. The computing device 1000 may include an interconnect system 1002 directly or indirectly coupled to the following devices: a memory 1004, one or more central processing units (CPUs) 1006, one or more graphics processing units (GPUs) 1008, a communication interface 1010, input / output (I / O) ports 1012, input / output components 1014, a power supply 1016, one or more presentation components 1018 (e.g., one or more displays), and one or more logic units 1020. In at least one embodiment, one or more computing devices 1000 may include one or more virtual machines (VMs), and / or any component thereof may include virtual components (e.g., virtual hardware components). For a non-limiting example, one or more GPUs 1008 may include one or more vGPUs, one or more CPUs 1006 may include one or more vCPUs, and / or one or more logic units 1020 may include one or more virtual logic units. Accordingly, one or more computing devices 1000 may include discrete components (e.g., a full GPU dedicated to computing device 1000), virtual components (e.g., a portion of the GPU dedicated to computing device 1000), or a combination thereof.
[0104] although Figure 10 The various boxes are shown as being connected to lines via interconnect system 1002, but this is not intended to be limiting and is merely for clarity. For example, in some embodiments, presentation component 1018 (such as a display device) may be considered I / O component 1014 (e.g., if the display is a touchscreen). As another example, CPU 1006 and / or GPU 1008 may include memory (e.g., memory 1004 may represent a storage device in addition to the memory of GPU 1008, CPU 1006, and / or other components). Thus, Figure 10 The computing devices described are merely illustrative. No distinction is made between categories such as "workstation," "server," "laptop computer," "desktop computer," "tablet computer," "client device," "mobile device," "handheld device," "game console," "electronic control unit (ECU)," "virtual reality system," and / or other device or system types, as all are conceived in… Figure 10 Within the scope of computing devices.
[0105] Interconnect system 1002 may represent one or more links or buses, such as address buses, data buses, control buses, or combinations thereof. Interconnect system 1002 may include one or more bus or link types, such as Industry Standard Architecture (ISA) bus, Extended Industry Standard Architecture (EISA) bus, Video Electronics Standards Association (VESA) bus, Peripheral Component Interconnect (PCI) bus, Fast Peripheral Component Interconnect (PCIe) bus, and / or another type of bus or link. In some embodiments, there is a direct connection between components. For example, CPU 1006 may be directly connected to memory 1004. Further, CPU 1006 may be directly connected to GPU 1008. In cases where there is a direct connection or point-to-point connection between components, interconnect system 1002 may include a PCIe link to perform that connection. In these examples, a PCI bus is not required in computing device 1000.
[0106] The memory 1004 may include any of a variety of computer-readable media. The computer-readable media may be any available medium accessible by the computing device 1000. The computer-readable media may include volatile and non-volatile media, as well as removable and non-removable media. By way of example and not limitation, the computer-readable media may include computer storage media and communication media.
[0107] Computer storage media may include volatile and non-volatile media and / or removable and non-removable media implemented using any method or technology for storing information such as computer-readable instructions, data structures, program modules, and / or other data types. For example, memory 1004 may store computer-readable instructions (e.g., representing programs and / or program elements, such as operating systems). Computer storage media may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic tape cassettes, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible by computing device 1000. As used herein, computer storage media does not include the signal itself.
[0108] Computer storage media can embody computer-readable instructions, data structures, program modules, and / or other data types in modulated data signals (such as carrier waves or other transmission mechanisms) and include any information transmission medium. The term "modulated data signal" can refer to a signal whose characteristics are set or altered in a manner that encodes information in the signal. By way of example and not limitation, computer storage media can include wired media (such as wired networks or direct wired connections) and wireless media (such as acoustic, RF, infrared, and other wireless media). Any combination of the foregoing should also be included within the scope of computer-readable media.
[0109] CPU 1006 may be configured to execute at least some of computer-readable instructions to control one or more components of computing device 1000 to perform one or more of the methods and / or processes described herein. Each CPU 1006 may include one or more cores (e.g., 1, 2, 4, 8, 28, 72, etc.) capable of processing multiple software threads simultaneously. CPU 1006 may include any type of processor and may include different types of processors depending on the type of computing device 1000 implemented (e.g., processors with fewer cores for mobile devices and processors with more cores for servers). For example, depending on the type of computing device 1000, the processor may be an advanced RISC machine (ARM) processor implemented using Reduced Instruction Set Computing (RISC) or an x86 processor implemented using Complex Instruction Set Computing (CISC). In addition to one or more microprocessors or supplemental coprocessors such as math coprocessors, computing device 1000 may also include one or more CPUs 1006.
[0110] In addition to or replacing CPU 1006, one or more GPUs 1008 may be configured to execute at least some of computer-readable instructions to control one or more components of computing device 1000 to perform one or more of the methods and / or processes described herein. One or more GPUs 1008 may be integrated GPUs (e.g., having one or more CPUs 1006) and / or one or more GPUs 1008 may be discrete GPUs. In embodiments, one or more GPUs 1008 may be a coprocessor of one or more CPUs 1006. GPUs 1008 may be used by computing device 1000 to render graphics (e.g., 3D graphics) or perform general-purpose computing. For example, GPUs 1008 may be used for general-purpose computing on a GPU (GPGPU). GPUs 1008 may include hundreds or thousands of cores capable of processing hundreds or thousands of software threads simultaneously. GPUs 1008 may generate pixel data for outputting an image in response to rendering commands (e.g., rendering commands received from CPU 1006 via a host interface). GPU 1008 may include graphics memory, such as display memory, for storing pixel data or any other suitable data, such as GPGPU data. Display memory may be included as part of memory 1004. GPU 1008 may include two or more GPUs operating in parallel (e.g., via a link). The link may be directly connected to the GPUs (e.g., using NVLINK) or may connect the GPUs via a switch (e.g., using NVSwitch). When combined, each GPU 1008 may generate pixel data or GPGPU data for different portions of the output or for different outputs (e.g., a first GPU for a first image and a second GPU for a second image). Each GPU may include its own memory or may share memory with other GPUs.
[0111] In addition to or replacing CPU 1006 and / or GPU 1008, one or more logic units 1020 may be configured to execute at least some of computer-readable instructions to control one or more components of computing device 1000 to perform one or more of the methods and / or processes described herein. In embodiments, one or more CPUs 1006, one or more GPUs 1008, and / or one or more logic units 1020 may execute any combination of methods, processes, and / or portions thereof, discretely or jointly. One or more logic units 1020 may be part of and / or integrated into one or more CPUs 1006 and / or GPUs 1008, and / or one or more logic units 1020 may be discrete components or otherwise external to CPUs 1006 and / or GPUs 1008. In embodiments, one or more logic units 1020 may be coprocessors of one or more CPUs 1006 and / or one or more GPUs 1008.
[0112] Examples of logic unit 1020 include one or more processing cores and / or components thereof, such as a data processing unit (DPU), tensor core (TC), tensor processing unit (TPU), pixel vision core (PVC), vision processing unit (VPU), graphics processing cluster (GPC), texture processing cluster (TPC), streaming multiprocessor (SM), tree traversal unit (TTU), artificial intelligence accelerator (AIA), deep learning accelerator (DLA), programmable vision accelerator (PVA) (which may include one or more direct memory access (DMA) systems, one or more vision or vector processing units (VPU), one or more pixel processing engines (PPE) (e.g.) This includes 2D arrays of processing elements, each of which communicates with one or more other processing elements in the array in north, south, east, and west directions; one or more decoupled accelerators or units (e.g., decoupled lookup table (DLUT) accelerators or units); vision processing units (VPUs); optical flow accelerators (OFAs); field-programmable gate arrays (FPGAs); neuromorphic chips; quantum processing units (QPUs); associative processing units (APUs); arithmetic logic units (ALUs); application-specific integrated circuits (ASICs); floating-point units (FPUs); input / output (I / O) elements; peripheral component interconnects (PCIs) or peripheral component interconnect fast (PCIe) elements; and so on.
[0113] The communication interface 1010 may include one or more receivers, transmitters, and / or transceivers that allow the computing device 1000 to communicate with other computing devices via electronic communication networks (including wired and / or wireless communications). The communication interface 1010 may include components and functions that allow communication via any of a plurality of different networks (e.g., wireless networks (e.g., Wi-Fi, Z-Wave, Bluetooth, Bluetooth LE, ZigBee, etc.), wired networks (e.g., communication via Ethernet or wireless bandwidth), low-power wide area networks (e.g., LoRaWAN, SigFox, etc.), and / or the Internet). In one or more embodiments, the logic unit 1020 and / or the communication interface 1010 may include one or more data processing units (DPUs) for directly transmitting data received via the network and / or via the interconnect system 1002 to one or more GPUs 1008 (e.g., the memory of one or more GPUs 1008).
[0114] I / O port 1012 allows computing device 1000 to be logically coupled to other devices including I / O component 1014, presentation component 1018, and / or other components, some of which may be built into (e.g., integrated into) computing device 1000. Illustrative I / O component 1014 includes microphone, mouse, keyboard, joystick, gamepad, game controller, disc satellite dish, scanner, printer, wireless device, etc. I / O component 1014 provides a natural user interface (NUI) that processes air gestures, voice, or other physiological input generated by the user. In some instances, the input may be passed to appropriate network elements for further processing. NUI can implement any combination of voice recognition, stylus recognition, facial recognition, biometric recognition, on-screen and near-screen gesture recognition, air gestures, head and eye tracking, and touch recognition associated with the display of computing device 1000 (as described in more detail below). The computing device 1000 may include a depth camera, such as a stereo camera system, an infrared camera system, an RGB camera system, touchscreen technology, and combinations thereof, for attitude detection and recognition. Additionally, the computing device 1000 may include an accelerometer or gyroscope that allows motion detection (e.g., as part of an inertial measurement unit (IMU)). In some examples, the computing device 1000 may use the output of the accelerometer or gyroscope to render immersive augmented reality or virtual reality.
[0115] The power supply 1016 may include a hardwired power supply, a battery power supply, or a combination thereof. The power supply 1016 may provide power to the computing device 1000 to allow the components of the computing device 1000 to operate.
[0116] One or more presentation components 1018 may include displays (e.g., monitors, touchscreens, television screens, head-up displays (HUDs), other display types, or combinations thereof), speakers, and / or other presentation components. Presentation component 1018 may receive data from other components (e.g., GPU 1008, CPU 1006, DPU, etc.) and output data (e.g., as images, videos, sounds, etc.).
[0117] Example Data Center
[0118] Figure 11 An example data center 1100 that can be used in at least one embodiment of this disclosure is shown. The data center 1100 may include a data center infrastructure layer 1110, a framework layer 1120, a software layer 1130, and / or an application layer 1140.
[0119] like Figure 11 As shown, the data center infrastructure layer 1110 may include a resource coordinator 1112, grouped computing resources 1114, and node computing resources (“nodes CR”) 1116(1)-1116(N), where “N” represents any integer, a positive integer. In at least one embodiment, the nodes CR 1116(1)-1116(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including DPUs, accelerators, field-programmable gate arrays (FPGAs), graphics processors or graphics processing units (GPUs), etc.), memory devices (e.g., dynamic read-only memory), storage devices (e.g., solid-state or disk drives), network input / output (“NW”) devices, and network network interfaces (“NW”). I / O devices, network switches, virtual machines ("VMs"), power modules and / or cooling modules, etc. In some embodiments, one or more nodes CR1116(1)-1116(N) may correspond to a server having one or more of the aforementioned computing resources. In addition, in some embodiments, nodes CR1116(1)-1116(N) may include one or more virtual components, such as vGPU, vCPU, etc., and / or one or more nodes CR1116(1)-1116(N) may correspond to virtual machines (VMs).
[0120] In at least one embodiment, the grouped computing resources 1114 may include individual groups of nodes CR1116 housed within one or more racks (not shown) or within a plurality of racks in data centers (also not shown) located in different geographical locations. The individual groups of nodes CR1116 within the grouped computing resources 1114 may include the group's computing, networking, memory, or storage resources, which may be configured or allocated to support one or more workloads. In at least one embodiment, a plurality of nodes CR1116, including CPUs, GPUs, DPUs, and / or other processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. One or more racks may also include any number of power modules, cooling modules, and / or network switches in any combination.
[0121] Resource coordinator 1112 may be configured or otherwise control one or more nodes CR1116(1)-1116(N) and / or grouped computing resources 1114. In at least one embodiment, resource coordinator 1112 may include a Software Design Infrastructure (“SDI”) management entity for data center 1100. Resource coordinator 1112 may include hardware, software, or some combination thereof.
[0122] In at least one embodiment, such as Figure 11 As shown, framework layer 1120 may include a job scheduler 1128, a configuration manager 1134, a resource manager 1136, and / or a distributed file system 1138. Framework layer 1120 may include a framework for software 1132 supporting software layer 1130 and / or one or more applications 1142 supporting application layer 1140. Software 1132 or application 1142 may respectively include web-based service software or applications, such as those provided by Amazon Web Services, Google Cloud, and Microsoft Azure. Framework layer 1120 may be, but is not limited to, a type of free and open-source software web application framework that can use distributed file system 1138 for large-scale data processing (e.g., "big data"), such as Apache Spark. TM(Hereinafter referred to as "Spark"). In at least one embodiment, job scheduler 1128 may include Spark drivers to facilitate the scheduling of workloads supported by various layers of data center 1100. Configuration manager 1134 may be able to configure different layers, such as software layer 1130 and framework layer 1120 including Spark and distributed file system 1138 for supporting large-scale data processing. Resource manager 1136 may be able to manage cluster or group computing resources mapped to or allocated for supporting distributed file system 1138 and job scheduler 1128. In at least one embodiment, cluster or group computing resources may include group computing resources 1114 at data center infrastructure layer 1110. Resource manager 1136 may coordinate with resource coordinator 1112 to manage these mapped or allocated computing resources.
[0123] In at least one embodiment, the software 1132 included in software layer 1130 may include software used by at least a plurality of portions of nodes CR1116(1)-1116(N), grouped computing resources 1114, and / or the distributed file system 1138 of framework layer 1120. One or more types of software may include, but are not limited to, internet web search software, email virus scanning software, database software, and streaming video content software.
[0124] In at least one embodiment, the application 1142 included in the application layer 1140 may include one or more types of applications used by at least a plurality of portions of nodes CR1116(1)-1116(N), grouped computing resources 1114, and / or the distributed file system 1138 of the framework layer 1120. One or more types of applications may include, but are not limited to, any number of genomics applications, cognitive computing and machine learning applications (including training or inference software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.) and / or other machine learning applications used in combination with one or more embodiments).
[0125] In at least one embodiment, any of the configuration manager 1114, resource manager 1136, and resource coordinator 1112 can implement any number and type of self-modification actions based on any amount and type of data acquired in any technically feasible manner. Self-modification actions can protect the data center operator of data center 1100 from making potentially poor configuration decisions and may prevent underutilized and / or poorly performing portions of the data center.
[0126] According to one or more embodiments described herein, data center 1100 may include tools, services, software, or other resources for training one or more machine learning models or using one or more machine learning models to predict or infer information. For example, one or more machine learning models may be trained by calculating weight parameters according to a neural network architecture using the software and / or computing resources described above with respect to data center 1100. In at least one embodiment, a trained or deployed machine learning model corresponding to one or more neural networks may be used to infer or predict information using the resources described above with respect to data center 1100 by using weight parameters calculated through one or more training techniques (such as, but not limited to, those described herein).
[0127] In at least one embodiment, the data center 1100 may use a CPU, application-specific integrated circuit (ASIC), GPU, FPGA, and / or other hardware (or corresponding virtual computing resources) to perform training and / or inference using the aforementioned resources. Furthermore, one or more of the aforementioned software and / or hardware resources may be configured to allow users to train or execute information inference services, such as image recognition, speech recognition, or other artificial intelligence services.
[0128] Example network environment
[0129] A network environment suitable for implementing embodiments of this disclosure may include one or more client devices, servers, network attached storage (NAS), other back-end devices, and / or other device types. Client devices, servers, and / or other device types (e.g., each device) may be... Figure 10 The implementation is carried out on one or more instances of computing device 1000, for example, each device may include similar components, features and / or functions of computing device 1000. Furthermore, in the case of implementing backend devices (e.g., servers, NAS, etc.), the backend devices may be included as part of data center 1100, an example of which is relative to the data center 1100 described herein. Figure 11 To describe in more detail.
[0130] Components of a network environment can communicate with each other via one or more networks, which may be wired, wireless, or both. A network can include multiple networks or a network of networks. For example, a network may include one or more wide area networks (WANs), one or more local area networks (LANs), one or more public networks (such as the Internet and / or the Public Switched Telephone Network (PSTN)), and / or one or more private networks. Where the network includes a wireless telecommunications network, components such as base stations, communication towers, or even access points (and other components) can provide wireless connectivity.
[0131] A compatible network environment may include one or more peer-to-peer network environments—in which case the network environment may not include a server—and one or more client-server network environments—in which case the network environment may include one or more servers. In a peer-to-peer network environment, the functionality described herein with respect to one or more servers can be implemented on any number of client devices.
[0132] In at least one embodiment, the network environment may include one or more cloud-based network environments, distributed computing environments, combinations thereof, etc. The cloud-based network environment may include a framework layer, job scheduler, resource manager, and distributed file system implemented on one or more servers, which may include one or more core network servers and / or edge servers. The framework layer may include a framework for software supporting the software layer and / or one or more applications supporting the application layer. The software or application may respectively include web-based service software or applications. In embodiments, one or more client devices may use web-based service software or applications (e.g., by accessing the service software and / or applications via one or more application programming interfaces (APIs)). The framework layer may be, but is not limited to, free and open-source software web application frameworks, such as those used for large-scale data processing (e.g., "big data") using distributed file systems.
[0133] A cloud-based network environment can provide cloud computing and / or cloud storage for any combination of the computing and / or data storage functions (or one or more portions thereof) described herein. Any of these different functions may be distributed across multiple locations from a central or core server (e.g., across one or more data centers distributed across states, regions, countries, globally, etc.). If the connection to the user (e.g., client device) is relatively close to the edge server, the core server may assign at least a portion of the functionality to the edge server. A cloud-based network environment can be private (e.g., limited to a single organization), public (e.g., available to many organizations), and / or a combination thereof (e.g., a hybrid cloud environment).
[0134] One or more client devices may include the information described in this article. Figure 10At least some of the components, features, and functions of one or more example computing devices 1000 described. By way of example and not limitation, the client device may be embodied as a personal computer (PC), laptop computer, mobile device, smartphone, tablet computer, smartwatch, wearable computer, personal digital assistant (PDA), MP3 player, virtual reality headset, global positioning system (GPS) or device, video player, camera, surveillance equipment or system, vehicle, ship, spacecraft, virtual machine, drone, robot, handheld communication device, hospital equipment, gaming device or system, entertainment system, vehicle computer system, embedded system controller, remote control, electrical appliance, consumer electronics device, workstation, edge device, any combination of these described devices, or any other suitable device.
[0135] This disclosure can be described in the general context of computer code or machine-usable instructions (including computer-executable instructions, such as program modules) that are executed by a computer or other machine (such as a personal data assistant or other handheld device). Typically, a program module, including routines, programs, objects, components, data structures, etc., refers to code that performs a specific task or implements a specific abstract data type. This disclosure can be implemented in a variety of system configurations, including handheld devices, consumer electronics, general-purpose computers, and more specialized computing devices. This disclosure can also be implemented in a distributed computing environment where tasks are performed by remote processing devices linked via a communication network.
[0136] As used herein, the phrase “and / or” relating to two or more elements should be interpreted as meaning only one element, or a combination of elements. For example, “element A, element B, and / or element C” can include only element A, only element B, only element C, element A and element B, element A and element C, element B and element C, or element A, B, and C. Furthermore, “at least one of element A or element B” can include at least one of element A, at least one of element B, or at least one of element A and at least one of element B. Additionally, “at least one of element A and element B” can include at least one of element A, at least one of element B, or at least one of element A and at least one of element B.
[0137] This document provides a detailed description of the subject matter of this disclosure to meet statutory requirements. However, the description itself is not intended to limit the scope of this disclosure. Rather, the inventors have anticipated that the claimed subject matter may also be embodied in other ways in combination with other current or future techniques to include combinations of different steps or steps similar to those described in this document. Furthermore, although the terms “step” and / or “box” may be used herein to refer to different elements of the method employed, such terms should not be construed as implying any particular order among or between the various steps disclosed herein, unless and only if the order of individual steps is explicitly described.
[0138] Example paragraph
[0139] A: A method comprising: obtaining, at a planner node associated with an application providing a natural language interface for a client device to communicate with a plurality of services, input data representing one or more queries sent by the client device; processing the input data and information associated with the plurality of services using at least one or more first language models at the planner node, mapping the one or more queries to one or more tool nodes associated with the application, wherein each of the one or more tool nodes is mapped to one or more of the plurality of services; processing at least a portion of the input data and information associated with the one or more services using at least one or more second language models at least based on the one or more tool nodes. The application generates first text data representing one or more API calls for one or more services, based on one or more associated application programming interface (API) specifications; at a response generation node associated with the application, the application executes the one or more API calls based on at least one or more tool nodes to obtain second text data representing one or more responses for one or more services; at least based on the response generation node, the application uses one or more third language models to process at least the second text data and the one or more API specifications to generate output data representing one or more responses to the one or more queries; and the application sends the output data to the client device for rendering using the client device.
[0140] B: The method described in paragraph A further includes: using at least one or more second language models to process at least a portion of the input data and the one or more API specifications based on at least one or more tool nodes; and determining at least: one or more API endpoints associated with the one or more services; and one or more parameters to be included in the one or more API calls, wherein the generation of the first text data representing the one or more API calls is based at least on the determination of the one or more API endpoints and the determination of the one or more parameters.
[0141] C: The method as described in any one of paragraphs A and B further includes: processing the input data using the one or more first language models at least based on the planner node, decomposing the one or more queries into at least a first query and a second query, wherein mapping the one or more queries to the one or more tool nodes includes: mapping the first query to a first tool node and mapping the second query to a second tool node.
[0142] D: The method as described in any one of paragraphs A, C, and D further includes: processing the input data using the one or more second language models based at least on the one or more tool nodes, decomposing the input data into at least a first part of the input data and a second part of the input data, the first part corresponding to a first query in the one or more queries, and the second part corresponding to a second query in the one or more queries, wherein generating the first text data representing the one or more API calls includes: generating the first part of the first text data representing the first API call corresponding to the first query; and generating the second part of the first text data representing the second API call corresponding to the second query.
[0143] E: The method as described in any one of paragraphs A, D, or E, wherein the one or more responses comprise one or more multimodal responses, and the output data comprises a combination of two or more of text data, image data, video data, or audio data.
[0144] F: The method as described in any one of paragraphs A, E, and E, wherein the information associated with the plurality of services indicates at least one of one or more tools, one or more capabilities, or one or more functions of each of the plurality of services.
[0145] G: The method as described in any one of paragraphs AF, wherein the one or more API specifications associated with the one or more services indicate at least: one or more endpoints corresponding to one or more APIs for the one or more services; one or more functions of the one or more endpoints; one or more parameters to be included in one or more requests sent to the one or more endpoints; and one or more formats of the one or more requests.
[0146] H: The method as described in any one of paragraphs AG, wherein the generation of the output data is further based on the response generation node using the one or more third language models to at least process the input data, the second text data, and response format information from the one or more API specifications.
[0147] I: A system comprising: one or more processors configured to: obtain input data representing a query from a computing device; generate first text data representing one or more API calls to one or more services using one or more language models and based at least a portion of the input data and one or more application programming interface (API) specifications associated with one or more services; receive second text data representing one or more responses to the one or more API calls from the one or more services and based at least on the execution of the one or more API calls; generate output data representing one or more responses to the query using the one or more language models and based at least on the second text data and the one or more API specifications; and transmit the output data to the computing device.
[0148] J: As described in paragraph I, the one or more processors are further configured to: map the query to the one or more services using the one or more language models and at least based on the input data and information associated with the plurality of services; obtain the one or more API specifications associated with the one or more services based at least on the mapping; and apply the one or more API specifications and portions of the input data to the one or more language models.
[0149] K: In the system as described in any one of paragraphs I and J, the one or more processors are further configured to: use the one or more language models and at least based on the portion of the input data and the one or more API specifications to determine at least: one or more API endpoints associated with the one or more services; and one or more parameters to be included in the one or more API calls, wherein the generation of the first text data is based at least on the determination of the one or more API endpoints and the one or more parameters.
[0150] L: In the system as described in any one of paragraphs IK, the one or more processors are further configured to: decompose the query into at least a first query and one or more second queries using the one or more language models, wherein the generation of the first text data includes: using the one or more language models and at least based on the one or more API specifications and a first portion of the input data corresponding to the first query, generating one or more first portions of the first text data representing one or more first API calls of the one or more services; and using the one or more language models and at least based on the one or more API specifications and a second portion of the input data corresponding to one or more second queries, generating one or more second portions of the first text data representing one or more second API calls of the one or more services.
[0151] M: In the system as described in any one of paragraphs IL, the one or more processors are further configured to: determine, using the one or more language models and at least based on the input data and the one or more API specifications, the order of an API call sequence for a series of services among the plurality of services; and execute the API call sequence by sending one or more portions of the first text data to the series of services at least based on the order.
[0152] N: In the system as described in any one of paragraphs IM, the one or more processors are further configured to: send a first request to the one or more language models to identify the one or more services among the plurality of services to be invoked to generate the one or more responses to the query; send a second request to the one or more language models to generate first text data representing the one or more API calls, the one or more API calls being at least based on the one or more API specifications and including at least: one or more endpoint identifiers corresponding to one or more APIs for the one or more services; and one or more parameters indicating one or more operations to be performed by at least one of the one or more APIs or the one or more services, wherein generating the first text data using the one or more language models is at least based on sending the first request and the second request.
[0153] O: The system as described in any one of paragraphs IN, wherein, based at least on executing the one or more API calls, the one or more services are used to: convert the first text data into a plurality of values using one or more second language models and at least based on the query; and generate at least one of a visual representation or a plurality of predicted values associated with the plurality of values using one or more machine learning models and at least based on the plurality of values.
[0154] P: In the system as described in any one of paragraphs 10, the one or more processors are further configured to: determine a ranking of each of the plurality of services using the one or more language models and based at least on the input data and information associated with the plurality of services; select the one or more services from the plurality of services based at least on the fact that one or more first rankings of the one or more services are higher than one or more second rankings of one or more second services; and send the first text data representing the one or more API calls to one or more API endpoints associated with the one or more services.
[0155] Q: A system as described in any one of paragraphs IP, wherein the system comprises at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulated operations; a system for performing one or more digital twin operations; a system for performing optical transmission simulation; a system for performing collaborative content creation of 3D assets; a system for performing one or more deep learning operations; a system implemented using edge devices; a system implemented using robots; a system for performing one or more generative AI operations; a system for performing operations using large language models; a system for performing operations using small language models; a system for performing operations using one or more visual language models (VLMs); a system for performing operations using one or more multimodal language models; a system for using or deploying one or more inference microservices; a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system comprising one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
[0156] R: One or more processors, comprising: processing circuitry for: routing the one or more natural language queries to one or more tools mapped to the one or more services, based at least on data that uses one or more language models to process at least one or more natural language queries and indicates one or more specifications associated with one or more services; and generating one or more API calls for the one or more services, based at least on data that uses one or more language models to process at least the one or more natural language queries and one or more application programming interface (API) specifications associated with the one or more services.
[0157] S: One or more processors as described in paragraph R, wherein the one or more services include one or more Representational State Transition (REST) API services.
[0158] T: One or more processors as described in any one of paragraphs RS, wherein the one or more processors are included in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more analog operations; a system for performing one or more digital twin operations; a system for performing optical transmission simulation; a system for performing collaborative content creation of 3D assets; a system for performing one or more deep learning operations; a system implemented using edge devices; a system implemented using robots; a system for performing one or more generative AI operations; a system for performing operations using large language models; a system for performing operations using small language models; a system for performing operations using one or more visual language models (VLMs); a system for performing operations using one or more multimodal language models; a system for using or deploying one or more inference microservices; a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system containing one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
Claims
1. A method comprising: At the planner node associated with the application that provides a natural language interface for client devices to communicate with multiple services, input data representing one or more queries sent by the client device is obtained; At least based on the planner node, one or more first language models are used to process the input data and information associated with the plurality of services, mapping the one or more queries to one or more tool nodes associated with the application, wherein each of the one or more tool nodes is mapped to one or more of the plurality of services; At least a portion of the input data is processed using one or more second language models based on at least one or more tool nodes and one or more application programming interface (API) specifications associated with the one or more services, to generate first text data representing one or more API calls to the one or more services; At the response generation node associated with the application, at least one or more API calls are executed based on the one or more tool nodes to obtain second text data representing one or more responses from the one or more services; At least based on the response generation node, one or more third-language models are used to process at least the second text data and the one or more API specifications to generate output data representing one or more responses to the one or more queries; as well as The output data is sent to the client device for presentation using the client device.
2. The method of claim 1, further comprising: Using at least one or more tool nodes and the one or more second language models to process at least a portion of the input data, along with the one or more API specifications, at least the following is determined: One or more API endpoints associated with the one or more services; as well as To be included in one or more parameters of the one or more API calls. The generation of the first text data representing the one or more API calls is based at least on the determination of the one or more API endpoints and the determination of the one or more parameters.
3. The method of claim 1, further comprising: The planner node uses at least one or more first language models to process the input data, decomposing the one or more queries into at least a first query and a second query. Mapping one or more queries to one or more tool nodes includes mapping the first query to a first tool node and mapping the second query to a second tool node.
4. The method of claim 1, further comprising: The input data is processed using at least one or more second language models based on at least one or more tool nodes, decomposing the input data into at least a first part and a second part, wherein the first part corresponds to a first query in the one or more queries, and the second part corresponds to a second query in the one or more queries. Generating the first text data representing the one or more API calls includes: The first part of the representation of the first text data corresponds to the first API call to the first query; and The second part of the second API call that generates the representation of the first text data corresponds to the second query.
5. The method of claim 1, wherein, The one or more responses include one or more multimodal responses, and the output data includes a combination of two or more of text data, image data, video data, or audio data.
6. The method of claim 1, wherein, The information associated with the plurality of services indicates at least one of one or more tools, one or more capabilities, or one or more functions of each of the plurality of services.
7. The method of claim 1, wherein, The one or more API specifications associated with the one or more services indicate at least: One or more endpoints, which correspond to one or more APIs for the one or more services; One or more functions of the one or more endpoints; This will be included in one or more parameters in one or more requests sent to the one or more endpoints; as well as The one or more requests are in one or more formats.
8. The method of claim 1, wherein, The generation of the output data is also based at least on the response generation node using one or more third language models to process at least the input data, the second text data, and response format information from one or more API specifications.
9. A system comprising: One or more processors are used for: Obtain input data representing the query from the computing device; First text data representing one or more API calls to one or more services is generated using one or more language models and based at least a portion of the input data and one or more application programming interface (API) specifications associated with one or more services. Receive second text data representing one or more responses to the one or more API calls from the one or more services and at least based on the execution of the one or more API calls; Using the one or more language models and based at least on the second text data and the one or more API specifications, generate output data representing one or more responses to the query; as well as The output data is sent to the computing device.
10. The system of claim 9, wherein, The one or more processors are also used for: Using one or more language models and based at least on the input data and information associated with the multiple services, the query is mapped to one or more services; Based at least on the mapping, obtain the one or more API specifications associated with the one or more services; as well as The one or more API specifications and the portion of the input data are applied to the one or more language models.
11. The system of claim 9, wherein, The one or more processors are also used for: Using the one or more language models and at least based on the aforementioned portion of the input data and the one or more API specifications, determine at least: One or more API endpoints associated with the one or more services; as well as The first text data is to be included in one or more parameters of the one or more API calls, wherein the generation of the first text data is based at least on the determination of the one or more API endpoints and the one or more parameters.
12. The system of claim 9, wherein, The one or more processors are also used for: Using one or more language models, the query is decomposed into at least a first query and one or more second queries. The generation of the first text data includes: Using the one or more language models and at least based on the one or more API specifications and the first portion of the input data corresponding to the first query, generate a representation of the first text data for one or more first API calls of the one or more services; and Using the one or more language models and at least based on the one or more API specifications and one or more second parts of the input data corresponding to the one or more second queries, generate a representation of the first text data for one or more second API calls to the one or more services.
13. The system of claim 9, wherein, The one or more processors are also used for: Using the one or more language models and at least based on the input data and the one or more API specifications, determine the order of the API call sequence for a series of services among the plurality of services; as well as The API call sequence is executed by sending one or more portions of the first text data to the series of services in at least the order stated therein.
14. The system of claim 9, wherein, The one or more processors are also used for: Send a first request to the one or more language models to identify the one or more services to be invoked to generate the one or more responses to the query; Send a second request to the one or more language models to generate the first text data representing the one or more API calls, the one or more API calls being at least based on the one or more API specifications and including at least: One or more endpoint identifiers, corresponding to one or more APIs used for the one or more services; and One or more parameters indicating one or more operations to be performed by at least one of the one or more APIs or the one or more services. The generation of the first text data using the one or more language models is based at least on sending the first request and the second request.
15. The system of claim 9, wherein, Based at least on the execution of one or more API calls, the one or more services are used for: Using one or more second language models and based at least on the query, the first text data is converted into multiple values; as well as Using one or more machine learning models and based at least on the plurality of values, generate at least one of a visual representation or a predicted value associated with the plurality of values.
16. The system of claim 9, wherein, The one or more processors are also used for: Using one or more language models and based at least on the input data and information associated with the multiple services, determine the ranking of each of the multiple services; The one or more services are selected from the plurality of services based on at least one or more first rankings of the one or more services being higher than one or more second rankings of the one or more second services. as well as Send the first text data representing the one or more API calls to one or more API endpoints associated with the one or more services.
17. The system of claim 9, wherein, The system is included in at least one of the following: Control systems for autonomous or semi-autonomous machines; Sensing systems for autonomous or semi-autonomous machines; A system for performing one or more simulation operations; A system for performing one or more digital twin operations; A system for performing optical transmission simulation; A system for collaborative content creation of 3D assets; A system for performing one or more deep learning operations; Systems implemented using edge devices; Systems implemented using robots; A system for performing one or more generative AI operations; A system for performing operations using large language models; A system for performing operations using a small language model; A system for performing operations using one or more visual language models (VLMs); A system for performing operations using one or more multimodal language models; A system for using or deploying one or more inference microservices; A system for performing one or more conversational AI operations; A system for generating synthetic data; A system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; A system containing one or more virtual machines (VMs); A system that is at least partially implemented in a data center; or A system that utilizes cloud computing resources at least in part.
18. One or more processors, comprising: Processing circuitry, used for: Based at least on the use of one or more language models to process at least one or more natural language queries and indicate data associated with one or more services among a plurality of services, the one or more natural language queries are routed to one or more tools mapped to the one or more services; as well as At least based on using the one or more language models to process at least the one or more natural language queries and the one or more application programming interface (API) specifications associated with the one or more services, generate one or more API calls for the one or more services.
19. One or more processors as claimed in claim 18, wherein, The one or more services include one or more expressive state transition REST API services.
20. The processor of claim 18, wherein the processor is included in at least one of the following: Control systems for autonomous or semi-autonomous machines; Sensing systems for autonomous or semi-autonomous machines; A system for performing one or more simulation operations; A system for performing one or more digital twin operations; A system for performing optical transmission simulation; A system for collaborative content creation of 3D assets; A system for performing one or more deep learning operations; Systems implemented using edge devices; Systems implemented using robots; A system for performing one or more generative AI operations; A system for performing operations using large language models; A system for performing operations using a small language model; A system for performing operations using one or more visual language models (VLMs); A system for performing operations using one or more multimodal language models; A system for using or deploying one or more inference microservices; A system for performing one or more conversational AI operations; A system for generating synthetic data; A system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; A system containing one or more virtual machines (VMs); A system that is at least partially implemented in a data center; or A system that utilizes cloud computing resources at least in part.