System and method for building and augmenting a semantic data layer for intelligent applications
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- JEDIFY LTD
- Filing Date
- 2025-12-05
- Publication Date
- 2026-06-11
Smart Images

Figure US2025058256_11062026_PF_FP_ABST
Abstract
Description
JEDIFY-0005USPCTSystem and method for building and augmenting a semantic data layer for intelligent applicationsCROSS-REFERENCE TO REEATED APPLICATIONS
[0001] The present application claims priority to U.S. Provisional Serial No. 63 / 728,792, filed December 6, 2024, the disclosure of which is hereby incorporated by reference in its entirety as if fully set forth herein.
[0002] The present application is also related to PCT Serial No. PCT / US25 / 34905, filed June 24, 2025, and U.S. Provisional Serial No. 63 / 663,430, filed June 24, 2024, the disclosures of each of which are hereby incorporated by reference in its entirety as if fully set forth herein.FIELD
[0003] The present disclosure relates generally to devices, systems and computer- implemented methods for building and augmenting a semantic data layer from heterogeneous data sources — including business-intelligence tools, transformation frameworks, datawarehouse platforms, and unstructured documentation — and for validating the semantic integrity of analytical outputs generated therefrom and retrieving desired data.BACKGROUND
[0004] Modern enterprises, such as organizations, accumulate and store data across disparate repositories: cloud data warehouses, relational databases, transformation projects (e.g., Extract-Load-Transform or “dbt” models), business-intelligence (BI) workbooks, and ad-hoc knowledge bases such as wikis or intranet sites. Conventional extract / transform / load (ETL) or catalog products enumerate these sources but fail to capture the semantic intent that links them — e.g., how a column labeled sales_amt inside a warehouse fact table maps to a KPI called “Gross Sales” defined in a BI dashboard or to a metric explained in marketing collateral. Absent such linkage, a technological problem arises in that there exists a gap between input text, sometimes in the fomr of business questions, and data. Business questions must be translated and converted into technical queries, resulting in technological problems such as risking errors in data retrieval and drift as schemas evolve. There is therefore a need for a unified system that (i) connects dynamically to a plurality of source types, (ii) constructsJEDIFY-0005USPCT an application-ready semantic model, and (iii) continually augments and validates that model as source definitions change.
[0005] Large Language Models (LLMs) are complex Natural Language Processing (NLP) models that, in certain general-purpose applications, have proved to have superior zero-shot performance. The use of LLMs, however, for free text retrieval methods can provide poor results, because of the above-described gap between the input text and the data source. This is because LLMs have little to no context for how the data contained within the data source is organized.
[0006] Furthermore, even when technical approaches enable LLMs to access data sources, they fail to provide adequate business context. Business logic must be repeatedly defined in every prompt (e.g., "revenue = closed deals where status='won"'), leading to definition drift across different Al agents and applications. This results in inconsistent outputs, token inefficiency, and exponential integration complexity when multiple data sources and Al agents interact.
[0007] One approach to bridge the gap between input text and data source is to inject a technical topology of the data / schema into the LLM. This approach still lacks contextualization of the data and concepts within the data source.BRIEF SUMMARY
[0008] The present application overcomes the disadvantages of the prior art by providing a system and method for retrieving desired data from a data source based upon a natural language text input using an autonomous semantic layer. The system and method for retrieving desired data from a data source based upon a natural language text input using an autonomous semantic layer represents an improvement in the function of a computer and, in particular, a solution to the technological problem that exists in bridging the gap between input text and data sources, by providing relevant and contextualized data in response to textual input. The autonomous semantic layer advantageously integrates with existing data sources, providing an entity or enterprise (such as an organization, such as a business or a department within a business, such as revenue operations, marketing, operations, etc., or a data team authorized to access or with access to such data) with an intuitive, user-friendly interface for extracting meaningful insights from complex data.JEDIFY-0005USPCT
[0009] The semantic layer serves as a contextual layer that can be accessed by one or more Artificial Intelligence (Al) applications, one or more Al agents, and / or one or more tools (including via Model Context Protocol (MCP) implementations), ensuring consistent business definitions and eliminating the need to repeat context in every prompt. This enables token-efficient queries and linear scaling as new data sources or Al applications are added.
[0010] One aspect of the present disclosure provides a computer-implemented method for retrieving desired data from a data source, comprising: receiving a first query, the first query being a request for desired data from a data source; generating a computational code, based upon the first query, to retrieve the desired data from the data source.
[0011] In one example, generating the computational code further comprises: generating a graphical representation of one or more concepts; generating a second computational code that associates the one or more concepts with data from the data source; and identifying a subset of relevant concepts from the one or more concepts.
[0012] In one example, the graphical representation comprises at least one node and at least one edge.
[0013] In one example, the node represents at least one concept and the edge represents at least one concept connection.
[0014] In one example, generating the graphical representation of one or more concepts further comprises: identifying the one or more concepts; and identifying one or more concept connections.
[0015] In one example, the first query includes complex prompts comprising multiple related queries and / or conversational interactions.
[0016] In one example, the first query originates from: one or more Artificial Intelligence (Al) applications; one or more Al agents; and / or one or more tools.
[0017] In one example, the one or more tools implement a Model Context Protocol (MCP) to communicate the first query.
[0018] In one example, a model is configured to identify the subset of relevant concepts from the one or more concepts.
[0019] In one example, the model comprises one of: a classifier, or a bi-modal embedding space.
[0020] In one example, the classifier is trained using the one or more concepts.JEDIFY-0005USPCT
[0021] In one example, the bi-modal embedding space is trained using a first encoder and a second encoder.
[0022] In one example, the first encoder embeds the first query into a vector space and the second encoder embeds the one or more business concepts into the vector space.
[0023] In one example, the computational code comprises Python, SQL, or an API call.
[0024] Another aspect of the disclosure provides a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform a method for retrieving desired data from a data source, the method comprising: receiving a first query, the first query being a request for desired data from a data source; generating a computational code, based upon the first query, to retrieve the desired data from the data source.
[0025] Another aspect of the disclosure provides one or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform a method for retrieving desired data from a data source, the method comprising: receiving a first query, the first query being a request for desired data from a data source; generating a computational code, based upon the first query, to retrieve the desired data from the data source.
[0026] Another aspect of the disclosure provides a computer-implemented method, comprising: receiving, via a user interface, uniform-resource locators (URLs) that describe one or more websites in order to capture business context expressed in natural language; receiving connection credentials for a plurality of heterogeneous data sources; profiling the connected plurality of heterogenous data sources to generate an initial data catalog and to retrieve candidate metrics; receiving a textual use-case definition that scopes organizational intent; generating, by a language model, one or more business questions derived from the textual use-case definition, and iteratively refining the generated one or more business questions based on user feedback to capture domain-specific terminology; inferring concrete definitions and formulae from the one or more business questions and the initial data catalog; responsive to a build command, executing per-source processing pipelines to enrich semantic objects; unifying semantic objects across sources via similarity detection and conflict resolution; and validating and confirming a unified semantic model.JEDIFY-0005USPCT
[0027] In one example, the plurality of heterogenous data sources comprises at least one of: one or more business intelligence (BI) tools; one or more data-transformation services; one or more data warehouses; and / or one or more distributed document repositories.
[0028] In one example, the initial data catalog includes one or more of: table schemas, column names, data types, and / or statistical distributions.
[0029] In one example, the candidate metrics are identified by numeric columns and potential aggregation patterns.
[0030] In one example, unifying semantic objects includes: (i) identifying potential matches where detected similarity scores exceed a threshold (ii) applying rule-based conflict resolution strategies; (iii) prompting users to manually resolve ambiguous cases through the user interface; and / or (iv) creating unified semantic entities that map to multiple physical columns across sources, storing the mappings in metadata tables.
[0031] In one example, validating and confirming the unified semantic model comprises one or more of: (i) detecting ambiguous definitions where multiple conflicting interpretations exist using entropy measures on definition variations; (ii) generating sample queries against the unified semantic model and comparing results to expected outcomes; and / or (iii) presenting validation results to users for confirmation through the user interface.
[0032] In one example, the per-source processing piperines include one or more of: (i) clustering similar columns across sources using embedding vectors generated by a Language Model Service and k-means or hierarchical clustering algorithms; (ii) labeling clusters with semantic types using classification models trained on labeled examples; (iii) analyzing relationships between entities by detecting foreign key relationships and join patterns in query logs; and / or (iv) distilling essential attributes by removing redundant or low- information columns based on variance and correlation analysis.
[0033] Another aspect of the disclosure provides a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform the method or methods describes above.
[0034] Another aspect of the disclosure provides one or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the method or methods described above.JEDIFY-0005USPCTBRIEF DESCRIPTION OF THE DRAWINGS
[0035] A more complete appreciation of the subject matter of the present disclosure and the various advantages thereof can be realized by reference to the following detailed description in which reference is made to the accompanying drawings in which:
[0036] Fig. 1 is a block diagram of a computing system according to one or more aspects of the disclosure;
[0037] Fig. 2 is a flow chart depicting a method of retrieving desired data from a data source;
[0038] Fig. 3 is a flow chart depicting a method of generating a computational code to satisfy the first query;
[0039] Fig. 3A depicts an example graphical representation;
[0040] Fig. 3B depicts a schematic diagram of a model;
[0041] Fig. 4 is a flow chart that depicts a method for graphically representing one or more concepts;
[0042] Fig. 5 is a method of training and operating a bi-modal embedding space; and
[0043] Figs. 5A-5C are schematic diagrams of training and operating a bi-modal embedding space.
[0044] Fig. 6 is a block diagram of an onboarding pipeline that constructs a semantic data layer from heterogeneous data sources.
[0045] Fig. 7 is a schematic of a connector architecture configured for live schemadrift monitoring and incremental semantic augmentation.DETAILED DESCRIPTION
[0046] Fig. 1 is a block diagram showing a computing system 100, according to one or more aspects of the disclosure, that may implement one or more techniques described in the present application.
[0047] As shown, the system 100 can include at least one computing device 105. The at least one computing device 105 can be any type of computing device, such as a personal computer, laptop computer, mobile device, tablet computer, wearable device, AR / VR headset, etc. As shown, the computing device 105 can include one or more processor(s) 110 that communicates with one or more devices, modules, or components via a bus, such as oneJEDIFY-0005USPCT or more memory modules 120, and / or any other components typically present in general purpose computers. The memory 120 can be a tangible non-transitory computer- readable medium and can store information accessible by the processor(s) 110, such as program instractions 122 (e.g., computation code) that may be retrieved and executed by the processor and / or data 124 that may be retrieved, manipulated, or stored by the processor(s) 110 or any other component of the computing device 105. The processor 110 can be any type of processor, such as one or more CPU(s), GPU(s), TPU(s), and / or NPU(s). The memory 120 can be any type of memory, such as volatile or non-volatile types of memory. In particular, the memory can include one or more of the following: ROM, such as Mask ROM, PROM, EPROM, EEPROM; NVRAM, such as Flash memory; Early stage NVRAM, such as nvSRAM, FeRAM, MRAM, or PRAM, or any other type, such as, CBRAM, SONOS, RRAM, Racetrack memory, NRAM, Millipede memory, or FJG. Any of the methods, routines, procedures, steps, blocks, etc., discussed herein in the present disclosure can be implemented as a set of program instructions stored in the memory 120, such that, when executed by the processor 110, can cause the processor 110 to perform the corresponding method, routine, procedure, step, block, etc.
[0048] Although depicted as single elements in Fig. 1 , the processor(s) 110 and memory 120 can respectively comprise one or more processors and / or one or more memory elements that can communicate by a wired / wireless connection. In this regard, the processor 110 can comprise a plurality of processors (e.g., one or more CPU(s), GPU(s), TPU(s), and / or NPU(s)) that can cooperate to execute program instructions. Moreover, the memory 120 can comprise a plurality of memories that can cooperate to store instructions and / or data.
[0049] The computing device 105 can also accept user input according to a number of methods, such as by a mouse, keyboard, trackpad, touchscreen interface, gesture, microphone, or the like (not shown). The computing device 105 can also be operably connected to a display (not shown).
[0050] The computing device 105 can also include one or more components to allow for wired or wireless communication via link 128 with any other computing device, such as server cloud 130 or computing device 140. The server cloud 130 can comprise one or more server computing devices, where such server computing devices can include similar components to those set forth with respect to device 105. Similarly, computing device 140 can include a processor 142, memory 144, instructions 146, and data 148, like the features setJEDIFY-0005USPCT forth with respect to device 105 and can be connected directly or indirectly to computing device 105 via link 132 and server cloud 130.
[0051] Fig. 2 is a flow chart 200 depicting a method of retrieving desired data from a data source by an autonomous semantic layer (also referred to as “the system”). In this regard, the autonomous semantic layer acts between the user and the data source in order to receive the request from the user and to retrieve the desired data from the data source.
[0052] At block 205, a first query is received. The first query may be provided by a user of a computing device, such as any of computing devices 105 and / or 140. The first query can be a natural language input of any type, for example a text query or an audio query. In the example of a text-based query, the query can be text entered directly by the user, or can be a file in the form of a PDF, worksheet, .csv file, Excel file, or any type of file. In the example of an audio query, the query can be in the form of a spoken query, a pre-recorded audio query, or an audio result of a text-to- speech input.
[0053] In some examples, the first query is a request or command made by the user, such as a request or command to retrieve data from a data source. For example, the first query may be a request from a data source, such as “What is the monthly trend in company signups in the last three months?”
[0054] In some embodiments, the system handles complex prompts containing multiple related queries or conversational interactions. The semantic layer maintains context across multiple queries, enabling Al agents to access consistent business definitions without repeating context in each prompt. For example, a prompt such as "Show me Q4 revenue by product, then compare it to Q3 and identify which products grew fastest" would be decomposed into multiple semantic operations while maintaining consistent definitions of "revenue" and "product" throughout.
[0055] In some embodiments, the semantic layer can be accessed (via the first query at block 205) by multiple distinct applications, Al agents, or tools concurrently. This architecture enables a single semantic layer to serve multiple consumers (N applications accessing M data sources through 1 semantic layer), providing consistent business context across all applications while avoiding redundant context definitions. The semantic layer may expose interfaces compatible with Model Context Protocol (MCP) or Software Development Kits (SDKs) to enable integration with various Al applications and agents.JEDIFY-0005USPCT
[0056] At block 207, relevant semantic entities and business context are identified from the semantic layer by employing the onboarding pipeline of Figure 6. Specifically, the system: (i) receives the query text from block 205; (ii) generates an embedding vector for the query using the Language Model Service 640; (iii) performs a similarity search against semantic entities stored in Semantic Fusion 680 using vector similarity (e.g., cosine similarity) to identify entities whose descriptions, names, or associated business questions have high semantic similarity to the query; (iv) retrieves the top-k most relevant entities (e.g., k=5-10) along with their associated metadata including calculation formulas, source mappings, and business context extracted during onboarding; (v) constructs a context package containing the relevant semantic entities and their relationships (e.g., if query mentions "revenue by customer," retrieve both "Revenue" and "Customer" entities plus their join relationship); and (vi) passes this context package to block 210 for computational code generation. This identification process leverages the enriched semantic objects created by the Clustering / Labeling / Analyzing / Distilling Pod 662 and unified by the Unifier Service 670, enabling the system to understand which data and business logic are relevant to the query without requiring the user to specify table names, column names, or calculation rules.
[0057] At block 210, a computational code is generated to satisfy the first query. Where the first query is a request or command to retrieve desired data from a data source, the computational code can be a computational code capable of retrieving the data from the data source. The computational code can be generated by a model. The model can be any type of generative text-to-code model capable of generating computation code from text. The model can be a generally available general purpose large language model (LLM), such as open source or the commercially available GPT-4 Turbo or GPT-4 developed by OpenAI, or Claude developed by Anthropic. In another example, the LLM can be a special purpose model trained for the task of text-to-code generation.
[0058] The semantic layer architecture provides significant computational efficiency advantages. By encapsulating business context in semantic entities rather than requiring context repetition in each query prompt, the system reduces token consumption in LLM- based applications. For example, a query can reference a semantic entity "Revenue" (consuming minimal tokens) rather than including the full business logic definition ("sum of closed deals where status='won' and amount > 0, excluding refunds...") in every prompt, resulting in an order of magnitude reduction in token usage.JEDIFY-0005USPCT
[0059] The computational code can be any type of code executable by a processor, such as BASIC, Fortran, C, C++, Python, and / or SQL, or any other type of formal language. In another example, the computation code can include one or more API calls to retrieve the desired data from the data source.
[0060] At block 215, the desired data is retrieved from the data source by executing the computational code. The desired data can then be provided to the user in any format.
[0061] Fig. 3 is a flow chart depicting a method 300 of generating a computational code to satisfy the first query.
[0062] At block 305, a graphical representation of one or more concepts is generated. The graphical representation can be a visualization in any dimensional space, such as an n- dimensional space (e.g., ID, 2D, 3D, or any dimension higher than 3D) that represents a structure and context of one or more concepts within the scope of an entity’s operational and data ecosystem in order to capture the domain of the entity. In one example, the entity is an organization, such as a business or a department within a business, such as revenue operations, marketing, operations, etc. The one or more concepts can include any type of data (e.g., a single data point or a set of data), category, parameter, dimension, measure, and / or mathematical function relevant to the entity.
[0063] Fig. 3A depicts an example graphical representation 300a. The graphical representation 300a can include one or more visual elements that represent the concepts and connections (e.g., relationships, and / or associations) among or between the one or more concepts. For example, the one or more concepts (or the concept data described below) can be represented as nodes 305a, 310a, 315a, and / or 320a and the connections can be represented as edges 325a, 330a, and / or 335a that connect the one or more nodes 305a-320a.
[0064] Each of the concepts can have concept data associated with the concept. The concept data can be stored at the data source. For example, an exemplary concept can be “offer” (e.g, offer for employment). The “offer” concept can have one or more concept data associated therewith, such as employee identification number, employment application identification number, start date, probation period, bonus amount, invoice identification number, date of last update, creation date, currency of bonus, and status. Each of the concepts can also have one or more concept connections associated with the concept. For the example concept of “offer,” the concept “offer” can be related to other concepts, such as “offer to fulfillment ratio,” a mathematical calculation concept, “average offer value,” a mathematicalJEDIFY-0005USPCT calculation concept, and / or “Application,” a connected concept (e.g., application that led to offer).
[0065] At block 310, a computational code is generated that associates at least one of the one or more concepts with associated concept data stored in the data source. This can be generated by utilizing existing knowledge of the entity, available in its data sources. The computation code can be any type of code executable by a processor, such as BASIC, Fortran, C, C++, Python, and / or SQL, or any other type of formal language. In another example, the computation code can include one or more API calls to retrieve data relevant to a concept.
[0066] For the example concept of “offer,” the computational code can be SQL code to retrieve the concept data for “offer” from a data source, as shown below:SELECT o.id, o.application_id, o.start_date, o.probation_period, o.bonus amount, o.invoice_id, o.updated_at, o.created_at, o.bonus_currency, o. status FROM data source o
[0067] At block 315, a model operates to identify a subset of relevant concepts from the one or more concepts based upon a text input. A schematic diagram of the model is shown in Fig. 3B. As shown, for a given query as an input, the model 305b operates to select a subset of relevant concepts. These relevant concepts can be used to retrieve the concept data that is stored at the data source. In this regard, the desired concept data is defined using the ontology and / or taxonomy defined by the graphical representation, as the graphical representation was generated using connected concepts and each node represents concept data.
[0068] fhe model can be trained using the one or more concepts, concept data, concept connections, and any other publicly available data. The model can be any type of model, and in one example is a classifier. In another example, the model is a bi-modal embedding space.
[0069] In the example of a classifier, the classifier can be any type of classifier, such as supervised, semi-supervised, or unsupervised. In the example of supervised, the supervised classifier can be trained using labeled datasets. In the example of semi-supervised, anyJEDIFY-0005USPCT combination of labeled or unlabeled training data can be provided. For example, the training data can be labeled only, unlabeled only, or both labeled and unlabeled.
[0070] In some instances, training and / or operating the classifier can be computational expensive, for example when there are a large number of concepts and / or queries available. In this scenario, it may be advantageous to implement the bi-modal embedding space.
[0071] At block 320, a semantic parser generates a computational code that retrieves the desired data from the data source. The computation code can be tested and / or validated before executed. Further, feedback collected during the testing and / or validation can be implemented. The computational code can be any type of code executable by a processor, such as BASIC, Fortran, C, C++, Python, and / or SQL, or any other type of formal language. In another example, the computation code can include one or more API calls to retrieve the desired data from the data source.
[0072] Fig. 4 is a flow chart 400 that depicts a method for graphically representing one or more concepts.
[0073] At block 405, one or more concepts are identified. The concepts can be identified from one or more data sources. The data sources can be any type of data source, publicly available, privately stored, or otherwise. For example, the data source can be any one of a confluence (e.g., web-based collaboration and knowledge management software), an internet website, documentation, logs, business intelligence (BI) dashboards, API, a log of operations, a query log, an API log of operations, etc. The concepts can be identified from the one or more data sources by parsing the one or more data sources to identify concepts. The identification can be performed in an automatic (e.g., without user input) manner. In this regard, the identification can incorporate one or more LLMs, classifiers, automated annotators, clustering techniques, and / or other data mining techniques.
[0074] At block 410, one or more connections or relations are identified between or among the one or more identified concepts. With the concepts identified, each concept can be annotated or augmented with semantics to contextualize the schema / data. For example, once a concept is identified and supporting data has been collected, such as relations and connections, the concept node is extended and / or augmented to include semantics and context of all the identified information. The concept data and / or concept connections can be extracted from the identified schema.JEDIFY-0005USPCT
[0075] At block 415, the graphical representation is generated, with the concepts represented as nodes and the concept connections represented as edges between nodes.
[0076] Fig. 5 is a method of training and operating a bi-modal embedding space.
[0077] At block 505, a bi-modal embedding space is trained. This can be seen in the schematic diagram of Fig. 5 A.
[0078] As shown, the bi-modal embedding space includes a first encoder 505a and a second encoder 510a. The first query can be provided to the first encoder 505a as an input and the one or more concepts can be provided to the second encoder 510a as an input. The first and second encoders can operate on the inputs to embed the inputs into a vector space and identify a similarity between the inputs.
[0079] At block 510, one or more concepts are embedded and / or stored. The concept can be embedded for identifying a similarity (e.g., as described above), while the concept can be stored for easier retrieval of the full concept data structure.
[0080] At block 515, one or more concepts are generated based upon an input text. In one example, unseen text is embedded to the bi-modal vector space and using the similarity identification, the relevant stored concepts can be retrieved. In some examples, re-ranking is applied to tighten the retrieval.
[0081] Figure 6 illustrates an onboarding pipeline for multi-source semantic layer construction. The system constructs and augments the semantic data layer through the following steps or blocks:
[0082] The onboarding pipeline can include Input Collection, for example at blocks 600 and / or 602. The Source & Context Input Panel 600 (e.g., a user interface) receives one or more uniform-resource locators (URLs) that describe one or more websites, with the URLs pointing to documentation repositories (e.g., Confluence, Notion, SharePoint) to capture business context expressed in natural language. The Use Case Editor 602 receives a textual use-case definition that scopes organizational intent, which may be entered as free-form text describing business objectives (e.g., "track customer retention metrics for SaaS products").
[0083] The onboarding pipeline can include Source Connection and Profiling, for example at blocks 610, 620, and / or 630. The system receives connection credentials for a plurality of heterogeneous data sources through Connector Modules 610 (one per source), which authenticate and establish connections to one or more BI tools, one or more data- transformation services, one or more data warehouses, and / or one or more distributedJEDIFY-0005USPCT document repositories. The Business Context Crawler 620 automatically (e.g., free of user input) crawls the provided URLs using web scraping techniques and natural language processing to extract business terminology, definitions, and contextual information from documentation. Simultaneously, the Schema Profiler 630 queries and profiles (e.g., automatically) connected data sources using database introspection APIs (e.g., INFORMATION_SCHEMA queries for SQL databases, API metadata endpoints for SaaS platforms) to generate an initial data catalog containing table schemas, column names, data types, and statistical distributions, and to retrieve candidate metrics by identifying numeric columns and potential aggregation patterns.
[0084] The onboarding pipeline can include Question Generation and Refinement, for example at block 640. The Language Model Service 640 receives the use-case definition and generates business questions derived from the use-case using a large language model (LLM) prompted with the use-case text and examples of well-formed business questions. The system iteratively refines the questions based on user feedback captured through the Use Case Editor 602, employing reinforcement learning from human feedback (RLHF) techniques to capture domain-specific terminology and improve question relevance.
[0085] The onboarding pipeline can include Metric Synthesis and Definition Inference, for example at block 650. The Metric Synthesizer 650 infers concrete definitions and formulae from the business questions and the data catalog by: (i) parsing the business questions using natural language understanding to extract entities, operations, and filters; (ii) matching extracted entities to columns in the data catalog using embedding-based similarity search; (iii) generating candidate SQL expressions or calculation formulas based on the matched columns and operations; and (iv) ranking candidates using a scoring function that considers semantic similarity, data type compatibility, and statistical properties.
[0086] fhe onboarding pipeline can include Processing Pipeline Execution, for example at blocks 660, 622, and / or 664. Responsive to a build command received through the user interface, the Orchestration Engine 660 executes per-source processing pipelines including: The Clustering / Labeling / Analyzing / Di stilling Pod 662 enriches semantic objects by: (i) clustering similar columns across sources using embedding vectors generated by the Language Model Service 640 and k-means or hierarchical clustering algorithms; (ii) labeling clusters with semantic types (e.g., "customer identifier," "revenue amount") using classification models trained on labeled examples; (iii) analyzing relationships betweenJEDIFY-0005USPCT entities by detecting foreign key relationships and join patterns in query logs; and (iv) distilling essential attributes by removing redundant or low-information columns based on variance and correlation analysis.
[0087] The Similarity Pod 664 computes similarity scores between semantic objects using cosine similarity of embedding vectors and string similarity metrics (e.g., Levenshtein distance, Jaccard similarity) applied to column names and descriptions.
[0088] The onboarding pipeline can include Unification and Conflict Resolution, for example at block 670. The Unifier Service 670 unifies semantic objects across sources via similarity detection and conflict resolution by: (i) identifying potential matches where similarity scores exceed a threshold (e.g., 0.8); (ii) applying rule-based conflict resolution strategies (e.g., preferring sources marked as authoritative, selecting the most recently updated definition); (iii) prompting users to manually resolve ambiguous cases through the user interface; and / or (iv) creating unified semantic entities that map to multiple physical columns across sources, storing the mappings in metadata tables.
[0089] The onboarding pipeline can include Validation and Persistence, for example at blocks 680 and / or 690. The system validates and confirms the unified semantic model by: (i) detecting ambiguous definitions where multiple conflicting interpretations exist using entropy measures on definition variations; (ii) generating sample queries against the semantic model and comparing results to expected outcomes: and / or (iii) presenting validation results to users for confirmation through the user interface.
[0090] The confirmed semantic model is persisted in Semantic Fusion 680, a graph database or relational database storing semantic entities, their attributes, relationships, calculation logic, and source mappings. The Drift Detector 690 monitors subsequent drift by continuously comparing incoming schema changes from Connector Modules 610 against the stored semantic model and triggering alerts when discrepancies are detected (as further detailed in Figure 7).
[0091] This onboarding process enables the system to bootstrap the semantic layer from existing enterprise resources (documentation, data schemas, tribal knowledge) and continuously refine it based on actual usage patterns and business context, ensuring the semantic entities remain aligned with organizational understanding and business logic.JEDIFY-0005USPCT
[0092] Figure 7 illustrates the connector architecture with schema-drift monitoring and semantic output. Each Connector Shell 710 operates independently for a specific data source and comprises:
[0093] The connector architecture can include Authentication and Schema Monitoring, for example at blocks 712 and / or 714. The Auth Handler 712 manages authentication credentials using OAuth 2.0, API keys, or database connection strings, refreshing tokens as needed. The Schema Listener 714 periodically queries the data source's metadata (e.g., every 5 minutes, configurable) using database introspection APIs to detect schema changes such as added columns, removed columns, data type changes, or constraint modifications.
[0094] The connector architecture can include Change Data Capture, for example at blocks 716 and / or 718. The Change-Data- Capture Streamer 716 monitors data changes using database-specific CDC mechanisms (e.g., MySQL binlog, PostgreSQL logical replication, API webhooks for SaaS platforms) or by periodically polling for changes based on timestamp columns. The Heartbeat-Metrics Emitter 718 emits metrics about connector health, query latency, error rates, and data freshness to monitoring systems, enabling operational visibility.
[0095] The connector architecture can include Event Processing and Drift Detection, for example at blocks 720, 730, and / or 740. Schema changes and data changes are published to an Event Queue 720 (e.g., Kafka, RabbitMQ, AWS SQS) for asynchronous processing. The Schema Drift Analyzer 730 consumes events from the queue and compares detected schema changes against the semantic model stored in Semantic Fusion 680. When drift is detected (i.e., a schema change that affects a semantic entity), the analyzer: (i) calculates the impact by identifying which semantic entities reference the changed schema elements; (ii) determines severity based on whether the change breaks existing queries or merely adds new fields; and (iii) writes detailed drift information to Drift Logs & Alerts 740, which may trigger notifications to administrators or, optionally, initiate an Auto Revise process (shown with dotted line) that uses the Language Model Service 640 to automatically propose updates to affected semantic entities.
[0096] The connector architecture can include Semantic Object Output, for example at block 750. Each connector produces Semantic Object Output 750 containing enriched metadata about the source's schema, sample data values, statistical distributions, andJEDIFY-0005USPCT mappings to semantic entities. This output feeds back into the Unifier Service 670 to maintain an up-to-date unified semantic model.
[0097] The connector architecture enables continuous synchronization between physical data sources and the semantic layer, ensuring that semantic entities remain accurate as underlying systems evolve.
[0098] Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims.
Claims
JEDIFY-0005USPCTCLAIMS1. A computer-implemented method for retrieving desired data from a data source, comprising: receiving a first query, the first query being a request for desired data from a data source; generating a computational code, based upon the first query, to retrieve the desired data from the data source.
2. The method of claim 1, wherein generating the computational code further comprises: generating a graphical representation of one or more concepts; generating a second computational code that associates the one or more concepts with data from the data source; and identifying a subset of relevant concepts from the one or more concepts.
3. The method of claim 2, wherein the graphical representation comprises at least one node and at least one edge.
4. The method of claim 3, wherein the node represents at least one concept and the edge represents at least one concept connection.
5. The method of claim 2, wherein generating the graphical representation of one or more concepts further comprises: identifying the one or more concepts; and identifying one or more concept connections.
6. The method of claim 2, wherein the first query includes complex prompts comprising multiple related queries and / or conversational interactions.
7. The method of claim 6, wherein the first query originates from: one or more Artificial Intelligence (Al) applications; one or more Al agents; and / or one or more tools.JEDIFY-0005USPCT8. The method of claim 7, wherein the one or more tools implement a Model Context Protocol (MCP) to communicate the first query.
9. The method of claim 2, wherein a model is configured to identify the subset of relevant concepts from the one or more concepts.
10. The method of claim 9, wherein the model comprises one of: a classifier, or a bi- modal embedding space.
11. The method of claim 10, wherein the classifier is trained using the one or more concepts.
12. The method of claim 10, wherein the bi-modal embedding space is trained using a first encoder and a second encoder.
13. The method of claim 12, wherein the first encoder embeds the first query into a vector space and the second encoder embeds the one or more business concepts into the vector space.
14. The method of claim 1, wherein the computational code comprises Python, SQL, or an API call.
15. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform the method of claim 1.
16. One or more non-transitory computer- readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the method of claim 1.
17. A computer-implemented method, compri sing :JEDIFY-0005USPCT receiving, via a user interface, uniform-resource locators (URLs) that describe one or more websites in order to capture business context expressed in natural language; receiving connection credentials for a plurality of heterogeneous data sources; profiling the connected plurality of heterogenous data sources to generate an initial data catalog and to retrieve candidate metrics; receiving a textual use-case definition that scopes organizational intent; generating, by a language model, one or more business questions derived from the textual use-case definition, and iteratively refining the generated one or more business questions based on user feedback to capture domain-specific terminology; inferring concrete definitions and formulae from the one or more business questions and the initial data catalog; responsive to a build command, executing per-source processing pipelines to enrich semantic objects; unifying semantic objects across sources via similarity detection and conflict resolution; and validating and confirming a unified semantic model.
18. The method of claim 17, wherein the plurality of heterogenous data sources comprises at least one of: one or more business intelligence (BI) tools; one or more data- transformation services; one or more data warehouses; and / or one or more distributed document repositories.
19. The method of claim 17, wherein the initial data catalog includes one or more of: table schemas, column names, data types, and / or statistical distributions.
20. The method of claim 17, wherein the candidate metrics are identified by numeric columns and potential aggregation patterns.21 . The method of claim 17, wherein unifying semantic objects includes:(i) identifying potential matches where detected similarity scores exceed a thresholdJEDIFY-0005USPCT(ii) applying rule-based conflict resolution strategies;(iii) prompting users to manually resolve ambiguous cases through the user interface; and / or(iv) creating unified semantic entities that map to multiple physical columns across sources, storing the mappings in metadata tables.
22. The method of claim 17, wherein validating and confirming the unified semantic model comprises one or more of:(i) detecting ambiguous definitions where multiple conflicting interpretations exist using entropy measures on definition variations;(ii) generating sample queries against the unified semantic model and comparing results to expected outcomes; and / or(iii) presenting validation results to users for confirmation through the user interface.
23. The method of claim 17, wherein the per-source processing pipelines include one or more of:(i) clustering similar columns across sources using embedding vectors generated by a Language Model Service and k-means or hierarchical clustering algorithms;(ii) labeling clusters with semantic types using classification models trained on labeled examples;(iii) analyzing relationships between entities by detecting foreign key relationships and join patterns in query logs; and / or(iv) distilling essential attributes by removing redundant or low-information columns based on variance and correlation analysis.
24. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform the method of claim 17.
25. One or more non -transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the method of claim 17.