A system and a method for unstructured synthetic data generation
The system generates unstructured synthetic data using a transformer-based model to address data quality issues, ensuring semantic and statistical compatibility, thereby enhancing machine learning model performance.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- PRIVASAPIEN TECH PTE LTD
- Filing Date
- 2025-12-15
- Publication Date
- 2026-06-25
Smart Images

Figure IB2025062864_25062026_PF_FP_ABST
Abstract
Description
[0001] A SYSTEM AND A METHOD FOR UNSTRUCTURED SYNTHETIC DATA GENERATION
[0002] EARLIEST PRIORITY DATE:
[0003] This Application claims priority from a provisional patent application filed in India having Patent Application No. 202441101483, filed on December 20, 2024, and titled “A SYSTEM AND A METHOD FOR UNSTRUCTURED SYNTHETIC DATA GENERATION”.
[0004] FIELD OF INVENTION
[0005] The present invention relates to the field of data analytics and database. More particularly, the present invention relates to a system and a method for unstructured synthetic data generation.
[0006] BACKGROUND
[0007] Machine learning systems have become central to modem data-driven operations, enabling automated decision-making, prediction, pattern discovery, and large- scale information processing across industries. These systems typically rely on substantial volumes of real-world data to achieve high performance, yet obtaining such data often presents significant challenges. Many domains such as healthcare, finance, logistics, and customer analytics handle sensitive or regulated information, creating barriers to openly collecting, sharing, or reusing datasets due to privacy, compliance, and confidentiality concerns. As a result, organizations frequently struggle to access high-quality training data without compromising security or exposing identifiable information.
[0008] In addition to privacy limitations, real -world datasets often suffer from issues such as imbalance, incompleteness, inconsistency, bias, and structural irregularities. These shortcomings reduce model accuracy, weaken fairness outcomes, and complicate downstream data processing. Heterogeneous data sources further compound the problem, as textual, numerical, categorical, and image-based data require different preprocessing steps, encoding methods, and normalization techniques. Ensuring compatibility among these diverse formats can demand extensive manual effort and domain expertise.
[0009] Another significant challenge lies in the computational demands of training advanced machine learning architectures, particularly when dealing with large, unstructured datasets. The need for repeated training cycles, fine-tuning, and performance evaluation places strain on hardware resources and increases operational cost. Without access to representative and well-structured data, these models risk producing unreliable, biased, or low-quality outputs.
[0010] Hence, there is a need for an improved system and method for unstructured synthetic data generation to address the aforementioned issue(s).
[0011] OBJECTIVES OF THE INVENTION
[0012] The primary objective of the invention is to enable the generation of unstructured synthetic data that accurately reflects the semantic, statistical, and contextual characteristics of real-time input data while preventing exposure of sensitive or identifiable information.
[0013] Another objective of the invention is to provide an adaptive data-processing framework that analyses, encodes, and segregates diverse data modalities such as textual, numerical, and categorical data in a structured and scalable manner to ensure consistent preparation for multimodal model training.
[0014] Yet another objective of the invention is to ensure that conditional textual synthetic data generated using contextual cues remains coherent, contextually aligned, and compatible with corresponding non-textual synthetic elements to produce a unified synthetic dataset of high fidelity.
[0015] A further objective of the invention is to incorporate robust validation, refinement, and alignment mechanisms that preserve semantic meaning, maintain statistical compatibility, and ensure that the resulting synthetic dataset is reliable for analytical, operational, and machine-learning applications.
[0016] Still another objective of the invention is to provide a fine-tuned, transformerbased synthetic data generation system capable of maintaining fairness, reducing bias, and improving overall quality and realism through iterative evaluation and optimization.
[0017] SUMMARY
[0018] In accordance with an embodiment of the present disclosure, a system for unstructured synthetic data generation is disclosed. The system includes a processor. The system includes a memory coupled to the processor, wherein the memory comprises instructions that, when executed by the processor, cause the processor to: receive a real-time input data from a plurality of sources, wherein the real-time input data comprises textual data and non-textual data; analyse a structure of the real-time input data to determine at least one encoding technique corresponding to each of the textual data and the non-textual data; encode the realtime input data based on the at least one encoding technique wherein the encoding comprises at least one of tokenization, word embeddings, and character-level encoding for the textual data, and at least one of normalization, one-hot encoding, and feature scaling for the non-textual data to generate encoded textual data and encoded non-textual data respectively; segregate the encoded textual data and the encoded non-textual data by identifying patterns and structures differentiating the textual data from numerical and categorical portions of the non-textual data, thereby forming a plurality of distinct processing groups; train a transformerbased model using the encoded textual data and the encoded non-textual data to learn distributions, dependencies, and patterns corresponding to the real-time input data; fine-tune the transformer-based model based on a plurality of evaluation metrics to improve quality, realism, and contextual consistency of synthetic data output by the transformer-based model; generate conditional textual synthetic data based on contextual cues comprising at least one of keywords and topics to ensure coherence and contextual relevance; combine the conditional textual synthetic data and non-textual synthetic data by aligning corresponding formats and structures to produce a unified synthetic dataset; process the unified synthetic dataset to preserve semantic meaning of the conditional textual synthetic data and to maintain statistical compatibility of the non-textual synthetic data; and output the unified synthetic dataset as a structured fusion of textual and nontextual information retaining semantic integrity and statistical fidelity. In accordance with an embodiment of the present disclosure, a method for unstructured synthetic data generation is disclosed. The method includes receiving, by a processor, real-time input data from a plurality of sources, wherein the real-time input data comprises textual data and non-textual data. The method includes analysing, by the processor, a structure of the real-time input data to determine at least one encoding technique corresponding to each of the textual data and the non-textual data. The method includes encoding, by the processor, the real-time input data based on the at least one encoding technique wherein the encoding comprises at least one of tokenization, word embeddings, and characterlevel encoding for the textual data, and at least one of normalization, one-hot encoding, and feature scaling for the non-textual data to generate encoded textual data and encoded non-textual data respectively. The method includes segregating, by the processor, the encoded textual data and the encoded non-textual data by identifying patterns and structures differentiating the textual data from numerical and categorical portions of the non-textual data, thereby forming a plurality of distinct processing groups. The method includes training, by the processor, a transformer-based model using the encoded textual data and the encoded nontextual data to learn distributions, dependencies, and patterns corresponding to the real-time input data. The method includes fine-tuning, by the processor, the transformer-based model based on a plurality of evaluation metrics to improve quality, realism, and contextual consistency of synthetic data output by the transformer-based model. The method includes generating, by the processor, conditional textual synthetic data based on contextual cues comprising at least one of keywords and topics to ensure coherence and contextual relevance. The method includes combining, by the processor, the conditional textual synthetic data and non-textual synthetic data by aligning corresponding formats and structures to produce a unified synthetic dataset. The method includes processing, by the processor, the unified synthetic dataset to preserve semantic meaning of the conditional textual synthetic data and to maintain statistical compatibility of the non-textual synthetic data. The method includes outputting, by the processor, the unified synthetic dataset as a structured fusion of textual and non-textual information retaining semantic integrity and statistical fidelity To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. It is to be appreciated that these figures depict only typical embodiments of the disclosure and are therefore not to be considered limiting in scope. The disclosure will be described and explained with additional specificity and detail with the appended figures.
[0019] BRIEF DESCRIPTION OF THE DRAWINGS
[0020] The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:
[0021] FIG. 1 illustrates a network environment of a system for unstructured synthetic data generation in accordance with an embodiment of the present disclosure;
[0022] FIG. 2 illustrates a schematic diagram of a user device of FIG. 1, in accordance with an example implementation of the present subject matter;
[0023] FIG. 3 illustrates a schematic diagram of a system for unstructured synthetic data generation of FIG. 1, in accordance with an embodiment of the present disclosure;
[0024] FIG. 4 (a) is a flow chart representing the steps involved in a method for unstructured synthetic data generation, in accordance with an embodiment of the present disclosure; and
[0025] FIG. 4 (b) illustrates continued steps of the method of FIG. 4 (a) in accordance with an embodiment of the present disclosure.
[0026] Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein. DETAILED DESCRIPTION
[0027] For the purpose of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended. Such alterations and further modifications in the illustrated system, and such further applications of the principles of the disclosure as would normally occur to those skilled in the art are to be construed as being within the scope of the present disclosure.
[0028] The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method. Similarly, one or more devices or subsystems or elements or structures or components preceded by "comprises... a" does not, without more constraints, preclude the existence of other devices, sub-systems, elements, structures, components, additional devices, additional sub-systems, additional elements, additional structures or additional components. Appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment.
[0029] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.
[0030] In the following specification and the claims, reference will be made to a number of terms, which shall be defined to have the following meanings. The singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise.
[0031] FIG. 1 illustrates a network environment of a system for unstructured synthetic data generation in accordance with an embodiment of the present disclosure. Referring to FIG. 1, a user device (104) corresponding to a user (108) may be communicatively coupled to a system (102). The user (108) may access the system (102) over a network (106). Examples of the user device (104) includes, but is not limited to, a mobile phone, desktop computer, portable digital assistant (PDA), smart phone, tablet, ultra-book, netbook, laptop, multi-processor system, microprocessor-based or programmable consumer electronic system, or any other communication device that a user may use. It will be appreciated that the system (102) may be presented to the user on the user device (104) as a web application accessed through a browser, through a software application on the user device, or, particularly for smartphones, through a mobile application installed at the smartphone. It will be appreciated that, within the context of the disclosure herein, web application refers to a utility implemented on a networked computing system accessible by user device over the Internet (e.g. through browsers) wherein the bulk of the processing takes place at the networked computing system, mobile applications refer to applications installed on smartphones that may communicate with a networked computing system, and a “software” application refers generally to applications other than web browsers installed on other types of user device that may communicate with a networked computing system over the network (106).
[0032] The network (106) may be a single communication network or a combination of multiple communication networks and may use a variety of different communication protocols. The network (106) may be a wireless network, a wired network, or a combination thereof. Examples of such individual personalized networks include, but are not limited to, Global System for Mobile Communication (GSM) network, Universal Mobile Telecommunications System (UMTS) network, Personal Communications Service (PCS) network, Time Division Multiple Access TDMA) network, Code Division Multiple Access (CDMA) network, Next Generation Network (NON), Public Switched Telephone Network (PSTN). Depending on the technology, the personalized network (106) may include various network entities, such as gateways and routers; however, such details have been omitted for the sake of brevity of the present description.
[0033] The system (102) may have a homepage that is presented to the user (108) accessing a top-level web address for web applications presented to the user (108) in a browser or a welcome screen for software and mobile applications. The homepage may include links to a user log-in interface or general information about the system (102) and the option to register as user (108). It will be appreciated that the presentation of a homepage may not be necessary, for example, if a user bypasses it by directly inputting a web address corresponding to a user log-in page, or if a separate mobile application is designed for users.
[0034] A new or unregistered user can access the user log-in interface, fill out the log-in information corresponding to the user's account, and indicate that the user wishes to sign in. It will be appreciated that any conventional registration and log-in techniques for web applications, software application, and mobile applications may be used, whichever is appropriate for the user. While registering the user may be prompted to provide username and corresponding user credentials, not limited to, password, geographical location, and contact information and upon receipt of the foregoing information, a corresponding user-profile may be created and stored on a respective database of the system (102).
[0035] In accordance with an embodiment of the present disclosure, a system for unstructured synthetic data generation is disclosed. The system includes a processor. The system includes a memory coupled to the processor, wherein the memory comprises instructions that, when executed by the processor, cause the processor to: receive a real-time input data from a plurality of sources, wherein the real-time input data comprises textual data and non-textual data; analyse a structure of the real-time input data to determine at least one encoding technique corresponding to each of the textual data and the non-textual data; encode the realtime input data based on the at least one encoding technique wherein the encoding comprises at least one of tokenization, word embeddings, and character-level encoding for the textual data, and at least one of normalization, one-hot encoding, and feature scaling for the non-textual data to generate encoded textual data and encoded non-textual data respectively; segregate the encoded textual data and the encoded non-textual data by identifying patterns and structures differentiating the textual data from numerical and categorical portions of the non-textual data, thereby forming a plurality of distinct processing groups; train a transformerbased model using the encoded textual data and the encoded non-textual data to learn distributions, dependencies, and patterns corresponding to the real-time input data; fine-tune the transformer-based model based on a plurality of evaluation metrics to improve quality, realism, and contextual consistency of synthetic data output by the transformer-based model; generate conditional textual synthetic data based on contextual cues comprising at least one of keywords and topics to ensure coherence and contextual relevance; combine the conditional textual synthetic data and non-textual synthetic data by aligning corresponding formats and structures to produce a unified synthetic dataset; process the unified synthetic dataset to preserve semantic meaning of the conditional textual synthetic data and to maintain statistical compatibility of the non-textual synthetic data; and output the unified synthetic dataset as a structured fusion of textual and nontextual information retaining semantic integrity and statistical fidelity.
[0036] It may be noted that the foregoing system is an exemplary system and may be implemented as computer executable instructions in any computing or processing environment, including in digital electronic circuitry or in computer hardware, firmware, device driver, or software. As such, the system is not limited to any specific hardware or software configuration.
[0037] FIG. 2 illustrates a schematic diagram of a user device, in accordance with an example implementation of the present subject matter. Referring to FIG. 2, the user device (104) may comprise a processor(s) (202), a memory(s) (204) coupled to and accessible by the processor(s) (202), and an interface (210) coupled to the memory(s) (204). The user device (104) disclosed herein may be same as the user device (104) described in FIG. 1. The functions of various elements shown in the figs., including any functional blocks labelled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing instructions. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" would not be construed to refer exclusively to hardware capable of executing instructions, and may implicitly comprise, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA). Other hardware, standard and / or custom, may also be coupled to the processor(s) (202). The user device (104) may further include a display (206) in addition to other components such as, but not limited to, keyboard, sensors, logic circuits etc. Further, the user device (104) may include data (208) which may include data (208) that may be stored, utilized or generated during the operation of the user device (104).
[0038] The memory(s) (204) may be a computer-readable medium, examples of which comprise volatile memory (e.g., RAM), and / or non-volatile memory (e.g., Erasable Programmable read-only memory, i.e. EPROM, flash memory, etc.). The memory(s) (204) may be an external memory, or internal memory, such as a flash drive, a compact disk drive, an external hard disk drive, or the like. The user device (104) may further include an interface (210) that may allow the connection or coupling of the user device (104) with one or more other devices, through a wired (e.g., Local Area Network, i.e., LAN) connection or through a wireless connection (e.g., Bluetooth®, Wi-Fi), for example, for connecting to the system shown in FIG. 1. The interface may also enable intercommunication between different logical as well as hardware components of the user device (104).
[0039] FIG. 3 illustrates a schematic diagram of a system for unstructured synthetic data generation of FIG. 1, in accordance with an embodiment of the present disclosure. Referring to FIG. 3, the system (102) includes a processor(s) (302), a memory(s) (304) coupled to and accessible by the processor(s) (302), and database (346) coupled to the memory(s) (304).
[0040] The system (102) disclosed herein is the same as the system (102) described in FIG. 1. The functions of various elements shown in the figs., including any functional blocks labelled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing instructions. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" would not be construed to refer exclusively to hardware capable of executing instructions, and may implicitly comprise, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA). Other hardware, standard and / or custom, may also be coupled to the processor(s) (302). The system (102) may further include other components such as, but not limited to, keyboard, sensors, logic circuits, input / output interfaces etc. Further, the system (102) may include data which may include data that may be stored, utilized or generated during the operation of the computer implemented system (102).
[0041] The memory(s) (304) may be a computer-readable medium, examples of which comprise volatile memory (e.g., RAM), and / or non-volatile memory (e.g., Erasable Programmable read-only memory, i.e. EPROM, flash memory, etc.). The memory(s) (304) may be an external memory, or internal memory, such as a flash drive, a compact disk drive, an external hard disk drive, or the like. The system (102) may further include the user interface (348) that may allow the connection or coupling of the system (102) with one or more other devices, through a wired (e.g., Local Area Network, i.e., LAN) connection or through a wireless connection (e.g., Bluetooth®, Wi-Fi)., for example, for connecting to the user device (104) as shown in FIG. 1. The user interface (348) may also enable intercommunication between different logical as well as hardware components of the system (102).
[0042] The system (102) may be provided with a database (346) to a real-time input data, an encoded textual data, an encoded non-textual data, a conditional textual synthetic data, a non-textual synthetic data, a unified synthetic dataset, an evaluation metrics and a validation result. In an example implementation of the system (102) including one or more servers, the databases may databases local to the server or may be remote to the server. It may be noted that the data in the databases may be stored as a table or may be pre-stored as a mapping with the other. This application is not limited thereto.
[0043] The system (102) may include module(s). The module(s) may include a receiving module (306), an encoder type predictor module (308), a segregator module (310), a training module (312), a conditional textual generator module (314), an aggregator module (316), a synthetic data generation module (318), a utility and requirement definition module (320), and an evaluator module (340). In one example, the module(s) may be implemented as a combination of hardware and firmware. In an example described herein, such combinations of hardware and firmware may be implemented in several different ways. For example, the firmware for module(s) may be processor (302) executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the module(s) may include a processing resource (for example, implemented as either single processor or combination of multiple processors), to execute such instructions. Further, the hardware for the module(s) may include communication apparatuses, control circuitries involving electrical and electronics components, sensors, and interface devices, which may be in communication with each other for multi-directional communication therebetween.
[0044] Further, the system (102) includes data. The data may include data that is either stored or generated as a result of functions implemented by the system. It may be further noted that information stored and available in data may be utilized by the engine(s) for performing various functions by the system. In an example, data may include a real-time input data (322), a textual data (324), a non-textual data (326), a numerical data (328), a categorical data (330), a contextual cues (332), an encoded textual and non-textual data (334), a unified synthetic dataset (336), and an evaluation metric (338). It may be noted that such examples of the various functions are only indicative. The present approaches may be applicable to other examples without deviating from the scope of the present subject matter.
[0045] In the present examples, the non-transitory machine-readable storage medium may store instructions that, when executed by the processing resource, implement the functionalities of modules(s). In such examples, the system (102) may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions. In other examples of the present subject matter, the machine-readable storage medium may be located at a different location but accessible to the system (102) and the processor(s) (302).
[0046] In operation, the receiving module (306) is configured to receive a real-time input data from a plurality of sources. The plurality of sources may include databases, application programming interfaces (APIs), networked sensors, data repositories, or external computing systems configured to provide real-time feeds. The receiving module (306) further configured to identify and classify the received data based on inherent characteristics such as modality, structure, and contextual relevance. The classification enables the system (102) to distinguish between textual and non-textual forms of input, thereby forming an organized data intake structure. The textual data corresponds to linguistic or language-based information, while the non-textual data corresponds to quantitative, categorical, or perceptual content. The real-time reception mechanism employs synchronized data channels making sure temporal coherence, allowing the system (102) to process diverse datasets without latency. The real-time input data is processed in a structured pipeline that maintains data integrity, temporal sequencing, and semantic correlation across multiple modalities to ensure uniform readiness for encoding and subsequent transformation.
[0047] By way of example, the plurality of sources includes, but is not limited to, enterprise databases containing transaction logs, social media streams containing user-generated text, medical repositories containing diagnostic reports, and sensor networks providing environmental metrics.
[0048] In one embodiment, the real-time input data comprises textual data and nontextual data. The textual data includes, but is not limited to, descriptive records, narrative text, or chat transcripts, while the non-textual data includes, but is not limited to, numerical measurements, categorical attributes, or image-based inputs.
[0049] In one embodiment, the encoder type predictor module (308) is configured to analyse a structure of the real-time input data to determine at least one encoding technique corresponding to each of the textual data and the non-textual data. The encoder type predictor module (308) performs an assessment of the incoming realtime input data by identifying intrinsic characteristics including format, dimensionality, distribution patterns, semantic density, and structural composition. For the textual data, the system examines linguistic attributes such as token boundaries, vocabulary complexity, contextual dependencies, and syntactic patterns. For the non-textual data, the encoder type predictor module (308) evaluates numerical ranges, categorical groupings, image pixel distributions, or feature scales associated with the underlying modality. This structural analysis enables the encoder type predictor module (308) to classify the real-time input data into modality-specific categories, thereby allowing the processor to select an encoding technique that is compatible with the corresponding data type. The encoding technique is selected from a plurality of predefined encoding methods, wherein the selection is based on the assessed structural characteristics of the textual data and the non-textual data, ensuring that the encoded representations preserve essential meaning, statistical properties, and contextual relevance.
[0050] By way of example, the real-time input data includes, but is not limited to, textual clinical notes, chat transcripts, descriptive reports, numerical sensor readings, categorical patient attributes, and a plurality of facial images captured through a camera installed within a monitoring or observational environment. The encoder type predictor module (308) may examine each clinical note to identify tokenizable sentence structures or embedded medical terminology, thereby determining whether tokenization or word embeddings are suitable for encoding. Similarly, the encoder type predictor module (308) may assess each numerical sensor reading to identify statistical variance and range, thereby determining whether normalization or feature scaling is appropriate.
[0051] In one embodiment, the encoder type predictor module (308) is configured to encode the real-time input data based on the at least one encoding technique. The encoder type predictor module (308) applies the selected encoding technique to each classified portion of the real-time input data, ensuring that linguistic constructs, numerical values, categorical attributes, and other modality-specific characteristics are transformed into structured numerical representations. For the textual data, the encoding preserves contextual relationships and semantic dependencies by converting words, characters, and sub-word units into identifiable vector spaces. For the non-textual data, the encoding transforms numerical ranges, categorical identifiers, and feature values into normalized or scaled numeric forms that maintain proportional integrity and pattern stability. The encoding stage ensures that every data unit, regardless of modality, is represented in a consistent, machine-interpretable format that enables uniform downstream processing and accurate synthetic data generation.
[0052] In one embodiment, the encoding comprises at least one of tokenization, word embeddings, and character-level encoding for the textual data, and at least one of normalization, one-hot encoding, and feature scaling for the non-textual data to generate encoded textual data and encoded non-textual data respectively.
[0053] By way of example, the encoding includes, but is not limited to, converting patient symptom descriptions into tokenized sequences, representing categorical gender attributes through one-hot encoding, normalizing temperature readings, or applying feature scaling to sensor outputs captured by a monitoring device.
[0054] In another embodiment, at least one encoding technique is selected dynamically based on data characteristics comprising dimensionality, sparsity, variance, and linguistic features of the real-time input data. The encoder type predictor module (308) evaluates each segment of the real-time input data to determine modalityspecific and structure-specific indicators, including size of feature space, presence of missing values, distribution spread, token density, semantic dependency depth, and categorical granularity. Based on this evaluation, the encoder type predictor module (308) selects an encoding technique that aligns with the identified characteristics, ensuring that each portion of the real-time input data is represented in a numerically stable and context-appropriate manner. High-dimensional or sparse datasets may trigger dimensionality-aware encodings such as feature scaling or sparse-matrix normalization, while linguistically complex text may trigger embeddings or character-level encodings.
[0055] By way of example, the dynamic selection includes, but is not limited to, choosing word embeddings for real-time clinical narratives exhibiting dense linguistic structure, applying normalization to fluctuating sensor readings exhibiting high variance, selecting one-hot encoding for categorical operational statuses, or applying feature scaling to numerical measurements captured through environmental monitoring devices.
[0056] In one embodiment, the segregator module (310) is configured to segregate the encoded textual data and the encoded non-textual data by identifying patterns and structures differentiating the textual data from numerical and categorical portions of the non-textual data, thereby forming a plurality of distinct processing groups. The segregator module (310) examines structural indicators within the encoded data representations, such as linguistic sequencing patterns, contextual embedding characteristics, numeric distributions, categorical mappings, and scale-dependent feature values. Through this structural examination, the segregator module (310) classifies the encoded data into modality-specific groups, ensuring that each processing group corresponds to a consistent data type. The segregation allows the segregator module (310) to isolate linguistic constructs from quantitative or categorical information, enabling tailored downstream transformations for each group. The plurality of distinct processing groups further enables differential model processing pathways, ensuring that each data modality receives optimized and context-appropriate treatment during later stages of synthetic generation.
[0057] By way of example, the segregation includes, but is not limited to, separating tokenized symptom descriptions into a textual group, assigning normalized temperature values to a numerical group, and allocating encoded diagnostic categories into a categorical group derived from records captured by clinical monitoring systems.
[0058] In another embodiment, the segregating comprises classifying attributes into groups corresponding to linguistic features, numerical features, and categorical features. The segregator module (310) analyses each of the encoded attribute of the real-time input data to determine its inherent modality and assigns the attribute to an appropriate processing group. Linguistic features are identified through patterns such as token sequences, semantic embeddings, syntactic markers, or contextual dependencies present within the encoded textual data. Numerical features are recognized through continuous values, magnitude ranges, temporal fluctuations, and statistical signatures present in encoded quantitative data. Categorical features are detected through discrete identifiers, label sets, class memberships, or symbolic indicators embedded within the encoded non-textual data. By organizing the encoded attributes into these groups, the segregator module (310) makes sure that each group receives transformations, refinements, and model-processing operations tailored to its modality.
[0059] By way of example, the groups include, but are not limited to, a linguistic group containing tokenized clinical descriptions, a numerical group containing normalized temperature readings, and a categorical group containing encoded diagnosis or status identifiers originating from monitored patient or operational systems.
[0060] In another embodiment, the utility and requirement definition module (320) is configured to identify cross-domain relations among the encoded textual data and the encoded non-textual data prior to training the transformer-based model. The utility and requirement definition module (320) examines the encoded representations to detect semantic, contextual, statistical, and categorical linkages that exist across the textual and non-textual modalities. The utility and requirement definition module (320) analyses linguistic cues, entity references, descriptive patterns, and contextual anchors within the encoded textual data and correlates them with numerical trends, categorical identifiers, or temporal patterns present within the encoded non-textual data. Through this examination, the utility and requirement definition module (320) establishes cross-domain mappings that reveal how descriptive language, contextual statements, or narrative content align with quantitative or categorical features. The cross-domain relations form an integrated relational map used to ensure that the transformer-based model learns multimodal dependencies rather than treating each modality in isolation. The utility and requirement definition module (320) utilizes similarity scoring, correlation analysis, embedding-distance computation, and pattern clustering to detect shared associations, enabling the model to capture richer contextual understanding during training.
[0061] By way of example, the cross-domain relations include, but are not limited to, identifying alignment between symptom descriptions and numerical severity scores, linking textual shipment-delay notes with categorical delay statuses, or correlating descriptive environmental conditions with temperature or humidity measurements captured through monitoring devices.
[0062] In one embodiment, the training module (312) is configured to train a transformerbased model using the encoded textual data and the encoded non-textual data to learn distributions, dependencies, and patterns corresponding to the real-time input data. The training module (312) receives the encoded textual data and the encoded non-textual data from prior processing stages and prepares them for multimodal training by aligning their structures within a shared training pipeline. The training module (312) feeds both modalities into the transformer-based model, enabling the transformer-based model to learn semantic relationships within the encoded textual data while simultaneously learning statistical patterns, value distributions, and categorical dependencies within the encoded non-textual data. The training process incorporates positional information, attention mechanisms, and modality-aware embedding structures to capture cross-modal relationships that exist between textual expressions and associated numerical or categorical indicators. Through repeated exposure to the encoded representations, the transformer-based model learns latent patterns such as linguistic cues aligned with quantitative changes, recurring correlations among categorical features, and contextual behaviours embedded within the real-time input data. The training module (312) updates the transformer-based model parameters throughout the training process using gradient-based optimization techniques, ensuring that the model captures both modality-specific and multimodal dependencies necessary for generating accurate synthetic data.
[0063] By way of example, the transformer-based model learns, but is not limited to learning, relationships between descriptive clinical notes and corresponding vital- sign ranges, correlations between shipment-delay text entries and categorical delay statuses, or associations between environmental condition descriptions and numerical sensor readings originating from monitoring devices.
[0064] In one embodiment, the training module (312) is configured to fine-tune the transformer-based model based on a plurality of evaluation metrics to improve quality, realism, and contextual consistency of synthetic data output by the transformer-based model. The training module (312) computes a plurality of validation statistics on held-out or cross-validation sets and to derive feedback signals therefrom. The training module (312) adjusts the transformer-based model parameters and training hyperparameters responsively, including learning rate schedules, loss weighting factors, regularization strength, attention-head pruning, and checkpoint selection, where adjustments are driven by metric-derived objectives. The fine-tuning operates as an iterative closed-loop process in which the processor evaluates the synthetic data output against the plurality of evaluation metrics, updates the transformer-based model using gradient-based or meta- optimization procedures, and repeats evaluation until convergence criteria corresponding to predefined metric thresholds are satisfied. The training module (312) further records metric histories and model states to enable reproducibility, rollback, and audit of tuning decisions.
[0065] By way of example, the plurality of evaluation metrics includes, but is not limited to, perplexity and token-level accuracy for measuring textual coherence, Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), or embedding-similarity scores for assessing contextual relevance; Kullback-Leibler (KL) divergence, Wasserstein distance, or distributional-overlap measures for evaluating statistical fidelity, Frechet Inception Distance (FID) or Inception Score (IS) for determining image realism; and fairness, calibration, and privacy metrics such as demographic parity, calibration error, and differential-privacy epsilon for assessing ethical and regulatory compliance.
[0066] In another embodiment, the training module (312) is configured to apply one or more fine-tuning parameters based on the plurality of evaluation metrics comprising at least one of accuracy, calibration, bias reduction, and contextual alignment. The training module (312) evaluates intermediate model outputs against the plurality of evaluation metrics and based on the evaluation results, adjusts tuning parameters such as learning rate, regularization strength, lossweight distributions, gradient thresholds, and attention-weight allocations. The training module (312) further modifies decoding configurations, layer sensitivities, and embedding adjustments to correct underperformance detected through the evaluation metrics. Accuracy -driven updates prioritize correctness and consistency, calibration-driven updates correct probability imbalances or confidence distortions, bias-reduction adjustments mitigate representational or distributional skew, and contextual -alignment updates strengthen semantic coherence between generated outputs and contextual cues.
[0067] By way of example, the fine-tuning includes, but is not limited to, adjusting loss weights when accuracy drops on a validation set, correcting mis calibrated confidence levels in categorical predictions, reducing skew in outputs that disproportionately overrepresent a category, or enhancing contextual adherence when generated text deviates from provided keywords or topics.
[0068] In another embodiment, the contextual cues comprises at least one of user- provided topics, metadata, ontology references, and task directives guiding the conditional textual synthetic data generation. The training module (312) receives the contextual cues and interprets them as semantic anchors that define the intended direction, specificity, tone, or thematic boundaries of the conditional textual synthetic data. The training module (312) transforms each contextual cue into a structured representation compatible with the transformer-based model, including vector embeddings, control tokens, prompt templates, or soft-guidance signals. The structured representations influence attention weighting, token selection, and contextual propagation during generation so that the resulting conditional textual synthetic data remains aligned with the provided contextual cues. The training module (312) further makes sure that contextual cues derived from metadata or ontology references are consistently mapped to domain-specific terminology and structural patterns, enabling the generated text to maintain domain accuracy and semantic continuity.
[0069] By way of example, the contextual cues include, but are not limited to, a user- provided topic such as “post-operative recovery,” metadata indicating a timestamp or source system, ontology references specifying clinical terminology links, or task directives instructing the system to produce a concise summary, a detailed explanation, or an event-focused narrative.
[0070] In one embodiment, the conditional textual generator module (314) is configured to generate conditional textual synthetic data based on contextual cues comprising at least one of keywords and topics to ensure coherence and contextual relevance. The conditional textual generator module (314) receives the contextual cues corresponding to a target generation goal and converts the cues into context vectors that are aligned with the encoder and decoder spaces of the transformerbased model. The conditional textual generator module (314) conditions the transformer-based model on the context vectors by inserting control tokens, prompt templates, or soft-prompt embeddings into the model input and by adjusting attention masks to prioritize context-relevant tokens during autoregressive or sequence-to-sequence generation. The processor further applies context-aware decoding strategies, including but not limited to top-k sampling, nucleus sampling, and temperature scheduling, to balance creativity and fidelity. The conditional textual generator module (314) enforces hard or soft constraints derived from the contextual cues using constrained decoding, beam scoring, or guided attention to preserve required facts or topical focus.
[0071] By way of example, the contextual cues include, but are not limited to, keywords such as “post-operative pain,” “cold chain failure,” or “customer chum,” and topics such as “cardiology follow-up,” “logistics exception,” or “monthly retention analysis.” The system conditions on a user-provided ontology entry for “post-operative pain” and uses a prompt template that instructs the transformerbased model to generate a 2-3 sentence clinical summary tied to recent vital-sign readings.
[0072] In one embodiment, the aggregator module (316) is configured to combine the conditional textual synthetic data and non-textual synthetic data by aligning corresponding formats and structures to produce a unified synthetic dataset. The aggregator module (316) receives the conditional textual synthetic data generated based on contextual cues and the non-textual synthetic data generated from encoded numerical or categorical patterns. The aggregator module (316) evaluates each modality to determine structural requirements, including field formats, schema arrangements, dimensional consistency, and relational dependencies. The aggregator module (316) applies alignment logic that maps textual references, semantic cues, and narrative context within the conditional textual synthetic data to the corresponding numerical ranges, categorical identifiers, or statistical attributes of the non-textual synthetic data. This alignment ensures that both modalities coexist meaningfully within a unified schema. The aggregator module (316) further standardizes naming conventions, synchronizes timestamps or record identifiers, and ensures that each synthetic entry maintains coherent relationships across modalities
[0073] By way of example, the combining includes, but is not limited to, aligning a synthetic clinical note describing “persistent mild fever” with synthetic temperature values slightly above the normal range, pairing a synthetic logistics status note referencing “inventory shortages” with reduced numerical stock counts, or matching a synthetic workplace incident description with categorical severity labels that reflect the narrative context.
[0074] In another embodiment, the combining comprises applying alignment rules configured to maintain coherence between semantic content of the conditional textual synthetic data and quantitative context of the non-textual synthetic data. The aggregator module (316) evaluates the conditional textual synthetic data to extract semantic indicators, contextual anchors, descriptive expressions, and topic-relevant cues, and simultaneously analyses the non-textual synthetic data to assess numerical distributions, categorical values, temporal patterns, or other quantitative characteristics. The aggregator module (316) applies alignment rules that map linguistic references in the conditional textual synthetic data to corresponding quantitative or categorical features in the non-textual synthetic data, ensuring that descriptive statements are contextually consistent with associated numerical or categorical representations. The alignment rules may incorporate semantic-numerical correlation thresholds, context-matching heuristics, feature-aware validation logic, and multimodal embedding comparisons to ensure that textual meaning and quantitative context reinforce one another.
[0075] By way of example, the alignment rules include, but are not limited to, ensuring that a synthetic text description referencing “high fever” corresponds to an elevated temperature value, confirming that descriptive notes mentioning “low inventory” match reduced stock counts, or verifying that narrative statements about “delayed shipment” align with categorical labels representing a delay status.
[0076] In one embodiment, the aggregator module (316) is configured to process the unified synthetic dataset to preserve semantic meaning of the conditional textual synthetic data and to maintain statistical compatibility of the non-textual synthetic data. The aggregator module (316) evaluates the unified synthetic dataset by examining semantic alignment, contextual continuity, and linguistic coherence within the conditional textual synthetic data, while simultaneously assessing statistical distributions, variance stability, categorical frequencies, and numerical correlations within the non-textual synthetic data. The aggregator module (316) applies modality-specific refinement procedures, including semantic consistency scoring for the textual portion and distribution-preservation adjustments for the non-textual portion. The aggregator module (316) further performs cross-modality checks to ensure that textual descriptions, references, or contextual narratives remain aligned with corresponding numerical or categorical values.
[0077] By way of example, the processing includes, but is not limited to, verifying that a synthetic clinical summary referencing “elevated temperature” aligns with a numerical temperature value within an elevated range, ensuring that a synthetic logistics report describing “delayed shipment” corresponds to a categorical status representing delay, and confirming that descriptive sales text mentioning “increasing demand” aligns with upward-trending numerical sales indicators.
[0078] In another embodiment, the evaluator module (340) is configured to validate the unified synthetic dataset against one or more predefined statistical thresholds and linguistic thresholds to ensure representational fidelity. The processor performs a structured validation sequence in which the unified synthetic dataset is examined for both statistical correctness and semantic coherence. The evaluator module (340) evaluates the non-textual synthetic data using predefined statistical thresholds that may correspond to distributional similarity, variance stability, correlation preservation, categorical proportionality, temporal regularity, or other quantitative indicators derived from the characteristics of the real-time input data. In parallel, the evaluator module (340) validates the conditional textual synthetic data using predefined linguistic thresholds, which may include semantic similarity scoring, contextual relevance assessment, syntactic integrity evaluation, and coherence-based checks. The evaluator module (340) compares the unified synthetic dataset against these thresholds to determine whether each synthetic record adheres to expected multimodal behaviour. Records that fail to satisfy either statistical or linguistic thresholds are flagged for refinement, recalibration, or regeneration, ensuring that only fidelity-compliant outputs proceed to downstream use.
[0079] By way of example, the validation includes, but is not limited to, confirming that synthetic temperature readings fall within statistically realistic ranges, verifying that categorical distributions resemble proportions seen in source data, and ensuring that synthetic textual content referencing “increasing values” is linguistically consistent with upward numerical trends embedded in the same synthetic record.
[0080] In another embodiment, the aggregator module (316) is configured to normalize embeddings corresponding to the encoded textual data and the encoded nontextual data through multi-dimensional consistency mapping prior to the output of the unified synthetic dataset. The evaluator module (340) generates embedding representations for each modality and evaluates their dimensional scales, vector magnitudes, spatial distributions, and relational proximities within a shared embedding space. The evaluator module (340) applies normalization techniques to reduce modality -induced imbalance, ensuring that the encoded textual data and the encoded non-textual data occupy harmonized vector regions that support coherent multimodal interpretation. The multi-dimensional consistency mapping aligns semantic embeddings from textual constructs with numerical, categorical, or feature-based embeddings from non-textual constructs by adjusting vector scales, applying distribution-matching transformations, and enforcing similarity constraints derived from prior cross-domain relationships.
[0081] By way of example, the normalization includes, but is not limited to, adjusting vector scales of token embeddings derived from clinical descriptions to match normalized ranges of sensor-measurement embeddings, equalizing categorical- attribute embeddings to align with distributional patterns of encoded status indicators, or harmonizing embeddings derived from descriptive event logs with embeddings representing numerical trends captured by environmental monitoring systems.
[0082] In another embodiment, the aggregator module (316) is configured to perform iterative differential refinement on the unified synthetic dataset to increase semantic accuracy and structural balance between the conditional textual synthetic data and the non-textual synthetic data. The aggregator module (316) evaluates the unified synthetic dataset through successive refinement cycles in which semantic coherence, contextual precision, numerical proportionality, and categorical consistency are assessed at each iteration. During each cycle, the processor identifies semantic deviations, contextual drift, numerical inconsistencies, or categorical mismatches that may weaken the structural or interpretive integrity of the unified synthetic dataset. The aggregator module (316) applies differential refinement operations such as selective correction of textual segments, recalibration of numerical values, redistribution of categorical frequencies, or fine-grained alignment of modality relationships to iteratively adjust and improve the multimodal output. The iterative adjustments make sures that the conditional textual synthetic data remains semantically accurate and contextually aligned, while the non-textual synthetic data retains stable statistical properties, realistic ranges, and modality-appropriate patterns. With each refinement pass, the aggregator module (316) narrows deviations from desired semantic and statistical thresholds, ultimately producing a balanced synthetic representation that harmonizes both modalities.
[0083] By way of example, the iterative differential refinement includes, but is not limited to, adjusting a synthetic narrative referencing “improving condition” to ensure numerical indicators reflect upward trends, correcting categorical labels that conflict with descriptive text, or refining contextual segments so that descriptive statements align with synthetic sensor values captured from monitoring systems.
[0084] In another embodiment, the synthetic data generation module (318) is configured to generate synthetic data based on the real-time input data wherein the synthetic data mimics the structure of the real-time input data. The synthetic data generation module (318) receives structurally analysed and modality-specific encoded representations of the real-time input data and utilizes these representations as training references for transformer-based generation. The processor identifies structural patterns, contextual relationships, statistical distributions, and categorical dependencies present within the real-time input data and employs these learned characteristics to guide the generation of synthetic data. The synthetic data produced by the synthetic data generation module (318) maintains coherence with the organizational form, semantic arrangement, and statistical behaviours inherent in the real-time input data, ensuring that each generated record reflects realistic relationships between linguistic content and quantitative or categorical values. By way of example, the synthetic data includes, but is not limited to, synthetic clinical summaries reflecting the structure of corresponding patient narratives, synthetic sensor readings reflecting the temporal patterns of captured measurements, or synthetic categorical labels preserving the distribution of operational statuses derived from monitoring devices within a controlled environment.
[0085] In one embodiment, the aggregator module (316) is configured to output the unified synthetic dataset as a structured fusion of textual and non-textual information retaining semantic integrity and statistical fidelity. The processor prepares the unified synthetic dataset for output by organizing the conditional textual synthetic data and the non-textual synthetic data into a unified, machine- readable structure that preserves the alignment established during earlier processing stages. The aggregator module (316) maintains correspondence between textual expressions and their associated numerical or categorical attributes through consistent indexing, schema alignment, and modality-aware formatting. The unified synthetic dataset is arranged such that each synthetic record contains semantically coherent textual descriptions and statistically consistent non-textual values, ensuring compatibility with downstream analytical, operational, or machine-learning workflows. The aggregator module (316) may convert the unified synthetic dataset into one or more output formats suitable for storage, visualization, or deployment, while preserving the contextual dependencies and statistical properties that characterize the synthesized information.
[0086] By way of example, the output includes, but is not limited to, structured clinical records containing synthetic narrative summaries aligned with synthetic vital-sign values, synthetic warehouse logs containing descriptive status text aligned with numerical inventory counts, and synthetic customer reports containing contextual statements paired with categorical churn indicators.
[0087] In another embodiment, the user interface (348) is configured to present the unified synthetic dataset to a user (108, FIG. 1) operating a user device (104, FIG. 1) via the user interface (348). In another embodiment, the user interface (348) receives a user feedback associated with one or more refinement, evaluation, or selection of the unified synthetic dataset. The user interface (348) transmits the unified synthetic dataset to the user device (104, FIG. 1) in a structured format that displays both the conditional textual synthetic data and the non-textual synthetic data in an interpretable arrangement. Through the user interface (348), the processor (302) enables the user (108, FIG. 1) to review semantic content, numerical values, and categorical indicators associated with each synthetic record and to provide feedback for improving coherence, accuracy, or contextual suitability. The interaction includes, but is not limited to, presenting a plurality of synthetic symptom descriptions, a plurality of synthetic sensor readings, and a plurality of facial-image-derived indicators captured by a camera positioned within a monitoring environment.
[0088] Consider a non-limiting example where the system (102) is deployed within a hospital network to enable secure, multimodal data generation for clinical analytics, medical research, and model development without exposing real patient information. The receiving module (306) receives real-time input data from a plurality of sources corresponding to electronic health records, monitoring devices, laboratory information systems, and clinician-generated notes, wherein the real-time input data comprises textual data and non-textual data. The textual data includes physician observations, triage notes, discharge summaries, and symptom descriptions, while the non-textual data includes numerical vital-sign measurements, categorical diagnosis codes, medication identifiers, laboratory ranges, and device-captured readings. The utility and requirement definition module (320) analyses a structure of the real-time input data to determine at least one encoding technique corresponding to each modality, and the encoder type predictor module (308) encodes the data into encoded textual data and encoded non-textual data, enabling consistent downstream transformation.
[0089] The segregator module (310) segregates the encoded textual data and the encoded non-textual data by identifying features differentiating linguistic constructs from numerical or categorical portions, thereby forming a plurality of distinct processing groups. Prior to training, the utility and requirement definition module (320) identifies cross-domain relations among symptom descriptions, device measurements, and coded findings to establish clinically meaningful associations. The training module (312) trains a transformer-based model using these encoded representations, and the training module (312) further fine-tunes the model based on a plurality of evaluation metrics comprising at least one of accuracy, calibration, bias reduction, and contextual alignment to ensure medically reliable synthetic output.
[0090] During generation, the conditional textual generator module (314) produces conditional textual synthetic data based on contextual cues such as user-provided topics (“respiratory distress,” “post-operative monitoring”), metadata (timestamps, ward identifiers), ontology references (clinical terminology mappings), and task directives (summaries, assessments). The synthetic data generation module (318) generates corresponding non-textual synthetic data aligned with numerical and categorical patterns. The aggregator module (316) then combines the conditional textual synthetic data and the non-textual synthetic data by applying alignment rules that maintain coherence between semantic content and quantitative context, such as ensuring that “elevated heart rate” corresponds to a numerically high synthetic pulse value. The evaluator module (340) validates the unified synthetic dataset against predefined statistical thresholds and linguistic thresholds, guaranteeing that numerical ranges, distribution properties, and domain-specific textual structures remain realistic.
[0091] Before outputting the dataset, the aggregator module (316) normalizes embeddings corresponding to the encoded textual data and the encoded nontextual data through multi-dimensional consistency mapping, ensuring that both modalities share harmonized representation spaces. The aggregator module (316), in communication with the evaluator module (340), further performs iterative differential refinement on the unified synthetic dataset to increase semantic accuracy and structural balance across modalities. The resulting unified synthetic dataset is then output as a structured fusion of textual and non-textual information retaining semantic integrity and statistical fidelity.
[0092] FIG. 4 is a flow chart representing the steps involved in method for unstructured synthetic data generation, in accordance with an embodiment of the present disclosure; FIG. 4 (b) illustrates continued steps of the method of FIG. 4 (a) in accordance with an embodiment of the present disclosure.
[0093] The method (400) includes receiving, by a processor, real-time input data from a plurality of sources, wherein the real-time input data comprises textual data and non-textual data in step 405. The processor classifies the real-time input data upon reception by identifying modality-specific characteristics that differentiate linguistic expressions from numerical or categorical patterns. The plurality of sources includes, but is not limited to, clinical systems providing a plurality of pain levels, monitoring devices capturing a plurality of facial images, and a camera positioned within a patient area to record real-time visual information.
[0094] The method (400) includes analysing, by the processor, a structure of the real-time input data to determine at least one encoding technique corresponding to each of the textual data and the non-textual data in step 410. The processor identifies linguistic patterns, numerical ranges, and categorical formations to classify each portion for modality-appropriate encoding. The analysis includes, but is not limited to, examining a plurality of symptom descriptions, a plurality of sensor measurements, and a plurality of facial images captured by a camera installed within a monitoring area.
[0095] The method (400) includes encoding, by the processor, the real-time input data based on the at least one encoding technique wherein the encoding comprises at least one of tokenization, word embeddings, and character-level encoding for the textual data, and at least one of normalization, one-hot encoding, and feature scaling for the non-textual data to generate encoded textual data and encoded nontextual data respectively in step 415. The processor assigns each data portion to its modality-specific encoding pathway to preserve linguistic structure or numerical and categorical integrity. The encoding includes, but is not limited to, converting a plurality of symptom descriptions into token sequences, normalizing a plurality of sensor readings, and scaling a plurality of facial-image-derived metrics captured by a camera installed within a monitoring area.
[0096] The method (400) includes segregating, by the processor, the encoded textual data and the encoded non-textual data by identifying patterns and structures differentiating the textual data from numerical and categorical portions of the nontextual data, thereby forming a plurality of distinct processing groups in step 420. The processor distinguishes linguistic embeddings, numerical values, and categorical indicators to assign each encoded element to a modality-specific group. The segregation includes, but is not limited to, allocating a plurality of tokenized symptom descriptions to a textual group, assigning a plurality of normalized sensor readings to a numerical group, and grouping a plurality of facial-image-derived categorical attributes captured by a camera positioned in a monitoring area.
[0097] The method (400) includes training, by the processor, a transformer-based model using the encoded textual data and the encoded non-textual data to learn distributions, dependencies, and patterns corresponding to the real-time input data in step 425. The processor feeds modality-specific encoded representations into the transformer-based model, enabling the model to recognize linguistic relationships, numerical trends, and categorical behaviours. The training includes, but is not limited to, learning associations between a plurality of encoded symptom descriptions, a plurality of normalized sensor measurements, and a plurality of facial-image-derived indicators captured by a camera positioned within a monitoring environment.
[0098] The method (400) includes fine-tuning, by the processor, the transformer-based model based on a plurality of evaluation metrics to improve quality, realism, and contextual consistency of synthetic data output by the transformer-based model in step 430. The processor applies tuning adjustments derived from accuracy measurements, calibration checks, bias-reduction indicators, and contextual- alignment scores to refine model behaviour across all modalities. The fine-tuning includes, but is not limited to, evaluating a plurality of encoded symptom descriptions, a plurality of normalized sensor readings, and a plurality of facial- image-derived indicators captured by a camera positioned within a monitoring environment.
[0099] The method (400) includes generating, by the processor, conditional textual synthetic data based on contextual cues comprising at least one of keywords and topics to ensure coherence and contextual relevance in step 435. The processor interprets the contextual cues as semantic guides and integrates them into the generation pathway so that the resulting textual output aligns with the intended thematic direction. The generating includes, but is not limited to, producing synthetic sentences aligned with a plurality of clinical keywords, constructing narrative segments linked to a plurality of operational topics, and generating context-driven descriptions associated with a plurality of facial-image-derived indicators captured by a camera positioned within a monitoring environment.
[0100] The method (400) includes combining, by the processor, the conditional textual synthetic data and non-textual synthetic data by aligning corresponding formats and structures to produce a unified synthetic dataset in step 440. The processor evaluates semantic cues within the conditional textual synthetic data and matches them with numerical or categorical characteristics present in the non-textual synthetic data to ensure structural and contextual harmony. The combining includes, but is not limited to, aligning a plurality of synthetic symptom descriptions with a plurality of synthetic sensor readings, pairing a plurality of topic-driven textual segments with a plurality of categorical status values, and associating narrative elements with a plurality of facial -image-derived indicators captured by a camera positioned within a monitoring environment.
[0101] The method (400) includes processing, by the processor, the unified synthetic dataset to preserve semantic meaning of the conditional textual synthetic data and to maintain statistical compatibility of the non-textual synthetic data in step 445. The processor verifies that narrative expressions, contextual segments, and descriptive cues remain aligned with numerical ranges, categorical proportions, and modality-specific characteristics embedded within the non-textual synthetic data.
[0102] The method (400) includes outputting, by the processor, the unified synthetic dataset as a structured fusion of textual and non-textual information retaining semantic integrity and statistical fidelity in step 450. The processor formats the unified synthetic dataset into an organized representation in which contextual textual elements correspond coherently with numerical values and categorical attributes to form an interpretable multimodal record. The outputting includes, but is not limited to, delivering a plurality of synthetic symptom descriptions aligned with a plurality of synthetic sensor readings, associating a plurality of context- driven narrative segments with a plurality of categorical status indicators, and pairing descriptive elements with a plurality of facial-image-derived indicators captured by a camera positioned within a monitoring environment.
[0103] Thus, various embodiments of the system and method for unstructured synthetic data generation provides several benefits. By analysing the structure of the realtime input data and selecting at least one encoding technique corresponding to each of the textual data and the non-textual data, the system ensures that each modality is processed in a manner that preserves its contextual and statistical characteristics. The ability to segregate the encoded textual data and the encoded non-textual data into distinct processing groups enhances processing efficiency and prevents cross-modality interference. Through the application of fine-tuning parameters based on the plurality of evaluation metrics, the transformer-based model achieves improved quality, realism, and contextual consistency. The system further ensures that the conditional textual synthetic data remains coherent with the quantitative context of the non-textual synthetic data by applying alignment rules during combining. Validation of the unified synthetic dataset against predefined statistical thresholds and linguistic thresholds strengthens representational fidelity, while multi-dimensional consistency mapping supports normalized embeddings across modalities. The iterative differential refinement performed by the processor provides additional enhancement by increasing semantic accuracy and structural balance, resulting in a unified synthetic dataset that retains semantic integrity and statistical fidelity suitable for downstream analytical, operational, or machine-learning applications.
Claims
WE CLAIM:
1. A system for unstructured synthetic data generation, comprising: a processor; and a memory coupled to the processor, wherein the memory comprises instructions that, when executed by the processor, cause the processor to: receive a real-time input data from a plurality of sources, wherein the real-time input data comprises textual data and non-textual data; analyse a structure of the real-time input data to determine at least one encoding technique corresponding to each of the textual data and the non-textual data; encode the real-time input data based on the at least one encoding technique wherein the encoding comprises at least one of tokenization, word embeddings, and character-level encoding for the textual data, and at least one of normalization, one-hot encoding, and feature scaling for the non-textual data to generate encoded textual data and encoded non-textual data respectively; segregate the encoded textual data and the encoded non-textual data by identifying patterns and structures differentiating the textual data from numerical and categorical portions of the non-textual data, thereby forming a plurality of distinct processing groups; train a transformer-based model using the encoded textual data and the encoded non-textual data to learn distributions, dependencies, and patterns corresponding to the real-time input data; fine-tune the transformer-based model based on a plurality of evaluation metrics to improve quality, realism, and contextual consistency of synthetic data output by the transformer-based model;generate conditional textual synthetic data based on contextual cues comprising at least one of keywords and topics to ensure coherence and contextual relevance; combine the conditional textual synthetic data and non-textual synthetic data by aligning corresponding formats and structures to produce a unified synthetic dataset; process the unified synthetic dataset to preserve semantic meaning of the conditional textual synthetic data and to maintain statistical compatibility of the non-textual synthetic data; and output the unified synthetic dataset as a structured fusion of textual and non-textual information retaining semantic integrity and statistical fidelity.
2. The system as claimed in claim 1, to cause the processor to generate synthetic data based on the real-time input data wherein the synthetic data mimics the structure of the real-time input data.
3. The system as claimed in claim 1, wherein the at least one encoding technique is selected dynamically based on data characteristics comprising dimensionality, sparsity, variance, and linguistic features of the real-time input data.
4. The system as claimed in claim 1, to cause the processor to identify crossdomain relations among the encoded textual data and the encoded non-textual data prior to training the transformer-based model.
5. The system as claimed in claim 1, wherein the segregating comprises classifying attributes into groups corresponding to linguistic features, numerical features, and categorical features.
6. The system as claimed in claim 1, to cause the processor to apply finetuning parameters based on the plurality of evaluation metrics comprising at least one of accuracy, calibration, bias reduction, and contextual alignment.
7. The system as claimed in claim 1, wherein the combining comprises applying alignment rules configured to maintain coherence between semantic content of the conditional textual synthetic data and quantitative context of the non-textual synthetic data.
8. The system as claimed in claim 1, to cause the processor to validate the unified synthetic dataset against predefined statistical thresholds and linguistic thresholds to ensure representational fidelity.
9. The system as claimed in claim 1, wherein the contextual cues comprises at least one of user-provided topics, metadata, ontology references, and task directives guiding the conditional textual synthetic data generation.
10. The system as claimed in claim 1, to cause the processor to normalize embeddings corresponding to the encoded textual data and the encoded nontextual data through multi-dimensional consistency mapping prior to the output of the unified synthetic dataset.
11. The system as claimed in claim 1, to cause the processor to perform iterative differential refinement on the unified synthetic dataset to increase semantic accuracy and structural balance between the conditional textual synthetic data and the non-textual synthetic data.
12. The system as claimed in claim 1, to cause the processor to: present the unified synthetic dataset to a user device via a user interface, wherein the user interface receives a user feedback associated with one or more refinement, evaluation, or selection of the unified synthetic dataset.
13. A method for unstructured synthetic data generation, comprising:receiving, by a processor, real-time input data from a plurality of sources, wherein the real-time input data comprises textual data and non-textual data; analysing, by the processor, a structure of the real-time input data to determine at least one encoding technique corresponding to each of the textual data and the non-textual data; encoding, by the processor, the real-time input data based on the at least one encoding technique wherein the encoding comprises at least one of tokenization, word embeddings, and character-level encoding for the textual data, and at least one of normalization, one-hot encoding, and feature scaling for the non-textual data to generate encoded textual data and encoded non-textual data respectively; segregating, by the processor, the encoded textual data and the encoded non-textual data by identifying patterns and structures differentiating the textual data from numerical and categorical portions of the non-textual data, thereby forming a plurality of distinct processing groups; training, by the processor, a transformer-based model using the encoded textual data and the encoded non-textual data to learn distributions, dependencies, and patterns corresponding to the real-time input data; fine-tuning, by the processor, the transformer-based model based on a plurality of evaluation metrics to improve quality, realism, and contextual consistency of synthetic data output by the transformer-based model; generating, by the processor, conditional textual synthetic data based on contextual cues comprising at least one of keywords and topics to ensure coherence and contextual relevance; combining, by the processor, the conditional textual synthetic data and non-textual synthetic data by aligning corresponding formats and structures to produce a unified synthetic dataset;processing, by the processor, the unified synthetic dataset to preserve semantic meaning of the conditional textual synthetic data and to maintain statistical compatibility of the non-textual synthetic data; and outputting, by the processor, the unified synthetic dataset as a structured fusion of textual and non-textual information retaining semantic integrity and statistical fidelity.