A system and a method for privacy preserved and fair synthetic data with proofs

The system generates privacy-preserving and fair synthetic data by reconciling schemas, training generative models, and providing mathematical proofs to address privacy and fairness concerns, enhancing data ingestion efficiency and compliance.

WO2026132993A1PCT designated stage Publication Date: 2026-06-25PRIVASAPIEN TECH PTE LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
PRIVASAPIEN TECH PTE LTD
Filing Date
2025-12-10
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing data systems face challenges in creating high-quality synthetic data that preserves privacy and fairness, while ensuring sensitive information is protected and unbiased, with complex data ingestion workflows and regulatory compliance being a concern.

Method used

A system and method for generating privacy-preserving and fair synthetic data through schema reconciliation, encoding, training generative models, and providing mathematical proofs to validate similarity, accuracy, and fairness, while integrating fairness-awareness throughout the data-generation process.

Benefits of technology

Enables the creation of high-quality synthetic data that maintains privacy and fairness, reduces manual effort in data ingestion, and provides transparent governance and compliance, ensuring statistical and structural consistency with the original dataset.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure IB2025062655_25062026_PF_FP_ABST
    Figure IB2025062655_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A system and a method for privacy preserved and fair synthetic data with proofs is disclosed The system a processor, and memory with instructions to: receive seed data with differing schemas; combine samples into a schema-consistent dataset by reconciling names, types and missing semantics; encode columns by characteristics to optimize representation; predict training requirements (batch size, learning rate, convergence) to optimize training; convert encoded data to training format; train generative models to produce synthetic records preserving statistical relations; sample models to produce records that maintain mathematical consistency with key statistics; generate proofs validating fairness and accuracy between synthetic and real records; assess privacy and fairness risks via attribute sensitivity and representation ratios and use results to guide parameters and sampling; and present via a user interface about job statuses, fairness indices, privacy scores, utility metrics and volumes to a user for governance.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] A SYSTEM AND A METHOD FOR PRIVACY PRESERVED AND FAIR SYNTHETIC DATA WITH PROOFS

[0002] EARLIEST PRIORITY DATE:

[0003] This Application claims priority from a provisional patent application filed in India having Patent Application No. 202441101475, filed on December 20, 2024, and titled “SYSTEM AND METHOD FOR PRIVACY PRESERVED AND FAIR SYNTHETIC DATA WITH PROOFS”.

[0004] FIELD OF INVENTION

[0005] The present invention relates to the field of data analytics and database. More particularly, the present invention relates to a system and a method for privacy preserved and fair synthetic data with proofs.

[0006] BACKGROUND

[0007] The increasing use of data-driven systems across industries such as healthcare, finance, telecommunications, and public services has created a significant dependence on large, high-quality datasets. These datasets are often sourced from multiple platforms, legacy systems, and external partners, resulting in substantial variations in schema design, attribute naming conventions, and data-quality standards. Before such information can be used for analytics or model development, organizations typically expend considerable effort reconciling inconsistent structures, resolving missing values, and standardizing formats.

[0008] At the same time, the increased sensitivity of personal and regulated information has heightened the need for strong privacy protections. Many datasets contain identifiers, demographic attributes, or behavioural traces that could expose individuals to re-identification or profiling risks if used directly. This has led regulatory bodies and internal governance teams to impose strict requirements on how data may be stored, shared, or analysed.

[0009] Another challenge arises from fairness considerations. Real-world datasets often exhibit imbalance, under-representation, or biased patterns that can propagate into downstream analytical systems. When such data are used to train machinelearning models, they may unintentionally produce unequal outcomes across demographic or protected groups. As organizations seek to deploy automated decision systems responsibly, the need to mitigate such risks has become increasingly important.

[0010] Hence, there is a need for an improved system and method for privacy preserved and fair synthetic data with proofs to address the aforementioned issue(s).

[0011] OBJECTIVES OF THE INVENTION

[0012] A primary objective of the invention is to enable the creation of high-quality synthetic tabular data that faithfully reflects the statistical and structural characteristics of the original dataset while ensuring that sensitive information remains protected and cannot be traced back to any individual.

[0013] Another objective of the invention is to integrate fairness-awareness throughout the data-generation process so that the resulting synthetic data does not inherit or amplify bias present in the original dataset. This includes identifying fairness risks, adjusting model behaviour, and validating fairness outcomes through measurable indicators.

[0014] Yet another objective of the invention is to provide mathematically grounded validation of the generated synthetic data by producing formal proofs and supporting metrics that demonstrate similarity, accuracy, fairness alignment, and privacy compliance. This helps organizations build confidence in the use of synthetic data for analytics, model development, and regulatory reporting.

[0015] A further objective of the invention is to unify complex data-ingestion workflows by allowing users to upload, reconcile, and prepare datasets with differing schemas in a seamless manner, reducing manual effort and improving the reliability of downstream processes.

[0016] Still another objective of the invention is to offer clear visibility and governance over synthetic-data generation through a user interface that tracks creation jobs, displays key metrics, and maintains transparent records for audit, monitoring, and decision-making.

[0017] SUMMARY

[0018] In accordance with an embodiment of the present disclosure, a system for privacy preserved and fair synthetic data with proofs is disclosed. The system includes a processor. The system includes a memory coupled to the processor, wherein the memory comprises instructions that, when executed by the processor, cause the processor to: receive a plurality of seed data samples comprising differing schemas from one or more data sources; combine the plurality of seed data samples into a schema-consistent dataset to ensure consistency across the plurality of seed data samples by reconciling attribute names, datatypes, and missing-value semantics; encode the schema-consistent dataset based on column data characteristics and encoding suitability to optimize data representation; predict one or more training requirements corresponding to the schema-consistent dataset, wherein the one or more training requirements comprises a batch size, a learning rate, and convergence criteria, wherein the predicted one or more training requirements are selected to optimize a training pipeline; convert the encoded schema-consistent dataset into a training-ready format based on the predicted one or more training requirements; train one or more generative models using the training-ready format, wherein the one or more generative models is configured to generate synthetic tabular data that preserve statistical and relational characteristics of the schema-consistent dataset; generate a plurality of synthetic data samples by sampling from the trained generative models, wherein the plurality of synthetic data samples corresponds to the schema-consistent dataset and maintains a mathematical consistency with respect to one or more statistical properties of the schema-consistent dataset; generate one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, a similarity between the plurality of synthetic data samples and corresponding real data of the schema-consistent dataset; assess a plurality of privacy and fairness risks associated with the schema-consistent dataset by analysing attribute sensitivity levels and representation ratios, and utilize the assessment to guide parameters of the trained generative models and the sampling of the plurality of synthetic data samples; and present, via a user interface, one or more creation jobs and corresponding statuses, fairness indices, privacy protection scores, data utility metrics, and generation volumes for monitoring and governance of the plurality of synthetic data samples to a user operating a user device.

[0019] In accordance with an embodiment of the present disclosure, a method for privacy preserved and fair synthetic data with proofs is disclosed. The method includes receiving, by a processor, a plurality of seed data samples comprising differing schemas from one or more data sources. The method includes combining, by the processor, the plurality of seed data samples into a schema-consistent dataset to ensure consistency across the plurality of seed data samples by reconciling attribute names, data types, and missing-value semantics. The method includes encoding, by the processor, the schema-consistent dataset based on column data characteristics and encoding suitability to optimize data representation. The method includes predicting, by the processor, one or more training requirements corresponding to the schema-consistent dataset, wherein the one or more training requirements comprises a batch size, a learning rate, and convergence criteria, wherein the predicted one or more training requirements are selected to optimize a training pipeline. The method includes converting, by the processor, the encoded schema-consistent dataset into a training-ready format based on the predicted one or more training requirements. The method includes training, by the processor, one or more generative models using the training -ready format, wherein the one or more generative models is configured to generate synthetic tabular data that preserve statistical and relational characteristics of the schema-consistent dataset. The method includes generating, by the processor, a plurality of synthetic data samples by sampling from the trained generative models, wherein the plurality of synthetic data samples corresponds to the schema-consistent dataset and maintains a mathematical consistency with respect to one or more statistical properties of the schema-consistent dataset. The method includes generating, by the processor, one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, a similarity between the plurality of synthetic data samples and corresponding real data of the schema-consistent dataset. The method includes assessing, by the processor, a plurality of privacy and fairness risks associated with the schema-consistent dataset by analysing attribute sensitivity levels and representation ratios, and utilize the assessment to guide parameters of the trained generative models and the sampling of the plurality of synthetic data samples. The method includes presenting, by the processor, via a user interface, one or more creation jobs and corresponding statuses, fairness indices, privacy protection scores, data utility metrics, and generation volumes for monitoring and governance of the plurality of synthetic data samples to a user operating a user device.

[0020] To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. It is to be appreciated that these figures depict only typical embodiments of the disclosure and are therefore not to be considered limiting in scope. The disclosure will be described and explained with additional specificity and detail with the appended figures.

[0021] BRIEF DESCRIPTION OF THE DRAWINGS

[0022] The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:

[0023] FIG. 1 illustrates a network environment of a system for privacy preserved and fair synthetic data with proofs in accordance with an embodiment of the present disclosure;

[0024] FIG. 2 illustrates a schematic diagram of a user device of FIG. 1, in accordance with an example implementation of the present subject matter;

[0025] FIG. 3 illustrates a schematic diagram of a system for privacy preserved and fair synthetic data with proofs of FIG. 1, in accordance with an embodiment of the present disclosure;

[0026] FIG. 4 (a) is a flow chart representing the steps involved in a method for privacy preserved and fair synthetic data with proofs, in accordance with an embodiment of the present disclosure; and FIG. 4 (b) illustrates continued steps of the method of FIG. 4 (a) in accordance with an embodiment of the present disclosure.

[0027] Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein.

[0028] DETAILED DESCRIPTION

[0029] For the purpose of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended. Such alterations and further modifications in the illustrated system, and such further applications of the principles of the disclosure as would normally occur to those skilled in the art are to be construed as being within the scope of the present disclosure.

[0030] The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method. Similarly, one or more devices or subsystems or elements or structures or components preceded by "comprises... a" does not, without more constraints, preclude the existence of other devices, sub-systems, elements, structures, components, additional devices, additional sub-systems, additional elements, additional structures or additional components. Appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.

[0031] In the following specification and the claims, reference will be made to a number of terms, which shall be defined to have the following meanings. The singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise.

[0032] FIG. 1 illustrates a network environment of a system for privacy preserved and fair synthetic data with proofs in accordance with an embodiment of the present disclosure.

[0033] Referring to FIG. 1, a user device (104) corresponding to a user (108) may be communicatively coupled to a system (102). The user (108) may access the system (102) over a network (106). Examples of the user device (104) includes, but is not limited to, a mobile phone, desktop computer, portable digital assistant (PDA), smart phone, tablet, ultra-book, netbook, laptop, multi-processor system, microprocessor-based or programmable consumer electronic system, or any other communication device that a user may use. It will be appreciated that the system (102) may be presented to the user on the user device (104) as a web application accessed through a browser, through a software application on the user device, or, particularly for smartphones, through a mobile application installed at the smartphone. It will be appreciated that, within the context of the disclosure herein, web application refers to a utility implemented on a networked computing system accessible by user device over the Internet (e.g. through browsers) wherein the bulk of the processing takes place at the networked computing system, mobile applications refer to applications installed on smartphones that may communicate with a networked computing system, and a “software” application refers generally to applications other than web browsers installed on other types of user device that may communicate with a networked computing system over the network (106).

[0034] The network (106) may be a single communication network or a combination of multiple communication networks and may use a variety of different communication protocols. The network (106) may be a wireless network, a wired network, or a combination thereof. Examples of such individual personalized networks include, but are not limited to, Global System for Mobile Communication (GSM) network, Universal Mobile Telecommunications System (UMTS) network, Personal Communications Service (PCS) network, Time Division Multiple Access TDMA) network, Code Division Multiple Access (CDMA) network, Next Generation Network (NON), Public Switched Telephone Network (PSTN). Depending on the technology, the personalized network (106) may include various network entities, such as gateways and routers; however, such details have been omitted for the sake of brevity of the present description.

[0035] The system (102) may have a homepage that is presented to the user (108) accessing a top-level web address for web applications presented to the user (108) in a browser or a welcome screen for software and mobile applications. The homepage may include links to a user log-in interface or general information about the system (102) and the option to register as user (108). It will be appreciated that the presentation of a homepage may not be necessary, for example, if a user bypasses it by directly inputting a web address corresponding to a user log-in page, or if a separate mobile application is designed for users.

[0036] A new or unregistered user can access the user log-in interface, fill out the log-in information corresponding to the user's account, and indicate that the user wishes to sign in. It will be appreciated that any conventional registration and log-in techniques for web applications, software application, and mobile applications may be used, whichever is appropriate for the user. While registering the user may be prompted to provide username and corresponding user credentials, not limited to, password, geographical location, and contact information and upon receipt of the foregoing information, a corresponding user-profile may be created and stored on a respective database of the system (102).

[0037] In accordance with an embodiment of the present disclosure, a system for privacy preserved and fair synthetic data with proofs is disclosed. The system includes a processor. The system includes a memory coupled to the processor, wherein the memory comprises instructions that, when executed by the processor, cause the processor to: receive a plurality of seed data samples comprising differing schemas from one or more data sources; combine the plurality of seed data samples into a schema-consistent dataset to ensure consistency across the plurality of seed data samples by reconciling attribute names, datatypes, and missing-value semantics; encode the schema-consistent dataset based on column data characteristics and encoding suitability to optimize data representation; predict one or more training requirements corresponding to the schema-consistent dataset, wherein the one or more training requirements comprises a batch size, a learning rate, and convergence criteria, wherein the predicted one or more training requirements are selected to optimize a training pipeline; convert the encoded schema-consistent dataset into a training-ready format based on the predicted one or more training requirements; train one or more generative models using the training-ready format, wherein the one or more generative models is configured to generate synthetic tabular data that preserve statistical and relational characteristics of the schema-consistent dataset; generate a plurality of synthetic data samples by sampling from the trained generative models, wherein the plurality of synthetic data samples corresponds to the schema-consistent dataset and maintains a mathematical consistency with respect to one or more statistical properties of the schema-consistent dataset; generate one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, a similarity between the plurality of synthetic data samples and corresponding real data of the schema-consistent dataset; assess a plurality of privacy and fairness risks associated with the schema-consistent dataset by analysing attribute sensitivity levels and representation ratios, and utilize the assessment to guide parameters of the trained generative models and the sampling of the plurality of synthetic data samples; and present, via an user interface, one or more creation jobs and corresponding statuses, fairness indices, privacy protection scores, data utility metrics, and generation volumes for monitoring and governance of the plurality of synthetic data samples to a user operating a user device.

[0038] It may be noted that the foregoing system is an exemplary system and may be implemented as computer executable instructions in any computing or processing environment, including in digital electronic circuitry or in computer hardware, firmware, device driver, or software. As such, the system is not limited to any specific hardware or software configuration.

[0039] FIG. 2 illustrates a schematic diagram of a user device, in accordance with an example implementation of the present subject matter. Referring to FIG. 2, the user device (104) may comprise a processor(s) (202), a memory(s) (204) coupled to and accessible by the processor(s) (202), and an interface (210) coupled to the memory(s) (204). The user device (104) disclosed herein may be same as the user device (104) described in FIG. 1. The functions of various elements shown in the figs., including any functional blocks labelled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing instructions. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" would not be construed to refer exclusively to hardware capable of executing instructions, and may implicitly comprise, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA). Other hardware, standard and / or custom, may also be coupled to the processor(s) (202). The user device (104) may further include a display (206) in addition to other components such as, but not limited to, keyboard, sensors, logic circuits etc. Further, the user device (104) may include data (208) which may include data (208) that may be stored, utilized or generated during the operation of the user device (104).

[0040] The memory(s) (204) may be a computer-readable medium, examples of which comprise volatile memory (e.g., RAM), and / or non-volatile memory (e.g., Erasable Programmable read-only memory, i.e. EPROM, flash memory, etc.). The memory(s) (204) may be an external memory, or internal memory, such as a flash drive, a compact disk drive, an external hard disk drive, or the like. The user device (104) may further include an interface (210) that may allow the connection or coupling of the user device (104) with one or more other devices, through a wired (e.g., Local Area Network, i.e., LAN) connection or through a wireless connection (e.g., Bluetooth®, Wi-Fi), for example, for connecting to the system shown in FIG. 1. The interface may also enable intercommunication between different logical as well as hardware components of the user device (104).

[0041] FIG. 3 illustrates a schematic diagram of a system for privacy preserved and fair synthetic data with proofs of FIG. 1, in accordance with an embodiment of the present disclosure. Referring to FIG. 3, the system (102) includes a processor(s) (302), a memory(s) (304) coupled to and accessible by the processor(s) (302), and database (346) coupled to the memory(s) (304).

[0042] The system (102) disclosed herein is the same as the system (102) described in FIG. 1. The functions of various elements shown in the figs., including any functional blocks labelled as "processor(s)", may be provided through the use of dedicated hardware as well as hardware capable of executing instructions. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" would not be construed to refer exclusively to hardware capable of executing instructions, and may implicitly comprise, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA). Other hardware, standard and / or custom, may also be coupled to the processor(s) (302). The system (102) may further include other components such as, but not limited to, keyboard, sensors, logic circuits, input / output interfaces etc. Further, the system (102) may include data which may include data that may be stored, utilized or generated during the operation of the computer implemented system (102).

[0043] The memory(s) (304) may be a computer-readable medium, examples of which comprise volatile memory (e.g., RAM), and / or non-volatile memory (e.g., Erasable Programmable read-only memory, i.e. EPROM, flash memory, etc.). The memory(s) (304) may be an external memory, or internal memory, such as a flash drive, a compact disk drive, an external hard disk drive, or the like. The system (102) may further include the user interface (348) that may allow the connection or coupling of the system (102) with one or more other devices, through a wired (e.g., Local Area Network, i.e., LAN) connection or through a wireless connection (e.g., Bluetooth®, Wi-Fi)., for example, for connecting to the user device (104) as shown in FIG. 1. The user interface (348) may also enable intercommunication between different logical as well as hardware components of the system (102).

[0044] The system (102) may be provided with a database (346) to store one or more mathematical proofs, corresponding validation metadata, similarity metrics, fairness indices, privacy differentials, metric definitions, computation parameters, confidence intervals, statistical bounds, sampling configurations, dataset lineage information, threshold values used in validation, training-parameter configurations, timestamps of proof generation, model version identifiers. In an example implementation of the system (102) including one or more servers, the databases may databases local to the server or may be remote to the server. It may be noted that the data in the databases may be stored as a table or may be prestored as a mapping with the other. This application is not limited thereto.

[0045] The system (102) may include module(s). The module(s) may include a receiving module (306), a customized configuration module (308), a training requirement prediction module (310), a trainer module (312), a sampler module (314), an evaluator module (316), an assessment module (318), a destination data ingestion module (320), and a schema aggregator module (336). In one example, the module(s) may be implemented as a combination of hardware and firmware. In an example described herein, such combinations of hardware and firmware may be implemented in several different ways. For example, the firmware for module(s) may be processor (302) executable instructions stored on a non- transitory machine-readable storage medium and the hardware for the module(s) may include a processing resource (for example, implemented as either single processor or combination of multiple processors), to execute such instructions. Further, the hardware for the module(s) may include communication apparatuses, control circuitries involving electrical and electronics components, sensors, and interface devices, which may be in communication with each other for multidirectional communication therebetween.

[0046] Further, the system (102) includes data. The data may include data that is either stored or generated as a result of functions implemented by the system. It may be further noted that information stored and available in data may be utilized by the engine(s) for performing various functions by the system. In an example, data may include a seed data samples data (322), a schema consistent dataset (324), a column data (326), a synthetic data sample (328), a statistical data (330), a mathematical data (332), and a metrics data (334). It may be noted that such examples of the various functions are only indicative. The present approaches may be applicable to other examples without deviating from the scope of the present subject matter.

[0047] In the present examples, the non-transitory machine-readable storage medium may store instructions that, when executed by the processing resource, implement the functionalities of modules(s). In such examples, the system (102) may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions. In other examples of the present subject matter, the machine-readable storage medium may be located at a different location but accessible to the system (102) and the processor(s) (302).

[0048] In operation, the receiving module (306) is configured to receive a plurality of seed data samples comprising differing schemas from one or more data sources. The receiving module (306) is configured to classify the plurality of seed data samples according to their respective schema characteristics, data formats, and attribute structures prior to integration. Each of the plurality of seed data samples may include multiple columns or fields representing heterogeneous data types, including categorical, numerical, or textual information. The processor (302) initiates a schema recognition and validation process to identify inconsistencies, missing attributes, and conflicting data types across the plurality of seed data samples.

[0049] The one or more data sources include, but are not limited to, enterprise databases, structured data repositories, relational database management systems, cloud-based storage platforms, or user-curated CSV and JSON files.

[0050] In another embodiment, the plurality of seed data samples comprise at least one of structured tabular data, semi-structured tabular data, or tabular data derived from unstructured sources. In this embodiment, the receiving module (306) classifies the plurality of seed data samples into one or more schema categories based on inherent formatting, attribute organization, and metadata availability. Structured tabular data are classified as datasets with clearly defined rows, columns, and data types, such as relational database tables or standardized spreadsheets. Semi -structured tabular data are classified as datasets that contain discernible tabular patterns but include flexible attribute structures, nested fields, or partially defined schemas, such as JSON tables, CSV files with irregular column alignment, or log data flattened into key-value tables. Tabular data derived from unstructured sources are classified as datasets reconstructed from raw, free-form inputs such as text documents, PDFs, or images, where extraction methods convert unstructured inputs into a tabular arrangement with inferred column names, aligned data types, and serialized values. Once classified, the processor adjusts ingestion, schema detection, and reconciliation rules to ensure that all types of seed data samples regardless of structural rigidity are integrated into a unified schema-consistent dataset for subsequent encoding, training requirement prediction, and synthetic data generation.

[0051] In one embodiment, the schema aggregator module (336) is configured to combine the plurality of seed data samples into a schema-consistent dataset to ensure consistency across the plurality of seed data samples by reconciling attribute names, data types, and missing-value semantics. The schema aggregator module (336) initiates a schema-alignment procedure wherein each attribute of the plurality of seed data samples is analysed to determine its semantic meaning, structural formatting, and corresponding data type. The schema aggregator module (336) first classifies each seed data sample according to its inherent schema definition, such as column identifiers, permitted value ranges, encoding standards, and logical groupings. Once classified, the schema aggregator module (336) performs an attribute-matching operation in which attributes referring to the same conceptual information are aligned, even if labelled differently across the plurality of seed data samples. This reconciliation process further includes converting mismatched data types into a unified format, resolving inconsistencies such as varying date representations, categorical encodings, and numerical scaling differences. The schema aggregator module (336) additionally evaluates missingvalue semantics to determine whether absent entries indicate null values, unknown states, not-applicable conditions, or intentionally suppressed information, and harmonizes these representations into a consistent structure. Example of schema reconciliation includes, but is not limited to, harmonizing attributes such as “DOB,” “Birth Date,” and “DateOfBirth” into a single attribute representation, standardizing numerical fields stored as strings into integer or floating-point formats, or unifying categorical encodings across datasets by mapping distinct label sets into a consolidated category list. Similarly, missingvalue harmonization includes, but is not limited to, interpreting placeholder values such as “N / A,” “NULL,” or “ — ” and converting them into a systematically defined missing-value token within the schema-consistent dataset.

[0052] In one embodiment, the customised configuration module (308) is configured to encode the schema-consistent dataset based on column data characteristics and encoding suitability to optimize data representation. The customised configuration module (308) classifies each column of the schema-consistent dataset according to its intrinsic data characteristics, including whether the column comprises categorical values, numerical values, textual entries, temporal values, binary indicators, or mixed-type information. The customised configuration module (308) identifies the statistical distribution, value cardinality, semantic type, and data density of each column to determine a corresponding encoding approach suitable for downstream model training. Once classified, the customised configuration module (308) selects an encoding technique that aligns with the characteristics of the column and the required synthetic data generation format, ensuring that the encoded representation preserves key relationships and structural patterns present in the original data. The encoding process may include transforming categorical values into numerical representations, normalizing or scaling numerical values, converting textual fields into vectorized formats, or applying embeddings that capture semantic similarity across values.

[0053] Example of encoding includes, but is not limited to, applying one-hot encoding or ordinal encoding for low- and high-cardinality categorical columns, utilizing min- max scaling, processing textual columns using tokenization or word-embedding techniques, and encoding temporal data using cyclical representations or timestamp decompositions. Column data characteristics include, but are not limited to, categorical labels, and numerical ranges. In another embodiment, the customised configuration module (308) is configured to encode a plurality of categorical columns, numerical columns, and textual columns of the schema-consistent dataset corresponding to the column data characteristic and a required type or format of synthetic data generation as defined by the one or more encoding configurations. The customised configuration module (308) classifies each plurality of categorical columns, numerical columns, and textual columns of the schema-consistent dataset according to its intrinsic data characteristic, where the categorical columns are identified by discrete label sets or enumerated values, numerical columns are identified by continuous or integer value ranges, and textual columns are identified by free-form or semi -structured linguistic content. After classification, the processor selects and applies one or more encoding configurations tailored to the specific column type and the synthetic-data generation requirements, ensuring that the encoded representation preserves both statistical and semantic fidelity. For categorical columns, the processor (302) may convert labels into numerical representations that retain class identity, for numerical columns, the processor may normalize or scale values to ensure comparability across varying magnitudes, and for textual columns, the processor (302) may tokenize, vectorize, or embed text into fixed-dimension numerical vectors.

[0054] Example of the encoding of the plurality of categorical columns, numerical columns, and textual columns includes, but is not limited to, transforming categorical values using one-hot encoding, ordinal encoding, or learned embeddings, and converting numerical values through min-max scaling.

[0055] In one embodiment, the training requirement prediction module (310) is configured to predict one or more training requirements corresponding to the schema-consistent dataset. The training requirement prediction module (310) evaluates the schema-consistent dataset to identify its structural complexity, distributional characteristics, feature dimensionality, data volume, and sparsity levels. Based on this evaluation, the processor classifies the dataset into one or more training readiness categories that correspond to computational intensity, model sensitivity, and optimization behaviour. Once classified, the training requirement prediction module (310) analyses historical training one or more training requirements, model architecture constraints, and statistical indicators extracted from the schema-consistent dataset to determine a suitable batch size that balances computational efficiency and gradient stability. Similarly, the training requirement prediction module (310) selects a learning rate aligned with the expected variation in the data and the convergence behaviour of the generative model.

[0056] Example of the one or more training requirements includes, but is not limited to, assigning a batch size proportional to dataset volume, selecting an adaptive learning rate based on gradient variance analysis, or defining convergence criteria that incorporate loss stabilization, fairness metric thresholds, or structural similarity indicators. Dataset characteristics include, but are not limited to, feature dimensionality, categorical-to-numerical ratio, attribute cardinality, and variance distribution.

[0057] In one embodiment, the one or more training requirements comprises a batch size, a learning rate, and convergence criteria. The training requirement prediction module (310) additionally determines convergence criteria including maximum epochs, early-stopping thresholds, loss-decrease tolerances, and fairness or privacy stability indicators.

[0058] In one embodiment, the predicted one or more training requirements are selected to optimize a training pipeline. The training requirement prediction module (310) evaluates the operational characteristics of the training pipeline, including computational resource availability, expected model complexity, gradient stability requirements, and data processing throughput. The system classifies the training pipeline according to performance constraints such as memory bandwidth, processor load, and model convergence patterns. Based on this classification, the predicted batch size, learning rate, and convergence criteria are selected to reduce training time, improve loss stabilization, and ensure consistent fairness and privacy behaviour throughout the generative model’s learning process. The training requirement prediction module (310) further optimizes the training pipeline by adapting the predicted training requirements in response to real-time performance indicators such as gradient fluctuations, epoch-level learning stagnation, or processing bottlenecks. Example of optimizing the training pipeline includes, but is not limited to, selecting a smaller batch size when memory availability is limited, increasing the batch size when the dataset demonstrates low variance and supports stable updates, adjusting the learning rate when gradient oscillation is detected.

[0059] In one embodiment, the trainer module (312) is configured to convert the encoded schema-consistent dataset into a training-ready format based on the predicted one or more training requirements. The trainer module (312) evaluates the encoded schema-consistent dataset to determine its compatibility with the predicted batch size, the predicted learning rate, and the predicted convergence criteria. The trainer module (312) classifies the encoded schema-consistent dataset into structural segments based on row distribution, feature dimensionality, and encoding density to ensure that each segment aligns with the expected input specifications of the generative model. Once classified, the trainer module (312) performs a formatting operation in which the encoded data is reorganized, partitioned, or reshaped into arrays, tensors, or batches that conform to the model’s input architecture. The conversion includes ensuring that all encoded values are normalized, type-consistent, and ordered according to the training pipeline’s requirements. The trainer module (312) further aligns the training-ready format with the predicted one or more training requirements by adjusting data chunk size, reshaping multi-dimensional arrays, and preparing standardized data loaders that regulate how the generative model receives data during each training iteration.

[0060] Example of converting the encoded dataset includes, but is not limited to, batching the encoded rows into groups defined by the predicted batch size, formatting the encoded columns into feature vectors compatible with a neural network input layer, or generating time-aligned tensors for sequential models. The training-ready format includes, but is not limited to, pre-shuffled training batches, memory- optimized tensor structures, model-aligned feature arrays.

[0061] In one embodiment, the training requirement prediction module (310) is configured to train one or more generative models using the training-ready format. The training requirement prediction module (310) classifies the training-ready format according to model type compatibility and feature topology, identifying whether the format corresponds to vector-based, sequence-based, or matrix-based model inputs and selecting a corresponding generative architecture accordingly. The training requirement prediction module (310) then initializes the one or more generative models with architecture parameters that match the feature dimensionality and data distributions of the training-ready format and configures training routines based on the predicted one or more training requirements. During training, the training requirement prediction module (310) applies iterative optimization to adjust model parameters using loss functions that combine fidelity objectives and fairness- and privacy-aware regularizes, where the fidelity objectives quantify preservation of marginal distributions, joint distributions, and conditional dependencies, and the regularizes penalize deviations from fairness thresholds and privacy budgets.

[0062] In one embodiment, the one or more generative models is configured to generate synthetic tabular data that preserve statistical and relational characteristics of the schema-consistent dataset.

[0063] Example of the one or more generative models includes, but is not limited to, a variational autoencoder configured with encoder and decoder networks to capture joint feature distributions, the example architectures include, but are not limited to, feedforward neural networks, transformer-based encoders, and recurrent layers for sequential features. The fidelity objectives include, but are not limited to, Kullback-Leibler divergence measures, Wasserstein distances, and reconstruction error metrics, and the fairness- and privacy-aware regularizes include, but are not limited to, statistical parity penalties, and equalized odds penalties.

[0064] In another embodiment, the training requirement prediction module (310) is configured to partition the training-ready format into a plurality of training batches based on the predicted one or more training requirements. The training requirement prediction module (310) classifies the training-ready format according to internal data organization, feature dimensionality, encoded value density, and the computational constraints associated with the predicted batch size. The training requirement prediction module (310) then segments the trainingready format into discrete training batches, each batch comprising a subset of rows and their corresponding encoded feature representations. The partitioning process ensures that each batch contains a statistically representative mix of categorical, numerical, and textual features derived from the schema-consistent dataset, thereby maintaining consistency across training iterations. The training requirement prediction module (310) additionally considers the predicted learning rate and convergence criteria when determining batch boundaries, ensuring that each batch contributes stable gradient updates and supports efficient model optimization.

[0065] Example of the plurality of training batches includes, but is not limited to, fixed- size batches derived from random shuffling of encoded rows, adaptive-size batches configured according to dataset variance or sparsity, stratified batches constructed to maintain proportional subgroup representation.

[0066] In one embodiment, the sampler module (314) is configured to generate a plurality of synthetic data samples by sampling from the trained generative models. The sampler module (314) classifies the trained generative models according to their internal sampling mechanisms, latent-space structures, and output dimensionality to ensure that each trained generative model is sampled in a manner aligned with the schema-consistent dataset. Once classified, the sampler module (314) initiates a sampling procedure in which latent variables, noise vectors, or encoded feature seeds are provided as inputs to the trained generative models. The trained generative models then transform these sampling inputs through their learned weights and structural mappings to produce synthetic outputs in the form of tabular data entries. Each generated synthetic data sample preserves the feature ordering, dimensionality, and data-type alignment defined by the schemaconsistent dataset.

[0067] Example of sampling from the trained generative models includes, but is not limited to, generating latent vectors drawn from a normal distribution for a variational autoencoder, or applying stepwise noise-removal iterations for a diffusion-based model. Sampling inputs include, but are not limited to, latent noise seeds, encoded categorical embeddings, or temporal context signals. The plurality of synthetic data samples includes, but is not limited to, synthetic customer records, synthetic financial entries, synthetic healthcare observations, or synthetic operational logs. In one embodiment, the plurality of synthetic data samples corresponds to the schema-consistent dataset and maintains a mathematical consistency with respect to one or more statistical properties of the schema-consistent dataset. In this embodiment, the sampler module (314) evaluates each synthetic data sample produced by the trained generative models and aligns the structural and statistical characteristics of the synthetic data sample with the corresponding attributes of the schema-consistent dataset. The system first classifies the schema-consistent dataset according to its statistical properties, including marginal distributions, inter-feature correlations, conditional dependencies, variance ranges, centraltendency values, and feature-interaction patterns. Once classified, the sampler module (314) performs a mathematical consistency validation in which the plurality of synthetic data samples is compared against the schema-consistent dataset to confirm that the synthetic values fall within statistically reasonable boundaries. The comparison includes verifying that numerical columns reflect similar distributional shapes, categorical columns reproduce category frequency proportions, and relational constraints such as monotonicity, conditional logic, or dependency structures are preserved.

[0068] Example of the one or more statistical properties includes, but is not limited to, mean values, median values, variance ranges, covariance matrices, correlation coefficients, conditional probability tables, and category-distribution frequencies. Correspondence between the plurality of synthetic data samples and the schemaconsistent dataset includes, but is not limited to, matching numerical distribution patterns, reproducing categorical proportions, and preserving multivariate dependency structures. Mathematical consistency includes, but is not limited to, satisfying statistical distance thresholds such as Kullback-Leibler divergence limits, or Wasserstein alignment thresholds.

[0069] In one embodiment, the evaluator module (316) is configured to generate one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, a similarity between the plurality of synthetic data samples and corresponding real data of the schema-consistent dataset. The evaluator module (316) classifies the plurality of fairness metrics and the plurality of accuracy metrics according to their mathematical form and intended validation purpose, for example by grouping metrics into distributional-distance metrics, hypothesis-test metrics, and constraint-satisfaction metrics. The evaluator module (316) then computes metric values for the plurality of synthetic data samples and for the corresponding real data and constructs formal statements of similarity or bounded deviation that relate the computed metric values. The one or more mathematical proofs are derived by applying one or more analytical techniques including, but not limited to, statistical hypothesis testing, concentration inequalities, distributional distance bounds, and compositional privacy analysis. The evaluator module (316) generates proof artifacts by combining empirical estimates with analytical bounds to produce a formal validation that the plurality of synthetic data samples satisfy the one or more fairness metrics and the one or more accuracy metrics relative to the corresponding real data.

[0070] Example of the one or more mathematical proofs includes, but is not limited to, a statistical-hypothesis proof that the difference in a fairness index between the plurality of synthetic data samples and the corresponding real data is not statistically significant at a predefined significance level, a distributional -di stance proof that the Wasserstein distance or Kullback-Leibler divergence between marginal or joint distributions of selected attributes is below a predefined threshold, a constraint-satisfaction proof that logical or relational constraints present in the schema-consistent dataset are preserved within tolerance bounds in the plurality of synthetic data samples, and a privacy-composition proof that the applied privacy-preserving transformations satisfy a stated differential-privacy budget. The one or more mathematical proofs further include computed fairnessstability certificates corresponding to changes in sampling probabilities or fairness-correction steps.

[0071] In another embodiment, the evaluator module (316) is configured to compute the one or more mathematical proofs by calculating one or more similarity metrics, one or more fairness indices, and one or more privacy differentials between the plurality of synthetic data samples and the corresponding real data. The evaluator module (316) classifies the plurality of similarity metrics, fairness indices, and privacy differentials according to their mathematical structure and validation purpose. The similarity metrics are classified as statistical -di stance measures, distribution-alignment measures, and relational-preservation measures. The fairness indices are classified as group-based disparity measures, outcomebalance measures, and conditional-dependence fairness measures. The privacy differentials are classified as re-identification risk measures, attribute-sensitivity leakage measures, and differential-privacy deviation measures. Once classified, the evaluator module (316) computes each metric by comparing the statistical, structural, and dependency patterns present in the plurality of synthetic data samples against the corresponding real data extracted from the schema-consistent dataset. The evaluator module (316) then aggregates these computed values into a structured proof construct that details how closely the synthetic data align with the real data across accuracy, fairness, and privacy dimensions.

[0072] The corresponding real data includes, but is not limited to, statistical summaries, subgroup partitions, and attribute-level frequency tables derived from the schemaconsistent dataset.

[0073] In another embodiment, the evaluator module (316) is configured to store the one or more mathematical proofs and corresponding validation metadata in a validation repository for subsequent verification and audit. The evaluator module (316) classifies the one or more mathematical proofs according to proof type, metric category, and validation purpose, where proof type includes similarity proofs, fairness proofs, privacy proofs, and composite proofs derived from multiple validation dimensions. The evaluator module (316) also classifies corresponding validation metadata, including metric definitions, computation parameters, confidence intervals, statistical bounds, sampling configurations, and dataset lineage markers associated with the schema-consistent dataset and the plurality of synthetic data samples. Once classified, the evaluator module (316) formats the mathematical proofs and the validation metadata into a structured, machine-readable format suitable for long-term retention, ensuring consistency, traceability, and auditability.

[0074] In one embodiment, the assessment module (318) is configured to assess a plurality of privacy and fairness risks associated with the schema-consistent dataset by analysing attribute sensitivity levels and representation ratios, and utilize the assessment to guide parameters of the trained generative models and the sampling of the plurality of synthetic data samples. The assessment module (318) classifies attributes of the schema-consistent dataset according to sensitivity categories and semantic importance, where the attribute sensitivity levels correspond to degrees of personal identifiability, regulatory sensitivity, or operational criticality. The assessment module (318) then computes representation ratios for one or more protected and non-protected subgroups to quantify relative presence, under-representation, or over-representation across the schemaconsistent dataset. The assessment module (318) further derives a plurality of risk indicators that include, without limitation, re-identification risk scores, subgroup performance gaps, distributional skew measures, attribute correlation leak indicators, and outlier exposure metrics. Based on the computed risk indicators, the processor generates an assessment that maps identified risks to corrective actions and parameter adjustments, wherein the corrective actions include but are not limited to, tuning model regularizes that penalize unfair error distributions, adjusting sampling probabilities to correct representation imbalance, modifying privacy-budget allocations and noise scales for differential privacy, constraining conditional sampling to respect logical protections, and selecting model capacity or architecture changes to mitigate model bias. The assessment module (318) applies the assessment to set or adapt one or more training and sampling parameters of the trained generative models, including learning-objective weights, sampling-temperature or latent-space priors, fairness-aware loss coefficients, and privacy-preserving noise parameters, and records the applied parameterization together with rationale metadata for traceability.

[0075] Example of the plurality of privacy and fairness risks and corresponding assessments includes, but is not limited to, attribute sensitivity levels such as direct identifiers (for example, national identifiers, full names), quasi-identifiers (for example, postal codes, dates of birth), and sensitive attributes (for example, health conditions, financial status).

[0076] In another embodiment, the assessment of privacy and fairness risks comprises determining attribute sensitivity levels, computing representation ratios for protected groups, and flagging imbalance conditions based on predefined thresholds. The assessment module (318) classifies attributes of the schema- consistent dataset into sensitivity categories, where each category corresponds to a defined level of identifiability, confidentiality, or regulatory significance. Attribute sensitivity levels are determined by evaluating whether an attribute is a direct identifier, a quasi-identifier, or a sensitive attribute whose disclosure may result in elevated privacy or fairness risk. The assessment module (318) additionally identifies protected groups by analysing demographic or regulatory classifications associated with one or more sensitive attributes and computes representation ratios for those protected groups by comparing subgroup counts, proportional distributions, and cross-attribute interactions across the schemaconsistent dataset. Once the representation ratios are computed, the processor applies a set of predefined thresholds corresponding to fairness requirements, regulatory constraints, or internal governance rules to detect imbalance conditions such as under-representation, over-representation, or skewed conditional relationships. When such imbalance conditions are flagged, the assessment process annotates each flagged attribute or subgroup with a risk severity indicator and provides the resulting assessment for use in subsequent training-parameter adjustment, sampling correction, or privacy-enhancement actions.

[0077] Example of determining attribute sensitivity levels, computing representation ratios, and flagging imbalance conditions includes, but is not limited to, classifying attributes such as national identifiers, full names, and home addresses as direct identifiers, classifying attributes such as zip code, or birthdate, and classifying attributes such as race, and gender as sensitive attributes.

[0078] In another embodiment, the assessment module (318) is configured to apply a fairness correction during generation of the plurality of synthetic data samples by iteratively adjusting sampling probabilities of values of one or more sensitive attributes. The assessment module (318) classifies one or more sensitive attributes within the schema-consistent dataset based on attribute-sensitivity levels and fairness relevance, identifying attributes such as demographic indicators, regulated categories, or protected-class labels. After classification, the assessment module (318) evaluates representation ratios, subgroup disparities, and conditional outcome imbalances associated with the one or more sensitive attributes. The processor then applies a fairness-correction loop in which sampling probabilities are iteratively modified during synthetic-data generation. The iterative adjustment ensures that subgroups associated with the one or more sensitive attributes receive sampling probabilities tailored to correct underrepresentation, mitigate bias amplification, or enforce fairness constraints such as statistical parity or equalized odds alignment. During each iteration, the assessment module (318) monitors fairness indices computed from partially generated synthetic batches, adjusts sampling weights accordingly, and continues refinement until the plurality of synthetic data samples satisfy predefined fairness thresholds while retaining statistical coherence with the schema-consistent dataset.

[0079] Example of applying fairness correction includes, but is not limited to, increasing sampling probabilities for under-represented subgroups, decreasing sampling probabilities for over-represented subgroups, and applying constraint-based sampling where values of one or more sensitive attributes are generated subject to fairness-aware probabilistic rules.

[0080] In another embodiment, the assessment module (318) is configured to apply at least one privacy-preserving transformation to the plurality of synthetic data samples selected from the group consisting of differential privacy noise addition, anonymization, feature masking, and attribute generalization, n this embodiment, the assessment module (318) classifies the plurality of synthetic data samples according to privacy-sensitivity levels associated with each attribute, including direct identifiers, quasi-identifiers, and sensitive attributes. Based on this classification, the assessment module (318) determines which privacy -preserving transformation or combination of transformations is required to reduce reidentification risk, attribute-inference risk, or linkage-attack feasibility. The processor then applies one or more privacy -preserving transformations in accordance with predefined privacy policies, dataset characteristics, and synthetic-data utility requirements. When applying differential-privacy noise addition, the processor injects calibrated statistical noise into numerical or categorical features to satisfy a designated privacy budget. When applying anonymization, the processor removes or suppresses identifying values to prevent direct identity linkage. When applying feature masking, the processor replaces specific attribute values with placeholder tokens, masked patterns, or nullified entries. When applying attribute generalization, the processor replaces finegrained attribute values with broader categories or aggregated representations to reduce identifiability while preserving statistical structure.

[0081] In one embodiment, present, via a user interface (348), one or more creation jobs and corresponding statuses, fairness indices, privacy protection scores, data utility metrics, and generation volumes for monitoring and governance of the plurality of synthetic data samples to a user (108, FIG. 1) operating a user device (104, FIG. 1). The user interface (348) classifies creation jobs according to job type, job priority, and job lineage, where job type includes ingestion, schema reconciliation, encoding, training, sampling, proof-generation, and export; job priority includes urgent, normal, and low; and job lineage traces input seed datasets, model version, and parameter sets. The user interface (348) then displays a job list in which each creation job is represented by a job identifier, a human-readable job name, and a current status indicator selected from a plurality of statuses including queued, running, paused, completed, failed, and cancelled. For each listed job the user interface (348) presents one or more fairness indices corresponding to selected protected attributes and fairness measures (for example, statistical parity, equalized odds, and disparate impact), one or more privacy protection scores corresponding to applied privacy transformations and privacy budgets (for example, differential-privacy a, anonymization score, or k-anonymity estimate), one or more data utility metrics corresponding to analytical fidelity (for example, distributional similarity scores, downstream-model performance, and reconstruction error), and a generation volume metric indicating the number of synthetic rows produced and the data size.

[0082] In another embodiment, the destination data ingestion module (320) is configured to inject the plurality of synthetic data samples and associated validation metadata to one or more configured destination environments for a downstream usage. The destination data ingestion module (320) classifies the one or more configured destination environments according to environment type, data-consumption purpose, and integration protocol. Environment types include analytical environments, machine-learning pipelines, governance platforms, secure data vaults, and external partner systems. The destination data ingestion module (320) then prepares the plurality of synthetic data samples and the associated validation metadata for injection by formatting them into structures compatible with the selected destination environments, ensuring that schema definitions, field mappings, and metadata annotations align with downstream requirements. The destination data ingestion module (320) additionally performs a transmissionreadiness check to validate that privacy-preserving transformations, fairnesscorrection results, and mathematical-proof artifacts are intact and complete. Once validated, the destination data ingestion module (320) injects the plurality of synthetic data samples and the associated validation metadata through a secure data-transfer mechanism, which may include batch uploads, streaming transfers, API-based delivery, or connector-based ingestion.

[0083] Consider a non-limiting example of deploying the system (102) for privacy- preserved and fair synthetic data with proofs includes its usage within a healthcare-analytics enterprise that processes sensitive patient information collected from multiple hospitals, diagnostic centres, and clinical laboratories. The plurality of seed data samples originates from electronic health-record systems, insurance-claims repositories, laboratory-information systems, and patient-intake platforms operating at different facilities. The plurality of seed data samples comprises differing schemas, including structured medical tables, semi -structured laboratory exports, and tabular reconstructions of unstructured physician notes. The receiving module (306) obtains the plurality of seed data samples from the one or more data sources, and the schema aggregator module (336) prepares them for downstream processing.

[0084] Once received, the processor (302), through the schema aggregator module (336), combines the plurality of seed data samples into a schema-consistent dataset by reconciling attribute names, harmonizing medical-code systems, aligning date- time formats, and standardizing missing-value semantics. The processor (302) then encodes categorical medical attributes such as diagnosis codes, medication categories, lab-result indicators, and procedural classifications using the customized configuration module (308), which applies one or more encoding configurations selected for medical -analytics suitability. Numerical columns such as blood-pressure readings, laboratory values, and cost metrics are normalized, while textual columns such as short physician descriptions or triage notes are embedded into numerical vectors.

[0085] Based on the characteristics of the schema-consistent dataset, the processor predicts one or more training requirements using the training requirement prediction module (310) by determining a batch size aligned with the dataset volume, a learning rate optimized for convergence stability, and convergence criteria sensitive to clinical -data variability. The encoded dataset is converted into a training-ready format and is then consumed by the trainer module (312), which trains one or more generative models capable of learning clinical distributions, comorbidity patterns, demographic relationships, and treatment pathways present in the real data.

[0086] Through sampling from the trained generative models, the sampler module (314) generates a plurality of synthetic patient records that correspond to the schemaconsistent dataset and maintain mathematical consistency with real clinical distributions. The plurality of synthetic patient records preserve statistical properties such as lab-value ranges, diagnosis-treatment correlation patterns, and demographic distributions without revealing any identifiable information attributable to actual individuals.

[0087] The processor then generates one or more mathematical proofs using the evaluator module (316), validating, through similarity metrics, that the distribution of synthetic laboratory values approximates the real data; validating, through fairness indices, that demographic groups receive proportionate representation; and validating, through privacy differentials, that re-identification risk remains below predefined thresholds. The processor (302) also assesses a plurality of privacy and fairness risks using the assessment module (318) by identifying sensitive attributes such as age, gender, ethnicity, and critical -health indicators, computing representation ratios, and flagging imbalance conditions relevant to clinical fairness and patient equity.

[0088] Fairness correction is applied during synthetic-data generation by iteratively adjusting sampling probabilities through the assessment module (318) to ensure that demographic groups, rare-disease categories, and under-represented populations are properly synthesized without statistical suppression or overemphasis. Privacy-preserving transformations such as differential-privacy noise addition and attribute generalization are applied to sensitive clinical values to further reduce identifiability.

[0089] After generation and validation, the plurality of synthetic patient records and the corresponding validation metadata are injected into one or more configured destination environments through the destination data ingestion module (320), such as a hospital analytics dashboard, a machine-learning development workspace for building predictive models, a compliance-verification environment for regulatory oversight, or an external research-collaboration platform used by partnered universities. The user interface (348) displays creation-job statuses, fairness indices, privacy -protection scores, data-utility metrics, and generation volumes for transparent governance and operational confidence.

[0090] FIG. 4 is a flow chart representing the steps involved in method for privacy preserved and fair synthetic data with proofs, in accordance with an embodiment of the present disclosure; FIG. 4 (b) illustrates continued steps of the method of FIG. 4 (a) in accordance with an embodiment of the present disclosure.

[0091] The method (400) includes receiving, by a processor, a plurality of seed data samples comprising differing schemas from one or more data sources in step 405. Example of the plurality of seed data samples includes, but is not limited to, structured database tables, semi -structured CSV or JSON files, and tabular outputs reconstructed from unstructured text.

[0092] The method (400) includes combining, by the processor, the plurality of seed data samples into a schema-consistent dataset to ensure consistency across the plurality of seed data samples by reconciling attribute names, datatypes, and missing-value semantics in step 410. The processor aligns column identifiers, standardizes data types (for example, converting text-based numbers to numeric format), and harmonizes missing-value tokens (such as “N / A,” or null) into a unified representation. The method (400) includes encoding, by the processor, the schema-consistent dataset based on column data characteristics and encoding suitability to optimize data representation in step 415. Example of encoding includes, but is not limited to, applying one-hot or ordinal encoding for categorical values, normalization for numerical ranges, and tokenization or embeddings for textual content. It must be noted that selecting encoding approaches based on column characteristics ensures compatibility with generative models and maintains representational fidelity.

[0093] The method (400) includes predicting, by the processor, one or more training requirements corresponding to the schema-consistent dataset, wherein the one or more training requirements comprises a batch size, a learning rate, and convergence criteria, wherein the predicted one or more training requirements are selected to optimize a training pipeline in step 420. The processor analyses dataset volume, feature distribution, and encoding density to determine configurations that support stable and efficient training.

[0094] The method (400) includes converting, by the processor, the encoded schemaconsistent dataset into a training-ready format based on the predicted one or more training requirements in step 425. The processor restructures encoded schemaconsistent dataset into tensors, arrays, or batches that align with the chosen batch size, learning rate expectations, and convergence settings.

[0095] The method (400) includes training, by the processor, one or more generative models using the training-ready format, wherein the one or more generative models is configured to generate synthetic tabular data that preserve statistical and relational characteristics of the schema-consistent dataset in step 430. The processor iteratively updates model parameters to learn distributions, correlations, and feature interactions present in the original data.

[0096] The method (400) includes generating, by the processor, a plurality of synthetic data samples by sampling from the trained generative models, wherein the plurality of synthetic data samples corresponds to the schema-consistent dataset and maintains a mathematical consistency with respect to one or more statistical properties of the schema-consistent dataset in step 435. Example of the one or more statistical properties includes, but is not limited to, mean ranges, variance bounds, category frequencies, or correlation patterns. It must be noted that maintaining mathematical consistency ensures analytical usefulness without exposing real individuals.

[0097] The method (400) includes generating, by the processor, one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, a similarity between the plurality of synthetic data samples and corresponding real data of the schema-consistent dataset in step 440. The processor computes one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, the similarity for both datasets and constructs formal proof statements using statistical tests, distributional-distance bounds, and confidence-interval analyses. Example of the one or more mathematical proofs includes, but is not limited to, hypothesis-test proofs (p-values and Cis), KL- or Wasserstein-distance bounds, and constraintsatisfaction certificates for fairness indices.

[0098] The method (400) includes assessing, by the processor, a plurality of privacy and fairness risks associated with the schema-consistent dataset by analysing attribute sensitivity levels and representation ratios, and utilize the assessment to guide parameters of the trained generative models and the sampling of the plurality of synthetic data samples in step 445. The processor first classifies attributes of the schema-consistent dataset into sensitivity categories corresponding to direct identifiers, quasi-identifiers, and sensitive attributes, and then computes representation ratios for one or more protected and non-protected subgroups to quantify proportional presence and subgroup sparsity. The processor derives risk indicators including re-identification scores, subgroup performance gaps, distributional skew measures, and correlation-leak indicators. Based on the derived risk indicators, the processor maps identified risks to corrective actions and configures model and sampling parameters accordingly, wherein the configured parameters include but are not limited to sampling probabilities, faimess-regularizer weights, privacy-noise scales, and conditional-sampling constraints.

[0099] The method (400) includes presenting, by the processor, via an user interface, one or more creation jobs and corresponding statuses, fairness indices, privacy protection scores, data utility metrics, and generation volumes for monitoring and governance of the plurality of synthetic data samples to a user operating a user device in step 450. Example of the presented information includes, but is not limited to, job status badges, subgroup fairness-index values, differential-privacy scores, similarity -based utility metrics, and total synthetic-row counts.

[0100] Thus, various embodiments of the system and method for privacy preserved and fair synthetic data with proofs provides several benefits. By receiving a plurality of seed data samples with differing schemas and converting them into a schemaconsistent dataset, the system reduces the manual burden of data reconciliation and ensures that heterogeneous data sources can be processed reliably. The encoding of categorical, numerical, and textual columns according to column data characteristics improves model readiness and supports accurate generative training. Predicting one or more training requirements allows the system to optimize the training pipeline automatically, resulting in stable convergence and improved synthetic-data quality. Generating a plurality of synthetic data samples that correspond to the schema-consistent dataset ensures that statistical properties and relational patterns are preserved without exposing sensitive information. The generation of one or more mathematical proofs provides assurance that the synthetic data meets defined accuracy, fairness, and privacy expectations, enabling auditability and regulatory confidence. Assessing a plurality of privacy and fairness risks before and during data generation ensures that representation ratios, sensitive attributes, and imbalance conditions are addressed systematically. Applying fairness correction and privacy-preserving transformations during generation further strengthens responsible-data practices. Finally, presenting creation jobs, validation metrics, and data-utility information through a user interface offers transparency, governance, and ease of adoption, supporting seamless integration of synthetic data into downstream environments.

Claims

WE CLAIM:

1. A system for privacy preserved and fair synthetic data with proofs, comprising: a processor; a memory coupled to the processor, wherein the memory comprises instructions that, when executed by the processor, cause the processor to: receive a plurality of seed data samples comprising differing schemas from one or more data sources; combine the plurality of seed data samples into a schema-consistent dataset to ensure consistency across the plurality of seed data samples by reconciling attribute names, data types, and missing-value semantics; encode the schema-consistent dataset based on column data characteristics and encoding suitability to optimize data representation; predict one or more training requirements corresponding to the schema-consistent dataset, wherein the one or more training requirements comprises a batch size, a learning rate, and convergence criteria, wherein the predicted one or more training requirements are selected to optimize a training pipeline; convert the encoded schema-consistent dataset into a trainingready format based on the predicted one or more training requirements; train one or more generative models using the training-ready format, wherein the one or more generative models is configured to generate synthetic tabular data that preserve statistical and relational characteristics of the schema-consistent dataset; generate a plurality of synthetic data samples by sampling from the trained generative models, wherein the plurality of synthetic data samples corresponds to the schema-consistent dataset and maintains amathematical consistency with respect to one or more statistical properties of the schema-consistent dataset; generate one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, a similarity between the plurality of synthetic data samples and corresponding real data of the schema-consistent dataset; assess a plurality of privacy and fairness risks associated with the schema-consistent dataset by analysing attribute sensitivity levels and representation ratios, and utilize the assessment to guide parameters of the trained generative models and the sampling of the plurality of synthetic data samples; and present, via a user interface, one or more creation jobs and corresponding statuses, fairness indices, privacy protection scores, data utility metrics, and generation volumes for monitoring and governance of the plurality of synthetic data samples to a user operating a user device.

2. The system as claimed in claim 1, wherein the plurality of seed data samples comprise at least one of structured tabular data, semi-structured tabular data, or tabular data derived from unstructured sources.

3. The system as claimed in claim 1, to cause the processor to encode a plurality of categorical columns, numerical columns, and textual columns of the schema-consistent dataset corresponding to the column data characteristic and a required type or format of synthetic data generation as defined by the one or more encoding configurations.

4. The system as claimed in claim 1, to cause the processor to partition the training-ready format into a plurality of training batches based on the predicted one or more training requirements.

5. The system as claimed in claim 1, to cause the processor to compute the one or more mathematical proofs by calculating one or more similarity metrics, one or more fairness indices, and one or more privacy differentials between the plurality of synthetic data samples and the corresponding real data.

6. The system as claimed in claim 1, wherein the assessment of privacy and fairness risks comprises determining attribute sensitivity levels, computing representation ratios for protected groups, and flagging imbalance conditions based on predefined thresholds.

7. The system as claimed in claim 1, to cause the processor to store the one or more mathematical proofs and corresponding validation metadata in a validation repository for subsequent verification and audit.

8. The system as claimed in claim 1, to cause the processor to apply a fairness correction during generation of the plurality of synthetic data samples by iteratively adjusting sampling probabilities of values of one or more sensitive attributes.

9. The system as claimed in claim 1, to cause the processor to apply at least one privacy-preserving transformation to the plurality of synthetic data samples selected from the group consisting of differential privacy noise addition, anonymization, feature masking, and attribute generalization.

10. The system as claimed in claim 1, to cause the processor to inject the plurality of synthetic data samples and associated validation metadata to one or more configured destination environments for a downstream usage.

11. A method for privacy preserved and fair synthetic data with proofs, comprising: receiving, by a processor, a plurality of seed data samples comprising differing schemas from one or more data sources;combining, by the processor, the plurality of seed data samples into a schema-consistent dataset to ensure consistency across the plurality of seed data samples by reconciling attribute names, data types, and missing-value semantics; encoding, by the processor, the schema-consistent dataset based on column data characteristics and encoding suitability to optimize data representation; predicting, by the processor, one or more training requirements corresponding to the schema-consistent dataset, wherein the one or more training requirements comprises a batch size, a learning rate, and convergence criteria, wherein the predicted one or more training requirements are selected to optimize a training pipeline; converting, by the processor, the encoded schema-consistent dataset into a training-ready format based on the predicted one or more training requirements; training, by the processor, one or more generative models using the training-ready format, wherein the one or more generative models is configured to generate synthetic tabular data that preserve statistical and relational characteristics of the schema-consistent dataset; generating, by the processor, a plurality of synthetic data samples by sampling from the trained generative models, wherein the plurality of synthetic data samples corresponds to the schema-consistent dataset and maintains a mathematical consistency with respect to one or more statistical properties of the schema-consistent dataset; generating, by the processor, one or more mathematical proofs that validate, for one or more fairness metrics and one or more accuracy metrics, a similarity between the plurality of synthetic data samples and corresponding real data of the schema-consistent dataset; assessing, by the processor, a plurality of privacy and fairness risks associated with the schema-consistent dataset by analysing attribute sensitivity levels and representation ratios, and utilize the assessment to guide parameters ofthe trained generative models and the sampling of the plurality of synthetic data samples; and presenting, by the processor, via a user interface, one or more creation jobs and corresponding statuses, fairness indices, privacy protection scores, data utility metrics, and generation volumes for monitoring and governance of the plurality of synthetic data samples to a user operating a user device.