Method for generating synthetic data
Tensor networks with differential privacy mechanisms address the challenge of generating synthetic data that maintain privacy and utility, effectively capturing real-world data patterns.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- BUNDESDRUCKEREI GMBH
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-18
Smart Images

Figure EP2025086308_18062026_PF_FP_ABST
Abstract
Description
[0001] Bundesdruckerei GmbH, et al.
[0002] B89594WO
[0003] Method for generating synthetic data
[0004] The present disclosure refers to a method for generating synthetic data, a system for generating synthetic data, a corresponding computer program product, and a computer-readable medium.
[0005] Background
[0006] Synthetic data generation has emerged as a crucial endeavor in the realm of artificial intelligence, addressing the challenges posed by data scarcity and the need for diverse datasets in training machine learning models. One of the main challenges in leveraging synthetic data is ensuring that it accurately captures the underlying patterns and distributions of real-world data. An overview of synthetic data generation in the context of machine learning is provided in Lu et al., Journal of LaTeX files, Vol. 14, No. 8, 2021. An introduction to machine learning can for example be found in A. Jung, Machine Learning: The Basics, Springer, Singapore, 2022.
[0007] In the article Han et al., Unsupervised Generative Modeling Using Matrix Product States, ArXiv e-prints, arXiv:1709.01662 (2018), a generative model using matrix product states is described, which allows dynamically adjusting dimensions of the tensors and offers a sampling approach for generative tasks. The method is applied to generative modeling of several standard datasets including the Bars and Stripes, random binary patterns, and the MNIST handwritten digits.
[0008] In many industries, the use of real data is restricted due to privacy concerns, which limits the ability to perform comprehensive data analysis and machine learning model training. While synthetic data offers numerous benefits, including scalability, accessibility, and versatility, its use also raises significant concerns regarding privacy risks. Unlike real-world data, synthetic data are not derived from actual observations but are instead generated by algorithms trained on existing data. As a result, synthetic data may inadvertently disclose sensitive information about individuals or entities present in the original data, posing potential threats to privacy and confidentiality. An overview of differential privacy in view of machine learning is available in Abadi et al., Deep Learning with Differential Privacy, 23rd ACM Conference on Computerand Communications Security, 2016.
[0009] Known synthetic data generation approaches make use of, e.g., CTGAN (Conditional Tabular Generative Adversarial Network), Gaussian copulas, Bayesian Networks (BN-Co), and BOEHMERT & BOEHMERT
[0010] - 2 -
[0011] Bayesian Network Identities (BN-ld). However, traditional methods either compromise privacy or result in synthetic data with comparatively low utility.
[0012] Summary
[0013] It is an object of the present disclosure to provide a method for generating synthetic data that preserve privacy while still being representative of the original data.
[0014] For solving the problem, a computer-implemented method for generating synthetic data is provided according to independent claim 1. Moreover, a system for generating synthetic data, a computer program product, and a computer-readable medium are provided. Further embodiments are disclosed in dependent claims.
[0015] The method according to the invention addresses the above issues by making use of tensor networks to generate synthetic data with built-in differential privacy mechanisms. The method may allow for the generated data to protect individual privacy without sacrificing performance on key utility metrics, enabling broader and safer use of synthetic data across various applications.
[0016] According to an aspect of the invention, the method for generating synthetic data comprises: - providing a tensor network including a plurality of tensors and a training dataset including at least one training data string;
[0017] - training the tensor network with respect to the training dataset by gradient descent, comprising the following steps:
[0018] - determining a tensor network gradient from the tensor network, the tensor network gradient being evaluated using the training dataset;
[0019] - applying noise to the tensor network gradient; and
[0020] - adjusting the tensor network based on the tensor network gradient;
[0021] - generating, from the tensor network, synthetic data including a synthetic data string, wherein:
[0022] - each component to be sampled of the synthetic data string is generated according to a sample probability which is a marginal probability for the component or a conditional probability conditioned on at least one value of a further component of the synthetic data string and
[0023] - at least one of determining the marginal probability and determining the conditional BOEHMERT & BOEHMERT
[0024] - 3 -
[0025] probability comprises separating, for the component, a corresponding partial tensor network from the tensor network and determining a squared norm of the corresponding partial tensor network.
[0026] Further, a data processing system comprising means for carrying out the method is provided. In addition, a computer program product is provided that comprises instructions which, when the program is executed by a computer, cause the computer to carry out the method. Moreover, a computer-readable medium is provided that comprises instructions which, when executed by a computer, cause the computer to carry out the method.
[0027] The tensor network may, e.g., be an MPS (matrix product state) tensor network, a tensor train, an MPO (matrix product operator) tensor network, a tree tensor (TT) network, or a MERA (multi-scale entanglement renormalization ansatz) tensor network (cf. Vidal, arXiv:cond-mat / 0512165). The tensor network may be a chain-like tensor network, such as MPS, TT, or MPO (cf. Schollwoeck, arXiv: 1008.3477).
[0028] In case of a MPS tensor network, the tensor network may have a left-canonical form (left-normalized), a right-canonical form (right-normalized), or a mixed-canonical form.
[0029] The tensor network may comprise at least one tensor with tensor rank (tensor order) of at least 3. In particular, the tensor network may comprise at least two, five, or ten tensors with tensor rank (tensor order) of at least 3. The tensor network may comprise at least two, five, or ten tensors with tensor rank (tensor order) of at least 4.
[0030] The training dataset may comprise a binary-valued training data string and / or an integervalued training data string.
[0031] A dataset in the sense of the present application (in particular, the training dataset) may refer to a structured collection of data points that usually relate to a specific topic and / or have been collected or measured for a specific purpose. The dataset may in particular be related to and / or be indicative of a predefined set of attributes, wherein individual datapoints each contain values for at least a subset of the attributes of the dataset. The values can be measured values, numbers, text data, and / or other values that describe the respective datapoint.
[0032] The training dataset can be partitioned into batches for training and / or into mini-batches for BOEHMERT & BOEHMERT
[0033] - 4 -
[0034] each gradient descent step.
[0035] The synthetic data may be an artificial dataset generated based on the training dataset, generated using the method with the aim of reflecting statistical properties of the training data set.
[0036] The method may further comprise preprocessing the training data set, the preprocessing including at least one of:
[0037] - discretizing a continuous data string to a discretized data string,
[0038] - mapping a discretized data string to a binary-valued or integer-valued training data string, and
[0039] - mapping a categorical data string to a binary-valued or integer-valued training data string.
[0040] As a result, the training data strings may be both indicative of original attributes and usable for the method.
[0041] Discretizing the continuous data string may comprise binning the continuous data string. Data bins for the binning may each have the same length. Alternatively, the data bins may have varying lengths, e.g., adjusted lengths based on the structure of the continuous data. The mapping may be carried out according to a one-to-one map. Mapping to a binary-valued string may comprise one-hot encoding.
[0042] The method may further comprise mapping the synthetic data string (binary-valued or integer-valued) to a discretized, continuous, or categorical synthetic data string.
[0043] The at least one training data string and / or a component thereof and / or a plurality of training data strings may be indicative of at least one of image data (e.g., for image reconstruction and / or image denoising), physiological data, medical data, ethnicity, social data, personal movement data, and address data. The physiological data may, e.g., comprise height and / or weight. The social data may comprise, e.g., family relationships. The personal movement data may comprise GPS data.
[0044] The method may comprise using the generated synthetic data for simulation and / or statistical analysis, in particular based on at least one of physiological data, medical data (e.g., synthetic patient records), ethnicity, social data (e.g., income and / or education level and / or occupa- BOEHMERT & BOEHMERT
[0045] - 5 -
[0046] tion), personal movement data, and address data. For example, synthetic patient records may be used for evaluating new medication across various demographics. Further, creating synthetic datasets reflecting different ethnic backgrounds or social data may facilitate assessing healthcare access disparities.
[0047] Synthetic movement data may in particular be used for at least one of urban planning, traffic flow optimization, crowd flow simulation (e.g., during large events), epidemiological simulations (e.g., potential spread of infectious diseases) and physical activity analysis and / or simulation.
[0048] Thus, the method may comprise carrying out, based on the (generated) synthetic data, at least one of urban planning simulation, traffic flow simulation, crowd flow simulation, epidemiological simulation, and physical activity simulation. Further, the method may also comprise image reconstruction and / or image denoising based on the synthetic data.
[0049] The method may comprise assessing the synthetic data using at least one metric, e.g., a fidelity metric or a quality metric, in particular an F1 score.
[0050] The tensor network may be adjusted as a result of maximum likelihood estimation with respect to the training dataset. To this end, the tensor network gradient may be determined from a negative log-likelihood function of a norm of the tensor network, in particular a (normalized) squared norm of the tensor network.
[0051] For example, the negative log-likelihood function may be £ = -^£veTlnP(v), with P(v) =
[0052]
[0053] |'P(v)|2 / Z, 'P(V) being a tensor network wave function, and Z being a normalization factor.
[0054] The tensor network gradient may be represented by
[0055] d£ Z' 2
[0056] a.(fc,fc+D wfcWfc+1_y_ffi ZJ 'P(V) '
[0057]
[0058] with 'P(v) = Tr((1)V1 (2)V2•••(2)ViV), derivative 'P'(v), normalization factor Z, derivative normalization factor Z', and training set size \T\.
[0059] The steps for training the tensor network may be repeated. In other words, the method may BOEHMERT & BOEHMERT
[0060] - 6 -
[0061] comprise training the tensor network with respect to the training dataset by gradient descent, comprising repeating the steps as defined in claim 1. The steps for training the tensor network may be repeated until a convergence criterion (for example, a gradient norm being below a predefined threshold) is met.
[0062] Alternatively, the steps for training the tensor network may be repeated for a predetermined number of times. For example, a number of gradient descent steps may be smaller than a number of employed batches. By minimizing the number of gradient descent steps, the amount of required noise may be reduced, thus more preserving data utility while maintaining strong privacy guarantees.
[0063] The tensor network may be trained by stochastic gradient descent. Preferably, the training may comprise determining mini-batch gradients (or batch gradients) from the tensor network, the mini-batch gradients (or the batch gradients) being evaluated using multiple samples from the training dataset, and determining the tensor network gradient as an average of the mini-batch gradients (or the batch gradients). For each (stochastic) gradient descent step, the training dataset may be split into corresponding mini-batches. A mini-batch may correspond to a subset of a batch.
[0064] Applying noise to the tensor network gradient may comprise entrywise adding random noise to the tensor network gradient.
[0065] The noise may be applied to the tensor network gradient for each (stochastic) gradient step (iteration). Alternatively, the noise may be applied to the tensor network gradient for at least one (stochastic) gradient step.
[0066] The random noise may include at least one of Gaussian noise and Laplacian noise. In particular, the random noise may include both Gaussian noise and Laplacian noise. For example, both Laplacian noise and Gaussian noise may be added to the tensor network gradients).
[0067] A noise parameter of the random noise may be determined from a differential privacy parameter or two differential privacy parameters. The differential privacy parameter(s) may be predetermined, e.g., based on a desired degree of privacy protection. The differential privacy parameter(s) may be determined based on an equation of moment for differential privacy. In BOEHMERT & BOEHMERT
[0068] - 7 -
[0069] general, the noise parameter may be determined from at least one of a differential privacy parameter, a model sensitivity value, a sampling ratio per lot, and a number of training iterations.
[0070] The noise parameter a may for example be a standard deviation (for Gaussian noise) or scaling parameter (Laplacian noise). The standard deviation a may, e.g., be determined as
[0071] t log 1 / 5
[0072] o = c • q • -,
[0073] with differential privacy parameters e and 8, model sensitivity value c, sampling ratio per lot ty, and number of training iterations t. On the other hand, the scaling parameter a may be determined as
[0074] A /
[0075] o = —
[0076] 6
[0077] with differential privacy parameter e and model sensitivity value 8f.
[0078] In general, the noise parameter may be determined such that a predetermined degree of differential privacy for the synthetic data is achieved. The method may comprise providing at least one value indicating the predetermined degree of differential privacy for the synthetic data, e.g., via user input.
[0079] The training of the tensor network may further comprise constraining a tensor network gradient norm (a norm of the tensor network gradient) below a predefined threshold (clipping). For example, in case the tensor network gradient norm is above the predefined threshold, the tensor network gradient may be renormalized such that the tensor network gradient norm is equal to the predefined threshold.
[0080] The synthetic data string may comprise at least one predetermined component (i.e., a component not to be sampled). For example, at least one component of the synthetic data string may be known and / or fixed. The at least one predetermined component may be at the beginning and / or at the end and / or in the middle of the synthetic data string.
[0081] The training dataset may include a plurality of training data strings. The training data string or BOEHMERT & BOEHMERT
[0082] - 8 -
[0083] each of the training data strings may comprise a plurality of (training data string) components (elements). The synthetic data (synthetic dataset) may include a plurality of synthetic data strings. The synthetic data string or each of the synthetic data strings may comprise a plurality of (synthetic data string) components (elements). The training data string(s) and / or synthetic data string(s) may be a vector (vectors), preferably (each) comprising a plurality of vector components.
[0084] The tensor network gradient may be evaluated at the training data string, at a plurality of training data strings, or from a (sampled) subset of the training dataset.
[0085] Each tensor of the tensor network may correspond to a component (element) of the synthetic data string and / or a component (element) of at least one or each of the training data strings. Adjusting the tensor network may comprise adjusting the tensor network parameters.
[0086] The conditional probability (according to which a component to be sampled is generated) can be a quotient of two marginal probabilities (for components of the synthetic data string). In particular, the conditional probability (here: P(vk-i |vk,...)) can be a marginal probability (PCvfc-i, vk,...)) for the component (i.e., k - 1) and at least the further component (i.e., k) divided by a marginal probability (P(vk,...)) for at least the further component (Zc). The conditional probability and / or the marginal probabilities may or may not be normalized. In the latter case, the method may comprise normalizing the conditional probability and / or the marginal probabilities.
[0087] The marginal probability
[0088]
[0089] vk,...)) for the component (k - 1) and at least the further component (Z ) may be determined by separating (determining), for the component (k - 1), a corresponding partial tensor network from the tensor network and determining a squared norm |x, Vfc-1'Vfc' ■ |2of the corresponding partial tensor network. The corresponding partial tensor network may comprise (at least) a tensor (i.e., A(k-1)) corresponding to the component (k - 1) and a further tensor (
[0090]
[0091] A(k)) corresponding to the component (Zc).
[0092] The marginal probability (P(vk,...)) for at least the further component (here: k) may also be determined by separating, for the component (Zc), a corresponding (further) partial tensor network from the tensor network and determining a (further) squared norm |
[0093]
[0094] %Vfc'-|2of the corresponding (further) partial tensor network. The corresponding partial tensor network may comprise (at least) the further tensor (A(k)) corresponding to the component (Zc). In case of BOEHMERT & BOEHMERT
[0095] - 9 -
[0096] further components k,
[0097]
[0098] the (further) squared norm corresponds to:
[0099]
[0100] Determining the marginal probability and / or determining the conditional probability may comprise normalizing the squared norm and / or the further squared norm. In particular, determining the marginal probability and / or determining the conditional probability may comprise dividing the squared norm and / or the further squared norm by normalization factor (Z). The normalization factor may be determined by contracting the tensor network: Z = V..!xVnP
[0101]
[0102] In case the sample probability (itself) is a marginal probability (P(vw)) for the component ( / V), the marginal probability may also be determined as a sum over remaining component indices (v1;V2< - - *7N-I).
[0103]
[0104] P(vN) =,VliV2i..VN-1P( '). Equivalently, the marginal probability may be determined using a (normalized) squared norm \XVN\2 / Z of the corresponding partial tensor network.
[0105] In an example, a computer-implemented method for generating synthetic data may be provided, comprising at least one of: providing a tensor network, preferably including a plurality of tensors, and a training dataset, preferably including at least one training data string; training the tensor network with respect to the training dataset by gradient descent. The training may preferably comprise at least one of the following steps: determining a (tensor network) gradient from the tensor network, the (tensor network) gradient being evaluated using the training dataset; applying noise to the (tensor network) gradient; and adjusting the tensor network based on the (tensor network) gradient. In addition or alternatively, the method may further comprise generating, from the tensor network, synthetic data, preferably including a synthetic data string. In addition or alternatively, each component - which is to be sampled -of the synthetic data string may be generated according to a (sample) probability which preferably is a marginal probability (for the component) or a conditional probability conditioned on at least one value of a further component of the synthetic data string. In addition or alternatively, at least one of determining the marginal probability and determining the conditional probability may comprise separating and / or determining, for the component, a corresponding BOEHMERT & BOEHMERT
[0106] - 10 -
[0107] partial tensor network from the tensor network and / or determining a squared norm of the corresponding partial tensor network.
[0108] In addition or alternatively, the generating of the synthetic data comprises, for each component to be sampled of the synthetic data string, successively contracting the tensor network up to a tensor associated with the synthetic datapoint.
[0109] Providing the tensor network may be carried out in a provision module or provision unit. Training the tensor network may be carried out in a training module or training unit. Determining the tensor network gradient may be carried out in a gradient submodule or gradient subunit. Applying noise may be carried out in a noise submodule or noise subunit. Adjusting the tensor network based on the tensor network gradient may be carried out in an updating submodule or updating subunit. Generating the synthetic data may be carried out in a data generation module or data generation unit.
[0110] At least one of the data processing system, the computer program, and the computer-readable medium may comprise the provision module, the training module, and the data generation module. At least one of the data processing system, the computer program, and the computer-readable medium may comprise the provision unit, the training unit, and the data generation unit. Each module or unit may be provided in a separate data processing device or in a shared data processing device. Each module or unit may also be distributed over a plurality of data processing devices (separately or shared). The training module (unit) may comprise the gradient submodule (subunit), the noise submodule (subunit), and the updating submodule (subunit). Each (sub-)module or (sub-)unit may comprise an input / output interface for data communication with the other (sub-)modules or (sub-)units.
[0111] The embodiments described above in connection with the method for generating synthetic data may be provided correspondingly for at least one of the data processing system, the computer program product, and the computer-readable medium.
[0112] Brief description of drawings
[0113] In the following, embodiments, by way of example, are described with reference to drawings in which: BOEHMERT & BOEHMERT
[0114] - 11 -
[0115] Fig. 1 shows graphical representations for illustrating tensors and tensor operations;
[0116] Fig. 2 shows an overview flowchart of the method for generating synthetic data according to the invention;
[0117] Fig. 3 shows a plot illustrating the performance of the method for generating synthetic data in comparison with other methods;
[0118] Fig. 4 shows a plot illustrating the performance of the method using Gaussian noise injection for different classification algorithms;
[0119] Fig. 5 shows a plot illustrating the performance of the method using Laplacian noise injection for different classification algorithms;
[0120] Fig. 6 shows a plot illustrating the performance of the method in comparison with alternative synthetic data generation; and
[0121] Fig. 7 shows a graphical representation of a data processing system for carrying out the method.
[0122] Detailed description
[0123] Various embodiments of the disclosed method are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only.
[0124] Tensor networks
[0125] Since the method according to the invention makes use of tensor networks, a brief overview of tensor networks and tensors is provided below. Tensor networks or tensor network states represent a class of variational wave functions generally used in the study of many-body quantum systems (cf. Jacob Biamonte, Quantum Tensor Networks in a Nutshell, arXiv:1708.00006v1).
[0126] A tensor is a multi-dimensional array of numbers (e.g., complex numbers, integers or binary values) and may be denoted by the symbol (TapY(with index sets I1, I2, / 3,
[0127]
[0128] etc.) or short: TapY. The subscripts a,, y,... denote different tensor dimensions and the number of these dimensions corresponds to the tensor rank (tensor order) of the tensor.
[0129] Tensors can represented by tensor diagrams as shown in the upper half of Fig 1. A tensor of rank zero, a scalar, is represented by a single vertex, a vector (rank 1) additionally includes BOEHMERT & BOEHMERT
[0130] - 12 -
[0131] an edge I a “leg” corresponding to its single index a, a matrix (rank 2) includes two legs corresponding to its two indices a and 0, and so on.
[0132] Tensor diagrams may also represent tensor contractions, which include sums of products of scalar components of one tensor or more tensors and may be understood as a generalization of a matrix product and a trace of a matrix, which generally results in a tensor rank reduction. For example, contraction of two rank-2 tensors, i.e., matrices Rap and SpYalong the dimension (index) ft can be represented diagrammatically by connecting the two tensors, as is shown in the lower half of Fig. 1. The matrices R and S, multiplied with another by summing over index ft (tracing out index?: Qay= Tr^ (R^S^y) = pRapSpy), share the same leg corresponding to index p. Both the product of the separate matrices R and S as well as the combined resulting matrix Q has two free indices a and y (degrees of freedom), signified by the two corresponding legs shown in the lower half of Fig. 1.
[0133] A tensor network may comprise a plurality of tensors connected with each other by tensor contractions. In this sense, the diagram including R and S shown in Fig. 1 represents a simple example of a tensor network including two tensors R and S.
[0134] Tensor networks include, e.g., matrix product states (MPS), matrix product operators (MPO), tree tensor networks, and MERA (multi-scale entanglement renormalization ansatz).
[0135] The probabilistic interpretation of quantum mechanics naturally suggests modelling data distribution with a quantum state and quantum physicists and chemists have developed many efficient classical representations of quantum wavefunctions. A number of these developed representations and algorithms can be adopted for efficient probabilistic modeling. In case of using a matrix product state I MPS, an / V-dimensional wave function is parametrized as follows:
[0136] '
[0137]
[0138] {'(Vi, v2.... vN) = Tr (A(1>1• ^(2>2. AWv«)
[0139] Each A(k)Vfcat (tensor network) site k with “physical index" vkfixed is a Dk-1by Dkmatrix (“bond dimensions” Dk-1and Dk). The physical index vkranges over dkvalues: vke {0,1,...,dk- 1} for each k e {1,..., N + 1}. Hence, for each site k, the tensor A(k)has an additional dimension to account for the possible values of vk. Dk-1x Dkx dk. For varying vk, the tensor A(k)corresponds to multiple Dk-1by Dkmatrices, i.e., a rank-3 tensor. Further, BOEHMERT & BOEHMERT
[0140] - 13 -
[0141] Do= DNis demanded to close the trace. For a Hilbert space of dimension 2W, there are 2^=1^-!^ tensor network parameters on the right-hand-side of the equation; for Hilbert space dimension NN, there are N^=1Dk-1Dkparameters.
[0142] The representational power of MPS can be expressed using the von Neumann entanglement entropy of the quantum state, which in case of a bipartite system (, B) is defined as S = -TrCp^ In pB). Here, the variables are partitioned into two groups v = (vA,vB), and pA= 1IVB'^(^A>VB) ^(V'A>VB) represents the reduced density matrix of the subsystem A (with subsystem B having been traced out). The entanglement entropy sets a lower bound for the bond dimension at the division S < In (Z)k). Any probability distribution of an / V-bit system can be described by an MPS as long as its bond dimensions are free from any restriction. The inductive bias using MPS with limited bond dimensions comes from dropping off the minor components of entanglement spectrum. Therefore, as the bond dimension increases, an MPS enhances its ability of parameterizing complicated functions. Long range interactions may be captured using, e.g., tree tensor networks.
[0143] Method for generating synthetic data
[0144] With regard to the method for generating synthetic data according to the invention, Fig. 2 shows an overview flowchart. In a first step 20, to be carried out in the provision module, a tensor network including a plurality of tensors and a training dataset including multiple training data strings is provided. In a second step, 21, to be carried out in the training module, the tensor network is trained with respect to the training dataset by gradient descent. In particular, the following steps are repeated: in the gradient submodule, a tensor network gradient is determined from the tensor network and evaluated using the training dataset; in the noise submodule, noise is applied to the tensor network gradient; and in an updating submodule, the tensor network is correspondingly updated based on the tensor network gradient.
[0145] Using the thus trained tensor network, synthetic data are generated (third step 22, to be carried out in the data generation module). The synthetic data include a synthetic data string (sample). Each component to be sampled of the synthetic data string is generated according to a sample probability which is a marginal probability for the component or a conditional probability conditioned on at least one value of a further component of the synthetic data string. At least one of determining the marginal probability and determining the conditional probability comprises separating, for the component, a corresponding partial tensor network BOEHMERT & BOEHMERT
[0146] - 14 -
[0147] from the tensor network and determining a squared norm of the corresponding partial tensor network.
[0148] The goal of unsupervised generative modelling is to model the joint probability distribution of given data. With the trained tensor network (model), new samples from the learned probability distribution can be generated.
[0149] In the following, exemplary embodiments of the method are described in further detail.
[0150] Training
[0151] The input training dataset T comprises multiple training data strings. These may be binary strings v e V = {0,1}®N, which are potentially repeated and can be mapped to basis vectors of a Hilbert space of dimension 2W. Alternatively, the training data set T comprises binary and non-binary (e.g., integer-valued) strings v e V = {0,1,... / V}N, which are not repeated and can be mapped to basis vectors of a Hilbert space of dimension NN. The training data set T may also comprise non-binary strings only.
[0152] In case an MPS representation of the wave function 'P(v) is used, training can be carried out by adjusting the parameters of the wave function 'P(v) (here: the tensor network parameters) such that its probability distribution P(v), which is represented by Born’s rule, i.e., P(v) = |'P(v)|2 / Z (with normalization factor Z) is as close as possible to the data distribution.
[0153] For training, maximum likelihood estimation (MLE) can be employed, which defines a (negative) log-likelihood function (NLL) and optimizes it by adjusting the parameters of the model (here: tensor network parameters). In the MPS case, the negative log-likelihood function £ is defined as:
[0154] £ =~ FFrylnP(rO,
[0155]
[0156] 1 I VET
[0157] where \T\ denotes the size of the training dataset. Minimizing the NLL reduces the dissimilarity between the model probability distribution P(v) and the empirical distribution defined by the training dataset T. Minimizing £ is equivalent to minimizing the Kullback-Leibler divergence between the two distributions. BOEHMERT & BOEHMERT
[0158] - 15 -
[0159] The MPS (in particular, in a canonical form) may be used to differentiate the NLL with respect to the components of an order-4 tensor A(k'k+1), which is obtained by contracting two adjacent tensors A(k)and A(k+1). The gradient reads:
[0160] d£ Z' 2
[0161] a.(fc,fc+l) wfcwfc+1- 7 “ ZJ 'P(V)'
[0162]
[0163] wherein
[0164]
[0165] 'P(v) = HXvi, v2— vN) = Tr(A(1)V1A(2)V2••• 4(2)VN), 'P'(V) denotes the derivative of the MPS with respect to the tensor element
[0166]
[0167] of A(k'k+1)and Z' = 22vei / T,'(v) 'P(v). For each k, the corresponding tensor A(k'k+1)has dimension Dk-1x Dkx dkx dk+1. Although Z and Z' formally involve summations over an exponentially large number of terms, they are tractable in the MPS model via efficient contraction schemes. In particular, if the MPS is in a mixed-canonical form, Z’ can be significantly simplified to 2 A^*V™kWk+1. After gradient descent, the merged order-4 tensor is decomposed into two order-3 tensors, and then the procedure is repeated for each pair of adjacent tensors.
[0168] The above procedure may be considered as similar to DMRG (density matrix renormalization group) with a two-site update, which allows to dynamically adjust bond dimensions during optimization and to allocate computational resources to important bonds which represent essential features of the input data. However, the loss function in classic DMRG is usually the energy, while the above procedure makes use of NLL, which is a function of data. Further, when processing a large amount of data, the landscape of the loss function is typically very complicated so that modern optimizers developed in machine learning, such as stochastic gradient descent and learning rate adapting techniques, may be useful in the above procedure. Since the ultimate goal of learning is optimizing the performance on test datasets, there is no need to find the optimal parameters minimizing the loss on the training dataset. One usually stops training before reaching an actual minimum to prevent overfitting.
[0169] The above procedure is data-oriented. It is straightforward to parallelize over the samples since the operations applied to them are identical and independent. For example, parallelization may take place over the so-called “batch” dimension. As a concrete example using the full MNIST dataset, the GPU implementation of MPS algorithm is at least 100 times faster than the CPU implementation. BOEHMERT & BOEHMERT
[0170] - 16 -
[0171] To each tensor network gradient determined as set out above, noise is applied (injected). In the context of differential privacy, noise injection techniques play a crucial role in ensuring the privacy of sensitive data while still allowing for meaningful analysis. Injecting noise within the training process may involve Gaussian noise and Laplacian noise.
[0172] Gaussian noise injection comprises adding random noise sampled from a Gaussian distribution to the gradient (for each gradient descent step) or, during a stochastic gradient descent step, the multiple gradients. This noise effectively masks individual data points, making it harder for adversaries to infer sensitive information. Gaussian noise injection is particularly suitable for continuous data or scenarios where the distribution of the data is approximately normal. However, Gaussian noise tends to spread uniformly across all dimensions, which may not be optimal for preserving privacy in certain cases, especially when dealing with highdimensional data.
[0173] On the other hand, Laplacian noise injection involves adding noise sampled from a Laplace distribution to the gradient or gradients. Unlike Gaussian noise, Laplacian noise has heavier tails, which means it can better preserve the sparsity and structure of the data. Laplacian noise injection is well-suited for scenarios where the data is sparse or has outliers, as it provides stronger privacy guarantees while still allowing for accurate model training. Additionally, Laplacian noise injection can be more computationally efficient compared to Gaussian noise, especially in high-dimensional settings.
[0174] Choosing between Gaussian and Laplacian noise injection depends on various factors, including the distribution and characteristics of the data, the desired level of privacy, and computational considerations. In practice, a combination of both noise injection techniques may be used to achieve a balance between privacy and utility, leveraging the strengths of each method for different types of data and applications.
[0175] For Gaussian noise, the used standard deviation a may be o = c • q / e • tlogi. For Laplacian noise, the scaling parameter a may be determined as o = f / e. Here, q denotes the sampling ratio per lot, where q = L / N. The ratio represents the proportion of data sampled for each step of the algorithm relative to the total dataset size. Further, t represents the number of steps or iterations performed during the training process, e and 8 are the differential BOEHMERT & BOEHMERT
[0176] - 17 -
[0177] privacy parameters for achieving e-differential privacy or (e, ^-differential privacy. In particular, E controls the level of privacy protection, while 5 represents the probability of failure in providing privacy. A / and c represent the sensitivity of the model, indicating the maximum amount that the output of a function (e.g., loss or gradient) can change when a single data point is added or removed from the dataset.
[0178] Generative sampling
[0179] Subsequent to training the tensor network, synthetic data samples (synthetic data strings) can be generated independently.
[0180] Certain tensor networks such as MPS are useful in that the partition function I normalization factor can be exactly computed with complexity linear in the system size. In case of certain chain-like tensor networks such as MPS, sampling can be carried out bit by bit from one end of the tensor networks to the other.
[0181] The sampling process may for example proceed sequentially from the right end of the MPS to the other, generating each variable based on the conditional probability given the determined variables. In detail, a synthetic data string may be sampled as follows. The processing starts at one end of the synthetic data string, for example its last ( / V-th) component (or last bit in the binary case). The / V-th bit is directly sampled via the marginal probability P(vw) = l^v,v2,..vN^ P y, which can be carried out straightforwardly if all tensors except ANhave, e.g., been gauged to be left-canonical. This is because P(vw) =
[0182]
[0183] ■■■< / ?! / Z= I^”I7Z. wherein p. e {0,1. - 1}, x’^ = and normalization factor Z = £izwe{Oi}
[0184]
[0185] I2- Given the value of the / V-th bit, the ( / V - l)-th bit can be sampled. More generally, given the bit values vk,vk+1,.. vN, the (k - l)-th bit Vfc.i is sampled according to the conditional probability
[0186] nr. x P(. Vk-!, Vk,..., VN)
[0187]
[0188] > ^ = P(V^1
[0189] As a result of the canonical condition, the marginal probability P(k,vk+1,...,vN) in the denominator can be expressed as: BOEHMERT & BOEHMERT
[0190] - 18 -
[0191] |XVfc, Vfc+1,..., VN|2
[0192] P(vk, Vk+1,.. VN) = - - -,
[0193]
[0194] wherein %, lVfcfc_'Vlfc+1'"Vw
[0195]
[0196] = 2;. Ifc-ilfc lfckl+fc1+)iVfc+1
[0197]
[0198] IjV-l has been established since the Zc-th bit is sampled. Schematically, the squared norm reads
[0199]
[0200] Multiplying the matrix
[0201]
[0202] X(kfrom the left, and calculating the squared norm of the resulting vector, x
[0203]
[0204] ^fcj21, Vfc,''VN= yields the marginal probability
[0205] \XVk-l, Vk,
[0206] P(vk-1, vk,..vN) = - - -
[0207]
[0208] in the numerator of the conditional probability. Hence, the conditional probability can be determined as:
[0209] \xvk-i’vk: ■■ ^JV |2
[0210] |vt,. v„) =. „„| 2.
[0211]
[0212] The sampling may thus include the following steps:
[0213] • For the current variable vk, compute the unnormalized probabilities for all possible values vk= 0,1,....dk- 1. The unnormalized probabilities are proportional to the squared norms of the resulting vectors after contracting the MPS tensors up to variable vk. Normalize the unnormalized probabilities to ensure that they sum to 1.
[0214] • Determine a cumulative distribution function (CDF) based on the (normalized) probabilities.
[0215] Generate a random number r in the interval [0,1) and select vksuch thatr falls within the corresponding interval in the CDF. BOEHMERT & BOEHMERT
[0216] - 19 -
[0217] • Set the value of vkand update the vector for the next iteration by contracting with the selected tensor A(k)Vfc.
[0218] • Repeat the above for all the variables from k = N to k = 1.
[0219] This way, all bit values of the synthetic data string are successively drawn using the conditional probabilities given all the precedingly determined component values. The resulting synthetic data string corresponds to a sample strictly obeying the probability distribution of the MPS.
[0220] This sampling approach is not limited to generating samples from scratch in a sequential order. Inference tasks can also be carried out when part of the components are provided or known. In this case, a canonical form may not be necessary or useful, e.g., if there is a segment of unknown components between given / known components. Nevertheless, the marginal probabilities are still tractable because one can also contract ladder-shaped tensor networks efficiently.
[0221] Evaluation
[0222] The underlying dataset employed for evaluation / assessment is derived from the 1996 U. S. Census Bureau data, designated as the Adult Dataset. This dataset encompasses 48842 entries for six continuous and eight nominal attributes. Notably, approximately 7 % of the entries contain missing values across various features. The following attributes are included in the Adult Dataset: Age: Continuous, from 17 to 90; Workclass: Categorical (8 categories); Fnlwgt (final weight): Continuous from 12285 to 1490400; Education: Categorical (16 categories); Education-num: Continuous from 1 to 16; Marital-status: Categorical (7 categories); Occupation: Categorical (14 categories); Relationship: Categorical (6 categories); Race: Categorical (5 categories); Sex: Categorical (2 categories); Capital-gain: Continuous from 0 to 99999; Capital-loss: Continuous from 0 to 4356; Hours-per-week: Continuous from 1 to 99; Native-country: Categorical (41 categories); Class: Categorical (2 categories).
[0223] For assessing the synthetic data generated according to the proposed method, quality metrics and fidelity metrics may be employed. Fidelity metrics assess how well the synthetic data captures mathematical properties of the real data, while quality metrics gauge how well ma- BOEHMERT & BOEHMERT
[0224] - 20 -
[0225] chine learning models perform on the real data when trained on the synthetic data.
[0226] Fidelity, in other words accuracy, precision or realism, is one goal for synthetic data. For evaluating the synthetic data, the python library Synthetic Data Metrics (SDMetrics) is used. Various metrics are employed to measure the fidelity of the synthetic data, some of which focus on categorical data columns and others focus on continuous data columns.
[0227] For categorical features, the following metrics can be used. Category coverage measures the extent to which all possible categories or classes in a dataset are represented in the synthetic dataset. For instance, in a dataset with categorical variables like colors (e.g., red, green, blue), category coverage may assess whether the synthetic data include examples of all these colors. High category coverage may ensure that the synthetic data are comprehensive and representative of all categorical variations in the original data.
[0228] Further, total variation quantifies the degree of variation (or change) between adjacent data points. In the context of synthetic data, it can be measured how much variation exists in the synthetic data compared to the original (training) dataset, so that the synthetic data captures the inherent variability of the original data.
[0229] A chi-squared test can be applied to assess whether the distribution of categorical variables in the synthetic data matches that of the original dataset. This way, the replication of frequencies and relationships of categories may be tracked. Moreover, with contingency similarity, the similarity in a joint distribution of two or more categorical variables between the original dataset and the synthetic data may be evaluated, generally comprising analyzing contingency tables (cross-tabulations) that show the frequency distribution of variables. A high contingency similarity indicates that the synthetic data accurately preserves the relationships and interactions among multiple variables.
[0230] For continuous features, the following metrics may be used. Boundary adherence refers to how well the synthetic data respect minimum and maximum values observed in the original dataset. This way, it may be tracked that logical or actual bounds of the data are not exceeded, such as age being between 0 and a plausible maximum such as 120.
[0231] Further, range coverage measures the extent to which the synthetic data spans the entire range of values found in the original dataset. Thus, the synthetic data may be ensured to BOEHMERT & BOEHMERT
[0232] - 21 -
[0233] capture the entire spectrum of data from the lowest to the highest value, including all intermediate variations. Moreover, the Kolmogorov-Smirnov test is a non-parametric test used to compare two distributions and determine if they are significantly different. In synthetic data generation, the KS test can assess how closely the distribution of the synthetic data matches that of the original data.
[0234] In addition to fidelity metrics, the quality of synthetic data may be further evaluated by training various classification machine learning models on this data and assessing their performance using a test set derived from real data. This process enables a comparative analysis of model performance between those trained on synthetic data versus those trained on real data.
[0235] For the plots according to Figs. 3 to 6 discussed below, the binary income feature is predicted, utilizing the F1 score as the primary evaluation metric. In case of an imbalanced dataset, where performance metrics like accuracy can be misleading, the F1 score may also be employed as a quality metric. The F1 score is defined as
[0236] 2 • (Precision • Recall)
[0237]
[0238] Precision + Recall ’
[0239] TP TP
[0240] with Precision = — TP —+F —P,’ Recall = — TP —+F —N, ’ true positive TP, false positive FP, and false negative FN.
[0241] Different classification algorithms are employed for evaluating the synthetic data, including Random Forest (RF - cf., e.g., Gilles Louppe, arXiv:1407.7502v3), k-nearest neighbors (KNN - cf. Padraig Cunningham et al., arXiv:2004.04523v2), Gradient Boosting (GB - cf. Hung-Hsuan Chen, arXiv:2410.05623v2), and support vector machines (SVM - cf. Fabrice Rossi et al., arXiv:0705.0209v1). For evaluation, the original dataset may be partitioned into (first) training subsets (the training dataset), validation subsets, and test subsets, preferably with proportions of 60%, 20%, and 20%, respectively. Concurrently, the synthetic data may be divided into second training and validation sets in an 80-20% split. The classifiers are first trained using the training and validation sets and subsequently evaluated on the test set derived from the real data. The outcomes from this evaluation serve as a benchmark for assessing the classifiers trained on synthetic data. Following the benchmarking, the same four classifiers are trained using the second training and validation sets. Their performance is BOEHMERT & BOEHMERT
[0242] - 22 -
[0243] then tested using the real data (first) test set. This allows for a direct comparison of performance metrics between models trained on synthetic versus real data, providing insights into the efficacy of the synthetic data in replicating real-world data scenarios and maintaining model accuracy.
[0244] Fig. 3 shows a plot illustrating the performance of the method for generating synthetic data in comparison with other methods. In this example, an MPS tensor network (TNN) has been employed. Further synthetic data have been generated using CTGAN (conditional tabular generative adversarial networks), VAE (variational autoencoder), and TVAE (tabular VAE).
[0245] The different columns of the figure correspond to the employed classification algorithms, namely Random Forest (RF), k-nearest neighbors (KNN), Gradient Boosting (GB), and support vector machines (SVM). For reference, no noise has been injected in this case. To determine the fidelity values (“Fid”), nine different metrics were employed, namely Boundary Adherence, Range Coverage, Category Coverage, Kolmogorov-Smirnov statistic, Total Variation Distance, Contingency Similarity, Pearson Correlation, Chi2 Test, and Cramer’s. The values range from 0 to 1, comparing the synthetic data with the real data. The fidelity in [%] corresponds to the overall score of the metrics (real data have 1). The metrics are used to measure statistical characteristics of the data form, e.g., dispersion.
[0246] Fig. 4 shows a plot illustrating the performance of the method using Gaussian noise injection for different classification algorithms. Different levels of privacy protection are represented by different sizes 1, 2, 5, and 10 for the differential privacy parameter e. Higher values for e correspond to lower noise injection. Hence, better performance with higher values for e is expected. Fig. 5 shows a corresponding plot for Laplacian noise injection.
[0247] In Fig. 6, the method for generating synthetic data including noise injection (Gaussian and Laplacian noise) is compared with synthetic data generation using PrivBayes (Private Data Release via Bayesian Networks). The differential privacy parameter e has been fixed to 20 for each case. BOEHMERT & BOEHMERT
[0248] - 23 -
[0249] Hardware setup
[0250] Fig. 7 shows a data processing system comprising means for carrying out the method. The data processing system may comprise a data processing device 70 (computer) or a plurality of data processing devices (not shown). The data processing device 70 has processor 71 and a storage medium 72 (computer-readable medium). The processor may be, for example, a central processing unit (CPU), a graphics processing unit (GPU), a field programmable gate array (FPGA) and / or an application specific integrated circuit (ASIC), or may comprise one or more of these elements. The storage medium 72 may comprise a non-volatile memory and / or a volatile memory. The storage medium 72 may include instructions that may be executed by the processor 71 to provide the functionality described herein. The data processing system and / or the data processing device 70 is adapted to execute method for generating synthetic data according to one of the embodiments described herein. In one example, a computer program is stored on the storage medium 72 for this purpose.
[0251] The features disclosed in this description, the drawings, and / or the claims may be material for the realization of various embodiments, taken in isolation or in various combinations thereof. BOEHMERT & BOEHMERT
[0252] - 24 -
[0253] Reference signs
[0254] 20, 21, 22 method steps
[0255] 70 data processing device
[0256] 71 processor
[0257] 72 storage medium
Claims
Bundesdruckerei GmbH, et al.B89594WOClaims1. A computer-implemented method for generating synthetic data, comprising:- providing a tensor network including a plurality of tensors and a training dataset including at least one training data string;- training the tensor network with respect to the training dataset by gradient descent, comprising the following steps:- determining a tensor network gradient from the tensor network, the tensor network gradient being evaluated using the training dataset;- applying noise to the tensor network gradient; and- adjusting the tensor network based on the tensor network gradient;- generating, from the tensor network, synthetic data including a synthetic data string, wherein:- each component to be sampled of the synthetic data string is generated according to a sample probability which is a marginal probability for the component or a conditional probability conditioned on at least one value of a further component of the synthetic data string and- at least one of determining the marginal probability and determining the conditional probability comprises separating, for the component, a corresponding partial tensor network from the tensor network and determining a squared norm of the corresponding partial tensor network.
2. The method according to claim 1, wherein the tensor network is a matrix product state tensor network, a tensor train, a matrix product operator tensor network, a tree tensor network, or a multi-scale entanglement renormalization ansatz tensor network.
3. The method according to claim 1 or 2, wherein the training data set comprise a binaryvalued training data string and / or an integer-valued training data string.
4. The method according to any of the preceding claims, further comprising preprocessing the training data set, including at least one of:- discretizing a continuous training data string to a discretized training data string, - mapping the discretized training data string to a binary-valued or an integer-valued training data string, and- mapping a categorical training data string to a binary-valued or an integer-valued training data string.BOEHMERT & BOEHMERT- 2 -5. The method according to any of the preceding claims, wherein the at least one training data string is indicative of at least one of physiological data, medical data, ethnicity, social data, personal movement data, and address data.
6. The method according to any of the preceding claims, wherein the tensor network gradient is determined from a negative log-likelihood function of a norm of the tensor network.
7. The method according to any of the preceding claims, wherein the tensor network is trained by stochastic gradient descent, preferably comprising:- determining mini-batch gradients from the tensor network, the mini-batch gradients being evaluated using multiple samples from the training dataset, and determining the tensor network gradient as an average of the mini-batch gradients.
8. The method according to any of the preceding claims, wherein applying noise to the tensor network gradient comprises entrywise adding random noise to the tensor network gradient.
9. The method according to claim 8, wherein the random noise includes at least one of Gaussian noise and Laplacian noise.
10. The method according to claim 8 or 9, wherein a noise parameter of the random noise is determined from a differential privacy parameter.
11. The method according to any of the preceding claims, wherein the training of the tensor network further comprises constraining a tensor network gradient norm below a predefined threshold.
12. The method according to any of the preceding claims, wherein the synthetic data string comprises at least one predetermined component.
13. A data processing system comprising means for carrying out the method according to any of the preceding claims.BOEHMERT & BOEHMERT- 3 -14. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method according to any of claims 1 to 12.
15. A computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method according to any of claims 1 to 12.