Non-uniform regularization in artificial neural networks for adaptive scaling
By introducing a random bottleneck architecture and non-uniform dropout technique into the autoencoder, the flexibility problem of adjusting the size and potential dimensionality of neural networks in existing technologies is solved, achieving efficient and seamless dimensionality reduction and computational complexity reduction in different downstream applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MITSUBISHI ELECTRIC CORP
- Filing Date
- 2021-02-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to flexibly adjust the potential dimension to meet the distortion requirements of different downstream applications without modifying the trained autoencoder model, leading to the need for multiple tedious retraining sessions, and the challenge of determining the size of the neural network remains unresolved.
By employing a stochastic bottleneck architecture and non-uniform dropout technique, flexible regularization and adaptive dimensionality reduction are achieved by applying a non-uniform dropout rate across the depth and width directions in the intermediate layers of the neural network. A random number generator is used to randomly modify the output signal of the neuron nodes, and an adaptive truncation is used to reduce computational complexity in the downstream testing phase.
It enables flexible adjustment of the potential dimension without retraining the neural network, reducing computational complexity while maintaining high-quality data reconstruction performance and adapting to the distortion requirements of different downstream applications.
Smart Images

Figure CN115485697B_ABST
Abstract
Description
Technical Field
[0001] This invention generally relates to artificial neural network systems and methods for designing neural networks, and more specifically, to non-uniform dropout in neural networks for achieving flexible regularization and adaptive dimensionality reduction. Background Technology
[0002] Feature extraction and dimensionality reduction are crucial before data analysis and communication. In many real-world applications, raw data measurements (e.g., audio / speech, images, videos, and biosignals) often have very high dimensionality. Properly handling high dimensionality often requires applying dimensionality reduction techniques that transform the raw data into meaningful, reduced-dimensional feature representations. These representations should reduce the dimensionality to the minimum required to capture the salient properties of the data. Dimensionality reduction is important in many machine learning and artificial intelligence applications due to the need to mitigate the so-called curse of dimensionality (a problem phenomenon in data analysis where dimensionality increases exponentially). To date, numerous dimensionality reduction algorithms have been developed, such as Principal Component Analysis (PCA), Kernel PCA, Isomap, Maximum Variance Expansion, Diffusion Mapping, Locally Linear Embedding, Laplacian Eigenmaps, Local Tangent Space Analysis, Sammon Mapping, Locally Linear Coordinates, and Manifold Mapping. Over the past few decades, latent representation learning based on artificial neural networks (ANNs) called autoencoders (AEs) has been widely used for dimensionality reduction because this non-linear technique has shown superior real-world performance compared to classical linear techniques such as PCA.
[0003] One of the challenges in dimensionality reduction is determining the optimal latent dimension that adequately captures the data features required for a specific application. While some regularization techniques, such as sparse AEs (SAEs) and rate-distortion AEs, can help tune the effective dimension, there is currently no adaptive approach that allows seamless tuning of the latent dimension to accommodate different distortion requirements of different downstream applications without modifying the trained AE model. Some existing work imposes conditional AE training and stepwise stacking on hierarchical architectures. However, these earlier methods require multiple tedious retraining sessions. Therefore, unlike linear PCA, which provides hierarchical latent variables, existing AEs often suffer from the drawback of having latent variables that are equally important.
[0004] Similar to determining the size of latent variables, determining the size of artificial neural networks is challenging because significantly shallow and narrow networks do not work well, while unnecessarily deep and wide networks require unrealistically large amounts of training data to function. Dropout can be used to effectively regularize overcomplete networks to prevent overfitting. Using a stochastic depth approach with higher dropout in deeper layers can be an effective way to determine the depth of overly deep self-organizing neural networks. However, determining the network width still requires trial and error by the designer.
[0005] Therefore, it is necessary to develop a neural network system for achieving flexible regularization and adaptive dimensionality reduction, and a method for designing neural networks. Summary of the Invention
[0006] According to some embodiments of the present invention, a system for flexible regularization and adaptive scaling of an artificial neural network is provided. The system includes: a memory storing the artificial neural network and training data; a processor and interface for submitting signals and training data to the neural network comprising a series of layers, each layer comprising a set of neuron nodes, wherein a pair of nodes from neighboring layers is interconnected with multiple trainable parameters to pass signals from the previous layer to the next; a random number generator for regularization by randomly modifying the output signals of each neuron node according to a multidimensional distribution across the layer depth direction and node width direction of the neural network, wherein at least one layer across the neuron nodes has a non-identical profile; a training operator for updating the neural network parameters using the training data such that the output of the neural network provides better values in multiple objective functions; and an adaptive truncation operator for pruning the outputs of neuron nodes in each layer of the compressed-size neural network to reduce computational complexity during runtime in downstream testing phases for any new incoming data.
[0007] This invention provides a method for designing rateless artificial neural networks (AEs) capable of flexible dimensionality reduction. This method is based on the understanding that traditional PCA possesses the rateless property, which refers to the ability to adjust any dimension by simply adding or discarding ordered principal components. The method and system in this invention employ a stochastic bottleneck architecture that uses non-uniform dropout to create ordered latent variables. Specifically, within each layer, non-uniform regularization is used to train an overcomplete artificial neural network to prioritize upper-layer hidden nodes for learning the most dominant features in these intermediate layers.
[0008] Unlike traditional Automatic Effects (AEs) with deterministic bottlenecks in the intermediate layers, some implementations of the proposed architecture utilize probabilistic pruning of bottlenecks to achieve adaptive dimensionality reduction. This allows end-users to freely adjust computational complexity. The invention also provides implementations that achieve this non-probabilistic property through a specific drop-off mechanism called tail drop-off, which discards consecutive neurons at the end of the latent space according to a specific probability distribution. Some implementations also describe architectures that integrate linear PCA into nonlinear AEs to provide better performance. This invention enables end-users to flexibly change the dimensionality while achieving excellent distortion performance across the entire dimensionality range.
[0009] Some implementations of non-uniform regularization use a monotonically increasing dropout rate across hidden nodes in intermediate hidden layers, which effectively reduces overparameterized neural networks. Another implementation uses a multidimensional dropout rate profile with a non-uniform dropout rate across both the depth and width directions to effectively reduce overparameterized depth and width without deterministically specifying these hyperparameters. This method and system allow for flexible tuning of neural network depth and width parameters without requiring retraining for specific sizes.
[0010] Some implementations use dropout simultaneously on consecutive neuron nodes at a specific dropout rate. Some implementations use a regularization technique called tail dropout, in which consecutive neuron nodes are randomly dropped from one node to the last node. Another implementation drops neuron nodes simultaneously in multiple dimensions, such as two-dimensional (2D) bottom dropout across the depth and width directions.
[0011] Some implementations utilize dropout distributions optimized across depth and width or channel directions in the sense of multi-objective optimization. The distribution profile can be parameterized using several hyperparameters specifying the 2D dropout rate, such as exponential, Lorentz, multinomial, sigmoid, power, geometric, Poisson, or Wigner distributions. This allows for minimal distortion when the user prunes neuron nodes at any intermediate layer, regardless of the number of pruned nodes. This ability to reduce the computational complexity of the prepared neural network reduces computational complexity for any downstream use case.
[0012] Some implementations use variational principles of random sampling in intermediate layers to enable users to utilize generative models. This approach is compatible with fully connected layers, convolutional layers, pooling / unpooling layers, skip connections, loop feedback, regression feedback, Inception modules, semi-supervised regulation, etc. Another implementation uses random noise injection with non-uniform variance across width and depth as an alternative to dropout regularization.
[0013] Some implementations use mean squared error (MSE) to minimize the loss function of a stochastic bottleneck neural network. For a more perceptual loss function, structural similarity (SSIM) can be used alternatively. The objective function may also include combinations of cross-entropy, negative log-likelihood, absolute error, cross-covariance, clustering loss, KL divergence, hinge loss, Huber loss, negative sampling, and triplet loss. Data-centric perceptual loss can be measured using a learned generative model through adversarial training. For classification tasks, a cross-entropy loss function is used. Multi-task optimization using multiple loss functions is also applied. In some implementations, complementary dropouts of neurons toward two different branches are employed to achieve nondeterministic soft disentanglement. Another implementation uses multiple different dropout rate profiles for common neuron nodes, and the outputs of surviving neurons are fed into multiple branches of the neural network, for example, using a monotonically increasing profile for the first branch, a monotonically decreasing profile for the second branch, and a sinusoidal profile for the last branch, to achieve specific priorities among latent variables in different domains.
[0014] The currently disclosed embodiments will be further described with reference to the accompanying drawings. The drawings shown are not necessarily to scale, but rather focus on illustrating the principles of the currently disclosed embodiments. Attached Figure Description
[0015] [ Figure 1A ] Figure 1A The traditional AE architecture is shown, which cascades two deterministic neural networks, an encoder and a decoder, with a bottleneck architecture (i.e., a smaller number of neurons in the intermediate layers).
[0016] [ Figure 1B ] Figure 1B This illustrates the sparse AE architecture of the prior art.
[0017] [ Figure 1C ] Figure 1C The concept of random width according to an embodiment of the invention is shown, wherein the drop rate is not constant, for example, it gradually increases across the network width.
[0018] [ Figure 1D ] Figure 1D This is an example of a flowchart illustrating the steps of a flexible dimensionality reduction method according to an embodiment of the present invention.
[0019] [ Figure 2A ] Figure 2A This illustrates a conventional (prior art) approach to increasing the drop rate across layer depths for self-adjustment of network depth.
[0020] [ Figure 2B ] Figure 2BThis is an example of random width regularization (independent) of independent nonidentical distributions with a drop rate according to an embodiment of the present invention.
[0021] [ Figure 2C ] Figure 2C An embodiment of the invention is shown, which has tail dropping for achieving random width regularization (tail dropping) of a non-uniform drop rate.
[0022] [ Figure 2D ] Figure 2D An example of a test dropout distribution according to an embodiment of the present invention is shown.
[0023] [ Figure 3A ] Figure 3A A combined nonlinear AE method according to an embodiment of the present invention is shown for use in flexible dimensionality reduction.
[0024] [ Figure 3B ] Figure 3B The present invention illustrates an embodiment of the present invention. Figure 3A A variant of .
[0025] [ Figure 4 ] Figure 4 An embodiment of the system according to the present invention is shown.
[0026] [ Figure 5A ] Figure 5A An example of a reconstructed image is shown using a deterministic sparse AE method (the prior art) for downstream dimensionality reduction.
[0027] [ Figure 5B ] Figure 5B An example of a reconstructed image using a stochastic bottleneck AE method for downstream dimensionality reduction according to an embodiment of the present invention is shown.
[0028] [ Figure 6 ] Figure 6 An embodiment of complementary discarding of soft unentangled latent representations for different objective functions, according to an embodiment of the present invention, is shown. Detailed Implementation
[0029] Figure 1AA conventional AE (existing technology) architecture 10 is shown, employing two deterministic neural networks, an encoder 20 and a decoder 40, with a bottleneck architecture. The encoder feeds raw data, such as a digital video signal, at an input layer 21 and generates dimensionality-reduced latent variables 30 through a hidden layer 22. The decoder feeds the latent variables 30 through a hidden layer 41 to reproduce the data at an output layer 42. The width of the network narrows between the encoder and decoder, forming a bottleneck; more specifically, the number of neurons in the intermediate latent layers is relatively fewer than in the input and output layers. This allows the network to be forced to learn to transform the data into a low-dimensional latent space (represented by variables at the bottleneck) 30 and then reconstruct the data from the low-dimensional representation.
[0030] Among them, AE has shown great potential in learning low-dimensional latent variables required in the underlying nonlinear manifolds of datasets. AE is a type of algorithm with... Figure 1A The bottleneck architecture of the artificial neural network shown transforms N-dimensional data into an M-dimensional latent representation (M≤N) via an encoder network; that is, the number of nodes in the input and output layers is N, and the number of nodes in the intermediate layers is M. The latent variables should contain sufficient features to reconstruct the original data through the decoder network.
[0031] AE (Automatic Encoder) is often used in unsupervised learning applications where data lacks specific labels for analysis, but the user wants to learn the underlying representations. Once the encoder and decoder networks are learned, the decoder network also helps to synthetically generate virtual data with distributions close to real-world data. To generate random synthetic data, latent nodes often use variational principles, where latent variables indicate parameter values (e.g., mean and variance for a normal distribution) that specify the distribution of the random number generator.
[0032] From the original data x∈R N The encoder network will have a corresponding latent representation z∈R with reduced dimension M. M Generate as z = f θ (x), where θ represents the encoder network parameters, i.e., the weights, biases, and any such learned variables in the encoder network. The latent variables should adequately capture the statistical geometry of the data manifold so that the decoder network can reconstruct the data into... ,in Let x′ represent the decoder network parameters, and x′∈R N Encoder and decoder pair The system is jointly trained to minimize the reconstruction loss (i.e., distortion), as given by the following equation:
[0033]
[0034] The loss function L(x,x′) is chosen to quantify the distortion between x and x′ (e.g., MSE and SSIM). Neural networks can be updated using methods such as stochastic gradient descent, adaptive momentum, Ada gradient, Ada bound, Nesterov accelerated gradient, or root mean square propagation.
[0035] Similarly, AE is also called nonlinear PCA (NLPCA) for the following reason. If we consider the simplified case where there is no nonlinear activation in the AE model, the encoder and decoder functions will be simplified to simple affine transformations. Specifically, the encoder becomes f θ (x) = Wx + b, where the trainable parameter θ is the linear weight W ∈ R M×N and deviation b∈R M Similarly, the decoder becomes in If the distortion metric is MSE, then when the data follows the Karhunen standard… When the theorem applies to a multivariate Gaussian distribution, the optimal linear AE is consistent with the classical PCA.
[0036] For example, suppose we have Gaussian data x ~ Normal(m,C) with mean m ∈ R. N And the covariance C∈R N×N It has the characteristic decomposition: C = ΦΛΦ T , where Φ∈R N×N It is a unitary eigenvector matrix, and Λ = diag[λ1,λ2,...,λ]. N ]∈R N×N The ordered eigenvalues are λ1≥λ2≥···≥λ N A diagonal matrix ≥ 0. For PCA, the encoder uses M principal feature vectors ΦI. N,M To project the data onto an M-dimensional latent subspace, where W = I M,N Φ T And b = -Wm, where I M,N ∈R M×N This represents an incomplete identity matrix where diagonal elements are equal to 1 and all others are zero. The decoder uses the transpose projection, where W′=ΦI. N,M And b′=m. MSE distortion is given by the following formula.
[0037]
[0038] Because the eigenvalues are sorted, the distortion decreases moderately as principal components are removed in the corresponding order. Of course, if an incorrect order is used (e.g., reversed), the MSE will be worse.
[0039] One of the advantages of classical PCA is its moderately rateless property due to the ordered principal components. Similar to rateless channel coding (e.g., fountain codes), PCA does not require a predetermined compression ratio M / N for dimensionality reduction (instead, it can be computed using the full dimension M = N), and the latent dimension can be freely adjusted later according to downstream applications. More specifically, a PCA encoder and decoder learned for dimension M can be universally applied to any low-dimensional PCA with a latent size L ≤ M without any modification to the PCA model, simply discarding z = [z1, z2, ..., z M ] T The smallest principal component D (D = ML) in the equation is defined as follows: for all m ∈ {L+1,...,M}, the tail variable is set to zero as z. m =0.
[0040] The rate-distortion property is highly beneficial in practical applications because the optimal latent dimension is often unknown beforehand. Instead of training multiple encoder-decoder pairs for different compression rates, a common PCA model can cover all rates L / N (1≤L≤M) by simply discarding tail components while still achieving optimal distortion. For example, a data server can publish a large-scale high-dimensional dataset along with a PCA model trained with a reduced dimension M for a specific application. However, for various other applications (e.g., different analyses), even further reduced dimensions can still meet and / or improve the learning performance of the final task. Even for end users who require fewer latent variables in various applications, the optimal rate-distortion tradeoff (under the Gaussian data assumption) is achieved by simply discarding the minimum principal components without updating the PCA model.
[0041] However, for real-world datasets, traditional PCA performs poorly compared to nonlinear dimensionality reduction techniques. Utilizing nonlinear activation functions (e.g., Corrected Linear Unit (ReLU) or sigmoid), AEs can better learn the inherent nonlinearity of the underlying latent representations of the data. However, existing AEs do not easily achieve rateless properties because learning the latent variables is often equally important. Therefore, multiple AEs need to be trained and deployed for different target dimensions. This drawback still applies to progressive dimensionality reduction methods employed by stacked and hierarchical AEs, which require multiple training iterations and readjustment for different dimensions. This invention provides an efficient method for achieving rateless AEs adaptable to any compression rate using a stochastic bottleneck.
[0042] Figure 1BAnother prior art technique using random pruning with an AE architecture called Sparse AE (SAE)50 is shown, where instead of a deterministic network, a random number generator 436 is used to randomize the encoder and decoder using dropouts, with some nodes being randomly set to zero during network computation. Unlike conventional AEs with deterministic bottleneck architectures, SAE employs a probabilistic bottleneck with an effective dimensionality that is randomly reduced through dropouts. For example, the SAE encoder generates an M-dimensional variable z that is randomly dropped with a common probability p, resulting in an effective latent dimension L = (1-p)M. Although the number of nodes in each layer remains fixed (without applying a deterministic bottleneck), the effective width of the network gradually and randomly narrows by increasing the dropout rate 90 for layers 70 closer to the center of the latent variable. The main benefit of dropouts is preventing overfitting problems in overcomplete neural networks. Computational complexity is also reduced once the latent variables are randomly dropped. While SAE has better adaptability than deterministic AEs to further reduce dimensionality by dropping latent variables, the latent variables are still equally important for data reconstruction during training, thus limiting its flexibility in achieving invariance.
[0043] Several existing technologies for AE variants, including Figure 1A Traditional AE and Figure 1B The SAE, Variational AE (VAE), Rate-Distortion AE, and Compressed AE in the dataset do not inherently achieve ratelessness well (which is the ability to flexibly further reduce the dimensionality of the latent representation) because all latent variables are essentially equally important for data reconstruction. Reconstruction performance deteriorates rapidly when some components of the latent representation are discarded.
[0044] Therefore, in our invention, such as Figure 1C As shown, the system introduces the concept of a stochastic bottleneck, where the dropout rate gradually increases across the width of the network 140, in addition to across the depth direction 141. This is specifically applied to the latent variable layer 120, with the effect of prioritizing latent variables with the lowest dropout rate. These components are the maximum principal components (MPCs) in a linear dimensionality reduction manner similar to PCA. The components with the highest dropout rate are the minimum principal nonlinear latent variables. This allows users to employ a learned AE model as a flexible dimensionality reduction method. It can apply an encoder to the data to generate latent representations and then flexibly drop components starting from the minimum principal latent variables. This moderately degrades data reconstruction performance while using a learned AE model for adaptive dimensionality reduction, achieving distortion comparable to traditional AE models fine-tuned to a specific desired dimension.
[0045] The method and system of this invention provide a new family of AEs that do not require determining the size of the bottleneck architecture to achieve rateless properties for seamless dimensionality reduction. This method can be viewed as an extended version of SAEs, similar in its overcomplete architecture, but employing different dropout distributions across network width, depth, or channels. This aspect of our method is key to achieving near-optimal distortion while allowing for flexible compression rates for dimensionality reduction.
[0046] Figure 1D An example flowchart illustrating the steps of a flexible dimensionality reduction method according to an embodiment of the present invention is shown. First, the method uses an autoencoder architecture suitable for the data being processed, with latent variable sizes equal to the maximum dimension 151. Next, the method trains the autoencoder while applying a non-uniform dropout rate across the width and depth directions in intermediate layers, including at least a latent representation layer 152. Then, the method applies the trained encoder to any newly incoming data sample 153 to generate a low-dimensional latent representation. For downstream applications, adaptive truncation 435 for each user in the system can adaptively truncate the latent variables to the desired dimension for further compression as required by each application 154. This reduces computational complexity. Finally, the trained decoder is applied to reproduce the original data without introducing too much distortion from the compressed latent variables 155.
[0047] Some implementations use variational principles with random distributions in intermediate layers to enable users to utilize generative models. The method of this invention is compatible with fully connected layers, convolutional layers, skip connections, loop feedback, regression feedback, Inception modules, and semi-supervised regulation. Another implementation uses random noise injection with non-uniform variance across width and depth as an alternative to dropout regularization.
[0048] Stochastic bottleneck implementation
[0049] The method of this invention employs a random bottleneck, which imposes a non-uniform dropout rate distribution that varies across the network's width, depth, and channel orientation, such as... Figure 1C As shown. In some implementations, a probabilistic bottleneck is achieved using a random width method, which employs a monotonically increasing dropout rate from the head (upper) latent variable neurons to the tail (lower) nodes, in order to force latent variables to be ordered by importance in a manner similar to PCA. By concentrating more important features in the head nodes, this method is able to achieve adequate data reconstruction even if some of the least important dimensions (similar to minimum principal components) are later discarded by other users in the downstream system.
[0050] Dropping techniques have been widely used for regularization of overparameterized deep neural networks. The purpose of dropping is to improve generalization performance by preventing activations from becoming strongly correlated, which could lead to overtraining. In standard dropping implementations, network activations are dropped with independent probability p during training (and, for some implementations, testing) (by setting the activation of that neuron node to zero). Recent theories offer a feasible interpretation of dropping as a Bayesian inference approximation.
[0051] In some embodiments, the methods and systems of the present invention employ other related regularization methods; for example, in addition to regular drop, drop connect, drop-block, stochastic depth, drop path, shake drop, spatial drop, zone out, shake-shake regularization, and data-driven drop. To accommodate the rateless nature of stochastic bottleneck AE architectures, another embodiment introduces an additional regularization mechanism called TailDrop as an implementation of stochastic width.
[0052] Figures 2A to 2D The concept of random width and some specific implementations known as tail dropout are further illustrated. The random bottleneck uses non-uniform dropout to modulate the importance of individual neurons, such as... Figure 1C As explained in [the document]. This regularization technique is an extended version of the random depth used in deep residual networks. For example... Figure 2A As shown, existing techniques use random depth to discard entire layers with a higher probability (206) for deeper layers, thus limiting the effective network depth, while shallower layers dominate training. A similar but different approach is to use random depth in the depth direction, such as... Figure 2B The invention demonstrates non-uniform dropout with a monotonically increasing rate 212 performed across the random width direction 211, wherein independent dropouts of the increasing rate are performed for individual neurons within the same intermediate layer. In some embodiments, it can also be achieved through methods such as... Figure 2C The method shown involves discarding consecutive nodes at the tail end (223) to achieve a monotonically increasing discard rate, which we call tail discarding. Tail discarding can be achieved through methods such as... Figure 2D The desired drop rate is achieved by adjusting the probability distribution of the tail drop length, as described in the text, for example, using Poisson, Laplace, exponential, sigmoid, Lorentz, multinomial, and Wigner distribution profiles. Under the model-based approach with the nonlinear characteristic spectrum assumption, some implementations use the power accumulation distribution function Pr(D<τM)=τβ, of order... (τ represents the compression ratio.)
[0053] Figure 5A This example illustrates image data reconstructed from a handwritten digit dataset using a conventional After Effects (AE). If the AE is trained with a deterministic bottleneck architecture, the conventional AE degrades image quality when the user discards latent variables to achieve a low-dimensional representation ranging from 64 variables to 4. Figure 5B This demonstrates the moderate performance of the random bottleneck of the present invention; even when downstream users reduce the dimensionality, the high-quality reconstructed image through random AE is still retained without retraining.
[0054] Model embedding stochastic bottleneck
[0055] Figure 3A A method is described that combines the nonlinear AE method used for flexible dimensionality reduction as described above with another model (e.g., linear flexible dimensionality reduction called PCA). The “Nonlinear Reduction (Encoder)” module 316 corresponds to the encoder of the AE as described above, the “Nonlinear Reconstruction (Decoder)” module 320 corresponds to the decoder of the AE as described above, and “NL-E Latent” 317 refers to the latent variables output by the encoder.
[0056] “PCA Reduction” 312 and “PCA Reconstruction” 314 are standard projection and data reconstruction transformations learned through the standard PCA method. “PCA Latent” 313 is a latent variable vector generated through PCA projection transformation, and “PCA Output” 315 is a data reconstruction generated through PCA data reconstruction transformation. Along the top path of the graph, data samples are processed through standard PCA reduction projection and reconstruction transformations; however, the intermediate “PCA Latent” 313 and the final “PCA Output” 315 are integrated into the bottom path, which utilizes randomized AEs to process the data.
[0057] In the bottom path, data samples are processed by "Nonlinear Downscaling (Encoder)" 316 to obtain "NL-E Latency" 317. However, instead of feeding it directly into "Nonlinear Reconstruction (Decoder)" 320, the "NL-E Latency" is combined with the "PCA Latency" via "Latency Combination Operation" 318 (e.g., element-wise addition, product, or concatenation) to obtain "Combined Latency" 319, which is then fed into "Nonlinear Reconstruction (Decoder)" 320. The "Nonlinear Reconstruction (Decoder)" is also (optionally) modified to take "PCA Output" 315 as input and generate "NL-D Output" 321 (typically corresponding to the reconstruction of the data). However, in our process, the "NL-D Output" is then combined with the "PCA Output" via "Output Combination Operation" 322 (e.g., element-wise addition) to obtain the final data reconstruction 323.
[0058] Figure 3B Depicting Figure 3AAnother variation of the implementation described herein. Instead of combining the “NL-D output” 321 with the “PCA output” 315, the output of the “nonlinear reconstruction (decoder)” 360 is used directly as the final data reconstruction 361.
[0059] The embodiments of the present invention described above can be implemented in any of a variety of ways. For example, the embodiments can be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can execute on any suitable processor or set of processors (whether located in a single computer or distributed across multiple computers). These processors can be implemented as integrated circuits, with one or more processors in an integrated circuit assembly. However, the processors can be implemented using circuits of any suitable format.
[0060] Figure 4 A block diagram of an apparatus 400 for controlling a system comprising multiple signal sources causing multiple events, according to some embodiments, is shown. An example of the system is a manufacturing production line. Apparatus 400 includes a processor 420 configured to execute stored instructions and a memory 440 storing instructions executable by the processor. Processor 420 may be a single-core processor, a multi-core processor, a computing cluster, or any number of other configurations. Memory 440 may include random access memory (RAM), read-only memory (ROM), flash memory, or any other suitable memory system. Memory 440 is configured to load computer-executable instructions (programs) stored in storage device 430 into apparatus 400, and processor 420 executes the computer-executable instructions. Storage device 430 includes computer-executable instructions, including neural network 431, linear PCA 432, trainer / training data 433, operational data 434, adaptive truncation 435, and random number generator 436. Processor 420 is connected to one or more input and output devices via bus 406.
[0061] These instructions implement methods for detecting and / or diagnosing anomalies in multiple events of the system. Device 400 is configured to detect object anomalies using a neural network 431. This neural network is referred to herein as a structurally connected neural network. The neural network 431 is trained to diagnose the control state of the system. For example, the neural network 431 can be trained offline using training data by a trainer (training operator) 433 to diagnose anomalies online using the system's operational data 434.
[0062] Examples of operational data include signals collected from signal sources during system operation, such as system events. Examples of training data include signals collected from signal sources over a period of time. This period can be a time interval before operation / production begins and / or during system operation.
[0063] Multi-task and adversarial learning with adaptive scaling
[0064] The above implementations focus on AE architectures for unsupervised learning to reduce dimensionality when the dataset is redundant and unlabeled. To this end, randomized AEs are trained to minimize distortion metrics, including (but not limited to) mean squared error (MSE) or structural similarity (SSIM). Some implementations use adversarial training to minimize more perceptual distortion, making the decoder output difficult to distinguish from the original data.
[0065] Another implementation uses multiple objective functions to train a randomized AE given conditional labels and perturbation variables. This method of the present invention randomly untangles latent variables; for example, surviving head neurons are fed into one decoder network to maximize SSIM, while complementary tail neurons are fed into another decoder network to minimize MSE with respect to perturbation variables. Figure 6 An example of this implementation is shown, in which the encoder generates latent variables 610, which are randomly and non-uniformly discarded. The surviving latent variables 650 are fed into one neural network of the adversary classifier 620, while the remaining discarded latent variables 660 are fed into another neural network of the interfering classifier 630. This non-uniform complementary discarding method, known as Swap Out, enables users to achieve more interpretable latent variables and flexible interpretability to adjust the tradeoff between distortion and transferability through soft disentanglement. The softly disentangled latent variables 610 are later pruned by the user using an adaptive truncate 435 and used with high transferability in other neural networks 640 for different tasks during the testing phase. In some implementations, multiple different discard profiles with anisotropic functions are used for a specific loss function to disentangle the neuron nodes of the intermediate layers.
[0066] Some embodiments of the present invention use random widths in more general neural network applications such as image classification and data regression. Specifically, no decoder block or bottleneck is required. For conventional feedforward multilayer perceptron architectures, random widths are used in each layer so that the user can adaptively change the network size after training. This solves the problem of current neural network designs requiring predetermined network sizes (i.e., neuron size (width), layer size (depth), and channels). The non-uniform dropout rate of each layer in the depth and width directions allows for adaptive scaling of the network size without knowing the optimal network size. The system can consider very deep and wide networks during the training phase, and then the user in the system can adaptively scale down the network architecture later in the testing phase for classification or regression applications.
[0067] Another implementation applies tail drop simultaneously in adjacent layers, which is called side drop. The profile used to determine the boundaries of dropped neurons across layers is designed as a 2D or 3D continuous function profile (e.g., a polynomial function).
[0068] Furthermore, embodiments of the present invention can be specifically implemented as a method, examples of which have been provided. Actions performed as part of this method can be ordered in any suitable manner. Therefore, embodiments can be constructed that perform actions in a different order than those shown, which may include performing some actions simultaneously, although they are shown as sequential actions in the illustrative embodiments.
[0069] Furthermore, this invention provides a novel method and system for implementing rate-free autoencoders, which enable flexible latent dimensions that can be seamlessly adjusted for different distortion and dimensionality requirements. In the proposed invention, instead of a deterministic bottleneck architecture, we use an overcomplete representation utilizing non-identical dropout random regularization. Unlike prior art, our neural network employs a multi-dimensional non-uniform dropout rate across network width, channels, and depth, allowing neurons to be ordered by importance. The method with a random bottleneck framework enables seamless rate adaptation with high reconstruction performance without optimizing a predetermined latent dimension during training. In some implementations, the non-uniform regularization method is applied to data classification or regression with multiple different objective functions for multi-task and adversarial learning. This method allows for adaptive scaling of the size of general artificial neural networks; i.e., depth and width are self-adjusting during training, and downstream users can seamlessly scale down non-uniformly regularized trained networks to reduce computational complexity during testing.
[0070] The use of ordinal numbers such as “first” and “second” to modify claim elements in claims does not imply any priority or order of one claim element over another, or the temporal order of the execution of method actions. Rather, it serves only as a label to distinguish one claim element with a specific name from another element with the same name (but using ordinal numbers), thus differentiating claim elements.
[0071] Although the invention has been described by way of example of preferred embodiments, it will be understood that various other adjustments and modifications may be made within the spirit and scope of the invention. Therefore, the appended claims are intended to cover all such variations and modifications that fall within the true spirit and scope of the invention.
Claims
1. A system for flexible regularization and adaptive scaling of artificial neural networks, the system comprising: An interface configured to receive and submit signals; The memory is configured to store artificial neural networks and training data, linear PCA, training operators, adaptive truncation, and random number generator; A processor, connected to the interface and the memory, is configured to submit the signal and the training data to the artificial neural network comprising a series of layers, wherein each layer comprises a set of neurons, wherein a pair of neurons from a neighboring layer is interconnected with a plurality of trainable parameters to pass the signal from the previous layer to the next layer, wherein the processor is configured to execute: The random number generator is configured to modify the output signal of each neuron node in a random manner to perform regularization, following a multidimensional distribution across the layer depth direction and node width direction of the artificial neural network, wherein at least one layer across the neuron nodes has a non-identical dropout rate profile. The training operator is configured to update the parameters of the artificial neural network using the training data, such that the output of the artificial neural network provides better values across multiple objective functions; and The adaptive truncation is configured to prune the outputs of the neuron nodes in each layer of the compressed artificial neural network to reduce computational complexity during downstream testing phases for any new incoming data. The artificial neural network comprises multiple cascaded neural network blocks forming at least an encoder network and a decoder network, wherein a random bottleneck with a small number of neurons in at least one intermediate layer represents adaptive low-dimensional latent variables with a non-identical dropout rate, thereby allowing rateless feature extraction of the encoder network and flexible data reconstruction of the decoder network.
2. The system according to claim 1, wherein, By simultaneously and randomly truncating consecutive nodes in the lower tail segment, node outputs are randomly discarded according to a monotonically increasing discard rate profile, while consecutive nodes in the upper head segment survive to be used for training the parameters of the artificial neural network.
3. The system according to claim 2, wherein, The discarded nodes and the surviving nodes are complementaryly fed into the various artificial neural networks to seamlessly unentangle the extracted features, so that the upper and lower nodes have different importance in two objective functions for multi-task and adversarial optimization, thereby realizing a transferable potential representation.
4. The system according to claim 1, wherein, The encoder network and the decoder network integrate linear projection feature extraction with linear principal component analysis (PCA) encoder and decoder to allow model-assisted adaptive dimensionality reduction.
5. The system according to claim 1, wherein, The network depth and width are adaptively scaled based on random depth and width, wherein, during training, deeper and wider layers are dropped with a higher probability, enabling downstream systems to adjust the size of the artificial neural network without retraining.
6. The system according to claim 1, wherein, A combination of multiple parameter functions based on polynomial, exponential, power, Poisson, Wigner, and Laplace functions is used with specific weights to specify a multidimensional regularized profile across network depth and width.
7. The system according to claim 1, wherein, The random number generator employs a combination of drop, swap, time zone drop, block drop, drop connection, noise injection, side drop, tail drop, and jitter.
8. The system according to claim 1, wherein, It employs a combination of convolutional layers, regression feedback, loop connections, skip connections, inception, and activation.
9. The system according to claim 1, wherein, The objective function employs a combination of mean squared error, cross-entropy, structural similarity, negative log-likelihood, absolute error, cross-covariance, clustering loss, divergence, hinge loss, Huber loss, negative sampling, and triplet loss.
10. The system according to claim 1, wherein, The updater uses a combination of stochastic gradient descent, adaptive momentum, Ada gradient, Ada bound, Nesterov accelerated gradient, and root mean square propagation to optimize the trainable parameters of the artificial neural network.
11. The system according to claim 1, wherein, Variational random sampling is used to construct the generative model.
Citation Information
Patent Citations
Stochastic categorical autoencoder network
US20190095798A1
Dynamic adaptation of deep neural networks
US20200134461A1