Text-to-haptic signal controllable generation method and system based on perceptual decoupling
By using a perceptual decoupling method, text descriptions are separated into sensory, emotional, and associative dimensions. Tactile signals are generated using a native tactile encoder and a conditional diffusion generation module, which solves the problems of insufficient frequency difference and multidimensionality modeling in existing technologies and improves the delicacy and accuracy of tactile signals.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIANGTAN UNIV
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-16
AI Technical Summary
Existing tactile signal generation technologies have shortcomings in frequency range differences and multidimensional modeling, making it difficult to generate satisfactory synthetic results when dealing with complex textures, and lack explicit modeling methods for tactile perception.
We employ a perceptual decoupling approach, which separates text descriptions into sensory, emotional, and associative dimensions through a hybrid expert text encoder. We then utilize a native tactile encoder and a conditional diffusion generation module to generate tactile signals, achieving cross-modal semantic alignment and multi-dimensional decoupling. Finally, we combine multi-task training and loss functions to optimize model performance.
It enables independent and controllable adjustment of sensation, emotion, and association, improving the delicacy and accuracy of generated signals and adapting to the generation of complex tactile textures.
Smart Images

Figure CN122219778A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of embodied intelligence, and in particular to a method and system for controllable generation of text-to-tactile signals based on perceptual decoupling. Background Technology
[0002] With the advancement of virtual reality and augmented reality technologies, haptic feedback has become a crucial element in building immersive interactive systems. The synergistic effect of bimodal feedback, combining vision and touch, can significantly enhance the user's sense of presence in a virtual environment. Precision applications such as remote surgery and robotic teleoperation rely heavily on this, as the mechanical information transmitted through the haptic channel directly affects the precision and safety of the operation. However, the maturity of haptic signal generation technology lags significantly behind that of vision and speech technologies, remaining in a relatively early stage of development.
[0003] Current tactile generation methods can be broadly categorized into three types. Physical model methods establish mathematical models based on physical principles, requiring extremely high precision in physical parameters. However, the model's generalization ability becomes inadequate when the application scenario changes. Parametric methods rely on preset waveform templates such as sine waves and pulse sequences to combine tactile effects. While concise, this approach has limited expressive power and struggles to deliver satisfactory synthesis results for complex textures. Data-driven methods leverage deep learning to learn the mapping relationship between text and touch from large-scale data. Compared to the previous two types, they offer greater flexibility but also face several pressing issues. Specifically, existing data-driven methods generally suffer from the following shortcomings: most works directly borrow audio generation frameworks to process tactile signals, ignoring the fundamental differences in frequency range between the two types of signals. The effective frequency range of tactile signals is concentrated between 10 and 500 Hz, while the frequency range of audio signals spans from 20 Hz to 20 kHz, a difference of more than an order of magnitude. Encoders optimized for wideband audio often lack sufficient accuracy in feature extraction for key frequency bands when processing narrowband tactile signals; existing methods lack explicit modeling techniques for the multidimensionality of tactile perception.
[0004] Therefore, a new method and system for controllable generation of text-to-tactile signals based on perceptual decoupling is needed. Summary of the Invention
[0005] According to a first aspect of the present invention, a method for controllable generation of text-to-tactile signals based on perceptual decoupling is provided. The method includes: in response to receiving descriptive text for which a tactile signal to be generated is to be received, inputting the descriptive text into a language model of a hybrid expert text encoder to generate a basic semantic representation vector, wherein the descriptive text includes sensory dimension text, emotional dimension text, and associative dimension text; inputting the basic semantic representation vector into a sensory expert module, an emotional expert module, and an associative expert module of the hybrid expert text encoder to obtain sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors, respectively; generating a global text feature vector based on the multi-dimensional expert feature vectors, wherein the multi-dimensional expert feature vectors include sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors; generating a global text conditional prediction term based on the global text feature vectors and initial noise; generating a dimensional expert fusion prediction term based on the multi-dimensional expert feature vectors and the initial noise; and generating a tactile signal based on the global text conditional prediction term, the dimensional expert fusion prediction term, and an unconditional baseline prediction term, wherein the unconditional baseline prediction term is generated based on the initial noise.
[0006] Optionally, in the method according to the present invention, the hybrid expert text encoder further includes a routing allocation module, wherein generating a global text feature vector based on multi-dimensional expert feature vectors includes: inputting the basic semantic representation vector into the routing allocation module to obtain a three-dimensional weight vector; and generating a global text feature vector based on the sensory dimension expert feature vector, the emotional dimension expert feature vector, the associative dimension expert feature vector, and the three-dimensional weight vector.
[0007] Optionally, in the method according to the present invention, generating a global text conditional prediction term based on the global text feature vector and the initial noise includes: inputting the global text feature vector and the initial noise into a conditional diffusion generation module to obtain the global text conditional prediction term.
[0008] Optionally, in the method according to the present invention, the step of generating dimensional expert fusion prediction terms based on the multi-dimensional expert feature vectors and the initial noise includes: generating sensory dimension prediction terms, emotional dimension prediction terms, and associative dimension prediction terms based on the sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors and the initial noise; and determining dimensional expert fusion prediction terms based on the sensory dimension prediction terms, emotional dimension prediction terms, associative dimension prediction terms, and network weights output by the conditional diffusion generation module.
[0009] Optionally, in the method according to the present invention, generating a tactile signal based on the global text conditional prediction term, the dimensional expert fusion prediction term, and the unconditional baseline prediction term includes: generating a global text guidance term based on the global text conditional prediction term and the unconditional baseline prediction term; generating a dimensional expert guidance term based on the dimensional expert fusion prediction term and the global text conditional prediction term; and generating a tactile signal based on the unconditional baseline prediction term, the global text guidance term, the dimensional expert guidance term, the global guidance weight of the global text guidance term, and the expert guidance weight of the dimensional expert guidance term, wherein the global guidance weight controls the degree to which the tactile signal fits the overall semantics of the descriptive text, and the expert guidance weight controls the expression strength of the tactile signal on the multi-dimensional expert feature vector.
[0010] Optionally, the method according to the present invention further includes: determining a clipping boundary of the tactile signal based on the tactile signal; and setting the signal amplitude of the tactile signal based on the clipping boundary.
[0011] According to a second aspect of the present invention, a controllable text-to-tactile signal generation system based on perceptual decoupling is provided. The system includes: a hybrid expert text encoder comprising: a language model for generating a basic semantic representation vector based on a received descriptive text of a tactile signal to be generated, the descriptive text including sensory dimension text, emotional dimension text, and associative dimension text; a sensory expert module, an emotional expert module, and an associative expert module for processing the basic semantic representation vector and outputting sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors, respectively; the hybrid expert text encoder is further configured to generate a global text feature vector based on the multi-dimensional expert feature vectors, the multi-dimensional expert feature vectors including sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors; and a conditional diffusion generation module for generating a global text conditional prediction term based on the global text feature vectors and initial noise, generating a dimensional expert fusion prediction term based on the multi-dimensional expert feature vectors and the initial noise, and generating a tactile signal based on the global text conditional prediction term, the dimensional expert fusion prediction term, and an unconditional baseline prediction term, the unconditional baseline prediction term being generated based on the initial noise.
[0012] According to a third aspect of the present invention, a computing device is provided, comprising: one or more processors; and a memory for storing one or more computer programs, wherein the one or more processors execute the one or more computer programs to implement the steps of the method described above.
[0013] According to a fourth aspect of the present invention, a computer-readable storage medium is provided having a computer program or instructions stored thereon, which, when executed by a processor, implement the steps of the above-described method.
[0014] According to a fifth aspect of the present invention, a computer program product is provided, comprising a computer program or instructions that, when executed by a processor, implement the steps of the above-described method.
[0015] This application enables independent and controllable adjustment of sensation, emotion, and association, thereby improving the subtlety of the generated signal. Attached Figure Description
[0016] To achieve the foregoing and related objectives, certain illustrative aspects are described herein in conjunction with the following description and accompanying drawings. These aspects indicate various ways in which the principles disclosed herein may be practiced, and all aspects and their equivalents are intended to fall within the scope of the claimed subject matter. The foregoing and other objectives, features, and advantages of this disclosure will become more apparent from the following detailed description, taken in conjunction with the accompanying drawings. Throughout this disclosure, the same reference numerals generally refer to the same parts or elements.
[0017] Figure 1 A schematic diagram of a method for controllable generation of text-to-tactile signals based on perceptual decoupling according to an embodiment of the present invention is shown. Detailed Implementation
[0018] Exemplary embodiments of the present disclosure will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. The same reference numerals generally refer to the same parts or elements.
[0019] According to one implementation, a controllable text-to-tactile signal generation system based on perception decoupling (hereinafter referred to as the system) includes a dimensional decoupling alignment module and a conditional diffusion generation module.
[0020] The dimensional decoupling and alignment module is used to achieve cross-modal semantic alignment and multi-dimensional decoupling. Cross-modal semantic alignment establishes an effective mapping relationship between the discrete hierarchical semantic structure of text and the continuous temporal features of tactile waveforms. Multi-dimensional decoupling separates the three intertwined dimensions of sensation, emotion, and association in human tactile perception. The dimensional decoupling and alignment module contains a native tactile encoder and a hybrid expert text encoder. The two achieve cross-modal alignment through a contrastive learning framework, while decoupling between dimensions is achieved through orthogonal constraints.
[0021] Traditional solutions often employ pre-trained audio encoders (such as Audio Spectrogram Transformer, AST) to process tactile signals. However, touch and audio are fundamentally different at the physical level. Tactile signals have a frequency range of approximately 10 to 500 Hz. The peak sensitivity of the FA2 type receptor (Pacinian corpuscle) in human skin is approximately 250 Hz, while the FA1 type receptor (Meissner corpuscle) primarily responds to low-frequency vibrations below 50 Hz. Audio encoders are typically optimized for a wide frequency range of 20 Hz to 20 kHz. Directly applying them to tactile signals introduces domain bias, especially resulting in insufficient feature extraction accuracy in the critical frequency band of 10 to 500 Hz.
[0022] Based on this, this application designs a native tactile encoder that directly processes the original tactile waveform signal, avoiding information loss caused by time-spectrum conversion. The encoder consists of three parts: a multi-scale convolutional terminator, a Transformer encoder, and a perceptual band pooling layer.
[0023] Different frequency components in tactile signals carry different levels of perceptual information. Multi-scale convolutional word generators can solve the problem that single-scale convolutional kernels cannot effectively capture both high-frequency transients and low-frequency envelope features in tactile signals simultaneously. The multi-scale convolutional word generator uses four parallel one-dimensional convolutions with kernel sizes of 3, 5, 7, and 11, and a uniform stride of 4. The small kernel (size 3) has the narrowest receptive field and is adept at capturing high-frequency transient details, such as short-duration peak changes in rapid taps or pulses. The medium kernels (sizes 5 and 7) cover a moderate temporal span and are suitable for extracting mid-frequency texture features, such as periodic vibration patterns generated by friction on material surfaces. The large kernel (size 11) has the widest receptive field and can perceive the slow fluctuations of the low-frequency envelope, such as the overall contour where amplitude gradually increases and decreases over time. After the four convolutional branches process the input waveform in parallel, each outputs a feature sequence at the corresponding scale. Then, they are concatenated along the channel dimension and mapped to a unified hidden dimension (which can be set to 256 dimensions in this embodiment) through a linear projection layer. The resulting multi-scale temporal feature sequence will be used as the input of the subsequent Transformer encoder.
[0024] The Transformer encoder outputs a long-term temporal feature sequence based on multi-scale temporal feature sequences. Tactile signals exhibit long-range temporal dependencies, and the Transformer encoder primarily addresses how to capture the gradual change in these dependencies over time, such as capturing the rhythmic patterns of periodic vibrations or the gradual change in vibration signals over time. Simple local convolutions cannot adequately model these long-distance relationships. This application employs a Pre-LayerNorm architecture Transformer encoder containing several layers (e.g., two layers), each consisting of a multi-head self-attention mechanism (e.g., eight attention heads) and a feedforward network. The multi-head self-attention mechanism enables the model to simultaneously capture the correlation between any two time steps in the feature sequence across different representation subspaces, effectively modeling long-range dependencies. The feedforward network performs a non-linear transformation on the self-attention output, enhancing the expressive power of the features. Each submodule is equipped with residual connections to ensure training stability.
[0025] The perceptual band pooling layer outputs a tactile feature vector based on a long-term feature sequence. This layer primarily addresses the significant differences in sensitivity of human tactile receptors at different frequencies. To match the feature representation output by the Transformer encoder with human tactile perception characteristics, the perceptual band pooling layer divides the feature sequence output by the Transformer encoder into several frequency bands (eight in this embodiment), using the peak sensitivity frequency of 250Hz as the center for Gaussian weight allocation. Bands closer to the sensitive range of human tactile receptors are assigned higher weights, while those further away are weighted accordingly. The features within each band are weighted and summed to form the final tactile feature vector. This setup naturally biases the Transformer encoder towards the frequency bands with the highest information density in human perception, thereby improving the correlation between features and the perception process.
[0026] When humans describe tactile experiences in words, they often involve multiple perceptual levels simultaneously within the same sentence. Take "the smooth and pleasant feel of silk" as an example: "smooth" is a physical sensory description, "pleasant" carries a clear emotional connotation, and "silk" refers to an associated scenario related to a specific material. Traditional single-encoder architectures uniformly encode text and output a global feature vector, failing to explicitly separate and express these three intertwined semantic information. Consequently, subsequent generative models cannot independently control each dimension.
[0027] The Mixture of Experts (MoE) text encoder adopts a Mixture of Experts (MoE) architecture to achieve dimensional decoupling of text semantics. The overall design of the Mixture of Experts text encoder includes a basic semantic extraction module, a routing allocation module, and a multi-dimensional expert module.
[0028] The basic semantic extraction module includes a pre-trained language model, specifically the T5 language model (Text-to-Text Transfer Transformer), and an average pooling module. After processing by the T5 language model, the output of the basic semantic extraction module is a token-level feature sequence. The average pooling module then performs an average pooling operation on this token-level feature sequence to obtain a basic semantic representation vector. This vector encapsulates all the semantic information of the text description. The T5 model can be pre-trained on large-scale text corpora and possesses powerful natural language understanding capabilities, effectively capturing the complex semantics describing feelings, emotions, and associations in text.
[0029] The routing allocation module is responsible for dynamically allocating the weights of expert feature vectors in each dimension based on the semantic content of the input text. The module performs a linear transformation on the basic semantic representation vector and then feeds it into a Softmax function for normalization, outputting a three-dimensional weight vector. These correspond to the routing weights of the expert feature vectors in the three dimensions of sensation, emotion, and association, respectively, and the sum of the three is always equal to 1. Specifically, when the text description focuses on the depiction of physical attributes (such as the appearance of words like "rough," "smooth," and "high-frequency vibration"), the routing weight of the expert feature vector in the sensation dimension will automatically increase; when the text focuses on emotional experience (such as expressions like "pleasure," "tension," and "relaxation"), the routing weight of the expert feature vector in the emotion dimension will increase accordingly; when the text points to scene associations (such as the appearance of images like "heartbeat," "engine," and "raindrops"), the expert feature vector in the association dimension will receive a higher routing weight.
[0030] The multi-dimensional expert modules include a perception expert module (S), an emotion expert module (E), and an association expert module (A). Each of these three types of expert modules can be implemented as an independent two-layer feedforward network. Each dimension's expert module takes the basic semantic representation vector as input, performs a nonlinear transformation through the two-layer feedforward network, learns and outputs a unique dimension feature relevant only to the current dimension. Subsequently, the unique dimension feature is residually added to the original basic semantic representation vector to obtain the dimension-enhanced semantic feature, ensuring no loss of global semantic information. Finally, an independent projection layer maps the dimension-enhanced semantic feature to a unified embedding space dimension, obtaining the expert feature vector for that dimension.
[0031] The final global text feature vector is formed by weighting and combining multi-dimensional expert feature vectors (including three dimensions) according to routing weights (the weights of corresponding dimensions in the three-dimensional weight vector). The global text feature vector retains complete textual semantic information while allowing for the separation and extraction of dimensional information through the division of labor among experts in each dimension. A soft routing strategy is employed when weighting and combining multi-dimensional expert feature vectors according to routing weights to form the global text feature vector. The soft routing strategy preserves the output information of all experts, allowing for a reasonable degree of semantic correlation between dimensions. This design consideration stems from the actual laws of tactile perception, because the three dimensions are not completely orthogonal at the perceptual level, and there is an objective degree of semantic overlap between them. For example, the description of "rain sound" includes scene associations in the associative dimension, while also carrying high-frequency sensory features and a soothing emotional mapping.
[0032] The native haptic encoder and the hybrid expert text encoder employ a multi-task training design, collaboratively training through four complementary optimization objectives to achieve cross-modal alignment, dimensional decoupling, hard sample differentiation, and expert load balancing, respectively. Each training task corresponds one-to-one with a loss function, as follows: Cross-modal alignment task: Employing haptic-text contrast loss to achieve semantic alignment between text and haptic features in the joint embedding space. Fine-grained matching task: Employing haptic-text matching loss to improve the model's ability to distinguish similar semantic samples. Dimensional decoupling task: Employing orthogonal decoupling loss to constrain the expert feature vectors of the three dimensions of sensation, emotion, and association to be independent. Expert load balancing task: Employing load balancing loss to ensure that the expert feature vectors of the three dimensions are fully utilized, avoiding routing bias. The learning process of the hybrid expert text encoder essentially involves continuously optimizing the accuracy and semantic expressive power of the text feature vectors, making them highly aligned with the semantics of the target haptic perception, and achieving orthogonal decoupling and non-mixing of the three dimensions of sensation, emotion, and association in the feature space, thereby providing controllable and adjustable conditional guidance for haptic signal generation.
[0033] The haptic-text contrastive loss is based on the InfoNCE framework. Within each training batch, paired haptic feature vectors and global text feature vectors are considered positive sample pairs, while other unpaired combinations within the batch are considered negative sample pairs. Temperature parameters are used to... (In this embodiment) =0.5) Scaling cosine similarity to calculate contrast loss, the optimization direction is to bring positive sample pairs closer in the embedding space and push negative sample pairs further away.
[0034] , in, For haptic-text contrast loss, This represents the number of samples in the current training batch. and The first The and the first The tactile feature vectors extracted from each sample by the tactile encoder; and The first The and the first The global text feature vectors extracted from each sample by the MoE text encoder; , )and( , ) constitute positive sample pairs, that is, paired tactile and textual representations from the same sample, while the remaining non-paired combinations within the batch ( , )and( , ), Forming negative sample pairs; Represents the cosine similarity function; The temperature parameter controls the sharpness of the similarity distribution; lower temperatures make the model more sensitive to distinguishing between positive and negative samples, while higher temperatures help training converge smoothly. The loss function is calculated symmetrically in both the haptic-to-text and text-to-haptic directions, ensuring bidirectional alignment across modalities.
[0035] A haptic-text matching loss is used as a necessary supplement. This loss is implemented in a binary classification task and operates in conjunction with a hard negative sample mining strategy.
[0036] , in, For haptic-text matching loss, and These are the representation vectors of tactile features and text features after being projected by the matching head, respectively; Use the Sigmoid activation function; The binary cross-entropy loss function; It is a set of positive sample pairs, containing real-paired tactile-text samples; The set of hard negative samples consists of the unpaired sample with the highest similarity to each positive sample in the current batch, i.e., hard negative samples. This represents the expectation of sample pairs sampled from the set of positive sample pairs; This represents the expectation of sample pairs sampled from the set of hard negative sample pairs. In the current batch, for each positive sample, the unpaired sample with the highest similarity is selected as the hard negative sample, and the binary cross-entropy loss forces the model to develop a more accurate ability to distinguish these subtle semantic differences.
[0037] Pure contrastive loss only distinguishes between randomly generated negative samples within a batch, potentially neglecting difficult negative samples that are semantically similar to positive samples but are not true pairs. Orthogonal decoupling loss is employed to ensure that the feature representations output by the three experts in the vector space are as independent as possible. Specifically, for any two different expert feature matrices output in the current batch (e.g., the sensory expert output matrix),... and the output matrix of emotional experts ), calculate the Frobenius norm of the inner product matrix of the two.
[0038] , in, This represents the orthogonal decoupling loss. Representation Dimension Expert The feature matrix output in the current batch. Representation Dimension Expert The feature matrix output in the current batch. , For batch size, For feature dimensions; When the values are different, the corresponding , , These correspond to the output feature matrices of three experts, representing the sensory, emotional, and associative dimensions, respectively. The Frobenius norm measures the degree of orthogonality deviation of the inner product of two expert feature matrices—it returns to zero when the two expert outputs are perfectly orthogonal; the greater the feature overlap, the higher the norm. The summation range covers all dimensional combinations. , , This involves imposing orthogonal constraints on all three pairs of experts. This is achieved by minimizing the orthogonal constraints on all dimensions. , , The sum of orthogonal deviations forces each expert to learn non-redundant dimensional features.
[0039] The load balancing loss method is used to address the routing collapse risk in the MoE architecture, which is that the routing network may converge to a degenerate state in which most samples are assigned to a single expert, resulting in the lack of effective training for the other experts.
[0040] , in, For load balancing losses, The total number of experts, in this embodiment Experts in the three dimensions of sensation, emotion, and association; Assigning the first to the routed network within the current batch The sample proportion of experts is obtained by normalizing discrete counts; For routing network to the first The average soft probability output by each expert is continuously differentiable. The load balancing loss is introduced to combat the tendency for route collapse. Regularization constraints are applied by monitoring the frequency distribution of route activations for each expert throughout the training batch. A penalty is imposed when the deviation between an expert's actual average route weight and the ideal uniform weight (one-third) exceeds a reasonable range.
[0041] The four loss functions mentioned above are weighted and summed according to their respective weight coefficients to form the final multi-task training objective.
[0042] The conditional diffusion generation module receives text features and dimensional expert features from the alignment module as conditional signals, driving the diffusion process to gradually recover the target tactile signal from random Gaussian noise. In the conditional injection stage, this module uses Feature-wise Linear Modulation (FiLM) to embed the conditional information; in the guidance and control stage, a hierarchical classifier-free guidance mechanism is used to achieve hierarchical control of global semantics and dimensional details.
[0043] The conditional diffusion generation module gradually recovers the target tactile signal from random noise based on the text and dimensional conditions provided by the alignment module. In this embodiment, the conditional diffusion generation module is adapted based on the DiffWave architecture. DiffWave itself is a diffusion model oriented towards waveform generation, and its core adopts a feedforward bidirectional dilated convolutional architecture inspired by WaveNet. Specifically, the conditional diffusion generation module consists of an input projection layer, multiple residual blocks, a skip connection aggregation layer, and an output projection layer. The input projection layer is responsible for projecting the noisy signal onto the residual channel dimension (set to 64 dimensions in this embodiment). The residual block is the core computational unit in the network, integrating a bidirectional dilated convolutional layer, a gated activation function, and a conditional modulation module. Several residual blocks are organized into multiple groups, with the dilation rate within each group increasing exponentially (e.g., 1, 2, 4, 8). This allows the network to obtain a progressively expanding receptive field without increasing the number of parameters, thereby covering information at different scales, from short-term local features to long-term global structures. Each residual block simultaneously outputs a jump connection signal. These signals are accumulated and aggregated at the jump connection aggregation layer at the end of the network, and then converted into the final noise prediction result by the output projection layer.
[0044] The conditional diffusion generation module refines the generated results through an iterative denoising process, ensuring stable training and reliable generation quality. Its non-autoregressive nature supports parallel processing of the entire waveform, making it suitable for the one-dimensional time-series waveform characteristics of tactile signals. The diffusion process consists of two stages: forward noise addition and backward denoising. In the forward stage, Gaussian noise is progressively added to the original tactile signal according to a preset noise scheduling scheme. After T steps (T=1000 steps in this embodiment), the signal degenerates into approximately pure Gaussian white noise. The backward stage forms the core of the generation. The model starts with pure noise and, at each step, predicts the noise added during the forward process based on the current noisy signal, time step encoding, and conditional vector. The predicted noise is subtracted to obtain a cleaner signal estimate. After T iterations, the final tactile waveform is output. The training objective is set to predict the noise added at each step of the forward process, and the loss function for the diffusion process is the mean square error of the noise prediction. , in, Let the diffusion process loss function be... The noise actually added to the forward process. For the model to be given a noisy signal Time step and conditions The noise to be predicted Indicates time step Original signal and noise The expectation. This represents the square of the norm.
[0045] According to one implementation, feature-wise linear modulation can be used to solve the problem of chaotic condition injection and achieve effective condition injection. A common approach in traditional schemes is simple concatenation, where the condition vector is directly concatenated with the noisy signal and then fed into the network. This method works in scenarios with only a single condition, but in multi-condition scenarios requiring simultaneous injection of global textual conditions and dimensional expert conditions, it easily leads to condition conflicts. Condition information from different sources becomes mixed after concatenation, making it difficult for the network to effectively distinguish the contribution of each condition, and some condition signals may be masked or even ignored. This application sets a dedicated multilayer perceptron (MLP) in each residual block, mapping the condition vector to two sets of parameters: scaling factor γ and translation β. The dimensions of these two sets of parameters are consistent with the channel dimensions of the hidden layer features within the residual block. The specific operation of condition injection is to perform an element-wise affine transformation on the hidden layer features, as shown in the following formula: , in, These are the hidden layer features after conditional modulation, i.e., in the conditional vector. Under the control of the original hidden layer features The output characteristics obtained after applying an affine transformation. These are the original hidden layer features within the residual block after dilated convolution and gated activation processing. It is a conditional vector combining the global text feature vector and the dimensional expert feature vector. and This is the parameter vector predicted by the MLP based on the conditional vector. This indicates element-wise multiplication. The core advantage of the mechanism is that, through affine transformation, the conditional information can apply independent scaling and offset control to each channel of the hidden layer features, achieving fine-grained conditional influence rather than coarse-grained mixing caused by simple splicing; at the same time, the number of parameters in the MLP is much smaller than the entire residual block, and the computational cost is negligible.
[0046] According to one implementation, hierarchical classifier-free guidance is used to solve the dimensionality control problem, achieving fine-grained dimensional controllability. Standard classifier-free guidance (CFG) works by amplifying the difference between conditional and unconditional predictions during denoising to enhance the control of conditions, but natively only supports the adjustment of a single condition. The generation in this application requires simultaneous conditional constraints at two levels: global textual semantics and dimension-specific details. These two levels differ in their control granularity: global textual semantics determines the overall perceptual direction of the generated signal (e.g., "a regular heartbeat"), while dimensional expert details control the specific expression intensity of each dimension (e.g., the physical rhythm characteristics of the heartbeat, the reassuring emotion evoked by the heartbeat, and the medical scenarios associated with the heartbeat). Specifically, the conditions are decomposed into two levels for control. In each denoising step, the model sequentially executes the following four computational steps.
[0047] The unconditional baseline prediction term is calculated in the first stage. The current noisy signal is then used. and time step Inputting the data into a diffusion network, without providing any conditional signals (condition terminals set to empty), yields the unconditional reference prediction term. This prediction term reflects the noise estimate made by the model based solely on the prior statistical laws of the signal itself, serving as a reference baseline for subsequent guided calculations.
[0048] The second step involves calculating the global text guidance term. This includes the current noisy signal. Time step The global text feature vector is fed into the diffusion network to obtain the global text conditional prediction term. The global text guide item is defined as follows: and The difference, in physical terms, represents the correction amount for the noise prediction relative to the unconditional baseline after the introduction of textual conditions. This is achieved using globally guided weights. The magnitude of the correction direction is scaled.
[0049] In the third step, the dimensional expert-guided term is calculated. This involves the current noisy signal. Time step Each feature vector is combined with one of the three expert feature vectors and fed into a diffusion network to obtain the sensory dimension prediction term. Sentiment dimension prediction items And association dimension prediction items The three factors are weighted and combined according to the network weights output by the routing network to obtain the dimensional expert fusion prediction term. : , Dimensional expert guidance items are and The difference indicates the direction of further correction based on the global text guidance after introducing dimensional expert conditions. This is determined by the expert guidance weight. Scale the correction direction.
[0050] In the fourth stage, comprehensive noise reduction is performed. The above items are weighted and summarized as follows: , Here This is the final prediction used to remove noise in the current denoising step. This is achieved by adjusting the global guiding weights. and expert-guided weight These two guiding weights can respectively control the degree to which the tactile signal fits the overall semantics of the descriptive text, and the intensity of the tactile signal's expression of the multi-dimensional expert feature vector. This hierarchical guiding design decouples global semantic control from dimensional detail control in operation: users can adjust the expression intensity of a certain dimension individually without interfering with the overall semantic direction, and can also adjust the guiding strength of the global semantics while maintaining the stability of dimensional details.
[0051] Specifically, an adaptive dynamic threshold mechanism is designed to address the issue of high guiding weights (such as...) during tactile signal generation. =3.0, When using a normalized [-1,1] interval (=4.5) to enhance semantic alignment, the amplitude of the generated signal may exceed the physically valid range (e.g., the normalized [-1,1] interval), leading to distortion.
[0052] Traditional methods employ fixed hard truncation (forcibly pruning values outside the range to the boundary value). However, this method introduces discontinuities at the truncation point, generating high-frequency artifacts, disrupting signal smoothness, and reducing generation quality. In this application, during each denoising step, a preset high percentile (the 99.5th percentile in this embodiment) of the absolute values of all generated signals in the current batch is first calculated as the dynamic truncation boundary. The core value of this mechanism lies in its ability to prune boundaries. It can adaptively adjust according to the actual statistical distribution of the current batch of signals, and automatically match a reasonable clipping range under different guidance intensities and generation conditions, avoiding the drawbacks of fixed thresholds being too lenient or too strict under certain operating conditions. Compared with hard truncation, the dynamic threshold strategy retains 99.5% of the amplitude information in the signal and applies smoothing constraints only to extreme outliers, thereby effectively reducing the total harmonic distortion rate and significantly improving signal fidelity.
[0053] According to one implementation, the overall training is divided into two stages: the first stage trains the dimensional decoupling and alignment module, and the second stage trains the conditional diffusion generation module. There is a sequential dependency between the two stages, with the second stage using the frozen output of the first stage as a fixed conditional input.
[0054] The first stage is the training dimension decoupling and alignment module. Its training goal is to enable the native haptic encoder and the hybrid expert text encoder to achieve cross-modal semantic alignment in the joint embedding space, and to simultaneously complete the feature decoupling of the three dimensions of sensation, emotion, and association in the process.
[0055] The training dataset is a tactile-text pairing dataset, where each sample contains a raw tactile waveform signal and its corresponding natural language text description. The text description simultaneously includes sensory dimension annotations (such as "roughness," "high-frequency vibration"), emotional dimension annotations (such as "pleasure," "tension"), and associative dimension annotations (such as "heartbeat," "engine vibration"). Before formal training begins, the tactile signals need to be normalized and the text descriptions need to be segmented and cleaned. The dataset is divided into three non-overlapping subsets: a training set, a validation set, and a test set, according to a preset ratio (an 8:1:1 partitioning scheme is used in this embodiment). In real-world scenarios, datasets often exhibit unbalanced class distributions; for example, the number of "comfort" class samples in the emotional dimension may far exceed the number of "anxiety" class samples. To address this issue, this application adopts a weighted sampling strategy, assigning higher sampling probabilities to minority class samples when constructing each training batch, thereby ensuring sufficient training coverage for each dimension and class, and ensuring the sufficiency of training for each dimension and class.
[0056] During module initialization, the parameters of the native haptic encoder's multi-scale convolutional word processor, Transformer encoder, and perceptual band pooling layer are randomly initialized. The backbone network of the hybrid expert text encoder is loaded with pre-trained T5 model weights, and the parameters of the routing network and the three-dimensional expert networks (sensory expert, emotional expert, and associative expert) are randomly initialized.
[0057] In multi-task loss functions, the optimization directions of each loss term are not entirely consistent. If all loss terms are activated simultaneously at the beginning of training, conflicts between gradient signals can easily affect the convergence quality. Experiments have verified that this application employs a three-stage progressive training strategy, gradually activating each loss term according to predetermined time nodes and systematically unfreezing the parameters of specific layers in the encoder. In the first stage of progressive training (corresponding to rounds 1 to 3), only the haptic-text contrast loss is activated (…). This loss is based on the InfoNCE framework, using temperature parameters. Scaled cosine similarity is used as the metric to maximize the similarity of positive sample pairs (i.e., paired tactile features and text features) within each training batch, while minimizing the similarity of negative sample pairs (i.e., unpaired combinations). In this stage, all parameters of the T5 backbone network are frozen and do not participate in gradient updates; only the native tactile encoder, routing assignment module, and multi-dimensional expert module are trained. The purpose of this is to establish the most basic alignment relationship between tactile features and text features without disturbing the general semantic knowledge accumulated by the pre-trained language model.
[0058] In the second phase of progressive training (corresponding to rounds 4 to 8 of training), after the contrastive loss from the first phase has fully converged, Add activation haptic-text matching loss to the above ( ) and load balancing losses ( At the same time, the last few layers of the T5 backbone network (the last 3 layers in this embodiment) are unfrozen and included in the range of trainable parameters.
[0059] It operates as a binary classification matching task, combined with a hard negative sample mining strategy. For each positive sample pair in the current batch, the unpaired sample with the highest similarity to it is selected as a hard negative sample within the batch. Then, the binary cross-entropy loss forces the model to more accurately identify these semantically very similar but actually mismatched sample pairs, making up for the inherent deficiency of contrastive loss in the discrimination power of hard negative samples.
[0060] The purpose of this function is to monitor the distribution of weights assigned to the three experts in the routing network. When the actual average routing weight of a certain expert deviates too far from the ideal average value (i.e., 1 / 3), this loss term is regularized to prevent the routing network from slipping into a degenerate mode that concentrates a large number of samples towards a single expert, ensuring that all three experts can obtain sufficient training gradients. The reason for unfreezing several layers at the tail of the T5 backbone network at this stage is to allow the pre-trained language model to undergo appropriate domain fine-tuning based on the special expression habits of tactile text, thereby improving its understanding accuracy of the three semantic components of sensation, emotion, and association in tactile descriptions.
[0061] In the third stage of progressive training (corresponding to rounds 9 to 15), after the matching loss and balancing loss from the second stage have stabilized, an activation orthogonal decoupling loss is added to all activated loss terms. The unfrozen state of the T5 backbone network remains unchanged from the previous stage. The calculation method is as follows: calculate the inner product matrix of the feature matrices output by any two experts from different dimensions in the current batch, and take its Frobenius norm as the quantitative index of orthogonal deviation. Then, minimize the sum of orthogonal deviations of all dimensional pairs (sensation-emotion, emotion-association, sensation-association) to promote the features output by each expert in the vector space to tend to be mutually orthogonal, thereby reducing feature redundancy between dimensions.
[0062] The decision to activate orthogonal loss only in the third stage is based on clear engineering considerations. If orthogonal constraints are applied too early, before cross-modal alignment is established, the orthogonal loss might push expert features in directions that, while orthogonal, lack semantic meaning, thus hindering subsequent alignment. Introducing orthogonal constraints only after the first two stages have established a solid cross-modal alignment relationship and a basic division of labor among experts allows decoupling to proceed on a meaningful semantic basis.
[0063] The four loss functions mentioned above are finally weighted and summed according to their respective weight coefficients to form the multi-task joint training objective for this stage.
[0064] During convergence determination, the execution flow for each training batch is as follows: The tactile signal is fed into the native tactile encoder to extract tactile feature vectors, and the descriptive text is fed into the hybrid expert text encoder to extract global text feature vectors and expert feature vectors in three dimensions. Based on the current training stage, it is determined which loss terms are active. The weighted sum of the activated loss terms is calculated as the total loss value for the current batch, and the trainable parameters are updated through gradient backpropagation. The optimizer used is AdamW, with appropriate learning rate and weight decay coefficients configured. A learning rate warmup strategy is employed at the beginning of training to avoid parameter oscillations caused by excessively large initial gradients.
[0065] After each complete training epoch, the model's cross-modal retrieval performance is evaluated on the validation set. Training is considered converged when the validation set performance no longer shows substantial improvement over several consecutive epochs. Upon convergence, all parameters of the dimensionality-decoupled aligned model (including the native haptic encoder, T5 backbone network, routing assignment module, and multi-dimensional expert module) are frozen for use as a fixed feature extractor in the second stage.
[0066] The training target in this stage is the conditional diffusion generation module. Its goal is to enable the module to learn to gradually remove noise from random Gaussian noise and recover the tactile signal waveform that meets the expected semantic constraints, based on the text conditions and dimensional conditions output by the first-stage dimensional decoupling alignment module.
[0067] Before training begins, the dimensionality decoupling and alignment module, which has already been trained in the first stage, is loaded first. All its internal parameters are frozen so it no longer participates in gradient updates; thereafter, this module only handles conditional feature extraction. Next, the conditional diffusion generation network based on the DiffWave architecture is loaded, and all its network parameters are initialized. The core of DiffWave employs a feedforward bidirectional dilated convolutional architecture inspired by WaveNet. The network consists of an input projection layer, multiple residual blocks, and skip connection aggregation layers. The residual blocks are organized into several groups, with the dilation rate within each group increasing exponentially (e.g., 1, 2, 4, 8), allowing the network to expand its receptive field layer by layer without increasing the number of parameters. Additionally, the core parameters of the diffusion process need to be configured, including the total number of denoising steps. (In this embodiment) =1000) and noise scheduling strategies.
[0068] During conditional feature extraction, a batch of training samples is randomly selected from the training set. The text descriptions of all samples within the batch are input into a frozen hybrid expert text encoder to obtain the global text feature vector and expert feature vectors for three dimensions (sensation, emotion, and association) along with their routing weights. The global text feature vector serves as the global guiding condition for the diffusion model, while the expert feature vectors for different dimensions serve as local dimensional guiding conditions.
[0069] During forward noise addition, the actual tactile signal of each sample within the batch is... Performing the forward diffusion process includes: scheduling by noise from time step =0 to = Gaussian noise is added gradually to obtain noisy signals at each time step. The forward process has a closed-form solution, which can sample noisy signals at any time step in one step without step-by-step iteration.
[0070] Conditional injection and noise prediction, random sampling of a time step The noisy signal at this time step The time-step encoding and conditional features are input into the diffusion network. Conditional features are injected into each residual block of the network via the FiLM mechanism: in each residual block, a dedicated multilayer perceptron (MLP) maps the conditional vector to a scaling factor γ and a translation β, performing an element-wise affine transformation on the hidden layer features within the residual block after dilated convolution and gated activation. The diffusion network outputs a predicted value for the noise added at that time step. .
[0071] During loss calculation and parameter updates, the mean squared error loss value of the noise prediction is backpropagated. Since all parameters of the alignment model are frozen, gradient updates only apply to the parameters of the diffusion network itself. The optimizer also uses AdamW.
[0072] The conditional random discard hierarchical classifier-guided mechanism in training requires the conditional diffusion generation module to have both conditional and unconditional generation capabilities. To meet this prerequisite, during training, some samples need to be randomly set to empty vectors with a certain probability (10% to 20% in this embodiment), so that the model can learn to predict noise based on signal priors even when there is a lack of conditional input.
[0073] The conditional discarding operation is implemented at two levels. The first level discards all conditions, that is, both text conditions and dimensional conditions are set to empty. The second level discards only the dimensional expert conditions, retaining the global text conditions. Through this hierarchical discarding strategy, the model has the ability to compute three types of outputs during the inference phase: unconditional prediction, prediction based solely on global text conditions, and prediction based on all conditions, laying the training foundation for the operation of the hierarchical guidance mechanism.
[0074] In addition, to better adapt to the low-frequency characteristics of tactile signals, a gating loss can be optionally introduced. The diffusion model is optimized by auxiliary optimization. The purpose of this loss is to encourage the model to maintain a reasonable distribution of expert weights in each dimension during the denoising process, preventing the guiding signal in a certain dimension from being improperly compressed or over-amplified during training iterations.
[0075] The training process involves continuous iteration with different training batches. After each complete training epoch, the generation quality is systematically evaluated on the validation set. The evaluation process is as follows: the text descriptions of each sample in the validation set are input into the frozen alignment model to obtain conditional features; the diffusion model generates tactile signals from random noise; and the quality metric between the generated signal and the real signal is calculated. When the metric on the validation set shows no substantial improvement for several consecutive epochs, the training is considered to have converged. At this point, the parameter state of the diffusion network is saved as the final conditional diffusion generation module.
[0076] Figure 1 A schematic diagram of a controllable text-to-tactile signal generation method based on perceptual decoupling according to an embodiment of this application is shown, such as... Figure 1As shown, the method includes step 110: In response to receiving descriptive text of the tactile signal to be generated, the descriptive text is input into the language model of the hybrid expert text encoder to generate a basic semantic representation vector. The descriptive text includes sensory dimension text, emotional dimension text, and associative dimension text. According to one implementation, the descriptive text can be input by the user or automatically generated by the device. The descriptive text includes text of multiple dimensions, each dimension describing information of one dimension. For example, sensory dimension text includes "rough," "smooth," "high-frequency vibration," etc.; emotional dimension text includes "pleasure," "tension," "relaxation," etc.; and associative dimension text includes "heartbeat," "engine," "raindrop," etc. The language model can specifically be implemented as a T5 language model (Text-to-Text Transfer Transformer) for semantic encoding of text, outputting a basic semantic representation vector. Subsequently, step 120 is executed, inputting the basic semantic representation vector into the sensory expert module, emotional expert module, and associative expert module of the hybrid expert text encoder to obtain sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors, respectively. According to one implementation, the hybrid expert text encoder includes a multi-dimensional expert module, which includes a sensory dimension expert module. ), Emotional Dimension Expert Module ( ) and Lenovo Dimension Expert Module ( Each vector performs dimension-specific feature mapping on its basic semantic representation vector, outputting a sensory dimension expert feature vector. Emotional dimension expert feature vector And the expert feature vector of the association dimension .
[0077] Subsequently, step 130 is executed, generating a global text feature vector based on multi-dimensional expert feature vectors. These multi-dimensional expert feature vectors include sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors. This includes: inputting the basic semantic representation vector into the routing allocation module to obtain a three-dimensional weight vector; and generating the global text feature vector based on the sensory dimension expert feature vector, emotional dimension expert feature vector, associative dimension expert feature vector, and three-dimensional weight vector. The routing allocation module predicts the routing weights of the three-dimensional experts based on the semantic content of the text. , , These are the routing weights for the expert feature vectors of the perception dimension, the emotion dimension, and the association dimension, respectively. Based on the obtained three-dimensional weight vector... , , This involves combining the routing weights of the three experts across three dimensions to obtain a three-dimensional weight vector. The expert feature vectors from each dimension are then weighted according to their routing weights and converged into a global text feature vector. The allocation of routing weights is adaptive. When the text focuses on describing physical properties (such as coarse high-frequency vibrations), the routing weights of the sensory dimension expert feature vectors are adjusted. Automatically increases; when focusing on the portrayal of emotional experience (such as relaxing), the routing weights of the emotional dimension expert feature vectors are increased. Consequently, the weighting increases; when focusing on the expression of scene associations (such as heartbeats), the routing weights of the expert feature vectors in the association dimension increase. Achieve higher allocation. The adaptive allocation process of routing weights requires no manual user intervention. The routing allocation module automatically adjusts the weight ratio of the three dimensions of experts based on the semantic content of the input text. When the text description involves multiple dimensions simultaneously (for example, a relaxing rough vibration simultaneously touches on both the emotional and sensory dimensions), the routing allocation module automatically assigns higher weights to the relevant dimensions, achieving multi-dimensional collaborative control.
[0078] According to one embodiment, initial noise is also generated in advance, including generating random Gaussian noise of the same length as the target tactile signal. As the initial input to the reverse denoising process, it is used to generate a tactile signal based on the diffusion network of the conditional diffusion generation module, including progressive reverse denoising of the initial noise: from the maximum time step... = Starting from this point, the reverse denoising operation is executed step by step. Within each denoising step, multi-level conditional information is fused through a hierarchical classifier-free guidance mechanism.
[0079] According to one implementation, an unconditional baseline prediction term can be generated based on initial noise. The current noisy signal and time step The input is a diffusion network, with no input provided at the conditional end, thus obtaining the model's basic prediction of noise in the unconditional state.
[0080] Subsequently, step 140 is executed, generating a global text conditional prediction term based on the global text feature vector and initial noise; specifically, the global text feature vector and initial noise are input into the conditional diffusion generation module to obtain the global text conditional prediction term. According to one implementation, the current noisy signal can be... Time step With global text feature vector They are fed into the diffusion network to obtain global text conditional prediction terms. .
[0081] Subsequently, step 150 is executed, generating dimensional expert fusion prediction terms based on multi-dimensional expert feature vectors and initial noise; this includes: generating sensory dimension prediction terms, emotional dimension prediction terms, and associative dimension prediction terms based on sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors and initial noise; and determining the dimensional expert fusion prediction terms based on the sensory dimension prediction terms, emotional dimension prediction terms, associative dimension prediction terms, and the network weights output by the conditional diffusion generation module. According to one implementation, the expert feature vectors of the three dimensions of sensory, emotional, and associative can be fed into the diffusion network in conjunction with the current noisy signal and time step to obtain noise predictions under each dimension condition. , , These are the prediction items for the sensory dimension, the emotional dimension, and the associative dimension, respectively. These are then weighted and combined according to routing weights to generate the dimensional expert fusion prediction item. .
[0082] Finally, step 160 is executed, generating a tactile signal based on the global text conditional prediction term, the dimensional expert fusion prediction term, and the unconditional baseline prediction term. This includes: generating a global text guidance term based on the global text conditional prediction term and the unconditional baseline prediction term; generating a dimensional expert guidance term based on the dimensional expert fusion prediction term and the global text conditional prediction term; and generating a tactile signal based on the unconditional baseline prediction term, the global text guidance term, the dimensional expert guidance term, the global guidance weight of the global text guidance term, and the expert guidance weight of the dimensional expert guidance term. The global guidance weight controls the degree to which the tactile signal fits the overall semantics of the descriptive text, and the expert guidance weight controls the expression strength of the tactile signal on the multi-dimensional expert feature vector. According to one implementation, a weighted sum is performed according to the hierarchical guidance formula to obtain the comprehensive denoising prediction of the current step. This hierarchical guidance structure decouples global semantic control from dimensional detail control in operation. Users can independently adjust the expression intensity of a certain dimension without interfering with the overall semantic direction, and can also adjust the semantic fit while maintaining the stability of dimensional details.
[0083] According to one embodiment, the intermediate signal generated from the initial noise is further processed; and the intermediate signal can be processed iteratively multiple times, with steps 140-160 executed in each iteration; during each iteration, the time step is... Decrease by 1 until... =0. At this point, the noise in the noisy signal has been gradually removed, and the output... This is the final tactile signal generated.
[0084] In step 160, the global bootstrap weight is adjusted. It allows control over how well the signal fits the overall semantics of the text. Increase This will enhance semantic consistency, but if If the value is too high, it may introduce signal distortion due to over-guidance; appropriately reducing it may help. The diversity of generated signals increases, at the cost of a corresponding decrease in semantic fit. Adjusting expert-guided weights. It allows for control over the expressive power of specific details in various dimensions. Increase... Afterwards, the textures become clearer and sharper, the emotional characteristics become more vivid and intense, and the associative connections become more explicit and specific; the feeling decreases. Then the details in each dimension tend to reach a balanced and harmonious state of integration.
[0085] According to one embodiment, this application determines the clipping boundary of the tactile signal based on the tactile signal; and sets the signal amplitude of the tactile signal based on the clipping boundary. In each iteration, this application determines the clipping boundary of the tactile signal based on an intermediate signal; and sets the signal amplitude of the intermediate signal based on the clipping boundary.
[0086] After each denoising step completes the signal update, an adaptive dynamic threshold mechanism is used to constrain the signal amplitude. Specifically, the absolute value of all generated signals in the current batch is calculated. Percentile value (e.g., = 99.5) as the dynamic clipping boundary Only extreme outliers exceeding the boundary are smoothly truncated, preserving most of the amplitude information in the signal. Compared to fixed hard truncation (forcibly truncating values exceeding the range to a fixed boundary), the dynamic threshold's truncation boundary adapts adaptively to the actual signal distribution, avoiding the introduction of discontinuities and high-frequency artifacts at the truncation point. When the guiding weight is high ( and When a larger value is taken to enhance semantic alignment, the amplitude of the generated signal fluctuates more, making the adaptive nature of this mechanism particularly crucial.
[0087] It should be noted that the storage medium (computer-readable medium) described above in this invention can be a computer-readable signal medium, a non-transitory computer-readable storage medium, or any combination thereof. A non-transitory computer-readable storage medium can be, for example,, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a non-transitory computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.
[0088] In this invention, a non-transitory computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this invention, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can also be any computer-readable medium other than a non-transitory computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination thereof.
[0089] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0090] The above description is merely a partial embodiment of the present invention and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of disclosure in this invention is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-disclosed concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features with similar functions disclosed in this invention.
[0091] Furthermore, although the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in sequential order. Multitasking and parallel processing may be advantageous in certain environments. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the invention. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.
[0092] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.
Claims
1. A method for controllable generation of text-to-tactile signals based on perceptual decoupling, the method comprising: In response to receiving a descriptive text for generating a tactile signal, the descriptive text is input into a language model of a hybrid expert text encoder to generate a basic semantic representation vector. The descriptive text includes sensory dimension text, emotional dimension text, and associative dimension text. The basic semantic representation vector is input into the sensory expert module, emotional expert module, and associative expert module of the hybrid expert text encoder to obtain the sensory dimension expert feature vector, emotional dimension expert feature vector, and associative dimension expert feature vector, respectively. A global text feature vector is generated based on multi-dimensional expert feature vectors, including sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors. Generate a global text conditional prediction term based on the global text feature vector and the initial noise; Based on the multi-dimensional expert feature vector and the initial noise generation dimension expert fusion prediction term; A tactile signal is generated based on the global text conditional prediction term, the dimensional expert fusion prediction term, and the unconditional baseline prediction term, wherein the unconditional baseline prediction term is generated based on the initial noise.
2. The method as described in claim 1, wherein, The hybrid expert text encoder further includes a route allocation module, wherein generating a global text feature vector based on multi-dimensional expert feature vectors includes: The basic semantic representation vector is input into the routing allocation module to obtain a three-dimensional weight vector; A global text feature vector is generated based on the expert feature vectors of the sensory dimension, the emotional dimension, the associative dimension, and the three-dimensional weight vector.
3. The method as described in claim 2, wherein, The generation of global text conditional prediction terms based on the global text feature vector and initial noise includes: The global text feature vector and the initial noise input conditional diffusion generation module are used to obtain the global text conditional prediction term.
4. The method of claim 3, wherein, Based on the multi-dimensional expert feature vector and the initial noise generation dimension expert fusion prediction term, including: Based on the expert feature vectors of the sensory dimension, the expert feature vectors of the emotional dimension, the expert feature vectors of the association dimension, and the initial noise, generate sensory dimension prediction terms, emotional dimension prediction terms, and association dimension prediction terms; The dimensional expert fusion prediction term is determined based on the sensory dimension prediction term, the emotional dimension prediction term, the associative dimension prediction term, and the network weights output by the conditional diffusion generation module.
5. The method according to any one of claims 1-4, generating the tactile signal based on the global text conditional prediction term, the dimensional expert fusion prediction term, and the unconditional baseline prediction term includes: Generate a global text guidance term based on the global text conditional prediction term and the unconditional baseline prediction term; Generate dimensional expert guidance items based on the dimensional expert fusion prediction items and the global text conditional prediction items; A tactile signal is generated based on the unconditional baseline prediction term, the global text guidance term, the dimensional expert guidance term, the global guidance weight of the global text guidance term, and the expert guidance weight of the dimensional expert guidance term. The global guidance weight controls the degree to which the tactile signal fits the overall semantics of the descriptive text, and the expert guidance weight controls the expression strength of the tactile signal on the multi-dimensional expert feature vector.
6. The method according to any one of claims 1-4, further comprising: The clipping boundary of the tactile signal is determined based on the tactile signal; The signal amplitude of the tactile signal is set according to the cutting boundary.
7. A controllable text-to-tactile signal generation system based on perception decoupling, the system comprising: A hybrid expert text encoder includes: a language model for generating a basic semantic representation vector based on a descriptive text of a received tactile signal to be generated, wherein the descriptive text includes sensory dimension text, emotional dimension text, and associative dimension text; a sensory expert module, an emotional expert module, and an associative expert module for processing the basic semantic representation vector and outputting sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors, respectively; the hybrid expert text encoder is further used to generate a global text feature vector based on the multi-dimensional expert feature vectors, wherein the multi-dimensional expert feature vectors include sensory dimension expert feature vectors, emotional dimension expert feature vectors, and associative dimension expert feature vectors. The conditional diffusion generation module is used to generate a global text conditional prediction term based on the global text feature vector and the initial noise, generate a dimensional expert fusion prediction term based on the multi-dimensional expert feature vector and the initial noise, and generate a tactile signal based on the global text conditional prediction term, the dimensional expert fusion prediction term and the unconditional baseline prediction term, wherein the unconditional baseline prediction term is generated based on the initial noise.
8. A computing device, comprising: One or more processors; A memory for storing one or more computer programs, characterized in that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1-6.
9. A computer-readable storage medium having a computer program or instructions stored thereon, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1-6.
10. A computer program product, comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method according to any one of claims 1-6.