A continuous vector discretization representation method, system and application
By employing an attention mechanism and a continuous vector discretization representation method based on binary spherical quantization, the problems of low parameter efficiency, poor compatibility, and unstable training in medical image processing are solved, achieving efficient and accurate image and video processing. This method is applicable to fields such as medical image storage and transmission, AI generation, and autonomous driving perception.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING QIANZI MEIER BIOTECHNOLOGY CO LTD
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies in medical image processing suffer from problems such as low parameter efficiency, poor image-video compatibility, imbalance between reconstruction quality and computational efficiency, and unstable training, making it difficult to achieve efficient and accurate image and video processing.
A continuous vector discretization representation method based on attention mechanism and binary spherical quantization is adopted. Features are extracted by attention encoder, discretized by binary spherical quantization module, and decoded and reconstructed by attention decoder to build implicit codebook and realize unified processing of image and video.
It achieves lightweight models, accurate image reconstruction, and fast processing speed. It can process images and videos in a unified manner, reduce computational complexity, and is suitable for fields such as medical image storage and transmission, AI generation, and autonomous driving perception.
Smart Images

Figure CN122290908A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and biomedical engineering technology, specifically to a continuous vector discretization representation method, system, and application based on attention mechanism and binary spherical quantization that can efficiently process medical images (including diagnostic images, surgical videos, monitoring dynamic images, etc.). Background Technology
[0002] With the widespread adoption of multimodal medical devices, efficient processing of medical images has become a key requirement in the medical field. This encompasses long-term compressed storage of medical images, real-time transmission of high-definition images during remote diagnosis, and accurate analysis and reconstruction of images in AI-assisted diagnosis. The current mainstream approach is to borrow logic from natural language processing to convert continuous medical images or video signals into discrete "visual symbols," and then achieve image processing through a complete process of "encoding → symbol conversion → decoding."
[0003] The existing technology mainly suffers from the following types of defects: 1. The contradiction between "bulky and easy to use" and "lightweight and cumbersome": Dictionary matching techniques (such as the VQ-VAE series) rely on a pre-trained "fixed dictionary" (codebook). While the processing effect is relatively accurate, the larger the "dictionary," the slower the search speed, and it is prone to "confusement" when encountering a large amount of medical data (i.e., dictionaries trained on small datasets are difficult to adapt to images of rare cases). In addition, this type of technique is prone to situations where "entries in the dictionary are not used," and balancing "accurate matching" and "good image reconstruction" is extremely difficult.
[0004] Dictionary-free lightweight techniques (such as LFQ): These techniques do not train a dictionary and directly perform a crude binary classification of intermediate features as either "positive" or "negative." While fast, they lack "standardization" of intermediate features, leading to uncontrollable errors during transformation and potentially blurring of lesions. Furthermore, the computational load increases dramatically when processing complex medical data.
[0005] 2. Image and video processing are not universally applicable: In existing technologies, models for processing images cannot identify the movement of lesions and surgical trajectories in videos; while specialized video processing technologies (such as "spatiotemporal dedicated devices" using 3DCNN) are bulky, consume a lot of computing power, and cannot flexibly process images of varying lengths (such as short ultrasound image sequences and long intensive care videos).
[0006] 3. Image quality and processing speed are mutually exclusive: To restore clarity, the "dictionary" needs to be enlarged, which leads to storage pressure and reduced processing speed; adding "time-aware" capabilities significantly increases the amount of computation; traditional compression standards (H.264, HEVC) do not restore the quality as well as AI methods; AI compression methods (VCT, DVC) are too complex and have slow encoding and decoding speeds, making it difficult to meet real-time requirements.
[0007] 4. Training difficulties and gradient propagation interruption: When converting continuous signals into discrete symbols, existing technologies encounter the problem of "gradient propagation interruption." LFQ technology suffers from large errors, making it difficult for the model to learn stable patterns; dictionary technology faces learning difficulties due to the mismatch between continuous features and discrete codebooks.
[0008] Therefore, there is an urgent need for a continuous vector discretization technique that can balance lightweight models, accurate image restoration, fast processing speed, unified processing of images and videos, and stable training. Summary of the Invention
[0009] The purpose of this section is to outline some aspects of embodiments of the present invention and to briefly describe some preferred embodiments. Simplifications or omissions may be made in this section, as well as in the abstract and title of this application, to avoid obscuring the purpose of these documents; however, such simplifications or omissions should not be construed as limiting the scope of the invention.
[0010] The technical problem to be solved by the present invention is to provide a continuous vector discretization representation method and system based on attention mechanism and binary spherical quantization, so as to solve the problems of low parameter efficiency, poor image and video compatibility, imbalance between reconstruction quality and computational efficiency, and unstable training in the prior art.
[0011] To solve the above-mentioned technical problems, the present invention provides the following technical solution: a continuous vector discretization representation method, comprising the following steps: S1, acquiring a continuous image signal to be processed, the continuous image signal including a single frame image or multiple frames of video; S2, extracting features from the continuous image signal using a pre-trained attention encoder to obtain a high-dimensional feature vector; wherein, the attention encoder is configured with a block causal masking mechanism to uniformly process the spatiotemporal features of static images and dynamic videos; S3, discretizing the high-dimensional feature vector through a binary spherical quantization module to obtain a binary discrete code; wherein, the discretization process includes: projecting the high-dimensional feature vector to a low-dimensional space and performing spherical normalization, performing binary quantization using a learnable hyperplane to construct an implicit codebook, and projecting the binary code back into the feature space; S4, decoding and reconstructing the quantized feature vector processed by the BSQ module using a pre-trained attention decoder to obtain the reconstructed continuous image signal.
[0012] As a preferred embodiment of the continuous vector discretization representation method described in this invention, step S3 involves discretizing the high-dimensional feature vector using a binary spherical quantization module, specifically including: S31, dimensionality reduction: projecting the high-dimensional feature vector z of dimension in_dim onto a low-dimensional space of dimension latent_dim through a linear layer to obtain vector v, with the formula: v = Linear(z); S32, spherical normalization: calculating the L2 norm of vector v, dividing v by its L2 norm to obtain the unit vector u mapped onto the unit hypersphere, with the formula: u = v / ||v||_2; S33, binary quantization: using the learnable hyperplane parameters of dimension latent_dim×k_bits, projecting the unit vector u onto a k_bits dimensional space to obtain the projection vector projections; performing sign function processing on the projection vector projections to obtain the binary vector binary_code; the calculation formula is: Projections = u · hyperplanes, binary_code = sign(Projections); S34, Normalization and Implicit Codebook Construction: Divide the binary vector binary_code by the square root of k_bits to obtain the normalized binary code normalized_binary, with the formula: normalized_binary = binary_code / √(k_bits); S35, Feature Reconstruction: Map the normalized binary code back to the latent_dim dimensional feature space through hyperplane back projection to obtain the quantized reconstruction vector quantized_vec.
[0013] As a preferred embodiment of the continuous vector discretization representation method described in this invention, in step S33, the rule for processing the sign function of the projection vector projections is: if projections is greater than 0, the value is 1; if projections is less than 0, the value is -1; if projections is equal to 0, the mapping is 1.
[0014] As a preferred embodiment of the continuous vector discretization representation method described in this invention, in step S2, feature extraction of the continuous image signal is performed using an attention encoder, specifically including: if the input is a single-frame image, the image is segmented into non-overlapping image blocks, flattened into a one-dimensional sequence, and then transformed into a high-dimensional feature vector through linear projection and directly input into the attention encoder; if the input is a multi-frame video, the multi-frame video is segmented into image blocks of the same size in chronological order, flattened, and then spatiotemporal position encoding is added before inputting into the attention encoder; the spatiotemporal position encoding is formed by superimposing spatial position encoding and temporal position encoding.
[0015] As a preferred embodiment of the continuous vector discretization representation method described in this invention, in step S4, the quantized feature vector is decoded and reconstructed using an attention decoder. Specifically, the decoder adopts an attention model structure symmetrical to the encoder, receives the quantized feature vector, and maps the discrete features back to the pixel space through linear projection, multi-layer attention calculation, and multi-layer perceptron head processing. For video input, the inter-frame motion continuity is maintained through spatiotemporal position coding, and the video is reassembled into a complete video sequence to avoid inter-frame flickering or motion distortion.
[0016] As a preferred embodiment of the continuous vector discretization representation method described in this invention, the binary spherical quantization module uses a pass-through estimator for gradient propagation during training, directly assigning the gradient of the discretized binary code during backpropagation to the continuous features before normalization, so as to ensure smooth gradient propagation.
[0017] Additionally, the present invention provides a continuous vector discretization representation system, comprising: a signal acquisition module for acquiring a continuous image signal to be processed, the continuous image signal including a single frame image or multiple frames of video; a feature encoding module for extracting features from the continuous image signal using a pre-trained attention encoder to obtain a high-dimensional feature vector; the attention encoder is configured with a block causal masking mechanism; a quantization discretization module for discretizing the high-dimensional feature vector using a binary spherical quantization module to obtain a binary discrete code; the quantization discretization module includes a dimensionality reduction unit, a spherical normalization unit, a binary quantization unit, and a feature reconstruction unit; and a decoding and reconstruction module for decoding and reconstructing the quantized feature vector processed by the quantization discretization module using a pre-trained attention decoder to obtain the reconstructed continuous image signal.
[0018] Additionally, the present invention also provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described above.
[0019] Additionally, the present invention provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the methods described above.
[0020] This invention provides a method for discretizing continuous vectors, which has the following advantages: By constructing an implicit codebook through "spherical normalization + binary quantization", there is no need to store a huge dictionary, the vector volume can be compressed by 100 times, and the quantization error is strictly limited, accurately preserving the details and texture of the lesion. By using attention mechanisms and block causal masks, the same model can efficiently process single-frame images and videos of arbitrary length, significantly reducing deployment and maintenance costs. The computational complexity is reduced from exponential to linear, and the processing speed is 2.4 times that of the current best technology, meeting the needs of real-time scenarios such as live surgery broadcasts; It solves the problems of dead codebook and gradient interruption, converges quickly, and can run on ordinary medical equipment (such as ultrasound machines in primary hospitals) without complicated debugging; It is not only suitable for medical image storage and transmission, but can also be applied to fields such as AI generation and autonomous driving perception, and has significant economic and social value. Attached Figure Description
[0021] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein: Figure 1 The overall architecture flowchart of "attention model-BSQ-encoder-decoder" provided by this invention. Detailed Implementation
[0022] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.
[0023] To address the problems of low parameter efficiency, poor image-video compatibility, imbalance between reconstruction quality and computational efficiency, and unstable training in existing continuous image signal discretization techniques, this invention proposes a unified discretization scheme based on an "attention model" and binary spherical quantization (BSQ). Through an end-to-end architecture of "feature encoding → spherical quantization → decoding and reconstruction," combined with an efficient gradient transfer mechanism and flexible adaptation design, the core objectives of "lightweight, high fidelity, wide compatibility, and easy training" are achieved.
[0024] Specifically, a method for discretizing continuous vectors includes the following steps: S1. Acquire the continuous image signal to be processed, which includes a single frame image or multiple frames of video; S2. Use a pre-trained attention encoder to extract features from continuous image signals to obtain high-dimensional feature vectors; The attention encoder is equipped with a block causal masking mechanism to uniformly process the spatiotemporal features of static images and dynamic videos. S3. Discretize the high-dimensional feature vector using the binary spherical quantization module to obtain the binary discrete code; The discretization process includes: projecting the high-dimensional feature vectors to a low-dimensional space and performing spherical normalization, using a learnable hyperplane for binary quantization to construct an implicit codebook, and projecting the binary code back into the feature space. S4. The pre-trained attention decoder is used to decode and reconstruct the quantized feature vector processed by the BSQ module to obtain the reconstructed continuous image signal.
[0025] Step S3 involves discretizing the high-dimensional feature vector using a binary spherical quantization module, specifically including: S31. Dimensionality reduction: Project the high-dimensional feature vector z of dimension in_dim to the low-dimensional space of dimension latent_dim through a linear layer to obtain the vector v, with the formula: v = Linear(z). S32. Spherical normalization: Calculate the L2 norm of vector v, divide v by its L2 norm, and obtain the unit vector u mapped onto the unit hypersphere. The formula is: u = v / ||v||2. S33. Binary Quantization: Using the learnable hyperplane parameters of dimension latent_dim×k_bits, the unit vector u is projected onto the k_bits dimensional space to obtain the projection vector projections; the projection vector projections are processed by a sign function to obtain the binary vector binary_code; the calculation formula is: Projections = u · hyperplanes, binary_code = sign(Projections); S34. Normalization and Implicit Codebook Construction: Divide the binary vector binary_code by the square root of k_bits to obtain the normalized binary code normalized_binary, with the formula: normalized_binary = binary_code / (k_bits) 2 ; S35. Feature Reconstruction: The normalized binary code is mapped back to the latent_dim dimensional feature space by back projection onto the hyperplane to obtain the quantized reconstruction vector quantized_vec.
[0026] Furthermore, in step S33, the rule for processing the sign function of the projection vector projections is as follows: if projections is greater than 0, the value is 1; if projections is less than 0, the value is -1; if projections is equal to 0, the mapping is 1.
[0027] Furthermore, in step S2, the attention encoder is used to extract features from the continuous image signal. Specifically, if the input is a single-frame image, the image is segmented into non-overlapping image blocks, flattened into a one-dimensional sequence, and then transformed into a high-dimensional feature vector through linear projection and directly input into the attention encoder. If the input is a multi-frame video, the multi-frame video is segmented into image blocks of the same size in chronological order, flattened, and then spatiotemporal position coding is added before inputting into the attention encoder. The spatiotemporal position coding is formed by superimposing spatial position coding and temporal position coding.
[0028] Furthermore, in step S4, the attention decoder is used to decode and reconstruct the quantized feature vector. Specifically, the decoder adopts an attention model structure symmetrical to the encoder, receives the quantized feature vector, and maps the discrete features back to the pixel space through linear projection, multi-layer attention calculation and multi-layer perceptron head processing. For video input, the spatiotemporal position coding is used to maintain the continuity of motion between frames and reassemble it into a complete video sequence to avoid inter-frame flickering or motion distortion.
[0029] Specifically, the binary spherical quantization module uses a pass-through estimator for gradient propagation during training, directly assigning the gradient of the discretized binary code during backpropagation to the continuous features before normalization to ensure smooth gradient propagation.
[0030] Additionally, based on the above method description, this invention also provides a continuous vector discretization representation system, comprising: The signal acquisition module is used to acquire continuous image signals to be processed, including single-frame images or multiple-frame videos. The feature encoding module is used to extract features from continuous image signals using a pre-trained attention encoder to obtain high-dimensional feature vectors; the attention encoder is equipped with a block causal masking mechanism. The quantization and discretization module is used to discretize high-dimensional feature vectors through the binary spherical quantization module to obtain binary discrete codes. The quantization and discretization module includes a dimensionality reduction unit, a spherical normalization unit, a binary quantization unit, and a feature reconstruction unit. The decoding and reconstruction module is used to decode and reconstruct the quantized feature vectors processed by the quantization and discretization module using a pre-trained attention decoder, so as to obtain the reconstructed continuous image signal.
[0031] It should be noted that the core architecture of this invention consists of an "attention model" and a "BSQ quantization bottleneck layer," and the overall process is as follows: Figure 1 As shown.
[0032] First, an attention-based encoder transforms the continuous image signal into a high-dimensional feature vector. Then, a BSQ quantization layer maps the high-dimensional features into a low-dimensional spherical binary discrete code, eliminating the need for an explicit codebook. Finally, an attention-based decoder reconstructs the discrete code into a continuous image consistent with the original signal. This framework natively supports single-frame image and multi-frame video input without modifying the core architecture; it adapts to different data types solely through a masking mechanism.
[0033] The core is to deeply integrate "spherical normalization" and "binary quantization" to construct an implicit codebook; to achieve unified modeling of spatiotemporal features through the block causal mask of the "attention model"; to use a pass-through estimator (STE) to ensure smooth gradient propagation, while designing a dedicated loss function to balance reconstruction quality and codebook utilization.
[0034] It should be noted that the core module for discretization of continuous vectors is as follows: BSQ is the core of this invention in resolving the contradiction between "highly efficient parameters, low distortion, and easy training," corresponding to the CorrectBSQSTE and BSQDiscreteEncoder classes in the code (see code). Its core logic is to "project high-dimensional features onto a unit hypersphere, then perform binary quantization, and finally reconstruct the original feature space." The specific steps are as follows: (1) Characteristic projection and spherical normalization First, the high-dimensional feature vector output by the "attention model" encoder is reduced in dimensionality and normalized to ensure that the feature distribution fits the hypersphere and avoids unbounded quantization errors. This is reflected in the `forward` function of `BSQDiscreteEncoder`: Dimensionality reduction: The input feature dimension is reduced from in_dim (e.g., 512) to a low dimension latent_dim (e.g., 18) using a linear layer proj_down, resulting in a vector v. The formula is:
[0035] v = Linear(z), where z is the high-dimensional feature output by the encoder; Spherical normalization: Calculate the L2 norm of v (i.e., the total length of the vector), divide v by the norm to obtain the unit vector u, as shown in the formula:
[0036] This step solves the problem of unbounded error caused by the lack of normalization in existing LFQ technology—by normalizing, all features are constrained to the "unit hypersphere," making the "total intensity" of each feature equal, thus providing a stable distribution basis for subsequent binary quantization.
[0037] (2) Spherical Binary Quantization and Implicit Codebook Construction Binary quantization of features is performed on a unit hypersphere, eliminating the need for pre-training an explicit codebook. Instead, an implicit codebook is automatically constructed using the spherical distribution, corresponding to the `forward` function of `CorrectBSQSTE`: Hyperplane projection: The normalized u is projected onto a k-bit space through learnable hyperplanes (with dimensions latent_dim × k_bits) to obtain projection vectors, which are calculated as follows: Projections = u⋅hyperplanes (matrix multiplication is implemented using torch.einsum("bl, lk -> bk") in the code). Binarization: The sign of the projected vector is determined to obtain the binary vector `binary_code`, according to the following rules: binary_code = sign(projections), and at the same time map sign(0) to 1 (to avoid distribution shift caused by zero value); Spherical constraint normalization: To ensure that the binary vector still conforms to the unit hypersphere, divide binary_code by kbits to obtain the normalized binary code normalized_binary, as shown in the following formula: normalized_binary = binary_code / (kbits) 2 ; The core innovation of this step is the "implicit codebook"—instead of storing a large-scale codebook like in traditional computing, it automatically generates a codebook of size 2 through a combination of "sphere + binary". kbits The implicit codebook (e.g., when k_bits=18, the codebook size reaches 262144) and the codebook grows exponentially with the quantization dimension k_bits, which not only guarantees expressive power but also eliminates the need for additional storage, perfectly solving the scalability problem of the explicit codebook.
[0038] (3) Quantitative feature reconstruction To achieve end-to-end training, the binary discrete code needs to be reconstructed back into a vector with the same dimension as the original features for subsequent processing by the decoder. This is achieved by back-projecting the hyperplane, mapping the `normalized_binary` back to the `latent_dim` dimensional feature space, resulting in the quantized reconstruction vector `quantized_vec`. This is used in the code as follows:
[0039] It should be noted that the following points are related to the design of the "attention model" decoder: To achieve unified discretization of images and videos, this invention employs an "attention model" as the core of the encoder-decoder architecture, corresponding to the underlying dependency architecture of BSQLayer2D in the code. (1) Feature encoding: Unified input processing of images / videos Image input: A single frame image is segmented into 1×p×p (e.g., 1×8×8) non-overlapping patches, flattened into a one-dimensional sequence, and then transformed into a high-dimensional feature vector through linear projection, which is directly input into the "attention model" encoder; Video input: Divide the multi-frame video into patches of the same size in chronological order, flatten them, add "spatiotemporal location coding" (composed of spatial location coding and temporal location coding), and then input them into the encoder without modifying the network structure.
[0040] (2) Decoding and reconstruction: High-fidelity restoration design The decoder employs an attention model structure symmetrical to the encoder. It receives the BSQ-quantized feature vectors and maps the discrete features back to the pixel space through a process of "linear projection → multi-layer attention → MLP head". For images: directly reconstruct the decoded features into a continuous signal with the same size as the original image; For video: Spatiotemporal location coding is used to maintain the continuity of motion between frames, reassemble them into a complete video sequence, and avoid inter-frame flickering or motion distortion.
[0041] To address this requirement, the following formula is used in its implementation:
[0042] in It is the quantized feature vector quantized_vec.
[0043] Additionally, based on the above method description, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described above.
[0044] Additionally, based on the above method description, the present invention also provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the method described above.
[0045] This invention is widely used as a vector storage scheme in multimodal tongue and facial diagnosis instruments. Its core principle is to efficiently discretize and store the feature vectors of images / videos acquired during tongue and facial diagnosis using binary spherical quantization (BSQ) technology, while ensuring the accuracy of the reconstructed features and preventing the loss of crucial medical information such as tongue coating texture, facial pigmentation, and lesion morphology. The following detailed embodiments, combined with application scenarios, allow those skilled in the art to directly reproduce the invention.
[0046] I. Best Practice Implementation: Multimodal Tongue and Facial Diagnosis Instrument (Professional Medical Version) This embodiment is applicable to professional equipment in hospitals and health check centers, which need to process static tongue diagnosis images and dynamic facial diagnosis videos (such as facial muscle movements and tongue extension and contraction videos) simultaneously. It requires high reconstruction accuracy, supports video input of arbitrary length, and storage / transmission efficiency that meets the needs of remote diagnosis and treatment.
[0047] (a) Hardware configuration (can be directly purchased and deployed) Acquisition module: High-definition industrial camera (resolution 3840×2160, supports 24FPS video recording), supplementary light (color temperature 5500K, to avoid color distortion); Computing module: GPU server (4×NVIDIA 4090 24GB GPU + AMD Ryzen Threadripper PRO 5975WX32-Core CPU); Storage module: SSD hard drive (2TB capacity, used to store the discretized binary feature vectors); Transmission module: Gigabit Ethernet interface (supports real-time transmission of feature vectors during remote diagnosis and treatment).
[0048] (ii) Software environment (version clearly defined, ready for direct installation and configuration) Basic framework: Python 3.9, PyTorch 2.0.1, CUDA 11.8; Dependencies: torchvision 0.15.2, numpy 1.24.3, Pillow 9.5.0, scipy 1.10.1.
[0049] (III) Data preparation (adapted for medical scenarios, can be directly replicated) Dataset: 100,000 static images of tongue diagnosis (including 20 categories of labels such as normal tongue coating, yellow coating, white coating, and tongue ulcers) and 50,000 dynamic videos of facial diagnosis (each video has 10-30 frames and includes 15 categories of labels such as facial pigmentation, wrinkles, and eyelid edema). The images and videos are labeled with the doctor's diagnosis results. Preprocessing: Static images: uniformly resize to 256×256 pixels (small side scaled to 256, Lanczos interpolation), center cropped to 256×256, RGB channels normalized to [0,1]; Dynamic video: Non-overlapping patches are divided into 1×8×8 pixel sizes, flattened into a one-dimensional sequence, and spatiotemporal location encoding is added (spatial location encoding + zero-initialization temporal location encoding superimposed). Feature extraction: A pre-trained model is used to extract high-dimensional feature vectors (512 dimensions) from the image / video, which are used as the input vector z of this invention.
[0050] (iv) Deployment and application process (can be directly integrated into the device) Feature discretization storage: Real-time acquisition of tongue diagnosis images / facial diagnosis videos → extraction of 512-dimensional feature vector z from ViT-Base → processing by BSQDiscreteEncoder: dimensionality reduction to 18-dimensional → spherical normalization → 36-bit binary quantization → generation of binary discrete code (36 bits / vector); Discrete code storage: Named according to "Patient ID-Collection Time-Feature Type" and stored on SSD hard drive; Feature Reconstruction and Diagnosis: Read the discrete code → BSQDiscreteEncoder back projection: Reconstruct 18-dimensional latent variables through the hyperplane → map back to 512-dimensional features in the proj_up layer → input into a medical diagnostic model (such as a lesion recognition CNN). Dynamic video processing: Block causal masking takes effect automatically, supports videos of any length (10 frames to 1 hour), the reconstructed video has an FVD of ≤6.2 (UCF-101 standard), no inter-frame flicker, and can clearly restore tongue and facial images; Remote transmission: Discrete codes are transmitted via gigabit Ethernet. The transmission time for 100 patient features (approximately 45KB) is ≤1ms, meeting the real-time requirements of remote diagnosis and treatment.
[0051] (v) Effect verification (can be directly reproduced in the test) Reconstruction quality: The accuracy rate of tongue coating texture and ulcer spots recognition after tongue diagnosis image reconstruction is ≥98.5%, and the position deviation of facial pigmentation spots after facial diagnosis video reconstruction is ≤1 pixel; Storage efficiency: Compared with the traditional VQ-VAE solution (16K codebook), storage usage is reduced by 90%, and codebook utilization is 99.8% (no "dead codebook"). Speed: Single image discretization and storage time ≤ 5ms, single 30-frame video processing time ≤ 120ms, meeting the real-time acquisition requirements of the device.
[0052] The core of this invention lies in the adaptation of BSQ technology to medical feature vectors. Through clear parameter configuration, training process and deployment steps, technicians in the relevant technical field can integrate this discretization scheme into various multimodal medical devices without additional exploration, so as to achieve efficient storage, transmission and accurate reconstruction.
[0053] In addition, to fully implement the technical solution of this invention, the following complete implementation code is provided: import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Function class CorrectBSQSTE(Function): """ Sequence: Projection → Binarization → Reconstruction """ @staticmethod def forward(ctx, u: torch.Tensor, hyperplanes: torch.Tensor, use_tanh: bool = False) -> torch.Tensor: """ Args: u: Normalized latent vector [batch, latent_dim] hyperplanes: Learnable hyperplanes [latent_dim, k_bits] use_tanh: Whether to use tanh for activation """ if use_tanh: u_activated = torch.tanh(u) else: u_activated = u # 3.1: Projecting onto the hyperplane projections = torch.einsum("bl, lk -> bk", u_activated,hyperplanes) # [batch, k_bits] # : sign binarization binary_code = torch.sign(projections) binary_code = torch.where(binary_code == 0, torch.ones_like(binary_code), binary_code) # : Hyper-spherical constraint k_bits = hyperplanes.shape[-1] normalized_binary = binary_code / torch.sqrt(torch.tensor(k_bits, dtype=u.dtype, device=u.device)) # Reconstruct the quantized vector quantized_vec = torch.einsum("bk, lk -> bl", normalized_binary, hyperplanes) ctx.save_for_backward(u_activated, hyperplanes, binary_code) ctx.use_tanh = use_tanh ctx.k_bits = k_bits return normalized_binary, quantized_vec @staticmethod def backward(ctx, grad_binary: torch.Tensor, grad_quantized: torch.Tensor): u_activated, hyperplanes, binary_code = ctx.saved_tensors use_tanh = ctx.use_tanh k_bits = ctx.k_bits # Straight-through estimator (STE) gradient if use_tanh: grad_activation = 1 - torch.tanh(u_activated) ** 2 grad_u = torch.einsum("bk, lk -> bl", grad_binary, hyperplanes) * grad_activation else: grad_u = torch.einsum("bk, lk -> bl", grad_binary, hyperplanes) # Hyperplane gradient grad_hyperplanes = torch.einsum("bl, bk -> lk", u_activated, grad_binary) return grad_u, grad_hyperplanes, None def correct_bsq_ste(u: torch.Tensor, hyperplanes: torch.Tensor, use_tanh: bool = False): return CorrectBSQSTE.apply(u, hyperplanes, use_tanh) class BSQDiscreteEncoder(nn.Module): def __init__(self, in_dim: int, latent_dim: int = 18, k_bits: int = 12, use_tanh: bool = False, training: bool = True): super().__init__() self.latent_dim = latent_dim self.k_bits = k_bits self.use_tanh = use_tanh self.eps = 1e - 6 self.training = training self.proj_down = nn.Linear(in_dim, latent_dim) # 3.1: Learnable hyperplanes [latent_dim, k_bits] self.hyperplanes = nn.Parameter(torch.randn(latent_dim, k_bits)) nn.init.orthogonal_(self.hyperplanes) # Quantization loss hyperparameter self.beta = 0.05 self.gamma0 = 1.0 self.gamma = 1.1 self.zeta = 0.05 def forward(self, z: torch.Tensor) -> tuple[torch.Tensor,torch.Tensor, torch.Tensor]: # 1. Projection → Normalization v = self.proj_down(z) v_norm = torch.norm(v, p=2, dim=-1, keepdim=True) u = v / (v_norm + self.eps) # 2. BSQ Quantification (3.1 Correct Order) binary_code, quantized_vec = correct_bsq_ste(u,self.hyperplanes, use_tanh=self.use_tanh) # 3. Calculate Quantization Loss quant_loss = None if self.training: # Commitment Loss: Distance between the latent vector and the reconstructed vector commitment_loss = self.beta * F.mse_loss(u, quantized_vec) # Entropy Penalty dot_product = torch.einsum("bl, lk -> bk", u,self.hyperplanes) prob = torch.sigmoid(dot_product / self.zeta) entropy = -prob *torch.log(prob + 1e-10) - (1 - prob) *torch.log((1 - prob) + 1e-10) entropy_penalty = self.gamma0 * torch.mean(entropy) -self.gamma * torch.var(entropy, dim=-1).mean() quant_loss = commitment_loss + entropy_penalty # 4. Theoretical value of discrete code tensor(0.3162, device='cuda:0') expected_val = 1.0 / torch.sqrt(torch.tensor(self.k_bits,dtype=z.dtype, device=z.device)) if self.training: # Validate constraints assert torch.allclose(torch.abs(binary_code), expected_val, atol=1e-5) binary_norm = torch.norm(binary_code, p=2, dim=-1) assert torch.allclose(binary_norm, torch.ones_like(binary_norm), atol=1e-5) return binary_code, quantized_vec, expected_val, quant_loss else: return binary_code, quantized_vec, None, None # -------------------------- Corrected BSQLayer2D (returns discrete code + loss during training) -------------------------- class BSQLayer2D(nn.Module): def __init__(self, in_dim: int, latent_dim: int = 18, use_tanh:bool = False, training: bool = True): super().__init__() self.training = training self.latent_dim = latent_dim self.encoder = BSQDiscreteEncoder( in_dim=in_dim, latent_dim=latent_dim, use_tanh=use_tanh, training ) # Fix: The decoder should map latent_dim back to in_dim self.proj_up = nn.Linear(latent_dim, in_dim) # Decoding layer def forward(self, z: torch.Tensor) -> tuple[torch.Tensor,torch.Tensor, torch.Tensor]: # Call the encoder to obtain discrete code + quantization vector + expected_val + quantization loss discrete_code, quantized_vec, expected_val, quant_loss =self.encoder(z) # Training mode: Decoding and reconstruction using quantized vectors if self.training: hat_z = self.proj_up(quantized_vec) # Reconstruct using quantized vectors return hat_z, expected_val, quant_loss # Non-training mode: Returns the decoded reconstructed value, discrete code, or None else: hat_z = self.proj_up(quantized_vec) # Reconstruct using quantized vectors return hat_z, discrete_code, None # -------------------------- 4. Correct test function: Fix return value unpacking order -------------------------- def run_bsq_layer_2d(use_tanh: bool = False, training: bool = True): """ - Training modes: Validate expected_val, quantize loss (numerical / non-negative), gradient propagation; - Non-training mode: Verify discrete code properties, decoding dimension, and gradient propagation.
[0054] """ # Experiment configuration (aligning with user requirements: batch=12, in_dim=512, latent_dim=18) batch_size = 12 in_dim = 512 latent_dim = 18 z = torch.randn(batch_size, in_dim, requires_grad=True) # Input[12, 512] # Initialize the BSQ layer (synchronous training / inference state) bsq_2d = BSQLayer2D( in_dim=in_dim, latent_dim=latent_dim, use_tanh=use_tanh, training # Forward Propagation: Obtaining Output and Quantization Loss hat_z, output, quant_loss = bsq_2d(z) # Fix: Correctly unpacks the three return values # ------------ Verification 1: Core Constraints of Training Mode (Quantitative Loss) --------------- expected_val_theory = 1.0 / torch.sqrt(torch.tensor(bsq_2d.encoder.k_bits, dtype=z.dtype, device=z.device)) if training: # Verification 1.1: expected_val assert output.ndim == 0, f"Training mode return value error! Scalar required, actual {output.ndim}D" # Verification 1.2: Quantification of loss compliance (non-negative, reasonable value) The statement `assert quant_loss is not None` indicates that the quantization loss for training mode is None. assert quant_loss >= 0, f"Quantization loss is abnormal! The value is {quant_loss.item():.4f}, it needs to be ≥0" assert 0 <= quant_loss.item() <= 1.0, f"Quantization loss value is abnormal! {quant_loss.item():.4f}, must be in the range of 0-1" # Verification 1.3: Quantization loss participates in gradient propagation (end-to-end training) # Total simulation loss: Reconstruction loss (L_coarse / L_fine) + Quantization loss # (Note: The MSE of u is used here to approximate the reconstruction loss; in practice, it needs to be combined with the decoder.) u = bsq_2d.encoder.proj_down(z) # Reproduce encoder intermediate value u u = u / (torch.norm(u, p=2, dim=-1, keepdim=True) + bsq_2d.encoder.eps) recon_loss = F.mse_loss(u, torch.zeros_like(u)) # Simulate reconstruction loss (example) total_loss = recon_loss + quant_loss # Total loss formula (λ=1) total_loss.backward() # Verify gradient propagation to encoder parameters (hyperplane, projection layer) The error message "assert bsq_2d.encoder.hyperplanes.grad is not None" indicates that the hyperplane gradient is not optimized, meaning the quantization loss is not properly optimized. The error message "assert bsq_2d.encoder.proj_down.weight.grad is not None" indicates that the gradient of the projection layer is None, preventing training. print(f"Training mode verification: expected_val={output:.4f}, quantization loss={quant_loss.item():.4f}") # ---------------- Verification 2: Core Constraints in Non-Training Mode -------------------- else: # Verification 2.1: Decoding dimension is consistent with input assert hat_z.shape == z.shape, f"Decoding dimension error! Requires {z.shape}, actual is {hat_z.shape}" # Verification 2.2: Attributes of Discrete Codes discrete_code, _, _, _ = bsq_2d.encoder(z) # Fix: Correctly obtains discrete code assert discrete_code.shape == (batch_size, bsq_2d.encoder.k_bits), \ "Discrete code dimension error! Requires {[batch_size, bsq_2d.encoder.k_bits]}, actual is {discrete_code.shape}" discrete_code_abs_mean = torch.abs(discrete_code).mean() assert torch.allclose(discrete_code_abs_mean, expected_val_theory, atol=1e-5), \ "Discrete code value error!" discrete_code_norm_mean = torch.norm(discrete_code, p=2, dim=-1).mean() assert torch.allclose(discrete_code_norm_mean, torch.tensor(1.0), atol=1e-5), \ f"Discrete code hyperspherical error! The L2 norm mean {discrete_code_norm_mean:.4f} needs to be approximately 1.0" # Verification 2.3: Gradient propagation is normal recon_loss = F.mse_loss(hat_z, z) recon_loss.backward() The error message "assert z.grad is not None" indicates that gradient propagation in non-training mode failed. # ---------------------- Output test results------------------------- activation_info = "Enable tanh activation" if use_tanh else "No activation" training_info = "Training mode (returns expected_val + quantization loss)" if trainingelse "Non-training mode (returns discrete code + decoded value)" print("=" * 80) print(f"BSQ layer test passed ({activation_info} | {training_info}, input [12,512])!") If not training: print(f"1. Discrete code dimension: {discrete_code.shape}") print(f"2. Discrete code value: ±{discrete_code_abs_mean:.4f} (≈±{expected_val_theory:.4f}") print(f"3. Decoding dimension: {hat_z.shape} (same as input)") else: print(f"1. expected_val:{output:.4f}") print(f"2. Quantization loss: {quant_loss.item():.4f} (commitment loss + entropy penalty)") print(f"3. Gradient Propagation: Both Hyperplane and Projection Layers Have Gradients (Supporting End-to-End Training)") print("=" * 80) # ---------------------- Main Function: Testing Multiple Scenarios----------------------- if __name__ == '__main__': # Scenario 1: No activation + training mode (verifying quantization loss and constraints) print("[Scenario 1: No activation | Training mode (with quantitative loss)]") run_bsq_layer_2d(use_tanh=False, training=True) # Scenario 2: Enable tanh+ training mode (engineering optimization, quantization loss logic remains unchanged) print("\n
Scenario 2: Enabling tanh | Training Mode (with quantitative loss)
Scenario 3: No activation | Non-training mode (discrete code + decoding)
Quick Verification: Quantization Loss in Training Mode
Quick Verification: Quantization Loss in Training Mode
Claims
1. A method for discretizing continuous vectors, characterized in that, Includes the following steps: S1. Acquire the continuous image signal to be processed, which includes a single frame image or multiple frames of video; S2. Use a pre-trained attention encoder to extract features from continuous image signals to obtain high-dimensional feature vectors; The attention encoder is equipped with a block causal masking mechanism to uniformly process the spatiotemporal features of static images and dynamic videos. S3. Discretize the high-dimensional feature vector using the binary spherical quantization module to obtain the binary discrete code; The discretization process includes: projecting the high-dimensional feature vectors to a low-dimensional space and performing spherical normalization, using a learnable hyperplane for binary quantization to construct an implicit codebook, and projecting the binary code back into the feature space. S4. The pre-trained attention decoder is used to decode and reconstruct the quantized feature vector processed by the BSQ module to obtain the reconstructed continuous image signal.
2. The continuous vector discretization representation method according to claim 1, characterized in that, Step S3 involves discretizing the high-dimensional feature vector using a binary spherical quantization module, specifically including: S31. Dimensionality reduction: Project the high-dimensional feature vector z of dimension in_dim to the low-dimensional space of dimension latent_dim through a linear layer to obtain the vector v, with the formula: v = Linear(z). S32. Spherical normalization: Calculate the L2 norm of vector v, divide v by its L2 norm, and obtain the unit vector u mapped to the unit hypersphere. The formula is: u = v / ||v||_2; S33. Binary Quantization: Using the learnable hyperplane parameters of dimension latent_dim×k_bits, the unit vector u is projected onto the k_bits dimensional space to obtain the projection vector projections; the projection vector projections are processed by a sign function to obtain the binary vector binary_code; the calculation formula is: Projections = u · hyperplanes, binary_code = sign(Projections); S34. Normalization and Implicit Codebook Construction: Divide the binary vector binary_code by the square root of k_bits to obtain the normalized binary code normalized_binary. The formula is: normalized_binary = binary_code / √(k_bits); S35. Feature Reconstruction: The normalized binary code is mapped back to the latent_dim dimensional feature space by back projection onto the hyperplane to obtain the quantized reconstruction vector quantized_vec.
3. The continuous vector discretization representation method according to claim 2, characterized in that, In step S33, the rule for processing the sign function of the projection vector projections is as follows: if projections is greater than 0, the value is 1; if projections is less than 0, the value is -1; if projections is equal to 0, the mapping is 1.
4. The continuous vector discretization representation method according to claim 3, characterized in that, In step S2, the attention encoder is used to extract features from the continuous image signal. Specifically, if the input is a single-frame image, the image is segmented into non-overlapping image blocks, flattened into a one-dimensional sequence, and then transformed into a high-dimensional feature vector through linear projection and directly input into the attention encoder. If the input is a multi-frame video, the multi-frame video is segmented into image blocks of the same size in chronological order, flattened, and then spatiotemporal position coding is added before inputting into the attention encoder. The spatiotemporal position coding is formed by superimposing spatial position coding and temporal position coding.
5. The continuous vector discretization representation method according to claim 4, characterized in that, In step S4, the attention decoder is used to decode and reconstruct the quantized feature vector. Specifically, the decoder adopts an attention model structure symmetrical to the encoder, receives the quantized feature vector, and maps the discrete features back to the pixel space through linear projection, multi-layer attention calculation and multi-layer perceptron head processing. For video input, the spatiotemporal position coding is used to maintain the inter-frame motion continuity and reassemble it into a complete video sequence to avoid inter-frame flicker or motion distortion.
6. The continuous vector discretization representation method according to claim 5, characterized in that, The binary spherical quantization module uses a pass-through estimator for gradient propagation during training. It directly assigns the gradient of the discretized binary code during backpropagation to the continuous features before normalization to ensure smooth gradient propagation.
7. A continuous vector discretization representation system, characterized in that, include: The signal acquisition module is used to acquire continuous image signals to be processed, including single-frame images or multiple-frame videos. The feature encoding module is used to extract features from continuous image signals using a pre-trained attention encoder to obtain high-dimensional feature vectors; the attention encoder is equipped with a block causal masking mechanism. The quantization and discretization module is used to discretize high-dimensional feature vectors through the binary spherical quantization module to obtain binary discrete codes. The quantization and discretization module includes a dimensionality reduction unit, a spherical normalization unit, a binary quantization unit, and a feature reconstruction unit. The decoding and reconstruction module is used to decode and reconstruct the quantized feature vectors processed by the quantization and discretization module using a pre-trained attention decoder, so as to obtain the reconstructed continuous image signal.
8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: The processor executes the program to implement the method as described in any one of claims 1 to 6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the program is executed by the processor, it implements the method as described in any one of claims 1 to 6.