Wildlife few-shot recognition method based on variational autoencoder and cross-domain alignment

By generating multimodal samples adapted to the target domain through variational autoencoders and cross-domain alignment techniques, the problems of sample scarcity and low cross-domain recognition accuracy in wildlife identification are solved, and efficient identification in complex environments is achieved.

CN122244902APending Publication Date: 2026-06-19SHAANXI TIANRUN BIOTECHNOLOGY ENGINEERING CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHAANXI TIANRUN BIOTECHNOLOGY ENGINEERING CO LTD
Filing Date
2026-03-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for wildlife identification suffer from problems such as poor model generalization ability due to scarce samples, low cross-domain recognition accuracy, and weak anti-interference ability of single modes, especially with poor recognition performance under different seasons and lighting conditions.

Method used

We employ a variational autoencoder-based approach with cross-domain alignment, using a three-dimensional latent vector separation algorithm and a cross-attention mechanism to generate multimodal samples adapted to the target domain. We then combine this with a multi-dimensional quality assessment system to select high-quality augmented samples and train a classification model for recognition.

🎯Benefits of technology

Generating high-quality virtual samples under limited sample conditions improves cross-domain recognition accuracy and robustness, reduces reliance on manual annotation, and enhances the model's recognition ability in complex environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244902A_ABST
    Figure CN122244902A_ABST
Patent Text Reader

Abstract

This invention provides a method for few-shot identification of wild animals based on variational autoencoders and cross-domain alignment, belonging to the field of image recognition technology. The method involves feature extraction and parameter matching of original multimodal samples to obtain standardized multimodal samples, domain feature vectors, and species-specific weights; extracting features from the standardized multimodal samples to construct a three-dimensional latent vector; calculating perturbation coefficients based on the domain feature vectors and species-specific weights to obtain an enhanced latent vector adapted to the target domain; aligning and fusing image features and voiceprint features in the enhanced latent vector using a cross-attention mechanism to decode and generate multimodal generated samples; screening the multimodal generated samples using a multi-dimensional quality assessment system; and using the classification model corresponding to the screened high-quality augmented samples to perform cross-domain identification of wild animals, obtaining the identification results. This invention solves the problem of reliance on manual annotation in few-shot scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image recognition technology, and in particular to a method for few-shot identification of wild animals based on variational autoencoders and cross-domain alignment. Background Technology

[0002] Wildlife species identification is one of the core technologies for field ecological monitoring and biodiversity conservation. However, in field scenarios, there are pain points such as scarce samples (low observation frequency for some species), large cross-domain differences (heterogeneous data distribution due to different seasons / lighting / shooting equipment), and weak anti-interference of single modality (relying solely on images is easily affected by occlusion). Existing technologies are difficult to meet the needs of accurate identification.

[0003] Traditional wildlife identification relies on large, manually labeled datasets, but the insufficient sample size for rare species leads to poor model generalization. Existing few-sample identification techniques often focus on single-modal data, failing to adapt to cross-environmental differences in the wild, resulting in a sharp drop in identification accuracy under different seasons / lighting conditions. Furthermore, the lack of collaborative utilization of multimodal information means that single image or voiceprint data is easily affected by environmental interference, failing to guarantee robustness in complex scenarios. Existing technology: 1. Data source: Relies on manually labeled wildlife image datasets with few samples (≤10 samples per species). 2. Model training: A meta-learning framework is used to train the classification model, learning the "sample-category" mapping relationship. 3. Recognition logic: After inputting a new sample, the recognition result is output based on meta-learning category matching rules. 4. Limitations: Only supports single-modal image input and does not handle cross-seasonal / lighting domain differences.

[0004] Existing technologies suffer from the following problems: 1. Reliance on manually labeled samples: It is difficult to obtain labeled samples of scarce species, resulting in high model training costs and poor universality. 2. Weak cross-domain generalization ability: Data distribution varies greatly under different seasons / light conditions, leading to a decrease in recognition accuracy of ≥40%. 3. Poor anti-interference capability of single modality: Relying solely on image data, recognition fails when encountering vegetation occlusion or low-light nighttime scenes. 4. Lack of sample augmentation mechanism: With few samples, the model has a high risk of overfitting, limiting its generalization ability. Summary of the Invention

[0005] To address the aforementioned shortcomings in existing technologies, this invention provides a method for identifying wild animals with few samples based on variational autoencoders and cross-domain alignment, which solves the problem of reliance on manual annotation in scenarios with few samples.

[0006] To achieve the aforementioned objectives, the technical solution adopted by this invention is as follows: a method for few-shot identification of wild animals based on variational autoencoders and cross-domain alignment, comprising: S1. Use target domain environmental parameters and species labels to extract features and match parameters of the original multimodal samples to obtain standardized multimodal samples, domain feature vectors and species-specific weights. S2. Use a dual-branch attention network to extract features from standardized multimodal samples, and use a three-dimensional latent vector separation algorithm to combine domain feature vectors and species-specific weights to construct a three-dimensional latent vector containing biological features, domain features and species-specific features. S3. Calculate the perturbation coefficient based on the domain feature vector and species-specific weights, and perform differentiated perturbation on each feature in the three-dimensional latent vector in combination with the target domain distribution information to obtain an enhanced latent vector adapted to the target domain. S4. Use the cross-attention mechanism to align and fuse the image features and voiceprint features in the enhanced latent vector, and decode to obtain multimodal generated samples; S5. Use a pre-set multi-dimensional quality assessment system to screen multimodal generated samples, and use the classification model corresponding to the screened high-quality amplified samples to perform cross-domain identification of wild animals, obtain the identification results, and complete the identification of wild animals with few samples; wherein, the classification model is obtained by training based on high-quality amplified samples.

[0007] The beneficial effects of this invention are as follows: The generative augmentation process constructed in this invention utilizes three-dimensional latent vector separation and cross-attention decoding techniques to generate a large number of high-quality virtual samples with target domain environmental features, even with only a small number of source domain samples. This directly reduces the dependence on large-scale manually labeled data, enabling the model to be trained based on high-quality augmented samples in S5, effectively solving the problems of difficulty in obtaining scarce wild animal samples and high labeling costs.

[0008] A three-dimensional latent vector separation algorithm is proposed, which explicitly decouples biological features, domain features, and species-specific features. Furthermore, a differential perturbation strategy incorporating target domain distribution information is introduced in S3. This design enables the generated samples to accurately simulate the specific environment of the target domain (such as lighting and background), thereby forcing the classification model to recognize the feature distribution of the target domain during the training phase. This significantly improves its cross-domain recognition accuracy under different seasons and shooting devices.

[0009] To address the common problem of single-modal failure in field environments (such as unclear images at night or interference from environmental noise), this invention introduces a cross-attention mechanism. This mechanism can dynamically adjust the fusion weights of image and voiceprint features based on domain features, achieving complementary advantages. For example, it automatically increases the weight of voiceprint features under low-light conditions, thereby ensuring that the classification model trained in S5 still has high robustness in complex environments.

[0010] Unlike existing technologies that rely on fixed thresholds or manual parameter tuning, this invention achieves dynamic control of perturbation intensity by calculating a perturbation coefficient based on domain feature vectors and species-specific weights. This means that the system can automatically adjust the feature distribution of generated samples according to changes in input environmental parameters (such as the transition from day to night), achieving adaptive adaptation to different monitoring scenarios without manual intervention.

[0011] A multi-dimensional quality assessment system was constructed, rigorously screening generated samples from four dimensions: image structure, voiceprint similarity, species fidelity, and domain fit. This ensures that the training data input to the classification model is both diverse (preventing overfitting) and has high confidence (preventing noise interference), enabling the classification model to learn robust discrimination boundaries even with few samples, significantly reducing the risk of overfitting.

[0012] Further, S2 includes: Image data from standardized multimodal samples is input into a residual network containing a spatial attention mechanism to extract image feature vectors; Voiceprint data from standardized multimodal samples is input into a bidirectional long short-term memory network containing a channel attention mechanism to extract voiceprint feature vectors; wherein, the residual network containing a spatial attention mechanism and the bidirectional long short-term memory network containing a channel attention mechanism are dual-branch attention networks, and the image feature vector and the voiceprint feature vector are features of standardized multimodal samples; The image feature vector and the voiceprint feature vector are weighted and concatenated to obtain the multimodal fusion feature; A mapping network containing fully connected layer weights is constructed. Multimodal fusion features are input into the mapping network, and an adaptation vector composed of domain feature vectors and species-specific weights is embedded during the mapping process. The domain feature vectors and species-specific weights are combined through a three-dimensional latent vector separation algorithm, and the mapping network outputs explicitly separated three-dimensional latent vectors representing biological features, domain features and species-specific features.

[0013] Furthermore, the expression for the three-dimensional latent vector is: ; in, Represents a three-dimensional latent vector. This represents the activation function. and Both represent the weight matrix of the fully connected layer. Indicates fusion features, and Both represent bias terms. This indicates a splicing operation. Representation domain eigenvectors, Indicates species-specific weights.

[0014] Furthermore, the calculation of the perturbation coefficient based on the domain feature vector and species-specific weights in S3 includes: The illumination intensity, background complexity, and voiceprint signal-to-noise ratio in the domain feature vector are normalized to obtain the normalized domain parameters. Based on species-specific weights as the base gain coefficient, the normalized domain parameters are weighted and summed to output a dynamic adaptation perturbation coefficient used to control the intensity of subsequent perturbations.

[0015] Furthermore, in step S3, differential perturbation is performed on each feature in the three-dimensional latent vector based on the target domain distribution information to obtain an enhanced latent vector adapted to the target domain, specifically including: Obtain the mean vector from the distribution information of the target domain; The perturbation variance corresponding to each feature is calculated using the perturbation coefficient; Based on the mean vector and perturbation variance, a weak-intensity Gaussian random perturbation with zero mean and variance controlled by the perturbation coefficient is applied to the biological feature part of the three-dimensional latent vector; a directional perturbation with mean matching target domain distribution information and variance controlled by the perturbation coefficient is applied to the domain feature part of the three-dimensional latent vector; and a micro-intensity random perturbation with zero mean and variance controlled by the perturbation coefficient is applied to the species-specific feature part of the three-dimensional latent vector to obtain the perturbation features of each part. By using a mapping network to collaboratively fuse the perturbed features, an enhanced latent vector adapted to the target domain is output.

[0016] Furthermore, the expression for the disturbance coefficient is: ; The expression for the enhanced latent vector is: ; in, This represents the disturbance coefficient. This represents the normalized light intensity. Indicates species-specific weights, This represents the normalized background complexity. This represents the normalized signal-to-noise ratio of the voiceprint. Represents the enhanced latent vector. This represents the mapping network weight matrix. Represents biological feature vectors, This represents a weak-intensity Gaussian random perturbation of biological characteristics. Representation domain eigenvectors, Representation domain feature orientation perturbation, Represents species-specific feature vectors, This represents a slight random perturbation in species characteristics.

[0017] Further, S4 includes: The enhanced latent vectors are mapped through a fully connected layer and processed by adaptive batch normalization to obtain intermediate layer features; By introducing a domain feature vector, the similarity between intermediate layer features and decoder hidden layer states is adjusted element-wise using the domain feature vector to obtain the adjustment result. Based on the adjustment results, normalization calculations are performed using the biometric dimension as a benchmark to obtain the cross-attention weights of the image feature branch and the voiceprint feature branch. Based on cross-attention weights, the image data and voiceprint data generated by the decoder are weighted and fused to obtain multimodal generated samples.

[0018] Furthermore, the expression for the loss function of the classification model is: ; ; ; ; ; ; ; ; in, The loss function of the classification model. Indicates the reconstruction loss. , , and All represent weighting coefficients. This represents the KL divergence loss. Indicates cross-domain alignment loss. This represents the multimodal fusion loss. Indicates loss of species characteristic fidelity. This represents the pixel-level mean square error loss of the image. This represents the mean square error loss at the voiceprint feature level. Indicates the image height. Indicates the image width. This represents the pixel value of the original image at spatial location (i,j) and channel k. This represents the pixel value of the generated image at spatial location (i,j) and channel k. The number of channels representing voiceprint characteristics. The time step representing the voiceprint characteristics This represents the numerical value of the original voiceprint feature at position (i,j). This represents the numerical value corresponding to the generated voiceprint feature. This represents the total number of samples in the training batch. Let represent the mean of the three-dimensional latent vectors of the nth sample. This represents the standard deviation of the three-dimensional latent vector of the nth sample. This indicates the maximum mean difference index. Represents the set of features from the source domain. This indicates the generation of a sample domain feature set. This represents the number of samples in the source domain feature set. Represents the Gaussian kernel eigenmap function. This indicates the number of samples in the generated sample domain feature set. This represents the Gaussian kernel feature map of the corresponding sample. Denotes the square of the L2 norm. This represents the image attention weight for the nth sample. This represents the voiceprint attention weight for the nth sample. This represents the species-specific latent vector of the nth original sample. This represents the species-specific latent vector of the nth generated sample.

[0019] Further, S5 includes: Calculate the image structural similarity and voiceprint cosine similarity between the multimodal generated samples and the original multimodal samples; Calculate the Euclidean distance between the species-specific features of the multimodal generated samples and the species-specific features of the original samples, and use it as an index of species feature fidelity. The Euclidean distance between the domain feature vector of the multimodal generated sample and the target domain feature vector corresponding to the target domain environmental parameters is calculated as a domain fit index; among them, image structure similarity, voiceprint cosine similarity, species feature fidelity and domain fit belong to a multi-dimensional quality assessment system. A comprehensive score is calculated based on image structural similarity, cosine similarity of voiceprints, fidelity of species features, and domain fit. When the comprehensive score is higher than a preset threshold, the multimodal generated sample is used as a high-quality amplified sample after screening. By using the classification model corresponding to the high-quality amplified samples after screening, cross-domain identification of wild animals is carried out to obtain identification results and complete the identification of a small number of wild animal samples.

[0020] Furthermore, the expression for the comprehensive score is: ; in, This represents the overall score. Indicating image structural similarity, Indicates the cosine similarity of voiceprints. Indicates primitive species-specific characteristics. Indicates the species-specific characteristics generated. This represents the function that takes the maximum value. Represents the L2 norm. This represents the domain feature vector of the generated sample. This represents the domain feature vector of the target domain. Attached Figure Description

[0021] This specification will be further described by way of exemplary embodiments, which will be described in detail with reference to the accompanying drawings. These embodiments are not limiting; in these embodiments, the same reference numerals denote the same structures, wherein: Figure 1 This is an exemplary flowchart illustrating a few-shot identification method for wild animals based on variational autoencoders and cross-domain alignment, according to some embodiments of this specification. Figure 2 This is an exemplary flowchart illustrating the hierarchical structure and data flow according to some embodiments of this specification; Figure 3 This is an exemplary flowchart of virtual domain expansion according to some embodiments of this specification; Figure 4 This is an exemplary flowchart of multimodal collaborative recognition logic according to some embodiments of this specification. Detailed Implementation

[0022] The specific embodiments of the present invention are described below to enable those skilled in the art to understand the present invention. However, it should be understood that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, various changes are obvious as long as they are within the spirit and scope of the present invention as defined and determined by the appended claims. All inventions utilizing the concept of the present invention are protected.

[0023] Example Figure 1 This is an exemplary flowchart illustrating a few-shot identification method for wild animals based on variational autoencoders and cross-domain alignment, according to some embodiments of this specification. Figure 1 As shown, the process includes the following steps. In some embodiments, the process may be executed by a processor.

[0024] The overall system architecture of this invention is designed around "precise control of multimodal features + enhanced domain adaptability," employing a modular, layered architecture. It comprises five functional modules arranged sequentially according to the data flow direction. Each module seamlessly connects through standardized data interfaces, forming a complete technical chain of "preprocessing-encoding-perturbation-decoding-screening." The specific architecture is as follows: 1. Core Architecture Logic: Based on the core logic of "feature separation-targeted enhancement-collaborative fusion-quality control," this architecture explicitly separates biological features, domain features, and species-specific features, enabling independent control of features across different dimensions. This resolves the "species distortion and domain adaptability contradiction caused by feature coupling" in traditional generative models. 2. Module Composition and Interaction: First Layer: Multimodal Sample Preprocessing Module. Serving as the system's data input, it receives raw samples and environmental parameters, performs data cleaning, feature extraction, and parameter matching, providing a standardized data foundation for subsequent processing. Second Layer: 3D Separation Encoder Module. Receives the preprocessed standardized data, extracts multimodal features through a dual-branch attention network, and then obtains 3D latent vectors using a dedicated separation algorithm, achieving explicit feature dimension splitting. Third Layer: Species Adaptive Perturbation Module. Based on the 3D latent vectors and target domain distribution information, it performs directional perturbation operations, enhancing domain adaptability while protecting core species features. Fourth Layer: Multimodal Cross-Attention Decoder Module. Receives enhanced latent vectors and achieves deep alignment and fusion of multimodal features through an intermediate-layer cross-attention mechanism, decoding and generating multimodal samples. Fifth Layer: Intelligent Screening Module. Performs multi-dimensional quality checks on the generated samples, removes invalid samples, and outputs high-quality amplified samples, providing data support for subsequent cross-domain alignment layers. 3. Data Flow and Control Flow: The data flow is unidirectional, following the path of "original sample → standardized data → 3D latent vector → enhanced latent vector → generated sample → high-quality amplified sample". The control flow achieves adaptive adjustment of each module through domain feature vectors and species-specific weights, ensuring that the generated sample conforms to the species characteristics and is adapted to the target domain environment.

[0025] S1. Use the target domain environmental parameters and species labels to extract features and match parameters of the original multimodal samples to obtain standardized multimodal samples, domain feature vectors and species-specific weights.

[0026] Raw multimodal samples refer to unprocessed wildlife data collected directly from the source domain. For example, raw multimodal samples may include RGB or infrared images (single-channel or three-channel) with a resolution of 640×480, and the corresponding raw audio data before Mel-frequency cepstral coefficient (MFCC) conversion.

[0027] Target domain environmental parameters refer to data that characterize the physical environment of the target observation area.

[0028] In some embodiments, environmental parameters of the target domain can be obtained through a geographic information system (GIS) or on-site sensors, specifically including latitude and longitude, altitude, and vegetation cover type (such as coniferous forest, shrubland, etc.).

[0029] Standardized multimodal samples refer to data obtained after denoising, normalizing, and feature fitting of the original samples.

[0030] In some embodiments, the processor can process the image by Gaussian filtering and FADC adaptive frequency sensing convolution, and process the voiceprint by wavelet decomposition to denoise, finally obtaining a standardized 640×480×3 image tensor and a 128×50 voiceprint feature matrix.

[0031] A domain feature vector is a low-dimensional vector that explicitly represents the characteristics of the data acquisition environment. In some embodiments, the domain feature vector... Specifically, it includes: Illumination intensity (L): obtained by statistical analysis of the image brightness histogram and normalized to [0,1]; Background complexity (B): calculated based on the image edge density and normalized to [0,1]; Voiceprint signal-to-noise ratio (S): estimated by signal energy ratio and normalized to [0,1].

[0032] Species-specific weights are scalar parameters assigned based on the salience of species morphological characteristics.

[0033] In some embodiments, the weight is obtained based on species feature library matching and has a value range of [0.8, 1.2]. For example, the giant panda with significant features is assigned a value of 1.1, and the tree frog with obvious mimicry features is assigned a value of 0.9.

[0034] S2. Features of standardized multimodal samples are extracted using a dual-branch attention network, and a three-dimensional latent vector containing biological features, domain features, and species-specific features is constructed by combining domain feature vectors and species-specific weights through a three-dimensional latent vector separation algorithm.

[0035] A dual-branch attention network is a neural network module used to extract image and voiceprint features in parallel.

[0036] In some embodiments, the dual-branch attention network includes: Image branch: An improved ResNet-18 architecture with the end fully connected layers removed is adopted, and a spatial attention mechanism is embedded in the 2nd and 3rd residual blocks; Voiceprint branch: A bidirectional LSTM network with 256 hidden units is adopted, combined with a channel attention mechanism.

[0037] The three-dimensional latent vector separation algorithm refers to the computational process of explicitly decoupling fused features into vectors of different attributes through specific network structure design.

[0038] In some embodiments, the three-dimensional latent vector separation algorithm uses a two-level fully connected layer for mapping and embeds an adaptation vector composed of domain feature vectors and species-specific weights during the mapping process, thereby separating biological features, domain features, and species-specific features.

[0039] A three-dimensional latent vector is a hidden layer representation vector with a dimension of 384, output by a separation algorithm.

[0040] In some embodiments, the three-dimensional latent vector includes: biological features: 256 dimensions, representing the basic biological morphology of the species; domain features: 64 dimensions, representing domain information such as ambient light and background; and species-specific features: 64 dimensions, representing key identification features unique to the species.

[0041] In some embodiments, the processor can input image data from standardized multimodal samples into a residual network containing a spatial attention mechanism to extract image feature vectors; input voiceprint data from standardized multimodal samples into a bidirectional long short-term memory network containing a channel attention mechanism to extract voiceprint feature vectors; weighted concatenation of image feature vectors and voiceprint feature vectors to obtain multimodal fusion features; construct a mapping network containing fully connected layer weights, input the multimodal fusion features into the mapping network, and embed an adaptation vector composed of domain feature vectors and species-specific weights during the mapping process; and output a three-dimensional latent vector that explicitly separates biological features, domain features, and species-specific features through the mapping network.

[0042] Residual networks are convolutional neural network structures used to extract deep features from images.

[0043] In some embodiments, the network is an improvement on the ResNet-18 architecture, removing the fully connected layers at the end of the original network and retaining the residual block structure to solve the gradient vanishing problem, and is used to output high-dimensional spatial feature maps.

[0044] Image feature vectors refer to image representation data after Global Average Pooling (GAP) processing.

[0045] In some embodiments, the vector is obtained by spatial dimension compression of the feature map output by the residual network, with a dimension of 128, representing the core visual information of the input image.

[0046] Bidirectional Long Short-Term Memory (Bi-LSTM) is a variant of recurrent neural network used to process time series data.

[0047] In some embodiments, the network comprises two LSTM layers, one forward and one backward, with the number of hidden units set to 256, for capturing temporal dependencies in the voiceprint data.

[0048] Voiceprint feature vectors refer to audio representation data after temporal compression.

[0049] In some embodiments, the vector is obtained by channel attention weighting and global pooling of the hidden state sequence output by Bi-LSTM, and has a dimension of 512.

[0050] Multimodal fusion features refer to the joint representation of image and voiceprint features in the feature space.

[0051] In some embodiments, the processor obtains multimodal fusion features by weighted concatenation, with a dimension of 640.

[0052] Mapping networks refer to fully connected neural networks used to map fused features to a three-dimensional latent space.

[0053] In some embodiments, the network comprises two fully connected layers (with dimensions of 512 and 384, respectively) and introduces a nonlinear transformation through the ReLU activation function to decouple and generate three-dimensional latent vectors.

[0054] An adaptation vector is a control vector used to inject prior environmental and species information into a mapping network.

[0055] In some embodiments, the adaptation vector is formed by concatenating the domain feature vector and the weighted domain feature vector, and is directly embedded in the operation process of the mapping network.

[0056] In some embodiments, the expression for the three-dimensional latent vector is: ; in, Represents a three-dimensional latent vector. This represents the activation function. and Both represent the weight matrix of the fully connected layer. Indicates fusion features, and Both represent bias terms. This indicates a splicing operation. Representation domain eigenvectors, Indicates species-specific weights.

[0057] In some embodiments, such as Figure 2The diagram shown is an exemplary flowchart illustrating the composition and data flow of each level of this invention. The system is divided into five levels: the first level performs multimodal sample preprocessing, outputting standardized samples and control parameters; the second level decouples features into biological features, domain features, and species-specific features through a dual-branch attention network and a three-dimensional latent vector separation algorithm; the third level performs species-adaptive perturbation; the fourth level generates samples through cross-attention decoding; and the fifth level performs quality screening. The data flow is unidirectional, flowing from original samples → standardized data → three-dimensional latent vectors → enhanced latent vectors → generated samples → high-quality amplified samples, and is dynamically controlled by domain feature vectors and species weights.

[0058] Image branch: An improved ResNet-18 architecture is adopted (the last two fully connected layers are removed), and a spatial attention mechanism is embedded in the 2nd and 3rd residual blocks. The layer details are as follows:

[0059] Voiceprint branch: Bidirectional LSTM + channel attention mechanism, with the following hierarchical details:

[0060] Weighted splicing with a ratio of 0.6:0.4 640-dimensional fusion features were obtained. The formula is: .

[0061] 3D Latent Vector Separation Algorithm (Self-designed): This algorithm uses a two-level fully connected layer to map and embed adaptive vectors constructed from domain features and species weights, explicitly separating the three types of features. The formula is as follows: ,in This is the weight matrix of the fully connected layer. For bias terms, σ It is the ReLU activation function. The splicing operation ultimately yields a three-dimensional latent vector. (384 dimensions), of which (256-dimensional) characterization of biological features, (64-dimensional) representation domain features, (64-dimensional) Characterizes species-specific traits. Three-dimensional latent vectors (384 dimensions).

[0062] S3. Calculate the perturbation coefficient based on the domain feature vector and species-specific weights, and perform differentiated perturbation on each feature in the three-dimensional latent vector in combination with the target domain distribution information to obtain an enhanced latent vector adapted to the target domain.

[0063] The disturbance coefficient is a dynamic scalar used to control the magnitude of potential spatial disturbances.

[0064] In some embodiments, the perturbation coefficient is calculated by weighting the domain feature vector with species-specific weights, and is used to balance the species fidelity and environmental adaptability of the generated samples.

[0065] Target domain distribution information refers to parameters that characterize the statistical regularities of target environmental data.

[0066] In some embodiments, this information is obtained by pre-statistical analysis of a small number of samples collected from the target domain, specifically represented as the mean vector of the target domain features.

[0067] An enhanced latent vector is a vector obtained by applying differential perturbations to a three-dimensional latent vector.

[0068] In some embodiments, the enhanced latent vector is obtained by applying weak perturbations to biological features, directional perturbations to domain features, and micro perturbations to species features, and then fusing them through a mapping network, aiming to simulate the feature distribution in the target domain environment.

[0069] In some embodiments, the processor can normalize the illumination intensity, background complexity, and voiceprint signal-to-noise ratio in the domain feature vector to obtain normalized domain parameters. Based on species-specific weights as the base gain coefficient, the normalized domain parameters are weighted and summed to output a dynamic adaptation perturbation coefficient used to control the intensity of subsequent perturbations.

[0070] Normalized domain parameters refer to environmental characteristic values ​​after numerical scaling.

[0071] In some embodiments, the processor maps the illumination intensity, background complexity, and voiceprint signal-to-noise ratio to the [0,1] interval, respectively, to eliminate the influence of dimensional differences on the calculation of the perturbation coefficient.

[0072] In some embodiments, the processor can obtain the mean vector from the target domain distribution information; calculate the perturbation variance corresponding to each feature using the perturbation coefficient; based on the mean vector and the perturbation variance, apply a weak-intensity Gaussian random perturbation with zero mean and variance controlled by the perturbation coefficient to the biofeature part of the three-dimensional latent vector, apply a directional perturbation with mean matching the target domain distribution information and variance controlled by the perturbation coefficient to the domain feature part of the three-dimensional latent vector, apply a micro-intensity random perturbation with zero mean and variance controlled by the perturbation coefficient to the species-specific feature part of the three-dimensional latent vector, and obtain the perturbed features; use a mapping network to collaboratively fuse the perturbed features and output an enhanced latent vector adapted to the target domain.

[0073] The mean vector refers to the central parameter used to guide directional perturbations.

[0074] In some embodiments, the vector is taken directly from the target domain distribution information to ensure that the generated domain features are statistically similar to the target environment.

[0075] Perturbation variance is a parameter that controls the degree of dispersion of random noise. In some embodiments, this parameter is positively correlated with the perturbation coefficient and is used to dynamically adjust the magnitude of change in different feature dimensions.

[0076] The perturbated features refer to feature vectors that have been superimposed with noise in the latent space. For example, the perturbated features may include: biological features superimposed with weak Gaussian noise, domain features superimposed with directional perturbations, and species features superimposed with micro-noise.

[0077] In some embodiments, the expression for the disturbance coefficient is: ; The expression for the enhanced latent vector is: ; in, This represents the disturbance coefficient. This represents the normalized light intensity. Indicates species-specific weights, This represents the normalized background complexity. This represents the normalized signal-to-noise ratio of the voiceprint. Represents the enhanced latent vector. This represents the mapping network weight matrix. Represents biological feature vectors, This represents a weak-intensity Gaussian random perturbation of biological characteristics. Representation domain eigenvectors, Representation domain feature orientation perturbation, Represents species-specific feature vectors, This represents a slight random perturbation in species characteristics.

[0078] In some embodiments, such as Figure 3 The diagram shown is an exemplary flowchart of the virtual domain augmentation method of the present invention. In the latent space, the system applies differentiated processing to different parts of the three-dimensional latent vector based on the calculated perturbation coefficients: weak perturbations are applied to biological features to maintain the ontological structure, directional perturbations matching the distribution of the target domain are applied to domain features to simulate environmental migration, and micro-perturbations are applied to species-specific features to prevent the loss of key features. Subsequently, the enhanced latent vector adapted to the target domain is output through fusion via a mapping network.

[0079] S4. Use the cross-attention mechanism to align and fuse the image features and voiceprint features in the enhanced latent vectors, and decode to generate multimodal generated samples.

[0080] Multimodal generated samples refer to the amplified data obtained by the decoder. For example, multimodal generated samples may include a 640×480×3 image and a 128×50 voiceprint feature sequence generated by fusion through a cross-attention mechanism.

[0081] In some embodiments, the processor can obtain intermediate layer features by mapping the enhanced latent vector through a fully connected layer and performing adaptive batch normalization; introduce domain feature vectors and use the domain feature vectors to adjust the similarity between the intermediate layer features and the hidden layer state of the decoder element-wise to obtain the adjustment result; based on the adjustment result, perform normalization calculation based on the biometric dimension to obtain the cross-attention weights of the image feature branch and the voiceprint feature branch; based on the cross-attention weights, perform weighted fusion of the image data and voiceprint data generated by the decoder to obtain multimodal generated samples.

[0082] Intermediate layer features refer to the data representations in the hidden layer state of the decoder network.

[0083] In some embodiments, the intermediate layer features are obtained by mapping the enhanced latent vectors through a fully connected layer (dimension 4096), and have not yet been recovered into specific image or voiceprint data.

[0084] The adjustment result refers to the feature similarity matrix after modulation by the domain feature vector.

[0085] In some embodiments, the processor uses the domain feature vector to perform element-wise multiplication adjustment on the inner product of the intermediate layer features and the hidden layer state to obtain the adjustment result, thereby introducing the influence of the environment on modality alignment.

[0086] Cross-attention weights are coefficients used to dynamically balance the contributions of image and voiceprint decoding.

[0087] In some embodiments, the cross-attention weights are calculated based on the similarity between intermediate layer features and the output of the adaptive BatchNorm, and are obtained by adjusting the domain feature vectors element by element.

[0088] In some embodiments, such as Figure 4 The diagram shown is an exemplary flowchart of the multimodal collaborative recognition logic of this invention. It illustrates how image features and voiceprint features are dynamically weighted and fused with domain feature vectors under a cross-attention mechanism, and how a multi-dimensional quality assessment system ultimately selects high-quality samples for classification model training.

[0089]

[0090] To solve the multimodal alignment problem, the attention weight calculation formula is as follows: ;in The dimension representing the biological feature vector (fixed at 256 dimensions). This represents element-wise multiplication, and D represents the eigenvectors of the domain. The square root of the biometric dimension is used to normalize the similarity calculation. This represents the inner product of the intermediate layer features and the hidden layer state of the decoder. This represents the feature vector of the j-th hidden state of the decoder. This represents the transpose matrix of the intermediate layer eigenvectors. This represents the Softmax activation function, which normalizes the calculation results into a probability distribution (with the weights summing to 1). This represents the attention weight of the voiceprint feature branch. The attention weights for the image feature branches are represented by the following formula for generating samples: , This represents the voiceprint sample (128×50 feature matrix) generated by the decoder. This represents an image sample (640×480×3 tensor) generated by the decoder. This represents the weighting percentage of the fusion of voiceprint features (the sum of the weights is 1). This represents the cross-attention weight (the proportion of image feature fusion weights). This indicates multimodal generated samples (fused image + amplified speaker samples).

[0091] S5. Use a pre-set multi-dimensional quality assessment system to screen multimodal generated samples, and use the classification model corresponding to the screened high-quality amplified samples to perform cross-domain identification of wild animals, obtain the identification results, and complete the identification of wild animals with few samples; wherein, the classification model is obtained by training based on high-quality amplified samples.

[0092] A multidimensional quality assessment system refers to a set of quantitative indicators used to screen generated samples. For example, a multidimensional quality assessment system may include image structural similarity (SSIM), cosine similarity of voiceprints (CosSim), species feature fidelity (euclidean distance of feature vectors), and domain fit (euclidean distance of domain vectors).

[0093] High-quality amplification samples refer to generated samples that have passed the screening of a multi-dimensional quality assessment system.

[0094] In some embodiments, a sample is considered a high-quality sample and added to the training set only if its overall score is greater than 0.72.

[0095] A classification model is a deep neural network system used for feature learning and category prediction of multimodal data on wild animals. Structurally, a classification model consists of the following four levels: Input layer: Receives standardized multimodal samples, including image tensors and voiceprint feature sequences.

[0096] Feature Extraction Backbone: Image Branch: Employs deep convolutional neural networks (such as ResNet-50 or EfficientNet) to extract high-dimensional visual features, focusing on the animal's texture, contour, and body posture information. Voiceprint Branch: Employs recurrent neural networks (such as bidirectional LSTM) or one-dimensional convolutional networks (1D-CNN) to extract temporal voiceprint features, focusing on the frequency variations and rhythmic patterns of calls.

[0097] Multimodal fusion layer: Through feature concatenation or attention weighting mechanisms, image feature vectors and voiceprint feature vectors are mapped to a unified joint feature space to achieve modal complementarity.

[0098] The classification output layer consists of fully connected layers and a softmax activation function. It maps the fused features to the probability distribution of each species category and outputs the category with the highest confidence as the recognition result.

[0099] In some embodiments, the training process of the classification model specifically includes the following steps: Step 1: Constructing a hybrid augmentation dataset. Using the aforementioned VAE generative model, high-quality augmented samples adapted to the target domain environment (such as winter snow scenes or low-light nighttime) are generated for species with scarce samples in the source domain (such as golden monkeys and giant pandas). The selected high-quality augmented samples are mixed with the original source domain samples at a ratio of 4:1 to 9:1 to construct a hybrid training set containing rich environmental changes and species characteristics. Step 2: Model initialization and parameter configuration. A multimodal classification network is built. The image branch is loaded with weights pre-trained on a large visual database (ImageNet) to accelerate convergence, and the voiceprint branch is initialized using Xavier. Optimizer: The AdamW optimizer is selected, with an initial learning rate of 1e-4 and a weight decay coefficient of 1e-3. Learning rate strategy: A cosine annealing strategy is adopted, smoothly decaying the learning rate from 1e-4 to 1e-6 over 100 training epochs. Batch Size: Set to 32 or 64, adaptively adjusted according to GPU memory. Step 3: Supervised Training and Loss Calculation. Input the mixed training set into the classification model and calculate the cross-entropy loss between the model's predicted output and the real species label. Simultaneously, to improve the model's robustness in cross-domain scenarios, label smoothing (smoothing factor set to 0.1) is introduced into the loss function to prevent the model from overfitting. Step 4: Model Iteration and Early Stopping. Monitor the model's accuracy and F1 score on the validation set (containing a small number of real samples from the target domain). If the validation set loss does not decrease within 10 consecutive epochs, trigger early stopping, saving the current optimal model parameters as the final classification model. Step 5: Cross-Domain Inference and Testing. Deploy the trained classification model to the target domain monitoring device. The system takes real-time multimodal data collected from the target domain as input and outputs the species identification results of wild animals directly through forward propagation, thereby achieving high-precision identification of scarce samples in complex environments.

[0100] The identification result refers to the category label output by the classification model after predicting the input wildlife data.

[0101] In some embodiments, the identification result is represented as the maximum value in the probability distribution of the species category (such as "giant panda" or "golden monkey").

[0102] In some embodiments, the expression for the loss function of the classification model is: ; ; ; ; ; ; ; ; in, The loss function of the classification model. Indicates the reconstruction loss. , , and All represent weighting coefficients. This represents the KL divergence loss. Indicates cross-domain alignment loss. This represents the multimodal fusion loss. Indicates loss of species characteristic fidelity. This represents the pixel-level mean square error loss of the image. This represents the mean square error loss at the voiceprint feature level. Indicates the image height. Indicates the image width. This represents the pixel value of the original image at spatial location (i,j) and channel k. This represents the pixel value of the generated image at spatial location (i,j) and channel k. The number of channels representing voiceprint characteristics. The time step representing the voiceprint characteristics This represents the numerical value of the original voiceprint feature at position (i,j). This represents the numerical value corresponding to the generated voiceprint feature. This represents the total number of samples in the training batch. Let represent the mean of the three-dimensional latent vectors of the nth sample. This represents the standard deviation of the three-dimensional latent vector of the nth sample. This indicates the maximum mean difference index. Represents the set of features from the source domain. This indicates the generation of a sample domain feature set. This represents the number of samples in the source domain feature set. Represents the Gaussian kernel eigenmap function. This indicates the number of samples in the generated sample domain feature set. This represents the Gaussian kernel feature map of the corresponding sample. Denotes the square of the L2 norm. This represents the image attention weight for the nth sample. This represents the voiceprint attention weight for the nth sample. This represents the species-specific latent vector of the nth original sample. This represents the species-specific latent vector of the nth generated sample.

[0103] In some embodiments, the processor can calculate the image structure similarity and voiceprint cosine similarity between the multimodal generated sample and the original multimodal sample; calculate the Euclidean distance between the species-specific features of the multimodal generated sample and the species-specific features of the original sample as a species feature fidelity index; calculate the Euclidean distance between the domain feature vector of the multimodal generated sample and the target domain feature vector corresponding to the target domain environmental parameters as a domain fit index; calculate a comprehensive score based on image structure similarity, voiceprint cosine similarity, species feature fidelity, and domain fit; when the comprehensive score is higher than a preset threshold, the multimodal generated sample is used as a high-quality augmented sample after screening; and use the classification model corresponding to the high-quality augmented sample after screening to perform cross-domain recognition of wild animals, obtain the recognition result, and complete the recognition of a small number of wild animal samples.

[0104] Image structural similarity refers to an indicator that measures the degree of similarity between a generated image and the original image in terms of brightness, contrast, and structure.

[0105] In some embodiments, the closer the image structure similarity is to 1, the more realistic the generated image is in terms of preserving the appearance structure of the species.

[0106] Voiceprint cosine similarity is an index that measures the directional consistency between the generated voiceprint vector and the original voiceprint vector.

[0107] In some embodiments, the cosine value of the angle between two vectors is calculated to evaluate the fidelity of the voiceprint features.

[0108] The species characteristic fidelity index refers to a numerical value that quantifies the ability of generated samples to retain the core characteristics of a species in the latent space.

[0109] In some embodiments, the processor can obtain the species-specific feature vector of the generated sample and the original sample by calculating the L2 norm distance based on the species feature fidelity index; the smaller the distance, the higher the fidelity.

[0110] Domain fit index refers to a numerical value that quantifies the degree of matching between generated samples and the target environment.

[0111] In some embodiments, the domain fit index is obtained by calculating the L2 norm distance between the domain feature vector of the generated sample and the target domain feature vector; the smaller the distance, the higher the fit.

[0112] The overall score refers to the weighted evaluation value used to ultimately determine the quality of the sample.

[0113] In some embodiments, the overall score is obtained by weighting and summing SSIM, cosine similarity of voiceprint, normalized species feature fidelity, and domain fit with weights of 0.3:0.3:0.2:0.2.

[0114] In some embodiments, the expression for the overall score is: ; in, This represents the overall score. Indicating image structural similarity, Indicates the cosine similarity of voiceprints. Indicates primitive species-specific characteristics. Indicates the species-specific characteristics generated. This represents the function that takes the maximum value. Represents the L2 norm. This represents the domain feature vector of the generated sample. This represents the domain feature vector of the target domain.

Claims

1. A method for few-shot identification of wild animals based on variational autoencoders and cross-domain alignment, characterized in that, include: S1. Use target domain environmental parameters and species labels to extract features and match parameters of the original multimodal samples to obtain standardized multimodal samples, domain feature vectors and species-specific weights. S2. Use a dual-branch attention network to extract features from standardized multimodal samples, and use a three-dimensional latent vector separation algorithm to combine domain feature vectors and species-specific weights to construct a three-dimensional latent vector containing biological features, domain features and species-specific features. S3. Calculate the perturbation coefficient based on the domain feature vector and species-specific weights, and perform differentiated perturbation on each feature in the three-dimensional latent vector in combination with the target domain distribution information to obtain an enhanced latent vector adapted to the target domain. S4. Use the cross-attention mechanism to align and fuse the image features and voiceprint features in the enhanced latent vector, and decode to obtain multimodal generated samples; S5. Use a pre-set multi-dimensional quality assessment system to screen multimodal generated samples, and use the classification model corresponding to the screened high-quality amplified samples to perform cross-domain identification of wild animals, obtain the identification results, and complete the identification of wild animals with few samples; wherein, the classification model is obtained by training based on high-quality amplified samples.

2. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 1, characterized in that, S2 includes: Image data from standardized multimodal samples is input into a residual network containing a spatial attention mechanism to extract image feature vectors; Voiceprint data from standardized multimodal samples is input into a bidirectional long short-term memory network containing a channel attention mechanism to extract voiceprint feature vectors; wherein, the residual network containing a spatial attention mechanism and the bidirectional long short-term memory network containing a channel attention mechanism are dual-branch attention networks, and the image feature vector and the voiceprint feature vector are features of standardized multimodal samples; The image feature vector and the voiceprint feature vector are weighted and concatenated to obtain the multimodal fusion feature; A mapping network containing fully connected layer weights is constructed. Multimodal fusion features are input into the mapping network, and an adaptation vector composed of domain feature vectors and species-specific weights is embedded during the mapping process. The domain feature vectors and species-specific weights are combined through a three-dimensional latent vector separation algorithm, and the mapping network outputs explicitly separated three-dimensional latent vectors representing biological features, domain features and species-specific features.

3. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 2, characterized in that, The expression for the three-dimensional latent vector is: ; in, Represents a three-dimensional latent vector. This represents the activation function. and Both represent the weight matrix of the fully connected layer. Indicates fusion features, and Both represent bias terms. This indicates a splicing operation. Representation domain eigenvectors, Indicates species-specific weights.

4. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 1, characterized in that, The perturbation coefficient calculated in S3 based on the domain feature vector and species-specific weights includes: The illumination intensity, background complexity, and voiceprint signal-to-noise ratio in the domain feature vector are normalized to obtain the normalized domain parameters. Based on species-specific weights as the base gain coefficient, the normalized domain parameters are weighted and summed to output a dynamic adaptation perturbation coefficient used to control the intensity of subsequent perturbations.

5. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 1, characterized in that, In step S3, differential perturbations are performed on each feature in the three-dimensional latent vector based on the target domain distribution information to obtain an enhanced latent vector adapted to the target domain, specifically including: Obtain the mean vector from the distribution information of the target domain; The perturbation variance corresponding to each feature is calculated using the perturbation coefficient; Based on the mean vector and perturbation variance, a weak-intensity Gaussian random perturbation with zero mean and variance controlled by the perturbation coefficient is applied to the biological feature part of the three-dimensional latent vector; a directional perturbation with mean matching target domain distribution information and variance controlled by the perturbation coefficient is applied to the domain feature part of the three-dimensional latent vector; and a micro-intensity random perturbation with zero mean and variance controlled by the perturbation coefficient is applied to the species-specific feature part of the three-dimensional latent vector to obtain the perturbation features of each part. By using a mapping network to collaboratively fuse the perturbed features, an enhanced latent vector adapted to the target domain is output.

6. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 5, characterized in that, The expression for the disturbance coefficient is: ; The expression for the enhanced latent vector is: ; in, This represents the disturbance coefficient. This represents the normalized light intensity. Indicates species-specific weights, This represents the normalized background complexity. This represents the normalized signal-to-noise ratio of the voiceprint. Represents the enhanced latent vector. This represents the mapping network weight matrix. Represents biological feature vectors, This represents a weak-intensity Gaussian random perturbation of biological characteristics. Representation domain eigenvectors, Representation domain feature orientation perturbation, Represents species-specific feature vectors, This represents a slight random perturbation in species characteristics.

7. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 1, characterized in that, S4 includes: The enhanced latent vectors are mapped through a fully connected layer and processed by adaptive batch normalization to obtain intermediate layer features; By introducing a domain feature vector, the similarity between intermediate layer features and decoder hidden layer states is adjusted element-wise using the domain feature vector to obtain the adjustment result. Based on the adjustment results, normalization calculations are performed using the biometric dimension as a benchmark to obtain the cross-attention weights of the image feature branch and the voiceprint feature branch. Based on cross-attention weights, the image data and voiceprint data generated by the decoder are weighted and fused to obtain multimodal generated samples.

8. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 1, characterized in that, The expression for the loss function of the classification model is: ; ; ; ; ; ; ; ; in, The loss function of the classification model. Indicates the reconstruction loss. , , and All represent weighting coefficients. This represents the KL divergence loss. Indicates cross-domain alignment loss. This represents the multimodal fusion loss. Indicates loss of species characteristic fidelity. This represents the pixel-level mean square error loss of the image. This represents the mean square error loss at the voiceprint feature level. Indicates the image height. Indicates the image width. This represents the pixel value of the original image at spatial location (i,j) and channel k. This represents the pixel value of the generated image at spatial location (i,j) and channel k. The number of channels representing voiceprint characteristics. The time step representing the voiceprint characteristics This represents the numerical value of the original voiceprint feature at position (i,j). This represents the numerical value corresponding to the generated voiceprint feature. This represents the total number of samples in the training batch. Let represent the mean of the three-dimensional latent vectors of the nth sample. This represents the standard deviation of the three-dimensional latent vector of the nth sample. This indicates the maximum mean difference index. Represents the set of features from the source domain. This indicates the generation of a sample domain feature set. This represents the number of samples in the source domain feature set. Represents the Gaussian kernel eigenmap function. This indicates the number of samples in the generated sample domain feature set. This represents the Gaussian kernel feature map of the corresponding sample. Denotes the square of the L2 norm. This represents the image attention weight for the nth sample. This represents the voiceprint attention weight for the nth sample. This represents the species-specific latent vector of the nth original sample. This represents the species-specific latent vector of the nth generated sample.

9. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 1, characterized in that, S5 includes: Calculate the image structural similarity and voiceprint cosine similarity between the multimodal generated samples and the original multimodal samples; Calculate the Euclidean distance between the species-specific features of the multimodal generated samples and the species-specific features of the original samples, and use it as an index of species feature fidelity. The Euclidean distance between the domain feature vector of the multimodal generated sample and the target domain feature vector corresponding to the target domain environmental parameters is calculated as a domain fit index; among them, image structure similarity, voiceprint cosine similarity, species feature fidelity and domain fit belong to a multi-dimensional quality assessment system. A comprehensive score is calculated based on image structural similarity, cosine similarity of voiceprints, fidelity of species features, and domain fit. When the comprehensive score is higher than a preset threshold, the multimodal generated sample is used as a high-quality amplified sample after screening. By using the classification model corresponding to the high-quality amplified samples after screening, cross-domain identification of wild animals is carried out to obtain identification results and complete the identification of a small number of wild animal samples.

10. The method for few-shot identification of wild animals based on variational autoencoder and cross-domain alignment according to claim 9, characterized in that, The expression for the comprehensive score is: ; in, This represents the overall score. Indicating image structural similarity, Indicates the cosine similarity of voiceprints. Indicates primitive species-specific characteristics. Indicates the species-specific characteristics generated. This represents the function that takes the maximum value. Represents the L2 norm. This represents the domain feature vector of the generated sample. This represents the domain feature vector of the target domain.