A wild animal recognition self-supervised pre-training method based on scene bias

By employing a scene-bias-based self-supervised pre-training method, we have solved the problems of small target localization, fine-grained feature modeling, and occlusion robustness in wildlife identification, achieving high-quality pre-trained feature representations and improving the model's recognition performance under field conditions.

CN122244904APending Publication Date: 2026-06-19BEIJING FORESTRY UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING FORESTRY UNIVERSITY
Filing Date
2026-03-18
Publication Date
2026-06-19

Smart Images

  • Figure CN122244904A_ABST
    Figure CN122244904A_ABST
Patent Text Reader

Abstract

This invention provides a self-supervised pre-training method for wildlife identification, relating to the fields of computer vision and deep learning. Addressing the challenges of small animal target proportions in field images, high annotation costs, and insufficient localization and feature learning capabilities of existing methods, it proposes a multi-level representation learning framework based on natural scene priors. This method introduces three types of visual biases—texture regularity, contour closure, and color distribution—to construct an attention mechanism for unsupervised small target localization. It employs a contrastive learning paradigm with local-global consistency constraints to establish multi-scale feature collaborative representation. It utilizes geographic, temporal, and visual metadata to construct a hierarchical hard negative sample sampling strategy. Robustness is enhanced through dual-objective optimization of real texture occlusion modeling and feature completion. This method fully utilizes unlabeled data for pre-training, significantly reducing annotation dependence and providing an efficient solution for few-sample species identification, demonstrating significant application value in ecological conservation and species monitoring.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and deep learning, specifically to a scene-biased self-supervised pre-training method for wildlife recognition, used to learn effective feature representations from a large number of unlabeled field images. Background Technology

[0002] Accurate identification of wild animals is of great significance for ecological protection and species monitoring. However, existing deep learning identification methods rely heavily on large-scale labeled data, and the labeling of wild animal images often requires specialized taxonomic knowledge, resulting in high labeling costs. At the same time, due to the uneven distribution of species, the number of rare species samples is usually less than a hundred, further exacerbating the problem of scarce labeled data and making it difficult for traditional supervised learning models to achieve good generalization performance.

[0003] In contrast to the scarcity of labeled data, field monitoring stations collect millions of unlabeled images annually, providing a rich but underutilized data resource for model training. Against this backdrop, self-supervised learning, through pre-training on large-scale unlabeled data, learns discriminative general feature representations, enabling models to achieve good performance in downstream wildlife identification tasks with only a small number of labeled samples. This provides a feasible and effective technical path to alleviate the problem of insufficient labeled data. However, existing self-supervised methods are mainly designed for general datasets such as ImageNet, and their direct application to pre-training on field images has the following limitations: First, small target localization is difficult during the pre-training stage. Animals constitute a very small proportion of wild images (typically 2%-5%), and existing methods using random cropping strategies during pre-training often completely miss animal regions. This causes the model to learn background features rather than animal features, resulting in pre-trained features lacking effective representation of animal regions. Solving this problem requires localizing animal regions under unlabeled conditions. Animals and backgrounds exhibit systematic differences in texture regularity, contour closure, and color distribution, which can serve as prior knowledge for localization, but existing methods have not utilized this.

[0004] Second, the pre-training stage suffers from insufficient fine-grained feature modeling. Species recognition relies on the collaboration between global morphology and local details. Existing methods only perform comparative learning at the global level during pre-training, lacking explicit constraints on local features and their consistency with global features. This makes it difficult for pre-trained features to capture subtle differences between species. If a local-global collaborative comparative learning framework could be constructed using key regions of scene offset localization, it could effectively enhance the learning of fine-grained features, but existing methods lack such a design.

[0005] Third, the pre-training stage lacks robustness to occlusion. Occlusion is common in outdoor vegetation, and existing methods do not model occlusion during pre-training, resulting in a lack of robustness of pre-trained features to occluded scenarios, which in turn causes the model's performance to degrade under occlusion conditions. If background textures could be used to simulate realistic occlusion and occlusion enhancement strategies could be introduced during the pre-training stage, the model's occlusion robustness could be improved, but existing methods lack such designs.

[0006] The common thread in the aforementioned problems is that existing methods do not fully utilize prior information from the field scene. Animals and backgrounds in field images exhibit systematic differences in texture, contour, and color; these scene biases can guide self-supervised learning even without annotations. Therefore, there is an urgent need to design scene bias-guided self-supervised pre-training methods to obtain high-quality pre-trained models adapted to field animal recognition tasks. Summary of the Invention

[0007] In view of the above-mentioned deficiencies of the prior art, the technical problem to be solved by the present invention is to provide a scene bias-based self-supervised pre-training method for wildlife recognition, which can accurately locate small target animals, learn multi-scale collaborative feature representations, and enhance the discriminative power and robustness to occlusion of features during the pre-training stage.

[0008] To achieve the above objectives, this invention provides a self-supervised pre-training method for wildlife recognition based on scene bias, characterized by the following steps: Step 1: Employ key region generation based on natural scene bias. Utilize the visual differences in texture, outline, and color between the animal and the background as prior biases to locate the animal region without annotations, and output the result. K indivual( K (The number of key regions is preset) The key regions are used as input for subsequent comparative learning; Step 2: Using contrastive learning with local set-global representation consistency, a dual-branch encoder is constructed to extract global and local features respectively. Through the joint constraints of global contrastive loss, local contrastive loss and consistency contrastive loss, multi-scale feature collaborative learning is achieved, and feature representation with global-local consistency is output. Step 3: Employ a metadata-based hierarchical hard negative sample sampling method. Utilize the geographic coordinates and shooting time of field images to construct a hard negative sample pool, enabling the model to learn to distinguish samples that are similar in appearance but different in species, thereby improving the discriminative power of the features. Step 4: Employ occlusion modeling and dual-objective training based on real textures. Collect real textures from the background of outdoor images to simulate natural occlusion. Enhance the robustness of the model to occlusion by jointly optimizing the occlusion invariant contrast loss and feature completion loss.

[0009] After pre-training through the above four steps, the final output is an encoder model suitable for small target scenarios in the wild. This model can be fine-tuned with a small number of labeled samples and used for downstream wildlife species identification tasks.

[0010] Furthermore, step one specifically includes: A0. Extract a multi-scale feature pyramid from the input image using a convolutional neural network; A1. Calculate the spatial attention map and the three types of bias response maps for each scale; A2. Merge the bias response map and the spatial attention map according to learnable weights; A3. Multi-scale attention fusion yields the final attention map; A4. Extraction based on the final attention map Key areas.

[0011] Furthermore, step two specifically includes: B0. Build a dual-branch encoder and generate two enhanced views; B1. Calculate the global contrast loss, local contrast loss, and consistency contrast loss; B2. The weighted combination of the three loss terms forms the total comparative learning loss.

[0012] Furthermore, steps A1 and A2 in step one specifically include: C0, Feature maps for each scale application Convolution yields spatial attention maps ; C1. Extract texture response using a Gabor filter bank, direction. ,frequency There are 12 filters in total. The variance of the response at each spatial location is calculated; a smaller variance indicates ordered texture. The reciprocal of the variance is taken and normalized. Obtain the texture bias response map ; C2. Apply Canny edge detection (threshold) to the feature map. , ), morphological closing operation ( Connected component analysis retains connected components with an area greater than 100 pixels, calculates the probability that each spatial location belongs to a closed connected component, and normalizes it to... Obtain the edge offset response map ; C3. Convert the input image from RGB to HSV color space, and preset the animal color distribution. Given the background color distribution, calculate the probability of each pixel belonging to each distribution, calculate the log-likelihood ratio of the animal probability to the background probability, and normalize it. Obtain the color bias response map ; C4, Fusion Formula ,in , , These are learnable weight parameters.

[0013] Furthermore, step B1 in step two specifically includes: D0, Global Contrast Loss is expressed in InfoNCE format. ,in For cosine similarity, For temperature parameters, For negative sample features, including other samples in the current batch and historical features in the momentum queue (queue size 65536). D1. Local contrast loss: By establishing the correspondence between key regions of two views, the contrast loss is calculated for each pair of corresponding regions, and the loss of all region pairs is averaged. ; D2. Consistency comparison loss requires that the set representation is similar to the global features. .

[0014] Furthermore, step three specifically includes: E0: Extract metadata such as geographic coordinates and shooting time from the training set; E1. Construct three types of difficult negative sample pools and sample them proportionally; E2. A course-based learning strategy is adopted, and the training process is divided into four stages.

[0015] Furthermore, the specific steps for constructing the three types of difficult negative sample pools as described in E1 are as follows: F0, Geographic Proximity Pool: Calculates distances to anchor point images using the Haversine formula, filtering images with a distance of 10km or less. In the unlabeled case, filters for the same object using a visual similarity threshold of 0.9, with sampling probability... ; F1, Time Proximity Pool: Filters images with seasonal differences of less than or equal to 1 month or time differences of less than or equal to 2 hours. Similar to the Geographic Pool, which filters by visual similarity, the sampling probability is inversely proportional to the time difference. F2, Visual Similarity Pooling: Using a pre-trained CLIP model ( Extract 512-dimensional features, construct a similarity index using the FAISS library, and retrieve... The most similar images are filtered for the same object by combining geographical or temporal information, and the index is updated every 10 epochs. F3. Sample 256 negative samples for each anchor point image, and distribute them as follows: 102 (40%) for geography, 77 (30%) for time, 51 (20%) for vision, and 26 (10%) for randomness.

[0016] Furthermore, step four specifically includes: G0. Randomly select 1000 images from the training set, apply the attention mechanism from step one to identify animal regions, and randomly crop from the background region (attention response less than 0.3). Texture blocks; G1, using K-means clustering ( The texture blocks are clustered into five categories: grass, branches, rocks, shrubs, and soil. The primary color and transparency attributes (for grass and branches) are extracted. Rocks and soil ), to build a texture library of approximately 5,000 texture blocks; G2. Generate a smooth random distribution using Perlin noise, parameters Thresholding yields a binary mask, and morphological operations are applied to optimize the boundary. G3. Calculate the vertical centroid using the attention map from step one. Based on prior statistical laws, the sampling location of occlusion is: probability 0.6, occlusion of the lower body ( Area), 0.3 occludes the upper body ( (Regional), 0.1 global occlusion; G4. Control the intensity of shading, from mild... Moderate Severe Randomly select the target intensity and adjust it through expansion or corrosion operations until the ratio is close to the target. G5, Overlay based on texture transparency attribute ; G6. Generate two different occlusion versions, ensuring that the overlap is less than 20%, and calculate the occlusion invariant contrast loss. G7. Construct the U-Net decoder (3 upsampling layers), and maintain the momentum encoder (parameters). ), calculate feature completion loss ; G8, Joint Optimization .

[0017] The beneficial effects of this invention are as follows: This invention designs a specialized self-supervised pre-training method tailored to the unique characteristics of field images, systematically solving four key problems in the pre-training stage. First, it addresses the small target localization problem in the pre-training stage through a natural scene-biased attention mechanism, ensuring that contrastive learning focuses on animal regions rather than the background, thus enabling pre-trained features to effectively represent animal regions. Second, it solves the fine-grained feature learning problem in the pre-training stage through local set-global representation consistency constraints, allowing the pre-trained model to simultaneously capture global morphology and local details. Third, it improves the discriminative power of pre-trained features for similar species through metadata-driven difficult negative sample sampling. Fourth, it enhances the robustness of pre-trained features to occluded scenes in the field through realistic texture occlusion modeling. By solving the above-mentioned problems in the pre-training stage, this invention obtains high-quality pre-trained feature representations, thereby significantly reducing the need for labeled data in downstream recognition tasks and maintaining good recognition performance under conditions of few samples, small targets, and occlusion. Attached Figure Description

[0018] Figure 1 This is a schematic diagram of the overall process of the method of the present invention; Figure 2 This is a schematic diagram of the key region generation process based on natural scene bias; Figure 3 This is a schematic diagram of a dual-branch encoder and a three-level contrastive learning architecture; Figure 4 This is a schematic diagram of a metadata-driven hard negative sample sampling strategy; Figure 5 This is a schematic diagram of the occlusion modeling process based on real textures. Detailed Implementation

[0019] The present invention will be further described below with reference to the embodiments.

[0020] This embodiment uses the Snapshot Serengeti dataset as the research object, learns effective feature representations from field images, and applies them to few-shot recognition tasks.

[0021] like Figure 1 As shown, this embodiment provides a self-supervised pre-training method for wildlife recognition based on scene bias, including the following steps: Step 1: Data preparation based on the Snapshot Serengeti dataset. This dataset contains approximately 1.5 million wildlife images, covering 48 species. Each image includes metadata such as geographic coordinates and capture time. The dataset is randomly divided into training, validation, and test sets in an 8:1:1 ratio. The training set is used for self-supervised pre-training, during which no species labels are used. A key region generation method based on natural scene bias is used to accurately locate animal regions from the images.

[0022] Step 2: Input the key region set obtained in Step 1 into the dual-branch encoder and train it based on a contrastive learning framework of local set-global representation consistency. The global branch extracts feature representations of the entire image, while the local branch extracts feature representations of each key region. By calculating global contrast loss, local contrast loss, and consistency contrast loss, the model is forced to learn both global and local features simultaneously, ensuring their synergy.

[0023] Step 3: Using metadata such as geographic coordinates and shooting time in the dataset, construct three types of difficult negative sample pools: geographic proximity, time proximity, and visual similarity. Combine negative samples in proportion and use a course learning strategy for training.

[0024] Step 4: Collect natural occlusion textures from the background area to build a texture library, generate irregularly shaped occlusion regions, determine the occlusion position based on the distribution of key regions, and superimpose textures according to the transparency attribute. Enhance the robustness of the model to occlusion by jointly optimizing the occlusion invariant contrast loss and feature completion loss.

[0025] Furthermore, such as Figure 2 The diagram illustrates the generation of key regions based on natural scene bias. Step one includes: A0. Using ResNet-50 as the backbone network, the input image is uniformly scaled to... Multi-scale features are extracted in stages 1-4, with resolutions of [resolutions to be filled in]. ; A1. Apply attention calculation and bias response map calculation to the feature map of each stage; A2. Integrating bias response maps and spatial attention; A3. Multi-scale fusion yields the final attention map; A4. Extraction Key areas.

[0026] Furthermore, the specific steps for calculating the attention and bias response map described in A1 are as follows: B0. Natural scene bias is the foundation of the first core innovation of this invention. By introducing prior differences in visual features between animals and the background, attention is guided to locate the animal region. B1. Feature maps for each stage application Convolution yields spatial attention maps ; B2. Using a Gabor filter bank, direction ,frequency A total of 12 filters are used. The filter bank is applied to the feature map, and the variance of the 12 responses at each spatial location is calculated. Small variance indicates ordered texture (animal fur), while large variance indicates messy texture (background vegetation). The reciprocal of the variance is taken as an index of texture orderliness and normalized to... Obtain the texture bias response map ; B3. Apply Canny edge detection (threshold) to the feature map. , This yields a binary edge map. Morphological closing operations are then applied (…). The circular structural element connects the broken edges. Connectivity analysis is performed on the edge graph after the closing operation, retaining connected regions with an area greater than 100 pixels, which are considered possible object contours. The probability of each spatial location belonging to a closed connected region is calculated and normalized to... Obtain the edge offset response map ; B4. Convert the input image from RGB color space to HSV color space. Preset animal color distribution. And the background color distribution (vegetation, sky, rocks). For each pixel's HSV value, calculate its probability of belonging to each distribution. Calculate the log-likelihood ratio of the animal probability to the background probability, and normalize it to... Obtain the color bias response map ; B5. Upsample the three types of bias response maps to the feature map. Same resolution, then fused with spatial attention map ,in , , These are learnable scalar weight parameters, initialized to these values, and optimized along with other network parameters via gradient descent.

[0027] Furthermore, the specific steps of multi-scale fusion and region extraction described in A3 and A4 are as follows: C0. Multi-scale attention fusion is a key step, as attention maps at different scales capture information at different levels and need to be further fused; C1. Upsample the attention maps from the four stages to a uniform resolution. Calculate the confidence level for each scale. The confidence level reflects the degree of concentration of high-response regions; the higher the value, the more concentrated the attention at that scale. C2. Obtain the fusion weights through Softmax normalization. The final attention map is obtained by weighted fusion. ; C3. Threshold the final attention map (threshold) This yields a binary graph. Connectivity analysis is then performed on the binary graph to identify all connected regions. The area of ​​each connected region is calculated, and a scale constraint is applied: the area is preserved within a certain range. Within the range, filter areas with a filtering area of ​​less than 0.5% for noise areas and greater than 40% for background areas; C4. Sort by area from largest to smallest, select the top... Find regions that meet the conditions. For each region, calculate the minimum bounding rectangle, extend the margins by 30%, and scale it uniformly to [size missing]. As a key area output.

[0028] Furthermore, such as Figure 3 The analysis shown uses contrastive learning based on the consistency of local set-global representation. Step two includes: D0. Construct a dual-branch encoder. The global branch uses ResNet-50 (stages 1-4) + GlobalAveragePooling + projection head, with Linear (2048→2048) → ReLU → Linear (2048→128) outputting global features g (128-dimensional). The local branch uses ResNet-50 (shared with the global branch in stages 1-3, independent in stage 4) + GlobalAveragePooling + projection head, outputting... K Local features (Each has 128 dimensions); D1. Apply data augmentation to the input image to generate two views. and Data augmentation includes random color jitter (brightness) Contrast saturation ,tone Random Gaussian blur (probability 0.5), random horizontal flip (probability 0.5), slight random rotation ( (degree), avoid using large-scale random cropping to prevent losing small targets; D2. Calculate the contrastive learning loss for the three levels; D3, Weighted Combination .

[0029] Furthermore, the specific steps for calculating the three-level contrastive learning loss as described in D2 are as follows: E0 and three-level contrastive learning are the second core innovations of this invention. Through constraints at three levels—global, local, and consistency—cooperative learning of multi-scale features is achieved. E1, Global Contrast Loss: and Obtained through the global encoder and Using the InfoNCE format Where sim is the cosine similarity. For temperature parameters, For negative sample features, including other samples in the current batch and historical features in the momentum queue (queue size 65536). E2. Local Contrast Loss: Key regions are extracted by applying the attention mechanism from step one to both views, and a region correspondence is established using the spatial information of the attention maps. For each corresponding local region pair, local features are obtained through a local encoder, and a contrast loss is calculated. The average loss of all successfully established corresponding region pairs is then taken. ; E3. Consistency Contrast Loss: Calculates the discriminative power of features for views. Each local feature Calculate the mean cosine similarity between it and the local features of all negative samples in the current batch and momentum queue. Discriminative power Calculate the aggregation weights and attention intensity. Overall score Softmax normalization yields the aggregate weights. Weighted aggregation yields a set representation. Calculate the consistency comparison loss. .

[0030] Furthermore, such as Figure 4 The following is an example of analyzing hierarchical hard negative sample sampling based on metadata. Step three includes: F0, metadata-driven difficult negative sample sampling is the third core innovation of this invention, which uses geographic, time and other metadata to construct a difficult negative sample pool; F1, Geographic Nearest Pooling: Extracts geographic coordinates from each image in the training set. For anchor point images Use the Haversine formula to calculate the distance to the Earth's surface. Where R = 6371 km. Images with a distance of 10 km or less constitute the geographic candidate set. In the unlabeled case, a pre-trained CLIP model is used to extract features and calculate cosine similarity. If the similarity is greater than 0.9, the image is considered a potential identical object and excluded. Sampling probability. ; F2, Temporal Neighbor Pool: Extract the shooting timestamp and construct a temporal neighbor pool with seasonal dimension (monthly difference less than or equal to 1 month) and time dimension (hourly difference less than or equal to 2 hours). Similar to the geographic pool, it is filtered by visual similarity threshold. The sampling probability is inversely proportional to the time difference. F3, Visual Similarity Pooling: Using a pre-trained CLIP model Visual features (512 dimensions) were extracted from all training images, and a similarity index was constructed using the FAISS library. Anchor image retrieval was then performed. The most similar images are filtered for the same object, combined with geographical or temporal information. Features are re-extracted and the index is updated every 10 training epochs using the current model; F4. Fusion sampling: 256 negative samples were sampled for each anchor point image and distributed as follows: 102 geographical samples (40%), 77 temporal samples (30%), 51 visual samples (20%), and 26 random samples (10%). F5. Course Learning Strategy: The training is divided into 4 stages: Stage 1 (epochs 1-40) mainly uses negative samples from different purposes, accounting for 70%; Stage 2 (epochs 41-100) mainly uses negative samples from different families within the same order, accounting for 60%; Stage 3 (epochs 101-160) mainly uses negative samples from different genera within the same family, accounting for 60%; Stage 4 (epochs 161-200) mainly uses negative samples from different species within the same genus, accounting for 70%.

[0031] Furthermore, such as Figure 5 The following describes the analysis of occlusion modeling based on real textures. Step four includes: G0, realistic texture occlusion modeling is the fourth core innovation of this invention, which enhances the robustness of the model by simulating real occlusion scenes in the wild; G1. Randomly select 1000 images from the training set, apply the attention mechanism from step one to identify animal regions, and randomly crop from the background region (attention response less than 0.3). Texture patches. K-means clustering was used ( The texture tiles were clustered into 5 categories: grass, branches, rocks, shrubs, and soil. Each cluster center was manually examined and labeled. The primary color and transparency attributes were extracted for each texture tile (grass and branches). Rocks and soil ), to build a texture library of approximately 5,000 texture blocks; G2. Generate a smooth random distribution using Perlin noise, parameters , , The output size is the same as the input image (448×448). The Perlin noise map is thresholded (threshold 0.5) to obtain a binary occlusion mask M. Morphological operations are then applied. Optimize the boundary; G3, Utilize the fused attention map from step one Determine the animal's location and calculate its vertical centroid. Based on prior statistical laws, the occlusion location is sampled: the lower body is occluded with a probability of 0.6. Area), 0.3 occludes the upper body ( (Regional area), 0.1 global occlusion. Perform a logical AND operation between the binary mask M and the selected region; G4. Control the occlusion intensity and calculate the current occlusion ratio. From mild Moderate Severe Randomly select target strength target Adjust the shading area using expansion or corrosion operations until... ratio near target (Error less than 0.05); G5. Randomly select texture blocks from the texture library and tile them according to the size of the occluded area. Overlay textures based on their transparency properties. ,in α To adjust the texture transparency; G6: Generate two different occlusion versions of the same original image. and Ensure that the overlap of occluded areas is less than 20%. Features are obtained through the encoder. and Calculate the occlusion invariant contrast loss ; G7. Construct a U-Net decoder with three upsampling layers, skipping connections to the corresponding encoder layers. Maintain the momentum encoder, with parameters passed through... Update. Features are extracted from the complete image using a momentum encoder as the reconstruction target. Reconstructing from occlusion features using a decoder Calculate the feature completion loss ; G8, Joint Optimization The decoder is only used during the pre-training phase and is discarded during fine-tuning for downstream tasks.

[0032] In summary, this embodiment achieves efficient self-supervised pre-training for wildlife recognition through four collaborative steps: key region generation based on natural scene bias, contrastive learning based on local set-global representation consistency, hierarchical hard negative sample sampling based on metadata, occlusion modeling based on real texture, and dual-object training. The core idea of ​​this method is to address the four problems existing in the pre-training stage of self-supervised methods—difficulty in small target localization, insufficient fine-grained feature modeling, limited negative sample discrimination, and lack of occlusion robustness—specifically addressing the unique characteristics of field images. This results in high-quality pre-trained feature representations, significantly improving the model's performance in downstream few-shot recognition tasks, especially maintaining good recognition performance under conditions of small targets and strong occlusion.

[0033] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.

Claims

1. A self-supervised pre-training method for wildlife recognition based on scene bias, characterized in that, Includes the following steps: Step 1: Generate key regions based on natural scene offset; Step 2: Employ comparative learning to ensure consistency between local sets and global representations; Step 3: Employ metadata-based hierarchical hard negative sample sampling; Step 4: Employ occlusion modeling based on real textures and bi-objective training.

2. The self-supervised pre-training method for wildlife recognition based on scene bias as described in claim 1, characterized in that, Step one specifically includes: A0. Extract multi-scale feature pyramids from the input image to obtain feature maps at different resolutions. ,in Indicates scale index; A1. Calculate the spatial attention map for each scale. And calculate three types of bias response maps: texture order bias response map Quantifying the differences in texture regularity between animal fur and background vegetation using contour closure bias response maps Quantify the difference between the closed outline of the animal and the scattered edges of the background, and use a color distribution difference bias response map. Quantify the separability of animal fur color from the background in color space; A2. Fuse the three types of bias response maps with the spatial attention map according to learnable weights, and then activate the sigmoid function. Normalization to The interval is used to obtain the fusion attention map for each scale. in , , These are learnable weight parameters; A3. Calculate the confidence score of the attention map at each scale. Based on confidence level Normalization determines the fusion weights, and weighted fusion yields the final attention map. ; A4. Perform thresholding and connected component analysis on the final attention map, apply scale constraints, and select... The area with the strongest response As a set of key regions, K is the preset number of key regions.

3. The self-supervised pre-training method for wildlife recognition based on scene bias as described in claim 1, characterized in that, Step two specifically includes: B0. Construct a dual-branch encoder, with the global branch extracting global feature representations. Local branch extraction Local feature representation The The key regions are obtained from step one, with the two branches sharing parameters at the bottom level and optimizing independently at the top level; B1. Apply data augmentation to the input image to generate two different views, and calculate the global contrast loss. Local contrast loss Loss compared to consistency ; B2. Calculate the aggregation weight, combined with attention intensity. and feature discrimination power The feature discrimination power is obtained by a monotonically decreasing transformation of the similarity between the local representation and the negative sample representation, so that the local regions that are more easily confused with the negative samples get higher aggregation weights. B3, will Local representations are weighted and aggregated to obtain a set representation. It requires that the set representation be consistent with the global representation; B4. Weight the three loss terms to form the total contrastive learning loss. ,in , , These are the preset weight hyperparameters.

4. The self-supervised pre-training method for wildlife recognition based on scene bias as described in claim 2, characterized in that, The specific steps for calculating the three types of bias response maps as described in A1 are as follows: C0. For texture orderliness, use a multi-directional, multi-frequency filter bank to extract the texture response, calculate the variance of the response in different directions at each spatial location. A smaller variance indicates texture orderliness. Take the reciprocal of the variance and normalize it to... Obtain the texture bias response map; C1. For contour closure, apply edge detection operators to obtain a binary edge map, perform morphological closing operations to connect broken edges, and use connected component analysis to retain connected components with an area greater than a threshold. Calculate the probability that each spatial location belongs to a closed connected component and normalize it. Obtain the edge offset response map; C2. For color distribution differences, convert the input image from RGB to HSV color space, preset the color distribution parameters for animals and background, calculate the probability of each pixel belonging to each distribution, calculate the log-likelihood ratio of the animal probability to the background probability and normalize it. The color bias response map is obtained.

5. The self-supervised pre-training method for wildlife recognition based on scene bias as described in claim 3, characterized in that, The specific steps for calculating the aggregate weights as described in B2 are as follows: D0. Obtain the attention intensity of each key region from the attention map in step one. ; D1. Calculate the local feature. Mean cosine similarity with the features of negative samples in the current batch or queue ; D2. Calculate the feature discrimination power through monotonically decreasing transformation. The higher the similarity, the lower the discrimination power; D3. Calculate the overall score The aggregate weights are obtained through Softmax normalization. ; D4. Weighted aggregation yields set representations. .

6. The self-supervised pre-training method for wildlife recognition based on scene bias as described in claim 1, characterized in that, Step three specifically includes: E0: Extract metadata such as geographic coordinates and shooting time associated with the image; E1. Construct a geographically proximate negative sample pooling system, starting from geographically distant samples within a set radius. Samples are selected from the image within the range, and the sampling probability is inversely proportional to the distance. E2. Construct a time proximity-difficult negative sample pool by selecting samples from images within a set window based on seasonal or temporal differences. The sampling probability is inversely proportional to the time difference. E3. Construct a visually similarity-difficult negative sample pool, use a pre-trained visual model to extract features and construct a similarity index, and update it regularly during training. E4. In the absence of labels, calculate the visual similarity of geographically or temporally adjacent sample pairs. If the similarity exceeds a set threshold... If identified as a potential identical object, it will be excluded from the negative sample pool. E5. Negative samples are sampled from the three pools according to a preset ratio, and a course learning strategy is adopted from the largest difference in classification hierarchy to the smallest difference.

7. The self-supervised pre-training method for wildlife recognition based on scene bias as described in claim 6, characterized in that, The specific steps for constructing the geographic proximity pool and label-free filtering as described in E1 and E4 are as follows: F0. Extract geographic coordinates from each image in the training set. For anchor point images Calculate the distance to the Earth's surface using the Haversine formula. ,in The radius of the Earth; F1, Filter distance less than or equal to The images constitute the geographic candidate set, in which Set to 10km; F2. In the unlabeled case, calculate the visual similarity between each image in the candidate set and the anchor point. Use the pre-trained CLIP model to extract features and calculate the cosine similarity. If the similarity is greater than... They were identified as potentially the same object and excluded from the negative sample pool. Set it to 0.9; F3, The sampling probability is set to an exponential function inversely proportional to the distance. ,in Set to 5km.

8. The self-supervised pre-training method for wildlife recognition based on scene bias as described in claim 1, characterized in that, Step four specifically includes: G0. Collect natural occlusion textures from the background area of ​​outdoor images, extract texture attributes, and build a texture library; G1: Use process noise to generate a smooth random distribution, and threshold it to obtain an irregularly shaped occlusion region; G2. Use the fusion attention map from step one to determine the animal's location, set the prior probability distribution of the occlusion location based on the statistical regularity of the parts, and generate the occlusion area within the animal area based on the prior probability. G3: Select texture blocks from the texture library and overlay them according to the texture transparency attribute. Semi-transparent textures are mixed with the original image according to the transparency coefficient, while opaque textures directly cover the original image. G4. Generate two different occluded versions of the same original image, ensuring that the overlap of the occluded regions is less than a set threshold. After extracting features from each version using an encoder, calculate the occlusion invariance contrast loss. ; G5 introduces a decoder network to reconstruct complete features from occlusion features. The reconstruction target is provided by features extracted from the complete image by the momentum encoder, and the feature completion loss is calculated. ; G6. Jointly optimize the occlusion-related loss and the contrastive learning loss. ,in The decoder is used only during the pre-training phase because the weight hyperparameters are preset.