A training method of a recognition model and a related device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a two-stage training architecture using a deep learning network and a multi-dimensional loss function, the problem of declining identification accuracy of individual cattle during long-term breeding was solved, achieving efficient and low-cost cattle identification.

CN122244898APending Publication Date: 2026-06-19BEIJING UNIV OF POSTS & TELECOMM

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING UNIV OF POSTS & TELECOMM
Filing Date: 2026-01-26
Publication Date: 2026-06-19

Application Information

Patent Timeline

26 Jan 2026

Application

19 Jun 2026

Publication

CN122244898A

IPC: G06V40/10; G06V10/764; G06V10/77; G06N5/04; G06V10/82; G06N3/0464; G06N3/094; G06V10/44; G06N3/045; G06V10/42; G06V10/54; G06V10/52; G06V10/774

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies struggle to achieve robustness of features, pose alignment, and background noise suppression for individual cattle during long-term farming without physical contact, leading to decreased recognition accuracy and increased costs.

⚗Method used

By constructing a deep learning network that takes into account both the spatial and temporal dimensions of learning, designing a multi-dimensional loss function, adopting a two-stage training architecture of pre-training and fine-tuning, and utilizing progressive temporal sampling and multi-dimensional inference modules, pseudo-label samples are generated to optimize the recognition model.

🎯Benefits of technology

It significantly improves the accuracy and generalization performance of individual cattle identification, reduces annotation costs, adapts to long-term data collection and perspective changes, and reduces background noise interference.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244898A_ABST

Patent Text Reader

Abstract

This disclosure provides a training method and related apparatus for a recognition model. The method includes: acquiring a source domain dataset; performing progressive temporal sampling on the source domain dataset to obtain sampled images; extracting features from the sampled images and training a pre-trained recognition model based on a multi-dimensional loss function; acquiring a target domain dataset; extracting features from the target domain dataset using the pre-trained recognition model and inputting the features into a multi-dimensional inference module to generate corresponding pseudo-label samples; and performing self-training and optimization on the pre-trained recognition model based on the pseudo-label samples and target domain database data to determine the recognition model. This disclosure can overcome problems such as feature drift caused by individual growth and development, different data distribution at collection points, and different camera collection positions in long-cycle aquaculture scenarios, and significantly improves the accuracy and generalization performance of individual recognition with only a very small number of labeled samples.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the fields of computer vision and deep learning technology, and in particular to a method for training a recognition model and related apparatus. Background Technology

[0002] In the current livestock farming sector, individual cattle identification is the foundation for achieving precise management. The two existing solutions—physical identification-based and visual identification-based—are both insufficient to meet the application needs of real, complex, and long-cycle farming scenarios.

[0003] First, traditional technologies based on physical identification such as ear tags rely on manual operation and contact-based recognition, which has significant limitations in actual large-scale farming. These tags are easily damaged or detached due to frequent fighting and friction among cattle, leading to the loss of identification information and a break in the traceability chain. Furthermore, the installation, replacement, and reading of tags can easily trigger stress reactions in cattle, affecting not only animal welfare and growth performance but also posing safety hazards to staff, and resulting in high overall maintenance costs.

[0004] Secondly, although deep learning-based visual recognition technology has the potential for contactless authentication in theory, it still faces significant technical bottlenecks in the face of the real, dynamic, and complex environment of farms: it is difficult to cope with feature drift during long-term farming; it has weak adaptability to posture changes and perspective distortion; and the detection mechanism introduces background noise that interferes with recognition accuracy.

[0005] Therefore, how to achieve an individual identification scheme with robustness to long-term data acquisition characteristics, posture alignment capability, and background noise suppression function from the perspective of a cow's back without physical contact is a technical problem that urgently needs to be solved in the field of smart farming. Summary of the Invention

[0006] In view of this, the purpose of this disclosure is to provide a training method and related apparatus for a recognition model, so as to solve or partially solve the problems raised in the background art.

[0007] To achieve the above objectives, this disclosure provides a method for training a recognition model, the method comprising:

[0008] Obtain the source domain dataset; The source domain dataset is progressively sampled over time to obtain sampled images; Feature extraction is performed on the sampled images, and a pre-trained recognition model is obtained by training based on a multi-dimensional loss function; Obtain the target domain dataset; wherein, the target domain dataset includes target domain library data and target domain query data; The pre-trained recognition model is used to extract features from the target domain dataset, and the features are input into the multi-dimensional inference module to generate corresponding pseudo-label samples. Based on the pseudo-label samples and the target domain library data, the pre-trained recognition model is self-trained and optimized to determine the recognition model.

[0009] Based on the same inventive concept, this disclosure also provides a training system for a recognition model, the system comprising: The first acquisition module is configured to acquire the source domain dataset; The sampling module is configured to perform progressive temporal sampling on the source domain dataset to obtain a sampled image; The training module is configured to extract features from the sampled images and train a pre-trained recognition model based on a multi-dimensional loss function. The second acquisition module is configured to acquire a target domain dataset; wherein the target domain dataset includes target domain library data and target domain query data; The generation module is configured to extract features from the target domain dataset using the pre-trained recognition model and input the features into the multi-dimensional inference module to generate corresponding pseudo-label samples. The optimization module is configured to perform self-training and optimization on the pre-trained recognition model based on the pseudo-label samples and the target domain library data, and determine the recognition model.

[0010] Based on the same inventive concept, this disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement a training method for a recognition model as described in any of the above claims.

[0011] Based on the same inventive concept, this disclosure also provides a non-transitory computer-readable storage medium that stores computer instructions for causing the computer to execute any of the above-described training methods for a recognition model.

[0012] Based on the same inventive concept, this disclosure also provides a computer program product, including one or more computer programs, which, when executed by one or more processors, implement the training method of any of the above-described recognition models.

[0013] As can be seen from the above, the training method for the recognition model provided in this disclosure constructs a deep learning network that takes into account both the learning space and time dimensions, and designs corresponding data processing and multi-dimensional loss function configurations to extract and model multi-dimensional features of the target individual, thereby training a model that can achieve long-term recognition of the target individual. This model can be applied to long-term fine management in the intelligent management system of modern large-scale farms. Attached Figure Description

[0014] To more clearly illustrate the technical solutions in one or more embodiments of this disclosure or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only one or more embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0015] Figure 1 A flowchart illustrating a training method for a recognition model provided in an embodiment of this disclosure; Figure 2 A schematic diagram of a training architecture for a recognition model provided in an embodiment of this disclosure; Figure 3 The recognition result diagram of the recognition model provided in the embodiments of this disclosure; Figure 4 This is a schematic diagram of the structure of a training system for a recognition model provided in an embodiment of the present disclosure; Figure 5 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation

[0016] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.

[0017] It should be noted that, unless otherwise defined, the technical or scientific terms used in one or more embodiments of this disclosure should have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms "first," "second," and similar words used in one or more embodiments of this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

[0018] Explanation of relevant technical terms: Precision Livestock Farming (PLF): A farming model that uses information technology to monitor animal health, growth, and welfare.

[0019] Re-ID (Re-Identification): A technology that reconfirms an individual's identity at different times or in different scenarios by extracting individual characteristics.

[0020] Gradient Reversal Layer (GRL): A core module used in adversarial learning to invert gradients, enabling the network to extract domain-invariant features.

[0021] Axis-Aligned Bounding Box (AABB): A rectangular target detection box with four sides parallel to the image coordinate axes.

[0022] Convolutional Neural Network (CNN): A type of deep neural network specifically designed for processing data with a grid structure, such as images.

[0023] Source Domain: refers to a known dataset used in the pre-training phase that contains a large number of complete annotations (such as identity labels, collection time, and spatial coordinates) to guide the model in learning general long-period and spatially consistent features.

[0024] Target Domain: refers to the application scenario dataset (such as a specific ranch) that this invention is actually deployed in. Its data distribution often differs from the source domain (domain offset) and usually contains only a very small number of labeled samples.

[0025] Target Domain Gallery Data: A set of reference samples pre-entered in the target domain and labeled with real identities, which serve as a comparison benchmark or anchor point during the identification process.

[0026] Target Domain Query Data: Unlabeled image stream samples of the identity to be identified in the target domain. It is the processing object of the multi-dimensional inference module and is used to generate pseudo-labels and perform subsequent fine-tuning.

[0027] Mean Average Precision (MAP): A metric used to comprehensively evaluate and identify the retrieval performance of a model across all individual categories. It is calculated by taking the arithmetic mean of the average precision for each identity category.

[0028] Rank-k accuracy: A key metric for evaluating recognition performance, it refers to the percentage of the k most similar samples in the model's similarity ranking that contain the correct individual. Rank-1 is the most commonly used, representing the probability that the first sample with the highest similarity is the correct individual.

[0029] Accuracy: refers to the proportion of cattle samples correctly classified or identified by the model on the test dataset out of the total number of sample pairs. It is used to intuitively reflect the discriminative effectiveness of the model.

[0030] Vision Transformer: A deep learning network architecture based on self-attention mechanism. By dividing an image into multiple blocks and combining them with positional encoding, it establishes global dependencies between pixels, effectively capturing complex individual representations and exhibiting strong robustness to occlusion and background interference.

[0031] Triplet Loss: A loss function commonly used in metric learning. Its core idea is to reduce the distance between features of individuals with the same identity, while increasing the distance between features of individuals with different identities, thereby forming discriminative clusters in the feature space.

[0032] Cross-Entropy Loss (CE Loss) is a function that measures the difference between the model's predicted probability distribution and the actual identity label distribution. In this invention, this loss function guides the network to learn discriminative visual representations by supervising the model's classification results of individual cattle, thereby enabling individuals with different identities to have higher distinguishability in the feature space.

[0033] Maximum Mean Discrepancy Loss (MMD Loss) is a statistical metric used to evaluate the distance between two different probability distributions. In this invention, this loss function measures the statistical distance between the feature distributions of the source and target domains. By minimizing this difference, the model is forced to ignore scene-specific noise such as environment and lighting, thereby extracting cross-scene domain-invariant features.

[0034] In current precision farming and animal husbandry, cattle visual recognition primarily employs deep metric learning-based cattle identification schemes, typically using convolutional neural networks (CNNs) or Transformers to map cattle images into a low-dimensional embedded feature space. The core of these schemes is the joint driving of contrastive learning through triplet loss and cross-entropy loss. This aims to narrow down cross-scene features of the same cattle during training while different individual features are separated, thereby learning discriminative visual representations. Specific techniques include a full-scale network (OSNet) capable of simultaneously extracting local texture and global body shape features; a Transformer re-identification model (TransReID) based on a self-attention mechanism to enhance occlusion robustness; and an adversarial learning method that achieves unsupervised cross-domain adaptation through a domain discriminator and maximum mean difference loss, to mitigate data distribution shifts between different farming environments.

[0035] However, the aforementioned existing technologies still have limitations when dealing with real-world long-term livestock farming scenarios: First, existing deep metric learning-based recognition methods struggle to address feature drift issues during the months-long growth process of cattle. Their loss functions are based on short-term static appearance modeling, lacking robust learning for individual dynamic evolution, leading to a significant decrease in feature consistency during long-term monitoring. Second, existing models lack spatial alignment mechanisms, making it difficult to overcome perspective distortion caused by cattle movement and positional changes. Neither the OSNet convolutional network nor the Transformer architecture can automatically correct geometric distortion in top-down views, resulting in discrete feature distributions for the same cattle and severely impacting recognition accuracy. Third, existing technologies have weak few-shot adaptive capabilities, and cross-domain methods rely heavily on target domain annotations. New cattle annotations are scarce in real-world pastures, making it difficult for existing solutions to effectively iterate using unlabeled data with very few labels, increasing annotation costs and limiting widespread adoption.

[0036] As described above, the relevant technology has the following problems: I. Existing technologies struggle to address feature drift issues during long-term cattle farming. The cattle farming cycle typically lasts several months, during which their morphology, weight, and coat texture undergo dynamic changes. Most existing recognition algorithms are trained only on static features within short time windows, lacking the ability to learn from feature changes generated by long-term data collection, leading to decreased accuracy in data recognition when facing long-term monitoring. Second, existing technologies have poor adaptability to posture changes and viewpoint distortion. In real-world overhead views of cattle, their postures are highly variable and their spatial positions are not fixed. Due to the lack of an effective automatic alignment mechanism, the acquired images often suffer from severe viewpoint scaling and geometric distortion, resulting in poor spatial consistency when inputting them into the model for feature extraction. Third, background noise introduced by the detection mechanism interferes with recognition accuracy. When processing slender and easily tilted cattle targets from a top-down view, the traditional horizontal target detection bounding box (AABB) inevitably includes a large amount of pixel information from the ground, fences, or other individual cattle because the detection box cannot be aligned with the main axis of the cattle. The presence of this redundant noise greatly interferes with the model's extraction of key discriminative features of the individual (such as back markings, body contours, etc.).

[0037] Therefore, there is an urgent need to develop a technical solution for training a recognition model to achieve individual recognition from a cow's back perspective without physical contact. This solution should possess robustness against long-term data acquisition, pose alignment capabilities, and background noise suppression.

[0038] Based on some implementations of this disclosure, a method for training a recognition model is provided. In this method, a deep learning network that considers both spatial and temporal dimensions of learning is constructed, and corresponding data processing and multi-dimensional loss function configurations are designed. Multi-dimensional feature extraction and modeling are performed on cattle back images, training a model capable of long-term beef cattle recognition. This model can be applied to long-term, refined management in the intelligent management system of modern large-scale farms.

[0039] like Figure 1 As shown, this disclosure discloses a training method for a recognition model, employing a two-stage training architecture of pre-training and fine-tuning to achieve robust modeling of individual characteristics under long-term data collection. The method includes: S101. Obtain the source domain dataset.

[0040] In some embodiments, a source domain dataset with complete annotation information is obtained. This source domain dataset includes the back image of the target individual, identification tags, collection timestamps, and spatial center coordinates.

[0041] In some embodiments, the source domain dataset refers to a known dataset used in the pre-training phase that contains a large number of complete annotations (such as identity labels, collection time, and spatial coordinates) to guide the model in learning general long-period and spatially consistent features.

[0042] S102. Perform progressive time sampling on the source domain dataset to obtain a sampled image.

[0043] In some embodiments, the progressive temporal sampling of the source domain dataset to obtain a sampled image includes: During the pre-training phase, to enable the recognition model to learn gradually, progressive temporal sampling (PTS) is used to sample the source domain dataset. The time sampling threshold for each training round is determined by the following formula:

[0044] in, This represents the maximum allowed time deviation for sampling at present. The total acquisition time of the source domain data; This is the current training round number; This is half of the preset total number of pre-training rounds. This mechanism enhances the recognition model's ability to model apparent changes over time by introducing lessons from easy to difficult.

[0045] In some embodiments, the progressive temporal sampling dynamically expands the sampling time window as the training process progresses, allowing the recognition model to be exposed to samples with smaller time spans in the early stages of training, and gradually introducing samples with larger time spans as training progresses, thereby guiding the recognition model to progressively learn and adapt to the temporal drift of the apparent characteristics of the target individual over a long period.

[0046] In some embodiments, this disclosure introduces a progressive time sampling mechanism in the time dimension, using only short time span samples in the early stage of pre-training, and gradually expanding the sampling time window as the training rounds increase, simulating the long-term breeding process of beef cattle, and guiding the model to learn the evolution of growth period appearance features from easy to difficult.

[0047] S103. Extract features from the sampled image and train a pre-trained recognition model based on a multi-dimensional loss function.

[0048] In some embodiments, the step of extracting features from the sampled image and training a pre-trained recognition model based on a multi-dimensional loss function includes: The sampled image is input into the recognition model to be trained for feature extraction, thereby obtaining a deep visual representation that takes into account both local details and global body shape.

[0049] Based on contrastive learning loss, spatially perceptual alignment loss, and cross-domain adversarial learning loss, the parameters of the recognition model to be trained are iteratively updated, and the updated recognition model to be trained is used as a pre-trained recognition model, thereby achieving the alignment of feature distributions.

[0050] In some embodiments, this disclosure designs a spatially-aware alignment loss in the spatial dimension, explicitly utilizing the normalized center coordinates provided by object detection to constrain the model to produce more similar feature representations for samples of the same identity located in similar spatial positions (i.e., with similar shooting perspectives). Simultaneously, by combining cross-domain adversarial learning (through gradient inversion layer GRL and MMD loss), environmental and temporal domain-specific noise is removed, and long-term general features applicable to the target domain are extracted.

[0051] Among them, spatial perception alignment loss By explicitly introducing location labels, the recognition model is constrained to learn spatial consistency features. Target similarity. The calculation formula is:

[0052] The corresponding loss function formula is:

[0053] in, and As an individual identity label; As an indicator function, it represents when the sample and samples If they belong to the same ID, the value is 1; otherwise, it is 0. and These are the normalized coordinates of the center point of the bounding box. Cosine similarity between feature vectors; The target similarity between feature vectors; represents the weighting coefficients for the loss term. This design forces the recognition model to extract highly consistent features from the same individual in similar spatial locations, thereby eliminating interference caused by viewpoint shift.

[0054] Among them, the total loss function in the pre-training phase Represented as:

[0055] in, All are weight parameters; contrastive learning loss includes ternary loss. and cross-entropy loss It is used to learn discriminative features that distinguish the apparent textures of different target individuals; spatially perceptive alignment loss. This is used to force the identification model to extract more consistent features from samples that are spatially close; the cross-domain adversarial learning loss includes the maximum mean difference loss. and binary cross-entropy loss It is used to align the feature distributions of the source domain dataset and the target domain dataset through the gradient inversion layer (GRL), so that the target domain dataset can learn the long-term features of the source domain dataset while reducing the impact of different data distributions.

[0056] By outputting pre-trained recognition model parameters with general discriminative power, the pre-trained recognition model initially possesses robustness in dealing with feature drift caused by perspective distortion and long-term data acquisition, serving as initial weights for subsequent fine-tuning stages.

[0057] In some embodiments, this disclosure guides the model to smoothly capture feature drift caused by growth by designing specific temporal sampling strategies, thereby ensuring the stability of individual representations throughout the entire breeding cycle.

[0058] In some embodiments, this disclosure introduces spatially perceptual learning loss to explicitly utilize the positional information of the target individual in the image to constrain the feature distribution, thereby forcing the recognition model to extract highly aligned feature vectors for samples of the same target individual that are spatially close, thus eliminating intra-class feature differences caused by viewpoint shift.

[0059] S104. Obtain the target domain dataset; wherein, the target domain dataset includes target domain library data and target domain query data.

[0060] In some embodiments, the target domain dataset includes target domain library data consisting of a small number of target domain samples with identity labels, and target domain query data consisting of a large number of target domain samples without identity labels.

[0061] In some embodiments, the target domain refers to the application scenario dataset (such as a specific ranch) actually deployed in this disclosure. Its data distribution often differs from the source domain (domain offset) and usually contains only a very small number of labeled samples.

[0062] In some embodiments, the Target Domain Gallery Data refers to a set of reference samples pre-recorded in the target domain and labeled with real identities, which serves as a comparison benchmark or anchor point during the identification process.

[0063] In some embodiments, the Target Domain Query Data refers to unlabeled image stream samples of the identity to be identified in the target domain. It is the processing object of the multi-dimensional inference module and is used to generate pseudo-labels and perform subsequent fine-tuning.

[0064] S105. Use the pre-trained recognition model to extract features from the target domain dataset, and input the features into the multi-dimensional inference module to generate corresponding pseudo-label samples.

[0065] In some embodiments, the step of extracting features from the target domain dataset using the pre-trained recognition model and inputting the features into a multi-dimensional inference module to generate corresponding pseudo-label samples includes: The pre-trained recognition model is used to extract features from the target domain dataset, and multi-dimensional reasoning and spatiotemporal constraint reasoning are performed based on the features and the corresponding spatial coordinates and time information.

[0066] In some embodiments, the multi-dimensional reasoning module uses the spatial proximity of the target domain library data to filter candidate identities, generates a corresponding initial pseudo-label for each unlabeled sample, and updates the dynamic centroid of each candidate identity according to the time decay weight.

[0067] During the fine-tuning phase, this disclosure proposes a multi-dimensional inference module designed to generate highly reliable pseudo-labels for unlabeled individuals (i.e., target domain query data) in long-cycle aquaculture scenarios within the target domain, utilizing the spatial coordinates and collection time information of samples in the target domain dataset. By modeling and dynamically representing individual features, the feature drift problem caused by individual growth is effectively mitigated.

[0068] The step of performing multi-dimensional reasoning and spatiotemporal constraint reasoning based on the features and the corresponding spatial coordinates and time information includes: S1051. Preprocess the target domain library data.

[0069] By initializing an empty dictionary of query records The query record dictionary uses individual identity (ID) as the key to store the feature vectors, spatial coordinates and timestamps of all query samples under that ID, laying the foundation for subsequent time series analysis.

[0070] Existing target domain database data with real identity labels is grouped by their IDs, and the spatial coordinate information of each sample is integrated to form a structured dataset containing features. Identity tags And spatial coordinates (i.e., the coordinates of the center of the detection box). It serves as an authoritative reference library for the system to perform identity comparison.

[0071] S1052. Based on spatial proximity, the samples in the target domain query data are matched with the samples in the target domain library data to obtain an initial candidate set.

[0072] For each query sample to be identified in the target domain query data ( Instead of comparing with all samples in the target domain database, a rapid screening process is first performed. This is done by calculating the spatial coordinates of the query samples. The strategy calculates the distance to the coordinates of all samples under each candidate ID in the gallery, and then selects the top-k gallery samples that are spatially closest to each candidate ID as the initial candidate set. This strategy is based on the reasonable assumption that the closer the same individual appears in spatial locations, the higher the similarity.

[0073] S1053. Based on the initial candidate set, calculate dynamic centroid and perform feature matching.

[0074] For each candidate ID, the reference feature used for comparison is not a fixed gallery sample feature, but a dynamic centroid. The centroid is jointly determined by the gallery sample features and the historical query sample features of the candidate ID, with a time decay weight introduced.

[0075] Dynamic centroid The calculation formula is:

[0076] in, To balance the initial labeled features Historical prediction characteristics The balance coefficient of contribution ratio; This refers to the query history of the candidate ID.

[0077] Time weights of samples The following exponential decay function is used for definition:

[0078] in, A time weighting factor for controlling the decay rate; For historical samples, timestamps are used. It is an exponential function; For the first The centroid is a set of timestamps of all historical query samples for a target individual. This mechanism assigns higher weight to recent samples, enabling the centroid to capture the latest appearance features of the target individual as it grows.

[0079] In some embodiments, historical query samples closer to the current query time contribute more to the current centroid composition, while the influence of older samples diminishes accordingly. This allows the centroid to dynamically reflect the individual's most recent appearance characteristics. Subsequently, the features of the current query sample are calculated. Compared with the dynamic centroid obtained in the previous step Feature distance between (Cosine distance or Euclidean distance is typically used). The feature distance combines spatial proximity and temporal continuity, making it more robust than simple feature matching.

[0080] S1054. Based on the dynamic centroid and feature matching results, perform identity prediction on the samples in the target domain query data to determine the corresponding initial pseudo-labels.

[0081] By comparing the feature distances between the query sample and the dynamic centroids of all candidate IDs. The candidate ID with the smallest feature distance is then determined as the predicted identity of the current query sample. Once the prediction is complete, the features, coordinates, and timestamp of the current query sample will be added as a new data point to the query history of its corresponding candidate ID. This means that with each successful recognition, the system's knowledge base is enhanced, allowing subsequent recognition processes to be based on richer and more timely contextual information, forming a progressive learning mechanism with continuous self-optimization capabilities.

[0082] S1055. The initial pseudo-labels are processed based on the similarity matrix to generate corresponding pseudo-label samples.

[0083] In some embodiments, the initial pseudo-labels are filtered based on the similarity matrix to select highly reliable pseudo-label samples. Based on the pseudo-label samples and the target domain library data, the contrastive learning loss and the spatial awareness alignment loss are recalculated, thereby enabling the self-training and continuous optimization of the recognition model in the target domain.

[0084] Among them, the total fine-tuning loss function in the fine-tuning phase Represented as:

[0085] in, Data loss in the target domain library The loss of the target domain query data based on the pseudo-label samples, Hyperparameters for balancing the strength of self-supervised learning.

[0086] Loss function between target domain library data and target domain query data based on the pseudo-label samples Represented as:

[0087] in, All are weighted parameters; This is a three-dimensional loss; Cross-entropy loss; This is the spatial awareness alignment loss.

[0088] By constructing a similarity matrix between query samples and database samples. The generated pseudo-labels are filtered using a preset similarity threshold, retaining only highly reliable pseudo-label sample pairs to ensure the stability of the fine-tuning process. The similarity matrix... elements in Defined as:

[0089] in, For the target domain query data, the first The feature vector of each query sample; For the target domain library data, the first The formula evaluates the confidence level of the prediction results by calculating the cosine similarity between samples in the library. By setting a similarity threshold, only when... Only then will the corresponding pseudo-labeled samples be retained in the filtered query set. This is used in the calculation of the loss function in the subsequent fine-tuning stage, thereby reducing the negative interference of low-reliability prediction results on model training.

[0090] In some embodiments, through iterative execution of the above steps, the multi-dimensional inference module can automatically correct and optimize the quality of pseudo-labels by spatiotemporal constraints with minimal human intervention, significantly improving the generalization accuracy of the model under small-shot conditions.

[0091] S106. Based on the pseudo-label samples and the target domain library data, the pre-trained recognition model is self-trained and optimized to determine the recognition model.

[0092] In some embodiments, the step of self-training and optimizing the pre-trained recognition model based on the pseudo-label samples and the target domain library data to determine the recognition model includes: Based on the pseudo-label samples and the target domain library data, the contrastive learning loss and spatial awareness alignment loss are recalculated, and the parameters of the pre-trained recognition model are incrementally adjusted. By continuously iterating and performing spatiotemporal constraint reasoning and loss calculation, the pre-trained recognition model can be self-updated and its accuracy improved with minimal human intervention. Finally, a target individual recognition model optimized for a specific environment is obtained, achieving non-contact, feature-drift-resistant long-term fine management.

[0093] In some embodiments, this disclosure addresses the performance bottleneck and high annotation costs associated with recognition under few-shot conditions in the target domain by providing an inference module with adaptive optimization capabilities. Existing re-identification technologies often rely on extensive manual annotation when facing new scenarios, making it difficult to achieve effective model self-updating with only a very small amount of initial data. Therefore, this disclosure utilizes a progressive spatiotemporal inference mechanism, leveraging the inherent temporal and spatial constraints of the target domain samples, to automatically generate and filter high-quality pseudo-labels during the fine-tuning stage. This enables the model to achieve closed-loop iteration under unsupervised or minimally supervised conditions, thereby significantly improving the generalization efficiency and robustness of individual identification in cross-scenario deployment.

[0094] This disclosure can overcome the problems of feature drift caused by the growth of target individuals, different data distribution at different collection points, and different collection positions relative to the camera in long-term breeding scenarios of existing recognition technologies. It can significantly improve the accuracy and generalization performance of individual recognition with very few labeled samples.

[0095] It is understandable that this method can be executed by any device, equipment, platform, or cluster of devices with computing and processing capabilities.

[0096] It should be noted that the methods of one or more embodiments of this disclosure can be executed by a single device, such as a computer or server. The methods of this embodiment can also be applied in a distributed scenario, where multiple devices cooperate to complete the process. In such a distributed scenario, one of these devices may execute only one or more steps of the methods of one or more embodiments of this disclosure, and the multiple devices will interact with each other to complete the method described.

[0097] It should be noted that the above description pertains to specific embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims may be performed in a different order than those shown in the embodiments and may still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0098] Figure 2 This is a schematic diagram of a training architecture for a recognition model provided in an embodiment of the present disclosure, such as... Figure 2As shown, this disclosure provides a training method for a non-contact target individual recognition model based on a back view. Through a two-stage, multi-dimensional learning framework, it models changes in individual appearance from both temporal and spatial dimensions, overcoming the performance bottleneck of traditional recognition methods under long-term, complex perspectives. Specifically, the pre-training stage utilizes source domain data and progressive temporal sampling, employing joint optimization through contrastive learning, spatial awareness learning, and cross-domain adversarial learning to enable the recognition model to learn feature representations with temporal robustness and domain invariance. The fine-tuning stage, based on a small amount of labeled data and a large amount of unlabeled data from the target domain, generates and filters pseudo-labels using a multi-dimensional inference module. Through iterative optimization of contrastive learning and spatial awareness loss, it achieves efficient adaptation and continuous improvement of the model in the target scene.

[0099] The beneficial effects of this disclosure are mainly reflected in the following aspects: First, it significantly enhances the resistance to feature drift caused by long-term data collection. Addressing the issue of decreased recognition accuracy due to the long breeding cycle of beef cattle and the dynamic evolution of individual appearance features during fattening (feature drift), this disclosure introduces a progressive temporal sampling strategy in the pre-training stage, coupled with a cross-domain adversarial learning mechanism, guiding the model to capture long-term evolution patterns from easy to difficult and removing time-domain specific interference. In the fine-tuning stage, the feature centroid is dynamically updated through a multi-dimensional inference module, ensuring that the model can track and lock onto the latest appearance state of individuals in real time. Experiments show, as shown in Table 1, that the model's recognition accuracy is significantly improved after introducing the multi-dimensional inference module; if this module is removed, the model's mAP on the test set decreases by 1.54%, and the Rank-1 accuracy decreases by 1.10%.

[0100] Secondly, it effectively eliminates feature inconsistencies caused by perspective distortion. This disclosure, by designing a spatially perceptual alignment loss function, explicitly incorporates image spatial coordinates into feature learning, forcing the model to extract highly aligned features for the same cow in similar spatial positions. This solves the perspective and radial distortion caused by the cow's positional changes under a top-down view, and eliminates the excessive dispersion of intra-class features in the embedding space due to different perspectives. Experimental data shows that this disclosure significantly outperforms traditional re-identification baseline models on the BECA-L long-term test set.

[0101] Third, it balances high recognition accuracy with lightweight deployment efficiency. As shown in Table 2, this disclosure exhibits strong generalization ability in backbone network selection, especially when combined with lightweight networks (such as OSNet x0.25), achieving excellent performance with extremely low parameter count (0.201M). This allows the feature vectors generated by the model to effectively distinguish between different cattle. Specific results are shown below. Figure 3 As shown. This allows the algorithm to be easily deployed on edge-side smart terminals with limited computing resources (such as surveillance camera boxes), meeting the needs of real-time ranch monitoring.

[0102] Table 1. Recognition Results of Different Recognition Methods

[0103] Table 2. Recognition Results of Different Base Models

[0104] In the proposed two-stage, multi-dimensional learning framework, the core recognition model (backbone network) possesses extremely high flexibility and scalability. While this embodiment prioritizes the use of a lightweight, full-scale feature extraction network, OSNet (such as OSNet x0.25 or OSNet x1.0), to balance computational efficiency and accuracy, in practical applications, the recognition model of this disclosure can be replaced by other basic recognition models depending on the specific computing power environment and accuracy requirements. These alternative models include, but are not limited to, traditional deep convolutional neural network architectures such as the ResNet series, ResNeXt, DenseNet, and MobileNetV3; simultaneously, re-recognition models based on the visual Transformer architecture, such as SwinTransformer or ViT, can also be employed, utilizing their global attention mechanism to capture richer individual representations.

[0105] Furthermore, in the implementation of feature metric learning, besides the combination of triplet loss and cross-entropy loss as emphasized in this disclosure, other advanced metric learning loss functions, such as ArcFace, CosFace, Circle Loss, or Center Loss, can be used as needed to further enhance the discriminative power between categories in the feature space. In the progressive temporal sampling module of the pre-training stage, the growth function of the time window, in addition to linear growth, can also be replaced with exponential growth, step-like growth, or other non-linear sampling logic based on the data distribution characteristics. In the multi-dimensional inference module of the fine-tuning stage, the calculation formula of the dynamic centroid and the definition of the time decay weights can also be adjusted, for example, replacing exponential decay with linear decay or density-based dynamic weight allocation. As long as the core logic still follows the pseudo-label generation and filtering mechanism under spatiotemporal constraints, it falls within the protection scope of this disclosure.

[0106] Based on the same inventive concept, corresponding to any of the methods in the above embodiments, this disclosure also provides a training system for a recognition model. For example... Figure 4 As shown, the above system includes: The first acquisition module 401 is configured to acquire the source domain dataset; The sampling module is configured to perform progressive temporal sampling on the source domain dataset to obtain a sampled image; The training module 402 is configured to extract features from the sampled image and train a pre-trained recognition model based on a multi-dimensional loss function. The second acquisition module 403 is configured to acquire a target domain dataset; wherein the target domain dataset includes target domain library data and target domain query data; The generation module 404 is configured to extract features from the target domain dataset using the pre-trained recognition model and input the features into the multi-dimensional inference module to generate corresponding pseudo-label samples. The optimization module 405 is configured to perform self-training and optimization on the pre-trained recognition model based on the pseudo-label samples and the target domain library data, and determine the recognition model.

[0107] For ease of description, the above system is described by dividing it into various modules based on their functions. Of course, when implementing one or more embodiments of this disclosure, the functions of each module can be implemented in one or more software and / or hardware.

[0108] The system described above is used to implement the corresponding methods in the foregoing embodiments and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0109] The non-contact individual identification technology based on back view proposed in this disclosure has applications far beyond beef cattle farming and has broad potential for cross-industry promotion.

[0110] Firstly, in the field of smart livestock farming and precision farming (PLF), this disclosure is not only applicable to long-term monitoring of various beef cattle breeds, but can also be directly extended to tasks such as identification, behavioral monitoring, and health assessment of dairy cattle. Based on the stability and feature richness of the back view, this method can also be adapted to individual identification and dynamic management of other group-raised livestock such as pigs, sheep, and goats.

[0111] Secondly, in the field of financial credit and insurance supervision, this disclosure provides core technical support for "live asset collateral." By conducting high-frequency, contactless, and unique identification of the collateralized cattle, financial institutions can achieve dynamic supervision and real-time value assessment of the collateral, effectively resolving the difficulties in supervising and confirming the ownership of live assets, and contributing to the stable development of beef and dairy cattle production.

[0112] Furthermore, in the field of wildlife conservation and ecological monitoring, the spatiotemporal modeling mechanism disclosed herein can also assist researchers in achieving long-term population tracking and individual behavior analysis for rare species with distinctive dorsal features without interfering with the animals' natural behavior.

[0113] Figure 5 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.

[0114] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this disclosure.

[0115] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this disclosure are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.

[0116] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.

[0117] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0118] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.

[0119] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this disclosure, and not necessarily all the components shown in the figures.

[0120] The electronic devices described above are used to implement the corresponding methods in the foregoing embodiments and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0121] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.

[0122] Based on the same inventive concept, corresponding to the training method of a recognition model in any of the above embodiments, this disclosure also provides a computer program product, which includes one or more computer programs. In some embodiments, the one or more computer programs are executable by one or more processors to cause the one or more processors to execute the training method of a recognition model. Corresponding to the execution entity for each step in each embodiment of the training method of a recognition model, the processor executing the corresponding step may belong to the corresponding execution entity. The computer program product of the above embodiments is used to cause the processor to execute the training method of a recognition model as described in any of the above embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0123] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this disclosure, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of one or more embodiments of this disclosure as described above, which are not provided in detail for the sake of brevity.

[0124] Additionally, to simplify the description and discussion, and to avoid obscuring one or more embodiments of this disclosure, the provided drawings may or may not show well-known power / ground connections to integrated circuit (IC) chips and other components. Furthermore, the apparatus may be shown in block diagram form to avoid obscuring one or more embodiments of this disclosure, and this also takes into account the fact that the details of implementation of these block diagram apparatuses are highly dependent on the platform on which one or more embodiments of this disclosure will be implemented (i.e., these details should be fully understood by those skilled in the art). While specific details (e.g., circuitry) are set forth to describe exemplary embodiments of this disclosure, it will be apparent to those skilled in the art that one or more embodiments of this disclosure may be implemented without these specific details or with variations thereof. Therefore, these descriptions should be considered illustrative rather than restrictive.

[0125] Although this disclosure has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.

[0126] This disclosure includes one or more embodiments intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A method for training a recognition model, characterized by, The method includes: Obtain the source domain dataset; The source domain dataset is progressively sampled over time to obtain sampled images; Feature extraction is performed on the sampled images, and a pre-trained recognition model is obtained by training based on a multi-dimensional loss function; Obtain the target domain dataset; wherein, the target domain dataset includes target domain library data and target domain query data; The pre-trained recognition model is used to extract features from the target domain dataset, and the features are input into the multi-dimensional inference module to generate corresponding pseudo-label samples. Based on the pseudo-label samples and the target domain library data, the pre-trained recognition model is self-trained and optimized to determine the recognition model.

2. The method of claim 1, wherein, The source domain dataset includes the back image of the target individual, identity tag, collection timestamp, and spatial center coordinates.

3. The method of claim 1, wherein, The step of progressively temporally sampling the source domain dataset to obtain sampled images includes: The source domain dataset is sampled using progressive temporal sampling to obtain sampled images; wherein, the temporal sampling threshold for each training epoch is: wherein, is a maximum time deviation currently allowed for sampling; is a total collection time length of the source domain data; is a current training round number; is half of a preset total pre-training round number.

4. The method according to claim 1, characterized in that, The step of extracting features from the sampled image and training a pre-trained recognition model based on a multi-dimensional loss function includes: The sampled image is input into the recognition model to be trained for feature extraction; Based on contrastive learning loss, spatial awareness alignment loss, and cross-domain adversarial learning loss, the parameters of the recognition model to be trained are iteratively updated, and the updated recognition model to be trained is used as a pre-trained recognition model. Among them, target similarity The calculation formula is: The corresponding loss function formula is: in, and As an individual identity label; As an indicator function, it represents when the sample and samples If they belong to the same ID, the value is 1; otherwise, it is 0. and These are the normalized coordinates of the center point of the bounding box. Cosine similarity between feature vectors; The target similarity between feature vectors; These are the weighting coefficients for the loss term; Among them, the total loss function in the pre-training phase Represented as: in, All are weight parameters; contrastive learning loss includes ternary loss. and cross-entropy loss ; Spatial awareness alignment loss; cross-domain adversarial learning loss includes maximum mean difference loss. and binary cross-entropy loss .

5. The method according to claim 1, characterized in that, The step of extracting features from the target domain dataset using the pre-trained recognition model and inputting the features into the multi-dimensional inference module to generate corresponding pseudo-label samples includes: The pre-trained recognition model is used to extract features from the target domain dataset; Based on the features and the corresponding spatial coordinates and time information, multi-dimensional reasoning and spatiotemporal constraint reasoning are performed. The step of performing multi-dimensional reasoning and spatiotemporal constraint reasoning based on the features and the corresponding spatial coordinates and time information includes: The target domain library data is preprocessed; Based on spatial proximity, samples in the target domain query data are matched with samples in the target domain database data to obtain an initial candidate set; Based on the initial candidate set, dynamic centroid calculation and feature matching are performed. Based on the dynamic centroid and feature matching results, the identity of the samples in the target domain query data is predicted to determine the corresponding initial pseudo-labels; The initial pseudo-labels are processed based on the similarity matrix to generate corresponding pseudo-label samples.

6. The method according to claim 1, characterized in that, The step of self-training and optimizing the pre-trained recognition model based on the pseudo-label samples and the target domain library data to determine the recognition model includes: Based on the pseudo-label samples and the target domain library data, the contrastive learning loss and spatial awareness alignment loss are calculated, and the parameters of the pre-trained recognition model are incrementally adjusted to determine the recognition model.

7. A training system for a recognition model, characterized in that, The system includes: The first acquisition module is configured to acquire the source domain dataset; The sampling module is configured to perform progressive temporal sampling on the source domain dataset to obtain a sampled image; The training module is configured to extract features from the sampled images and train a pre-trained recognition model based on a multi-dimensional loss function. The second acquisition module is configured to acquire a target domain dataset; wherein the target domain dataset includes target domain library data and target domain query data; The generation module is configured to extract features from the target domain dataset using the pre-trained recognition model and input the features into the multi-dimensional inference module to generate corresponding pseudo-label samples. The optimization module is configured to perform self-training and optimization on the pre-trained recognition model based on the pseudo-label samples and the target domain library data, and determine the recognition model.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executed by the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 6.

9. A non-transitory computer-readable storage medium, characterized in that, The non-transitory computer-readable storage medium stores computer instructions for causing the computer to perform the method according to any one of claims 1 to 6.

10. A computer program product, characterized in that, It includes one or more computer programs that, when executed by one or more processors, implement the method as described in any one of claims 1 to 6.