Method for identifying animal breed and estimating age from image
The method enhances breed identification and age estimation by using a CLIP model and joint detection, addressing inaccuracies in existing technologies through multimodal learning and environmental corrections, achieving precise and real-time results.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SOFTZEN
- Filing Date
- 2025-07-08
- Publication Date
- 2026-07-02
AI Technical Summary
Existing methods for identifying animal breeds and estimating age are inaccurate due to environmental factors and reliance on simple body measurements, leading to incomplete analysis and low accuracy, especially for mixed or rare breeds.
A method using a CLIP model for breed identification and a joint detection model for age estimation, combining image and text embeddings with deep learning-based pose estimation to analyze joint distances and ratios, correcting for environmental factors.
Improves breed identification accuracy and age estimation precision by leveraging multimodal learning and joint-based analysis, reducing errors from environmental and shooting factors, suitable for real-time applications.
Smart Images

Figure KR2025009814_02072026_PF_FP_ABST
Abstract
Description
How to identify animal breeds and estimate age from images
[0001] The present invention relates to an artificial intelligence (AI)-based image analysis technology, and more specifically, to a method for identifying the breed of an animal and estimating its age from an input animal image (or video frame).
[0002] With the recent advancements in computer vision and machine learning technologies, research on human face recognition, object detection, and behavior analysis is being widely conducted. Various technologies for analyzing images and videos of pets are also being developed, and their application areas are expanding, for instance, to identify pet breeds, perform face recognition, or detect abnormal behavior.
[0003] Many pet owners want to know the characteristics and precautions associated with the breeds of the animals they raise. Traditionally, the primary approach has been to classify pet breeds using general image classification models (e.g., CNN-based models) and simply provide information on breed-specific characteristics. However, this presents a problem due to the wide variety of breeds, and the difficulty of accurately identifying mixed-breeds or rare breeds. Furthermore, the accuracy of classification models can be reduced by environmental factors such as lighting, background, and viewing angles.
[0004] Utilizing animal joint data allows for a more precise determination of body dimensions. In particular, joint detection technology enables the extraction of major joints (head, torso, legs, and tail) and the analysis of their positional relationships, regardless of the animal's posture. This joint information helps to assess an animal's growth and health status more accurately than simple body measurements (height, weight). Traditionally, methods involving dental condition or weight were used to estimate an animal's age, but these methods suffered from significant variations among individual animals and potential errors depending on the time of measurement.
[0005] First, relying solely on image classification makes it difficult to improve the accuracy of breed identification. Second, attempting to estimate age using simple dimensional information (e.g., weight, height) may result in low accuracy or require additional inspection equipment. Third, the age estimation results may not properly reflect the animal's actual growth variability or breed-specific characteristics.
[0006] The present invention has been devised to solve the aforementioned problems and aims to provide a method for identifying the breed of an animal (e.g., a pet) by analyzing its image and, furthermore, estimating the age of the animal.
[0007] Another objective of the present invention is to improve upon the incomplete analysis problems that arise from separating the breed identification process and the age estimation process in conventional technology or attempting to estimate age using only simple body measurement information, and to provide a method that enables simple and precise breed identification and age estimation using only image and video data.
[0008] For example, this technology provides for estimating an animal's age by comprehensively analyzing the similarity between an input image and text labels corresponding to each breed, and by comparing the distances and ratios between joints with standard body shape data. This approach has the advantage of enabling linkage with standard body shape databases by breed and age, and allows for more objective age estimation through joint-based analysis even when various external environmental factors (lighting, shooting angle, etc.) differ.
[0009] As a means for realizing the aforementioned task, the present invention provides a method for identifying animal breeds and estimating age from an input image performed by a computer device, comprising: a step of inputting an input image into a breed identification model and determining a breed identification result based on the similarity between an input image embedding and a text label embedding of a text label corresponding to each breed; a step of obtaining standard body shape data for the identified breed based on the breed identification result; a step of inputting the input image into a joint detection model to identify a plurality of joints and calculating the distance and ratio between the joints based on the names and locations of each joint; and a step of estimating the age of the animal by comparing the distance and ratio between the joints with the standard body shape data.
[0010] According to one feature, the variety identification model includes an image encoder for extracting an input image embedding from an input image and a text encoder for extracting text embeddings for text labels corresponding to a plurality of varieties, and can determine the variety identification result by measuring the similarity between the input image embedding and the text embedding.
[0011] According to another feature, the breed identification model is a CLIP (Contrastive Language-Image Pre-training) model having a multimodal learning structure, which generates image embeddings and text embeddings respectively using pre-trained parameters, calculates cosine similarity or similarity scores between the image embeddings and multiple breed text embeddings, and determines the breed identification result by using the text label of the text embedding having the highest similarity among them as the final breed.
[0012] According to another feature, the breed identification model inputs a plurality of training images and text labels corresponding to each of a plurality of breeds into an image encoder and a text encoder, respectively, and performs contrastive learning based on the similarity between the training image embedding of the training image and the text label embedding of the text label, thereby fine-tuning the model parameters to update the training image embedding so that the similarity between the text label embedding of the text label for the breed included in the training image and the training image embedding increases.
[0013] According to another feature, in the contrast learning process, the text label may additionally include text data in which a representative image or description of the corresponding variety is added to the variety name included in each of the training images.
[0014] According to another feature, the joint detection model is a deep learning-based pose estimation model for identifying major joint locations including the torso, legs, head, and tail of an animal, and includes a DeepLabCut-based algorithm, calculates the coordinates of multiple joints within the input image, and can output the name and coordinate value of each joint.
[0015] According to another feature, the step of calculating the distance and ratio between the joints may include a preprocessing step of determining whether the input image is a side image of a target object and determining the direction between the foreleg joint and the hindleg joint.
[0016] According to another feature, the standard body shape data includes a table in which joint lengths and ratios measured for each age interval by breed are recorded, and the age of the animal can be estimated by deriving the closest age interval by measuring the Euclidean distance or cosine similarity between the distances and ratios between joints calculated from the joint detection model and the standard body shape data.
[0017] According to another feature, the input image is each frame of a video stream, and the step of estimating the age of the animal is performed for each of the input images, which is each frame of the video stream, and may include the step of determining the age of the animal by averaging or weighting the age estimation results for each frame.
[0018] According to another feature, the method for identifying the breed of an animal and estimating its age in the above image may further include the step of calculating a correction value based on ray tracing or camera parameters to correct for perspective distortion or errors due to the shooting angle for the positions of a plurality of joints calculated by the joint detection model, and applying the correction value to each joint position to reduce the estimation error that may occur when calculating the distance and ratio between the joints.
[0019] According to the present invention, an image of an animal (e.g., a pet) can be analyzed to identify the breed and estimate the age of the animal.
[0020] In addition, according to the present invention, the problem of incomplete analysis caused by separating the breed identification process and the age estimation process in the prior art or attempting to estimate age using only simple body measurement information is improved, and breed identification and age estimation can be performed simply and precisely using only image and video data.
[0021] FIG. 1 is a block diagram of a computer device for performing a method of identifying animal breeds and estimating age from an input image according to an embodiment of the present invention.
[0022] FIG. 2 is a flowchart of a method for determining the breed and age of an animal in an input image according to an embodiment of the present invention.
[0023] FIG. 3 is a schematic diagram of a breed identification model according to an embodiment of the present invention.
[0024] Figure 4 is a diagram showing the results of variety identification according to an embodiment of the present invention.
[0025] FIG. 5 is a list of major joints for estimating the age of an animal according to an embodiment of the present invention.
[0026] FIG. 6 is an example diagram of the location of major joints for estimating the age of an animal according to an embodiment of the present invention.
[0027] FIGS. 7 and FIGS. 8 are exemplary diagrams illustrating a process for estimating the age of an animal according to an embodiment of the present invention.
[0028] Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, the embodiments described below are merely intended to provide a detailed description sufficient for a person skilled in the art to easily practice the invention, and the scope of protection of the present invention is not limited to the embodiments described below. Meanwhile, in describing various embodiments of the present invention, the same reference numerals will be used for components having the same technical features.
[0029] FIG. 1 is a block diagram of a computer device for performing a method of identifying animal breed and estimating age in an input image according to an embodiment of the present invention. The computer device (100) shown in FIG. 1 is merely an example of a simplified configuration, and in an embodiment of the present invention, the computer device (100) may include other configurations for configuring a computer environment, and only some of the disclosed configurations may constitute the computer device (100).
[0030] A computer device (100) may include a processor (110), memory (130), and a network unit (150). The processor (110) may be composed of one or more cores and may include a processor for data analysis and deep learning, such as a central processing unit (CPU), a general purpose graphics processing unit (GPGPU), or a tensor processing unit (TPU) of the computer device. The processor (110) may read a computer program stored in memory (130) and perform data processing for machine learning according to an embodiment of the present invention. According to an embodiment of the present invention, the processor (110) may perform calculations for learning a neural network. The processor (110) may perform calculations for learning a neural network, such as processing input data for learning in deep learning (DL), extracting features from input data, calculating errors, and updating the weights of the neural network using backpropagation. At least one of the CPU, GPGPU, and TPU of the processor (110) can process the learning of a network function. For example, the CPU and GPGPU can together process the learning of a network function and data classification using the network function. In addition, in an embodiment of the present invention, processors of a plurality of computer devices can be used together to process the learning of a network function and data classification using the network function. In addition, a computer program executed on a computer device according to an embodiment of the present invention may be a CPU, GPGPU, or TPU executable program.
[0031] According to an embodiment of the present invention, the memory (130) can store any type of information generated or determined by the processor (110) and any type of information received by the network unit (150). The memory (130) may include at least one type of storage medium among a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (e.g., SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, a magnetic disk, and an optical disk. The computer device (100) may operate in conjunction with web storage that performs the storage function of the memory (130) on the internet.
[0032] The network unit (150) according to an embodiment of the present invention may use various wired communication systems such as a Public Switched Telephone Network (PSTN), xDSL (x Digital Subscriber Line), RADSL (Rate Adaptive DSL), MDSL (Multi Rate DSL), VDSL (Very High Speed DSL), UADSL (Universal Asymmetric DSL), HDSL (High Bit Rate DSL), and a Local Area Network (LAN). In addition, the network unit (150) presented in this specification may use various wireless communication systems such as CDMA (Code Division Multi Access), TDMA (Time Division Multi Access), FDMA (Frequency Division Multi Access), OFDMA (Orthogonal Frequency Division Multi Access), SC-FDMA (Single Carrier-FDMA), and other systems. In the present invention, the network unit (150) can be configured regardless of the communication mode, such as wired or wireless, and can be configured as various communication networks, such as a Personal Area Network (PAN) or a Wide Area Network (WAN). In addition, the network may be the known World Wide Web (WWW), and may utilize wireless transmission technologies used for short-range communication, such as Infrared Data Association (IrDA) or Bluetooth. Through the network unit (150) of the present invention, the computer device (100) can communicate with other computer devices, and, for example, can communicate with a data storage where data is stored, a cloud data storage, a cloud computer system for using computing power, etc.The technologies described in this specification can be used not only in the networks mentioned above but also in other networks.
[0033] FIG. 2 is a flowchart of a method for determining the breed and age of an animal in an input image according to an embodiment of the present invention.
[0034] A computer device (100) can input an input image into a breed identification model and determine a breed identification result based on the similarity between the input image embedding and the text label embedding of a text label corresponding to each breed (S100).
[0035] FIG. 3 is a schematic diagram of a breed identification model according to an embodiment of the present invention. The breed identification model may be a CLIP (Contrastive Language-Image Pre-training) model having a multimodal learning structure. The CLIP model simultaneously trains two neural networks, an image encoder and a text encoder, and is trained so that their embedding vectors have high similarity when the image and text represent the same object. Through this multimodal learning structure, a computer device can compare image and text labels in a single common representation space.
[0036] The computer device (100) applies pre-learned parameters to the image encoder and text encoder of the CLIP model to convert the input image and text labels into embedding vectors, respectively. For example, if the input image is a photograph of a "Siberian Husky," the image encoder compresses the image into a feature vector to generate an image embedding. The text encoder converts multiple breed labels, such as "Siberian Husky," "Shiba Inu," and "Persian Cat," into vectors to generate a text embedding.
[0037] At this time, the pre-trained parameters provided by the CLIP model are trained with large-scale image-text pairs and possess generalized expressive power capable of handling various breeds and shooting situations. Therefore, the computer device (100) can improve breed identification accuracy with only a relatively small amount of additional training (fine-tuning).
[0038] The computer device (100) first provides an input image (10) to an image encoder (220) of a breed identification model (200). At this time, the input image may be a captured photo file, a video frame, or other digital image data. The computer device (100) performs preprocessing, such as background removal, resolution adjustment, and color correction, if necessary, and then inputs the image into the image encoder (220).
[0039] The image encoder (220) may have a multilayer neural network structure (Convolutional Neural Network, Transformer, etc.) and maps the input image (10) into a high-dimensional embedding. For example, the image encoder (220) generates an input image embedding (11) by summarizing geometric, color, and texture information through a feature extraction layer and compressing it into a vector of a certain dimension in a subsequent layer. This embedding numerically represents the main features of the image and is then ready to be compared with the text label embedding generated by the text encoder (210).
[0040] The computer device (100) has text labels (20) corresponding to multiple breeds, and these text labels (20) may be simple breed names (e.g., "Siberian Husky") or complex text including characteristics and descriptions of the breed. The computer device (100) inputs the text labels (20) into the text encoder (210) of the breed identification model (200) to generate text label embeddings (21).
[0041] The text encoder (210) may also utilize a multi-layer neural network and performs the role of converting natural language sentences into embedding vectors. For example, the CLIP model includes a vocabulary (token) unit encoder that processes word embeddings and sentence embeddings, and converts text labels (20) containing specific variety names into vector forms of a certain dimension.
[0042] The computer device (100) calculates the similarity between the input image embedding (11) output from the image encoder (220) and the text label embedding (21) output from the text encoder (210). For example, cosine similarity can be calculated for embedding vectors mapped to the same dimension within the same model, or a similarity score based on a dot product can be obtained.
[0043] The computer device (100) finds the text label (20) having the highest similarity value by comparing each text label embedding (21) corresponding to one or more text labels (20) per breed with the input image embedding (11). For example, if 10 or more dog breed labels and 6 or more cat breed labels have been learned in advance, the computer device (100) calculates the highest similarity by comparing all breed label embeddings (21) with the input image embedding (11).
[0044] The computer device (100) determines the breed name of the text label (20) having the highest value among the calculated similarity values as the final breed identification result. For example, if the similarity for the label "Welsh Corgi" is 0.95, the label "Beagle" is 0.70, and the label "Labrador Retriever" is 0.60, the computer device determines "Welsh Corgi" as the breed identification result.
[0045] This approach offers higher accuracy than conventional CNN-based simple classification. This is because the model is trained using contrastive learning for image embeddings and text label embeddings, enabling it to richly understand the multimodal relationships between images and text.
[0046] The breed identification model (200) used in the present invention inputs a training image and a corresponding text label (20) into an image encoder (220) and a text encoder (210), respectively, and updates model parameters by performing contrastive learning to calculate the similarity between two embedding vectors. For example, the model is trained to have high similarity for image-text pairs suitable for the corresponding breed and low similarity for unsuitable pairs. This process is called fine-tuning, and if training image-text labels for various breeds and modified environments (lighting, angle, background, etc.) are provided, high accuracy is achieved in actual service environments.
[0047] The text labels (20) to be used in the comparative learning process may be not only simple breed names, but also sentences containing breed characteristics such as "Welsh Corgis have short legs and are small in size." Alternatively, text data describing a representative image of the breed may be added. Through this, the computer device (100) forms a more detailed multimodal embedding space between the image encoder (220) and the text encoder (210).
[0048] Figure 4 is a diagram showing the results of variety identification according to an embodiment of the present invention.
[0049] The example in FIG. 4 shows a computer device (100) receiving an image of a pet (e.g., dog, cat) and analyzing it through a breed identification model (200), and then displaying the identified breed name and the probability (similarity) of being classified as that breed as a result. An image of a dog identified as a Bull Terrier breed is shown at the top of the figure, and an image of a cat identified as a Korean ShortHair breed is shown at the bottom. Each result is output in the format of an array " / nresult" along with a message in the form of " / nmessage": "Breed analysis complete."
[0050] The computer device (100) receives an image of a dog or cat from a user or an external system. Before the input image is transmitted to the breed identification model (200) in the form of an input image (10), it may undergo preprocessing such as resolution adjustment, background removal, and color normalization. This preprocessing process helps the image encoder (220) generate image embeddings (11) more stably.
[0051] The computer device (100) inputs a preprocessed input image (10) into an image encoder (220) to generate an input image embedding (11) represented as a high-dimensional vector. The text encoder (210) converts multiple pre-registered breed labels (e.g., "Bull Terrier", "Jack Russell Terrier", "Doberman", etc.) into vectors. The computer device (100) calculates a cosine similarity or other similarity score between the generated input image embedding (11) and each text label embedding. As can be seen in the example at the top of the drawing, if the "Bull Terrier" label shows the highest similarity of 100.00%, the image is finally identified as the "Bull Terrier" breed. As shown in the example in FIG. 4, the computer device (100) can output the analysis results in JSON format or as a text message. In the example, "message" is displayed as "Variety analysis complete.", and variety names and similarity (accuracy) values are listed in the "result" array.
[0052] The breed identification model of the present invention learns a wide range of multimodal characteristics (relationships between image embeddings and text embeddings) for various breeds by training on a large number of image-text pairs in the initial stage. However, when new breed labels are introduced or images of rare or mixed breeds (mixed dogs and cats) are added later, it is difficult to guarantee optimal classification accuracy using only the existing model. In such cases, instead of completely discarding the previously trained model, accumulating knowledge through a transfer learning method can be reused, thereby significantly reducing training time and data requirements while improving accuracy.
[0053] The computer device (100) is based on a breed identification model (image encoder, text encoder) that has already been pre-trained with a large dataset. For example, since the CLIP model has already learned how to generate image-text embeddings based on a wide range of image-text pairs, it possesses a relatively generalized ability to represent new breeds.
[0054] Secure additional training data, such as new breed labels, images of rare breeds, and examples of mixed-breed dogs and cats. This data must include an appropriate amount of images (taken from various angles and environments) and the corresponding breed names (or comprehensive descriptions) in text format.
[0055] The computer device (100) maintains most of the layers of the existing model while fine-tuning only some parameters (e.g., upper layers, new breed label embeddings, etc.) to match the training data. Through this process, the model parameters are updated so that the "new breed A" image has high similarity to the "breed A" text label.
[0056] If necessary, existing layers (especially low-level feature extraction layers) can be frozen, and only the final embedding calculation or classification layer can be retrained.
[0057] In terms of the text encoder, new variety names (or additional descriptive text) can be tokenized, and then retrained so that the corresponding embeddings are properly mapped to the model's embedding space.
[0058] The computer device (100) evaluates the learned fine-tuning model using validation data (images and text related to new varieties). When the accuracy reaches the target standard, the updated model is reflected in the actual variety identification service.
[0059] The computer device (100) can obtain standard body shape data for the identified breed based on the breed identification result (S200).
[0060] The computer device (100) determines the breed of the input animal image using a breed identification model from the previous step (S100). For example, a label such as "Siberian Husky" or "Persian cat" can be identified as the final breed. The computer device (100) can use this breed identification result as a key to look up standard body shape data.
[0061] The computer device (100) holds a standard body shape database stored in memory (internal DB) or on an external server. This database records information such as joint lengths, inter-joint ratios, and weight ranges, categorized by breed and age.
[0062] It is configured to use the label corresponding to the breed identification result (e.g., "Bulldog", "Scottish Fold") as the DB identifier, or to search for the corresponding table using the label string as the key.
[0063] After confirming the breed identification result, the computer device (100) queries standard body shape data for the corresponding breed (S200). At this time, the computer device (100) loads joint lengths, joint ratios, growth curves, etc., recorded for each age interval, using the breed label as an index.
[0064] The computer device (100) retrieves a table in which standard body shape values are recorded for each month interval of the breed (e.g., 2 months, 3 months, 6 months, 12 months, 24 months, etc.). The standard body shape table may include average values, standard deviations, maximum and minimum ranges, etc., so as to be referenced when calculating Euclidean distance or cosine similarity.
[0065] The computer device (100) stores the retrieved standard body shape data in internal memory, a cache, or a separate temporary storage space. This data is then compared with the distances and ratios between joints calculated from a joint detection model in a subsequent step (S400) and used to estimate the age of the animal. For example, the computer device (100) can determine the closest age range in months by calculating the Euclidean distance or cosine similarity between the joint ratio values within the standard body shape data and the actual measured joint ratio values.
[0066] The computer device (100) can input an input image into a joint detection model to identify multiple joints and calculate the distance and ratio between joints based on the name and location of each joint (S300).
[0067] The computer device (100) automatically detects major joints in an animal image through a joint detection model. The joint detection model includes a deep learning-based pose estimation algorithm (e.g., DeepLabCut) and calculates the positions of the animal's torso, legs, head, tail, etc., in the form of coordinate values. For example, specific points such as the foreleg joint, hind leg joint, neck, and tail starting point can be tracked.
[0068] This model is configured to learn a number of animal images and ground truth coordinates of the corresponding joints in advance, and to infer the position of each joint when an arbitrary input image (or video frame) is given. A computer device (100) obtains the coordinates of multiple joints of an animal based on the inference results of this model. FIG. 5 is a list of major joints for estimating the age of an animal according to an embodiment of the present invention.
[0069] The computer device (100) can determine whether the input image properly captures the side of the animal before inputting the image into the joint detection model. For example, it checks whether the head, torso, and legs are aligned in a straight line, or if the accuracy of joint detection may decrease from a perspective other than the side, such as the front or rear, so it selects this during the preprocessing process.
[0070] The present invention distinguishes between 'forelegs' and 'hindlegs' using bounding boxes or keypoints to determine the direction between the foreleg joints and the hindleg joints, and determines the left-right direction. When the preprocessed image is input into a joint detection model, the inference accuracy and consistency of the joint detection model can be improved.
[0071] A computer device (100) provides a preprocessed input image to a joint detection model and outputs coordinate values representing the positions of multiple joints from the model. At this time, each joint includes a unique name (e.g., nose, leg, leg, base of ear, tail, etc.) along with (x, y) coordinates on the image.
[0072] DeepLabCut is an algorithm frequently used for animal pose estimation that can accurately identify body parts without separate markers for each joint. A computer device (100) receives a 'joint probability map' or 'keypoint coordinates' from the model and selects only joints with a confidence score above a certain standard to determine their names and coordinate values.
[0073] The computer device (100) calculates the length or ratio between major parts using the detected multiple joint coordinates. For example, various measurements are possible, such as the distance between the foreleg joint and the hind leg joint, the length from the nose to the tail, and the leg length relative to the shoulder height.
[0074] The computer device (100) calculates the Euclidean distance between the (x, y) coordinates of each joint to quantify the length of the body part. The computer device (100) compares the ratio derived in this way with standard body shape data in the subsequent step (S400), the "age estimation" process, and uses it for estimating the age of the animal in months, etc.
[0075] FIG. 6 is an example diagram of the location of major joints for estimating the age of an animal according to an embodiment of the present invention.
[0076] The computer device (100) first acquires a photograph or video frame of an animal as an input image. Then, the computer device (100) performs preprocessing to remove backgrounds or unnecessary areas and adjusts the image resolution and aspect ratio to suit a pose estimation model. When the preprocessed image is input into a joint detection model (e.g., DeepLabCut), the model predicts the location of each part of the animal's body.
[0077] The joint detection model calculates multiple key points (joint candidates), and the computer device (100) selects a location with a high confidence score among these candidates to determine the (x, y) coordinates. In the example of FIG. 6, reference number 310 is the back base, which is joint number 20; reference number 320 is the front left thigh, which is joint number 25; reference number 330 is the front left paw, which is joint number 27; reference number 410 is the back end, which is joint number 21; reference number 420 is the back left thigh, which is joint number 32; and reference number 430 is the back left paw, which is joint number 31. For example, the back base (310), which is joint number 20, corresponds to the dorsal base of the animal, and the front left thigh (320), which is joint number 25, refers to the thigh area of the front leg. The left forefoot (330), left rear thigh (420), and left hindfoot (430) are also identified in the same way. The model calculates the coordinates of these joints in the form of (x, y) on the image, and the computer device (100) records the coordinates in an internal data structure.
[0078] The computer device (100) clearly classifies the locations of major joints in the side image of the animal. In a subsequent step, the growth status, body shape characteristics, age, etc. of the animal can be estimated by calculating the distance and ratio between each joint.
[0079] In the joint coordinate extraction and length and ratio calculation steps, if perspective distortion is a concern, a camera parameter-based correction value is applied. When the computer device (100) detects the joints of an animal, it recognizes that a difference between the actual length and the image coordinates may occur due to the camera's shooting angle, optical distortion, distance, etc. For example, perspective distortion occurs where the actual length appears larger when the animal is closer to the camera, or conversely, appears smaller when photographed from a distant position. Such distortion causes errors in accurately calculating the distance and ratio between joints.
[0080] The computer device (100) can obtain intrinsic parameters such as the camera's focal length, sensor size, and lens distortion coefficient, and extrinsic parameters such as the camera's position and orientation in advance, or estimate them in real time at the time of shooting. These parameters can be calculated using a ray tracing technique or a camera calibration algorithm. The computer device (100) uses a ray tracing algorithm to estimate how much distance and angle the joints have in actual 3D space and traces the process of projection onto the image plane in reverse. Through this, it calculates how distorted each joint coordinate appears in the image and calculates a correction value.
[0081] A computer device (100) calculates a distortion amount (δx, δy) by applying camera parameters or ray tracing results to the (x, y) coordinates of multiple joints calculated by a joint detection model. For example, if the image is taken at an angle, the actual distance between joints may appear shorter in the image, so a "perspective correction coefficient" is multiplied, or a mathematical transformation considering the "camera position and angle" is performed. The computer device (100) resets the coordinates by adding or subtracting the correction value (δx, δy) corresponding to each joint coordinate (x, y). Through this, the joint position that appears farther or closer than it actually is due to perspective distortion and camera tilt is corrected to be closer to the actual measurement.
[0082] The computer device (100) calculates the distance and ratio between joints closer to the actual animal body shape by using the coordinate values corrected as above. For example, if the actual distance between the animal's front legs and hind legs is 40 cm, before correction it could be misidentified as 35 cm or 45 cm due to the shooting angle, but after correction it can be calculated as a value closer to the actual value (approximately 40 cm). This step helps to perform the subsequent comparison process with standard body shape data more accurately.
[0083] The computer device (100) can estimate the age of the animal by comparing the distance and ratio between the joints with the standard body shape data (S400).
[0084] The computer device (100) obtains standard body shape data for the identified breed based on the breed identification result in the preceding step (S200). This standard body shape data includes a table in which joint lengths or inter-joint ratios are recorded for each breed's age interval (e.g., 3 months, 6 months, 12 months, etc.), and average values and deviations (standard deviations), etc., may be specified for each age interval.
[0085] The computer device (100) measures the length (B) between the left front thigh (320) and the left front foot (330), the length (A) from the left front thigh (320) to the back point (310), etc., using coordinates calculated from a joint detection model, for example, as shown in FIG. 7. As can be seen in the drawing, the ratio of A to B is approximately 0.91 in the puppy stage, whereas a value of 1.21, etc. may appear in the adult stage. Additionally, as shown in FIG. 8, the length (B) between the left rear thigh (420) and the left hind foot (430), the length (A) from the left rear thigh (420) to the back end (410), etc., can be measured, and the ratio of A to B is approximately 1.18 in the puppy stage, while a value of 1.65, etc. may appear in the adult stage.
[0086] The computer device (100) divides or combines the measured distances to calculate the animal's characteristic body shape ratio (A versus B). For example, it can calculate the foreleg length (B) relative to the thigh length (A) on the back, or obtain the ratio between the foreleg and the hind leg.
[0087] The computer device (100) compares the inter-joint distance and ratio data with each month interval of the standard body shape data. At this time, Euclidean distance or cosine similarity, etc., may be used.
[0088] The computer device (100) determines the closest age range in standard body shape data based on the calculated distance or similarity value. For example, it calculates which of the puppy stage (0.91.1) and adult stage (1.21.6) in the drawing the current joint ratio is closer to, and reflects the result as the final age estimate. In some implementations, more detailed data for segments such as 3 months, 6 months, and 12 months may be used to obtain a more precise result, such as "approximately 5 to 6 months old."
[0089] In another embodiment of the present invention, the input image may be a frame unit of a video stream rather than a single photograph. In this case, the computer device (100) calculates the distance and ratio between joints for each frame and compares this with standard body shape data to calculate an age estimate per frame. Once all frame results are collected, the computer device (100) determines the final age by averaging or weighting them. For example, since the joint ratio per frame in an image of an animal may be calculated slightly differently, multiple estimates are combined to obtain a consistent final result. This is also a method to mitigate errors caused by the shooting angle or temporary changes in the animal's posture.
[0090] The present invention compares the similarity between input image embeddings and text label embeddings through a breed identification model having a multimodal learning structure such as CLIP (Contrastive Language-Image Pre-training). This enables improved identification accuracy for a wide range of species (breeds), including rare breeds and mixed-breed dogs (cats), compared to simple CNN classification methods.
[0091] The present invention enables more precise and consistent body analysis than manual methods by automatically detecting the positions of multiple joints of an animal using a deep learning pose estimation algorithm (e.g., DeepLabCut). In particular, it reduces errors caused by environmental and shooting factors through the determination of side image conditions and correction based on camera parameters.
[0092] The present invention can perform the same analysis not only on simple still images but also on a frame-by-frame basis of video streams. By averaging or weighting the results for each frame, real-time monitoring and precise identification are possible even in actual environments where animals are moving.
[0093] The present invention is configured to retrieve standard body shape data specialized for a specific breed (joint lengths and ratios by age interval, standard deviation, etc.) based on breed identification results. Since it can reflect growth curves that differ for each breed, the accuracy of age estimation is further enhanced.
[0094] This invention derives the closest age range in months by comparing the distances and ratios calculated by a joint detection model with standard body shape data (Euclidean distance, cosine similarity, etc.). This significantly improves precision compared to existing methods that estimate based on simple weight and height.
[0095] This invention offers high flexibility as it can determine age ranges through appropriate correction (contrast learning) by considering breed-specific standard deviations when necessary, even if individual growth variations exist.
[0096] The present invention enables real-time monitoring of a pet's health status or behavioral problems by combining not only age estimation but also joint detection models and behavioral pattern analysis models (e.g., AI for detecting abnormal behavior). This can be applied to various fields, such as veterinary clinics, pet care services, and animal shelters.
[0097] The 3D avatar model (when additionally implemented) or AI analysis results generated in the present invention have great potential to be expanded into next-generation services such as metaverse platforms and NFT-based digital assetization. For example, a personalized digital twin of a pet can be created and implemented in a virtual environment, or growth process data can be recorded and managed in the form of an NFT.
[0098] Furthermore, even if new breed labels (text labels) and new standard body type data (e.g., unique mixed-breed dogs or cats) are added, they can be easily updated through the AI model structure of the present invention (transfer learning, additional fine-tuning, etc.). This enables continuous industrial and research applications.
[0099] Users can quickly receive results for pet breed identification and age estimation using only a smartphone camera or a simple image upload. This is feasible without specialized equipment or complex procedures, making it suitable for mass services.
[0100] This invention presents a comprehensive pet body shape analysis solution by combining AI models (breed identification, joint detection), standard body shape data, and image correction techniques. Unlike existing single-purpose systems (breed classification only, or weight-based age estimation only), this solution simultaneously achieves efficient automation and high accuracy.
[0101] As the number of pets increases, data-driven customized services are becoming increasingly important in industries such as medical and healthcare, insurance, and feed and supplies. This invention meets these market demands and provides differentiated technological value compared to existing solutions.
[0102] Although embodiments of the present invention have been described above, it is understood that those skilled in the art can make various modifications without departing from the scope of the claims of the present invention.
[0103] This invention is a technology that identifies animal breeds from input images and estimates age based on joint information, offering significant practicality in that it enables automated, image-based biometric analysis. In particular, by combining a multimodal CLIP model with a joint detection algorithm, both breed identification accuracy and age estimation precision can be improved. This invention can be applied to various industrial fields, such as pet healthcare, insurance, shelter management, and feed recommendations, and also boasts high public appeal as it allows for service delivery via mobile devices. Furthermore, the analysis structure based on perspective correction and standard body shape data provides stable inference results even in real-world environments, enabling future expansion into advanced services such as digital twins and behavioral analysis.
Claims
1. A method for identifying animal breeds and estimating age from input images performed by a computer device, A step of inputting an input image into a breed identification model and determining a breed identification result based on the similarity between the input image embedding and the text label embedding of a text label corresponding to each breed; A step of obtaining standard body shape data including a table in which joint lengths and ratios measured for each age interval for each breed identified based on the breed identification results above are recorded; A step of inputting the above input image into a joint detection model to identify multiple joints and calculating the distance and ratio between joints based on the name and location of each joint; A step of calculating a ray tracing-based or camera parameter-based correction value to correct for perspective distortion or errors due to shooting angles regarding the positions of a plurality of joints calculated by the joint detection model, and applying the correction value to each joint position to reduce the estimation error that may occur when calculating the distance and ratio between the joints; and The method includes the step of estimating the age of the animal by measuring the Euclidean distance or cosine similarity between the distance and ratio between the joints and the standard body shape data to derive the closest age range in months; The above-described variety identification model is a CLIP (Contrastive Language-Image Pre-training) model having a multimodal learning structure comprising an image encoder for extracting input image embeddings from the input image and a text encoder for extracting text embeddings for text labels corresponding to a plurality of varieties, wherein the model generates image embeddings and text embeddings respectively using pre-trained parameters, calculates cosine similarity or a similarity score between the image embeddings and the plurality of variety text embeddings, determines the variety identification result by selecting the text label of the text embedding having the highest similarity among them as the final variety, inputs a plurality of training images and text labels corresponding to each of the plurality of varieties into the image encoder and text encoder respectively, and performs contrastive learning based on the similarity between the training image embeddings of the training images and the text label embeddings of the text labels, thereby updating model parameters to increase the similarity between the text label embeddings of the text labels for varieties included in the training images and the training image embeddings. Tuning method.
2. In Claim 1, A method characterized in that, in the above contrast learning process, the text label further includes text data in which a representative image or description of the corresponding variety is added in addition to the variety name included in each of the above learning images.
3. In Claim 1, The above joint detection model is a deep learning-based pose estimation model for identifying major joint locations including the torso, legs, head, and tail of an animal, and includes a DeepLabCut-based algorithm, and is characterized by calculating the coordinates of a plurality of joints within the input image and outputting the name and coordinate value of each joint.
4. In Claim 1, A method characterized by the step of calculating the distance and ratio between the joints including a preprocessing step of determining whether the input image is a side image of a target object and determining the direction between the foreleg joint and the hindleg joint.
5. In Claim 1, The above input image is each frame of the video stream, and A method characterized by including a step of estimating the age of the animal, wherein the step of estimating the age of the animal is performed for each of the input images, which are each frame of the video stream, and the step of determining the age of the animal by averaging or weighting the age estimation results for each frame.
6. A computer program stored on a computer-readable storage medium, wherein the computer program performs the following methods for identifying the breed of an animal and estimating its age in an input image, and The above method is, A step of inputting an input image into a breed identification model and determining a breed identification result based on the similarity between the input image embedding and the text label embedding of a text label corresponding to each breed; A step of obtaining standard body shape data including a table in which joint lengths and ratios measured for each age interval for each breed identified based on the breed identification results above are recorded; A step of inputting the above input image into a joint detection model to identify multiple joints and calculating the distance and ratio between joints based on the name and location of each joint; A step of calculating a ray tracing-based or camera parameter-based correction value to correct for perspective distortion or errors due to shooting angles regarding the positions of a plurality of joints calculated by the joint detection model, and applying the correction value to each joint position to reduce the estimation error that may occur when calculating the distance and ratio between the joints; and The method includes the step of estimating the age of the animal by measuring the Euclidean distance or cosine similarity between the distance and ratio between the joints and the standard body shape data to derive the closest age range in months; The above-described variety identification model is a CLIP (Contrastive Language-Image Pre-training) model having a multimodal learning structure comprising an image encoder for extracting input image embeddings from the input image and a text encoder for extracting text embeddings for text labels corresponding to a plurality of varieties, wherein the model generates image embeddings and text embeddings respectively using pre-trained parameters, calculates cosine similarity or a similarity score between the image embeddings and the plurality of variety text embeddings, determines the variety identification result by selecting the text label of the text embedding having the highest similarity among them as the final variety, inputs a plurality of training images and text labels corresponding to each of the plurality of varieties into the image encoder and text encoder respectively, and performs contrastive learning based on the similarity between the training image embeddings of the training images and the text label embeddings of the text labels, thereby updating model parameters to increase the similarity between the text label embeddings of the text labels for varieties included in the training images and the training image embeddings. A computer program that is being tuned.
7. As a computer device, One or more processors; and Memory for storing instructions executable on one or more processors; including, The above one or more processors, Input images are input into a breed identification model to determine breed identification results based on the similarity between the input image embeddings and the text label embeddings of text labels corresponding to each breed, and Based on the above breed identification results, standard body shape data is obtained including a table in which joint lengths and ratios measured for each age interval are recorded for each breed identified based on the breed, and The above input image is input into a joint detection model to identify multiple joints, and the distance and ratio between joints are calculated based on the name and location of each joint. For the positions of multiple joints calculated by the joint detection model, a correction value based on ray tracing or camera parameters is calculated to correct for perspective distortion or errors due to the shooting angle, and by applying the correction value to each joint position, the estimation error that may occur when calculating the distance and ratio between the joints is reduced. Estimating the age of the animal by measuring the Euclidean distance or cosine similarity between the distance and ratio between the joints and the standard body shape data to derive the closest age range in months, and, The above-described variety identification model is a CLIP (Contrastive Language-Image Pre-training) model having a multimodal learning structure comprising an image encoder for extracting input image embeddings from the input image and a text encoder for extracting text embeddings for text labels corresponding to a plurality of varieties, wherein the model generates image embeddings and text embeddings respectively using pre-trained parameters, calculates cosine similarity or a similarity score between the image embeddings and the plurality of variety text embeddings, determines the variety identification result by selecting the text label of the text embedding having the highest similarity among them as the final variety, inputs a plurality of training images and text labels corresponding to each of the plurality of varieties into the image encoder and text encoder respectively, and performs contrastive learning based on the similarity between the training image embeddings of the training images and the text label embeddings of the text labels, thereby updating model parameters to increase the similarity between the text label embeddings of the text labels for varieties included in the training images and the training image embeddings. A computer device that is tuned.