Method for identifying abnormal behavior of companion animal using image information

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The method uses a YOLO-based object detection, DeepLabCut-based pose estimation, and CLIP-based multimodal abnormal behavior identification, along with a text generation model, to accurately and flexibly identify and notify abnormal pet behaviors, addressing the limitations of existing technologies in pet behavior analysis.

WO2026142401A1PCT designated stage Publication Date: 2026-07-02SOFTZEN

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SOFTZEN
Filing Date: 2025-10-10
Publication Date: 2026-07-02

Application Information

Patent Timeline

10 Oct 2025

Application

02 Jul 2026

Publication

WO2026142401A1

IPC: G06V40/20; G06V20/52; G06V10/774; G06V10/34; G06V20/40; G06V10/82; G06N3/0895; G06N3/096; H04N7/18; A01K29/00

AI Tagging

Technology Topics

Radiology Zoology

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

An image processing method, apparatus and electronic device
CN116912124Bunderstand spatial structureGood image denoising effectImage enhancement Character and pattern recognitionImage denoisingImaging processing
Image for displaying information
JP1829801SComputer graphics (images)Radiology
Assessment of lung disease progression
US12664657B2Medical simulation Image enhancement Radiology Computer vision
Expression driving method, device, equipment and storage medium
CN115937933BCharacter and pattern recognition Neural architectures Radiology Image pair
Image reading device, image forming apparatus, and image reading method
US20260172515A1Pictoral communication Computer graphics (images)Radiology

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately and consistently identify abnormal behaviors in pets such as aggression, anxiety, and stress in dogs and cats, particularly due to limitations in distinguishing between normal and abnormal behaviors across diverse breeds and individual animals, and relying solely on simple object recognition fails to capture dynamic and joint characteristics.

Method used

A method utilizing a YOLO-based object detection model to locate pets, a DeepLabCut-based pose estimation model to identify joint coordinates, and a CLIP-based multimodal abnormal behavior identification model to classify behaviors, combined with a text generation model for automatic annotation and notification, enabling real-time monitoring and risk-level assessment.

Benefits of technology

Enables precise identification and notification of abnormal pet behaviors in real-time, with enhanced data scalability and flexibility through partial fine-tuning techniques, allowing for accurate classification and automatic generation of behavior labels.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure KR2025095638_02072026_PF_FP_ABST

Patent Text Reader

Abstract

A method for identifying an abnormal behavior of a companion animal by using image information comprises the steps of: acquiring a frame image of a frame unit of an inputted image stream; extracting a companion animal area image through an object detection model to detect an area in which a companion animal is located from the frame image; inputting the companion animal area image into a deep learning-based pose estimation model to calculate a plurality of joint coordinates of the companion animal and obtaining position information of each joint; inputting the frame image or the companion animal area image into an abnormal behavior identification model of a multimodal structure to classify the behavior of the companion animal; and recording the type and time of the abnormal behavior and providing a notification to a user if the classified behavior of the companion animal is an abnormal behavior.

Need to check novelty before this filing date? Find Prior Art

Description

Method for Identifying Abnormal Pet Behavior Using Video Information

[0001] The present invention relates to a method for identifying abnormal behavior in pets by utilizing video information based on artificial intelligence.

[0002] With the rapid increase in households raising pets, the analysis of animal safety and behavior is emerging as a critical social issue. In particular, early detection of abnormal behaviors in dogs and cats—such as aggression, anxiety, and stress—allows owners to respond promptly and minimize damage or problems. Traditionally, it was common for owners to observe their pets directly or for veterinarians and pet sitters to visually analyze their behavior. However, this approach is prone to missing subtle bodily signals and lacks accuracy and consistency due to reliance on subjective judgment. While extensive research has traditionally been conducted in the field of video analysis to recognize and track objects such as people or vehicles, automated technologies specialized in identifying abnormal behaviors in pets (dogs and cats) have been relatively lacking. For instance, while simple CNN-based classification models exist, they suffer from limitations such as an inability to significantly distinguish between normal and abnormal behaviors or to effectively handle variations across diverse breeds and individual animals. Furthermore, relying solely on simple object recognition makes it difficult to accurately identify abnormal behaviors that require the reflection of dynamic and joint characteristics, such as "aggressive postures" or "fleeing behavior."

[0003] The present invention has been devised to solve the aforementioned problems and aims to provide a method for automatically monitoring the behavior of animals (e.g., companion animals) to detect whether there is abnormal behavior.

[0004] Another objective of the present invention is to provide a method that enables specific behavior classification and notification based on risk level by linking an object detection model, a pose estimation model, and a multimodal abnormal behavior identification model.

[0005] Another objective of the present invention is to propose an annotation method capable of automatically generating behavior labels using a text generation model, and to provide a method for enhancing data scalability and flexibility by enabling the easy incorporation of new abnormal behavior labels through partial fine-tuning (transfer learning) techniques.

[0006] As a means to realize the aforementioned task, the present invention provides a method for identifying abnormal behavior of a pet using image information, comprising: a step of acquiring a frame image in frame units of an input image stream; a step of extracting a pet area image through an object detection model to detect an area where the pet is located in the frame image; a step of inputting the pet area image into a deep learning-based pose estimation model to calculate a plurality of joint coordinates of the pet and acquiring position information of each joint; a step of inputting the frame image or the pet area image into a multimodal abnormal behavior identification model to classify the behavior of the pet; and a step of recording the type and time of the abnormal behavior and providing a notification to a user when the classified behavior of the pet is abnormal behavior.

[0007] According to one feature, the object detection model includes a YOLO-based neural network structure to track the location of a pet in the frame image, and can set the area within the bounding box detected in the frame image as the pet area image.

[0008] According to another feature, the deep learning-based pose estimation model includes a DeepLabCut-based animal pose estimation algorithm, can identify multiple joint coordinates of a pet, and output the name and coordinate value of each joint.

[0009] According to another feature, the abnormal behavior identification model of the multimodal structure described above is a CLIP-based model including an image encoder and a text encoder, which can convert multiple behavior text labels expressing abnormal pet behavior into text embeddings and determine the behavior text label with the highest similarity by comparing them with the image embeddings of the frame image or the pet area image.

[0010] According to another feature, the abnormal behavior identification model of the multimodal structure described above is fine-tuned by performing contrastive learning using a video of the pet's abnormal behavior and a text label describing the abnormal behavior, and can be trained to determine whether there is abnormal behavior based on the similarity between the image embedding and the text embedding of the behavior text label.

[0011] According to another feature, when the behavior of a classified pet is abnormal behavior, the step of recording the type and time of the abnormal behavior and providing a notification to the user may include recording the type of abnormal behavior, the time of occurrence, and a video stream containing the abnormal behavior, and providing a notification to the user in a different way based on the type of abnormal behavior.

[0012] According to another feature, the step of providing different types of notifications to the user based on the type of abnormal behavior may include providing an immediate notification when the risk level is above a predetermined level based on at least one criterion among the pet risk level, space risk level, or user risk level for the abnormal behavior, and providing notifications at predetermined time intervals when the risk level is below a predetermined level.

[0013] According to the present invention, a systematic method for automatically monitoring pet behavior to detect abnormal behavior in real time can be provided. Furthermore, according to the present invention, by linking an object detection model, a pose estimation model, and a multimodal abnormal behavior identification model, it is possible to enable specific behavior classification and notifications based on risk levels. Additionally, according to the present invention, an annotation method capable of automatically generating behavior labels using a text generation model is presented, and new abnormal behavior labels can be easily incorporated through partial fine-tuning (transfer learning) techniques, thereby enhancing data scalability and flexibility.

[0014] FIG. 1 is a block diagram of a computer device for performing a method of determining abnormal behavior of a pet in a video stream according to an embodiment of the present invention. FIG. 2 is a schematic diagram showing artificial intelligence models for determining abnormal behavior of a pet in a video stream according to an embodiment of the present invention. FIG. 3 is an example of data in an intermediate process for identifying abnormal behavior of a pet according to an embodiment of the present invention. FIG. 4 is an example of JSON-formatted data in an intermediate process for identifying abnormal behavior of a pet according to an embodiment of the present invention. FIG. 5 is a schematic diagram showing an abnormal behavior identification model according to an embodiment of the present invention. FIG. 6 is an example diagram of pet abnormal behavior identification according to an embodiment of the present invention. FIG. 7 is an example diagram of pet abnormal behavior identification according to an embodiment of the present invention. FIG. 8 is a flowchart of a method for determining abnormal behavior of a pet according to an embodiment of the present invention.

[0015] Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. These embodiments are merely intended to provide a detailed description sufficient for a person skilled in the art to easily practice the invention, and the technical concept and scope of protection of the present invention should not be interpreted as being limited only to the following embodiments.

[0016] FIG. 1 is a block diagram of a computer device for performing a method of determining abnormal behavior of a pet in a video stream according to an embodiment of the present invention. The computer device (100) may include a processor (110), memory (130), and a network unit (150). The processor (110) may be composed of one or more cores and may include a processor for data analysis and deep learning, such as a central processing unit (CPU), a general-purpose graphics processing unit (GPGPU), or a tensor processing unit (TPU) of the computer device. The processor (110) may read a computer program stored in memory (130) and perform data processing for machine learning. The processor (110) may perform calculations for learning a neural network. The processor (110) may perform calculations for learning a neural network, such as processing input data for learning in deep learning, extracting features from input data, calculating errors, and updating the weights of the neural network using backpropagation. The memory (130) can store information of any form generated or determined by the processor (110) and information of any form received by the network unit (150). The network unit (150) can be configured regardless of the mode of communication, such as wired and wireless.

[0017] FIG. 2 is a schematic diagram showing artificial intelligence models for determining abnormal behavior of a pet in a video stream according to an embodiment of the present invention. To determine abnormal behavior of a pet in a video stream, an object detection model (200), a pose estimation model (300), an abnormal behavior identification model (400), and a text generation model (500) may be used.

[0018] A computer device (100) detects the location of a pet in each frame image in the form of a bounding box using an object detection model (200). The object detection model (200) can identify multiple objects, such as people or objects, in the image, but the present invention is configured to be trained specifically for pets (dogs, cats) or to ignore other objects if necessary. The object detection model (200) includes a YOLO-based neural network structure. YOLO divides the frame image into a grid and predicts candidate bounding boxes and corresponding class probabilities at once in each grid cell.

[0019] The computer device (100) sets pets such as dogs and cats as the main class and prepares numerous training images corresponding to them. In the present invention, the YOLO model is trained or fine-tuned using a dataset that includes various species, backgrounds, angles, and lighting conditions. By doing so, the computer device (100) can accurately detect pet areas using an object detection model (200).

[0020] Whenever a computer device (100) receives a series of frame images, it inputs the frame images into an object detection model (200). The YOLO neural network infers the location (bounding box coordinates) and class (probabilities) of the pet in the entire frame with a single forward processing. Subsequently, only bounding boxes with a probability above a certain threshold are selected, and duplicate boxes are refined using NMS (Non-Max Suppression), etc., to produce the final pet area.

[0021] The (x, y) coordinates, width, and height of the bounding box output by the object detection model (200) are used by the computer device (100) to accurately cut out the area where the pet is located. For example, (x min , y min From ) (x max , y maxA video patch cut up to ) is called a pet area image. The computer device (100) can input the extracted pet area image into a pose estimation model (300) or directly input it into a multimodal abnormal behavior identification model (400). In this process, additional preprocessing such as background removal or resolution adjustment may be performed.

[0022] If multiple pets exist within a single frame, the object detection model (200) generates multiple bounding boxes. The computer device (100) generates independent area images for each bounding box and inputs each into subsequent steps (pose estimation, abnormal behavior identification). This enables accurate analysis even when multiple pets appear on a single screen.

[0023] The computer device (100) can generate multiple bounding boxes from a single frame image through an object detection model (200). For example, if two dogs and one cat appear simultaneously on one screen, the model (200) outputs three bounding boxes. The computer device (100) checks the coordinates and confidence of each bounding box to extract each pet area image.

[0024] The computer device (100) assigns a unique Tracking ID to each detected pet bounding box in each frame. For example, in the first frame, ID=1 is assigned to "Dog A", ID=2 to "Dog B", and ID=3 to "Cat C". Subsequently, in the next frame, the object detection model (200) generates new bounding box candidates.

[0025] The computer device (100) compares the bounding boxes (and IDs) of the previous frame with the bounding boxes of the current frame and matches identical objects based on a similarity score (which may utilize IoU, color histograms, joint coordinates, etc.). For example, if a bounding box is most similar to the "ID=1" bounding box of the previous frame in terms of position and size, "ID=1" is continuously assigned to that bounding box as well. If a newly appearing pet bounding box does not match the previous frame, a new Tracking ID is assigned. If the pet disappears from the screen or detection is missed, the ID is temporarily put into a dormant state. If detection is not resumed for a certain number of frames or longer, it is processed as "Tracking Termination" and is no longer considered a valid tracking target.

[0026] The computer device (100) continuously tracks the same object over multiple frames using a tracking ID. This allows for the accumulation of the movement trajectory of "Dog A" or the management of the position change of "Cat C" in a time series. This information can then be combined with a pose estimation model (300) or an abnormal behavior identification model (400) to precisely analyze what behaviors individual pets frequently exhibit, whether abnormal behaviors are repeated, etc.

[0027] If multiple pets within a frame exhibit abnormal behavior simultaneously, there is a possibility of confusing the perpetrators without tracking information. However, the computer device (100) clearly distinguishes between "ID=1 (Dog A)" and "ID=2 (Dog B)" through a Tracking ID and manages their respective joint coordinates or behavior labels separately. This is essential for identifying and recording abnormal behavior on an individual pet basis.

[0028] The computer device (100) records the time and type of abnormal behavior and provides notifications. In an environment with multiple pets, the user can distinguish and check the abnormal behavior of "ID=1 (dog A)", the abnormal behavior of "ID=3 (cat C)", etc. This allows for clear identification of which pet performed which behavior and accurate determination of the risk level or warning level.

[0029] In actual implementation, the computer device (100) can generate and update a Tracking ID by using YOLO as an object detection model (200) and then applying a bounding box tracking algorithm (e.g., SORT, DeepSORT). SORT performs simple tracking using IoU and Kalman filters, while DeepSORT further improves re-identification accuracy by utilizing visual features. When joint points (pose estimation results) and color and shape characteristics are used together during the process of determining the tracking ID, ID confusion can be reduced even in situations where multiple pets look similar to each other. For example, pets are distinguished by detailed features such as "foreleg length ratio, tail angle."

[0030] As the frame rate and the number of pets increase, the processing load of the tracking algorithm increases. The computer device (100) adjusts the tracking algorithm parameters (update period, distance measurement method, etc.) according to the state of the hardware resources (processor, GPU) to maintain a balance between real-time performance and accuracy.

[0031] In this way, the object detection model (200) assigns a tracking ID to track each pet in a situation where multiple pets appear simultaneously, and maintains the ID consistently across consecutive frames, thereby efficiently identifying abnormal behavior even in a multi-pet environment. Through this tracking function, the present invention distinguishes the movements of individual pets, clearly identifies the source of the abnormal behavior, and performs repetitive abnormal behavior analysis and risk assessment with greater precision.

[0032] The present invention allows the object detection model (200) to be updated using a transfer learning method for more accurate detection in new types of pets or environments. For example, detection performance is continuously improved by adding rare dog and cat breed datasets and partially fine-tuning the YOLO model. Sensitivity can be increased by adjusting thresholds so that pets are not missed for the purpose of identifying abnormal behavior. On the other hand, if false positives increase, NMS settings can be adjusted to optimize the results.

[0033] The object detection model (200) serves as a preprocessing tool that effectively separates only the parts where the pet appears throughout the entire invention. This allows the remaining models (pose estimation model (300), abnormal behavior identification model (400)) to focus on the pet without unnecessary background information, thereby improving accuracy and processing speed. Since the YOLO-based neural network predicts both the object location and class with a single neural network inference, it is advantageous for real-time processing at a speed of tens of frames per second or more. This aligns with the objective of the present invention (real-time identification of abnormal pet behavior). The present invention is not limited to YOLO, and other object detection algorithms such as Faster R-CNN and SSD may also be used. However, depending on the requirements, a model that balances accuracy and throughput may be selected.

[0034] The pose estimation model (300) is a model that estimates the pose of a pet by detecting multiple joints forming the body of the pet in an image of the pet. The computer device (100) detects multiple joints forming the body of the pet using the pose estimation model (300). The pose estimation model (300) includes a deep learning algorithm, particularly an animal pose estimation algorithm such as DeepLabCut, and finds the joint positions with relatively high accuracy regardless of the angle or background from which the input image was taken.

[0035] In the present invention, joint coordinates are calculated by selecting various parts of a pet (dog, cat), such as the head, ears, forelegs, hindlegs, and tail. Subsequently, the joint coordinates can be used as additional auxiliary information for an abnormal behavior identification model (400) and can also be linked to a text generation model (500) that automatically generates behavior description text describing the pet's behavior.

[0036] DeepLabCut is a markerless pose estimation algorithm for predicting the position corresponding to an animal joint, utilizing a Backbone CNN (such as a ResNet family) that has been pre-trained on a large dataset such as ImageNet. A computer device (100) fine-tunes the DeepLabCut model using training data to which animal joint labels (head, ears, forepaws, hind legs, tail, etc.) have been attached in advance.

[0037] The pose estimation model (300) generates a heatmap that visualizes the "probability of joint existence" for each pixel of the input image. The computer device (100) interprets this heatmap and selects the coordinates with the highest probability of each joint being located as the final joint point. In this process, only joints with a Confidence Score above a certain standard are selected, and joints below that standard are processed by complementary post-processing (interpolation, tracking).

[0038] The computer device (100) inputs the pet area image extracted through the object detection model (200) into the pose estimation model (300). The pose estimation model (300) extracts features through multiple layers within the network and outputs a heatmap for each joint. Based on the heatmap, (x, y) coordinates are calculated, and the calculated coordinates are transmitted to subsequent modules, such as the abnormal behavior identification model (400), in a structured form such as JSON. The joint coordinates calculated by the pose estimation model (300) can be used to more precisely distinguish "posture" or "movement" in the abnormal behavior identification model (400) of a multimodal structure. For example, characteristic postures such as the pet tilting its ears back or raising its tail high can be associated with abnormal behavior.

[0039] The computer device (100) can track the movement (dynamic change) of a pet by performing pose estimation in consecutive frames. For example, it determines in a time series how many frames a specific behavior lasts, how rapidly the tail angle changes, etc. This information helps distinguish between one-time and repetitive behaviors or contributes to risk classification.

[0040] The computer device (100) can accumulate and store pose estimation results (joint coordinates and confidence) and monitor model accuracy by comparing them with actual labels collected during the annotation process. If necessary, fine-tuning for new breeds and poses is also performed through transfer learning.

[0041] When an object detection model (200) detects multiple pets and assigns a Tracking ID to each bounding box, a pose estimation model (300) receives area images for each ID and calculates joint coordinates. Through this, the computer device (100) distinguishes and manages changes in the posture of individual pets. In situations where multiple pets overlap on one screen, the pose estimation model (300) may confuse the joint coordinates. The computer device (100) is configured to estimate the correct joint points by combining bounding boxes, joint heatmaps, and joint connection relationships (skeleton structures such as shoulder-leg). This reduces false positives and maintains high accuracy.

[0042] The computer device (100) may also display the pose estimation results (e.g., visualization of joint positions) in real time on a user interface (UI). This allows the guardian to intuitively check the pet's movements and facilitates inspection when abnormal behavior occurs.

[0043] The pose estimation model (300) quantifies detailed movements that are difficult to identify with simple object recognition into joint coordinates. Since the tail angle, ear position, body tilt, etc. are directly linked to various abnormal behavior indicators, they play a key role in the abnormal behavior identification process provided by the present invention.

[0044] Since algorithms similar to DeepLabCut are not limited to specific breeds or shooting environments, they can be extended across various backgrounds, lighting, and species. This ensures the broad applicability of the present invention.

[0045] Joint coordinates provide useful ground information for automatically generating behavioral description text in the text generation model (500) or classifying behavioral types in the abnormal behavior identification model (400). Therefore, the pose estimation model (300) is not merely a simple analysis tool, but functions as an important link in the entire pipeline of the present invention.

[0046] FIG. 3 is an example of intermediate process data for identifying abnormal behavior of a pet according to an embodiment of the present invention. FIG. 3 is an example of intermediate process data extracted by the aforementioned object detection model (200) and pose estimation model (300).

[0047] The computer device (100) sequentially applies each frame image acquired from the video stream to an object recognition model (e.g., YOLO) and a pose estimation model (e.g., DeepLabCut) to produce pet area (bounding box) and joint coordinate information. This intermediate result data is ultimately used for identifying abnormal behavior or annotation work. Referring to the table shown in the example of FIG. 3 ("Example of annotation generation for dog / cat behavior pattern classification"), the main fields and structure are specified.

[0048] info field

[0049] video_name: video filename (e.g., dog_run.mp4).

[0050] resolution: Image resolution (e.g., 1080×1920).

[0051] num_frame: Total number of frames.

[0052] num_keypoints: Number of joints detected in a frame (e.g., 15, 17, etc.).

[0053] keypoint_channels: Settings for joint coordinate format and recognition rate, etc.

[0054] version: Version of the annotation data format or model version.

[0055] category_id field

[0056] Identify the pet behavior classification number. For example, it can be classified as "0: Normal behavior," "1: Abnormal behavior (aggression)," "2: Abnormal behavior (fear)," etc.

[0057] annotations field

[0058] frame_index: frame number(0, 1, 2, …).

[0059] id: Identifier (Tracking ID) of the object (or pet). When tracking multiple objects, an individual ID is assigned.

[0060] person_id: Can record the author (labeler) ID or a system-automatically generated ID.

[0061] keypoints: Stores joint coordinates in the form (x, y, score). Includes the pixel coordinate values where each joint is located and the confidence level (Score), etc.

[0062] Through this structure, the pet's bounding box, joint coordinates, behavior labels, etc., are systematically recorded in JSON format within a single frame.

[0063] Figure 4 is an example of JSON-formatted data of an intermediate process for identifying abnormal behavior of a pet according to an embodiment of the present invention, recorded in the form of the data of the example in Figure 3.

[0064] The JSON example presented in FIG. 4 is structured such that for a video file named "dog-walkrun-004754.mp4", the video resolution (1080×1920), total number of frames (num_frame=506), number of joints (num_keypoints=15), and annotation array (annotations field) are recorded. The computer device (100) automatically saves the joint coordinates of the pet (dog), the corresponding behavior classification (category_id), and bounding box information for each frame, thereby utilizing them for an abnormal behavior identification model or data analysis thereafter.

[0065] The abnormal behavior identification model (400) is a deep learning model having a multimodal structure, and, for example, a CLIP (Contrastive Language-Image Pre-training) based algorithm may be applied. The abnormal behavior identification model (400) converts the pet area image (or the entire frame image) into an embedding vector and also converts text labels such as "baring teeth and growling" and "puffing up tail" into embedding vectors. The computer device (100) calculates the similarity (e.g., cosine similarity) between the image embedding and the text embedding and selects the behavior label showing the highest similarity as the final classification result.

[0066] The abnormal behavior identification model (400) is pre-trained using videos of abnormal pet behavior and corresponding text labels, or fine-tuned to suit the usage environment. For example, accuracy is improved by training with a dataset labeled with various forms of abnormal behavior such as aggression, fear, anxiety, and stress.

[0067] The computer device (100) analyzes a video of a pet's behavior in a multimodal manner using an abnormal behavior identification model (400). For example, a CLIP (Contrastive Language-Image Pre-training) based neural network structure may be applied, and this model includes an image encoder (420) and a text encoder (410), respectively.

[0068] Hereinafter, an abnormal behavior identification model (400) is described in detail with reference to the drawings. FIG. 5 is a schematic diagram showing an abnormal behavior identification model according to an embodiment of the present invention. A computer device (100) divides a video stream (10) into frames and, if necessary, extracts a pet area image through an object detection model (200) and a pose estimation model (300). The abnormal behavior identification model (400) finally transmits the input frame (or area image) to an image encoder (420).

[0069] The computer device (100) inputs a text label (20) expressing abnormal pet behavior, such as "a dog growling and threatening behavior," into a text encoder (410). This label may be a type of abnormal behavior predefined in the present invention or a descriptive sentence automatically generated through a text generation model (500).

[0070] The image encoder (420) converts the input image into an image embedding (11) vector, and the text encoder (410) converts the behavior text label (20) into a text embedding (21) vector. The computer device (100) calculates the cosine similarity or dot product of the image embedding (11) and the text embedding (21) through a similarity determination (30).

[0071] The abnormal behavior identification model (400) utilizes a pre-trained backbone, such as ResNet or Vision Transformer (ViT), to extract visual features of a frame image as an image embedding (11). The computer device (100) obtains a vector of constant length from an image encoder (420), and this vector is called an "image embedding (11)".

[0072] The computer device (100) performs an image encoder (420) for every frame or samples frames at regular intervals to obtain image embeddings (11). The image embeddings (11) summarize the appearance of the pet (fur color, posture, facial expression, etc.) in the form of high-dimensional vectors and are used as key representations for identifying abnormal behavior.

[0073] The present invention may additionally reflect joint coordinates obtained through a pose estimation model (300) into an image. In some embodiments, joint information may be input into an encoder along with visual features of the image to increase the accuracy of the image embedding (11).

[0074] The computer device (100) inputs text labels (20) describing abnormal behaviors, such as "a dog growling and threatening" or "a cat arching its back," into a text encoder (410). The abnormal behavior identification model (400) tokenizes these text labels and generates sentence-level text embeddings (21).

[0075] In addition to standard labels, the present invention allows a text encoder (410) to input a sentence describing behavior automatically generated by a text generation model (500). This makes it easy to reflect new types of abnormal behavior.

[0076] The text encoder (410) is trained through multimodal contrastive learning so that correct image-text pairs have high similarity and incorrect pairs have low similarity. The computer device (100) applies partial fine-tuning as needed to precisely encode new behavior labels as well.

[0077] The computer device (100) performs a similarity determination (30) between an image embedding (11) and a text embedding (21). For example, the cosine similarity between vector v (image) and w (text) is calculated, and the label with the highest similarity becomes a candidate for the current frame action.

[0078] The computer device (100) calculates the similarity for all behavior labels ("normal behavior," "aggressive behavior," "fearful behavior," etc.) and determines the corresponding label as the final classification result if the maximum value is greater than or equal to a specific threshold. If it is less than the threshold, it is processed as "normal" or "unclassified."

[0079] Video of abnormal behavior and text labels describing the behavior ("dog growling and threatening behavior") are input together to train the embeddings of both sides to be close. On the other hand, the combination of normal behavior ("sitting behavior") and abnormal behavior labels is trained to have a low similarity score to clearly distinguish between the classifications.

[0080] The present invention allows the abnormal behavior identification model (400) to be updated using a partial fine-tuning technique when a "new abnormal behavior" label appears through the text generation model (500). This enables flexible expansion without retraining the entire model.

[0081] By directly comparing image embeddings (11) and text embeddings (21), more detailed behavior recognition is possible than conventional CNN-only classification. For example, it richly reflects the angle at which a pet shows its teeth, the condition of its tail, etc.

[0082] By matching sentences and images that are easy for humans to understand, such as "a dog growling and threatening behavior," the results of the abnormal behavior analysis are also intuitively conveyed to the user.

[0083] Multimodal models such as CLIP can handle real-time video frames and allow for the easy addition of new labels through partial fine-tuning. This supports the diversity of pets (breeds and behaviors) and forms an ecosystem that can be expanded in the future.

[0084] In the present invention, the text generation model (500) automatically generates text describing the behavior of a pet based on a given image frame or joint coordinate information, and can be used in the abnormal behavior identification model (400) or the process of constructing annotation data.

[0085] A computer device (100) uses a text generation model (500) to automatically generate sentences describing pet behavior (e.g., “dog growling and threatening behavior”). The present invention may use a structure trained to automatically generate text in a large-scale language model (e.g., GPT family) or an image+text multimodal model.

[0086] The text generation model (500) utilizes the pet image (or joint coordinates), behavior label candidates, environmental context (e.g., time, risk level), etc., as input information. In some embodiments, joint ratios and angles calculated by the pose estimation model (300) are passed as conditions to the text generation model (500) to generate more detailed descriptive sentences.

[0087] The text generation model (500) can use a Transformer structure and generates sentences based on keywords or context, such as "dog's body posture, tail angle." For example, the text generation model (500) can be composed of a large language model and an image recognition model. This automates the behavioral description sentences that were previously written manually by labelers or developers.

[0088] The computer device (100) initializes the internal parameters (e.g., vocab, tokenizer, network weights) of the text generation model (500) or loads an already trained model. It is configured to perform partial fine-tuning with a corpus specialized in pet behavior to skillfully generate special expressions such as "bashing teeth" and "puffing up the tail."

[0089] In a video stream or a specific frame, the pet area, joint coordinates, (optional) temporary result of an abnormal behavior identification model (400), etc., calculated by the object detection model (200) and the pose estimation model (300) are transmitted to the text generation model (500). For example, information summarized in the form of keywords / attributes, such as "dog, foreleg angle 45 degrees, mouth opening large, growling=1," can be used as input. The text generation model (500) interprets the input information and generates natural language sentences, such as "the behavior of a dog opening its mouth and growling to warn" or "the behavior of a cat arching its back to express fear." The computer device (100) can post-process the generated text (organize synonyms, check spelling) or finalize it through user verification.

[0090] The finally determined behavior sentence is recorded as "behavior label text" in JSON-based annotation data. In this process, by storing human-readable explanatory sentences together with existing numeric labels such as "category_id," the interpretability of the model is increased, and it can also be used to train an abnormal behavior identification model (400).

[0091] The text generation model (500) learns a large text corpus of the general domain and then further fine-tunes sentences related to pet behavior ("baring teeth...", "raising tail...") to enhance expertise. The computer device (100) collects a pet behavior dataset (e.g., video captions, behavior labels) and supplies it to the model, thereby enabling the production of more accurate and contextually appropriate expressions during automatic sentence generation. The present invention allows the text generation model (500) to be updated through partial fine-tuning when a new type of behavior (e.g., a specific behavior of a rare breed) appears. This facilitates the "generation of new behavior labels" and ensures that the entire process of identifying abnormal behavior operates scalably.

[0092] The computer device (100) can be configured to input a behavioral description sentence generated by the text generation model (500) into the text encoder (410) of the abnormal behavior identification model (400) to contrast with the image embedding (420). This allows new, unstructured behavioral sentences to be automatically reflected in the identification model.

[0093] When the object detection model (200) and the pose estimation model (300) extract visual information, the computer device (100) provides the result to the text generation model (500) as a summary factor. Then, the generated sentence is used again as a label in the abnormal behavior identification model (400) or saved as a dataset. Through this, the entire system is repeatedly trained, and the quality of the pet behavior description is gradually improved.

[0094] The text generation model (500) significantly reduces the burden of the past manual labeling process by automatically describing pet behavior in sentences that are easy for humans to understand. In addition to fixed categories ("aggressive behavior," "fearful behavior"), it can freely describe newly observed behaviors, such as "shaking the body vigorously and showing excitement." This can be further incorporated into the model through future transfer learning. When abnormal behavior occurs, instead of simply marking "abnormal behavior = 1," the system provides specific sentences such as "dog growling and warning behavior detected." Users immediately understand the situation and respond. Since hundreds or thousands of behavior label sentences can be automatically generated and modified through the text generation model (500), the AI training data becomes richer, and the model helps to learn the deviations by situation and breed well. The text generation model (500) performs the role of natural language generation for automatically describing pet behavior in the present invention and is organically linked with the object detection model (200), pose estimation model (300), and abnormal behavior identification model (400). Through this, the entire system can be continuously and flexibly expanded not only to identify abnormal behavior but also to discover new behavior labels and annotations.

[0095] FIGS. 6 and 7 are exemplary diagrams of identifying abnormal behavior in pets according to an embodiment of the present invention. FIG. 6 is an exemplary diagram showing the identification of abnormal behavior and user notification for a dog, and FIG. 7 is an exemplary diagram showing the identification of abnormal behavior and user notification for a cat.

[0096] A computer device (100) arranges frame images extracted from a video stream at regular intervals or groups them around a specific point in time for determining abnormal behavior and provides them to the user. In the drawing, scenes showing a pet exhibiting normal behavior (sitting posture) and abnormal behavior (showing teeth and growling) are exemplarily arranged. In the example, they are distinguished and described as 'Normal behavior: sitting behavior' and 'Abnormal behavior: showing teeth and growling behavior'. The present invention displays "normal behavior" and "abnormal behavior" in text and provides the corresponding frame images as thumbnails, as in the example in the drawing, thereby allowing the user to intuitively check the condition of the pet. Depending on the time when abnormal behavior occurs, the present invention sends an immediate push notification or provides a notification at regular intervals. This helps the user quickly identify the level of danger and respond when aggressive scenes occur consecutively across multiple frames, as in the example in the drawing.

[0097] FIG. 8 is a flowchart of a method for determining abnormal behavior of a pet according to an embodiment of the present invention. A computer device (100) can acquire frame images in frame units of an input video stream (S100).

[0098] The computer device (100) first receives a video stream. The video stream may be in various forms, for example, video transmitted in real time from a CCTV or IP camera, or a recorded video file. The present invention receives such a video source via the Internet or a local network and stores it in an internal memory or buffer.

[0099] The computer device (100) can receive real-time video from a camera via a streaming protocol (RTSP, HTTP, WebRTC, etc.). The computer device (100) also decodes and uses frames in the same way when playing file-based video (e.g., MP4, AVI).

[0100] The computer device (100) checks parameters such as video resolution (e.g., 1920×1080), frames per second (FPS), and color format (RGB, YUV, etc.), and configures an internal cache so that processing can be performed in a subsequent step. The computer device (100) divides the received video stream into frame images according to a fixed interval (frame rate). That is, one frame is extracted at every time unit (1 / FPS second) and loaded into memory in the form of a 2D image.

[0101] If the stream is a compressed video codec (H.264, H.265, etc.), the computer device (100) restores the original image for each frame through a decoder. It checks whether there is any frame loss depending on the decoding time and, if necessary, executes retry logic to secure a consistent frame sequence.

[0102] The computer device (100) assigns a unique number (frame_index) to each frame and records the time interval (timestamp) between frames. In a subsequent step (object detection, etc.), the time at which abnormal behavior occurs can be accurately tracked based on a specific frame number. The computer device (100) stores video information (bitrate, resolution), recording (or real-time) time, camera ID, etc. together or records them in annotation data (info field).

[0103] The computer device (100) receives CCTV footage installed in animal shelters or pet cafes in real-time stream. After extracting frames, it sends notifications whenever abnormal behavior occurs, allowing the administrator to take immediate action. Additionally, the computer device (100) can analyze whether abnormal behavior is occurring by sequentially acquiring frames while playing back existing recording files. Through this, it is also possible to process past video data offline.

[0104] Next, the computer device can extract a pet area image through an object detection model to detect the area where the pet is located in the frame image (S200).

[0105] The computer device (100) uses an object detection model to locate the area where the pet is located in each frame image. In one embodiment of the present invention, the object detection model includes a YOLO (You Only Look Once) based neural network structure, but various object detection algorithms (Faster R-CNN, SSD, etc.) can also be optionally applied.

[0106] The computer device (100) divides the frame image into a grid shape using the YOLO algorithm and predicts candidate bounding boxes and class probabilities for each grid cell. The present invention is primarily interested in pet classes such as "dog" and "cat," and if the probability is greater than or equal to a threshold, the corresponding bounding box is detected as a "pet area." The computer device (100) performs Non-Max Suppression (NMS) to handle overlaps among the detected bounding boxes. The box with the highest probability is maintained first, and overlapping boxes are removed to finally determine a clear area where the pet is located.

[0107] The bounding box generated by the YOLO model is represented as (x_min, y_min, x_max, y_max). The computer device (100) creates a pet area image by cutting out the corresponding coordinate interval from the original frame. For example, "the area cut out from (x_min=100, y_min=150) to (x_max=300, y_max=400)" becomes the pet image.

[0108] The computer device (100) performs preprocessing, such as background removal or resolution adjustment (resizing), as needed. For example, if the pose estimation model (300) requires input of a certain resolution, the pet area image is normalized to that resolution. Lighting correction or color conversion (RGB→Gray, etc.) can be applied optionally.

[0109] When multiple pets are detected within a single frame, the computer device (100) calculates an independent bounding box for each and extracts an image of the pet area. This enables accurate recognition even in an environment where multiple pets exist simultaneously.

[0110] The present invention can detect multiple pets in the same frame using an object detection model. A computer device (100) assigns a Tracking ID to each bounding box to track the same individual in the next frame (e.g., algorithms such as SORT, DeepSORT, etc.). Through this, the movement paths or behavioral patterns of multiple pets can be distinguished by individual. The computer device (100) matches the same individual by calculating similarity, such as the Intersection over Union (IoU) between the bounding box of the previous frame and the bounding box of the current frame. Newly appearing pets are issued a new Tracking ID, and pets that have disappeared are switched to a dormant state. Through this tracking process, the present invention reduces confusion and increases accuracy by applying an abnormal behavior identification model (400) to each individual in a multi-pet environment.

[0111] YOLO-based models are easy to fine-tune for specific breeds (e.g., rare dog breeds) or environments (nighttime, indoors, etc.). This allows pet detection performance to be maintained or improved under various conditions.

[0112] Next, the computer device (100) can input an image of the pet area into a deep learning-based pose estimation model to calculate multiple joint coordinates of the pet and obtain position information of each joint (S300).

[0113] The computer device (100) obtains a bounding box in which a pet is located within a frame image through an object detection model (200) and crops the image of that area. The pet area image has unnecessary background removed so that the pose estimation model can predict joints more efficiently and accurately. If necessary, the computer device (100) normalizes the resolution, color format, etc. to match the pose estimation model (e.g., 256×256, RGB). If multiple pets are detected, each area image is input as an individual model inference.

[0114] The present invention provides an example of an animal pose estimation algorithm based on DeepLabCut, which predicts the location of keypoints of a pet through a deep learning model. DeepLabCut uses a Backbone CNN (such as ResNet) that has been pre-trained on a large dataset such as ImageNet, and adopts a markerless pose estimation technique to estimate the location of animal joints. A computer device (100) fine-tunes the model using training images containing pre-established "pet joint labels (head, ears, forelegs, hindlegs, tail, etc.").

[0115] The pose estimation model generates a probability map (heatmap) for each joint in the input pet area image. The computer device (100) sets the pixel coordinates with the highest confidence in the probability map as the location of the corresponding joint. Only joints with a Confidence Score above a certain standard are finally selected, and joints with a lower score are supplemented through post-processing.

[0116] Finally, the joint position is calculated in the form of (x, y) coordinates, and the joint name (e.g., left front leg, right hind leg, etc.) is indexed and output together. The computer device (100) records this result in a structured form (annotations) such as JSON, or immediately transmits it to a subsequent step (abnormal behavior identification model (400)).

[0117] By specifically determining joint coordinates, postural characteristics such as whether the pet has its ears pulled back, its tail raised, or its legs bent can be quantitatively identified. This is closely related to the identification of abnormal behaviors (e.g., the state of the ears and tail during fear behavior, the angle of tooth exposure during aggressive behavior, etc.).

[0118] The computer device (100) can repeatedly perform pose estimation in consecutive frames to analyze, on the time axis, how long the pet maintained a specific abnormal behavior (aggression, anxiety) pose, whether the movement was sudden, etc. Through this, it is possible to distinguish between one-time warnings and repetitive abnormal behaviors, thereby enabling a more accurate risk assessment.

[0119] If multiple pets are extracted individually by their Tracking IDs in the object detection model (200), the pose estimation model (300) independently calculates joint coordinates for the image inputs for each ID. The computer device (100) uses this to manage the poses of 'ID=1' (dog A) and 'ID=2' (dog B) separately, thereby preventing behavioral confusion.

[0120] Next, the computer device can classify the behavior of the pet by inputting a frame image or a pet area image into a multimodal abnormal behavior identification model (S400).

[0121] The abnormal behavior identification model of the present invention is designed with a multimodal (neural) structure composed of an image encoder and a text encoder. A computer device (100) converts a frame image (or pet area image) and a "behavior text label" into embedding vectors, respectively, and then measures their similarity to determine which behavior corresponds to which. In this embodiment, a CLIP model (Contrastive Language-Image Pre-training) is used as an example, and by performing contrastive learning on image-text pairs, the calculation of similarity between image embeddings and text embeddings is efficiently learned. This increases the mapping accuracy between pet behavior and text labels.

[0122] The present invention can express various types of abnormal behaviors such as aggression, fear, and anxiety using natural language labels such as "baring teeth and growling" or "cat arching its back," and can also add descriptive sentences automatically generated through a text generation model (500).

[0123] The computer device (100) inputs a frame image extracted from a video stream or a pet area image cut out through an object detection model (200) into an image encoder. The image encoder generates an image embedding vector (e.g., 512 dimensions) by utilizing a pre-trained CNN (e.g., ResNet) or Transformer (ViT) based network. The computer device (100) performs image encoding on every frame or every frame at regular intervals to obtain image embeddings. If necessary, accuracy can be improved by overlaying joint coordinate information obtained from a pose estimation model (300) onto the image or by providing additional information (channels) to the image encoder input.

[0124] In the present invention, "multiple behavior text labels expressing abnormal behavior" are defined, for example, "aggressive behavior (dog baring teeth)," "cat puffing up its tail," etc. In addition to a fixed list of labels, the computer device (100) can also utilize sentences automatically generated by a text generation model (500), thereby improving the ability to respond to new abnormal behaviors. The text encoder tokenizes the above sentences and outputs a final text embedding after word and sentence embedding. The CLIP model is trained by comparing a large number of image-text pairs in advance so that correct image-text pairs have high similarity and incorrect pairs have low similarity.

[0125] When training with videos of abnormal pet behavior and text labels describing such behavior, "aggression"-related features are reinforced between the image embeddings and the text embeddings. By applying a partial fine-tuning method, this invention allows newly observed abnormal behavior labels to be added to the model, thereby providing high system flexibility.

[0126] The computer device (100) compares an image embedding generated by an image encoder with a behavior text label embedding generated by a text encoder. For example, it measures similarity using a cosine similarity or dot product method. It calculates the similarity between each text embedding for various behavior labels ("normal behavior," "aggression," "fear," etc.) and the current image embedding, and the label with the highest value becomes the behavior classification result for the corresponding frame. By setting a threshold, if the similarity is above a certain level, it is determined as "abnormal behavior," and if it is below that level, it is classified as another label such as "normal."

[0127] The present invention performs real-time abnormal behavior classification by repeating the above process for each frame of a video stream. Depending on the frame rate and hardware (GPU) performance, it is possible to process tens to hundreds of frames per second.

[0128] The following describes the process for constructing annotation data for classifying abnormal behaviors of the aforementioned pets. Through this process, additional training data can be secured, which can increase the accuracy of the abnormal behavior identification model and enable response to various abnormal behaviors of pets of various breeds. By automatically configuring the annotation data required for abnormal behavior identification and training (learning) through this process, the present invention streamlines data labeling operations and allows for the flexible incorporation of new behavior labels.

[0129] A computer device (100) sequentially acquires frame images from a video stream (or pre-recorded video). This serves as the basic unit of annotation data, and subsequently, pet detection and behavior description are performed. The computer device (100) calculates the bounding box where the pet is located within each frame through an object detection model. Using the bounding box coordinates (x_min, y_min, x_max, y_max), an image of the pet's area is extracted from the original frame. If there are multiple pets, a separate area image is generated for each to distinguish them by ID.

[0130] The present invention takes an animal pose estimation algorithm such as DeepLabCut as an example and, when an image of a pet area is input, calculates keypoint coordinates and a score in the form of (x, y). Quantitative pose information of the pet is obtained by detecting multiple joints such as the head, ears, forelegs, hindlegs, and tail.

[0131] The computer device (100) stores a list of joint coordinates for each frame and pet in a structured manner, such as JSON. For example, keypoints: [[x₁, y₁, score₁], …]. By utilizing a Tracking ID, joint data can be managed so that they do not get mixed up even in a multi-pet environment. Although the purpose of the annotation stage is primarily for data accumulation, this joint information can also be utilized to achieve higher accuracy when actually detecting abnormal behavior.

[0132] The computer device (100) processes the corresponding frame image (optionally along with joint coordinates, background information, etc.) using a text generation model (e.g., a GPT family including an image recognition model and a large language model). The model automatically generates sentences describing specific behaviors of pets ("a dog puffing out its tail and growling," "a cat arching its back"). The text generation model creates descriptive sentences referred to as behavior label texts in the present invention by referencing video, image, and joint information. For example, if there was previously only a 'aggressive' label, the model may suggest more detailed label sentences, such as "aggressive behavior of baring teeth and growling." The computer device (100) may display the generated behavior label texts in a UI so that the user can review and modify them. Accuracy and quality are ensured by saving the final confirmed texts after verification as training data.

[0133] The computer device (100) organizes training data by grouping the pet area image, joint coordinates, and generated behavior label text obtained through the above steps based on the same frame and ID. For example, it is stored in a JSON structure as {"bounding_box": ..., "keypoints": ..., "behavior_text": "a dog growling and warning"}. The final generated training data is used to train (or fine-tune) an abnormal behavior identification model (400). The present invention allows the model to learn the relationships between images, joints, and sentences through contrastive learning, and new abnormal behavior labels can be easily added. If necessary, "category_id (aggression, fear, etc.)", "frame_index", etc. are also recorded together to perform accuracy evaluation or model updates thereafter.

[0134] The computer device (100) accumulates the automatically generated annotation data in the form of a DB or file. Whenever new pet behavior is observed in the future, the dataset is reinforced, and the model accuracy is improved through transfer learning.

[0135] The present invention automates the process of manually writing bounding boxes, joints, and behavior descriptions by combining object detection, pose estimation, and text generation models. This significantly reduces costs and time when building large-scale datasets. New behavior sentences proposed by the text generation model (e.g., special behaviors of rare breeds) can also be easily added to the training data, thereby increasing the scalability of the abnormal behavior identification model. Through the linkage between pet area images (visual information), joint coordinates (pose information), and behavior label text (linguistic description), the model performs multimodal contrastive learning more richly. As a result, the abnormal behavior identification model (400) is capable of more accurate and detailed classification.

[0136] The following describes additional training and updates for an abnormal behavior identification model using the aforementioned novel training data. In the present invention, the abnormal behavior identification model (e.g., based on multimodal CLIP) is not limited to fixed behavior labels but can flexibly improve the model through partial fine-tuning (transfer learning) when new abnormal behavior text labels or new image data are added.

[0137] The abnormal behavior identification model of the present invention is based on a CLIP (or a similar multimodal structure) that has been pre-trained with a large dataset. The computer device (100) reuses the image and text embedding capabilities of this model, but performs additional training (fine-tuning) when a new abnormal behavior label or a new frame image appears.

[0138] In the transfer learning process, the computer device (100) does not retrain the entire model, but partially fine-tunes only the upper layers or specific modules. By doing so, it maintains the general mapping (low-level features) between image texts that have already been learned, while rapidly adapting to newly added behavior labels or image data. Through this partial fine-tuning method, the present invention expands the model to accurately identify rare cases, such as "aggressive behavior seen only in dogs of a specific breed." In other words, it avoids costly full retraining and maintains and improves model performance at a low cost.

[0139] In the present invention, text describing an action automatically generated by the text generation model (500) (e.g., "The cat arches its back and puffs out its tail") or new labels directly added by the user can be reflected in the model training data. When new text labels and video (frame image) or pose information corresponding to the action are obtained, the computer device (100) performs partial fine-tuning on the existing model. At this time, the image encoder and text encoder learn the correct mapping anew and conduct contrast learning so that the "new abnormal behavior label" and the corresponding image embedding become close. The present invention updates the abnormal behavior identification model when new labels and video data are collected periodically (or event-based). Through this, the model always reflects the latest or higher label knowledge and maintains model accuracy.

[0140] When new videos of abnormal behavior are collected in environments where pets have not been previously observed (e.g., at night, swimming pools, etc.) or specific dog or cat breeds, the computer device (100) utilizes this data for partial fine-tuning. For example, if there are sufficient example images, the model learns and identifies "unique ear movements" of rare dog breeds. The computer device (100) performs small mini-batch training on the new frame images and corresponding behavior labels (or text descriptions). By adjusting the learning rate and updating only the necessary layers (especially the top classification layer or the embedding transformation layer), the identification performance for the new class (new behavior) is improved without significantly compromising existing classification capabilities. The updated abnormal behavior identification model (400) undergoes internal validation (measuring accuracy on the new dataset) and is deployed as a final production (service) version. Users automatically receive abnormal behavior alerts based on the improved model.

[0141] From the video, the pet area image and joint coordinates are obtained through the object detection model (200) and pose estimation model (300), and behavior label text is generated using the text generation model (500). The computer device (100) evaluates the recognition performance of the existing model for the newly emerged behavior ("a slight attack showing only half of the teeth"). If the recognition is inaccurate, additional training is required. The computer device (100) sets only some layers of the CLIP model (the top layer or some of the text and image encoders) to a trainable state and modifies them through mini-batch training. Contrast training is performed in short epochs using the new behavior label and the corresponding frame image as a positive pair and combinations with other labels as a negative pair. After training, the newly learned parameters are reflected in the abnormal behavior identification model (400). Now the model can identify the new label with high accuracy.

[0142] It is easy to add rare abnormal behaviors or breed-specific behaviors that existing models could not classify, facilitating the refinement and continuous improvement of the invention. Since the model is rapidly updated through partial fine-tuning rather than retraining the entire layer, training time and resources are significantly reduced. This enables agile responses even in real-time service environments (e.g., pet monitoring systems). Pet owners or managers can improve the model simply by uploading videos of new abnormal behaviors, thereby increasing system accuracy without requiring specialized knowledge.

[0143] Next, if the classified pet's behavior is abnormal, the computer device can record the type and time of the abnormal behavior and provide a notification to the user (S500).

[0144] The computer device (100) receives a classification result from the abnormal behavior identification model (400) that an abnormal behavior occurred in a specific frame (or frame interval). At this time, it records a specific abnormal behavior label, such as "a dog baring its teeth and growling" or "a cat puffing up its tail and acting erratically," along with the time of occurrence (timestamp).

[0145] Since abnormal behavior can have various types such as aggression, fear, anxiety, and stress, the computer device (100) stores the "type of abnormal behavior" in the form of a label or ID. For example, it can log structured data such as {"time": "2023-06-01 12:05:10", "behavior_type": "aggression", "frame_index": 230, ...}.

[0146] The computer device (100) extracts and stores a portion of the video stream around the time when abnormal behavior occurred (e.g., ±5 seconds) in the form of a video clip when necessary. This allows the situation to be reviewed later.

[0147] The computer device (100) provides notifications in different ways depending on the type of abnormal behavior. For example, various channels such as push notifications, SMS, email, and in-app pop-ups are possible. The computer device (100) sends an immediate push notification when it is determined to be "severe aggression," and may provide only organized notifications (period notifications) at regular intervals for "mild anxiety behavior."

[0148] The present invention calculates the risk level by combining various criteria such as the risk level of the pet (breed characteristics, history of past attacks), the risk level of the space (narrow space, accompanying children, etc.), and the risk level of the user (whether the user is absent). The computer device (100) notifies the user by providing an immediate notification (emergency notification) if the risk level is above a predetermined level, and a periodic notification (e.g., notification of an event over one hour) if the risk level is below the level, thereby reducing unnecessary frequent notifications while responding quickly to emergency situations.

[0149] Notifications may include the type of abnormal behavior (aggression, anxiety), the time of occurrence (frame or real-time), and short video capture images. Users can check detailed information on the app or web interface and immediately assess their pet's condition.

[0150] The computer device (100) records whether abnormal behavior occurs on a frame-by-frame basis and cumulatively analyzes "how often and continuously abnormal behavior occurs" along the time axis. For example, if aggressive behavior occurs three or more times within 10 seconds, it can be considered "repetitive abnormal behavior." The computer device (100) may have a relatively low warning level for one-time abnormal behavior, but if it is repeated multiple times over a short period, it raises the warning level to adjust the "severity" upward. For example, "Level 1: One-time" → "Level 2: Repetitive" → "Level 3: Continuous occurrence." The level of user notification (push, SMS, etc.) may vary for each level. The present invention applies this logic for distinguishing between one-time and repetitive behavior in real time to convey accurate dangerous situations to guardians or administrators. For example, it can be subdivided into "dog growled briefly but stopped quickly (Level 1, mild)," "dog continued aggressive behavior for 1 minute (Level 3, severe)," etc.

[0151] When abnormal behavior occurs, the computer device (100) displays the "type of abnormal behavior," "time of occurrence," "risk level," etc., on an app or web dashboard. As shown in the examples of drawings in FIGS. 6 and 7, the normal vs. abnormal behavior is provided in the form of thumbnails or screenshots to visually convey the situation.

[0152] Even after the notification, the computer device (100) accumulates and stores abnormal behavior information in an internal DB. By utilizing this, changes in pet behavior patterns and the frequency of occurrence during specific time periods (night and day) can be analyzed over the long term. The administrator (user) can view "number of aggression occurrences in the last 24 hours," "average risk level," etc. on the statistics screen.

[0153] This invention sends an emergency alert when high-risk abnormal behavior occurs, allowing guardians to physically separate or restrain the pet or calm it remotely via a device. This prevents safety accidents in advance. By systematically recording the type, timing, and repetitiveness of classified abnormal behaviors, this invention enables reproducibility and verification, and allows for immediate action through user notifications. This invention varies the notification method by applying various risk criteria (pet characteristics, frequency of occurrence, etc.). This prevents false positives or excessive alerts while providing effective notifications quickly in emergency situations. The distinction between one-time and repetitive abnormal behaviors in this invention enables the early identification of habitual problematic behaviors in pets and reduces damage or accidents by inducing guardian responses based on rising warning levels.

[0154] The present invention precisely extracts only the pet area excluding the background using an object detection model (e.g., YOLO). Subsequently, joint coordinates are calculated using a pose estimation model (DeepLabCut), and pet behavior is classified using a multimodal abnormal behavior identification model (CLIP, etc.). This combination of multi-stage AI models provides higher accuracy than conventional simple CNN classification techniques, allowing for detailed reflection of the pet's posture, facial expressions, and joint condition.

[0155] The present invention distinguishes pet behavior more delicately by directly comparing text labels and images, such as "dog baring teeth, growling, and in an aggressive posture," rather than using a simple "abnormal behavior = 1" method. This continuously improves identification accuracy by flexibly responding to various behavioral expressions and newly discovered abnormal behavior labels.

[0156] Object detection models, pose estimation models, and abnormal behavior identification models analyze video streams of tens to hundreds of frames per second in real time through hardware acceleration, such as GPUs. By integrating an immediate notification function, guardians can quickly recognize and respond to abnormal behavior in their pets. When abnormal behavior is detected, the present invention sends a notification (push, SMS, email, etc.) and records the time of occurrence, frame, and behavior label. Users can view a thumbnail image on the user interface and take emergency measures if necessary. By differentiating the level of notification based on risk or recurrence, it immediately delivers necessary information while suppressing unnecessary excessive notifications.

[0157] This invention incorporates behavior labels automatically generated through a text generation model or newly collected video data into the model using a partial fine-tuning (transfer learning) method. Through this, "unique abnormal behaviors exhibited by rare dog breeds" or "newly discovered aggression patterns" can be identified with minimal training.

[0158] Training data (annotations) can be automatically constructed by combining information extracted from an object detection model (200) and a pose estimation model (300) with behavior labels automatically generated by a text generation model (500). This significantly reduces the labeling costs required for model expansion and enables a self-evolving system structure.

[0159] This invention prevents unexpected accidents or injuries by classifying and notifying pets of aggression, fear, and stress behaviors in real time. By analyzing repetitive abnormal behaviors cumulatively, owners can promptly proceed with professional counseling or training, thereby improving the welfare and stability of the pet. This invention can be used not only at home but also in veterinary clinics, animal shelters, and pet cafes. Real-time monitoring increases management efficiency, and the accumulated data can be widely utilized in behavioral research, healthcare, and the pet tech industry.

[0160] The present invention has high industrial applicability in that it can be applied to a home pet monitoring system, allowing guardians to check and respond to abnormal behaviors of pets in real time even while away from home, and to efficiently manage the condition of pets at animal hospitals, shelters, pet hotels, etc., and prevent safety accidents through the detection of early signs of abnormality.

Claims

1. As a method for identifying abnormal behavior in pets using video information, A step of acquiring frame images in frame units of an input video stream; A step of extracting a pet area image through an object detection model to detect the area where the pet is located in the above frame image; A step of inputting the above pet area image into a deep learning-based pose estimation model to calculate multiple joint coordinates of the pet and obtaining position information of each joint; A step of classifying the behavior of a pet by inputting the above frame image or the above pet area image into a multimodal abnormal behavior identification model; and A method comprising the step of recording the type and time of the abnormal behavior and providing a notification to the user when the classified behavior of the pet is abnormal behavior.

2. In Claim 1, A method characterized by the object detection model including a YOLO (You Only Look Once) based neural network structure to track the location of a pet in the frame image, and setting the area within the bounding box detected in the frame image as the pet area image.

3. In Claim 1, The deep learning-based pose estimation model described above includes a DeepLabCut-based animal pose estimation algorithm, and is characterized by identifying multiple joint coordinates of a pet and outputting the name and coordinate value of each joint.

4. In Claim 1, The above-described multimodal abnormal behavior identification model is a CLIP (Contrastive Language-Image Pre-training) based model including an image encoder and a text encoder, characterized by converting a plurality of behavior text labels expressing abnormal pet behavior into text embeddings and determining the behavior text label with the highest similarity by comparing them with the image embeddings of the frame image or the pet area image.

5. In Claim 4, A method characterized in that the above-described multimodal abnormal behavior identification model is fine-tuned by performing contrastive learning using a video of abnormal pet behavior and a text label describing the abnormal behavior, and is trained to determine whether there is abnormal behavior based on the similarity between the image embedding and the text embedding of the behavior text label.

6. In Claim 1, A method characterized by the step of recording the type and time of the abnormal behavior and providing a notification to a user when the classified behavior of a pet is abnormal behavior, wherein the step includes recording the type of abnormal behavior, the time of occurrence, and a video stream containing the abnormal behavior, and providing a notification to the user in a different manner based on the type of abnormal behavior.

7. In Claim 6, A method characterized by the step of providing a different type of notification to a user based on the type of abnormal behavior, wherein the step of providing an immediate notification when the risk level is above a predetermined level based on at least one criterion among the pet risk level, space risk level, or user risk level for the abnormal behavior, and providing a notification at predetermined time intervals when the risk level is below the predetermined level.

8. In Claim 1, A method characterized by further including a step of determining a warning level by distinguishing between one-time abnormal behavior and repetitive abnormal behavior by cumulatively analyzing the presence of abnormal behavior derived from each frame image along the time axis.

9. In Claim 1, A method further comprising the step of, for constructing annotation data for classifying abnormal behavior of the pet, extracting a pet area image through an object detection model for a frame image containing the pet, inputting the pet area image into a pose estimation model to obtain multiple joint coordinates and position information of the pet, and processing the frame image with a text generation model to generate behavior label text describing the behavior of the pet included in the frame, thereby generating training data including the pet area image, joint coordinates, and generated behavior label text.

10. In Claim 1, A method characterized by the above abnormal behavior identification model performing transfer learning with partial fine-tuning techniques when text labels or frame images regarding new abnormal behavior are added.

11. A computer program stored on a computer-readable storage medium, wherein the computer program performs the following methods for identifying abnormal behavior of a pet, the method comprising: a step of acquiring a frame image in frame units of an input video stream; a step of extracting a pet area image through an object detection model to detect an area where the pet is located in the frame image; a step of acquiring position information of each joint by inputting the pet area image into a deep learning-based pose estimation model to calculate multiple joint coordinates of the pet; a step of classifying the pet's behavior by inputting the frame image or the pet area image into a multimodal abnormal behavior identification model; and a step of recording the type and time of the abnormal behavior and providing a notification to a user when the classified behavior of the pet is abnormal behavior.

12. A computer device comprising: one or more processors; and a memory for storing instructions executable on the one or more processors, wherein the one or more processors acquire a frame image in frame units of an input video stream, extract a pet area image through an object detection model to detect an area where a pet is located in the frame image, input the pet area image into a deep learning-based pose estimation model to calculate multiple joint coordinates of the pet and acquire position information of each joint, input the frame image or the pet area image into a multimodal abnormal behavior identification model to classify the behavior of the pet, and if the classified behavior of the pet is abnormal behavior, record the type and time of the abnormal behavior and provide a notification to the user.