Method and system for detecting human interaction relationship based on monitoring scene
By constructing a dataset of human-object relationships and training a network to calculate the interaction relationships between pedestrians and objects, the problem of low detection efficiency in traditional methods is solved, and efficient detection of human-object interaction relationships is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INTELLIGENT INTER CONNECTION TECH CO LTD
- Filing Date
- 2023-05-16
- Publication Date
- 2026-06-12
Smart Images

Figure CN116682053B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to a method and system for detecting human interaction relationships in a surveillance scene. Background Technology
[0002] In recent years, with the continuous acceleration of urbanization and the increasing number of motor vehicles, many problems such as urban traffic congestion and parking conflicts have arisen. To address this, an intelligent urban traffic management system has been developed by relying on artificial intelligence algorithms, cloud service platforms, smart hardware devices, and edge computing devices. This system is capable of collecting, processing, and feeding back traffic information in real time and with high accuracy.
[0003] By processing and analyzing video image data from monitored scenes, evidence of traffic violations such as running red lights and speeding can be captured and alerted. This includes guiding roadside parking, recording parking spaces, and providing real-time updates, predictions, and dissemination of traffic congestion information. However, a higher level of visual understanding is needed to address the interaction between people and objects in monitored scenes. Traditional methods, after detecting people and objects, analyze the relationships between them one by one. Because the types and numbers of targets in monitored scenes are enormous, this method of analyzing people and objects individually is extremely time-consuming and resource-intensive, resulting in low efficiency in detecting human-object interactions. Summary of the Invention
[0004] The purpose of this invention is to solve the technical problem of low detection efficiency caused by analyzing people and objects one by one in traditional methods. To achieve the above objective, this invention provides a method and system for detecting human interaction relationships based on surveillance scenes.
[0005] This invention provides a method for detecting human interaction relationships in a surveillance scenario, comprising:
[0006] Construct a person relationship dataset, which includes multi-class object detection annotation data, pedestrian pose annotation data, and pedestrian bounding box annotation data;
[0007] The multi-class object detection network, pedestrian pose recognition network, and pedestrian envelope detection network are trained based on the aforementioned person relationship dataset to obtain the category and location of each object, pedestrian pose category, and pedestrian envelope location.
[0008] Based on the category and location of each target, obtain the category and location of the object, and based on the category and location of the object, obtain the object's bounding box position;
[0009] Calculate the distance between the pedestrian and the object based on the pedestrian's bounding box position and the object's bounding box position;
[0010] The interaction relationship between the pedestrian and the object is obtained based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category.
[0011] In one embodiment, obtaining the object's category and location based on the category and location of each target, and calculating the object's bounding box position based on the object's category and location, includes:
[0012] Based on the category and location of the object, calculate the two-dimensional distance between the center point and the corner point of the two-dimensional detection box, where the two-dimensional distance is the radius of the three-dimensional envelope.
[0013] Based on the object's category, the reference radius of the object is obtained. Based on the ratio of the radius of the three-dimensional envelope to the reference radius, the three-dimensional envelope corresponding to the reference radius is transformed proportionally to obtain the position of the object's envelope.
[0014] In one embodiment, calculating the distance between a pedestrian and an object based on the pedestrian's bounding box position and the object's bounding box position includes:
[0015] The pedestrian envelope position is transformed to obtain the pedestrian coordinate position in the world coordinate system;
[0016] The object's bounding box position is transformed to obtain the object's coordinate position in the world coordinate system.
[0017] The distance between the pedestrian and the object is calculated based on the pedestrian's coordinates and the object's coordinates.
[0018] In one embodiment, obtaining the human interaction relationship based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category includes:
[0019] Determine whether the distance between the pedestrian and the object is less than a distance threshold;
[0020] If so, then there is an interaction relationship between the pedestrian and the object, and the interaction relationship between the pedestrian and the object is obtained based on the pedestrian's posture category and the object's category.
[0021] In one embodiment, training a multi-class object detection network, a pedestrian pose recognition network, and a pedestrian bounding box detection network based on the person relationship dataset to obtain the category and location of each object, the pedestrian pose category, and the pedestrian bounding box location includes:
[0022] The monitored scene images in the aforementioned person relationship dataset are input into the multi-class target detection network, which outputs the predicted target category and the predicted target location.
[0023] Based on the predicted target category, the predicted target location, and the multi-category target detection annotation data, a multi-category target detection loss function is constructed.
[0024] The pedestrian images in the aforementioned person relationship dataset are input into the pedestrian pose recognition network and the pedestrian envelope detection network, respectively, and multiple predicted pedestrian pose key points and multiple predicted pedestrian envelope key points are output.
[0025] Calculate the predicted pedestrian center point for each pedestrian based on the predicted target position of each pedestrian, calculate the first offset between each predicted pedestrian posture key point and the predicted pedestrian center point, and calculate the second offset between each predicted pedestrian envelope key point and the predicted pedestrian center point.
[0026] Based on the predicted pedestrian center point, the first offset, the second offset, the pedestrian pose annotation data, and the pedestrian envelope annotation data, a pedestrian keypoint loss function is constructed.
[0027] Based on the multiple predicted pedestrian pose key points and the pedestrian pose annotation data, a pedestrian pose classification loss function is constructed;
[0028] The multi-class object detection network, the pedestrian pose recognition network, and the pedestrian envelope detection network are trained based on the multi-class object detection loss function, the pedestrian keypoint loss function, and the pedestrian pose classification loss function to obtain the trained multi-class object detection network, the trained pedestrian pose recognition network, and the trained pedestrian envelope detection network.
[0029] In one embodiment, the present invention provides a human interaction relationship detection system based on a monitored scenario, comprising:
[0030] The dataset construction module is used to construct a person relationship dataset, which includes multi-class object detection annotation data, pedestrian pose annotation data, and pedestrian envelope annotation data.
[0031] The target pedestrian information acquisition module is used to train a multi-class target detection network, a pedestrian pose recognition network, and a pedestrian envelope detection network based on the person relationship dataset, so as to obtain the category and location of each target, the pedestrian pose category, and the pedestrian envelope location.
[0032] The object information acquisition module is used to acquire the category and location of the object based on the category and location of each target, and to obtain the object's bounding box position based on the object's category and location.
[0033] The person distance calculation module is used to calculate the distance between the pedestrian and the object based on the position of the pedestrian's bounding box and the position of the object's bounding box.
[0034] The character interaction relationship generation module is used to obtain the character interaction relationship based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category.
[0035] In one embodiment, the object information acquisition module includes:
[0036] The two-dimensional distance calculation module is used to calculate the two-dimensional distance between the center point and the corner point of the two-dimensional detection box according to the category and position of the object. The two-dimensional distance is the radius of the three-dimensional envelope.
[0037] The object envelope position acquisition module is used to obtain the reference radius of the object according to the object's category, and to perform a proportional transformation on the three-dimensional envelope corresponding to the reference radius according to the ratio of the three-dimensional envelope radius to the reference radius to obtain the object envelope position.
[0038] In one embodiment, the person distance calculation module includes:
[0039] The pedestrian coordinate transformation module is used to transform the position of the pedestrian envelope to obtain the pedestrian coordinate position in the world coordinate system;
[0040] The object coordinate transformation module is used to transform the position of the object's bounding box to obtain the object's coordinate position in the world coordinate system.
[0041] The relative distance calculation module is used to calculate the distance between the pedestrian and the object based on the pedestrian's coordinate position and the object's coordinate position.
[0042] In one embodiment, the character interaction relationship generation module includes:
[0043] The distance determination module is used to determine whether the distance between the pedestrian and the object is less than a distance threshold;
[0044] The relationship acquisition module is used to determine if an interaction relationship exists between a pedestrian and an object, and to obtain the interaction relationship based on the pedestrian's posture category and the object's category.
[0045] In one embodiment, the target pedestrian information acquisition module includes:
[0046] The target detection module is used to input the monitored scene images in the person relationship dataset into the multi-class target detection network and output the predicted target category and predicted target location.
[0047] The first loss function construction module is used to construct a multi-class target detection loss function based on the predicted target category, the predicted target location, and the multi-class target detection annotation data;
[0048] The pedestrian detection module is used to input pedestrian images from the person relationship dataset into the pedestrian pose recognition network and the pedestrian envelope detection network respectively, and output multiple predicted pedestrian pose key points and multiple predicted pedestrian envelope key points;
[0049] The pedestrian location information calculation module is used to calculate the predicted pedestrian center point of each pedestrian based on the predicted target location of each pedestrian, calculate the first offset between each predicted pedestrian posture key point and the predicted pedestrian center point, and calculate the second offset between each predicted pedestrian envelope key point and the predicted pedestrian center point.
[0050] The second loss function construction module is used to construct a pedestrian keypoint loss function based on the predicted pedestrian center point, the first offset, the second offset, the pedestrian pose annotation data, and the pedestrian envelope annotation data.
[0051] The third loss function construction module is used to construct a pedestrian pose classification loss function based on the multiple predicted pedestrian pose key points and the pedestrian pose annotation data.
[0052] The model training module is used to train the multi-class object detection network, the pedestrian pose recognition network, and the pedestrian envelope detection network according to the multi-class object detection loss function, the pedestrian keypoint loss function, and the pedestrian pose classification loss function, so as to obtain the trained multi-class object detection network, the trained pedestrian pose recognition network, and the trained pedestrian envelope detection network.
[0053] The aforementioned method and system for detecting human interaction relationships in surveillance scenarios utilizes multi-category target detection, pedestrian pose detection, and pedestrian 3D bounding box detection to determine the category and location of each target, the pedestrian pose category, and the pedestrian bounding box location. Based on the object category and the 2D detection box location, the object's 3D bounding box location is obtained. Given the fixed camera position in a surveillance scenario, the positions of pedestrians and objects in the real scene can be projected using the pedestrian and object bounding box locations, and the actual distance between them can be calculated. Based on this actual distance, the interaction relationship between pedestrians and objects is detected using the pedestrian pose category and the object category, thus obtaining the human interaction relationship in the surveillance scenario. Therefore, the human interaction relationship detection method based on surveillance scenarios provided by this invention eliminates the need for individual combination analysis of people and objects, saving manpower, resources, and time, thereby improving the efficiency of human interaction relationship detection. Attached Figure Description
[0054] Figure 1This is a flowchart illustrating the steps of the method for detecting human interaction relationships based on a monitoring scenario provided by the present invention.
[0055] Figure 2 This is a schematic diagram of the structure of the human interaction relationship detection system based on a monitoring scenario provided by the present invention. Detailed Implementation
[0056] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0057] Please see Figure 1 This invention provides a method for detecting human interaction relationships in a surveillance scenario, comprising:
[0058] S10, Construct a person relationship dataset, which includes multi-class object detection annotation data, pedestrian pose annotation data, and pedestrian bounding box annotation data;
[0059] S20, Train the multi-class object detection network, pedestrian pose recognition network and pedestrian envelope detection network based on the person relationship dataset to obtain the category and location of each object, pedestrian pose category and pedestrian envelope location;
[0060] S30: Based on the category and location of each target, obtain the category and location of the object, and based on the category and location of the object, obtain the object's bounding box position;
[0061] S40, Calculate the distance between the pedestrian and the object based on the positions of the pedestrian's bounding box and the object's bounding box;
[0062] S50 obtains the interaction relationship between the pedestrian and the object based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category.
[0063] In this embodiment, in S10, the person relationship dataset originates from images collected during surveillance scenes. Image frames from surveillance videos of different areas are collected, such as different cities, different streets, different time periods, and different seasons, including different surveillance scene videos and image data.
[0064] Multi-class object detection annotation data is generated by labeling multiple classes of objects in an image using a bounding box tool. These multi-class objects include, but are not limited to, different types of vehicles, pedestrians, non-motorized vehicles, different types of pets, roadblocks, strollers, baby carriages, backpacks, helmets, and suitcases. The annotation includes the object's class C and the bounding box position P = (X, Y, W, H) describing the object's location, where (X, Y) represents the coordinates of the object's center point, and (W, H) represents the length and width of the bounding box.
[0065] The pedestrian pose annotation data adopts the COCO human pose dataset annotation method, using 17 key points to describe the human pose. In one embodiment, the 17 key points include 0-nose tip, 1-left eye, 2-right eye, 3-left ear, 4-right ear, 5-left shoulder joint, 6-right shoulder joint, 7-left elbow joint, 8-right elbow joint, 9-left wrist joint, 10-right wrist joint, 11-left hip joint, 12-right hip joint, 13-left knee joint, 14-right knee joint, 15-left ankle joint, and 16-right ankle joint. The annotation information of each key point includes (v, x, y), where v represents the visibility of the key point, v=0 indicates invisible, v=1 indicates visible, and (x, y) represents the coordinates of the key point.
[0066] The pedestrian envelope annotation data is generated using eight key points that constitute the pedestrian envelope. These key points include their type, coordinates, and visibility. The annotations include four key points where the pedestrian is in contact with the ground and four key points where the pedestrian's head is in the air. In one embodiment, based on the position of the pedestrian's left foot, the key point on the left side in contact with the ground in the direction of the pedestrian's left foot is used as a reference and denoted as bf-1. Then, rotating clockwise, the other three ground-contact points are denoted as bf-2, bb-3, and bb-4. The key point on the left side in the direction of the pedestrian's head is denoted as tf-5. Rotating clockwise, the other three head-related key points are denoted as tf-6, tb-7, and tb-8. Visible points are denoted as 1, and invisible points as 0. The key point's category, coordinate position, and visibility are used as label data.
[0067] In S20, the multi-class object detection network includes a feature extraction network and an object detection network. The feature extraction network includes, but is not limited to, convolutional neural networks such as ResNet, MobileNet, and VGG. The object detection network includes, but is not limited to, multi-class detection methods based on YOLO, SSD, and CenterNet, and outputs the object's class and location information. The object's location information includes the coordinates of the center point of the object's bounding box, the object's length, and its width.
[0068] The pedestrian pose recognition network is used to identify the postures and movements of pedestrians. The pedestrian envelope detection network is used to obtain the 3D bounding boxes of pedestrians in the image coordinate system. Both the pedestrian pose recognition network and the pedestrian envelope detection network use keypoint regression, including a backbone network, a feature aggregation network, and a keypoint prediction network. The backbone network and the feature aggregation network use the same network structure. For the keypoint prediction network, the pedestrian pose recognition network and the pedestrian envelope detection network output different numbers of keypoints.
[0069] In S30, the category and location of an object are selected from multiple target categories and locations. An object can be understood as a target other than pedestrians. By manually annotating the 3D bounding boxes of different objects, a 3D bounding box detection model is trained. Then, based on this model, the 3D bounding box of the object, i.e., the object's bounding box location, is obtained.
[0070] Alternatively, the object's position can be represented by a 2D bounding box. Based on the object's 2D bounding box position and the dimensions of objects in various categories within a 3D bounding box knowledge base, the object's 3D bounding box is estimated, converted into its size within the monitored scene image, and thus the object's bounding box position is obtained.
[0071] In S40 and S50, both the pedestrian and object bounding box positions are 3D bounding boxes. Based on the coordinates of these 3D bounding boxes, the positions and distances of the pedestrian and object in the world coordinate system are calculated. By comparing the distance between the pedestrian and object with a distance threshold, the distance between them can be determined, thus revealing whether an interaction exists. When an interaction exists, the pedestrian's action can be determined based on their pose category. For example, if the pedestrian's pose category indicates a kicking motion and the object's category is a soccer ball, the interaction is "a person kicking a soccer ball"; if the pedestrian's pose category indicates a raising hand motion and the object's category is a basketball, the interaction is "a person playing basketball." Other interaction categories include a person riding a bicycle, a person walking a dog, a person holding a cat, etc.
[0072] This invention proposes a method for detecting human interaction relationships in surveillance scenarios. Based on multi-category target detection, pedestrian pose detection, and pedestrian 3D bounding box detection, it obtains the category and location of each target, the pedestrian pose category, and the pedestrian bounding box location. Based on the object category and the 2D detection box location, the object's 3D bounding box location is obtained. Given the fixed camera position in a surveillance scenario, the positions of pedestrians and objects in the real scene can be projected using the pedestrian and object bounding box locations, and the actual distance between them can be calculated. Based on this actual distance, the interaction relationship between pedestrians and objects is detected using the pedestrian pose category and the object category, thus obtaining the human interaction relationship in the surveillance scenario. Therefore, this invention's method for detecting human interaction relationships in surveillance scenarios eliminates the need for individual combination analysis of people and objects, saving manpower, resources, and time, thereby improving the efficiency of human interaction relationship detection.
[0073] In one embodiment, S20, a multi-class object detection network, a pedestrian pose recognition network, and a pedestrian bounding box detection network are trained based on a person relationship dataset to obtain the category and location of each object, the pedestrian pose category, and the pedestrian bounding box location, including:
[0074] S210: Input the monitored scene images from the person relationship dataset into the multi-class target detection network, and output the predicted target category and predicted target location;
[0075] S220, Construct a multi-class target detection loss function based on the predicted target category, predicted target location, and multi-class target detection annotation data;
[0076] S230, input the pedestrian images in the person relationship dataset into the pedestrian pose recognition network and the pedestrian envelope detection network respectively, and output multiple predicted pedestrian pose key points and multiple predicted pedestrian envelope key points;
[0077] S240, calculate the predicted pedestrian center point for each pedestrian based on the predicted target position for each pedestrian, calculate the first offset between the predicted pedestrian posture key point and the predicted pedestrian center point, and calculate the second offset between the predicted pedestrian envelope key point and the predicted pedestrian center point.
[0078] S250, based on the predicted pedestrian center point, first offset, second offset, pedestrian pose annotation data, and pedestrian envelope annotation data, construct the pedestrian key point loss function;
[0079] S260, construct a pedestrian pose classification loss function based on multiple predicted pedestrian pose key points and pedestrian pose annotation data;
[0080] S270, based on the multi-class object detection loss function, the pedestrian keypoint loss function, and the pedestrian pose classification loss function, train the multi-class object detection network, the pedestrian pose recognition network, and the pedestrian envelope detection network to obtain the trained multi-class object detection network, the trained pedestrian pose recognition network, and the trained pedestrian envelope detection network.
[0081] In this embodiment, the multi-class object detection network outputs the category and location information of the target. The target category of pedestrian is selected from the multi-class object detection results of multiple targets. Based on the coordinate position of the pedestrian's 2D target detection bounding box, it is extracted from the overall image, and the extracted image is resized to form a pedestrian image. Simultaneously, based on the resized image, the coordinates of the 17 key points representing the pedestrian's pose and the 8 key points representing the pedestrian's 3D envelope are also transformed to obtain the pedestrian key point coordinate information based on the extracted and resized pedestrian image.
[0082] The resized pedestrian images are input into the pedestrian pose recognition network and the pedestrian bounding box detection network, respectively. Both networks consist of three parts: a backbone network, a feature aggregation network, and a keypoint prediction network. The backbone network is responsible for feature extraction from the pedestrian images.
[0083] The backbone network employs a convolutional layer-normalization layer-activation function layer convolutional combination approach, performing stacked operations on multiple convolutional combination layers. Downsampling is performed once in each convolutional combination operation. This includes, but is not limited to, backbone networks such as ResNet, VGG, and MobileNet. Normalization layers include, but are not limited to, instance normalization layers and adaptive instance normalization layers. Non-linear activation layers include, but are not limited to, ReLU and Leaky ReLU non-linear activation functions. The feature aggregation network aggregates high- and low-level features extracted from different layers in the backbone network, providing more feature representations for subsequent keypoint detection. Aggregating features from different layers of the network enables the fusion of high- and low-level features, thereby improving the detection accuracy of subsequent keypoint detection tasks.
[0084] The keypoint prediction network takes a feature map after feature aggregation as input and locates the class and position of keypoints for 17 types of pedestrian pose keypoints (or 8 types of pedestrian envelope keypoints). Based on the coordinates of the 2D detection box of each pedestrian target, the coordinates of the pedestrian's center point are calculated and used as the pedestrian's center point. The offset of the pedestrian pose keypoint (or pedestrian envelope keypoint) relative to the pedestrian center point is used as the learning metric for the keypoint detection model. Simultaneously, the width and height of the pedestrian are also included in the regression. By incorporating the information of the pedestrian center point and the constraints of pedestrian width and height dimensions, the regression difficulty of the keypoint detection task is reduced, and the accuracy of keypoint detection is improved, addressing the difficulty caused by the large number of pedestrian targets in surveillance scenarios. The L1 distance is used to calculate the offset of each keypoint from the center point. A keypoint represents the position of the pedestrian pose center point plus the offset.
[0085] The overall loss function is formed by combining the multi-class object detection loss function, the pedestrian keypoint loss function, and the pedestrian pose classification loss function, where L = α1L. OD +α2L Keypoint +α3L CLS .
[0086] α1, α2, and α3 are the weight coefficients of the multi-class object detection loss function, the pedestrian keypoint loss function, and the pedestrian pose classification loss function, respectively, and can be set according to the actual situation.
[0087] L OD L represents the loss function for multi-class object detection, which includes two parts: object classification and location regression. OD =β1L class+β2L bbox L class The classification loss function for the target is represented by a weighted softmax classification loss function, which can solve the problem of inaccurate identification caused by the low frequency of occurrence of target categories due to the large number of target categories in the human interaction relationship detection task. Common categories include pedestrians, vehicles, and non-motorized vehicles. Categories with low frequency of occurrence include shopping bags, water cups, pets, pianos, and camels. bbox The regression loss function for the target detection bounding box is represented by the smooth L1 loss function.
[0088] L Keypoint The pedestrian keypoint loss function is a regression loss function for pedestrian pose keypoints and pedestrian 3D envelope keypoints, which are regression loss functions for pedestrian center point and regression loss functions for the first offset (or the second offset) between pedestrian center point and pedestrian pose keypoint.
[0089] The pedestrian keypoint loss function is expressed as L Keypoint =γ1L reg +γ2L offset L reg L represents the regression loss function of the pedestrian center point. offset The regression loss function represents the first offset (or the second offset) between the pedestrian center point and the pedestrian pose keypoints. Both parts of the pedestrian keypoint loss function use the L1 Loss (mean absolute error) loss function. The pedestrian pose classification loss function uses the standard softmax classification loss function.
[0090] Based on the trained multi-class target detection network, the trained pedestrian pose recognition network, and the trained pedestrian envelope detection network, the monitoring image under test is detected sequentially to obtain the category and location of each target, the pedestrian pose category, and the pedestrian envelope location, thereby realizing the subsequent detection steps S30 to S50 and obtaining the human interaction relationship corresponding to the monitoring image under test.
[0091] In one embodiment, S30, based on the category and location of each target, the category and location of the object are obtained, and based on the category and location of the object, the object's bounding box position is calculated, including:
[0092] S310, Calculate the two-dimensional distance between the center point and the corner point of the two-dimensional detection box according to the category and position of the object. The two-dimensional distance is the radius of the three-dimensional envelope.
[0093] S320: Based on the object's category, obtain the object's reference radius. Based on the ratio of the 3D envelope radius to the reference radius, perform a proportional transformation on the 3D envelope corresponding to the reference radius to obtain the object's envelope position.
[0094] In this embodiment, different categories of objects include, but are not limited to, different types of vehicles such as cars, buses, and trucks; non-motorized vehicles such as bicycles and electric vehicles; different types of pets such as cats and dogs; and other common objects such as suitcases and shopping bags. A 3D bounding box knowledge base for different categories of objects is constructed, including the object's category, reference radius, and 3D bounding box information. Based on the position of the object's 2D detection box, the distance from the center point of the 2D detection box to any corner point of the 2D detection box is used as the radius of the object's 3D bounding box. Based on the length and width of the object's 2D detection box position output in S20, the radius of the object's 3D bounding box can be calculated.
[0095] Based on the object's category, the corresponding reference radius and 3D envelope information are retrieved from the 3D envelope knowledge base. The size of the 3D envelope information retrieved from the knowledge base is then scaled proportionally based on the ratio of the 3D envelope radius to the reference radius, thereby obtaining the corresponding object envelope position.
[0096] In one embodiment, S40, calculating the distance between the pedestrian and the object based on the pedestrian's bounding box position and the object's bounding box position includes:
[0097] S410, perform coordinate transformation on the pedestrian envelope position to obtain the pedestrian coordinate position in the world coordinate system;
[0098] S420 transforms the object's bounding box position to obtain the object's coordinate position in the world coordinate system.
[0099] S430: Calculate the distance between the pedestrian and the object based on the pedestrian's coordinates and the object's coordinates.
[0100] In this embodiment, the positions of the pedestrian and object bounding boxes are transformed into their positions and distances in the world coordinate system. Based on the characteristic that the position of the surveillance camera remains unchanged, the coordinates of the 3D bounding boxes of the pedestrian and object obtained from the image coordinate system are transformed to obtain their coordinate positions in the world coordinate system, which can be represented as p1 = (x1, y1, z1) and p2 = (x2, y2, z2), respectively. The distance between the pedestrian and object is calculated based on their coordinate positions in the world coordinate system, and can be expressed as...
[0101] In one embodiment, S50, based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category, the human interaction relationship is obtained, including:
[0102] S510, determine whether the distance between the pedestrian and the object is less than the distance threshold;
[0103] S520, if so, then there is an interaction relationship between the pedestrian and the object, and the interaction relationship between the pedestrian and the object is obtained according to the pedestrian posture category and the object category.
[0104] In this embodiment, the distance threshold can be set to 1 to 2 meters, and the specific value can be limited according to the actual situation. If the distance between the pedestrian and the object is less than the distance threshold, it indicates that the pedestrian and the object have an interactive relationship. If the distance between the pedestrian and the object is not less than the distance threshold, it indicates that the pedestrian and the object are far apart and there is no interactive relationship. For pedestrians and objects with an interactive relationship, the interaction relationship is obtained based on the pedestrian's posture category and the object's category.
[0105] Please see Figure 2 This invention provides a system for detecting human interaction relationships in a surveillance scene. The system includes a dataset construction module 10, a target pedestrian information acquisition module 20, an object information acquisition module 30, a human distance calculation module 40, and a human interaction relationship generation module 50. The dataset construction module 10 constructs a human relationship dataset, which includes multi-class object detection annotation data, pedestrian pose annotation data, and pedestrian bounding box annotation data. The target pedestrian information acquisition module 20 trains a multi-class object detection network, a pedestrian pose recognition network, and a pedestrian bounding box detection network based on the human relationship dataset to obtain the category and location of each target, the pedestrian pose category, and the pedestrian bounding box location.
[0106] The object information acquisition module 30 is used to acquire the object's category and location based on the category and location of each target, and to obtain the object's bounding box position based on the object's category and location. The person distance calculation module 40 is used to calculate the distance between the pedestrian and the object based on the pedestrian's bounding box position and the object's bounding box position. The person interaction relationship generation module 50 is used to obtain the person interaction relationship based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category.
[0107] In this embodiment, the description of the dataset construction module 10 can be found in the description of S10 in the above embodiment. The description of the target pedestrian information acquisition module 20 can be found in the description of S20 in the above embodiment. The description of the object information acquisition module 30 can be found in the description of S30 in the above embodiment. The description of the person distance calculation module 40 can be found in the description of S40 in the above embodiment. The description of the person interaction relationship generation module 50 can be found in the description of S50 in the above embodiment.
[0108] In one embodiment, the object information acquisition module 30 includes a two-dimensional distance calculation module and an object envelope position acquisition module. The two-dimensional distance calculation module calculates the two-dimensional distance between the center point and corner points of the two-dimensional detection box based on the object's category and location; the two-dimensional distance is the radius of the three-dimensional envelope. The object envelope position acquisition module obtains the object's reference radius based on its category, and proportionally transforms the three-dimensional envelope corresponding to the reference radius based on the ratio of the three-dimensional envelope radius to the reference radius to obtain the object envelope position.
[0109] In this embodiment, the description of the two-dimensional distance calculation module can be found in the description of S310 in the above embodiment. The description of the object envelope position acquisition module can be found in the description of S320 in the above embodiment.
[0110] In one embodiment, the person distance calculation module 40 includes a pedestrian coordinate transformation module, an object coordinate transformation module, and a relative distance calculation module. The pedestrian coordinate transformation module transforms the pedestrian's bounding box position to obtain the pedestrian's coordinate position in the world coordinate system. The object coordinate transformation module transforms the object's bounding box position to obtain the object's coordinate position in the world coordinate system. The relative distance calculation module calculates the distance between the pedestrian and the object based on their coordinate positions.
[0111] In this embodiment, the description of the pedestrian coordinate transformation module can be found in the description of S410 in the above embodiment. The description of the object coordinate transformation module can be found in the description of S420 in the above embodiment. The description of the relative distance calculation module can be found in the description of S430 in the above embodiment.
[0112] In one embodiment, the character interaction relationship generation module 50 includes a distance judgment module and a relationship acquisition module. The distance judgment module is used to determine whether the distance between the pedestrian and the object is less than a distance threshold. The relationship acquisition module is used to determine whether an interaction relationship exists between the pedestrian and the object if the distance is less than a distance threshold, and to obtain the character interaction relationship based on the pedestrian's posture category and the object's category.
[0113] In this embodiment, the description of the distance determination module can be found in the description of S510 in the above embodiment. The description of the relationship acquisition module can be found in the description of S520 in the above embodiment.
[0114] In one embodiment, the target pedestrian information acquisition module 20 includes a target detection module, a first loss function construction module, a pedestrian detection module, a pedestrian location information calculation module, a second loss function construction module, a third loss function construction module, and a model training module. The target detection module is used to input the monitored scene images from the person relationship dataset into a multi-class target detection network, and output the predicted target category and predicted target location. The first loss function construction module is used to construct a multi-class target detection loss function based on the predicted target category, predicted target location, and multi-class target detection annotation data.
[0115] The pedestrian detection module takes pedestrian images from the person relationship dataset and inputs them into the pedestrian pose recognition network and the pedestrian envelope detection network, respectively, outputting multiple predicted pedestrian pose keypoints and multiple predicted pedestrian envelope keypoints. The pedestrian location information calculation module calculates the predicted pedestrian center point for each pedestrian based on their predicted target location, calculates the first offset between each predicted pedestrian pose keypoint and the predicted pedestrian center point, and calculates the second offset between each predicted pedestrian envelope keypoint and the predicted pedestrian center point. The second loss function construction module constructs a pedestrian keypoint loss function based on the predicted pedestrian center point, the first offset, the second offset, the pedestrian pose annotation data, and the pedestrian envelope annotation data.
[0116] The third loss function construction module is used to construct a pedestrian pose classification loss function based on multiple predicted pedestrian pose keypoints and pedestrian pose annotation data. The model training module is used to train the multi-class object detection network, pedestrian pose recognition network, and pedestrian envelope detection network based on the multi-class object detection loss function, pedestrian keypoint loss function, and pedestrian pose classification loss function, resulting in the trained multi-class object detection network, trained pedestrian pose recognition network, and trained pedestrian envelope detection network.
[0117] In this embodiment, the description of the target detection module can be referred to the description of S210 in the above embodiment. The description of the first loss function construction module can be referred to the description of S220 in the above embodiment. The description of the pedestrian detection module can be referred to the description of S230 in the above embodiment. The description of the pedestrian location information calculation module can be referred to the description of S240 in the above embodiment. The description of the second loss function construction module can be referred to the description of S250 in the above embodiment. The description of the third loss function construction module can be referred to the description of S260 in the above embodiment. The description of the model training module can be referred to the description of S270 in the above embodiment.
[0118] In the various embodiments described above, the specific order or hierarchy of steps in the disclosed process is an example of an exemplary method. Based on design preferences, it should be understood that the specific order or hierarchy of steps in the process may be rearranged without departing from the scope of this disclosure. The appended method claims provide elements of various steps in an exemplary order and are not intended to limit the scope to a specific order or hierarchy.
[0119] Those skilled in the art will also understand that the various illustrative logical blocks, modules, and steps listed in the embodiments of the present invention can be implemented by electronic hardware, computer software, or a combination of both. To clearly demonstrate the interchangeability of hardware and software, the functions of the various illustrative components, modules, and steps described above have been generally described. Whether such functionality is implemented through hardware or software depends on the specific application and the overall system design requirements. Those skilled in the art can implement the described functions using various methods for each specific application, but such implementation should not be construed as exceeding the scope of protection of the embodiments of the present invention.
[0120] The various illustrative logic blocks or modules described in the embodiments of this invention can be implemented or operate the described functions using a general-purpose processor, digital signal processor, application-specific integrated circuit (ASIC), field-programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. The general-purpose processor can be a microprocessor; alternatively, it can be any conventional processor, controller, microcontroller, or state machine. The processor can also be implemented using a combination of computing devices, such as a digital signal processor and a microprocessor, multiple microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration.
[0121] The steps of the methods or algorithms described in the embodiments of this invention can be directly embedded in hardware, a software module executed by a processor, or a combination of both. The software module can be stored in RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium in the art. Exemplarily, the storage medium can be connected to the processor so that the processor can read information from and write information to the storage medium. Optionally, the storage medium can also be integrated into the processor. The processor and storage medium can be housed in an ASIC, which can be housed in a user terminal. Optionally, the processor and storage medium can also be housed in different components of the user terminal.
[0122] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for detecting human interaction relationship based on monitoring a scene, characterized in that, include: Construct a person relationship dataset, which includes multi-class object detection annotation data, pedestrian pose annotation data, and pedestrian bounding box annotation data; The multi-class object detection network, pedestrian pose recognition network, and pedestrian envelope detection network are trained based on the aforementioned person relationship dataset to obtain the category and location of each object, pedestrian pose category, and pedestrian envelope location. Based on the category and location of each target, obtain the category and location of the object, and based on the category and location of the object, obtain the object's bounding box position; Calculate the distance between the pedestrian and the object based on the pedestrian's bounding box position and the object's bounding box position; The interaction relationship between the pedestrian and the object is obtained based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category. The step of calculating the distance between the pedestrian and the object based on the pedestrian's bounding box position and the object's bounding box position includes: The pedestrian envelope position is transformed to obtain the pedestrian coordinate position in the world coordinate system; The object's bounding box position is transformed to obtain the object's coordinate position in the world coordinate system. Calculate the distance between the pedestrian and the object based on the pedestrian's coordinates and the object's coordinates. The step of obtaining the human interaction relationship based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category includes: Determine whether the distance between the pedestrian and the object is less than a distance threshold; If so, then there is an interaction relationship between the pedestrian and the object, and the interaction relationship between the pedestrian and the object is obtained according to the pedestrian posture category and the object category; The step of training a multi-class object detection network, a pedestrian pose recognition network, and a pedestrian bounding box detection network based on the person relationship dataset to obtain the category and location of each object, the pedestrian pose category, and the pedestrian bounding box location includes: The monitored scene images in the aforementioned person relationship dataset are input into the multi-class target detection network, which outputs the predicted target category and the predicted target location. Based on the predicted target category, the predicted target location, and the multi-category target detection annotation data, a multi-category target detection loss function is constructed. The pedestrian images in the aforementioned person relationship dataset are input into the pedestrian pose recognition network and the pedestrian envelope detection network, respectively, and multiple predicted pedestrian pose key points and multiple predicted pedestrian envelope key points are output. Calculate the predicted pedestrian center point for each pedestrian based on the predicted target position of each pedestrian, calculate the first offset between each predicted pedestrian posture key point and the predicted pedestrian center point, and calculate the second offset between each predicted pedestrian envelope key point and the predicted pedestrian center point. Based on the predicted pedestrian center point, the first offset, the second offset, the pedestrian pose annotation data, and the pedestrian envelope annotation data, a pedestrian keypoint loss function is constructed. Based on the multiple predicted pedestrian pose key points and the pedestrian pose annotation data, a pedestrian pose classification loss function is constructed; The multi-class object detection network, the pedestrian pose recognition network, and the pedestrian envelope detection network are trained based on the multi-class object detection loss function, the pedestrian keypoint loss function, and the pedestrian pose classification loss function to obtain the trained multi-class object detection network, the trained pedestrian pose recognition network, and the trained pedestrian envelope detection network. 2.The method of claim 1, wherein, The step of obtaining the object's category and location based on the category and location of each target, and calculating the object's bounding box position based on the object's category and location, includes: Based on the category and location of the object, calculate the two-dimensional distance between the center point and the corner point of the two-dimensional detection box, where the two-dimensional distance is the radius of the three-dimensional envelope. Based on the object's category, the reference radius of the object is obtained. Based on the ratio of the radius of the three-dimensional envelope to the reference radius, the three-dimensional envelope corresponding to the reference radius is transformed proportionally to obtain the position of the object's envelope.
3. A system for detecting human interaction relationships in a monitored scenario, characterized in that, include: The dataset construction module is used to construct a person relationship dataset, which includes multi-class object detection annotation data, pedestrian pose annotation data, and pedestrian envelope annotation data. The target pedestrian information acquisition module is used to train a multi-class target detection network, a pedestrian pose recognition network, and a pedestrian envelope detection network based on the person relationship dataset, so as to obtain the category and location of each target, the pedestrian pose category, and the pedestrian envelope location. The object information acquisition module is used to acquire the category and location of the object based on the category and location of each target, and to obtain the object's bounding box position based on the object's category and location. The person distance calculation module is used to calculate the distance between the pedestrian and the object based on the position of the pedestrian's bounding box and the position of the object's bounding box. The character interaction relationship generation module is used to obtain the character interaction relationship based on the distance between the pedestrian and the object, the pedestrian's posture category, and the object's category; The distance calculation module for the person includes: The pedestrian coordinate transformation module is used to transform the position of the pedestrian envelope to obtain the pedestrian coordinate position in the world coordinate system; The object coordinate transformation module is used to transform the position of the object's bounding box to obtain the object's coordinate position in the world coordinate system. The relative distance calculation module is used to calculate the distance between the pedestrian and the object based on the pedestrian's coordinate position and the object's coordinate position. The character interaction relationship generation module includes: The distance determination module is used to determine whether the distance between the pedestrian and the object is less than a distance threshold; The relationship acquisition module is used to determine if there is an interaction relationship between the pedestrian and the object, and to obtain the interaction relationship between the pedestrian and the object based on the pedestrian's posture category and the object's category. The target pedestrian information acquisition module includes: The target detection module is used to input the monitored scene images in the person relationship dataset into the multi-class target detection network and output the predicted target category and predicted target location. The first loss function construction module is used to construct a multi-class target detection loss function based on the predicted target category, the predicted target location, and the multi-class target detection annotation data; The pedestrian detection module is used to input pedestrian images from the person relationship dataset into the pedestrian pose recognition network and the pedestrian envelope detection network respectively, and output multiple predicted pedestrian pose key points and multiple predicted pedestrian envelope key points; The pedestrian location information calculation module is used to calculate the predicted pedestrian center point of each pedestrian based on the predicted target location of each pedestrian, calculate the first offset between each predicted pedestrian posture key point and the predicted pedestrian center point, and calculate the second offset between each predicted pedestrian envelope key point and the predicted pedestrian center point. The second loss function construction module is used to construct a pedestrian keypoint loss function based on the predicted pedestrian center point, the first offset, the second offset, the pedestrian pose annotation data, and the pedestrian envelope annotation data. The third loss function construction module is used to construct a pedestrian pose classification loss function based on the multiple predicted pedestrian pose key points and the pedestrian pose annotation data. The model training module is used to train the multi-class object detection network, the pedestrian pose recognition network, and the pedestrian envelope detection network according to the multi-class object detection loss function, the pedestrian keypoint loss function, and the pedestrian pose classification loss function, so as to obtain the trained multi-class object detection network, the trained pedestrian pose recognition network, and the trained pedestrian envelope detection network.
4. The human interaction relationship detection system based on a monitoring scene according to claim 3, characterized in that, The object information acquisition module includes: The two-dimensional distance calculation module is used to calculate the two-dimensional distance between the center point and the corner point of the two-dimensional detection box according to the category and position of the object. The two-dimensional distance is the radius of the three-dimensional envelope. The object envelope position acquisition module is used to obtain the reference radius of the object according to the object's category, and to perform a proportional transformation on the three-dimensional envelope corresponding to the reference radius according to the ratio of the three-dimensional envelope radius to the reference radius to obtain the object envelope position.