Human-computer collaborative optimization method and system based on artificial intelligence

By analyzing surveillance videos to identify individuals in need of assistance and proactively providing help, the problem of robots passively waiting has been solved, and the utilization rate of robots in areas such as conference venues and hospitals has been improved.

CN122223643APending Publication Date: 2026-06-16上海万怡医学科技股份有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
上海万怡医学科技股份有限公司
Filing Date
2026-03-09
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Robots are used less in areas such as conference venues and hospitals because they typically wait passively for people to ask them questions and fail to proactively offer assistance.

Method used

By analyzing surveillance video frames through surveillance cameras, identifying people's postures and facial expressions, filtering out those who need assistance, and controlling robots to move next to them to provide assistance prompts, the system responds to requests using artificial intelligence and knowledge graphs.

🎯Benefits of technology

This enables robots to proactively identify and assist people in need, improving robot utilization, accuracy, and convenience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122223643A_ABST
    Figure CN122223643A_ABST
Patent Text Reader

Abstract

The application provides a man-machine cooperation optimization method and system based on artificial intelligence, and relates to the technical field of robots.The method comprises the following steps: sampling a monitoring video frame, obtaining a to-be-processed image, obtaining a target area where a person is located, determining posture information and expression information of the person, screening a to-be-assisted person based on the posture information and the expression information, selecting a target assisting person, controlling a robot to move to a position where the target person is located, issuing an assistance prompt message, receiving demand information, and playing the assistance prompt message to assist the target assisting person.According to the application, the monitoring video frame can be analyzed to determine a to-be-assisted person who needs assistance in a region in real time, a target assisting person can be selected based on the positions of the robot and the to-be-assisted person, and the robot can actively assist the target assisting person, so that the person who needs help can be accurately screened and actively assisted, and the utilization rate of the robot is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robotics, and in particular to an artificial intelligence-based human-machine collaborative optimization method and system. Background Technology

[0002] In related technologies, robots can be placed in areas such as venues and hospitals to provide assistance to people, such as answering specific questions or providing directions. However, robots are usually placed in specific locations and wait for people to ask for help. However, the number of people who know the functions and locations of the robots is small, resulting in low utilization of the robots. Summary of the Invention

[0003] This invention provides a human-machine collaborative optimization method and system based on artificial intelligence, which can solve the technical problem of robots passively waiting for people and having low utilization rate in related technologies.

[0004] According to a first aspect of the present invention, an artificial intelligence-based human-machine collaborative optimization method is provided, comprising: The image to be processed is obtained by sampling the surveillance video frames captured by the surveillance camera; Detect multiple images to be processed to obtain the target regions where people are located in each image; Based on the target region, determine the posture and facial expression information of each person in each image to be processed; Based on the posture information, the facial expression information, and the target area, select the personnel to be assisted from a group of people; Based on the posture and facial expression information of the person to be assisted, as well as the location information of the target area of ​​the person to be assisted in the image to be processed, select the target person to assist; Control the moving components to move the robot to a preset position next to the location of the target assistant, and activate the robot's display device, audio device and microphone; Send an assistance prompt message to the target person by means of at least one of the display device and the audio device; When the system receives the request information of the target person through at least one of the microphone or display devices, it obtains assistance prompt information matching the request information based on a preset knowledge graph, and notifies the target person through the display device or audio device.

[0005] According to the present invention, based on the target region, determining the pose and facial expression information of each person in each image to be processed includes: Perform facial detection processing on the target area to obtain the area where the face is located; The rectangular area located below the lowest point of the face area in the target region is defined as the body area; The first convolutional processing layer of the image recognition model is used to process the facial region and obtain facial feature information. Facial feature information is processed through the first fully connected layer and the first activation layer to obtain the person's expression information; The second convolutional processing layer of the image recognition model is used to process the area where the body is located to obtain body feature information; Body feature information is processed through the second fully connected layer and the second activation layer to obtain the person's posture information.

[0006] According to the present invention, selecting personnel to be assisted from a plurality of personnel based on the posture information, the facial expression information, and the target region includes: Obtain the facial features, body features, expression information, and posture information of the j-th person in the i-th image to be processed; The target region of the j-th person in the i-th image is processed by the third convolutional processing layer of the trained personnel state recognition model to obtain regional feature information. The trained personnel state recognition model is used to process the regional feature information, expression information, facial feature information, and the hidden state information of the j-th person in the (i-1)-th image to obtain the expression hidden state information of the j-th person in the i-th image to be processed. When i=1, the hidden state information of the j-th person in the (i-1)-th image to be processed is a zero vector. The trained personnel state recognition model is used to process the region feature information, posture information, body feature information, and the hidden state information of the j-th person in the (i-1)-th image to obtain the posture hidden state information of the j-th person in the i-th image. Based on the latent state information of facial expression and pose, obtain the latent state information of the j-th person in the i-th image to be processed. When i=n, ​​the region feature information, pose information and expression information are concatenated to obtain the third concatenated information, where n is the number of images to be processed; The third stitched information is processed through the fifth multilayer perceptual network layer of the trained personnel state recognition model to obtain the fifth weight matrix; The state feature vector of the j-th person is obtained by multiplying the fifth weight matrix with the latent state information of the j-th person in the n-th image to be processed. The state feature vector of the j-th person is processed by the sixth layer of the trained personnel state recognition model to obtain the probability information of whether the j-th person needs assistance. Based on the probability information of each person's judgment, select the person to be assisted from multiple people.

[0007] According to the present invention, a trained personnel state recognition model is used to process regional feature information, expression information, facial feature information, and the latent state information of the j-th person in the (i-1)-th image to obtain the latent expression state information of the j-th person in the i-th image to be processed, including: The regional feature information and facial expression information are concatenated to obtain the first concatenated information; The first spliced ​​information is processed by the first multi-layer perceptual network layer of the trained personnel state recognition model to obtain the first weight matrix; The first expression intermediate state information is obtained by multiplying the first weight matrix with the latent state information of the j-th person in the (i-1)-th image to be processed. The first spliced ​​information is processed by the second multilayer perceptual network layer of the trained personnel state recognition model to obtain the second weight matrix; The intermediate state information of the second expression is obtained by multiplying the second weight matrix with the facial feature information; Based on the intermediate state information of the first expression and the intermediate state information of the second expression, the latent state information of the expression of the j-th person in the i-th image to be processed is obtained.

[0008] According to the present invention, a trained personnel state recognition model is used to process region feature information, pose information, body feature information, and the latent state information of the j-th person in the (i-1)-th image to obtain the pose latent state information of the j-th person in the i-th image to be processed, including: The region feature information and pose information are concatenated to obtain the second concatenated information; The second spliced ​​information is processed through the third multilayer perceptual network layer of the trained personnel state recognition model to obtain the third weight matrix. The first pose intermediate state information is obtained by multiplying the third weight matrix with the hidden state information of the j-th person in the (i-1)-th image to be processed. The second spliced ​​information is processed through the fourth multilayer perceptual network layer of the trained personnel state recognition model to obtain the fourth weight matrix. The intermediate state information of the second pose is obtained by multiplying the fourth weight matrix with the body feature information; Based on the intermediate state information of the first pose and the intermediate state information of the second pose, the latent state information of the pose of the j-th person in the i-th image to be processed is obtained.

[0009] According to the present invention, the training steps of the personnel state recognition model include: Obtain the training facial features, training body features, training expression information, training posture information, and training region features of the area where the trainee is located in the t-th training image; By using the personnel state recognition model, the training region feature information, training expression information, training facial feature information, and training latent state information of the trainee in the (t-1)th training image are processed to obtain the training expression latent state information of the trainee in the tth training image. By using the personnel state recognition model, the training region feature information, training posture information, training body feature information, and training latent state information of the trainee in the (t-1)th training image are processed to obtain the training posture latent state information of the trainee in the tth training image. Based on the training facial expression hidden state information and the training posture hidden state information, obtain the hidden state information of the trainee in the t-th training image. The training region feature information and training expression information are concatenated, and the concatenation result is input into the seventh multilayer perceptual network layer for processing to obtain the seventh weight matrix; The training expression category feature vector is obtained by multiplying the seventh weight matrix with the training expression latent state information. The training expression category feature vectors are input into the third fully connected layer and the third activation layer for processing to obtain the training expression category probability information. The training region feature information and training pose information are concatenated, and the concatenation result is input into the eighth multilayer perceptron layer for processing to obtain the eighth weight matrix. The training pose category feature vector is obtained by multiplying the eighth weight matrix with the training pose latent state information. The training pose category feature vector is input into the fourth fully connected layer and the fourth activation layer for processing to obtain the training pose category probability information. Based on the hidden state information, obtain the probability information of training judgments that require assistance from the trainers; Based on the training expression category probability information, training posture category feature vector, training judgment probability information, and the annotation information of the trainees, the loss function of the personnel state recognition model is determined. The personnel status recognition model is trained based on the loss function of the personnel status recognition model to obtain the trained personnel status recognition model.

[0010] According to the present invention, the loss function of the personnel state recognition model is determined based on training expression category probability information, training posture category feature vectors, training judgment probability information, and training personnel annotation information, including: According to the formula

[0011] Determine the loss function LOSS for the personnel status recognition model, where, This represents the probability that a trainee's expression belongs to the k-th expression category, determined based on the training expression category probability information of the t-th training image. Let be the probability that a trainee's facial expression belongs to the k-th type, as determined by the annotation information. Let be the probability that the trainee's pose belongs to the s-th pose, determined based on the training pose category probability information of the t-th training image. Let s be the probability that the trainee's posture, determined based on the annotation information, belongs to the s-th expression. Let be the probability that the trainer needs to provide assistance, determined based on the training judgment probability information of the t-th training image. Let N represent the probability that a trainer needs assistance, determined based on the trainer's annotation information. Let N be the number of facial expression categories and M be the number of posture categories. , , The preset weights are T, which is the number of training images including the trainees, t≤T, k≤N, s≤M, and t, T, k, N, s, and M are all positive integers.

[0012] According to the present invention, selecting a target assistant based on the posture and facial expression information of the person to be assisted, and the position information of the target area of ​​the person to be assisted in the image to be processed, includes: Based on the location information of the target area of ​​the person to be assisted in the last image to be processed and the calibration parameters of the monitoring camera, determine the first spatial coordinate data of the person to be assisted. Obtain the robot's second-space coordinate data; Based on the first spatial coordinate data, the second spatial coordinate data, and the judgment probability information of the person to be assisted, the assistance service indicators of the person to be assisted are obtained. Select the target assistance personnel based on the assistance service indicators of each person in need of assistance.

[0013] According to the present invention, assistance service indicators for the person requiring assistance are obtained based on first spatial coordinate data, second spatial coordinate data, and judgment probability information of the person requiring assistance, including: Obtain the planar coordinates of the first spatial coordinate data and the planar coordinates of the second spatial coordinate data of the person to be assisted, and calculate the spatial vector between the first spatial coordinate data and the second spatial coordinate data; Identify the number of obstacles and obtain the set area where each obstacle is located in the space; Based on the set region and the spatial vector, and using the judgment probability information of the person to be assisted, the assistance service index of the person to be assisted is calculated.

[0014] According to the formula

[0015] Obtain the assistance service index for the u-th person in need of assistance. ,in, This provides the probability information for judging the u-th person in need of assistance. Let the first spatial coordinate data of the u-th person to be assisted be the planar coordinates. The second spatial coordinate data refers to the planar coordinates. for and The lines connecting them Let h be the region containing the h-th obstacle in space. Let h be the number of obstacles. And u, h and All are positive integers, and if is a conditional function.

[0016] According to a second aspect of the present invention, an artificial intelligence-based human-machine collaborative optimization system is provided, comprising: The sampling module is used to sample the surveillance video frames captured by the surveillance camera to obtain the image to be processed; The target region module is used to detect multiple images to be processed and obtain the target region where people are located in each image. The information module is used to determine the posture and facial expression information of each person in each image to be processed based on the target area. The module for people to be assisted is used to select people to be assisted from multiple people based on the posture information, the facial expression information, and the target area; The target assistance personnel module is used to select target assistance personnel based on the posture and facial expression information of the person to be assisted, as well as the location information of the target area of ​​the person to be assisted in the image to be processed. The mobility module controls the mobility components to move the robot to a preset position next to the location of the target assistant, and activates the robot's display device, audio device, and microphone. The assistance prompt module is used to send assistance prompt messages to the target person receiving assistance via at least one of a display device and an audio device. The notification module is used to obtain assistance prompts that match the request information based on a preset knowledge graph when it receives the request information of the target person to assist through at least one of the microphone or display devices, and then notify the target person to assist through the display device or audio device.

[0017] By adopting the above technical solution, the present invention can achieve the following technical effects: According to the present invention, by analyzing surveillance video frames, personnel requiring assistance within a region can be identified in real time. Based on the positions of the robot and the personnel to be assisted, a target assistance personnel can be selected, enabling the robot to proactively assist the target personnel. This method accurately filters personnel in need of help and proactively provides assistance, improving robot utilization. When acquiring latent facial expression information, first stitching information that enhances the facial and facial features of the personnel can be acquired. The latent expression information is dynamically updated based on the first stitching information in the current image to be processed, improving the fit between the first intermediate facial expression information and the personnel's facial expression state, and more accurately expressing facial expression changes. Furthermore, facial feature information can be enhanced through the first stitching information, making the facial feature information more coordinated with the overall features. That is, facial features are enhanced through real-time overall features, thereby improving the accuracy of the latent facial expression information in expressing facial expression states and changes. When acquiring latent posture information, second stitching information that enhances the body and posture features of the personnel can be acquired. The latent posture information is dynamically updated based on the second stitching information in the current image to be processed, improving the fit between the first intermediate posture information and the personnel's posture state, and more accurately expressing posture changes. Furthermore, body feature information can be enhanced through second-level concatenation, making it more coordinated with overall features. This means strengthening body features through real-time overall features, thereby improving the accuracy of latent state information in representing posture and movement. During training, training can be performed separately for facial expression latent state information, posture latent state information, and hidden state information. This improves the accuracy of facial expression latent state information in representing facial expression categories, posture latent state information in representing posture categories, and latent state information in representing human states, thereby enhancing the overall performance of the human state recognition model. By improving the accuracy of recognizing human expressions and postures, the accuracy of judging whether a person needs assistance is further improved, increasing training efficiency and relevance. When determining assistance service indicators, the indicators can be comprehensively determined based on both the degree of need for assistance from the person requiring assistance and the ease with which the robot can reach the person's location, thus improving the convenience of assistance service indicators. Attached Figure Description

[0018] Figure 1 An exemplary flowchart of an artificial intelligence-based human-machine collaborative optimization method according to an embodiment of the present invention is shown.

[0019] Figure 2 An exemplary schematic diagram illustrating the screening of individuals requiring assistance according to an embodiment of the present invention is shown.

[0020] Figure 3 A block diagram of an artificial intelligence-based human-machine collaborative optimization system according to an embodiment of the present invention is shown as an example. Detailed Implementation

[0021] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0022] Figure 1 An exemplary flowchart illustrates a human-machine collaborative optimization method based on artificial intelligence according to an embodiment of the present invention, the method comprising: Step S1: Sample the video frames captured by the surveillance camera to obtain the image to be processed; Step S2: Detect multiple images to be processed to obtain the target region where the person is located in each image; Step S3: Based on the target area, determine the posture and facial expression information of each person in each image to be processed; Step S4: Based on the posture information, the facial expression information, and the target area, select the personnel to be assisted from multiple personnel; Step S5: Select the target person to assist based on the posture and facial expression information of the person to be assisted, as well as the location information of the target area of ​​the person to be assisted in the image to be processed. Step S6: Control the moving components to move the robot to a preset position next to the location of the target person, and activate the robot's display device, audio device and microphone; Step S7: Send an assistance prompt message to the person via at least one of the display device and the audio device; Step S8: When receiving the person's request information through at least one of the microphone or display devices, obtain assistance prompt information matching the request information according to the preset knowledge graph, and notify the target person through the display device or audio device.

[0023] According to an embodiment of the present invention, the artificial intelligence-based human-machine collaboration optimization method can analyze monitoring video frames to determine in real time the personnel in the area who need assistance, and select the target personnel for assistance based on the positions of the robot and the personnel to be assisted, thereby enabling the robot to actively assist the target personnel. This method can accurately screen the personnel who need help and actively provide assistance, thereby improving the utilization rate of the robot.

[0024] According to one embodiment of the present invention, in step S1, the surveillance camera is a high-definition camera capable of capturing images of a specific area (e.g., a meeting room, a hospital lobby, etc.), and the surveillance video has high clarity, capable of capturing the facial areas of people and their facial expressions. The surveillance video frames captured by the surveillance camera can be sampled, for example, once per second. This not only reduces the computational load but also allows for the analysis of the actions and behaviors of each person captured over a period of time using fewer images to be processed, thus improving analysis efficiency.

[0025] According to one embodiment of the present invention, in step S2, multiple images to be processed (e.g., 5-10 images to be processed) at the current time and before the current time are detected to obtain the target area where the person is located in each image to be processed. For example, the target area where the person is located is determined using a model such as YOLO, that is, the area where the rectangular box containing the person in the image to be processed is located. It is also possible to identify persons who are present in multiple images to be processed. These persons may have been in almost the same position or have only moved slightly during the time period between the capture times of the 5-10 images to be processed, resulting in these persons being captured in multiple consecutive images to be processed. These persons may not know the route to their desired destination and are looking for road signs, or they may have some problems and are looking for solutions, such as people in a venue looking for a product exhibition area. These persons are potential people who need help. Persons who quickly pass through the area captured by the surveillance camera and cannot be present in multiple images to be processed simultaneously usually have a clear goal and know the location of their destination, so they can quickly pass through the area to their destination. These persons are not potential people who need help.

[0026] According to one embodiment of the present invention, in step S3, image blocks in the target area can be analyzed to determine the posture information and facial expression information of the person, and then, based on the changes in the posture information and facial expression information of the person in multiple images to be processed, it can be determined whether the person needs assistance.

[0027] According to an embodiment of the present invention, determining the pose and expression information of each person in each image to be processed based on the target region includes: performing facial detection processing on the target region to obtain the region where the face is located; determining the rectangular region in the target region located below the lowest point of the region where the face is located as the region where the body is located; processing the region where the face is located through a first convolutional processing layer of an image recognition model to obtain facial feature information; processing the facial feature information through a first fully connected layer and a first activation layer to obtain the expression information of the person; processing the region where the body is located through a second convolutional processing layer of an image recognition model to obtain body feature information; and processing the body feature information through a second fully connected layer and a second activation layer to obtain the pose information of the person.

[0028] According to one embodiment of the present invention, face detection processing can be performed only on image blocks of the target region. For example, face detection processing can be performed using the YOLO model to obtain the region where the face is located. That is, the face is selected by using a small rectangular box, and the region where the small rectangular box is located is the region where the face is located. The rectangular region below the lowest point of the face region is the region where the body is located. The face region and the body region can be analyzed separately to obtain expression information and posture information. The face region can be processed by the first convolutional processing layer of an image recognition model (e.g., a convolutional neural network model) to obtain facial feature information. After processing by the first fully connected layer and the first activation layer, the expression information of the person can be obtained, such as the probability that the person's expression belongs to various expression categories. Similarly, the body region can be processed by the second convolutional processing layer of an image recognition model to obtain body feature information. This body feature information is then processed by the second fully connected layer and the second activation layer to obtain the posture information of the person, that is, the probability that the person's posture belongs to various posture categories.

[0029] According to one embodiment of the present invention, in step S4, the behavior pattern of a person can be analyzed by using the posture information and facial expression information in multiple images to be processed, thereby determining whether the person needs assistance.

[0030] Figure 2 An exemplary schematic diagram illustrating the screening of individuals requiring assistance according to an embodiment of the present invention is shown.

[0031] According to an embodiment of the present invention, selecting personnel to be assisted from multiple personnel based on the posture information, the expression information, and the target region includes: acquiring facial feature information, body feature information, expression information, and posture information of the j-th person in the i-th image to be processed; processing the target region of the j-th person in the i-th image to be processed through the third convolutional processing layer of a trained personnel state recognition model to obtain region feature information; processing the region feature information, expression information, facial feature information, and the latent state information of the j-th person in the (i-1)-th image to be processed through the trained personnel state recognition model to obtain the expression latent state information of the j-th person in the i-th image to be processed, wherein, when i=1, the latent state information of the j-th person in the (i-1)-th image to be processed is a zero vector; and processing the region feature information, posture information, body feature information, and the latent state information of the j-th person in the i-th image to be processed through the trained personnel state recognition model to obtain the expression latent state information of the j-th person in the i-th image to be processed. The latent state information in the -1 images to be processed is processed to obtain the pose latent state information of the j-th person in the i-th image to be processed; based on the expression latent state information and pose latent state information, the latent state information of the j-th person in the i-th image to be processed is obtained; when i=n, ​​the region feature information, pose information and expression information are concatenated to obtain the third concatenated information, where n is the number of images to be processed; the third concatenated information is processed through the fifth multilayer perceptron layer of the trained personnel state recognition model to obtain the fifth weight matrix; the fifth weight matrix is ​​multiplied by the latent state information of the j-th person in the n-th image to obtain the state feature vector of the j-th person; the state feature vector of the j-th person is processed through the sixth multilayer perceptron layer of the trained personnel state recognition model to obtain the judgment probability information that the j-th person needs assistance; based on the judgment probability information of each person, the person to be assisted is selected from multiple people.

[0032] According to an embodiment of the present invention, the facial feature information, body feature information, expression information, and posture information of the j-th person (a person who appears in multiple images to be processed) in each image to be processed can be obtained by the aforementioned method.

[0033] According to one embodiment of the present invention, the target region is processed by the third convolutional processing layer of the personnel state recognition model to obtain regional feature information. The regional feature information can be used to describe the overall state of the person when the i-th image to be processed is captured, to enhance the features of the face and body, and to associate with the features of the face and body, thereby describing the overall behavior and state of the person.

[0034] According to one embodiment of the present invention, a trained personnel state recognition model is used to process region feature information, expression information, facial feature information, and latent state information of the j-th person in the (i-1)-th image to obtain the latent expression state information of the j-th person in the i-th image to be processed. This process includes: concatenating region feature information and expression information to obtain first concatenated information; processing the first concatenated information through a first multilayer perceptron layer of the trained personnel state recognition model to obtain a first weight matrix; multiplying the first weight matrix with the latent state information of the j-th person in the (i-1)-th image to obtain first intermediate expression state information; processing the first concatenated information through a second multilayer perceptron layer of the trained personnel state recognition model to obtain a second weight matrix; multiplying the second weight matrix with facial feature information to obtain second intermediate expression state information; and obtaining the latent expression state information of the j-th person in the i-th image to be processed based on the first and second intermediate expression state information.

[0035] According to one embodiment of the present invention, facial expression information can represent the probability that a person's facial expression belongs to multiple facial expression categories, and can be represented in vector form, where each data point in the vector represents the probability that the person's facial expression belongs to one facial expression category. The facial expression information can be concatenated with regional feature information to obtain first concatenated information. This first concatenated information includes not only the overall features of the region where the person is located, but also the person's facial expression category information, and can be used to enhance the person's facial features and facial expression features.

[0036] According to one embodiment of the present invention, the first stitched information can be processed through a first multilayer perceptual network layer to obtain a first weight matrix, and the hidden state information can be updated through the first weight matrix. The hidden state information describes the overall behavior and state of a person. Furthermore, updating the hidden state using the first weight matrix obtained based on the overall features of the target region in the current image makes the updated hidden state information more closely match the state (facial state) of the person in the current image, i.e., more accurately describes the person's state and behavior. By multiplying the first weight matrix corresponding to the first stitched information, which can enhance facial features and expression features, with the hidden state information in the previous image to be processed, the facial features and expression features in the hidden state information can be enhanced and updated to obtain first intermediate expression state information. Since the acquisition process of the first intermediate expression state information references the hidden state information of the previous image to be processed, the first intermediate expression state information can be used to describe the changes in the person's expression between the time the previous image to be processed was captured and the time the current image to be processed was captured.

[0037] According to one embodiment of the present invention, the first splicing information can be processed through a second multilayer perceptual network layer to obtain a second weight matrix. The second weight matrix is ​​then multiplied with facial feature information to obtain second expression intermediate state information. That is, the first splicing information is determined using the overall features of the target region in the current image, and the second weight matrix obtained based on the first splicing information that can enhance the facial features and expression features of the person can be used to enhance the current facial feature information based on the current overall features, thereby strengthening the expression features. This makes the second expression intermediate state information more consistent with the state (overall state) of the person in the current image. In other words, the overall features are used to enhance the facial features, making the facial features and overall features more consistent and reducing the probability of inconsistencies between the expression recognition result and the posture recognition result (e.g., a calm expression but a large range of body movements).

[0038] According to one embodiment of the present invention, the first intermediate expression state information and the second intermediate expression state information are added together to obtain the latent expression state information of the j-th person in the i-th image to be processed. The expression features of the person can be enhanced by the overall features of the target region, highlighting the changes in expression, enhancing the coordination between expression and overall features, and improving the accuracy of the latent expression state information in expressing expression state and expression changes.

[0039] In this way, first stitched information that enhances a person's facial and expression features can be obtained. Based on this first stitched information in the current image to be processed, the latent state information is dynamically updated, improving the fit between the first intermediate expression state information and the person's expression state, and more accurately expressing expression changes. Furthermore, facial feature information can be enhanced through the first stitched information, making facial feature information more coordinated with overall features. That is, facial features are enhanced through real-time overall features, thereby improving the accuracy of the latent expression state information in representing expression states and changes.

[0040] According to one embodiment of the present invention, a trained personnel state recognition model is used to process region feature information, posture information, body feature information, and the latent state information of the j-th person in the (i-1)-th image to obtain the posture latent state information of the j-th person in the i-th image to be processed. This process includes: concatenating region feature information and posture information to obtain second concatenated information; processing the second concatenated information through a third multilayer perceptron layer of the trained personnel state recognition model to obtain a third weight matrix; multiplying the third weight matrix with the latent state information of the j-th person in the (i-1)-th image to obtain first intermediate posture information; processing the second concatenated information through a fourth multilayer perceptron layer of the trained personnel state recognition model to obtain a fourth weight matrix; multiplying the fourth weight matrix with body feature information to obtain second intermediate posture information; and obtaining the posture latent state information of the j-th person in the i-th image to be processed based on the first and second intermediate posture information.

[0041] According to one embodiment of the present invention, posture information can represent the probability that a person's posture belongs to multiple posture categories, and can be represented in vector form, where each data point in the vector represents the probability that the person's posture belongs to a certain posture category. Posture information can be concatenated with region feature information to obtain second concatenated information. This second concatenated information includes not only the overall features of the region where the person is located, but also the person's posture category information, and can be used to enhance the person's body features and posture features.

[0042] According to one embodiment of the present invention, the second stitched information can be processed through a third multilayer perceptual network layer to obtain a third weight matrix, and the hidden state information can be updated through the third weight matrix. The hidden state information describes the overall behavior and state of a person. Furthermore, updating the hidden state using the third weight matrix obtained based on the overall features of the target region in the current image makes the updated hidden state information more closely match the state (body state) of the person in the current image, i.e., more accurately describes the person's state and behavior. By multiplying the third weight matrix corresponding to the second stitched information, which can enhance body features and posture features, with the hidden state information in the previous image to be processed, the body features and posture features in the hidden state information can be enhanced and updated to obtain first posture intermediate state information. Since the acquisition process of the first posture intermediate state information references the hidden state information of the previous image to be processed, the first posture intermediate state information can be used to describe the changes in the person's posture between the time the previous image to be processed was captured and the time the current image to be processed was captured.

[0043] According to one embodiment of the present invention, the second stitching information can be processed through a fourth multilayer perceptual network layer to obtain a fourth weight matrix. The fourth weight matrix is ​​then multiplied with body feature information to obtain the second pose intermediate state information. That is, the second stitching information is determined using the overall features of the target region in the current image. Based on the fourth weight matrix obtained from the second stitching information, which can enhance the body features and pose features of the person, the body feature information can be enhanced to strengthen the pose features. This makes the second pose intermediate state information more consistent with the state (overall state) of the person in the current image. In other words, the body features are enhanced using the overall features to make the body features and overall features more consistent, reducing the probability of inconsistencies between the expression recognition result and the pose recognition result (e.g., a calm expression but a large range of body movements).

[0044] According to one embodiment of the present invention, the first posture intermediate state information and the second posture intermediate state information are added together to obtain the posture latent state information of the j-th person in the i-th image to be processed. The posture features of the person can be enhanced by the overall features of the target region, highlighting the changes in posture, enhancing the coordination between posture and overall features, and improving the accuracy of the posture latent state information in expressing posture and action.

[0045] In this way, second stitching information that enhances a person's body and posture features can be obtained. Based on this second stitching information in the current image to be processed, the latent state information is dynamically updated, improving the fit between the first posture intermediate state information and the person's posture state, and more accurately representing posture changes. Furthermore, the second stitching information can be used to enhance body feature information, making it more coordinated with overall features. That is, by enhancing body features through real-time overall features, the accuracy of the posture latent state information in representing posture and movement is improved.

[0046] According to one embodiment of the present invention, by concatenating facial expression latent state information and posture latent state information, the latent state information of the j-th person in the i-th image to be processed can be obtained, thereby updating the latent state information. The above processing can be performed iteratively until i=n, ​​obtaining the latent state information corresponding to the last image to be processed. Furthermore, the region feature information, posture information, and facial expression information can also be concatenated to obtain third concatenated information, that is, feature information after enhancing the facial and body features of the person through overall features. Then, the third concatenated information is processed through the fifth multilayer perceptual network layer to obtain the fifth weight matrix. The fifth weight matrix is ​​then multiplied with the latent state information of the j-th person in the n-th image to be processed to obtain the state feature vector of the j-th person. That is, the latent state information integrates posture features and facial expression features from multiple past moments. The posture features and facial expression features in the latent state information can be further enhanced by the fifth weight matrix obtained by the third concatenated information to obtain a state feature vector describing the person's state and behavior (e.g., blank expression, standing still, etc.). Then, it can be determined whether the person needs assistance based on the state feature vector.

[0047] According to one embodiment of the present invention, the sixth multilayer perceptual network layer may include fully connected layers and activation layers (e.g., a network layer processed using the sigmoid activation function), which can process the state feature vector to obtain the probability information of the j-th person needing assistance (i.e., the probability that the j-th person needs assistance). Based on the same method, the probability information of each person can be obtained, and persons requiring assistance whose probability information is higher than or equal to a preset probability threshold can be selected.

[0048] According to an embodiment of the present invention, the above-mentioned personnel state recognition model can be trained before use. The training steps of the personnel state recognition model include: acquiring training facial feature information, training body feature information, training expression information, training posture information, and training region feature information of the area where the training person is located in the t-th training image; processing the training region feature information, training expression information, training facial feature information, and training latent state information of the training person in the (t-1)-th training image through the personnel state recognition model to obtain the training expression latent state information of the training person in the t-th training image; processing the training region feature information, training posture information, training body feature information, and training latent state information of the training person in the (t-1)-th training image through the personnel state recognition model to obtain the training posture latent state information of the training person in the t-th training image; obtaining the latent state information of the training person in the t-th training image based on the training expression latent state information and the training posture latent state information; and concatenating the training region feature information and the training expression information, and displaying the concatenated result. The input is processed at the seventh multilayer perceptron layer to obtain the seventh weight matrix. This seventh weight matrix is ​​then multiplied by the training facial expression latent state information to obtain the training facial expression category feature vector. This training facial expression category feature vector is then processed by the third fully connected layer and the third activation layer to obtain the training facial expression category probability information. The training region feature information and training pose information are concatenated, and the concatenated result is input into the eighth multilayer perceptron layer to obtain the eighth weight matrix. This eighth weight matrix is ​​then multiplied by the training pose latent state information to obtain the training pose category feature vector. This training pose category feature vector is then processed by the fourth fully connected layer and the fourth activation layer to obtain the training pose category probability information. Based on the latent state information, the training judgment probability information indicating that the trainee needs to assist is obtained. Based on the training facial expression category probability information, the training pose category feature vector, the training judgment probability information, and the trainee's annotation information, the loss function of the personnel state recognition model is determined. The personnel state recognition model is then trained using this loss function to obtain the trained personnel state recognition model.

[0049] According to one embodiment of the present invention, the training methods for facial feature information, body feature information, expression information, posture information, and training region feature information of the area where the trainee is located are similar to those for obtaining facial feature information, body feature information, expression information, posture information, and region feature information. The training methods for obtaining expression latent state information are similar to those for obtaining expression latent state information. The training methods for obtaining posture latent state information are similar to those for obtaining posture latent state information. The training methods for obtaining the latent state information of the trainee in the t-th training image are similar to those for obtaining the latent state information of the j-th trainee in the i-th image to be processed. These methods will not be described in detail here.

[0050] According to one embodiment of the present invention, the training latent state information of facial expressions is facial feature information enhanced through overall features. The facial expression category of the trainee can be obtained based on the training latent state information, and then compared with the trainee's actual facial expression category for training, reducing the error of the facial expression category obtained based on the training latent state information and improving the accuracy of the training latent state information. Specifically, the training region feature information and the training facial expression information can be concatenated, and the concatenation result can be input into the seventh multilayer perceptual network layer for processing to obtain the seventh weight matrix. This matrix is ​​then multiplied with the training latent state information to further enhance the facial feature information through the overall features of the trainee and the probability distribution of the facial expression category in the training image, resulting in a training facial expression category feature vector. This vector is then input into the third fully connected layer and the third activation layer (a network layer processed by the softmax function). After facial expression feature enhancement, the probability information of the training facial expression category is obtained; that is, the probability distribution of the facial expression category after correction of the training facial expression information through facial expression feature enhancement processing. Similarly, the training region feature information and training pose information can be concatenated, and the concatenated result can be input into the eighth multilayer perceptron layer for processing to obtain the eighth weight matrix. This eighth weight matrix is ​​then multiplied with the training pose latent state information to obtain the training pose category feature vector. This vector is then input into the fourth fully connected layer and the fourth activation layer for processing to obtain the training pose category probability information. In other words, it is the probability distribution of the pose category after the training pose information has been corrected through body pose feature enhancement processing. Furthermore, training judgment probability information requiring assistance from the trainer can also be obtained based on the latent state information. The method of obtaining this information is similar to the aforementioned judgment probability information and will not be repeated here. Moreover, the training judgment probability information is not obtained based on the latent state information of the last training image, but rather based on the latent state information of each training image.

[0051] According to one embodiment of the present invention, the loss function of the personnel state recognition model is determined based on training expression category probability information, training posture category feature vector, training judgment probability information, and training personnel annotation information, including: determining the loss function LOSS of the personnel state recognition model according to formula (1). (1) in, This represents the probability that a trainee's expression belongs to the k-th expression category, determined based on the training expression category probability information of the t-th training image. Let be the probability that a trainee's facial expression belongs to the k-th type, as determined by the annotation information. Let be the probability that the trainee's pose belongs to the s-th pose, determined based on the training pose category probability information of the t-th training image. Let s be the probability that the trainee's posture, determined based on the annotation information, belongs to the s-th expression. Let be the probability that the trainer needs to provide assistance, determined based on the training judgment probability information of the t-th training image. Let N represent the probability that a trainer needs assistance, determined based on the trainer's annotation information. Let N be the number of facial expression categories and M be the number of posture categories. , , The preset weights are T, which is the number of training images including the trainees, t≤T, k≤N, s≤M, and t, T, k, N, s, and M are all positive integers.

[0052] According to one embodiment of the present invention, in formula (1), when the trainee's expression belongs to the k-th expression, ,otherwise .therefore, for and The cross-entropy loss function between them is reduced during training to make the cross-entropy loss function smaller, thereby making and The error between them is reduced, thus making This reduces the overall error in training the expression category probability information, making the expression category predicted by the model closer to the labeled expression category.

[0053] According to one embodiment of the present invention, when the trainee's posture belongs to the s-th posture, ,otherwise, .therefore, for and The cross-entropy loss function between them is reduced during training to make the cross-entropy loss function smaller, thereby making and The error between them is reduced, thus making This reduces the overall error of the training pose category probability information, making the pose category predicted by the model closer to the labeled pose category.

[0054] According to one embodiment of the present invention, when the trainer needs assistance, ,otherwise, ,therefore, for and The cross-entropy loss function between them is reduced during training to make the cross-entropy loss function smaller, thereby making and The reduced error between the models makes the judgments output by the trainers, which require assistance from the trainers, closer to the labeled results.

[0055] According to one embodiment of the present invention, the weighted summation of the above three cross-entropy loss functions is as follows: Let be the weighted sum of the errors in the model's prediction of the trainee's facial expression and pose in the t-th training image, and the error in judging whether the trainee needs assistance. This represents the total error of the model's processing of the trainee in the t-th training image. Since a smaller t value indicates fewer training images before the t-th image, fewer features are available for reference and updating the latent state information. Therefore, a smaller t usually results in a higher total error, and a larger t usually results in a lower total error. Furthermore, the importance of model prediction accuracy is higher; theoretically, a larger t results in a smaller error, thus requiring more error correction to improve the overall model accuracy. Therefore, weights can be assigned to the total error corresponding to the t-th training image. This ensures that the weights are smaller when t is small and larger when t is large. Based on the total error and weights corresponding to the t-th training image, the loss function of the personnel state recognition model is obtained. The personnel state recognition model can be trained based on this loss function. The parameters of the personnel state recognition model can be adjusted by gradient descent. The above training process can be iteratively executed multiple times using multiple training images to complete the training and obtain the trained personnel state recognition model.

[0056] In this way, training can be performed separately on the latent state information of facial expressions, the latent state information of postures, and the latent state information during the training process. This improves the accuracy of the latent state information of facial expressions in representing facial expression categories, the accuracy of the latent state information of postures in representing posture categories, and the accuracy of the latent state information in representing the state of a person. This enhances the overall performance of the person state recognition model. By improving the accuracy of recognizing a person's facial expressions and postures, the model can further improve the accuracy of judging whether a person needs assistance, thus improving training efficiency and relevance.

[0057] According to one embodiment of the present invention, in step S5, a target assistant can be selected from multiple people who need assistance, for example, the person who needs the most assistance can be found and assisted.

[0058] According to one embodiment of the present invention, selecting a target assistant based on the posture and facial expression information of the person to be assisted, and the position information of the target area of ​​the person to be assisted in the image to be processed, includes: determining first spatial coordinate data of the person to be assisted based on the position information of the target area of ​​the person to be assisted in the last image to be processed and the calibration parameters of the monitoring camera; acquiring second spatial coordinate data of the robot; obtaining an assistance service index of the person to be assisted based on the first spatial coordinate data, the second spatial coordinate data and the judgment probability information of the person to be assisted; and selecting a target assistant based on the assistance service index of each person to be assisted.

[0059] According to one embodiment of the present invention, the position information of the centroid of the target area in the last image to be processed can be determined, and the position information can be mapped based on the calibration parameters of the camera to obtain the first spatial coordinate data of the person to be assisted in the area captured by the camera, and the second spatial coordinate data of the robot can be obtained in a similar manner.

[0060] According to one embodiment of the present invention, obtaining the assistance service index of the person to be assisted based on the first spatial coordinate data, the second spatial coordinate data, and the judgment probability information of the person to be assisted includes: obtaining the assistance service index of the uth person to be assisted according to formula (2). , (2) in, This provides the probability information for judging the u-th person in need of assistance. Let the first spatial coordinate data of the u-th person to be assisted be the planar coordinates. The second spatial coordinate data refers to the planar coordinates. for and The lines connecting them Let h be the region containing the h-th obstacle in space. Let h be the number of obstacles. And u, h and All are positive integers, and if is a conditional function.

[0061] According to one embodiment of the present invention, multiple obstacles may exist in the area captured by the camera. For example, each person can be considered an obstacle, as can items placed in the space, building structures (e.g., pillars, stairs) in the space. Each obstacle occupies a certain area in the space. The coordinate range of the area where the placed items or building structures are located is known. The area where each person is located can be obtained through the camera and its calibration parameters. For example, the spatial coordinate data of the centroid of each person's position in the image can be determined in the same way as above. The x-axis and y-axis data in the spatial coordinate data are taken as the planar coordinates of the centroid. Then, a circular area is obtained with the planar coordinates as the center and a preset length as the radius, which is the area where each person is located. Indicates in In the case of an obstacle, the condition function value is 1; otherwise, it is 0. That is, if the robot needs to pass through the h-th obstacle to reach the location of the u-th person to be assisted, the condition function value is 1; otherwise, it is 0. This represents the ratio of the number of obstacles the robot needs to pass through or avoid to the total number of obstacles in order to reach the location of the u-th person requiring assistance. It can also represent the difficulty for the robot to reach the location of the u-th person requiring assistance. Therefore... This represents the ease with which the robot can reach the location of the u-th person in need of assistance. Let be the straight-line distance between the robot and the u-th person to be assisted. This indicates the ease with which the robot can reach the location of the u-th person in need of assistance. It can be used as an assistance service indicator for the u-th person in need of assistance. The higher the assistance service indicator, the more the u-th person in need of assistance needs help, and the more convenient it is for the robot to reach the location of the u-th person in need of assistance.

[0062] In this way, the assistance service indicators for the person in need of assistance can be determined by combining the degree of need for help and the ease with which the robot can reach the person's location, thereby improving the convenience of the assistance service indicators.

[0063] According to one embodiment of the present invention, when determining the assistance service index for each person to be assisted, the person to be assisted corresponding to the maximum value of the assistance service index can be selected as the target assistance person.

[0064] According to one embodiment of the present invention, in step S6, the movable component can be controlled to move the robot to a preset position next to the location of the target assistant (for example, the direction of the line connecting the preset position and the location of the target assistant is consistent with the direction of the target assistant's face, and the distance between the preset position and the target assistant's location is 0.5 meters), and the robot's display device, audio device, and microphone are activated. In step S7, an assistance prompt message is sent to the target assistant through at least one of the display device and audio device, for example, displaying the text "What help do you need?" on the display device, or playing the statement through the audio device.

[0065] According to an embodiment of the present invention, in step S8, the target assistant's request information can be received through at least one of a microphone or a display device. For example, the target assistant states a question, and the robot's processor can analyze the semantic information of the question and find the answer corresponding to the question by querying the knowledge graph. This answer is the assistance prompt information that matches the request information, and then the target assistant can be notified through the display device or the audio device.

[0066] According to embodiments of the present invention, the AI-based human-machine collaborative optimization method can analyze monitoring video frames to determine in real time the personnel requiring assistance within a given area. Based on the positions of the robot and the personnel to be assisted, a target assistance personnel is selected, enabling the robot to proactively assist the target personnel. This method accurately filters personnel in need of help and proactively provides assistance, improving robot utilization. When acquiring latent facial expression information, first stitching information that enhances the facial and facial features of the personnel can be obtained. The latent expression information is dynamically updated based on the first stitching information in the current image to be processed, improving the fit between the first intermediate facial expression information and the personnel's facial expression state, and more accurately expressing facial expression changes. Furthermore, the first stitching information can be used to enhance facial feature information, making the facial feature information more coordinated with the overall features. That is, by enhancing facial features through real-time overall features, the accuracy of the latent facial expression information in expressing facial expression states and changes is improved. When acquiring latent state information of posture, second stitching information that enhances the body and posture features of a person can be obtained. The latent state information is then dynamically updated based on this second stitching information in the current image to be processed, improving the fit between the first intermediate posture information and the person's posture state, and more accurately representing posture changes. The second stitching information can also enhance body feature information, making it more coordinated with overall features. That is, by enhancing body features through real-time overall features, the accuracy of the latent state information in representing posture and action is improved. Furthermore, during training, training can be performed separately for facial expression latent state information, posture latent state information, and other latent state information. This improves the accuracy of the facial expression latent state information in representing facial expression categories, the posture latent state information in representing posture categories, and the latent state information in representing the person's state, respectively. This enhances the overall performance of the person state recognition model. By improving the accuracy of recognizing facial expressions and postures, the accuracy of determining whether a person needs assistance is further improved, increasing training efficiency and relevance. When determining assistance service indicators, the assistance service indicators can be comprehensively determined based on two aspects: the degree of need for assistance from the person in need and the ease with which the robot can reach the location of the person in need, thereby improving the convenience of assistance service indicators.

[0067] Figure 3 An exemplary block diagram of an artificial intelligence-based human-machine collaborative optimization system according to an embodiment of the present invention is shown, the system comprising: The sampling module is used to sample the surveillance video frames captured by the surveillance camera to obtain the image to be processed; The target region module is used to detect multiple images to be processed and obtain the target region where people are located in each image. The information module is used to determine the posture and facial expression information of each person in each image to be processed based on the target area. The module for people to be assisted is used to select people to be assisted from multiple people based on the posture information, the facial expression information, and the target area; The target assistance personnel module is used to select target assistance personnel based on the posture and facial expression information of the person to be assisted, as well as the location information of the target area of ​​the person to be assisted in the image to be processed. The mobility module controls the mobility components to move the robot to a preset position next to the location of the target assistant, and activates the robot's display device, audio device, and microphone. The assistance prompt module is used to send assistance prompt messages to the target person receiving assistance via at least one of a display device and an audio device. The notification module is used to obtain assistance prompts that match the request information based on a preset knowledge graph when it receives the request information of the target person to assist through at least one of the microphone or display devices, and then notify the target person to assist through the display device or audio device.

[0068] This invention can be a method, apparatus, system, and / or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of the invention.

[0069] Those skilled in the art should understand that the embodiments of the present invention described above and shown in the accompanying drawings are merely examples and do not limit the present invention. The objectives of the present invention have been fully and effectively achieved. The functions and structural principles of the present invention have been demonstrated and explained in the embodiments, and any variations or modifications may be made to the implementation of the present invention without departing from the stated principles.

[0070] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A human-machine collaborative optimization method based on artificial intelligence, characterized in that, include: The image to be processed is obtained by sampling the surveillance video frames captured by the surveillance camera; Detect multiple images to be processed to obtain the target regions where people are located in each image; Based on the target region, determine the posture and facial expression information of each person in each image to be processed; Based on the posture information, the facial expression information, and the target area, select the personnel to be assisted from a group of people; Based on the posture and facial expression information of the person to be assisted, as well as the location information of the target area of ​​the person to be assisted in the image to be processed, select the target person to assist; Control the moving components to move the robot to a preset position next to the location of the target assistant, and activate the robot's display device, audio device and microphone; Send an assistance prompt message to the target person by means of at least one of the display device and the audio device; When the system receives the request information of the target person through at least one of the microphone or display devices, it obtains assistance prompt information matching the request information based on a preset knowledge graph, and notifies the target person through the display device or audio device.

2. The artificial intelligence-based human-machine collaborative optimization method according to claim 1, characterized in that, Based on the target region, determine the pose and facial expression information of each person in each image to be processed, including: Perform facial detection processing on the target area to obtain the area where the face is located; The rectangular area located below the lowest point of the face area in the target region is defined as the body area; The first convolutional processing layer of the image recognition model is used to process the facial region and obtain facial feature information. Facial feature information is processed through the first fully connected layer and the first activation layer to obtain the person's expression information; The second convolutional processing layer of the image recognition model is used to process the area where the body is located to obtain body feature information; Body feature information is processed through the second fully connected layer and the second activation layer to obtain the person's posture information.

3. The artificial intelligence-based human-machine collaborative optimization method according to claim 2, characterized in that, Based on the posture information, the facial expression information, and the target area, select personnel to be assisted from multiple individuals, including: Obtain the facial features, body features, expression information, and posture information of the j-th person in the i-th image to be processed; The target region of the j-th person in the i-th image is processed by the third convolutional processing layer of the trained personnel state recognition model to obtain regional feature information. The trained personnel state recognition model is used to process the regional feature information, expression information, facial feature information, and the hidden state information of the j-th person in the (i-1)-th image to obtain the expression hidden state information of the j-th person in the i-th image to be processed. When i=1, the hidden state information of the j-th person in the (i-1)-th image to be processed is a zero vector. The trained personnel state recognition model is used to process the region feature information, posture information, body feature information, and the hidden state information of the j-th person in the (i-1)-th image to obtain the posture hidden state information of the j-th person in the i-th image. Based on the latent state information of facial expression and pose, obtain the latent state information of the j-th person in the i-th image to be processed. When i=n, ​​the region feature information, pose information and expression information are concatenated to obtain the third concatenated information, where n is the number of images to be processed; The third stitched information is processed through the fifth multilayer perceptual network layer of the trained personnel state recognition model to obtain the fifth weight matrix; The state feature vector of the j-th person is obtained by multiplying the fifth weight matrix with the latent state information of the j-th person in the n-th image to be processed. The state feature vector of the j-th person is processed by the sixth layer of the trained personnel state recognition model to obtain the probability information of whether the j-th person needs assistance. Based on the probability information of each person's judgment, select the person to be assisted from multiple people.

4. The human-machine collaborative optimization method based on artificial intelligence according to claim 3, characterized in that, Using a trained personnel state recognition model, regional feature information, expression information, facial feature information, and the latent state information of the j-th person in the (i-1)-th image to be processed are processed to obtain the latent expression state information of the j-th person in the i-th image to be processed, including: The regional feature information and facial expression information are concatenated to obtain the first concatenated information; The first spliced ​​information is processed by the first multi-layer perceptual network layer of the trained personnel state recognition model to obtain the first weight matrix; The first expression intermediate state information is obtained by multiplying the first weight matrix with the latent state information of the j-th person in the (i-1)-th image to be processed. The first spliced ​​information is processed by the second multilayer perceptual network layer of the trained personnel state recognition model to obtain the second weight matrix; The intermediate state information of the second expression is obtained by multiplying the second weight matrix with the facial feature information; Based on the intermediate state information of the first expression and the intermediate state information of the second expression, the latent state information of the expression of the j-th person in the i-th image to be processed is obtained.

5. The artificial intelligence-based human-machine collaborative optimization method according to claim 3, characterized in that, Using a trained personnel state recognition model, regional feature information, pose information, body feature information, and the latent state information of the j-th person in the (i-1)-th image to be processed are processed to obtain the pose latent state information of the j-th person in the i-th image to be processed, including: The region feature information and pose information are concatenated to obtain the second concatenated information; The second spliced ​​information is processed through the third multilayer perceptual network layer of the trained personnel state recognition model to obtain the third weight matrix. The first pose intermediate state information is obtained by multiplying the third weight matrix with the latent state information of the j-th person in the (i-1)-th image to be processed. The second spliced ​​information is processed through the fourth multilayer perceptual network layer of the trained personnel state recognition model to obtain the fourth weight matrix. The intermediate state information of the second pose is obtained by multiplying the fourth weight matrix with the body feature information; Based on the intermediate state information of the first pose and the intermediate state information of the second pose, the latent state information of the pose of the j-th person in the i-th image to be processed is obtained.

6. The human-machine collaborative optimization method based on artificial intelligence according to claim 3, characterized in that, The training steps for the personnel status recognition model include: Obtain the training facial features, training body features, training expression information, training posture information, and training region features of the area where the trainee is located in the t-th training image; By using the personnel state recognition model, the training region feature information, training expression information, training facial feature information, and training latent state information of the trainee in the (t-1)th training image are processed to obtain the training expression latent state information of the trainee in the tth training image. By using the personnel state recognition model, the training region feature information, training posture information, training body feature information, and training latent state information of the trainee in the (t-1)th training image are processed to obtain the training posture latent state information of the trainee in the tth training image. Based on the training facial expression hidden state information and the training posture hidden state information, obtain the hidden state information of the trainee in the t-th training image. The training region feature information and training expression information are concatenated, and the concatenation result is input into the seventh multilayer perceptual network layer for processing to obtain the seventh weight matrix; The training expression category feature vector is obtained by multiplying the seventh weight matrix with the training expression latent state information. The training expression category feature vectors are input into the third fully connected layer and the third activation layer for processing to obtain the training expression category probability information. The training region feature information and training pose information are concatenated, and the concatenation result is input into the eighth multilayer perceptron layer for processing to obtain the eighth weight matrix. The training pose category feature vector is obtained by multiplying the eighth weight matrix with the training pose latent state information. The training pose category feature vector is input into the fourth fully connected layer and the fourth activation layer for processing to obtain the training pose category probability information. Based on the hidden state information, obtain the probability information of training judgments that require assistance from the trainers; Based on the training expression category probability information, training posture category feature vector, training judgment probability information, and the annotation information of the trainees, the loss function of the personnel state recognition model is determined. The personnel status recognition model is trained based on the loss function of the personnel status recognition model to obtain the trained personnel status recognition model.

7. The artificial intelligence-based human-machine collaborative optimization method according to claim 6, characterized in that, Based on the training facial expression category probability information, training posture category feature vectors, training judgment probability information, and the annotation information of the trainees, the loss function of the personnel state recognition model is determined, including: According to the formula Determine the loss function LOSS for the personnel status recognition model, where, This represents the probability that a trainee's expression belongs to the k-th expression category, determined based on the training expression category probability information of the t-th training image. Let be the probability that a trainee's facial expression belongs to the k-th type, as determined by the annotation information. Let be the probability that the trainee's pose belongs to the s-th pose, determined based on the training pose category probability information of the t-th training image. Let s be the probability that the trainee's posture, determined based on the annotation information, belongs to the s-th expression. Let be the probability that the trainer needs to provide assistance, determined based on the training judgment probability information of the t-th training image. Let N represent the probability that a trainer needs assistance, determined based on the trainer's annotation information. Let N be the number of facial expression categories and M be the number of posture categories. , , The preset weights are T, which is the number of training images including the trainees, t≤T, k≤N, s≤M, and t, T, k, N, s, and M are all positive integers.

8. The human-machine collaborative optimization method based on artificial intelligence according to claim 3, characterized in that, Based on the posture and facial expression information of the person requiring assistance, as well as the location information of the target area of ​​the person requiring assistance in the image to be processed, the target person for assistance is selected, including: Based on the location information of the target area of ​​the person to be assisted in the last image to be processed and the calibration parameters of the monitoring camera, determine the first spatial coordinate data of the person to be assisted. Obtain the robot's second-space coordinate data; Based on the first spatial coordinate data, the second spatial coordinate data, and the judgment probability information of the person to be assisted, the assistance service indicators of the person to be assisted are obtained. Select the target assistance personnel based on the assistance service indicators of each person in need of assistance.

9. The human-machine collaborative optimization method based on artificial intelligence according to claim 8, characterized in that, Based on the first spatial coordinate data, the second spatial coordinate data, and the judgment probability information of the person requiring assistance, assistance service indicators for the person requiring assistance are obtained, including: According to the formula Obtain the assistance service index for the u-th person in need of assistance. ,in, This provides the probability information for judging the u-th person in need of assistance. Let the first spatial coordinate data of the u-th person to be assisted be the planar coordinates. The second spatial coordinate data refers to the planar coordinates. for and The lines connecting them Let h be the region containing the h-th obstacle in space. Let h be the number of obstacles. And u, h and All are positive integers, and if is a conditional function.

10. A human-machine collaborative optimization system based on artificial intelligence, characterized in that, include: The sampling module is used to sample the surveillance video frames captured by the surveillance camera to obtain the image to be processed; The target region module is used to detect multiple images to be processed and obtain the target region where people are located in each image. The information module is used to determine the posture and facial expression information of each person in each image to be processed based on the target area. The module for people to be assisted is used to select people to be assisted from multiple people based on the posture information, the facial expression information, and the target area; The target assistance personnel module is used to select target assistance personnel based on the posture and facial expression information of the person to be assisted, as well as the location information of the target area of ​​the person to be assisted in the image to be processed. The mobility module controls the mobility components to move the robot to a preset position next to the location of the target assistant, and activates the robot's display device, audio device, and microphone. The assistance prompt module is used to send assistance prompt messages to the target person receiving assistance via at least one of a display device and an audio device. The notification module is used to obtain assistance prompts that match the request information based on a preset knowledge graph when it receives the request information of the target person to assist through at least one of the microphone or display devices, and then notify the target person to assist through the display device or audio device.