Police practical combat digital human interactive AI auxiliary training method

By detecting and labeling reflective highlight areas in training image sequences, structural information is restored and continuous motion trajectories are constructed. This solves the problem of loss of motion structural information under fixed lighting conditions and achieves stability and continuity in motion recognition and interaction determination.

CN122200749APending Publication Date: 2026-06-12HANGZHOU LIDUN SECURITY TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU LIDUN SECURITY TECHNOLOGY CO LTD
Filing Date
2026-05-09
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In a fixed indoor training room, the loss of key motion structure information due to localized high brightness caused by reflection affects the difficulty of motion recognition and interaction judgment. Existing technologies are unable to achieve stable analysis and continuous expression of key motions under fixed lighting conditions.

Method used

By detecting and marking reflective highlight areas in the collected training image sequences of trainees, a reflective area feature sequence is generated. Based on these feature sequences, boundary extension and internal morphology compensation are performed to restore structural information. A continuous structural chain is constructed through cross-frame matching to generate an action structure trajectory sequence. Finally, the training results are output and the training parameters are updated in a closed loop.

🎯Benefits of technology

It effectively solves the problem of texture and gradient loss caused by reflective highlights, ensures the integrity of action structure in time series, and improves the reliability of interaction judgment and training effect.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200749A_ABST
    Figure CN122200749A_ABST
Patent Text Reader

Abstract

The present disclosure provides a police practical digital human interactive AI auxiliary training method, which comprises performing anti-reflection highlight area detection and marking processing on the collected training image sequence of the trainee, generating an anti-reflection area feature sequence; generating a structure recovery image sequence by boundary extension and internal morphology compensation from the real structure area outside the anti-reflection area to the inside; extracting the structure fragments of the key parts of the trainee for cross-frame matching to construct a continuous structure chain and generate a continuous action structure trajectory sequence; extracting the action semantics and determining the contact relationship with the digital human to generate an interaction state sequence; based on the comparison result of the interaction state sequence and the preset training target, outputting a training result index and updating the training parameters in a closed loop, which can solve the problem of key action structure information loss caused by local anti-reflection highlight under fixed lighting conditions, and realize stable analysis and continuous expression of key actions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence-assisted training technology, and in particular to a method for interactive AI-assisted training of digital human for police combat. Background Technology

[0002] In fixed indoor training rooms, lighting systems typically employ overhead surface light sources or side-mounted supplemental lighting strips, maintaining a stable illumination angle, intensity, and distribution throughout the space. When trainees perform dynamic movements, areas of the human body with different reflective properties, such as the forehead, bridge of the nose, back of the hands, and metal parts of police equipment, will form specular reflection paths with the light source under specific postures, resulting in locally bright areas in the image. These bright areas often appear as nearly pure white bright spots, with pixel grayscale values ​​approaching the sensor's upper limit, causing the complete loss of texture information and edge gradients within these areas.

[0003] This problem is not simply due to excessive brightness, but rather it is significantly destructive. For example, in law enforcement action training, when trainees perform actions such as pointing a weapon or rapidly controlling their wrists, the hand area is often at a critical angle between the light source and the camera. When the back of the hand or the surface of the weapon produces bright reflections, the detailed information originally used to identify the opening and closing of the fingers and the grip posture is completely obscured, making it impossible to decompose the action structure. Similarly, in facial recognition and emotion judgment training, when trainees tilt their heads up or turn them to the side, bright bands easily form on the forehead and bridge of the nose. These bright bands cross the key areas of the eyebrows and eyes, causing a break in the local facial structure and affecting the judgment of gaze direction or expression changes. Furthermore, this problem is dynamically triggered but spatially stable. The position of the reflective area moves with the action, but the formation mechanism always depends on a fixed light source and reflection angle relationship. In continuous image frames, bright areas appear and disappear periodically in key areas, making the same action appear structurally discontinuous in a time sequence, further increasing the difficulty of action recognition and interactive judgment.

[0004] Therefore, there is an urgent need for a police combat digital human interactive AI-assisted training method that can solve the problem of loss of key action structural information caused by local reflection and high brightness under fixed lighting conditions, so as to achieve stable analysis and continuous expression of key actions. Summary of the Invention

[0005] In view of this, in order to solve the problems brought about by the existing technology, this application provides a method for interactive AI-assisted training of digital human for police combat.

[0006] Firstly, this disclosure provides a method for interactive AI-assisted training of digital human for police operations, the method comprising:

[0007] S1. Perform reflective highlight region detection and labeling processing on the collected training image sequence of trainees to generate reflective region feature sequence;

[0008] S2. Based on the reflective region feature sequence, a structural restoration image sequence is generated by extending the boundary and compensating the internal morphology from the real structural region outside the reflective region inward.

[0009] S3. Based on the structure recovery image sequence, extract structural segments of key parts of the trainee and perform cross-frame matching to construct a continuous structural chain and generate a continuous motion structure trajectory sequence.

[0010] S4. Based on the action structure trajectory sequence, extract action semantics and determine the contact relationship with the digital human, and generate an interaction state sequence;

[0011] S5. Based on the comparison results between the interaction state sequence and the preset training target, output the training result index and update the training parameters in a closed loop.

[0012] Optionally, S1 includes:

[0013] The training image sequence is normalized from color to grayscale. In the obtained normalized grayscale image, a candidate region for reflective highlights is constructed based on joint detection of brightness and local gradient decay.

[0014] A spatial continuity constraint is applied to the candidate reflective bright areas, and discrete pseudo-bright fragments are removed to generate a reflective area mask;

[0015] The reflective area mask is uniformly encoded to generate a reflective area feature sequence that includes geometric and temporal stability.

[0016] Optionally, S2 includes:

[0017] Based on the reflective region feature sequence, a boundary seed set is extracted from the outer expansion ring outside the reflective region;

[0018] Using the boundary seed set, the trend, curvature, and grayscale variation patterns carried by the boundary seeds are propagated into the interior of the reflective area to generate an extended boundary set;

[0019] Based on the extended boundary set, morphological compensation for grayscale and gradient propagation is performed on the interior of the reflective area to generate an internal morphological compensation image.

[0020] The internal morphology compensation image is then fused with the original non-reflective area through a boundary transition to generate a structure restoration image sequence.

[0021] Optionally, the morphological compensation for grayscale and gradient propagation within the reflective region based on the extended boundary set includes:

[0022] Based on the distance from the point to be compensated within the reflective area to the nearest extended boundary, and combined with the average grayscale information and average gradient information on the extended boundary, the compensation grayscale value of the point to be compensated is calculated. The compensation grayscale value includes an attenuation term for inheriting the boundary grayscale and a gradient term for preserving the structural continuity trend.

[0023] Optionally, S3 includes:

[0024] Candidate structural fragments of key parts of the trainee are extracted from the structural reconstruction image sequence to form a candidate structural fragment sequence. The key parts include the trainee's hands, face, and forearm distal region which is directly related to action judgment.

[0025] The spatial location and morphological features of candidate structural segments between adjacent frames are matched to establish a set of candidate matching relationships;

[0026] Based on the candidate matching relationship set, a cross-frame continuous structure chain is constructed. According to the comparison result of the calculated continuity confidence of each chain and the preset confidence threshold, the chain with the continuity confidence lower than the threshold is judged as a pseudo chain and removed, and the continuous structure chain retained after screening is obtained.

[0027] The filtered and retained continuous structural chain is transformed into a sequence of motion structure trajectories containing the center position, main direction angle and equivalent scale changes of the corresponding key parts in each frame.

[0028] Optionally, the construction of the cross-frame continuous structure chain further includes:

[0029] For continuous structural chains with short-term gaps, linear interpolation is allowed to fill gaps of no more than 2 frames. The interpolation results are verified by using candidate structural segments of the corresponding frames in the structure recovery image sequence. When there are candidate structural segments with similar geometric features near the position where the interpolation is generated, the actual observed value of that segment is used to replace the interpolation result.

[0030] Optionally, S4 includes:

[0031] Based on the motion structure trajectory sequence, a temporal decomposition is performed according to the comparison results of the motion segment activity of key parts in each frame with a preset activity threshold, and a set of motion temporal decomposition segments is generated.

[0032] For each action temporal decomposition segment, calculate its action direction intensity, and combine the direction evolution type and duration features to generate an action semantic candidate set;

[0033] By combining the action semantic candidate set with the spatial distance, distance change trend and contact duration frame number from the key parts of the trainee to the target parts of the digital human, a contact relationship determination result set is generated;

[0034] The interaction response priority value is calculated based on the action semantic candidate set and the contact relationship determination result set, and the interaction state sequence corresponding to the digital human interaction response is generated based on the priority value.

[0035] Optionally, S5 includes:

[0036] The interaction state sequence is aligned segment by segment with the preset training target to generate an interaction state alignment result set that includes single segment alignment score, duration deviation and order deviation.

[0037] Based on the interaction state alignment result set, a training result index set is calculated, which includes the trainee's action completion degree and the interaction matching degree with the digital human.

[0038] Based on the deviation between the training result index set and the preset target threshold, an adjustment suggestion set for the training parameters is generated;

[0039] The training parameters are smoothly updated based on the parameter adjustment suggestion set, and the training results for this round are output.

[0040] In a second aspect, this disclosure provides an electronic device including a memory and at least one processor, the memory storing a computer program, and the processor executing the computer program to implement the method of the first aspect described above.

[0041] Thirdly, this disclosure provides a computer storage medium storing a computer program that, when executed, implements the method described in the first aspect.

[0042] The beneficial effects of the present invention are as follows: Compared with the prior art, the present invention has the following advantages:

[0043] (1) To address the problem of loss of key action structure information caused by local reflective brightness under fixed indoor training room lighting conditions, the method of detecting and marking reflective bright areas, restoring structural boundaries and reconstructing local information, and using the stable boundary outside the reflective area to extend the boundary and compensate the shape, the continuous and recognizable contour structure can be restored in a single frame image. This effectively solves the problem of texture and gradient loss caused by reflection, and enables key details such as hand grip posture, finger opening and closing state and facial expression to be stably resolved.

[0044] (2) To address the problem of discontinuous image frame structure caused by the periodic appearance and disappearance of reflective areas with changes in motion, a continuous structural chain spanning the time axis is constructed by cross-frame structural consistency constraints and continuous reconstruction of motion trajectories. Short-term gaps are then filled and verified, eliminating the break in motion structure caused by the intermittent appearance of reflective areas. This ensures the integrity of the same action in the time sequence and improves the reliability of interactive judgment.

[0045] (3) To address the issues of accuracy of action recognition and consistency of interactive response after reflection repair, by extracting key action semantics and generating interactive states, continuous action trajectories are transformed into action segments and contact relationships with clear semantics. Combined with training results feedback and closed-loop updates, training parameters are adaptively adjusted according to action completion and interaction matching degree, so that digital humans can generate reasonable interactive responses based on stable action sequences, which significantly improves the intelligence level and training effect of police combat digital human interactive training. Attached Figure Description

[0046] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.

[0047] Figure 1 A flowchart of the interactive AI-assisted training method for digital human in police combat provided in this embodiment is shown.

[0048] Figure 2 A flowchart illustrating cross-frame matching and continuous structure chain construction provided in an embodiment of this disclosure is shown.

[0049] Figure 3 A flowchart illustrating the action semantic extraction and interaction state generation process provided in an embodiment of this disclosure is shown.

[0050] The accompanying drawings have illustrated specific embodiments of this disclosure, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concepts of this disclosure to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0051] The present disclosure will be further described below with reference to the accompanying drawings. The following embodiments are only used to illustrate the technical solutions of the present disclosure more clearly, and should not be used to limit the scope of protection of the present disclosure.

[0052] The components of the embodiments of the invention described and illustrated herein can typically be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.

[0053] In the following, the terms “comprising,” “having,” and their cognates, which may be used in various embodiments of the invention, are intended only to indicate a particular feature, number, step, operation, element, component, or combination thereof, and should not be construed as excluding, firstly, the presence of one or more other features, numbers, steps, operations, elements, components, or combinations thereof, or adding the possibility of one or more features, numbers, steps, operations, elements, components, or combinations thereof.

[0054] Unless otherwise specified, all terms used herein (including technical and scientific terms) shall have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of the invention pertain. Terms (such as those defined in commonly used dictionaries) shall be interpreted as having the same meaning as in their contextual meaning in the relevant technical field and shall not be interpreted as having an idealized or overly formal meaning, unless clearly defined in the various embodiments of the invention.

[0055] Figure 1 A flowchart of the interactive AI-assisted training method for police combat digital humans provided in this disclosure embodiment is shown below. Figure 1 As shown, the method may include the following steps:

[0056] S1: Perform reflective highlight region detection and labeling processing on the collected training image sequence of trainees to generate reflective region feature sequence.

[0057] The acquired training image sequence undergoes reflective highlight region detection and labeling. By identifying regions where pixel grayscale values ​​are close to saturation and local gradients suddenly drop, and combining this with spatial continuity constraints, a set of reflective region masks is formed. The position, area, and shape of reflective regions in each frame are uniformly encoded to generate a reflective region feature sequence. This is specifically achieved through the following sub-steps.

[0058] S1.1: Perform a color-to-grayscale normalization transformation on the training image sequence to obtain a normalized grayscale image sequence.

[0059] In the interactive training of digital humans for police practice, the raw images captured by cameras are typically color images, with each pixel containing three color channels: red, green, and blue. To unify the scales of subsequent brightness discrimination and gradient drop discrimination, and to avoid bias in the reflection detection results caused by different color channels, it is first necessary to perform grayscale normalization transformation on each frame of the image. Specifically, for the t-th frame image, the coordinates are... For each pixel, the intensity values ​​of its red, green, and blue channels are denoted as follows: , , These three values ​​are all integers between 0 and 255. The normalized grayscale value is then... Calculated using the following formula: ; In this formula, the coefficients 0.299, 0.587, and 0.114 for the three channels are based on the classic weights of human eye sensitivity to different color brightness levels. Summing these coefficients and then dividing by 255 maps the result to the range of 0 to 1. After this transformation, the original color image is converted into a normalized grayscale image with values ​​between 0 and 1. This unifies subsequent highlight intensity discrimination and local gradient drop discrimination into the same scalar domain, preventing color deviations from interfering with reflectivity detection.

[0060] This sub-step yields a normalized grayscale image sequence for each frame, which serves as the basis for subsequent reflective candidate region detection.

[0061] S1.2: In the normalized grayscale image, construct the candidate region for reflective highlights based on the joint detection of brightness and local gradient attenuation.

[0062] The lighting system in the fixed indoor training room is stable. Reflective areas are not simply characterized by high brightness; more importantly, they are bright but lack texture, meaning their internal grayscale is close to saturation. While there may be abrupt brightness changes with the surrounding area, the local gradient within the region decays rapidly, resulting in almost complete loss of detail. Therefore, after obtaining the normalized grayscale image sequence, both brightness and gradient decay terms need to be introduced to detect candidate reflective bright areas. For each frame, using the current pixel coordinates... Define a local neighborhood around the center. The radius *r* of this neighborhood is typically set to 3 to 7 pixels. This captures local texture features without introducing excessive background interference due to an overly large neighborhood. Within this neighborhood, a local gradient decay value is defined. The calculation formula is as follows: ; in, Representing the neighborhood The total number of pixels within, The coordinates of the pixels in the neighborhood. It represents the absolute value of the grayscale difference between a neighboring pixel and the center pixel. It is called the brightness enhancement factor, with a value ranging from 1.5 to 3.0, and is used to amplify the effect of high grayscale areas; This is called the gradient suppression coefficient, ranging from 1.0 to 2.5, and is used to penalize regions with strong local texture. The formula indicates that when the gray value of the center pixel itself... When the value is relatively high, and the average gray-level difference within its neighborhood is small, indicating weak local texture and few structural details, The value of will increase significantly, thus effectively highlighting the reflective area; conversely, if the neighborhood has rich texture and drastic grayscale changes, the average grayscale difference in the denominator will be larger. It will be suppressed, thus avoiding misjudging normal texture areas as reflections.

[0063] Calculate Then, it is compared with preset brightness and response thresholds. The brightness threshold is usually set to 0.82 to 0.95 to filter sufficiently bright pixels; the response threshold is usually set to 0.70 to 0.90 to filter pixels with significant local gradient decay. Only pixels that simultaneously meet the grayscale value... Values ​​above the brightness threshold and local gradient decay Only pixels exceeding the response threshold are marked as candidate pixels for reflective highlighting. For different target parts such as faces, hands, and metal parts of police equipment, different weight parameters can be set during calculation, but the definition of the response value remains consistent to ensure terminology uniformity.

[0064] This sub-step obtains the positions of all reflective highlight candidate pixels that meet the conditions in each frame, thus forming a reflective highlight candidate region sequence.

[0065] S1.3: Apply spatial continuity constraints to the candidate reflective bright areas and remove discrete pseudo-bright fragments to generate a reflective area mask.

[0066] Reflective highlight candidate regions obtained solely through pixel-level detection often contain scattered bright spots. These bright spots may originate from the edges of local light strip reflections, bright spots from clothing creases, etc., and are not true reflective interference areas but rather pseudo-brightness fragments. Therefore, it is necessary to perform spatial continuity constraint processing on the candidate regions to eliminate discrete noise and pseudo-brightness fragments, retaining only true reflective regions with moderate area, stable internal response, and relatively continuous boundaries.

[0067] In practice, an eight-neighbor connected component labeling algorithm is used to divide candidate pixels into connected regions, grouping adjacent candidate pixels into the same connected region. For the k-th candidate connected region in frame t, its area is denoted as... Its circumscribed boundary length is Candidate response values ​​of all pixels within this region The average value (average value per unit area) is Then define the region validity criterion value. The calculation formula is as follows: ; in, This is the area weighting coefficient, with a value ranging from 0.5 to 1.5, used to control the degree to which area contributes to effectiveness; The response weighting coefficient, ranging from 1.0 to 2.0, is used to control the importance of the average response value. is the shape penalty coefficient, ranging from 0.5 to 1.5, used to control the penalty strength for shape dispersion on effectiveness. In the formula... This is a dimensionless morphological discrete term. The closer its value is to 1, the closer the region's shape is to a circle or ellipse, and the more regular the boundary. When the region's boundary is too fragmented, too thin, or has sharp edges, this term will increase, thus leading to... Decrease. Only. Only connected regions that exceed a preset validity threshold are considered valid reflective regions.

[0068] After this sub-step, the reflective area mask for each frame is finally output. Each mask corresponds to a real reflective highlight area, and these masks will be used for subsequent encoding and restoration processing.

[0069] S1.4: The reflective area mask is uniformly encoded to generate a reflective area feature sequence that includes geometric and temporal stability.

[0070] After obtaining the reflective area mask for each frame, these areas need to be uniformly encoded to form a feature sequence that can be directly used in subsequent steps. The encoded content should at least include the region's center position, area, aspect ratio, principal orientation angle, and region stability index. For the k-th reflective region in frame t, let its center coordinates be... , The area is The spindle length is The length of the secondary shaft is The center distance between this region and the corresponding region in the previous frame that spatially matches it is Then, the stability index of this region is defined. as follows: ; This formula shows that the area A relatively large area that is not excessively elongated in shape, i.e. Smaller regions exhibit higher stability indices; additionally, if the position of this region changes compared to the corresponding region in the previous frame... If it is small, then increase it further. Conversely, if the region is elongated or its position changes significantly across frames, the stability index will decrease accordingly. In this way, the temporal stability of reflective regions can be quantitatively evaluated.

[0071] After encoding, the features of each reflective region in each frame are arranged in chronological order to form a reflective region feature sequence. This sequence not only records the geometric and morphological attributes of each reflective region in a single frame, but also includes cross-frame correspondence and stability information, thus providing sufficient input for structural boundary restoration and local information reconstruction in step S2.

[0072] In the technical solution of this disclosure, by performing frame-by-frame detection and marking of reflective bright areas in the training image sequence, the local bright areas generated by specular reflection under fixed lighting conditions can be accurately identified. By eliminating false bright fragments through spatial continuity constraints, a stable reflective area mask and feature sequence are generated, providing reliable basic information for subsequent structural compensation and effectively solving the problem of difficult-to-accurate positioning of reflective areas.

[0073] S2: Based on the reflective region feature sequence, a structural restoration image sequence is generated by extending the boundary and compensating the internal morphology from the real structural region outside the reflective region inward.

[0074] Based on the reflective region feature sequence output in step S1, boundary extension and internal morphology compensation are performed on the reflective regions in each frame of the image. By propagating boundary information inward from the real structure preserved outside the reflective region, a structure restoration image sequence with continuous contours is generated. This is specifically achieved through the following sub-steps.

[0075] S2.1: Based on the reflective region feature sequence, extract the boundary seed set from the outer expansion ring outside the reflective region.

[0076] From the reflective region feature sequence generated in step S1, extract the stable outer boundary of each reflective region in each frame, and use these stable boundaries as the basis for subsequent boundary extension processing. Since the grayscale inside the reflective region is close to saturation and the texture information has been lost, it is impossible to obtain reliable structural information directly from the inside. Therefore, it is necessary to obtain boundary seeds from the real structural parts outside the reflective region that are not yet covered by highlights.

[0077] For the k-th reflective area in the t-th frame, based on the center position given in step S1 ,area and spindle length and secondary shaft length First, an outer ring region is constructed. This outer ring refers to a ring-shaped area extending outwards from the boundary of the reflective area, typically 3 to 9 pixels wide. It is used to extract edge information that is not covered by highlights but is adjacent to the true contour of the target. Then, the consistency of grayscale gradient intensity and direction is calculated within this outer ring. Edge segments with high continuity are retained, while isolated short edges and sharp, noisy edges are removed, thus forming a boundary seed set.

[0078] Specifically, it is assumed that a total of [number] samples were detected within the outer expanding ring. There are i candidate edge points, and the gradient magnitude of the i-th candidate edge point is 1. The gradient direction angle is The average gradient direction angle of all candidate edge points within the annulus is denoted as . Then the boundary stability value of the outer ring of the reflective region is defined. as follows: ; in, This is the gradient enhancement coefficient, ranging from 1.2 to 2.5, used to amplify the influence of edge points with high gradient magnitudes; The directional discrepancy penalty coefficient, ranging from 1.0 to 2.0, is used to penalize edge points with inconsistent directions. This formula indicates that the larger the gradient magnitude and the more consistent the direction of the edge points within the outer ring, the stronger the continuity of the true structure outside that region, making it more suitable as a seed source for restoring the boundary of reflective areas.

[0079] Unlike conventional restoration methods that rely solely on single-frame edges, this step prioritizes the stability index from step S1.4 when extracting boundary seeds. The outer edge points of the extended ring corresponding to areas with higher reflectivity are considered more reliable because these areas show less morphological change across consecutive frames, thus enhancing the accuracy and anti-interference capability of subsequent boundary extensions. For areas with lower stability indices, the weight of their boundary seeds is reduced accordingly, or they are temporarily excluded from the extension processing of the current frame.

[0080] This sub-step yields a set of boundary seeds for each reflective region in each frame. These seed points represent stable and continuous real edge information outside the reflective region, providing a reliable starting point for subsequent inward extension.

[0081] S2.2: Using the boundary seed set, the trend, curvature and grayscale change rules carried by the boundary seeds are propagated into the interior of the reflective area to generate an extended boundary set.

[0082] After obtaining the boundary seed set, boundary extension processing is performed on each reflective region in each frame. Boundary extension utilizes the boundary seed set to propagate the trend, curvature, and grayscale variation patterns carried by the seed points into the reflective region, enabling the formation of a continuous, closed set of candidate boundaries within the reflective region that are consistent with the outer structure. Boundary extension needs to consider both the boundary curvature trend and distance attenuation characteristics.

[0083] Specifically, for the k-th reflective region in the t-th frame, let its edge seed points be a total of There are , where the Euclidean distance from the j-th seed point to a certain point p to be restored inside the reflective area is . The local curvature of the edge segment where the seed point is located is The gradient magnitude of the seed point is Then the boundary extension response value of the point p to be restored The definition is as follows: ; in, This is the distance decay scale parameter, with a value ranging from 4 to 15 pixels, used to control the rate at which the influence of the seed point decays as the distance increases; This is the gradient weight coefficient, with a value ranging from 1.0 to 2.5; The curvature penalty coefficient ranges from 0.5 to 1.8. This formula shows that boundary seeds that are closer to the point to be restored, have clearer edges, and more stable curvature contribute more to the extension of the inner boundary; while boundary seeds that are excessively bent or oscillate at high frequencies will have a reduced contribution to the inner restoration due to their larger curvature.

[0084] During execution, the boundary seed can be divided into connected segments first, then the local arc segments or polyline segments of each segment can be fitted separately, and then the calculation can be performed point by point. The response value. When a certain point to be recovered... When the value exceeds a preset extension threshold, the point is included in the extension boundary set. This extension threshold is typically set to 1.1 to 1.6 times the mean response of the entire region, in order to filter out weak responses caused by noise while preserving the main boundary structure.

[0085] This sub-step yields a set of extended boundaries within each reflective region. These extended boundaries extend inward from the outer seed point, forming the complete outline skeleton of the reflective region.

[0086] S2.3: Based on the extended boundary set, perform morphological compensation for grayscale and gradient propagation inside the reflective area to generate an internal morphological compensation image.

[0087] After completing the boundary extension, it is necessary to reconstruct local information within the reflective area enclosed by the extended boundary. For the face, hands, and equipment edges in a fixed indoor training room, their true form is usually characterized by strong boundary constraints and smooth internal grayscale gradients. Internal compensation should give equal importance to boundary consistency and neighborhood smoothness.

[0088] Specifically, for any point q to be compensated within the reflective area, let its distance to the nearest extension boundary be... The average gray value on the extended boundary is The average gradient magnitude on the extended boundary is The compensation grayscale value of the point to be compensated. Calculated using the following formula: ; in, This is a scale parameter for the influence of boundary grayscale, with a value ranging from 2 to 8 pixels. It is used to control the decay rate of boundary grayscale propagation inward. is the gradient decay scale parameter, ranging from 3 to 12 pixels in distance, used to control the decay rate of gradient information propagating inward; u is the boundary decay exponent, ranging from 1.0 to 2.0; v is the distance suppression exponent, ranging from 1.0 to 2.5. It can be seen that the compensated grayscale value consists of two parts: one part inherits the true grayscale of the boundary and decays with distance, and the other part introduces gradient information to preserve the structural continuity trend. Specifically, the first part of the formula indicates that the region near the boundary mainly inherits the true grayscale of the boundary, and the grayscale contribution gradually and smoothly decays as the distance increases; the second part introduces gradient information, enabling the internal compensation to retain a certain structural continuity trend, avoiding problems such as sharp edges, internal voids, or excessively flat interiors leading to morphological collapse in the restored result.

[0089] After this sub-step, a morphological compensation image of the interior of each reflective area is obtained. The reflective area in this image no longer appears as a pure white void, but instead recovers a recognizable shape that is continuous with the surrounding structure.

[0090] S2.4: Perform boundary transition fusion between the internal morphology compensation image and the original non-reflective area to generate a structure restoration image sequence.

[0091] After obtaining the morphological compensation images of each reflective region, these compensation regions need to be fused with the non-reflective regions in the original image to generate a structurally restored image of the entire frame. Since there may be slight inconsistencies in grayscale distribution and boundary details between the compensation regions and the non-reflective regions of the original image, boundary transition fusion is required to ensure that the restored result remains visually and structurally continuous.

[0092] Specifically, a transition zone is constructed inside and outside the boundary of the reflective area, and the width of this transition zone is typically set to 2 to 4 pixels. For any position s within the transition zone, let its original grayscale value be... The compensation grayscale value is The signed distance from this location to the boundary of the reflective area is Its definition is as follows: when s is located outside the reflective area Take a positive value when located inside the reflective area. Take a negative value. The resulting grayscale value will be... Calculated using the following formula: ; in, The transition band smoothing scale parameter ranges from 1 to 4 pixels. This formula utilizes the Sigmoid function to achieve a smooth, weighted transition from the original image to the compensated image: outside the reflective area, Since the value is much less than 1, the first term dominates, and the fusion result is close to the original grayscale; within the reflective region, When the value is much less than 1, the second term dominates, and the fusion result is close to the compensation grayscale. Near the boundary, the two terms have equal weights, achieving smooth splicing and avoiding artificial edges caused by hard splicing.

[0093] After the fusion is completed, a small-scale contour consistency check is performed on the entire frame image: if the outer contour of the restored region deviates too much from the original outer edge in position, for example, the deviation exceeds 5 pixels, the boundary extension threshold in sub-step S2.2 or the distance scale parameter in sub-step S2.3 is called back, and the local restoration is re-executed until the contour consistency meets the requirements.

[0094] Finally, the structure-reconstructed image sequence is output in chronological order. The reflective areas in each frame of the sequence are effectively repaired, providing high-quality basic image data for subsequent cross-frame structure consistency constraints and motion trajectory continuous reconstruction processing.

[0095] In the technical solution of this disclosure, a stable boundary seed is extracted from the outer side based on the feature sequence of the reflective region. The internal structure of the reflective region is restored through boundary extension and morphological compensation, generating a structure restoration image sequence with continuous contours. This step utilizes the real structure of the outer edge of the reflective region to propagate information inward, avoiding morphological distortion caused by direct filling. This allows key parts such as hands and faces to maintain recognizable continuous contours even under reflective interference, providing high-quality image data for motion analysis.

[0096] S3: Based on the structure recovery image sequence, extract structural segments of key parts of the trainee and perform cross-frame matching to construct a continuous structural chain and generate a continuous motion structure trajectory sequence.

[0097] Based on the structural reconstruction image sequence output in step S2, cross-frame matching of key parts of the trainee is performed by combining the spatial position changes of adjacent frames to eliminate structural breaks caused by periodic reflections and form a continuous motion structure trajectory sequence. Figure 2 A flowchart illustrating the cross-frame matching and continuous structural chain construction process provided in this disclosure embodiment is shown, such as... Figure 2 As shown, this is achieved through the following sub-steps.

[0098] S3.1: Extract candidate structural fragments of key parts of the trainee from the structural recovery image sequence to form a candidate structural fragment sequence.

[0099] In the structural reconstruction image sequence obtained in step S2, the key parts that need to be tracked continuously across frames are first separated from the whole frame image to form a stable sequence of candidate structural segments. The key parts here mainly refer to the trainee's hands, face, and the forearm distal region directly related to action judgment. Since step S2 has already performed boundary restoration and local information reconstruction on the reflective highlight areas, this sub-step involves extracting which local structures are most worthy of cross-frame matching from the reconstructed continuous contours.

[0100] In practice, edge extraction and closed region search are first performed on the structural reconstruction image of each frame to obtain a series of candidate contours. Based on geometric and appearance features such as area range, aspect ratio, degree of boundary curvature variation, and local grayscale uniformity, candidate regions that match the local geometric features of the hand or face are selected. For the hand, the area range is typically set to 0.5% to 8% of the total frame area; for the face, the area range is typically set to 2% to 15% of the total frame area. The preferred aspect ratio range is 0.6 to 2.8 to cover local structures under different orientations and poses.

[0101] Finally, a sequence of candidate structural fragments for key parts is output in chronological order. Each fragment in the sequence is accompanied by features such as location, area, shape, and grayscale.

[0102] S3.2: Match the spatial location and morphological features of candidate structural segments between adjacent frames to establish a set of candidate matching relationships.

[0103] After obtaining the candidate structural fragment sequence of key parts, it is necessary to match the candidate structural fragments between adjacent frames to establish an initial cross-frame correspondence. The matching between adjacent frames is based on changes in position, area, shape, and local grayscale distribution.

[0104] S3.3: Construct a cross-frame continuous structure chain based on the matching relationship between adjacent frames, and calculate the continuity credibility of each chain to eliminate false chains.

[0105] Adjacent frame matching can only solve the local correspondence between one frame and the next, and cannot guarantee the overall continuity of the structural chain over a longer period of time. Therefore, this sub-step needs to establish a continuous structural chain spanning multiple time points based on the set of candidate matching relationships between adjacent frames, and eliminate pseudo-continuous chains caused by local recovery residuals, short-term occlusion, reflection residues, or rapid hand movements. Specifically, for any candidate chain r, let the chain be continuous... The matching scores on the frames are respectively The center displacements of two adjacent frames are respectively The displacement changes are respectively ,in The continuity and reliability of this chain. The confidence level is defined as a function that is positively correlated with the sum of the matching scores and negatively correlated with the degree of fluctuation of the displacement change. Those skilled in the art will understand that any function satisfying the above positive and negative correlations can be used to construct the continuous confidence level. For ease of screening, a confidence threshold is preset. This threshold can be dynamically determined based on the statistical distribution of the training data, for example, taking 0.3 to 0.6 times the mean confidence level of all candidate chains, or empirically fixed at a value between 0.4 and 0.7.

[0106] In practical processing, all possible candidate structure chains can be constructed from the matching relationships between adjacent frames using depth-first search or dynamic programming. Then, the chain continuity can be used to determine the chain's reliability. The process involves filtering chains whose continuity confidence is below a preset threshold, classifying them as pseudo-chains and removing them. If a chain is missing in a particular frame, and the continuity confidence of both segments before and after the chain is higher than a preset confidence threshold (e.g., higher than 0.7), then a gap of no more than two frames is allowed for reconnection. This involves linear interpolation based on the positions of the preceding and following frames, along with the same linear interpolation of the orientation angle and equivalent scale of the missing frame. The interpolation result is then validated using candidate structural segments from the corresponding frame in the image restored in step S2. During validation, the similarity between the candidate structural segments near the interpolation location and the interpolation result is calculated, considering geometric features such as area, aspect ratio, and boundary curvature changes. If a candidate structural segment has a similarity exceeding a preset threshold, the actual observed value of that segment is used to replace the interpolation result, thereby further improving the reliability of the reconnection and enhancing robustness against short-term reflective residual disturbances.

[0107] The final output is a set of continuous structural chains across frames that are retained after filtering. Each chain corresponds to the continuous motion trajectory of a key part over a period of time.

[0108] S3.4: The continuous structural chain retained after filtering is transformed into a sequence of motion structure trajectories containing the center position, main direction angle and equivalent scale changes of the corresponding key parts in each frame.

[0109] After the above process of eliminating false chains, a set of valid cross-frame continuous structural chains is obtained. Based on this, these structural chains need to be further transformed into action structure trajectory sequences that can be directly used in subsequent action semantic extraction. Each continuous structural chain corresponds to a key part, such as the continuous movement trajectory of the hand or face; therefore, the parameters of the structural chain are the motion parameters of that key part. The action structure trajectory not only includes the temporal positional changes of the key part but also the changes in structural scale, direction, and local morphological trends. In this way, in the next step, the system can determine whether the trainee is reaching out, retracting, raising their head, tilting their head, pointing a weapon, or performing a control action based on these trajectories.

[0110] Specifically, for any continuous structural chain r, its center position in the t-th frame is denoted as... The principal direction angle is The equivalent scale is Among them, the equivalent scale It can be obtained from the geometric mean of the principal axis length and the secondary axis length. Then, the combined trajectory strength of the chain in frame t relative to the previous frame is defined. as follows: ; in, The direction change enhancement coefficient ranges from 0.5 to 1.5. The scale variation suppression coefficient ranges from 0.5 to 1.5. This formula indicates that motion trajectories should not only consider positional movement but also directional swaying and structural scale changes to more comprehensively represent the actual movement processes of key parts such as the hands and face. For example, even when moving from left to right, a movement accompanied by rotation is semantically completely different from a simple translation; the overall trajectory strength can reflect this difference. In actual execution, the same key part can be represented across the entire time series. , , The trajectories are organized into structured items in chronological order, and each trajectory is assigned attributes such as start time, end time, average velocity, and maximum deflection angle. Multiple local chains within the same motion cycle can be further merged based on spatial proximity and temporal overlap. For example, the motion trajectories of the two hands can be maintained separately, or the relative movements of the hands and face can be linked to form a more complete sequence of motion structure trajectories.

[0111] The final output sequence of action structure trajectories serves as the final result of step S3 and is directly used for the key action semantic extraction and interaction state generation processing in step S4.

[0112] In the technical solution of this disclosure, cross-frame matching and continuous structure chain construction are performed on the structure restoration image sequence. Key candidate segments in adjacent frames are associated as motion trajectories spanning the time axis. Linear interpolation of short-time gaps and verification of candidate structures further enhance the continuity of the chain. This step eliminates structural breaks caused by the periodic occurrence of reflections, forming a stable and continuous sequence of motion structure trajectories, ensuring the structural integrity of the same action over time.

[0113] S4: Based on the action structure trajectory sequence, extract action semantics and determine the contact relationship with the digital human, and generate an interaction state sequence.

[0114] Based on the action structure trajectory sequence output in step S3, the trainee's actions are decomposed and semantically mapped to extract the action type, direction, and contact relationship with the digital human, driving the digital human to generate interactive responses and forming a stable interactive state sequence. Figure 3 A flowchart illustrating the action semantic extraction and interaction state generation process provided in this embodiment is shown, such as... Figure 3 As shown, this is achieved through the following sub-steps.

[0115] S4.1: Based on the action structure trajectory sequence, perform temporal decomposition according to the comparison results of the action segment activity of key parts in each frame with the preset activity threshold, and generate a set of action temporal decomposition segments.

[0116] The motion structure trajectory sequence output in step S3 is further decomposed from a continuous temporal representation into several motion temporal decomposition segments with clearly defined start and end boundaries. The motion structure trajectory sequence already includes the center position, principal orientation angle, equivalent scale, and overall trajectory intensity of key components in each frame; therefore, this sub-step focuses on determining when an action begins and ends. Segmentation can be performed based on trajectory intensity, rate of change of direction, and duration.

[0117] Specifically, for any continuous structural chain r in step S3, let its comprehensive trajectory intensity in frame t be... The absolute value of the change in the principal direction angle between adjacent frames is The ratio of equivalent scale change between adjacent frames is It is defined as the larger scale divided by the smaller scale, and therefore takes a value greater than or equal to 1. This defines the activity level of the motion segment in that key part. as follows: ; in, This is the trajectory intensity enhancement coefficient, with a value ranging from 1.0 to 2.0; The direction change enhancement coefficient ranges from 0.5 to 1.5. The scale fluctuation penalty coefficient ranges from 0.5 to 1.5. This formula indicates that continuous frames with strong displacement, significant directional changes, and no excessively anomalous scale variations exhibit high motion segment activity. A higher intensity trajectory is more likely to constitute a valid action segment. Conversely, if the trajectory intensity is weak, the direction change is small, or the scale jumps drastically, the activity level is low. In engineering implementation, a sliding window segmentation strategy can be used. The window length is preferably 4 to 10 frames. If several consecutive frames... If all frames are above the preset activity threshold, the action segment is considered to have started; if several consecutive frames are below the threshold, the segment is considered to have ended. The preset activity threshold can be set according to different training projects: a lower threshold is recommended for subtle hand control movements, usually set to 0.8 to 1.1 times the average of the entire sequence; a higher threshold is recommended for large-scale control movements, usually set to 1.1 to 1.5 times the average of the entire sequence.

[0118] This sub-step outputs a set of action temporal decomposition segments, where each segment contains at least a start frame, an end frame, the number of the key part to which it belongs, and the corresponding trajectory segment, providing clear boundaries for subsequent action semantic extraction.

[0119] S4.2: Calculate the action direction intensity for each action temporal decomposition segment, and generate an action semantic candidate set by combining the direction evolution type and duration features.

[0120] After the action temporal decomposition fragment set is formed, semantic feature extraction needs to be performed on each action temporal decomposition fragment, and an action semantic candidate set needs to be established. This extraction process is based on the start and end displacements, directional angle deflection, and duration of each fragment. Action semantic candidates refer to the descriptive quantities extracted from an action fragment that are sufficient to distinguish action categories such as raising a hand, reaching out a hand, retracting a hand, tilting the head, turning the head, grabbing, pressing, and pointing. This sub-step focuses on constructing action type discrimination values ​​and direction encoding values. Let the center coordinates of a certain action temporal decomposition fragment z in the start frame and end frame be respectively... and Its continuous frame count is The average principal orientation angle within the segment is The total deflection of the principal direction angle is This is the absolute value of the difference between the main direction angle of the ending frame and the main direction angle of the starting frame. The motion direction strength is then defined. as follows: ; in, The short segment suppression coefficient ranges from 0.5 to 1.2. This formula indicates that the directionality of a motion sequence decomposition segment should not be determined solely by the start and end displacements; the degree of directional deflection within the segment and its duration should also be considered. Otherwise, jitter can easily be misjudged as a defined motion. The larger the start and end displacements, the more pronounced the directional deflection, and the longer the duration, the stronger the directional intensity. The higher the intensity, the greater. The directional intensity of a movement reflects its overall strength and significance, and is an important basis for distinguishing between subtle and large-amplitude movements.

[0121] At the implementation level, the displacement direction of the action segment can be further divided into basic direction categories such as horizontal left, horizontal right, vertical up, vertical down, diagonally approaching, and diagonally moving away; the duration can be divided into short-term triggered types (usually 3 to 10 frames), medium-term controlled types (usually 10 to 30 frames), and long-term sustained types (usually more than 30 frames); and the angle evolution can be divided into stable, rotational, and reciprocating types. Then, based on the action direction intensity and the above-mentioned basic features such as direction category, duration category, and angle evolution category, action semantic candidates are generated, such as right hand diagonally forward, head turning to the left, hand quickly retracting, and forearm continuously pressing forward.

[0122] If a classification model is involved, a lightweight action temporal decomposition segment classification model can be used, but in this invention, a rule-based constraint plus classifier fine-tuning approach is more suitable. The input features of the classifier can include the length of the action temporal decomposition segment, average velocity, directional intensity, total angular deflection, and scale change rate. The classifier can be a support vector machine (SVM) or a shallow multilayer perceptron. The penalty coefficient in the SVM is preferably 0.5 to 5, and the number of hidden layer nodes in the multilayer perceptron is preferably 16 to 64.

[0123] This sub-step outputs a candidate set of action semantics, where each candidate contains an action type label, direction encoding, and corresponding confidence level.

[0124] S4.3: Combine the action semantic candidate set with the spatial distance, distance change trend and contact duration frame number from the key parts of the trainee to the target parts of the digital human to generate a contact relationship determination result set.

[0125] This sub-step combines the spatial distance between the trainee's key body parts and the target body parts of the digital human, the distance change trend, and the number of frames of contact duration to generate a set of contact relationship determination results. Contact relationships include not only already made contact, but also contact that has not yet made contact but is rapidly approaching, edge contact, and continuous pressing contact.

[0126] Let the distance between the center of the key part of the trainee corresponding to a certain action semantic candidate z and the center of the target part of the digital human in the t-th frame be... The change in distance between two adjacent frames is Contact duration is This refers to the number of frames that continuously meet the contact distance threshold. Furthermore, let the ratio of the contact area to the equivalent area of ​​the critical component be... It is defined as the larger area divided by the smaller area, so its value is greater than or equal to 1.

[0127] Contact relationship strength Defined as the average distance That is, the average distance within the temporal decomposition segment of the action is negatively correlated with the number of contact duration frames. Positively correlated with, and with, the area ratio A function that is negatively correlated. Those skilled in the art will understand that any function satisfying the above positive and negative correlation can be used to construct the contact relationship strength. The smaller the average distance, the more continuous contact frames, and the closer the area ratio is to 1, the stronger the contact relationship. The higher the value, the more likely a real contact relationship will form; if the area of ​​the contact area differs too much from the area of ​​the critical part, it indicates that it may only be an overlapping projection rather than a real contact. Larger values ​​lead to reduced strength.

[0128] In engineering implementation, the contact distance threshold can be set according to the proportion of the training image, usually set to 0.1 to 0.35 times the equivalent scale of the key parts. For different target parts such as the hand and the digital human forearm, the hand and the digital human shoulder, and the hand and the digital human wrist, different thresholds can be set to reflect the geometric characteristics of different movement types.

[0129] After this sub-step, the contact relationship determination result set is output, including the contact type such as already in contact, approaching, rubbing against, pressing, etc., the contact location such as wrist, forearm, shoulder, etc., the contact duration and the contact relationship intensity, which are then called by sub-step S4.4.

[0130] S4.4: Calculate the interaction response priority value based on the action semantic candidate set and the contact relationship determination result set, and generate the interaction state sequence corresponding to the digital human interaction response based on the priority value.

[0131] This sub-step comprehensively evaluates each candidate interaction state by calculating the interaction response priority value and selects the state with the highest priority value as the current output. Interaction states include retreating to avoid, raising an arm to block, turning the head to respond, and being suppressed. To ensure the stability of interaction state generation, this sub-step jointly models the action type, direction intensity, and contact relationship strength, and suppresses frequent state switching caused by short-term jitter. Specifically, it comprehensively evaluates each candidate interaction state by calculating the interaction response priority value and selects the state with the highest priority value as the current output.

[0132] Let the action type matching score of action semantic candidate z be... This refers to the degree of matching between the action type and the digital human response template, with a value ranging from 0 to 1; the directional strength is... The strength of the contact relationship is Furthermore, the cost of switching between the current candidate state and the interaction state at the previous moment is... The switching cost can be predefined based on the degree of state difference. For example, the cost of switching from standing still to retreating is lower, while the cost of switching from leaning forward to attacking to retreating is higher.

[0133] Interactive response priority Defined as the matching score directional strength Contact relationship strength All are positively correlated with the switching cost. A function that is negatively correlated. Those skilled in the art will understand that any function satisfying the above positive and negative correlation can be used to construct the interaction response priority value. The higher the matching score, the greater the directional strength, and the stronger the contact strength, then... The higher the value, the greater the switching cost. The lower the value, the more likely it is to suppress unreasonable state transitions.

[0134] In engineering implementation, a mapping table can be pre-established to map action semantics to digital human response templates. For example, if the action is a rapid forward extension of the hand with high contact strength, it is preferentially mapped to the digital human blocking or retreating; if the action is a continuous pressing of the hand with sustained contact, it is preferentially mapped to the digital human being in a controlled state; if the action is a head turn without contact, it is mapped to the digital human's eye-following or speech response readiness state. For each frame, the priority value of the interaction response for each action semantic candidate is calculated. ,choose The interaction state corresponding to the largest candidate is taken as the candidate interaction state for that frame. Then, for the candidate interaction states of several consecutive frames, the majority hold principle or the three-frame confirmation mechanism can be adopted, that is, the actual switch is only performed when the same interaction state is output in three consecutive frames, so as to reduce occasional jitter.

[0135] After this sub-step, an interactive state sequence is finally formed. This sequence serves as the final output of step S4 and is used by step S5 for training result output and feedback closed-loop update processing.

[0136] In the technical solution of this disclosure, semantic features of motion and contact relationships are extracted from the motion structure trajectory sequence. The continuous motion trajectory is decomposed into motion segments with clear start and end boundaries, and an interactive state sequence is generated based on the direction, intensity, and spatial proximity to the target part of the digital human. This step realizes the mapping from the underlying motion trajectory to the high-level semantics, enabling the digital human to generate reasonable interactive responses based on the trainee's actual actions, thus improving the realism and effectiveness of training.

[0137] S5: Based on the comparison results between the interaction state sequence and the preset training target, output the training result index and update the training parameters in a closed loop.

[0138] The interaction state sequence output in step S4 is compared with the preset training target, and the training results, including action completion degree and interaction matching degree, are output. Based on this, the parameters of subsequent training scenarios are adjusted to achieve closed-loop feedback and continuous optimization in the training process. This is achieved through the following sub-steps.

[0139] S5.1: Align the interaction state sequence with the preset training target segment by segment to generate an interaction state alignment result set including single segment alignment score, duration deviation and order deviation.

[0140] The interaction state sequence generated in step S4 is aligned segment by segment with the preset training target to form an interaction state alignment result set that can be used for evaluation.

[0141] In practice, the start frame, end frame, state category, and duration of each state segment in the interaction state sequence are first read. Then, the target state order, duration range, and allowable deviation range in the preset training objective are read. Sequence alignment is then performed on both to obtain the alignment relationship, duration deviation, and order deviation between each state segment and the target state. Let the duration of the i-th actual interaction state segment be denoted as . The duration of the target state corresponding to the target state is The order deviation is The value is defined as the absolute value of the difference between the actual state sequence number and the target state sequence number. For example, if the target state sequence is 3, and the actual state sequence number is 4, then... The category matching coefficient is The category matching coefficient is 1 for a perfect match, 0.5 for a partial match, and 0 for a no match.

[0142] Single-segment alignment score Defined as the matching coefficient with the category Positively correlated with the duration deviation (i.e., the absolute difference between the actual duration and the target duration), negatively correlated with the order deviation A function that is negatively correlated. Those skilled in the art will understand that any function satisfying the above positive and negative correlations can be used to calculate the single-segment alignment score. The closer the category match, the closer the duration, and the more correct the order of occurrence, the better. The higher.

[0143] In engineering implementation, dynamic time warping can be used for state-level alignment, but this invention is more suitable for using a finite window sequential alignment method, that is, only allowing the current state to find a match within a range of no more than one position before and after the target state, so as to enhance the standardization of the training process.

[0144] After this sub-step, an interactive state alignment result set is output. Each alignment result in the result set includes the correspondence between the state fragment and the target state, the single-segment alignment score, and the duration deviation and order deviation.

[0145] S5.2: Based on the interaction state alignment result set, calculate a training result index set that includes the trainee's action completion degree and its interaction matching degree with the digital human.

[0146] After obtaining the interaction state alignment result set, the training process needs to be quantitatively evaluated to form a training result indicator set. This indicator set should include at least two core indicators: the trainee's action completion rate and the interaction matching rate between the trainee and the digital human. Action completion rate reflects whether the trainee completed the required action chain according to the training requirements, while interaction matching rate reflects whether a correct interaction loop was formed between the digital human's response and the trained action. Therefore, indicators need to be constructed from four dimensions: state coverage, state persistence, state sequence, and state response consistency.

[0147] Suppose there are N target states in a certain training iteration, and the number of successfully aligned states is... The sum of all single-segment alignment scores is The total length of the actual interaction state sequence is The total length of the target interaction state sequence is The length unit can be frames or seconds.

[0148] Trainee's performance level Defined as relative to state coverage ratio Positively correlated with the average single-segment aligned score Positively correlated with and relatively deviating from the total duration A function that is negatively correlated. Those skilled in the art will understand that any function satisfying the above positive and negative correlations can be used to calculate action completion. For example, it can be expressed as the product of state coverage ratio and average single-segment alignment score, divided by one, plus the relative deviation of total duration. Only when the state coverage ratio is high, the average single-segment alignment score is large, and the total duration deviation is small... Only then is it considered a high value.

[0149] Furthermore, let the consistency coefficient of the digital human response for the i-th successfully aligned state segment be . This coefficient reflects the consistency between the digital human's actual response and the target response template, ranging from 0 to 1, with 1 indicating complete consistency. This represents the interaction matching degree between the trainee and the digital human. Defined as the average joint matching score Positively correlated and aligned with the mean - response difference A function that is negatively correlated. Those skilled in the art will understand that any function satisfying the above positive and negative correlations can be used to calculate the interaction matching degree. Truly effective training interaction requires not only that the trainee's actions closely approximate the target, but also that the digital human's responses remain consistent with the expected training script; both must be achieved simultaneously.

[0150] The final output training result metric set should include at least the action completion rate. Interaction matching degree The number of missed detections in the status is Number of out-of-order numbers and number of timeout segments.

[0151] S5.3: Based on the deviation between the training result index set and the preset target threshold, generate a set of adjustment suggestions for the training parameters.

[0152] After the training result metric set is formed, a parameter adjustment suggestion set needs to be generated based on the evaluation results for subsequent updates to the training scenario parameters. These parameters include the interaction difficulty parameter, target action tolerance parameter, digitizer response sensitivity parameter, and state transition confirmation parameter in the training scenario. If the trainee's action completion rate is low, but the interaction matching between the trainee and the digitizer is high, it indicates that the trainee's action execution is insufficient, but the digitizer's response logic is basically correct. In this case, the focus should be on lowering the action threshold or extending the allowed duration of the target action. If the trainee's action completion rate is high, but the interaction matching between the trainee and the digitizer is low, it indicates that the action itself is well completed, but the interaction state transition or response template is not well-fitted. The focus should be on adjusting the response sensitivity and state transition confirmation conditions. Let the action completion rate be... Interaction matching degree is The target lower limit thresholds are respectively and ,generally The value ranges from 0.75 to 0.95. The value ranges from 0.80 to 0.95. Therefore, the overall deviation value Z is defined as follows: ; in, This is the motion deviation enhancement coefficient, with a value ranging from 1.0 to 2.0; This is the interaction bias enhancement coefficient, with a value ranging from 1.0 to 2.0; The high-quality training inhibition coefficient ranges from 0.5 to 1.2. This formula indicates that when both the trainee's performance on the action and the trainee's interaction with the digital human are close to the target values, the overall deviation Z decreases rapidly; when both are low, the overall deviation increases significantly, driving stronger parameter adjustments.

[0153] In engineering implementation, the overall deviation value Z can be divided into three levels: slight deviation (Z is usually less than 0.1), moderate deviation (Z is usually between 0.1 and 0.3), and significant deviation (Z is usually greater than 0.3). For slight deviation, only the digital human's response sensitivity is adjusted, preferably by 3% to 8% of the original parameters; for moderate deviation, both the target action tolerance and the number of status confirmation frames are adjusted, preferably by 5% to 15%; for significant deviation, the training script rhythm or interaction window length is further adjusted, preferably by 10% to 20%.

[0154] The final output parameter adjustment suggestion set includes the name of the parameter to be adjusted, the suggested adjustment direction, and the adjustment range.

[0155] S5.4: Smoothly update the training parameters based on the parameter adjustment proposal set and output the training results for this round.

[0156] The parameter adjustment suggestion set is implemented in the training scheme update, and the training results of this round are output. First, the parameters of subsequent training scenarios are updated according to the parameter adjustment suggestion set. The updated content includes at least four categories: target action duration tolerance, state transition confirmation frame count, digitizer response sensitivity, and interaction window length. Second, the training results of this round are structured into an output, which includes at least the trainee's action completion rate, the interaction matching degree between the trainee and the digitizer, key error states that occurred during training, suggested improvement directions, and updated parameter items. To prevent excessive oscillation of training parameters in multiple training rounds, this sub-step introduces a parameter update suppression factor. Let the current value of a certain parameter to be updated be... It is recommended to update the target value to If the overall deviation value is Z, then the updated parameter value is... The calculation is as follows: ; in, The smoothing coefficient is updated, with a value ranging from 0.05 to 0.30. This formula indicates that when the overall deviation Z is small, the exponential term is close to 1, the update amount is small, and the parameters are only slightly adjusted; when the overall deviation is large, the exponential term decreases, and the parameters gradually approach the target update value, while still retaining the smooth update characteristic, avoiding large one-time modifications that could lead to training instability. In engineering implementation, The value should not be too small, otherwise the parameters will change too quickly; nor should it be too large, otherwise the closed-loop optimization effect will be weakened. A smaller value can be chosen for the action tolerance parameter. The optimal value is between 0.05 and 0.15 to quickly adapt to the trainee's current abilities; a larger value can be used for the digital human's response sensitivity parameter. The optimal value is between 0.15 and 0.30 to maintain interaction stability. The final output is the training results and the updated training scheme.

[0157] In the technical solution of this disclosure, the interaction state sequence is aligned with a preset training objective, a training result index set including action completion degree and interaction matching degree is calculated, and parameter adjustment suggestions are generated based on index deviations. This optimizes parameters such as action tolerance and response sensitivity in subsequent training scenarios through a smooth update method. This step forms a closed-loop feedback mechanism from evaluation to adjustment, enabling the training system to adaptively adapt to the trainee's ability level and continuously improve training effectiveness.

[0158] According to embodiments of this disclosure, an electronic device is also provided, which may include a processor, a communications interface, a memory, and a communication bus, wherein the processor, the communications interface, and the memory communicate with each other via the communication bus. The processor can invoke logical instructions stored in the memory to execute the methods provided in the above embodiments.

[0159] Furthermore, the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, and can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0160] On the other hand, this disclosure also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the methods provided in the above embodiments.

[0161] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0162] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0163] It should be understood that the above embodiments are only used to illustrate the technical solutions of this disclosure, and not to limit them; although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure.

Claims

1. A method for interactive AI-assisted training of digital human for police combat, characterized in that, The method includes: S1. Perform reflective highlight region detection and labeling processing on the collected training image sequence of trainees to generate reflective region feature sequence; S2. Based on the reflective region feature sequence, a structural restoration image sequence is generated by extending the boundary and compensating the internal morphology from the real structural region outside the reflective region inward. S3. Based on the structure recovery image sequence, extract structural segments of key parts of the trainee and perform cross-frame matching to construct a continuous structural chain and generate a continuous motion structure trajectory sequence. S4. Based on the action structure trajectory sequence, extract action semantics and determine the contact relationship with the digital human, and generate an interaction state sequence; S5. Based on the comparison results between the interaction state sequence and the preset training target, output the training result index and update the training parameters in a closed loop.

2. The method according to claim 1, characterized in that, S1 includes: The training image sequence is normalized from color to grayscale. In the obtained normalized grayscale image, a candidate region for reflective highlights is constructed based on joint detection of brightness and local gradient decay. A spatial continuity constraint is applied to the candidate reflective bright areas, and discrete pseudo-bright fragments are removed to generate a reflective area mask; The reflective area mask is uniformly encoded to generate a reflective area feature sequence that includes geometric and temporal stability.

3. The method according to claim 1, characterized in that, S2 includes: Based on the reflective region feature sequence, a boundary seed set is extracted from the outer expansion ring outside the reflective region; Using the boundary seed set, the trend, curvature, and grayscale variation patterns carried by the boundary seeds are propagated into the interior of the reflective area to generate an extended boundary set; Based on the extended boundary set, morphological compensation for grayscale and gradient propagation is performed on the interior of the reflective area to generate an internal morphological compensation image. The internal morphology compensation image is then fused with the original non-reflective area through a boundary transition to generate a structure restoration image sequence.

4. The method according to claim 3, characterized in that, The morphological compensation for grayscale and gradient propagation within the reflective region based on the extended boundary set includes: Based on the distance from the point to be compensated within the reflective area to the nearest extended boundary, and combined with the average grayscale information and average gradient information on the extended boundary, the compensation grayscale value of the point to be compensated is calculated. The compensation grayscale value includes an attenuation term for inheriting the boundary grayscale and a gradient term for preserving the structural continuity trend.

5. The method according to claim 1, characterized in that, S3 includes: Candidate structural fragments of key parts of the trainee are extracted from the structural reconstruction image sequence to form a candidate structural fragment sequence. The key parts include the trainee's hands, face, and forearm distal region which is directly related to action judgment. The spatial location and morphological features of candidate structural segments between adjacent frames are matched to establish a set of candidate matching relationships; Based on the candidate matching relationship set, a cross-frame continuous structure chain is constructed. According to the comparison result of the calculated continuity confidence of each chain and the preset confidence threshold, the chain with the continuity confidence lower than the threshold is judged as a pseudo chain and removed, and the continuous structure chain retained after screening is obtained. The filtered and retained continuous structural chain is transformed into a sequence of motion structure trajectories containing the center position, main direction angle and equivalent scale changes of the corresponding key parts in each frame.

6. The method according to claim 5, characterized in that, The construction of the cross-frame continuous structure chain also includes: For continuous structural chains with short-term gaps, linear interpolation is allowed to fill gaps of no more than 2 frames. The interpolation results are verified by using candidate structural segments of the corresponding frames in the structure recovery image sequence. When there are candidate structural segments with similar geometric features near the position where the interpolation is generated, the actual observed value of that segment is used to replace the interpolation result.

7. The method according to claim 1, characterized in that, S4 includes: Based on the motion structure trajectory sequence, a temporal decomposition is performed according to the comparison results of the motion segment activity of key parts in each frame with a preset activity threshold, and a set of motion temporal decomposition segments is generated. For each action temporal decomposition segment, calculate its action direction intensity, and combine the direction evolution type and duration features to generate an action semantic candidate set; By combining the action semantic candidate set with the spatial distance, distance change trend and contact duration frame number from the key parts of the trainee to the target parts of the digital human, a contact relationship determination result set is generated; The interaction response priority value is calculated based on the action semantic candidate set and the contact relationship determination result set, and the interaction state sequence corresponding to the digital human interaction response is generated based on the priority value.

8. The method according to claim 1, characterized in that, S5 includes: The interaction state sequence is aligned segment by segment with the preset training target to generate an interaction state alignment result set that includes single segment alignment score, duration deviation and order deviation. Based on the interaction state alignment result set, a training result index set is calculated, which includes the trainee's action completion degree and the interaction matching degree with the digital human. Based on the deviation between the training result index set and the preset target threshold, an adjustment suggestion set for the training parameters is generated; The training parameters are smoothly updated based on the parameter adjustment suggestion set, and the training results for this round are output.

9. An electronic device, characterized in that, The electronic device includes a memory and at least one processor, the memory storing a computer program, and the processor executing the computer program to implement the police combat digital human interactive AI-assisted training method according to any one of claims 1-7.

10. A computer storage medium, characterized in that, It stores a computer program, which, when executed, implements the police combat digital human interactive AI-assisted training method according to any one of claims 1-7.