Spatial semantic and pose feature coupled complex scene action classification method
By injecting visual semantic features into skeleton nodes and constructing a multimodal hypergraph, the problem of accurate perception and logical discrimination of multi-body interactive actions in complex scenes by traditional models is solved, and high-precision action classification of multi-entity collaborative interaction of "human-object-environment" is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANHUI UNIVERSITY OF ARCHITECTURE
- Filing Date
- 2026-05-11
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to accurately perceive and logically discriminate collaborative interactions between multiple entities ("human-object-environment") in complex multi-entity interaction scenarios. Traditional dual-stream architectures suffer from semantic blind spots, cross-attention mechanisms struggle to handle non-paired mappings, and generative reasoning models lack physical constraints, leading to distorted action judgments.
By injecting visual semantic features into skeleton nodes to reconstruct the topology of graph convolutional networks, a multimodal hypergraph is constructed for high-order feature aggregation. Hyperedges are used to achieve unpaired aggregation of pose nodes and semantic nodes. Combined with the kinematic space convergence trend, a physical simulator is introduced for action reasoning.
It improves the accuracy and robustness of action classification in complex scenarios, enabling the perception of scene context interaction across physical distances and achieving accurate modeling and discrimination of complex collaborative interactions.
Smart Images

Figure CN122244561A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and more specifically, to a method for classifying actions in complex scenes by coupling spatial semantics and pose features. Background Technology
[0002] Human motion classification technology in complex scenarios is a core research direction in the field of computer vision, widely used in intelligent security, industrial operation monitoring, and human-computer collaborative interaction. In real open working environments, personnel often need to frequently interact with various tools and complex facilities in three-dimensional space, resulting in highly concealed and varied collaborative motion patterns. Achieving accurate perception and logical discrimination of complex multi-body interactive actions is an essential technical requirement for visual understanding systems to advance towards industrial-grade high reliability.
[0003] Currently, mainstream solutions in the industry typically employ a two-stream network architecture based on visual images and skeletal keypoints for independent feature extraction, and rely on cross-attention mechanisms to achieve multimodal fusion. For the recognition of higher-order action logic, some cutting-edge technologies further introduce Visual Language Models (VLMs) to attempt to utilize the statistical correlations of massive pre-trained data for end-to-end action classification inference.
[0004] However, the above-mentioned solutions have significant technical limitations when dealing with complex multi-person interaction scenarios: First, in the early stages of feature extraction, the modalities of the traditional two-stream architecture are isolated from each other, resulting in a serious "semantic blind spot" in the skeletal nodes. Furthermore, the graph network is limited by the static connections of human physiology, resulting in a rigid physical topology and a lack of effective perception of the context of the remote environment. Second, when facing concurrent human-tool-environment interactions, conventional cross-attention mechanisms are difficult to handle non-paired mappings, which can easily lead to computational divergence and failure to capture higher-order collaborative logic. Finally, existing generative inference models rely solely on the probability distribution of visual features and lack rigorous constraints from underlying physical laws. When faced with severe occlusion or unconventional interactions, they are prone to "physical illusions" that deviate from objective laws such as contact mechanics and center of gravity balance, leading to serious distortion in the judgment of fine-grained dangerous actions. Summary of the Invention
[0005] To overcome the aforementioned deficiencies of the prior art, embodiments of the present invention provide a method for classifying complex scene actions by coupling spatial semantics and pose features. This method reconstructs the dynamic graph network topology by injecting visual semantics into skeleton nodes and constructing a multimodal hypergraph by combining kinematic spatial convergence trends to perform high-order feature aggregation. This addresses the problem that traditional models are limited by physiological topological structures and have difficulty effectively aligning and modeling the collaborative interaction actions of multiple entities such as "human-object-environment" in complex scenes.
[0006] To achieve the above objectives, the present invention provides the following technical solution: A method for classifying actions in complex scenes by coupling spatial semantics and pose features includes the following steps: acquiring video frame sequences and skeletal keypoint sequences, and extracting visual semantic features from the video frame sequences; injecting the visual semantic features into the initial node features of a graph convolutional network constructed from the skeletal keypoint sequences, updating the adjacency matrix based on the semantic relevance of the initial node features and performing graph convolution calculation to obtain local pose features; constructing a multimodal hypergraph based on the kinematic properties of the local pose features and the spatial convergence trend of the visual semantic features, and using the hyperedges in the hypergraph to aggregate pose nodes and semantic nodes to obtain coupled features; performing action inference based on the coupled features and outputting action classification results.
[0007] The technical effects and advantages of the complex scene action classification method coupled with spatial semantics and pose features of this invention are as follows: This invention breaks through the topological limitation of traditional skeletal networks that rely solely on rigid physiological connections by directly injecting visual semantic features into the initial features of nodes in a skeletal graph convolutional network and dynamically updating the adjacency matrix based on semantic relevance. This endows the model with the ability to perceive scene context interactions across physical distances. Simultaneously, it constructs a multimodal hypergraph by deeply analyzing the kinematic properties of pose features and the spatial convergence trend of visual semantics. It utilizes the high-order envelope properties of hyperedges to achieve non-paired aggregation of pose nodes and semantic nodes, effectively overcoming the challenge of aligning and modeling the collaborative interaction features of multiple entities such as "human-object-environment" in complex work scenarios. Finally, it performs action inference based on deeply fused high-order coupling features, significantly improving the accuracy and robustness of action classification and discrimination in complex scenarios from the underlying logic. Attached Figure Description
[0008] Figure 1 A schematic diagram of the process for classifying complex scene actions by coupling spatial semantics and pose features, provided in an embodiment of the present invention; Figure 2 This is a heatmap comparing the adjacency matrix provided in an embodiment of the present invention; Figure 3 A simulation diagram of three-dimensional spatial trajectory and kinematic convergence trend provided for embodiments of the present invention; Figure 4 This is a three-dimensional visualization of the solid bounding box and center of gravity balance determination provided in an embodiment of the present invention; Figure 5 This is a comparison chart of classification confusion matrices provided in an embodiment of the present invention. Detailed Implementation
[0009] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.
[0010] Example 1, Figure 1 This invention presents a method for classifying actions in complex scenes by coupling spatial semantics and pose features, comprising the following steps: S1, acquire video frame sequence and skeletal key point sequence, and extract visual semantic features of the video frame sequence.
[0011] It should be noted that after acquiring the data, the initial dimension of the skeletal keypoint sequence is represented as follows: ,in The time step of the original sequence. This represents the number of nodes in the human skeleton. This represents the (X, Y, Z) coordinate position in three-dimensional space. Preferably, this three-dimensional coordinate is uniformly transformed to the camera coordinate system with the camera's optical center as the origin. In complex scenes, the sampling frequency and initial startup time of video acquisition devices (such as RGB cameras) and motion capture devices (or pose estimation models) often differ, causing the acquired video frame sequence and the skeletal keypoint sequence to be asynchronous on the time axis. To ensure the spatiotemporal consistency of multimodal data in subsequent physical and logical reasoning, as a preferred implementation, this step uses a timestamp-based interpolation algorithm to temporally align heterogeneous data. For each given video frame timestamp... Search for two adjacent timestamps in the sequence of skeletal keypoints. and .like (in If a preset maximum effective time difference threshold is used (e.g., 0.1 seconds), then the timestamp is calculated using linear interpolation. Corresponding virtual skeleton keypoint coordinates: (1) in, This represents the calculated timestamp of the corresponding video frame. The coordinate vector of the virtual skeleton keypoints at the location; and These represent timestamps in the original skeletal sequence. and The corresponding skeletal keypoint coordinate frames are determined. Linear interpolation not only achieves frame alignment but also ensures the smoothness of the skeletal motion trajectory in the continuous time series, providing a continuous and differentiable data source that conforms to the laws of physics and mechanics for subsequent calculations of velocity and acceleration vectors at continuous time steps. Through the above alignment operations, this step constructs a multimodal synchronization data sequence of consistent length and one-to-one correspondence in the time dimension.
[0012] Furthermore, for the temporally aligned video frame sequence, its inherent visual semantic features need to be extracted. Since actions in complex scenes are not only limited by the skeletal mechanics of the human body but also highly dependent on the three-dimensional spatial interaction between the human body and its surrounding environment and the objects being manipulated, this embodiment uses a Swin-Base variant model pre-trained on the ImageNet-22K dataset as the basic extraction architecture. The aligned continuous video frame sequence is used as the network input, with an initial input tensor dimension of [missing value]. ,in This represents the total number of frames in the aligned time series. Represents the standard RGB color channel. and These are the initial pixel height and width of the input image, respectively. Since the input is a continuous sequence of video frames, the model uses a batch size of... The spatial two-dimensional features of each frame are extracted independently using a specific method. The video frame sequence undergoes multiple stages of local moving window self-attention computation and hierarchical feature downsampling (Patch Merging) operations within the Swin Transformer network. As the network depth increases, the receptive field of the visual features gradually expands and incorporates global contextual information, ultimately outputting a high-dimensional visual semantic feature map at a deeper network layer. The tensor dimension of this feature map is mapped to... ,in In this embodiment, the number of feature channels used to characterize semantic richness is preferably set to 1024 dimensions.
[0013] In the aforementioned feature map dimensions, To maintain the timing logic, keep the number of time steps constant. and These represent the height and width of the feature map in the spatial dimension (set to 1 / 32 of the input resolution in this embodiment), and 1024 is the number of feature channels representing semantic richness. This high-dimensional visual semantic feature map not only encodes the appearance and texture attributes of the subject in the foreground color, but also its fine-grained spatial grid structure ( It also accurately captures the topological structure of the environmental background and the underlying visual semantics of potential interactive objects, thereby achieving high-fidelity semantic mapping of real three-dimensional physical scenes in deep two-dimensional feature space.
[0014] By performing the above steps, this process effectively eliminates the temporal asynchrony between heterogeneous sensors and deeply mines the environmental and object semantics in the video global space. This process fundamentally solves the semantic fragmentation problem caused by spatiotemporal misalignment during multimodal feature fusion, providing an extremely accurate spatiotemporal reference and high-quality semantic material for subsequent feature injection and spatiotemporal co-coupling under a unified graph topology.
[0015] S2, the visual semantic features are injected into the node initial features of the graph convolutional network constructed from the skeletal keypoint sequence. The adjacency matrix is updated based on the semantic relevance of the node initial features and graph convolution is performed to obtain local pose features.
[0016] In this embodiment, injecting the visual semantic features into the initial node features of the graph convolutional network constructed from the skeletal keypoint sequence includes: S201, for the coordinates of each key point in the skeletal keypoint sequence, determine the corresponding local receptive field on the visual semantic features; it should be noted that in the three-dimensional space of complex scenes, human actions often physically interact with specific environmental areas. To establish this spatial mapping, in this embodiment, the actual three-dimensional spatial coordinates (x, y, z) of the human skeletal keypoints are projected onto a two-dimensional image plane using camera intrinsic and extrinsic parameter matrices, and mapped to [the desired image plane] according to the network scaling ratio. On a two-dimensional scale, using the mapped two-dimensional feature pixels as the center coordinates, the dimension extracted in stage S1 is... Define a size of on the multi-frame visual semantic features. (For example or The grid region is used as the local receptive field window for a specific skeletal keypoint, thereby accurately capturing the local environmental spatial information around the keypoint. Specifically, when mapping the projected 2D pixel coordinates to the feature map grid, the pixel coordinates need to be divided by the downsampling step size (Stride, 32 in this embodiment) of the network feature extraction and rounded down to obtain the center index. To prevent receptive field extraction from exceeding the limit due to skeletal points being located at the edge of the image or being occluded, this step introduces a boundary truncation and zero-padding mechanism: when Window exceeds feature map or When crossing boundaries, the excess portion is filled with a zero vector to ensure strict alignment of the dimensions of the multidimensional tensor.
[0017] S202, spatial pooling is performed on the features within the local receptive field to obtain a semantic vector. After obtaining the local receptive field, this embodiment uses average pooling or max pooling to aggregate and reduce the dimensionality of the feature matrix covered within the window. This operation compresses the spatially dispersed local features into a vector of length [length missing]. The semantic vector is generated. This aggregation process effectively filters visual noise in local areas, enabling the output semantic vector to highly condense and accurately represent the true visual semantics of the local environment in which the specific joint is located, such as obtaining scene-level context information such as "hand holding a tool" or "foot touching a control panel".
[0018] S203, the semantic vector is concatenated with the initial features of the corresponding keypoint coordinates to obtain the initial features of the nodes in the graph convolutional network. During the construction phase of the graph convolutional network (GCN), the original initial features of the skeletal graph nodes typically only contain the geometric coordinates (x, y, z) of the keypoint in three-dimensional space and the confidence score output by the underlying pose estimation model. In this embodiment, the above-mentioned length of... The semantic vector of the joint is concatenated with its original initial features along the channel dimension. Specifically, the concatenation operation involves combining two feature vectors end-to-end to form a high-dimensional fused feature vector whose dimension is equal to the sum of the dimensions of the two vectors. This concatenated feature vector forms the fused node features, which serve as input node attributes for subsequent processing in the GCN model. This mechanism deeply integrates the skeletal structure of the physical space dimension with the scene environment of the visual semantic dimension at the underlying architecture level.
[0019] Furthermore, the method of updating the adjacency matrix based on the semantic relevance of the initial features of the nodes includes: S204, calculate the feature similarity between the semantic vectors carried by any two nodes in the graph convolutional network; to evaluate the degree of association between different skeletal nodes in the semantic space in the network topology, this invention uses the cosine similarity algorithm to calculate the semantic vector part in the concatenated features of any two nodes, and the calculation formula is as follows: (2) in, Represents a node With nodes Feature similarity score between them; and Representing nodes respectively and nodes The length carried is semantic vector; and These represent the L2 norms (i.e., the magnitudes of the vectors) of the two vectors mentioned above. This formula quantitatively measures the degree of semantic correlation between the local environments of the two physical joints by calculating the cosine of the angle between the two vectors in the multidimensional semantic feature space.
[0020] S205, if the feature similarity is greater than a preset similarity threshold, and the two nodes are not directly connected in the physical skeleton structure, then a semantic connection edge is established for the two nodes in the adjacency matrix. It should be noted that in collaborative actions in complex scenarios, parts of the human body that are far apart often interact with the same object features. For example, in a construction scenario, the worker's "right hand" node and the "scaffolding" support hand node located far away, although not directly connected in the topology of the human physiological skeleton, will have similar extracted semantic vectors when they involve collaborative operations. It will increase significantly. When the model determines that the similarity of the feature is greater than the preset similarity threshold (such as 0.85), and the two lack a direct physical skeleton connection edge, the element at the corresponding row and column position in the dynamic adjacency matrix A of GCN will be set from 0 to non-zero, thereby explicitly establishing a semantic connection edge across the physical structure within the graph structure.
[0021] As a preferred implementation, after establishing semantic connection edges for the two nodes in the adjacency matrix, the following steps are included: S206, Obtain the physical distance between the two nodes in three-dimensional space; To ensure that the newly established semantic connection does not violate the objective physical laws of the real three-dimensional world, spatial distance needs to be further introduced as a physical constraint. In this embodiment, the Euclidean distance between the two nodes establishing the connection is calculated using the original three-dimensional coordinate data of the skeleton key points. The calculation formula is as follows: (3) in, For nodes and nodes Physical distance in real three-dimensional space; and Representing nodes respectively and nodes The corresponding absolute spatial three-dimensional coordinates. This physical distance parameter objectively reflects the span of potential interactive behaviors on a real-world scale.
[0022] S207, calculate the weight value of the semantic connection edge based on the weighted combination of the feature similarity and the physical distance. To reasonably allocate the intensity of information transmission during graph convolution feature aggregation, this invention comprehensively considers the semantic prior and the physical decay effect of spatial distance, dynamically allocating the connection weights of newly created topological edges. To avoid network training instability caused by different units of measurement, this embodiment uses an exponential decay function to map the physical distance to the (0, 1] interval, and its weight calculation formula is improved as follows: (4) in, These are the updated semantic connection edge weights in the adjacency matrix A of the graph convolutional network. and These are preset weight parameters; The hyperparameter for controlling the rate of physical distance decay (e.g., set to 0.1).
[0023] Subsequently, graph convolution is performed based on the updated dynamic adjacency matrix A. To ensure the numerical stability of feature aggregation, the updated degree matrix D (diagonal elements) needs to be calculated. , (These are elements in matrix A), and the convolution formula is shown in the image below: (5) in, For the first The node feature input of the layer (including features concatenated from the original attributes and visual semantics). The output is the local pose feature; This is the learnable weight parameter matrix for this layer; The activation function is a non-linear function (such as ReLU). Through the above normalization calculation, the output is a local pose feature that incorporates high-order neighborhood environment information. The specific feature distribution update result of the adjacency matrix is shown below. Figure 2 As shown.
[0024] Figure 2 This example presents a comparison of feature heatmaps between a traditional skeleton adjacency matrix and a semantically driven dynamic adjacency matrix. Figure 2The data distribution in the diagram shows that the traditional matrix on the left only exhibits high response values (dark areas) on the diagonal and adjacent physiological joints (such as the elbow and wrist joints, where the index distances are similar). In contrast, the dynamic adjacency matrix A on the right, calculated using the aforementioned joint weighting formula, not only retains the high weights of physical connections, but also clearly generates secondary bright connection bands with weights greater than 0.68 between node pairs with large spatial spans but extremely high semantic similarity (e.g., the "right wrist" with a vertical axis node index of 7 and the distant "scaffolding support pole" with a horizontal axis node index of 18). This visualization result, generated based on tensors extracted from a real network, intuitively and quantitatively demonstrates the feasibility and effectiveness of the dynamic topology mechanism of this invention in semantic interaction across physical distances and autonomously focusing scenes.
[0025] By performing the aforementioned feature mapping fusion and adjacency matrix dynamic update steps, this invention breaks the rigid network structure of traditional graph convolutional networks, which can only rely on physically adjacent skeletal joints for local information transmission. This dynamic topology perception mechanism, driven by both semantics and physical distance, effectively endows neural networks with the "scene perception" ability to overcome spatial physical limitations, autonomously focus, and aggregate remote environmental interaction objects, significantly improving the model's generalization and reasoning capabilities when facing complex interactive actions from the underlying logic.
[0026] S3. Construct a multimodal hypergraph based on the kinematic properties of the local pose features and the spatial convergence trend of the visual semantic features. Use the hyperedges in the hypergraph to aggregate pose nodes and semantic nodes to obtain coupled features.
[0027] In this embodiment, the construction of a multimodal hypergraph based on the spatial convergence trend of kinematic properties of local pose features and visual semantic features includes: S301, the local pose features are mapped to pose nodes in the hypergraph, and the visual semantic features are mapped to semantic nodes in the hypergraph. It should be noted that in complex human-computer-environment interaction scenarios, a single action often involves the coordination of multiple body parts with the environment and tools. To overcome the structural limitation of traditional graph networks that can only connect two nodes, this embodiment introduces a multimodal hypergraph structure. Among them, the node set It contains two types of heterogeneous nodes: one type is the set of local pose nodes of the human skeleton output by the preorder graph convolutional network. Another type uses object detection algorithms (such as YOLOv8) to extract environmental entities (such as tools and consoles) and their features from the original video frames, and then maps these extracted semantic nodes into a set. Specifically, after extracting environmental entity features, this embodiment combines the aligned depth map acquired by a depth sensor (such as LiDAR or an RGB-D camera) or utilizes a monocular depth pre-trained model (such as DepthAnything) to obtain the Z-axis depth value of the target center point in the camera coordinate system. Then, by combining this with the camera intrinsic parameter matrix, the 2D pixel coordinates are back-projected to obtain the precise three-dimensional absolute coordinates of each semantic node. Together, these two constitute the cardinality of all nodes in the hypergraph, i.e. .
[0028] S302, calculate the motion trajectory vector of each pose node, and evaluate the convergence trend of the motion trajectory vector and each semantic node in three-dimensional space; when constructing a higher-order topology, it is not enough to rely solely on static spatial distances, but dynamic physical temporal flow must be introduced. This invention analyzes the pose node set... By analyzing the changes in three-dimensional coordinates across multiple consecutive frames on the timeline, the spatiotemporal motion trajectory vector is extracted. Subsequently, this trajectory vector is correlated with a set of semantic nodes. Geometric and kinematic correlation calculations are performed on absolute coordinates (for static environments) or dynamic evolution coordinates (for moving objects) in three-dimensional space to preliminarily determine whether there is a physical tendency for human joints and environmental entities to approach each other in three-dimensional physical space and to come into contact or interact in the future.
[0029] S303, when the motion trajectory vector of at least one pose node and at least one semantic node exhibit the convergence trend, a hyperedge is generated that includes the pose node and the semantic node that generated the convergence trend. Unlike edges in a regular graph that can only connect two nodes, hyperedges in a hypergraph... It possesses unpaired envelope properties, enabling the simultaneous connection of any number of nodes. For example, in the action scenario of "carrying heavy objects with both hands," this invention can generate a hyperedge that simultaneously envelops and connects the worker's "right-hand pose node," "left-hand pose node," and the "heavy tool semantic node" in the environment. This hyperedge, dynamically generated based on convergence trends, accurately maps the complex collaborative interaction logic of many-to-many or many-to-one relationships in the real physical world.
[0030] Further, the evaluation of the convergence trend of the motion trajectory vector and each of the semantic nodes in three-dimensional space includes: S304, extract the velocity and acceleration vectors of the pose node in continuous time steps; based on the first-order and second-order discrete difference principle, in order to eliminate the high-frequency jitter noise generated by the underlying pose estimation model, this embodiment first uses the Kalman filter or exponential moving average (EMA) algorithm to process the original three-dimensional spatial coordinates. Temporal smoothing and denoising are performed. Then, kinematic parameters are extracted from the smoothed coordinates, calculated using the following formula: (6) (7) in, and These represent the pose node at the current time step. and the previous time step The absolute coordinates in three-dimensional space; The time interval between two frames; This is the three-dimensional velocity vector extracted at the current time step; This is the three-dimensional acceleration vector extracted at the current time step.
[0031] S305, based on the velocity vector and the acceleration vector, calculate the rate of change of spatial distance from the pose node to the target semantic node; in complex movements, human limb movements are often non-uniform. To more accurately predict motion trends and overcome the lag in evaluating variable-speed motion by relying solely on a single velocity variable, this invention introduces an acceleration vector to perform kinematic forward-looking compensation for instantaneous velocity. The forward-looking velocity obtained by superimposing the velocity vector and acceleration vector (…) The formula for calculating the rate of change of spatial distance (i.e., dynamic approach velocity) is as follows, projected onto the relative positions of the two objects: (8) in, To take into account the rate of change of spatial distance considering acceleration, a positive value represents that they are getting closer to each other; and These represent the three-dimensional absolute spatial coordinates of the semantic node and the pose node at the current time step, respectively. This represents the relative velocity vector of the pose node relative to the semantic node. This represents the three-dimensional acceleration vector of the pose node at the current time step. A preset look-ahead time compensation coefficient (e.g., set to 0.1 seconds) is used to convert the physical quantity of acceleration into a prediction of velocity increments in the short future period; It is a normalized direction vector that precisely points from the pose node to the semantic node.
[0032] S306, if the spatial distance change rate indicates that the pose node is accelerating towards the target semantic node, and the expected contact time is less than a preset time threshold, then it is determined that there is a convergence trend between the two. Based on the above kinematic calculations, when the obtained spatial distance change rate... When, that is, in acceleration Under the positive superposition effect, the pose node is in a convergence state of accelerating its approach to the target semantic node. Based on this, this embodiment further introduces collision prediction logic based on physical constraints, defining the formula for the expected contact time between the two as: (9) in, The distance between the two in three-dimensional space is the Euclidean distance. A safety line constant (such as) to prevent overflow of denominators that are zero or extremely small (e.g.) If the calculated expected contact time If the time interval is less than the system's set sensitive time threshold (e.g., 0.5 seconds), then from a strict kinematic physics perspective, it is determined that the two are about to interact, i.e., a convergence trend exists, and this is used as a prerequisite for triggering the generation of the hyperedge. The three-dimensional spatiotemporal evolution and kinematic parameter simulation results of this determination process are as follows: Figure 3 As shown.
[0033] Figure 3 This is a 3D visualization simulation of the motion trajectories and convergence trends of pose nodes and target semantic nodes in 3D space. (Example) Figure 3 As shown, in a three-dimensional coordinate system (XYZ space, unit: meters), the blue continuous scatter curve represents the motion trajectory of the worker's "right hand" pose node over continuous time steps, and the red static cube represents the semantic node of the "suspended weight" in the environment. The green arrows extending outward from the pose node represent the instantaneous integrated velocity vector after look-ahead compensation. Its direction precisely points to the red target semantic node. Meanwhile, the yellow dashed cone region represents the area based on the predicted contact time. (The simulation value is 0.32 seconds, which is less than the 0.5-second threshold.) This simulation clearly shows the three-dimensional spatial envelope range of potential future physical interference. The simulation diagram of this high-order three-dimensional feature clearly demonstrates that forward-looking spatiotemporal compensation calculations based on velocity and acceleration can pre-determine the anchor points of high-order interactive actions in complex three-dimensional geometric space before actual physical contact occurs, thus providing a rigorous kinematic basis for the accurate construction of hyperedges.
[0034] As a preferred embodiment, the step of aggregating pose nodes and semantic nodes using hyperedges in the hypergraph to obtain coupled features includes: S307, aggregate the features of all pose nodes and semantic nodes connected to the same hyperedge to the hyperedge to generate hyperedge features; after the multimodal hypergraph is constructed, perform the Node-to-Edge message passing aggregation phase. The hypergraph network performs weighted fusion of the features of all heterogeneous nodes belonging to the same hyperedge envelope, and the generation formula for its hyperedge features is as follows: (10) in, For the generated hyperedge Feature representation; Indicates connection to the hyperedge The set of all associated nodes (including pose nodes and semantic nodes); node The current hidden layer feature tensor; This is the aggregate weight matrix from nodes to hyperedges, used to measure the contribution of different nodes to interactive events.
[0035] S308, the features of each node connected to the hyperedge are updated using the hyperedge features; subsequently, the network performs the Edge-to-Node reverse distribution and feature update phase. The hyperedge broadcasts its aggregated high-order interaction information back to each connected bottom-level node, completing the iteration of node features, with the update formula as follows: (11) in, For the first The updated node features output by the layer network; For the first Hyperedge features output by hypergraph convolution; Distribute weight matrices to the features from the hyperedge to the node; It is a non-linear activation function (such as ReLU or LeakyReLU) used to improve the model's ability to fit complex feature boundaries.
[0036] S309, after a preset number of iterations, the updated features of all nodes in the hypergraph are concatenated to obtain the coupled features. To prevent graph smoothing issues caused by excessive convolution (i.e., all node features tend to become homogenized and lose their distinctiveness), the number of iterations for hypergraph convolution is typically set to 2 to 3 times in this embodiment. After a preset number of iterations (e.g., 2 times), since the total number of nodes in the hypergraph can dynamically change in complex scenarios, to ensure the dimensionality consistency of subsequent network inputs, this embodiment does not use direct flattening. Instead, it uses Global Average Pooling or Global Max Pooling to reduce the dimensionality of all pose nodes and semantic nodes in the spatial graph node dimension, respectively. Then, the fixed-length pooled features are concatenated at the channel level to obtain the coupled features.
[0037] By performing the above steps, this invention utilizes a multimodal hypergraph driven by physical kinematics to replace the computationally intensive pairwise cross-attention mechanism in traditional networks. It directly realizes non-paired structured coupling modeling of high-order action logic of "human-tool-environment" and above in complex scenarios from the underlying architecture, providing a deep-aligned physical semantic basis for the accuracy of subsequent action classification.
[0038] S4. Based on the coupling features, perform action reasoning and output the action classification result.
[0039] In this embodiment, the step of performing action reasoning based on the coupling features and outputting action classification results includes: S401, the coupling features are input into a pre-configured physical simulator to obtain a physical judgment result characterizing the physical interaction state between entities. It should be noted that relying solely on numerical mapping of the feature space and visual probability distribution, neural networks often produce judgments that defy common sense when facing complex occlusion scenarios. Therefore, in this embodiment, a physical simulation mechanism based on classical mechanics laws is creatively introduced before the final classification reasoning. The coupling features obtained in the preceding steps are used as input and passed to a pre-configured rigid body physical simulator. Specifically, the physical simulator (e.g., the open-source Bullet physics engine) has pre-configured global physical environment parameters aligned with the real-world working scenario before receiving input, including a set gravity acceleration vector (e.g., ...). The system calculates parameters such as m / s², material friction coefficient, and collision recovery coefficient to ensure that the mechanical calculations in the virtual space are highly consistent with the objective laws of the real world. This step utilizes objective natural laws such as spatial geometry and kinematics to pre-examine whether the action sequences contained in the multimodal features truly conform to the objective physical laws of the real world, thereby outputting a physical judgment result that realistically and quantitatively represents the physical interaction state between entities (such as human limbs and tools, feet and supporting surfaces).
[0040] S402, the physical determination result is fused with the coupling feature and input into the visual language model for joint inference, outputting the action classification result. After obtaining the hard index basis of the physical mechanics layer, this invention performs cross-modal deep fusion of the physical determination result and the original coupling feature. The fused multidimensional information sequence is input into the Vision-Language Model (VLM) for joint final inference. To ensure the accuracy of cross-modal semantic understanding and the feasibility of the system, the visual language model in this embodiment specifically adopts a multimodal large model based on the BLIP-2 (Bootstrapping Language-Image Pre-training 2) or InstructBLIP architecture; its internal text decoder base can adopt an open-source large language model, such as LLaMA, Vicuna, or Flan-T5 network structures with 7B or 13B parameters. Thanks to the powerful zero-shot generalization ability and deep cross-modal understanding ability accumulated by the visual language large model through pre-training on massive image-text pairs, the model can overcome the defects of traditional small models with fixed labels, accurately capture complex and rare action patterns, and finally output action description labels and classification results with highly fine-grained semantic information in the form of natural language.
[0041] Further, the step of inputting the coupling features into a pre-configured physics simulator to obtain a physical determination result characterizing the physical interaction state between entities includes: S403, the coupling features are decoded into skeleton bounding box parameters and object bounding box parameters in three-dimensional space. To inversely transform the abstract high-level graph network coupling features into three-dimensional geometric entity data that can be directly read and parsed by a physics simulator, this embodiment utilizes a regression head constructed using a multilayer perceptron (MLP) to perform inverse spatial projection and decoupling mapping on the coupling features. Specifically, the regression head adopts an MLP structure consisting of three fully connected layers, with the dimensions of each hidden layer neuron configured as [512, 256, 128], and layer normalization is introduced between layers to accelerate model convergence. For each human skeleton node represented in the feature tensor and its corresponding environmental interaction object entity, the regression network outputs its entity bounding box spatial geometric data in a unified three-dimensional world coordinate system. The parameter set of each independent bounding box is precisely decoded into three-dimensional center point coordinates. and the corresponding three-dimensional dimension extension parameters During the model training phase, this embodiment uses a smoothed L1 loss function and a 3D GIoU loss function to calculate the error between the decoded output bounding box and the real physical space labeled bounding box; its training hyperparameters are set as follows: using the AdamW optimizer, with an initial learning rate of The batch size was set to 32. The model was trained iteratively for 50 epochs on an action physics dataset containing 3D geometric annotations. The parameters of the regression network were optimized through backpropagation to ensure the accuracy of the metric scale for 3D geometric decoupling.
[0042] S404, using the physical simulator to calculate at least one of the contact stress, collision state, and center of gravity balance state between the skeleton bounding box and the object bounding box, to obtain the physical determination result.
[0043] After obtaining the precise 3D geometric bounding box parameters, this invention employs a lightweight rigid body physics engine such as Bullet as the underlying architecture to execute the physics simulator. Specifically, the logic for generating the physics determination results is as follows: (1) Solving for collision state and contact stress: The physics engine utilizes the GJK (Gilbert-Johnson-Keerthi) distance algorithm and the Extended Polyhedron Algorithm (EPA) to perform interference and collision detection between OBB bounding boxes and obtain the minimum penetration depth. .
[0044] If the calculated minimum penetration depth If the collision occurs, the real-time collision status is determined to be "collision occurred" (TRUE); otherwise, it is "no collision" (FALSE).
[0045] Based on the determination that a collision has occurred, this embodiment introduces a mechanical model based on the penalty function method, wherein the contact stress... The calculation formula is as follows: (12) in, For contact stress scalar; This is the preset material stiffness coefficient; Penetration depth; The instantaneous rate of change of penetration depth; The damping coefficient; This is the equivalent contact area estimated based on the intersection area of the bounding boxes.
[0046] (2) Determination of the balance state of the center of gravity: The simulator calculates the overall composite center of gravity of the system based on the preset biomechanical mass distribution parameters of each skeletal component, and obtains its two-dimensional projected coordinates on the bottom support surface. The formula for calculating the two-dimensional projected coordinates of the composite center of gravity is as follows: (13) in, This indicates that the combined center of gravity of the human body system is on the supporting plane (i.e., where gravity is vertically downward). A two-dimensional projected coordinate vector on an absolutely horizontal ground surface; This represents the total number of bounding boxes for the decoded human skeleton. Indicates the first The preset biomechanical mass scalar corresponding to each skeletal part's enclosure box (this scalar can be configured according to the standard percentage of each limb segment in the total human body mass in the standard anthropometry parameter table); Indicates the first The absolute coordinate projection components of the three-dimensional center point of the skeleton bounding box on the target support plane.
[0047] Furthermore, this embodiment determines the center of gravity balance state through the following geometric logic: if the calculated projected coordinates If the body falls outside the effective supporting polygon of the current action (such as the geometric convex hull region formed by the contact surfaces of the feet), the physics engine determines the current human posture instability based on static laws. The hard numerical indicators such as contact stress, collision interference indicators, and posture balance state output through mechanical calculations constitute the aforementioned physical determination result. The specific physical state bounding box decoupling and center of gravity projection balance determination mechanism is as follows: Figure 4 As shown.
[0048] Figure 4 This is a 3D visualization diagram illustrating the spatial decoupling of the entity bounding box and the determination of the balance state of the synthesized center of gravity in this embodiment. Figure 4 As shown, in the virtual three-dimensional coordinate space reconstructed by the physics simulator, complex high-dimensional coupling features were successfully decoded in reverse: the human skeleton was finely mapped as a series of blue elements with independent mass properties. The figure shows a 3D oriented bounding box (OBB, such as the torso and limbs bounding boxes), while the scaffolding platform at the bottom is mapped to a gray static environment bounding box. A red solid sphere marks the overall composite center of gravity, calculated by weighting all the skeletal bounding boxes. Along the direction of gravitational acceleration (negative Z-axis), this center of gravity is projected vertically onto the ground. An absolute plane forms a two-dimensional projection point. (Marked with a green cross in the diagram). Simultaneously, by connecting the contact points at the bottom of the two-foot enclosure, a yellow polygonal area is generated in the diagram, representing the effective support polygon in terms of physics and mechanics. In this specific hazardous behavior simulation case (such as personnel over-leaning during unauthorized operations), the diagram clearly shows the green projection points. It has crossed the boundary and is completely outside the yellow effective support polygon. Based on this, the system strictly determines that the system's static torque is unbalanced, which is crucial for subsequent injection into the larger model.<POSTURE: UNSTABLE> This absolute physical constraint provides solid evidence from spatial geometry and classical mechanics.
[0049] As a preferred embodiment, the step of fusing the physical determination result with the coupling feature and inputting it into a visual language model for joint reasoning, and outputting the action classification result, includes: S405, the physical determination result is converted into a logical string in a preset format. Considering that the visual language large model is essentially a generative architecture driven by natural language tokens, this embodiment constructs specific format conversion rules to threshold and discretize the continuous numerical state parameters output by the physics engine, converting them into a high-dimensional logical string in a preset format. For example, if the simulator calculates that the contact stress between the worker's hand entity and the high-voltage cable entity enclosure box is greater than zero, the discretization condition is triggered, and a preset logical string is generated.<CONTACT: TRUE> If, based on the aforementioned physical formula, the projected center of gravity of the system does not exceed the safe area of the supporting polygon of the scaffolding platform, then a logical string is generated.<POSTURE: SAFE> .
[0050] S406, the logical string is used as a prompt word and concatenated with the semantic tokens corresponding to the coupled features. This embodiment designs a prompt word assembly mechanism based on strong physical logic constraints. First, a feature mapping layer (e.g., a multilayer perceptron structure) is used to align the dimensions of the coupled features to the pre-trained text embedding space of the large visual language model, thereby converting it into a continuous sequence of visual semantic tokens that the visual language model can natively parse. Specifically, the feature mapping layer uses a Q-Former module based on the Transformer architecture, which contains 32 learnable query tokens. Through cross-attention calculation with the input coupled features, a visual feature sequence with fixed length and high semantic compression is extracted. Then, a linear projection layer is used to map the dimensions of this sequence to the same text embedding space dimension (e.g., 4096 dimensions) as the visual language model, achieving alignment of heterogeneous features. Subsequently, in the concatenation and reconstruction stage of the input sequence, the aforementioned generated...<CONTACT: TRUE> Logical strings representing rigid physical facts, as an immutable "hard prompt," are strictly concatenated at the beginning of the aforementioned visual semantic token sequence.
[0051] S407, the concatenated sequence is input into the visual language model for prediction to obtain the action classification result. Finally, the complete sequence after multimodal physical semantic concatenation and recombination is input into the visual language model for end-to-end processing. The large visual language model uses its deep autoregressive mechanism and multi-head self-attention module to read and parse the sequence layer by layer. It is important to emphasize that the inherent relationship established in this step is that, using the multi-head self-attention weight allocation mechanism within the visual language model, since the logical string is at the beginning of the input sequence, the model will force the attention weight to be preferentially allocated to logical tokens representing objective physical facts when generating action labels. This strong correlation constraint mechanism ensures that even if the visual features present ambiguous interaction information, the model will be constrained at the inference level by the hard "veto power" of physical judgment, thereby suppressing the generation of action labels that violate physical common sense in the probability distribution (for example, when the physical judgment is non-contact, the prediction of the "holding" action is forcibly blocked), fundamentally blocking the path of the model to make incorrect inferences based solely on visual co-occurrence probability, and outputting fine-grained action classification results with high logical reliability.
[0052] Figure 5 This is a comparison chart of the classification confusion matrices of the traditional visual model and the physically constrained reasoning model introduced in this invention in this embodiment. From Figure 5The matrix data distribution of the traditional model on the left shows a large number of misjudgment crossovers among similar action categories that are highly prone to visual confusion, with a high density of erroneous response blocks in the non-main diagonal region (error rate as high as 23%). In contrast, the confusion matrix on the right, applying the present invention, benefits from the hard boundary constraints imposed by physical indicators such as collision penetration depth and center of gravity balance on the probability distribution of the large visual model. This almost completely eliminates the aforementioned misjudgment crossovers of highly confusing actions, and the main diagonal of the matrix exhibits extremely high and pure response clustering. These results demonstrate the effectiveness of the joint inference mechanism of this invention in blocking AI "visual illusions" and significantly improving the absolute accuracy of classifying highly similar actions in complex scenes.
[0053] By executing the closed-loop steps of joint reasoning between the aforementioned physical simulation and large model, this invention introduces a robust mechanical judgment mechanism from a pre-existing physics engine. This effectively addresses the core pain point of traditional large-scale visual models, which, lacking constraints from real-world physical laws, are prone to "visual illusions" due to the blind assumption of fictitious interactive actions arising from the simultaneous presence of people and tools in a scene. This mechanism provides deep learning with a real-world physical logic anchor from the underlying architecture, significantly improving the absolute accuracy and logical reliability of dangerous action recognition in complex construction scenarios.
[0054] The above formulas are all dimensionless calculations. The formulas are derived from software simulations based on a large amount of collected data to obtain the most recent real-world results. The preset parameters in the formulas are set by those skilled in the art according to the actual situation.
[0055] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, in the form of a computer program product.
[0056] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0057] In addition, the functional modules in the various embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module.
[0058] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0059] In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for classifying actions in complex scenes by coupling spatial semantics and pose features, characterized in that, include: Obtain video frame sequences and skeletal keypoint sequences, and extract the visual semantic features of the video frame sequences; By combining visual semantic features with the initial node features of a graph convolutional network constructed from skeletal keypoint sequences, the adjacency matrix is updated based on the semantic relevance of the initial node features and graph convolution is performed to obtain local pose features. A multimodal hypergraph is constructed based on the kinematic properties of the local pose features and the spatial convergence trend of the visual semantic features. The pose nodes and semantic nodes are aggregated using the hyperedges in the hypergraph to obtain the coupling features. Based on the aforementioned coupling features, action reasoning is performed, and action classification results are output.
2. The method according to claim 1, characterized in that, The combination of visual semantic features and the initial node features of the graph convolutional network constructed from skeletal keypoint sequences includes: For the coordinates of each key point in the skeletal key point sequence, the corresponding local receptive field is determined on the visual semantic features. Spatial pooling is performed on the features within the local receptive field to obtain semantic vectors; The semantic vector is concatenated with the initial features of the corresponding keypoint coordinates to obtain the initial features of the nodes of the graph convolutional network.
3. The method according to claim 1, characterized in that, The method of updating the adjacency matrix based on the semantic relevance of initial node features includes: Calculate the feature similarity between the semantic vectors carried by any two nodes in the graph convolutional network; If the feature similarity is greater than a preset similarity threshold, and the two nodes are not directly connected in the physical skeleton structure, then a semantic connection edge is established between the two nodes in the adjacency matrix.
4. The method according to claim 3, characterized in that, After establishing semantic connection edges for the two nodes in the adjacency matrix, the following steps are included: Obtain the physical distance between the two nodes in three-dimensional space; The weight value of the semantic connection edge is calculated based on a weighted combination of the feature similarity and the physical distance.
5. The method according to claim 1, characterized in that, A multimodal hypergraph is constructed based on the spatial convergence trend of kinematic attributes of local pose features and visual semantic features, including: The local pose features are mapped to pose nodes in the hypergraph, and the visual semantic features are mapped to semantic nodes in the hypergraph. Calculate the motion trajectory vector of each pose node, and evaluate the convergence trend of the motion trajectory vector and each semantic node in three-dimensional space; When the motion trajectory vector of at least one pose node and at least one semantic node have the convergence trend, a hyperedge containing the pose node and the semantic node that generated the convergence trend is generated.
6. The method according to claim 5, characterized in that, The evaluation of the convergence trend of the motion trajectory vector and each of the semantic nodes in three-dimensional space includes: Extract the velocity vector and acceleration vector of the pose node in consecutive time steps; Based on the velocity vector and the acceleration vector, calculate the rate of change of the spatial distance between the pose node and the target semantic node; If the rate of change of spatial distance indicates that the pose node is accelerating towards the target semantic node, and the expected contact time is less than a preset time threshold, then it is determined that there is a convergence trend between the two.
7. The method according to claim 1, characterized in that, The process of aggregating pose nodes and semantic nodes using hyperedges in the hypergraph to obtain coupled features includes: The features of all pose nodes and semantic nodes connected to the same hyperedge are aggregated to the hyperedge to generate hyperedge features; Update the features of each node connected to the hyperedge using the hyperedge features; After a preset number of iterations, the updated features of all nodes within the hypergraph are concatenated to obtain the coupled features.
8. The method according to claim 1, characterized in that, The action reasoning based on the coupled features, and the output of action classification results, include: The coupling features are input into a pre-configured physical simulator to obtain physical determination results that characterize the physical interaction state between entities; The physical determination result is fused with the coupling feature and input into the visual language model for joint reasoning, and the action classification result is output.
9. The method according to claim 8, characterized in that, The step of inputting the coupling features into a pre-configured physics simulator to obtain a physical determination result characterizing the physical interaction state between entities includes: The coupling features are decoded into skeleton bounding box parameters and object bounding box parameters in three-dimensional space; The physical simulator is used to calculate at least one of the contact stress, collision state, and center of gravity balance state between the skeleton bounding box and the object bounding box to obtain the physical determination result.
10. The method according to claim 8, characterized in that, The process of fusing the physical determination result with the coupling feature and inputting it into the visual language model for joint reasoning, and outputting the action classification result, includes: The physical determination result is converted into a logical string in a preset format; The logical string is used as a prompt word and combined with the semantic token corresponding to the coupling feature; The concatenated sequence is input into the visual language model for prediction to obtain the action classification result.