Sign language recognition method and device, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By extracting image features and performing semantic transformation in the sign language recognition model to generate semantic text, the problem of low flexibility and accuracy caused by the dependence on the dictionary in existing models is solved, and more efficient sign language text recognition is achieved.

CN115546895BActive Publication Date: 2026-06-16AGRICULTURAL BANK OF CHINA

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: AGRICULTURAL BANK OF CHINA
Filing Date: 2022-10-13
Publication Date: 2026-06-16

Smart Images

Figure CN115546895B_ABST

Patent Text Reader

Abstract

The application discloses a sign language recognition method and device, electronic equipment and storage medium, and the method comprises the steps of obtaining a sign language video, performing image feature extraction on a plurality of video images extracted from the sign language video to obtain a feature map of each video image, and determining a motion unit feature vector of each video image based on the feature map of each video image; performing semantic conversion on the motion unit feature vector to obtain a semantic text of each video image; and determining a sign language recognition text of the sign language video according to the semantic texts of the plurality of video images. The technical scheme of the application does not rely on an artificially maintained sign language vocabulary, performs semantic conversion on the motion unit feature vector of each video image to obtain a semantic text of each video image, determines a sign language recognition text of the sign language video according to the semantic texts of the plurality of video images, does not need to search for a sign language word most matched with hand key point coordinates from a vocabulary, and improves the flexibility and accuracy of the sign language recognition method.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments of the present invention relate to the field of artificial intelligence, and in particular to a sign language recognition method, device, electronic device and storage medium. Background Technology

[0002] Currently, when using sign language recognition models for sign language recognition, human pose estimation models or single-stage object detection models (YOLOv5) are used to obtain feature maps of each frame in the sign language video. Based on the feature maps of each frame, the coordinates of the hand key points in each frame are determined. Then, the sign language word that best matches the hand key point coordinates is searched from the dictionary. The sign language word is then fused and output to obtain the recognized text output.

[0003] However, searching for the sign language word that best matches the coordinates of key hand points from the sign language dictionary relies heavily on the dictionary, resulting in poor flexibility and low accuracy for the sign language recognition model. Summary of the Invention

[0004] This invention provides a sign language recognition method, device, electronic device, and storage medium, which can improve the flexibility and accuracy of sign language recognition methods and solve the problem that existing sign language recognition models have poor flexibility and low accuracy due to their heavy reliance on sign language lexicons.

[0005] In a first aspect, embodiments of the present invention provide a sign language recognition method, the method comprising:

[0006] Acquire a sign language video and extract multiple frames of video images from the sign language video;

[0007] Image feature extraction is performed on the multiple video images to obtain a feature map of each video image, and the action unit feature vector of each video image is determined based on the feature map of each video image.

[0008] The semantic text of each video frame is obtained by semantically transforming the feature vector of the action unit.

[0009] The sign language recognition text of the sign language video is determined based on the semantic text of the multi-frame video images.

[0010] Secondly, embodiments of the present invention provide a sign language recognition device, the device comprising:

[0011] An image extraction module is used to acquire sign language videos and extract multiple frames of video images from the sign language videos;

[0012] The feature vector determination module is used to extract image features from the multi-frame video images to obtain a feature map of each frame video image, and to determine the action unit feature vector of each frame video image based on the feature map of each frame video image.

[0013] The semantic conversion module is used to perform semantic conversion on the feature vector of the action unit to obtain the semantic text of each frame of video image;

[0014] The text determination module is used to determine the sign language recognition text of the sign language video based on the semantic text of the multi-frame video images.

[0015] Thirdly, embodiments of the present invention also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the sign language recognition method as described in any of the embodiments of the present invention.

[0016] Fourthly, embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the sign language recognition method as described in any of the embodiments of the present invention.

[0017] In this embodiment of the invention, a sign language video can be acquired, and multiple video frames can be extracted from the sign language video; image feature extraction can be performed on the multiple video frames to obtain a feature map of each video frame, and an action unit feature vector of each video frame can be determined based on the feature map of each video frame; the action unit feature vector can be semantically transformed to obtain the semantic text of each video frame; and the sign language recognition text of the sign language video can be determined based on the semantic text of the multiple video frames. The technical solution of this invention can extract multiple frames of video images from a sign language video, obtain feature maps for each frame, and perform semantic transformation on the action unit feature vectors of each frame to obtain the semantic text of each frame. Then, based on the semantic text of the multiple frames, the sign language recognition text of the sign language video is determined. This is equivalent to not relying on a manually maintained sign language lexicon. Based on the relationship between the action unit feature vectors corresponding to the feature maps of each frame of the sign language video and the semantic text, the action unit feature vectors of each frame are semantically transformed to obtain the semantic text of each frame. This eliminates the need to search for the sign language word that best matches the hand key point coordinates in a lexicon, thereby improving the flexibility and accuracy of the sign language recognition method. It solves the problem of poor flexibility and low accuracy in existing sign language recognition models due to their heavy reliance on sign language lexicons, and enables faster recognition of sign language text in sign language videos. Attached Figure Description

[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 A schematic flowchart of the sign language recognition method provided in an embodiment of the present invention;

[0020] Figure 2 A schematic diagram of the sign language recognition method provided in an embodiment of the present invention;

[0021] Figure 3 Another schematic diagram of the sign language recognition method provided in this embodiment of the invention;

[0022] Figure 4 A schematic diagram of a semantic converter provided in an embodiment of the present invention;

[0023] Figure 5 This is a schematic diagram illustrating the semantic transformation of action unit feature vectors in the sign language recognition method provided in this embodiment of the invention.

[0024] Figure 6 A schematic diagram of a generative sign language recognition model provided in an embodiment of the present invention;

[0025] Figure 7 This is another flowchart illustrating the sign language recognition method provided in an embodiment of the present invention;

[0026] Figure 8 A schematic diagram of the structure of the sign language recognition device provided in an embodiment of the present invention;

[0027] Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0028] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, the accompanying drawings show only the parts relevant to the present invention, and not all of the structures.

[0029] Figure 1This is a flowchart illustrating a sign language recognition method provided in an embodiment of the present invention. This method can be executed by a sign language recognition device provided in this embodiment, which can be implemented using software and / or hardware. In a specific embodiment, the device can be integrated into an electronic device, such as a computer or server. The following embodiments will illustrate this using the integration of the device into an electronic device as an example. Figure 1 The method may specifically include the following steps:

[0030] Step 101: Acquire the sign language video and extract multiple video frames from it.

[0031] In one alternative implementation, the acquired sign language video can be segmented into frames to obtain multi-frame video images of the sign language video.

[0032] For example, Figure 2 A schematic diagram of the sign language recognition method provided in an embodiment of the present invention is shown below. Figure 2 As shown, after acquiring the sign language video, the input sign language video is processed into frames to obtain multi-frame video images.

[0033] Optionally, after extracting multiple video frames from the sign language video, the multiple video frames can be denoised and normalized to facilitate subsequent extraction of image features and ensure the availability of input data in the subsequent sign language recognition process. The input data may include feature maps of each video frame.

[0034] For example, it can be like Figure 2 As shown, noise reduction and normalization are performed on multiple frames of video images.

[0035] Step 102: Extract image features from multiple video frames to obtain feature maps for each video frame, and determine the action unit feature vector for each video frame based on the feature maps of each video frame.

[0036] In this context, a feature map can be understood as an image that includes color features, texture features, shape features, and spatial relationship features. An action unit is a basic unit that divides the human body into parts involved in production activities, primarily including the worker's hand, fingers, forearm (or arm), torso, legs, feet, and head. In this embodiment, the action unit can be the hand in a sign language video. An action unit feature can be understood as the features of the basic units that divide the human body into parts involved in production activities, such as hand features. An action unit feature vector can be understood as the feature vector of the feature image of the basic units that divide the human body into parts involved in production activities, such as the feature vector of the key point position image of the hand.

[0037] In one alternative implementation, a convolutional neural network can be used to extract image features from each frame of video image to obtain a feature map for each frame. Then, based on the two-dimensional coordinate feature vector, three-dimensional coordinate feature vector, and three-dimensional reconstructed feature vector of each frame of video image, the action unit feature vector of each frame can be determined.

[0038] Among them, the two-dimensional coordinate feature vector can be understood as the coordinate feature vector of the feature map of the hand key point position in each frame of video image. The three-dimensional coordinate feature vector can be understood as the coordinate feature vector of the feature map of the hand key point position in each frame of video image in three-dimensional space; the three-dimensional reconstruction feature vector can be understood as the coordinate feature vector of the key point position on the surface of the three-dimensional reconstruction model of the hand obtained from the hand reconstruction in each frame of video image.

[0039] In one optional implementation, a two-dimensional coordinate feature vector corresponding to the hand keypoint position of each video frame can be obtained from the feature map of each video frame. The feature map of each video frame is then mapped to a feature vector, resulting in a feature vector for each video frame. Based on the feature vectors and two-dimensional coordinate feature vectors of each video frame, a three-dimensional coordinate feature vector corresponding to the hand keypoint position of each video frame is determined. Based on the feature vectors and three-dimensional coordinate feature vectors of each video frame, a three-dimensional hand reconstruction model corresponding to the hand keypoint position of each video frame is reconstructed, and the three-dimensional reconstruction feature vector of each video frame is obtained from the three-dimensional hand reconstruction model. Finally, based on the two-dimensional coordinate feature vector, three-dimensional coordinate feature vector, and three-dimensional reconstruction feature vector of each video frame, the motion unit feature vector of each video frame is determined.

[0040] For example, such as Figure 2 As shown, a convolutional neural network can be used to extract image features from each frame of video image, obtaining a feature map for each frame. Based on the feature map of each frame, a two-dimensional coordinate feature vector is estimated, resulting in a two-dimensional coordinate feature vector for each frame. More specifically, it can be done as follows: Figure 3 As shown, the feature map of each video frame is input into a two-dimensional coordinate feature vector estimation network to estimate the two-dimensional coordinate feature vector, thereby obtaining the two-dimensional coordinate feature vector corresponding to the hand key point position of each video frame.

[0041] After such Figure 2 As shown, after denoising and normalizing each frame of video image, feature vector mapping can be performed on the feature map of each frame to obtain the feature vector of each frame. Based on the feature vector and two-dimensional coordinate feature vector of each frame, the three-dimensional coordinate feature vector corresponding to the hand key point position of each frame is determined. Figure 3As shown, the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame can be input into the three-dimensional coordinate feature vector estimation network to estimate the three-dimensional coordinate feature vector, thereby obtaining the three-dimensional coordinate feature vector corresponding to the hand key point position of each video frame.

[0042] After obtaining the three-dimensional coordinate feature vector of each video frame, as follows: Figure 2 As shown, 3D reconstruction feature vector estimation can be performed based on the feature vectors of each video frame and the 3D coordinate feature vectors of each video frame. More specifically, a 3D hand reconstruction model can be reconstructed based on the 3D coordinate feature vectors and feature vectors of each video frame, resulting in a 3D hand reconstruction model. The 3D reconstruction feature vectors of each video frame can then be obtained from the 3D hand reconstruction model. For example, as... Figure 3 As shown, a parameterized 3D hand reconstruction model (e.g., a parameterized hand model, MANO) is used. The 3D coordinate feature vector and feature vector of each frame of video image are input into the parameterized 3D hand reconstruction model to reconstruct the 3D hand, resulting in a 3D hand reconstruction model. This 3D hand reconstruction model can be composed of a 3D mesh consisting of 778 vertices. The 778*3D 3D reconstruction feature vector of the 778 vertices in the 3D hand reconstruction model can be obtained. This 3D reconstruction feature vector can be understood as the coordinate feature vector of the key point position of the hand in the 3D hand reconstruction model.

[0043] Finally, as Figure 3 As shown, the two-dimensional coordinate feature vector, three-dimensional coordinate feature vector, and three-dimensional reconstruction feature vector of each video frame can be fused to obtain the action unit feature vector of each video frame.

[0044] Step 103: Semantically transform the action unit feature vector to obtain the semantic text of each video frame.

[0045] In one optional implementation, the action unit feature vector can be encoded and mapped to the hidden layer space to obtain the hidden layer space encoding information of the action unit feature vector. The hidden layer space encoding information is then decoded to the semantic text vector space to obtain the semantic text vector of the hidden layer space encoding information. The target text corresponding to the action unit feature vector is determined based on the semantic text vector, and the target text corresponding to the action unit feature vector is determined as the semantic text of each frame of video image.

[0046] Here, the hidden space can be understood as the space where the encoded action unit feature vectors are located; the hidden space encoding information can be understood as the encoding information of the action unit feature vectors in the hidden space; and the semantic text vector can be understood as the vector that expresses the semantics of the text.

[0047] In the semantic text vector space, each semantic text vector can have a corresponding semantic text. Therefore, the target text corresponding to the action unit feature vector can be determined based on the semantic text vector corresponding to the hidden space encoding information of the action unit feature vector.

[0048] For example, Figure 4 A schematic diagram of a semantic converter provided in an embodiment of the present invention, as shown below. Figure 4 As shown, the semantic converter includes an encoder, a decoder, and a CTC operator. The encoder encodes and maps the action unit feature vectors to the hidden space; the decoder decodes the encoded information in the hidden space to the semantic text vector space; and the CTC (Connectionist Temporal Classification) operator removes redundant text.

[0049] Figure 5 This is a schematic diagram illustrating the semantic transformation of action unit feature vectors in the sign language recognition method provided in this embodiment of the invention, as shown below. Figure 5 As shown, the action unit feature vector can be input Figure 4 In the semantic converter shown, the encoder maps the action unit feature vector to the hidden layer space to obtain the hidden layer space encoding information of the action unit feature vector; the decoder decodes the hidden layer space encoding information to the semantic text vector space to obtain the semantic text vector of the hidden layer space encoding information; the target text corresponding to the action unit feature vector is determined according to the semantic text vector, and the target text corresponding to the action unit feature vector is determined as the semantic text of each frame of video image.

[0050] Step 104: Determine the sign language recognition text of the sign language video based on the semantic text of multiple video images.

[0051] The sign language recognition text can include text composed of semantic text from multiple frames of video images in a sign language video.

[0052] In one optional implementation, the semantic text of multiple video frames can be stored sequentially in a preset document according to the acquisition order of each video frame. The semantic text of the multiple video frames stored in the preset document is then determined as the sign language recognition text of the sign language video. Finally, through... Figure 4 The semantic converter in the video outputs the sign language recognition text.

[0053] In this embodiment, without relying on a manually maintained sign language lexicon, the semantic text of each sign language video frame is obtained by semantically transforming the action unit feature vectors corresponding to the feature maps of each frame of the video image based on the relationship between them and the semantic text. Then, the sign language recognition text of the sign language video is determined based on the semantic text of multiple frames of video images. This eliminates the need to search for the sign language word that best matches the coordinates of the hand key points in the lexicon, thereby improving the flexibility and accuracy of the sign language recognition method. It solves the problem that existing sign language recognition models have poor flexibility and low accuracy due to their heavy reliance on sign language lexicons, and can more quickly recognize the sign language text in sign language videos.

[0054] In this embodiment, the sign language recognition method of this embodiment can be used to generate a sign language recognition model. Specifically, it can be referred to... Figure 6 The original sign language video is input into a video preprocessing unit to obtain multiple video frames. These frames are then input into an action unit feature vector determination unit to determine the action unit feature vector for each frame. After obtaining the action unit feature vector for each frame, it is input into a semantic converter for semantic conversion to obtain the sign language text of the original sign language video. Finally, the sign language text of the original sign language video is input into a deep learning trainer for learning and training, ultimately resulting in a sign language recognition model. The sign language recognition model generated using the method provided in this embodiment simplifies the model structure, improves the accuracy and efficiency of the sign language recognition model, and solves the problem of low recognition efficiency caused by the complex model structure of existing sign language recognition models.

[0055] The sign language recognition method provided by the embodiments of the present invention will be further described below, such as... Figure 7 As shown, Figure 7 Another flowchart illustrating the sign language recognition method provided in this embodiment of the invention may specifically include the following steps:

[0056] Step 201: Acquire the sign language video and extract multiple video frames from the sign language video.

[0057] Step 202: Extract image features from multiple video frames to obtain feature maps for each video frame.

[0058] Step 203: Obtain the two-dimensional coordinate feature vector corresponding to the hand key point position in each video frame from the feature map of each video frame.

[0059] For example, a two-dimensional coordinate feature vector estimation network (e.g., a 2D pose estimation network) based on a residual network module can be used to input the two-dimensional coordinate feature vector of each video frame into the two-dimensional coordinate feature vector estimation network to estimate the two-dimensional coordinate feature vector of each video frame, thereby obtaining the two-dimensional coordinate feature vector of each video frame.

[0060] Step 204: Map the feature map of each video frame to a feature vector to obtain the feature vector of each video frame.

[0061] For example, a convolutional network based on a residual network module can be used to map the feature map of each video image to a feature vector to obtain the feature vector of each video image. Since the convolutional network based on the residual network module has a large depth, and the greater the depth of the convolutional network, the higher the accuracy, using a convolutional network based on a residual network module to map the feature map of each video image to a feature vector can obtain the feature vector of each video image more accurately.

[0062] Step 205: Based on the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame, determine the three-dimensional coordinate feature vector corresponding to the hand key point position of each video frame.

[0063] In one optional implementation, the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame can be input into a three-dimensional coordinate feature vector estimation network to estimate the three-dimensional coordinate feature vector, thereby obtaining the three-dimensional coordinate feature vector corresponding to the hand key point position of each video frame.

[0064] For example, a three-dimensional coordinate feature vector estimation network (e.g., a gesture graph convolutional neural network based on graph convolution, graph pooling, and inverse graph pooling operations) can be used to input the two-dimensional coordinate feature vector of each frame of video image and the feature vector of each frame of video image into the three-dimensional coordinate feature vector estimation network to estimate the three-dimensional coordinate feature vector of each frame of video image, thereby obtaining the three-dimensional coordinate feature vector of each frame of video image.

[0065] The formula for the graph convolution operation in this embodiment is as follows:

[0066]

[0067] Among them, H l+1 H can represent the output of the current graph convolutional layer. l It can represent the input of the current graph convolutional layer. D can represent the adjacency matrix of the graph, D can represent the degree matrix of the graph, and W can represent the parameter matrix of the graph convolutional layer.

[0068] In this embodiment, a graph convolutional network is used to estimate the three-dimensional coordinate feature vector corresponding to the hand key point position in each frame of video image. Compared with using a convolutional neural network to estimate the three-dimensional coordinate feature vector corresponding to the hand key point position in each frame of video image, the three-dimensional coordinate feature vector can be estimated more efficiently and accurately, and the time for estimating the three-dimensional coordinate feature vector can be shortened.

[0069] Step 206: Based on the feature vector of each video frame and the three-dimensional coordinate feature vector of each video frame, reconstruct a three-dimensional hand reconstruction model corresponding to the hand key point position of each video frame, and obtain the three-dimensional reconstruction feature vector of each video frame from the three-dimensional hand reconstruction model.

[0070] Step 207: Determine the action unit feature vector of each video frame based on the two-dimensional coordinate feature vector, three-dimensional coordinate feature vector, and three-dimensional reconstruction feature vector of each video frame.

[0071] In one optional implementation, the dimensionality reduction of the two-dimensional coordinate feature vector, three-dimensional coordinate feature vector, and three-dimensional reconstructed feature vector of each video frame can be performed to obtain the first coordinate feature vector, second coordinate feature vector, and third coordinate feature vector of each video frame. Then, the first coordinate feature vector, second coordinate feature vector, and third coordinate feature vector of each video frame are concatenated to obtain the motion unit feature vector of each video frame. This allows for the feature vector fusion of the hand keypoint positions in each video frame from different dimensions, resulting in a richer set of motion unit feature vectors for the hand keypoint positions in each video frame after the fusion of coordinate feature vectors from different dimensions.

[0072] Step 208: Map the action unit feature vector encoding to the hidden layer space to obtain the hidden layer space encoding information of the action unit feature vector.

[0073] Step 209: Decode the hidden space encoded information into the semantic text vector space to obtain the semantic text vector of the hidden space encoded information.

[0074] Step 210: Determine the target text corresponding to the action unit feature vector based on the semantic text vector.

[0075] In one optional implementation, the initial text corresponding to the action unit feature vector is determined based on the semantic text vector; it is determined whether there is redundant text in the initial text corresponding to the action unit feature vector; when there is redundant text in the initial text corresponding to the action unit feature vector, the redundant text is deleted from the initial text corresponding to the action unit feature vector to obtain the target text corresponding to the action unit feature vector. This can avoid duplicate target text and reduce the duplication rate of target text.

[0076] For example, it can be like Figure 5 As shown, the initial text corresponding to the action unit feature vector is determined based on the semantic text vector, and then... Figure 4 The CTC operator of the semantic converter determines whether there is redundant text in the initial text corresponding to the action unit feature vector; if there is redundant text in the initial text corresponding to the action unit feature vector, it utilizes... Figure 4 The CTC operator in the algorithm removes redundant text from the initial text corresponding to the action unit feature vector, and obtains the target text corresponding to the action unit feature vector.

[0077] In this embodiment, the action unit feature vector is encoded and mapped to the hidden layer space to obtain the hidden layer space encoding information of the action unit feature vector. The hidden layer space encoding information is then decoded to the semantic text vector space to obtain the semantic text vector of the hidden layer space encoding information. This allows for a more accurate determination of the semantic text vector corresponding to the hidden layer space encoding information of the action unit feature vector. Furthermore, the target text corresponding to the action unit feature vector can be accurately determined based on the semantic text vector, eliminating the need to search for the sign language word that best matches the hand key point coordinates in the dictionary. This improves the flexibility and accuracy of the sign language recognition method.

[0078] Step 211: Determine the target text corresponding to the action unit feature vector as the semantic text of each frame of video image.

[0079] In this embodiment, without relying on a manually maintained sign language lexicon, the semantic text of each sign language video frame is obtained by semantically transforming the action unit feature vectors corresponding to the feature maps of each frame of the video image based on the relationship between them and the semantic text. Then, the sign language recognition text of the sign language video is determined based on the semantic text of multiple frames of video images. This eliminates the need to search for the sign language word that best matches the coordinates of the hand key points in the lexicon, thereby improving the flexibility and accuracy of the sign language recognition method. It solves the problem that existing sign language recognition models have poor flexibility and low accuracy due to their heavy reliance on sign language lexicons, and can more quickly recognize the sign language text in sign language videos.

[0080] Figure 8 This is a schematic diagram of a sign language recognition device provided in an embodiment of the present invention. This device is suitable for executing the sign language recognition method provided in an embodiment of the present invention. Figure 8 As shown, the device may specifically include:

[0081] Image extraction module 401 is used to acquire sign language video and extract multiple frames of video images from the sign language video;

[0082] The feature vector determination module 402 is used to extract image features from the multi-frame video images to obtain a feature map of each frame video image, and to determine the action unit feature vector of each frame video image based on the feature map of each frame video image.

[0083] The semantic conversion module 403 is used to perform semantic conversion on the feature vector of the action unit to obtain the semantic text of each frame of video image;

[0084] The text determination module 404 is used to determine the sign language recognition text of the sign language video based on the semantic text of the multi-frame video images.

[0085] Optionally, the feature vector determination module 402 determines the action unit feature vector of each video frame based on the feature map of each video frame, including:

[0086] Obtain a two-dimensional coordinate feature vector corresponding to the hand key point position in each video frame from the feature map of each video frame;

[0087] The feature map of each video frame is mapped to a feature vector to obtain the feature vector of each video frame.

[0088] Based on the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame, a three-dimensional coordinate feature vector corresponding to the hand key point position of each video frame is determined.

[0089] Based on the feature vector of each video frame and the three-dimensional coordinate feature vector of each video frame, a three-dimensional hand reconstruction model corresponding to the hand key point position of each video frame is reconstructed, and the three-dimensional reconstruction feature vector of each video frame is obtained from the three-dimensional hand reconstruction model.

[0090] Based on the two-dimensional coordinate feature vector, the three-dimensional coordinate feature vector, and the three-dimensional reconstruction feature vector of each video frame, the motion unit feature vector of each video frame is determined.

[0091] Optionally, the feature vector determination module 402 obtains a two-dimensional coordinate feature vector corresponding to the hand key point position in each frame of the video image from the feature map of each frame, including:

[0092] The feature map of each video frame is input into a two-dimensional coordinate feature vector estimation network to estimate the two-dimensional coordinate feature vector, thereby obtaining the two-dimensional coordinate feature vector corresponding to the hand key point position of each video frame.

[0093] Optionally, the feature vector determination module 402 determines a three-dimensional coordinate feature vector corresponding to the hand key point position in each video frame based on the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame, including:

[0094] The feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame are input into a three-dimensional coordinate feature vector estimation network to estimate the three-dimensional coordinate feature vector, thereby obtaining the three-dimensional coordinate feature vector corresponding to the position of the hand key point in each video frame.

[0095] Optionally, the feature vector determination module 402 determines the action unit feature vector of each video frame based on the two-dimensional coordinate feature vector, the three-dimensional coordinate feature vector, and the three-dimensional reconstructed feature vector of each video frame, including:

[0096] The two-dimensional coordinate feature vector, the three-dimensional coordinate feature vector, and the three-dimensional reconstructed feature vector of each video frame are subjected to dimensionality reduction processing to obtain the first coordinate feature vector, the second coordinate feature vector, and the third coordinate feature vector of each video frame.

[0097] By connecting the first coordinate feature vector, the second coordinate feature vector, and the third coordinate feature vector of each frame of video image, the motion unit feature vector of each frame of video image is obtained.

[0098] Optionally, the semantic conversion module 403 is specifically used for:

[0099] The action unit feature vector is encoded and mapped to the hidden layer space to obtain the hidden layer space encoding information of the action unit feature vector;

[0100] Decode the hidden layer space encoded information into a semantic text vector space to obtain the semantic text vector of the hidden layer space encoded information;

[0101] The target text corresponding to the feature vector of the action unit is determined based on the semantic text vector;

[0102] The target text corresponding to the feature vector of the action unit is determined as the semantic text of each frame of video image.

[0103] Optionally, the semantic conversion module 403 determines the target text corresponding to the action unit feature vector based on the semantic text vector, including:

[0104] The initial text corresponding to the feature vector of the action unit is determined based on the semantic text vector;

[0105] Determine whether there is redundant text in the initial text corresponding to the feature vector of the action unit;

[0106] If there is redundant text in the initial text corresponding to the action unit feature vector, the redundant text is deleted from the initial text corresponding to the action unit feature vector to obtain the target text corresponding to the action unit feature vector.

[0107] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional modules is merely an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. The specific working process of the functional modules described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0108] The device in this embodiment does not rely on a manually maintained sign language lexicon. Based on the relationship between the action unit feature vectors corresponding to the feature maps of each frame of the sign language video and the semantic text, it performs semantic transformation on the action unit feature vectors of each frame of the video to obtain the semantic text of each frame. Then, based on the semantic text of multiple frames of video images, it determines the sign language recognition text of the sign language video. It does not require searching for the sign language word that best matches the coordinates of the hand key points in the lexicon, thereby improving the flexibility and accuracy of the sign language recognition method. It solves the problem that the existing sign language recognition models have poor flexibility and low accuracy due to their heavy reliance on the sign language lexicon, and can recognize the sign language text in sign language videos more quickly.

[0109] This invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the sign language recognition method provided in any of the above embodiments.

[0110] This invention also provides a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the sign language recognition method provided in any of the above embodiments.

[0111] The following is for reference. Figure 9 It shows a schematic diagram of the structure of a computer system 500 suitable for implementing an electronic device according to embodiments of the present invention. Figure 9 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.

[0112] like Figure 9As shown, the computer system 500 includes a central processing unit (CPU) 501, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 502 or programs loaded from storage section 508 into random access memory (RAM) 503. The RAM 503 also stores various programs and data required for the operation of the computer system 500. The CPU 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.

[0113] The following components are connected to I / O interface 505: an input section 506 including a keyboard, mouse, etc.; an output section 507 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 508 including a hard disk, etc.; and a communication section 509 including a network interface card such as a LAN card, modem, etc. The communication section 509 performs communication processing via a network such as the Internet. A drive 510 is also connected to I / O interface 505 as needed. A removable medium 511, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 510 as needed so that computer programs read from it can be installed into storage section 508 as needed. Computer system 500 also includes a graphics processing unit (GPU). Figure 9 As not shown, the graphics processing unit (GPU) can be used for parallel computing, accelerated processing, and other image processing tasks.

[0114] In particular, according to the embodiments disclosed in this invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 509, and / or installed from removable medium 511. When the computer program is executed by central processing unit (CPU) 501, it performs the functions defined above in the system of this invention.

[0115] It should be noted that the computer-readable medium shown in this invention can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this invention, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this invention, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0116] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0117] The modules and / or units described in the embodiments of the present invention can be implemented in software or hardware. The described modules and / or units can also be housed in a processor; for example, a processor can be described as including an image extraction module, a feature vector determination module, a semantic conversion module, and a text determination module. The names of these modules do not necessarily constitute a limitation on the module itself.

[0118] In another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or it may exist independently and not assembled into the device. The computer-readable medium carries one or more programs, which, when executed by the device, cause the device to include:

[0119] Acquire a sign language video and extract multiple video frames from it; extract image features from the multiple video frames to obtain feature maps for each frame, and determine the action unit feature vector for each frame based on the feature maps; perform semantic transformation on the action unit feature vectors to obtain the semantic text for each frame; determine the sign language recognition text of the sign language video based on the semantic text of the multiple video frames.

[0120] According to the technical solution of this embodiment, without relying on a manually maintained sign language lexicon, the semantic text of each frame of the sign language video is obtained by semantically transforming the action unit feature vector corresponding to the feature map of each frame of the video image based on the relationship between the action unit feature vector and the semantic text. Then, the sign language recognition text of the sign language video is determined based on the semantic text of multiple frames of video images. There is no need to search for the sign language word that best matches the coordinates of the hand key points in the lexicon, thereby improving the flexibility and accuracy of the sign language recognition method. This solves the problem that the existing sign language recognition model has poor flexibility and low accuracy due to its heavy reliance on the sign language lexicon, and can recognize the sign language text in the sign language video more quickly.

[0121] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can occur depending on design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A sign language recognition method, characterized in that, The method includes: Acquire a sign language video and extract multiple frames of video images from the sign language video; Image feature extraction is performed on the multiple video images to obtain a feature map of each video image, and the action unit feature vector of each video image is determined based on the feature map of each video image. The semantic text of each video frame is obtained by semantically transforming the feature vector of the action unit. The sign language recognition text of the sign language video is determined based on the semantic text of the multi-frame video images; The step of determining the action unit feature vector of each video frame based on the feature map of each video frame includes: Obtain a two-dimensional coordinate feature vector corresponding to the hand key point position in each video frame from the feature map of each video frame; The feature map of each video frame is mapped to a feature vector to obtain the feature vector of each video frame. Based on the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame, a three-dimensional coordinate feature vector corresponding to the hand key point position of each video frame is determined. Based on the feature vector of each video frame and the three-dimensional coordinate feature vector of each video frame, a three-dimensional hand reconstruction model corresponding to the hand key point position of each video frame is reconstructed, and the three-dimensional reconstruction feature vector of each video frame is obtained from the three-dimensional hand reconstruction model. Based on the two-dimensional coordinate feature vector, the three-dimensional coordinate feature vector, and the three-dimensional reconstruction feature vector of each video frame, the motion unit feature vector of each video frame is determined.

2. The method according to claim 1, characterized in that, The step of obtaining a two-dimensional coordinate feature vector corresponding to the hand key point position in each frame of the video image from the feature map includes: The feature map of each video frame is input into a two-dimensional coordinate feature vector estimation network to estimate the two-dimensional coordinate feature vector, thereby obtaining the two-dimensional coordinate feature vector corresponding to the hand key point position of each video frame.

3. The method according to claim 1, characterized in that, The step of determining the three-dimensional coordinate feature vector corresponding to the hand key point position in each video frame based on the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame includes: The feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame are input into a three-dimensional coordinate feature vector estimation network to estimate the three-dimensional coordinate feature vector, thereby obtaining the three-dimensional coordinate feature vector corresponding to the hand key point position of each video frame.

4. The method according to claim 1, characterized in that, The determination of the motion unit feature vector for each video frame based on the two-dimensional coordinate feature vector, the three-dimensional coordinate feature vector, and the three-dimensional reconstructed feature vector of each video frame includes: The two-dimensional coordinate feature vector, the three-dimensional coordinate feature vector, and the three-dimensional reconstructed feature vector of each video frame are subjected to dimensionality reduction processing to obtain the first coordinate feature vector, the second coordinate feature vector, and the third coordinate feature vector of each video frame. By connecting the first coordinate feature vector, the second coordinate feature vector, and the third coordinate feature vector of each frame of video image, the motion unit feature vector of each frame of video image is obtained.

5. The method according to claim 1, characterized in that, The step of semantically transforming the feature vector of the action unit to obtain the semantic text of each frame of video image includes: The action unit feature vector is encoded and mapped to the hidden layer space to obtain the hidden layer space encoding information of the action unit feature vector; Decode the hidden layer space encoded information into a semantic text vector space to obtain the semantic text vector of the hidden layer space encoded information; The target text corresponding to the feature vector of the action unit is determined based on the semantic text vector; The target text corresponding to the feature vector of the action unit is determined as the semantic text of each frame of video image.

6. The method according to claim 5, characterized in that, Determining the target text corresponding to the action unit feature vector based on the semantic text vector includes: The initial text corresponding to the feature vector of the action unit is determined based on the semantic text vector; Determine whether there is redundant text in the initial text corresponding to the feature vector of the action unit; If there is redundant text in the initial text corresponding to the action unit feature vector, the redundant text is deleted from the initial text corresponding to the action unit feature vector to obtain the target text corresponding to the action unit feature vector.

7. A sign language recognition device, characterized in that, The device includes: An image extraction module is used to acquire sign language videos and extract multiple frames of video images from the sign language videos; The feature vector determination module is used to extract image features from the multi-frame video images to obtain a feature map of each frame video image, and to determine the action unit feature vector of each frame video image based on the feature map of each frame video image. The semantic conversion module is used to perform semantic conversion on the feature vector of the action unit to obtain the semantic text of each frame of video image; The text determination module is used to determine the sign language recognition text of the sign language video based on the semantic text of the multi-frame video images; The feature vector determination module determines the action unit feature vector of each video frame based on the feature map of each video frame, including: Obtain a two-dimensional coordinate feature vector corresponding to the hand key point position in each video frame from the feature map of each video frame; The feature map of each video frame is mapped to a feature vector to obtain the feature vector of each video frame. Based on the feature vector of each video frame and the two-dimensional coordinate feature vector of each video frame, a three-dimensional coordinate feature vector corresponding to the hand key point position of each video frame is determined. Based on the feature vector of each video frame and the three-dimensional coordinate feature vector of each video frame, a three-dimensional hand reconstruction model corresponding to the hand key point position of each video frame is reconstructed, and the three-dimensional reconstruction feature vector of each video frame is obtained from the three-dimensional hand reconstruction model. Based on the two-dimensional coordinate feature vector, the three-dimensional coordinate feature vector, and the three-dimensional reconstruction feature vector of each video frame, the motion unit feature vector of each video frame is determined.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the sign language recognition method as described in any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the sign language recognition method as described in any one of claims 1 to 6.