Conversion device and program

The conversion device and program enhance sign language recognition accuracy by training the encoder using both positive and negative examples, leveraging a triplet loss function to improve the neural network's learning process.

JP7876362B2Active Publication Date: 2026-06-19NIPPON HOSO KYOKAI

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
NIPPON HOSO KYOKAI
Filing Date
2022-07-13
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Conventional sign language recognition technologies fail to effectively train the encoder using negative examples, limiting the accuracy of the neural network in recognizing sign language videos.

Method used

A conversion device and program that utilize a first encoder and a second encoder to learn from both positive and negative examples, employing a triplet loss function to improve the learning process by generating pseudo-negative examples based on correct word sequences.

Benefits of technology

Enhances the accuracy of sign language recognition by training the encoder to differentiate between correct and incorrect word sequences, improving the neural network's ability to recognize sign language videos.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007876362000003
    Figure 0007876362000003
  • Figure 0007876362000004
    Figure 0007876362000004
  • Figure 0007876362000005
    Figure 0007876362000005
Patent Text Reader

Abstract

To enable also machine learning by a negative example (a word string of an incorrect solution) in a conversion device for converting video into a word string.SOLUTION: A first encoder portion obtains a state vector by performing calculation by a neural network on the basis of a video feature quantity and outputs the state vector. A second encoder portion obtains a state vector expressing semantic information by performing calculation by a neural network on the basis of a given word string and outputs the state vector. A learning data supply portion supplies a pair of video and correct word string corresponding to the video for learning. A negative example data creation portion generates an incorrect word string on the basis of the correct word string supplied by the learning data supply portion. A loss calculation portion calculates a correct error and an incorrect error. A control portion performs control for performing learning by error backward propagation of a neural network of at least the first encoder portion on the basis of both of the correct error and the incorrect error.SELECTED DRAWING: Figure 1
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This invention relates to a conversion device and a program. [Background technology]

[0002] Research is underway on a technology that automatically recognizes spoken language based on video recordings of sign language. Such technology is expected to assist communication between people with hearing impairments and those with normal hearing.

[0003] Patent Document 1 describes a configuration for automatic recognition of sign language videos. The device described in Patent Document 1 generates a sequence of symbols based on the video. The generated sequence of symbols is a sequence of sign language words. The technology described in Patent Document 1 uses a neural network to perform automatic recognition of sign language videos. Here, a method is introduced to shorten the path length of backpropagation in order to improve the learning efficiency of the neural network. Specifically, it is as follows.

[0004] A configuration consisting of an encoder and a decoder connected in series is used as a mechanism for automatically recognizing sign language videos. Both the encoder and decoder are implemented using neural networks. The encoder takes the feature quantities of the input video (sign language video) as input and outputs a state vector. The decoder takes the state vector output from the encoder as input and outputs an estimated word sequence (sign language word sequence, symbol sequence, label sequence). When training is performed in this configuration, based on the training data (pairs of input video and correct word sequences), backpropagation is performed in the decoder and encoder paths based on the error (loss) between the estimated word sequence obtained from the input video and the correct word sequence. This optimizes the internal parameters of the neural networks in both the encoder and decoder.

[0005] The technology described in Patent Document 1 introduces a different learning mechanism in addition to the learning using the above configuration (a configuration in which an encoder and a decoder are connected in series). The technology described in Patent Document 1 provides a second encoder in addition to the above encoder (an encoder that outputs a state vector based on the features of the input video; for convenience, this will be called the "first encoder"). The second encoder is equipped with a neural network that calculates a state vector based on a sequence of correct words (a sequence of sign language words, a sequence of symbols, a sequence of labels). The error (loss) between the state vector output from the first encoder and the state vector output from the second encoder is calculated, and backpropagation is performed based on that error to learn both the first and second encoders.

[0006] In other words, the technology described in Patent Document 1 performs learning based on the error between the output from the first encoder and the output from the second encoder, in addition to learning in a basic configuration (a configuration in which an encoder and a decoder are connected in series). That is, the technology described in Patent Document 1 introduces error backpropagation with a shorter path. In this way, the technology described in Patent Document 1 enables encoder learning with the effects of gradient loss suppressed by performing propagation with a shorter path. [Prior art documents] [Patent Documents]

[0007] [Patent Document 1] Japanese Patent Publication No. 2021-099713 [Overview of the Initiative] [Problems that the invention aims to solve]

[0008] However, the conventional technology has the following problems. Specifically, in the device shown in Patent Document 1, the input to the second encoder (an encoder additionally provided in the method of Patent Document 1; referred to as the "second encoder unit 60" in the embodiment of Patent Document 1) is always only a sequence of words of positive examples (correct data). For this reason, the technology of Patent Document 1 has the problem that it is not possible to train the first encoder (an encoder that takes video features as input and outputs a state vector; referred to as the "encoder unit 20" in the embodiment of Patent Document 1) using negative examples (incorrect data).

[0009] This invention was made based on the above-mentioned problem recognition, and aims to provide a conversion device and program that have the configuration described in Patent Document 1 and that also enable learning using negative examples (sequences of incorrect words). [Means for solving the problem]

[0010] [1]To solve the above problems, a conversion device according to an aspect of the present invention includes: a first encoder unit that obtains and outputs a state vector representing semantic information by performing calculations by a neural network based on video feature amounts; a decoder unit that estimates and outputs a word sequence by performing calculations by a neural network based on the state vector output by the first encoder unit; a second encoder unit that obtains and outputs a state vector representing semantic information by performing calculations by a neural network based on a given word sequence; a learning data supply unit that supplies, for learning, pairs of videos and correct word sequences corresponding to the videos; a negative example data generation unit that generates an incorrect word sequence based on the correct word sequence supplied by the learning data supply unit; a loss calculation unit that obtains a correct error, which is an error between a first state vector output by the first encoder unit based on the video feature amounts of the videos supplied by the learning data supply unit and a second state vector output by the second encoder unit based on the correct word sequences supplied by the learning data supply unit, and obtains an incorrect error, which is an error between the first state vector and a second state vector output by the second encoder unit based on the incorrect word sequence generated by the negative example data generation unit based on the correct word sequence supplied by the learning data supply unit; and a control unit that controls at least the learning of the neural network of the first encoder unit by error backpropagation based on both the correct error and the incorrect error.

[0011] According to the configuration of [1] above, the negative example data generation unit automatically generates incorrect answer word sequences. The loss calculation unit calculates not only the correct answer error based on the correct answer word sequence but also the incorrect answer error based on the incorrect answer word sequence. Then, the control unit performs learning of the neural network of the first encoder unit based on both the correct answer error and the incorrect answer error. That is, the learning of the first encoder unit can be performed based on the incorrect answer word sequence. When performing the learning of the neural network of the first encoder unit, the learning of the neural network of the second encoder unit may be performed based on the same error. And based on the state vector output by the first encoder unit after learning, the decoder unit can estimate a word sequence.

[0012] [2] Also, in one aspect of the present invention, in the conversion device of [1] above, the loss calculation unit obtains a combined error, which is an error based on both the obtained correct answer error and the incorrect answer error, and the control unit controls to perform learning by error backpropagation of the neural network of at least the first encoder unit based on the combined error. The larger the value of the correct answer error, the larger the value of the combined error, and the larger the value of the incorrect answer error, the smaller the value of the combined error.

[0013] According to the configuration of [2] above, the larger the value of the correct answer error, the larger the value of the combined error. Also, the larger the value of the incorrect answer error, the smaller the value of the combined error. By performing learning of the neural network of the first encoder unit based on such a combined error, the state vector output by the first encoder unit after learning approaches the information corresponding to the correct answer error and moves further away from the information corresponding to the incorrect answer error.

[0014] [3] Also, as one aspect of the present invention, in the conversion device of [2] above, the combined error is L obtained by the loss calculation unit according to the following formula (1) triplet where (in formula (1), d positive is the correct answer error, d negative is the incorrect answer error, and α is a value appropriately determined).

[0015] In the configuration described in [3] above, the triplet error L triplet This is a specific example of the aforementioned synthesis error.

[0016] [4] In addition, in the conversion device of [2] or [3] described above, the control unit controls the first encoder unit to perform backpropagation of the neural network based on the correct answer error if the value of the correct answer error is greater than a predetermined threshold, and controls the first encoder unit to perform backpropagation of the neural network based on the combined error if the value of the correct answer error is less than or equal to the threshold.

[0017] According to the configuration described in [4] above, when the correct answer error is greater than the threshold, the neural network of the first encoder can be trained in a way that focuses solely on reducing the correct answer error. Furthermore, when the correct answer error falls below the threshold, the neural network of the first encoder can be trained in a way that simultaneously reduces the correct answer error and increases the incorrect answer error.

[0018] [5] Another aspect of the present invention includes: a first encoder unit that calculates and outputs a state vector representing semantic information by performing calculations using a neural network based on video features; a decoder unit that estimates and outputs a word sequence by performing calculations using a neural network based on the state vector output by the first encoder unit; a second encoder unit that calculates and outputs a state vector representing semantic information by performing calculations using a neural network based on a given word sequence; a learning data supply unit that supplies pairs of video and corresponding correct word sequences for learning; a negative example data generation unit that generates an incorrect word sequence based on the correct word sequence supplied by the learning data supply unit; and (1) video features of the video supplied by the learning data supply unit This is a program to cause a computer to function as a conversion device comprising: (2) a loss calculation unit that calculates a correct error, which is the error between a first state vector output by the first encoder unit based on a quantity and a second state vector output by the second encoder unit based on the correct word sequence supplied by the learning data supply unit; and (3) a loss calculation unit that calculates an incorrect error, which is the error between the first state vector and an incorrect word sequence generated by the negative example data generation unit based on the correct word sequence supplied by the learning data supply unit and a second state vector output by the second encoder unit; and a control unit that controls the neural network of at least the first encoder unit to perform learning by backpropagation based on both the correct error and the incorrect error. [Effects of the Invention]

[0019] According to the present invention, the first encoder unit can be trained using not only the correct answer error based on the correct answer sequence, but also the incorrect answer error based on the incorrect answer sequence. In other words, according to the present invention, the accuracy of the first encoder unit can be improved by training using negative examples as well. [Brief explanation of the drawing]

[0020] [Figure 1] This is a block diagram showing the schematic functional configuration of a conversion device according to an embodiment of the present invention. [Figure 2] This is a block diagram illustrating the first encoder unit, the second encoder unit, and their respective input and output data in the same embodiment. [Figure 3] This flowchart shows the processing procedure when the conversion device according to the same embodiment performs learning using the first pattern in learning mode. [Figure 4] This flowchart shows the processing procedure when the conversion device according to the same embodiment performs learning using the second pattern in learning mode. [Figure 5] This flowchart shows the processing procedure when the conversion device according to this embodiment is operating in conversion execution mode. [Figure 6] This is a block diagram showing an example of the internal configuration of the conversion device according to the same embodiment. [Modes for carrying out the invention]

[0021] Next, one embodiment of the present invention will be described with reference to the drawings. This embodiment is an improvement over the conversion device described in Patent Document 1. Specifically, this embodiment improves the learning effect of the encoder that generates a state vector based on video. To this end, the conversion device of this embodiment has a function to generate negative examples (sequences of incorrect words) based on positive examples (sequences of correct words) in the training data.

[0022] In this embodiment, the encoder is trained using not only the supplied positive examples but also the negative examples generated by the conversion device. Specifically, in this embodiment, the triplet loss (triplet error) L calculated by the following equation (1) is used. triplet We will train a neural network based on this.

[0023]

number

[0024] In equation (1), d positiveis the distance between the output from the neural network (the estimated value at that time) and the positive example. Also, d negative is the distance between the output from the neural network (the estimated value at that time) and the negative example. α is a hyperparameter. α is also called the "margin". The value of α is given as appropriate. α may be a non - negative value. Also, the value of α may be made variable and obtained by machine learning. Also, max is a maximum - value function that returns the maximum value among the actual arguments.

[0025] That is, the triplet loss L calculated by Equation (1) triplet is an error (loss) that brings about a learning effect such that the output from the neural network becomes closer to the positive example and farther from the negative example.

[0026] The conversion device of the present embodiment automatically generates pseudo - negative example data in order to calculate d negative Thereby, the present embodiment aims to improve the learning of the encoder.

[0027] FIG. 1 is a block diagram showing a schematic functional configuration of the conversion device according to the present embodiment. As shown in the figure, the conversion device 1 includes an input unit 10, a first encoder unit 20, a decoder unit 30, an output unit 40, a first loss calculation unit 50, a second encoder unit 60, a second loss calculation unit 70, a learning data supply unit 80, a control unit 90, and a negative example data generation unit 110. Each of these functional units can be realized by, for example, a computer and a program. Also, each functional unit has a storage means as necessary. The storage means is, for example, a variable in a program or a memory allocated by the execution of the program. Also, as necessary, a non - volatile storage means such as a magnetic hard disk device or a solid - state drive (SSD) may be used. Also, at least some of the functions of each functional unit may be realized as a dedicated electronic circuit instead of a program.

[0028] The conversion device 1 having the above configuration takes video including sign language actions as input and outputs information about word sequences corresponding to the sign language actions shown in the video. The word sequence output by the conversion device 1 is a sequence of symbols representing sign language actions that correspond to words (sign language actions correspond to sign language labels (sign language words, gross)). In other words, the conversion device 1 converts video of sign language into word sequences.

[0029] The conversion device 1 is configured to include a machine learning-capable model internally. The machine learning-capable model is, for example, a neural network. The conversion device 1 operates in either a learning mode or a conversion execution mode. In learning mode, the conversion device 1 trains the model based on training data. Specifically, the conversion device 1 optimizes the internal parameters of the model. For example, if the model is constructed using a neural network, methods such as backpropagation can be used to optimize the internal parameters. In conversion execution mode, the conversion device 1, based on the trained model, finds (predicts) a word sequence corresponding to an unknown input video and outputs the obtained word sequence.

[0030] When the conversion device 1 operates in learning mode, both the first pattern of learning and the second pattern of learning are used in combination. The first pattern of learning involves learning the encoder (first encoder unit 20 below) and the decoder (decoder unit 30 below) using only positive example learning data. The second pattern of learning involves learning the first encoder (first encoder unit 20 below) and the second encoder (second encoder unit 60 below) using both positive and negative example learning data. In the second pattern of learning, the decoder is not learned. However, even in the second pattern of learning, if learning using negative example learning data is not considered effective, the first encoder and the second encoder are learned based only on positive example learning data. The procedures for the first and second patterns of learning will be explained later with reference to the flowchart.

[0031] For example, when training the model, the conversion device 1 may alternate between training using the first pattern and training using the second pattern. Alternatively, the conversion device 1 may perform training using the first pattern multiple times, then perform training using the second pattern multiple times, and repeat this process thereafter.

[0032] The functions of each component of the conversion device 1 are as follows:

[0033] The input unit 10 acquires the input video to be converted from an external source and passes it to the first encoder unit 20. When the conversion device 1 is operating in conversion execution mode, the input unit 10 acquires the input video from an external source.

[0034] The first encoder unit 20 extracts the meaning of the video based on the video received from the input unit 10 or the learning data supply unit 80, and outputs a state vector containing the extracted meaning information. In other words, the state vector is semantic representation data that expresses meaning. Specifically, the first encoder unit 20 calculates and outputs a state vector representing semantic information by performing calculations using a neural network based on video features. When the conversion device 1 is operating in conversion execution mode, the first encoder unit 20 passes the output state vector to the decoder unit 30. When the conversion device 1 is operating in learning mode, in the learning of the first pattern, the first encoder unit 20 passes the output state vector to the decoder unit 30. In the learning of the second pattern, the first encoder unit 20 passes the output state vector to the second loss calculation unit 70.

[0035] The first encoder unit 20 is configured to include a neural network internally. The neural network in the first encoder unit 20 is machine learning capable. As an example, the first encoder unit 20 can be implemented using an RNN (Recurrent Neural Network).

[0036] Note that the RNN in the first encoder unit 20 does not receive the input video (sequence of frame images) directly. Instead, the RNN in the first encoder unit 20 receives image features extracted from the frame images using a CNN (Convolutional Neural Network).

[0037] The decoder unit 30 calculates and outputs a word sequence based on the state vector output by the first encoder unit 20. The word sequence output by the decoder unit 30 is a sequence of vectors corresponding to each word. When the conversion device 1 is operating in conversion execution mode, the decoder unit 30 passes the word sequence to be output to the output unit 40. When the conversion device 1 is operating in learning mode (learning the first pattern), the decoder unit 30 passes the word sequence to be output to the first loss calculation unit 50. The decoder unit 30, like the first encoder unit 20, is also configured to include a neural network internally. The decoder unit 30 can be implemented using an RNN as an example. That is, the decoder unit 30 estimates and outputs a word sequence by performing calculations using a neural network based on the state vector output by the first encoder unit 20.

[0038] The output unit 40 outputs the word sequence to the outside when it receives it from the decoder unit 30. In other words, when the conversion device 1 is operating in conversion execution mode, the output unit 40 outputs the word sequence corresponding to the input video (the estimated word sequence which is the recognition result of the input video) to the outside.

[0039] The first loss calculation unit 50 calculates the loss (error) for training the neural network. Specifically, in training the first pattern, the first loss calculation unit 50 calculates the loss between the word sequence output by the decoder unit 30 and the correct word sequence supplied by the training data supply unit 80. The loss calculated by the first loss calculation unit 50 is used for backpropagation of errors in the neural networks of the decoder unit 30 and the first encoder unit 20, respectively.

[0040] The second encoder unit 60 extracts the meaning of the word sequence supplied by the training data supply unit 80 and outputs a state vector as a result. The word sequence data supplied by the training data supply unit 80 may be a correct word sequence or an incorrect word sequence. The state vector calculated by the second encoder unit 60 based on the correct word sequence is d in equation (1) above. positive This is used to find the state vector calculated by the second encoder unit 60 based on the incorrect word sequence is d in equation (1) above. negative It is used to obtain the second encoder unit 60, like the first encoder unit 20, is configured to include a neural network internally. The second encoder unit 60 can be implemented using an RNN as an example. In other words, the second encoder unit 60 obtains and outputs a state vector representing semantic information by performing calculations using a neural network based on the word sequence provided by the training data supply unit 80.

[0041] The second loss calculation unit 70 calculates the loss for backpropagation of the neural network. Specifically, the second loss calculation unit 70 calculates the loss for learning the second pattern. In other words, the second loss calculation unit 70 finds the loss between the state vector output by the first encoder unit 20 and the state vector output by the second encoder unit 60. The second loss calculation unit 70 also calculates the d obtained based on the correct word sequence. positive And d, which was obtained based on the sequence of incorrect words negative Using the above equation (1), the triplet loss L triplet This calculates the loss. The second loss calculation unit 70 may also be simply referred to as the "loss calculation unit".

[0042] In other words, the second loss calculation unit 70 calculates the correct answer error (d) which is the error between the first state vector output by the first encoder unit 20 based on the video features of the video (training data) supplied by the training data supply unit 80 and the second state vector output by the second encoder unit 60 based on the correct word sequence (training data) supplied by the training data supply unit 80. positiveThe second loss calculation unit 70 calculates the incorrect error (d) between the first state vector and the second state vector output by the second encoder unit 60 based on the first state vector and the incorrect word sequence generated by the negative example data generation unit 110 based on the correct word sequence supplied by the learning data supply unit 80. negative )

[0043] Furthermore, the second loss calculation unit 70 calculates a combined error, which is an error based on both the correct answer error and the incorrect answer error. The combined error is as follows: The larger the value of the correct answer error, the larger the value of the combined error. Also, the larger the value of the incorrect answer error, the smaller the value of the combined error. An example of the combined error is the triplet error L that the second loss calculation unit 70 calculates using the above formula (1). triplet That is the case.

[0044] The training data supply unit 80 supplies training data for the conversion device 1 to perform machine learning. The training data supply unit 80 supplies pairs of input video and correct word sequences as one positive example training data. The training data supply unit 80 also instructs the negative example data generation unit 110 to generate incorrect word sequences based on the above correct word sequences. The training data supply unit 80 supplies pairs of input video and its incorrect word sequences as one negative example training data corresponding to the above positive example training data. In this way, since the training data supply unit 80 supplies both positive example training data (correct word sequences) and negative example training data (incorrect word sequences) for one input video, the second loss calculation unit 70 calculates the above d positive and d negative Based on this, triplet loss L triplet This makes it possible to calculate the result. In other words, the learning data supply unit 80 supplies at least pairs of video and corresponding correct word sequences for learning purposes.

[0045] The incorrect word sequences supplied by the training data supply unit 80 are pseudo-negative examples mechanically generated based on the correct word sequences. These incorrect word sequences may also be called "incomplete correct word sequences."

[0046] The negative example data generation unit 110 generates negative example data based on instructions from the training data supply unit 80. Specifically, the negative example data generation unit 110 generates an incorrect word sequence based on the correct word sequence provided by the training data supply unit 80, and returns the incorrect word sequence to the training data supply unit 80 as negative example data.

[0047] An example of the processing method by the negative example data generation unit 110 is as follows: The negative example data generation unit 110 receives a sequence of correct words from the training data supply unit 80. This sequence of correct words is: Word1-···-Word U It can be expressed as follows: In other words, the correct word sequence is a sequence of U words. From Word1 to Word U Each word up to this point is a sign language label (symbol). The negative example data generation unit 110 generates an incorrect word sequence by replacing any m words from the U words that make up the correct word sequence with other words. Note that 1 ≤ m ≤ U. The value of m may be 1 or 2, etc., or it may be a value randomly selected within the range of 1 ≤ m ≤ U. Also, the words to be replaced when replacing m words may be, for example, randomly selected words. Alternatively, the words to be replaced may be limited to words that have similar grammatical properties to the original words.

[0048] The control unit 90 controls the operation of the entire conversion device 1. Specifically, the control unit 90 controls whether the conversion device 1 operates in learning mode or conversion execution mode. The control unit 90 controls each functional unit to perform operations that depend on the operating mode at that time. In addition, when the conversion device 1 operates in learning mode, the control unit 90 controls whether to perform learning using the first pattern or the second pattern.

[0049] Furthermore, the control unit 90 specifically controls the machine learning procedure performed by the conversion device 1.

[0050] In other words, when performing the first pattern of learning, the control unit 90 controls the supply of video data included in the learning data to the first encoder unit 20, and causes the first encoder unit 20 and the decoder unit 30 to perform forward processing of the neural network. The control unit 90 also controls the supply of correct word sequences included in the learning data to the first loss calculation unit 50. Furthermore, the control unit 90 causes the decoder unit 30 and the first encoder unit 20 to perform backpropagation of errors in their respective neural networks based on the loss calculated by the first loss calculation unit 50.

[0051] Furthermore, when performing the second pattern of learning, the control unit 90 causes the negative example data generation unit 110 to generate incorrect word sequences. The control unit 90 also controls the supply of video data included in the training data to the first encoder unit 20, and the supply of correct word sequence data included in the training data and incorrect word sequences generated by the negative example data generation unit 110 to the second encoder unit 60. The control unit 90 also causes the neural networks of the first encoder unit 20 and the second encoder unit 60 to perform forward processing. The control unit 90 also causes the second loss calculation unit 70 to calculate the loss. Depending on the situation, the second loss calculation unit 70 calculates the correct error (d positive ), incorrect answer error (d negative ), and triplet error (L triplet The necessary values ​​from the above are calculated. The control unit 90 also controls the first encoder unit 20 and the second encoder unit 60 to perform backpropagation of errors in their respective neural networks based on the losses calculated by the second loss calculation unit 70.

[0052] Furthermore, the control unit 90 controls each pair included in the set of pairs of video and correct word sequences (training data) to be used sequentially for training. The control unit 90 also determines the termination conditions for neural network training and controls whether or not to terminate training. For example, the control unit 90 may terminate training when a predetermined number of processing cycles (a predetermined number of epochs) have been completed. Alternatively, the control unit 90 may determine, for example, whether or not the changes in the values ​​of the neural network's internal parameters due to training have converged, and decide whether or not to terminate training based on that determination result.

[0053] Furthermore, the control unit 90 controls the neural network of at least the first encoder unit 20 to perform learning by backpropagation based on both the correct answer error and the incorrect answer error. The control unit 90 may also control the neural network of the second encoder unit 60 to perform learning by backpropagation based on both the correct answer error and the incorrect answer error. In addition, the control unit 90 may control the neural network of at least the first encoder unit 20 to perform learning by backpropagation based on the combined error.

[0054] Furthermore, the control unit 90 may control the neural network of the first encoder unit to perform backpropagation based on the correct answer error if the value of the correct answer error is greater than a predetermined threshold, and to perform backpropagation based on the combined error if the value of the correct answer error is less than or equal to the threshold. The specific procedure will be explained later with reference to the flowchart.

[0055] Figure 2 is a block diagram illustrating the first encoder unit 20 and the second encoder unit 60, and their respective input and output data. As shown in the figure, the first encoder unit 20 is configured to include a neural network 220 internally. The second encoder unit 60 is configured to include a neural network 260 internally.

[0056] The neural network 220 in the first encoder unit 20 takes information about the frame images contained in the input video as input and outputs a state vector containing information representing its meaning. As mentioned above, the CNN outputs feature quantities (feature vectors) that represent the features of the input frame image. The RNN then takes these feature quantities of the frame image as input and outputs a state vector. The neural network 260 in the second encoder unit 60 takes information about a sequence of words as input and outputs a state vector containing information representing its meaning. The sequence of words is data supplied from the training data supply unit 80 and is either a sequence of correct words or a sequence of incorrect words.

[0057] The second loss calculation unit 70 calculates the loss between the state vector output from the neural network 220 of the first encoder unit 20 and the state vector output from the neural network 260 of the second encoder unit 60. The second loss calculation unit 70 calculates d, which is the loss corresponding to the correct word sequence supplied from the training data supply unit 80. positive When calculating this, the loss corresponding to the incorrect word sequence supplied from the training data supply unit 80 is d. negative In some cases, the second loss calculation unit calculates these d positive and d negative Based on this, the triplet loss L is calculated using equation (1). triplet It is possible to calculate this.

[0058] When performing backpropagation of neural networks 220 and 260, the above d positive Or Triplet Loss L triplet This method is used. By performing this type of learning, the state vectors output by neural networks 220 and 260 will approach the state vectors corresponding to the correct word sequences and move away from the state vectors corresponding to the incorrect word sequences.

[0059] Next, the operation procedure of the conversion device 1 will be explained. The operation in learning mode (learning of the first pattern (Figure 3) and learning of the second pattern (Figure 4)) and the operation in conversion execution mode (Figure 5) will be explained with reference to the flowchart.

[0060] Figure 3 is a flowchart showing the processing procedure when the conversion device 1 performs learning using the first pattern in learning mode. This flowchart shows the processing corresponding to a pair of input images and correct word sequences. The processing procedure will be explained below in accordance with this flowchart.

[0061] In step S1, the training data supply unit 80 acquires one data set of input video and correct word sequence (correct example). The training data supply unit 80 supplies the input video included in this pair to the first encoder unit 20. The training data supply unit 80 also supplies the correct word sequence included in this pair to the first loss calculation unit 50.

[0062] In step S2, the training data supply unit 80 supplies the input video data acquired in step S1 to the first encoder unit 20. The first encoder unit 20 performs forward propagation processing of the neural network based on the input video data it receives. As a result, the first encoder unit 20 outputs a state vector. This state vector is passed to the decoder unit 30.

[0063] In step S3, the decoder unit 30 receives the state vector output from the first encoder unit 20 by the processing in step S2 and performs forward propagation processing of the neural network. As a result, the decoder unit 30 outputs an estimated word sequence. This estimated word sequence is estimated to be the word sequence corresponding to the original input video. The decoder unit 30 passes this estimated word sequence to the first loss calculation unit 50.

[0064] In step S4, the first loss calculation unit 50 calculates the loss between the estimated word sequence output from the decoder unit 30 and the correct word sequence supplied from the training data supply unit 80 (step S1). This loss forms the basis for the backpropagation process in steps S5 to S6.

[0065] In step S5, the converter 1 performs backpropagation of errors in the neural network of the decoder unit 30 based on the above-mentioned losses. This updates the values ​​of the internal parameters of the decoder unit 30.

[0066] In step S6, the conversion device 1, following step S5, performs backpropagation of errors in the neural network of the first encoder unit 20. This updates the values ​​of the internal parameters of the first encoder unit 20.

[0067] Figure 4 is a flowchart showing the processing procedure when the conversion device 1 performs learning using the second pattern in learning mode. This flowchart shows the processing corresponding to one set of input video, correct word sequence, and incorrect word sequence. The processing procedure will be explained below in accordance with this flowchart.

[0068] In step S11, the learning data supply unit 80 acquires one data item of a pair (correct example) of an input video and a correct word sequence.

[0069] In step S12, the negative example data generation unit 110 generates an incorrect word sequence based on the correct word sequence obtained in step S11. The negative example data generation unit 110 then passes the generated incorrect word sequence data to the training data supply unit 80.

[0070] In step S13, the training data supply unit 80 supplies the input video data acquired in step S11 to the first encoder unit 20. The first encoder unit 20 performs forward propagation processing of the neural network based on the input video data it receives. As a result, the first encoder unit 20 outputs a state vector.

[0071] In step S14, the training data supply unit 80 supplies the correct word sequence data acquired in step S11 to the second encoder unit 60. The second encoder unit 60 performs forward propagation processing based on the correct word sequence data it receives. As a result, the second encoder unit 60 outputs a state vector.

[0072] In step S15, the second loss calculation unit 70 calculates the loss between the state vector output from the first encoder unit 20 and the state vector output from the second encoder unit 60. The loss calculated here is d in equation (1) above. positive That is the case.

[0073] In step S16, the control unit 90 determines whether the loss calculated in step S15 (the loss between the state vector output from the first encoder unit 20 and the state vector output from the second encoder unit 60) is greater than a predetermined threshold. If the loss is greater than the threshold (step S16: YES), the process jumps to step S20. If the loss is less than or equal to the threshold (step S16: NO), the process jumps to the next step, S17.

[0074] The threshold value mentioned above can be set as appropriate. For example, the threshold value could be set to 0.01.

[0075] In other words, if the loss calculated in step S15 is greater than a predetermined threshold, the intrinsic parameters of the neural network are adjusted using that loss (the loss based only on positive examples). If the loss calculated in step S15 is less than or equal to that threshold, the intrinsic parameters of the neural network are adjusted using a triplet loss calculated based on both positive and negative examples.

[0076] If the process proceeds to step S17, in this step, the learning data supply unit 80 supplies the incorrect word sequence (negative examples) generated in step S12 to the second encoder unit 60. The second encoder unit 60 performs forward propagation processing based on the data of the incorrect word sequence it has received. As a result, the second encoder unit 60 outputs a state vector.

[0077] In step S18, the second loss calculation unit 70 calculates the loss between the state vector output from the first encoder unit 20 (step S13) and the state vector output from the second encoder unit 60 (step S17). The loss calculated here is d in equation (1) above. negative That is the case.

[0078] In step S19, the second loss calculation unit 70 calculates the triplet loss L using the above formula (1). triplet The triplet loss L is calculated. As shown in equation (1), the closer the state vector output from the first encoder unit 20 is to the state vector calculated by the second encoder unit 60 based on the correct word sequence (positive example), the higher the triplet loss L. triplet The value of becomes smaller. Also, the closer the state vector output from the first encoder unit 20 is to the state vector calculated by the second encoder unit 60 based on the incorrect word sequence (negative example), the smaller the triplet loss L becomes. triplet The value will increase.

[0079] Next, the process moves to the backpropagation process in steps S20 and S21. Note that if it was determined in step S16 that the loss was greater than the threshold (step S16: YES), the loss L calculated in step S15 is used. positive Backpropagation is performed based on the following. If it is determined in step S16 that the loss is below the threshold (step S16: NO), the triplet loss L calculated in step S19 is used. triplet Perform backpropagation based on this.

[0080] In step S20, the conversion device 1 performs backpropagation of errors in the neural network of the second encoder unit 60 based on the above-mentioned losses. As a result, the values ​​of the internal parameters of the second encoder unit 60 are updated.

[0081] In step S21, the conversion device 1 performs backpropagation of errors in the neural network of the first encoder unit 20 based on the above-mentioned losses. As a result, the values ​​of the internal parameters of the first encoder unit 20 are updated.

[0082] Figure 5 is a flowchart showing the processing procedure when the conversion device 1 is operating in conversion execution mode. The operation of the conversion device 1 in conversion execution mode is contingent on the neural network training being completed. The processing procedure will be explained below in accordance with this flowchart.

[0083] In step S51, the input unit 10 acquires the input video. The input unit 10 passes the frame images contained in the input video to the first encoder unit 20.

[0084] In step S52, the first encoder unit 20 performs forward propagation processing of the neural network based on the input video acquired in step S51. As a result, the first encoder unit 20 outputs a state vector. This state vector is passed to the decoder unit 30.

[0085] In step S53, the decoder unit 30 takes the state vector output from the first encoder unit 20 in step S52 as input and performs forward propagation processing of the neural network. As a result, the decoder unit 30 outputs a word sequence. This word sequence corresponds to the sign language actions shown in the input video. In other words, this word sequence is an estimated word sequence estimated to represent the content of those sign language actions. That is, this word sequence is the recognition result of the sign language actions shown in the input video. The decoder unit 30 passes this word sequence to the output unit 40.

[0086] In step S54, the output unit 40 outputs the word sequence output from the decoder unit 30 in step S53 to the outside as the conversion result (sign language recognition result).

[0087] The conversion device 1 may, for example, alternately repeat the learning of the first pattern and the learning of the second pattern. Alternatively, the conversion device 1 may repeatedly apply all the training data sequentially. The conversion device 1 may use existing machine learning methods using the training data.

[0088] Figure 6 is a block diagram showing an example of the internal configuration of the converter 1. The converter 1 can be implemented using a computer. As shown in the figure, the computer consists of a central processing unit 901, RAM 902, input / output ports 903, input / output devices 904 and 905, etc., and a bus 906. The computer itself can be implemented using existing technology. The central processing unit 901 executes instructions contained in programs read from RAM 902, etc. The central processing unit 901 writes data to RAM 902, reads data from RAM 902, and performs arithmetic and logical operations according to each instruction. RAM 902 stores data and programs. Each element contained in RAM 902 has an address and can be accessed using that address. RAM stands for "Random Access Memory". Input / output ports 903 are ports for the central processing unit 901 to exchange data with external input / output devices, etc. Input / output devices 904 and 905 are input / output devices. Input / output devices 904 and 905 exchange data with the central processing unit 901 via input / output port 903. Bus 906 is a common communication channel used within the computer. For example, the central processing unit 901 reads and writes data to RAM 902 via bus 906. Also, for example, the central processing unit 901 accesses input / output ports via bus 906.

[0089] Furthermore, at least some of the functions of the conversion device 1 in the embodiment can be realized by a computer and a program. In that case, the program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be loaded into a computer system and executed. Here, "computer system" includes hardware such as the OS and peripheral devices. Furthermore, "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, CD-ROMs, DVD-ROMs, USB memory, and storage devices such as hard disks built into a computer system. In other words, "computer-readable recording medium" may be a non-transitory computer-readable recording medium. Moreover, "computer-readable recording medium" may also include those that temporarily and dynamically hold programs, such as communication lines when transmitting programs via networks such as the Internet or communication lines such as telephone lines, and those that hold programs for a certain period of time, such as volatile memory inside a computer system that acts as a server or client in that case. Furthermore, the above program may be for realizing some of the functions described above, and may also be able to realize the above functions in combination with a program already recorded in the computer system.

[0090] As explained above, according to this embodiment, the conversion device 1 is d positive (Reduce the error in the correct answer) negative The system performs learning that increases the (incomplete error). The conversion device 1 also automatically generates incorrect word sequences (incomplete word sequences) for such learning. In other words, the first encoder unit 20 can learn the difference between correct word sequences and incorrect word sequences (incomplete word sequences), thereby improving the accuracy of image recognition.

[0091] The results of the demonstration experiment of the above embodiment are as follows. In the demonstration experiment, 6,000 pairs of sign language video (input video) and correct word sequences were prepared as training data, and the conversion device 1 was trained. In addition, 1,000 pairs of sign language video and correct word sequences were used as evaluation data, and the error rate of the estimated word sequence resulting from conversion based on sign language video was evaluated. It was confirmed that the error rate was 1.5% lower compared to the conventional technology (method described in Patent Document 1).

[0092] Although embodiments have been described above, the present invention can also be implemented in the following modified forms.

[0093] [Differentiation] In the above embodiment, when learning using the second pattern (Figure 4), d positive The triplet error L is only valid if the (correct answer error) is below a predetermined threshold. triplet Backpropagation is performed based on this (decision in step S16 in Figure 4). In a modified example, the decision in step S16 is omitted, and the triplet error L is always used. triplet Backpropagation based on this may also be performed.

[0094] While embodiments of this invention have been described in detail above with reference to the drawings, the specific configuration is not limited to these embodiments and includes designs and the like that do not depart from the spirit of this invention. [Industrial applicability]

[0095] The present invention can be used, for example, in processing such as extracting meaning from video or recognizing video. One example is its use in converting video containing linguistic content into other linguistic expressions. However, the scope of application of the present invention is not limited to those exemplified herein. [Explanation of symbols]

[0096] 1. Conversion device 10 Input section 20 Encoder section 30 Decoder section 40 Output section 50 First Loss Calculation Unit 60 Second Encoder Section 70 Second Loss Calculation Unit (Loss Calculation Unit) 80. Training data supply unit 90 Control Unit 110 Negative Example Data Generation Unit 220,260 neural networks 901 Central Processing Unit 902 RAM 903 Input / Output Ports 904,905 Input / Output Devices 906 Bus

Claims

1. A first encoder unit that calculates and outputs a state vector representing semantic information by performing calculations using a neural network based on video features, A decoder unit that estimates and outputs a word sequence by performing calculations using a neural network based on the state vector output by the first encoder unit, A second encoder unit calculates and outputs a state vector representing semantic information by performing calculations using a neural network based on a given sequence of words, A training data supply unit that provides pairs of video footage and corresponding correct word sequences for training purposes, A negative example data generation unit generates a sequence of incorrect words based on the sequence of correct words supplied by the learning data supply unit, (1) A loss calculation unit that calculates the correct error, which is the error between a first state vector output by the first encoder unit based on the video features of the video supplied by the learning data supply unit and a second state vector output by the second encoder unit based on the correct word sequence supplied by the learning data supply unit, and (2) an incorrect error, which is the error between the first state vector and an incorrect word sequence generated by the second encoder unit based on the incorrect word sequence generated by the negative example data generation unit based on the correct word sequence supplied by the learning data supply unit. A control unit that controls the neural network of at least the first encoder unit to perform learning by backpropagation based on both the correct answer error and the incorrect answer error, A conversion device equipped with the following features.

2. The loss calculation unit calculates a composite error, which is an error based on both the correct error and the incorrect error obtained. The control unit controls the neural network of at least the first encoder unit to perform learning by backpropagation based on the synthesis error. The larger the value of the correct answer error, the larger the value of the combined error. The larger the value of the incorrect answer error, the smaller the value of the combined error. The conversion device according to claim 1.

3. The aforementioned combined error is calculated by the loss calculation unit using formula (1) L triplet That is, [Math 1] (However, in equation (1), d positive This is the aforementioned error in accuracy, d negative This is the error for incorrect answers, (α is a value that can be determined as appropriate.) The conversion device according to claim 2.

4. The control unit controls the neural network of the first encoder unit to perform backpropagation based on the correct answer error if the value of the correct answer error is greater than a predetermined threshold, and controls the neural network of the first encoder unit to perform backpropagation based on the combined error if the value of the correct answer error is less than or equal to the threshold. The conversion device according to claim 2.

5. A first encoder unit that calculates and outputs a state vector representing semantic information by performing calculations using a neural network based on video features, A decoder unit that estimates and outputs a word sequence by performing calculations using a neural network based on the state vector output by the first encoder unit, A second encoder unit calculates and outputs a state vector representing semantic information by performing calculations using a neural network based on a given sequence of words, A training data supply unit that provides pairs of video footage and corresponding correct word sequences for training purposes, A negative example data generation unit generates a sequence of incorrect words based on the sequence of correct words supplied by the learning data supply unit, (1) A loss calculation unit that calculates the correct error, which is the error between a first state vector output by the first encoder unit based on the video features of the video supplied by the learning data supply unit and a second state vector output by the second encoder unit based on the correct word sequence supplied by the learning data supply unit, and (2) an incorrect error, which is the error between the first state vector and an incorrect word sequence generated by the second encoder unit based on the incorrect word sequence generated by the negative example data generation unit based on the correct word sequence supplied by the learning data supply unit. A control unit that controls the neural network of at least the first encoder unit to perform learning by backpropagation based on both the correct answer error and the incorrect answer error, A program to make a computer function as a conversion device equipped with such a feature.