Audio visual speech module based on residual network and bidirectional gating recurrent units

A technology of recurrent unit and speech model, which is applied in speech analysis, speech recognition, instruments, etc., and can solve the problem of low recognition accuracy

Inactive Publication Date: 2018-09-28
SHENZHEN WEITESHI TECH
View PDF0 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Aiming at the problem that the recognition accuracy is not high under the condition of strong noise, the object of the present invention is to provide an audio-visual speech model based on the residual network and the bidirectional gated recurrent unit. The bidirectional gated recurrent unit (BGRU) is modeled, and then the BGRU outputs of the two signal streams are concatenated and sent to the classification layer for fusion, and then their temporal dynamics are jointly modeled, and finally output from a Softmax layer, Softmax Each frame is labeled by the layer, and the labeled sequence is based on the highest average probability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Audio visual speech module based on residual network and bidirectional gating recurrent units
  • Audio visual speech module based on residual network and bidirectional gating recurrent units
  • Audio visual speech module based on residual network and bidirectional gating recurrent units

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present invention will be further described in detail below in conjunction with the drawings and specific embodiments.

[0023] figure 1 It is a system frame diagram of an audio-visual speech model based on a residual network and a bidirectional gated recurrent unit of the present invention. It mainly includes visual stream, audio stream, classification layer and audio-visual fusion.

[0024] The visual flow is composed of a spatio-temporal convolution with a 34-layer residual network (ResNet-34) and a 2-layer bidirectional gated recurrent unit (BGRU); here is the version of the 34-layer identity map. The main process is: when When the output of each step becomes a single-dimensional tensor, the residual network will gradually reduce the space-time dimension; finally, the output of the 34-layer residual...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an audio visual speech module based on a residual network and bidirectional gating recurrent units. The audio visual speech module mainly includes a visual flow, an audio flow,a classifying layer and an audio-video fusion, wherein the audio-video fusion includes the following processes that in the visual flow or the audio flow, temporal dynamics are modelized by two layersof the bidirectional gating recurrent units (BRGU), and then BGRU outputs of two signal flows are connected in series and transferred to the classifying layer to fuse; the temporal dynamics are jointly modeled; and at last outputting from an Softmax layer is achieved, each frame is signed by the Softmax layer, and a signed sequence is based on a topmost average probability. The characteristics ofpixels and audio waveforms can be directly extracted at the same time, the audio visual speech module has the text recognition function in a large public context dataset, in the condition of high noise intensity, compared with a traditional audio visual speech module, accuracy of classifying is obviously improved.

Description

technical field [0001] The invention relates to the field of audio-visual speech recognition, in particular to an audio-visual speech model based on a residual network and a bidirectional gating cycle unit. Background technique [0002] With the substantial improvement of the performance of personal computers, human-computer interaction technology has gradually shifted from computer-centered to human-centered interaction methods. Under this background, audio-visual speech recognition technology has also developed rapidly. Audio-visual speech recognition technology is mainly used in telephone and communication systems. People can easily query and extract relevant information from remote database systems through voice commands; audio-visual speech recognition technology is also widely used in user interactive machines, voice notepads, etc. , business self-service processing platform and other equipment, greatly reducing labor costs; in terms of public security criminal investi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L15/06
CPCG10L15/063G10L2015/0631
Inventor 夏春秋
Owner SHENZHEN WEITESHI TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products