Unlock instant, AI-driven research and patent intelligence for your innovation.

Corpus extraction method and device and terminal equipment

An extraction method and corpus technology, applied in speech analysis, speech recognition, instruments, etc., can solve the problem of high cost of corpus extraction and achieve the effect of reducing costs

Inactive Publication Date: 2019-06-07
GUANGZHOU UNIVERSITY
View PDF8 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, when using the existing technology for corpus extraction, it was found that since the purpose of establishing and collecting corpus is to provide training and testing databases for the speech recognition system, the choice of speakers needs to cover different regions, ages, genders and educational levels across the country, and It is necessary to extract corpus from multiple recording environments to ensure the matching degree of subsequent speech recognition, resulting in too high cost of corpus extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus extraction method and device and terminal equipment
  • Corpus extraction method and device and terminal equipment
  • Corpus extraction method and device and terminal equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

[0037] See figure 1 .

[0038] see figure 1 , is a schematic flow chart of the corpus extraction method provided by an embodiment of the present application, such as figure 1 As shown, the task processing method includes step S11 to step S14. Each step is as follows:

[0039] Step S11, collecting audio and video data of the video material.

[0040] Step S12, using audio and video data that does not contain subtitle text data as the first processing data, after obtaining the sub...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a corpus extraction method and device and terminal equipment, and the method comprises the steps: collecting audio and video data, obtaining a subtitle region voice image of the audio and video data which does not contain subtitle text data, intercepting the subtitle region voice image according to a preset frame number, and obtaining a plurality of pieces of voice image data; converting caption images in the plurality of pieces of voice image data into a plurality of texts, calculating cosine values between every two of the plurality of texts, and combining the texts of which the cosine values reach a threshold value; and segmenting the first voice data corresponding to the caption image according to the combined text to obtain a corpus of each first text unit. Compared with the prior art, the method has the advantages that the audio and video caption images without caption files are converted into the text files and then matched with the voice data, so that corpus extraction is achieved, the problem that corpus extraction needs to be conducted through multiple recording environments is solved, and the purpose of reducing the corpus extraction cost is achieved.

Description

technical field [0001] The present application relates to the technical field of audio, video, and voice information retrieval, and in particular, to a corpus extraction method, device, and terminal equipment. Background technique [0002] In the automatic speech recognition system, the performance and robustness of the system depend to a large extent on whether there is enough corpus data in the process of modeling the recognition model, that is, the corpus data resource library is the key basic link of intelligent speech technology. The scale and quality of the corpus in the corpus data resource library largely determines the breadth and depth of various intelligent voice applications, and also greatly affects the user experience. [0003] In the prior art, corpus is extracted by way of recording, so as to establish a corpus data resource library. However, when using the existing technology for corpus extraction, it was found that since the purpose of establishing and col...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/32G06K9/46G06F17/27G10L15/26
Inventor 周发升何伟宝詹逸陈渤杨敬慈皮樾李锦韬
Owner GUANGZHOU UNIVERSITY