Corpus acquisition method and device

An acquisition method and acquisition device technology, which is applied in the field of corpus acquisition methods and devices, can solve the problems of high cost and long voice sample time period, and achieve the effects of improving the research and development cycle, saving labeling costs, and reducing the number of labels

Pending Publication Date: 2021-05-28
广州欢城文化传媒有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The embodiment of the present application provides a method and device for obtaining corpus, which solves the technical problems of long time period and high cost in obtaining speech samples for training in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus acquisition method and device
  • Corpus acquisition method and device
  • Corpus acquisition method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

[0045] figure 1 One of the methods for acquiring corpus in this application is a method flowchart of an embodiment, such as figure 1 as shown, figure 1 Including:

[0046] 101. Obtain a voice sample;

[0047] It should be noted that this application can crawl the required voice samples from the cloud, because most of the voice samples crawled from the cloud are...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a corpus acquisition method and device. The method comprises the following steps: acquiring a voice sample; filtering truncated voice and invalid voice in the voice sample to obtain a qualified voice sample; performing multiple voice recognition on the qualified voice sample to obtain a plurality of corresponding voice texts; comparing the plurality of voice texts to obtain a similarity score; if the similarity score is greater than a preset similarity threshold value, taking the voice sample greater than the similarity threshold value as a voice sample to be labeled, and taking the voice text with the longest text content as a voice text to be labeled; and performing manual annotation on the to-be-annotated voice text to obtain an annotated sample. The invention solves the technical problems of long time period and high cost of obtaining voice samples for training in the prior art.

Description

technical field [0001] The present application relates to the technical field of speech recognition, in particular to a method and device for acquiring corpus. Background technique [0002] With the rapid development of artificial intelligence, there are more and more data training tasks based on deep learning. In order to achieve better model quality, it is particularly important to obtain high-quality data sets in the early stage. In order to achieve the effect of human communication with the accuracy of human-computer interaction, it is necessary to collect vertical field corpus as a data set for supervised learning of the recognition engine to obtain a high-quality recognition model. In actual project development, voice data collection accounts for one-third of the entire project development cycle. In order to speed up the project development progress, it is necessary to improve the efficiency of data labeling. [0003] Looking at the voice research and development depa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L15/04G10L15/06G10L15/26G10L25/51
CPCG10L15/04G10L15/063G10L15/26G10L25/51
Inventor 马金龙熊佳汪暾罗箫焦南凯徐志坚谢睿陈光尧
Owner 广州欢城文化传媒有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products