Corpus acquisition method and device

An acquisition method and acquisition device technology, which is applied in the field of corpus acquisition methods and devices, can solve the problems of high cost and long voice sample time period, and achieve the effects of improving the research and development cycle, saving labeling costs, and reducing the number of labels
CN112863490APending Publication Date: 2021-05-28广州欢城文化传媒有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
广州欢城文化传媒有限公司
Publication Date
2021-05-28

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a corpus acquisition method and device. The method comprises the following steps: acquiring a voice sample; filtering truncated voice and invalid voice in the voice sample to obtain a qualified voice sample; performing multiple voice recognition on the qualified voice sample to obtain a plurality of corresponding voice texts; comparing the plurality of voice texts to obtain a similarity score; if the similarity score is greater than a preset similarity threshold value, taking the voice sample greater than the similarity threshold value as a voice sample to be labeled, and taking the voice text with the longest text content as a voice text to be labeled; and performing manual annotation on the to-be-annotated voice text to obtain an annotated sample. The invention solves the technical problems of long time period and high cost of obtaining voice samples for training in the prior art.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The present application relates to the technical field of speech recognition, in particular to a method and device for acquiring corpus. Background technique

[0002] With the rapid development of artificial intelligence, there are more and more data training tasks based on deep learning. In order to achieve better model quality, it is particularly important to obtain high-quality data sets in the early stage. In order to achieve the effect of human communication with the accuracy of human-computer interaction, it is necessary to collect vertical field corpus as a data set for supervised learning of the recognition engine to obtain a high-quality recognition model. In actual project development, voice data collection accounts for one-third of the entire project development cycle. In order to speed up the project development progress, it is necessary to improve the efficiency of data labeling.

[0003] Looking at the voice research and development depa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More