Unlock instant, AI-driven research and patent intelligence for your innovation.

Corpus selection and processing method, device, equipment and computer-readable storage medium

A processing method and corpus technology, applied in the field of corpus screening, can solve the problem that the selection of corpus does not meet the requirements of sentence length distribution, etc.

Active Publication Date: 2020-01-14
北京海天瑞声科技股份有限公司
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Embodiments of the present invention provide a method, device, device, and computer-readable storage medium for selecting and processing corpus, which are used to solve the problem that the sentence length distribution in the corpus selected by the existing corpus selection method is far from the sentence length distribution of the real corpus , the selection of corpus does not meet the requirements for sentence length distribution in corpus design

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus selection and processing method, device, equipment and computer-readable storage medium
  • Corpus selection and processing method, device, equipment and computer-readable storage medium
  • Corpus selection and processing method, device, equipment and computer-readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0031] figure 1 A flow chart of the corpus selection processing method provided by Embodiment 1 of the present invention; figure 2 It is a schematic diagram of the sentence length distribution of the corpus provided by the embodiment of the present invention. The embodiment of the present invention aims at the problem that in the corpus selected by the existing corpus selection method, the sentence length distribution is far from the sentence length distribution of the real corpus, and the corpus selection does not meet the requirements for the sentence length distribution in the corpus design, a corpus is provided. Choose a processing method.

[0032] The method in this embodiment is applied to a terminal device, and the terminal device may be a mobile terminal such as a smart phone or a smart speaker, or may be a server, etc. In other embodiments, the method may also be applied to other devices. In this embodiment, A terminal device is taken as an example for schematic il...

Embodiment 2

[0047] image 3 It is a flow chart of the corpus selection processing method provided by Embodiment 2 of the present invention. On the basis of the first embodiment above, in this embodiment, the original sentence length distribution can be the sentence length distribution of the original corpus, and according to the original sentence length distribution, select from the original corpus that meets the sentence number requirement and the sentence length requirement, and is similar to the original sentence length distribution. The corpus whose sentence length distribution matches is used as the initial sentence length distribution model, including: obtaining the ratio of the number of target sentences to the number of sentences in the original corpus; calculating each target sentence according to the ratio of the number of target sentences to the number of sentences in the original corpus The number of long sentences; according to the number of sentences of each target sentence ...

Embodiment 3

[0143] Figure 5 It is a schematic structural diagram of a corpus selection processing device provided in Embodiment 3 of the present invention. The corpus selection processing device provided in the embodiment of the present invention can execute the processing flow provided in the corpus selection processing method embodiment. Such as Figure 5 As shown, the corpus selection processing device 30 includes: an initial selection module 301 and a modification module 302 .

[0144] Specifically, the initial selection module 301 is used to select from the original corpus a corpus that meets the sentence number and sentence length requirements and matches the original sentence length distribution according to the original sentence length distribution, as the initial sentence length distribution model.

[0145] The correction module 302 is used to correct the initial sentence length distribution model to obtain a final sentence length distribution model that meets the requirements...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the invention provides a corpus selection processing method, device and equipment and a computer readable storage medium. According to the method provided by the embodiment of the invention, corpora which meets the sentence number requirement and the sentence length requirement and is matched with the original sentence length distribution is selected from the original corpora according to the original sentence length distribution and is used as an initial sentence length distribution model, and the sentence length distribution of the obtained initial sentence length distribution model is consistent with or very close to the original sentence length distribution; a final sentence length distribution model meeting the total word number requirement, the sentence number requirement and the sentence length requirement is obtained by correcting the initial sentence length distribution model, so that the sentence length distribution of the obtained final sentence length distribution model is close to the original sentence length distribution, and the requirement for the sentence length distribution in corpus design is met.

Description

technical field [0001] Embodiments of the present invention relate to the technical field of corpus screening, and in particular, to a method, device, equipment, and computer-readable storage medium for selecting and processing corpus. Background technique [0002] In the fields of speech synthesis, speech recognition, and natural language processing, it is necessary to select a large number of corpus from the corpus that meet specific application scenarios as training data for model training. In the current corpus design project, the corpus that meets the requirements of the number of sentences and sentence length specified by the user is usually obtained. In some application scenarios, the user also has requirements for the total number of words in the selected corpus. For example, the sentence length of each sentence in the corpus is required to be controlled within 5-20, the number of sentences is 10,000, the total number of words is 150,000 and the fluctuation of the to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/211
Inventor 杨福星曹琼郝玉峰
Owner 北京海天瑞声科技股份有限公司