Training data set generation method and device

A technology for training data sets and text data, which is applied in the field of data processing, can solve problems such as lack of training data information, data waste, and affect model training effects, so as to meet the requirements of training data, avoid waste, and improve effectiveness.

Active Publication Date: 2021-08-27
BEIJING SOGOU TECHNOLOGY DEVELOPMENT CO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The collection of training data usually uses crawler tools to grab webpage data from webpages, and webpages often contain text and related pictures at the same time. If only the text data or picture data in them are simply used, it will not only waste data It will also make the training data information missing and affect the model training effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Training data set generation method and device
  • Training data set generation method and device
  • Training data set generation method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0068]In order to enable those skilled in the art to better understand the solutions of the embodiments of the present invention, the embodiments of the present invention will be further described in detail below in conjunction with the drawings and implementations.

[0069] Embodiments of the present invention provide a method and device for generating a training data set. When grabbing training data from a webpage, not only the text data in the text of the webpage is extracted, but also the pictures are obtained when the text of the webpage contains pictures, and the The picture is identified to obtain picture text data, and a training data set is generated according to the text data in the text and the picture text data.

[0070] Such as figure 1 Shown is a flowchart of a method for generating a training data set in an embodiment of the present invention, including the following steps:

[0071] Step 101, grabbing the text of the webpage.

[0072] Since the training data i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and device for generating a training data set. The method includes: grabbing the text of a webpage; detecting whether the text contains a picture; if so, acquiring the picture; identifying the picture to obtain the picture Text data; generate a training data set according to the text data in the text and the text data in the picture; if not, generate a training data set according to the text data in the text. The invention can improve the richness and completeness of training data.

Description

technical field [0001] The invention relates to the field of data processing, in particular to a method and device for generating a training data set. Background technique [0002] Deep learning enables machine learning to achieve many applications and expands the scope of artificial intelligence. Its motivation is to establish and simulate the neural network of the human brain for analysis and learning. It imitates the mechanism of the human brain to explain data, such as images, sounds and text. One of the core issues in deep learning is training data. Deep learning requires a large amount of data. It can be said that the amount of training data plays a key role in the intelligence of artificial intelligence. [0003] In the existing technology, training data is generally divided into two types: image data and text data, and these two types of data are used in different directions of artificial intelligence. For example, text data is used in natural language processing ap...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06K9/62G06F16/9532G06K9/32
CPCG06F16/9532G06V20/635G06F18/214
Inventor 龚艳丽
Owner BEIJING SOGOU TECHNOLOGY DEVELOPMENT CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products