Systems and methods for preparing data for use by machine learning algorithms

a machine learning algorithm and data technology, applied in the field of machine learning, can solve the problems of no data at all, typographical errors, and abounding integrity and quality of data, and achieve the effect of improving the accuracy and utility of a primary machine learning algorithm and improving performan

Inactive Publication Date: 2020-12-24
NEURALSTUDIO SEZC
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The present invention provides improved systems and methods for preparing data for use in training a primary machine learning algorithm. This involves replacing instances of missing or invalid data in historical data with data-replacing models that reflect the context of other historical data values related to the output of interest. The invention helps to improve the accuracy and utility of the primary model machine learning algorithm by creating better training data. It also helps in identifying clusters of interest in the historical data and predicting or classifying a phenomenon of interest. Overall, the invention makes the training process more efficient and effective for various applications.

Problems solved by technology

Clustering problems generally employ unsupervised learning algorithms that identify similarities in the data as the model output.
While the data operated on by the algorithm is critically important in machine learning, a machine learning algorithm itself does not directly address issues affecting the validity / integrity of the data used by the algorithm (for example, how invalid values for given data points should be handled), Some of the approaches used in the prior art to deal with such data issues independent of the machine learning algorithm are discussed further below.
When applying machine learning algorithms to data produced by typical “real world” data sources, problems with the integrity and quality of the data abound.
Such problems include, but are by no means limited to: data accidentally or deliberately omitted when forms are completed by humans, typographical errors that occur when humans transcribe forms and enter information into a computing system, and errors made when optical character recognition or voice recognition systems process raw data and convert it into a form suitable for use by machine learning algorithms.
Hardware problems can also cause errors when data moves from a source (such as a sensor) to a repository (for example, a database).
Sensors can fail and thus provide no data at all.
The conduit for the data can be “noisy”—electromagnetic interference, simple corrosion of wire terminal connectors, or a faulty or damaged cable can all introduce artefacts—such that the data that is placed in a repository is not an accurate reflection of the information originally produced and transmitted.
Despite the critical importance of data quality in the empirical model (machine learning) development process and significant advances in the empirical modeling algorithms themselves, and considering the huge quantities of raw data now generated every second in systems around the world, there has been little progress in improving the quality of historical data that is to be used by machine learning algorithms to develop primary models of a phenomenon of interest.
Likewise, the prior art has seen the same lack of progress in techniques for preparing new data to be used by a model after it has been placed in service.
Although easy to implement, this approach can have unacceptable ramifications when data quantity is also an issue (that is, when events of interest occur so infrequently that it is important to retain and utilize any data related to them).
The IDRE report discusses more complex replacement schemes, but they generally force an unwarranted assumption of linearity on the data in a particular series.
While these techniques can be moderately effective when a data series is missing a single data point, they are more likely to introduce errors into the final model when used to replace multiple invalid data values in a particular data series.
These approaches especially tend to obscure the context provided by the valid data values in the same record as the invalid data.
Accordingly, known approaches for replacing missing or invalid numeric data values in a particular data record generally fail to adequately take into account the logical and temporal relationships of valid data in the record.
The prior art is thus lacking a way of preparing data for a machine learning algorithm that accounts for missing or invalid data in a way that increases the ability of the model representing the algorithm to generate a more accurate and useful output when the model is placed in service.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Systems and methods for preparing data for use by machine learning algorithms
  • Systems and methods for preparing data for use by machine learning algorithms
  • Systems and methods for preparing data for use by machine learning algorithms

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041]The description that follows assumes a thorough understanding of the basic theory and principles underlying what is commonly referred to as “machine learning.” It will be readily understood by a person skilled in the art of machine learning, neural networks, and related principles of mathematical modeling to describe examples of particular embodiments illustrating various ways of implementing the present subject matter. Accordingly, certain details may be omitted as being unnecessary for enabling such a person to realize the embodiments described herein.

[0042]As those skilled in the art will recognize, in the description of the subject matter disclosed and claimed herein control circuitry and components described and depicted in the various figures are meant to be exemplary of any electronic computing system capable of performing the functions ascribed to them. Such a computing system will typically include the necessary input / output interface devices and a central processing ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Historical data used to train machine learning algorithms can have thousands of records with hundreds of fields, and inevitably includes faulty data that affects the accuracy and utility of a primary model machine learning algorithm. To improve dataset integrity it is segregated into a clean dataset having no invalid data values and a faulty dataset having the invalid data values. The clean dataset is used to produce a secondary model machine learning algorithm trained to generate from plural complete data records a replacement value for a single invalid data value in a data record, and a tertiary model machine learning clustering algorithm trained to generate from plural complete data records replacement values for multiple invalid data values. Substituting the replacement data values for invalid data values in the faulty dataset creates augmented training data which is combined with clean data to train a more accurate and useful primary model.

Description

CROSS-REFERENCE TO RELATED APPLICATION[0001]This application claims the benefit of U.S. provisional application No. 62 / 620,059, filed Jan. 22, 2018, the entire contents of which are incorporated herein by reference.BACKGROUND OF THE INVENTIONField of the Invention[0002]The present invention relates to machine learning, and more particularly, to systems and methods for improving the integrity and quality of data used in training and applying machine learning algorithms to increase the utility and accuracy of computer implementations and executions of such algorithms.Description of Related Art[0003]A mathematical model is a mathematical expression that describes a phenomenon with sufficient accuracy and consistency as to be useful in the real world. There are two basic forms of mathematical models. One is a “first-principles” model, which attempts to describe a phenomenon of interest on the basis of fundamental laws of physics, chemistry, biology, etc. The second is an “empirical” mod...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06N20/00G06F16/22
CPCG06N20/00G06F16/22G06N20/20G06N3/084G06N3/088G06N5/022G06N3/045
Inventor COPPER, JACK
Owner NEURALSTUDIO SEZC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products