Data traceability-based large-scale discrete feature mining method

A large-scale, discrete technology, applied in data mining, database models, multi-dimensional databases, etc., can solve problems such as loss of model iteration efficiency, inability to directly use production models, and large differences in architecture, and achieve the unification and development of offline data synchronization mechanisms. The effect of low maintenance cost and high model production efficiency

Inactive Publication Date: 2018-07-17
霍尔果斯智融未来信息科技有限公司
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method can guarantee the efficiency of model production, but this method has the following two disadvantages: 1. Due to the real-time update characteristics of online data, the consistency of online and offline data cannot be completely guaranteed theoretically only based on timestamps. As a result, the stability and interpretability of the risk control model cannot be guaranteed. 2. The architecture used for feature offline mining is quite different from the architecture of online data feature extraction. The offline architecture needs to develop additional mechanisms to ensure the consistency of data extraction as much as possible. For each new Additional data sources have corresponding development costs
[0005] 2. Offline feature production is only used for testing, and the models used online are made using the features of online dump
This method can guarantee the stability and interpretability of the risk control model, but this method also has disadvantages: 1. The features produced offline cannot be directly used in the production model, and all model production needs to wait for the accumulation of online features, which loses Model iteration efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data traceability-based large-scale discrete feature mining method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0021] A large-scale discrete feature mining method with traceable data. The same feature calculation lib is used for online requests and offline research. The raw data snapshots used for online feature calculation are fully saved in the cache to ensure the data used in offline research. The data is consistent with the data used online at that time. When there are new ideas for feature mining and new features need to be mined from the previous data, you only need to update the feature calculation lib, and use more data according to the large-scale discrete feature mining architecture. Sample making model.

[0022] First of all, it is necessary to ensure the separation of data acquisition and data calculation. The input of feature calculation lib is characterized by the output of original data. The cache can be realized by using different storage media (such as: mongo, redis, etc.) according to the needs. The data warehouse can be built based on the hadoop system (including: hd...

Embodiment 2

[0024] For a preferred structure of Embodiment 1, the large-scale discrete feature mining framework includes an offline system and an online system, the offline system is composed of a data warehouse, an offline feature mining system, and a model offline training system, and the offline feature mining system is passed through Load feature calculation lib to mine new features from the data warehouse, and the model offline training system uses new features to carry out model training; the online system is divided into three layers, a business layer, a feature layer, and a data storage layer. The business layer includes a business system, Risk control decision-making system, online estimation system, the business system sends the basic information of the order (including: order id, mobile phone number, device number, ID card number, etc.) Obtain the corresponding original data from the data storage layer, process the original data through the feature processing system to obtain th...

Embodiment 3

[0026] As a kind of application scheme of embodiment 1, such as figure 1 As shown, it includes the following steps: 1. Build a set of risk control system based on Internet big data, including data collection section, data storage terminal, risk control rule system, feature calculation system, model estimation system 2. Build a set of offline Feature model processing system, including data warehouse, offline feature mining system, model offline training system 3, online feature calculation system and offline feature mining system, using the same feature calculation lib 4, all online data are stored in the offline data warehouse as snapshots 5. To implement new feature mining, you only need to update the feature calculation lib, and then perform data mining and model making offline. 6. The new model produced and the updated new feature calculation lib can be launched at the same time, and the new features can be applied to the line.

[0027] Compared with the prior art, the pre...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data traceability-based large-scale discrete feature mining method. A same feature calculation lib is used for an online request and offline research; an original data snapshot used in online feature calculation is subjected to total storage through a cache; data used in the offline research is guaranteed to be consistent with data used online at that time; and when a newidea of feature mining exists, new features need to be mined from previous data, the feature calculation lib only needs to be updated, and a model is built by utilizing more data samples according toa large-scale discrete feature mining architecture. The consistency of the data used in online and offline feature mining can be guaranteed; the model production does not depend on online features; the release can be performed after the offline research is finished; the model production efficiency is high; offline data synchronization mechanisms are unified; and the development and maintenance cost is low.

Description

technical field [0001] The invention relates to a large-scale discrete feature mining method with traceable data, which can be widely used in the field of financial risk control based on machine learning technology. Background technique [0002] Generally, in a financial risk control system based on machine learning technology, feature production is divided into two parts: online and offline. Ensuring that the features calculated offline are consistent with those calculated online is a prerequisite for the stability and interpretability of the risk control model. [0003] Currently, there are generally two approaches: [0004] 1. Both online and offline are used to obtain data according to the time attributes of different data, and use offline features to make models. This method can guarantee the efficiency of model production, but this method has the following two disadvantages: 1. Due to the real-time update characteristics of online data, the consistency of online and o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06Q40/02
CPCG06F16/283G06F2216/03G06Q40/03
Inventor 郭安
Owner 霍尔果斯智融未来信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products