Data cleaning method for data files and data files processing method

A technology of data files and processing methods, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems that no longer meet data statistical analysis, etc., and achieve the goal of reducing processing burden, ensuring integrity, and saving storage space Effect

Inactive Publication Date: 2015-02-18
BANK OF CHINA
2 Cites 13 Cited by

AI-Extracted Technical Summary

Problems solved by technology

With the rapid development of the Bank of China's business, business processing is becoming more and more complex, business processes are constantly being updated, and business systems are more and more closely related. The demand for data statistics and analysis by senior management leaders ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

[0049] Step Q02: Obtain and decompress the data files that need to be cleaned by connecting to the local database. Because data ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a data cleaning method for data files and a data file processing method. The data cleaning method comprises the following steps of S2: determining a cleaning content and a cleaning rule of each data file according to uniform requirements of a data download platform, on data and the uniform requirements of a class system to the data, determined and predetermined by the data of a source system data sheet, and compiling cleaning process steps; S4: generating a cleaning configuration file corresponding to each data file according to the cleaning content and the cleaning rule of each data file; S6: according to the cleaning configuration file, cleaning the data file according to the cleaning process steps. The data cleaning method disclosed by the invention has the beneficial effects that different data files in source systems are extracted for uniformly cleansed (data cleaning), and the data files are presented and shared in an uniform manner, so that an uniform view is provided for the subsequent data process of all levels of data, and the processing burden of the source system can be reduced.

Application Domain

Technology Topic

Image

  • Data cleaning method for data files and data files processing method
  • Data cleaning method for data files and data files processing method
  • Data cleaning method for data files and data files processing method

Examples

  • Experimental program(1)

Example Embodiment

[0037] The present invention will be described in detail below in conjunction with the accompanying drawings.
[0038] According to an embodiment of the present invention, a data cleaning method for data files is provided, wherein the data files come from various source systems, and can also be called source files, source file data, or data files from source systems, etc., such as figure 1 As shown, it is a flowchart of a data cleaning method for a data file according to an embodiment of the present invention, the data cleaning method includes:
[0039] Step S2: Determine the cleaning content and cleaning rules of each data file according to the data definition of the source system data table, the unified requirements of the predefined data download platform for data, and the unified requirements of the analysis system for data, and compile the cleaning process step, wherein, specifically, the data downloading platform is a downloading intermediary for downloading data files of each source system, and the requirements of the data downloading platform for data files can be various normative requirements for data files, and Since the data files are used by the analysis system after cleaning, the data requirements of the analysis system should also be used as one of the basis for determining the cleaning rules. The analysis category refers to data analysis, and the data source of this type of system often comes from For multiple systems, data of the same nature often have different formats and representations, such as dates, which may be yyyy-mm-dd (4-digit year-month-day) in the data files of some source systems, and in some source systems The data file is yyyymmdd (year, month, day), and the data file of the source system may be mm-dd-yy (month-day-2-digit year), and for example, the empty expression may be in some system data files Do not write the value directly, and some system data files will use NULL to represent the empty value. The unified requirement for data is to unify these different formats and forms of expression, for example, the date is unified into yyyymmdd format for subsequent data processing;
[0040] Step S4: Generate a cleaning configuration file corresponding to each data file according to the cleaning content and cleaning rules of each data file. In addition, manually set the cleaning configuration file according to the cleaning rules and cleaning content;
[0041] Step S6: According to the cleaning configuration file, the data files are cleaned according to the cleaning process steps, so as to ensure that the cleaned data meets the unified requirements of the analysis system for data.
[0042] Preferably, the cleaning content includes at least one of the following: the field order matching check of the cleaned file and the data file; the length check, type check and non-null check of the primary key field; the type check of the non-primary key field (can include but not limited to length checks and whitespace checks); type checks for fields of non-primary key date types; handling of invisible characters; checks for fields of specific types. It is also possible to allow all non-primary key fields to be null. In addition, since the cleaning rule is a general rule, it is difficult to unify the complicated business rules in practical applications. Therefore, the business rules are not cleaned and checked during the cleaning process, only technical checks are performed, and specific business rules The correctness of the rules is guaranteed by the business system, and this part of the check is performed in the database.
[0043] Preferably, step S6 may further write the error into a log when an error occurs. So as to provide references such as retrieval for future management personnel.
[0044] According to an embodiment of the present invention, a data file processing method is also provided, and the data file processing method includes:
[0045] Step Q1: You can first judge that the read data file is in the checkable state, if not, then check whether the control information of the read data file is correct;
[0046] Step Q2: When the control information of the read data file is correct, clean the data according to the steps in the data cleaning method in the foregoing embodiment.
[0047] As preferably, before step Q1, the data file processing method may further include:
[0048] Step Q01: Store the data files from the source system into the local database in a compressed state via the data download platform. In all operations on the data files, it is necessary to successfully connect to the local database before reading the data files. process;
[0049] Step Q02: Obtain and decompress the data files that need to be cleaned by connecting to the local database. Since data files are very large, keeping the data files in a compressed state saves storage space and protects the integrity of the data files.
[0050] Preferably, after step Q2, further include: step Q3: loading the cleaned data file.
[0051] Preferably, step Q2 includes: step Q21: compressing the cleaned data file;
[0052] Then step Q3 further includes: Step Q31: decompressing the compressed data file, and loading the decompressed data file. Compressing the cleaned data files is also required for storage space and data integrity.
[0053] Preferably, when implementing step Q31, loading the decompressed data file includes:
[0054] Obtain a control file corresponding to the decompressed data file; and/or
[0055] In the case of acquisition failure, according to the content and format of the control file predetermined by the database, a corresponding control file is generated for the decompressed data file. Among them, the control file is an information file required by the database during the process of loading data files into the database table. A data file corresponds to a control file, which defines how to load the contents of the file into the database table. The control file is necessary. . The content format of the control file is predetermined by the database used, and the correspondence between the data file and the database table fields is also one-to-one correspondence. Therefore, in the absence of the control file, the control file corresponding to the data file can be automatically generated through the relevant program.
[0056] As preferably, after step Q3, the data file processing method further includes:
[0057] Traverse the data files in the local database, and process the data files according to steps Q1 to Q3. The so-called traversal means that all data files have been cleaned, and the cleaning success rate can be counted and recorded.
[0058] figure 2 is a general flowchart of a method for processing data files according to an embodiment of the present invention, Figure 3 to Figure 5 respectively in figure 2 A flow chart of a method for checking, cleaning and loading data files in the process of . Among them, the two functions of file inspection and file loading are supporting functions related to file cleaning, that is, when processing data files, file inspection is performed first, then data cleaning is performed, and finally file loading is performed. The following combination Figure 2 to Figure 5 Describe in detail. like figure 2 As shown, first judge whether there are files (that is, data files) in the cleaning system. If there are files, then read the status of the current file. A set of status can be customized by the system. After each pair of data files is processed differently, it will be Data files that have undergone different operations are placed in different states, so that each time a data file is read, different next operations are performed according to its different states. After the operation is completed, the next record is read. If there is no file currently, then Read the folders up_err (error file in the data file), up_back (returned file in the data file), up_duplicate (copy file of the data file) file list in the database, and determine whether there are files in this type of folder, if Exit if there is no, if there is, judge whether it is expired (the expiration period can be preset according to needs, half a year or one year or many years), if it expires, delete the expired file, and then repeatedly read the list of files in the folders up_err, up_back, and up_duplicate in the database Until all expired files in the file list are deleted.
[0059] Among them, according to an embodiment of the present invention, preferably six states of the record of the current state of the data file are set, of course, the increase or decrease in the number of states is also within the protection scope of the present invention, for example, the text also Describes some other file states, and the name of the state can be used to distinguish different states, not necessarily limited to digital states. Wherein, the six states may be: 1000\2000\3000\4000\4200\5000. The six states are described below: For the sake of saving network traffic, data files are generally in a compressed state when they arrive at the cleaning system, so the first step to perform file cleaning must be to decompress the data files in the compressed state , in these six states, there are six processing methods:
[0060] State 1. When the state read from the data file is 1000, it indicates that the system process has scanned the data file in the compressed state, and then judges whether the data file exists. If not, the state of the data file is set to 1XXX (XXX part It can be defined by itself, as long as it is distinguished from other states) and then exits with an error. If it exists, call the decompression program to decompress the data file;
[0061] State 2. When the state read from the data file is 2000, it indicates that the system process has successfully decompressed the data file, and then judges whether the data file exists. If not, set the state of the data file to 2XXX and exit with an error. If it exists , then call the checking program to check the data file, such as image 3 Shown is a flowchart of the steps of checking the decompressed data files according to an embodiment of the present invention. First, the checking program is connected to the database (i.e. the local database described in the text, preferably, the oracle database can be used, i.e. the data The database temporarily stored after the file arrives at the local area of ​​the cleaning system, and multiple databases can also be set), if the connection fails, the program ends, if the connection succeeds, the current data file is opened, and if the opening fails, the status of the data file is set to be 2005 and end the program, if the data file is successfully opened, then read the control information of the data file, if the control information of the data file is wrong, then set the status as 2001 and end the program, if successful, then check the file control information (the control The information is the information that comes with the data file), and different types of checks can be performed according to the type of file control information. If the check fails, the program will end and the status corresponding to the check type will be set. If the check is successful, the file status will be updated. For example, the 2000 is set to 3000, if the update fails, then set the status to 2006 and end the program, if the update is successful, then close the file, after successfully closing the file, disconnect the connection with the database, and then end the program;
[0062] Status 3. When the status read from the data file is 3000, it indicates that the system process has successfully checked the data file, and then judges whether the data file exists. If not, set the status of the data file to 3XXX and exit with an error. If it exists , you can call the cleaning process step to clean the data files, such as Figure 4 Shown is a flow chart of the steps of cleaning the successfully checked data files according to an embodiment of the present invention. First connect to the database, if the connection is successful, obtain the file path of various logs, if the connection is successful, obtain the configuration file required for file cleaning, then open the data file, if successful, check the configuration file (the file with the suffix .frm) Whether the file format is correct, if the above five operations fail, the program will be terminated, and then a record will be opened if the file format check is successful, and the file will be closed if it fails. The operation is the cleaning operation. Specifically, the files that do not conform to the rules are converted into data that meet the cleaning rules according to the configuration file). If the conversion is successful, the cleaned data will be written into the cleaned file. If the conversion fails, the failed conversion operation Include the cleaning log, and then enter the re-cleaning function to perform re-cleaning. If the re-cleaning is successful, it will be included in the re-cleaning log, record the number of re-cleaning records, and write the re-cleaned data to the cleaned file. When the write is successful , open the next record, if the cleaning fails again, skip this record and open the next record, after opening the next record successfully, repeat the above-mentioned check and conversion one by one according to the obtained configuration file, if you open the next record If it fails, close the file. After closing the file, disconnect the connection with the database. After the disconnection is successful, end the program. After the above operations are completed, if all the files are cleaned, there will be no next record, which means that all files are cleaned The program will not end until it is completely finished. Among them, the specific rules of file cleaning are based on the ODS (Operational Data Store) corresponding to each file, that is, an optional part of the data warehouse architecture, which is "subject-oriented, integrated, current or near current, constantly changing "data) table name, find the configuration file name corresponding to the file, open the file to be cleaned, and check each record one by one and each field according to the configuration of the configuration file. The inspection results are generally divided into three situations: Case 1, If it meets the requirements of the configuration file (that is, it meets the "data file file content cleaning rules" in this article, the specific requirements are to clean according to the configuration file's definition of the primary key, data type, empty or non-empty, etc., according to the cleaning rules), it will be recorded Write it into the cleaned file; Case 2, if it does not meet the requirements of the configuration file, convert it according to the format requirements of the configuration file; Case 3, if there is an exception, for example, a field is defined as a number in the configuration file, but a certain field in the data file If characters appear in the part of a row of records corresponding to the field, then the row of the data file is recorded as an exception and cannot be converted. For records that cannot be converted, it will be thrown. After cleaning, count the number of records cleaned, the number of records converted and the number of records thrown out, and count the success rate of cleaning. Finally, set the status of the cleaned data file to 4000 ( Figure 4 not shown in);
[0063] Status 4. When the status of the read data file is 4200, it indicates that the system process has successfully cleaned the data file and compressed it. Specifically, for the cleaned file, it may be loaded at that time, or it may be due to some For some reasons, it will not be loaded immediately. If it is not loaded immediately, the file will be compressed for the purpose of saving disk space, and the status of the compressed data file will be set to 4200;
[0064] Status 5. When the status of the read data file is 5000, it indicates that the system process needs to reload the data file whose status is 4200. If the data file is decompressed, the status of the data file after decompression is 5000. Then you can enter the loader, such as Figure 5 Shown is a flow chart of the steps of loading the cleaned data files (including the data files that are loaded directly after successful cleaning or have passed the state 4200 ) according to an embodiment of the present invention. First connect to the database, then obtain the path of each log file, then obtain the control file required for loading, if it fails, automatically generate the control file, delete the data in the current area of ​​the current ODS table in the state where the control file exists, and then load the data file, Then check the loading log to determine whether the loading is successful, and if it is successful, update the file status (not shown in the figure, just update to other statuses indicating that the data file is loaded successfully), then disconnect the data connection, end the program, and perform the above operations In the middle: If there is a loading failure, it can be set to a different state and then end the program.
[0065] The following description according to an embodiment of the present invention applies the file cleaning step of the present invention to the actual situation:
[0066] Scheduler cleandata.pc, the specific configuration information of this program includes:
[0067] Cleandata: /imrswork/bin/cleandata
[0068] Date: 20101231
[0069] Job execution serial number: 15608
[0070] Area Number: 40004 C_SACTCAP
[0071] Job numbering system name: BANCS
[0072] File name: 0100006D.k01
[0073] SYS_TABPARA table configuration
[0074] FRM, PRM file configuration
[0075] The above configuration files are pre-defined configuration files used when cleaning files, and contain information such as cleaning rules for each data item.
[0076] Among them, the cleaning rules for data file content include the following items:
[0077]1. The order of each field of the cleaned file is the same as that of the corresponding field of the data file. If the corresponding ODS table is configured in the SYS_TABPARA table and the area number needs to be added during cleaning, it will be automatically added to the first row of each row during cleaning. The field increases the area number corresponding to the file;
[0078] 2. Perform length check, type check, and non-empty check on the primary key field;
[0079] 3. Check the type of non-primary key numeric fields, if the field contains spaces, it will be automatically removed;
[0080] 4. Check the type of the non-primary key date field and convert it to YYYYMMDD format. If the value given by the date field is blank or empty, or other abnormal date types, it will be uniformly converted to "99991231";
[0081] 5. Do not check character-type non-primary key fields;
[0082] 6. All non-primary key fields can be empty;
[0083] 7. Processing of invisible characters: delete garbled characters contained in each field (delete those whose ASCI code is less than 32);
[0084] 8. For the fields configured as special types in the configuration file, special checks are performed. For example, the TRMSTR type can remove the trailing spaces in the character type fields; the DELTAB type can delete the TAB key in the character type fields.
[0085] Among them, the cleaning rule for the file control information is: define the last 14 lines of file control information for each downloaded file, and this part of the records will be ignored without cleaning.
[0086] Among them, the inspection principles of the cleaning process include: the business rules are not checked during the cleaning inspection stage, only technical inspections are performed; the correctness of the business rules is guaranteed by the business system, and this part of the inspection is carried out in the database.
[0087] The effect of the data file cleaning function in the technical solution of the present invention in analysis systems such as data warehouses is relatively obvious, mainly reflected in: 1. Unified sorting of different data in each source system, and presenting data files in a unified manner; 2. It provides a unified view for subsequent data processing at all levels.
[0088] The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the protection scope of the present invention is defined by the claims. Those skilled in the art can make various modifications or equivalent replacements to the present invention within the spirit and protection scope of the present invention, and such modifications or equivalent replacements should also be deemed to fall within the protection scope of the present invention.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Numerical value processing method and device

ActiveCN106157141AReduce processing burdenImprove business processing efficiencyFinanceResourcesComputer science
Owner:ADVANCED NEW TECH CO LTD

Classification and recommendation of technical efficacy words

  • Save storage space
  • Reduce processing burden

System and method for restaurant electronic menu

InactiveUS20060085265A1Save storage spaceFast retrievalMarketingSoftwareNutritional information
Owner:IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products