Storage file filtering method and apparatus

A filtering method and a technology for importing files, which are applied in the field of data processing, can solve problems such as failure to load, low traversal efficiency, and no effective solution proposed, so as to achieve the effect of reducing abnormal situations and not reducing storage efficiency

Active Publication Date: 2017-08-22
GUANGZHOU AIYOU INFORMATION TECH
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For the above situation, in the prior art, the business program should identify whether the file has been loaded when processing the file, and if it is repeated, it cannot be loaded.
The disadvantage of this is that each business program needs to add a module to identify duplicate files, and the traversal efficiency is low when the file comparison process involves cluster files
[0004] For the above-mentioned problem of repeated data import, no effective solution has been proposed yet

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Storage file filtering method and apparatus
  • Storage file filtering method and apparatus
  • Storage file filtering method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0029] Embodiment 1 of the present invention provides a filtering method for storage files, which is implemented by the data warehouse tool Hive, see figure 1 The schematic flow chart of the filtering method for the storage files shown includes the following steps:

[0030] Step S11, using the data warehouse tool Hive to obtain the verification code of the current storage file.

[0031] The method provided in this embodiment is applicable to the case of Hive importing data, where the current storage file is the source file to be imported. The above check code is the unique corresponding value calculated by each file. For example, the check code is uniquely obtained by the checksum checksum or other algorithms such as the Hash function, which can distinguish whether the content of each file is the same as that of other files. (regardless of whether the two names are the same).

[0032] Step S12, searching whether there is a check code consistent with the check code of the cur...

Embodiment 2

[0047] Embodiment 2 of the present invention provides a method for filtering files stored in a database, and the verification code is checksum as an example for illustration. see figure 2 The schematic flow chart of the filtering method for the storage files shown includes the following steps:

[0048] Step S21, obtain the checksum checksum of the current storage file through the Hive CheckSum function in the data warehouse tool.

[0049] First, the program uses Hive to execute the load operation, and obtains the checksum of the current storage file (that is, the source file) through the Hive CheckSum function. Before the source file is stored in the database, the checksum of the source file needs to be calculated to obtain the checksum result. The specific calculation process belongs to the prior art and will not be repeated here.

[0050] Step S22, using the checksum checksum of the current storage file as the check code of the current storage file.

[0051] In step S23,...

Embodiment 3

[0069] Embodiment 3 of the present invention provides a filter device for storage files, which is realized by the data warehouse tool Hive, see Figure 4 The schematic diagram of the structure includes a verification code acquisition module 410, a verification code search module 420, a discarding module 430 and an import module 440, wherein the functions of each module are as follows:

[0070] The verification code acquisition module 410 is used to obtain the verification code of the current storage file using the data warehouse tool Hive;

[0071] The verification code search module 420 is used to find whether there is a verification code consistent with the verification code of the current storage file in the pre-stored verification code in the target directory; the pre-stored verification code is the imported file in the target directory check code;

[0072] Discarding module 430, for if yes, discarding the current storage file;

[0073] The import module 440 is configure...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a storage file filtering method and apparatus, and relates to the technical field of data processing. The method comprises: obtaining a check code of a current storage file by using a data warehouse tool Hive; searching whether there is a check code that matches the check code of the current storage file in check codes pre-stored in a target directory, wherein the pre-stored check codes are check codes of files which have been already imported to the target directory; if so, discarding the current storage file; and if not, importing the current storage file, and writing the check code of the current storage file into the target directory. According to the storage file filtering method provided by embodiments of the present invention, by obtaining the check code of the current storage file and comparing the check code with the check codes which have been stored in the target directory, when a check code in the target directory that matches the check code of the current storage file, this shows that the current storage file has been repeated, so that without any modification of the service application, repeated storage file filtering is realized.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a method and device for filtering files in storage. Background technique [0002] With the massive increase of data, a single computer can no longer store massive data. Therefore, distributed clusters have received extensive attention. In a distributed cluster, data can be distributed to multiple computers for storage and distributed computing can be implemented. Hadoop is the infrastructure of distributed systems. Users can develop distributed programs without knowing the underlying details of the distribution, and make full use of the capabilities of cheap computer clusters to perform high-speed calculations and storage of data. Hive is a data warehouse tool for Hadoop. It can map structured data files into a data table, provide a complete structured query language (SQL, StructuredQuery Language) query function, and convert SQL statements into MapReduce tasks for execu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/1734G06F16/182G06F16/2282G06F16/27G06F16/284
Inventor 谭就王建成
Owner GUANGZHOU AIYOU INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products