The following examples are used to further illustrate the present invention, but the examples do not limit the present invention in any form. Unless otherwise specified, the reagents, methods and equipment used in the present invention are conventional reagents, methods and equipment in the technical field. But the present invention is not limited in any form.
 like Figure 1-2 As shown, this embodiment discloses a method for cleaning Internet big data, which includes the following steps:
S1. Use the data acquisition module 1 to log in to the target server through the http protocol, and use regular expressions, xpath expressions and jsonpath expressions to extract the required data; among them, the http protocol is a simple request-response protocol, which usually runs in over TCP. It specifies what kind of messages the client might send to the server and what kind of response it gets. The headers of the request and response messages are given in ASCII; the message content has a MIME-like format. The http protocol is an application layer protocol. Like other application layer protocols, it is a protocol for realizing a certain type of specific application, and its function is realized by an application program running in the user space. HTTP is a protocol specification, which is recorded in the document and is the implementation program of HTTP that really communicates through the HTTP protocol. A regular expression is a logical formula for operating on strings, which is to use some pre-defined specific characters and combinations of these specific characters to form a "regular string", which is used to express the corresponding characters. A filtering logic for strings. The xpath expression is the XML Path Language, which is a language used to determine the location of a certain part of an XML document. A jsonpath expression is a way to parse an xml document with reference to an xpath expression. The json data structure is usually anonymous and does not necessarily need to have a root element. The jsonpath expression uses an abstract name $ to represent the outermost object. The jsonpath expression can use the following symbols: $.store.book.title.
 S2. Use the crawler synchronization module 2 to synchronize the files in the oss through the checksum algorithm, the transmission synchronization algorithm and the comparison algorithm; the checksum algorithm is used for a set of data items for verification purposes in the field of data processing and data communication The sum of these data items can be numbers or other strings that are regarded as numbers in the process of calculating the check sum; the transmission synchronization algorithm is a process of synchronously copying the data during the transmission process; the comparison algorithm is a process for Algorithms for comparing data information.
 S3. Using the data cleaning module 3, the data is processed by the mean value filling method, the hot card filling method and the regression filling method, and the processed data is packaged and inserted into the kafaka queue of the KAFKA module 4; this step S3 includes the following steps: S31, Through distributed data collectors, and configured according to specific tasks, actively obtain metadata from databases or files. Or passively receive the metadata by the API; S32, through the distributed data collector, and according to the specific task configuration, the signature key, the obtained metadata, and information including the correspondence between the metadata and the target data field, type correspondence, etc. The task configuration is encapsulated into a task object identifiable by the distributed data processor program, and distributed to specific machines and work processes to perform cleaning work through the distributed task scheduling system of the distributed data processor; S33. The data processor receives the task, parses the task object, first verifies whether the signature key is legal, if not, discards the task and records the log, if it is legal, goes to the next step S34; S34, through the distributed data processor, the signature key is After the key verification is passed, restore the metadata and the task configuration contained in the task object, and clean the data according to the corresponding relationship in the configuration; S35, through the distributed data processor, according to the configuration, perform the metadata processing Classification, the metadata fields are related to the target data fields; S36, through the distributed data processor, after the processing of the corresponding relationship between the data fields is completed, the metadata is processed according to the requirements of the target data; S37, through the distributed data The processor, according to the requirements of the target data, performs type conversion on the data types that do not meet the requirements; S38, through the distributed data processor, standardizes the format of the converted metadata as needed; S39, through the data storage, transforms the standardized format. The metadata is pushed to the front-end UI, the back-end API, the push message queue, or the database module 5 as needed; the mean value filling method refers to dividing the attributes in the information table into numerical attributes and non-numeric attributes to be processed separately. If the null value is numeric, the missing attribute value is filled according to the average value of the attribute in all other objects; if the null value is non-numeric, according to the principle of mode in statistics, use the The missing attribute value is filled with the value with the most value (that is, the value with the highest frequency) of the attribute in other objects with the same decision attribute value as the object. The hot-card imputation method specifically refers to the method of hot-card imputation for a variable containing missing values: find an object that is most similar to it in the database, and then fill in the value of this similar object. Different questions may use different criteria to judge similarity. The most common is to use a correlation coefficient matrix to determine which variable (eg, variable Y) is most correlated with the variable (eg, variable X) where the missing value is located. Then sort all variables according to the value of Y. Then the missing value of variable X can be replaced by the data of the case before the missing value. The regression filling method refers to assuming that the y attribute is missing, then knowing the x attribute, and then using the regression method to train the model with no real data, and then bringing the value of the x attribute in, predicting the y attribute, and then filling in the missing place.
 S4. Use the KAFKA module 4, use the election algorithm to reasonably allocate the data to the server queue, and transmit the data to the database module 5 through the network; the basic idea of the election algorithm is: when a process P finds that the coordinator no longer responds to the request, it determines The coordinator fails, so it initiates an election to elect a new coordinator, that is, the one with the largest process number in the current active process; the basic process is: the following three message types will be sent during the election process: Election message: Indicates that an election is initiated ; Answer(Alive) message: Reply to the initiating election message; Coordinator(Victory) message: The election winner sends an election success message to the participants. The events that trigger the election process include: when process P recovers from an error; leader failure is detected. Election process: If P is the largest ID, send a Victory message to everyone directly, and the new Leader is successful; otherwise, send an Election message to all processes with an ID larger than him; if P does not receive an Alive message after sending an Election message, Then P sends a Victory message to everyone, and the new leader is successful; if P receives an Alive message from a process with a larger ID than its own, P stops sending any messages and waits for the Victory message (if it does not wait for a while after a period of time). Victory message, restart the election process); if P receives an Election message from a process with a smaller ID than its own, it replies with an Alive message, and then restarts the election process. If P receives a Victory message, it regards the sender as the leader.
 S5. Use the database module 5 to monitor whether the data transmitted by the KAFKA module 4 has SQL injection attacks, filtering and saving through the wallFilter, and use the filer-chainshain to expand the monitoring statistics. wallFilter is a data interception control algorithm, as long as it can detect whether the data has SQL injection attacks, filtering and saved virus information. filer-chainshain is a control algorithm for data monitoring statistics, as long as the purpose of data monitoring and statistics can be achieved.
 Specifically, the Internet big data cleaning method includes a data acquisition module 1, a crawler synchronization module 2, a KAFKA module 4 and a database module 5, and the data acquisition module 1 is respectively connected with the crawler synchronization module 2, the KAFKA module 4 and the database module. 5 is electrically connected; it is characterized in that it also includes a data cleaning module 3; wherein, the data acquisition module 1 is used to collect the target data, save the collected data to the database module 5, and synchronize to the crawler synchronization module 2 above; the crawler synchronization module 2 regularly synchronizes the data to the local, and then informs the data cleaning module 3 to clean the data, and the data cleaning module 3 includes a distributed data collector, a distributed data processor and a data storage; the distribution The distributed data collector uses a distributed system to extract and receive data from a variety of sources in large batches and quickly, and then pushes it to a distributed data processor for data cleaning; the distributed data processor is responsible for processing distributed data collection. Metadata pushed by the server, clean and convert different data through configuration, and push the cleaned data to the data storage; the data storage is responsible for processing the cleaned data, and stores the data according to business needs and usage scenarios. into the database module 5; the KAFKA module 4 is used for publishing and subscribing record streams; the database module 5 is used for real-time analysis and storage of data. The data acquisition module 1 simulates the public business system logging in to the target server through the network, analyzes the routing rules of the target system, and saves css, js, pictures and page text information in the database module 5. The crawler synchronization module 2 uses the oss data synchronization interface to synchronize data from the oss, and sends a cleaning instruction to the data cleaning module 3 . The data cleaning module 3 migrates, compresses, cleans, scatters, shards, blocks and other various transformations on the data, and inserts it into the kafka distributed message queue for processing. The distributed data collector includes an Extract unit that actively collects data and an API unit that passively receives data. The distributed data processor adopts distributed deployment, including a data verification and classification unit for verifying and classifying data, a data combining unit for splitting or splicing data, and a type conversion unit for performing type verification and conversion on data. And a format specification unit that normalizes the format of the data. The database module 5 composes SQL according to the arrays passed from the data acquisition module 1 and the data cleaning module 3, arranges them into optimal SQL, and filters out SQL attacks.
 More specifically, in the embodiment of the present invention, the data collection module 1: logs in to the target server through the http protocol, and extracts the required data using techniques such as regular expressions, xpath expressions, jsonpath expressions, and the like. The crawler synchronization module 2: use the checksum algorithm, the transmission synchronization algorithm, and the comparison algorithm to synchronize the files in the oss. Described data cleaning module 3: use mean value filling method, hot card filling method, regression filling method and other algorithms to process the data, and pack the processed data and insert it into the kafaka queue. In the embodiment of the present invention, the checksum algorithm, Algorithms such as the transmission synchronization algorithm, the comparison algorithm, the mean value filling method, the hot card filling method, and the regression filling method are all conventional data processing algorithms, which are used in this embodiment to speed up the processing of the collected data information. Wherein, the ETL data cleaner includes the following steps:
Step 1. Module E—distributed data collector, according to the specific task configuration, actively obtains metadata from the database or file. Or passively receive metadata by the API. Step 2. Module E-distributed data collector, according to the specific task configuration, configure the signature key, the obtained metadata, and the task configuration including the corresponding relationship between the metadata and the target data field, type corresponding relationship and other information, and encapsulate it. It becomes a task object identifiable by a distributed data processor program, and is distributed to specific machines and work processes to perform cleaning work through the distributed task scheduling system of the distributed data processor. Step 3. T module - distributed data processor, receives the task, parses the task object, first verifies whether the signature key is legal, if not, discards the task and records the log, and if it is legal, goes to step 4. Step 4. Module T—distributed data processor, after the signature key verification is passed, the metadata and task configuration contained in the task object are restored. According to the corresponding relationship in the configuration, the data is processed in steps 5, 6, 7, and 8. Step 5. The T module—the distributed data processor, after obtaining the metadata, classifies the metadata according to the configuration, and associates the metadata fields with the target data fields. After doing this step, go to step 6. Step 6. Module T—distributed data processor, after the processing of the corresponding relationship of the data fields is completed, the metadata is processed according to the requirements of the target data. If there is missing information, splicing is completed. If it is necessary to combine multiple fields into one field, perform field merging. If there is information that needs to be filtered out, filter it out. Step 7, T module - distributed data processor, after the data is split and spliced in step 6 and some other series of processing. According to the target data requirements, type conversion is performed on the data types that do not meet the requirements. Step 8, T module - distributed data processor, after the data type conversion in step 7, the data basically meets the requirements. At this time, according to the needs of the data, the format specification is carried out. For example, what is provided to the front-end UI display, what is provided to other back-end API interfaces, what is stored in the message queue, what is stored in the relational database, and what is stored in the document database... all have different format requirements. After the on-demand canonical format, the cleaning of the metadata is finalized. In this case, go to step 9. Step 9, L module - data storage. This is a link responsible for the landing of data, and the specific landing place varies according to business needs. The memory is specially designed to deal with these requirements, and can support the requirements of pushing front-end UI, pushing back-end API, pushing message queue, storing in database, etc. This module also supports adding plug-in expansion to provide more types of data landing services.
 The KAFKA module 4 is a kafka distributed message queue, which uses the election algorithm to reasonably allocate data to the server queue, and transmits the druid database through the network.
 The database module 5 is a druid database, using wallFilter to monitor whether the data transmitted by kafka has SQL injection attacks, filtering, and saving, and using filer-chainshain to expand monitoring statistics. Because the data pool is used, it saves a lot of program links and closes the database operation, allowing the application program to reuse an existing database link without rebuilding a new one, which greatly increases the efficiency of the database and improves the speed of data transmission.
 Obviously, in the embodiment of the present invention, the present invention first uses the data collection module 1 to collect data information, then uses the crawler synchronization module 2 to synchronize the data to the data cleaning module 3, and then uses the data cleaning module 3 to standardize and type the data. Conversion, verification classification and splitting and splicing processing effectively reclassify, integrate and clean the data into each standardized database module 5, and finally screen and display the data through the KAFKA module 4 and the database module 5. Through this method, the data can be effectively mined. The data is collected in the data collection module 1 and then subjected to data screening and cleaning. Compared with the existing method of collecting data after screening, the present invention uses the method of collecting data first and then screening and cleaning data, which is more conducive to collecting all relevant target data. It avoids the loss of target data, and effectively ensures that the relevant or adjacent target data is stored for backup, reducing the workload of the user's next data collection, and effectively reclassifies, integrates and cleans the data into each standardized database module 5 through the data cleaning module 3. , improves the accuracy of data cleaning, solves the defect of low screening and cleaning efficiency caused by data loss of big data in the prior art, and achieves the purpose of quickly and accurately screening and cleaning data.
 The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the points that are different from other embodiments, and the same and similar parts between the various embodiments can be referred to each other.
 The above description of the disclosed embodiments enables those skilled in the art to practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.