A data deduplication method, electronic equipment and storage medium
By using a Bloom filter with double-write and a monthly recurring mechanism, combined with Redis cluster management, the problems of high storage cost, low query efficiency, and poor system reliability in existing data deduplication schemes are solved, achieving efficient and accurate data deduplication, suitable for massive data scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 北京中科闻歌科技股份有限公司
- Filing Date
- 2025-06-20
- Publication Date
- 2026-06-12
AI Technical Summary
Existing data deduplication solutions suffer from high storage costs, low query efficiency, and poor system reliability in scenarios with massive amounts of data, making it difficult to meet the requirements of web crawler systems for low power consumption, high reliability, and self-optimization.
It employs a Bloom filter double-write and monthly recurring mechanism, which ensures the accuracy of data identification by simultaneously writing to the current and next window filters at the end of the time window, and achieves efficient querying and data management through a Redis cluster.
It effectively avoids duplicate data collection caused by window switching, ensures the accuracy of duplicate identification during data handover, reduces the false judgment rate, improves the system's load balancing and resource utilization efficiency, and supports efficient deduplication of massive amounts of data.
Smart Images

Figure CN120670684B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of big data processing, and in particular to a data deduplication method, electronic device, and storage medium. Background Technology
[0002] With the exponential growth of internet information, web crawlers, as the core tool for data collection, face severe challenges in terms of efficiency and resource utilization. Data deduplication, as a critical step in crawler systems, directly affects the accuracy of data collection and system performance. Current mainstream deduplication technologies, based on their implementation principles, mainly include relational database index deduplication schemes, distributed key-value storage deduplication schemes, and traditional Bloom filter deduplication schemes.
[0003] Relational database index deduplication schemes, based on relational databases such as MySQL, achieve data uniqueness verification by creating unique indexes (such as URL hash indexes). When new data is added to the database, the index is used to quickly check for duplicate records. This scheme has a low technical threshold for implementation and is suitable for small-scale data scenarios, supporting complex queries and transaction management. However, in massive data scenarios, the index size expands linearly with the data volume, leading to a surge in disk I / O overhead and a query response time degrading from milliseconds to seconds. Storage costs are also high, with the index space occupied by a single data record reaching 2 to 3 times the actual data size, and it is difficult to cope with the horizontal scaling needs of data scales exceeding hundreds of millions.
[0004] Distributed key-value storage deduplication solutions employ distributed systems such as HBase and Cassandra, using data fingerprints (e.g., MD5 hash values) as keys and storing them in a distributed cluster. Data sharding is achieved through a consistent hashing algorithm, supporting horizontal scaling. Theoretically, this solution can support petabyte-level data storage, possessing automatic fault tolerance and load balancing capabilities, making it suitable for distributed web crawler scenarios. However, the system architecture is complex, requiring solutions to issues such as data sharding consistency and cross-node query latency; operational costs are high, requiring a professional team to maintain cluster topology and data migration strategies; for high-frequency query scenarios, cross-node network overhead and cache invalidation issues still exist, typically limiting single-machine QPS to within tens of thousands.
[0005] Traditional Bloom filter deduplication schemes are based on a probabilistic data structure that maps bitmaps to multiple hash functions. They use k hash functions to map elements to k positions in a bit array and set them to 1. During a query, the existence of an element is determined by the hash mapping. This scheme is extremely space-efficient, requiring only a few MB of memory to store millions of data entries. Furthermore, the query time complexity is O(k), close to constant time, making it suitable for high-frequency deduplication scenarios. However, it cannot actively clean up expired data. As data is continuously written, the density of "1"s in the bitmap approaches 100%, and the false positive rate increases exponentially. In actual tests, when the bitmap utilization exceeds 70%, the false positive rate can spike from an initial 0.1% to over 5%. In addition, this scheme lacks a robust backup and recovery mechanism. If a system failure causes the bitmap data to be lost, the entire data fingerprint must be reconstructed, potentially leading to repeated data collection by crawlers, wasting server resources, and causing abnormal access to the target website.
[0006] Therefore, existing data deduplication solutions essentially face a triangular contradiction between "storage efficiency, query performance, and system reliability": relational databases and distributed key-value stores struggle to overcome the bottlenecks of storage cost and query efficiency, while traditional Bloom filters are limited by the accumulation of false positives and data reliability issues. In scenarios involving the daily collection of hundreds of millions of data points, existing technologies can no longer meet the requirements of web crawler systems for deduplication modules that are "low-power, highly reliable, and self-optimizing." Summary of the Invention
[0007] To address the aforementioned technical problems, the technical solution adopted by this invention is as follows:
[0008] According to a first aspect of the present invention, a data deduplication method is provided, the method comprising the following steps:
[0009] S100, determine the position of the current monitoring time within the time window: if it is in the first time period of the current time window, execute S200; if it is in the first unit time of the second time period of the current time window, execute S300; if it is in the second time period of the current time window but not the first unit time, execute S400; if it is in the first unit time of the next adjacent time window, execute S500; wherein, the first time period is the first M consecutive units of time of the current time window, the second time period is the last N consecutive units of time of the current time window, M and N are positive integers, and the total length of the current time window is M+N units of time.
[0010] S200: When the data URL to be processed within the unit time of the current monitoring time is received, a query operation is performed on the first Bloom filter corresponding to the current time window based on the data URL to obtain the query result; if the query result indicates that the data URL needs to be added, the data URL is added to the first Bloom filter and S100 is executed; the initial state of the first Bloom filter is the newly initialized Bloom filter.
[0011] S300: Create a newly initialized second Bloom filter as a pre-Bloom filter for the next adjacent time window, and execute S400.
[0012] S400: When the data URL to be processed within the unit time of the current monitoring time is received, a query operation is performed on the first Bloom filter corresponding to the current time window based on the data URL, and the query result is obtained. If the query result indicates that the data URL needs to be added, the data URL is added to the first Bloom filter and the second Bloom filter. Otherwise, the data URL is only added to the second Bloom filter, and S100 is executed.
[0013] S500: Release the resources occupied by the first Bloom filter, mark the second Bloom filter as the new first Bloom filter for data processing in subsequent time windows, and execute S200.
[0014] According to a second aspect of the present invention, an electronic device is provided, including a processor and a memory; the processor executes the steps of the method described in the first aspect of the present invention by invoking a program or instructions stored in the memory.
[0015] According to a third aspect of the present invention, a computer-readable storage medium is provided that stores a program or instructions that cause a computer to perform the steps of the method described in the first aspect of the present invention.
[0016] The present invention has at least the following beneficial effects:
[0017] The present invention provides a data deduplication method that effectively solves the problem of data omission when switching time windows by simultaneously writing the current window filter (first filter) and the next window filter (second filter) at the end of the time window. This allows the filters to correctly identify duplicate data at the beginning of the month and avoid duplicate collection caused by window switching.
[0018] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a flowchart of a data deduplication method provided in an embodiment of the present invention. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0022] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of this invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0023] It should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe the steps as sequential processes, many of these steps can be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the steps can be rearranged. A process can be terminated when its operation is complete, but it may also have additional steps not included in the figures. A process can correspond to a method, function, procedure, subroutine, subroutine, etc.
[0024] This invention provides a data deduplication method, which aims to solve the problems of continuously rising false positive rates caused by the lack of support for element deletion in traditional Bloom filters and how to achieve effective data backup and rapid recovery.
[0025] Furthermore, such as Figure 1 As shown, an embodiment of the present invention provides a data deduplication method comprising the following steps:
[0026] S100, determine the position of the current monitoring time within the time window: if it is in the first time period of the current time window, execute S200; if it is in the first unit time of the second time period of the current time window, execute S300; if it is in the second time period of the current time window but not the first unit time, execute S400; if it is in the first unit time of the next adjacent time window, execute S500; wherein, the first time period is the first M consecutive units of time of the current time window, the second time period is the last N consecutive units of time of the current time window, M and N are positive integers, and the total length of the current time window is M+N units of time.
[0027] In this embodiment of the invention, the time window is divided based on the natural month time cycle, and the duration of each unit of time is a complete natural day (24 hours). The monitoring time is set as a fixed time point within each unit of time. Specifically, the monitoring time is the end time of each natural day, 00:00:00 (i.e., the start time of the next day).
[0028] In this embodiment of the invention, the method is used to deduplicate the URLs of data collected by a web crawler. It is understood that a data URL (Uniform Resource Locator) is an address identifier used to locate and access network resources.
[0029] In this embodiment of the invention, N is positively correlated with the web crawler's collection period. Specifically, the longer the collection period, the larger the value of N; the shorter the collection period, the smaller the value of N. This correlation mechanism is designed to ensure the accuracy and efficiency of data deduplication. The web crawler's collection period refers to the time interval between two collections of the same target data. When the collection period is long, it means that the amount of data the crawler needs to process in a single collection task is relatively large, and the possibility and complexity of data duplication will increase accordingly. For example, if the crawler performs a full-network data collection once a month, the probability of overlap between old and new data is high when the collection task is nearing its end due to the large amount of accumulated data. At this time, setting N to a larger value (such as 5-7 days) and enabling the double Bloom filter for a longer period before the window ends can fully cover potential scenarios of data duplication and effectively avoid missed detections caused by window switching. Conversely, when the collection period is short, such as when the crawler collects data every day, the amount of data processed each time is relatively small, and the risk of data duplication is low. At this point, N can be set to a smaller value (such as 1 to 2 days), which can meet the need for smooth data transition, reduce the write burden of the double Bloom filter, and reduce system resource consumption.
[0030] Furthermore, dynamically adjusting the N value can adapt to changes in web crawler collection strategies. When the crawler adjusts its collection frequency or range due to business needs, the system can automatically or manually adjust the N value according to the new collection cycle, ensuring that the data deduplication mechanism is always operating at its optimal state. For example, during e-commerce promotional periods, the crawler may temporarily increase its collection frequency to obtain real-time product information. Increasing the N value at this time can better cope with the surge in data volume in a short period of time and ensure the accuracy of deduplication.
[0031] In an illustrative embodiment of the present invention, N can be determined based on an empirical formula derived from historical data statistics. For example, by analyzing historical crawler data and statistically analyzing the relationship between the data repetition rate and the value of N under different collection periods, an empirical formula can be obtained. For example, the empirical formula can be: N = rounddown(c × T) + 1, where rounddown() represents rounding down, c is a preset coefficient, and T is the fitted value. In one illustrative embodiment, c = 0.2. T is the collection period in days.
[0032] In another illustrative embodiment of the present invention, N can be determined based on a calculation model for estimating the amount of data. In one example, N = roundup(log(Q×R) / log2), where Q is the estimated amount of data for each data collection task, and R is the historical repetition rate of the data. roundup() represents rounding up.
[0033] In another illustrative embodiment of the present invention, N can be determined based on interval-based dynamic association rules. For example, different rules for N values can be formulated according to different ranges of the collection period:
[0034] When the acquisition period T ≤ T1, set N=1. TI is the first acquisition period, which can be set to 5 days.
[0035] When T1 ≤ T < T2, set N = [T / 5] + 1. T2 is the second acquisition period, which can be set to 15 days. [ ] indicates rounding.
[0036] When T2 ≤ T, set N = [T / 3] + 1. T2 is the first data collection period, which can be set to 15 days.
[0037] For example, when T=3 days, N=1 day; when T=10 days, N=[10 / 5]+1=3 days; when T=20 days, N=[10 / 3]+1=7 days. This interval division rule can be flexibly adjusted according to the actual business characteristics, making it easy to quickly determine the value of N in different collection cycle scenarios, while taking into account the accuracy of data deduplication and system resource consumption. S200, when the data URL to be processed within the unit time of the current monitoring time is received, a query operation is performed on the first Bloom filter corresponding to the current time window based on the data URL to obtain the query result; if the query result indicates that the data URL needs to be added, the data URL is added to the first Bloom filter, and S100 is executed; the initial state of the first Bloom filter is the newly initialized Bloom filter.
[0038] S300: Create a newly initialized second Bloom filter as a pre-Bloom filter for the next adjacent time window, and execute S400.
[0039] S400: When a data URL to be processed within the current monitoring time unit is received, a query operation is performed on the first Bloom filter corresponding to the current time window based on the data URL, and the query result is obtained. If the query result indicates that the data URL needs to be added, the data URL is added to both the first and second Bloom filters; otherwise, if the query result indicates that the data URL does not need to be added, the data URL is only added to the second Bloom filter, and S100 is executed. In this way, the second Bloom filter will pre-store some data from the end of the previous month.
[0040] S500: Release the resources occupied by the first Bloom filter, mark the second Bloom filter as the new first Bloom filter for data processing in subsequent time windows, and execute S200.
[0041] In this embodiment of the invention, the number of Bloom filters currently in use is determined based on the time period in which the current monitoring moment occurs. When the current monitoring moment falls within the last few days of the current time window, a second Bloom filter is activated, employing a dual-write and monthly cyclical mechanism for Bloom filters. Key advantages include: Accurate deduplication guarantee: The dual-write mechanism simultaneously writes to the current and next time window filters at the end of the time window, avoiding missed / false judgments due to window switching. This is particularly suitable for peak data periods at the end of the month, ensuring the accuracy of duplicate identification during data handover; Data continuity management: The monthly cyclical mechanism allows for seamless filter transitions, fully preserving historical data. This provides a continuous data foundation for trend analysis and forecasting, avoiding processing interruptions; load balancing and stability: dual writes distribute the write pressure of a single filter, reducing the risk of performance degradation under large data volumes; a monthly pre-initialization mechanism allows parameters to be adjusted in advance based on data estimates to cope with periodic traffic fluctuations; efficient resource utilization: dual writes are only activated at the end of the window, with a single filter running most of the time, reducing memory and computing overhead; monthly dynamic configuration of filter parameters avoids resource waste; maintainability optimization: the periodic mechanism facilitates data archiving, cleaning, and system monitoring, and targeted filter management and status checks can be performed after each month's processing, improving operational efficiency.
[0042] Furthermore, in this embodiment of the invention, the query operation and the add operation are implemented based on a Redis cluster. The first Bloom filter and the second Bloom filter have the same structure. Each Bloom filter consists of multiple independent data blocks. The data blocks of each Bloom filter are divided according to a preset hash range. Each data block is uniformly mapped to different master nodes of the Redis cluster through a consistent hashing algorithm. The data blocks on each master node are stored in a BitMap structure to record the existence status of the data URL.
[0043] In practical implementation, the method of this invention can be deployed on a Redis cluster consisting of three servers, each configured with 16 cores, 32GB of memory, and 500GB of storage. The Redis cluster is configured with a 3-master, 3-slave architecture, i.e., three master nodes, each connected to one slave node. Each master node is associated with multiple hash slots, each hash slot is associated with multiple hash bits, and each master node is connected to one slave node. The Redis cluster is deployed in Redis Cluster mode, and three sentinel nodes are configured to monitor the cluster status. Redis version 7.0 or higher is selected, and the AOF persistence mechanism is enabled to ensure data security.
[0044] In this embodiment of the invention, the bit array size m and the number of hash functions k of each Bloom filter satisfy the following conditions: m = -n × ln(p) / (ln(2)²), k = m / n × ln(2); where n is the estimated data volume and p is the target false positive rate.
[0045] In one specific embodiment, the Bloom filter is configured as follows: the estimated monthly new data volume is 300 million URLs, the Bloom filter hashes 16 times, the false positive rate is set to 0.001%, and the initial size of the filter each month is approximately 3.2GB of memory space.
[0046] Those skilled in the art should understand that the specific partitioning strategies for Bloom filter data blocks (such as by hash range, data popularity, etc.) and the mapping mechanism from data blocks to Redis cluster master nodes (such as consistent hashing algorithm) are all existing technologies that are widely used in the field and will not be elaborated here.
[0047] Furthermore, the step of performing a query operation on the first Bloom filter corresponding to the current time window based on the received data URL to obtain the corresponding query results specifically includes:
[0048] S10, using a preset hash function and k different seed values, perform hash calculation on the data URL to obtain k different hashes.
[0049] In this embodiment of the invention, the preset hash function is the MurmurHash3 function. The k seed values can be 0 to k-1. When calculating the hash value of the data URL, each URL is calculated using the MurmurHash3 function and a different seed value. For example, if k=16, mmh3.hash(url,0), mmh3.hash(url,1), ..., mmh3.hash(url,15) will be calculated to generate 16 hash values. Compared to implementing k different hash functions, this method has lower computational overhead while ensuring the uniformity of hash value distribution.
[0050] S11, based on the GETBIT command provided by the Redis cluster, map the k hash values to the k positions of the corresponding bit groups of the first Bloom filter, and obtain the status values of the k first mapping positions.
[0051] S11 may specifically include:
[0052] S1101, perform a modulo operation on each of the k hash values to ensure that each hash value maps to a valid range of the Bloom filter bit array, resulting in k valid indices. Here, valid index = hash value % bit array length. % represents the modulo operation. The bit array length is the total number of bits in the Bloom filter (e.g., 2^k). 24 )
[0053] S1102, determine the data block ID to which each valid index belongs. Wherein, data block ID = valid index / / data block size. The data block size is a preset value, and / / represents integer division.
[0054] S1103 uses a consistent hashing algorithm to determine the Redis master node ID corresponding to each data block ID. Specifically, a hash ring algorithm can be used to map the data block ID to a ring node space using a hash function (such as MurmurHash), and the node ID is equal to the nearest active master node in the clockwise direction.
[0055] S1104, for k valid indexes, execute the GETBIT command in batches through the Redis cluster's Pipeline. The command format is: GETBIT bloomfilter:window:{window ID}:chunk:{block ID} {valid index}. Here, {window ID} is the unique identifier of the current time window (e.g., 202506), and {block ID} is the block number calculated in S1102.
[0056] S1105, parse the return value (0 or 1) of the GETBIT command and convert it into a boolean array. 0 indicates that the position is not marked and the corresponding element definitely does not exist. 1 indicates that the position has been marked and the corresponding element may exist (there is a possibility of false positives).
[0057] S1106: Assemble the k state values into a boolean array of length k in hash order, which will serve as the basis for subsequent existence checks.
[0058] S12, based on the state values of the k first mapping positions, determine whether the data URL already exists in the first Bloom filter: if all state values are 1, generate a query result indicating that the data URL may already exist and does not need to be added; if at least one state value is 0, generate a query result indicating that the data URL does not exist and needs to be added.
[0059] Furthermore, the add operation is specifically implemented using the SETBIT command provided by the Redis cluster, which sets the values of the k hash positions corresponding to the data URL to 1, specifically including:
[0060] S1, using the same k hash functions and seed values as the query operation, calculate the k hash values of the URL of the data to be added.
[0061] S2 converts each of the k hash values into a valid index in the bit array using a modulo operation.
[0062] S3, determine the data block ID based on the valid index.
[0063] S4 uses a consistent hashing algorithm to determine the Redis master node ID corresponding to each data block ID.
[0064] S5 maintains a buffer of URLs to be added, triggering batch processing according to the following rules: when the buffer accumulates more than 1000 URLs, or the waiting time reaches 50ms, merge the SETBIT operation by grouping by (master node ID, data block ID). Grouping example: {node1:{chunk1:[offset1,offset2],chunk2:[offset3]},node2:{chunk3:[offset4]}}.
[0065] S6, through the Redis cluster's Pipeline mechanism, performs the following operations in batches: SETBITbloomfilter:window:{window ID}:chunk:{block ID}{valid index}1.
[0066] S6, through the Redis cluster's Pipeline mechanism, performs batch SETBIT operations in groups: for each master node, an independent Pipeline connection is created; the execution command format is: SETBITbloomfilter:window:{window ID}:chunk:{block ID}{valid index}1.
[0067] S7 ensures atomicity for k SETBIT operations on the same data block through Redis transactions (MULTI / EXEC).
[0068] In summary, the data deduplication method provided by the embodiments of the present invention has at least the following advantages:
[0069] (1) By utilizing the high performance characteristics of Redis in-memory database, it is possible to achieve millisecond-level query response time based on the GETBIT command, which can meet the high concurrency requirements of crawlers.
[0070] (2) High availability of the system is achieved through master-slave replication and sentinel mechanism of Redis cluster, ensuring 99.99% service availability and effectively avoiding the risk of single point of failure. When the master node fails, the sentinel mechanism can automatically promote the slave node to master node. The entire switching process is transparent to the business, and the data loss is less than or equal to 1 second. In addition, by configuring Redis's AOF persistence mechanism, even in extreme cases, the system can recover data from disk, ensuring the integrity and recoverability of deduplicated data.
[0071] (3) By employing a double-write Bloom filter and a monthly recurring mechanism, this service effectively controls and maintains an extremely low false positive rate. By pre-creating the filter for the following month at the end of the month, basic data is accumulated while reducing historical invalid data, keeping the false positive rate below the preset 0.001%. This means that when processing 1 billion data entries, the number of false positives does not exceed 10,000. For massive data deduplication scenarios, this is a completely acceptable level, significantly better than the problem of accumulated false positive rates after long-term operation of a traditional single Bloom filter.
[0072] (4) In practical application scenarios, Redis nodes can be dynamically added or removed according to business needs, theoretically supporting an unlimited number of data deduplication requirements. Whether it is the increase in data volume or the increase in concurrent requests, it is only necessary to simply expand the Redis cluster without large-scale modification of the application layer, which greatly reduces the complexity of system expansion.
[0073] (5) Compared to the complex solutions in the Hadoop ecosystem, Redis-based solutions have lower deployment and maintenance costs.
[0074] This invention also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform the method described in this invention.
[0075] This invention also provides a computer-readable storage medium storing computer-executable instructions for performing the methods described in this invention.
[0076] It should be understood that the various forms of processes shown above can be used to reorder, add, or delete steps. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this invention can be achieved, and this is not limited herein.
[0077] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A data deduplication method, characterized in that, The method includes the following steps: S100, determine the position of the current monitoring time within the time window: if it is in the first time segment of the current time window, execute S200; if it is in the first unit time of the second time segment of the current time window, execute S300; if it is in the second time segment of the current time window but not the first unit time, execute S400; if it is in the first unit time of the next adjacent time window, execute S500; wherein, the first time segment is the first M consecutive units of time of the current time window, the second time segment is the last N consecutive units of time of the current time window, M and N are positive integers, and the total length of the current time window is M+N units of time; S200: When the data URL to be processed within the unit time of the current monitoring time is received, a query operation is performed on the first Bloom filter corresponding to the current time window based on the data URL to obtain the query result; if the query result indicates that the data URL needs to be added, the data URL is added to the first Bloom filter and S100 is executed; the initial state of the first Bloom filter is the newly initialized Bloom filter. S300, create a newly initialized second Bloom filter as a preparatory Bloom filter for the next adjacent time window, and execute S400; S400: When the data URL to be processed within the unit time of the current monitoring time is received, a query operation is performed on the first Bloom filter corresponding to the current time window based on the data URL, and the query result is obtained. If the query result indicates that the data URL needs to be added, the data URL is added to the first Bloom filter and the second Bloom filter. Otherwise, the data URL is only added to the second Bloom filter, and S100 is executed. S500: Release the resources occupied by the first Bloom filter, mark the second Bloom filter as the new first Bloom filter for data processing in subsequent time windows, and execute S200.
2. The method according to claim 1, characterized in that, The query operation and the add operation are implemented based on a Redis cluster. The first Bloom filter and the second Bloom filter have the same structure. Each Bloom filter consists of multiple independent data blocks. The data blocks of each Bloom filter are divided according to a preset hash range. Each data block is evenly mapped to different master nodes of the Redis cluster through a consistent hashing algorithm. The data blocks on each master node are stored in a BitMap structure to record the existence status of the data URL.
3. The method according to claim 2, characterized in that, The step of performing a query operation on the first Bloom filter corresponding to the current time window based on the received data URL to obtain the corresponding query results specifically includes: S10, using a preset hash function and k different seed values, perform hash calculation on the data URL to obtain k different hash values, where k is the number of hash functions in the Bloom filter; S11, based on the GETBIT command provided by the Redis cluster, map the k hash values to the k positions of the corresponding bit groups of the first Bloom filter, and obtain the status values of the k first mapping positions; S12, based on the state values of the k first mapping positions, determine whether the data URL already exists in the first Bloom filter: if all state values are 1, generate a query result indicating that the data URL may already exist and does not need to be added; if at least one state value is 0, generate a query result indicating that the data URL does not exist and needs to be added.
4. The method according to claim 2, characterized in that, The add operation is specifically implemented by using the SETBIT command provided by the Redis cluster to set the values of k hash positions corresponding to the data URL to 1, where k is the number of hash functions in the Bloom filter.
5. The method according to claim 3, characterized in that, The preset hash function is the MurmurHash3 function.
6. The method according to claim 1, characterized in that, The method described above is used to deduplicatize URLs of data collected by web crawlers.
7. The method according to claim 1, characterized in that, The time window is divided based on the natural month time cycle, and the duration of each unit of time is a complete natural day.
8. The method according to claim 6, characterized in that, The relationship between N and the web crawler's data collection cycle is positively correlated.
9. An electronic device, characterized in that, Including processor and memory; The processor executes the steps of the method as described in any one of claims 1 to 8 by invoking programs or instructions stored in the memory.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store a program or instructions that cause a computer to perform the steps of the method as described in any one of claims 1 to 8.