A method for data duplicate checking
A data and database technology, which is applied in the field of big data duplication check, can solve the problems of limited storage system scalability, inefficient consumption, complex metadata management and storage, etc., to solve high memory usage, slow solution speed, reduce The effect of connection time
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Example Embodiment
[0037] Example 1
[0038] Such as figure 1 As shown, in the first aspect, the present invention provides a data duplication processing method, including the following steps:
[0039] Step S100: Use the DBCursor cursor to read the data for the massive resource data, use .next() to read the data in the database in order, and close the connection with close() after reading;
[0040] Step S102: Traverse the resource data list in batches, use the simHash algorithm to calculate the text in the resource database as a simHash fingerprint signature, store it as a String type, and divide the fingerprint signature into n segments, where n is a natural number;
[0041] Step S104: Use the fragments of the simHash signature obtained above as the key, and the corresponding sinHash+heap number as the value;
[0042] S106: Perform a duplicate check of the target text using the key value of the hashMap, and determine whether the Hamming distance of the n segments is less than a set threshold; if it is le...
Example Embodiment
[0052] Example 2
[0053] Such as figure 2 As shown, in the second aspect, the present invention provides a logic flow chart for data duplicate checking, including:
[0054] Data reading, using DBCursor cursor to read the data in the resource database;
[0055] Data query, query whether the current resource information exists in the simHash table through the target resource data id; if it is, the current data check is ended, if not, the current data is further calculated simHash fingerprint signature operation, and the fingerprint signature is segmented;
[0056] Traverse each segmented simHash fragment to the segmented simHash fingerprint matching the target resource data in the resource data signature list: if there is no match, then further query the heap number information table, and store the heap number information+1 in the cache ; If there is a match, take out the simHash list of all matching resources, calculate the Hamming distance from the current resource data one by one,...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap