A text data collection method and device

A technology of text data and collection methods, which is applied in the directions of text database query, unstructured text data retrieval, electronic digital data processing, etc., which can solve the problem of affecting the efficiency of text data collection, the accuracy of plagiarism check, and the inability to balance the efficiency of plagiarism. problems, to achieve the effect of improving the accuracy of duplicate checking, high practical value, and high efficiency of duplicate checking

Inactive Publication Date: 2019-05-03
QILIN HESHENG NETWORK TECH INC
View PDF7 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, for large-scale databases, duplication checking based on different hash algorithms will seriously affect the collection efficiency of text data, making it impossible to achieve both accuracy and efficiency of duplication checking

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A text data collection method and device
  • A text data collection method and device
  • A text data collection method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described The embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of this application.

[0024] In various embodiments of the present invention, it should be understood that the size of the sequence numbers of the following processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, rather than the implementation of the pr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention relates to a text data collection method and device. The collection method comprises the steps of performing duplicate checking on a text database based on a first Hashvalue obtained by calculating a text fragment with a set character length in a first target text through a first Hash algorithm; if the duplicate checking is not hit, storing the first target text into the text database, and configuring the text type of the first target text in the text database as a first type; Selecting a second target text from the text of which the text type is the first typein the text database, and performing duplicate checking on the text database based on a second hash value calculated from the second target text through a second hash algorithm; and if the duplicatechecking is missed, changing the text type of the second target text in the text database into a second type, or else deleting the data corresponding to the second target text from the database in thetext database. According to the text data duplicate checking method and device, text data can be efficiently subjected to duplicate checking based on different Hash algorithms in the text data collection process.

Description

technical field [0001] The embodiments of the present application relate to the technical field of computer software, and in particular, to a method and device for collecting text data. Background technique [0002] With the development of big data applications, people are more and more deeply aware of the value of data. In order to meet the growing demand for data, data acquisition technology is particularly important. Among them, collecting text data (such as news, microblog information, etc.) in the network is a common data collection method. [0003] The existing text data collection method is generally to check the database based on the hash value calculated by the hash algorithm after the text data is obtained, and save the text data to the database under the premise that the check misses. , so as to ensure the uniqueness of the text data in the database. At present, there is no unique method for checking the hash value of text data in the industry. In order to ensu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/33G06F17/27
Inventor 贾太滨李涛
Owner QILIN HESHENG NETWORK TECH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products