Network protected index data obtaining method based on OCR technology

An acquisition method and protected technology, applied in the field of network communication, can solve problems such as fixed content, low accuracy of results, and reduced efficiency, and achieve the effect of batch acquisition of acquired data, accurate acquired data, and wide application value

Active Publication Date: 2016-11-09
SHANDONG UNIV OF SCI & TECH
View PDF5 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the second type of method also has certain disadvantages: First, the obtained permissions are clearly graded. If you do not obtain higher permissions, the number of calls to the interface will be greatly reduced, and the efficiency of the acquisition will also be greatly reduced.
Secondly, the acquired content is relatively fixed and lacks flexibility
Third, the acquired data format is mainly text, and most of the data needs to be crawled twice
However, the accuracy of the results of this method is low, and there is a certain error

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Network protected index data obtaining method based on OCR technology
  • Network protected index data obtaining method based on OCR technology
  • Network protected index data obtaining method based on OCR technology

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0065] A method for obtaining a network protected Baidu index based on OCR technology, such as figure 1 As shown, the specific steps include:

[0066] (1) Target data website login;

[0067] (2) Target data location and acquisition: Use the automated testing tool Selenium Webdriver to simulate the user's operations on the data platform before the target data is displayed; for example, log in, enter search keywords, set search time, etc. Load the image of the target data, and use the method of simulating mouse movement to dynamically load, collect and store the data values ​​on the curve in the image of the target data;

[0068] (3) Target data preprocessing: preprocessing the image of the target data;

[0069] (4) Target data identification and storage: using improved OCR technology for target data identification and storage:

[0070] a. Custom font samples: For characters that are prone to failure in recognition and fonts that are not commonly used, expand the segmentatio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a network protected index data obtaining method based on the OCR technology. The method includes the steps that an automatic testing tool is used for simulating a series of operation on a data platform by a user before index data display, and the operation includes login, search keyword input and search time setting; simulation mouse motion is used for dynamically displaying and collecting values on a curve, and finally the improved OCR technology is used for obtaining numerical values of target data. The protected data obtained through the method has the advantages of being high in obtaining efficiency, accurate and capable of being obtained in batches, effective data support is provided for public opinion analysis and data mining, a new thought is provided for the network big data obtaining method, and valuable information is provided for commercial popularization, precise marketing and market analysis. The network protected index data obtaining method has important theoretical significance and wide application value.

Description

technical field [0001] The invention relates to a method for acquiring network protected index data based on OCR technology, and belongs to the technical field of network communication. Background technique [0002] OCR technology is the abbreviation of Optical Character Recognition (Optical Character Recognition). It converts the text of various bills, newspapers, books, manuscripts and other printed materials into image information through optical input methods such as scanning, and then uses text recognition technology to convert the image information. Enter technology for computers that can be used. [0003] The process of OCR technology to recognize characters in images can be summarized as image preprocessing, character feature extraction, and font dictionary comparison, which are the three core processes of OCR. Among them, character feature extraction is the most important. This process first performs line or word segmentation on the character sequence to be recogni...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F21/62
CPCG06F16/951G06F21/6218G06F2221/2107
Inventor 曾庆田王松松李超段华赵中英
Owner SHANDONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products