Web field distributed real time extraction system

A distributed real-time, field-based technology, applied in the field of distributed-based data integration methods and systems, can solve problems such as data cannot be retrieved, and achieve the effect of convenient online query

Inactive Publication Date: 2016-10-05
天津询达数据科技有限公司
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] Current traditional search engines such as baidu and google can only search shallow data. At present, there are a large amount of data inside the website on the Internet, such as portals, forums, post bars, etc., which are called Deep Web. These data cannot Crawled by traditional search engines
In order to overcome the inability of traditional search engines to mine data deeply, the system uses the latest technology to invent a method to extract the data that needs to be extracted from each website, integrate it into its own website, and then provide it to users for retrieval, which is convenient for users Internet query

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web field distributed real time extraction system
  • Web field distributed real time extraction system
  • Web field distributed real time extraction system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] 1. Database setting module:

[0043] (1) Add a command line to the database server on the server side to allow the ip of the computer where the distributed crawler is located and a specific user name it has to write and update remote data on the database server where the server side is located.

[0044] (2) On the crawler computer and the remote database login module, set the remote server to be logged in, the IP address of the computer and the database port, the name of the database to be logged in, and the user name to be used, that is, the first item The server side allows this specific user name, the password that needs to be logged in, and the name of the form that needs to be logged in.

[0045] (3) On the crawler computer and the local database login module, set the IP address and database port of the local computer to be logged in, the name of the database to be logged in, the user name to be used, usually Root, and the password to be logged in, The name of the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a novel deep data mining method for extracting data from various websites. The method can automatically and directionally extract a designated website in all weather; an adopted language is css / html or jquery format; an automatic module can perform data extraction on a website stored in a list according to a certain frequency, and can set the number of times of extraction or automatically recycle without limit; a distributed crawler module can perform extraction on a website according to a certain frequency and can perform multi-level extraction; the extracted content passes a filtering layer, missing value compensation is performed on the extracted content, and filtering is performed on extracted values, and finally, the result is stored in a local database or a remote database according to a set condition. The Web field distributed real time extraction system can achieve distributed all-weather automatic data extraction and integration.

Description

Technical field [0001] The invention relates to a Deep Web data integration method and system, in particular to a distributed-based data integration method and system. Background technique [0002] Current traditional search engines such as baidu and google can only search shallow data. At present, there are a large amount of data inside the website on the Internet, such as portals, forums, post bars, etc., which are called Deep Web. These data cannot Retrieved by traditional search engines. In order to overcome the inability of traditional search engines to mine data deeply, the system uses the latest technology to invent a method to extract the data that needs to be extracted from each website, integrate it into its own website, and then provide it to users for retrieval, which is convenient for users Check online. Contents of the invention [0003] The invention provides a distributed system and method for automatically extracting and integrating data from various web...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F9/44
Inventor 刘挺孟小峰
Owner 天津询达数据科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products