Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Webpage data capturing and filtering method

A web page data and filtering method technology, which is applied in the fields of electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of unrealistic update speed, huge daily data update volume, and unpredictable massive website crawling, etc. Accurate data capture and filtering, avoiding the effect of production and post-maintenance work

Inactive Publication Date: 2012-07-11
WEIGOUSNGHAI CULTURE MEDIA
View PDF2 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Obtain the required information by filtering the information on the webpage (for example, BBS needs to capture the article title, author, posting time, hits, replies, etc.), such as making regular expressions for filtering according to the rules of Html language tags in the webpage , so that the above information can be accurately filtered out, but when the area to be captured is very wide, such as covering nearly 80,000 forums, more than 200 large news websites, and many well-known search engines, blogs, post bars, etc. website, so the amount of daily data updates is very large, it is impossible to independently create a set of tailor-made filtering programs for each BBS and each section of each website, even if it is produced at a cost, with a large number of websites. The revision of the crawling program must also be accurately modified accordingly. This kind of maintenance workload and this update speed are obviously unrealistic
[0023] It can be seen that, usually, the crawling program made by the existing general method can only capture data for individual websites or a small number of websites in order to achieve accurate filtering of data capture. Considering the production of too many data matching programs and the Maintenance, it is impossible to crawl unknown massive websites, so it is necessary to provide a new webpage data crawling and filtering method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage data capturing and filtering method
  • Webpage data capturing and filtering method
  • Webpage data capturing and filtering method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0041] figure 1 It is a schematic flow chart of the webpage data grabbing and filtering method of the present invention; figure 2 For the present invention, Html is converted into a schematic flow diagram of an XML sequence listing; image 3 Obtain a schematic diagram of the data flow in the BBS article for the present invention.

[0042] See figure 1 , the implementation process of the present invention is described in detail below with grabbing the web page data in the BBS article as an example:

[0043] Step S101: Obtaining the Html code of the webpage

[0044] First use the OpenRead (+URL) method of WebClient in C#.NET to read all the Html codes of a forum article list.

[0045] Step S102: Serialize Html to XML

[0046] Please continue to see figure 2 , as shown in step S201, first delete all the basic irrelevant codes in the overall Html c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage data capturing and filtering method, which comprises the following steps: a) obtaining webpage Html codes; b) converting the webpage Html codes into an XML (extensive makeup language) sequence table; and c) performing fuzzy matching on the XML sequence table by using information key words to obtain webpage data. According to the webpage data capturing and filtering method provided by the invention, the webpage Html codes are comprehensively subjected to XML serialization at first, and the webpage data in XML files are obtained in a fuzzy filtering way, so that massive amounts of webpage data can be captured and filtered quickly and more accurately.

Description

technical field [0001] The invention belongs to the technical field of computer databases, and in particular relates to a method for capturing and filtering webpage data. Background technique [0002] Web crawling and data extraction technology has a long history of development, through various technical means to achieve the purpose of collecting web content, and, in each stage of computing technology development, people are trying to use more advanced technology and programming language to achieve more A powerful website content acquisition tool, the existing common web page data capture methods are as follows: [0003] 1. The method of web crawling and data extraction using web crawler technology [0004] Web crawler is Web Spider, which is a very vivid name. Comparing the Internet to a spider web, then a spider is a spider crawling around on the web. Web spiders search for web pages through the link addresses of the web pages. Starting from a certain page (usually the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 金炜杰
Owner WEIGOUSNGHAI CULTURE MEDIA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products