Directional web data extraction method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of webpage data and data, which is applied in the field of network technology and search engines, can solve the problems of inability to provide directional capture of webpage data and limited application fields, and achieve the effect of simple operation, wide application range and strong operability

Active Publication Date: 2012-10-17

重庆超体科技有限公司

View PDF2 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0008] The above-mentioned first two methods can only be realized if the webpage data acquirer and the website operator obtain commercial cooperation, and the webpage data acquirer has high commercial public relations capabilities, and these two methods are limited by commercial cooperation and cannot provide Targeted crawling of webpage data other than business partners, the application field is very limited

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0051] In order to provide users with vegetable price information services, the server that provides the corresponding services needs to target and grab vegetable price data from a professional price information website. The web pages contained in the price information website are in HTML (Hyper Text Markup Language) format; the URL (Uniform Resource Locator) address of the web page containing vegetable price information is "http: / / www .feinno.com / commodity-price / 016", the price of vegetables on this webpage is presented as a table structure as shown in Table 1:

[0052] Table 1

[0053]

[0054] In Table 1, the cell where the "vegetable price" is located is the header cell, and the other cells are data cells. Now, it is necessary to provide users with a special price quotation service for "green peppers". Therefore, the server needs to target the price data of green peppers from the webpage, ignoring other webpage data in the webpage; the method of the present invention is used f...

Embodiment 2

[0082] In order to provide users with vegetable price information services, the server that provides the corresponding services needs to grab vegetables from the webpage "http: / / www.feinno.com / commodity-price / 016" of the price information website described in Example 1. For price data, the presentation of vegetable prices on this webpage is shown in Table 1. Now it is necessary to provide users with the vegetable name data listed in this table and the corresponding price data of various vegetables. Therefore, the vegetable names on the webpage are "tomato", "green pepper", "carrot" and the corresponding vegetable price "3.50 yuan / 500 grams", "2.50 yuan / 500 grams", and "1.50 yuan / 500 grams" are the webpage data to be crawled; the method of the present invention is used for targeted crawling, and the flow chart is as follows figure 1 As shown, the specific method is as follows:

[0083] i) According to the data structure characteristics of the web page data to be captured in the w...

Embodiment 3

[0117] In the price information website, the content of the page whose URL address is "http: / / www.feinno.com / commodity-price / 016" has changed. The page not only provides the vegetable price information shown in Table 1, but also Another grain price information table with exactly the same structure as the table in Table 1, as shown in Table 4:

[0118] Table 4

[0119]

[0120] The vegetable name data listed in the vegetable price information table (shown in Table 1) and the corresponding price data of various vegetables are still provided to users. The vegetable name data "tomato", "green pepper", "Carrot" and the corresponding vegetable price data of the three are "3.50 yuan / 500g", "2.50 yuan / 500g", and "1.50 yuan / 500g".

[0121] According to the table structure characteristics and HTML source code syntax rules in Table 1, it can be determined that the source code of the vegetable name data and vegetable price data to be crawled in the web file should at least contain the table tag...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a directional web data extraction method. The method comprises the following steps: carrying out source code grammatical rule analysis on web files by virtue of data structural features of web data to be extracted presented in the web files, and then constructing a data matching model with the data structural features through a regular expression; and carrying out data matching on source codes of the web files, extracting the web data which needs to be extracted from a part of the matched source codes, so that the problem of directional web data extraction is sloved. Inthe method, the regular expression is taken as a matching tool, which has strong operability for the technical personnel in the field and is beneficial to popularization and application of the method; and aiming at certain web data with more complicated data structural features and higher extraction difficulty, the invention further provides a directional extraction proposal for extracting the web data to be extracted step by step in a multistage location manner, thus having stronger adaptability and wide application range.

Description

Technical field [0001] The present invention involves the field of network technology and search engine technology, and specially involves a method of web data directional capture. Background technique [0002] With the rapid development of network technology, Wanwei.com has become the most transmitted information and data transmission carrier with the largest number of transmission and the highest transmission efficiency.Hot topic. [0003] Network spiders (also known as network crawlers and network robots), that is, the software technology that captures the web data from the Wanwei Network in accordance with the established procedures, becomes the main application technology of the information data obtained from the Wanwei Network; the web data described here isRefers to the title, text, images, links, tables and other types of titles that are used in web pages and providing users with valid information.For example, search engines such as Baidu, Google and other search service ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F17/30

Inventor 史寿伟李龙向涛李友良

Owner 重庆超体科技有限公司

Directional web data extraction method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology