Directional web data extraction method

A technology of webpage data and data, which is applied in the field of network technology and search engines, can solve the problems of not being able to provide directional crawling of webpage data and limited application fields, and achieve the effects of simple operation, wide application range, and saving storage resources

Active Publication Date: 2011-05-04
重庆超体科技有限公司
View PDF2 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The above-mentioned first two methods can only be realized if the webpage data acquirer and the website operator obtain commercial cooperation, and the webpage data acquirer has high commercial public relations capabilities, and these two methods are limited by commercial cooperation and cannot provide Targeted crawling of webpage data other than business partners, the application field is very limited

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Directional web data extraction method
  • Directional web data extraction method
  • Directional web data extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] In order to provide users with a vegetable price information service business, the server that provides the corresponding business needs to capture vegetable price data from a professional price information website.The web files contained in the price information website are HTML (Hyper Text Markup Language, Super text) format; the URL (Uniform Resource Locator, unified resource positioning symbol) address containing vegetable price information is "http: / / wwww. Feinno.com / Commodity-Pro / 016 ", the price of vegetables in this page presents the table structure as shown in Table 1:

[0056] Table 1

[0057]

[0058] In Table 1, the "Vegetable Price" unit is the header, and the other cells are data cells.Now, you need to provide the "green pepper" price quotation service for users. Therefore, the server needs to grasp the price data of green peppers from the webpage to ignore other web pages in the webpage;Facilities Ru Ru figure 1 The specific method is as follows:

[0059]...

Embodiment 2

[0086] In order to provide users with a vegetable price information service business, the server that provides the corresponding business needs to be the webpage "http: / / www.feinno.com / commodity-price / 016" of the price information website described by Example 1Price data, the price of vegetable prices in this page is shown in Table 1.Currently, it is necessary to provide users with the vegetable name data listed in this form and the corresponding price data of various vegetables. Therefore, the vegetable name of the vegetables in the webpage "tomato", "green peppers", "carrot" and the corresponding vegetable price "3.50 yuan / 500 grams "," 2.50 yuan / 500 grams "," 1.50 yuan / 500 grams "is the web page data to be captured; using the method of the present invention for directional capture, the process box diagram is like figure 1 The specific method is as follows:

[0087] i) According to the data structure characteristics of the web file to be captured, the data matching model built b...

Embodiment 3

[0121] In the price information website, the URL address is "http: / / www.feinno.com / commodity-price / 016". The content of the webpage has changed.Another food price information form with the same table structure as Table 1, as shown in Table 4:

[0122] Table 4

[0123]

[0124] Now it still provides users with vegetable price information forms (as shown in Table 1), and the vegetable name data listed in Table 1 and the corresponding price data of various vegetables need to be directed from the above webpage."Carrot" and the vegetable price data corresponding to the three "3.50 yuan / 500 grams", "2.50 yuan / 500 grams", "1.50 yuan / 500 grams".

[0125] According to the table structure characteristics of Table 1 and the rules of HTML source code, you can determine the source code of vegetable name data and vegetable price data in the web file.

[0136]

[0137]

[0138]

[0139]

[0140]

[0141]

[0142] ", Table Tag" ", Head tag" ", Data cell tag" "But the characteristics of only ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a directional web data extraction method. The method comprises the following steps: carrying out source code grammatical rule analysis on web files by virtue of data structural features of web data to be extracted presented in the web files, and then constructing a data matching model with the data structural features through a regular expression; and carrying out data matching on source codes of the web files, extracting the web data which needs to be extracted from a part of the matched source codes, so that the problem of directional web data extraction is sloved. In the method, the regular expression is taken as a matching tool, which has strong operability for the technical personnel in the field and is beneficial to popularization and application of the method; and aiming at certain web data with more complicated data structural features and higher extraction difficulty, the invention further provides a directional extraction proposal for extracting the web data to be extracted step by step in a multistage location manner, thus having stronger adaptability and wide application range.

Description

Technical field [0001] The present invention involves the field of network technology and search engine technology, and specially involves a method of web data directional capture. [0002] Background technique [0003] With the rapid development of network technology, Wanwei.com has become the most transmitted information and data transmission carrier with the largest number of transmission and the highest transmission efficiency.Hot topic. [0004] Network spiders (also known as network crawlers and network robots), that is, the software technology that captures the web data from the Wanwei Network in accordance with the established procedures, becomes the main application technology of the information data obtained from the Wanwei Network; the web data described here isRefers to the title, text, images, links, tables and other types of titles that are used in web pages and providing users with valid information.For example, search engines such as Baidu, Google and other search...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 史寿伟李龙向涛李友良
Owner 重庆超体科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products