Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Directional web data extraction method

A technology of webpage data and data, which is applied in the field of network technology and search engines, can solve the problems of inability to provide directional capture of webpage data and limited application fields, and achieve the effect of simple operation, wide application range and strong operability

Active Publication Date: 2012-10-17
重庆超体科技有限公司
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The above-mentioned first two methods can only be realized if the webpage data acquirer and the website operator obtain commercial cooperation, and the webpage data acquirer has high commercial public relations capabilities, and these two methods are limited by commercial cooperation and cannot provide Targeted crawling of webpage data other than business partners, the application field is very limited

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Directional web data extraction method
  • Directional web data extraction method
  • Directional web data extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0051] In order to provide users with vegetable price information services, the server that provides the corresponding services needs to target and grab vegetable price data from a professional price information website. The web pages contained in the price information website are in HTML (Hyper Text Markup Language) format; the URL (Uniform Resource Locator) address of the web page containing vegetable price information is "http: / / www .feinno.com / commodity-price / 016", the price of vegetables on this webpage is presented as a table structure as shown in Table 1:

[0052] Table 1

[0053]

[0054] In Table 1, the cell where the "vegetable price" is located is the header cell, and the other cells are data cells. Now, it is necessary to provide users with a special price quotation service for "green peppers". Therefore, the server needs to target the price data of green peppers from the webpage, ignoring other webpage data in the webpage; the method of the present invention is used f...

Embodiment 2

[0082] In order to provide users with vegetable price information services, the server that provides the corresponding services needs to grab vegetables from the webpage "http: / / www.feinno.com / commodity-price / 016" of the price information website described in Example 1. For price data, the presentation of vegetable prices on this webpage is shown in Table 1. Now it is necessary to provide users with the vegetable name data listed in this table and the corresponding price data of various vegetables. Therefore, the vegetable names on the webpage are "tomato", "green pepper", "carrot" and the corresponding vegetable price "3.50 yuan / 500 grams", "2.50 yuan / 500 grams", and "1.50 yuan / 500 grams" are the webpage data to be crawled; the method of the present invention is used for targeted crawling, and the flow chart is as follows figure 1 As shown, the specific method is as follows:

[0083] i) According to the data structure characteristics of the web page data to be captured in the w...

Embodiment 3

[0117] In the price information website, the content of the page whose URL address is "http: / / www.feinno.com / commodity-price / 016" has changed. The page not only provides the vegetable price information shown in Table 1, but also Another grain price information table with exactly the same structure as the table in Table 1, as shown in Table 4:

[0118] Table 4

[0119]

[0120] The vegetable name data listed in the vegetable price information table (shown in Table 1) and the corresponding price data of various vegetables are still provided to users. The vegetable name data "tomato", "green pepper", "Carrot" and the corresponding vegetable price data of the three are "3.50 yuan / 500g", "2.50 yuan / 500g", and "1.50 yuan / 500g".

[0121] According to the table structure characteristics and HTML source code syntax rules in Table 1, it can be determined that the source code of the vegetable name data and vegetable price data to be crawled in the web file should at least contain the table tag...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a directional web data extraction method. The method comprises the following steps: carrying out source code grammatical rule analysis on web files by virtue of data structural features of web data to be extracted presented in the web files, and then constructing a data matching model with the data structural features through a regular expression; and carrying out data matching on source codes of the web files, extracting the web data which needs to be extracted from a part of the matched source codes, so that the problem of directional web data extraction is sloved. Inthe method, the regular expression is taken as a matching tool, which has strong operability for the technical personnel in the field and is beneficial to popularization and application of the method; and aiming at certain web data with more complicated data structural features and higher extraction difficulty, the invention further provides a directional extraction proposal for extracting the web data to be extracted step by step in a multistage location manner, thus having stronger adaptability and wide application range.

Description

Technical field [0001] The present invention involves the field of network technology and search engine technology, and specially involves a method of web data directional capture. Background technique [0002] With the rapid development of network technology, Wanwei.com has become the most transmitted information and data transmission carrier with the largest number of transmission and the highest transmission efficiency.Hot topic. [0003] Network spiders (also known as network crawlers and network robots), that is, the software technology that captures the web data from the Wanwei Network in accordance with the established procedures, becomes the main application technology of the information data obtained from the Wanwei Network; the web data described here isRefers to the title, text, images, links, tables and other types of titles that are used in web pages and providing users with valid information.For example, search engines such as Baidu, Google and other search service ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 史寿伟李龙向涛李友良
Owner 重庆超体科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products