An entity-based bottom-up web data extraction method

A data extraction, bottom-up technology, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of reduced recall, neglected connections, and unsuitable complex pages.

Active Publication Date: 2011-11-30
NORTHEASTERN UNIV
View PDF2 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Later, the improved MDR II method uses tree structure information to locate nodes, but neither MDR nor MDR II can get rid of the excessive dependence on the DOM tree of the page. When the attributes under a certain logo change, they cannot guarantee the extracted accuracy
Therefore, this type of method is more suitable for page extraction with simple structure, but not suitable for complex pages
[0006] In recent years, some studies have proposed new methods based on these typical technologies, but most of them deriv...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An entity-based bottom-up web data extraction method
  • An entity-based bottom-up web data extraction method
  • An entity-based bottom-up web data extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0042] Step 1. Select the web data page: select the popular air ticket booking website "Taobao Air Ticket" http: / / ipiao.taobao.com / 2010 / home.htm? TBG=66409.71436.28&ad_id=&am_id=&cm_id=1400381961b2c34cffa7&pm_id=As the data source, select Shenyang as the origin of the flight, select Shenzhen as the destination, and select 2011 / 5 / 11 as the date, and click Search to return to the ticket result page (see attachment Figure 4 ), enter the HTML source code of the page as the input.

[0043] Step 2. Divide the text: After completing the preprocessing of the result page D, divide the text of D, and obtain the text sequence S list For .

[0044] Step 3. Label entity attributes: The extraction rules for booking topics are defined as follows:

[0045] First level rule level R 1

The second level ruleset R 2

Flight (F)

\C{4,8}([\w\d]{6})?

\C{2}Aviation\w{2}\d{4}

time (T)

\d{1,2}[:dot]\d{1,2}

([01][0-9])|(2[0-4])[:point]([0-5][0-9])|(60) ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for extracting web data from the bottom to the top based on an entity and belongs to the field of network data management. The method comprises the following steps of: selecting a web data page; dividing texts; tagging entity attribute; extracting a repeated mode of an attribute sequence; and simplifying a result mode. By the method for extracting the web data, thestructured data of a complicated web page can be more widely extracted, over-dependence on the page structure in the prior art is effectively avoided, adaptability is good, and accuracy is high.

Description

technical field [0001] The invention belongs to the field of network data management, in particular to a bottom-up extraction method for web data pages. Background technique [0002] With the increasing amount of network information, web pages with a single structure are no longer sufficient to carry data, and the number of web pages with various themes and complex structures continues to increase in today's Internet. While expanding people's horizons, it also brings many problems to the application of Web data. The complexity of web pages and the amount of noise information are increasing day by day, and even pages with the same theme and data sources have great deviations, making it more and more difficult for high-quality structured data in web pages to be effectively analyzed and integrated, and the utilization rate of information is obvious decline. Therefore, it is becoming increasingly important to extract information from complex and diverse Web pages and convert t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 申德荣刘桐寇月聂铁铮于戈
Owner NORTHEASTERN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products