Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Webpage data extracting method based on extensible language query

A web page data and language query technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of reduced extraction efficiency, difficulty in expressing missing attributes, and low extraction efficiency

Inactive Publication Date: 2011-03-09
NORTHEASTERN UNIV
View PDF2 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Based on the HTML tag tree or the Document Object Model DOM tree, it is a relatively common method to extract data records in the page. Before extracting data, first convert the Web page into a Document Object Model DOM tree based on the tags, and then based on the structural features in the DOM tree and automatic or semi-automatic extraction rules to extract data from it. The method based on the page structure first formulates the structure of the page containing the data part, and then finds similar parts in the page according to this structure as the extraction result. However, for the page with a simple structure, it Good results can be obtained, but if the page DOM tree has a complex structure and there are too many noise nodes in the data area, the processing effect is not very good, and the data recognition of nested structures cannot be supported;
[0004] The technology of extracting data based on visual information in web pages mainly uses the location habit characteristics of users browsing content in web design to extract data from corresponding locations. ViDRE of Microsoft Asia Research Institute proposes an extraction method based on visual features. This method is to some extent On the one hand, when the page has no obvious visual features, the extraction efficiency based on vision will be seriously reduced; on the other hand, the way based on vision It is suitable for data extraction of a single page, and the efficiency of extracting a large amount of data with the same structure and different pages will be very low;
[0005] The above methods are only applicable to webpages with simple data structures. If the data in the webpages are in a hierarchical relationship, the extracted results will be difficult to represent or the attributes will be missing, so it is difficult to deal with the page content with complex data structures; secondly, these methods directly after initialization Generate the extracted result data, if there is an attribute recognition error, it is difficult to correct it in time; in addition, these methods are relatively independent in operation, and it is difficult to combine with the existing database system, so there is a lack of unified management of web page data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage data extracting method based on extensible language query
  • Webpage data extracting method based on extensible language query
  • Webpage data extracting method based on extensible language query

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0080] Below in conjunction with accompanying drawing and embodiment the present invention is described in further detail:

[0081] figure 1 For a Web data page of a certain electronic book selling website, the flow process of the method of the present invention is as follows figure 2 As shown, the steps are as follows:

[0082] Step 1: Determine the corresponding data pattern S when extracting data content from the web page, where the data entity name E is "book", and the attribute names and attribute data types contained in the attribute collection are shown in Table 1:

[0083] Table 1 shows the attribute names and attribute data types contained in the data entity "Book"

[0084]

attribute 1

attribute 2

attribute 3

attribute 4

attribute 5

attribute 6

attribute 7

attribute 8

attribute 9

name

book title

author

publishing house

Published date

book introduction

original price

current price

D...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A webpage data extracting method based on extensible language query belongs to the technical field of computer database; and the method comprises the following steps: determining the corresponding mode structure in the Web page when extracting the data contents; locating the data area, the data unit and the attribute text in the Web page; marking the semanteme of the attribute text; generating the data unit node path; calculating the path expression form of extracting the attribute value; generating the XML query sentence for extracting the data; and extracting the data by means of the XML query sentence. The method can generate precision XML query sentence for guaranteeing the correctness of the XML query sentence; the method has high generality and can be combined with the current method in seamless; and the method can adapt to more complex query result output.

Description

technical field [0001] The invention belongs to the technical field of computer databases, in particular to a method for extracting web page data based on an extensible language query. Background technique [0002] With the continuous development of the Web field and the rapid growth of data information in the Web, the demand for Web data in various application fields is increasing. Although the Web contains a large amount of structured and semi-structured data, these data are mainly in the form of super The text markup language HTML is provided to users through browsers, and it is difficult to be directly used in applications such as data mining and data integration. Therefore, how to efficiently and accurately extract structured and semi-structured data from a large number of Web pages becomes More and more important, typical extraction methods for Web data are mainly divided into three categories: methods based on HTML tag tree or document object model DOM tree; methods b...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 聂铁铮于戈王波涛岳德君
Owner NORTHEASTERN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products