Webpage data extracting method based on extensible language query

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A web page data and language query technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of reduced extraction efficiency, difficulty in expressing missing attributes, and low extraction efficiency

Inactive Publication Date: 2011-03-09

NORTHEASTERN UNIV

View PDF2 Cites 31 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] Based on the HTML tag tree or the Document Object Model DOM tree, it is a relatively common method to extract data records in the page. Before extracting data, first convert the Web page into a Document Object Model DOM tree based on the tags, and then based on the structural features in the DOM tree and automatic or semi-automatic extraction rules to extract data from it. The method based on the page structure first formulates the structure of the page containing the data part, and then finds similar parts in the page according to this structure as the extraction result. However, for the page with a simple structure, it Good results can be obtained, but if the page DOM tree has a complex structure and there are too many noise nodes in the data area, the processing effect is not very good, and the data recognition of nested structures cannot be supported;

[0004] The technology of extracting data based on visual information in web pages mainly uses the location habit characteristics of users browsing content in web design to extract data from corresponding locations. ViDRE of Microsoft Asia Research Institute proposes an extraction method based on visual features. This method is to some extent On the one hand, when the page has no obvious visual features, the extraction efficiency based on vision will be seriously reduced; on the other hand, the way based on vision It is suitable for data extraction of a single page, and the efficiency of extracting a large amount of data with the same structure and different pages will be very low;

[0005] The above methods are only applicable to webpages with simple data structures. If the data in the webpages are in a hierarchical relationship, the extracted results will be difficult to represent or the attributes will be missing, so it is difficult to deal with the page content with complex data structures; secondly, these methods directly after initialization Generate the extracted result data, if there is an attribute recognition error, it is difficult to correct it in time; in addition, these methods are relatively independent in operation, and it is difficult to combine with the existing database system, so there is a lack of unified management of web page data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0080] Below in conjunction with accompanying drawing and embodiment the present invention is described in further detail:

[0081] figure 1 For a Web data page of a certain electronic book selling website, the flow process of the method of the present invention is as follows figure 2 As shown, the steps are as follows:

[0082] Step 1: Determine the corresponding data pattern S when extracting data content from the web page, where the data entity name E is "book", and the attribute names and attribute data types contained in the attribute collection are shown in Table 1:

[0083] Table 1 shows the attribute names and attribute data types contained in the data entity "Book"

[0084]

attribute 1

attribute 2

attribute 3

attribute 4

attribute 5

attribute 6

attribute 7

attribute 8

attribute 9

name

book title

author

publishing house

Published date

book introduction

original price

current price

D...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A webpage data extracting method based on extensible language query belongs to the technical field of computer database; and the method comprises the following steps: determining the corresponding mode structure in the Web page when extracting the data contents; locating the data area, the data unit and the attribute text in the Web page; marking the semanteme of the attribute text; generating the data unit node path; calculating the path expression form of extracting the attribute value; generating the XML query sentence for extracting the data; and extracting the data by means of the XML query sentence. The method can generate precision XML query sentence for guaranteeing the correctness of the XML query sentence; the method has high generality and can be combined with the current method in seamless; and the method can adapt to more complex query result output.

Description

technical field [0001] The invention belongs to the technical field of computer databases, in particular to a method for extracting web page data based on an extensible language query. Background technique [0002] With the continuous development of the Web field and the rapid growth of data information in the Web, the demand for Web data in various application fields is increasing. Although the Web contains a large amount of structured and semi-structured data, these data are mainly in the form of super The text markup language HTML is provided to users through browsers, and it is difficult to be directly used in applications such as data mining and data integration. Therefore, how to efficiently and accurately extract structured and semi-structured data from a large number of Web pages becomes More and more important, typical extraction methods for Web data are mainly divided into three categories: methods based on HTML tag tree or document object model DOM tree; methods b...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor聂铁铮于戈王波涛岳德君

OwnerNORTHEASTERN UNIV

Webpage data extracting method based on extensible language query

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology