Method and device for extracting webpage frame information

A technology of page information and extraction method, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve the problems of inability to meet the requirements of accuracy rate and information recall rate, inability to be applied on a large scale, and high labor cost.

Active Publication Date: 2012-12-26
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF7 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using this method not only consumes huge labor costs, but also requires the mining objects to have the same structural characteristics in the page, so it cannot be applied on a large scale due to the limitation of labor costs and page structure consistency.
For the situation where the number of mining objects is huge and the page structure changes, such as obtaining the geographic interest point data of entities in the entire network, the existing template-based structured information extraction methods cannot meet the requirements of extraction accuracy and information recall rate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage frame information
  • Method and device for extracting webpage frame information
  • Method and device for extracting webpage frame information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0119] figure 1 It is a flow chart of the page information extraction method provided in this embodiment, such as figure 1 As shown, the method includes:

[0120] Step S101 , obtaining webpages of the whole network.

[0121] A web crawler is used to crawl webpages on the Internet, at least including URLs and source codes of the webpages. For example, the url address is "http: / / www.hdhospital.com / OverView.aspx", which is a page in the website of Beijing Haidian Hospital. Use a web crawler to grab the web page, record the corresponding url address, and obtain the web page The web page source code (such as HTML code) corresponding to the page.

[0122] Step S102, obtaining the home page or contact page of the website corresponding to the web page.

[0123] The method of obtaining the home page of the site can be one or any combination of the methods A~C listed below:

[0124]Method A: Take out the domain name address from the website address of the web page, perform jump pro...

Embodiment 2

[0155] image 3 It is a flow chart of the page information extraction method provided in this embodiment, such as image 3 shown, including:

[0156] Step S301, acquiring webpages of the whole network.

[0157] This step is the same as step S101 in the first embodiment, and will not be repeated here.

[0158] Step S302, analyzing the web pages one by one.

[0159] Analyze the webpages of the whole network obtained in step S301 one by one, and enter step S307 after executing step S303, or enter step S307 after executing steps S304 to S306.

[0160] Step S303, obtaining the homepage or contact page of the website corresponding to the web page.

[0161] The process of this step is the same as that of step S102 in the first embodiment, and will not be repeated here. And add the obtained site home page or contact page to the home page or contact page library.

[0162] Step S304 , parsing the web page into a document object model tree, performing visual block processing on the...

Embodiment 3

[0217] Figure 6 is a schematic diagram of the page information extraction device provided in this embodiment. Such as Figure 6 As shown, the device includes:

[0218] The web page acquisition module 601 is configured to acquire web pages of the entire network.

[0219] A web crawler is used to crawl webpages on the Internet, at least including URLs and source codes of the webpages. For example, the url address is "http: / / www.hdhospital.com / OverView.aspx", which is a page in the website of Beijing Haidian Hospital. Use a web crawler to grab the web page, record the corresponding url address, and obtain the web page The web page source code (such as HTML code) corresponding to the page.

[0220] The site structure analysis module 602 is used to obtain the site home page or contact page corresponding to the web page, including:

[0221] The website home page obtaining sub-module 6021 is used to obtain the website home page corresponding to the web page.

[0222] The conta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a device for extracting webpage information. The method comprises: S1, acquiring a webpage frame of a whole network; S2, acquiring the corresponding website homepage or related page of the webpage frame; S3, extracting interest point names and corresponding address information in the website homepage or related page; and S4, correlating the extracted interest point names and corresponding address information, and obtaining the structural information. Compared with the prior art, the method utilizes the organization structural characteristics and information semantic characteristics of an entity mechanism in an internet, the related information of the entity mechanism is extracted from the website homepage or related page, the structural geographic location information is acquired through verifying, integrating and correlating a plurality of source data, so the information accuracy is improved, moreover, the information recall can be automatically carried out on the entity mechanism of the whole internet, the labor cost is lowered, and the information recall efficiency is increased.

Description

【Technical field】 [0001] The invention relates to the technical field of Internet information processing, in particular to a method and device for extracting page information. 【Background technique】 [0002] With the continuous development of the Internet and information technology, the Internet has become the main source of people's daily access to information. Since web pages are increasing exponentially every day, in order to enable users to quickly and accurately obtain the information they are interested in, information extraction is usually performed on these massive page data first. The task of information extraction is to structure the information contained in the text, so that people can obtain the information they need like querying a database. For example, the method of information extraction can be used to extract the name, address, contact number and other contact information of the entity contained in the webpage, and obtain the data of geographical points of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 王松
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products