Method and device for extracting page information

A technology of page information and extraction method, which is applied in special data processing applications, instruments, electrical digital data processing, etc., and can solve the problems of inability to meet the requirements of accuracy rate and information recall rate, inability to be applied on a large scale, and large labor costs.

Active Publication Date: 2014-01-15
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF4 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using this method not only consumes huge labor costs, but also requires the mining objects to have the same structural characteristics in the page, so it cannot be applied on a large scale due to the limitation of labor costs and page structure consistency.
For the situation where the number of mining objects is huge and the page structure changes, such as obtaining the geographic interest point data of entities in the entire network, the existing template-based structured information extraction methods cannot meet the requirements of extraction accuracy and information recall rate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting page information
  • Method and device for extracting page information
  • Method and device for extracting page information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0109] figure 1 It is a flow chart of the page information extraction method provided in this embodiment, such as figure 1 As shown, the method includes:

[0110] Step S101 , obtaining webpages of the whole network.

[0111] A web crawler is used to crawl webpages on the Internet, at least including URLs and source codes of the webpages. For example, the url address is "http: / / www.hdhospital.com / OverView.aspx", which is a page in the website of Beijing Haidian Hospital. Use a web crawler to grab the web page, record the corresponding url address, and obtain the web page The web page source code (such as HTML code) corresponding to the page.

[0112] Step S102 , parse the acquired web pages one by one into a document object model tree, perform visual block processing on the web pages according to the size and position of the page tags and CSS information, and obtain the visual blocks of the web pages.

[0113] The web pages acquired in step S101 are divided into blocks base...

Embodiment 2

[0164] Figure 5 It is a flow chart of the page information extraction method provided in this embodiment, such as Figure 5 shown, including:

[0165] Step S501 , obtaining webpages of the whole network.

[0166] Step S502, analyzing the web pages one by one.

[0167] Analyze the webpages of the whole network obtained in step S501 one by one, and enter step S507 after executing steps S503 to S505, or enter step S507 after executing step S506.

[0168] Step S503 , parsing the web page into a document object model tree, performing visual block processing on the web page according to the size and position of the page tag and CSS information, and obtaining the visual block of the web page.

[0169] Step S504: Label the visual blocks based on the semantic features of the visual blocks to obtain marked blocks.

[0170] Step S505 , using the pre-built address information tree to analyze the text in the marked blocks sentence by sentence, and identify the marked blocks containing...

Embodiment 3

[0210] Image 6 is a schematic diagram of the page information extraction device provided in this embodiment. Such as Image 6 As shown, the device includes:

[0211] The web page acquisition module 601 is configured to acquire web pages of the entire network.

[0212] A web crawler is used to crawl webpages on the Internet, at least including URLs and source codes of the webpages.

[0213] For example, the url address is "http: / / www.hdhospital.com / OverView.aspx", which is a page in the website of Beijing Haidian Hospital. Use a web crawler to grab the web page, record the corresponding url address, and obtain the web page The web page source code (such as HTML code) corresponding to the page.

[0214] The visual block processing module 602 is configured to parse the obtained web pages into a document object model tree one by one, perform visual block processing on the web pages according to the size and position of the page tags and the cascading style sheet information, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and device for extracting page information. The method includes the steps that S1, web pages of a whole network are obtained; S2, the obtained web pages are analyzed to be a document object model tree one by one, and according to size, position and cascading style sheet information of the pages, visual blocking treatment is performed on the web pages to obtain visual blocks; S3, on the basis of semantic features, the visual blocks are labeled to obtain labeled blocks; S4, address information tree constructed in advance is used for analyzing texts in the labeled blocks sentence by sentence, and the labeled blocks including the address information are identified to address information blocks; S5, names and corresponding address information of interest points in the address information blocks are extracted; S6, the extracted names and address information of the interest points are correlated to obtain structured information. Compared with the prior art, the method and device can automatically mine objects which are various in structure change and huge in quantity in the whole network, save labor cost and improve the accuracy rate and the recall rate.

Description

【Technical field】 [0001] The invention relates to the technical field of Internet information processing, in particular to a method and device for extracting page information. 【Background technique】 [0002] With the continuous development of the Internet and information technology, the Internet has become the main source of people's daily access to information. Since web pages are increasing exponentially every day, in order to enable users to quickly and accurately obtain the information they are interested in, information extraction is usually performed on these massive page data first. The task of information extraction is to structure the information contained in the text, so that people can obtain the information they need like querying a database. For example, the method of information extraction can be used to extract the name, address, contact number and other contact information of the entity contained in the webpage, and obtain the data of geographical points of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9537G06F40/253G06F40/30
Inventor 王松
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products