Webpage information extraction method and device thereof

A webpage information and extraction method technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as inability to accurately search

Inactive Publication Date: 2011-01-12
FUJITSU LTD
View PDF0 Cites 90 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this text search method cannot accurately find and extract information such as the person who posted, the title of the post, and the person who posted it.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extraction method and device thereof
  • Webpage information extraction method and device thereof
  • Webpage information extraction method and device thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] Embodiments of the present invention will be described below with reference to the drawings. Elements and features described in one drawing or one embodiment of the present invention may be combined with elements and features shown in one or more other drawings or embodiments. It should be noted that representation and description of components and processes that are not related to the present invention and known to those of ordinary skill in the art are omitted from the drawings and descriptions for the purpose of clarity.

[0024] figure 1 is a schematic flowchart showing a method for extracting web page information according to an embodiment of the present invention.

[0025] Such as figure 1 As shown, the web page information extraction method may include the following steps 102-110.

[0026] In step 102, source codes of web pages in the website are acquired. In step 104, a document object model (Document ObjectModel, DOM) tree structure of the webpage is establ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage information extraction method and a device thereof. The method comprises the following steps: obtaining a source code of a webpage in a website; establishing a document object model (DOM) tree structure of the webpage according to the obtained source code, wherein, the DOM tree structure of the webpage comprises one or more nodes; acquiring at least one template of the website, wherein, the template is of the DOM tree structure; selecting path information of the content to be extracted from the template; and matching the path information of the content to be extracted with the nodes in the DOM tree structure of the webpage, and if matching successfully, extracting the content information in the webpage corresponding to the path information.

Description

technical field [0001] The invention relates to information extraction, in particular to a method and device for extracting network web page information. Background technique [0002] With the rapid development of the Internet and electronic technology, people are no longer restricted by regions, and can conveniently exchange various information on the Internet. With the participation of a large number of users, there is a large amount of useful information in the web pages of the website (for example, a forum), which is of great use value not only for individuals but also for enterprises. However, there is a lot of randomness in the release of this information. The web pages of the website contain both useful information and a lot of distracting information, such as advertisements. Really useful information is often drowned in these noises. [0003] At present, a method for extracting web page information is a common automatic text search technology. This technology cla...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 王新文王主龙于浩孟遥
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products