Method and device for extracting webpage text content

The technology of an extraction method and an extraction device, which is applied in the field of webpage content extraction, can solve the problems of low accuracy in extracting webpage content, and achieve the effect of improving accuracy

Active Publication Date: 2012-12-05
ALIBABA (CHINA) CO LTD
View PDF4 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] The embodiment of the present invention provides a method and device for extracting webpage text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage text content
  • Method and device for extracting webpage text content
  • Method and device for extracting webpage text content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] The main implementation principles, specific implementation manners and corresponding beneficial effects of the technical solutions of the embodiments of the present invention will be described in detail below in conjunction with the respective drawings.

[0021] Such as figure 1 As shown, it is a flowchart of a method for extracting webpage body content in an embodiment of the present invention, and the specific processing flow is as follows:

[0022] Step 11. Divide the web page whose body content needs to be extracted into content blocks.

[0023] Web pages usually describe one or more topics through paragraphs of text, which also contain content such as pictures and links, but these content is not the main body of the web page, and its content is relatively small compared to the body content of the web page.

[0024] Dividing a webpage into content blocks refers to dividing the webpage into multiple content blocks according to each container tag pair in the webpage. In othe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for extracting webpage text content. The method includes steps of dividing a webpage with requirement on text content extraction into different content blocks; executing operations, including determining link text length and non-link text length of the content blocks, to the different divided content blocks respectively; determining the link text density of the corresponding content block according to the determined link text length and non-link text length; and determining that the content blocks are the text content of the webpage when the link text density is not higher than a first specified threshold value. By the method and the device for extracting webpage text content, the problem of low accuracy in webpage text content extraction in the prior art is solved.

Description

Technical field [0001] The present invention relates to the field of Internet information processing technology, and in particular to a method and device for extracting webpage body content. Background technique [0002] With the rapid development of Internet technology, the information on web pages is becoming more and more abundant. In order to better use the information on web pages, people are constantly pursuing technologies that can effectively organize and use online information, but at the same time, web pages are not like traditional text. It is neat and clean, and contains a lot of noise content, such as scripts added to enhance user interaction, navigation links added to facilitate user browsing, and advertising links added for commercial considerations. [0003] Web page body extraction refers to the removal of text link advertisements, pictures, copyrights and other information irrelevant to the body text in the navigation bar and sidebar from the Hyper Text Mark-up La...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 朱海军姜吉发
Owner ALIBABA (CHINA) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products