Webpage text content extracting method and device

An extraction method and technology of an extraction device are applied in the field of webpage text content extraction, which can solve the problems of low accuracy of webpage text content and achieve the effect of improving the accuracy.

Active Publication Date: 2012-07-04
CHINA MOBILE COMM GRP CO LTD
View PDF2 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] Embodiments of the present invention provide a method and device for extracting webpage text content to solve the problem of low accuracy in extracting webpage text content existing in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage text content extracting method and device
  • Webpage text content extracting method and device
  • Webpage text content extracting method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The main realization principles, specific implementation modes and corresponding beneficial effects of the technical solutions of the embodiments of the present invention will be described in detail below in conjunction with each accompanying drawing.

[0022] Such as figure 2 As shown, it is a flowchart of a method for extracting webpage text content in an embodiment of the present invention, and its specific processing flow is as follows:

[0023] Step 21, obtaining two web pages belonging to the same hierarchical directory under the same site;

[0024] The embodiment of the present invention proposes that different pages of the same hierarchical directory under the same site are usually generated by the same hypertext markup language (HTML, Hyper Text Mark-up Language) template, so pages under the same hierarchical directory under the same site The webpage structure between different webpages is the same or similar. For example, different pages of the same hierarch...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage text content extracting method and device. The method comprises the following steps of: acquiring two webpages which belong to a catalogue at the same hierarchy below the same site; for each acquired webpage, respectively executing the following steps of: dividing the webpage into content blocks; determining label density and/or link density of each content block; selecting the content block the label density and/or link density of which meets corresponding preset conditions; extracting the content block with the text content of being not consistent with the text contexts of the content blocks selected from another webpage; and determining the extracted content block as the text content of the webpage. By adopting the technical scheme of the invention, the problem that accuracy is lower when the text content of the webpage is extracted in the prior art can be solved.

Description

technical field [0001] The invention relates to the technical field of Internet information processing, in particular to a method and device for extracting webpage text content. Background technique [0002] With the rapid development of Internet technology, the information on the webpage is becoming more and more abundant. In order to better use the information on the webpage, people are constantly pursuing technologies that can effectively organize and utilize online information. However, the webpage is not as neat as the traditional text , clean, which contains a lot of noise content, such as scripts added to enhance user interaction, navigation links added to facilitate user browsing, and advertising links added for commercial considerations, etc. The above noise content not only affects The efficiency of webpage information retrieval also leads to low retrieval accuracy. The accurate extraction of webpage content can not only filter the interference of navigation inform...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 周奕周宇煜吴淑燕
Owner CHINA MOBILE COMM GRP CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products