Web page body text extraction method and apparatus

A text and webpage technology, applied in the field of webpage text extraction methods and devices, can solve the time-consuming and labor-intensive problems of webpage information extraction, and achieve the effect of good versatility, rapid and accurate extraction

Active Publication Date: 2015-12-23
BEIJING CREATIVE & INTERACTIVE DIGITAL TECH CO LTD
View PDF5 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the prior art, the DOM (Document Object Model, Document Object Model) tree is often parsed from the HTML webpage, and it is tim

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page body text extraction method and apparatus
  • Web page body text extraction method and apparatus
  • Web page body text extraction method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0034] The embodiments of the present invention will be described in detail below in conjunction with the drawings.

[0035] figure 1 Shows a flowchart of a method for extracting webpage text provided by an embodiment of the present invention, see figure 1 , A method for extracting webpage text provided by an embodiment of the present invention includes:

[0036] S101, extract the text in the title tag and the text in the h tag in the HTML source code of a webpage.

[0037] Specifically, since the text in the title tag of some web pages is information describing the website and has nothing to do with the main text, it is necessary to first determine whether the text in the title tag is related to the actual main text. At this time, the text in the title tag can be extracted from the source code of the web page, for example, denoted as Title 1, and the text in the h tag can be extracted from the HTML source code of the web page, for example, denoted as Title 2.

[0038] S102: Determine...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a web page body text extraction method and system. The method comprises the steps: extracting a text in a title tag and a text in an h tag of a web page HTML source code; determining a body text title according to the text similarity between the text in the title tag and the text in the h tag; extracting a tag source code in a body tag of the web page HTML source code; carrying out first extraction processing on the tag source code in the body tag to acquire a first web page body text; determining a row block distribution function, and extracting a text block according to the row block distribution function; and carrying out second extraction processing on the text block to acquire a second web page body text. According to the web page body text extraction method and apparatus employed by the present invention, the generality is better and extraction can be performed quickly and accurately, thereby ensuring that the web page body text extraction is smoothly performed.

Description

technical field [0001] The invention relates to the field of computers, in particular to a method and device for extracting webpage text. Background technique [0002] With the rapid development of the Internet, the information on the network is increasing explosively, and general users browse various types of information through web pages. There are mainly two types of text on a web page, including text information to be expressed on the web page and noise information that has nothing to do with the text. Noise information includes various noise information such as website navigation, advertisements, copyright statements, and related links. Included in the noise information, the text extraction is to extract the text information of the web page accurately and efficiently. [0003] However, in the prior art, DOM (Document Object Model, Document Object Model) trees are often parsed from HTML webpages, and it takes time and effort to extract webpage information based on the D...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 朱国库蒋文保
Owner BEIJING CREATIVE & INTERACTIVE DIGITAL TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products