Web page body text extraction method and apparatus

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A text and webpage technology, applied in the field of webpage text extraction methods and devices, can solve the time-consuming and labor-intensive problems of webpage information extraction, and achieve the effect of good versatility, rapid and accurate extraction

Active Publication Date: 2015-12-23

BEIJING CREATIVE & INTERACTIVE DIGITAL TECH CO LTD

View PDF5 Cites 23 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] In the prior art, the DOM (Document Object Model, Document Object Model) tree is often parsed from the HTML webpage, and it is time-consuming and laborious to extract webpage information based on the DOM tree structure for webpages of different categories and columns.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0034] Embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0035] figure 1 It shows a flow chart of a web page text extraction method provided by an embodiment of the present invention, see figure 1 , a web page text extraction method provided by an embodiment of the present invention, comprising:

[0036] S101. Extract the text in the title tag and the text in the h tag in the HTML source code of the webpage.

[0037] Specifically, since the text in the title tag of some web pages is information describing the website and has nothing to do with the text, it is first necessary to determine whether the text in the title tag is related to the actual text. At this time, the text in the title tag can be extracted from the source code of the web page, for example, marked as title 1, and the text in the h tag can be extracted from the HTML source code of the web page, for example, marked as title 2.

[0038] S102...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The present invention provides a web page body text extraction method and system. The method comprises the steps: extracting a text in a title tag and a text in an h tag of a web page HTML source code; determining a body text title according to the text similarity between the text in the title tag and the text in the h tag; extracting a tag source code in a body tag of the web page HTML source code; carrying out first extraction processing on the tag source code in the body tag to acquire a first web page body text; determining a row block distribution function, and extracting a text block according to the row block distribution function; and carrying out second extraction processing on the text block to acquire a second web page body text. According to the web page body text extraction method and apparatus employed by the present invention, the generality is better and extraction can be performed quickly and accurately, thereby ensuring that the web page body text extraction is smoothly performed.

Description

technical field [0001] The invention relates to the field of computers, in particular to a method and device for extracting webpage text. Background technique [0002] With the rapid development of the Internet, the information on the network is increasing explosively, and general users browse various types of information through web pages. There are mainly two types of text on a web page, including text information to be expressed on the web page and noise information that has nothing to do with the text. Noise information includes various noise information such as website navigation, advertisements, copyright statements, and related links. Included in the noise information, the text extraction is to extract the text information of the web page accurately and efficiently. [0003] However, in the prior art, DOM (Document Object Model, Document Object Model) trees are often parsed from HTML webpages, and it takes time and effort to extract webpage information based on the D...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/9577

Inventor朱国库蒋文保

OwnerBEIJING CREATIVE & INTERACTIVE DIGITAL TECH CO LTD

Web page body text extraction method and apparatus

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology