Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and device for extracting webpage text

A web page text extraction and web page technology, applied in the field of data processing, can solve problems such as slow web page text speed

Active Publication Date: 2018-02-23
SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] It can be seen from the above description that the method for extracting the text of the webpage in the prior art needs to traverse all the DOM trees, and the speed of extracting the text of the webpage is relatively slow

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage text
  • Method and device for extracting webpage text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work belong to the protection of the present invention. scope.

[0050] Such as figure 1 As shown, the embodiment of the present invention provides a method for extracting webpage text, and the method may include the following steps:

[0051] Step 101: extracting all semantic blocks of the webpage to be extracted;

[0052] Step 102: evenly divide the to-be-extracted webpage into multiple area blocks;

[0053] Step...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage main text extraction method and device. The method comprises the steps that all semantic blocks of a webpage to be extracted are extracted; the webpage to be extracted is equally divided into a plurality of region blocks; the region blocks are randomly selected in predetermined times; the semantic blocks in a region block which is selected are determined; the sample distribution probability of each semantic block is calculated; a semantic block of which the sample distribution probability is greater than or equal to the predetermined probability is the semantic block where the webpage main text is located. The webpage main text extraction method and device can improve the speed of extracting the webpage main text.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a method and device for extracting webpage text. Background technique [0002] With the rapid development of webpage information resources, many webpages are generated every day. A web page can include text information and some advertisement information. How to extract the text from the webpage becomes very important. [0003] In the prior art, the DOM (Document Object Model, Document Object Model) tree is first parsed from the HTML (Hyper Text Mark-upLanguage, hypertext markup language file) webpage through the nested relationship between the tags in the webpage, and then traverse all DOM tree, the position of the text is determined according to the distribution rule of the text information in the DOM tree. [0004] It can be seen from the above description that the method for extracting webpage text in the prior art needs to traverse all DOM trees, and the speed of e...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 李克学范莹戴鸿君王传国刘永
Owner SHANDONG LANGCHAO YUNTOU INFORMATION TECH CO LTD