Unlock instant, AI-driven research and patent intelligence for your innovation.

Web page text extraction method and device

A text and webpage technology, which is applied in the field of webpage text extraction methods and devices, can solve the problems of text node errors, no impurity information filtering, etc., and achieves the effect of high accuracy

Active Publication Date: 2020-11-24
BEIJING SOGOU TECHNOLOGY DEVELOPMENT CO LTD
View PDF12 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When extracting webpage text, judge the template type of the target webpage to be extracted, and then extract the text of the text node of the target webpage according to the text node of the matching template, but in the text node, there will be related articles or recommended subscriptions Such as the impurity information of impurity nodes, that is, although the text extraction template of the prior art extracts the information of the text nodes, it does not filter out the impurity information of the impurity nodes in the text nodes;
[0003] In addition, when there are large disclaimers and other footnotes at the bottom of the webpage, or when the main text is mainly pictures and less text, the text nodes judged according to the above-mentioned text extraction template in the prior art are often wrong

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page text extraction method and device
  • Web page text extraction method and device
  • Web page text extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

no. 3 Embodiment

[0098] Also, refer to Figure 4 , this figure is a flow chart of the third specific embodiment according to the web page text extraction method of the present invention, and the main steps of this embodiment are as follows:

[0099] Step S21, determining the text node of the webpage with the same domain name, which specifically includes: obtaining a plurality of sample webpages of the webpage with the same domain name; comparing the webpage structures of the plurality of sample webpages to determine the text node of the webpage with the same domain name;

[0100] The webpage structures of the webpages with the same domain name are actually similar. Therefore, this embodiment can determine the text nodes of the webpages with the same domain name through the webpage structure. Taking a specific example, the webpage structures of multiple sample webpages are compared The text node of the webpage with the same domain name can be determined, for example, in the following manner: ar...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a web page content extract method and a device, the method comprises content knot information and content extract template of impurity knot information, wherein, web pages of different domain name corresponds to different content extract template; obtaining the content extract template matched with target web page; obtaining context knot of the target web page according to the content knot information of the content extract template, eliminating impurity knot corresponded by the impurity knot information in the context knot. The context extract comprises the context knot information and the impurity knot information, the impurity knot is eliminated according to the impurity knot information during the context information extraction, so the web page context information of higher correction ratio can be obtained.

Description

technical field [0001] The present invention relates to the technical field of the Internet, and more specifically, the present invention relates to a method and device for extracting webpage text. Background technique [0002] At present, the text extraction of web pages generally adopts the template-based text extraction method, while the existing text extraction templates are generated from a large number of web pages with similar structures, looking for the location of large content texts, and counting the nodes that are most likely to be texts, namely Calculate the ratio of the text length to the total length, count the node with the highest ratio as the text node, and then generate the text extraction template. When extracting webpage text, judge the template type of the target webpage to be extracted, and then extract the text of the text node of the target webpage according to the text node of the matching template, but in the text node, there will be related article...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535G06F16/957
CPCG06F16/9577
Inventor 胡又欢
Owner BEIJING SOGOU TECHNOLOGY DEVELOPMENT CO LTD