Unlock instant, AI-driven research and patent intelligence for your innovation.

Web page content extraction method and device

A technology of webpage content and extraction method, applied in the computer field, can solve the problems of excessive noise, inability to meet the accuracy and comprehensiveness, etc., and achieve the effect of comprehensive improvement

Active Publication Date: 2021-01-01
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If a single algorithm is used for extraction without discrimination, it is easy to extract too much noise, which cannot meet the requirements of the accuracy and comprehensiveness of web page text content extraction.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page content extraction method and device
  • Web page content extraction method and device
  • Web page content extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, rather than to limit the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

[0027] It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

[0028] figure 1 It shows an exemplary system architecture 100 to which embodiments of the webpage content extraction method or the webpage content extraction apparatus of the present application can be applied.

[0029] Such as figure 1 As shown, the system architecture 10...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

This application disclosed the webpage content extraction method and device.The content extraction method of the webpage includes: Analysis of the online page to be extracted to determine the HTML label contained in the webpage included; extract the HTML feature of the webpage from the HTML tag; import the extracted HTML features into pre -training pre -trainingThe picture web recognition model; and in response to determining that the webpage is to be extracted as the picture webpage, extract the picture and the HTML tag corresponding to the picture corresponding to the picture to be extracted.This embodiment can use different strategies to extract the content of the webpage based on the type of web pages (such as picture types and non -picture types) to be extracted to enhance the accuracy and comprehensive improvement of web content extraction.

Description

technical field [0001] The present application relates to the field of computer technology, specifically to the field of Internet technology, and in particular to a method and device for extracting web page content. Background technique [0002] For Web data mining, the content extraction of web pages is usually used as the basic step in the early stage of data mining. Whether the content of the webpage text can be extracted efficiently and accurately, and can be easily promoted to various websites determines the effect of subsequent data mining. [0003] In the prior art, usually only a single extraction algorithm is used to extract the content of the webpage text. Due to the large number of sub-pages and various forms of the website, the main body of the website may be text, pictures or even a mixture of graphics and text, and the internal website labels are also various; page, list page, etc., and there are content pages that need to extract materials. If a single algo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F40/143G06K9/62
CPCG06F16/951G06F40/14G06F18/2411
Inventor 余婷婷胡飞
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD