Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web page content extracting method and device

A web page text extraction and web page technology, which is applied in the field of word crawling, can solve the problems of large differences in web page structure, template failure, and inability to guarantee the timeliness of template rule updates, and achieve high accuracy, text integrity, and high text integrity. degree of effect

Active Publication Date: 2016-06-29
NAT UNIV OF DEFENSE TECH +1
View PDF3 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The method based on the DOM tree parses the HTML file of the web page into a DOM tree, and determines the position of the text according to the distribution law of the text information in the DOM tree. Big differences, different complexity, specific distribution rules can only be effective for some webpages, generalization ability is average
The method based on template rules generally manually configures the extraction template rules for a specific website, so that the text can be quickly extracted in the specified area, but the irregular update of the website often leads to the failure of the template and the inability to effectively extract information. The extracted text often contains different degrees of impurities (mainly non-standard scripts). This method requires a lot of manual input and high-quality template rule selection to maintain a high extraction accuracy. times are great challenges
[0004] The existing method for extracting webpage text based on the DOM tree structure has the following disadvantages: 1) the construction and traversal efficiency of the DOM tree is low and the speed is slow; 2) for irregular web pages, there may be a risk of failure to construct the DOM tree; 3) the The text distribution of the method is not general, and it only has a good extraction accuracy for some web pages
[0005] The existing methods of extracting text based on template rules have the following disadvantages: 1) The manual maintenance of template rules is heavy; 2) The timeliness of updating template rules cannot be guaranteed; 3) The quality of template rules is difficult to guarantee

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page content extracting method and device
  • Web page content extracting method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The drawings constituting a part of the present application are used to provide a further understanding of the present invention, and the exemplary embodiments and descriptions of the present invention are used to explain the present invention, and do not constitute an improper limitation of the present invention.

[0034] In this article, the text slice slice is to cut the HTML text of the webpage into three types of text slices: start tag (BTag), text (Text), and end tag (ETag). Since this method is automatic text extraction, it has natural advantages over manual template extraction.

[0035] In order to avoid the inefficiency caused by DOM tree analysis, while maintaining the freedom of operation, see figure 1 One aspect of the present invention provides a method for extracting webpage text, including the following steps:

[0036] Step S100: Slice and filter the HTML text of the web page to obtain a text slice list sliceList, repair irregular tags in the text slice list...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a web page content extracting method and device.The method comprises the steps that characteristic values of all text sections in web page HTML text subjected to denoising preprocessing are calculated, multi-time aggregation is conducted on text, meeting the requirement, in a text section list according to different aggregation rules, and web page content meeting the requirement is obtained finally.The method is simple and efficient and avoids the cumbersome process for artificially making extracting rules for the full text; automatic content extraction of a relevant web page can be achieved according to a specific web page type.

Description

Technical field [0001] The invention relates to the technical field of word crawling, in particular to a method and device for extracting the body of a webpage. Background technique [0002] With the rapid development of Internet technology, web pages have become an important source of information for people. However, unlike traditional text, in addition to valid information (text), web pages also contain a lot of invalid information (noise), such as website navigation links, advertising content and links, copyright information, etc. This greatly reduces the efficiency of obtaining valid information. In order to effectively use the information on web pages, a fast and accurate automatic text extraction method is needed to filter invalid information and extract the text content that users really need. This also affects public opinion monitoring and analysis, and analysis results based on Internet big data mining. Key factors. [0003] The existing web page body extraction methods ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 陈发君刘忠黄金才朱承修保新程光权陈超冯旸赫龙开亮孟果
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products