Method for extracting webpage text based on label path and text punctuation ratio feature fusion

A webpage text extraction and label path technology, applied in the field of webpage text extraction based on the fusion of label path and text punctuation ratio features, can solve the problems of different establishment, number of features, complex content, no longer applicable templates, etc., and achieve simple calculation methods , Extract the content of the text comprehensively and improve the effect of accuracy

Inactive Publication Date: 2018-04-10
SOUTH CHINA AGRI UNIV
View PDF5 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There are some problems in the typical text extraction method: 1. Different templates need to be established for different web pages; 2. When the structure of the web page changes, the template is no longer applicable; 3. Using machine learning methods to construct templates requires a large number of training web pages ; 4. The computational complexity may not be suitable for online systems; 5. The number of features and content of the current feature-based extraction methods are relatively complex, or have strong pertinence; 6. The current extraction methods cannot be applied to web pages containing a single The case of body blocks and multiple body blocks; that is, in the current method, it is not suitable for most web page structures

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting webpage text based on label path and text punctuation ratio feature fusion
  • Method for extracting webpage text based on label path and text punctuation ratio feature fusion
  • Method for extracting webpage text based on label path and text punctuation ratio feature fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] The present invention will be further described below in conjunction with specific examples.

[0047] Such as figure 1 As shown, the webpage text extraction method based on label path and text punctuation ratio feature fusion provided by the present embodiment comprises the following steps:

[0048] 1) Establish a DOM tree according to the HTML document and perform preprocessing;

[0049] In this step, the syntax format of the HTML document of the web page may not be standardized. You need to use the html tidy tool to standardize the syntax of the document, and then use existing tools such as Jsoup to directly convert the input HTML document into a DOM tree. DOM tree and html The corresponding relationship is as figure 2 shown. After converting to a DOM tree, delete tags that are known to be impossible to be text content, and tags that are invisible during web browsing can be deleted directly, such as script tags and style tags in css. Among them, the script tag co...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for extracting a webpage text based on a label path and a text punctuation ratio feature fusion. The method is mainly by constructing a text punctuation ratio and a feature fusion method of a label path to propose a novel feature value, thereby extracting the text from a webpage. The method is characterized in that a text punctuation ratio feature pair is defined to measure the average sentence length of the label path, and at the same time, the position of the label path and its internal complexity are combined to give a more comprehensive feature value to judge the content of the text. By adoption of the method, it is possible to extract the webpage text more accurately without constructing an extraction template, and the application scope is wide.

Description

technical field [0001] The invention relates to the technical field of data mining and web page analysis, in particular to a web page text extraction method based on the fusion of label path and text punctuation ratio features. Background technique [0002] With the rapid development of the Internet, the Internet has become an important channel for people to obtain information, and people are also accustomed to publishing various information on the Internet. What follows is that the amount of web pages and information in the Internet grows at a very high speed. In addition to the text information, the web page also contains advertisements, navigation bars and other content that has nothing to do with the text. In order to enable users to quickly obtain useful information, the text extraction method also came into being. [0003] A web page contains a lot of content, but only a limited part of the content is really needed by users, and this part of the content is called the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22G06F17/30
CPCG06F16/951G06F40/14G06F40/157
Inventor 黎嘉朗古万荣田绪红毛宜军李吉平
Owner SOUTH CHINA AGRI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products