Online Web news content extraction system

A content and news technology, applied in the field of online Web news content extraction, can solve the problems of wrapper failure, high cost, single consideration angle, etc., to achieve the effect of improving adaptability, improving versatility, and strong real-time performance

Active Publication Date: 2016-07-06
HEFEI UNIV OF TECH
View PDF4 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] First, many current webpage extraction technologies assume that the extracted webpage objects are generated by the same webpage template, and the existing wrappers are difficult to effectively extract the content of webpages generated by unknown templates, and their versatility is poor
If you need to extract the content of a web page with an unknown template, you need to build a new wrapper for the template, and any change in the template will cause the wrapper to fail, and the cost of maintaining these templates online is extremely high
Even if the webpages are generated by the same template, there are still many non-template nodes in these webpages, and there are certain differences between the non-template nodes of different webpages. Only the wrappers generated by some training webpages cannot cover these differences. Competent for the extraction task of some web pages
[0012] Second, many current web page extraction technologies are not suitable for online extraction tasks
This type of method is simple in design, single in perspective, and completely ignores the hierarchy of characters in HTML text, and the hierarchy is closely related to the distribution of web content, so it is difficult to be used for the extraction of massive heterogeneous web news web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Online Web news content extraction system
  • Online Web news content extraction system
  • Online Web news content extraction system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] see figure 1 , the online Web news content extracting method is to carry out as follows in the present embodiment:

[0050] Step 1, use an HTML parser to parse the extracted web news web page to obtain the DOM tree of the extracted web news web page; obtain the HTML text of the extracted news web page according to the URL address of the extracted web news web page, and use Jtidy to modify the HTML Error messages in the text include label matching errors, label writing errors, and HTML encoding errors; use the HTML parser HTMLParser to scan the characters in the HTML text one by one, analyze the structural hierarchical relationship of the HTML text, and obtain the DOM of the extracted Web news web page Tree;

[0051] Step 2, traverse the DOM tree, visit each node in the DOM tree in turn, and construct the text node information sequence and the label path information sequence of the text node; each element in the text node information sequence has two attributes, which a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an online Web news content extraction method. The method comprises the following steps: obtaining a DOM tree of an extracted Web news webpage; traversing the DOM tree to construct a text node information sequence and a label path information sequence; calculating a label path characteristic value sequence; fusing the label path characteristic value sequence by a weighted DS evidence theory to obtain a label path comprehensive characteristic value sequence; constructing a text node comprehensive characteristic value sequence; and extracting the text content of the Web news webpage according to the text node comprehensive characteristic value sequence. The invention furthermore discloses an online Web news content extraction system which consists of an analysis module, a calculation module, a fusion module and an extraction module. The label path characteristics are not based on templates of the webpages and have diversity, and the whole extraction process only comprises simple mathematical elements, so that massive heterogeneous Web news webpages can be effectively extracted online.

Description

[0001] The application date is May 10, 2013, the application number is 2013101732801, the title of the invention is: a method and system for extracting online Web news content, and the applicant is a divisional application of Hefei University of Technology. technical field [0002] The invention belongs to the field of network information processing, in particular to an online Web news content extraction method and system. Background technique [0003] With the rapid development of the Internet, Web news pages have become the main platform for people to publish and obtain information after traditional newspapers, radio, and television. At present, in addition to the main content, Web news pages are also mixed with a large amount of information irrelevant to the subject content, such as navigation bars, advertisements, recommended links, copyright statements, and so on. These noise data, which account for 40%-50% of the entire web page data, seriously affect the service quali...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/986
Inventor 吴共庆李莉徐喆昊胡学钢吴信东
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products