Method and system for extracting news webpage content using webpage label clustering

A technology for webpage content and webpage labeling, which is applied in special data processing applications, instruments, electrical digital data processing, etc., and can solve problems such as small problems, complex rules, and periodic monitoring of target websites.

Inactive Publication Date: 2011-12-28
BEIJING ZHONGSOU NETWORK TECH
View PDF1 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The disadvantage of manually writing templates is that it takes a lot of human resources to write templates, and as the target website changes, the cost of maintaining templates is also very high
The disadvantage of the automatic template method is that the algorithm is complex, and it also requires periodic monitoring of the target website to maintain template changes
Regardless of whether templates are generated manually or automatically, it is assumed that the data of t

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting news webpage content using webpage label clustering
  • Method and system for extracting news webpage content using webpage label clustering
  • Method and system for extracting news webpage content using webpage label clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present invention. It includes various details to facilitate understanding and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

[0023] Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

[0024] figure 1 is a flowchart illustrating a method 100 for extracting news web page content according to an exemplary embodiment of the present invention.

[0025] Such as figure 1 As shown in , method 100 begins at step 110 . In step 110, the webpage content is pr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and system for extracting news webpage content by using webpage tag clustering. The method includes: preprocessing the webpage content, including parsing the webpage content into a DOM tree and counting the information of each node of the DOM tree; deleting the nodes of the DOM tree heuristically; deleting the DOM tree according to rules The nodes of the tree; and clustering and deleting the nodes of the DOM tree based on the tag structure, thereby generating a final DOM tree for output.

Description

technical field [0001] The present invention generally relates to the field of news webpage content extraction, and more specifically, the present invention relates to a method and system for extracting news webpage content by using webpage tag clustering. Background technique [0002] In the field of news (or information) search, news text extraction is an essential link, and the quality of its text extraction determines the quality of news search and user experience. [0003] At present, there are various methods for news text extraction, which can be divided into two categories according to whether templates are used or not: extraction based on templates (or wrappers) and extraction based on non-templates. [0004] In template-based extraction, the template is first defined, and then the program is written to parse and execute the template to obtain data. According to the template generation method, it can be divided into: manual template extraction and automatic templat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 高勇王放许欢庆郭永福陈沛
Owner BEIJING ZHONGSOU NETWORK TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products