Method for acquiring news web page text information

A text information and extraction method technology, applied in the direction of instruments, calculations, electrical digital data processing, etc., can solve the problems of complex generation and maintenance of wrappers, high cost, etc., and achieve the effect of reducing manual intervention and improving efficiency and accuracy

Active Publication Date: 2006-06-14
新方正控股发展有限责任公司 +2
View PDF0 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, if the information comes from many information sources, many wrappers are required, so that the generation and maintenance of wrappers becomes a complex task
F

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for acquiring news web page text information
  • Method for acquiring news web page text information

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0048] In the following, the method of the present invention will be further explained in conjunction with the embodiments and the drawings.

[0049] Take, for example, extracting text information from 1000 news pages arranged in chronological order from the sports channel of Sina News, such as figure 1 As shown, a method for extracting the body information of a news webpage includes the following steps:

[0050] (1) Use a third-party web page purification tool (for example, tidy tool) for 1,000 web pages to perform standardized preprocessing to make them conform to the Html language standard, and then follow the Html language Mark, parse the Html data of all news webpages, and get the Html tree;

[0051] When parsing the Html data of all news pages, and constructing the Html tree, the following methods are used:

[0052] Because in the present invention, the Html tag with The effect is the same, so the present invention is based on with The situation is completely simil...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for extracting the text information in a new webpage, belonging to the field of webpage information analyzing and processing technique. The existing technique ordinarily adopts a packer to extract the interesting data from the web pages and the obtaining of information mode recognition knowledge of the packer is a time-taking and laborious, higher intelligence-demanding operation. The invention uses a stack data structure to convert the hiberarchy information of webpage data into vectors, constructs and analyzes Html tree, then making compression on the data of various layers of the Html tree, making data filtration, thinning, recognition and recombination to extract the needed data information. The invention is applied to extracting the template-generated news information in the news web pages from a fixed website.

Description

technical field [0001] The invention belongs to the technical field of webpage information analysis and processing, and in particular relates to a method for extracting news webpage text information. Background technique [0002] With the rapid development of the Internet, the amount of information on the Internet, that is, the Web, is increasing at an alarming rate every day. Many companies often need various information and usually collect information on a large scale from the Internet. Therefore, the collection of massive information has become a concern for every enterprise. The problem. Because the current information processing technology is aimed at content in plain text format, and the information on the Web mainly exists in the form of static Html, how to convert the information in the form of Html collected on the Web into valuable text format information , to facilitate subsequent information processing has become an urgent technical problem to be solved. [000...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 舒文兵吴於茜肖建国
Owner 新方正控股发展有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products