Method for acquiring news web page text information

A technology of text information and extraction method, applied in the fields of instrument, calculation, electric digital data processing, etc., can solve the problems of inaccurate text information, low efficiency, incomplete text information, etc., and achieve the effect of reducing manual intervention and improving efficiency

Inactive Publication Date: 2010-05-26
NEW FOUNDER HLDG DEV LLC +2
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] (1) The text information extracted by this method is incomplete: because the text information of news web pages not only exists in the table, but also exists in the div; in addition, the information of the news includes not only the text information, but also the title information. Extraction of title information, this method does not involve
[0011] (2) The text information extracted by this method is not accurate enough, and the efficiency is not high: because the threshold setting in the method of selecting candidate table nodes is difficult to grasp, the size of the threshold has a great influence on the extraction of text information, so if the threshold is set If it is not appropriate, the extracted text information will be very inaccurate; even if a suitable threshold is selected, only the table nodes whose number of Chinese characters contained in the string is greater than the threshold are used as candidates. This method of extracting text information is Not accurate enough
In addition, the setting of the threshold requires a large number of experiments, which also affects the extraction efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for acquiring news web page text information
  • Method for acquiring news web page text information
  • Method for acquiring news web page text information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0050] The method of the present invention will be further illustrated below in conjunction with the embodiments and accompanying drawings.

[0051] Take the text information extracted from the 1000 news webpages arranged in chronological order captured from the sports channel of Sina News as an example, such as figure 1 As shown, a method for extracting news webpage text information includes the following steps:

[0052] (1) Use a third-party web page purification tool (for example, tidy tool) to perform standardized preprocessing on 1000 web pages to make them conform to the Html language standard, and then according to the Html language mark, analyze the Html data of all news webpages, and get the Html tree;

[0053] When parsing the Html data of all news webpages and constructing the Html tree, the following methods are used:

[0054] Since in the present invention, the Html tag and The effect is the same, so the present invention uses and The sit...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for extracting the text information in a new webpage, belonging to the field of webpage information analyzing and processing technique. The existing technique ordinarily adopts a packer to extract the interesting data from the web pages and the obtaining of information mode recognition knowledge of the packer is a time-taking and laborious, higher intelligence-demanding operation. The invention uses a stack data structure to convert the hiberarchy information of webpage data into vectors, constructs and analyzes Html tree, then making compression on the data of various layers of the Html tree, making data filtration, thinning, recognition and recombination to extract the needed data information. The invention is applied to extracting the template-generatednews information in the news web pages from a fixed website.

Description

technical field [0001] The invention belongs to the technical field of webpage information analysis and processing, and in particular relates to a method for extracting news webpage text information. Background technique [0002] With the rapid development of the Internet, the amount of information on the Internet, that is, the Web, is increasing at an alarming rate every day. Many companies often need various information and usually collect information on a large scale from the Internet. Therefore, the collection of massive information has become a concern for every enterprise. The problem. Because the current information processing technology is aimed at content in plain text format, and the information on the Web mainly exists in the form of static Html, how to convert the information in the form of Html collected on the Web into valuable text format information , to facilitate subsequent information processing has become an urgent technical problem to be solved. [000...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 舒文兵吴於茜肖建国
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products