Method for extracting content of text based on HTML characteristics

An extraction method, a technology of HTML web pages, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of mixing, increasing the accuracy of text clustering and text classification, and extracting more content, etc., to achieve Reduced workload, reduced system consumption, and improved analysis efficiency

Inactive Publication Date: 2007-12-26
上海新纳广告传媒有限公司
View PDF0 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In these technologies, text content extraction is very important. If all the content of the text is extracted, the extracted content will be too much, and a lot of unnecessary things will be mixed, such as advertisements, navigation information, etc., which are of...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for extracting content of text based on HTML characteristics
  • Method for extracting content of text based on HTML characteristics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] The present invention will be further described below in conjunction with accompanying drawing.

[0018] As shown in Figure 1, the HTML feature-based text content extraction method divides the web page layout into content modules and non-content modules. The content module is the content part of the webpage, and the non-content module is generally used to display information such as navigation information, banners, copyright notices or advertisements. The goal of the solution of the present invention is to decompose the HTML webpage and extract the content modules from the HTML webpage. For each decomposed module, we give different scores according to its position in the web page layout. The higher the score of the module that is in the focus of the user's sight, the lower the score. is too large, the module may display advertisements or navigation information. In the present invention, a module content comprehensive score formula is provided: comprehensive score=posi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for picking up test content based on HTML feature includes utilizing countermark to decompose inputted HTML webpage to be multiple module, keeping decomposition on decomposed module if module is able to be continuously decomposed without table occurrence, setting different position score on inputted module according to different position in layout and calculating the chaining character length of each module and test length in super-chaining of each module for obtaining integrated score of each module according to the formula.

Description

technical field [0001] The invention relates to a text content extraction method, in particular to a text content extraction method based on HTML features. Background technique [0002] With the development of search engines, search users have higher and higher requirements for search engines, and the technical requirements for search engines are also higher and higher. Many new technologies have emerged, such as text clustering and text classification, automatic summarization, and so on. In these technologies, text content extraction is very important. If all the content of the text is extracted, the extracted content will be too much, and a lot of unnecessary things will be mixed, such as advertisements, navigation information, etc., which are often repeated. , and it is not the target of the user's search. Furthermore, too much repetitive or unnecessary information will increase the accuracy of text clustering and text classification, and will also add some unnecessary pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 金骏胡创义
Owner 上海新纳广告传媒有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products