Method for extracting content of text based on HTML characteristics

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An extraction method, a technology of HTML web pages, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of mixing, increasing the accuracy of text clustering and text classification, and extracting more content, etc., to achieve Reduced workload, reduced system consumption, and improved analysis efficiency

Inactive Publication Date: 2007-12-26

上海新纳广告传媒有限公司

View PDF0 Cites 23 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

In these technologies, text content extraction is very important. If all the content of the text is extracted, the extracted content will be too much, and a lot of unnecessary things will be mixed, such as advertisements, navigation information, etc., which are often repeated. , and it is not the target of the user's search. Furthermore, too much repetitive or unnecessary information will increase the accuracy of text clustering and text classification, and will also add some unnecessary processing in the word segmentation stage

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0017] The present invention will be further described below in conjunction with accompanying drawing.

[0018] As shown in Figure 1, the HTML feature-based text content extraction method divides the web page layout into content modules and non-content modules. The content module is the content part of the webpage, and the non-content module is generally used to display information such as navigation information, banners, copyright notices or advertisements. The goal of the solution of the present invention is to decompose the HTML webpage and extract the content modules from the HTML webpage. For each decomposed module, we give different scores according to its position in the web page layout. The higher the score of the module that is in the focus of the user's sight, the lower the score. is too large, the module may display advertisements or navigation information. In the present invention, a module content comprehensive score formula is provided: comprehensive score=posi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A method for picking up test content based on HTML feature includes utilizing countermark to decompose inputted HTML webpage to be multiple module, keeping decomposition on decomposed module if module is able to be continuously decomposed without table occurrence, setting different position score on inputted module according to different position in layout and calculating the chaining character length of each module and test length in super-chaining of each module for obtaining integrated score of each module according to the formula.

Description

technical field [0001] The invention relates to a text content extraction method, in particular to a text content extraction method based on HTML features. Background technique [0002] With the development of search engines, search users have higher and higher requirements for search engines, and the technical requirements for search engines are also higher and higher. Many new technologies have emerged, such as text clustering and text classification, automatic summarization, and so on. In these technologies, text content extraction is very important. If all the content of the text is extracted, the extracted content will be too much, and a lot of unnecessary things will be mixed, such as advertisements, navigation information, etc., which are often repeated. , and it is not the target of the user's search. Furthermore, too much repetitive or unnecessary information will increase the accuracy of text clustering and text classification, and will also add some unnecessary pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor金骏胡创义

Owner上海新纳广告传媒有限公司

Method for extracting content of text based on HTML characteristics

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology