Method for extracting content of text based on HTML characteristics

An extraction method, a technology of HTML web pages, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of mixing, increasing the accuracy of text clustering and text classification, and extracting more content, etc., to achieve Reduced workload, reduced system consumption, and improved analysis efficiency
CN101093487AInactive Publication Date: 2007-12-26上海新纳广告传媒有限公司

Patent Information

Authority / Receiving Office
CN · China
Current Assignee / Owner
上海新纳广告传媒有限公司
Publication Date
2007-12-26
Estimated Expiration
Not applicable · inactive patent

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
Patent Text Reader

Abstract

A method for picking up test content based on HTML feature includes utilizing countermark to decompose inputted HTML webpage to be multiple module, keeping decomposition on decomposed module if module is able to be continuously decomposed without table occurrence, setting different position score on inputted module according to different position in layout and calculating the chaining character length of each module and test length in super-chaining of each module for obtaining integrated score of each module according to the formula.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to a text content extraction method, in particular to a text content extraction method based on HTML features. Background technique

[0002] With the development of search engines, search users have higher and higher requirements for search engines, and the technical requirements for search engines are also higher and higher. Many new technologies have emerged, such as text clustering and text classification, automatic summarization, and so on. In these technologies, text content extraction is very important. If all the content of the text is extracted, the extracted content will be too much, and a lot of unnecessary things will be mixed, such as advertisements, navigation information, etc., which are often repeated. , and it is not the target of the user's search. Furthermore, too much repetitive or unnecessary information will increase the accuracy of text clustering and text classification, and will also add some unnecessary pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More