Method for extracting content of text based on HTML characteristics
Patent Information
- Authority / Receiving Office
- CN · China
- Current Assignee / Owner
- 上海新纳广告传媒有限公司
- Publication Date
- 2007-12-26
- Estimated Expiration
- Not applicable · inactive patent
Smart Images
Figure 1 Figure 2
Abstract
Description
technical field
[0001] The invention relates to a text content extraction method, in particular to a text content extraction method based on HTML features. Background technique
[0002] With the development of search engines, search users have higher and higher requirements for search engines, and the technical requirements for search engines are also higher and higher. Many new technologies have emerged, such as text clustering and text classification, automatic summarization, and so on. In these technologies, text content extraction is very important. If all the content of the text is extracted, the extracted content will be too much, and a lot of unnecessary things will be mixed, such as advertisements, navigation information, etc., which are often repeated. , and it is not the target of the user's search. Furthermore, too much repetitive or unnecessary information will increase the accuracy of text clustering and text classification, and will also add some unnecessary pr...