Webpage text extraction method based on maximum text density

A web page text extraction and text technology, applied in the field of information processing, can solve the problems of inapplicability, lack of generality, time-consuming and labor-intensive information pattern recognition knowledge, and achieve the effect of improving the accuracy rate

Inactive Publication Date: 2014-04-09
TONGJI UNIV
View PDF1 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The method based on DOM structure and webpage segmentation is mainly to analyze HTML tags, but now webpages tend to be complicated and non-standardized, and it is not applicable to interpret webpage content simply through HTML semantics
The template-based method can only target a certain type of information source in a specific format, and the acquisition of information pattern recognition knowledge required to construct it is a time-consuming and laborious work. At present, Internet web pages are becoming more and more diverse and customizable. not universal

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage text extraction method based on maximum text density
  • Webpage text extraction method based on maximum text density
  • Webpage text extraction method based on maximum text density

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] like figure 1 As shown, the specific steps of the web page text extraction method based on the maximum text density are as follows:

[0041] 1. Web page preprocessing

[0042] (1) Character encoding problem

[0043] Common encoding methods include GBK (including Simplified Chinese and Traditional Chinese), BG2312 (Simplified Chinese), BIG-5 (Traditional Chinese), UTF-8, UTF-16, and UNICODE. In the HTML document, the encoding method is defined as follows:

[0044]

[0045]

[0046]

[0047] The charset attribute defines how the web page is encoded. In order to prevent garbled characters on the webpage, in the preprocessing stage of the webpage, the default encoding of the acquired webpage file is converted to UTF-8 character encoding. If the relevant encoding information cannot be obtained from the webpage, try to convert it to UTF-8 character encoding coding.

[0048] (2) Web page standardization

[0049] Now the HTML code format on som...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a webpage text extraction method based on the maximum text density. The method includes the following steps of (1) preprocessing a webpage, processing character codes and standardizing the webpage, (2) analyzing the webpage into a DOM tree and extracting tag text blocks in the webpage according to specific tags, (3) calculating the maximum text density, and (4) extracting texts, carrying out sequencing according to calculated text densities after all the tag text blocks are processed, and selecting a tag with the maximum text density, wherein the tag and content of a nested sub-tag serve as a text block and the text is obtained after the tag is eliminated. The webpage text extraction method based on the maximum text density is low in algorithm complexity, has universality and has a good effect on webpages with complex structures.

Description

technical field [0001] The present invention relates to information processing based on the Internet, which is network information extraction and application. Background technique [0002] With the development of the times, the World Wide Web has become an important source of information for people. Users usually use browsers to directly view web pages. In addition, there are many Internet-based information processing tasks (such as information search, data mining, machine translation, etc.), which are also carried out based on the information content of web pages. However, the text information of web pages on the Internet is often surrounded by "web page noise" such as advertisement links, navigation bars, and copyright information. How to accurately and efficiently extract the text information of web pages has become an important topic in the current network information extraction and application, which has high application value and practical significance. [0003] At p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/986
Inventor 蒋昌俊陈闳中闫春钢丁志军王鹏伟何源夏琳娟
Owner TONGJI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products