Aggregated text density based webpage body text extraction method and apparatus

A web page text extraction and text technology, which is used in website content management, network data retrieval, special data processing applications, etc. Extract accurate and efficient effects with simple and efficient methods

Active Publication Date: 2016-07-06
NAT UNIV OF DEFENSE TECH
View PDF5 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Existing problems: Simple problems are complicated, making text extraction cumbersome and complicated, which is not conducive to wide application

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Aggregated text density based webpage body text extraction method and apparatus
  • Aggregated text density based webpage body text extraction method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027]

[0028] figure 1

[0029]

[0030]

[0031]

[0032]

[0033]

[0034]

[0035]

[0036]

[0037]

[0038]

[0039]

[0040]

[0041]

[0042]

[0043] i+2 i+2

[0044]

[0045]

[0046] figure 2

[0047]

[0048]

[0049]

[0050]

[0051]

[0052]

[0053]

[0054]

[0055]

[0056] i+2 i+2 i+2

[0057]

[0058]

[0059] Tags are parsed and stored as units; paragraphs are clustered using a text clustering algorithm and the text is finally generated. Existing problems: simple problems are complicated, which makes extracting the text cumbersome and complicated, which is not conducive to wide application. SUMMARY OF THE INVENTION The purpose of the present invention is to provide a method and device for extracting webpage text based on aggregated text density in order to solve the technical problems in the prior art mentioned in the background art above. The ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides an aggregated text density based webpage body text extraction method and apparatus. In the method, webpage text content is segmented by a method of separating a webpage HTML according to a tag, so as to effectively separate various types of texts in the content. A special website extraction rule does not need to be customized, so that the method is high in generality; a complex text mining means is not required, so that the method is simple and efficient and accurate for extraction of various types of webpage body texts.

Description

technical field [0001] The present invention relates to the technical field of webpage reptiles, in particular to a method and device for extracting webpage text based on aggregated text density. Background technique [0002] With the rapid development of social informatization, the Internet has become an important source of information for people. Netizens usually use browsers to directly view the content of web pages. In addition, there are many Internet-based information processing tasks (such as information retrieval, data mining, machine translation, etc.) that are also based on the information content of web pages. The body of the web page is processed. But besides useful information (such as body content), most web pages also contain a lot of noise information, such as website navigation information, related links and advertisements, copyright information, and some scripting languages. How to accurately and efficiently extract the text information of web pages, so t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/313G06F16/986
Inventor 刘忠陈发君黄金才朱承修保新程光权陈超冯旸赫
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products