Unlock instant, AI-driven research and patent intelligence for your innovation.

Text extraction method and apparatus

A text and node technology, applied in the computer field, can solve problems such as the complexity of extracting text

Active Publication Date: 2016-07-06
INSPUR QILU SOFTWARE IND
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In short, the method of extracting text in the prior art is relatively complicated

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text extraction method and apparatus
  • Text extraction method and apparatus
  • Text extraction method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0091] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work belong to the protection of the present invention. scope.

[0092] Such as figure 1 As shown, the embodiment of the present invention provides a method for extracting text, which may include the following steps:

[0093] S1: Obtain CSS (CascadingStyleSheets, Cascading Style Sheets) content in the webpage to be extracted;

[0094] S2: Determine the width of the node in the webpage to be extracted according to t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Embodiments of the present invention provide a text extraction method and apparatus. The method comprises: acquiring CSS (cascading style sheets) content in a to-be-extracted webpage; according to the CSS content, determining a width of a node in the to-be-extracted webpage, and according to the width of the node in the to-be-extracted webpage, determining a text node containing text content; and according to the text node, clearing the to-be-extracted webpage, and extracting a text of the to-be-extracted webpage. The text extraction method and apparatus provided by the embodiments of the present are capable of implementing text extraction in a simpler way.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method and device for extracting text. Background technique [0002] Text extraction is a part of web page structure and the basis for text mining of web pages. In the prior art, the text is extracted through a statistical method based on multiple filtering indicators such as text density ratio, text number, and punctuation marks. Contains text and html (HyperTextMarkupLanguage, Hypertext Markup Language) tag content. In existing methods, it is assumed that nodes containing text should contain a large amount of dense text. Due to the flexibility of web page writing, the html tags of some web pages contain a large amount of css content, while some web pages contain less content, so the text density ratio can vary widely. [0003] In the existing method, various thresholds such as text density and text number need to be set, and different text density thresholds generally ne...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/9577G06F16/986
Inventor 毛立花孙海峰王传超
Owner INSPUR QILU SOFTWARE IND