Website text extraction method and device

A text and website technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as inaccurate extraction of website text information

Active Publication Date: 2018-08-24
BEIJING GRIDSUM TECH CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The embodiment of the present invention provides a method and device for extracting website text, so as to at least solve the technical problem of inaccurate website text information extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Website text extraction method and device
  • Website text extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] In order to enable those skilled in the art to better understand the solutions of the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is an embodiment of a part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

[0024] It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a website text extraction method and device. The method comprises the following steps: extracting node information from each webpage of a website; determining the node information as first node information or second node information, wherein the first node information is node information including webpage body text and the second node information is node information excluding node information of webpage body text; extracting webpage body text of the first node information; according to the first node information, extracting again node information of a target webpage, wherein the node information meets a pre-set condition and the target website is a website from which the second node information is extracted; and taking the text included in the node information extracted again as webpage body text of the target webpage. The technical problem that text information of the website is not accurately extracted is resolved.

Description

technical field [0001] The invention relates to the field of website information extraction, in particular to a method and device for website text extraction. Background technique [0002] Text extraction refers to extracting the text part of a web page and removing other parts. In the Internet field, text extraction is a very common and basic requirement. [0003] Probably speaking, the body part is the area with the largest amount of text in a web page. Therefore, a common text extraction method is to extract the source code of the website, find the child node with the longest plain text length, and the content of this node is the text. For example as follows figure 1 As shown, the box part contains the longest text content, so the content of the box part is used as the body part. [0004] Sometimes, however, the body text does not contain the most text content. For example, in a forum website, there may be a comment with more text content than the text content. At thi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 曹志明
Owner BEIJING GRIDSUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products