Method and device for extracting webpage text

A text and web page technology, applied in the computer field, can solve problems such as difficulty in determining the beginning and end of the text, low completeness rate, heavy workload, etc., and achieve the effect of reducing time, reducing labor costs, and improving efficiency

Active Publication Date: 2019-07-16
BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD +1
View PDF6 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In the process of realizing the present invention, the inventor found that there are at least the following problems in the prior art: 1. Extracting the text of the webpage based on the template requires manual participation, the workload is large, and the template needs to be reconfigured when the structure of the webpage changes; 2. Based on the block text Density to extract the text, it is difficult to determine the beginning and end of the text, and the completeness rate is not high; 3, the method of extracting text based on visual web page segmentation requires javascript and other engines, which is complex and time-consuming; 4, there is no method in the prior art Suitable for all types of web page text extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage text
  • Method and device for extracting webpage text
  • Method and device for extracting webpage text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] Exemplary embodiments of the present invention are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present invention to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

[0050] The current method for extracting the text of a web page has not reached the level expected by people. This invention starts from the various characteristics of the text of the current web page and combines the advantages and disadvantages of the existing technology to design an intelligent method for extracting the text of a web page, which can accurately and comple...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and device for extracting a webpage text, and relates to the technical field of computers. A specific embodiment of the method comprises the following steps: constructing an access model according to a to-be-extracted webpage; calculating a similarity value between each unit area of the main body part and the feature part; screening a unit text area from the accessmodel according to the similarity value and the first index value of each unit area; and determining the beginning and ending of the text of the to-be-extracted webpage according to the unit text area to obtain the complete text of the to-be-extracted webpage. According to the embodiment, the webpage text can be accurately and completely extracted, the labor cost is reduced, and the efficiency ofextracting the webpage text is improved.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method and device for extracting webpage text. Background technique [0002] With the rapid development of society, the Internet has gradually become the main platform for information release and acquisition, and the data on it has been growing exponentially. Internet data has covered all fields of the real world, such as economy, politics, and culture, and constitutes an important source of information for many applications. However, in addition to the text that people need, the content of the web page also has content that has nothing to do with the text, such as copyright information, advertisements, navigation bars, and decoration information, which are called noise information. How to shield the noise information and extract the text from the webpage has become a hot spot in current research. [0003] At present, the method for extracting webpage text has the following...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/957G06F16/9535
CPCG06F16/9577G06F16/9535
Inventor 贾宝玉李杰周旭
Owner BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products