A Method of Webpage Text Extraction

A web page and text technology, applied in the computer field, can solve problems such as low performance, spam information, and small problems, and achieve the effects of improving work efficiency, high analysis efficiency, and accurate results

Inactive Publication Date: 2017-01-18
北京中搜云商网络技术有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The disadvantage of manually writing templates is that it takes a lot of human resources to write templates, and as the target website changes, the cost of maintaining templates is also very high
The disadvantage of the automatic template method is that the algorithm is complex, and at the same time, it also needs to periodically monitor the target website to maintain the changes of the template
Regardless of whether the template is generated manually or automatically, the assumption is that the data of the website is generated through the template. Some large websites have basically no problem, that is, different entrances may have different templates, but for many small and medium websites, the templating is not Very good, using template extraction can only extract most of the information, there are more opportunities to contain spam
Vision-based page segmentation algorithm is not suitable for news search engine applications due to complex rules and low performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Webpage Text Extraction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] The specific embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0033] A web page contains information such as text title, text source, text release time, text, author, etc. The web page may also include a large number of advertisements, spam, etc., and the "longest string" in news web pages mostly appears in the text. features to find a paragraph in the text area and obtain its corresponding label features, and then use the found label features to search forward, backward, and bidirectionally for similar label nodes. This process is referred to as "label clustering".

[0034] A method for extracting the text of a webpage, according to searching for "longest string" to search for iconic nodes to realize the extraction of news webpage text content, said method comprises the following steps: 1, deleting the negligible label in the said webpage and the negligible label in the Content; II, looking ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage content extracting method. The method comprises the following steps of I, preprocessing a webpage, II, searching for the longest series in the webpage, III, establishing a DOM tree and searching for the nodes corresponding to the longest series according to the DOM tree, IV, determining a beginning node and a finishing node according to labels of the nodes corresponding to the longest series, V, checking and filtering the beginning node and the finishing node, and VI, outputting text in the filtered beginning node and text in the filtered finishing node. The method overcomes the defect of a module or blocking technique in news content extraction application, searches for seed paragraphs based on the longest series and improves webpage content extracting work efficiency and accuracy.

Description

technical field [0001] The invention relates to a method in the field of computers, in particular to a method for extracting news webpage text content based on searching for "longest string" to find symbolic nodes. Background technique [0002] In the field of news (or information) search, news text extraction is an essential link, and the quality of its text extraction determines the quality of news search and user experience. [0003] At present, news text extraction methods come in various formats, which can be divided into two categories according to whether templates are used: template-based (or wrapper)-based extraction and non-template-based extraction. [0004] Template-based extraction: first define the template, and then write a program to parse and execute the template to obtain data. According to the template generation method, it can be divided into: manual template extraction and automatic template extraction. Manual template extraction. For the extracted ta...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/80G06F40/14
Inventor 涂波
Owner 北京中搜云商网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products