Title Extraction Method and Device Based on Web Articles

A title and web page technology, applied in the Internet field, can solve the problems of low extraction efficiency and high cost

Active Publication Date: 2019-06-11
HANGZHOU DT DREAM TECH
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] For this reason, the first object of the present invention is to propose a method for extracting titles based on webpage articles, which is used to solve the problems of high cost and low extraction efficiency in the prior art.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Title Extraction Method and Device Based on Web Articles
  • Title Extraction Method and Device Based on Web Articles
  • Title Extraction Method and Device Based on Web Articles

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0086] Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.

[0087] The method and device for extracting titles based on webpage articles according to the embodiments of the present invention will be described below with reference to the accompanying drawings.

[0088] figure 1 It is a schematic flowchart of a method for extracting titles based on webpage articles provided by an embodiment of the present invention. Such as figure 1 As shown, the title extraction method based on webpage articles includes the following steps:

[0089] S101. Obtain the webpage code corresponding to the w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a title extraction method and device based on a webpage article. The method comprises the following steps: obtaining a webpage code corresponding to the webpage article; constructing a DOM tree according to the rendered webpage code; adjusting the rendered webpage code according to the actual attribute values of elements in each node of the DOM tree; obtaining leaf nodes infront of a body area in the DOM tree and taking the leaf nodes as title candidate nodes; calculating the feature score of each title candidate node according to text content features in the title candidate node and the distance between the title candidate node and the body area; determining the title candidate node with the highest feature score as the title node, and determining text content of the title node as the title of the webpage article. Therefore, the title candidate nodes can be determined through combination with the position of the body area, the title is determined through combination with the text content features in the title candidate nodes, wrapper establishment is avoided, full-automatic extraction is realized, cost is reduced, and extraction efficiency is increased.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a method and device for extracting titles based on webpage articles. Background technique [0002] There are two main methods of web page data extraction at present. The first method needs to build a special "wrapper" program to identify the data and convert it into a suitable format, such as XML, associative tables, etc., but this method requires the user to have a computer. And programming background knowledge, and when the data source website format changes, the wrapper needs to be modified. The second one provides a friendly man-machine interface technology. Through the man-machine interface technology, users can quickly create wrappers by clicking on the page, which lowers the threshold for users to use. However, the biggest problem with this method is that it is very inflexible, that is, when When the format of the data source website changes, the wrapper needs to be ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/22
CPCG06F40/131G06F40/14
Inventor 张为
Owner HANGZHOU DT DREAM TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products