Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for extracting webpage content

A webpage content and webpage technology, applied in the information field, can solve problems such as inaccurate extraction of webpage text, inaccurate extraction of webpage titles, incomplete extraction of various elements of webpages, etc.

Inactive Publication Date: 2013-04-24
盘古文化传播有限公司
View PDF5 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, when the existing technology is used to extract the webpage content, the title of the webpage is not extracted accurately, and the elements of the webpage are not fully extracted, which leads to inaccurate extraction of the text of the webpage.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage content
  • Method and device for extracting webpage content
  • Method and device for extracting webpage content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0028] Embodiments of the present invention provide a method for extracting web page content, such as figure 1 As shown, the method includes:

[0029] Step 101, convert the HTML source code into a corresponding document tree structure, and determine the title of the web page according to the TITLE tag of the document tree structure.

[0030] A document object model (Document Object Model, DOM), which may also be called a document tree structure, may be obtained by parsing a source code of a hypertext markup language (Hyper Text Mark-up Language, HTML) of a web page. The document tree structure contains a lot of useful information that can be used for analysis and pattern matching. The text block can be obtained by analyzing the source code of the document tree structure with SAX. For example, in a web page with a DIV layout, the document tree structure is composed of multiple DIV blocks, and the DIV blocks are text blocks marked with DIV tags. As a container, the DI...

Embodiment 2

[0139] An embodiment of the present invention provides a device for extracting web page content, such as Figure 4 As shown, the device includes: a conversion unit 401, a webpage title determination unit 402, a webpage element determination unit 403, a text block attribute determination unit 404, and a webpage full-text acquisition unit 405;

[0140] A conversion unit 401, configured to convert the HTML source code into a corresponding document tree structure;

[0141] A webpage title determining unit 402, configured to determine the webpage title according to the TITLE tag of the document tree structure;

[0142] A webpage element determination unit 403, configured to determine webpage elements in the webpage according to the webpage title, the webpage elements at least including website LOGO, page navigation, news release time, and news sources;

[0143] A text block attribute determination unit 404, configured to determine the attributes of each text block accor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a device for extracting webpage content and relates to the technical field of information. The method and the device for extracting webpage content can accurately extract webpage title and various elements in webpages when extracting the webpage content. The method includes converting HTML (hypertext markup language) source codes into corresponding document tree structures, and determining webpage titles according to TITLE tags of the document tree structures; determining webpage elements in webpages according to the webpage titles; determining attributes of various text blocks according to the webpage titles and density and word number of the text blocks of the document tree structures; and extracting the webpage titles, the webpage elements and the text blocks taking the webpage content as text, and acquiring full texts of the webpages, wherein the webpage elements at least include website LOGO, page navigation, news publishing time and news sources. The method and the device for extracting the webpage content are applicable to webpage content extraction.

Description

technical field [0001] The present invention relates to the field of information technology, in particular to a method and device for extracting web page content. Background technique [0002] Use the SAX parser to convert the text area in the source code of the web page Label, <hn>Label, Label, The content in tags such as tags is parsed into multiple text blocks, and the preset indicators in each text block are calculated to determine whether the content of this text block can be used as the text. The preset indicators can include the number of words, hyperlink density and other indicators . For example, the link density of the current text block is less than or equal to 0.333333, and the link density of the previous text block is less than or equal to 0.555556, and the number of words in the current text block is less than or equal to 16, and the number of words in the next text block is less than or equal to 14, and the previous When the number of words in th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
Inventor 兰晶徐慎昆
Owner 盘古文化传播有限公司
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More