Method and device for extracting webpage content

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A webpage content and webpage technology, applied in the information field, can solve problems such as inaccurate extraction of webpage text, inaccurate extraction of webpage titles, incomplete extraction of various elements of webpages, etc.

Inactive Publication Date: 2013-04-24

盘古文化传播有限公司

View PDF5 Cites 37 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] However, when the existing technology is used to extract the webpage content, the title of the webpage is not extracted accurately, and the elements of the webpage are not fully extracted, which leads to inaccurate extraction of the text of the webpage.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0028] Embodiments of the present invention provide a method for extracting web page content, such as figure 1 As shown, the method includes:

[0029] Step 101, convert the HTML source code into a corresponding document tree structure, and determine the title of the web page according to the TITLE tag of the document tree structure.

[0030] A document object model (Document Object Model, DOM), which may also be called a document tree structure, may be obtained by parsing a source code of a hypertext markup language (Hyper Text Mark-up Language, HTML) of a web page. The document tree structure contains a lot of useful information that can be used for analysis and pattern matching. The text block can be obtained by analyzing the source code of the document tree structure with SAX. For example, in a web page with a DIV layout, the document tree structure is composed of multiple DIV blocks, and the DIV blocks are text blocks marked with DIV tags. As a container, the DI...

Embodiment 2

[0139] An embodiment of the present invention provides a device for extracting web page content, such as Figure 4 As shown, the device includes: a conversion unit 401, a webpage title determination unit 402, a webpage element determination unit 403, a text block attribute determination unit 404, and a webpage full-text acquisition unit 405;

[0140] A conversion unit 401, configured to convert the HTML source code into a corresponding document tree structure;

[0141] A webpage title determining unit 402, configured to determine the webpage title according to the TITLE tag of the document tree structure;

[0142] A webpage element determination unit 403, configured to determine webpage elements in the webpage according to the webpage title, the webpage elements at least including website LOGO, page navigation, news release time, and news sources;

[0143] A text block attribute determination unit 404, configured to determine the attributes of each text block accor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and a device for extracting webpage content and relates to the technical field of information. The method and the device for extracting webpage content can accurately extract webpage title and various elements in webpages when extracting the webpage content. The method includes converting HTML (hypertext markup language) source codes into corresponding document tree structures, and determining webpage titles according to TITLE tags of the document tree structures; determining webpage elements in webpages according to the webpage titles; determining attributes of various text blocks according to the webpage titles and density and word number of the text blocks of the document tree structures; and extracting the webpage titles, the webpage elements and the text blocks taking the webpage content as text, and acquiring full texts of the webpages, wherein the webpage elements at least include website LOGO, page navigation, news publishing time and news sources. The method and the device for extracting the webpage content are applicable to webpage content extraction.

Description

technical field [0001] The present invention relates to the field of information technology, in particular to a method and device for extracting web page content. Background technique [0002] Use the SAX parser to convert the text area in the source code of the web page Label, <hn>Label, Label, The content in tags such as tags is parsed into multiple text blocks, and the preset indicators in each text block are calculated to determine whether the content of this text block can be used as the text. The preset indicators can include the number of words, hyperlink density and other indicators . For example, the link density of the current text block is less than or equal to 0.333333, and the link density of the previous text block is less than or equal to 0.555556, and the number of words in the current text block is less than or equal to 16, and the number of words in the next text block is less than or equal to 14, and the previous When the number of words in th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27

Inventor 兰晶徐慎昆

Owner 盘古文化传播有限公司

Features

Generate Ideas
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method and device for extracting webpage content

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology