Method for extracting core content of webpage based on text-tag density

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of core content and extraction method, which is applied in the Internet and communication fields, can solve problems such as poor versatility, improve accuracy, improve efficiency and accuracy, and achieve simple effects

Active Publication Date: 2016-10-26

BEIJING FORESTRY UNIVERSITY

View PDF4 Cites 17 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Since the template largely depends on the specific structure of the web page, once the structure of the web page changes, it needs to be reset and learned, and the versatility is not strong

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0040] The present invention will be described in detail below with reference to the drawings and embodiments.

[0041] This invention takes the source code of the webpage as input, and outputs the core text of the webpage including the title, keywords, description, and core content, and its focus is on the core content of the webpage of acquisition.

[0042] as follows As shown in Figure 1, the processing process of the present invention includes four stages: webpage source code preprocessing, webpage core content range estimation, core content boundary determination, and deletion of remaining tags.

[0043] The present invention is specifically realized through the following technical solutions:

[0044] 1. Web page source code Preprocessing stage

[0045] The preprocessing stage needs to extract the core elements of the webpage such as the title, keywords, and description from the original webpage text, and delete the text part of the webpage that is easy to interfere The tag for extr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

This invention relates to a method for extracting a core content of a webpage based on text-tag density. The method comprises four steps of preprocessing webpage source code, estimating the range of the core content of the webpage, determining the boundary of the core content and deleting residual tags. In the step of preprocessing webpage source code, extract core elements of title, summary, description, and so on from an original webpage text, and delete the tags unrelated to the core content of the webpage in the original webpage text so as to acquire a pending text. In the step of estimating the range of the core content of the webpage, determine a general range of the core content of the webpage. In the step of determining the boundary of the core content, separately determine precise start and stop positions of the core content of the webpage text. In the step of deleting residual tags, take out the core content part and delete residual tags to acquire the core content of the webpage, which is convenient to be analyzed and processed. By adoption of this method, the DOM (Document Object Model) structure of a webpage document is unnecessary to be analyzed; the theme and the content of the webpage are not limited; the processing procedure has linear complexity; and this method is applicable to the technical applications of extracting the core contents of various kinds of webpages, denoising webpages, and so on.

Description

technical field [0001] The invention relates to the technical field of the Internet in the field of communication, in particular to a method for extracting the core content of a webpage text with linear complexity based on text-label density. Background technique [0002] With the rapid development of the Internet, the World Wide Web (WWW) has become the largest Internet database in the world. Therefore, how to effectively extract information from the World Wide Web has become a new research direction. These involve collecting, processing, and extracting information from web pages at high speed. [0003] However, in reality, in addition to the text content related to the topic, there will be a lot of irrelevant information on the web page. This content includes everything from logos, advertisements, images, navigation, sidebars, and more. Although this information can play a role in assisting browsing for web browsers, it is useless in most cases for many Internet applica...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F16/95G06F16/9577

Inventor 蒋东辰闫艺鑫

Owner BEIJING FORESTRY UNIVERSITY

Method for extracting core content of webpage based on text-tag density

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology