Webpage text extracting method based on text tag feature mining

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology for web page text extraction and text labeling, which is applied in the field of web page body text extraction based on text label feature mining, and can solve the problems of inability to adapt to web pages in real time and high maintenance costs.

Active Publication Date: 2017-01-18

UNIV OF ELECTRONICS SCI & TECH OF CHINA

View PDF3 Cites 12 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The traditional template-based web page text extraction not only needs to manually configure the templates of each website, but also cannot adapt to the structural changes of web pages in real time, which makes the later maintenance cost larger

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0073] The technical solution of the present invention will be further described below in conjunction with the accompanying drawings.

[0074] like figure 1 As shown, a web page text extraction method based on text label feature mining includes the following steps:

[0075] S1. Perform web page label preprocessing and Html label repair;

[0076] The web page text extraction method revolves around the text label features of the web page, and the web page tags contain a large number of useless noise tags, so it is necessary to exclude the script tags of the JavaScript language, the style tags used for the structural features of the web pages, and the noscript tags before extracting the tag features. Exclude annotation content tags, exclude useless table span tags and their internal list li tags, and exclude noise tags such as text formatting tags and newline tags.

[0077] According to the requirements of the following text, two situations should be considered in the process o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a webpage text extracting method based on text tag feature mining. The webpage text extracting method comprises the following steps: S1, preprocessing webpage tags and repairing Html tags; S2, selecting and extracting Html tag features; S3, clustering and mining tag features and selecting a text cluster; S4, adjusting tags in the text cluster empirically; S5, extracting a tag text of the text cluster. In the webpage text extracting method, tags of webpage source codes are mined, the webpage tags are clustered by a hierarchical clustering algorithm, a cluster in which the text tag is positioned is extracted, the tags in the tag cluster is adjusted according to experience, and text is extracted according to the adjusted text cluster feature. Compared with other news webpage text extracting methods, the webpage text extracting method has the characteristics of higher universality, higher accuracy, easiness in use and no need of special settings for specific webpages.

Description

technical field [0001] The invention belongs to the field of text extraction, in particular to a web page text extraction method based on text label feature mining. Background technique [0002] With the rapid development of Web applications, people are facing the challenge brought by the "information explosion", that is, the information is extremely rich, the information spreads rapidly, and the knowledge is too poor. As the country has strongly called for "Internet +" in the past two years, it will bring people a better Internet experience. In the face of various web pages, it has become an important and meaningful task to accurately and quickly extract the subject information of the web pages. research direction. With technological innovation, the Web has gradually grown into a platform for content production and consumption. Numerous information sources in the form of HTML web pages have formed on the Internet, such as navigation bars, advertisements, recommended links,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/9577

Inventor于富财文友枥陈西安袁进吴轶铭申洲汪辉鲁才

OwnerUNIV OF ELECTRONICS SCI & TECH OF CHINA

Webpage text extracting method based on text tag feature mining

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology