Webpage text extracting method based on text tag feature mining

A technology for web page text extraction and text labeling, which is applied in the field of web page body text extraction based on text label feature mining, and can solve the problems of inability to adapt to web pages in real time and high maintenance costs.

Active Publication Date: 2017-01-18
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF3 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The traditional template-based web page text extraction not only needs to manually configure the templates of each webs...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage text extracting method based on text tag feature mining
  • Webpage text extracting method based on text tag feature mining
  • Webpage text extracting method based on text tag feature mining

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0073] The technical solution of the present invention will be further described below in conjunction with the accompanying drawings.

[0074] like figure 1 As shown, a web page text extraction method based on text label feature mining includes the following steps:

[0075] S1. Perform web page label preprocessing and Html label repair;

[0076] The web page text extraction method revolves around the text label features of the web page, and the web page tags contain a large number of useless noise tags, so it is necessary to exclude the script tags of the JavaScript language, the style tags used for the structural features of the web pages, and the noscript tags before extracting the tag features. Exclude annotation content tags, exclude useless table span tags and their internal list li tags, and exclude noise tags such as text formatting tags and newline tags.

[0077] According to the requirements of the following text, two situations should be considered in the process o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage text extracting method based on text tag feature mining. The webpage text extracting method comprises the following steps: S1, preprocessing webpage tags and repairing Html tags; S2, selecting and extracting Html tag features; S3, clustering and mining tag features and selecting a text cluster; S4, adjusting tags in the text cluster empirically; S5, extracting a tag text of the text cluster. In the webpage text extracting method, tags of webpage source codes are mined, the webpage tags are clustered by a hierarchical clustering algorithm, a cluster in which the text tag is positioned is extracted, the tags in the tag cluster is adjusted according to experience, and text is extracted according to the adjusted text cluster feature. Compared with other news webpage text extracting methods, the webpage text extracting method has the characteristics of higher universality, higher accuracy, easiness in use and no need of special settings for specific webpages.

Description

technical field [0001] The invention belongs to the field of text extraction, in particular to a web page text extraction method based on text label feature mining. Background technique [0002] With the rapid development of Web applications, people are facing the challenge brought by the "information explosion", that is, the information is extremely rich, the information spreads rapidly, and the knowledge is too poor. As the country has strongly called for "Internet +" in the past two years, it will bring people a better Internet experience. In the face of various web pages, it has become an important and meaningful task to accurately and quickly extract the subject information of the web pages. research direction. With technological innovation, the Web has gradually grown into a platform for content production and consumption. Numerous information sources in the form of HTML web pages have formed on the Internet, such as navigation bars, advertisements, recommended links,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 于富财文友枥陈西安袁进吴轶铭申洲汪辉鲁才
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products