Labeling method and device for web page topics

A technology of webpage annotation and webpage, which is applied in the field of data processing, can solve the problems of low accuracy of webpage topic annotation, and achieve the effect of improving efficiency and accuracy

Active Publication Date: 2015-09-02
NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
View PDF5 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The present invention provides a method and device for labeling webpage topics to solve the problem of low accuracy in labeling webpage topics in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Labeling method and device for web page topics
  • Labeling method and device for web page topics
  • Labeling method and device for web page topics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to illustrate the present invention, but not to limit the present invention.

[0025] This embodiment provides a method for labeling webpage topics, such as figure 1 As shown, it is a flowchart of a method for labeling a webpage topic according to an embodiment of the present invention. This embodiment is a step performed for each web page.

[0026] Step S110, based on the title and text of the web page, obtain the topic feature vector of the web page.

[0027] Due to the difference in length and language style between the title and the text of the web page, in this embodiment, the title and text in the web page are extracted respectively; the title feature vector is constructed according to the title; the text feature vector is constructed according to the text...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a labeling method and device for web page topics. The method includes the steps that based on titles and main bodies of web pages, topic feature vectors of the web pages are acquired; classification processing is performed on the topic feature vectors through a classifier which is obtained through training in advance; whether types which the topic feature vectors belong to exist is judged; if yes, the web pages are labeled as the types which the topic feature vectors belong to; otherwise, the web pages are labeled as web pages to be labeled; furthermore, clustering processing is performed on the multiple web pages to be labeled; the type of each cluster is obtained through analysis; the web pages to be labeled are labeled as the types of the clusters which the web pages belong to. By the adoption of a supervised classification method and unsupervised clustering method cascading mode, the topics are automatically acquired from the web pages, the web pages are labeled, and the labeling efficiency and accuracy of the web page topics are effectively improved.

Description

technical field [0001] The present invention relates to the technical field of data processing, and in particular, to a method and device for marking webpage topics. Background technique [0002] Extracting and labeling web page topics by analyzing Internet web content is an important basis for applications such as Internet data management and mining. At present, the keyword matching method is mostly used for the topic labeling of web pages, and the labeling of web pages is realized by matching the title of the web page with some preset keywords. However, this method of direct matching is too simple, and if the keywords in the title of the web page change, the method will not be able to accurately label the subject, and the accuracy of the web page standard will not be guaranteed. Another kind of webpage topic labeling is to use the clustering method to cluster the webpages, and extract keywords from the webpages that are clustered into one type as the labeling of this type...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35G06F16/374
Inventor 李扬曦杜翠兰李睿佟玲玲翟羽佳王晶刘洋秦韬付戈
Owner NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products