Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for identifying webpage categories

An identification method and webpage technology, applied in the Internet field, can solve problems such as inability to identify webpage categories, lack of methods for identifying webpage categories, etc., and achieve the effect of easy extraction

Inactive Publication Date: 2015-07-29
TENCENT TECH (SHENZHEN) CO LTD
View PDF2 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the embodiments of the present invention is to provide a webpage category identification method to solve the problem that the prior art lacks a webpage category identification method and cannot effectively identify the webpage category, thereby facilitating the extraction of webpage content and user behavior. analysis and better display of page content in mobile browsers

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for identifying webpage categories
  • Method and device for identifying webpage categories
  • Method and device for identifying webpage categories

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0023] figure 1 The implementation flow of the method for identifying the webpage category provided by the first embodiment of the present invention is shown, and the details are as follows:

[0024] In step S101, the page features of the webpage to be identified are acquired.

[0025] Specifically, the webpage to be identified includes a webpage address, page information, and corresponding webpage source code information.

[0026] Acquiring the page features of the webpage to be identified may be obtained before extracting the webpage content or before viewing the page content through a mobile phone browser when the terminal is a mobile terminal. Or when analyzing the user behavior, identify the category of the webpage before or after the user obtains and views the webpage.

[0027] Specifically, the page features may include one or more of the following features: web page address features, web page title features, secondary navigation features, document object model DOM tr...

Embodiment 2

[0060] figure 2 The implementation flow of a method for identifying web page categories provided by the second embodiment of the present invention is shown, and the details are as follows:

[0061] In step S201, web page samples marked with web page categories are obtained.

[0062] The webpage samples marked with webpage categories can be marked as text pages or picture text pages by the staff in advance based on experience identification, and the webpage samples used for training can also be marked as other categories according to needs and the specific content of the webpage .

[0063] In step S202, according to the category of the webpage and the page features of the webpage sample, a decision tree model is obtained through training with a classification regression algorithm.

[0064] As a preferred implementation manner, according to the webpage category and the page features of the webpage samples, a recursive method may be used to divide the samples into multiple sma...

Embodiment 3

[0074] image 3 It shows a structural block diagram of a device for identifying webpage categories provided by the third embodiment of the present invention, and the details are as follows:

[0075] The identification device of the webpage category described in the embodiment of the present invention includes:

[0076] A page feature acquiring unit 301, configured to acquire the page feature of the webpage to be identified;

[0077] A page feature loading unit 302, configured to load the page features according to a pre-generated decision tree model, the decision tree model is generated by training a plurality of sample web pages whose web page categories have been determined;

[0078] The traversal search unit 303 is configured to recursively traverse the decision tree model, search for leaf classification nodes of the decision tree corresponding to the page features, and obtain the webpage category of the webpage to be identified from the leaf nodes.

[0079] Specifically,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the field of the Internet and provides a method and a device for identifying webpage categories. The method comprises obtaining page characteristics of a webpage to be identified, loading the page characteristics according to a pre-generated decision-making tree model which is generated by training a plurality of sample webpages of a determined webpage category, recursively traversing the decision-making tree model, searching for decision-making tree leaf classification nodes corresponding to the page characteristics, and obtaining the webpage category of the webpage to be identified according to the leaf nodes. The page characteristics of the webpage to be identified are obtained, the obtained page characteristics are loaded to the pre-generated decision-making tree model, and the decision-making tree model is generated by training the plurality of sample webpages of the determined webpage category, and therefore, the webpage category corresponding to the decision-making tree leaf classification nodes can be found quickly and effectively, and the extraction of webpage contents, the analysis of user behaviors and better display of the contents in the page in a cellphone browser can be realized conveniently.

Description

technical field [0001] The invention belongs to the field of the Internet, in particular to a method and device for identifying web page categories. Background technique [0002] With the development of the mobile Internet, more and more users use mobile browsers to obtain and read various information, including text, pictures, video, audio, and the like. Because it is easy to use, it brings great convenience to people's life. [0003] When using a mobile browser to browse webpage content, in order to more conveniently extract the content in the webpage, analyze user behavior and better display the content in the webpage in the mobile browser, the content of the webpage can be identified, such as Recognition of text pages and picture text pages (it is agreed that when the proportion of the text content in the webpage to the entire webpage reaches a preset value, such as 60%, it will be judged as a text text page, or it can be determined by the webpage Judging the proportio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 黄钰
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products