Method and device for determining webpage type

A type of web page and web page technology, applied in the computer field, can solve problems such as difficulty in analyzing web page content, affect accuracy, and low efficiency, and achieve obvious effects, improve efficiency and speed, and have a wide range of applications

Active Publication Date: 2013-04-03
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF5 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Defect 1: It is necessary to download and analyze the content of the webpage. For massive data, the efficiency is low and the speed is slow
[0005] Defect 2: In order to improve their ranking in search engines, many websites will artificially add a large number of category keywords to web pages. This cheating method greatly affects the accuracy of determining the types of these web pages
[0006] Defect 3: There are a large number of different forms of webpages in the network, and the wide variety of webpage forms makes it difficult to analyze the content of webpages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for determining webpage type
  • Method and device for determining webpage type
  • Method and device for determining webpage type

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0062] After analyzing the user's search behavior, it is found that after the user submits a query for search, the webpage clicked in the search results can usually reflect the needs of the user, and in turn, the query corresponding to the clicked webpage can also reflect the type of the webpage. Based on this, the method provided by the invention is as figure 1 As shown, it mainly includes the following steps:

[0063] Step 101: Obtain all queries corresponding to when the webpage to be identified is clicked in the search log.

[0064] In the embodiment of the present invention, all queries corresponding to the webpage to be identified are collected when the webpage to be identified is clicked. These queries reflect the type of the webpage to be identified. Therefore, the feature vector of the webpage to be identified is determined through these queries.

[0065] In addition, usually when a user clicks on a webpage after searching, it is largely influenced by the title of th...

Embodiment 2

[0084] figure 2 The flow chart of the method for obtaining the preset type of training corpus provided in Embodiment 2 of the present invention, such as figure 2 As shown, the acquisition method for a certain type of training corpus includes the following steps:

[0085] Step 201: Obtain the type of seed query.

[0086] It is enough that the seed query can fully reflect the requirements of this type. Since the number of seed queries does not need to be very large, usually dozens of them are enough, so manual configuration can be used.

[0087] Taking the recipe class as an example, the configured seed query can be: recipes of home-cooked dishes, recipes of home-cooked dishes, recipes, common recipes, Sichuan cuisine recipes, etc. For the convenience of understanding and examples, here are two seed queries "recipes of home-cooked dishes" and "recipes of home-cooked dishes" as examples.

[0088] Step 202: Obtain the clicked url corresponding to the seed query in the search ...

Embodiment 3

[0100] In this embodiment, the type of the webpage to be identified is determined by calculating the overlap rate between the feature vector of the webpage to be identified and the feature vectors of each preset type.

[0101] In this case, the way to obtain the feature vectors of each preset type from each preset type of training corpus is to determine each n-gram of each preset type of training corpus, count the number of occurrences of each n-gram and based on each n The number of occurrences of -gram determines the weight of each n-gram, so as to obtain the feature vector of each preset type. The weight of the n-gram may be the ratio of the number of occurrences of the n-gram to the total number of occurrences of all n-grams.

[0102] When determining the n-grams of the training corpus, in order to prevent the ambiguity caused by too small granularity, n-grams with larger granularity or even the entire query can be used, for example, 3-gram, 4-gram, etc. are used.

[0103...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a device for determining a webpage type. The method comprises the following steps of: 1) obtaining all queries corresponding to a webpage to be recognized in a search log when the webpage is clicked; 2) determining n-grams of the queries obtained in the step 1 to form a feature vector of the webpage to be recognized, wherein n is one or more preset positive integers; and 3) based on the correlation between the feature vector of the webpage to be recognized and the feature vectors of preset types, determining the type of the webpage to be recognized. The method and the device for determining the webpage type have the advantages that the determining efficiency and speed of the webpage type are improved, the anti-cheating capacity is strong, the application scope is wide and the like.

Description

【Technical field】 [0001] The invention relates to the field of computer technology, in particular to a method and device for determining the type of a web page. 【Background technique】 [0002] With the rapid development of network technology and the continuous enrichment of network information, users have become accustomed to obtaining the information they care about from the network through search engines. In search engine technology, whether it is demand analysis, search result ranking or personalized search, it may involve the operation of determining the type of web page. For example, in demand analysis, the search demand of the query can be determined by analyzing the type of the clicked web page corresponding to the query in the search log; in the ranking of search results, it is determined according to the consistency between the web page type and the query search demand Ranking in search results; in personalized search, by analyzing the types of web pages clicked an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 黄际洲
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products