Method and system for language classification of sites

A language and site technology, applied in instruments, network data indexing, and other database retrievals, etc., can solve problems such as inability to identify site-level languages

Active Publication Date: 2015-04-29
NEW FOUNDER HLDG DEV LLC +2
View PDF2 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The present invention provides a method and system for website language classification to solve the

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for language classification of sites
  • Method and system for language classification of sites
  • Method and system for language classification of sites

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0052] Embodiments of the present invention first propose a method for classifying site languages, see figure 1 ,include:

[0053] Step 101: For each language, use the preset search terms of the language to search to obtain all page links corresponding to the language.

[0054] Step 102: Classify all page links according to the link addresses of...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a system for language classification of sites. The method comprises the following steps: searching each language by virtue of a default preset word of the language, and obtaining all webpage links corresponding to the language; classifying all webpage links according to the link addresses of all webpage links, wherein each class corresponds to one site; sampling partial webpage links from a sub-class corresponding to each site to form a sample set; generating a training model corresponding to the language according to the number and the language information of the webpage links in the sample set; classifying a webpage link set of to-be-detected webpage resources according to the site to obtain each site needing to be detected; obtaining a language predicted value of each to-be-detected site according to the language training model. On the basis of a single webpage language recognition technology for the webpage, a reasonable and efficient method for language classification of the sites is provided; a system framework is simple and easy to maintain, so that the requirements of a modern search engine technology are met.

Description

technical field [0001] The invention relates to the technical field of computer Internet, in particular to a method and system for classifying languages ​​of websites. Background technique [0002] In modern search engine technology, site language has important guiding significance for search engine resources crawling and processing. First of all, site language information is used for crawler scheduling of search engines, which can control the resource grabbing pressure of different languages, improve network bandwidth and resource grabbing efficiency, and can also target language resources to enrich the search display of related languages; Secondly, site language information can also be used to guide the deletion and blocking of spam resources. [0003] In the prior art, for a single page of a webpage, there is already a relatively mature method of crawling webpage resources through a crawler system to complete the language identification of a single page. A site is a col...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 甘文杰于晓明杨建武张涛
Owner NEW FOUNDER HLDG DEV LLC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products