Method and device for automatically establishing classification rule for cross-language

An automatic construction, cross-language technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve the problems of unbearable workload, high construction cost, high cost, and reduce The effect of labor cost and workload

Active Publication Date: 2014-02-12
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF3 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the cost of manually constructing preliminary filtering rules is high. If there are many target languages, the construction cost will be high and the workload will be unbearable. Similar problems may exist for other document classification rules other than preliminary filtering rules.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for automatically establishing classification rule for cross-language
  • Method and device for automatically establishing classification rule for cross-language
  • Method and device for automatically establishing classification rule for cross-language

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046]The existing preliminary filtering rules mainly include two types: one is the D rule, which is used to filter pages, that is, when the features of the page match the rule, the page is filtered out and does not enter the subsequent classifier stage. The other is the C rule, which is used to retain pages, that is, when the feature of the page hits the rule, the page is retained and enters the subsequent classifier stage, and if a page does not hit any rule, it is filtered out. Usually, no matter what kind of initial filtering rule can be regarded as a feature judgment expression, each judgment condition in the feature judgment expression belongs to one of the following two: whether a certain feature contains, or whether the value of a certain feature is greater than (or less than) a certain value. The relation between each judging condition is "and" or "or". There can be parentheses in the expression to change the priority of logical operations. In any case, a feature jud...

Embodiment 2

[0133] image 3 The structure diagram of the device for automatically constructing classification rules across languages ​​provided by Embodiment 2 of the present invention, as shown in image 3 The shown device may include: a rule transformation unit 300 , a keyword determination unit 310 , a candidate word determination unit 320 , a candidate word selection unit 330 and a rule replacement unit 340 .

[0134] The rule transformation unit 300 is configured to transform the classification rules of the source language to obtain one or more AND relationship rules, and provide each AND relationship rule as a current AND relationship rule to the keyword determination unit 310 .

[0135] Specifically, through the analysis of regular expressions and the distribution rate of logical operations, the rules can first be transformed into disjunctive paradigms, and then the disjunctive paradigms can be split into several AND relational rules.

[0136] The keyword determination unit 310 is...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a device for automatically establishing a classification rule for cross-language. The classification rule of a source language is converted into more than one AND relationship rule, and the following steps are executed for each AND relationship rule: determining a key word of a judgment condition in the current AND relationship rule; determining a target language candidate word set corresponding to each key word, wherein the set includes a target language translation word, a target language word containing the key word in a character string when the target language is translated to the source language and the keyword; selecting a candidate word with a document coverage condition meeting the preset requirement from each target language candidate set as a target language key word corresponding to each key word; and adopting the target language key word to substitute the corresponding key word in the current AND relationship rule through an OR relationship to obtain the AND relationship rule of the target language. By adopting the method and device, the classification rule is only established for one language, so that the labor cost and workload can be greatly reduced.

Description

【Technical field】 [0001] The invention relates to the technical field of computer applications, in particular to a method and device for automatically constructing classification rules across languages. 【Background technique】 [0002] With the explosive growth of the number of webpages on the Internet, whether the information that users are interested in can be quickly and accurately searched from the massive webpages makes the text classification technology be applied in the field of information retrieval. Webpage classification is mainly done through machine learning models. Before classification based on machine learning models, it is first necessary to use preliminary filtering rules to eliminate webpages that are obviously not of the target type, so as to reduce the difficulty of classification and improve the classification effect. When classifying webpages online, any webpage is initially filtered, and the webpages after the initial filtering are entered into the clas...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35G06F16/9535
Inventor 刘峰牛正雨
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products