Industry dictionary generating method and device

An industry and dictionary technology, applied in the field of industry dictionary generation methods and devices, can solve the problems of time-consuming and laborious, high cost of industry dictionary generation and industry dictionary, and achieve the effect of saving production costs and improving efficiency.

Active Publication Date: 2011-08-31
QUNAR CAYMAN ISLANDS
View PDF4 Cites 64 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

From the above analysis, it can be seen that the domain ontology database construction technology is mainly discovered through artificially set rules or large-scale corpus training; among them, the artificially set rules are fixed, and their recall rate is relatively low; while corpus training requires preparation A large amount of corpus is time-consuming and labor-intensive
In addition, the construction technology of domain ontology library also needs to establish the interconnection be

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Industry dictionary generating method and device
  • Industry dictionary generating method and device
  • Industry dictionary generating method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0033] Specifically, an implementation manner of step 12 includes:

[0034] Step 121, preprocessing the document collection to obtain a word sequence collection;

[0035] Among them, preprocessing mainly refers to performing word segmentation processing on each document in the document collection, that is, performing word segmentation on the document to obtain a series of words. Since Chinese text is not like English, there are spaces between words in English lines as natural delimiters, but there is no obvious delimiter between words in Chinese. In order to facilitate the automatic processing of Chinese documents by the industry dictionary generation device, it is necessary to Perform word segmentation to form a series of words. Wherein, the word segmentation processing may adopt a word segmentation method based on a dictionary, or a word segmentation method based on statistics. Since the accuracy of word segmentation has a certain impact on the quality of the final industr...

specific Embodiment approach

[0054] Further, step 13 obtains a specific implementation manner of relevant candidate terms, including:

[0055] Step 131 , the industry dictionary generation device uses statistical algorithms such as chi-square check or information gain to calculate the correlation between each candidate term and the industry category to which it belongs; the chi-square check algorithm is preferred.

[0056] The principle of the chi-square verification algorithm is: first assume that the two variables are independent (null hypothesis), and then observe the deviation between the actual value and the theoretical value to determine whether the theory is correct. If the deviation is very small, it is considered to be a sample error, and the null hypothesis is accepted, that is, the two variables are considered to be independent; otherwise, the null hypothesis is rejected, that is, the two variables are considered to be correlated. On the issue of calculating the correlation between candidate te...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an industry dictionary generating method and an industry dictionary generating device. The method comprises the following steps of: acquiring a document collection corresponding to the initial industry glossaries according to the initial industry glossaries; acquiring candidate glossaries according to the document collection; performing industry relevance analysis on the candidate glossaries to acquire relevant candidate glossaries; performing co-occurrence analysis and incidence relation excavation on the relevant candidate glossaries to generate industry vocabularies; and adding the industry vocabularies into industry dictionaries. Due to the adoption of the technical scheme, the industry dictionaries can be generated, and the problems of high cost, low efficiency and the like which are generated when workers search the industry vocabularies in the prior art are solved.

Description

technical field [0001] The invention relates to data mining technology, in particular to a method and device for generating an industry dictionary. Background technique [0002] Industry dictionaries are a collection of terms and idioms in a certain industry represented by the smallest language unit, such as machinery industry dictionaries, tourism industry dictionaries, etc. In the prior art, technologies similar to industry dictionaries include text classification feature selection technology and domain ontology (Domain Ontology) library construction technology. [0003] Text classification feature selection technology is a very important method to achieve feature space dimensionality reduction in the text classification system. It first divides the text in the training set, and then counts the frequency of words in the training set, and then through feature selection. The algorithm selects some words as the features used in the training of the classifier. Among them, co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 何伟平王名悠吴永强
Owner QUNAR CAYMAN ISLANDS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products