Classified corpus establishing method and system and server provided with system

A construction method and corpus technology, applied in the field of natural language processing, can solve problems such as inability to classify, and achieve the effect of reducing human subjective influence, shortening time, and reducing the degree of manual participation.

Active Publication Date: 2016-12-07
SHANGHAI ADVANCED RES INST CHINESE ACADEMY OF SCI
View PDF3 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When there are more than 1000 text categories, even domain experts cannot accurately classify the text
Therefore, th

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Classified corpus establishing method and system and server provided with system
  • Classified corpus establishing method and system and server provided with system
  • Classified corpus establishing method and system and server provided with system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] The present embodiment provides a kind of construction method of classification corpus, and the construction method of described classification corpus comprises the following steps:

[0051] Obtain the target data to be classified, and obtain category description data according to actual needs;

[0052] Calculate the text similarity between the target data to be classified and the determined category description data to select the text similarity calculation method corresponding to the maximum accuracy;

[0053] Use the text similarity calculation method corresponding to the maximum accuracy to calculate the similarity between the target data to be classified and the determined category description data, and classify the target data to be classified into the corresponding maximum similarity according to the calculated similarity category;

[0054] Perform deep matching on the classified target data and the determined category description data to obtain a first classifi...

Embodiment 2

[0071] see figure 2 , is a schematic flowchart of a method for constructing a classification corpus in another embodiment. Such as figure 2 As shown, the construction method of the classification corpus specifically includes the following steps:

[0072] S1', obtain the target data to be classified through the web crawler system.

[0073]For example, the recruitment information of all domestic listed companies published on 51job, Zhaopin, ChinaHR and Liepin from August 2014 to August 2015 was obtained through the web crawler system. Therefore, the recruitment information of all domestic listed companies published on 51job, Zhaopin.com, ChinaHR and Liepin from August 2014 to August 2015 is the target data to be classified.

[0074] S2', clarify the classification system according to actual needs to obtain category description data. In this embodiment, the "Occupational Classification Code of the People's Republic of China" is used as the basis for classification. There ar...

Embodiment 3

[0086] The present embodiment provides a kind of construction system 1 of classification corpus, please refer to image 3 , which is a schematic diagram showing the principle structure of a system for constructing a classification corpus in an embodiment. Such as image 3 As shown, the construction system 1 of the classification corpus includes: data acquisition module 10, category acquisition module 11, first processing module 12, first classification module 13, second processing module 14, selection module 15, second classification module 16 , a third processing module 17 , a determination module 18 , a third classification module 19 , and a testing module 20 .

[0087] The data acquisition module 10 is used to acquire target data to be classified through a web crawler system.

[0088] For example, the recruitment information of all domestic listed companies published on 51job, Zhaopin, ChinaHR and Liepin from August 2014 to August 2015 was obtained through the web crawler...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a classified corpus establishing method and system and a server provided with the system. The establishing method comprises the steps of acquiring target data to be classified and acquiring category description data according to actual needs, selecting a text similarity calculating method corresponding to maximum accuracy, classifying the target data to be classified as a category corresponding to maximum similarity, filling the target data with first classification matching degree within a first similarity range in a preset primary corpus, classifying the rest of the target data to be classified with a selected and well trained classifier, filling the target data with second classification matching degree within a second similarity range in the preset primary corpus, and determining the preset primary corpus as a final corpus when the filled preset primary corpus can not be enlarged any more. In this way, corpus establishment cost is reduced, manual intervene degree is reduced, and corpus establishment time is shortened.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and relates to a construction method and system, in particular to a classification corpus construction method, system and a server with the system. Background technique [0002] In recent years, network technology has developed rapidly, and Internet data has become the main source of information for people due to its advantages such as rapid update, wide range, and easy access. According to statistics, the vast majority of network data exists in the form of text. How to use natural language processing technology to classify these text information, so that users can find useful information more accurately and quickly, has become an important issue in the field of artificial intelligence. an important research question. Faced with this demand, a number of technologies with great practical value have been born, such as information retrieval, data mining, and public opinion monit...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/355G06F16/36
Inventor 徐浩煜谷重阳封松林周晗李明齐
Owner SHANGHAI ADVANCED RES INST CHINESE ACADEMY OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products