Multi-scale difficulty vector classification method for graded reading materials

A classification method and multi-scale technology, applied in the direction of text database clustering/classification, special data processing applications, instruments, etc., can solve problems such as multi-time, limitation, insufficient sentence information extraction, etc., to enhance generalization and training. The effect of fast speed and rich difficulty feature representation

Active Publication Date: 2020-01-24
SOUTH CHINA UNIV OF TECH
View PDF5 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The current problem in this field is that in order to obtain rich sentence features, it takes a lot of time to construct features and model learning. Most of the features used are limited to the vocabulary and syntax levels, and the extraction of sentence information is not comprehensive enough.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-scale difficulty vector classification method for graded reading materials
  • Multi-scale difficulty vector classification method for graded reading materials
  • Multi-scale difficulty vector classification method for graded reading materials

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0032] figure 1 is a flowchart of the present invention, such as figure 1 As shown, a multi-scale difficulty vector classification method for graded readers disclosed in this embodiment includes the following steps: data cleaning, sentence segmentation, word segmentation, word-level feature extraction, sentence-level feature extraction, multi-scale feature extraction, splicing, GBDT Model training and feature importance analysis are as follows:

[0033] T1. Clean the original text data in the web html text format in advance, and then segment each sample into sentences. Chinese sentences can be segmented using the jieba tool, but not limited to this. Taking English data as an example here, such as figure 2 The sentence abbreviation "And it was... said" in the sentence and participle layer below puts the sentence " ‘And it was only10rubles for allthis,’ she said. ‘I’m taking it back for the girls at the factory to try.’ "After removing the html tag, it was divided into tw...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-scale difficulty vector classification method for graded reading materials. The classification method comprises the following steps: firstly, constructing word matchingfeatures, context features, topic features and the like to enrich feature representation; a light and comprehensive sentence difficulty vector is obtained in combination with the most prominent characteristic in previous research, and then is input into a classifier such as a GBDT (Gradient Boost Tree), so that a very good effect is achieved on educational graded reading linguistic data and general linguistic data. According to the method, feature representation is simplified, sentence difficulty can be reflected only through 21 vectors, multi-scale features are introduced, difficulty featurerepresentation is enriched, and model generalization is enhanced; a difficulty vector representation system suitable for sentence levels and article levels is constructed by combining newly used context information, and good effects are obtained in two data sets of the sentence levels and the article levels; the classifier uses a gradient boosting tree, the training speed is high, and a feature importance sequence can be obtained.

Description

technical field [0001] The invention relates to the technical field of clarity analysis in natural language processing, in particular to a multi-scale difficulty vector classification method for graded readers. Background technique [0002] The task of difficulty vector classification is, given a text, by analyzing the text, giving the difficulty value of the text or judging which level of readers the text is suitable for. Applied in the field of education, it can provide a reference for the selection of graded corpus and textbook materials, and can quantitatively measure the difficulty and complexity of sentence comprehension. In the field of general texts such as news texts, the difficulty and professionalism of news reading can also be analyzed. This difficulty vector can make a more accurate measurement of the difficulty and complexity of text understanding, provide an important basis for sentence simplification and refinement, and also provide a reference for the selec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F16/33G06F40/289G06F40/211G06F40/216
CPCG06F16/3343G06F16/3344G06F16/355
Inventor 马千里陈海斌田帅
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products