Resume information extraction method based on cascading sequence annotation

A technology of sequence labeling and information extraction, which is applied to instruments, text database query, unstructured text data retrieval, etc. It can solve the problem of not considering the structure of resume text blocks, and achieve the effect of solving the problem of confusion.

Active Publication Date: 2020-11-20
THE 28TH RES INST OF CHINA ELECTRONICS TECH GROUP CORP
View PDF3 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The current information extraction technology is only aimed at the extraction of shorter text fragments, and cannot hand

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Resume information extraction method based on cascading sequence annotation
  • Resume information extraction method based on cascading sequence annotation
  • Resume information extraction method based on cascading sequence annotation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] Such as figure 1 As shown, the present invention provides a method for extracting resume information based on stacked sequence annotation, comprising the following steps:

[0052] Step 1, use pdfminer to analyze the resume file in pdf format, and parse the rich-text resume into a text representation in common format;

[0053] Step 2, data labeling during training: use remote supervised data to back-label and merge similar items during the labeling process;

[0054] Step 3, divide the resume information into blocks: divide the resume into 4 blocks, and train the classifier to divide the text into blocks;

[0055] Step 4, using the two-layer sequence labeling model to realize information extraction at the sentence level and short text segment level.

[0056] Step 1 includes:

[0057] PDF is a kind of rich text, which needs to be parsed into ordinary plain text format first. The parsing process will involve issues of column division, section division and line breaking. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a resume information extraction method based on cascading sequence annotation. The method comprises the following steps of 1, analyzing a pdf resume by using a pdfminer, and converting an original pdf into a multi-line text representation; wherein the process mainly solves the problems of disordered sequence and wrong broken lines; step 2, training process data marking, utilizing remotely-supervised data back marking and combining similar items in the marking process; step 3, resume information block division, for sentences obtained through pdfminer, judging the block where each sentence is located according to the classification of each sentence; and step 4, realizing information extraction at a sentence level and a short text fragment level by utilizing the double-layer sequence labeling model. The method is advantaged in that filtering is subsequently realized by utilizing resume block information, so the recall rate is effectively improved, and meanwhile, accuracy is not greatly reduced; through four stages, extraction of the resume information can be effectively realized.

Description

technical field [0001] The invention relates to a resume information extraction method based on cascading sequence annotation. Background technique [0002] The extraction of key information from a resume includes four categories: attribute information, education experience, work experience, and project experience. Specific attribute information includes: name, date of birth, gender, telephone number, highest education level, place of origin, settled city and county, and political status; education experience includes: graduate school, degree, graduation time; work experience includes: work unit, work content , position, working hours; project experience includes: project name, project responsibility, project time. Among these 18 types of information, job content and project responsibility are extracted at the key sentence level, and other attributes are extracted from relatively short text fragments. [0003] The current information extraction technology is only aimed at ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/33G06F16/35G06F40/205G06F40/289G06K9/62
CPCG06F16/3344G06F16/35G06F40/205G06F40/289G06F18/214
Inventor 徐建郭培胜徐琳李晓冬
Owner THE 28TH RES INST OF CHINA ELECTRONICS TECH GROUP CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products