Resume layout analysis algorithm fusing visual and textual characteristics

A layout analysis and text technology, applied in the field of resume parsing, can solve problems such as the inability to directly identify semantic categories, and achieve the effect of reducing accumulated errors

Pending Publication Date: 2019-10-29
苏州过来人科技有限公司
View PDF15 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Semantic categories of different paragraph units in resumes cannot be directly identified

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Resume layout analysis algorithm fusing visual and textual characteristics

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment

[0050] In actual use, first obtain the text line and its coordinates in the resume through the pdf reading program or the ocr engine; then encode the text of the i-th line through the neural network to obtain the text embedding vector text_emb(i); by extracting the corresponding line image, get the image embedding vector img_emb(i); then, extract features such as font size and text length, and perform normalization processing to get the feature vector; then aggregate the text embedding vector, image embedding vector and feature vector to get the row embedding Vector line_emb(i); Finally, the neural network is used to sequence the line vector sequence [line_emb(i)] to obtain the semantic annotation of each line, and then obtain the start and end line numbers of each semantic paragraph unit.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a resume layout analysis algorithm fusing visual and textual characteristics. The analysis of a resume layout comprises the following steps: step 1, obtaining text lines and coordinates thereof from a pdf reading program or an ocr engine; step 2, encoding the text of the ith row by using a neural network to obtain a text embedding vector text _ emb (i); step 3, extracting an image of a corresponding row to obtain an image embedding vector img _ emb (i); step 4, extracting word size and word length features, and normalizing the word size and word length features to obtain a feature vector; step 5, aggregating the vectors obtained in the steps 2, 3 and 4 to obtain a line embedding line _ emb (i); and step 6, performing sequence labeling on the row vector sequence [line _ emb (i)] by using a neural network. Semantic division is carried out on the resume by combining visual features and text semantic features of the resume, and independent paragraph units are identified.

Description

technical field [0001] The invention relates to the field of resume analysis, in particular to a resume layout analysis algorithm combining visual and text features. Background technique [0002] Traditional vision-based layout analysis can distinguish layout regions such as pictures, tables, and paragraphs, but it is difficult to identify the semantic information of the regions. In the field of resume parsing, it is necessary to perform semantic analysis on resumes. Generally, text is mainly used as the main basis for layout recognition, such as CN201810489651.X. Obvious visual features, such as dividing line, font size, blank area size, etc. [0003] There are also some methods to extract simple visual features by rules. For example, CN201811613437.7, by extracting visual features such as font size, whether it is bold, font type, line text length, etc., resumes a classifier that distinguishes titles and subjects. This method does not consider the text content, uses simp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/20G06F17/27
CPCG06V10/22G06F40/30
Inventor 丁伟峰
Owner 苏州过来人科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products