Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Document content classification method, system and device and computer readable storage medium

A classification method and document technology, applied in the field of information retrieval, can solve problems such as wide coverage, chaotic document layout, and wrong judgment, and achieve the effects of flexible typesetting, reduced layout impact, and high error tolerance.

Pending Publication Date: 2022-08-05
四川医枢科技有限责任公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

After investigation, the traditional document content classification technology is implemented based on statistics and rules. The statistics-based method is an uncertain probability-based reasoning method learned on a large-scale corpus. The shortcomings of this method It is the coverage of the corpus that needs to be wide enough to achieve good results
The rule-based method is to formulate certain classification rules according to some rules in linguistics. This method is a deterministic reasoning method. Updates have certain limitations
With the development of deep neural network technology, in recent years, most of the document content classification tasks have been implemented based on NLP-related tasks. The basic method of implementation is to first perform word segmentation processing on the text, and perform Embedding operations to extract the feature vectors of the text, and then go through a series of Convolution and pooling operations, and finally the classification results are obtained through softmax (Softmax logical regression, softmax logistic regression) on the output results. The advantage of this method is that the model is simple and easy to train. The disadvantage is that it is targeted according to the training results. Adjust the model parameters, and at the same time, it cannot reflect the semantic features between word vectors for long-length documents
In the prior art, in the process of sorting texts, judgment errors are prone to occur, resulting in disordered document layout

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document content classification method, system and device and computer readable storage medium
  • Document content classification method, system and device and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0040] The embodiment of the present invention discloses a document content classification method, see figure 1 As shown, the method includes:

[0041] S11: Obtain a target document, convert the document into a picture format, and obtain a target picture corresponding to the target document.

[0042] Specifically, in order to use the image recognition technology to classify the content of the document, the document in the non-pictur...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a document content classification method, system and device and a computer readable storage medium, and the method comprises the steps: converting a document into a picture format, and obtaining a target picture corresponding to a target document; extracting content features from the target picture by using a preset document content classification model, and performing region division on the target picture according to the content features to obtain a plurality of to-be-sorted segmented regions; utilizing a preset document layout analysis model to extract the text type of each segmented region, and sorting according to the text type of each segmented region to obtain a plurality of text regions with correct text sequences; and reordering each text region to obtain a recombined document. According to the method, the document is divided into a plurality of areas according to categories through image recognition, each area is typeset independently, typesetting is more flexible, errors between the areas do not seriously affect the whole, and finally overall sorting is performed to obtain a complete document.

Description

technical field [0001] The present invention relates to the field of information retrieval, and in particular, to a document content classification method, system, device and computer-readable storage medium. Background technique [0002] Document content classification technology is to label and classify information content under a certain classification system, which belongs to a research field of information retrieval technology. Its function is to help people improve the efficiency of managing and processing text information. It is widely used in fields such as text filtering. After investigation, the traditional document content classification technology is based on statistics and rules. The statistics-based method is an uncertain probability-based reasoning method learned from a large-scale corpus. The shortcomings of this method are The coverage of the corpus needs to be wide enough to achieve good results. The rule-based method is to formulate certain classificatio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06V20/62G06V10/26G06V30/148G06V10/40G06V30/18G06V10/764G06V10/80G06V30/19
CPCG06F18/241G06F18/25
Inventor 王明辉闾磊高阳黄甫毅樊淼淼
Owner 四川医枢科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products