Unlock instant, AI-driven research and patent intelligence for your innovation.

A PDF file category judgment method and a character extraction method

A technology of file category and determination method, applied in the field of content recognition, can solve the problems of inability to be used for secondary editing, automatic translation, inability to automatically determine the file category, and lack of file universality, etc., and achieve rapid positioning of file categories and text extraction High efficiency and improve the effect of automation

Active Publication Date: 2019-05-10
四川译讯信息科技有限公司
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These two types of PDF files can maintain the integrity of the source file, but their non-editable properties are also very inconvenient, such as reprocessing scenarios such as secondary editing, automatic translation, and format reconstruction.
[0003] Some existing PDF text extraction tools, such as Apache PDFbox (developed by Apache), iTextSharp, etc., can extract text from text-based PDFs for secondary processing, but such applications cannot automatically determine the type of the file. All input files use the same method to extract text, which does not have universality for files

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A PDF file category judgment method and a character extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] All features disclosed in this specification, or steps in all methods or processes disclosed, may be combined in any manner, except for mutually exclusive features and / or steps.

[0039] Any feature disclosed in this specification (including any appended claims, abstract), unless otherwise stated, may be replaced by alternative features that are equivalent or serve a similar purpose. That is, unless expressly stated otherwise, each feature is one example only of a series of equivalent or similar features.

[0040] Refer to attached figure 1 , the present embodiment discloses a method for determining the category of a PDF file, which can determine whether the PDF file is an image file or a text file, and the determination includes the following steps:

[0041] A. Read the production program of the PDF file; according to the reading result, judge whether the PDF file is a picture or a non-picture, and if it is not a picture, proceed to the next step.

[0042] The produ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a PDF file category judgment method and a character extraction method. The category determination method comprises a step of determining a category according to a production program, a step of determining the category according to a file font, a step of determining the category according to a file document structure, a step of determining the category according to a CMAP character mapping table, and a step of determining the category according to picture information contained in a file. And after the file category is judged, the character extraction method correspondingto the file category is selected to identify and extract characters in the file. The PDF file classification method adopts a step-by-step judgment mode, can accurately and quickly judge the types ofall PDF files, and is high in judgment efficiency, low in resource consumption and high in universality.

Description

technical field [0001] The invention relates to the field of content recognition, in particular to a PDF file category judgment method and a text extraction method. Background technique [0002] PDF document is a common file format nowadays. It can save the font, format, color and graphics of the source document. The text in the document will not change during the process of transmission or sharing, and it does not support editing. At the same time, due to different generation sources, PDF format documents are divided into two categories: electronic files are directly converted to generate PDF files, that is, text-based PDF; non-electronic files (pictures, photos taken by mobile phones, etc.) are scanned to generate PDF text, that is, image-type PDF. These two types of PDF files can maintain the integrity of the source files, but their non-editable properties are also very inconvenient, such as reprocessing scenarios such as secondary editing, automatic translation, and form...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35
Inventor 马万炯陈俊周杨龙杰左林翼李剑
Owner 四川译讯信息科技有限公司