A method for extracting text content of Chinese PDF files for network streaming

A stream transmission and content technology, applied in transmission systems, data exchange networks, digital transmission systems, etc., can solve the problem of real-time supervision of Chinese PDF text content, lack of support for real-time Chinese text content extraction, and affect the real-time extraction of text content Efficiency and other issues, to achieve space efficiency optimization, improve throughput, and avoid delay jitter

Active Publication Date: 2018-10-26
HARBIN ENG UNIV
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Liu Lirong’s research on key technologies for formatted file content extraction and filtering mentioned key technologies such as formatted file content extraction and filtering in network traffic. It is a big deficiency, and it cannot solve the real-time supervision of the Chinese PDF text content transmitted by the network data stream
[0003] According to the above description, when the current method is aimed at extracting the text content of the PDF file transmitted by the network data stream, it lacks support for real-time or Chinese text content extraction, and the text content extraction of the Chinese PDF file transmitted by the network data stream is harmful to the system and the entire system. The processing flow has extremely high requirements, and its requirements are mainly reflected in:
[0004] ① A large amount of PDF file text content will be transmitted concurrently in the real-time network data stream, so the extra overhead generated for the content extraction of PDF files in the data stream will seriously affect the efficiency of real-time text content extraction
[0006] ③The encoding of Chinese PDF files is relatively complicated. Each PDF document has its own specific Chinese CID encoding, and the mapping relationship between CID encoding and Unicode encoding is not fixed in the storage location in the file, so a large number of CIDs that appear before the mapping file need to be cached Encoding and caching the mapping relationship will bring a large memory overhead

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for extracting text content of Chinese PDF files for network streaming
  • A method for extracting text content of Chinese PDF files for network streaming

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The present invention will be further described below in conjunction with the accompanying drawings.

[0050] The invention describes a method for extracting real-time text content from Chinese PDF files transmitted in the network. The method not only supports the extraction of static PDF files, but also supports the extraction of Chinese PDF text content in network traffic. The method includes a fast text content flow positioning method and a data compression method of CID code conversion. Through fast identification and judgment of the content flow type of the PDF file, the present invention can greatly improve the extraction speed of the text content of the Chinese PDF file; in addition, by using the red-black tree mapping technology, it saves the conversion of the Chinese text content with a small amount of time delay. The memory overhead required for coding.

[0051] The invention is oriented to extracting the real-time text content from the Chinese PDF file trans...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of network information processing, in particular to a method for extracting text content of a Chinese PDF file oriented to network stream transmission. The invention includes: step A: providing an interface to the application layer logic; step B: extracting the text content of the Chinese PDF document from the network data analyzed by the application layer logic. The present invention analyzes and optimizes each step of extracting the content of the Chinese PDF file transmitted by the network data stream, and achieves the greatest possible optimization of the entire system in terms of time efficiency and space efficiency, and at the same time, avoids the generation of such harmful phenomena as time delay jitter, This enables the method to operate in a large-flow supervision system without affecting the overall operating efficiency of the system.

Description

technical field [0001] The invention belongs to the technical field of network information processing, in particular to a method for extracting text content of a Chinese PDF file oriented to network stream transmission. Background technique [0002] Because the PDF file format has the advantages of platform independence, complete standards, and high-quality output format, PDF files have become the main form of information sharing and transmission on the Internet, but there will also be content in PDF files that is not conducive to the healthy development of the Internet. The need to inspect the content of PDF files in network data streams. There have been many researches on PDF document text content extraction at home and abroad. In the research on the generation of PDF documents and their content extraction, Liu Ping and others introduced in detail the generation standards of PDF documents in ScienceWord software, and the extraction of text content after reverse analysis o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/22H04L12/823H04L12/841H04L47/32
CPCH04L47/283H04L47/32G06F40/129G06F40/151
Inventor 王巍杨武苘大鹏玄世昌段茂涛
Owner HARBIN ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products