Chinese PDF file text content extraction method oriented to network flow transmission

An extraction method and streaming technology, applied in the field of Chinese PDF file text content extraction, can solve problems such as the inability to solve real-time supervision of Chinese PDF text content, the lack of real-time Chinese text content extraction support, and the impact on the real-time extraction efficiency of text content, etc. Achieve space efficiency optimization, improve throughput, and avoid delay jitter

Active Publication Date: 2016-08-10
HARBIN ENG UNIV
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Liu Lirong’s research on key technologies for formatted file content extraction and filtering mentioned key technologies such as formatted file content extraction and filtering in network traffic. It is a big deficiency, and it cannot solve the real-time supervision of the Chinese PDF text content transmitted by the network data stream
[0003] According to the above description, when the current method is aimed at extracting the text content of the PDF file transmitted by the network data stream, it lacks support for real-time or Chinese text content extraction, and the text content extraction of the Chinese PDF file transmitted by the network data stream is harmful to the system and the entire system. The processing flow has extremely high requirements, and its requirements are mainly reflected in:
[0004] ① A large amount of PDF file text content will be transmitted concurrently in the real-time network data stream, so the extra overhead generated for the content extraction of PDF files in the data stream will seriously affect the efficiency of real-time text content extraction
[0006] ③The encoding of Chinese PDF files is relatively complicated. Each PDF document has its own specific Chinese CID encoding, and the mapping relationship between CID encoding and Unicode encoding is not fixed in the storage location in the file, so a large number of CIDs that appear before the mapping file need to be cached Encoding and caching the mapping relationship will bring a large memory overhead

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese PDF file text content extraction method oriented to network flow transmission
  • Chinese PDF file text content extraction method oriented to network flow transmission

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The present invention will be further described below in conjunction with the accompanying drawings.

[0050] The invention describes a method for extracting real-time text content from Chinese PDF files transmitted in the network. The method not only supports the extraction of static PDF files, but also supports the extraction of Chinese PDF text content in network traffic. The method includes a fast text content flow positioning method and a data compression method of CID code conversion. Through fast identification and judgment of the content flow type of the PDF file, the present invention can greatly improve the extraction speed of the text content of the Chinese PDF file; in addition, by using the red-black tree mapping technology, it saves the conversion of the Chinese text content with a small amount of time delay. The memory overhead required for coding.

[0051] The invention is oriented to extracting the real-time text content from the Chinese PDF file trans...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention belongs to the technical field of network information processing, and in particular, to a Chinese PDF file text content extraction method oriented to network flow transmission. The method provided by the present invention comprises: step A: providing an interface of an application layer logic; and step B: performing Chinese PDF document text content extraction on network data parsed by the application layer logic. According to the method provided by the present invention, by analysis and optimization of steps such as extraction of Chinese PDF file content transmitted by network data flows, maximal optimization of the whole system on terms of time efficiency and spatial efficiency is achieved, and moreover, occurrence of detrimental phenomena such as delay jitter is avoided, so that the method can be operated in a large-traffic surveillance system without affecting the whole operating efficiency of the system.

Description

technical field [0001] The invention belongs to the technical field of network information processing, in particular to a method for extracting text content of a Chinese PDF file oriented to network stream transmission. Background technique [0002] Because the PDF file format has the advantages of platform independence, complete standards, and high-quality output format, PDF files have become the main form of information sharing and transmission on the Internet, but there will also be content in PDF files that is not conducive to the healthy development of the Internet. The need to inspect the content of PDF files in network data streams. There have been many researches on PDF document text content extraction at home and abroad. In the research on the generation of PDF documents and their content extraction, Liu Ping and others introduced in detail the generation standards of PDF documents in ScienceWord software, and the extraction of text content after reverse analysis o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22H04L12/823H04L12/841H04L47/32
CPCH04L47/283H04L47/32G06F40/129G06F40/151
Inventor 王巍杨武苘大鹏玄世昌段茂涛
Owner HARBIN ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products