Docx file text content extraction method and device

An extraction method and file technology, which is applied in the field of text content extraction of docx files, can solve problems such as execution speed that needs to be improved, and achieve the effect of increasing extraction speed and improving parsing speed

Pending Publication Date: 2021-03-19
BEIJING GRIDSUM TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, when extracting docx file content in t

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Docx file text content extraction method and device
  • Docx file text content extraction method and device
  • Docx file text content extraction method and device

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0057]In order to make the above objects, features, and advantages of the present application, the following description will be described in further detail below with reference to the accompanying drawings and specific embodiments.

[0058]In order to understand the technical solutions provided herein, the background art according to the present application will be described.

[0059]The inventors found in the study of traditional parsing DOCX documents, common DOCX file resolutions were: based on Apache POI analysis methods based on COM interface analysis methods.

[0060]Based on the Apache PoI file parsing method, the DOCX file is read using the Java API provided by Apache Poi, obtaining the DOCX file text content, the method has the following deficiencies:

[0061]1, Apache Poi is the Java API, depending on the Java environment, there is a need to install JRE, in some special occasions (such as resource nervous) can not meet the needs;

[0062]2, the DOCX file itself is compressed format, and...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a docx file text content extraction method and device, and the method specifically comprises the following steps: firstly obtaining any XML file, where text content is stored, in a dcox file, triggering SAX to load the XML file, extracting text elements in the XML file, and obtaining the text content of the XML file according to the extracted text elements. Visibly, when the XML file is analyzed, the XML file is analyzed in an SAX mode, only the text content needs to be analyzed, other content of the file does not need to be analyzed, and the text content extraction speed is increased. Particularly, for a large-capacity XML file, due to the fact that the SAX analysis mode is to load a part of data, analyze a part of data and process a part of data,the problem that the loading time is long when the XML file is loaded at a time is avoided, and the analysis speed of the large-capacity XML file is increased.

Description

technical field [0001] The present application relates to the technical field of file processing, in particular to a method and device for extracting text content of a docx file. Background technique [0002] Due to the widespread use of the Windows operating system, the docx file supported by Office has become a widely used editing file. In practical applications, users often interact with information through docx files, which will inevitably carry sensitive information in docsx files. In order to prevent the leakage of sensitive information, the anti-leakage device needs to monitor and audit the sensitive information in the docx file. As a docx file is unstructured and has a custom format, it is necessary to extract the text content in the docx file before auditing whether there is sensitive information. [0003] However, in the prior art, when extracting the content of the docx file, the execution speed needs to be improved. Contents of the invention [0004] In view...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/81G06F40/117G06F40/126G06F40/169
CPCG06F16/81
Inventor 童陈敏
Owner BEIJING GRIDSUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products