Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and device for extracting document content

A content and document technology, applied in the field of document content extraction, can solve problems such as the inability to adapt to the diverse content forms of test questions

Active Publication Date: 2019-05-03
TIANWEN DIGITAL MEDIA TECH HUNAN
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Although there are technical means to solve this problem, the technologies used mainly focus on the structure of the document itself (paragraphs, levels), rather than the content of the document itself; or need to specify a document template for effective analysis; or add special tags, analysis and extraction The specific content (table) in the document cannot adapt to the diverse content forms of the test questions

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting document content
  • Method and device for extracting document content
  • Method and device for extracting document content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0020] figure 1 It is a schematic flowchart of a method for extracting document content provided by Embodiment 1 of the present invention. The method can be executed by an apparatus for extracting document content, which can be implemented by means of software and / or hardware, and can usually be configured in a server that can provide document content extraction services. Such as figure 1 As shown, the method includes:

[0021] S110. Search for content in the initial document that satisfies the set format conditions, and remove the format corresponding to the content to obtain a preprocessed document;

[0022] The initial document is the original document that the user needs to process, and the initial document may be a document in various formats (such as a Word document, a PPT document, and a PDF document, etc.). The content satisfying the set format condition can be searched from the initial document according to the preset format, and the content satisfying the set form...

Embodiment 2

[0032] figure 2 It is a schematic flowchart of a method for extracting document content provided by Embodiment 2 of the present invention. Such as figure 2 As shown, this embodiment is optimized on the basis of Embodiment 1, and S210 is added before S110. Such as figure 2 As shown, the S210 is specifically:

[0033] S210. Copy the document to be processed to a temporary directory to obtain an initial document, and set the status of the initial document to invisible.

[0034] The document to be processed is the document uploaded by the user and needs to be processed received by the server. When the user sends an instruction, the server will copy the successfully uploaded document to the temporary directory to obtain the initial document, and set the document status of the initial document to invisible , to prevent the server from being affected by other operations during the initial text processing.

[0035] Optionally, the method may also include before extracting the ...

Embodiment 3

[0041] This embodiment is optimized on the basis of the foregoing embodiments, and further elaborates on different processing methods for content satisfying the set format conditions during the preprocessing process.

[0042] Formatting conditions can include multiple formats. Optionally, the content satisfying the set format condition includes superscript content, subscript content and underline content; Fix the content of the format condition, and remove the format corresponding to the content, including:

[0043] First, invoke the Word document advanced search tool to search for content that meets the set format conditions in the initial document;

[0044] Such as image 3 As shown, the advanced search tool for Word documents is a built-in tool in Word documents.

[0045] Secondly, determine the target format character corresponding to the content satisfying the set format condition;

[0046] Each setting format has a corresponding target format specifier. Exemplarily,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Embodiments of the invention disclose a document content extraction method and apparatus. The method comprises the steps of searching for contents which meet a set format condition in an initial document, and removing a format corresponding to the contents to obtain a preprocessed document; extracting target contents in the preprocessed document, performing conversion to form a format content file, adding a tag to the format content file to perform tagging, and deleting the target contents from the preprocessed document; and extracting residual contents in the preprocessed document after processing, storing the residual contents as a text file, and storing the text file and the format content file in a folder corresponding to the initial document. According to the method and the apparatus provided by the embodiments of the invention, the problems that only the specific contents in the document can be extracted and the contents diversified in format cannot be adapted in the prior art are solved; and by analyzing the document contents, the contents of different formats in the document are fully identified and extracted, so that the purposes of reducing document usage limitations of an electronic system and fully utilizing the contents in the existing document are achieved.

Description

technical field [0001] Embodiments of the present invention relate to the technical field of document information processing, and in particular to a method and device for extracting document content. Background technique [0002] With the popularization of education informatization and online education, more and more word documents (hereinafter collectively referred to as documents) are disseminated and used through the network or the school's internal system. Word is a document editing program on the client side. When the content of the document is used in the electronic system, it can only be displayed and edited by embedding specific controls. The use method is single and limited, and the content in the document cannot be fully utilized. Online education companies usually convert documents into content data in the process of question bank construction. If these conversions are performed manually, the efficiency is quite low for thousands of documents. How to use a compute...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/22G06F17/24
CPCG06F40/151G06F40/166
Inventor 祝杰
Owner TIANWEN DIGITAL MEDIA TECH HUNAN