Method and device for extracting document content
A content and document technology, applied in the field of document content extraction, can solve problems such as the inability to adapt to the diverse content forms of test questions
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0020] figure 1 It is a schematic flowchart of a method for extracting document content provided by Embodiment 1 of the present invention. The method can be executed by an apparatus for extracting document content, which can be implemented by means of software and / or hardware, and can usually be configured in a server that can provide document content extraction services. Such as figure 1 As shown, the method includes:
[0021] S110. Search for content in the initial document that satisfies the set format conditions, and remove the format corresponding to the content to obtain a preprocessed document;
[0022] The initial document is the original document that the user needs to process, and the initial document may be a document in various formats (such as a Word document, a PPT document, and a PDF document, etc.). The content satisfying the set format condition can be searched from the initial document according to the preset format, and the content satisfying the set form...
Embodiment 2
[0032] figure 2 It is a schematic flowchart of a method for extracting document content provided by Embodiment 2 of the present invention. Such as figure 2 As shown, this embodiment is optimized on the basis of Embodiment 1, and S210 is added before S110. Such as figure 2 As shown, the S210 is specifically:
[0033] S210. Copy the document to be processed to a temporary directory to obtain an initial document, and set the status of the initial document to invisible.
[0034] The document to be processed is the document uploaded by the user and needs to be processed received by the server. When the user sends an instruction, the server will copy the successfully uploaded document to the temporary directory to obtain the initial document, and set the document status of the initial document to invisible , to prevent the server from being affected by other operations during the initial text processing.
[0035] Optionally, the method may also include before extracting the ...
Embodiment 3
[0041] This embodiment is optimized on the basis of the foregoing embodiments, and further elaborates on different processing methods for content satisfying the set format conditions during the preprocessing process.
[0042] Formatting conditions can include multiple formats. Optionally, the content satisfying the set format condition includes superscript content, subscript content and underline content; Fix the content of the format condition, and remove the format corresponding to the content, including:
[0043] First, invoke the Word document advanced search tool to search for content that meets the set format conditions in the initial document;
[0044] Such as image 3 As shown, the advanced search tool for Word documents is a built-in tool in Word documents.
[0045] Secondly, determine the target format character corresponding to the content satisfying the set format condition;
[0046] Each setting format has a corresponding target format specifier. Exemplarily,...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


