Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Template-based classification system of electronic official documents

An electronic document and grading system technology, applied in the fields of electronic digital data processing, special data processing applications, instruments, etc., can solve problems such as poor applicability and false positives in the process of sensitive word screening, and achieve the effect of strong applicability

Active Publication Date: 2018-08-14
STATE GRID HEILONGJIANG ELECTRIC POWER CO LTD ELECTRIC POWER RES INST +2
View PDF4 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] In order to solve the problem of poor applicability in the unified setting of sensitive fonts in the existing information security supervision means and the situation of many false positives in the sensitive word screening process that only matches sensitive words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Template-based classification system of electronic official documents
  • Template-based classification system of electronic official documents
  • Template-based classification system of electronic official documents

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0043] Specific implementation mode one: combine figure 1 To describe this embodiment,

[0044] A template-based electronic document classification and grading system, including:

[0045] Sensitive words and stop words management module, which is used to provide users with setting operations of sensitive words and stop words; Sensitive words; users can set stop words according to Chinese usage habits;

[0046] The sensitive words mentioned are key words or parameters that the user considers to be confidential or possibly confidential in the file or page;

[0047] The above stop words refer to certain words or words that are automatically ignored when the scanning module is scanning to index pages or process search requests in order to save space and improve search efficiency; in a general sense, stop words roughly include tone Auxiliary words, adverbs, conjunctions, etc., usually have no clear meaning by themselves, and only have a certain effect when they are put into a co...

specific Embodiment approach 2

[0054] The scanning module described in this embodiment includes a file scanning submodule and a URL scanning submodule:

[0055] The file scanning submodule is used to provide full-text text extraction for office documents such as Office series documents and PDF; for compressed files such as ZIP and RAR, it provides decompression and then performs file type determination and text extraction operations, and supports Nested recursive decompression of compressed files;

[0056] The URL scanning sub-module is used to scan the URL (Uniform Resource Locator, Uniform Resource Locator) of the specified location, and use the search engine crawler technology to crawl recursively according to the set number of crawling layers, so as to realize the text extraction of HTML pages and page attachments; In the form of attachments, it also supports office documents such as Office series, PDF and other document types, as well as text extraction of compression types such as ZIP and RAR;

[005...

specific Embodiment approach 3

[0058] The file scanning sub-module described in this embodiment encapsulates the text content extraction of different files, that is, only a single interface is provided to realize the content extraction of documents such as Office and PDF. When the URL scanning sub-module extracts HTML content, the encoding of the processed text is UTF-8 by default.

[0059] The scanning module has designed a unified multi-format text content extraction interface: it supports the text content extraction of Office and PDF, and the text extraction of HTML and attachments. Because the text extraction methods of different types of documents are different, even different versions of the same type of documents have Differences, such as Office 2003 and Office 2007, extracting file content separately will lead to interface complexity and lower maintainability. In view of the above situation, the text content extraction of different files is used to encapsulate, that is, only a single interface is pro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a classification system of electronic official documents, in particular to a template-based classification system of electronic official documents, and is intended to solve the problem that information safety supervision means of the prior art in which a sensitive word base requires uniform settings has poor applicability, and the problem that a sensitive word screening process that matches only sensitive words presents many false positives. The template-based classification system of electronic official documents comprises: a sensitive word and stop word management module that is used to provide setting operations for sensitive word and stop words; a scanning module for extracting text in a file under detection; a template management module that is used to support, in an enterprise intranet environment, the selection and exporting of templates and source files uploaded by superior departments, and support solely the selection and exporting of templates in a non-enterprise intranet environment; a secrecy-involved matching module that is used to perform sensitive word matching on text according to exported templates and to judge paragraph similarity and full-text similarity. The template-based classification system of electronic official documents is used for classifying and managing electronic official documents.

Description

technical field [0001] The invention relates to a system for classifying and grading electronic official documents. Background technique [0002] In today's information-based society, the daily work of government departments at all levels, enterprises and institutions is inseparable from the application of computer systems. The company's various electronic documents involve many types and are widely distributed. At the same time, they are stored in various storage media And the website also contains various important information and work materials of the company. Ensuring the security of these data has become a direction of information security work. The business data of the government and large enterprises and institutions is important basic data, and data leakage will cause major economic losses and serious security risks to the country and users. Therefore, when the headquarters distributes electronic documents, it is necessary to classify various electronic documents, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289
Inventor 尚方冉庆辉韩冰张凯王孝余刘生
Owner STATE GRID HEILONGJIANG ELECTRIC POWER CO LTD ELECTRIC POWER RES INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products