File identification method and device, storage medium and electronic equipment

By acquiring file characteristics and system performance data, using the target model to calculate the number of concurrent processes and the identification time sequence, grouping and identifying files concurrently, and employing asynchronous processing and machine learning optimization, the problem of low efficiency in identifying large batches of files is solved, achieving efficient file identification and transaction processing.

CN116665234BActive Publication Date: 2026-06-16INDUSTRIAL AND COMMERCIAL BANK OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INDUSTRIAL AND COMMERCIAL BANK OF CHINA
Filing Date
2023-05-23
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In the field of agency trading in financial markets, existing technologies are inefficient when recognizing large batches of documents, especially on special trading days when there are many order documents. Synchronous recognition methods result in low efficiency for traders and may cause recognition timeouts.

Method used

By acquiring the number of features and system performance data of the files to be identified, the concurrency and identification time sequence are calculated using the target model. Files are identified in groups and concurrently. File identification tasks are processed asynchronously. Machine learning algorithms are used to adaptively adjust the priority of concurrent identification. A Kafka message queue is used to process the asynchronous identification results.

🎯Benefits of technology

It improves document recognition efficiency, reduces the risk of recognition timeout, and enhances the operational efficiency of traders and the system's overall processing capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116665234B_ABST
    Figure CN116665234B_ABST
Patent Text Reader

Abstract

The application discloses a file identification method and device, a storage medium and an electronic device. It relates to the field of financial technology and other related technical fields. The method comprises the following steps: obtaining N to-be-identified files, determining the number of to-be-identified features in each to-be-identified file to obtain N target numbers; inputting the N target numbers, N and system performance data into a target model to obtain a concurrent number of an identification engine and an identification time sequence of the to-be-identified files; dividing the N to-be-identified files into S groups of to-be-identified files according to the identification time sequence, wherein S is less than N, S is a positive integer, and the number of to-be-identified files contained in each group of to-be-identified files is less than or equal to the concurrent number; and identifying the S groups of to-be-identified files through the identification engine to obtain an identification result of each to-be-identified file. Through the application, the problem of low identification efficiency when a large number of files are identified in the related art is solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of financial technology and other related technologies, and more specifically, to a document recognition method, apparatus, storage medium and electronic device. Background Technology

[0002] In the field of agency trading in financial markets, since overseas institutional investors need to authorize domestic traders to conduct transactions on their behalf within China through authorization letters, OCR (Optical Character Recognition) technology is used to extract transaction elements from authorization letters and automatically record transaction information in order to improve trading efficiency. Specifically, OCR recognition employs a synchronous processing method. After the authorization letter document is uploaded to the authorization letter bookkeeping system, the system simultaneously waits for the recognition result and performs verification, among other steps, to achieve automatic bookkeeping.

[0003] However, on special trading days, due to the large number of orders, if each order document is submitted to the system individually before OCR recognition, the trader would have to perform the operation in the system once for each order document, which is inefficient. Furthermore, using a synchronous method to recognize a single order document each time cannot fully utilize the OCR engine's performance. In addition, an order document may contain multiple transactions to be recognized; using a synchronous waiting method for OCR recognition may result in situations where transaction elements cannot be extracted due to recognition timeouts.

[0004] There is currently no effective solution to the problem of low recognition efficiency when recognizing large batches of files in related technologies. Summary of the Invention

[0005] The main objective of this application is to provide a document recognition method, apparatus, storage medium, and electronic device to solve the problem of low recognition efficiency when recognizing large batches of documents in related technologies.

[0006] To achieve the above objectives, according to one aspect of this application, a document recognition method is provided. The method includes: acquiring N files to be identified, determining the number of features to be identified in each file to obtain N target quantities, where N is a positive integer; determining the recognition engine used to identify the files to be identified, and acquiring the system performance data of the recognition engine; inputting the N target quantities, N, and system performance data into the target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be identified, where the concurrency is the number of threads of the recognition engine that concurrently identify files, and the recognition time sequence is a sequence of files to be identified ordered from low to high according to the time consumed by the recognition engine; the target model is trained by M sets of training samples, each set of training samples including a set of historical target quantities, historical number of files to be identified, historical system performance data, historical concurrency, and historical recognition time sequence, where M is a positive integer; dividing the N files to be identified into S groups of files to be identified according to the recognition time sequence, where S is less than N, S is a positive integer, and the number of files to be identified in each group is less than or equal to the concurrency; and identifying the S groups of files to be identified by the recognition engine to obtain the recognition result of each file to be identified.

[0007] Optionally, obtaining N files to be identified includes: receiving files to be identified within a preset period, determining whether the number of files to be identified received within the preset period is greater than or equal to N; if the number of files to be identified received within the preset period is greater than or equal to N, generating a file identification task, and sending the file identification task to the identification engine. The file identification task includes a task number and N files to be identified, where the task number is a number representing the files to be identified that need to be identified concurrently.

[0008] Optionally, after obtaining the recognition result for each file to be recognized, the method further includes: sending the recognition result for each file to be recognized to a message queue; retrieving the recognition result from the message queue and parsing the recognition result to obtain the recognition content and task number corresponding to the recognition result; and storing the recognition content in the storage area corresponding to the task number, wherein the recognition result for each file recognition task is stored in the storage area corresponding to the task number.

[0009] Optionally, the target model is obtained by: acquiring the number of historical targets, the number of historical files to be identified, historical system performance data, historical concurrency, and historical identification time sequence; determining the number of historical targets, the number of historical files to be identified, historical system performance data, historical concurrency, and historical identification time sequence as the historical data training set; and inputting the historical data training set into the elastic network regression algorithm model for training to obtain the target model.

[0010] Optionally, the recognition engine identifies S groups of files to be recognized, and the recognition result for each file to be recognized includes: recognizing the Kth group of files to be recognized in ascending order of K, where K is less than or equal to S and K is a positive integer; calling each thread in the recognition engine to recognize one file to be recognized in the Kth group of files to be recognized, where the number of threads called by the recognition engine at the same time is less than or equal to the number of concurrent threads.

[0011] Optionally, determining the number of features to be identified in each file to be identified includes: determining the type of each file to be identified, and obtaining the number of features to be identified in the file to be identified from a preset correspondence list based on the type. The preset correspondence list contains multiple correspondences, and each correspondence contains a type of file to be identified and the number of features to be identified in that type of file.

[0012] Optionally, system performance data may include at least one of the following: CPU utilization, memory usage, and data read / write rate.

[0013] To achieve the above objectives, according to another aspect of this application, a document recognition apparatus is provided. The apparatus includes: an acquisition unit, configured to acquire N documents to be recognized, determine the number of features to be recognized in each document to be recognized, and obtain N target quantities, where N is a positive integer; a determination unit, configured to determine a recognition engine for recognizing the documents to be recognized, and acquire system performance data of the recognition engine; and an input unit, configured to input the N target quantities, N, and system performance data into a target model to obtain the concurrency level of the recognition engine and the recognition time sequence of the documents to be recognized, wherein the concurrency level is the number of threads of the recognition engine concurrently recognizing documents, and the recognition time sequence is the time consumed by the recognition engine when recognizing the documents. The sequence is ordered from low to high length. The target model is trained from M sets of training samples. Each set of training samples includes a set of historical target quantity, historical number of files to be identified, historical system performance data, historical concurrency, and historical identification time sequence, where M is a positive integer. The partitioning unit is used to divide N files to be identified into S sets of files to be identified according to the identification time sequence, where S is less than N and S is a positive integer. The number of files to be identified in each set is less than or equal to the number of concurrency. The identification unit is used to identify the S sets of files to be identified through the identification engine and obtain the identification result for each file to be identified.

[0014] This application employs the following steps: Obtain N files to be identified; determine the number of features to be identified in each file to obtain N target quantities, where N is a positive integer; determine the recognition engine used to identify the files to be identified, and obtain the system performance data of the recognition engine; input the N target quantities, N, and system performance data into the target model to obtain the concurrency level of the recognition engine and the recognition time sequence of the files to be identified, where the concurrency level is the number of threads used by the recognition engine to concurrently identify files, and the recognition time sequence is obtained by sorting the files to be identified in ascending order of recognition time consumed by the recognition engine. The target model is trained using M sets of training samples. Each set of training samples includes a sequence of historical target numbers, historical number of files to be identified, historical system performance data, historical concurrency levels, and historical recognition time sequences, where M is a positive integer. N files to be identified are divided into S groups according to their recognition time sequences, where S is less than N and S is a positive integer. The number of files to be identified in each group is less than or equal to the concurrency level. The recognition engine identifies the S groups of files to be identified, obtaining the recognition result for each file. This solves the problem of low recognition efficiency when identifying large batches of files in related technologies. By inputting multiple files to be identified into the recognition engine to obtain the concurrency level and recognition time sequence, the recognition engine concurrently identifies the files based on the concurrency level and recognition time sequence, thereby improving file recognition efficiency. Attached Figure Description

[0015] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:

[0016] Figure 1 This is a flowchart of a document recognition method provided according to an embodiment of this application;

[0017] Figure 2 This is a flowchart of an optional file recognition method provided according to an embodiment of this application;

[0018] Figure 3 This is a schematic diagram of a document recognition device provided according to an embodiment of this application;

[0019] Figure 4 This is a schematic diagram of an electronic device provided according to an embodiment of this application. Detailed Implementation

[0020] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0021] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present application.

[0022] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0023] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for display, data used for analysis, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties.

[0024] The present invention will now be described in conjunction with preferred implementation steps. Figure 1 This is a flowchart of a document recognition method provided according to an embodiment of this application, such as... Figure 1 As shown, the method includes the following steps:

[0025] Step S101: Obtain N files to be identified, determine the number of features to be identified in each file, and obtain the number of N targets, where N is a positive integer.

[0026] Specifically, the document to be identified can be a power of attorney document, which refers to a document authorizing a domestic trader to conduct transactions on behalf of the investor within China. The features to be identified can be transaction elements within the power of attorney document that need to be identified, such as the transaction date, transaction location, and transaction type. The document identification method provided in this application supports uploading N power of attorney documents to be identified at once.

[0027] For example, the order bookkeeping system supports uploading multiple order documents at once. Traders can submit multiple order documents to be recognized simultaneously in the order bookkeeping system. One order document can contain multiple transactions. Select N order documents to be recognized and upload them. After uploading, the order recognition task is submitted to the OCR recognition engine.

[0028] Step S102: Determine the recognition engine used to identify the file to be identified, and obtain the system performance data of the recognition engine.

[0029] Specifically, the recognition engine can be an OCR recognition engine, and system performance data can include CPU utilization, memory usage, and data read / write speed. Since the system performance data of the recognition engine varies at different times, it is necessary to obtain the current system performance data of the recognition engine to determine the number of concurrent processing threads that the recognition engine can support during this file recognition.

[0030] Step S103: Input the N target quantities, N, and system performance data into the target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be recognized. The concurrency is the number of threads of the recognition engine that recognize files concurrently, and the recognition time sequence is the sequence of files to be recognized ordered from low to high according to the time consumed by the recognition engine. The target model is trained by M sets of training samples. Each set of training samples includes a set of historical target quantities, historical number of files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence, where M is a positive integer.

[0031] Specifically, the target model can be an elastic network regression model. The OCR recognition engine adaptively generates the concurrency level of the recognition engine and the recognition time sequence of N files based on the current system performance data, the number of files to be recognized (N), and the number of features in the files to be recognized. The recognition time sequence is a sequence arranged in ascending order of the time required for the OCR recognition engine to recognize each file.

[0032] It should be noted that the elastic network regression model is a model that can handle nonlinear separable data. It can fit a surface suitable for the data points and is applicable to situations where there is high collinearity between feature variables (in this application, the features to be identified) (i.e., two feature values ​​have an approximately linear correlation). This model is also used to reduce the impact of the correlation between some indicators such as CPU usage and memory usage and the OCR recognition engine on model training.

[0033] Step S104: Divide the N files to be identified into S groups of files to be identified according to the identification time sequence, where S is less than N, S is a positive integer, and the number of files to be identified in each group is less than or equal to the number of concurrent processes.

[0034] Specifically, the OCR recognition engine concurrently recognizes the commission documents according to the order of recognition time from low to high. For example, if there are N files to be recognized arranged according to the recognition time sequence, namely files A, B, C, D, and E, a total of 5 files, and the concurrency is 2, then the files to be recognized are divided into the first group including files A and B, the second group including files C and D, and the third group including file E.

[0035] Step S105: The recognition engine identifies the S groups of files to be recognized, and the recognition result of each file to be recognized is obtained.

[0036] Specifically, the recognition engine processes one group of files from the S groups of files to be recognized at a time, according to the concurrency level. For example, it first recognizes files A and B from the first group of files to be recognized, then recognizes files C and D from the second group of files to be recognized, and finally recognizes file E from the third group of files to be recognized.

[0037] It should be noted that OCR refers to the process by which electronic devices determine the shape of a device by detecting dark and light patterns, and then translate the shape into computer text using character recognition methods.

[0038] The file recognition method provided in this application involves acquiring N files to be recognized, determining the number of features to be recognized in each file to obtain N target quantities, where N is a positive integer; determining a recognition engine for recognizing the files to be recognized, and acquiring system performance data of the recognition engine; inputting the N target quantities, N, and system performance data into a target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be recognized, where the concurrency is the number of threads of the recognition engine concurrently recognizing files, and the recognition time sequence is the order of the files to be recognized according to the time consumed by the recognition engine during recognition, from low to high. The obtained sequence and target model are trained from M sets of training samples. Each set of training samples includes a set of historical target quantity, historical number of files to be identified, historical system performance data, historical concurrency, and historical recognition time sequence, where M is a positive integer. The N files to be identified are divided into S groups according to the recognition time sequence, where S is less than N and S is a positive integer. The number of files to be identified in each group is less than or equal to the concurrency. The recognition engine identifies the S groups of files to be identified, obtaining the recognition result for each file. This solves the problem of low recognition efficiency when recognizing large batches of files in related technologies. By inputting multiple files to be identified into the recognition engine to obtain the concurrency and recognition time sequence, and then using the recognition engine to concurrently recognize the files based on the concurrency and recognition time sequence, the efficiency of file recognition is improved.

[0039] When the number of files to be identified received by the recognition engine exceeds N, a file recognition task is generated. Optionally, in the file recognition method provided in this application embodiment, obtaining N files to be identified includes: receiving files to be identified within a preset period, determining whether the number of files to be identified received within the preset period is greater than or equal to N; if the number of files to be identified received within the preset period is greater than or equal to N, generating a file recognition task, and sending the file recognition task to the recognition engine. The file recognition task includes a task number and N files to be identified, where the task number is a number representing the files to be identified that need to be identified concurrently.

[0040] Specifically, the preset cycle can be half an hour. To improve document recognition efficiency, the power of attorney bookkeeping system receives documents to be recognized within half an hour. When the number of documents to be recognized reaches N, a document recognition task is generated, and a unique task number is added to the task. The task number can be used to distinguish the documents to be recognized in different document recognition tasks. During the process of recognizing the power of attorney through OCR, the document recognition task is submitted asynchronously. The logic is that after submitting one or more document recognition tasks, the OCR recognition engine synchronously returns a uniquely identified task number, indicating that the document recognition task has been received. The task number is used as the unique identifier for processing the recognition result. The efficiency of document recognition is improved by supporting the submission of multiple documents to be recognized at one time or supporting parallel recognition of multiple transactions in a single power of attorney document.

[0041] After obtaining the recognition result, the recognition result is registered in the bookkeeping system. Optionally, in the document recognition method provided in this application embodiment, after obtaining the recognition result of each document to be recognized, the method further includes: sending the recognition result of each document to be recognized to a message queue; obtaining the recognition result from the message queue and parsing the recognition result to obtain the recognition content and task number corresponding to the recognition result; storing the recognition content in the storage area corresponding to the task number, wherein the recognition result of each document recognition task is stored in the storage area corresponding to the task number.

[0042] Specifically, the recognition result and its corresponding task number for each file to be recognized are sent to the Kafka message queue. The authorization bookkeeping system consumes the recognition results from the Kafka message queue in real time, that is, it listens to the information in the Kafka message queue in real time. After consuming a message, it parses the task number in the recognition result, then parses the recognition result again, extracts and adaptively supplements the authorization element information in the recognition result according to the process of parsing and synchronizing the recognition results, and finally updates the authorization element information to the storage area corresponding to the task number. By registering the recognition results to the storage area corresponding to the task number, the bookkeeping process is automated.

[0043] It's important to note that Kafka message queues are an open-source stream processing platform developed by the Apache Software Foundation. It's a high-throughput, distributed publish-subscribe messaging system capable of handling all action stream data from consumers on a website. Messages are pushed from upstream applications to downstream applications for caching, waiting for extrapolation or being actively pulled by further downstream applications. It is widely used in distributed systems, primarily addressing issues such as application coupling, asynchronous messaging, and traffic shaping.

[0044] Before obtaining the number of concurrent processes of the recognition engine and the recognition time sequence of N files to be recognized, a target model needs to be trained. Optionally, in the file recognition method provided in this application embodiment, the target model is obtained in the following way: obtaining the number of historical targets, the number of historical files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence; determining the number of historical targets, the number of historical files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence as a historical data training set; inputting the historical data training set into the elastic network regression algorithm model for training to obtain the target model.

[0045] Specifically, the ElasticNet Regression machine learning algorithm is applied to train the target model and generate concurrency parameters for the recognition engine. Preprocessing is performed on historical OCR recognition engine system parameters (CPU utilization, memory usage, IO metrics, i.e., historical system performance data), the number of OCR tasks (i.e., the number of historical files to be recognized), and the number of elements to be extracted for each task (i.e., the number of historical targets) to form a historical data training set. This set is then input into the ElasticNet Regression algorithm model to train a target model with specific adaptive parameters. Machine learning algorithms are then used to train the target model based on the OCR recognition engine's system performance, the number of tasks to be recognized, task features, and the historical dataset. Based on this target model, the concurrency of the OCR engine and the processing priority of the tasks to be recognized are adaptively generated, ultimately achieving the ability to concurrently recognize multiple transactions. This solves the timeout problem in synchronous recognition and the low recognition efficiency when there are a large number of files to be recognized.

[0046] Concurrent file recognition is achieved by using multiple threads in the recognition engine. Optionally, in the file recognition method provided in this application embodiment, the recognition engine recognizes S groups of files to be recognized and obtains the recognition result of each file to be recognized, including: recognizing the Kth group of files to be recognized in ascending order of K, where K is less than or equal to S and K is a positive integer; calling each thread in the recognition engine to recognize one file to be recognized in the Kth group of files to be recognized, where the number of threads called by the recognition engine at the same time is less than or equal to the number of concurrent processes.

[0047] For example, five groups of files to be identified are obtained by sorting them according to their recognition time sequence. Recognition begins with the first group, followed by the second group, and so on, until the fifth group is identified. Each group of files is identified by calling each thread in the recognition engine to identify one file within that group. The number of threads simultaneously called by the recognition engine is equal to the number of files contained in that group.

[0048] It should be noted that the efficiency of OCR recognition will vary depending on the file size and the number of valid transaction elements contained in different authorization forms. To improve overall efficiency and fully utilize the resources of the recognition engine when processing multiple document recognition tasks simultaneously, the processing order of the documents to be recognized is sorted according to their recognition time before concurrent recognition. The recognition time of the current document is estimated based on the recognition time of each type of authorization form in the historical recognition process, and then sorted in ascending order of time. This allows the OCR recognition engine to concurrently recognize transaction elements, reducing the waiting time in the parsing process and thus shortening the overall execution time. By adaptively calculating the number of concurrent processes and adjusting the order of the authorization forms to be recognized, i.e., determining the recognition time sequence, the manual costs of traders in the proxy authorization transaction process are reduced, and the overall processing efficiency of the system is improved.

[0049] Optionally, in the file recognition method provided in this application embodiment, determining the number of features to be recognized in each file to be recognized includes: determining the type of each file to be recognized, and obtaining the number of features to be recognized in the file to be recognized from a preset correspondence list based on the type. The preset correspondence list contains multiple correspondences, and each correspondence contains a type of file to be recognized and the number of features to be recognized in that type of file.

[0050] Specifically, the number of features to be identified varies depending on the type of file to be identified. For example, file A requires identification of three transaction elements: transaction time, transaction amount, and transaction period. File B requires identification of two transaction elements: transaction time and transaction amount. A preset mapping list includes the number of features to be identified for each type of file. Determining the number of features for each file facilitates automatic adjustment of the recognition engine's concurrency.

[0051] Optionally, in the file recognition method provided in the embodiments of this application, the system performance data includes at least one of the following: CPU utilization, memory usage, and data read / write rate.

[0052] Specifically, system performance data that affects file recognition efficiency can also include response time, and is not limited to CPU utilization, memory usage, and data read / write speed.

[0053] According to another embodiment of this application, an optional file recognition method is also provided. Figure 2 This is a flowchart of an optional file recognition method provided according to an embodiment of this application. For example... Figure 2 As shown, the method includes:

[0054] Traders simultaneously submit multiple proxy orders to be recognized in the order bookkeeping system, with each proxy order potentially containing multiple transactions. The order bookkeeping system submits recognition requests to the OCR recognition engine, which synchronously responds with a unique request number from the order bookkeeping system; one request number corresponds to one file. The OCR recognition engine adaptively generates the recognition concurrency based on current system performance, the number of files to be recognized, and the characteristics of the files, and sorts the estimated recognition time of the orders based on their historical recognition times. The OCR recognition engine recognizes proxy orders concurrently from low to high time and sends the recognition results and their corresponding request numbers to Kafka. The order bookkeeping system consumes the recognition results from Kafka in real time, parses and adjusts them, and automatically records them in the order bookkeeping system.

[0055] The optional document recognition method provided in this application introduces a Kafka message queue to send recognition results to Kafka when recognizing proxy authorization transactions using OCR, achieving asynchronous recognition capabilities. It also supports uploading multiple authorization documents at once and enabling parallel recognition. Furthermore, by introducing machine learning algorithms, the concurrent processing capacity of the OCR recognition engine is generated based on features such as the system parameters of the current OCR recognition engine, the number of documents to be recognized, and the number of transaction elements to be recognized in the documents. The recognition priority of the documents to be recognized can be adaptively adjusted, thereby improving the overall efficiency of batch recognition. The introduction of Kafka for asynchronous recognition solves the request timeout problem that occurs when synchronous recognition is used for a single authorization document containing multiple transactions.

[0056] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.

[0057] This application also provides a file recognition device. It should be noted that the file recognition device of this application can be used to execute the file recognition method provided in this application. The file recognition device provided in this application is described below.

[0058] Figure 3 This is a schematic diagram of a document recognition device provided according to an embodiment of this application. For example... Figure 3 As shown, the device includes:

[0059] The acquisition unit 10 is used to acquire N files to be identified, determine the number of features to be identified in each file, and obtain N target counts, where N is a positive integer;

[0060] The determining unit 20 is used to determine the recognition engine used to identify the file to be identified, and to obtain the system performance data of the recognition engine;

[0061] Input unit 30 is used to input N target quantities, N, and system performance data into the target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be recognized. The concurrency is the number of threads of the recognition engine that recognize files concurrently, and the recognition time sequence is the sequence of files to be recognized ordered from low to high according to the time consumed by the recognition engine. The target model is trained by M sets of training samples. Each set of training samples includes a set of historical target quantities, historical number of files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence, where M is a positive integer.

[0062] The partitioning unit 40 is used to divide N files to be identified into S groups of files to be identified according to the identification time sequence, where S is less than N, S is a positive integer, and the number of files to be identified in each group is less than or equal to the number of concurrent processes.

[0063] The recognition unit 50 is used to recognize the S groups of files to be recognized through the recognition engine and obtain the recognition result of each file to be recognized.

[0064] The file recognition device provided in this application embodiment acquires N files to be recognized by an acquisition unit 10, determines the number of features to be recognized in each file, and obtains N target quantities, where N is a positive integer; a determination unit 20 determines a recognition engine for recognizing the files to be recognized and acquires system performance data of the recognition engine; an input unit 30 inputs the N target quantities, N, and system performance data into a target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be recognized, where the concurrency is the number of threads of the recognition engine that concurrently recognize files, and the recognition time sequence is a sequence of files to be recognized ordered from low to high according to the time consumed by the recognition engine during recognition. The target model is trained by M sets of training samples, each set of training samples including a set of historical samples. The system includes a target number of files to be identified, a historical number of files to be identified, historical system performance data, historical concurrency levels, and historical identification time sequences, where M is a positive integer. A partitioning unit 40 divides the N files to be identified into S groups based on the identification time sequences, where S is less than N and S is a positive integer. The number of files to be identified in each group is less than or equal to the number of concurrent processes. An identification unit 50 identifies the S groups of files to be identified using an identification engine, obtaining the identification result for each file. This solves the problem of low identification efficiency when identifying large batches of files in related technologies. By inputting multiple files to be identified into the identification engine to obtain the concurrency level and identification time sequences, the identification engine concurrently identifies the files based on the concurrency level and identification time sequences, thereby improving file identification efficiency.

[0065] Optionally, in the file recognition device provided in this application embodiment, the acquisition unit 10 includes: a receiving module, used to receive files to be recognized within a preset period, and determine whether the number of files to be recognized received within the preset period is greater than or equal to N; and a generation module, used to generate a file recognition task when the number of files to be recognized received within the preset period is greater than or equal to N, and send the file recognition task to the recognition engine, wherein the file recognition task includes a task number and N files to be recognized, and the task number is a number representing the files to be recognized that need to be recognized simultaneously.

[0066] Optionally, in the file recognition device provided in this application embodiment, the device further includes: a sending unit, used to send the recognition result of each file to be recognized to a message queue; a parsing unit, used to obtain the recognition result from the message queue and parse the recognition result to obtain the recognition content and task number corresponding to the recognition result; and a storage unit, used to store the recognition content in a storage area corresponding to the task number, wherein the recognition result of each file recognition task is stored in the storage area corresponding to the task number.

[0067] Optionally, in the file recognition device provided in this application embodiment, the target model is obtained in the following way: acquiring the number of historical targets, the number of historical files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence; determining the number of historical targets, the number of historical files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence as a historical data training set; and inputting the historical data training set into the elastic network regression algorithm model for training to obtain the target model.

[0068] Optionally, in the file recognition device provided in the embodiments of this application, the recognition unit 50 includes: a first recognition module, used to recognize the Kth group of files to be recognized in ascending order of K, wherein K is less than or equal to S and K is a positive integer; and a calling module, used to call each thread in the recognition engine to recognize one of the files to be recognized in the Kth group of files to be recognized, wherein the number of threads called by the recognition engine at the same time is less than or equal to the number of concurrent calls.

[0069] Optionally, in the file recognition device provided in this application embodiment, the acquisition unit 10 includes: a determination module, used to determine the type of each file to be recognized, and to obtain the number of features to be recognized of the file to be recognized from a preset correspondence list based on the type, wherein the preset correspondence list contains multiple correspondences, and each correspondence contains a type of file to be recognized and the number of features to be recognized of that type of file.

[0070] Optionally, in the file recognition device provided in this application embodiment, the system performance data includes at least one of the following: central processing unit utilization, memory occupancy rate, and data read / write rate.

[0071] The document recognition device includes a processor and a memory. The aforementioned acquisition unit 10, determination unit 20, input unit 30, division unit 40, and recognition unit 50 are all stored in the memory as program units. The processor executes the aforementioned program units stored in the memory to realize the corresponding functions.

[0072] The processor contains a kernel, which retrieves the corresponding program units from memory. One or more kernels can be configured, and file recognition efficiency can be improved by adjusting kernel parameters.

[0073] The memory may include non-permanent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM, and the memory includes at least one memory chip.

[0074] This invention provides a computer-readable storage medium storing a program that, when executed by a processor, implements a file recognition method.

[0075] This invention provides a processor for running a program, wherein the program executes a file identification method during runtime.

[0076] Figure 4 This is a schematic diagram of an electronic device provided according to an embodiment of this application. For example... Figure 4 As shown, the electronic device 401 includes a processor, a memory, and a program stored in the memory and executable on the processor. When the processor executes the program, it performs the following steps: acquiring N files to be identified, determining the number of features to be identified in each file to obtain N target quantities, where N is a positive integer; determining a recognition engine for identifying the files to be identified, and acquiring system performance data of the recognition engine; inputting the N target quantities, N, and system performance data into a target model to obtain the concurrency level of the recognition engine and the recognition time sequence of the files to be identified, where the concurrency level is the number of threads in the recognition engine that concurrently identify files, and the recognition time sequence is... The time series is a sequence of files to be identified, ordered from lowest to highest time taken by the recognition engine. The target model is trained using M sets of training samples. Each set of training samples includes a set of historical target counts, historical file counts, historical system performance data, historical concurrency levels, and historical recognition time sequences, where M is a positive integer. The N files to be identified are divided into S groups based on their recognition time sequences, where S is less than N and is a positive integer. The number of files in each group is less than or equal to the number of concurrent processes. The recognition engine then identifies the S groups of files to be identified, obtaining the recognition result for each file. The devices mentioned in this paper can be servers, PCs, tablets, mobile phones, etc.

[0077] This application also provides a computer program product, which, when executed on a data processing device, is suitable for executing an initialization program with the following method steps: obtaining N files to be identified, determining the number of features to be identified in each file to obtain N target quantities, where N is a positive integer; determining a recognition engine for identifying the files to be identified, and obtaining system performance data of the recognition engine; inputting the N target quantities, N, and system performance data into a target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be identified, where the concurrency is the number of threads of the recognition engine concurrently identifying files, and the recognition time sequence... The target model is a sequence of files to be identified, ordered from lowest to highest time taken by the recognition engine. The target model is trained from M sets of training samples. Each set of training samples includes a set of historical target counts, historical file counts, historical system performance data, historical concurrency counts, and historical recognition time sequences, where M is a positive integer. The N files to be identified are divided into S sets of files to be identified according to the recognition time sequence, where S is less than N and is a positive integer. The number of files to be identified in each set is less than or equal to the number of concurrency counts. The recognition engine identifies the S sets of files to be identified to obtain the recognition result for each file.

[0078] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0079] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0080] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0081] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0082] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0083] Memory may include non-persistent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0084] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0085] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0086] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0087] The above are merely embodiments of this application and are not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A file recognition method, characterized in that, include: Obtain N files to be identified, determine the number of features to be identified in each file, and obtain N target counts, where N is a positive integer; Determine the recognition engine used to identify the file to be identified, and obtain the system performance data of the recognition engine; The N target quantities, N, and the system performance data are input into the target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be recognized. The concurrency is the number of threads of the recognition engine that concurrently recognize files, and the recognition time sequence is a sequence of files to be recognized ordered from low to high according to the time consumed when they are recognized by the recognition engine. The target model is an elastic network regression model, which is trained by M sets of training samples. Each set of training samples includes a set of historical target quantities, historical number of files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence, where M is a positive integer. The N files to be identified are divided into S groups of files to be identified according to the identification time sequence, where S is less than N, S is a positive integer, and the number of files to be identified in each group is less than or equal to the number of concurrent processes. The recognition engine identifies the S groups of files to be identified, and the recognition result for each file is obtained.

2. The method according to claim 1, characterized in that, Obtain N files to be identified, including: Receive files to be identified within a preset period, and determine whether the number of files to be identified received within the preset period is greater than or equal to N; If the number of files to be identified received within the preset period is greater than or equal to N, a file identification task is generated and sent to the identification engine. The file identification task includes a task number and the N files to be identified. The task number is a number that represents the files to be identified that need to be identified simultaneously.

3. The method according to claim 2, characterized in that, After obtaining the recognition result for each file to be recognized, the method further includes: Send the recognition result of each file to be recognized to the message queue; The identification result is obtained from the message queue, and the identification result is parsed to obtain the identification content and task number corresponding to the identification result; The identified content is stored in the storage area corresponding to the task number, wherein the identification result of each file identification task is stored in the storage area corresponding to the task number.

4. The method according to claim 1, characterized in that, The target model is obtained in the following way: Obtain historical target counts, historical file counts to be identified, historical system performance data, historical concurrency levels, and historical identification time sequences; The number of historical targets, the number of historical files to be identified, the historical system performance data, the historical concurrency, and the historical identification time sequence are determined as the historical data training set; The historical data training set is input into the elastic network regression algorithm model for training to obtain the target model.

5. The method according to claim 1, characterized in that, The recognition engine identifies the S groups of files to be recognized, and the recognition result for each file includes: Identify the Kth group of files to be identified in ascending order of K, where K is less than or equal to S and K is a positive integer; Each thread in the recognition engine is invoked to recognize one of the files to be recognized in the Kth group of files to be recognized, wherein the number of threads invoked simultaneously by the recognition engine is less than or equal to the number of concurrent processes.

6. The method according to claim 1, characterized in that, Determining the number of features to be identified in each file includes: The type of each file to be identified is determined, and the number of features to be identified for the file to be identified is obtained from a preset correspondence list based on the type. The preset correspondence list contains multiple correspondences, and each correspondence contains a type of file to be identified and the number of features to be identified for that type of file.

7. The method according to claim 1, characterized in that, The system performance data includes at least one of the following: CPU utilization, memory usage, and data read / write rate.

8. A document recognition device, characterized in that, include: The acquisition unit is used to acquire N files to be identified, determine the number of features to be identified in each file, and obtain N target counts, where N is a positive integer; The determining unit is used to determine the recognition engine used to identify the file to be identified, and to obtain the system performance data of the recognition engine; The input unit is used to input the N target quantities, N, and the system performance data into the target model to obtain the concurrency of the recognition engine and the recognition time sequence of the files to be recognized. The concurrency is the number of threads of the recognition engine that concurrently recognize files, and the recognition time sequence is a sequence of files to be recognized ordered from low to high according to the time consumed when they are recognized by the recognition engine. The target model is an elastic network regression model, which is trained by M sets of training samples. Each set of training samples includes a set of historical target quantities, historical number of files to be recognized, historical system performance data, historical concurrency, and historical recognition time sequence, where M is a positive integer. A partitioning unit is used to divide the N files to be identified into S groups of files to be identified according to the identification time sequence, wherein S is less than N, S is a positive integer, and the number of files to be identified in each group is less than or equal to the number of concurrent processes. The recognition unit is used to recognize the S groups of files to be recognized through the recognition engine and obtain the recognition result for each file to be recognized.

9. A non-volatile storage medium, characterized in that, The non-volatile storage medium includes a stored program, wherein the program, when running, controls the device where the non-volatile storage medium is located to execute the file recognition method according to any one of claims 1 to 7.

10. An electronic device, characterized in that, It includes one or more processors and a memory, the memory being used to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the file recognition method according to any one of claims 1 to 7.