File analysis method and device, computer readable storage medium and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By logically splitting and using parallel thread parsing of large files, the problems of supporting multiple storage methods and low parsing efficiency are solved, achieving efficient file parsing.

CN115757291BActive Publication Date: 2026-06-23CHINA EVERBRIGHT BANK

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA EVERBRIGHT BANK
Filing Date: 2022-11-15
Publication Date: 2026-06-23

Application Information

Patent Timeline

15 Nov 2022

Application

23 Jun 2026

Publication

CN115757291B

IPC: G06F16/16

AI Tagging

Application Domain

File/folder operations

Technology Topics

Programming languageTerm memory

Technical Efficacy Phrases

Improve analysis efficiencySolve the low efficiency of analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Array compression processing method, device, equipment and storage medium
CN116303114BImplement compressed storageReduce storage spaceMemory adressing/allocation/relocation Code conversion Computer hardwareMemory address
A flight parameter autonomous analysis method, device, medium and system suitable for medium and large unmanned aerial vehicles
CN122262090AMeet accuracy requirementsMeet consistency requirements
Large model-based financial data analysis method, system and medium
CN120523927BImprove analysis efficiencySave time and costMetadata text retrieval FinanceApplying knowledgeFinancial problem
Digitized methods, systems, devices, and media for on-site safety inspections and consultations
CN122243045AReal-time acquisitionAutomate distributionCharacter and pattern recognition Office automation
High-temperature and high-pressure resistant digital geothermal multi-parameter collaborative well logging system and intelligent analysis method
CN122257791AHigh measurement accuracyHigh data reliabilitySurvey Lighting and heating apparatus Thermodynamics Well logging

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN115757291B_ABST

Patent Text Reader

Abstract

Embodiments of the present application provide a file parsing method and device, a computer readable storage medium and an electronic device. The file parsing method comprises: splitting a first file into a plurality of first logical segments according to a logical splitting task; in the case that the length of the first logical segment does not exceed a preset first threshold, reading the first logical segment into a memory through a read interface corresponding to the storage type of the first file; and parsing the first logical segment in the memory in a thread corresponding to each first logical segment according to the parsing type of the first file. Through the present application, the problem that the file parsing method in the related art cannot support multiple storage modes and has low file parsing efficiency is solved, and the effect of improving the file parsing efficiency is achieved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to the field of document processing technology, and more specifically, to a document parsing method, apparatus, computer-readable storage medium, and electronic device. Background Technology

[0002] In the financial sector, batch processing is a common business process, generating many large files. Large file parsing is a frequent and crucial business scenario in batch processing. Excessive parsing time for large files can significantly increase the overall batch processing time, reducing efficiency. Currently, the main method for large file parsing is single-machine parsing, but single-machine resources limit parsing speed. Furthermore, single-machine parsing involves file splitting and merging, generating numerous temporary small files and requiring excessive opening and closing of file resources, further contributing to low parsing efficiency. In addition, the storage locations for batch files are diverse, including Network Attached Storage (NAS) and object storage, and the file content formats are varied. Traditional large file parsing methods cannot flexibly support and expand storage methods or multiple file content formats, limiting their application scenarios. Summary of the Invention

[0003] This invention provides a file parsing method, apparatus, computer-readable storage medium, and electronic device to at least solve the problems of file parsing methods in the related art being unable to support multiple storage methods and having low file parsing efficiency.

[0004] According to an embodiment of the present invention, a file parsing method is provided, comprising: splitting a first file into multiple first logical segments according to a logical splitting task; if the length of the first logical segment does not exceed a preset first threshold, reading the first logical segment into memory through a read interface corresponding to the storage type of the first file; and parsing the first logical segments in memory in threads corresponding to each first logical segment according to the parsing type of the first file.

[0005] In an exemplary embodiment, if the size of the first file is greater than a preset second threshold, the first file is split into multiple first logical fragments according to a logical splitting task.

[0006] In an exemplary embodiment, the logical splitting task includes at least one of the following parameters: file path, file storage method, number of logical shards, and parsing type.

[0007] In one exemplary embodiment, splitting a first file into multiple first logical fragments according to a logical splitting task includes: splitting the first file into multiple first logical fragments according to the total length of the first file and the number of fragments of the logical fragments.

[0008] In an exemplary embodiment, after splitting the first file into multiple first logical segments according to the logical splitting task, the method further includes: recording the length, file path, and starting position of each first logical segment.

[0009] In one exemplary embodiment, the parsing type includes: custom type and template type.

[0010] In an exemplary embodiment, the first logical slice in memory is parsed in threads corresponding to each of the first logical slices according to the parsing type of the first file, including: parsing the first logical slice through the SPI (Service Provider Interface) mechanism when the parsing type is a custom type; and parsing the first logical slice through a format template corresponding to the content format of the first file when the parsing type is a template type.

[0011] In an exemplary embodiment, before parsing the first logical segment using a format template corresponding to the content format of the first file, the method further includes: loading the format template into a folder under a predetermined file path.

[0012] In an exemplary embodiment, after parsing the first logical slice in memory in the thread corresponding to each of the first logical slices, the process includes: setting a process ID and a thread ID for the second file obtained by parsing the first logical slice, and writing the second file into a file path corresponding to the unit number of the first logical slice, wherein the unit number is preset.

[0013] In one exemplary embodiment, writing the second file into the file path corresponding to the unit number includes: if the file path of the second file has not been created, creating the file path of the second file and then writing the second file; if the file path of the second file has been created, determining whether the second file is being opened for the first time based on the file handle of the second file.

[0014] In one exemplary embodiment, after determining whether the second file is being opened for the first time based on the file handle of the second file, the method further includes: if the second file is being opened for the first time, recording the second file in a cache unit; if the second file is not being opened for the first time, writing the second file into a file path corresponding to the unit number.

[0015] In one exemplary embodiment, after parsing the first logical slice in memory in the thread corresponding to each of the first logical slices, the method further includes one of the following: if parsing the first logical slice fails, re-initiating the parsing task for the first logical slice; if parsing the first logical slice succeeds, merging the second files with the same unit number but different process number and thread number into the same file according to the merging task.

[0016] In an exemplary embodiment, the method further includes: if the length of the first logical segment exceeds the first threshold, performing a second logical split on the first logical segment to obtain multiple second logical segments; reading the second logical segments into memory through a read interface corresponding to the storage type of the first file; and parsing the second logical segments in memory in a thread corresponding to the second logical segments according to the parsing type of the first file.

[0017] According to another embodiment of the present invention, a file parsing apparatus is provided, comprising: a splitting module, configured to split a first file into multiple first logical segments according to a logical splitting task; a reading module, configured to read the first logical segments into memory through a read interface corresponding to the storage type of the first file when the length of the first logical segment does not exceed a preset first threshold; and a parsing module, configured to parse the first logical segments in memory respectively in threads corresponding to each first logical segment according to the parsing type of the first file.

[0018] According to yet another embodiment of the present invention, a computer-readable storage medium is also provided, wherein a computer program is stored therein, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.

[0019] According to yet another embodiment of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above method embodiments.

[0020] Through the above embodiments of the present invention, the first logical segment can be read through read interfaces corresponding to different storage types, meaning that the embodiments of the present invention can support multiple storage types. Simultaneously, each logical segment has a corresponding thread for file parsing, resulting in higher parsing efficiency. Therefore, the problems of file parsing methods in related technologies being unable to support multiple storage methods and having low file parsing efficiency can be solved, thereby improving file parsing efficiency. Attached Figure Description

[0021] Figure 1 This is a hardware structure block diagram of a computer terminal for the file parsing method according to an embodiment of the present invention;

[0022] Figure 2 This is a flowchart of a file parsing method according to an embodiment of the present invention;

[0023] Figure 3 This is a structural block diagram of a file parsing apparatus according to an embodiment of the present invention;

[0024] Figure 4 This is a flowchart of a parsing method based on a distributed framework according to an embodiment of the present invention;

[0025] Figure 5 This is a flowchart of the parsing process according to an embodiment of the present invention. Detailed Implementation

[0026] The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

[0027] It should be noted that the terms "first," "second," etc., in the specification, claims, and drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0028] The methods and embodiments provided in this application can be executed on a mobile terminal, a computer terminal, or a similar computing device. Taking running on a computer terminal as an example, Figure 1 This is a hardware structure block diagram of a computer terminal for the file parsing method according to an embodiment of the present invention. For example... Figure 1 As shown, a computer terminal may include one or more ( Figure 1 Only one is shown in the diagram. A processor 102 (which may include, but is not limited to, a microprocessor (MCU) or a field-programmable gate array (FPGA)) and a memory 104 for storing data are also shown. The computer terminal may further include a transmission device 106 for communication functions and an input / output device 108. Those skilled in the art will understand that... Figure 1The structure shown is for illustrative purposes only and does not limit the structure of the computer terminal described above. For example, the computer terminal may also include components that are more complex than those described above. Figure 1 The more or fewer components shown, or having the same Figure 1 The different configurations shown.

[0029] The memory 104 can be used to store computer programs, such as application software programs and modules, like the computer program corresponding to the file parsing method in this embodiment of the invention. The processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, thereby implementing the above-described method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to a computer terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0030] The transmission device 106 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider for the computer terminal. In one example, the transmission device 106 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 may be a Radio Frequency (RF) module used for wireless communication with the Internet.

[0031] This embodiment provides a file parsing method that runs on the aforementioned computer terminal. Figure 2 This is a flowchart of a file parsing method according to an embodiment of the present invention, such as... Figure 2 As shown, the process includes the following steps:

[0032] Step S202: Split the first file into multiple first logical fragments according to the logical splitting task;

[0033] Step S204: If the length of the first logical segment does not exceed the preset first threshold, the first logical segment is read into memory through the read interface corresponding to the storage type of the first file.

[0034] Step S206: According to the parsing type of the first file, parse the first logical slice in memory in the thread corresponding to each of the first logical slices respectively.

[0035] In this embodiment, large files are split using logical splitting, which improves the efficiency of file splitting.

[0036] In an exemplary embodiment, if the size of the first file is greater than a preset threshold, the first file is split into multiple first logical fragments according to a logical splitting task.

[0037] In this embodiment, the logical splitting task includes at least one of the following parameters: file path, file storage method, number of logical shards, and parsing type.

[0038] In step S202 of this embodiment, the first file is split into multiple first logical fragments according to the total length of the first file and the number of logical fragments.

[0039] After step S202 in this embodiment, the method further includes: recording the length, file path, and starting position of each first logical segment.

[0040] In this embodiment, the parsing type includes: custom type and template type. When the parsing type is a template type, it supports user-provided file content format, thus enabling the file parsing method in this embodiment to support file parsing for multiple file types.

[0041] In step S206 of this embodiment, the following steps are included: when the parsing type is a custom type, the first logical fragment is parsed through the Service Provider Interface (SPI) mechanism; when the parsing type is a template type, the first logical fragment is parsed through a format template corresponding to the content format of the first file.

[0042] In this embodiment, before parsing the first logical segment using a format template corresponding to the content format of the first file, the method further includes: loading the format template into a folder under a predetermined file path.

[0043] Following step S206 in this embodiment, the process includes: setting a process ID and thread ID for the second file obtained by parsing the first logical segment, and writing the second file into the file path corresponding to the unit number of the first logical segment. In this embodiment, unit numbers are pre-set for each logical segment.

[0044] In one exemplary embodiment, writing the second file into the file path corresponding to the unit number includes: if the file path of the second file has not been created, creating the file path of the second file and then writing the second file; if the file path of the second file has been created, determining whether the second file is being opened for the first time based on the file handle of the second file.

[0045] In one exemplary embodiment, after determining whether the second file is being opened for the first time based on the file handle of the second file, the method further includes: if the second file is being opened for the first time, recording the second file in a cache unit; if the second file is not being opened for the first time, writing the second file into a file path corresponding to the unit number.

[0046] In this embodiment, after parsing the first logical segment in memory in the thread corresponding to the first logical segment, one of the following is further included: if parsing the first logical segment fails, re-initiating the parsing task for the first logical segment; if parsing the first logical segment succeeds, merging the second files with the same unit number but different process number and thread number into the same file according to the merging task.

[0047] In this embodiment, the method further includes: when the length of the first logical segment exceeds a first threshold, performing a second logical split on the first logical segment to obtain multiple second logical segments; reading the second logical segments into memory through the read interface corresponding to the storage type of the first file; and parsing the second logical segments in memory in the thread corresponding to the second logical segments according to the parsing type of the first file.

[0048] Through the above steps, the first logical slice can be read through the read interface corresponding to different storage types, meaning that this embodiment of the invention supports multiple storage types. Furthermore, each logical slice has a corresponding thread for file parsing, resulting in higher parsing efficiency. Therefore, it solves the problems of file parsing methods in related technologies not supporting multiple storage methods and having low file parsing efficiency, thus achieving the effect of improving file parsing efficiency.

[0049] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as read-only memory / random access memory (ROM / RAM), magnetic disk, optical disk), and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0050] This embodiment also provides a file parsing device for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that performs a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0051] Figure 3 This is a structural block diagram of a file parsing apparatus according to an embodiment of the present invention, such as... Figure 3 As shown, the device includes: a splitting module 10, a reading module 20, and a parsing module 30.

[0052] Splitting module 10 is used to split the first file into multiple first logical fragments according to the logical splitting task;

[0053] The reading module 20 is used to read the first logical segment into memory through the read interface corresponding to the storage type of the first file when the length of the first logical segment does not exceed a preset first threshold.

[0054] The parsing module 30 is used to parse the first logical segments in memory respectively in threads corresponding to each of the first logical segments, according to the parsing type of the first file.

[0055] It should be noted that the above modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to: all the above modules are located in the same processor; or, the above modules are located in different processors in any combination.

[0056] To facilitate understanding of the technical solutions provided by this invention, detailed descriptions will be given below in conjunction with specific scenario embodiments.

[0057] This invention provides a method for fast parsing of large files for the financial industry. Figure 4 This is a flowchart of the distributed framework-based parsing method according to an embodiment of the invention, such as... Figure 4 As shown, the framework includes a scheduling module 41 and an execution module 42. Specifically, in this embodiment, a cluster approach is used, and the scheduling module 41 and the execution module 42 cooperate to achieve the task of quickly parsing large files.

[0058] The scheduling module 41's functions include: triggering tasks, scheduling the execution module 42 using routing and load balancing strategies, and monitoring the execution module 42. Specifically, the monitoring module is responsible for updating the list of healthy executors through a periodic liveness detection mechanism when new execution modules register or when an execution module fails.

[0059] like Figure 4 As shown, the interaction process between the scheduling module 41 and the execution module 42 in the large file parsing process includes the following steps:

[0060] Step S401: Create a logical splitting task.

[0061] Specifically, the scheduling module 41 sends logical splitting tasks to the execution module 42. When sending logical splitting tasks, the specific communication transmission parameters include: file path, file storage method, number of logical file splits, and parsing type.

[0062] In this embodiment, large files are split using a logical splitting method, which has higher splitting efficiency.

[0063] Step S402: Return the specific list of shards.

[0064] Specifically, after receiving the logical splitting task, the execution module 42 first obtains the total length of the file; then, based on the number of splits, it obtains the length of a single logical split; based on the length of a single logical split, it performs logical splitting using the Java RandomAccessFile method, and records the length, file path, and starting position of each logical split after splitting; finally, it returns a list of all logical splits to the scheduling module 41.

[0065] In this embodiment, the Java RandomAccessFile method is used for logical splitting, which avoids the generation of temporary small files in the middle.

[0066] Step S403: Issue the file to resolve the task.

[0067] Specifically, the scheduling module 41 obtains all healthy execution modules 42 according to a separate monitoring and detection process, and distributes all shards evenly to the execution modules 42 for file parsing tasks through a load balancing routing strategy.

[0068] Step S404: Asynchronously return a message indicating successful reception of the logical fragment.

[0069] Specifically, after receiving a specific file parsing task, the execution module 42 adds all received logical slices to a queue and then informs the scheduling module 41 that the task has been successfully received, reducing the time spent on task distribution by the scheduling module. In this embodiment, the execution module 42 uses a thread pool to create a task for each logical slice, which is executed by a separate thread. In this embodiment, multiple logical slices can be executed concurrently, where a logical slice is equivalent to the first logical slice in the above embodiment.

[0070] Step S405: If the fragment length is too large, perform a second logical split.

[0071] Specifically, the maximum value of the shard length is pre-configured. The execution module 42 determines the length of the first logical shard. To avoid excessive memory consumption, if the length exceeds the set maximum value, a second logical shard is performed.

[0072] Step S406: parse the logical fragments.

[0073] Specifically, such as Figure 5 As shown, the parsing process in each execution thread includes the following steps:

[0074] Step S501: The file is split into secondary logical segments to obtain secondary logical fragments, wherein the secondary logical fragments are equivalent to the second logical fragments in the above embodiment.

[0075] Specifically, if the length of the logical segment is less than the maximum value mentioned above, step S501 can be omitted.

[0076] Step S502: Determine whether the file fragmentation process is complete. If the process is not complete, proceed to step S503. If the process is complete, proceed to step S503'.

[0077] Step S503: Read the individual logical slices row by row and store them in memory.

[0078] Specifically, secondary logical sharding is handled through the read interface. That is, the specific implementation class of the read interface is obtained according to the Service Provider Interface (SPI) mechanism based on the storage type, and the secondary logical shards are read into memory row by row, with the record format being LinkedList. <string>This refers to a linked list arranged by row. This embodiment supports Network Attached Storage (NAS), Secure File Transfer Protocol (SFTP) storage, and object storage.

[0079] Step S503': Release the file descriptor created by the write module, and the process ends.

[0080] Step S504: The parsing module parses the files corresponding to the unit number categories line by line.

[0081] Specifically, the parsing type is determined as either raw custom mode (RAW) or template mode based on the parameters passed in by the scheduling module 41 (e.g., parsing type parameter).

[0082] In custom mode, users need to implement the parsing interface themselves, use the SPI mechanism to parse the current line, determine the cell number of the current line, map the cell number to a specific path, or notify the business module to use the data by sending the parsed content through a notification interface.

[0083] If using a template, the user places the format template of the file content in a specific folder under the agreed classpath. The parsing module loads the content of this format template into memory and parses the relevant key information. Finally, it parses the current line passed to the read module according to this format template and returns the result: a HashMap.<string key,object value> The meaning of the string key column is that the object value is the value at the intersection of the current row and the column.

[0084] The parsing module reads the data from the module line by line, and finally records it into a LinkedList. <HashMap<stringkey, Object value> The linked list's individual nodes represent a single line of parsed information. After parsing each line, the unit number (id) containing the line's content is determined, and the file path to be written is obtained by mapping the unit number (id).

[0085] The data format processed by the parsing module is HasnMap. <string path, LinkedList<HashMap<string, Object> >>, where the key "path" of the outermost hash table represents the file path corresponding to different unit IDs, and the value is the parsed content of the file to be recorded in the LinkedList. <HashMap<string, Sting> >

[0086] In this embodiment, after processing by the parsing module, all rows of the current logical shard are processed according to the unit ID. The important role of the unit ID is that the large file generated by the current batch processing flow is a collection of different business modules. After the parsing module finishes processing, it will classify and save the data of different business modules according to the unit ID. Subsequent batch processing business flows can then perform their respective business processing according to different files. The unit ID is used to distinguish between different business modules.

[0087] In this embodiment, multiple execution modules 42 can be deployed on the same physical machine. Each execution module 42 uses a thread pool to parse files concurrently. To avoid concurrent overwriting during the file writing process, the file name of the parsing module is appended with a process ID and a thread ID. Subsequently, the file merging process merges the contents of the same ID.

[0088] Step S505: Determine if the folder containing the current file path exists. If it exists, proceed to step S506; otherwise, proceed to step S506'.

[0089] Step S506: Determine if the file is being opened for the first time. If it is, proceed to step S507; otherwise, proceed to step S508.

[0090] Specifically, the determination of whether a file is being opened for the first time is made by checking whether the file stream handle is open. This step is necessary because execution module 42 iterates through and parses multiple secondary logical segments, and this setting is implemented to avoid repeatedly opening and closing file resources.

[0091] During the execution module's loop processing of the secondary fragments, the file stream of the write module is opened only once. The open file stream handle is recorded in a cached manner, which can avoid the time-consuming impact on performance caused by opening and closing file resources multiple times during the loop processing of the secondary fragments.

[0092] Step S506': Create the path.

[0093] Specifically, create all the folders within that file path to create the required file path.

[0094] Step S507: Open the file and record it in the cache unit.

[0095] Specifically, when a file stream handle is not opened because the file is being used for the first time, an operation to create an open file handle is performed, and the file is cached in a hash table, where the "key" in the hash table is the file path and the "value" is the file handle.

[0096] Step S508: Write the corresponding line content of the file to the write file.

[0097] Specifically, the data in the parsing module is iterated through in a loop, and the line content at the corresponding position is recorded into the file. In this embodiment, multiple execution modules can perform file parsing simultaneously, and a thread pool is used within each execution module to parse the file concurrently. To avoid concurrent overwriting during the file writing process, the file name suffix of the parsing module is appended with the process ID and thread ID, and the content with the same ID is subsequently merged by the file merging process.

[0098] In this embodiment, after the current secondary logical segment's read, parsing, and write module processing flow is completed, the next secondary logical segment's read, parsing, and write module processing flow will continue until all secondary logical segments are processed. This signifies that the primary logical segment issued by the scheduling module has been processed in the current executor's specific thread, and the processing result is informed to the scheduling module.

[0099] In this embodiment, the write module process will find the write implementation class of a specific storage type through the write interface and according to the SPI mechanism based on the storage type parameter passed by the scheduling module 41. The data parsed in the parsing module will be recorded in the file or database table according to different unit IDs based on the file path.

[0100] The following example, using the template pattern parsing type in the parsing process, illustrates the file parsing process:

[0101] The JSON format template is as follows:

[0102] { [

[0104] "seq|serial number|String"

[0105] "date|order application time|Date:yyyy-MM-dd HH:MM:ss"

[0106] "amount|amount|Integer"

[0107] "bol|whether to postpone|Boolean" ]

[0109] }

[0110] Example of large file content that actually needs to be parsed:

[0111] 3456233xskde 2022-07-01 12:31:21 423 true

[0112] 45xkkd324222 2021-09-03 10:11:46 235 false

[0113] 35xkkd324111 2020-07-30 01:01:43 035 true

[0114] Each row in the JSON list represents the meaning of each column in the large file content. Taking "seq|serial number|String" as an example, the large file content corresponding to this row is the first column: 3456233xskde, 45xkkd324222, and 35xkkd324111. Here, "seq|serial number|String" means: seq is the key value of the HashMap generated by the parsing module, the value is 3456233xskde, the serial number represents the meaning of seq and serves as a note, and String represents the data type of seq.

[0115] The actual execution flow of the parsing module;

[0116] When the current line is "3456233xskde 2022-07-01 12:31:21 423 true", it is parsed as a hashMap, and the result is as follows:

[0117] result1: {seq:3456233xskde, date:2022-07-01 12:31:21, amount:423, bol:true}.

[0118] When the current line is "45xkkd324222 2021-09-03 10:11:46 235 false", it is parsed as a hashMap, and the result is as follows:

[0119] Result2:{seq:45xkkd324222, date:2021-09-03 10:11:46, amount:235, bol:false}.

[0120] When the current line is "35xkkd324111 2020-07-30 01:01:43 035 true", it is parsed as a hashMap, and the result is as follows:

[0121] Result3: {seq:35xkkd324111, date:2020-07-30 01:01:43, amount:035, bol:true}.

[0122] During the parsing of each line, if the first character of each column represents the unit ID, we can know that the first and third lines belong to the same unit and need to be written to the same file. The second line contains the content of another unit and needs to be written to another file. The example format of the mapping relationship between unit ID and file name is: id_unit_test.

[0123] In addition, process ID and thread ID need to be added during parsing. Assuming the file content is in the same logical slice and is parsed by a specific thread of execution module A, with process ID 21477 and thread ID 678321 for the parsing, the file paths to be written for the first and third lines are: 3_unit_test_21477_678321; the file path to be written for the second line is: 4_unit_test_21477_678321. After obtaining the file path for each line, the corresponding lines of the same file are saved to a linked list, where the data structure is LinkedList. <String,HashMap<String,object> >:

[0124] The linked list 3_unit_test_21477_678321 contains the following content: list1={result1, result3}

[0125] The linked list 4_unit_test_21477_678321 contains the following content: list2 = {result2};

[0126] Finally, append the filename to the saved path; for example, using / run / test, this generates the final HashMap. <String path, HashMap<> >

[0127] The results are as follows:

[0128] {3_unit_test_21477_678321:list1,4_unit_test_21477_678321:list2}; This result is used by the subsequent write module.

[0129] In this embodiment, the execution module adopts a template pattern, which facilitates providing users with a unified entry point. The template pattern also includes processing flows such as secondary logical sharding, a read module, a parsing module, and a write module.

[0130] Step S407: Reissue the file parsing task for the files that failed to be parsed.

[0131] Specifically, the scheduling module 41 receives the asynchronous parsing result from the execution module 42. If the execution file parsing task fails, it re-initiates the parsing task of the logical segment. If the parsing is successful, it saves the data until all logical segment processing is completed.

[0132] Step S408: Issue the file merging task.

[0133] Specifically, after the logical sharding process is completed, the scheduling module 41 issues a file merging task. After receiving the file merging task, the execution module 42 merges files with the same file prefix (e.g., unit number id) but different process ids and thread ids into the same file and removes the process id and thread id suffixes.

[0134] For example: After the thread pools of different execution modules 42 execute the relevant processes, the generated file is: 1_unit_test_34521_231.csv. The merged file is named: 1_unit_test.csv. The meanings of the relevant fields are: 1: unit id; 34521: process id of a specific executor; 231: id of the thread executing in this executor.

[0135] In embodiments of this invention, a clustered approach and a distributed framework are employed. The scheduling module is responsible for managing the task execution process, including logical splitting of large files, executor routing and scheduling, and merging of small files within the same unit. Asynchronous operation leverages more machine performance to improve file parsing efficiency. The scheduling module uses a thread pool to process multiple secondary logical shards simultaneously, resulting in excellent execution efficiency. Furthermore, embodiments of this invention can utilize an SPI mechanism for different storage methods, enabling flexible expansion to various file storage options and providing excellent scalability. Users only need to implement the interface classes for the read, parsing, and write modules for different storage methods.

[0136] Furthermore, considering the significant differences in file formats among different users, this invention innovatively proposes the following: the parsing module in the execution module provides both a raw custom method and a template method to flexibly handle various file formats; the raw custom method is used to solve relatively niche files, where users implement the file parsing interface themselves; the template method is for users to provide file format templates, such as column content and column delimiters for CSV files, which can successfully install the template format parsing file and save it to a specific storage location.

[0137] Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.

[0138] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0139] Embodiments of the present invention also provide an electronic device including a memory and a processor, the memory storing a computer program and the processor being configured to run the computer program to perform the steps in any of the above method embodiments.

[0140] In one exemplary embodiment, the electronic device may further include a transmission device and an input / output device, wherein the transmission device is connected to the processor and the input / output device is connected to the processor.

[0141] Specific examples in this embodiment can be found in the examples described in the above embodiments and exemplary implementations, and will not be repeated here.

[0142] It is obvious to those skilled in the art that the modules or steps of the present invention described above can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. They can be implemented using computer-executable program code, and thus can be stored in a storage device for execution by a computing device. In some cases, the steps shown or described can be performed in a different order than those described herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any particular combination of hardware and software.

[0143] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, or improvements made within the principles of the present invention should be included within the scope of protection of the present invention.< / string>

Claims

1. A file parsing method, characterized in that, include: The first file is split into multiple first logical fragments based on the logical splitting task; If the length of the first logical segment does not exceed the preset first threshold, the first logical segment is read into memory through the read interface corresponding to the storage type of the first file. According to the parsing type of the first file, the first logical slice in memory is parsed in the thread corresponding to each of the first logical slices respectively; The parsing types include: custom types and template types. The step of parsing the first logical segments in memory in threads corresponding to each first logical segment according to the parsing type of the first file includes: parsing the first logical segments through the Service Provider Interface (SPI) mechanism when the parsing type is a custom type; and parsing the first logical segments through a format template corresponding to the content format of the first file when the parsing type is a template type.

2. The method according to claim 1, characterized in that, in, If the size of the first file is greater than a preset second threshold, the first file is split into multiple first logical fragments according to the logical splitting task.

3. The method according to claim 1, characterized in that, in, The logical splitting task includes at least one of the following parameters: file path, file storage method, number of logical shards, and parsing type.

4. The method according to claim 3, characterized in that, The first file is split into multiple first logical fragments based on the logical splitting task, including: Based on the total length of the first file and the number of logical fragments, the first file is split into multiple first logical fragments.

5. The method according to claim 1, characterized in that, After splitting the first file into multiple first logical fragments according to the logical splitting task, it also includes: Record the length, file path, and starting position of each first logical segment.

6. The method according to claim 1, characterized in that, Before parsing the first logical fragment using a format template corresponding to the content format of the first file, the process also includes: Load the format template into a folder under the predetermined file path.

7. The method according to claim 1, characterized in that, After resolving the first logical slice in memory in the thread corresponding to each of the first logical slices, the process includes: Set the process ID and thread ID for the second file obtained by parsing the first logical segment, and write the second file into the file path corresponding to the unit number according to the unit number of the first logical segment, wherein the unit number is preset.

8. The method according to claim 7, characterized in that, Writing the second file to the file path corresponding to the unit number includes: If the file path of the second file has not been created, the file path of the second file will be created and then written to the second file; If the second file path has already been created, determine whether the second file is being opened for the first time based on the file handle of the second file.

9. The method according to claim 8, characterized in that, After determining whether the second file is being opened for the first time based on its file handle, the process also includes: If the second file is being opened for the first time, the second file will be recorded in the cache unit; If the second file is not being opened for the first time, the second file will be written to the file path corresponding to the unit number.

10. The method according to claim 7, characterized in that, After resolving the first logical slice in memory in the thread corresponding to each of the first logical slices, the process further includes one of the following: If parsing the first logical segment fails, the parsing task for the first logical segment is restarted. If the first logical fragment is successfully parsed, the second files with the same unit number but different process number and thread number are merged into a single file according to the merge task.

11. The method according to claim 1, characterized in that, The method further includes: If the length of the first logical segment exceeds the first threshold, the first logical segment is split into multiple second logical segments. The second logical slice is read into memory through the read interface corresponding to the storage type of the first file; According to the parsing type of the first file, the second logical segment in memory is parsed in the thread corresponding to the second logical segment.

12. A file parsing device, characterized in that, include: The splitting module is used to split the first file into multiple first logical fragments according to the logical splitting task; The reading module is used to read the first logical segment into memory through the read interface corresponding to the storage type of the first file, provided that the length of the first logical segment does not exceed a preset first threshold. The parsing module is used to parse the first logical segments in memory according to the parsing type of the first file in threads corresponding to each of the first logical segments; The parsing types include: custom types and template types. According to the parsing type of the first file, the file parsing device is further configured to parse the first logical fragment through the Service Provider Interface (SPI) mechanism when the parsing type is a custom type; and to parse the first logical fragment through a format template corresponding to the content format of the first file when the parsing type is a template type.

13. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program is configured to perform the method described in any one of claims 1 to 11 when executed.

14. An electronic device comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processor is configured to run the computer program to perform the method described in any one of claims 1 to 11.