A method and related device for extracting content to be archived based on multi-keyword matching
By using a multi-keyword matching method to dynamically select target directories and identify file content, the problem of format limitations in existing technologies is solved, enabling efficient and accurate retrieval of multiple file types, thereby improving the work efficiency of archival staff and the archival user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU ELECTRIC POWER COMM NETWORK LTD
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies only support searching for text-based archival content, and cannot comprehensively retrieve archival information in multiple formats, resulting in low retrieval efficiency and complex operation, which affects the work efficiency of archival staff.
By using a multi-keyword matching method, the path and keywords are obtained from the search page, the target directory is dynamically selected, the content of different types of files is extracted, and a three-level sentence segmentation strategy and overlap detection algorithm are used to identify semantic units and generate the content to be archived.
It enables efficient and accurate searching of multiple document types, improves retrieval efficiency and accuracy, simplifies operation processes, and enhances the work efficiency of archival staff and the user experience of archives.
Smart Images

Figure CN122240560A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a method and related equipment for extracting archived content based on multi-keyword matching. Background Technology
[0002] In related technologies, existing technologies only support searching for text-based content to be archived, making it impossible for users to comprehensively retrieve all relevant materials and difficult to adapt to increasingly complex archival information collection systems. Resources to be archived often exist in multiple formats (such as PDF, DOCX, XLSX, images, compressed files, etc.), and contain a large amount of unstructured or semi-structured data. Retrieval is limited by file type, and many existing technologies require manual file selection, increasing operational complexity and labor costs, thus affecting the retrieval experience and efficiency of archival staff in borrowing, compilation, appraisal, and other related tasks.
[0003] In summary, the technical problems existing in the relevant technologies need to be improved. Summary of the Invention
[0004] The main objective of this application is to propose a method for extracting archived content based on multi-keyword matching. This method aims to solve the problems of format restrictions, cumbersome operation, and low efficiency in the retrieval of archived content in the prior art, and to achieve efficient, accurate, and convenient integrated retrieval of multiple archived contents, thereby improving retrieval efficiency and accuracy.
[0005] To achieve the above objectives, one aspect of this application proposes a method for extracting archived content based on multi-keyword matching, the method comprising: In response to a search operation, the search path, search keywords, and recursive search identifier are obtained from the search page, wherein the search keywords include at least one keyword; If the recursive search flag is enabled, then the path to be searched and its subdirectories are used as the target directory; otherwise, the path to be searched is used as the target directory. Extract the contents of different types of files in the target directory; The content to be archived is obtained by searching the file content using the keywords to be searched. In response to the export operation, the content to be archived is displayed on the file viewing page.
[0006] In some embodiments, the method further includes: In response to a selection action, the search path is displayed on the search page; In response to an input action, the entered search keywords are displayed on the search page; In response to the selection operation, a recursive search control and a recursive search icon are displayed on the search page. The recursive search icon includes an enabled icon and a disabled icon.
[0007] In some embodiments, extracting the contents of different types of files in the target directory includes: Obtain the first file in the target directory, wherein the first file includes a file type; If the file type is a compressed file, then the first file is decompressed and its format is identified to obtain the decompressed first file; The corresponding document processor is invoked based on the file type, and the file content of the first file is extracted through the document processor.
[0008] In some embodiments, the step of searching the file content using the search keywords to obtain the content to be archived includes: A three-level sentence segmentation strategy is adopted to identify sentence boundaries of the document content based on Chinese grammatical features, thereby obtaining semantic units. The three-level sentence segmentation strategy includes a coarse punctuation segmentation strategy, a semantic unit subdivision strategy, and a phrase fine segmentation strategy. Determine whether the search keyword exists in the file content; otherwise, determine that the content to be archived has no matching results. If the search keyword exists in the file content, it is determined that the content to be archived has a matching result. The current semantic unit centered on the search keyword is determined from the semantic units, and the current semantic unit belongs to the current file. The current semantic unit is merged by merging adjacent segments and overlapping segments using an overlap detection algorithm. The current semantic unit corresponding to each keyword to be searched at each position and the current file corresponding to the current semantic unit are used as the matching results, and the matching results are saved to the content to be archived.
[0009] In some embodiments, the method further includes: If the content to be archived has matching results, then obtain the number of matching results for the content to be archived; If the number of matching results exceeds the normal mode threshold, then obtain the available system memory; The block size is determined based on the available memory of the system. The content to be archived is divided into blocks based on the block capacity to obtain the block result; The block results are used to generate independent temporary files.
[0010] In some embodiments, the content to be archived includes matching results, the matching results including the current semantic unit and the current file corresponding to the current semantic unit, and the display of the content to be archived on the file viewing page in response to the export operation includes: In response to the export operation, a hyperlink between the current semantic unit and the current file is displayed on the file viewing page; In response to the selection of the hyperlink, the current semantic unit of the current file is displayed on the document viewing page.
[0011] To achieve the above objectives, another aspect of this application proposes a device for extracting archived content based on multi-keyword matching, the device comprising: The search module is used to respond to a search operation by obtaining the path to be searched, the keywords to be searched, and the recursive search identifier from the search page. The keywords to be searched include at least one keyword. The directory determination module is used to determine the target directory if the recursive search identifier is enabled, and otherwise determine the target directory if the recursive search identifier is enabled. The content extraction module is used to extract the content of different types of files in the target directory; The result generation module is used to search the file content using the search keywords to obtain the content to be archived. The results display module is used to display the content to be archived on the file viewing page in response to the export operation.
[0012] To achieve the above objectives, another aspect of this application provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the method described above.
[0013] To achieve the above objectives, another aspect of the embodiments of this application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the methods described above.
[0014] To achieve the above objectives, another aspect of the embodiments of this application proposes a computer program product, including a computer program that, when executed by a processor, implements the aforementioned method.
[0015] The embodiments of this application include at least the following beneficial effects: This application provides a method, apparatus, electronic device, storage medium, and program product for extracting archived content based on multi-keyword matching. This solution, in response to a search operation, obtains the search path, search keywords, and recursive search identifier from the search page; if the recursive search identifier is enabled, the search path and subdirectories are used as the target directory; otherwise, the search path is used as the target directory, adapting to the hierarchical management characteristics of archives; extracting different types of file content from the target directory can handle multiple file formats, improving search efficiency; searching file content using search keywords yields the content to be archived, facilitating convenient and quick searching of multiple types of file content, improving retrieval efficiency and accuracy; in response to an export operation, the content to be archived is displayed on the file viewing page. Through structured result display and quick location functions, users can quickly locate file archive content, significantly improving the efficiency of archival staff in receiving and archiving, enriching the form and content of archives, and enhancing the user experience of archives. Attached Figure Description
[0016] Figure 1 This is a flowchart of the method for extracting archived content based on multi-keyword matching provided in the embodiments of this application; Figure 2 This is a flowchart of the search input steps of the method for extracting archived content based on multi-keyword matching provided in the embodiments of this application; Figure 3 yes Figure 1 The flowchart of step S103 in the process; Figure 4 yes Figure 3 The flowchart of step S304 in the process; Figure 5 The flowchart of the method for extracting archived content based on multi-keyword matching provided in this application embodiment also includes a block segmentation step; Figure 6 This is a flowchart illustrating a specific implementation of the method for extracting archived content based on multi-keyword matching provided in this application when applied to a file content search system. Figure 7 This is a document processing flowchart provided in an embodiment of this application; Figure 8 This is a schematic diagram of the search page provided in an embodiment of this application; Figure 9 This is a flowchart of the block processing provided in the embodiments of this application; Figure 10 This is a schematic diagram of the structure of the device for extracting archived content based on multi-keyword matching provided in the embodiments of this application; Figure 11 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit it. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.
[0018] It is understood that the terms “first,” “second,” etc., used in this application may be used herein to describe various concepts, but unless otherwise stated, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another. For example, without departing from the scope of the embodiments of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the words “if,” “when,” or “in response to a determination” as used herein may be interpreted as “when…” or “when…” or “in response to a determination.”
[0019] As used in this application, the terms "at least one", "multiple", "each", "any", etc., "at least one" includes one, two or more, "multiple" includes two or more, "each" refers to each of the corresponding multiples, and "any" refers to any one of the multiples.
[0020] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0021] Before providing a detailed description of the embodiments of this application, some of the nouns and terms involved in the embodiments of this application will be explained first. The nouns and terms involved in the embodiments of this application are subject to the following interpretations.
[0022] 1) Depth-first search (DFS) is an algorithm that prioritizes exploring the depth of a tree or graph. After visiting a node, DFS will explore its child nodes as deeply as possible until there are no more child nodes to visit.
[0023] 2) Breadth-First Search (BFS) is an algorithm that prioritizes exploring the breadth of a tree or graph. BFS first visits all directly connected nodes, and then expands outwards layer by layer.
[0024] Among related technologies, existing technologies only support text-type file content search, which makes it impossible for users to comprehensively retrieve all relevant information. The retrieval is limited by file type, and many existing technologies require manual selection of file type, which increases search time and labor costs and affects the user's search experience.
[0025] In summary, the technical problems existing in the relevant technologies need to be improved.
[0026] In view of this, this application provides a method and related equipment for extracting archived content based on multi-keyword matching. This solution, in response to a search operation, obtains the search path, search keywords, and recursive search identifier from the search page. If the recursive search identifier is enabled, the search path and subdirectories are used as the target directory; otherwise, the search path is used as the target directory, adapting to the hierarchical management characteristics of archives. Extracting different types of file content from the target directory can handle multiple file formats, improving search efficiency. Searching for file content using the search keywords yields the content to be archived, facilitating convenient and quick searching of multiple file types and improving retrieval efficiency and accuracy. In response to an export operation, the content to be archived is displayed on the file viewing page. The structured result display and quick location function help users quickly locate file archive content, significantly improving the efficiency of archival staff in receiving and archiving, enriching the content of archives, and enhancing the user experience of archives.
[0027] The method for extracting archived content based on multi-keyword matching provided in this application relates to the field of computer technology. This method can be applied to a terminal, a server, or software running on either a terminal or a server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, or in-vehicle terminal, but is not limited to these. The server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The server can also be a node server in a blockchain network. The software can be an application implementing the method for extracting archived content based on multi-keyword matching, but is not limited to the above forms.
[0028] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0029] It should be noted that in all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent is obtained first. Furthermore, the collection, use, and processing of this data comply with relevant laws, regulations, and standards. In addition, when embodiments of this application require access to sensitive personal information of users, separate permission or consent from the user is obtained through pop-ups or redirection to confirmation pages. Only after obtaining the user's separate permission or consent is the necessary user-related data required for the proper functioning of these embodiments acquired.
[0030] Figure 1 This is an optional flowchart of the method for extracting archived content based on multi-keyword matching provided in the embodiments of this application. Figure 1 The method may include, but is not limited to, steps S101 to S105.
[0031] Step S101: In response to the search operation, obtain the search path, search keywords and recursive search identifier from the search page.
[0032] Specifically, the search operation is triggered by clicking, double-clicking, checking, selecting from drop-down boxes, and hovering over the search control. The search page includes the search content, which includes the path to be searched, the keywords to be searched, and the recursive search identifier. The keywords to be searched include at least one keyword.
[0033] In some embodiments, when a user triggers a search operation on the search interface, all search content can be obtained through controls on the front-end page, and the user-selected search path, the user-inputted search keywords, and the user-selected recursive search identifier can be extracted from the search content.
[0034] Optionally, the search path can be determined by selecting the file path.
[0035] In some embodiments, in response to a selection operation, the path to be searched is displayed on the search page; in response to an input operation, the entered keywords to be searched are displayed on the search page; and in response to a selection operation, a recursive search control and a recursive search identifier are displayed on the search page.
[0036] It should be noted that multiple keywords are supported for searching. Furthermore, simultaneous searching with multiple keywords is supported (returning all files containing the keywords).
[0037] In this embodiment, in response to a search operation, the path to be searched, the keywords to be searched, and the recursive search identifier are obtained from the search page. This helps to intuitively and clearly obtain the search information entered by the user, preparing for subsequent file content searches.
[0038] In step S102, if the recursive search flag is enabled, the path to be searched and its subdirectories are used as the target directory; otherwise, the path to be searched is used as the target directory.
[0039] In some embodiments, if the recursive search flag is enabled, the path to be searched and all its subdirectories are used as the target directory.
[0040] Alternatively, the folder structure can be traversed using depth-first search (DFS) and breadth-first search (BFS) algorithms.
[0041] Understandably, if the recursive search flag is disabled, the path to be searched will be used as the target directory.
[0042] In this embodiment, if the recursive search flag is enabled, the path to be searched and its subdirectories are used as the target directory; otherwise, the path to be searched is used as the target directory. This allows for dynamic selection of the search directory, improves the accuracy of the search, and prepares the site for subsequent file content searches.
[0043] Step S103: Extract the contents of different types of files in the target directory.
[0044] In some embodiments, a first file in the target directory is obtained. The first file includes a file type.
[0045] Optionally, if the file type is a compressed file, the first file is decompressed and its format is identified to obtain the decompressed first file.
[0046] Understandably, the corresponding document processor is called based on the file type, and the file content of the first file in the target directory is extracted through the document processor.
[0047] In this embodiment, the content of different types of files in the target directory is extracted, which can handle multiple file formats and improve search efficiency.
[0048] Step S104: Search the file content using the keywords to be searched to obtain the content to be archived.
[0049] It is understandable that the content to be archived is the content obtained from the search that corresponds to the keywords.
[0050] In some embodiments, a string matching algorithm is used to search for file content using keywords to be searched.
[0051] Optionally, a three-level sentence segmentation strategy is employed to identify sentence boundaries in the document content based on Chinese grammatical features, thereby obtaining semantic units. This three-level strategy includes a coarse punctuation segmentation strategy, a semantic unit subdivision strategy, and a phrase segmentation strategy.
[0052] Further determine whether the search keyword exists in the file content; otherwise, determine that the content to be archived has no matching results.
[0053] Understandably, if the file content contains the keyword to be searched, then the content to be archived is determined to have a matching result. The current semantic unit centered on the keyword to be searched is determined from the semantic units, and the current semantic unit belongs to the current file. Adjacent segments are merged and overlaps are merged using an overlap detection algorithm. The current semantic unit corresponding to each keyword at each position and the current file corresponding to the current semantic unit are used as the matching result, and the matching result is saved to the content to be archived.
[0054] In this process, a complete semantic unit is retained for each occurrence of the search keyword, and semantic units will be merged if there is overlap.
[0055] In this embodiment, the content of the file is searched by the keywords to be searched to obtain the content to be archived, which is conducive to realizing convenient and fast searching of multiple types of file content and improving retrieval efficiency and accuracy.
[0056] In step S105, in response to the export operation, the content to be archived is displayed on the file viewing page.
[0057] Specifically, the export operation is triggered by clicking, double-clicking, checking, selecting from a drop-down list, and hovering the export control. It is used to export the content to be archived as a result file and display the result file. The result file includes the content to be archived, the matching results, and the current semantic unit and the current file corresponding to the current semantic unit. The result file is used to display the content to be archived on the search page. The content to be archived includes the matching results, and the matching results include the document path of the current file, the document name of the current file, the keywords to be searched, the current semantic unit, and the archive information corresponding to the current semantic unit.
[0058] In some embodiments, in response to an export operation, the content to be archived is exported as a result file, and the content to be archived in the result file is displayed on the file viewing page.
[0059] Understandably, the exported result file will automatically open and display the intended archived content on the file viewing page.
[0060] The document path and document name of the current file can be displayed via hyperlinks.
[0061] Furthermore, in response to an export operation, a hyperlink between the current semantic unit and the current file is displayed on the file viewing page; in response to a hyperlink selection operation, the current semantic unit of the current file is displayed on the document viewing page.
[0062] In this embodiment, in response to the export operation, the content to be archived is displayed on the file viewing page, providing an intuitive display of the content to be archived. Through the structured result display and quick location function, it is beneficial for users to quickly locate the file archive content, greatly improving the work efficiency of archival staff in scenarios such as file retrieval, borrowing, and compilation, and is beneficial for users to view and operate the results, thus improving the user experience.
[0063] Steps S101 to S105 as illustrated in this embodiment of the application, in response to a search operation, obtain the path to be searched, the keywords to be searched, and the recursive search identifier from the search page; if the recursive search identifier is enabled, the path to be searched and its subdirectories are used as the target directory; otherwise, the path to be searched is used as the target directory, adapting to the hierarchical management characteristics of archives; extracting different types of file content from the target directory can handle multiple file formats and improve search efficiency; searching for file content using the keywords to be searched yields the content to be archived, facilitating convenient and quick searching of multiple types of file content and improving retrieval efficiency and accuracy; in response to an export operation, the content to be archived is displayed on the file viewing page, and the structured result display and quick location function help users quickly locate file archive content, significantly improving the efficiency of archival staff in receiving and archiving, enriching the form and content of archives, and enhancing the user experience of archives.
[0064] Please see Figure 2 In some embodiments, the method for extracting archived content based on multi-keyword matching provided in this application further includes a search input step, which may include, but is not limited to, steps S201 to S203: Step S201: In response to the selection operation, the path to be searched is displayed on the search page.
[0065] In step S201 of some embodiments, in response to the directory viewing operation, candidate directories are displayed on the directory selection page.
[0066] Furthermore, in response to the selection of a search path, the search path is displayed on the search page.
[0067] In step S202, in response to the input operation, the entered keywords to be searched are displayed on the search page.
[0068] In step S202 of some embodiments, in response to an input operation, the user-inputted keywords to be searched are displayed on the search page.
[0069] It is understandable that the search term can be a single keyword or multiple keywords.
[0070] If there are multiple keywords to be searched, they are separated by delimiters such as “ ”, “;”, “、” and “and”.
[0071] In step S203, in response to the selection operation, a recursive search control and a recursive search icon are displayed on the search page.
[0072] Specifically, the recursive search flags include an enabled flag and a disabled flag.
[0073] In step S203 of some embodiments, in response to the selection operation of the recursive search control, a recursive search identifier is displayed on the search page.
[0074] Please see Figure 3 In some embodiments, step S103 may include, but is not limited to, steps S301 to S303: Step S301: Obtain the first file in the target directory.
[0075] Specifically, the first file includes the file type.
[0076] In step S301 of some embodiments, all first files in the target directory are obtained.
[0077] Step S302: If the file type is a compressed file, the first file is decompressed and its format is identified to obtain the decompressed first file.
[0078] In step S302 of some embodiments, if the file type is a compressed file type, the first file is recursively decompressed until the file type is no longer a compressed file type.
[0079] Furthermore, the format of the first file is identified to determine the file type of the decompressed first file.
[0080] Step S303: Call the corresponding document processor according to the file type, and extract the file content of the first file through the document processor.
[0081] In step S303 of some embodiments, the corresponding document processor is called according to the file type, and the file content of the first file is extracted by the document processor.
[0082] Optionally, if the first file includes an image, then the text content in the image is extracted.
[0083] Please see Figure 4 In some embodiments, step S104 may include, but is not limited to, steps S401 to S405: Step S401: A three-level sentence segmentation strategy is used to identify sentence boundaries in the document content based on Chinese grammatical features, thereby obtaining semantic units.
[0084] Specifically, the three-level sentence segmentation strategy includes a coarse punctuation segmentation strategy, a semantic unit subdivision strategy, and a phrase fine segmentation strategy.
[0085] In step S401 of some embodiments, punctuation marks are used to perform a preliminary rough segmentation of the sentence.
[0086] Furthermore, based on the coarse segmentation results, the semantic units separated by punctuation are further subdivided.
[0087] Optionally, semantic units can be segmented into phrases to extract meaningful semantic units.
[0088] Step S402: Determine whether the keyword to be searched exists in the file content; otherwise, determine that the content to be archived has no matching results.
[0089] In step S402 of some embodiments, if the keyword to be searched does not exist in any of the entire file content, then the content to be archived is determined to have no matching results.
[0090] Step S403: If the content of the file contains the keyword to be searched, then the content to be archived is determined to have a matching result, and the current semantic unit centered on the keyword to be searched is determined from the semantic units.
[0091] Specifically, the current semantic unit belongs to the current file.
[0092] In step S403 of some embodiments, if the file content contains the keyword to be searched, then it is determined that the content to be archived has a matching result.
[0093] Optionally, each current semantic unit centered on the keyword to be searched can be determined from the semantic units of the document content.
[0094] Step S404: The current semantic unit is merged into adjacent segments and overlaps using an overlap detection algorithm.
[0095] In step S404 of some embodiments, partially overlapping and fully overlapping current semantic units in the current semantic unit are merged by an overlap detection algorithm.
[0096] Step S405: Take the current semantic unit corresponding to each keyword to be searched at each position and the current file corresponding to the current semantic unit as the matching result, and save the matching result to the content to be archived.
[0097] In step S405 of some embodiments, the current semantic unit corresponding to each keyword to be searched at each position and the current file corresponding to the current semantic unit are used as the matching result, and the matching result is saved to the content to be archived.
[0098] Please see Figure 5 In some embodiments, the method for extracting archived content based on multi-keyword matching provided in this application further includes a segmentation step, which may include, but is not limited to, steps S501 to S505: Step S501: If the content to be archived has matching results, then obtain the number of matching results for the content to be archived.
[0099] In step S501 of some embodiments, the number of matching results for the content to be archived is obtained.
[0100] Step S502: If the number of matching results exceeds the normal mode threshold, then obtain the available memory of the system.
[0101] In step S502 of some embodiments, it is determined whether the number of matching results exceeds the normal mode threshold. If so, the system enters the optimization mode and obtains the available memory.
[0102] Step S503: Determine the block size based on the available system memory.
[0103] Optionally, the default initial block size is 10,000 records.
[0104] In step S503 of some embodiments, the block size is dynamically adjusted to 5000 or 2000 based on the available system memory.
[0105] Step S504: Divide the content to be archived into blocks according to the block capacity to obtain the block result.
[0106] In step S504 of some embodiments, the content to be archived is divided into blocks according to the block capacity to obtain a block result. The block result includes multiple blocks.
[0107] Step S505: Generate independent temporary files from the block results.
[0108] In step S505 of some embodiments, each block of the block result is generated as an independent temporary file.
[0109] Furthermore, the independent temporary documents are merged to obtain the content to be archived.
[0110] Figure 6 This is a flowchart illustrating a specific implementation of the multi-keyword matching-based content extraction method provided in this application when applied to a file content search system. Figure 6 The methods may include, but are not limited to, the following steps: Step 1, file processing.
[0111] Optionally, it supports content extraction from multiple file formats.
[0112] The system can automatically select the processing method based on the file extension.
[0113] Optionally, the compressed file is processed and its contents are extracted recursively (the files within the compressed package are automatically decompressed and extracted).
[0114] In some embodiments, file processing includes the following steps: (1) First, intelligent format recognition is performed based on file extension and feature code, and the corresponding dedicated processor is automatically routed (e.g., DOCX uses python-docx to parse paragraphs and tables, PDF uses PyMuPDF to extract text block coordinates, and Excel uses openpyxl to read cell values). (2) For compressed files, a recursive decompression + format recognition chain processing method is used; (3) All extracted content is processed through a unified cleaning pipeline (encoding conversion, redundancy filtering, and structure marking) and then output as standardized text. (4) Through an abnormal retry mechanism and memory monitoring, the processing stability is ensured, and finally, efficient content extraction of Word / Excel / PDF / compressed files and other formats is achieved.
[0115] For example, the file processing flowchart is as follows: Figure 7 As shown, Step 2, keyword search.
[0116] For example, a diagram of the search page is shown below. Figure 8 As shown.
[0117] It should be noted that this embodiment supports simultaneous retrieval of multiple keywords (returning all files containing the keywords). Each time a keyword is found in a file, the retrieved information will be saved.
[0118] Among these processes, intelligent sentence segmentation is performed, preserving the context of keywords.
[0119] In some embodiments, keyword retrieval includes the following steps: (1) First, a three-level sentence segmentation strategy is adopted (basic punctuation segmentation → semantic unit segmentation → phrase segmentation), combined with Chinese grammatical features to intelligently identify sentence boundaries; (2) Then, a keyword context dynamic expansion mechanism is established, extending from the keyword to both ends to complete semantic units, and adjacent segments are merged through an overlap detection algorithm. (3) The final result retains the most relevant high-quality contextual units for each keyword and each position; (4) The processing incorporates semantic integrity assessment (such as stop word ratio, position weight, etc.) to ensure that the output retains key information while avoiding redundancy.
[0120] Optionally, hyperlinks can be added based on file paths to quickly open documents and view the searched content (i.e., the content to be archived).
[0121] Step 3: Export the results.
[0122] Optionally, it supports block processing of large result sets.
[0123] Among them, the dynamic block partitioning strategy automatically adjusts the block size based on the available system memory.
[0124] For example, the initial block size is 10,000 records, which is dynamically adjusted to 5,000 or 2,000 records based on memory pressure. The block processing flowchart is as follows. Figure 9 As shown.
[0125] In some embodiments, memory usage is monitored in real time, and block splitting is triggered when the threshold is exceeded; streaming writing is used to avoid memory accumulation; each block generates an independent temporary file, which is then merged.
[0126] Specifically, the result set supports two export modes (optimized mode / normal mode): 1. Optimization mode (recommended for large result sets) Applicable scenarios: Result sets exceeding 10,000 records; Block processing: Write 5000 records to a temporary file; Memory mapping: Using disk caching instead of memory storage; Progress is recoverable: Supports resuming from the most recent block after an abnormal interruption; Performance: Moderate processing speed, low memory usage, and extremely high stability.
[0127] 2. Normal mode (suitable for small to medium-sized result sets) Applicable scenarios: Result sets with fewer than 10,000 records; Batch processing: All data is loaded into memory at once; Real-time preview: You can view the progress in real time during the writing process; Performance: Fast processing speed, high memory usage, low stability, and may experience memory overflow when the result set is too large.
[0128] Please see Figure 10 This application also provides a device for extracting archived content based on multi-keyword matching, which can implement the above method. The device includes: Search module 1001 is used to respond to search operations by obtaining the path to be searched, the keywords to be searched, and the recursive search identifier from the search page; The directory determination module 1002 is used to determine the path to be searched and its subdirectories as the target directory if the recursive search identifier is an enabled identifier, and otherwise determine the path to be searched as the target directory. Content extraction module 1003 is used to extract the content of different types of files in the target directory; Result generation module 1004 is used to search the file content using the search keywords to obtain the content to be archived. The results display module 1005 is used to display the content to be archived on the file viewing page in response to the export operation.
[0129] It is understood that the content of the above method embodiments is applicable to the present device embodiments. The specific functions implemented by the present device embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0130] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described method. This electronic device can be any smart terminal, including tablet computers, in-vehicle computers, etc.
[0131] It is understood that the content of the above method embodiments is applicable to this device embodiment. The specific functions implemented by this device embodiment are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0132] Please see Figure 11 , Figure 11 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes: The processor 1101 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application. The memory 1102 can be implemented as a read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory 1102 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 1102 and is called and executed by the processor 1101 using the methods described in the embodiments of this application. Input / output interface 1103 is used to implement information input and output; The communication interface 1104 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.). Bus 1105 transmits information between various components of the device (e.g., processor 1101, memory 1102, input / output interface 1103, and communication interface 1104); The processor 1101, memory 1102, input / output interface 1103 and communication interface 1104 are connected to each other within the device via bus 1105.
[0133] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method.
[0134] It is understood that the content of the above method embodiments is applicable to this storage medium embodiment. The specific functions implemented in this storage medium embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.
[0135] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0136] It is understood that the content of the above method embodiments is applicable to the embodiments of this program product. The specific functions implemented by the embodiments of this program product are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0137] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0138] The method, apparatus, electronic device, storage medium, and program product for extracting archived content based on multi-keyword matching provided in this application embodiment, in response to a search operation, obtains the search path, search keywords, and recursive search identifier from the search page; if the recursive search identifier is enabled, the search path and subdirectories are used as the target directory; otherwise, the search path is used as the target directory, adapting to the hierarchical management characteristics of archives; extracting different types of file content from the target directory can handle multiple file formats, improving search efficiency; searching file content through search keywords yields the content to be archived, facilitating convenient and quick searching of multiple types of file content, improving retrieval efficiency and accuracy; in response to an export operation, the content to be archived is displayed on the file viewing page, and through structured result display and quick location functions, users can quickly locate file archive content, significantly improving the efficiency of archival staff in receiving and archiving, enriching the form and content of archives, and enhancing the user experience of archives.
[0139] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0140] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0141] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0142] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0143] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0144] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0145] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0146] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0147] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0148] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0149] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.
Claims
1. A method for extracting archived content based on multi-keyword matching, characterized in that, The method includes the following steps: In response to a search operation, the search path, search keywords, and recursive search identifier are obtained from the search page, wherein the search keywords include at least one keyword; If the recursive search flag is enabled, then the path to be searched and its subdirectories are used as the target directory; otherwise, the path to be searched is used as the target directory. Extract the contents of different types of files in the target directory; The content to be archived is obtained by searching the file content using the keywords to be searched. In response to the export operation, the content to be archived is displayed on the file viewing page.
2. The method according to claim 1, characterized in that, The method further includes: In response to a selection action, the search path is displayed on the search page; In response to an input action, the entered search keywords are displayed on the search page; In response to the selection operation, a recursive search control and a recursive search icon are displayed on the search page. The recursive search icon includes an enabled icon and a disabled icon.
3. The method according to claim 1, characterized in that, The extraction of different types of file content from the target directory includes: Obtain the first file in the target directory, wherein the first file includes a file type; If the file type is a compressed file, then the first file is decompressed and its format is identified to obtain the decompressed first file; The corresponding document processor is invoked based on the file type, and the file content of the first file is extracted through the document processor.
4. The method according to claim 1, characterized in that, The process of searching the file content using the keywords to be searched to obtain the content to be archived includes: A three-level sentence segmentation strategy is adopted to identify sentence boundaries of the document content based on Chinese grammatical features, thereby obtaining semantic units. The three-level sentence segmentation strategy includes a coarse punctuation segmentation strategy, a semantic unit subdivision strategy, and a phrase fine segmentation strategy. Determine whether the search keyword exists in the file content; otherwise, determine that the content to be archived has no matching results. If the search keyword exists in the file content, it is determined that the content to be archived has a matching result. The current semantic unit centered on the search keyword is determined from the semantic units, and the current semantic unit belongs to the current file. The current semantic unit is merged by merging adjacent segments and overlapping segments using an overlap detection algorithm. The current semantic unit corresponding to each keyword to be searched at each position and the current file corresponding to the current semantic unit are used as the matching results, and the matching results are saved to the content to be archived.
5. The method according to claim 1, characterized in that, The method further includes: If the content to be archived has matching results, then obtain the number of matching results for the content to be archived; If the number of matching results exceeds the normal mode threshold, then obtain the available system memory; The block size is determined based on the available memory of the system. The content to be archived is divided into blocks based on the block capacity to obtain the block result; The segmentation results are used to generate independent temporary files.
6. The method according to claim 1, characterized in that, The content to be archived includes matching results, which include the current semantic unit and the current file corresponding to the current semantic unit. In response to the export operation, displaying the content to be archived on the file viewing page includes: In response to the export operation, a hyperlink between the current semantic unit and the current file is displayed on the file viewing page; In response to the selection of the hyperlink, the current semantic unit of the current file is displayed on the document viewing page.
7. A device for extracting archived content based on multi-keyword matching, characterized in that, The device includes: The search module is used to respond to a search operation by obtaining the path to be searched, the keywords to be searched, and the recursive search identifier from the search page. The keywords to be searched include at least one keyword. The directory determination module is used to determine the target directory if the recursive search identifier is enabled, and otherwise determine the target directory if the recursive search identifier is enabled. The content extraction module is used to extract the content of different types of files in the target directory; The result generation module is used to search the file content using the search keywords to obtain the content to be archived. The results display module is used to display the content to be archived on the file viewing page in response to the export operation.
8. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the method according to any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 6.