A document search method and apparatus

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By generating multiple search packages and utilizing a binary search engine and inverted index technology, the problem of complex search results in existing technologies is solved, achieving efficient and accurate information filtering and user-interactive search.

CN116701737BActive Publication Date: 2026-06-23CHINA MOBILE QUANTONG SYST INTEGRATION CO LTD +2

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA MOBILE QUANTONG SYST INTEGRATION CO LTD
Filing Date: 2022-02-28
Publication Date: 2026-06-23

Application Information

Patent Timeline

28 Feb 2022

Application

23 Jun 2026

Publication

CN116701737B

IPC: G06F16/951; G06F16/953

AI Tagging

Application Domain

Web data indexing Web data querying

Technology Topics

Engineering Data mining

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Texitile light ageing test instrument
CN1588059Acompact structure Easy to assemble and disassemble Material analysis by optical meansTextile testingEngineering Light filter
Multi-dimensional training method and device of support vector machine
CN114186620AImprove linear separabilityimprove classificationKernel methods Character and pattern recognition Data set Descent algorithm
Loop structure of cold heat flows
CN1916533AImprove efficiencySimple configurationFluid circulation arrangement Heating and refrigeration combinations Heat flow Working fluid
Environment-friendly mobile collecting box for decoration cutting dust
CN108636005AThe dragging process is smoothavoid secondary flyingUsing liquid separation agent Working accessories Engineering Sediment
An IGBT lifetime prediction method based on a GA-Elman-LSTM combined model
CN115964937BImprove forecast accuracySolve the problem of easy to fall into local minimumInternal combustion piston engines Biological models Engineering Data mining

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN116701737B_ABST

Patent Text Reader

Abstract

The application provides a document search method and device, electronic equipment and computer program product, and relates to the technical field of document search. The method comprises the following steps: generating a plurality of search packages according to search terms; and respectively searching each search package by using a binary search engine to obtain corresponding search results. The application effectively improves the search efficiency by generating a plurality of search packages from the search terms and independently searching each search package. Meanwhile, the binary search engine is used to search the byte sequence of the search terms in the form of a binary file, so that the search process is more efficient, and the efficiency and accuracy of the document search are effectively improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of document search technology, specifically to a document search method, apparatus, electronic device, and computer program product. Background Technology

[0002] With the development of the internet, information resources on the internet are becoming increasingly abundant. Users can use information search technology to find the information they need. Typically, the information search process on the internet is based on the search terms entered by the user. Based on the search terms entered by the user, all pages or documents that contain those search terms can be found and ultimately displayed to the user.

[0003] In existing technologies, because the scope of information search is often large, the final search results are numerous and prone to contain many irrelevant results. The search results are not effectively integrated, requiring users to expend considerable effort to filter out useful information, resulting in low information search efficiency and accuracy. Summary of the Invention

[0004] This application provides a document search method, apparatus, electronic device, and computer program product to solve the problem of low information search efficiency and accuracy in the prior art.

[0005] In a first aspect, embodiments of this application provide a document search method, including:

[0006] Several search packages are generated based on the obtained search terms; wherein each search package contains several output terms that have a set relevance to the search terms;

[0007] The search is performed on each of the search packages to obtain search results corresponding to each search package; wherein each search result is obtained by a binary search engine parsing the search package to obtain a byte sequence and then querying the inverted index corresponding to the byte sequence.

[0008] In one embodiment, the document search method further includes:

[0009] The search results are matched one-to-one with the search package and used as the initial search results feedback, and the target initial search results selected by the user based on the initial search results are obtained;

[0010] Based on the initial search results for the target, several search packages are regenerated, and the final retrieval results are obtained by searching based on the regenerated search packages.

[0011] In one embodiment, the step of searching the plurality of search packages respectively to obtain search results corresponding to each of the search packages includes:

[0012] The binary search engine is used to parse the search package to obtain the original byte sequence, and the original byte sequence is divided into several byte subsequences according to a preset fixed length;

[0013] The inverted index corresponding to each byte subsequence is queried to obtain several candidate file identifiers. Then, the intersection of all candidate file identifiers is taken as the search result corresponding to the search package.

[0014] In one embodiment, the document search method further includes:

[0015] Calculate the ambiguity of each of the search packets;

[0016] The initial search results are sorted and fed back based on the fuzziness of each search package.

[0017] In one embodiment, the step of regenerating several search packages based on the initial search results of the target, and performing a search based on the regenerated search packages to obtain the final retrieval results, includes:

[0018] Based on the initial search results for the target, several search packages are regenerated, and the ambiguity of each search package is updated.

[0019] Target search packages with ambiguity below a preset ambiguity threshold are selected, and the final retrieval results are obtained by searching based on the target search packages.

[0020] In one embodiment, generating several search packages based on the obtained search terms includes:

[0021] Obtain several output terms that have a set relevance to the search term, and generate several search packages based on the output terms.

[0022] In one embodiment, obtaining a plurality of output terms that have a predetermined relevance to the search term, and generating a plurality of search packages based on the output terms, includes:

[0023] Based on the search terms and the preset relevance threshold range, first-order terms whose relevance to the search terms is within the relevance threshold range are obtained from the pre-stored term library;

[0024] Based on the first-order terms and the relevance threshold range, second-order terms whose relevance to the first-order terms is within the relevance threshold range are obtained from the term library;

[0025] By aggregating terms that are at the same level and within the same relevance threshold range, several search packages are generated.

[0026] In one embodiment, the document search method further includes:

[0027] Training data is obtained from the document corpus, the relevance between each pair of terms in the training data is calculated, and the terms are stored in the term library based on the calculated relevance information.

[0028] Secondly, embodiments of this application also provide a document search method, including:

[0029] The search query request is parsed to obtain the original byte sequence, and the original byte sequence is divided into several byte subsequences of fixed length; wherein, the fixed length is determined according to the search query request;

[0030] The inverted index corresponding to each byte subsequence is queried to obtain several candidate file identifiers. Then, the file associated with the intersection of all candidate file identifiers is taken as the search result of the search query request.

[0031] Thirdly, embodiments of this application provide a document search device, including:

[0032] The search package generation module is used to generate several search packages based on the obtained search terms; wherein each search package contains several output terms that have a set relevance to the search terms;

[0033] The result acquisition module is used to search the plurality of search packages respectively and obtain the search results corresponding to each of the search packages; wherein, each of the search results is obtained by the binary search engine parsing the search package to obtain a byte sequence and querying the inverted index corresponding to the byte sequence.

[0034] The result filtering module is used to match the search results with the search package one by one and provide the initial search results as feedback, and to obtain the target initial search results selected by the user based on the initial search results.

[0035] The result update module is used to regenerate several search packages based on the initial search results of the target, and to perform a search based on the regenerated search packages to obtain the final retrieval results.

[0036] Fourthly, embodiments of this application provide an electronic device, including a processor and a memory storing a computer program, wherein the processor executes the program to implement the steps of the document search method described in the first aspect.

[0037] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the steps of the document search method described in the first aspect.

[0038] The document search method, apparatus, electronic device, and computer program product provided in this application improve search efficiency by generating multiple search packages from search terms and searching them independently. Furthermore, by employing a binary search engine to perform binary file searches on the byte sequences of the terms, the search process becomes more efficient, effectively improving both the efficiency and accuracy of document searches. Attached Figure Description

[0039] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0040] Figure 1 This is a flowchart illustrating the document search method provided in an embodiment of this application;

[0041] Figure 2 This is a schematic diagram of the structure of the document search system provided in the embodiments of this application;

[0042] Figure 3 This is a schematic diagram of the term management module provided in an embodiment of this application;

[0043] Figure 4 This is a schematic diagram of the workflow of the search module provided in an embodiment of this application;

[0044] Figure 5 This is a schematic diagram of the structure of the document search device provided in the embodiments of this application;

[0045] Figure 6 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0047] Figure 1This is a flowchart illustrating the document search method. (Refer to...) Figure 1 This application provides a document search method, which may include the following steps:

[0048] S1. Generate several search packages based on the obtained search terms; wherein each search package contains several output terms that have a set relevance to the search terms;

[0049] S2. Search each of the plurality of search packages to obtain search results corresponding to each of the search packages; wherein, each search result is obtained by a binary search engine parsing the search package to obtain a byte sequence and querying the inverted index corresponding to the byte sequence.

[0050] In this embodiment, multiple search packages are first generated based on the user-input search terms. Each search package contains several output terms with a set relevance to the search terms. Then, each search package is passed to a different search unit for searching. For each search unit, a binary search engine is used to parse the search package to obtain a byte sequence. Then, the total byte sequence is divided into multiple byte subsequences according to the length of the byte sequence corresponding to the inverted index. Finally, the binary file is searched according to the inverted index of these byte subsequences to obtain the search results corresponding to this search package.

[0051] The document search method provided in this application improves search efficiency by generating multiple search packages from search terms and searching them independently. Furthermore, by employing a binary search engine to perform binary file searches on the byte sequences of the terms, the search process becomes more efficient, effectively improving both the efficiency and accuracy of document searches.

[0052] In one embodiment, the document search method may further include the following steps:

[0053] S3. Match the search results with the search package one by one and use them as initial search results as feedback, and obtain the target initial search results selected by the user based on the initial search results;

[0054] S4. Based on the initial search results of the target, several search packages are regenerated, and the final search results are obtained by searching based on the regenerated search packages.

[0055] Based on the above embodiments, after obtaining the search results corresponding to each search package, the search results are matched with the search packages and fed back to the interaction module. The user can select the one that best meets the search needs based on multiple search results through the interaction module. Then, the system redetermines the main term based on the selected target initial search results and regenerates multiple search packages for secondary search.

[0056] The document search method provided in this application allows users to select the main term for a secondary search based on the initial search results through an interactive module, and then update the search based on the updated main term, which greatly improves the accuracy of the final search results.

[0057] In one embodiment, the step of searching the plurality of search packages respectively to obtain search results corresponding to each of the search packages includes:

[0058] The binary search engine is used to parse the search package to obtain the original byte sequence, and the original byte sequence is divided into several byte subsequences according to a preset fixed length;

[0059] The inverted index corresponding to each byte subsequence is queried to obtain several candidate file identifiers. Then, the intersection of all candidate file identifiers is taken as the search result corresponding to the search package.

[0060] It should be noted that during the search process, the file identifiers of files containing these byte sequences as file content are obtained by querying the inverted index through byte sequence lookup. Then, the search results of the corresponding search package are determined based on the intersection of these file identifiers.

[0061] The document search method provided in this application, after obtaining the file identifier associated with the specified byte sequence, uses the intersection of these file identifiers as the search result of the search packet, thereby further improving the accuracy of the search results.

[0062] In one embodiment, the document search method further includes:

[0063] Calculate the ambiguity of each of the search packets;

[0064] The initial search results are sorted and fed back based on the fuzziness of each search package.

[0065] It should be noted that by calculating the fuzziness attribute of each search package and sorting and displaying the search results according to the magnitude of fuzziness, users can more easily find the search results that best meet their search needs, thereby improving the efficiency and convenience of document search.

[0066] In one embodiment, the step of regenerating several search packages based on the initial search results of the target, and performing a search based on the regenerated search packages to obtain the final retrieval results, includes:

[0067] Based on the initial search results for the target, several search packages are regenerated, and the ambiguity of each search package is updated.

[0068] Target search packages with ambiguity below a preset ambiguity threshold are selected, and the final retrieval results are obtained by searching based on the target search packages.

[0069] It should be noted that during the secondary search process, multiple search packages are generated based on the initial search results selected by the user, and the fuzziness of each search package is calculated. Some search packages with excessively high fuzziness are eliminated, and then the target search packages with fuzziness below the preset fuzziness threshold are used for secondary search. This process eliminates some search terms with low relevance, effectively reducing the computational load of the search and further improving the efficiency and accuracy of the search.

[0070] In one embodiment, generating several search packages based on the obtained search terms includes:

[0071] Obtain several output terms that have a set relevance to the search term, and generate several search packages based on the output terms.

[0072] In this embodiment, a term expander is first used to obtain multiple related output terms based on the user-input search terms, and then a search package generator is used to generate several search packages based on the output terms, ensuring the relevance between the output terms and the search terms, thereby effectively improving the accuracy of the search.

[0073] In one embodiment, obtaining a plurality of output terms that have a predetermined relevance to the search term, and generating a plurality of search packages based on the output terms, includes:

[0074] Based on the search terms and the preset relevance threshold range, first-order terms whose relevance to the search terms is within the relevance threshold range are obtained from the pre-stored term library;

[0075] Based on the first-order terms and the relevance threshold range, second-order terms whose relevance to the first-order terms is within the relevance threshold range are obtained from the term library;

[0076] By aggregating terms that are at the same level and within the same relevance threshold range, several search packages are generated.

[0077] In this embodiment, to appropriately expand the user-input search terms, multiple levels of terms can be generated when generating output terms. Output terms generated based on the search terms are called first-order terms, and output terms obtained using first-order terms as input terms are called second-order terms, and so on. The specific order of the output terms can be determined according to requirements. Then, the search package generator packages these output terms, grouping terms of the same order and the same relevance threshold range into one search package. It should be noted that not all terms of the same order will be in the same search package.

[0078] The document search method provided in this application expands the search terms appropriately and summarizes them into multiple search packages according to hierarchy and relevance range. This enables the search to find more related search results even when the user fails to enter accurate search terms, and allows for faster and more accurate search for the results the user wants.

[0079] In one embodiment, the document search method further includes:

[0080] Training data is obtained from the document corpus, the relevance between each pair of terms in the training data is calculated, and the terms are stored in the term library based on the calculated relevance information.

[0081] It should be noted that the terminology database used above stores and records the relevance between terms. The term relevance is calculated by a term relevance trainer using training data obtained from a document corpus. This application provides a foundation for term expansion by pre-calculating term relevance on the training data and recording the term relevance information, thereby improving the efficiency and accuracy of subsequent searches.

[0082] On the other hand, embodiments of this application also provide a document search method, which can be executed by the binary search engine, including the following steps:

[0083] S1. Parse the search query request to obtain the original byte sequence, and divide the original byte sequence into several byte subsequences of fixed length; wherein, the fixed length is determined according to the search query request;

[0084] S2. Query the inverted index corresponding to each byte subsequence to obtain several candidate file identifiers, and then use the file associated with the intersection of all candidate file identifiers as the search result of the search query request.

[0085] In this embodiment of the application, during the search process, the file identifiers of files containing these byte sequences as file content are obtained by querying the inverted index through byte sequence lookup. Then, the search results of the corresponding search package are determined based on the intersection of these file identifiers.

[0086] The document search method provided in this application uses a binary search engine for partial search, which improves the efficiency of the search process. After obtaining the file identifier associated with the specified byte sequence, the intersection of these file identifiers is used as the search result of the search packet, thereby further improving the accuracy of the search results.

[0087] Based on the above scheme, and to facilitate a better understanding of the document search method provided in the embodiments of this application, the following detailed explanation is provided:

[0088] It should be noted that the embodiments of this application mainly solve the following problems:

[0089] Existing technologies have a wide search scope and complex search results. This application's embodiments will accurately segment search terms and perform distributed, precise search to solve the problem of low search efficiency.

[0090] Existing technologies lack interactivity, forcing users to manually filter through numerous search results. This application's embodiment categorizes and displays search results, allowing users to modify the main term based on the results and continue searching on top of the original search, thus solving the problems of low interactivity and high filtering costs.

[0091] Existing technologies lack a connection mechanism between search terms. This application's embodiments solve the search problem when users cannot provide precise search terms by incorporating a unique fuzziness algorithm mechanism.

[0092] Existing technologies perform overall searches on search terms, resulting in low search efficiency. This application's embodiments improve the efficiency of the search process by employing a binary engine for partial searches.

[0093] Please see Figure 2The search system used in this application includes a term management module, a search module, and an interaction module. The user inputs search terms through the interaction module; the term management module generates multiple search packages based on the input search terms, each search package containing multiple terms with similar meanings. Terms within different search packages exhibit both relevance and differences. Each search package has a fuzziness attribute, reflecting the relevance between the terms within the search package and the user-input term; each search package is sent to different search units within the search module for searching. Search results are displayed in the interaction module, unit by unit. The user selects a term from one of the search packages as the primary term based on the search results. The term management module updates the fuzziness of the search packages, generating new search packages and / or canceling some older ones. The rule for generating or canceling search packages is whether their fuzziness attribute meets preset search requirements.

[0094] The term management module and interaction module can be integrated into the user client, which can be installed on a personal computer or mobile device. The search module can be or includes a server or server cluster, multiple distributed server clusters, a mainframe, workstation, personal computer, tablet computer, personal digital assistant, cellular phone, media center, embedded system, or any other type of device. An implementation on a specific computing device is called a search unit.

[0095] Please see Figure 3 The term management module includes a term library, a term relevance trainer, a term expander, a fuzziness calculation processor, and a search package generator. Specifically, the term library is used to store term information, the term relevance trainer is used to obtain the relevance between two terms, the term expander is used to obtain output terms that have a set relevance to the input terms, the fuzziness calculation processor is used to calculate the fuzziness of each search package, and the search package generator is used to aggregate highly relevant terms into a search package.

[0096] The term library will store and manage terms based on their relevance. When a new term needs to be added, the relevance between the original term and the new term will be obtained through the term relevance trainer, and the new term will be stored in the original term area with the highest relevance.

[0097] The training data for the term relevance trainer comes from a document corpus. It is calculated by statistically analyzing the probability that one term and another term appear in the same region of the same document. This probability is bidirectional. For example, the probability that term b appears in a document region containing term a is α, and the probability that term a appears in a document region containing term b is β. α and β are two different values. The relevance P between term a and term b is: P = e^(-α / β)α·β-1 ;

[0098] The term relevance trainer records the relevance between two terms and updates the relevance as the training data increases;

[0099] The term expander may include an input interface, an output interface, and an expansion channel. A term and a threshold range are input through the input interface. The expansion channel retrieves terms from the term library whose relevance to the input term falls within the threshold range based on the threshold, and outputs the retrieved terms through the output interface.

[0100] By inputting one of the terms output by the output interface back into the input interface, a new output term can be obtained. It should be noted that the new output term will no longer contain terms that have already been output in the term expander.

[0101] The output term obtained by using the main term as the input term is called a first-order term, the output term obtained by using a first-order term as the input term is called a second-order term, and so on.

[0102] The search package generator will group output terms that are at the same level and have high relevance into a search package. It should be noted that not all terms at the same level will be in the same search package. For example, if terms b and c are in the same relevance range relative to term a, but the relevance between terms b and c is not high enough to meet the condition of high relevance, then terms b and c will be divided into different search packages.

[0103] The fuzziness calculation processor is used to calculate the fuzziness of each search packet. The fuzziness M of the search packet consisting of n-order terms is:

[0104]

[0105] Among them, P i This represents the correlation between an i-th order term and its corresponding input term.

[0106] Please see Figure 4 The search module includes a binary search engine and an inverted index query, which enables searching binary files. The inverted index can be built upon and identified from a document corpus, which can be created by a service or entity and subsequently provided to other services and / or entities. Both the binary search engine and the inverted index can be implemented on one or more computing devices. Furthermore, the inverted index can be stored on the disk storage of the computing device.

[0107] The binary search engine is configured to receive search queries from terms within a search package. The binary search engine can generate a fixed-length byte sequence based on the terms within the search package. The binary search engine can identify each possible consecutive byte sequence of a specific length, including the query, and this consecutive byte sequence length can correspond to the fixed length used by the inverted index.

[0108] For example, if a term generates a byte sequence of "03 62D1 34 12 00", then the binary search engine can determine the following sequences to search for: "03 62D1 34", "62D1 34 12", and "D1 34 12 00".

[0109] After determining a fixed-length sequence of bytes, the binary search engine queries the inverted index for each byte sequence and obtains the file identifier of the file that includes these byte sequences as file content. The binary search engine can then take further action to verify the received byte sequence, which can be implemented by the binary search engine or another component of the computing device.

[0110] The inverted index can specify a fixed-length byte sequence, and for each specified byte sequence, the inverted index can also specify one or more file identifiers for the file containing that specified byte sequence as file content. The inverted index can be generated by a binary search engine, another component of a computing device, or other computing devices.

[0111] The inverted index can be generated or updated periodically from the document corpus. The inverted index can also be generated or updated in response to changes or additions to the document corpus. In order to build the inverted index, each fixed-length byte sequence encountered in a file of the document corpus is added to the byte sequence specified by the inverted index.

[0112] When a byte sequence is encountered, the generation component can determine whether that byte sequence has already been specified. If specified, the file identifier of the currently processed file is associated with the specified byte sequence. If not specified, it is added, and the file identifier of the currently processed file is associated with the added byte sequence.

[0113] A binary search engine corresponds to a fixed-length byte sequence for a search query. It queries the inverted index of each defined byte sequence and, in response, obtains file identifiers associated with those byte sequences. Upon obtaining the file identifiers associated with the byte sequences used for the search query, the binary search engine determines the intersection of these results as the search result for the corresponding search package. For example, if the binary search engine searches for three byte sequences, and if the first sequence is associated with file identifiers 1, 3, and 4, then the second sequence is associated with file identifiers 1, 2, and 4, and the third sequence is associated with file identifiers 1, 4, and 30, the intersection of the results will include file identifiers 1 and 4. The binary search engine will then return indications of the files associated with file identifiers 1 and 4 as the search results.

[0114] Binary search engines or other components can perform further validation operations on the files identified by the intersection of the results. For example, files associated with file identifiers 1 and 4 can be evaluated to ensure that the search query criteria are met before these file identifiers are returned as search results.

[0115] The operations associated with a binary search engine include receiving a search query request, searching the inverted index for the byte sequence corresponding to the search query request, determining the intersection of the search results, and returning an indication of the file identified in the intersection. Specifically:

[0116] 1. An inverted index is generated from a document corpus by a system comprising one or more processors.

[0117] The generation process may include specifying at least one subset of fixed-length byte sequences found in at least one file in a document corpus, and for each byte sequence in the subset, specifying a file identifier for one or more files in the corpus containing that byte sequence. The inverted index may be distributed across multiple computing devices.

[0118] 2. The search engine receives search query requests.

[0119] 3. The search engine determines multiple fixed-length byte sequences corresponding to the search query request.

[0120] 4. The search engine searches for each byte sequence in an inverted index of a specified fixed-length byte sequence, and for each specified byte sequence, searches for the file identifier of the file that includes the specified byte sequence.

[0121] 5. The search engine determines the intersection of the search results.

[0122] 6. The search engine verifies whether the search results included in the intersection meet the search query criteria.

[0123] 7. The search engine responds to a search query request by returning an indication of the files associated with the file identifiers included in the intersection.

[0124] It should be noted that the key points of the embodiments of this application are as follows:

[0125] 1) A fuzzy algorithm mechanism is used between terms to expand the search terms and divide the search package, so that each search unit can perform an independent and accurate search, which greatly improves the search efficiency.

[0126] 2) Each search unit first provides partial search results. Users can then modify the main term based on these partial results, which changes the search process accordingly. This creates a strong interaction between the user and the search, enabling them to retrieve the required documents more quickly.

[0127] 3) The search module extracts multiple byte sequences from the search terms and performs the search through a binary search engine, which improves the efficiency of the search process itself.

[0128] Compared with the prior art, the embodiments of this application have the following beneficial effects:

[0129] 1. The embodiments of this application expand the search terms into multiple search packages based on the relevance of the terms, and use a distributed search method to perform precise searches on each search package. Each search unit is targeted when searching for a search package, and each search unit operates independently, which greatly improves search efficiency.

[0130] 2. The search package in this application embodiment has a fuzzyness attribute, which enables users to find the desired results even when they cannot provide precise search terms.

[0131] 3. The result feedback in this application embodiment corresponds to the search package. The search results are displayed according to the fuzziness attribute of the search terms, which makes it more time-saving and labor-saving for users to filter out effective results.

[0132] 4. In this embodiment of the application, a binary search engine is used for specific searches, and shorter byte sequences are extracted from the search terms, making the search process more efficient.

[0133] The document search apparatus provided in the embodiments of this application is described below. The document search apparatus described below can be referred to in correspondence with the document search method described above.

[0134] Please see Figure 5 This application provides a document search device, including:

[0135] Search package generation module 1 is used to generate several search packages based on the obtained search terms; wherein, each search package contains several output terms that have a set relevance to the search terms;

[0136] Result acquisition module 2 is used to search the plurality of search packages respectively and obtain search results corresponding to each search package; wherein, each search result is obtained by a binary search engine parsing the search package to obtain a byte sequence and querying the inverted index corresponding to the byte sequence.

[0137] In one embodiment, the document search device further includes:

[0138] The result filtering module is used to match the search results with the search package one by one and feed them back to the interaction module as the initial search results, and to determine the target initial search results selected by the user through the interaction module.

[0139] The result update module is used to regenerate several search packages based on the initial search results of the target using the term management module, and to perform a search based on the regenerated search packages to obtain the final retrieval results.

[0140] It is understood that the above-described device embodiments correspond to the method embodiments of this application. The document search device provided in the embodiments of this application can implement the document search method provided in any one of the method embodiments of this application.

[0141] Figure 6 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 6 As shown, the electronic device may include: a processor 610, a communication interface 620, a memory 630, and a communication bus 640, wherein the processor 610, the communication interface 620, and the memory 630 communicate with each other via the communication bus 640. The processor 610 can call a computer program in the memory 630 to execute the steps of a document search method, such as including:

[0142] S1. Generate several search packages based on the obtained search terms; wherein each search package contains several output terms that have a set relevance to the search terms;

[0143] S2. Search each of the plurality of search packages to obtain search results corresponding to each of the search packages; wherein, each search result is obtained by a binary search engine parsing the search package to obtain a byte sequence and querying the inverted index corresponding to the byte sequence.

[0144] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0145] On the other hand, embodiments of this application also provide a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can perform the steps of the document search methods provided in the above embodiments, such as including:

[0146] S1. Generate several search packages based on the obtained search terms; wherein each search package contains several output terms that have a set relevance to the search terms;

[0147] S2. Search each of the plurality of search packages to obtain search results corresponding to each of the search packages; wherein, each search result is obtained by a binary search engine parsing the search package to obtain a byte sequence and querying the inverted index corresponding to the byte sequence.

[0148] On the other hand, embodiments of this application also provide a processor-readable storage medium storing a computer program for causing a processor to execute the steps of the document search methods provided in the above embodiments, such as including:

[0149] S1. Generate several search packages based on the obtained search terms; wherein each search package contains several output terms that have a set relevance to the search terms;

[0150] S2. Search each of the plurality of search packages to obtain search results corresponding to each of the search packages; wherein, each search result is obtained by a binary search engine parsing the search package to obtain a byte sequence and querying the inverted index corresponding to the byte sequence.

[0151] The processor-readable storage medium can be any available medium or data storage device that the processor can access, including but not limited to magnetic memory (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO)), optical memory (e.g., CD, DVD, BD, HVD), and semiconductor memory (e.g., ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid-state drive (SSD)).

[0152] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0153] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0154] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A document search method, characterized in that, include: Based on the obtained search terms and the preset relevance threshold range, first-order terms whose relevance to the search terms is within the relevance threshold range are obtained from the pre-stored term library; Based on the first-order terms and the relevance threshold range, second-order terms whose relevance to the first-order terms is within the relevance threshold range are obtained from the term library; By aggregating terms that are at the same level and fall within the same relevance threshold range, several search packages are generated. The search is performed on each of the search packages to obtain search results corresponding to each search package; wherein each search result is obtained by a binary search engine parsing the search package to obtain a byte sequence and then querying the inverted index corresponding to the byte sequence. The search results are mapped one-to-one with the search packages and used as the initial search results. The fuzziness of each search package is calculated. The corresponding initial search results are sorted and fed back according to the fuzziness of each search package. The target initial search results selected by the user based on the initial search results are obtained. Based on the initial search results for the target, several search packages are regenerated, and the ambiguity of each search package is updated. Target search packages with ambiguity below a preset ambiguity threshold are selected, and the final search results are obtained based on the target search packages. The fuzziness M of the search bag composed of n-order terms is: ； Among them, P i This represents the correlation between an i-th order term and its corresponding input term.

2. The document search method according to claim 1, characterized in that, The step of searching each of the plurality of search packages to obtain search results corresponding to each of the search packages includes: The binary search engine is used to parse the search package to obtain the original byte sequence, and the original byte sequence is divided into several byte subsequences according to a preset fixed length; The inverted index corresponding to each byte subsequence is queried to obtain several candidate file identifiers. Then, the intersection of all candidate file identifiers is taken as the search result corresponding to the search package.

3. The document search method according to claim 1, characterized in that, Also includes: Training data is obtained from the document corpus, the relevance between each pair of terms in the training data is calculated, and the terms are stored in the term library based on the calculated relevance information.

4. A document search method, characterized in that, include: The search query request is parsed to obtain the original byte sequence, and the original byte sequence is divided into several byte subsequences of fixed length; wherein, the fixed length is determined according to the search query request; The inverted index corresponding to each byte subsequence is queried to obtain several candidate file identifiers. Then, the file associated with the intersection of all candidate file identifiers is taken as the search result of the search query request. The search results are matched one-to-one with each search package corresponding to the search query request and used as the initial search results. The fuzziness of each search package is calculated. The corresponding initial search results are sorted and fed back according to the fuzziness of each search package, and the target initial search results selected by the user based on the initial search results are obtained. Based on the initial search results for the target, several search packages are regenerated, and the ambiguity of each search package is updated. Target search packages with ambiguity below a preset ambiguity threshold are selected, and the final search results are obtained based on the target search packages. The search package is generated in the following manner: Based on the obtained search terms and the preset relevance threshold range, first-order terms whose relevance to the search terms is within the relevance threshold range are obtained from the pre-stored term library; Based on the first-order terms and the relevance threshold range, second-order terms whose relevance to the first-order terms is within the relevance threshold range are obtained from the term library; By aggregating terms that are at the same level and fall within the same relevance threshold range, several search packages are generated. The fuzziness M of the search bag composed of n-order terms is: ； Among them, P i This represents the correlation between an i-th order term and its corresponding input term.

5. A document search device, characterized in that, include: The search package generation module is used to: obtain first-order terms whose relevance to the search terms is within the relevance threshold range from a pre-stored term library based on the obtained search terms and a preset relevance threshold range; obtain second-order terms whose relevance to the first-order terms is within the relevance threshold range from the term library based on the first-order terms and the relevance threshold range; and aggregate terms of the same order and the same relevance threshold range to generate several search packages. The result acquisition module is used to search the plurality of search packages respectively and obtain the search results corresponding to each of the search packages; wherein, each of the search results is obtained by the binary search engine parsing the search package to obtain a byte sequence and querying the inverted index corresponding to the byte sequence. The result filtering module is used to match the search results with the search packages one by one and use them as the initial search results, calculate the fuzziness of each search package, sort the corresponding initial search results according to the fuzziness of each search package and return the results, and obtain the target initial search results selected by the user based on the initial search results. The result update module is used to regenerate several search packages based on the initial search results of the target, and update the ambiguity of each search package. Target search packages with ambiguity below a preset ambiguity threshold are selected, and the final search results are obtained based on the target search packages. The fuzziness M of the search bag composed of n-order terms is: ； Among them, P i This represents the correlation between an i-th order term and its corresponding input term.

6. An electronic device comprising a processor and a memory storing a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the document search method according to any one of claims 1 to 3.

7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the document search method according to any one of claims 1 to 3.

Citation Information

Patent Citations

CN112835923A
US20180196943A1

Patent Information

Abstract

Description

Patent Citations

CN112835923A

US20180196943A1