Data processing method and electronic device
By generating a target index for medium-sized object data and employing a head-and-tail double sampling technique, the problem of insufficient query efficiency and flexibility for medium-sized object data in the HBase database is solved, enabling efficient multi-dimensional queries and fast data location.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2026-05-27
- Publication Date
- 2026-06-26
Smart Images

Figure CN122285752A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data storage technology, and in particular to a data processing method and an electronic device. Background Technology
[0002] HBase is a distributed, column-oriented open-source database. As a commonly used distributed database in the big data field, HBase plays a crucial role in supporting the storage of large amounts of data. However, for medium-sized object data ranging from tens of kilobytes (KB) to 10 megabytes (MB), such as document fragments, thumbnails, and structured attachments, the query process is limited by HBase's native retrieval architecture, which only supports single-point queries based on row keys. This results in low query efficiency and flexibility. Therefore, finding a way to query medium-sized object data to improve query efficiency and flexibility is a pressing technical problem that needs to be solved. Summary of the Invention
[0003] This application provides a data processing method and an electronic device to at least solve the problem of low query efficiency and query flexibility of medium-sized object data in the related art.
[0004] This application provides a data processing method, including: In response to a user's data write request, the write data corresponding to the data write request is stored, and it is determined whether the write data is medium-sized object data. Medium-sized object data is object data whose data size is within a preset range. If it is determined that the written data is medium-sized object data, extract the metadata information corresponding to the written data and obtain the key information corresponding to the written data. The metadata information includes the file storage path of the written data, and the key information includes at least one of keywords and data fingerprint information. Based on metadata and key information, a target index corresponding to the written data is generated, and the target index is written into the preset search and analysis engine. When key information includes keywords, retrieve the key information corresponding to the written data, including: Determine whether the parsing method corresponding to the written data is the first parsing method; If the parsing method is determined to be the first parsing method, the target data size corresponding to the written data is obtained, and the one-sided prefetch range, the head sampling start offset and the tail sampling start offset corresponding to the written data are determined based on the target data size. Based on the unilateral prefetch range, the head sampling start offset, and the tail sampling start offset, the target keywords corresponding to the written data are determined, and the target keywords are identified as key information.
[0005] This application also provides a data processing apparatus, including: The data storage and type determination module is used to respond to the user's data write request, store the write data corresponding to the data write request, and determine whether the write data is medium-sized object data. Medium-sized object data is object data whose data size is within a preset range. The information extraction module is used to extract the metadata information corresponding to the written data and obtain the key information corresponding to the written data when it is determined that the written data is medium-sized object data. The metadata information includes the file storage path of the written data, and the key information includes at least one of keywords and data fingerprint information. The index generation and writing module is used to generate the target index corresponding to the data to be written based on metadata information and key information, and write the target index into the preset search and analysis engine. The information extraction module is specifically used to determine whether the parsing method corresponding to the written data is the first parsing method when the key information includes keywords; If the parsing method is determined to be the first parsing method, the target data size corresponding to the written data is obtained, and the one-sided prefetch range, the head sampling start offset and the tail sampling start offset corresponding to the written data are determined based on the target data size. Based on the unilateral prefetch range, the head sampling start offset, and the tail sampling start offset, the target keywords corresponding to the written data are determined, and the target keywords are identified as key information.
[0006] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for executing the computer program to implement the steps of any of the above data processing methods.
[0007] This application also provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of any of the above-described data processing methods.
[0008] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the above-described data processing methods.
[0009] This application achieves several advantages. By determining the type of data being written during the storage process of user data write requests, and extracting corresponding metadata and key information when the data is medium-sized object data, an index is generated based on this metadata and key information. Furthermore, when the key information includes keywords and the parsing method is the first parsing method, the single-sided prefetch range and header / tail sampling offsets are dynamically determined based on the target data size. This allows for the extraction of target keywords as key information, accurately focusing on high-value-density core fields in medium-sized object data. This reduces data parsing and storage overhead during index construction, shortens index generation time, improves index generation speed, and optimizes input / output efficiency during the query phase. Moreover, the index enables rapid location of required data without traversing all data, saving significant time and computational resources. Simultaneously, it provides structured guidance for search analysis through metadata and key information, offering precise and multi-dimensional query methods. This avoids the poor query efficiency and flexibility issues associated with single-point queries based solely on row keys, achieving a balance between maintaining the reliability of database data storage and improving query efficiency and flexibility. Attached Figure Description
[0010] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0011] Figure 1 A flowchart illustrating a data processing method provided in an embodiment of this application; Figure 2 This application provides a schematic diagram of the architecture of a data processing system. Figure 3 A flowchart illustrating a method for determining key information provided in an embodiment of this application; Figure 4 This is a schematic diagram of the structure of a data processing device provided in an embodiment of this application. Detailed Implementation
[0012] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.
[0013] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.
[0014] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0015] HBase is typically a distributed, column-oriented open-source database. As a commonly used distributed database in the big data field, HBase plays a crucial role in supporting the storage of large amounts of data. In actual business applications, there are large amounts of "medium-sized object data" (such as document fragments, thumbnails, structured attachments, etc.) ranging from tens of KB to 10 MB. Directly using HBase's native key-value (KV) storage can lead to performance issues such as merge storms. To solve this problem, HBase introduces medium-sized object storage features, adopting an architecture that separates metadata from actual data. The descriptive information of medium-sized object data (such as filename, size, creation time, checksum, etc.) is stored in HBase ordinary key-value pairs, while the actual object data is stored in a separate file directory and centrally managed. This is accompanied by a dedicated compaction strategy and cache isolation mechanism to avoid the interference of medium-sized object data files on the storage performance of small key-value pairs, achieving efficient collaborative storage of medium-sized object data and small key-value pairs. However, the query capabilities of medium-sized object storage are still limited by the native HBase architecture, which only supports single-point queries based on row keys, and cannot meet complex query needs such as multi-condition filtering, full-text search, fuzzy matching, and aggregation analysis. To address this issue, this application provides a data processing method, which will be described below with reference to specific embodiments.
[0016] Figure 1 This is a flowchart illustrating a data processing method provided in an embodiment of this application. This method can be executed by an electronic device, which can be executively understood as a device such as a mobile phone, tablet computer, laptop computer, or desktop computer. Furthermore, this data processing method can be applied to… Figure 2 As can be understood from the data processing system architecture shown, the data processing method provided in this application embodiment can also be applied to other system architectures.
[0017] like Figure 2As shown, the data processing system architecture includes a database, a data synchronization unit, an index construction unit, an intelligent parsing unit, a pre-set search and analysis engine, and a content retrieval unit. These components interact via network communication.
[0018] In this embodiment, the database can be an HBase database. The database can respond to user data write requests and store the write data corresponding to those requests. The data synchronization unit, based on HBase's native replication mechanism, creates and manages secondary indexes of HBase tables. By synchronizing data through replication, it achieves near real-time awareness of data addition, deletion, and modification operations, and generates a write-ahead log file. The write data is then sent to the index construction unit via this log file. The index construction unit parses the write data in the write-ahead log file to obtain event information corresponding to the write data. This event information includes data operations, such as add, update, and delete operations. It also determines the data type to identify whether the data is medium-sized object data. If the write data is regular key-value pair data, it directly constructs a preset search and analysis engine write request based on the index mapping relationship to execute the data operation corresponding to the write data. If the write data is medium-sized object data, it extracts the metadata information of the write data and calls the intelligent parsing unit. The intelligent parsing unit parses the written data and extracts key information, returning this key information as the parsing result to the index construction unit. The index construction unit integrates the metadata and key information to obtain the index corresponding to the written data. It then converts the index into a file format corresponding to a preset search and analysis engine and writes it to that engine. When a user needs to perform a search, they input a search command into the content retrieval unit. This command includes search statements and / or search images, etc. The preset search and analysis engine performs an index query based on the search statements and / or search images in the search command, determines the row key corresponding to the search command, and sends the row key to the content retrieval unit. The content retrieval unit then sends the row key to the database, which performs a data query based on the row key and returns the original data to the content retrieval unit.
[0019] like Figure 1 As shown, the data processing method provided in this application embodiment may include the following steps: S110. In response to the user's data write request, store the write data corresponding to the data write request, and determine whether the write data is medium-sized object data.
[0020] In this embodiment, a data write request can be understood as a user-inputted request instructing data to be written. The data write request carries the data to be written, along with its corresponding row key, column family, column qualifier, and timestamp information.
[0021] In this embodiment, medium-sized object data can be understood as object data with a size within a preset range, stored in a separate storage architecture. Medium-sized object data is also known as a moderate object, or MOB for short. For example, medium-sized object data can be data with a size between tens of KB and 10 MB.
[0022] Specifically, the electronic device can respond to a user's data write request, parse the data write request to obtain the write data corresponding to the data write request, write the write data to the write-ahead log file, and then write the write data to the storage area corresponding to the column family of the write data to realize the storage of the write data. Furthermore, it can determine the target data size of the write data, and based on the target data size, determine whether the write data is medium-sized object data. If the write data is within a preset range, it is determined that the write data is medium-sized object data; otherwise, it is determined that the write data is regular key-value pair data.
[0023] S120. If it is determined that the data to be written is medium-sized object data, extract the metadata information corresponding to the data to be written and obtain the key information corresponding to the data to be written.
[0024] In this embodiment of the application, metadata information may include the file storage path where the data is written, row key, column family, column qualifier, version, target data size, user-defined tags, and other information.
[0025] Key information includes at least one of keywords and data fingerprint information. Key information may also include information such as the verification code corresponding to the written data.
[0026] Specifically, when the electronic device determines that the written data is medium-sized object data, it parses the written data and extracts the corresponding metadata information. Simultaneously, it analyzes and identifies the written data to obtain key information corresponding to it. The specific implementation method for extracting the metadata information corresponding to the written data is similar to the implementation method for extracting data metadata information in related technologies, and will not be elaborated here.
[0027] In this embodiment, the electronic device can determine the parsing method corresponding to the written data when the key information includes keywords; if the parsing method is determined to be the first parsing method, the target data size corresponding to the written data is obtained, and the one-sided prefetch range, the head sampling start offset, and the tail sampling start offset corresponding to the written data are determined based on the target data size; based on the one-sided prefetch range, the head sampling start offset, and the tail sampling start offset, the target keywords corresponding to the written data are determined, and the target keywords are identified as key information. The first parsing method can be understood as a method for quickly parsing the written data, that is, using a "head-tail dual sampling" method for parsing.
[0028] S130. Generate the target index corresponding to the written data based on metadata information and key information, and write the target index into the preset search and analysis engine.
[0029] In the embodiments of this application, the preset search and analysis engine can be understood as an engine that stores the index of the written data.
[0030] Specifically, after obtaining the metadata and key information corresponding to the written data, the electronic device integrates the metadata and key information to generate the target index corresponding to the written data, and converts the target index into a file format corresponding to the preset search and analysis engine and writes it to the preset search and analysis engine.
[0031] In this embodiment, the type of data being written can be determined during the storage of data in a user's data write request. When the data being written is medium-sized object data, metadata and key information corresponding to the data are extracted. An index is generated based on this metadata and key information. Furthermore, when the key information includes keywords and the parsing method is the first parsing method, the single-sided prefetch range and head and tail sampling offsets are dynamically determined based on the target data size. This allows for the extraction of target keywords as key information, accurately focusing on high-value-density core fields in medium-sized object data. This reduces the amount of data parsing and storage overhead during index construction, shortens index generation time, improves index generation speed, and optimizes input / output efficiency during the query phase. Moreover, the index allows for quick location of the required data without traversing all data, saving significant time and computational resources. Simultaneously, the metadata and key information provide structured guidance for search analysis, offering precise and multi-dimensional query methods. This avoids the poor query efficiency and flexibility issues associated with single-point queries based solely on row keys, achieving the technical effect of maintaining the reliability of database data storage while improving query efficiency and flexibility.
[0032] In some embodiments of this application, key information may include keywords.
[0033] Electronic devices can determine the parsing method corresponding to the written data, and execute different methods to extract key information from the written data according to the parsing method.
[0034] The following will combine Figure 3 The method for obtaining key information corresponding to the written data is described in detail.
[0035] Figure 3 A flowchart of a key information determination method provided in an embodiment of this application is shown below. Figure 3 As shown, the data processing method may include the following steps: S310. Determine whether the parsing method corresponding to the written data is the first parsing method.
[0036] In this embodiment of the application, the parsing method corresponding to the written data includes a first parsing method and a second parsing method. The first parsing method can be understood as a method for fast parsing of the written data; the second parsing method can be understood as a method for deep parsing of the written data, with the fast parsing method being faster than the deep parsing method.
[0037] In some embodiments of this application, the electronic device can determine whether the parsing method corresponding to the written data is the first parsing method based on the user's input operation, wherein the input operation is an operation used to indicate the data parsing method.
[0038] In other embodiments of this application, the electronic device can determine whether the parsing method corresponding to the written data is the first parsing method based on the attribute information corresponding to the written data, wherein the attribute information includes information for indicating the data parsing method.
[0039] In this embodiment of the application, when the parsing method is the first parsing method, steps S320 to S330 are executed; when the parsing method is not the first parsing method, that is, when the parsing method is the second parsing method, step S340 is executed.
[0040] S320. When the parsing method is the first parsing method, obtain the target data size corresponding to the written data, and determine the one-sided prefetch range, the head sampling start offset and the tail sampling start offset corresponding to the written data based on the target data size.
[0041] In this embodiment, the first parsing method can be understood as using a double-sampling approach at the beginning and end of the document to parse the written data. For example, key information such as the cover, video header, and invoice header at the beginning of the document, and the QR code, verification code, contract date, and signatory at the end of the document are extracted, while most of the information in the middle, which is of secondary importance, is not further parsed or extracted.
[0042] In this embodiment, determining the one-sided prefetch range, header sampling start offset, and tail sampling start offset corresponding to the written data based on the target data size can specifically include: calculating the product of the target data size and a preset coefficient to obtain a first value; obtaining a first comparison result between the first value and the target value; and determining the smaller value in the first comparison result as the one-sided prefetch range; obtaining a preset offset; and determining the preset offset as the header sampling start offset; calculating the difference between the target data size and the one-sided prefetch range to obtain a second value; obtaining a second comparison result between the second value and the average value of the target data size; and determining the larger value in the second comparison result as the tail sampling start offset.
[0043] In this embodiment, the preset coefficient can be understood as a data prefetching ratio coefficient. The value of the preset coefficient is generally between 0 and 1. For example, the preset coefficient can be 4%.
[0044] The preset offset can be a pre-set offset for the header sampling. For example, the preset offset can be zero.
[0045] The target value can be understood as a pre-set upper limit value used to determine the unilateral prefetch range. For example, the target value can be 1.
[0046] Specifically, when the parsing method is determined to be the first parsing method, the electronic device can obtain preset coefficients, calculate the product of the target data size and the preset coefficients to obtain a first value, obtain a first comparison result between the first value and the target value, and when the first comparison result is that the first value is less than the target value, determine the first value as a one-sided prefetch range; when the first comparison result is that the first value is greater than the target value, determine the target value as a one-sided prefetch range; when the first comparison result is that the first value and the target value are equal, determine the first value or the target value as a one-sided prefetch range; obtain a preset offset, and set the preset... The offset is determined as the head sampling start offset; simultaneously, the difference between the target data size and the one-sided prefetch range is calculated to obtain a second value. A second comparison result is obtained between the second value and the average value of the target data size. When the second value is greater than the average value of the target data size, the second value is determined as the tail sampling start offset; when the second value is less than the average value of the target data size, the average value of the target data size is determined as the tail sampling start offset; when the second value is equal to the average value of the target data size, either the second value or the average value of the target data size is determined as the tail sampling start offset.
[0047] S330. Based on the unilateral prefetch range, the head sampling start offset, and the tail sampling start offset, determine the target keywords corresponding to the written data, and identify the target keywords as key information.
[0048] In this embodiment of the application, the target keyword corresponding to the written data is determined based on the one-sided prefetch range, the head sampling start offset, and the tail sampling start offset. Specifically, this may include: determining the head sampling data based on the one-sided prefetch range and the head sampling start offset, and determining the tail sampling data based on the one-sided prefetch range and the tail sampling start offset; extracting at least one first keyword corresponding to the head sampling data and the tail sampling data, and determining the target keyword based on the at least one first keyword.
[0049] Specifically, after acquiring the one-sided prefetch range, the head sampling start offset, and the tail sampling start offset, the electronic device determines the start position of the head sampling data and the start position of the tail sampling data based on the one-sided prefetch range. It then determines the head sampling data based on the start position and head sampling start offset, and the tail sampling data based on the start position and tail sampling start offset, thus obtaining the head sampling data and tail sampling data. At least one first keyword corresponding to each of the head and tail sampling data is extracted, and the at least one first keyword is pruned using keyword entropy pruning technology to obtain the target keyword. The specific implementation method for extracting at least one first keyword corresponding to the head and tail sampling data is similar to the data keyword extraction method in related technologies and will not be elaborated here. Therefore, by adopting the "head-to-tail dual sampling" method for keyword extraction, the key semantic anchors of the entire text can be captured efficiently, avoiding noise interference caused by redundant information in the middle paragraphs. It can simultaneously capture the feature distribution of the head and tail, ensuring the effectiveness and comprehensiveness of key information extraction. While improving extraction accuracy, it also improves the speed of writing data parsing, and reduces the computational overhead and data input / output and parsing pressure brought by full text parsing, thereby improving the efficiency of keyword extraction and index generation.
[0050] For example, let the target data size be S, and the preset coefficient be α, where α is 4%, or 0.04.
[0051] The single-sided prefetch range L = min(α*S, 1); the head sampling start offset offsethead = 0; the tail sampling start offset offsettail = max(SL, S / 2).
[0052] Taking a target data size of 10MB as an example, the single-sided prefetch range is 0.4; the head sampled data is [0, 0.4MB]; and the tail sampled data is [9.6MB, 10MB].
[0053] In an embodiment of the present application, determining a target keyword based on at least one first keyword may specifically include: for each first keyword, determining a first frequency of the first keyword appearing in the written data and a second frequency of the first keyword appearing in the historical data; determining a score of the first keyword based on the first frequency and the second frequency; determining a second keyword among the first keywords whose score is greater than a preset score threshold, and determining the second keyword as the target keyword.
[0054] In an embodiment of the present application, historical data can be understood as all data existing in the database before the written data is stored in the database.
[0055] Specifically, the electronic device can, for each first keyword, determine the frequency of the first keyword appearing in the written data and the total number of words corresponding to the written data; determine the ratio between the frequency and the total number of words as the first frequency. Similarly, the second frequency is obtained according to the ratio of the frequency of the first keyword appearing in the historical data and the total number of words in the historical data.
[0056] Further, determining the score of the first keyword based on the first frequency and the second frequency may specifically include: determining the information entropy corresponding to the first keyword based on the first frequency; determining the rarity corresponding to the first keyword based on the second frequency; calculating the product between the information entropy and the rarity, and determining the product as the score of the first keyword.
[0057] In an embodiment of the present application, the calculation formula of information entropy is H(p i ) = -p i *log2p i .
[0058] Among them, H(p i ) is the information entropy; p i is the first frequency.
[0059] The calculation formula of rarity is M = (1 - DF(w i )) 2 .
[0060] Among them, M is the rarity; DF(w i ) is the second frequency.
[0061] The higher the second frequency, the smaller the rarity, that is, the more common the keyword is. Exemplarily, for example, for the keyword "of", the frequency of appearance in the historical data is 99%, then its rarity is 0.0001.
[0062] The calculation formula of the score of the first keyword is θ = H(p i ) * (1 - DF(w i )) 2 .
[0063] Specifically, after acquiring the first frequency and the second frequency, the electronic device obtains the information entropy of the first keyword based on the first frequency and the information entropy calculation formula; obtains the rarity of the first keyword based on the second frequency and the rarity calculation formula; calculates the product between the information entropy and the rarity, determines the product as the score of the first keyword, obtains the comparison result between the score of the first keyword and the preset score threshold, determines the second keyword among the first keywords whose score is greater than the preset score threshold based on the comparison result, and determines the second keyword as the target keyword.
[0064] For example, the preset score threshold can be 0.3.
[0065] In this embodiment, keyword determination can be achieved through quantitative filtering of information content, improving the accuracy of keyword determination. By using information entropy and sparseness as measures of keyword uncertainty, high-frequency noise keywords with excessively uniform distribution and low distinguishability in the corpus, as well as long-tail low-frequency keywords with excessively sparse distribution and insufficient representativeness, can be effectively identified. Pruning by setting thresholds can filter out keywords with low semantic information content and weak contribution to text recognition, thus retaining keywords with the greatest information gain. Retaining high-quality keywords for index generation reduces the size of the index, significantly reducing index storage overhead and subsequent retrieval pressure, improving retrieval accuracy and efficiency, and avoiding retrieval deviations caused by noise keywords.
[0066] S340. When the parsing method corresponding to the written data is the second parsing method, the written data and the preset prompt words are input into the preset machine learning model, and the preset machine learning model outputs the third keyword corresponding to the written data, and the key information is determined based on the third keyword.
[0067] In this embodiment of the application, the preset prompt word is a pre-set prompt word used to indicate the keyword extraction rules.
[0068] The preset machine learning model can be any machine learning model used for text keyword extraction, and there are no restrictions here.
[0069] It should be noted that the specific implementation method of inputting the written data and preset prompt words into the preset machine learning model, and having the preset machine learning model output the third keyword corresponding to the written data, is similar to the specific implementation method of text keyword extraction based on machine learning models in related technologies, and will not be described in detail here.
[0070] The specific implementation method for determining key information based on the third keyword is similar to the specific implementation method for determining target keywords based on at least one first keyword described in this application, and will not be repeated here.
[0071] In this embodiment, when the parsing method corresponding to the written data is the second parsing method, the written data and preset prompt words are input into a preset machine learning model, and the preset machine learning model outputs the third keyword corresponding to the written data. Based on the third keyword, key information is determined. That is, by deeply parsing the syntax, semantics, contextual information, etc. of the written data, the implicit semantics and syntactic importance of the keywords can be accurately captured, effectively identifying high-quality keywords, significantly improving the accuracy and distinguishability of keywords, showing stronger robustness in complex scenarios, and thus improving the accuracy of keyword determination.
[0072] In other embodiments of this application, key information may include data fingerprint information.
[0073] Obtaining key information corresponding to the written data may specifically include: splitting the written data into fragments to obtain at least one data block corresponding to the written data; determining the data block fingerprint corresponding to each of the at least one data block; aggregating the data block fingerprints to obtain the data fingerprint information corresponding to the written data; and determining the data fingerprint information as key information.
[0074] Specifically, the electronic device can obtain a preset fragmentation threshold, compare the target data size corresponding to the written data with the preset fragmentation threshold, and if the target data size is greater than the preset fragmentation threshold, perform fragmentation processing on the written data based on the preset fragmentation threshold to obtain at least one data block. After obtaining at least one data block, the perceptual feature vector of each data block is extracted, and the perceptual feature vector is determined as the data block fingerprint of that data block. Then, the data block fingerprints are aggregated to obtain aggregated fingerprint information, which is determined as the data fingerprint information corresponding to the written data, and the data fingerprint information is determined as key information.
[0075] In some examples, the written data is divided into blocks to obtain a series of data blocks {b0, b1, ... b}. m-1}, data block b j The corresponding data block fingerprint is That is, data block b j Low-dimensional feature vectors, such as perceptual hashing.
[0076] The specific calculation formula for aggregating data block fingerprints is as follows: .
[0077] in, The aggregated fingerprint information; For data block b j The corresponding data block fingerprint; m is the number of data blocks.
[0078] In this embodiment, since different fingerprint information can be obtained from any two data blocks with different orders, it is possible to extract data fingerprint information from the written data to detect slight changes such as video transcoding, image rotation, and video editing order changes, and to realize functions such as file similarity search or image search during the search, thereby improving the flexibility and accuracy of the search.
[0079] In some embodiments of this application, the key information may include a verification code.
[0080] Obtaining key information corresponding to the written data may specifically include: calculating a checksum for the target data in the written data to obtain a checksum corresponding to the target data; and determining the checksum as key information corresponding to the written data.
[0081] The target data can be all the data to be written or a portion of the data to be written. For example, the target data can be the header sample data and the tail sample data to be written.
[0082] In some examples, the checksum calculation for the target data in the written data can be performed based on a preset cyclic redundancy check (CRC) algorithm to obtain the checksum. For example, the preset CRC algorithm can be a 32-bit cyclic redundancy check (CRC32) algorithm.
[0083] In this embodiment, by obtaining the checksum corresponding to the written data, when a file with the same name is transmitted again, it can be determined whether the file has changed based on the checksum. If the checksums match, it means the file has not changed and no processing is required; if the checksums do not match, it means a change has occurred, triggering a file update, thus improving the accuracy of update processing determination. Simultaneously, by calculating the checksum only on a portion of the written data, the computational load is reduced, improving computational efficiency.
[0084] In some embodiments of this application, the key information may include keywords, data fingerprint information, and a verification code. Obtaining the key information for writing data may include: obtaining the keywords, data fingerprint information, and verification code for the data to be written; and determining the keywords, data fingerprint information, and verification code as the key information for writing the data. The specific implementation methods for obtaining the keywords, data fingerprint information, and verification code for the data to be written are similar to those in the above embodiments of this application, and will not be repeated here.
[0085] In this embodiment of the application, writing the target index into a preset search and analysis engine may specifically include: determining the field type of each field in the target index and determining the target writing method corresponding to the field type; and writing each field in the target index into the preset search and analysis engine based on the target writing method.
[0086] Specifically, after obtaining the target index, the electronic device can identify each field in the target index and determine the field type of each field. When the field type is a text data field (such as file name, description, tag, etc.), a full-text index is used, that is, the text data field is segmented and an inverted index is built to provide a foundation for subsequent keyword matching and fuzzy queries. When the field type is a structured data field (such as creation time, size, type, category, etc.), a term index is used, that is, an exact value index is built without segmentation to improve query efficiency. When the field type is geographic location information, a geospatial index is used to support geospatial queries. When more than a certain percentage of the fields in the target index are term types, no segmentation is required to significantly reduce the storage cost of Elasticsearch. For example, the target percentage can be 60%.
[0087] In this embodiment, different writing methods can be used for different field types in the target index. This prevents index bloat and precision loss caused by useless word segmentation of structured fields such as numeric and enumerated fields, while also ensuring the flexibility of text content retrieval. Furthermore, by completing field type differentiation and binding during the writing phase, the appropriate index structure can be automatically matched during querying without runtime derivation, thereby further reducing query latency.
[0088] In this embodiment of the application, before writing the target index into the preset search and analysis engine, the data processing method may further include: obtaining the status information of the preset search and analysis engine; and determining the target time information for writing the target index into the preset search and analysis engine based on the status information.
[0089] Write the target index to the preset search and analysis engine. Specifically, this can include writing the target index to the preset search and analysis engine based on the target time information.
[0090] Determining the target time information for writing the target index to the preset search and analysis engine based on status information can specifically include: if the status information of the preset search and analysis engine is in a state that indicates the health of the preset search and analysis engine, the target time information is determined to be the current time, that is, the target index is directly written to the preset search and analysis engine; if the status information of the preset search and analysis engine is in a state that indicates the preset search and analysis engine is unhealthy, waiting is performed until it is determined that the status of the preset search and analysis engine has recovered to a healthy state, and the time when the status of the preset search and analysis engine has recovered to a healthy state is determined as the target time information, that is, the target index is written to the preset search and analysis engine in batches at the time when the status of the preset search and analysis engine has recovered to a healthy state.
[0091] For example, the preset search and analysis engine health status information can be a healthy status health=yellow (warning) / red (danger). The preset unhealthy search and analysis engine health status information can be a healthy status health=green (normal).
[0092] The specific implementation method of writing the target index into the preset search and analysis engine is similar to the specific implementation method of writing the target index into the preset search and analysis engine in the above embodiments of this application, and will not be described again here.
[0093] In this embodiment, the timing of writing the target index can be determined based on the state of the preset search and analysis engine. When the preset search and analysis engine is in an unhealthy state, a degradation circuit breaker mechanism is supported. That is, the target index is written to the preset search and analysis engine in batches after the preset search and analysis engine is restored to a healthy state, ensuring zero blocking of the target index writing link and thus significantly reducing the impact of the preset search and analysis engine failure on the index writing performance.
[0094] In this embodiment of the application, after writing the target index into a preset search and analysis engine, the data processing method may further include: receiving a user's search instruction; determining at least one target row key that matches the search instruction based on the search instruction and the preset search and analysis engine; obtaining the original data corresponding to the search instruction based on the at least one target row key; and returning the original data to the user.
[0095] In this embodiment of the application, the search instruction may include search statements and / or search images and other information.
[0096] Specifically, the electronic device matches the search query and / or search image in the search command with the key information corresponding to the index in the preset search and analysis engine to obtain the matching value; it then determines at least one target row key corresponding to the index whose matching value is greater than a preset matching threshold; and finally, based on the correspondence between the at least one target row key and the medium object data file path, it determines the target medium object data file path, extracts the original data from the target medium object data file path, and returns the original data to the user. This improves the efficiency and flexibility of data querying.
[0097] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.
[0098] Embodiments of this application also provide a data processing apparatus, which can be disposed within an electronic device and is understood as a part of the functional modules of the aforementioned electronic device.
[0099] like Figure 4 As shown, the data processing device 400 may include a data storage and type determination module 410, an information extraction module 420, and an index generation and writing module 430.
[0100] The data storage and type determination module 410 can be used to respond to a user's data write request, store the write data corresponding to the data write request, and determine whether the write data is medium-sized object data. Medium-sized object data is object data whose data size is within a preset range. The information extraction module 420 can be used to extract the metadata information corresponding to the written data and obtain the key information corresponding to the written data when it is determined that the written data is medium-sized object data. The metadata information includes the file storage path of the written data, and the key information includes at least one of keywords and data fingerprint information. The index generation and writing module 430 can be used to generate the target index corresponding to the writing data based on metadata information and key information, and write the target index into the preset search and analysis engine; The information extraction module 420 can be specifically used to determine whether the parsing method corresponding to the written data is the first parsing method when the key information includes keywords; If the parsing method is determined to be the first parsing method, the target data size corresponding to the written data is obtained, and the one-sided prefetch range, the head sampling start offset and the tail sampling start offset corresponding to the written data are determined based on the target data size. Based on the unilateral prefetch range, the head sampling start offset, and the tail sampling start offset, the target keywords corresponding to the written data are determined, and the target keywords are identified as key information.
[0101] In this embodiment, the type of data being written can be determined during the storage of data in a user's data write request. When the data being written is medium-sized object data, metadata and key information corresponding to the data are extracted. An index is generated based on this metadata and key information. Furthermore, when the key information includes keywords and the parsing method is the first parsing method, the single-sided prefetch range and head and tail sampling offsets are dynamically determined based on the target data size. This allows for the extraction of target keywords as key information, accurately focusing on high-value-density core fields in medium-sized object data. This reduces the amount of data parsing and storage overhead during index construction, shortens index generation time, improves index generation speed, and optimizes input / output efficiency during the query phase. Moreover, the index allows for quick location of the required data without traversing all data, saving significant time and computational resources. Simultaneously, the metadata and key information provide structured guidance for search analysis, offering precise and multi-dimensional query methods. This avoids the poor query efficiency and flexibility issues associated with single-point queries based solely on row keys, achieving the technical effect of maintaining the reliability of database data storage while improving query efficiency and flexibility.
[0102] In some embodiments of this application, the information extraction module 420 may also be specifically used to calculate the product of the target data size and the preset coefficient to obtain a first value, obtain a first comparison result between the first value and the target value, and determine the smaller value in the first comparison result as the unilateral pre-fetch range. Obtain the preset offset and set it as the starting offset for head sampling; Calculate the difference between the target data size and the one-sided prefetch range to obtain a second value. Obtain a second comparison result between the second value and the average value of the target data size. Determine the larger value in the second comparison result as the tail sampling start offset.
[0103] In some embodiments of this application, the information extraction module 420 may also be specifically used to determine head sampling data based on the one-sided prefetch range and the head sampling start offset, and to determine tail sampling data based on the one-sided prefetch range and the tail sampling start offset; Extract at least one primary keyword corresponding to the head sampling data and the tail sampling data, and determine the target keyword based on the at least one primary keyword.
[0104] In some embodiments of this application, the information extraction module 420 may also be specifically used to determine, for each first keyword, the first frequency of the first keyword appearing in the written data and the second frequency of the first keyword appearing in the historical data. The score of the first keyword is determined based on the first frequency and the second frequency; Identify the second keyword whose score is greater than a preset score threshold from the first keyword, and designate the second keyword as the target keyword.
[0105] In some embodiments of this application, the information extraction module 420 may also be specifically used to determine the information entropy corresponding to the first keyword based on the first frequency; The rarity of the first keyword is determined based on the second frequency. Calculate the product between information entropy and rarity, and determine the score of the first keyword based on this product.
[0106] In some embodiments of this application, the information extraction module 420 may also be specifically used to input the written data and preset prompt words into a preset machine learning model when the parsing method corresponding to the written data is the second parsing method, and the preset machine learning model outputs the third keyword corresponding to the written data, and determines key information based on the third keyword.
[0107] In some embodiments of this application, the information extraction module 420 may also be specifically used to perform fragmentation processing on the written data when the key information includes data fingerprint information, so as to obtain at least one data block corresponding to the written data. Determine the data block fingerprint corresponding to at least one data block; The data block fingerprints are aggregated to obtain the data fingerprint information corresponding to the written data, and the data fingerprint information is identified as key information.
[0108] In some embodiments of this application, the index generation and writing module 430 can be specifically used to determine the field type of each field in the target index and to determine the target writing method corresponding to the field type; The target write method writes each field in the target index into the preset search and analysis engine.
[0109] In some embodiments of this application, the data processing apparatus 400 may further include a write timing determination module.
[0110] The write timing determination module can be used to obtain the status information of the preset search and analysis engine before writing the target index to the preset search and analysis engine; Based on the status information, determine the target time information for writing the target index into the preset search and analysis engine.
[0111] For a description of the features in the embodiment corresponding to the data processing device, please refer to the relevant description in the embodiment corresponding to the data processing method, which will not be repeated here.
[0112] Embodiments of this application also provide an electronic device, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above-described data processing method embodiments.
[0113] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above data processing method embodiments when it is run.
[0114] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0115] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above data processing method embodiments.
[0116] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above data processing method embodiments.
[0117] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0118] The data processing method and electronic device provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are only intended to help understand the method and core ideas of this application. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.
Claims
1. A data processing method, characterized by, include: In response to a user's data write request, the write data corresponding to the data write request is stored, and it is determined whether the write data is medium-sized object data, wherein medium-sized object data is object data whose data size is within a preset range; If it is determined that the written data is the medium object data, the metadata information corresponding to the written data is extracted, and the key information corresponding to the written data is obtained. The metadata information includes the file storage path of the written data, and the key information includes at least one of keywords and data fingerprint information. Based on the metadata information and the key information, a target index corresponding to the written data is generated, and the target index is written into a preset search and analysis engine; When the key information includes keywords, obtaining the key information corresponding to the written data includes: Determine whether the parsing method corresponding to the written data is the first parsing method; If the parsing method is determined to be the first parsing method, the target data size corresponding to the written data is obtained, and the one-sided prefetch range, the head sampling start offset and the tail sampling start offset corresponding to the written data are determined based on the target data size. Based on the unilateral prefetch range, the head sampling start offset, and the tail sampling start offset, the target keyword corresponding to the written data is determined, and the target keyword is identified as the key information.
2. The method of claim 1, wherein, The step of determining the one-sided prefetch range, header sampling start offset, and tail sampling start offset corresponding to the written data based on the target data size includes: Calculate the product of the target data size and the preset coefficient to obtain a first value, obtain a first comparison result between the first value and the target value, and determine the smaller value in the first comparison result as the unilateral pre-fetch range; Obtain a preset offset and determine the preset offset as the head sampling start offset; Calculate the difference between the target data size and the single-sided prefetch range to obtain a second value. Obtain a second comparison result between the second value and the average value of the target data size. Determine the larger value in the second comparison result as the tail sampling start offset.
3. The method of claim 1, wherein, The determination of the target keyword corresponding to the written data based on the unilateral prefetch range, the head sampling start offset, and the tail sampling start offset includes: The head sampling data is determined based on the unilateral prefetch range and the head sampling start offset, and the tail sampling data is determined based on the unilateral prefetch range and the tail sampling start offset. Extract at least one first keyword corresponding to the head sampling data and the tail sampling data, and determine the target keyword based on the at least one first keyword.
4. The method of claim 3, wherein, Determining the target keyword based on the at least one first keyword includes: For each first keyword, determine the first frequency of the first keyword appearing in the written data and the second frequency of the first keyword appearing in the historical data; The score of the first keyword is determined based on the first frequency and the second frequency; Identify a second keyword among the first keywords whose score is greater than a preset score threshold, and determine the second keyword as the target keyword.
5. The method of claim 4, wherein, Determining the score of the first keyword based on the first frequency and the second frequency includes: The information entropy corresponding to the first keyword is determined based on the first frequency; The rarity of the first keyword is determined based on the second frequency. Calculate the product between the information entropy and the rarity, and determine the product as the score of the first keyword.
6. The method of claim 1, wherein, Also includes: When the parsing method corresponding to the written data is the second parsing method, the written data and the preset prompt words are input into the preset machine learning model, and the preset machine learning model outputs the third keyword corresponding to the written data, and the key information is determined based on the third keyword.
7. The method of claim 1, wherein, When the key information includes data fingerprint information, obtaining the key information corresponding to the written data includes: The written data is fragmented to obtain at least one data block corresponding to the written data; Determine the data block fingerprints corresponding to each of the at least one data block; The data block fingerprints are aggregated to obtain the data fingerprint information corresponding to the written data, and the data fingerprint information is determined as the key information.
8. The method of claim 1, wherein, The step of writing the target index into a preset search and analysis engine includes: Determine the field type of each field in the target index, and determine the target write method corresponding to the field type; Based on the target writing method, each field in the target index is written into the preset search and analysis engine.
9. The method of claim 1, wherein, Before writing the target index into the preset search and analysis engine, the method further includes: Obtain the status information of the preset search and analysis engine; Based on the status information, the target time information for writing the target index into the preset search and analysis engine is determined.
10. An electronic device, comprising: include: Memory, used to store computer programs; A processor for executing the computer program to implement the steps of the data processing method as described in any one of claims 1 to 9.