Method, device, equipment, medium and product for generating derived variables
By optimizing deserialization and data structures, the problem of low efficiency in generating derived variables in existing technologies has been solved, achieving fast and efficient generation of derived variables and improving data processing speed and storage efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JINGDONG TECH HLDG CO LTD
- Filing Date
- 2023-04-24
- Publication Date
- 2026-06-16
Smart Images

Figure CN116737809B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a method, apparatus, device, medium, and product for generating derived variables. Background Technology
[0002] The analysis of credit reports and the processing of derived variables are crucial for the use of credit report content.
[0003] In existing technologies, parsing credit reports and processing derived variables involves calculating each variable individually. This is done by reading and calculating from the original credit report data, which is inefficient. Furthermore, calculating individual variables does not consider the logical relationships between them, leading to duplicate data retrieval and calculation, resulting in high latency in calculating derived variables. Summary of the Invention
[0004] This disclosure provides a method, apparatus, device, medium, and product for generating derived variables, which solves the defects of low efficiency and long time consumption in the prior art when generating derived variables, and realizes rapid and efficient generation of derived variables.
[0005] This disclosure provides a method for generating derived variables, including:
[0006] Obtain the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable, wherein the data to be processed is obtained by deserializing the original credit report data;
[0007] Based on the relationships between the data to be processed, the data to be processed is stored in a pre-set first data structure;
[0008] Based on the data processing logic and the first data structure, the data to be processed is processed to generate the target derived variable.
[0009] According to a method for generating derived variables provided in this disclosure, the data processing logic includes: data processing functions and the logical relationships between each of the data processing functions;
[0010] The step of processing the data to be processed based on the data processing logic and the first data structure to generate the target derived variable includes:
[0011] For each of the data processing functions, the following derived variable generation process is performed:
[0012] Based on the logical relationship, the current data processing function is determined; the data to be processed corresponding to the current data processing function is extracted from the first data structure, and the data to be processed is processed using the current data processing function to obtain the current derived variable;
[0013] Repeat the process of generating the derived variable until all the data processing functions have been executed, and obtain the target derived variable.
[0014] According to the method for generating derived variables provided in this disclosure, after processing the data to be processed using the current data processing function to obtain the current derived variable, the method further includes:
[0015] The number of times the data to be processed is determined, and the number of times it is cited is used to characterize whether the data to be processed can be used to generate the target derived variable;
[0016] When the number of references is determined to be a preset value, it is determined that the data to be processed is no longer used to generate the target derived variable, and the data to be processed is deleted from the first data structure.
[0017] According to a method for generating derived variables provided in this disclosure, the step of obtaining the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable, includes:
[0018] Determine the derived variable template corresponding to the target derived variable;
[0019] Obtain the target path corresponding to the derived variable template, and the data processing logic corresponding to the derived variable template;
[0020] Based on the target path, the data to be processed is obtained from a pre-set second data structure.
[0021] According to a method for generating derived variables provided in this disclosure, before obtaining the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable, the method further includes:
[0022] Obtain the original credit report data;
[0023] Based on the segment information of the original credit report data, the original credit report data is deserialized into the second data structure.
[0024] According to a method for generating derived variables provided in this disclosure, the first data structure includes a number matrix, which includes a first data node and a second data node;
[0025] The second data structure includes a multi-branch tree, which includes a root object and child objects; the root object includes a data identifier, which corresponds to the derived variable template; the child object includes the data to be processed, and the parent object of the child object is the root object or another child object.
[0026] The step of storing the data to be processed into a pre-set first data structure based on the correlation between the data to be processed includes:
[0027] Determine the data type of the data stored in each of the sub-objects, wherein the data type includes: string type and non-string type;
[0028] The data stored in the first sub-object is organized in the first data node, where the first sub-object is a sub-object whose stored data is of the non-string type.
[0029] The data stored in the second sub-object is organized in the second data node. The second sub-object is subordinate to the first sub-object, and the stored data is a sub-object of the string type.
[0030] According to a method for generating derived variables provided in this disclosure, the number matrix includes: a two-dimensional number matrix composed of at least two doubly linked lists;
[0031] The first data node includes: the data node of the i-th row of the doubly linked list in the number array; the second data node includes: the remaining data node after removing the first data node from all data nodes in the number array; where i is an integer greater than or equal to 1.
[0032] Organizing the data stored in the first sub-object within the first data node includes:
[0033] Organize the data stored in the first sub-object into the data nodes of the doubly linked list in the i-th row;
[0034] Organizing the data stored in the second sub-object within the second data node includes:
[0035] For each first data node of the doubly linked list in the i-th row, execute the following stored procedure:
[0036] Determine a second sub-object that has the subordinate relationship with the first sub-object of the first data node;
[0037] Determine the entries for the data stored in the second sub-object;
[0038] The data corresponding to each entry is organized into the second data node corresponding to the first data node.
[0039] According to a method for generating derived variables provided in this disclosure, the number matrix includes: a two-dimensional number matrix composed of at least two doubly linked lists;
[0040] The first data node includes: the data node of the j-th column of the doubly linked list in the number array; the second data node includes: the remaining data node after removing the first data node from all data nodes in the number array, where j is an integer greater than or equal to 1.
[0041] Organizing the data stored in the first sub-object within the first data node includes:
[0042] The data stored in the first sub-object is organized in the data nodes of the j-th column of the doubly linked list;
[0043] Organizing the data stored in the second sub-object within the second data node includes:
[0044] For each first data node of the doubly linked list in column j, execute the following stored procedure:
[0045] Determine a second sub-object that has the subordinate relationship with the first sub-object of the first data node;
[0046] Determine the entries for the data stored in the second sub-object;
[0047] The data corresponding to each entry is organized in a second data node corresponding to the first data node.
[0048] This disclosure also provides an apparatus for generating derived variables, comprising:
[0049] The acquisition module is used to acquire data to be processed corresponding to at least one target derived variable, and data processing logic corresponding to the target derived variable. The data to be processed is obtained by deserializing the original credit report data.
[0050] An organization module is used to store the data to be processed into a pre-set first data structure based on the relationships between the data to be processed;
[0051] The processing module is used to process the data to be processed based on the data processing logic and the first data structure to generate the target derived variable.
[0052] This disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method for generating derived variables as described above.
[0053] This disclosure also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method for generating derived variables as described above.
[0054] This disclosure also provides a computer program product, including a computer program that, when executed by a processor, implements the method for generating derived variables as described in any of the above.
[0055] The method, apparatus, device, medium, and product for generating derived variables disclosed herein acquire data to be processed corresponding to at least one target derived variable, and data processing logic corresponding to the target derived variable. The data to be processed is obtained by deserializing the original credit report data. Therefore, this disclosure does not generate derived variables from the original credit report data, but rather uses the deserialized data to generate derived variables, providing an effective data foundation for the subsequent rapid generation of derived variables. Furthermore, this disclosure simultaneously acquires and calculates data to be processed corresponding to multiple target derived variables, avoiding repeated acquisition and calculation of data generated by a single acquisition and calculation, which leads to high computational latency. The problem is that, based on the relationships between the data to be processed, the data to be processed is stored in a pre-set first data structure. This demonstrates that the present disclosure fully considers the relationships between the data to be processed and stores the data based on these relationships, providing an effective data foundation for subsequent rapid processing of the data. Finally, based on the data processing logic and the first data structure, the data to be processed is processed to generate target derived variables. This demonstrates that, based on a specific data structure, the present disclosure can effectively improve the data operation and storage efficiency, solving the problems of low efficiency and long processing time in the prior art when generating derived variables, and achieving the goal of rapid and efficient generation of derived variables. Attached Figure Description
[0056] To more clearly illustrate the technical solutions in this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0057] Figure 1 This is one of the flowcharts illustrating the method for generating derived variables provided in this disclosure;
[0058] Figure 2 This is a schematic diagram of deserializing the original credit report data into a multi-way tree, as provided in this public disclosure;
[0059] Figure 3This is the third flowchart illustrating the method for generating derived variables provided in this disclosure;
[0060] Figure 4 This is a schematic diagram of the number array provided in this publication;
[0061] Figure 5 This is a schematic diagram of the structure of the derived variable generation framework provided in this disclosure;
[0062] Figure 6 This is a schematic diagram of the structure of the device for generating derived variables provided in this disclosure;
[0063] Figure 7 This is a schematic diagram of the structure of the electronic device provided in this disclosure. Detailed Implementation
[0064] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the embodiments of this disclosure.
[0065] The following is combined with Figures 1-3 This disclosure describes a method for generating derived variables according to embodiments of the present disclosure.
[0066] This invention provides a method for generating derived variables. This method can be applied to smart terminals, such as mobile phones, computers, and tablets, as well as servers. The following description uses the application of this method to a server as an example; however, it should be noted that this is merely illustrative and not intended to limit the scope of protection of this invention. Other descriptions in this invention's embodiments are also illustrative and not intended to limit the scope of protection of this invention, and will not be described in detail thereafter.
[0067] Specifically, credit reports play a crucial role in the data application of financial institutions. By parsing and processing raw credit report data, derived variables are obtained, providing an important data foundation for downstream model training or strategy development. However, with the rapid development of technology, the volume of raw credit report data is increasing, leading to high latency issues when parsing and processing raw credit report data online in real time.
[0068] The specific implementation of this method is as follows: Figure 1 As shown:
[0069] Step 101: Obtain the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable.
[0070] The data to be processed is obtained by deserializing the original credit report data.
[0071] In one specific embodiment, the original credit reporting message data sent from upstream is acquired in real time to determine the information of each segment of the original credit reporting message data; based on the segment information of the original credit reporting message data, the original credit reporting message data is deserialized into a second data structure.
[0072] The data deserialized into the second data structure is defined as target credit data, which includes data to be processed.
[0073] The original credit report data includes one or more of the following: credit report data in Extensible Markup Language (XML) format, credit report data in a lightweight data exchange format (JavaScript Object Notation, JSON) format, and credit report data in Hyper Text Markup Language (HTML) format.
[0074] The second data structure includes a multi-branch tree, which consists of a root object and child objects. The root object includes a data identifier, which corresponds to a derived variable template. The child objects include data to be processed, and the parent object of each child object is either the root object or another child object.
[0075] The root object is the root node of the multi-branch tree, and the child objects include the intermediate nodes and leaf nodes of the multi-branch tree. The intermediate nodes store credit report data of data type or object type, and the leaf nodes store credit report data of string type.
[0076] Specifically, the segment information of the original credit reporting message data differs depending on the type of business. For example, if the business type includes personal credit reporting and corporate credit reporting, then the segment information for personal credit reporting will differ from that for corporate credit reporting.
[0077] The following example, using personal credit reporting as an example, illustrates how to deserialize the original credit reporting message data into a multi-way tree:
[0078] For example, the information sections of the original credit report data include: A: basic personal information; B: overview of personal credit report; C: details of credit transaction information; etc.
[0079] A includes multiple sections of information, such as: D: Residential Information; E: Identity Information; F: Occupation Information; etc.
[0080] C also includes information from multiple sections, such as: G: non-revolving loan account; H: credit card account; etc.
[0081] E includes multiple sections of information, such as I: personal identification information; J: spouse identification information; etc.
[0082] Taking the above example, based on the information of each segment and the hierarchical relationships between them, the original credit report data is deserialized into a multi-way tree, such as... Figure 2 As shown.
[0083] Among them, Figure 2 The root node is represented by the root symbol.
[0084] Specifically, during the deserialization operation, the target credit data corresponding to all nodes is organized in a specified format to facilitate the subsequent extraction, analysis, and calculation of this target credit data.
[0085] The specified format is based on the specified symbols to divide the data. For example, I: Personal identity information includes: name, age, date of birth, school status, education, etc. When storing many fields in the leaf node I, the fields can be separated by the specified " / " symbol, or by the "*" symbol, or by the "@" symbol to prevent data confusion.
[0086] That is, when a node needs to store multiple fields, the data is divided according to the specified format.
[0087] Of course, any symbol is acceptable, except for symbols that already exist in the target credit data, to prevent confusion caused by identical data and symbols.
[0088] This disclosure provides an effective data foundation for rapidly generating derived variables by deserializing the original credit report data into a multi-way tree and utilizing the mechanism of a small multi-way tree height to reduce the number of disk reads.
[0089] In one specific embodiment, the specific implementation of acquiring the data to be processed and the data processing logic is as follows: Figure 3 As shown:
[0090] Step 301: Determine the derivative variable template corresponding to the target derived variable.
[0091] Specifically, the target derived variables required downstream are stored in advance, and the corresponding target derived variable template is determined based on the target derived variables; or the derived variable template required downstream is stored in advance, and the target derived variables required downstream can be obtained through the derived variable template.
[0092] The derived variable templates differ depending on the type of business.
[0093] Step 302: Obtain the target path corresponding to the derived variable template, and the data processing logic corresponding to the derived variable template.
[0094] Among them, the derived variable templates are different, and their corresponding target paths and data processing logic are different.
[0095] Specifically, a first correspondence between the template identifier of the derived variable template and the target credit data, and a second correspondence between the template identifier of the derived variable template and the data processing logic are pre-stored. Based on the first correspondence, the target derived variables in the derived variable template are matched with the target credit data. Based on the successful matching result, the target path of the derived variable template in the second data structure is determined; based on the second correspondence, the data processing logic corresponding to the derived variable template is determined.
[0096] The target path is the storage path that contains the data to be processed corresponding to the derived variable template.
[0097] Step 303: Based on the target path, obtain the data to be processed from the pre-set second data structure.
[0098] Specifically, after deserializing the original credit report data into a multi-way tree, the data to be processed corresponding to the leaf node can be obtained directly through the target path, for example, root / A / E / I, or the data to be processed corresponding to the intermediate node can be obtained directly through the target path, for example, root / A / E.
[0099] Specifically, the target path includes at least one storage path.
[0100] For example, the target paths are root / A / E / I and root / A / E / J. If the data to be processed is extracted from only one of the target paths each time, that is, extracted once from root / A / E / I and once from root / A / E / J, it can be seen that root / A / E is being retrieved repeatedly, which may cause data duplication, and the multiple extractions increase the data processing load on the server.
[0101] To address the issue of repeatedly retrieving the same path, the minimum closure path of multiple storage paths is obtained, and the data to be processed corresponding to that minimum closure path is extracted. For example, the minimum closure path of root / A / E / I and root / A / E / J is root / A / E, so root / A / E is used as the target path to retrieve the data to be processed. In this way, data to be processed from multiple storage paths can be obtained in a single extraction process.
[0102] The minimum closure path is the path that is included in multiple storage paths.
[0103] Correspondingly, when storing the data to be processed in the array, the data to be processed can be decomposed based on the storage path.
[0104] This disclosure effectively reduces the occurrence of repeatedly retrieving multiple similar storage paths, thereby improving the efficiency of processing derived variables.
[0105] Step 102: Based on the relationships between the data to be processed, store the data to be processed into a pre-set first data structure.
[0106] The relationships between the various modules correspond to the relationships between the data to be processed.
[0107] Specifically, the first data structure includes a number array, which includes a first data node and a second data node; the second data structure includes a multi-branch tree, which includes a root object and child objects; the root object includes a data identifier, which corresponds to a derived variable template; the child object includes the data to be processed, and the parent object of the child object is the root object or another child object.
[0108] The data identifier is used to indicate the business type to which the original credit reporting message data belongs.
[0109] In one specific embodiment, the data type of the data stored in each sub-object is determined, including: string type and non-string type; the data stored in the first sub-object is organized in the first data node, where the first sub-object is a sub-object whose stored data is of non-string type; the data stored in the second sub-object is organized in the second data node, where the second sub-object is a sub-object that has a subordinate relationship with the first sub-object and whose stored data is of string type.
[0110] The subordinate relationships between sub-objects correspond to the hierarchical relationships between various modules.
[0111] Non-string types include array types and object types, etc.
[0112] In one specific embodiment, the number array includes: a two-dimensional number array composed of at least two doubly linked lists; the first data node includes: the data node of the i-th row of the doubly linked list in the number array; the second data node includes: the remaining data node after removing the first data node from all data nodes in the number array, where i is an integer greater than or equal to 1.
[0113] Specifically, the data stored in the first sub-object is organized into the data nodes of the doubly linked list in the i-th row; the following storage procedure is executed for each first data node of the doubly linked list in the i-th row:
[0114] Identify the second sub-object that has a subordinate relationship with the first sub-object of the first data node; identify the entries of the data stored in the second sub-object; organize the data corresponding to each entry into the second data node corresponding to the first data node.
[0115] The entries and fields correspond to each other.
[0116] The number of columns in the array is determined by the number of the first sub-objects in the multi-way tree, and the number of rows in the array is determined by the number of entries in the second sub-objects.
[0117] i can be any number from 1 to the row number of the number matrix. The following explanation uses i equal to 1 as an example:
[0118] To clearly illustrate how to store the data to be processed in a multi-way tree in a data array, Figure 4 Let's take an example to illustrate:
[0119] Let's take storing the data to be processed corresponding to root / A in a data array as an example:
[0120] Where A is an array type, E is an array type, I is a string type, J is a string type, D is a string type, and F is a string type.
[0121] For example, A is stored in the following way:
[0122] The storage method of E is
[0123] Since root / A / E / I and root / A / E / J share a common path, we only need to organize the sub-objects corresponding to the different parts into a doubly linked list in the first row. Using the data in Table 1, we can then... Figure 4 The design.
[0124]
[0125] Table 1 Target Credit Data
[0126] Table 1 shows 3 entries for residential information, 2 entries for occupational information, 5 entries for personal identification information, and 4 entries for spouse identification information. This will be used as an example for design. Figure 4 .
[0127] Because the entries are different, there may be instances where the second data node is empty. In such cases, simply set the corresponding second data node to empty.
[0128] Based on this, in order to quickly select specific data and speed up data extraction, the concept of row number can be added. For example, add row number to the first column, or add a new column to add row number.
[0129] This disclosure extracts the data to be processed from the multi-way tree and organizes it into a number matrix composed of doubly linked lists to facilitate the subsequent processing and calculation of derived variables.
[0130] This disclosure uses a doubly linked list array to achieve bidirectional data access. From any data node, the target data node can be found based on its predecessor and successor, effectively improving the data retrieval speed.
[0131] In one specific embodiment, the number array includes: a two-dimensional number array composed of at least two doubly linked lists; the first data node includes: the data node of the j-th column of the doubly linked list in the number array; the second data node includes: the remaining data node after removing the first data node from all data nodes in the number array, where j is an integer greater than or equal to 1.
[0132] Specifically, the data stored in the first sub-object is organized into the data nodes of the j-th column of the doubly linked list; the following storage procedure is executed for each first data node of the j-th column of the doubly linked list:
[0133] Identify the second sub-object that has a subordinate relationship with the first sub-object of the first data node; identify the entries of the data stored in the second sub-object; organize the data corresponding to each entry in the second data node corresponding to the first data node.
[0134] Specifically, the first data node is the data node corresponding to the column in the array. For the specific implementation method, please refer to the implementation process of the first data node being the data node corresponding to the row in the array, and simply replace the row with the column.
[0135] Step 103: Based on the data processing logic and the first data structure, process the data to be processed to generate target derived variables.
[0136] In one specific embodiment, the data processing logic includes: data processing functions and the logical relationships between each data processing function. The specific implementation of generating the target derived variable is shown below:
[0137] For each data processing function, perform the following derived variable generation process:
[0138] Based on logical relationships, determine the current data processing function; extract the data to be processed corresponding to the current data processing function from the first data structure; use the current data processing function to process the data to be processed to obtain the current derived variable;
[0139] Repeat the process of generating derived variables until all data processing functions have been executed, and the target derived variables are obtained.
[0140] Among them, the data processing function is an abstract function obtained by abstracting the code corresponding to each processing step, the logical relationship is the logical relationship between all processing steps, and the processing steps are the method steps that process multiple target derived variables at the same time.
[0141] This disclosure can process multiple target derived variables simultaneously. The processing of multiple target derived variables may be accomplished through a single data processing function or through multiple data processing functions.
[0142] For example, if a total of 50 target derived variables need to be processed, the first data processing function completes the processing of 10 target derived variables, the second data processing function completes the processing of the remaining 20 target derived variables, and the third data processing function completes the processing of all target derived variables.
[0143] For example, the target derived variable is the calculation of the number of queries for a certain institution in March, June, September, December, and 24 months. Analysis shows that these target derived variables are all filtered by institution type and institution name, and then the number of queries is calculated across different time spans.
[0144] Once the target derived variables are determined, the corresponding data to be processed is provided from the multi-branch tree, and the extracted data is organized into a data matrix. Then, by using data processing functions to aggregate and classify query time and query mechanism, all target derived variables can be calculated simultaneously.
[0145] This disclosure reduces the total time to compute all variables by simultaneously computing multiple target derived variables, enabling the server to respond quickly and improving the user experience.
[0146] In one specific embodiment, after processing the data to be processed using the current data processing function to obtain the current derived variable, the number of times the data to be processed is referenced is determined. The number of references is used to characterize whether the data to be processed can be used to generate the target derived variable. When the number of references is determined to be a preset value, it is determined that the data to be processed is no longer used to generate the target derived variable, and the data to be processed is deleted from the first data structure.
[0147] Specifically, after organizing the data to be processed in the multi-branch tree into a matrix, the row or column corresponding to the data in the matrix is determined and defined as the reference path. The number of times the data to be processed needs to be used when generating the target derived variable is determined, i.e., the reference count. After each data processing operation using the data processing function, the reference count is decremented by 1 until it reaches 0. This indicates that the data will not be used in subsequent generation of the target derived variable and can therefore be deleted, reducing the number of data access processes and greatly improving data processing speed.
[0148] In one specific embodiment, the first data structure includes: a number array consisting of at least two doubly linked lists; deleting the data to be processed from the first data structure, specifically, deleting the doubly linked list corresponding to the data to be processed from the first data structure.
[0149] This disclosure deletes the doubly linked list corresponding to the data to be processed that is no longer in use, thereby also removing the access relationship between the data and the data, and improving the generation efficiency of the target derived variable.
[0150] Based on the aforementioned method for generating derived variables, this disclosure provides a framework for generating derived variables. Raw credit report data is input into this framework, and the raw credit report data is parsed and processed to output derived variables.
[0151] like Figure 5 As shown, the framework includes: an input component (Creditreport) 501, a relation extraction component (config) 502, a multi-branch tree storage component (N-tree) 503, a data extraction component (Extra) 504, a data alignment component (Align) 505, a data analysis component (Share Analysis) 506, a data sharing component (Data Share) 507, a reachability analysis component (Parallel Filter) 508, a data extraction component (Prepare Data) 509, a variable generation component (Calculate) 510, and an output component (result) 511.
[0152] Among them, input component 501 is used to input the original credit report data;
[0153] The relationship extraction component 502 is used to extract the logical relationships and mapping relationships between the original credit report data; based on the logical relationships and mapping relationships, the original credit report data is deserialized into a multi-branch tree;
[0154] The multi-branch tree storage component 503 is used to store the serialized original credit report data using multiple storage paths;
[0155] The data extraction component 504 is used to extract the data to be processed corresponding to the target path from the multi-branch tree of the multi-branch tree storage component 503. That is, it obtains the data to be processed from multiple storage paths and no longer performs data extraction from individual storage paths, which improves data processing efficiency.
[0156] Data alignment component 505 is used to perform structured processing on the data to be processed;
[0157] Data analysis component 506 is used to determine similar storage paths in the target path and to determine the minimum closure path of similar storage paths, so as to avoid repeated extraction and repeated alignment of the data to be processed and improve data processing efficiency.
[0158] Data sharing component 507 is used to share available data between different nodes to avoid invalid data acquisition;
[0159] The reachability analysis component 508 is used to perform statistical analysis on the number of references to the data to be processed in the data matrix. When the number of references is 0, the corresponding data to be processed is deleted to reduce the amount of data and improve computational efficiency.
[0160] Data extraction component 509 is used to prepare the current data to be processed;
[0161] The variable generation component 510 is used to process the current data to be processed based on the current data processing function to obtain the target derived variable;
[0162] Output component 511 is used to output the target derived variable.
[0163] The above methods can be applied within this framework. The specific implementation content and the method for generating derived variables within the framework are consistent. Where there is overlap, it will not be described again.
[0164] The method for generating derived variables provided in this disclosure acquires data to be processed corresponding to at least one target derived variable, and data processing logic corresponding to the target derived variable. The data to be processed is obtained by deserializing the original credit report data. Therefore, this disclosure does not generate derived variables from the original credit report, but rather uses the deserialized data to generate derived variables, providing an effective data foundation for the subsequent rapid generation of derived variables. Furthermore, this disclosure simultaneously acquires and calculates the data to be processed corresponding to multiple target derived variables, avoiding the problem of high computational latency caused by repeated acquisition and calculation of data generated from individual acquisition and calculation. Based on the relationships between the data to be processed, the data to be processed is stored in a pre-set first data structure. It is evident that this disclosure fully considers the relationships between the data to be processed and stores the data based on these relationships, providing an effective data foundation for subsequent rapid processing of the data. Finally, based on the data processing logic and the first data structure, the data to be processed is processed to generate target derived variables. It is evident that this disclosure, based on a specific data structure, can effectively improve the data operation efficiency and storage efficiency, solving the problems of low efficiency and long processing time in the prior art when generating derived variables, and achieving the goal of generating derived variables quickly and efficiently.
[0165] The apparatus for generating derived variables provided in the embodiments of this disclosure is described below. The apparatus for generating derived variables described below can be referred to in correspondence with the method for generating derived variables described above. Where there is repetition, it will not be repeated. The apparatus is specifically as follows: Figure 6 As shown:
[0166] The acquisition module 601 is used to acquire the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable. The data to be processed is obtained by deserializing the original credit report data.
[0167] The organization module 602 is used to store the data to be processed into a pre-set first data structure based on the relationship between the data to be processed;
[0168] The processing module 603 is used to process the data to be processed based on the data processing logic and the first data structure to generate target derived variables.
[0169] In one specific embodiment, the data processing logic includes: data processing functions and the logical relationships between each data processing function; the processing module 603 is specifically used to perform the following derived variable generation process for each data processing function: determining the current data processing function based on the logical relationship; extracting the data to be processed corresponding to the current data processing function from the first data structure, processing the data to be processed using the current data processing function to obtain the current derived variable; repeating the derived variable generation process until all data processing functions are completed to obtain the target derived variable.
[0170] In one specific embodiment, the processing module 603 is further configured to determine the number of times the data to be processed is referenced, the number of times the data to be processed is used to characterize whether the data to be processed can be used to generate the target derived variable; when the number of times the data to be processed is determined to be a preset value, it is determined that the data to be processed is no longer used to generate the target derived variable, and the data to be processed is deleted from the first data structure.
[0171] In one specific embodiment, the acquisition module 601 is specifically used to determine the derivative variable template corresponding to the target derived variable; acquire the target path corresponding to the derived variable template, and the data processing logic corresponding to the derived variable template; and acquire the data to be processed from the pre-set second data structure based on the target path.
[0172] In one specific embodiment, the acquisition module 601 is further configured to acquire the original credit report data; based on the segment information of the original credit report data, the original credit report data is deserialized into the second data structure.
[0173] In one specific embodiment, the first data structure includes a number matrix, which includes a first data node and a second data node; the second data structure includes a multi-branch tree, which includes a root object and child objects; the root object includes a data identifier, which corresponds to a derived variable template; the child objects include data to be processed, and the parent object of each child object is the root object or another child object; the organization module 602 is specifically used to determine the data type of the data stored in each child object, which includes string type and non-string type; organize the data stored in the first child object in the first data node, where the first child object is a child object whose stored data is of non-string type; organize the data stored in the second child object in the second data node, where the second child object is a child object that has a subordinate relationship with the first child object and whose stored data is of string type.
[0174] In one specific embodiment, the number array includes: a two-dimensional number array composed of at least two doubly linked lists; the first data node includes: the data node of the i-th row of the doubly linked list in the number array, and the second data node includes: all data nodes in the number array excluding the first data node, where i is an integer greater than or equal to 1; the organization module 602 is specifically used to organize the data stored in the first sub-object into the data node of the i-th row of the doubly linked list; for each first data node of the i-th row of the doubly linked list, the following storage procedure is performed: determining a second sub-object that has a subordinate relationship with the first sub-object of the first data node; determining the entries of the data stored in the second sub-object; and organizing the data corresponding to each entry into the second data node corresponding to the first data node.
[0175] In one specific embodiment, the array includes: a two-dimensional array composed of at least two doubly linked lists; a first data node includes: a data node of the j-th column of the doubly linked list in the array; a second data node includes: all data nodes in the array excluding the first data node, where j is an integer greater than or equal to 1; an organization module 602 is specifically used to organize the data stored in the first sub-object into the data node of the j-th column of the doubly linked list; for each first data node of the j-th column of the doubly linked list, the following storage procedure is performed: determining a second sub-object that has a subordinate relationship with the first sub-object of the first data node; determining the entries of the data stored in the second sub-object; and organizing the data corresponding to each entry into the second data node corresponding to the first data node.
[0176] In one specific embodiment, the first data structure includes: a number array consisting of at least two doubly linked lists; the organization module 602 is further configured to delete the doubly linked list corresponding to the data to be processed from the first data structure.
[0177] Figure 7 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 7 As shown, the electronic device may include a processor 701, a communications interface 702, a memory 703, and a communication bus 704. The processor 701, communications interface 702, and memory 703 communicate with each other via the communication bus 704. The processor 701 can call logical instructions in the memory 703 to execute a method for generating derived variables. This method includes: acquiring data to be processed corresponding to at least one target derived variable, and data processing logic corresponding to the target derived variable, wherein the data to be processed is obtained by deserializing the original credit report data; storing the data to be processed into a pre-set first data structure based on the correlation between the data to be processed; and processing the data to be processed based on the data processing logic and the first data structure to generate the target derived variable.
[0178] Furthermore, the logical instructions in the aforementioned memory 703 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of this disclosure, essentially, or the parts that contribute to the prior art, or parts of the technical solutions, can be embodied in the form of software products. These computer software products are stored in a storage medium and include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0179] On the other hand, this disclosure also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium. The computer program includes program instructions, and when the program instructions are executed by a computer, the computer is able to execute the method for generating derived variables provided by the above methods. The method includes: acquiring data to be processed corresponding to at least one target derived variable, and data processing logic corresponding to the target derived variable, wherein the data to be processed is obtained by deserializing the original credit report data; storing the data to be processed in a pre-set first data structure based on the correlation between the data to be processed; and processing the data to be processed based on the data processing logic and the first data structure to generate the target derived variable.
[0180] In another aspect, this disclosure also provides a non-transitory computer-readable storage medium having a computer program stored thereon. When executed by a processor, the computer program is implemented to perform the methods for generating the derived variables provided above. The method includes: acquiring data to be processed corresponding to at least one target derived variable, and data processing logic corresponding to the target derived variable, wherein the data to be processed is obtained by deserializing original credit report data; storing the data to be processed in a pre-set first data structure based on the correlation between the data to be processed; and processing the data to be processed based on the data processing logic and the first data structure to generate the target derived variable.
[0181] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0182] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0183] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit them. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure.
Claims
1. A method for generating derived variables, characterized in that, include: Obtain the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable, wherein the data to be processed is obtained by deserializing the original credit report data; Based on the relationships between the data to be processed, the data to be processed is stored in a pre-set first data structure; Based on the data processing logic and the first data structure, the data to be processed is processed to generate the target derived variable; The data processing logic includes: data processing functions and the logical relationships between the data processing functions; The step of processing the data to be processed based on the data processing logic and the first data structure to generate the target derived variable includes: For each of the data processing functions, the following derived variable generation process is performed: Based on the logical relationship, the current data processing function is determined; the data to be processed corresponding to the current data processing function is extracted from the first data structure, and the data to be processed is processed using the current data processing function to obtain the current derived variable; Repeat the process of generating the derived variable until all the data processing functions have been executed to obtain the target derived variable; After processing the data to be processed using the current data processing function to obtain the current derived variable, the process further includes: The number of times the data to be processed is determined, and the number of times it is cited is used to characterize whether the data to be processed can be used to generate the target derived variable; When the number of references is determined to be a preset value, it is determined that the data to be processed is no longer used to generate the target derived variable, and the data to be processed is deleted from the first data structure.
2. The method for generating derived variables according to claim 1, characterized in that, The step of acquiring the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable, includes: Determine the derived variable template corresponding to the target derived variable; Obtain the target path corresponding to the derived variable template, and the data processing logic corresponding to the derived variable template; Based on the target path, the data to be processed is obtained from a pre-set second data structure.
3. The method for generating derived variables according to claim 2, characterized in that, Before acquiring the data to be processed corresponding to at least one target derived variable, and the data processing logic corresponding to the target derived variable, the method further includes: Obtain the original credit report data; Based on the segment information of the original credit report data, the original credit report data is deserialized into the second data structure.
4. The method for generating derived variables according to claim 2, characterized in that, The first data structure includes a number array, which includes a first data node and a second data node. The second data structure includes a multi-branch tree, which includes a root object and child objects; the root object includes a data identifier, which corresponds to the derived variable template; the child object includes the data to be processed, and the parent object of the child object is the root object or another child object. The step of storing the data to be processed into a pre-set first data structure based on the correlation between the data to be processed includes: Determine the data type of the data stored in each of the sub-objects, wherein the data type includes: string type and non-string type; The data stored in the first sub-object is organized in the first data node, where the first sub-object is a sub-object whose stored data is of the non-string type. The data stored in the second sub-object is organized in the second data node. The second sub-object is subordinate to the first sub-object, and the stored data is a sub-object of the string type.
5. The method for generating derived variables according to claim 4, characterized in that, The array includes: a two-dimensional array consisting of at least two doubly linked lists; The first data node includes: the data node of the i-th row of the doubly linked list in the number array; the second data node includes: the remaining data node after removing the first data node from all data nodes in the number array; where i is an integer greater than or equal to 1. Organizing the data stored in the first sub-object within the first data node includes: Organize the data stored in the first sub-object into the data nodes of the doubly linked list in the i-th row; Organizing the data stored in the second sub-object within the second data node includes: For each first data node of the doubly linked list in the i-th row, execute the following stored procedure: Determine a second sub-object that has the subordinate relationship with the first sub-object of the first data node; Determine the entries for the data stored in the second sub-object; The data corresponding to each entry is organized into the second data node corresponding to the first data node.
6. The method for generating derived variables according to claim 4, characterized in that, The array includes: a two-dimensional array consisting of at least two doubly linked lists; The first data node includes: the data node of the j-th column of the doubly linked list in the number array; the second data node includes: the remaining data node after removing the first data node from all data nodes in the number array, where j is an integer greater than or equal to 1. Organizing the data stored in the first sub-object within the first data node includes: The data stored in the first sub-object is organized in the data nodes of the j-th column of the doubly linked list; Organizing the data stored in the second sub-object within the second data node includes: For each first data node of the doubly linked list in column j, execute the following stored procedure: Determine a second sub-object that has the subordinate relationship with the first sub-object of the first data node; Determine the entries for the data stored in the second sub-object; The data corresponding to each entry is organized in a second data node corresponding to the first data node.
7. The method for generating derived variables according to claim 1, characterized in that, The first data structure includes: a number array consisting of at least two doubly linked lists; Deleting the data to be processed from the first data structure includes: Delete the doubly linked list corresponding to the data to be processed from the first data structure.
8. A device for generating derived variables, characterized in that, include: The acquisition module is used to acquire data to be processed corresponding to at least one target derived variable, and data processing logic corresponding to the target derived variable. The data to be processed is obtained by deserializing the original credit report data. An organization module is used to store the data to be processed into a pre-set first data structure based on the relationships between the data to be processed; The processing module is used to process the data to be processed based on the data processing logic and the first data structure to generate the target derived variable; The data processing logic includes: data processing functions and the logical relationships between each data processing function; The processing module is specifically used to perform the following derived variable generation process for each data processing function: determine the current data processing function based on logical relationships; extract the data to be processed corresponding to the current data processing function from the first data structure; process the data to be processed using the current data processing function to obtain the current derived variable; repeat the derived variable generation process until all data processing functions are completed to obtain the target derived variable; The processing module is further configured to determine the number of times the data to be processed is referenced, which is used to characterize whether the data to be processed can be used to generate the target derived variable; when the number of references is determined to be a preset value, it is determined that the data to be processed is no longer used to generate the target derived variable, and the data to be processed is deleted from the first data structure.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method for generating derived variables as described in any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the method for generating derived variables as described in any one of claims 1 to 7.
11. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method for generating derived variables as described in any one of claims 1 to 7.