A real-time user portrait providing method and device, and a storage medium
By using OGG source and R process to monitor database updates in the user profiling system, pushing data to Kafka, and using FLINK for data access and synchronization to MySQL and CLICKHOUSE, the problem of existing systems being unable to meet the diverse target audience and real-time data needs was solved, enabling the supply of real-time user behavior data and improving business analysis capabilities and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN ZBANK CO LTD
- Filing Date
- 2022-11-30
- Publication Date
- 2026-06-26
AI Technical Summary
Existing user profiling systems cannot meet the diverse needs for audience targeting and real-time data, and cannot quickly and accurately provide financial enterprises with real-time user behavior data, resulting in insufficient business analysis capabilities.
By creating a basic data table for the user profiling system, using OGG source and R processes to monitor database updates, pushing data to Kafka, and using FLINK to build a Kafka connector for data access and incremental statistics, the data is finally synchronized to MySQL and CLICKHOUSE databases to achieve real-time data supply.
It enables multi-dimensional and multi-type targeted user filtering, enhances the business department's ability to grasp the core customer market, provides real-time business data with timely and complex calculation-based external indicators, reduces business complexity and costs, and improves user experience.
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing, and provides a method, apparatus, and storage medium for real-time user profiling. Background Technology
[0002] The post-80s and post-90s generations, totaling 340 million people, are increasingly becoming the main consumers of financial institutions. However, their financial consumption habits are changing. They are unwilling to conduct business at financial branches and dislike passively receiving financial products and services. Young people spend most of their time on mobile internet and smartphones. On average, each person uses a smartphone for more than 3 hours a day, and young people may spend more than 4 hours. Browsing mobile phones has become the third most common lifestyle habit after work and sleep. Mobile apps have also become the customer entry point, service entry point, consumption entry point, and data entry point for all financial institutions.
[0003] Financial institutions need to leverage customer profiles to understand their customers, identify target customers, and reach them. As their business lines develop, the demands for both user profiles and real-time data are increasing. Regarding user profiles, there is a need for faster, more accurate, and more convenient tools for filtering customer groups and easy-to-use user analysis capabilities. For real-time data, there is a need for real-time responses to user behavior, and an increasing demand for real-time data in business scenarios such as algorithm features, statistical indicators, and business visualization.
[0004] To address the demand for real-time data, the data processing procedures of the existing user profiling system need to be revamped. This includes optimizing and adjusting the data extraction from the source system, the processing of profiling metrics, and the data synchronization process. Once the data is optimized, the data viewed by customers through the user profiling system will be up-to-date.
[0005] To address the lack of a provider for historical real-time data, the inability of existing user profiling systems to meet diverse audience targeting needs, and the business requirements for further audience analysis, a solution was proposed to transform the original t-1 user profiling system into a real-time user profiling system, establishing real-time data integration, real-time data scheduling, and a real-time data quality center. This solution effectively addresses business pain points and meets business requirements. Summary of the Invention
[0006] The purpose of this invention is to transform the original data provision from T-1 to real-time data provision to the user profiling system.
[0007] To achieve the above objectives, the present invention employs the following technical means:
[0008] A method for real-time user profiling includes the following steps:
[0009] Step 1: Create the basic data table for the profiling system and save the data synchronized from the upstream system;
[0010] Step 2: After creating the basic data tables, data needs to be synchronized from the upstream system. The OGG source server is deployed on the upstream system. The OGG source server and the upstream database run together. There are two processes running on the OGG source server: the E process (Extract) and the P process (pump). The E process is responsible for reading the Oracle archive logs and writing the database updates to the trail file (the trail file can be understood as a file format for data transmission unique to OGG). The P process is responsible for monitoring changes in the local trail file and pushing the updates to the trail file directory on the target end through the TCP / IP protocol.
[0011] Step 3: The target is deployed on a server in the same network segment as the OGG source. An R process (REPLIACT) runs on the target. The R process listens for local trail file updates and then sends the updates to KAFKA.
[0012] Step 4: Build a connector in FLINK to connect to KAFKA, facilitating the access of KAFKA data;
[0013] Step 5: FLINK reads the data to generate a timestamp, and inserts the data into the basic data table defined in Step 1 by defining a write-write stream.
[0014] Step 6: Perform incremental data statistics. The transaction data is obtained by taking the data from the most recent ten minutes based on the timestamp generated in Step 5, and then obtaining the indicator data.
[0015] Step 7: Synchronize the indicator data obtained in Step 6 to the MySQL and CLICKHOUSE databases for the user profiling system to query the data in real time.
[0016] In the above technical solution, step 4 specifically includes the following steps:
[0017] Step 4.1: A consumer thread is started to pull data from Kafka and store it in the Handover's next object;
[0018] Step 4.2: Loop through the next step of Handover;
[0019] Step 4.3: Record the current offset and update it to the thread pool. This is used to initialize the worker consumption settings and save the consumption information of all topics.
[0020] Step 4.4: Write the consumption progress data from the consumption information into a temporary object;
[0021] Step 4.5: Submit the offset data of the current batch checkpoint recorded in the temporary object to KAFKA;
[0022] Step 4.6: The Kafka message is transformed and converted into an object stream according to the specified type;
[0023] Step 4.7: Register the object stream as a temporary table;
[0024] Step 4.8: Concatenate multiple INSERT INTO statements to store the data from the temporary table into the system's basic data table.
[0025] In the above technical solution, the incremental data statistics in step 6 are as follows:
[0026] Step 6.1: Directly retrieve data from the basic data table, add a timestamp condition, and retrieve data from the most recent ten minutes; Step 6.2: Configure scheduling tasks to initialize user profile-related tasks every 10 minutes and process newly synchronized data.
[0027] The present invention also provides a real-time user profile data supply device, comprising the following modules
[0028] Basic Data Table Module: Creates the basic data tables for the profiling system and stores data synchronized from the upstream system;
[0029] Data transmission module: After creating the basic data tables, data needs to be synchronized from the upstream system. The OGG source server is deployed on the upstream system. The OGG source server and the upstream database run together. There are two processes running on the OGG source server: the E process (Extract) and the P process (pump). The E process is responsible for reading the Oracle archive logs and writing the database updates to the trail file (the trail file can be understood as a file format for data transmission unique to OGG). The P process is responsible for monitoring changes in the local trail file and pushing the updates to the trail file directory on the target end through the TCP / IP protocol.
[0030] Data update module: The target end is deployed on a server in the same network segment as the OGG source end. The R process (REPLIACT) running on the target end is the most important component. The main function of the R process is to listen for local trail file updates and then send the updates to KAFKA.
[0031] KAFKA Connector: Build a connector in FLINK to connect to KAFKA, facilitating the access of KAFKA data;
[0032] Data writing module: FLINK reads data to generate timestamps, and inserts the data into the basic data table defined in step 1 by defining write and write streams.
[0033] Incremental statistics module: Performs incremental data statistics. The transaction data is obtained by taking the data from the most recent ten minutes based on the timestamp generated in step 5, and then obtaining the indicator data.
[0034] Data storage module: Synchronizes the indicator data obtained in step 6 to the MySQL and CLICKHOUSE databases for the user profiling system to query the data in real time.
[0035] The KAFKA connector in the above-mentioned device specifically includes the following steps:
[0036] Step 1: A consumer thread is started to pull data from Kafka and store it in the next object of Handover;
[0037] Step 2: Loop through the next step of Handover;
[0038] Step 3: Record the current offset and update it to the thread pool. This is used to initialize the worker consumption settings and save the consumption information for all topics.
[0039] Step 4: Write the consumption progress data from the consumption information into a temporary object;
[0040] Step 5: Submit the offset data of the current batch checkpoint recorded in the temporary object to KAFKA;
[0041] Step 6: The Kafka message is transformed and converted into an object stream according to the specified type;
[0042] Step 7: Register the object stream as a temporary table;
[0043] Step 8: Concatenate multiple INSERT INTO statements to store the data from the temporary table into the system's basic data table.
[0044] In the above device, the incremental data statistics are performed in step 6 of the incremental statistics module as follows:
[0045] Step 1: Retrieve data directly from the basic data table, add a timestamp condition, and retrieve data from the most recent ten minutes; Step 2: Configure a scheduling task to initialize user profile-related tasks every 10 minutes and process the newly synchronized data.
[0046] The present invention also provides a storage medium in which a processor executes a program to implement the above-described method for real-time data supply of user profiles.
[0047] Because the present invention employs the above-mentioned technical means, it has the following beneficial effects:
[0048] 1. Transform from T-1 to provide real-time data to the user profiling system, enabling multi-dimensional and multi-type targeted user screening, and helping business departments quickly grasp the core customer market.
[0049] 2. For real-time business data, it provides timely insights into trending topics and potential. This accelerates the application of the business in production and consumption, thereby increasing the volume of high-quality content creation and users' ability to consume content.
[0050] It also provides externally displayed metrics that offer real-time, complex calculations, enhancing the user experience. It also eliminated the method of calculating metrics via scripts in the business backend, reducing business complexity, saving costs, and improving efficiency.
[0051] 3. For real-time algorithm features, we provide real-time algorithm features based on creators, content, and consumers. Together with the algorithm team, we have achieved significant improvements in core metrics such as DAU, retention, and user payment in multiple projects. Detailed Implementation
[0052] The embodiments of the present invention will be described in detail below. Although the present invention will be described and illustrated in conjunction with some specific embodiments, it should be noted that the present invention is not limited to these embodiments. On the contrary, any modifications or equivalent substitutions made to the present invention should be covered within the scope of the claims of the present invention.
[0053] Furthermore, to better illustrate the present invention, numerous specific details are set forth in the following detailed embodiments. Those skilled in the art will understand that the present invention can be practiced without these specific details.
[0054] A method for real-time user profiling includes the following steps:
[0055] Step 1: Create the basic data table for the profiling system and save the data synchronized from the upstream system;
[0056] Step 2: After creating the basic data tables, data needs to be synchronized from the upstream system. The OGG source server is deployed on the upstream system. The OGG source server and the upstream database run together. There are two processes running on the OGG source server: the E process (Extract) and the P process (pump). The E process is responsible for reading the Oracle archive logs and writing the database updates to the trail file (the trail file can be understood as a file format for data transmission unique to OGG). The P process is responsible for monitoring changes in the local trail file and pushing the updates to the trail file directory on the target end through the TCP / IP protocol.
[0057] As an example, the installation steps for the OGG source are described below:
[0058] Step 1: On the source database server, please navigate to the Oracle GoldenGate installation directory and execute the following steps after connecting with ggsci:
[0059] --Preserve archived logs for cap
[0060] DBLOGIN USERID ogg,PASSWORD ogg
[0061] REGISTER EXTRACT cap LOGRETENTION
[0062] Step 2: Create the Extract group cap
[0063] --Group names should be no more than 8 characters long and should not end with a number (but can start with a number), otherwise the report file may have problems; also try to avoid using keywords such as extract, ext, replicatet, rep.
[0064] ADD EXTRACT cap, TRANLOG, BEGIN now
[0065] Step 3: Create a Local Trail for the cap group
[0066] ADD EXTTRAIL dirdat / lt,EXTRACT cap
[0067] Step 4: Edit the parameters for the cap group;
[0068] --Execute the EDIT PARAMS cap command, add the following parameters, and save and exit with :wq.
[0069] example:
[0070] EXTRACT cap
[0071] USERID ogg PASSWORD ogg
[0072] TRANLOGOPT I ONS LOGRETENT I ON ENABLED
[0073] EXTTRAIL dirdat / lt
[0074] TABLE scott.t;
[0075] Step 5: The parameters will be saved in the dirprm / cap.prm file.
[0076] TRANLOGOPTIONS DBLOGREADER
[0077] Step 6: Add additional logs to the replicated table;
[0078] DBLOGIN USERID ogg,PASSWORD ogg
[0079] ADD TRANDATA scott.t
[0080] Step 7: Configure Data Pmnp;
[0081] -- Create a second EXTRACT group for Pump, noting that the data source Local Trail is specified as lt.
[0082] ADD EXTRACT pumpkw, EXTTRAILSOURCE dirdat / lt, BEGIN now
[0083] Step 8: Create a Remote Trail for the pump group
[0084] Remote Trail names can only be two characters long, such as rt.
[0085] ADD RMTTRAIL dirdat / rt, EXTRACT pumpkw
[0086] Step 9: Edit parameters for the pump group;
[0087] --Execute the command `EDIT PARAMS pumpkw`, adding the following parameters, and save and exit with `:wq`. Specify the location of the target database's GoldGate instance. Example:
[0088] EXTRACT pumpkw
[0089] USERID ogg PASSWORD ogg
[0090] RMTHOST 192.168.144.207,MGRPORT 7809
[0091] RMTTRAIL dirdat / rt
[0092] TABLE scott.t.
[0093] Step 3: The target is deployed on a server in the same network segment as the OGG source. An R process (REPLIACT) runs on the target. The R process listens for local trail file updates and then sends the updates to KAFKA.
[0094] The installation steps on the target device are as follows:
[0095] After connecting via ggsci in the Oracle GoldenGate installation directory on the target database server, perform the following steps:
[0096] Step 1: Create the CheckPoint table
[0097] DBLOGIN USERID ogg,PASSWORD ogg
[0098] ADD CHECKPOINTTABLE checkpoint
[0099] Step 2: Create a Replicat group, noting that the data source Remote Trail is set to rt, and the CheckPoint table is the CheckPoint table.
[0100] ADD REPLICAT wrt, EXTTRAIL dirdat / rt, BEGIN now, CHECKPOINTTABLEogg.checkpoint
[0101] Step 3: Create a Local Trail for the WRT group
[0102] Execute the command EDIT PARAMS wrt, add the following parameters, and save and exit with :wq.
[0103] example:
[0104] REPLICAT wrt
[0105] USERID ogg, PASSWORD ogg
[0106] HANDLECOLLISIONS
[0107] ASSUMETARGETDEFS
[0108] MAP scott.t, TARGET scott.t;
[0109] Step 4: Build a connector in FLINK to connect to KAFKA, facilitating the access of KAFKA data;
[0110] Step 5: FLINK reads the data to generate a timestamp, and inserts the data into the basic data table defined in Step 1 by defining a write-write stream.
[0111] Step 6: Perform incremental data statistics. The transaction data is obtained by taking the data from the most recent ten minutes based on the timestamp generated in Step 5, and then obtaining the indicator data.
[0112] Step 7: Synchronize the indicator data obtained in Step 6 to the MySQL and CLICKHOUSE databases for the user profiling system to query the data in real time.
[0113] In the above technical solution, step 4 specifically includes the following steps:
[0114] Step 4.1: A consumer thread is started to pull data from Kafka and store it in the Handover's next object;
[0115] Step 4.2: Loop through the next step of Handover;
[0116] Step 4.3: Record the current offset and update it to the thread pool. This is used to initialize the worker consumption settings and save the consumption information of all topics.
[0117] Step 4.4: Write the consumption progress data from the consumption information into a temporary object;
[0118] Step 4.5: Submit the offset data of the current batch checkpoint recorded in the temporary object to KAFKA;
[0119] Step 4.6: The Kafka message is transformed and converted into an object stream according to the specified type;
[0120] Step 4.7: Register the object stream as a temporary table;
[0121] Step 4.8: Concatenate multiple INSERT INTO statements to store the data from the temporary table into the system's basic data table.
[0122] In the above technical solution, the incremental data statistics in step 6 are as follows:
[0123] Step 6.1: Directly retrieve data from the basic data table, add a timestamp condition, and retrieve data from the most recent ten minutes; Step 6.2: Configure scheduling tasks to initialize user profile-related tasks every 10 minutes and process newly synchronized data.
[0124] The present invention also provides a real-time user profile data supply device, comprising the following modules
[0125] Basic Data Table Module: Creates the basic data tables for the profiling system and stores data synchronized from the upstream system;
[0126] Data transmission module: After creating the basic data tables, data needs to be synchronized from the upstream system. The OGG source server is deployed on the upstream system. The OGG source server and the upstream database run together. There are two processes running on the OGG source server: the E process (Extract) and the P process (pump). The E process is responsible for reading the Oracle archive logs and writing the database updates to the trail file (the trail file can be understood as a file format for data transmission unique to OGG). The P process is responsible for monitoring changes in the local trail file and pushing the updates to the trail file directory on the target end through the TCP / IP protocol.
[0127] Data update module: The target end is deployed on a server in the same network segment as the OGG source end. The R process (REPLIACT) running on the target end is the most important component. The main function of the R process is to listen for local trace file updates and then send the updates to KAFKA.
[0128] KAFKA Connector: Build a connector in FLINK to connect to KAFKA, facilitating the access of KAFKA data;
[0129] Data writing module: FLINK reads data to generate timestamps, and inserts the data into the basic data table defined in step 1 by defining write and write streams.
[0130] Incremental statistics module: Performs incremental data statistics. The transaction data is obtained by taking the data from the most recent ten minutes based on the timestamp generated in step 5, and then obtaining the indicator data.
[0131] Data storage module: Synchronizes the indicator data obtained in step 6 to the MySQL and CLICKHOUSE databases for the user profiling system to query the data in real time.
[0132] The KAFKA connector in the above-mentioned device specifically includes the following steps:
[0133] Step 1: A consumer thread is started to pull data from Kafka and store it in the next object of Handover;
[0134] Step 2: Loop through the next step of Handover;
[0135] Step 3: Record the current offset and update it to the thread pool. This is used to initialize the worker consumption settings and save the consumption information for all topics.
[0136] Step 4: Write the consumption progress data from the consumption information into a temporary object;
[0137] Step 5: Submit the offset data of the current batch checkpoint recorded in the temporary object to KAFKA;
[0138] Step 6: The Kafka message is transformed and converted into an object stream according to the specified type;
[0139] Step 7: Register the object stream as a temporary table;
[0140] Step 8: Concatenate multiple INSERT INTO statements to store the data from the temporary table into the system's basic data table.
[0141] In the above device, the incremental data statistics are performed in step 6 of the incremental statistics module as follows:
[0142] Step 1: Retrieve data directly from the basic data table, add a timestamp condition, and retrieve data from the most recent ten minutes; Step 2: Configure a scheduling task to initialize user profile-related tasks every 10 minutes and process the newly synchronized data.
[0143] The present invention also provides a storage medium in which a processor executes a program to implement the above-described method for real-time data supply of user profiles.
Claims
1. A method for real-time user profiling, characterized in that: Includes the following steps Step 1: Create the basic data table for the profiling system and save the data synchronized from the upstream system; Step 2: After creating the basic data tables, data needs to be synchronized from the upstream system. The OGG source server is deployed on the upstream system. The OGG source server and the upstream database run together. There are two processes running on the OGG source server: the E process and the P process. The E process is responsible for reading the Oracle archive logs and writing the database updates to the trail file. The P process is responsible for listening to changes in the local trail file and pushing the updates to the trail file directory on the target end through the TCP / IP protocol. Step 3: The target end is deployed on a server in the same network segment as the OGG source end. An R process runs on the target end. The R process listens for updates to the local trail file and then sends the updates to KAFKA. Step 4: Build a connector in FLINK to connect to KAFKA, facilitating the access of KAFKA data; Step 5: FLINK reads the data to generate a timestamp, and inserts the data into the basic data table defined in Step 1 by defining a write-write stream; Step 6: Perform incremental data statistics. The transaction data is obtained by taking the data from the most recent ten minutes based on the timestamp generated in Step 5, and then obtaining the indicator data. Step 7: Synchronize the indicator data obtained in Step 6 to the MySQL and CLICKHOUSE databases for the user profiling system to query the data in real time.
2. The real-time user profile data supply method according to claim 1, characterized in that: Step 4 specifically includes the following steps: Step 4.1: A consumer thread is started to pull data from Kafka and store it in the Handover's next object; Step 4.2: Loop through the next step of Handover; Step 4.3: Record the current offset and update it to the thread pool. This is used to initialize the worker consumption settings and save the consumption information of all topics. Step 4.4: Write the consumption progress data from the consumption information into a temporary object; Step 4.5: Submit the offset data of the current batch checkpoint recorded in the temporary object to KAFKA; Step 4.6: The Kafka message is transformed and converted into an object stream according to the specified type; Step 4.7: Register the object stream as a temporary table; Step 4.8: Concatenate multiple INSERT INTO statements to store the data from the temporary table into the system's basic data table.
3. The real-time user profile data supply method according to claim 1, characterized in that: The incremental data statistics in step 6 are as follows: Step 6.1: Directly retrieve data from the basic data table, add a timestamp condition, and retrieve data from the most recent ten minutes; Step 6.2: Configure scheduling tasks to initialize user profile-related tasks every 10 minutes and process newly synchronized data.
4. A real-time user profile data supply device, characterized in that: Includes the following modules: Basic Data Table Module: Creates the basic data tables for the profiling system and stores data synchronized from the upstream system; Data transmission module: After creating the basic data tables, data needs to be synchronized from the upstream system. The OGG source is deployed on the upstream system. The OGG source and the upstream database run together. There are two processes running on the OGG source: the E process and the P process. The E process is responsible for reading the Oracle archive logs and writing the database updates to the trail file. The P process is responsible for listening to changes in the local trail file and pushing the updates to the trail file directory on the target end through the TCP / IP protocol. Data update module: The target end is deployed on a server in the same network segment as the OGG source end. An R process runs on the target end. The R process listens for updates to the local trail file and then sends the updates to KAFKA. KAFKA Connector: Build a connector in FLINK to connect to KAFKA, facilitating the access of KAFKA data; Data writing module: FLINK reads data to generate timestamps, and inserts the data into the basic data table defined by the basic data table module by defining write and write streams; Incremental statistics module: Performs incremental data statistics. The transaction data is obtained by taking the data from the most recent ten minutes based on the timestamp produced by the data writing module. Data storage module: Synchronizes the indicator data obtained from the incremental statistics module to the MySQL and CLICKHOUSE databases for the user profiling system to query the data in real time.
5. The real-time user profile data supply device according to claim 4, characterized in that: The specific implementation of KAFKA connectors includes the following steps: Step 1: A consumer thread is started to pull data from Kafka and store it in the next object of Handover; Step 2: Loop through the next step of Handover; Step 3: Record the current offset and update it to the thread pool. This is used to initialize the worker consumption settings and save the consumption information for all topics. Step 4: Write the consumption progress data from the consumption information into a temporary object; Step 5: Submit the offset data of the current batch checkpoint recorded in the temporary object to KAFKA; Step 6: The Kafka message is transformed and converted into an object stream according to the specified type; Step 7: Register the object stream as a temporary table; Step 8: Concatenate multiple INSERT INTO statements to store the data from the temporary table into the system's basic data table.
6. The real-time user profile data supply device according to claim 4, characterized in that: The incremental statistics module performs incremental data statistics as follows: Retrieve data directly from the basic data table, add a timestamp condition, and retrieve data from the most recent ten minutes; Configure and schedule tasks to initialize user profile-related tasks every 10 minutes and process newly synchronized data.
7. A storage medium, characterized in that, The processor executes the program in the storage medium to implement a real-time user profile data supply method as described in claims 1-3.