Data set multiplexer for a data processing system

By generating and updating entries in the dataset catalog table using a dataset multiplexer, the problem of managing datasets in dynamic environments in data processing systems is solved, enabling efficient data analysis that can adapt to changes in data storage devices without modifying the application.

CN117015769BActive Publication Date: 2026-06-12AB INITIO TECHNOLOGY LLC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
AB INITIO TECHNOLOGY LLC
Filing Date
2022-01-31
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing data processing systems struggle to efficiently perform data analysis in dynamic environments when managing large datasets, especially when data storage devices change, requiring frequent application modifications to adapt to changes in physical dataset storage.

Method used

The dataset multiplexer generates and updates entries in the dataset catalog table, dynamically accesses and manages physical datasets, provides a format-independent interface for logical datasets, and automatically adapts to changes in data storage devices.

🎯Benefits of technology

It enables the maintenance of high efficiency and reliability of data analysis without modifying the application when data storage devices change, reducing modification and testing costs and improving the system's flexibility and maintainability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117015769B_ABST
    Figure CN117015769B_ABST
Patent Text Reader

Abstract

A data processing system having a dataset multiplexer enables applications to be written to specify access to a dataset as an operation on a logical dataset. During execution of an application by the data processing system, the operation to access the dataset is implemented by accessing an entry in a dataset catalog table of the logical dataset. The entry includes information for accessing a physical data source that stores the logical dataset, including a format conversion from the format of the physical data source to the format of the logical dataset. The entry in the catalog table can be created based on registration of a data source with the dataset multiplexer and can be automatically updated based on changes in storage of the dataset. Such maintenance of the catalog table can be partially or fully automated, such that the system automatically adapts to any changes in storage of the dataset without requiring modification of any applications.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications

[0002] This application claims the benefit of U.S. Provisional Patent Application Serial No. 63 / 163,709, filed March 19, 2021, entitled “DATASETMULTIPLEXER FOR DATA PROCESSING SYSTEM,” and U.S. Provisional Patent Application Serial No. 63 / 143,898, filed January 31, 2021, entitled “DATASETMULTIPLEXER FOR DATA PROCESSING SYSTEM,” the entire contents of which are incorporated herein by reference. Technical Field

[0003] Various aspects of this disclosure relate to techniques for efficiently operating data processing systems with large datasets that can be stored on any of a large number of data storage devices. Background Technology

[0004] Modern data processing systems manage vast amounts of data within an enterprise. For example, a large organization may possess millions of datasets. This data can support multiple aspects of the business's operations, making such a large dataset invaluable to the enterprise. For instance, some datasets might support routine processes such as tracking customer account balances or sending account statements to customers. In other cases, processing data from one or more datasets can generate business insights, such as concluding that a requested transaction is fraudulent or that the enterprise faces a specific level of financial risk as a result of overall transactions in a particular geographic area. In still other cases, processing data from one or more datasets can generate technical insights, such as concluding that the enterprise is at risk of technical failure due to incorrect technical processes.

[0005] The physical storage of these datasets can be provided in any of a variety of ways. For example, datasets may be stored in a structured manner and managed by a database system within the enterprise. In this case, the dataset may be stored as one or more tables managed by the database. Alternatively, simple datasets can be stored in files accessible to the data processing system, such as .csv or .xml files or flat files. The computer storage devices on which the datasets (whether as files, database tables, or in some other format) reside can be physically implemented in any of a variety of forms, such as local to the data processing system, distributed throughout the enterprise, or distributed in a network cloud managed by a third party.

[0006] Enterprise architects can select physical storage for a dataset based on its expected characteristics, such as its size, required access time, retention period, and the impact of loss or corruption. Business considerations, such as storage price or reliance on third-party storage vendors, can also influence an enterprise's choice of physical storage. Therefore, data storage devices used within an enterprise to store datasets can take many different forms.

[0007] To support a wide range of functionalities, data processing systems can execute applications, whether for routine process execution or for extracting insights from datasets. Applications can be programmed to access data storage devices to read and write data. Summary of the Invention

[0008] According to some aspects, a method executed by a data processing system enables efficient data analysis in a dynamic environment with multiple datasets by generating entries in a dataset catalog table and / or using entries in the dataset catalog table to access physical datasets in a data storage device. The data processing system can be configured to execute a data processing application programmed to access logical datasets. Each logical dataset includes a data schema independent of the format of the corresponding data in the physical dataset. The data processing system includes a dataset multiplexer configurable to provide the application with access to the physical datasets in the data storage device. The method includes creating a plurality of entries in the dataset catalog table, each entry associated with both a logical dataset and a physical dataset and having associated computer-executable instructions for accessing the physical dataset; receiving input that at least partially identifies a first logical dataset for access to perform an operation within the data processing application specifying the access dataset; while performing the operation within the data processing application, invoking the computer-executable instructions to access the physical dataset associated with the entry in the dataset catalog table associated with the first logical dataset; and dynamically updating the entries in the dataset catalog table in response to an event indicating a change in the physical dataset associated with the logical dataset.

[0009] According to one aspect, creating multiple entries in the dataset catalog table includes receiving information related to a first physical dataset in a first data storage device stored in these data storage devices, wherein the first physical dataset corresponds to a first logical dataset; generating a first program based on the information related to the first physical dataset, including computer-executable instructions for accessing the first physical dataset from the first data storage device; and storing a link to the first program in the first entry of the dataset catalog table so that the data processing application can access the first physical dataset using the first program.

[0010] According to one aspect, generating the first program for accessing the first physical dataset from the first data storage device includes identifying the type of the first data storage device from received information; selecting a first program template for the type of the first data storage device; and filling the first program template with one or more values ​​of one or more parameters of the first program template to generate the first program.

[0011] According to one aspect, receiving input that at least partially identifies the first logical dataset includes providing a user interface through which the user at least partially identifies the first logical dataset.

[0012] According to one aspect, invoking these computer-executable instructions includes enabling access to the entry in the dataset catalog table associated with the first logical dataset; and enabling access, based on information within the entry, to a data storage device storing a physical dataset corresponding to the first logical dataset.

[0013] According to one aspect, dynamically updating the entries in the dataset catalog table includes detecting an event indicating a change associated with the physical dataset corresponding to the first logical dataset; and modifying the entry in the dataset catalog table associated with the first logical dataset based on the detection of the event.

[0014] According to one aspect, modifying the entry in the dataset catalog table includes modifying the computer-executable instructions used to access the physical dataset corresponding to the first logical dataset.

[0015] According to some aspects, a method executed by a data processing system is provided for achieving efficient data analysis in a dynamic environment with multiple datasets by registering datasets in a dataset catalog table to facilitate access to multiple physical datasets in a data storage device. The data processing system can operate with the multiple physical datasets stored in these data storage devices. The data processing system includes a dataset multiplexer configurable to provide an application with access to a physical dataset among the multiple physical datasets stored in the data storage devices. The physical dataset corresponds to a logical dataset, which includes a data schema independent of the format of the corresponding data in the physical dataset. The method includes receiving information associated with a first physical dataset among the multiple physical datasets stored in a first data storage device, wherein the first physical dataset corresponds to a first logical dataset; generating a first program based on the information associated with the first physical dataset, including computer-executable instructions for accessing the first physical dataset from the first data storage device; and storing a link to the first program in a first object in an object library such that the application can access the first physical dataset using the first program.

[0016] According to one aspect, the method includes determining whether to modify the first program used to access the first physical dataset based on detecting an event indicating a change associated with the first physical dataset.

[0017] According to one aspect, the method includes, based on determining to modify the first program: generating a modified first program; and replacing the first program with the modified first program as the target of the link.

[0018] According to one aspect, generating the modified first program includes generating the modified first program without modifying the application or the first logical dataset.

[0019] According to one aspect, the information associated with the first physical dataset includes information about the type of the first data storage device.

[0020] According to one aspect, the dataset multiplexer includes an object library storing information for accessing the plurality of physical datasets, and a first object in the object library includes an identifier of the first physical dataset.

[0021] According to one aspect, the dataset multiplexer further includes an API, and the method further includes enabling the application to access the first object through the API.

[0022] According to one aspect, the method further includes: assigning identifiers to objects in the library based on the schema and logical name of the corresponding logical dataset in which information is stored in the objects.

[0023] According to one aspect, the method further includes: receiving a command for registering the first physical dataset in a dataset catalog table; and generating the first object and storing it in the library based on the received command.

[0024] According to one aspect, the identifier of the first physical dataset is a physical identifier.

[0025] According to one aspect, the first object further includes a second identifier, and the second identifier is a logical identifier of a logical dataset associated with the first object.

[0026] According to one aspect, the method further includes: in response to detecting an event indicating that the first physical dataset has been changed from being stored in the first data storage device to being stored in the second data storage device, modifying the physical identifier in the first object without modifying the logical identifier.

[0027] According to one aspect, the first object includes parameter values ​​accessed during the execution of the first program; and the method further includes: modifying the value of a parameter stored in the first object based on detecting an event indicating a change in the parameter value accessed in the first program.

[0028] According to one aspect, the first program includes access logic and conversion logic, and when the application is executed, the access logic and conversion logic of the first program are executed to provide access to the first physical dataset and to convert between the format used in the first physical dataset and the format used in the first logical dataset.

[0029] According to one aspect, the first program includes one or more parameters that affect the operation of the first program, such that the values ​​of the one or more parameters affect access to the first physical dataset via the first program.

[0030] According to one aspect, the application is configured to provide the values ​​of one or more parameters for use when the first program is invoked.

[0031] According to one aspect, the method further includes generating the first program by: detecting the type of the first data storage device; and selecting a template from a plurality of templates based on the detected type.

[0032] According to one aspect, the first program includes a first portion configured for read access to the first data storage device and a second portion configured for write access to the first data storage device.

[0033] According to one aspect, the first program is configured to include an executable data flow graph containing logic for accessing the first physical dataset.

[0034] According to several aspects, a method executed by a data processing system is provided for enabling an application to access multiple physical datasets in multiple data storage devices by using entries in a dataset catalog table, thereby achieving efficient data analysis in a dynamic environment with multiple datasets. The data processing system operates in conjunction with the application and the multiple physical datasets stored in the multiple data storage devices. The application is programmed to access logical datasets, which include data schemas independent of the format of corresponding data in the physical datasets. The method includes providing a user interface through which a user at least partially identifies the logical datasets to be accessed in the application; executing the application; and, when performing operations involving access to the identified logical datasets: enabling access to objects in an object library associated with the logical datasets; and enabling access to data storage devices storing the physical datasets corresponding to the identified logical datasets based on information within the objects.

[0035] According to one aspect, the method further includes: updating information in the object based on events associated with the storage of data corresponding to the identified logical dataset.

[0036] According to one aspect, the information in the object includes an executable program for accessing the physical dataset.

[0037] According to one aspect, the executable program used to access the physical dataset encodes the logic for converting data between the format used in the physical dataset and the format used in the logical dataset.

[0038] In one respect, the object is an executable program used to access the physical dataset.

[0039] According to one aspect, the information in the object includes the type of the data storage device.

[0040] According to one aspect, the information in the object includes the record format or pattern associated with the physical dataset.

[0041] According to one aspect, the information in the object includes one or more parameters specifying how to access the physical dataset, and the one or more parameters include at least one parameter indicating whether the data in the physical dataset is compressed.

[0042] According to one aspect, the information in the object includes one or more parameters specifying how to access the physical dataset, and the one or more parameters include at least one parameter indicating the type of access.

[0043] According to one aspect, the type of access includes an indication of whether it is a read access or a write access.

[0044] According to one aspect, the type of access includes an indication of whether access is via a fast connection or a slow connection.

[0045] According to one aspect, the data processing system includes a repository of metadata associated with the logical dataset; and the user interface includes a menu that presents the logical dataset based on the metadata in the repository.

[0046] According to some aspects, a method executed by a data processing system enables efficient data analysis in a dynamic environment with multiple datasets by generating entries in a dataset catalog table to access physical datasets in a data storage device. The data processing system is configured to execute data processing applications programmed to access logical datasets. Each logical dataset includes a data schema independent of the format of the corresponding data in the physical dataset, and the data processing system includes a dataset multiplexer configurable to provide the application with access to the physical datasets in the data storage device. The method includes receiving information relating to a first physical dataset stored in a first data storage device, wherein an application is programmed to access a first logical dataset, and wherein the first physical dataset corresponds to the first logical dataset; generating a first program for accessing the first physical dataset from the first data storage device based on the received information, wherein generating the first program includes: identifying the type of the first data storage device from the received information; selecting a first program template for the type of the first data storage device; and filling the first program template with one or more values ​​of one or more parameters of the first program template to generate the first program; and storing in an object information for invoking the execution of the first program from within the application programmed to access the first logical dataset.

[0047] According to one aspect, filling the first program template includes automatically discovering one or more values ​​of one or more first parameters of the first program template based on information related to the first physical dataset.

[0048] According to one aspect, the one or more first parameters include information about the recording format or pattern associated with the first physical dataset.

[0049] According to one aspect, the information stored in the object for invoking the execution of the first program from within an application programmed to access the first logical dataset includes an identifier for storing the first data storage device.

[0050] According to one aspect, storing information in the object for invoking the execution of the first program from within an application programmed to access the first logical dataset includes storing a logical identifier of the first logical dataset.

[0051] According to one aspect, generating the first program further includes: obtaining information about one or more second parameters of the first program template, wherein the one or more second parameters are different from the one or more first parameters.

[0052] Depending on one aspect, the one or more second parameters specify the method of accessing the first physical dataset.

[0053] According to one aspect, generating the first program further includes: determining whether a program template is available for the type of the first data storage device; and selecting an available template as the first program template based on the determination that the first program template is available for the type of the first data storage device.

[0054] According to one aspect, the method includes: based on determining the type of program template that is not available for the first data storage device; creating a program structure based on user input; and generating a first program for accessing the first data storage device based on the created program structure.

[0055] According to one aspect, the method includes receiving information relating to a second physical dataset stored in a second data storage device stored in these data storage devices; and generating a second program for accessing the second physical dataset from the second data storage device based on the information relating to the second physical dataset.

[0056] According to one aspect, the data processing system is configured to execute in multiple environments, each environment including an instance of the data processing system; and the object is assigned a unique identifier within the scope of each of the multiple environments and includes at least a common portion in the multiple environments.

[0057] According to some aspects, a method executed by a data processing system is provided for efficient analysis in a dynamic environment with multiple datasets by facilitating access to physical datasets in a data storage device through updating entries in a dataset catalog table. The data processing system is configured to execute a data processing application programmed to access data represented as logical datasets. Each logical dataset includes a data schema independent of the format of the corresponding data in the physical dataset, and the data processing system includes a dataset multiplexer configurable to provide the application with access to the physical datasets in these data storage devices. The method includes receiving information associated with a first physical dataset corresponding to a first logical dataset stored in a first data storage device; generating a first procedure for accessing the first physical dataset from the first data storage device based on the received information; detecting an event indicating a change associated with the physical dataset corresponding to the first logical dataset; and modifying the first procedure for accessing the physical dataset corresponding to the first logical dataset based on the detected event.

[0058] According to one aspect, the physical dataset is a first physical dataset, and the event indicating a change associated with the physical dataset includes an event indicating a change from a first data storage device storing the first physical dataset to a second data storage device, and the method further includes: in response to detecting the event indicating a change from the first data storage device to the second data storage device, modifying the first program to access the first physical dataset from the second data storage device.

[0059] According to one aspect, the physical dataset is the first physical dataset, and the event indicating a change associated with the physical dataset includes an event indicating a change in the parameter values ​​of the first program used to access the first physical dataset.

[0060] According to one aspect, detecting events that indicate changes associated with a physical dataset includes detecting events that indicate the replacement of the first physical dataset with a second physical dataset corresponding to the first logical dataset, and modifying the first procedure for accessing the physical dataset includes replacing the first procedure with a second procedure for accessing the second physical dataset.

[0061] According to one aspect, the data processing system is configured to invoke the first program to perform operations within an application specifying access to a first logical dataset; the data processing system is configured to execute in multiple environments, wherein a first environment includes a first instance of the data processing system and a second environment includes a second instance of the data processing system, the first data storage device and the first program are associated with the first instance of the data processing system, and the method further includes: generating a second program to perform operations within the second instance of the data processing system specifying access to the first logical dataset.

[0062] According to one aspect, the application that specifies access to the first logical dataset and accesses the second program in the second environment is executed so as to access the second physical dataset in response to the application performing an operation on the first logical dataset.

[0063] According to several aspects, a method executed by a data processing system is provided for enabling an application to access multiple physical datasets across multiple data storage devices by using entries in a dataset catalog table, thereby achieving efficient data analysis in a dynamic environment with multiple datasets. The data processing system is configured to execute a data processing application programmed to access logical datasets. Each logical dataset includes a data schema independent of the format of the corresponding data in the physical dataset, and the data processing system includes a dataset multiplexer configurable to provide the application with access to multiple physical datasets across the multiple data storage devices. The method includes performing an operation within the application to specify access to a logical dataset by: accessing a dataset catalog table to select an object associated with the logical dataset; and, based on the selected object, invoking a program configured to access a data source storing the physical dataset corresponding to the logical dataset.

[0064] According to one aspect, the method further includes: dynamically updating the objects in the dataset catalog table in response to an event indicating a change in the physical storage of the logical dataset represented by the objects in the dataset catalog table.

[0065] The various aspects described above may be used alternatively or additionally with aspects of any system, method, and / or process described herein. Further, a data processing system may be configured to operate according to a method having one or more of the foregoing aspects. Such a data processing system may include at least one computer hardware processor and at least one non-transitory computer-readable medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform such a method. Further, a non-transitory computer-readable medium may include processor-executable instructions that, when executed by at least one computer hardware processor of the data processing system, cause the at least one computer hardware processor to perform a method having one or more of the foregoing aspects. Therefore, the foregoing content is a non-limiting summary of the invention, defined by the appended claims. Attached Figure Description

[0066] Various aspects will be described with reference to the following figures. It should be understood that the figures are not necessarily drawn to scale. Items appearing in multiple figures are indicated by the same or similar reference numerals in all the figures in which they appear.

[0067] Figure 1A This is a block diagram of an exemplary enterprise IT system with a data processing system according to the aspects of the technology described herein, the data processing system having a dataset multiplexer;

[0068] Figure 1B yes Figure 1A A block diagram of an exemplary enterprise IT system in an operational state at the first moment, during which the dataset multiplexer facilitates access between an application configured to access a logical dataset and a first data storage device storing the physical dataset corresponding to the logical dataset;

[0069] Figure 1C yes Figure 1B A block diagram of an exemplary enterprise IT system in an operational state at a second time, during which the dataset multiplexer facilitates access between an application configured to access the logical dataset and a second data storage device storing the physical dataset corresponding to the logical dataset;

[0070] Figure 2A yes Figure 1A The data processing system is instantiated in multiple instances to provide a block diagram of an exemplary enterprise IT system in multiple environments, wherein the application is executed by the first instance, and the dataset multiplexer facilitates access between the application and the first physical dataset for this purpose.

[0071] Figure 2B yes Figure 2AA block diagram of an exemplary enterprise IT system, in which the application is executed by a third instance, and a dataset multiplexer facilitates access between the application and a second physical dataset;

[0072] Figure 3A It is a diagram of a graphical development environment for an application that is written as a data flow diagram;

[0073] Figure 3B yes Figure 3A A schematic diagram of a data flow graph, wherein the input nodes of the data flow graph are configured or programmed based on a logical dataset;

[0074] Figure 3C yes Figure 3A A schematic diagram of the data flow graph is modified to access information in the dataset catalog table to enable access to the physical dataset, thereby performing operations in the application that specifies access to the logical dataset.

[0075] Figure 4 It is illustrative information that can be reflected in the object in the dataset catalog table, which provides information about the physical dataset corresponding to the logical dataset;

[0076] Figure 5A yes Figure 1A A block diagram of an exemplary enterprise IT system shows additional details of a dataset multiplexer;

[0077] Figure 5B yes Figure 1A A block diagram of an exemplary IT system illustrates the components of a data multiplexer that can be optionally used when connected to an application programming interface (API) being implemented.

[0078] Figure 6A For example Figure 1A or Figure 5A A block diagram depicting an exemplary enterprise IT system in its first operational state at the earliest possible moment;

[0079] Figure 6B yes Figure 6A A block diagram illustrating an exemplary enterprise IT system in a second state at a second time;

[0080] Figure 7 This is a block diagram illustrating the information used in a data processing system configured with a dataset multiplexer, based on some aspects of the techniques described in this article;

[0081] Figure 8 This is a flowchart illustrating an exemplary method for operating a data processing system having a dataset multiplexer according to aspects of the technology described herein; and

[0082] Figure 9This is a block diagram of an illustrative computing system environment that can be used to implement some aspects of the techniques described herein. Detailed Implementation

[0083] The inventors have recognized and understood that dataset multiplexers enable efficient operation of data processing systems. In enterprises with numerous datasets that may be stored on various data storage devices, dataset multiplexers enable applications written based on one or more logical datasets rather than on physical datasets. If the data storage device storing the physical dataset(s) represented by the logical dataset changes, these applications written based on the logical dataset will operate correctly without modification. To support this dynamic updating of data storage devices, dataset multiplexers can maintain a catalog table of datasets, where each entry in the catalog table provides information for accessing the data storage device storing the physical dataset(s) represented by the logical dataset. For example, dataset multiplexers can enable efficient analysis in dynamic environments where the physical storage of datasets may evolve or change.

[0084] By using a dataset multiplexer, applications can be written and executed without the application's knowledge of the formats (e.g., record formats or schemas) supported by the data storage devices accessed by the application, or even the physical location of these data storage devices. Furthermore, for example, business users who understand how to extract business insights from data but lack knowledge of the physical datasets and data storage devices can write applications based on logical datasets rather than physical datasets. The dataset multiplexer can automatically provide connectivity between the application and the appropriate data storage device storing the physical dataset represented by the logical dataset, thus eliminating the need for the application and user to understand the implementation details of the data storage device.

[0085] The dataset catalog can be updated in response to events indicating changes in the storage of datasets (such as physical datasets represented by logical datasets). There is no need to modify the application and / or the logical dataset in response to these events. By retrieving information from the catalog at access time about the data storage device used to access the physical dataset corresponding to the application's logical dataset, the appropriate data storage device can be accessed without maintaining the application to adapt to changes in data storage devices. In enterprises, this capability facilitates the migration of datasets from one storage location to another for efficient use of computer storage devices while maintaining correct application execution. For example, datasets may migrate from one storage location to another throughout their lifecycle, or even from one type of storage device to another. Such migrations can occur without modifying any application, while maintaining correct application execution. Even when such changes occur, no application modification is required, providing reliable and efficient application execution and potentially offering significant cost savings to enterprises by avoiding the costs and downtime of modifying and retesting a modified application.

[0086] As a concrete example, a physical dataset can initially be stored as a file. Storing it as a file allows the use of low-cost computer storage devices. As the amount of data in the physical dataset grows or the data becomes more valuable, the physical dataset may be migrated to a database system for faster processing of large datasets or greater fault tolerance. By updating the catalog entries of the logical dataset corresponding to the physical dataset, applications written to access the logical dataset via a dataset multiplexer can continue operating without modification when the physical dataset is migrated from a file to a database system.

[0087] Catalog entries may include information for accessing the physical dataset, which can accommodate other types of changes to the storage of data associated with the logical dataset. This information may include a program that, when executed, accesses data from the data storage device and transforms it into a representation of the logical dataset. As a concrete example, the format of fields in the physical dataset used to store logical entities can be changed without affecting applications referencing those logical entities, because modifications to entries in the dataset catalog may include modifications to the program that transforms data from the data storage device into a format used in the logical dataset.

[0088] Dataset multiplexers can also facilitate application development by simplifying transitions between programming environments. For example, applications are typically developed in a development environment, tested in a test environment, and then rolled out to production. In production, the application can read from and write to one or more data storage devices containing “real-time” data used across the enterprise. In test and development environments, the application can operate on offline data storage devices that are less likely to impact the enterprise if they fail due to improper application operation. In development environments, data storage devices may be relatively small, while in test environments, they can be configured to provide robust test cases, including extreme test cases that might not appear in the current real-time data.

[0089] Regardless of the reason for expecting different datasets in different environments, each environment can have its own dataset catalog information. Instances of data processing systems providing development environments can access dataset catalog information within the development environment scope. Similarly, instances of data processing systems providing test or production environments can access dataset catalog information within their respective environment scope to access the appropriate data storage devices. In this way, applications written to access logical datasets can operate in any of these environments and automatically access the appropriate data storage devices in each environment without adapting the application to a specific environment. When the execution of an application involves operations on a logical dataset, the data processing system automatically utilizes the appropriate dataset catalog information of the appropriate environment to access the data storage device in that environment that contains the physical dataset storing the data corresponding to that logical dataset.

[0090] The value of such dataset multiplexers can be enhanced by multiplexers capable of automatically constructing entries for data storage devices in a dataset catalog table. For example, a dataset multiplexer can maintain a set of program templates applicable to different types of data storage devices. When a data storage device is registered with the dataset multiplexer, the multiplexer can detect the type of the data storage device and select an appropriate template. The program used to access that data storage device can be constructed by populating the selected template with parameter values ​​detected based on the analysis of that data storage device. Some or all of these parameter values ​​can be obtained alternatively or additionally from a metadata repository maintaining the metadata of the data storage device, provided via user input, or otherwise.

[0091] Various aspects of the data processing system can be implemented to achieve any or more of the foregoing objectives and advantages. These objectives and advantages can be used individually or in any suitable combination.

[0092] Representative data processing systems with dataset multiplexers

[0093] Figure 1A This is a block diagram of an IT system 100 based on some aspects of the technology described herein, which includes an illustrative data processing system 104 and a dataset multiplexer 105 integrated with the data processing system 104. For example, IT system 100 could be an IT system for an enterprise (such as a financial company). For simplicity, elements of an enterprise IT system, such as networks, cloud storage, and user devices, are not explicitly shown.

[0094] Data processing system 104 is configured to access data storage devices 102-1, 102-2, 102-3, ..., and 102-n (e.g., to read data from them and / or write data to them). Each of the data storage devices 102-1, 102-2, 102-3, ..., and 102-n can store one or more physical datasets. The data storage devices can store any suitable type of data or data set in any suitable manner or format. The data storage devices can use, for example, a database system (e.g., a relational database system) to store data as flat text files, spreadsheets, etc. Furthermore, these data storage devices can be located either within or outside the enterprise. For example, external data storage devices can be located in the "cloud" or otherwise in storage hardware managed by a third party. Therefore, data storage devices can provide a federated environment where different data storage devices used by an enterprise can be located in different locations and / or managed by different entities inside or outside the enterprise.

[0095] In some cases, data storage devices can store transactional data. For example, a data storage device can store credit card transactions, telephone record data, or bank transaction data. It should be understood that the data processing system 104 can be configured to access any suitable number and type of data storage devices, as the aspects of the technology described herein are not limited in this respect. A data storage device from which the data processing system 104 can be configured to read data can be referred to as a data source. A data storage device to which the data processing system 104 can be configured to write data can be referred to as a data sink. However, the technology described herein can be applied to data storage devices that store other types of data used in an enterprise.

[0096] Each data storage device may be implemented using one or more storage units and may include data management software or other control mechanisms to support the storage of physical datasets in any suitable type or one or more formats. The storage units(s) may be of any suitable type and may include, for example, one or more servers, one or more disk arrays, one or more disk array clusters, one or more portable storage units, one or more non-volatile storage units, one or more volatile storage units, and / or any other(s) devices(s) configured to electronically store data. In embodiments where the data storage device includes multiple storage units, these storage units may be located together in one physical location (e.g., in a building) or distributed across multiple physical locations (e.g., in multiple buildings, in different cities, states, or countries). These storage units may be configured to communicate with each other using any suitable type of one or more networks, as the aspects of the techniques described herein are not limited in this respect.

[0097] Data management software organizes data in physical storage and provides mechanisms for accessing the data so that data can be written to or read from physical storage. Data management software can be, for example, a database system or a file management system. Depending on the type of data management software, storage devices (multiple devices) can use one or more formats to store physical datasets, such as database tables, spreadsheet files, flat text files, and / or files in any other suitable format (e.g., a mainframe native format). Data storage devices 102-1, 102-2, 102-3, ..., 102-n can be of the same type (e.g., all can be relational databases) or different types (e.g., one can be a relational database, while another can be a data storage device storing data as flat files). When data storage devices are of different types, the storage environment can be referred to as a heterogeneous or federated data environment 102. Data storage devices can be, for example, SQL Server databases, Oracle databases, TERADATA databases, flat files, multi-file data storage devices, Hadoop distributed databases, DB2 data storage devices, Microsoft SQL Server data storage devices, INFORMIX data storage devices, tables, collections of tables, or other sub-parts of a database, and / or any other suitable type of data storage device, as the various aspects of the technology described herein are not limited in this respect.

[0098] Data processing system 104 supports a wide variety of applications 106 to perform functions such as accessing (e.g., read access and / or write access) physical datasets stored in data storage devices 102-1, 102-2, 102-3, ..., and 102-n. Applications 106 can then perform operations based on the data in the data storage devices. Data processing system 104 can support applications 106-1, 106-2, 106-3, ..., and 106-n, which can be of the same or different types. In some cases, an application, when executed, can read or write transactional data from one or more physical datasets in a data storage device. In other cases, an application, when executed, can read or write data from physical datasets stored in different data storage devices and analyze the data to extract business insights from the datasets.

[0099] For example, such as Figure 3A As shown, application 106 can be developed as a data flow graph. A data flow graph can include components (referred to as “nodes” or “vertices”) representing data processing operations to be performed on data, and links between components representing data flows. Techniques for performing computations encoded by a data flow graph are described in U.S. Patent No. 5,966,072 entitled “Executing Computations Expressed as Graphs,” the entire contents of which are incorporated herein by reference. An environment for developing applications (e.g., computer programs) as data flow graphs is described in U.S. Patent Publication No. 2007 / 0011668 entitled “Managing Parameters for Graph-Based Applications,” the entire contents of which are incorporated herein by reference. A data flow graph can include data sources (e.g., Figure 3A Input data storage device 302 or 304) and data destination (e.g. Figure 3A The output data storage device 314). The data source and the terminal nodes in the data stream represent access to the data storage devices 102-1, 102-2, 102-3, ..., or 102-n.

[0100] However, the application itself does not need to be programmed using any specific data storage device included in the application. Application 106 can be programmed based on logical datasets, rather than being hard-coded to access a single physical dataset. A logical dataset can refer to a logical representation of one or more datasets. Data processing system 104 can store the definitions of multiple logical datasets, as well as other metadata about these logical datasets. This information can be provided, for example, by a metadata management module (e.g., Figure 5A The metadata management module 526 is used to manage it. Tools used with the data processing system 104 can access metadata about the logical dataset and perform functions based on that metadata. For example, a program development environment can provide a user interface through which available logical datasets can be selected and used for application programming.

[0101] A logical dataset can have a schema defined independently of the corresponding data in the physical dataset / data storage device. For example, a logical dataset can have a schema defining logical entities within it. These logical entities can be recognizable and / or understandable to human users. For instance, a logical dataset might include logical entities such as customer names. In the corresponding physical dataset, a customer name might be stored as three fields in a row of a data table, storing data corresponding to the customer's first name, the first letter of their middle name, and their last name. However, a logical dataset can simply include the logical entity Customer_Name, regardless of the data format in physical storage.

[0102] Data processing system 104 may include an interface (not shown) through which a schema for a logical dataset can be defined. For example, the interface may be a user interface through which a user can specify a logical dataset or otherwise import it into the system by specifying a schema for the logical dataset. In some embodiments, data processing system 104 may store a set of logical entities commonly used in business operations. Examples of commonly used logical entities may include one or more of name, identification number, telephone number, address, nationality, account balance, transaction amount, or date. These business terms may be used to at least partially specify the schema for the logical dataset. However, instead of predefined logical entities, or in addition to predefined logical entities, the schema may be defined to include other logical entities.

[0103] Programming applications based on logical datasets avoids the need for application developers to understand the format of the data storage devices that store the corresponding physical datasets. Therefore, data analysts can develop applications using logical datasets even if they do not understand the data format within the data storage devices that hold the physical datasets.

[0104] As a more detailed example, within an enterprise, programmers can define a logical dataset for storing new customers. The schema of this logical dataset can include logical entities such as customer name, customer address, customer identifier, and customer acquisition date. Data analysts can write applications based on this logical dataset and these logical entities, regardless of the storage format of the physical dataset corresponding to the logical dataset. Therefore, data analysts can write applications without knowing the physical dataset containing the stored data that the application will access.

[0105] When the application is executed, the data in the physical dataset corresponding to the logical dataset can be stored in one or more of data storage devices 102-1, 102-2, 102-3, ..., and 102-n. To execute the application, each operation specifying access to the logical dataset can be performed by the data processing system 104 by reading or writing data from the corresponding physical dataset stored in one of the data storage devices 102-1, 102-2, 102-3, ..., and 102-n. According to some aspects, the dataset multiplexer 105 can automate the execution of such operations by automatically accessing the corresponding physical dataset. This access may include conversion between the data format stored in the physical data storage device and the format specified in the schema of the logical dataset. As another example, the conversion may result in associating data from the physical dataset with metadata already associated with the logical dataset. As a specific example, the conversion may associate a field from the physical dataset with a field in the logical dataset that is marked with an indication that personally identifiable information is stored. Therefore, in this example, the metadata can be used for operations on the data from the physical dataset, such as filtering or masking personally identifiable information.

[0106] like Figure 1A As shown, the data processing system 104 includes a dataset multiplexer 105 for automating access to corresponding physical datasets and converting between the formats of logical datasets and physical datasets. The dataset multiplexer 105 can maintain a dataset catalog table 107, where each entry in the catalog table corresponds to a logical dataset and provides information for accessing one or more physical datasets. For example, a catalog table entry can identify a physical dataset in a data storage device 102-1, 102-2, 102-3, ..., or 102-n corresponding to a logical dataset. Catalog table entries may alternatively or additionally include information for converting data stored in the physical datasets into the format of the logical datasets. This information may be or may include an executable program. For example, catalog table information may identify a program for converting data in multiple fields of a physical dataset into the format of a corresponding logical entity in the logical dataset. Other information may alternatively or additionally be stored as or reflected in the catalog table information used to access the one or more physical datasets.

[0107] Dataset multiplexer 105 enables application 106 to seamlessly access (multiple) physical datasets based on programmed (multiple) logical datasets using information from the dataset catalog table. Figure 1BAn application (e.g., application 106-3) programmed to access data based on a logical dataset is illustrated. When performing operations to access (e.g., read and / or write) the logical dataset, the dataset multiplexer 105 of the data processing system 104 enables access to the corresponding physical dataset in a data storage device (e.g., data storage device 102-1). For example, when the catalog information stored for the logical dataset is or includes an access control program, that program can be executed. Therefore, even though application 106-3 is programmed based on a logical dataset, the physical dataset stored in data storage device 102-1 is accessed when data access operations are performed.

[0108] Dataset multiplexer 105 can access its dataset directory table to select the entry associated with the logical dataset referenced in application 106-3. Information used to identify the physical dataset stored in data storage device 102-1 and / or to convert data in the format of data storage device 102-1 into the format of logical datasets can then be used for data access.

[0109] In some cases, this access can be dynamic. Catalog information can be used when operations requiring data access are performed within the application. Entries in the dataset catalog associated with logical datasets can be updated in response to events indicating changes in the storage of information associated with the logical dataset. Access to physical data storage devices via catalog information ensures that the application continues to function, even though changes may be made at any point in the IT system 100, even if the data analyst or other user who wrote application 106-3 is unaware of these changes.

[0110] For example, a physical dataset can be migrated from data storage device 102-1 to data storage device 102-n. The logical dataset programmed by the application does not need to be modified to account for this change. By updating the catalog entries of the logical dataset, the dataset multiplexer 105 can automatically utilize the updated catalog information to provide application 106-3 with access to the correct physical dataset, regardless of the data storage device it resides on.

[0111] Figure 1C The application 106-3 is shown accessing the data storage device 102-n via the dataset multiplexer 105 of the data processing system 104. Figure 1B and Figure 1C The access conditions can be the results of executing application 106-3 at different times. Because the catalog information is dynamic and changes to account for the storage of the dataset, no changes to application 106-3 are required to correctly access the expected data.

[0112] exist Figure 1BIn the diagram, solid lines indicate the data flow from data storage device 102-1 to application 106-3 when performing operations to access (e.g., read and / or write) a logical dataset. Dashed lines indicate interactions between components that control the data flow during the operation. For example, application 106-3 may interact with dataset multiplexer 105 to obtain information from catalog entries associated with the logical dataset for accessing the physical dataset corresponding to that logical dataset. Dataset multiplexer 105 may obtain information from the corresponding physical dataset(s) in data storage device 102-1 to generate appropriate catalog entries. Similarly, Figure 1C Solid lines indicate the data flow from data storage device 102-n to application 106-3 when performing operations to access (e.g., read and / or write) a logical dataset, while dashed lines indicate the interactions between components that can control the data flow during operation (e.g., dataset multiplexer 105, application 106-3, and data storage device 102-n).

[0113] Using dynamic data allows for correct operation even when any of a variety of other types of changes occur within IT system 100. Besides changes to the data storage device that stores the physical dataset, the type of data storage device holding the dataset may also change. For example, the type of data storage device might change. Data storage device 102-1 could be an Oracle database, but data storage device 102-n could be a SQL Server data storage device. As another example, the schema of the physical dataset can change, such as by including additional fields for name data. These changes are automatically compensated for by modifying the transformation logic within the catalog tables.

[0114] Dynamically using dataset catalog information for data access can automatically handle other types of changes. As another example, users can run different instances of a data processing system for different purposes. It might be desirable for the same application to access different physical datasets when running in different instances. Such execution can be ensured by providing different catalog information in different instances, or additionally, in cases where the application is expected to access different physical datasets corresponding to the same logical dataset in different contexts.

[0115] Figure 2AThis document illustrates aspects of the techniques described herein, where an application (e.g., application 106-2) accesses multiple physical datasets in a data storage device (e.g., data storage device 102-2) via a dataset multiplexer of an instance of a data processing system (e.g., instance 104-1 of data processing system 104). In the environment created by instance 104-1, access to the logical dataset is resolved to a dataset in data storage device 102-2. The same application executing in different environments created by different instances 104-n of the data processing system can access different physical datasets. Figure 2B This demonstrates application 106-2 accessing data storage device 102-n (e.g., a database data storage device) within an environment created by instance 104-n of data processing system 104. For simplicity, in... Figure 2A and Figure 2B Individual lines illustrating the control flow between the components are not shown. However, it should be understood that the components of the data processing system can interact to control the operations described herein. Therefore, for simplicity, control interactions can be omitted.

[0116] Figure 2A and Figure 2B The operation shown can be created by scoped partitioning of the catalog information for each instance, allowing references to the same logical dataset within each scope to access the physical dataset through the catalog information of that scope. All or part of the identifier for a logical dataset can remain unchanged across scopes. As a concrete example, a logical dataset can be identified by a combination of a name and a schema, which can be the same regardless of the context. However, the catalog information of the dataset associated with that logical identifier may differ across different instances.

[0117] exist Figure 2A and Figure 2BIn the embodiments, different instances 104-1, 104-2, ..., 104-n of the data processing system 104 can be provided for different programming environments. As a specific example, an enterprise can operate the data processing system in development, testing, and production environments. The dataset used by the same application may differ in each of these environments. Real-time data used in the production environment may not be used in the development or testing environment to avoid real-time data corruption and / or minimize the risk of exposing sensitive information. Data storage devices in the production environment may be large and provide fast data access, making them very expensive. On the other hand, datasets in the development environment may be smaller and stored in low-cost data storage devices to reduce the cost of application development. Data sets in the testing environment may include data that may appear in rare operational scenarios that are not present in the real-time dataset when testing the application, to ensure robust testing and comprehensive code coverage. Enabling the application in any environment enables efficient movement between environments (such as development, testing, and production environments) and can improve application development efficiency and the overall operational efficiency of the IT system.

[0118] Each instance of the data processing system 104 may include a dataset multiplexer that maintains a dataset catalog table for the corresponding environment. Each dataset multiplexer can access the appropriate dataset catalog table for the appropriate environment to provide access to the appropriate data storage devices(s). For example, Figure 2A The application 106-2 is shown to access a data storage device 102-2 (e.g., a flat file data storage device) in the development environment via an instance 104-1 of the data processing system 104. Figure 2B The illustration shows application 106-2 accessing data storage device 102-n in a production environment via instance 104-n of data processing system 104, which may be a database.

[0119] Representative techniques for developing applications using dataset multiplexers

[0120] In some embodiments, the application executed by the data processing system may be written by a human user of the data processing system in a graphical programming language. In other embodiments, procedural languages ​​or other types of programming languages ​​may be used alternatively or additionally.

[0121] Figure 3A A graphical user interface (GUI) is demonstrated, which allows data analysts or other human users to write applications within a graphical development environment. This GUI is used as an example of application development in this paper. In this example, the data processing system includes a library of components that perform operations on data. (Although for simplicity...) Figure 3AAs explicitly stated, however, the graphical development environment may include toolbars or other user interface elements through which users can select components from the library. Users can also specify connections between these components to form a graph. For example, a component may specify an operation for transforming data or may specify a data source or data sink to access. Components may be represented by icons with different shapes, depending on the operation performed by the component or the type of data storage device that holds the data from the data source or data sink.

[0122] Users can write applications by selecting components corresponding to desired operations and connecting them in the order specified by the desired data flow through the operations represented by the components. Each component can be configured via user input parameters. The values ​​of some configuration parameters can specify various aspects of the component's operation. For example, a component representing a dataset can receive parameters specifying the operation as a data source or a data sink.

[0123] In embodiments where applications are written using logical datasets, the values ​​of some configuration parameters can specify a particular logical dataset and / or logical entities within that dataset for performing component operations. For example, a component representing a dataset can be configured to represent a specified logical dataset by providing an identifier of the logical dataset as the value of that parameter. Alternatively or additionally, components can be configured via user input that specifies logical entities used as keys in specific operations.

[0124] A data processing system may include a repository of information about logical datasets and / or logical entities that can be used to configure application components. Entries in this repository may be created by the users who wrote the application. However, in an enterprise, many people may be involved in generating and analyzing data, so the information in the repository may not have been developed by the users who developed the application. For example, logical dataset information may be created by other users, or even through automated analysis of certain physical datasets.

[0125] The user interface provided in the development environment may include user interface elements that allow users to specify logical datasets or logical entities in the repository as parameter values ​​for components of the configuration graph. These user interface elements may include elements for users to input search queries. For example, the query may be a faceted query where the user specifies one or more values ​​for dimensions describing the logical dataset or logical entity. For example, these dimensions may include words entered in the repository describing the names of fields included within the logical dataset.

[0126] The data processing system can perform a search based on a query and return a list of options selected by the system based on the query. The user can then select a returned value to configure a component, which will subsequently operate according to that selection. For example, when a dataset component is configured as a data source (configured to output data from a logical dataset), the component will operate by providing data in the format specified for the logical dataset when the application is executed.

[0127] Applications are not required to be developed entirely by human programmers. All or part of the program can be generated in other ways, such as from templates or by machine from another programming language or pseudo-language. Regardless of how the application is developed, specifying the data the application will operate on based on one or more logical datasets makes it possible to write applications without knowing or relying on the physical storage of the data. This capability can simplify any part of the development process performed by human users, as they can specify operations involving data access based on logical datasets and / or logical entities within those datasets. For example, a data analyst can write an application without understanding the details of any particular physical dataset. Furthermore, avoiding reliance on physical storage in the application can extend the functionality of the data processing system. For example, the application can be written even if the programmer does not know or has not yet determined the details of the physical dataset that will exist when the application is executed.

[0128] As a further simplification, the data processing system can be configured to perform operations specified based on a logical dataset or logical entities within a logical dataset. These operations can be specified to be performed within the application and may then be performed on data accessed in the physical dataset corresponding to the logical dataset.

[0129] For example, a logical entity can be associated with an enterprise-wide list of valid values, and that list can be changed at the enterprise level without requiring changes to every application that accesses that logical entity. As a concrete example, a gender logical entity can be defined within a data processing system. At some point, the metadata associated with this logical entity might indicate that the allowed values ​​are M and F. Later, the allowed values ​​could be changed to M, F, and X. Every application written based on this logical entity can automatically adapt to the changed list, regardless of which physical dataset stores the gender information. This is advantageous because, for example, indicating an "X" value as a new allowed value in the metadata can automatically affect all applications that use the gender logical entity.

[0130] As another example, validation rules can be specified based on logical entities and applied regardless of the physical dataset from which the data is accessed. As a concrete example, a data processing system can configure data validation rules for a logical element used for an email address. Once one or more fields in any physical dataset storing emails are identified as corresponding to the logical element used for an email address, that data validation rule can be applied to data from that physical dataset. Validation rules can be used in an application in one or more ways. For example, rules can be invoked from within the application on data from a specific physical dataset, or the application can access the results of applying these rules to a specific physical dataset, even if the application of the rules to the dataset is triggered from outside the application.

[0131] As another example, components that perform masking or filtering operations can be specified based on logical entities and / or metadata about those logical entities, and can operate within the application regardless of which physical data storage device the data being processed is pulled from. As a concrete example, logical entities acting as identifiers of a person can be assigned privacy levels. Logical entities can be defined for multiple identifiers of a person, such as email addresses and Social Security numbers. The metadata associated with these logical entities can assign a medium privacy level to email addresses, but a high privacy level to Social Security numbers. Filtering or masking components specified based on logical entities can be configured to omit records from their output that have certain field values ​​associated with privacy levels above a threshold, or to obfuscate the values ​​of those fields. When these operations are performed on physical datasets with fields corresponding to email addresses or Social Security numbers, they can be performed based on privacy levels. Defining logical datasets and associated metadata (such as privacy levels) in a repository that can be used to develop applications enables the efficient implementation and updating of features (such as these functionalities) across the enterprise. This definition can also be used to enforce enterprise policies related to data access by ensuring the proper disposal of physical datasets containing sensitive information (i.e., datasets including fields containing sensitive information).

[0132] Figure 3AAn application (e.g., application 106-3) is shown being developed as a data flow graph via a user interface in a development environment. Here, components are represented as nodes in the graph. The data flow graph in this example includes input nodes 302, 304 for reading data from each physical dataset and output nodes 314 for writing data to each physical dataset. An example of generating such input and output nodes based on the functionality they will provide (e.g., data sink functionality or data source functionality) is described in U.S. Patent No. 9,977,659 entitled “Managing Data Set Objects,” the entire contents of which are incorporated herein by reference. The data flow graph also includes nodes 306, 308, 310, 312 for various data processing operations (e.g., filtering, sorting, or joining operations) performed on the data read from the physical datasets. When the graph is executed by a data processing system, the results of the data processing operations are written to the physical dataset associated with output node 314.

[0133] Each input node can be configured with parameter values ​​associated with its corresponding data source. These values ​​indicate how to access data from the data source. Similarly, each output node can be configured with parameter values ​​associated with its corresponding data destination. These values ​​indicate how to write the results to the data destination.

[0134] Traditionally, applications (including such Figure 3AAn application (as illustrated in the data flow diagram) will need to be manually updated to account for changes in how data is stored. For example, if a dataset is migrated from one data storage device to another, an experienced developer will manually modify the configuration of the input and / or output nodes of the affected data flow diagram. This manual update needs to be performed by an experienced developer with knowledge (e.g., programming knowledge) of the data flow diagrams and data storage devices supported by the data processing system. In data processing systems supporting large datasets, changes in how data is stored occur frequently, and introducing errors during updates or neglecting to update the application for each change can lead to the propagation of errors throughout the enterprise. For example, executing a data flow diagram where input nodes are configured with incorrect or outdated parameter values ​​associated with a data source may result in reading data from an incorrect data source or reading data in an incorrect format. Errors in the input data lead to data processing operations being performed on the erroneous data, resulting in inaccurate output. Incorrect output may be easily identifiable, such as a crashed job or a report lacking expected information. In other cases, errors are more subtle, with incorrect data being written to the physical dataset, which may be used in subsequent processing without any indication that the data has been corrupted by the error. By the time erroneous data propagates throughout an enterprise and is identified, many datasets may already be corrupted, making error detection and correction time-consuming and expensive. Furthermore, migrating from one data storage device to another is both costly and time-consuming, as it requires identifying all physical datasets affected by the change and then manually editing applications that use and test those datasets.

[0135] The inventors have developed a technique to avoid these problems by automatically providing access to the appropriate physical dataset, without requiring the maintenance of application / dataflow graphs to adapt to changes in data storage devices. By enabling data processing systems to adapt to changes in data storage devices, the risk of introducing errors when modifying applications is significantly reduced, thereby eliminating error propagation common in traditional systems.

[0136] Such access can be achieved through a dataset multiplexer 105 that automatically provides a connection between the application and the appropriate physical dataset. The application can be programmed based on (multiple) logical datasets. For example, a business user with little knowledge of physical datasets (e.g., their location or format) can write an application based on (multiple) logical datasets. The dataset multiplexer 105 can maintain a catalog table of datasets, where each entry in the catalog table is associated with a logical dataset and provides information for accessing the physical dataset corresponding to that logical dataset (regardless of which data storage device the physical dataset is stored on when the application executes). In response to instructions from the data flow graph execution involving operations on the logical dataset, the dataset multiplexer 105 can obtain information for accessing the physical dataset from the catalog table entries associated with the logical dataset and automatically provide a connection between the data flow graph and the physical dataset based on that information. In some embodiments, the information for accessing the physical dataset may include a program that provides access to the physical dataset. This program, when executed by the application, can access the physical dataset from the data storage device and convert it into the format of the logical dataset.

[0137] Figure 3B schematically shown Figure 3A The input node 302 is configured or programmed based on the logical dataset. The input node 302 can be configured to represent a specific logical dataset specified via user input provided through a user interface. For example, user input can be provided via user interface 315. A list 370 of logical datasets that can be used to configure input and output nodes of the data flow graph can be provided in user interface 315. The logical datasets that can be used to configure input and output nodes can be logical datasets that have entries in a dataset catalog table. Users can browse the list and select a specific logical dataset for configuring input node 302. Users can enter search queries via user interface element 372, where users can specify one or more values ​​for dimensions describing the logical dataset or logical entity. These dimensions can include words entered in the repository describing the logical dataset or fields included within the logical dataset. Figure 3B It depicts a user selecting the "loyalty" logical dataset 375 and an input node 302 being configured to represent the selected logical dataset.

[0138] The co-pending application, with agent file number A1041.70070US02 and titled "Data Processing System with Manipulation of Logical Dataset Groups," describes various search interfaces through which users can search for datasets and / or dataset groups as targets for operation. The interfaces and techniques described in that co-pending application can be used in the data processing system described herein to configure components of an application.

[0139] The dataset catalog table may include entries for the selected logical dataset, providing information for accessing the physical dataset corresponding to the selected logical dataset. This information may be, or include, a program for accessing the physical dataset. When the execution of an application involves operations on the selected logical dataset, the dataset multiplexer can utilize the appropriate dataset catalog table information to provide access to the physical dataset. For example, an identifier associated with the selected logical dataset can be used to identify the appropriate entry in the dataset catalog table that includes the program, and the program can be executed to access the physical dataset from the data storage device. The dataset multiplexer may expose a link to the program, thereby enabling access to the physical dataset by executing the program at that link.

[0140] Figure 3C This diagram illustrates how such connections can be made using a dataset's catalog table. The diagram schematically illustrates the above combination. Figure 3B The application 106-3 is described. (For example...) Figure 3C As shown, when the program is executed, Figure 3B The input nodes 302, 304 and output node 314 are replaced with programs that provide access to the physical datasets corresponding to the logical datasets to which these components are configured. For example, input nodes 302, 304 are replaced with programs 330, 340 that provide access to each physical dataset in the data storage device in which they are currently stored. Furthermore, output node 314 can be replaced with program 350, which instructs a program to provide access to each physical dataset in which data is written to its currently residing data storage device. These programs can also convert between the format of the logical datasets used to program the application and the storage format of the physical datasets in the data storage device.

[0141] Representative dataset catalog

[0142] Dataset catalog 107 may include multiple objects, each storing information associated with a logical dataset. In this context, an object refers to a collection of information stored in a computer-readable medium that captures information related to a logical dataset. This information may be stored in any suitable format. For example, it may be stored in a contiguous block of computer memory, distributed across multiple locations in computer memory, stored in a single file or other data structure, distributed across multiple data structures, or otherwise stored in a manner that allows the information reflected in the object to be associated with the logical dataset.

[0143] The object can be associated with a logical dataset in any suitable manner. The object can have a predefined format including information, which can be in the form of a header identifying the logical dataset and / or physical dataset associated with that information. However, this information can be in a format other than a header. For example, a catalog table can store a list of pointers to objects indexed by logical dataset identifiers, such that accessing a pointer with a specific logical dataset identifier as an index allows a computer accessing the catalog table to find the object associated with that logical dataset as the target of that pointer. Alternatively or additionally, some or all of the catalog table information about the logical dataset can be stored as an appendix to an information repository that may otherwise exist within the data processing system. For example, the data processing system may include a repository of metadata associated with the logical dataset and / or physical dataset. The catalog table information can be appended to this repository and / or stored in a separate metadata repository.

[0144] Information about the logical dataset can be reflected in the object in any suitable form. For example, the information can be stored as one or more descriptors, each with a value. Alternatively or additionally, the information can be stored as or include computer-executable instructions. In some embodiments, the physical dataset can be reflected in the object because a program stored with the object to access the physical dataset is hard-coded to access it. In other embodiments, information identifying the physical dataset corresponding to the logical dataset can be stored as a value of a field in the data structure of the storage object. This value can be passed as a runtime parameter to a program stored with the object to access the physical dataset or otherwise used to access the physical dataset.

[0145] Figure 4 Example object 400 is shown in dataset catalog table 107 maintained by dataset multiplexer 105. Figure 4 The information captured in object 400 is shown; however, some of this information (such as discovery information 406 and / or access information 408) may be optional.

[0146] The information captured in object 400 may include information used to identify the physical dataset corresponding to the logical dataset. In this example, the object is identified by identifier 404 of the logical dataset.

[0147] The information reflected in object 400 may be, or may include, an executable program 402 for accessing a physical dataset. When executed, the program can access the physical dataset corresponding to the logical dataset and convert data in the physical dataset into the format of the logical dataset, and vice versa. The program can be reflected in the object by storing a copy of the program's computer-executable instructions in the computer memory allocated for the directory table object. In other embodiments, the program may be stored elsewhere, with only a pointer to the program or other identifiers of the program stored in the computer memory allocated for the object.

[0148] In some embodiments, a program may be created using discovery information 406 identified during the registration process of the physical dataset and / or access information 408 used in other ways to access the physical dataset.

[0149] The object may reflect information about the physical data source storing the corresponding physical dataset, enabling access to and transformation of data within the physical dataset. This information can be obtained in any of a variety of ways, including via user input or via an automatic discovery process performed by reading data or metadata from the data source storing the physical dataset. In some embodiments, the discovered information 406 may be automatically discovered as part of a registration process for registering the physical dataset with the dataset multiplexer 105. As part of the registration process, the user may specify a logical dataset corresponding to the physical dataset, or may determine the correspondence between the logical dataset and the physical dataset in another suitable manner. The automatically discovered information may include a physical identifier associated with the data storage device and / or the physical dataset, a reference to the storage location of the data storage device and / or the physical dataset, the type of data storage device, the record format or mode of the physical dataset, and / or other information.

[0150] In some embodiments, a copy of the discovered information may be stored in the object. In other embodiments, the discovered information 406 may be reflected in the object because it is used to create a program for accessing the physical dataset, which is stored as part of the object. For example, the type and format information of the data storage device and / or the physical dataset may be used to create a program with conversion logic for converting data in the physical dataset into a format suitable for a logical dataset.

[0151] Access information may include parameters 408, which may specify how to access physical datasets and / or data storage devices. In some embodiments, these parameters may be design-time parameters and / or runtime parameters. Design-time parameters may be applied to specify the functionality of program 402. Since the program is generated based on design-time parameters, the values ​​of these parameters do not need to be stored separately in object 400. If they are runtime parameters, their values ​​may be stored in the object and provided to the program as input during program execution.

[0152] Parameter 408 may include one or more parameters specifying the type of access to the physical dataset. In some embodiments, the access type may indicate read access or write access. In other embodiments, the access type may indicate the amount of bandwidth allocated for accessing a particular logical dataset. For example, the value of parameter 408 may indicate dedicated access or shared access. The data storage device may support multiple connections to application 106, which can collectively use no more than a predetermined amount of bandwidth to access the data storage device. Allocation methods may be applied so that applications performing tasks with higher priority than other tasks can use more of the total available bandwidth of the data source. As a specific example, the data source may support both dedicated and shared access, where dedicated access by an application results in more available bandwidth being allocated to the application than when shared access is provided. Specifying dedicated access to the logical dataset for higher-priority applications and shared access to the logical dataset for lower-priority applications can allocate available bandwidth at the data source as needed.

[0153] As another example, access parameters may alternatively or additionally indicate the type of connection used to access the data storage device that holds the physical dataset corresponding to the logical dataset, such as a fast connection or a slow connection.

[0154] As yet another example, parameter 408 may include one or more parameters specifying security-related information. In some embodiments, the one or more parameters may indicate whether the data in the physical dataset is encrypted. In embodiments where the data is encrypted, parameter 408 may include information such as a security key to decrypt the information or otherwise make it available. To enhance security, the security key may be provided by application 106 at runtime and may not be stored in the dataset directory table 107. In other embodiments, the one or more parameters may indicate whether the data in the physical dataset is compressed. In embodiments where parameter 408 is used to create program 402, the value of parameter 408 indicating that the data in the physical dataset is encrypted may be used to include decryption logic in the program.

[0155] As a further example, parameter 408 may include one or more parameters that specify the criteria for the filtering operation. For example, these one or more parameters may specify the date on which information can be used to filter when accessing the physical dataset.

[0156] In some embodiments, some or all values ​​of parameter 408 can be automatically discovered. This automatic discovery process can be performed when a physical dataset is registered with a component of a data processing system that creates a dataset catalog table. For example, during the discovery process, a component of the data processing system can access metadata in the data storage device to determine the information reflected in the object. Alternatively or additionally, a component of the data processing system can analyze data read from the physical dataset to identify patterns in the data that indicate record format, encryption, compression, or other information about the physical data storage device.

[0157] However, it should be understood that the discovered information 406 can be obtained through means other than direct interaction with the data source, such as by reading from a repository of metadata associated with logical and / or physical datasets maintained by the data processing system. For example, security information such as encryption or compression can be applied to all datasets in the data storage device. Once security information is stored anywhere in the system for a physical dataset in the data storage device, that security information can be reflected in objects used to access other physical datasets in the same data storage device.

[0158] Some or all of the information reflected in the object, even if Figure 4 In the example, being indicated as discovered can also be input by the user. In other embodiments, as part of the registration process, a portion of the discovery information 406 and / or access information 408 can be specified by the user via a user interface. However, it should be understood that user input can be provided in other ways, such as when defining a logical dataset. As a specific example, the priority of a logical dataset can be specified when defining the logical dataset or after defining the logical dataset by editing the metadata stored for that logical dataset.

[0159] Furthermore, it should be understood that Figure 4This illustrates objects associated with a logical dataset at a given time, configured to access a physical dataset. The data processing system can detect events affecting the storage of data associated with the logical dataset. If so, the objects of that logical dataset can be updated. For example, whenever a change in any parameter is detected, the values ​​of those parameters can be updated. Alternatively or additionally, if a new physical dataset is registered, and the input indicates that the physical dataset stores data for a logical dataset whose objects already exist in the catalog, the objects of that logical dataset can be modified. For example, changes can be implemented by rewriting the object entirely or partially with new information, or by replacing the object with a new object to reflect the new physical dataset. However, the objects of that logical dataset can be accessed in the same way via the dataset catalog. In this way, once an application written to perform data access operations based on a logical dataset is configured to access the physical dataset corresponding to the logical dataset via the dataset catalog, it will continue to correctly access the correct physical dataset regardless of any changes.

[0160] In some embodiments, program 402 may be configured to include an executable data flow graph containing logic for accessing a physical dataset. In embodiments where the application is developed as a graph, as described above... Figures 3A to 3C As described above, program 402 can be configured as a subgraph, in which case it will be executed as part of the data flow graph implementing the application. For example, Figure 3C A first program 330 is depicted that includes a subgraph of logic for accessing an input dataset, a second program 340 that includes a subgraph of logic for accessing an input dataset, and a third program 350 that includes a subgraph of logic for accessing an output dataset.

[0161] These subgraphs can be considered dynamic subgraphs (DSGs) because they are updated periodically based on events indicating changes in the appropriate data access mechanisms of the storage devices associated with the logical dataset. Therefore, the result of using subgraph data access operations within an application is dynamic access to the physical dataset that stores the correct data at that time. Thus, a DSG is used in this paper as an example of procedure 402.

[0162] Representative dataset multiplexer with dataset catalog table

[0163] Figure 5A This is a block diagram emphasizing the components of the data dataset multiplexer 105 in the data processing system 104. (Example) Figure 5A As shown, the dataset multiplexer 105 includes, among other components, a registration module 520, a dynamic subgraph (DSG) generator 524, a metadata management module 526, an operation metadata module 528, a directory table service interface 522, and a user interface 530.

[0164] In some embodiments, the registration module 520 is configured to register a physical dataset with the dataset multiplexer 105. Registration can be triggered by adding a physical dataset to the IT infrastructure or by using the physical dataset from an application. Alternatively or additionally, the registration module 520 can receive commands for registering the physical dataset via a user interface 530. For example, a user can provide input via the user interface 530 to initiate the registration process for the physical dataset. This input can be in the form of a direct command for registering the physical dataset.

[0165] Alternatively or additionally, the input can indirectly indicate the initiation of registration. For example, registration can be triggered when a user of the application selects a logical dataset already associated with a physical dataset that has no information in the dataset catalog table or whose information in the catalog table is not up-to-date. Other actions used as indirect commands can include instructions for migrating a physical dataset from one data storage device to another, or commands for changing metadata associated with the logical dataset that may affect the conversion between the physical dataset and the logical dataset. Regardless of how the registration process is triggered, user input can specify a logical dataset corresponding to the physical dataset, allowing objects of that logical dataset in the catalog table to be created or rewritten with the latest information.

[0166] Additional information used to create or update objects in the catalog table can be collected from one or more sources. Registration module 520 can discover information about the physical dataset and / or the data storage device that stores it during the registration process. Information collected in this manner may include the type of data storage device, the record format or mode of the physical dataset, the physical storage location of the data storage device, compression and / or encryption status, and / or other information.

[0167] The registration module 520 can provide the obtained information to the DSG generator 524. The DSG generator 524 can create a DSG based on the received information. The DSG generator 524 can access multiple program templates, each corresponding to a specific type of data storage device. The DSG generator 524 can detect the type of data storage device from the received information and select the appropriate program template corresponding to the detected type from the multiple program templates. For example, the data processing system can be pre-configured with templates for read and / or write access to data tables in an Oracle database or a Hadoop distributed database. Detecting the type of data storage device storing the physical dataset allows the DSG generator 524 to select the appropriate template for accessing the physical dataset corresponding to the logical dataset for which the DSG is being created.

[0168] DSG generator 524 can generate a program based on a selected program template. DSG generator 524 can detect parameter values ​​of the selected program template from received information and can populate the program template with the detected values. Alternatively or additionally, some or all of the parameter values ​​can be obtained from metadata management module 526, which in this example can maintain metadata for physical datasets, data storage devices, and / or logical datasets. Alternatively or additionally, parameters can be provided through user input using user interface 530 or otherwise obtained.

[0169] DSG generator 524 generates a DSG that includes access logic for accessing the physical dataset and conversion logic for converting between the format of the physical dataset and the format of the corresponding logical dataset. DSG generator 524 can generate a logical-to-physical layer mapping for the physical dataset and the corresponding logical dataset. DSG generator 524 can generate a mapping between one or more fields of the logical dataset and one or more fields of the physical dataset representing the same information. This mapping can be generated using information from various sources, including information available within the data processing system, user input, and / or information obtained through semantic discovery. DSG generator 524 can utilize this mapping to generate conversion logic. For example, customer names in the physical dataset can be stored as three fields in a row of a data table, storing data corresponding to the customer's first name, middle name initial, and last name, respectively. However, the logical dataset can simply include the logical entity Customer_Name (Customer_Name). DSG generator 524 can generate a mapping between these three fields of the physical dataset and the logical entity of the logical dataset. The conversion logic can include logic for converting between the "customer's first name, middle name initial, and last name" format of the physical dataset and the "Customer_Name" format of the logical entity. When DSG is executed, access logic is performed to obtain information from the three fields of the physical dataset, and transformation logic is performed to convert between the format of the physical dataset and the format of the logical dataset.

[0170] In some embodiments, DSG generator 524 creates a DSG for each of a plurality of physical datasets in a data storage device. The created DSGs may be included in dataset directory table 107. Dataset directory table 107 may include objects associated with logical datasets, wherein each object may be or include a DSG for accessing the physical dataset corresponding to the logical dataset.

[0171] The registration module 520 can also provide discovery information to the metadata management module 526, enabling the metadata management module 526 to receive and maintain metadata about the physical dataset and / or data storage device. In some embodiments, when generating a DSG, the metadata management module 526 can be an information source for the dynamic subgraph generator 524 and can additionally store metadata about the dataset, which can be used in other operations involving the dataset within the data processing system. For example, the metadata management module 526 can maintain information used as metadata about the logical dataset, information about logical entities in the logical dataset, relationships between logical entities in the dataset, and relationships between entities in other logical datasets and / or other logical datasets.

[0172] The metadata management module 526 can also store a mapping between logical datasets and physical datasets. This mapping can be based on user input, or in some embodiments, it can be obtained, for example, through monitoring operations in which the user directly or indirectly specifies the association between the logical dataset and the physical dataset as part of a data processing operation. Regardless of how it is obtained, in some embodiments, the metadata management module 526 can maintain a table or other data structure that maps identifiers of logical datasets to identifiers of corresponding physical datasets. The dynamic subgraph generator 524 can use this information when creating objects representing logical datasets and / or when it determines that the storage of data associated with the logical datasets has changed, thus requiring previously created objects to be updated.

[0173] The metadata management module 526 can maintain a list of known logical datasets of the data processing system 104. When an application is programmed based on a logical dataset, the list of known logical datasets can be presented to the user via the application's user interface, and the user can select a specific logical dataset from the presented list. This logical information maintained by the metadata management module 526 can be used, for example, to enable the user to search for a specific logical dataset used to write the application. Information about physical datasets, including their correspondence with logical datasets, stored by the metadata management module 526, can also be used to search for appropriate datasets. For example, this logical and physical information can be used to define the dimensions of a faceted search for a dataset.

[0174] Data processing systems can maintain other types of metadata about datasets, which can also be used by users searching for datasets for specific scenarios. For example, metadata related to the use of a dataset can be captured and stored when the dataset is used. This operational metadata can also be used by dataset search tools to enable users to search for datasets based on how others have used them.

[0175] Operational metadata module 528 can collect operational metadata about a dataset. Operational metadata can be collected during or after the execution of an application or other program accessing the dataset. Operational metadata collected during execution may include identification information about the accessed physical dataset, the date and time of access, whether the dataset was updated, parameter values ​​associated with the execution of one or more subgraphs of the accessed dataset, and / or other operational data. Operational metadata collected or determined after execution may include information about the frequency of access to the dataset (whether physical or logical), information about the recentity of access, or information about the size of the accessed data (e.g., the number of records read from and / or written). Some operational metadata may be social information, such as information about the user who created or accessed the dataset. This social information may include the user's role within the enterprise, the permissions granted to the user, and / or other information about people within the enterprise.

[0176] exist Figure 5A In the example, the catalog table service interface 522 integrates access to various types of metadata about datasets. It can provide, for example, a faceted search tool that enables searching for any facet among multiple facets that may exist in any of the logical metadata, physical metadata, and / or operational metadata about physical and / or logical datasets that a user might want to select when writing an application or otherwise specifying the operation to be performed on the dataset. Facets in the search can be based on information stored within the data processing system about logical datasets, physical datasets, and / or operational metadata. For example, a search for datasets can be limited to returning only datasets that have entries / objects in the dataset catalog table. This facet can be combined with other facets associated with the same logical or physical dataset to provide a powerful search interface. For example, a search query can be limited to returning only datasets accessed within the past week and only those logical datasets with an email field, for which the corresponding physical datasets are stored on a data storage device with high-speed access.

[0177] although Figure 5A Separate modules for managing different types of metadata are shown; however, it should be understood that this depiction is based on functional separation, and the hardware and / or software components for capturing and / or providing multiple types of metadata may be divided in other ways, including integrating the capture and management of all such metadata into a single module or into more modules than shown.

[0178] The catalog service interface 522 also enables application 106 to be programmed based on a logical dataset. Once the user selects a logical dataset for programming the application, the catalog service interface 522 can provide information that allows the application written based on that logical dataset to access the appropriate physical dataset. The catalog service interface 522 can access a dataset catalog table 107, where each object corresponds to a logical dataset and provides information for accessing the physical dataset corresponding to that logical dataset. The catalog object can be or includes a program (shown as DSG in this example) for accessing the physical dataset corresponding to the logical dataset.

[0179] The catalog service interface 522 enables an application to access a physical dataset by providing information about a program within an object of a selected logical dataset in the dataset catalog table 107. When performing an operation to access a logical dataset from within the application, the application can use this information to access the corresponding physical dataset in a data storage device. In this way, a program identified from the catalog object can be executed to access the physical dataset from the data storage device. For example, the catalog service interface 522 may expose a link to a DSG, which a development environment developing an application can use to construct the application such that, when the application is executed, access to the physical dataset is achieved by executing the DSG at that link. In some embodiments, the catalog service interface 522 provides this link via an application programming interface (API).

[0180] As described above, the catalog table object associated with the logical dataset, and therefore the DSG within that object, can be updated in response to events indicating a change in the storage of information associated with the logical dataset. For example, the physical dataset corresponding to the logical dataset may be migrated from one data storage device to another. The catalog table object of the logical dataset can be updated to account for this change. In some embodiments, the program used to access the physical dataset can be modified so that the application accesses the physical dataset from the correct data storage device. By updating the catalog table object of the logical dataset, applications written to access the logical dataset can continue to operate without modification even when the physical dataset is migrated from one data storage device to another. See below for reference. Figures 6A to 6B To describe this dynamic update in more detail.

[0181] Other events that do not need to be associated with the location of the physical dataset may cause changes to objects in the dataset catalog table. For example, in response to an event indicating a change in the format of the physical dataset, the appropriate catalog table object may be updated. For instance, if the format of the physical dataset is changed by adding a field to the dataset, the corresponding catalog table object may be updated to take the added field into account. In some embodiments, the transformation logic in the program used to access the physical dataset may be modified to take this change into account. As another example, in response to an event indicating a change in the value of a parameter used to generate a program or accessed within a program, the value of the parameter stored in the catalog table object may be updated and / or the program may be regenerated with the new value. As yet another example, an event indicating a change associated with a physical dataset corresponding to the same logical dataset may include an event indicating that the physical dataset is replaced with another physical dataset corresponding to the same logical dataset. In this example, the catalog table object corresponding to the first physical dataset may be replaced or substituted by the catalog table object corresponding to the other physical dataset. These changes may be implemented by a dynamic subgraph generator 524, which may be triggered to update the catalog table object when an event is detected. For example, an update can be implemented by rewriting all or part of the memory location storing the catalog table object, or by associating an object stored in another memory location with a dataset catalog table entry so that the catalog table object of a particular catalog table entry is updated when it is replaced by a new object. This change can be triggered by user input or automatically detected by the dynamic subgraph generator 524, the catalog table service interface 522, or other components of the data processing system.

[0182] It should be understood that when an application written based on a logical dataset is executed and the dataset catalog table 107 is accessed to provide the application with access to the physical dataset corresponding to the logical dataset, one or more components, such as the registration module 520, the dynamic subgraph generator 524, the metadata management module 526, the operation metadata module 528, and / or the user interface 530, may be optional. Figure 5B As shown. When performing an operation to access a logical dataset from within the application, the application can obtain information about the DSG associated with the logical dataset from the dataset directory table 107 via the directory table service interface 522 based on the identifier associated with the logical dataset. In some embodiments, the directory table service interface 522 may provide this information to the application by exposing a link to the DSG. When executed, the DSG provides the application with access to the physical dataset corresponding to the logical dataset.

[0183] Representative techniques for updating dataset catalog table objects

[0184] Objects in a dataset catalog table can be used to perform data access operations in applications programmed based on logical datasets. This catalog table object can be updated in response to events, thus providing appropriate data access by using the current information in the object at the time of application execution. One such event is a change in the physical storage location of the dataset, such as... Figure 6A and Figure 6B As shown. Figure 6A For example Figure 1A or Figure 5A The diagram shown illustrates an exemplary enterprise IT system in an operational state at the first moment, during which the data processing system facilitates access between applications 106-1, 106-3 and data storage devices 102-1 and 102-2.

[0185] Application 106-3 can be developed in a development environment as a data flow graph that utilizes information from a dataset directory table to implement references to logical datasets in the application specification. Components 330 and 340 of application 106-3, representing input nodes of the data flow graph, can be programmed according to logical datasets, wherein, for these components, information stored in computer memory for executing the application includes links to directory table objects corresponding to these logical datasets. For example, component 330 may link to a directory table object corresponding to a first logical dataset, and component 340 may link to a directory table object corresponding to a second logical dataset. Links can be stored in any format, conveying information sufficient to identify the information within the object necessary to access the physical dataset corresponding to the logical dataset referenced in those components. For example, links can be stored as identifiers of objects or paths via a directory structure to files storing programs for accessing the physical datasets.

[0186] Application 106-1 can also be developed as a data flow graph. Components 610 and 620 of application 106-1, which represent the input nodes of the data flow graph, can be programmed according to logical datasets, wherein these components are linked to directory table objects corresponding to the logical datasets. For example, component 610 can be linked to a directory table object corresponding to a first logical dataset and component 620 can be linked to a directory table object corresponding to a third logical dataset.

[0187] like Figure 6A As shown, component 330 of application 106-3 and component 610 of application 106-1 can be programmed based on the same logical dataset and can be linked to the same directory table object in the dataset directory table 107.

[0188] Data processing system 104 can maintain a dataset directory table 107 that includes directory table objects corresponding to logical datasets. Each directory table object can be or includes a DSG for accessing the physical dataset corresponding to the logical dataset. For example... Figure 6A As shown, the dataset catalog table includes a first set of DSGs, each of which is programmed to access the physical dataset from data source 102-2. The dataset catalog table 107 also includes a second set of DSGs, each of which is programmed to access the physical dataset from data source 102-1.

[0189] Data processing system 104 enables applications 106-3 and 106-1 to access physical datasets from data storage devices 102-2 and 102-1 based on corresponding programming logic datasets, using information in dataset catalog table 107. When programming application 106-3, a user can, for example, select a first logical dataset from a list of known logical datasets and associate that logical dataset with component 330, and associate a second logical dataset with component 340. Similarly, when programming application 106-1, a user can select a first logical dataset to associate with component 610 and select a third logical dataset to associate with component 620.

[0190] When performing an operation to access a logical dataset associated with component 330, data processing system 104 may select a DSG linked to component 330. When performing an operation to access a logical dataset associated with component 340, data processing system 104 may select a DSG linked to component 340. When performing an operation to access a logical dataset associated with component 610, data processing system 104 may select a DSG linked to component 610. When performing an operation to access a logical dataset associated with component 620, data processing system 104 may select a DSG linked to component 620.

[0191] Figure 6B yes Figure 1A or Figure 5A A block diagram of an exemplary data processing system in an operational state at a second time, during which the data processing system facilitates access between applications 106-1, 106-3 and data storage devices 102-1 and 102-1' when the physical dataset of data storage device 102-1 has been migrated to data storage device 102-1'.

[0192] In this example, the migration of the physical dataset from data storage device 102-1 to data storage device 102-1' is an event that causes data processing system 104 to update dataset catalog table 107. Objects in dataset catalog table 107 corresponding to the logical datasets mapped to the physical datasets in data storage device 102-1 can be updated to account for the change in data storage device. Through this update, the second set of DSGs can be modified to access the physical dataset from data storage device 102-1' instead of data storage device 102-1. Figure 6B As shown, the link between applications 106-3 and 106-1 and the dataset directory table 107 remains unchanged, and applications 106-3 and 106-1 continue to operate regardless of changes in the physical storage of the dataset. Performing an operation within an application that specifies access to the logical dataset will still result in access to the physical dataset at its updated location.

[0193] A representative application configured for data access via a dataset catalog table object.

[0194] Figure 7 This is a block diagram showing the various information maintained by the dataset multiplexer 105. This information enables the application 106-2 to be configured to access the physical dataset based on a programmed logical dataset. Once the application is configured, this information can also be logged as the result of the application's execution. This logged information can provide operational metadata for other functions performed by the data processing system, including providing a search interface through which the user can later search for the dataset for use in the application based on previous operations on the dataset.

[0195] In this example, application 106-2 is programmed to read data from a dataset containing information about customers. The application then extracts records representing preferred customers from this dataset and writes the results to a second dataset. When executed, application 106-2 will read from and write to a physical dataset. However, application 106-2 can be programmed based on a first logical dataset associated with input data storage device 710 and a second logical dataset associated with output data storage device 720.

[0196] While application 106-2 is being written, the user can provide configuration inputs to the input data storage device 710, specifying the logical dataset from which data is to be read. In this example, the logical dataset is identified as "abbott.customers". This dataset can be selected by user input, such as from a list of all logical datasets registered with the data processing system or from a limited list returned in response to a user query for a dataset with user-specified parameters. Such a selection interface can be provided by the development environment of application 106-2.

[0197] Similarly, the output data storage device 720 can be configured with a logical dataset. In this example, the logical dataset is identified as "abbott.preferred-cust".

[0198] To enable application execution, the development environment can associate selected logical datasets with information that allows read and write operations to the physical dataset corresponding to the specified logical dataset when the application is executed. This can be done, for example, via the directory table service interface 522 ( Figure 5A The information is obtained to complete this. The directory table service interface 522 can, for example, provide information about the maintained program in response to a request for directory table information related to the logical dataset, so that when the program is executed, it accesses the physical dataset corresponding to the specific logical dataset at that time. In this example, information about the program is provided as a path within the directory structure to the file storing the program. In this example, a link to the program used to access the physical dataset corresponding to the input logical dataset "abbott.customers" is stored at the path "common20 / abbott / customers / DSG". However, the link to this program can be provided in any suitable format.

[0199] Similarly, a program is obtained to access the physical dataset corresponding to the output logical dataset "abbott.preferred-cust". In this example, the path is "common10 / abbott / preferred-cust / DSG". These links pointing to programs that can access the physical dataset can be exposed by the directory table service interface 522 during application execution. These links can be stored as part of the application's computer-executable representation, so that these programs can be executed when operations to access these datasets are performed within the application. Alternatively, sufficient information to execute a program to access the physical dataset can be obtained at any time before performing operations to access the data source, including during application execution.

[0200] Whenever an application is executed, information about the program used to provide access to the physical dataset is identified, and the dataset multiplexer 105 can provide information about that program. Figure 7 The dataset multiplexer 105 maintains information sufficient to associate a logical dataset with a program used to access the physical dataset corresponding to that logical dataset. For example, this information may be stored as a dataset catalog table object for the logical dataset. In some embodiments, this information may be acquired or provided by the dataset multiplexer 105 at application runtime or design / build time. Doing so at design / build time avoids increasing the time overhead of runtime operations and / or dependence on runtime operations.

[0201] exist Figure 7 In the example, the information is shown as stored as two relationships. The physical identifier of the physical dataset is used as the key to link information 702, 704, and 706 together. First, information 702 provides information about the identifier of the physical dataset that links the logical dataset to the physical dataset currently storing the data corresponding to that logical dataset, using a logical ID for each logical dataset. Second, information 704 provides the relationship between the physical dataset and the programs that can be used to access it.

[0202] exist Figure 7 In the example, message 702 links the logical dataset “abbott.customers” to the physical dataset identified by the identifier “123”. The program with the path “common20 / abbott / customers / DSG” is associated with the physical dataset identified by “123” via message 704.

[0203] Similarly, the logical dataset "abbott.preferred-cust" is associated with the physical dataset ID "247" via information 702. Furthermore, the program at the path "common10 / abbott / preferred-cust / DSG" is associated with physical dataset 247 via information 704.

[0204] The dataset multiplexer can maintain similar information for each logical dataset that has been registered for the corresponding physical dataset, such as in a dataset catalog table object. Alternatively or additionally, some or all of this information can be maintained by the metadata management module 526 or other modules within the data processing system. Regardless of how the information is maintained, the dataset multiplexer 105 can provide information about the procedures used to access the physical dataset corresponding to the logical dataset.

[0205] exist Figure 7In the example, the identified program with the path "common20 / abbott / customers / DSG", along with the information used to invoke it, is stored as DSG 715, replacing the specified input data storage device 710. DSG 715 can be referred to as "Read DSG", which reads data from the physical dataset corresponding to the input logical dataset "abbott.customers". Similarly, the program with the path "common10 / abbott / preferred-cust / DSG", along with the information used to invoke it, is stored as DSG 725, replacing the specified output data storage device 720. DSG 725 can be referred to as "Write DSG", which writes data to the physical dataset corresponding to the output logical dataset "abbott.preferred-cust".

[0206] Information indicating the program to be executed within the application can be stored along with the program instructions constituting the application. When the application is written as a data flow graph and the program for accessing the data source is written as a subgraph, these subgraphs can be dynamically linked to the data flow graph at appropriate locations for execution. These locations can correspond to input and / or output nodes of the data flow graph. During or just before the execution of the data flow graph, link or path information of the subgraphs disclosed by or obtained from the directory table service interface 522 can be provided to the input and / or output nodes, and the corresponding subgraphs can be linked and / or stored in place of the input and / or output nodes. Example techniques for dynamically linking subgraphs to the data flow graph via a subgraph interface, as described in U.S. Patent 10,180,821 entitled “Managing Interfaces for Sub-Graphs,” the entire contents of which are incorporated herein by reference, can be used. However, alternatively or additionally, other methods of storing information to execute the program can be used.

[0207] When application 106-2 is executed and an operation is encountered that accesses a logical dataset associated with input data storage device 710, the linked DSG 715 can be invoked. Invoking DSG 715 causes its access and transformation logic to be executed. During execution, input data storage device 710 can be accessed, and data from the input data storage device and / or its corresponding physical dataset can be read and converted into the format of the logical dataset. Invoking the DSG may require providing parameters to a controller module (not shown) within the data processing system.

[0208] exist Figure 7In the example, the parameter provided for executing DSG 715 is shown as parameter 730. In this example, one of parameter 730 identifies the DSG, for example by providing its path. The value of this parameter can be stored when the input data source 710 is configured for a specific logical dataset.

[0209] Other parameters 730 can be provided so that they can be supplied by the controller module to the DSG 715 for execution. These runtime parameters (i.e., those provided at runtime) may affect the execution of the DSG. For example, parameter values ​​“Param1” and “Param2” can be provided to the DSG at runtime. The value of such a parameter can, for example, specify that the DSG 715 should execute in a specific read mode (single record, batch, fast, shared, etc.). As another example, parameter values ​​can reflect the access priority of the application.

[0210] The values ​​of these runtime parameters can be obtained in one or more ways. For example, they can be encoded in application 106-2 based on input provided by the user during application development. For instance, parameter values ​​can be obtained from information input as input data source 710 in the development environment. As another example, alternatively or additionally, parameter values ​​can be obtained from other user input during application development or in response to prompts during execution. As yet another example, the application can identify parameter values ​​during runtime from various inputs, such as external inputs indicating the time of day, current system load, or other inputs depending on data provided as inputs to a data flow graph.

[0211] As yet another example, alternatively or additionally, parameter values ​​may be obtained from other modules. Specifically, the values ​​of at least some of the parameters in parameter 730 may be read from or obtained by processing information in a metadata repository that stores information about the logical dataset associated with the input data storage device 710. As yet another example, the values ​​of at least some of the parameters in parameter 730 may be read from or obtained by processing information in an access control module that maintains information about users, and may reflect access priorities or mechanisms for the data storage device that are set based on the role of the user developing or executing the application.

[0212] The values ​​of other parameters in the input data source parameter 730 can be included, allowing the controller module or other components of the data processing system to capture operational metadata. For example, the logical identifier of the dataset for which access is encoded can thus be stored. Similarly, the identifier of the accessed physical dataset can be stored. The value of this parameter can be provided by the dataset multiplexer, such as from the current information 702 at execution time. Capturing such information allows the operational metadata module 528 ( Figure 5A Additional facets, for example, can provide information to support data search.

[0213] exist Figure 7 In the example, dataset multiplexer 105 is shown as information 706 collected during the execution of application 106-2. For example, information 706 may include the date the dataset was accessed, the size of the dataset at the time of access and / or the amount of data read from or written to the dataset, the host ID of the computer hardware involved in the data access (e.g., by executing the application or accessing a program or physically storing the data). Other portions of information 706 may indicate the logical dataset associated with output data storage device 720, the accessed physical dataset, parameter values ​​provided to the program when the physical dataset was accessed (e.g., “Param1” and “Param2”), and / or other information. Such entries may be stored for each access to the dataset, for several previous accesses to the dataset, or for a predetermined time after accessing the dataset. This information may be analyzed after execution to determine other operational parameters, such as the frequency or recency / freshness of dataset usage.

[0214] Similar information can be stored for the output data storage device 720. When performing an operation to access the logical dataset associated with the output data storage device 720, the linked DSG 725 can be invoked. Invoking DSG 725 causes its access and transformation logic to be executed. During execution, the output data storage device 720 can be accessed, and data can be written to the output data storage device after being converted from the format of the logical dataset to the format of the output data storage device and / or the format of the corresponding physical dataset of the output data storage device. Parameter 740 represents the parameter whose value is provided to the controller module and can be utilized by DSG 725 during execution. Although in Figure 7 Although not shown, entries can be created in the repository of operational metadata based on access to the physical dataset corresponding to the output data storage device 720.

[0215] Methods for registering the representativeness of datasets in a dataset catalog table

[0216] Figure 8This is a flowchart illustrating a process 800 for registering a physical dataset to a dataset catalog table, enabling access to the physical dataset from an application configured to access a logical dataset corresponding to that physical dataset. Process 800 can be executed by a data processing system 104, for example, in reference... Figures 1A to 1C The dataset multiplexer 105 is described. Alternatively or additionally, process 800 may include other actions, including those described elsewhere herein in conjunction with other embodiments.

[0217] Process 800 may commence 801 in response to a detected event. This event could be an indication that a catalog entry in the dataset catalog does not exist, providing access to a physical dataset in the IT system that corresponds to a logical dataset defined in the data processing system. The detected event could be an automatic detection of a physical dataset existing in the IT system that does not yet have a catalog entry. Such an indication could be, for example, a user-input command used by the data processing system to register a physical dataset as corresponding to a logical dataset. Alternatively, the event could be an indication that a catalog entry in the dataset catalog providing access to a physical dataset in the IT system is outdated. However, other events, including those described herein, can trigger the execution of process 800. For example, as part of running a periodic (weekly, bi-weekly, etc.) import feed, a new physical dataset can be identified in the data storage device. This identification can trigger the execution of process 800.

[0218] Process 800 may proceed to action 802, during which information about a physical dataset stored in a data storage device is obtained. The physical dataset may be the same as the one mentioned in the context of the aforementioned start 801 of process 800. In some embodiments, information such as physical identifiers associated with the data storage device and / or physical dataset, references to the storage location of the data storage device and / or physical dataset, the type of data storage device, the record format or mode of the data storage device and / or physical dataset, and / or other information (e.g., in...) Figure 4 (Information described in the context).

[0219] At action 804, a logical-to-physical layer mapping can be generated for the physical dataset and the corresponding logical dataset. In some embodiments, the dataset multiplexer 105 can generate a mapping between one or more fields of the logical dataset and one or more fields of the physical dataset representing the same information. This mapping can be generated using information from various sources, including information available within the data processing system, user input, and / or information obtained through semantic discovery. For example, a field in the physical dataset where most entries include the characters “@” and “.” can be associated with a field in the logical dataset named “email”. This relationship can be obtained through semantic discovery and used to generate the mapping. Similar relationships between fields can be specified through user input or otherwise. The mapping between the logical dataset and the physical dataset can be generated by applying these relationships. In some embodiments, information about unique keys and / or foreign keys regarding relationships between the specified datasets can be used to generate the mapping.

[0220] Using these relationships, programs for accessing physical datasets can be configured to perform any necessary mappings between fields in the physical and logical datasets. A template for the program can be selected and then configured to implement the mapping, thereby providing access to and conversion of data formats. To obtain the template, at action 806, the type of data storage device can be determined based on the information obtained at action 802. At action 808, it can be determined whether the program template is available for that type of data storage device. Many data storage devices may have consistent access paradigms that can be captured in the template. Therefore, a data processing system can store a template library for widely used data processing system types, such as Oracle databases or SQL Server databases.

[0221] In response to the determination that a program template is available, the process proceeds to action 810, where an available program template is selected, and then proceeds to action 812, where a program is generated based on the selected program template. The generated program enables access to the target physical dataset and applies the mapping generated in action 804 to convert between the data format of the logical dataset and the data format of the physical dataset.

[0222] At action 812, a program for accessing the physical dataset from the data storage device is generated. This program can be generated by the following steps: at action 812a, a selected program template is populated based on one or more first parameters; and at action 812b, information about one or more other parameters is obtained.

[0223] At action 812a, the selected program template can be populated by recognizing the value of the first parameter of the program template based on the information obtained in action 802 (such as information automatically discovered during the registration process).

[0224] At action 812b, information about one or more other parameters of the program template can be obtained. These parameters can specify the method of accessing the physical dataset. For example, some information can be obtained from a metadata repository that maintains the metadata of the data storage device. As another example, some information can be obtained via user input. For example, the user can specify information about the access type or security-related information. User input regarding other parameters can be obtained during the registration process.

[0225] In some embodiments, in response to determining at action 808 that a program template is unavailable, the process proceeds to action 820, where a program structure for generating the program is created. In some embodiments, the program structure can be created by prompting the user for input. For example, the user can provide a file containing the program structure and / or parameter values. Next, at action 822, a program for accessing a physical dataset from a data storage device can be generated based on the program structure input by the user.

[0226] It will be understood that actions 802, 804, 806, 808, 810, 812, 820, and 822 can be performed to generate a program for accessing different physical datasets in a data storage device or to generate a program for accessing physical datasets in different data storage devices, without departing from the scope of this disclosure. For example, a first program can be generated for accessing a first physical dataset in a data storage device, and a second program can be generated for accessing a second physical dataset in a data storage device. As another example, a first program can be generated for accessing a first physical dataset in a first data storage device, and a second program can be generated for accessing a second physical dataset in a second data storage device different from the first data storage device.

[0227] Once the program is generated, information for invoking program execution from within an application programmed based on the logical dataset is stored in an object in the dataset catalog table 107. The stored information may include a physical identifier of the data storage device or the physical dataset stored in the data storage device, a logical identifier of the logical dataset, parameter values ​​to be used during program execution, and / or other information. In some embodiments, the object may be or includes the program.

[0228] Therefore, the program generated at action 812 or 822 can be used by an application that accesses the logical dataset corresponding to the physical dataset. Thus, at action 814, which may optionally be executed at any time after registration (or not at all), the program generated at action 812 or 822 is linked to (multiple) applications. This link enables an application programmed according to the logical dataset to access the physical dataset using the generated program. When performing an operation to access the logical dataset, the linked program is executed to provide access to the physical dataset corresponding to the logical dataset.

[0229] Regardless of whether the generated program is linked to an application that accesses the logical dataset, at action 816, it is determined whether an event indicating a change in the storage of data corresponding to the logical dataset has been detected. For example, this change could indicate a migration from a first data storage device to a second data storage device, a change in the format of the logical dataset, or a change in the format of the physical dataset. In response to detecting such an event, the process loops back to action 802, where it can be repeated. Repeating the process can result in the generation of a new program for accessing the physical dataset corresponding to the logical dataset or an update of an existing program for accessing the physical dataset corresponding to the logical dataset. However, the link to this program can be the same, ensuring that any application configured with that link for accessing the data corresponding to the logical dataset continues to operate on the correct data.

[0230] In some embodiments, in response to determining at action 816 that no change event has been detected, process 800 continues to monitor for change events, such that the program for accessing the physical dataset corresponding to the logical dataset for which access information has been generated will continue to operate as expected.

[0231] Additional implementation details

[0232] Figure 9 An example of a suitable computing system environment 900 on which the techniques described herein can be implemented is shown. The computing system environment 900 is merely one example of a suitable computing environment and is not intended to impose any limitation on the scope of use or functionality of the techniques described herein. Nor should the computing environment 900 be construed as having any dependencies or requirements relating to any one or combination of the components shown in the exemplary operating environment 900.

[0233] The techniques described herein operate in conjunction with many other general-purpose or special-purpose computing system environments or configurations. Examples of well-known computing systems, environments, and / or configurations that may be suitable for use with the techniques described herein include, but are not limited to, personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the systems or devices described above.

[0234] A computing environment can execute computer-executable instructions, such as program modules. Typically, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. The techniques described in this paper can also be implemented in distributed computing environments, where tasks are performed by remote processing devices connected via communication networks. In distributed computing environments, program modules can reside in local and remote computer storage media, including memory storage devices.

[0235] refer to Figure 9 An exemplary system for implementing the techniques described herein includes a general-purpose computing device in the form of a computer 900. Components of the computer 910 may include, but are not limited to, a processing unit 920, a system memory 930, and a system bus 921 that couples various system components, including the system memory, to the processing unit 920. The system bus 921 may be any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, and a local bus using any architecture of various bus architectures. By way of example and not limitation, such architectures include the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MCA) bus, the Enhanced ISA (EISA) bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus (also known as a mezzanine bus).

[0236] Computer 910 typically includes a variety of computer-readable media. Computer-readable media can be any available medium accessible to computer 910, and includes volatile and non-volatile, removable and non-removable media. By way of example, and not limitation, computer-readable media can include computer storage media and communication media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage devices, magnetic tape cassettes, magnetic tape, disk storage devices or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to computer 910. Communication media typically embodies computer-readable instructions, data structures, program modules, or other data in the form of modulated data signals, such as carrier waves or other transmission mechanisms, and includes any information delivery medium. The term "modulated data signal" refers to a signal in which one or more of its characteristics are set or altered in such a way as to encode information in the signal. By way of example and not limitation, communication media include wired media (such as wired networks or direct wired connections) and wireless media (such as acoustic, RF, infrared, and other wireless media). Any combination of the above should also be included within the scope of computer-readable media.

[0237] System memory 930 includes computer storage media in the form of volatile and / or non-volatile memory, such as read-only memory (ROM) 931 and random access memory (RAM) 932. A basic input / output system 933 (BIOS), containing basic routines that facilitate the transfer of information between components within computer 910 during startup, is typically stored in ROM 931. RAM 932 typically contains data and / or program modules that can be immediately accessed and / or currently operated by processing unit 920. This is by way of example and not limitation. Figure 9 The operating system 934, application program 935, other program modules 936, and program data 937 are shown.

[0238] Computer 910 may also include other removable / non-removable, volatile / non-volatile computer storage media. This is just one example. Figure 9A hard disk drive 941 for reading or writing non-removable non-volatile magnetic media, a flash drive 951 for reading or writing removable non-volatile memory 952 (such as flash memory), and an optical disk drive 955 for reading or writing removable non-volatile optical disk 956 (such as CD ROM or other optical media) are illustrated. Other removable / non-removable, volatile / non-volatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital multifunction disks, digital videotapes, solid-state RAM, solid-state ROM, etc. The hard disk drive 941 is typically connected to the system bus 921 via a non-removable memory interface (such as interface 940), and the disk drive 951 and optical disk drive 955 are typically connected to the system bus 921 via a removable memory interface (such as interface 950).

[0239] The above description and Figure 9 The drive and its associated computer storage media shown provide the computer 910 with storage for computer-readable instructions, data structures, program modules, and other data. For example, in Figure 9 In this diagram, hard disk drive 941 is shown storing operating system 944, application programs 945, other program modules 946, and program data 947. Note that these components may be the same as or different from operating system 934, application programs 935, other program modules 936, and program data 937. Different reference numerals are given to operating system 944, application programs 945, other program modules 946, and program data 947 to show that they are at least different copies. An actor can input commands and information into computer 910 through input devices such as keyboard 962 and pointing device 961 (typically a mouse, trackball, or touchpad). Other input devices (not shown) may include microphone, joystick, game controller, satellite dish, scanner, etc. These and other input devices are typically connected to processing unit 920 via user input interface 960 coupled to the system bus, but may be connected via other interfaces and bus structures (such as parallel ports, game ports, or Universal Serial Bus (USB)). Monitor 991 or other types of display devices are also connected to system bus 921 via an interface (such as video interface 990). In addition to the monitor, the computer may also include other peripheral output devices that can be connected via the peripheral output interface 995, such as speakers 997 and printers 996.

[0240] Computer 910 can operate in a networked environment using a logical connection to one or more remote computers (such as remote computer 980). Remote computer 980 can be a personal computer, server, router, network PC, peer-to-peer device, or other public network node, and typically includes many or all of the elements described above regarding computer 910, although... Figure 9 Only the memory storage device 981 is shown in the image. Figure 9 The logical connections described include Local Area Networks (LAN) 971 and Wide Area Networks (WAN) 973, but may also include other networks. Such networking environments are common in offices, enterprise-wide computer networks, intranets, and the Internet.

[0241] When used in a LAN networking environment, computer 910 connects to LAN 971 via a network interface or adapter 970. When used in a WAN networking environment, computer 910 typically includes a modem 972 or other means for establishing communication over a WAN 973 (such as the Internet). Modem 972 may be built-in or external and may be connected to system bus 921 via actor input interface 960 or other suitable mechanism. In a networking environment, program modules or portions thereof depicted with respect to computer 910 may be stored in a remote memory storage device. This is by way of example and not limitation. Figure 9 The remote application 985 is shown residing on the memory device 981. It should be understood that the network connection shown is exemplary and other means of establishing a communication link between computers can be used.

[0242] The techniques described herein can be implemented in any of a variety of ways, as they are not limited to any particular implementation. The examples of implementation details provided herein are for illustrative purposes only. Furthermore, the techniques disclosed herein can be used alone or in any suitable combination, as aspects of the techniques described herein are not limited to the use of any particular technique or combination of techniques.

[0243] Therefore, several aspects of the technology described herein have been described, and it should be understood that various changes, modifications, and improvements are possible.

[0244] For example, an application is described where a user writes a program to access specific logical data. In some embodiments, the user can be a human user. In other embodiments, the user can be a program with artificial intelligence (AI). For example, the AI ​​can derive data processing algorithms by processing a dataset, which can then be applied to other datasets.

[0245] As another example, information 702, 704, and 706 is described as being stored in separate tables. However, this information can be stored in a single table or combined within any suitable data structure.

[0246] Such changes, modifications, and improvements are intended to be part of this disclosure and are intended to fall within the spirit and scope of this disclosure. Furthermore, while advantages of the techniques described herein are indicated, it should be understood that not every embodiment of the techniques described herein will include every described advantage. Some embodiments may not implement any of the features described herein as advantageous, and in some cases, one or more of the described features may be implemented to achieve further embodiments. Therefore, the foregoing description and figures are by way of example only.

[0247] The above aspects of the technology described herein can be implemented in any of a variety of ways. For example, embodiments can be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or set of processors, whether the software code is provided in a single computer or distributed across multiple computers. Such a processor can be implemented as an integrated circuit (where one or more processors are included in the integrated circuit assembly), including commercially available integrated circuit assemblies known in the art, such as CPU chips, GPU chips, microprocessors, microcontrollers, or coprocessors. Alternatively, the processor can be implemented as a custom circuit system (such as an ASIC) or a semi-custom circuit system resulting from configuring programmable logic devices. As yet another alternative, the processor can be part of a larger circuit or semiconductor device, whether commercially available, semi-custom, or custom. As a specific example, some commercially available microprocessors have multiple cores, such that one or a subset of these cores can constitute a processor. However, the processor can be implemented using any suitable form of circuit system.

[0248] Furthermore, it should be understood that a computer can be embodied in any of a variety of forms, such as a rack-mount computer, desktop computer, laptop computer, or tablet computer. Additionally, a computer can be embedded in a device that is not typically considered a computer but has suitable processing power, including a personal digital assistant (PDA), smartphone, or any other suitable portable or stationary electronic device.

[0249] Furthermore, a computer may have one or more input and output devices. These devices can be used, in particular, to present a user interface. Examples of output devices that can be used to provide a user interface include a printer or display screen for visual presentation of output, and a speaker or other sound-generating device for auditory presentation of output. Examples of input devices that can be used for a user interface include keyboards and pointing devices, such as mice, touchpads, and digitizers. As another example, a computer may receive input information through speech recognition or in other audible formats.

[0250] Such computers can be interconnected by one or more networks of any suitable form, including local area networks (LANs) or wide area networks (WANs), such as enterprise networks or the Internet. These networks can be based on any suitable technology and can operate according to any suitable protocol, and can include wireless networks, wired networks, or fiber optic networks.

[0251] Furthermore, the various methods or processes outlined in this paper can be encoded as software executable on one or more processors, which employ any of a variety of operating systems or platforms. Additionally, such software can be written using a variety of suitable programming languages ​​and / or programming or scripting tools, and can also be compiled into executable machine language code or intermediate code that executes on a framework or virtual machine.

[0252] In this regard, aspects of the technology described herein can be embodied in a computer-readable storage medium (or multiple computer-readable media) (e.g., computer memory, one or more floppy disks, compressed optical discs (CDs), optical discs, digital video discs (DVDs), magnetic tape, flash memory, circuit configurations in field-programmable gate arrays or other semiconductor devices, or other tangible computer storage media) encoded with one or more programs that, when executed on one or more computers or other processors, perform the methods implementing the various embodiments described above. As will be clear from the foregoing examples, a computer-readable storage medium can retain information for a sufficient time to provide computer-executable instructions in a non-transitory form. Such a computer-readable storage medium can be transportable, such that one or more programs stored thereon can be loaded onto one or more different computers or other processors to implement the aspects of the technology described above. As used herein, the term "computer-readable storage medium" covers only non-transitory computer-readable media that can be regarded as an article of manufacture (i.e., a manufactured article) or a machine. Alternatively or additionally, aspects of the technology described herein can be embodied in computer-readable media other than computer-readable storage media, such as propagating signals.

[0253] The terms “program” or “software” are used herein in a general sense to refer to any type of computer code or computer-executable instruction set or processor-executable instructions that can be used to program a computer or other processor to implement the various aspects of the techniques described above. Furthermore, it should be understood that, according to one aspect of this embodiment, one or more computer programs that perform the methods of the invention when executed do not necessarily reside on a single computer or processor, but can be distributed in a modular manner across multiple different computers or processors to implement the various aspects of the techniques described herein.

[0254] Computer-executable instructions can take many forms, such as program modules, that are executed by one or more computers or other devices. Typically, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. Typically, the functionality of program modules can be combined or allocated as needed in different embodiments.

[0255] Furthermore, data structures can be stored in any suitable form on a computer-readable medium. For simplicity, a data structure can be shown as having fields related by their position within the data structure. Such relationships can also be implemented by assigning positions in a computer-readable medium that convey the relationships between fields to the storage used for those fields. However, any suitable mechanism can be used to establish relationships between information within the fields of a data structure, including the use of pointers, labels, or other mechanisms that establish relationships between data elements.

[0256] The various aspects of the technology described herein can be used individually, in combination, or in a variety of arrangements not precisely described in the foregoing embodiments, and therefore are not limited in their application to the details and arrangements of the components set forth in the foregoing description or shown in the accompanying drawings. For example, an aspect described in one embodiment can be combined in any way with aspects described in other embodiments.

[0257] Furthermore, the techniques described in this paper can be embodied as methods, and this paper provides examples of these methods, including references. Figure 8 The actions performed as part of any of these methods can be ordered in any suitable manner. Therefore, embodiments can be constructed in which actions are performed in a different order than those shown, and these actions may include performing some actions simultaneously, even if these actions are shown as consecutive actions in the illustrative embodiments.

[0258] Furthermore, some actions are described as being performed by an “actor” or “user.” It should be understood that an “actor” or “user” does not necessarily have to be a single individual, and in some embodiments, actions attributable to an “actor” or “user” may be performed by a combination of individuals, teams, and / or individuals with computer-aided tools or other entities.

[0259] The use of ordinal terms such as "first," "second," and "third" to modify a claim element does not imply any priority, superiority, or order of one claim element relative to another, or any chronological order of the actions of a method. Rather, it serves only as a label to distinguish one claim element with a specific name from another element with the same name (but using ordinal terms) to differentiate claim elements.

[0260] Furthermore, the wording and terminology used herein are for descriptive purposes and should not be considered restrictive. The terms “including,” “comprising,” “having,” “containing,” “involving,” and their variations, as used herein, are intended to cover all items listed thereafter and their equivalents, as well as any additional items.

Claims

1. A method executed by a data processing system for enabling efficient data analysis in a dynamic environment with multiple datasets by generating entries in a dataset catalog table and / or using entries in the dataset catalog table to access physical datasets in a data storage device, wherein... The data processing system is configured to execute a data processing application programmed to access logical datasets, each logical dataset comprising a data schema independent of the format of the corresponding data in a physical dataset, and the data processing system includes a dataset multiplexer configurable to provide the application with access to physical datasets in these data storage devices, the method comprising: Multiple entries are created in the dataset catalog table, each of which is associated with a logical dataset and a physical dataset and has associated computer-executable instructions for accessing the physical dataset; Receive a first input that at least partially identifies a first logical dataset for access to perform an operation within a data processing application specifying the access dataset; When performing operations within the data processing application, these computer-executable instructions are invoked to access the physical dataset associated with the first entry in the dataset catalog table that is associated with the first logical dataset; In response to an event indicating a change in the physical dataset associated with the logical dataset, one or more of the plurality of entries in the dataset directory table are updated, wherein updating one or more of the plurality of entries includes updating the first entry in the dataset directory table associated with the first logical dataset; and After updating the first entry in the dataset catalog table that is associated with the first logical dataset: Receive a second input, which at least partially identifies the first logical dataset for access to perform operations within the data processing application; and When performing operations within the data processing application, these computer-executable instructions are invoked to access the physical dataset associated with the updated first entry in the dataset catalog table that is associated with the first logical dataset.

2. The method as described in claim 1, wherein, Creating multiple entries in this dataset directory table includes: Receive information relating to a first physical dataset in a first physical dataset stored in a first data storage device, wherein the first physical dataset corresponds to the first logical dataset; Generate a first program based on information related to the first physical dataset, including computer-executable instructions for accessing the first physical dataset from the first data storage device; and The first entry in the dataset directory table stores a link to the first program so that the data processing application can access the first physical dataset using the first program.

3. The method as described in claim 2, wherein, The first program for generating access to the first physical dataset from the first data storage device includes: Identify the type of the first data storage device from the received information; Select a first program template for the type of the first data storage device; and The first program is generated by filling the first program template with one or more values ​​of one or more parameters of the first program template.

4. The method according to any one of claims 1 to 3, wherein, Receiving input that at least partially identifies the first logical dataset includes: Provide a user interface through which the user can at least partially identify the first logical dataset.

5. The method according to any one of claims 1 to 3, wherein, Invoking these computer-executable instructions includes: This enables access to the first entry in the dataset directory table associated with the first logical dataset; and This enables access to a data storage device that stores the physical dataset corresponding to the first logical dataset based on the information in the first entry.

6. The method according to any one of claims 1 to 3, wherein, Updating one or more of the multiple entries in the dataset catalog table includes: Detect events that indicate changes associated with the physical dataset corresponding to the first logical dataset; and Based on the detection of the event, modify the first entry in the dataset directory table that is associated with the first logical dataset.

7. The method of claim 6, wherein, Modifying the first entry in the dataset catalog table includes: Modify the computer-executable instructions used to access the physical dataset corresponding to the first logical dataset.

8. A method executed by a data processing system for achieving efficient data analysis in a dynamic environment with multiple datasets by registering datasets in a dataset catalog table to facilitate access to multiple physical datasets in a data storage device, wherein, The data processing system is capable of operating with multiple physical datasets stored in these data storage devices. The system includes a dataset multiplexer configured to provide access to a physical dataset among the multiple physical datasets to an application programmed to access a logical dataset. The physical datasets are stored in the data storage devices and correspond to logical datasets, which include a data schema independent of the format of the corresponding data in the physical datasets. The method includes: Receive information relating to a first physical dataset among multiple physical datasets stored in a first data storage device across multiple data storage devices, wherein the first physical dataset corresponds to a first logical dataset; A first program is generated based on information related to the first physical dataset, including computer-executable instructions for accessing the first physical dataset from the first data storage device; Store a link to the first program in the first object in the object library so that the application programmed to access the logical dataset can use the first program to access the first physical dataset; Based on the detection of an event indicating a change associated with the first physical dataset, determine whether to modify the first program used to access the first physical dataset; and Based on the determination to modify the first procedure: Generate the modified first program; and Replace the first program with the modified first program as the target of the link.

9. The method of claim 8, wherein, Generating the modified first program involves generating the modified first program without modifying the application or the first logical dataset.

10. The method of claim 8 or 9, wherein, Information associated with the first physical dataset includes information about the type of the first data storage device.

11. The method of claim 8 or 9, wherein: The dataset multiplexer includes an object library that stores information for accessing the multiple physical datasets, and a first object in the object library includes an identifier for the first physical dataset.

12. The method of claim 11, wherein: The dataset multiplexer further includes an API, and the method further includes enabling the application to access the first object through the API.

13. The method of claim 11, wherein, The method further includes: Identifiers are assigned to objects in the library based on the schema and logical name of the corresponding logical dataset in which information is stored in the objects.

14. The method of claim 11, wherein, The method further includes: Receive commands to register the first physical dataset in the dataset directory table; and Based on the received command, the first object is generated and stored in the library.

15. The method of claim 11, wherein: The identifier for this first physical dataset is a physical identifier.

16. The method of claim 15, wherein: The first object further includes a second identifier, and the second identifier is a logical identifier of a logical dataset associated with the first object.

17. The method of claim 16, wherein, The method further includes: In response to detecting an event indicating that the first physical dataset has been changed from being stored in the first data storage device to being stored in the second data storage device, the physical identifier in the first object is modified without modifying the logical identifier.

18. The method of claim 11, wherein: The first object includes parameter values ​​accessed during the execution of the first program; and The method further includes: Based on an event that indicates a change in the value of a parameter accessed in the first program, the value of the parameter stored in the first object is modified.

19. The method of claim 8 or 9, wherein, The first program includes access logic and conversion logic, and when the application is executed, the access logic and conversion logic of the first program are executed to provide access to the first physical dataset and to convert between the format used in the first physical dataset and the format used in the first logical dataset.

20. The method of claim 8 or 9, wherein, The first program includes one or more parameters that affect the operation of the first program, such that the values ​​of the one or more parameters affect access to the first physical dataset via the first program.

21. The method of claim 20, wherein, The application is configured to provide the values ​​of one or more parameters for use when the first program is invoked.

22. The method of claim 8 or 9, wherein, The method further includes generating the first program by: Detect the type of the first data storage device; and Select a template from multiple templates based on the detected type.

23. The method of claim 22, wherein, The first program includes a first part configured for read access to the first data storage device and a second part configured for write access to the first data storage device.

24. The method of claim 8 or 9, wherein, The first program is configured to include an executable data flow graph containing logic for accessing the first physical dataset.

25. A method executed by a data processing system for enabling an application to access multiple physical datasets in multiple data storage devices by using entries in a dataset catalog table, thereby achieving efficient data analysis in a dynamic environment with multiple datasets, wherein, The data processing system is capable of operating with the application and the multiple physical datasets stored in the multiple data storage devices, and the application is programmed to access logical datasets that include data schemas independent of the format of the corresponding data in the physical datasets. The method includes: Provide a user interface through which the user can at least partially identify the logical dataset used for access in the application; When executing the application, and when performing operations involving access to the identified logical dataset: Access objects in the object repository that are associated with the logical dataset; Based on this object, access is provided to the data storage device that stores the physical dataset corresponding to the identified logical dataset, wherein: The information in this object includes an executable program for accessing the physical dataset, or the object itself is an executable program for accessing the physical dataset; and Accessing the data storage device includes using the executable program to access the data storage device storing the physical dataset; and Based on events associated with the storage of data corresponding to the identified logical dataset, update the information in the object or the object itself so that subsequent accesses to the data storage device can be made using the updated information in the object or the updated object.

26. The method of claim 25, wherein, The executable program used to access the physical dataset encodes the logic for converting data between the format used within the physical dataset and the format used within the logical dataset.

27. The method of claim 25, wherein, The information in this object includes the type of the data storage device.

28. The method of claim 25, wherein, The information in this object includes the record format or pattern associated with the physical dataset.

29. The method of claim 25, wherein, The information in the object includes one or more parameters specifying how to access the physical dataset, including at least one parameter indicating whether the data in the physical dataset is compressed.

30. The method of claim 25, wherein, The information in the object includes one or more parameters that specify how to access the physical dataset, including at least one parameter that indicates the type of access.

31. The method of claim 30, wherein, The type of access includes an indication of whether it is a read access or a write access.

32. The method of claim 30, wherein, The type of access includes an indication of whether access is via a fast connection or a slow connection.

33. The method of claim 25, wherein: The data processing system includes a repository of metadata related to logical datasets; and The user interface includes a menu that presents a logical dataset based on the metadata in the repository.

34. The method of claim 25, wherein: The information in the object includes one or more parameters specifying how to access the physical dataset, including at least one parameter indicating whether the data in the physical dataset is encrypted or specifying criteria for filtering operations on the data.

35. A method executed by a data processing system for enabling efficient data analysis in a dynamic environment with multiple datasets by generating entries in a dataset catalog table to access physical datasets in a data storage device, wherein... The data processing system is configured to execute a data processing application programmed to access logical datasets, each logical dataset comprising a data schema independent of the format of the corresponding data in the physical dataset, and the data processing system includes a dataset multiplexer configurable to provide the application with access to physical datasets in a data storage device, the method comprising: Receive information relating to a first physical dataset stored in a first data storage device, wherein the application is programmed to access a first logical dataset, and wherein the first physical dataset corresponds to the first logical dataset; Based on the received information, a first program is generated for accessing the first physical dataset from the first data storage device, wherein generating the first program includes: Identify the type of the first data storage device from the received information; Select a first program template for the type of the first data storage device; and The first program template is populated with one or more values ​​of one or more parameters of the first program template to generate the first program; The object stores information for invoking the execution of the first program from within the application programmed to access the first logical dataset; and In response to an event indicating a change in the first physical dataset associated with the first logical dataset, the information in the object is updated to invoke the execution of the first program.

36. The method of claim 35, wherein: Populating the first program template includes automatically discovering one or more values ​​of one or more first parameters of the first program template based on information related to the first physical dataset.

37. The method of claim 36, wherein, The one or more first parameters include information about the recording format or pattern associated with the first physical dataset.

38. The method according to any one of claims 35 to 37, wherein, The object stores information for invoking the execution of the first program from within an application programmed to access the first logical dataset, including an identifier for storing the first data storage device.

39. The method according to any one of claims 35 to 37, wherein, The object stores information for invoking the execution of the first program from within an application programmed to access the first logical dataset, including a logical identifier storing the first logical dataset.

40. The method of claim 36, wherein, The generation of this first procedure further includes: Obtain information about one or more second parameters of the first program template, wherein the one or more second parameters are different from the one or more first parameters.

41. The method of claim 40, wherein, The one or more second parameters specify the method for accessing the first physical dataset.

42. The method according to any one of claims 35 to 37, wherein, The generation of this first procedure further includes: Determine whether the program template can be used for the type of the first data storage device; and Based on determining the type of the first program template that can be used for the first data storage device, an available template is selected as the first program template.

43. The method of claim 42, further comprising: Based on the determination that the program template cannot be used for the type of the first data storage device: The program structure is created based on user input; as well as Based on the created program structure, a first program is generated for accessing the first data storage device.

44. The method of any one of claims 35 to 37, further comprising: Receive information related to a second physical dataset stored in a second data storage device stored in these data storage devices; as well as A second program is generated based on information related to the second physical dataset for accessing the second physical dataset from the second data storage device.

45. The method according to any one of claims 35 to 37, wherein: The data processing system is configured to run in multiple environments, each of which includes an instance of the data processing system. as well as The object is assigned a unique identifier within the scope of each of the multiple environments and includes at least the common parts of the multiple environments.

46. ​​A method executed by a data processing system for enabling efficient analysis in a dynamic environment with multiple datasets by updating entries in a dataset catalog table to facilitate access to physical datasets in a data storage device, wherein, The data processing system is configured to execute a data processing application programmed to access data represented as logical datasets, each logical dataset including a data schema independent of the format of the corresponding data in a physical dataset, and the data processing system includes a dataset multiplexer configurable to provide the application with access to physical datasets in these data storage devices, the method comprising: Receive information related to a first physical dataset corresponding to a first logical dataset stored in a first data storage device; A first program is generated based on the received information for accessing the first physical dataset from the first data storage device, wherein the entry associated with the first logical dataset includes the first program or a link to the first program; Detect events indicating changes associated with a physical dataset corresponding to the first logical dataset, wherein the physical dataset is the first physical dataset; and Based on the detection of the event, the first program used to access the first physical dataset corresponding to the first logical dataset is modified.

47. The method of claim 46, wherein, The event indicating a change associated with the physical dataset includes an event indicating a change from a first data storage device storing the first physical dataset to a second data storage device, and the method further includes: In response to detecting an event indicating a change from the first data storage device to the second data storage device, the first program is modified to access the first physical dataset from the second data storage device.

48. The method of claim 46 or 47, wherein, The event indicating a change associated with the physical dataset includes an event indicating a change in the parameter values ​​of the first program used to generate access to the first physical dataset.

49. The method of claim 46 or 47, wherein: Events that indicate changes associated with the physical dataset include events that indicate the replacement of the first physical dataset with a second physical dataset corresponding to the first logical dataset, and... Modifying the first program used to access the physical dataset includes replacing the first program with a second program used to access the second physical dataset.

50. The method of claim 46 or 47, wherein: The data processing system is configured to invoke the first program to perform operations within an application that specifies access to the first logical dataset; The data processing system is configured to run in multiple environments, wherein a first environment includes a first instance of the data processing system and a second environment includes a second instance of the data processing system. The first data storage device and the first program are associated with the first instance of the data processing system, and The method further includes: A second program is generated to perform operations within the application that specifies access to the first logical dataset within the second instance of the data processing system.

51. The method of claim 50, further comprising: The application that specifies access to the first logical dataset and accesses the second program in the second environment is executed so as to access the second physical dataset in response to the application performing an operation on the first logical dataset.

52. A method executed by a data processing system for enabling an application to access multiple physical datasets in multiple data storage devices by using entries in a dataset catalog table, thereby achieving efficient data analysis in a dynamic environment with multiple datasets, wherein, The data processing system is configured to execute a data processing application programmed to access logical datasets, each logical dataset including a data schema independent of the format of the corresponding data in a physical dataset, and the data processing system includes a dataset multiplexer configurable to provide the application with access to multiple physical datasets in multiple data storage devices, the method comprising: Perform operations to access the specified logical dataset within the application in the following ways: Access the dataset catalog table to select objects associated with the logical dataset; and The program is invoked based on the selected object; it is configured to access the data source that stores the physical dataset corresponding to the logical dataset. The objects in the dataset directory table are dynamically updated in response to events indicating changes in the physical storage of the logical dataset represented by objects within the dataset directory table; and By accessing the dataset catalog table, the updated object associated with the logical dataset can be selected, and the specified operation to access the logical dataset can be subsequently performed within the application.

53. The method of claim 52, wherein, Dynamically updating objects within the dataset catalog table includes modifying programs configured to access data sources that store the physical datasets corresponding to the logical dataset.

54. A data processing system, comprising: At least one computer hardware processor; as well as At least one non-transitory computer-readable medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform the method as described in any one of claims 1 to 53.

55. At least one non-transitory computer-readable medium comprising processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform the method as described in any one of claims 1 to 53.

56. A computer program product comprising computer instructions that, when executed by a processor, perform the steps of the method according to any one of claims 1 to 53.