Apparatus, system, and method for generating and using a queryable index in a cyber data model to enhance network security
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- BLUEVOYANT LLC
- Filing Date
- 2023-06-23
- Publication Date
- 2026-06-26
AI Technical Summary
Conventional indexing schemes for cyber event data in large-scale databases face challenges with increased processing resources and memory requirements, leading to degraded read speed and query time, which are prohibitively expensive for petabyte-scale data.
A method for indexing cyber event data involves reformulating data into a common intermediate format, generating unique hash values, and creating rowKey indexes based on asset identifiers and timestamps, storing data in a rowKey database, and mapping these to the original dataset to enable efficient querying.
This approach allows for rapid retrieval of cyber event data within a certain time frame, supporting high throughput and efficient querying of billions of records while maintaining low costs, enabling scalable and consistent cloud-based SIEM implementations.
Smart Images

Figure 00000000_0000_ABST
Abstract
Description
Technical Field
[0001] (Cross - Reference to Related Applications) This application claims the benefit and priority under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 63 / 366,903, filed on June 23, 2022, entitled DEVICES, SYSTEMS, AND METHOD FOR GENERATING AND USING A QUERYABLE INDEX IN A CYBER DATA MODEL TO ENHANCE NETWORK SECURITY, the disclosure of which is hereby incorporated by reference in its entirety.
[0002] This disclosure generally relates to network security and, more specifically, to improved systems and methods for processing and indexing files from a continuously updated database. In conventional indexing schemes, the system can be overwhelmed by the number of records, which can negatively impact read speed, memory database size, and query time. Some of these problems can be overcome by a significant increase in processing resources and memory space, but for petabyte - scale data, this solution can be prohibitively expensive.
Summary of the Invention
[0003] The following summary is provided to facilitate an understanding of some of the innovative features particular to the aspects disclosed herein and is not intended to be a complete description. A full understanding of the various aspects can be obtained by taking the entire specification, claims, and abstract.
[0004] In one aspect of the present disclosure, a method for indexing cyber event data in an extensible database for continuous queries is disclosed. The method includes receiving, by a processor, cyber event data from one or more data sources; reformulating, by the processor, the cyber event data into a common intermediate format consisting of accessible attributes including a timestamp of event occurrence and one or more asset identifiers; generating, by the processor, a hash value unique to each cyber event; generating, by the processor, one or more rowKey indexes, asset identifiers, and timestamps of events corresponding to each of the cyber event hashes; storing, by the processor, the reformulated cyber event data in row inputs of one or more rowKey databases, wherein the one or more data rowKey databases are organized according to consecutive rowKeys; mapping, by the processor, the row inputs in the rowKey database to the original dataset; receiving, by the processor, a rowKey query based on parameters of the rowKey field; and returning, by the processor, cyber event data based on the rowKey query, wherein the query result is returned within a certain time.
Brief Description of the Drawings
[0005]
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
[0006] Corresponding reference numerals indicate corresponding parts throughout the several views. The embodiments described herein illustrate various aspects of the invention in one form, and such embodiments should not be construed as limiting the scope of the invention in any way.
DETAILED DESCRIPTION OF THE INVENTION
[0007] The applicant of the present application owns the following US provisional patent applications, and the entire disclosure of each of them is incorporated herein by reference. - International Patent Application No. PCT / US2022 / 072739, filed on June 3, 2022, entitled "DEVICES, SYSTEMS, AND METHODS FOR ENHANCING SECURITY INFORMATION & EVENT MANAGEMENT UPDATES FOR MULTIPLE TENANTS BASED ON CORRELATED, AND SYNERGISTIC DEPLOYMENT NEEDS", - International Patent Application No. PCT / US2022 / 072743, filed on June 3, 2022, entitled "DEVICES, SYSTEMS, AND METHODS FOR STANDARDIZING & STREAMLINING THE DEPLOYMENT OF SECURITY INFORMATION & EVENT MANAGEMENT ARTIFACTS FOR MULTIPLE TENANTS", - International Patent Application No. PCT / US2022 / 082167, filed on December 21, 2022, entitled DEVICES, SYSTEMS, AND METHODS FOR PROVISIONING AND UPDATING SECURITY INFORMATION & EVENT MANAGEMENT ARTIFACTS FOR MULTIPLE TENANTS - International Patent Application No. PCT / US2022 / 082173, filed on December 21, 2022, entitled DEVICES, SYSTEMS, AND METHODS FOR STREAMLINING AND STANDARDIZING THE INGEST OF SECURITY DATA ACROSS MULTIPLE TENANTS - International Patent Application No. PCT / US2023 / 061069, filed on January 23, 2023, entitled DEVICES, SYSTEMS, AND METHODS FOR REMOTELY MANAGING ANOTHER ORGANIZATION’S SECURITY ORCHESTRATION, AUTOMATION, AND RESPONSE - International Patent Application No. PCT / US2023 / 062894, filed on February 20, 2023, entitled DEVICES, SYSTEMS, AND METHODS FOR IDENTIFYING CYBER ASSETS AND GENERATING CYBER RISK MITIGATION ACTION BASED ON DOMAIN REDIRECTS - International Patent Application No. PCT / US2023 / 021736, filed on May 10, 2023, entitled DEVICES, SYSTEMS, AND METHODS FOR SUMMARIZING ANALYTIC OBSERVATIONS - International Patent Application No. PCT / US2023 / 022858, filed on May 19, 2023, entitled "DEVICES, SYSTEMS, AND METHODS FOR INGESTING & ENRICHING SECURITY INFORMATION TO AUTONOMOUSLY SECURE A PLURALITY OF TENANT NETWORKS" - International Patent Application No. PCT / US2023 / 022535, filed on May 17, 2023, entitled "DEVICES, SYSTEMS, AND METHODS FOR IDENTIFYING CYBER ASSETS AND GENERATING CYBER RISK MITIGATION ACTIONS BASED ON A DEMOCRATIC MATCHING ALGORITHM" - International Patent Application No. PCT / US2023 / 024386, filed on June 4, 2023, entitled "DEVICES, METHODS, AND SYSTEMS FOR GENERATING A HIGHLY-SCALABLE, EFFICIENT COMPOSITE RECORD INDEX" - International Patent Application No. PCT / US2023 / 068590, filed on June 16, 2023, entitled "DEVICES, SYSTEMS, AND METHODS FOR CATEGORIZING, PRIORITIZING, AND MITIGATING CYBER SECURITY RISKS" - U.S. Provisional Patent Application No. 63 / 368,567, filed on July 17, 2022, entitled "DEVICES, SYSTEMS, AND METHODS FOR UTILIZING A NETWORKED, COMPUTER-ASSISTED, THREAT HUNTING PLATFORM TO ENHANCE NETWORK SECURITY" - U.S. Provisional Patent Application No. 63 / 369,582, filed on July 27, 2022, entitled "AUTONOMOUS THREAT SCORING AND SECURITY ENHANCEMENT" - U.S. Provisional Patent Application No. 63 / 377,304, filed on September 27, 2022, entitled "DEVICES, SYSTEMS, AND METHODS FOR CONTINUOUSLY ENHANCING THE IMPLEMENTATION OF CODE CHANGES VIA ENRICHED PIPELINES" - U.S. Provisional Patent Application No. 63 / 507,250, filed on June 9, 2023, entitled "DEVICES, SYSTEMS, AND METHODS FOR ATTRIBUTING NETWORK-IMPLEMENTED CYBER ASSETS TO OPERATING ENTITIES AND GENERATING CYBER RISK MITIGATION ACTIONS BASED ON THE ATTRIBUTION"
[0008] Numerous specific details are described in this disclosure to provide a complete understanding of the overall structure, function, manufacture, and use of the aspects illustrated in the accompanying drawings. Well-known operations, components, and elements are not described in detail so as not to obscure the aspects described herein. The reader will understand that the aspects described and illustrated herein are non-limiting aspects. Thus, it will be understood that the specific structural and functional details disclosed herein may be representative and exemplary. Modifications and variations can be made without departing from the scope of the claims.
[0009] Before detailing the various aspects and methods of the systems disclosed herein, it should be noted that the exemplary aspects are not limited to the application or use in the details disclosed in the accompanying drawings and description. Of course, the exemplary aspects may be implemented or incorporated in other aspects, variations, and modifications and may be practiced or carried out in various ways. Further, unless otherwise indicated, the terms and expressions used herein are selected for the purpose of describing the exemplary aspects for the convenience of the reader and are not intended for limitation. For example, any reference herein to a particular manufacturer, software suite, application, or development platform is merely intended to illustrate some of the many aspects of the present disclosure. This includes any reference to trademarks. Thus, it should be understood that the devices, systems, and methods disclosed herein can be implemented to enhance any software updates according to any purpose of use and / or user preference.
[0010] As used herein, the term "server" may refer to or include one or more computing devices that are operated or facilitated by communication and processing for multiple parties in a network environment, such as the Internet or any public or private network. As used herein, a reference to a "server" or "processor" may refer to the previously enumerated server and / or processor that executes a previous step or function, a different server, and / or processor, and / or a combination of servers, and / or a combination of processors, as enumerated.
[0011] As used herein, the term "network" refers to, or can include, an entire enterprise information technology (IT) system introduced by a tenant. For example, a network can include a group of two or more nodes (e.g., assets) that are connected by any physical and / or wireless connection and configured to communicate and share information with one or more other nodes. However, the term "network" is not limited to any particular node or any particular means of connecting those nodes. A network can be connected to Ethernet, an intranet, and / or an extranet, and can be configured to communicate with each other via ad hoc connections (e.g., Bluetooth®, Near Field Communication (NFC), etc.), local area connections ("LAN"), wireless local area networks ("WLAN"), and / or virtual private networks ("VPN") regardless of the physical location of each device. The network can include any combination of assets (e.g., devices, servers, desktop computers, laptop computers, personal digital assistants, mobile phones, wearables, smart home appliances, etc.). The network can further include any tools, applications, and / or services deployed by a device or otherwise utilized by an enterprise IT system such as a firewall, an email client, a document management system, an office system, etc. In some non-limiting aspects, "network" can include third-party devices, applications, and / or services that are owned and controlled by a third party but that the tenant is authorized to access in the enterprise IT system.
[0012] As used herein, the term "platform" may include software architecture, hardware architecture, and / or combinations thereof. A platform may include any of a stand-alone software product, a network architecture, and / or a software product configured to be integrated within a software architecture and / or a hardware architecture as needed to provide its technical benefits to a software product. For example, a platform may include any combination of a chipset, a processor, a logic-based device, memory, storage, a graphical user interface, a graphics subsystem, an application, and / or a communication module (e.g., a transceiver). In other words, a platform can provide the resources necessary to enable the technical benefits provided by software. According to some non-limiting aspects, the technical benefits provided by software are provided to other software employed by the physical resources of the ecosystem or the physical resources within the ecosystem (e.g., APIs, services, etc.). According to other non-limiting aspects, a platform may include a framework for several software applications intended and designed to function together.
[0013] As used herein, the term "security monitoring platform" refers to, or may include, software configured to aggregate and analyze activities from many different resources across an information technology (IT) infrastructure. For example, a security monitoring platform may include a security information and event management (SIEM) platform and / or other types of platforms used for data monitoring and / or analysis (e.g., Splunk, enterprise security, Microsoft Sentinel, Datadog security monitoring, ELK, etc.). The various aspects of the devices, systems, and methods disclosed herein related to SIEM may similarly be applicable to any type of security monitoring platform.
[0014] As used herein, the term "constant" may refer to one or more security information and event management (SIEM) functions that do not change during the issuance of an alert. For example, constants can include, among others, Azure Sentinel Log Analytics functions. According to some non-limiting aspects, constants may be specifically configured according to the preferences and / or requirements of individual clients. For example, as described herein, alert rules may be the same for all client deployments. However, the use of the devices, systems, and methods disclosed herein can "fine-tune" the method of managing alerts for each specific client using client-specific constants. In other words, each constant may include a whitelist of specific protocols, accounts, etc., which the alert rules manage separately (e.g., skip them).
[0015] As used herein, the term "entity" refers to, or may include, a corporation, business-related organization, non-profit organization, government agency, charity, educational institution, or any other type of organization or individual that owns or has a relationship with a collection of cyber assets. As used herein, references to "cyber assets" may refer to computing devices, networks, hardware, software, data, information, or any other type of information technology-related component, label, or identifier for switching, signaling, or routing, such as, for example, domains, Internet protocol (IP) addresses, or shared and / or dynamic assets. As used herein, the term "cyber data" may refer to information related to cyber assets owned by a monitored enterprise or entity of interest.
[0016] Examples of commonly implemented SIEMs include Azure Sentinel and Salesforce Cloud, Devo, LogRhythm, IBM’s QRadar, Securonix, McAfee Enterprise Security Manager, LogPoint, Elastic Stack, ArcCyto Enterprise Security Manager, InsightIDR, etc. Introducing Azure Sentinel as a cloud-based tool is widely accepted among Managed Security Service Providers (MSSPs), and thus, Azure Sentinel is described as a non-limiting example. However, of course, other SIEMs are contemplated by the present disclosure. Similar to most SIEMs, introducing Azure Sentinel requires advanced skills and is a time-consuming task, and there is a high possibility of errors occurring. Each organization that requires a security solution has special needs regarding monitoring such as ingestion log sources, detection / alert rules, automation of responses, reports, etc., and alerts. Microsoft (MSFT) is often used by MSSPs to manage multiple clients, but the complexity of the initial configuration, introduction, and ongoing maintenance of artifacts (e.g., resource groups, log analytics workspaces, alert rules, workbooks, playbooks, etc.) has increased significantly. This can result in high costs for both MSSPs, who have to employ more expensive specialists, and clients, who often bear at least a portion of the increasing costs. However, in many cases, there is overlap among some of the implementation needs of various clients.
[0017] For example, many organizations may require similar firewall monitoring solutions. In such cases, asset reuse and reintroduction (and updates) can lead to significant cost savings and operational simplification. Unfortunately, known SIEM tools are technically unable to utilize such synergies. Therefore, from initial provisioning to incident response automation, MSSPs have limited opportunities for reuse to capture efficiency across multiple clients. Thus, improved devices, systems, and implementation methods, as well as the issuance of SIEM client updates, are needed. Such enhancements can improve the technical performance and cost efficiency of SIEM, including the introduction of detection rules, visualization, investigation workbooks, and continuous maintenance.
[0018] Aggregating cyber event data, including log data, event data, threat intelligence data, etc., from multiple platforms and providing the cyber event data to a system such as a SIEM platform can be beneficial for processing and capturing malicious behavior or drawing other meaningful conclusions. For example, cyber event data can be useful for collecting records from network devices, servers, domain controllers, and memory. The collected records or cyber event data can be stored, normalized, aggregated, and analyzed to discover trends, detect threats, and enable an organization to investigate alerts. Known SIEM tools (also referred to herein as SIEM detection engines) can provide some functionality, including monitoring events, collecting data, and issuing security alerts across the network, but such tools are typically customized to the implementing organization, i.e., more specifically, to a particular network architecture that can often be complex.
[0019] Specifically, because it is related to important data aggregation necessary to effectively protect the network, traditional tools are insufficient to efficiently monitor and aggregate data on a large scale. For example, to monitor and aggregate data across multiple tenant (or client) networks, an MSSP needs to receive approximately 2 million records (e.g., cyber data) per second, and traditional tools need to efficiently store, retrieve, and analyze related records for specifically requesting IP addresses, answering IP addresses, queried domain names (e.g., Qnames), and subdomains queried over a time range of several months in a cost - effective manner. Therefore, traditional tools cannot monitor and tally the records necessary to identify malicious activities in the footprint of interest, and thus cannot effectively identify important security metrics, especially including traffic of interest for security appliances, software vendors, and / or specific use cases.
[0020] In other words, traditional tools cannot technically aggregate and / or manage high - throughput records because of the nature and volume of those records, which require a "write speed" that exceeds their rated performance. Similarly, traditional tools are technically unable to maintain an efficient query index, and the resulting amount of data is very large for the user to effectively and / or efficiently search for records of interest for a specific tenant network, especially when managing the security of multiple tenant networks.
[0021] As used herein, the terms "domain" and "domain name" refer to, or may include, a string that identifies, or is otherwise associated with, a network, computing device, or other resource that communicates with the Internet, such as a server, personal computer, website, or other service communicated over the Internet. In some aspects, as used herein, "domain" and "domain name" generally refer to what is described in Domain Names - Implementation and Specification, Network Working Group (Nov.1987), the disclosure of which is incorporated herein by reference.
[0022] Entities typically need to understand and manage cyber security risks. More specifically, an enterprise needs to understand and manage cyber security risks associated with its cyber assets. For example, an entity can have an Internet presence, i.e., a large number of cyber assets used for Internet-related communications. One or more of these cyber assets may be configured such that the entity is potentially exposed to cyber security risks. Cyber security risks can include unwanted or malicious attempts to gain access to an entity's network, data, and / or other information. Cyber security risks can also include, for example, maliciously denying the proper owner's use of a cyber asset, such as a denial-of-service attack or ransomware. Thus, in order to identify potential exposure to cyber security risks and take measures against such threats, an entity, and / or its risk assessors and auditors, need to identify cyber assets and how they are configured.
[0023] To further improve the management of cyber threats and other security risks, an entity also needs to identify and understand the cyber assets of other entities (hereinafter sometimes referred to as "target entities"). This need can arise because communication between entities can lead to the exposure of threats, or perhaps because an entity's cyber security risk can cause catastrophic service disruptions outside the realm of the Internet, potentially affecting partner entities. For example, a first entity (e.g., a "client entity") can use its cyber assets to communicate with the cyber assets of thousands of other target entities, such as various suppliers, vendors, partners, and third parties. If any of the target entities' cyber assets are vulnerable to cyber security risks, the client entity may also be exposed to risks by communicating with these assets. Therefore, an entity needs to not only identify and understand its own cyber assets, but also identify and understand the risks posed by the cyber assets of other target entities.
[0024] However, the large-scale identification of target entities and their cyber assets can be a complex, time-consuming, and resource-intensive process. This can be particularly difficult, especially in the case of a managed security service provider (MSSP) being introduced, which, depending on the scale, repeatedly but consistently performs large-scale event management (SIEM) for a very large number of client networks in a cloud-based security information context. At the same time, its disclosure is set forth in U.S. Provisional Patent Application No. 63 / 196,458, filed on June 3, 2021, entitled "DEVICES, SYSTEMS, AND METHODS FOR ENHANCING SECURITY INFORMATION & EVENT MANAGEMENT UPDATES FOR MULTIPLE TENANTS BASED ON CORRELATED, AND SYNERGISTIC DEPLOYMENT NEEDS", the disclosure of which is hereby incorporated by reference in its entirety.
[0025] Even if there is a comprehensive list of target entities and their cyber assets, it can be complex, time-consuming, and resource-intensive to determine which cyber assets are vulnerable to cybersecurity risks. For example, malicious actors are constantly trying to identify and exploit vulnerabilities associated with cyber assets. At the same time, cyber asset configurations can become outdated and more vulnerable to attacks (e.g., new security protocols, version updates, evolution of industry standards related to cybersecurity, etc.). Therefore, it may be necessary to continuously monitor millions of cyber assets across thousands of target entities for potential cybersecurity risks in order to identify these vulnerabilities and help protect client entities in a meaningful way.
[0026] Furthermore, simply identifying cybersecurity vulnerabilities associated with the cyber assets of a target entity may not be sufficient to meaningfully protect a client entity. A client entity is likely to be unable to realize the benefits of identifying and monitoring the cyber assets of a target entity unless measures are implemented to address the discovered cybersecurity vulnerabilities. However, considering the magnitude and diversity of cybersecurity risks that may exist in the cyber asset footprint of a particular target entity, it can be difficult to determine the order and urgency in which risks need to be addressed. For example, some cybersecurity risks may need to be addressed immediately to prevent potential attacks, while others may be of low urgency or low priority. Therefore, there is a need for improved devices, systems, and methods for organizing and reporting identified cybersecurity risks so that the target entity and its cyber asset footprint can be reliably identified, the cybersecurity risks associated with the target entity's cyber assets can be identified, and appropriate corrective measures can be implemented before the target entity's cyber assets are utilized.
[0027] Accordingly, there is a need for devices, systems, and methods for generating and utilizing queryable indexes within a cyber data model to enhance network security. Such devices, systems, and methods have a number of practical applications and provide a number of technical improvements over known tools, including efficient querying and processing of records (e.g., cyber data) for specific cyber assets owned by specific entities, which can include record volumes of tens of billions of records in just a few seconds while maintaining low cost and high write throughput. Accordingly, such devices, systems, and methods can be used to repeatedly scale cloud-based data aggregation consistently without degrading the quality of search results.
[0028] The present disclosure presents such devices, systems, and methods, all of which provide many technical benefits and enable a user to introduce cloud-based SIEM implementations, such as an implementation of Azure Sentinel, on a large scale, repeatedly, and consistently, in one non-limiting aspect. For example, the devices, systems, and methods disclosed herein can provide (1) a record (e.g., pDNS) file partitioning scheme, (2) a streaming clustering algorithm for rapidly accumulating and emitting files using this scheme, (3) an efficient query index for those files implemented in Google Bigtable, and (4) an efficient algorithm for updating the query index when partitioned files are written. The resulting composite index can include the split files and a separate index, thereby enabling a SIEM or other user to write two million records per second along with their associated index values and query the resulting data for specific assets of interest within billions of written pDNS records within seconds.
[0029] A composite index can include a streaming distributed database that accumulates records from various sources. For example, a structured streaming job (e.g., Apache Spark) can be executed on a cloud-based platform (e.g., Google Cloud Dataproc) to continuously read and process a record stream from the composite index in small batches called micro-batches. Records can be grouped into each micro-batch by the first byte of the requested protocol, which improves the performance later in the pipeline. The records can then be written as files (e.g., Apache Avro) on a cloud-based storage platform (such as Google Cloud Storage). According to the present disclosure, the grouped, written, and stored records can function as the primary data storage layer of the pDNS database and can support a very high write throughput (e.g., 6 million records per second). Conventional MSSP devices, systems, and methods are not only technically impossible to automate, but it is also not very realistic, if not impossible, for an MSSP to manually continuously aggregate and manage millions of tenant networks in real time.
[0030] Implement a SIEM to aggregate data (e.g., log data, event data, threat intelligence data, etc.) from multiple platforms, analyze the data, and capture abnormal behavior or potential cyberattacks. The SIEM can collect security data from network devices, servers, domain controllers, etc. Run the SIEM to apply storage, normalization, aggregation, and analysis to the data, detect trends, detect threats, and enable an organization to investigate any alerts. Known SIEM tools (also referred to herein as SIEM detection engines) provide excellent functionality, including monitoring of events, collection of data, and issuance of security alerts across the network, but such tools are typically customized to the implementing organization, i.e., more specifically, to a particular network architecture that can often be complex.
[0031] FIG. 1 shows a system 1000 configured for security information and event management (SIEM) implementation across multiple tenants, in accordance with at least one non-limiting aspect of the present disclosure. System 1000 may include a SIEM provider server 1002 comprising a memory 1004 and a processor 1006. In various aspects, the SIEM provider server 1002 may comprise computer systems 5000 and their various components (e.g., the processor 1006 may be similar to the processor 5004 and the memory 1004 may be similar to the main memory 5006), which are described further with reference to FIG. 5.
[0032] In various aspects, memory 1004 may be configured to store instructions that, when executed by processor 1006, generate requests for data from multiple data sources 1010, 1020. Provider server 1002 receives petabytes of raw data from a client or third party. The data may include global Internet traffic, and the network security computing system may be interested only in the facts of the overall data set. Upon receiving the raw data, the network provider server or computing system aggregates, processes, indexes, and stores a copy of the data to create a queryable database. The index may be stored locally on provider server 1002 or on backend server 1030. Further, provider server 1002 may operate as a front end and obtain query results from backend server 1030.
[0033] As the data set continues to grow, it may take an inordinate amount of time and / or resources to query for a specified record. A database index can reduce the amount of time and / or resources necessary to query for a particular record by reducing the number of records processed when searching for the query results. However, constructing an appropriate and implementable index requires careful consideration of the data content, the queries to be made, and the write performance requirements. Since index construction and maintenance are required for each insertion into the database, write performance degrades as more complex indexes are created. This disclosure describes a data indexing scheme for a continuously updated data set that includes petabytes of cyber data and requires the completion of terabytes of writes daily. The data indexing scheme provides a database architecture that indexes and stores SIEM data to return query results to a query in a certain amount of time.
[0034] Thus, system 1000 can be implemented to write individual records or cyber data directly into a distributed key-value database, such as when Google BigTable is required to replicate data between keys of different fields or write out some kind of back pointer. Thus, system 1000 can replicate data between keys of different fields so that the same record can be found using different indexes. Conventional systems and methods require significantly more database nodes (up to four times as many) to maintain the write speed in a security operation center (SOC) environment.
[0035] FIG. 2 shows a high-level flow diagram of a data indexing scheme. The system receives 2002 data from one or more data sources and aggregates 2004 the data into a distributed database. Alternatively, several different jobs can be written to the rowKey table, some jobs being scheduled batch jobs and some jobs being streamed. The "structured streaming job" can also simply be removed from the figure. The system reformats 2006 the data into a common extensible format and writes the data to the rows of a rowKey database. The system reads the fields of the rows of the rowKey database and generates 2008 an index based on the field strings from the rowKey database. The system writes 2010 the index to a rowKey database where the index can be queried. Thus, the index is where the cyber data is stored / read and is also stored along with the data (currently). In other words, the index is the location that enables writes until it is generated. The rowKey database receives 2012 queries from the front-end computing system and retrieves data based on the index indicating the location of the record.
[0036] In various aspects, the system obtains cyber event data from one or more third-party sources and aggregates the cyber event data into a single dataset. The data can be received or obtained daily, according to the type and rate of change of the cyber event data. In one aspect, the system correlates the cyber event data to dynamic assets and provides an accurate assessment of the cyber event data to the third-party source as soon as possible. In one aspect, the cyber event data may include asset behavior at a particular point in time that correlates with malicious behavior or exposes the security of the system to risk. In another aspect, the cyber event data may include asset information such as software version, firmware version, update history, etc. Due to the dynamic nature of the cyber event data, the data can become stale and out-of-date in a short period (e.g., a few days or weeks). Therefore, the dataset needs to be continuously updated so that the system can maintain a continuity chain for dynamic assets.
[0037] When the system creates a queryable index table, the dataset can be queried based on timestamps and / or assets of interest. In one aspect, an entity is defined by its footprint that includes multiple assets related at several specific points. All information related to the footprint of the entity can be queried according to IP address or domain and obtained with a fixed-time query of the entity assets. The main advantage of the data indexing scheme is that there is little change in the query response time as the amount of data or records in the dataset increases.
[0038] In this aspect, the network security computing system ingests data from multiple data sources and aggregates the data into a single distributed dataset. Next, the data is reformatted from its original format, such as JSON or CSV text, and translated into a structured intermediate format specific to the schema of each data source, and is either saved directly to the database in a serialized format such as JSON or saved to a separate extensible server while referring to its location stored in the database. In either case, the structured record or reference is written to a row in the rowKey database, where each rowKey includes an asset identifier field and an event timestamp field, and the globally unique identifier of the event is recorded. If multiple assets or timestamps are associated with a record or reference, they are saved repeatedly once for each combination of asset / timestamp, and the same record or reference can be retrieved using any rowKey index of the associated asset / timestamp.
[0039] The first step in the indexing scheme is to associate cyber assets (such as clients or tenants) with cyber event data. The system analyzes the original data to extract the data from its original form and reformats it into a common structured format. The structured format includes three important common fields: a list of zero or more IP addresses, a list of zero or more domains, and a single timestamp.
[0040] For each data source, a mapping is created that explicitly specifies the relationship between the fields of the reformatted input and the original data. For example, a data source that includes a banner scan of IP addresses may have columns called "scanned_ip" that indicates the scanned IP address, "source_ip" that indicates the IP address from which the scan was performed, and "scan_time" that indicates the time at which the scan occurred. In this example, the mapping includes "scanned_ip" and "source_ip" as a list of IP addresses associated with the scan server event, and "scan_time" as a single timestamp. The scheme is required to map at least one IP address or domain from the original data and exactly one timestamp from the original data.
[0041] The reformatting scheme can be defined as an abstract Java class that explicitly specifies the mapping of common extensible fields. The Java class can be configured to pull data from the original data to the corresponding fields. Using Java object classes provides access to higher-level class types such as open source IP address classes. Additionally, Java objects allow for greater customization for serialization and deserialization of data for different contexts. For example, when writing data to a backend database for translation simplification and human readability, the data is serialized as JSON, but when the data needs to be processed at a high throughput rate, it may be advantageous to serialize the data with an optimized msgpack library. Finally, the Java object class can be used to define how to build an index.
[0042] When the data is reformatted, the system writes the data to one or more rows of the rowKey database, where the data inputs are stored in order according to their rowKey content. Since the rowKey starts with the asset content, consecutive IP address ranges are stored consecutively, enabling efficient batch retrieval of the data associated with the IP address ranges, and fully qualified domain names with a common suffix are also stored consecutively, enabling efficient batch retrieval of the data associated with the common domain suffix. Further, IPV4, IPV6, and domain name rowKey entries are stored in another database within the database. This allows all rowKeys to be sorted chronologically according to IP addresses and domain ranges. When multiple assets are associated with the same cyber event or data, different rowKey inputs are created for each associated asset. In one example, when the associated cyber assets include IPV4, IPV6, and domain names, rowKey inputs are created for each database of the same cyber event or data.
[0043] Generation of Indices for Cyber Data The system uses an extensible distributed data processing job such as a Google Dataflow job to encode the asset identifier (domain or IP address), observation time, and record as index values within a backend database such as Google Bigtable. This index takes the form of a set of rowKeys that are used by the backend database to associate the records with queryable fields. In various embodiments, the rowKey is a string that indicates the exact position of the stored row (data element or record). Further, the rowKey is used by the database to be able to sort the rows according to each rowKey. To enable a fixed-time query for a particular record (e.g., cyber asset, cyber event), several copies of the record are stored for each asset associated with the record, and each copy has a rowKey board with a single asset identifier.
[0044] The database contains multiple RowKeys, and each RowKey contains multiple fields within the database. In one aspect, the rowKey string indicates the exact location of the pDNS record within the distributed database. The rowKey format is: <asset_identifier>#<observation_timestamp>#<unique_hash>
[0045] asset_identifier: Encoding of either an IP address or a domain name. The IP address is encoded as a hexadecimal representation of the IP address bytes, 4 bytes for ipv4 and 16 bytes for ipv6. The domain name is encoded as the fully qualified domain name, but in reverse. For example, "www.google.com" or "www.GOOGLE.com" is encoded as "com.google.www".
[0046] observation_timestamp: The timestamp of the most accurate timestamp related to the occurrence of a cyber event, encoded as an ISO 8601 string.
[0047] unique_hash: In one aspect, the uniqueness of a cyber event can be the set of columns that most clearly define an index. The unique hash is generated from a hash algorithm that receives the parameters of the cyber event as input. If either of two recorded events has the same value for these columns, it can be determined that both records correspond to the same event. With the unique hash value, the system can deduplicate multiple occurrences of the same data within the data. This prevents the system from storing or returning multiple occurrences of the data for the same cyber event.
[0048] In one aspect, cyber event data may be classified into cyber-related analytic observations by querying assets or entities of interest and performing pattern matching analysis on the retrieved records. Based on the classified analysis, the system may collect and aggregate cyber event data of suspicious behavior that is similarly classified, and then downstream systems may summarize the behavior disclosed in 210270P entitled METHOD AND SYSTEM FOR SUMMARIZING ANALYTIC OBSERVATIONS, filed on January 31, 2021.
[0049] Processing and Indexing of pDNS Records In one aspect, a data indexing scheme improves the processing speed and database size for processing and indexing the flow of passive DNS (pDNS) records, as well as the query response time for indexed pDNS records. The system receives a data stream of approximately two million new pDNS records per second and processes approximately 172 billion new pDNS records per day. pDNS records enable the system to store DNS resolution data that the system uses to reference past DNS record values and identify potential security incidents or malicious infrastructure. DNS records are dynamic, and when a DNS record changes, it becomes difficult to identify and associate the previous value with the domain. Thus, pDNS records can be very useful in providing references to new DNS values. pDNS records enable system administrators to determine when DNS records change, the previous DNS values, and the new DNS values. Without pDNS records, it can be difficult to identify previous DNS records of malicious websites and associate those values with current DNS values.
[0050] The pDNS data stream is also useful for security operation centers to identify patterns and create predictive analytics models to identify malicious actors or cyberattacks. In various aspects, pDNS records can be used to identify potentially malicious activities in the footprint of interest, possible security appliances, software vendors used by the company of interest, and to investigate traffic of interest for specific use cases and the like.
[0051] Ingest pDNS records and group pDNS records in a structured streaming job The first step is to ingest the pDNS records into a distributed dataset, where all pDNS records are stored in a common file type using a common format. The pDNS records are accumulated from multiple data sources into a streaming distributed database, and the database and structured streaming jobs are periodically executed on the database. The structured streaming job continuously reads and processes the pDNS data stream from the database in small batches called micro-batches. Each micro-batch is grouped according to the first byte of the requested IP address specification and contains multiple records that are written to a group file. In one aspect, the group file is stored in a data serialization format such as JSON, XML, Apache avro file. The JSON-based file format may be advantageous for the ability to easily transmit data objects in key-value pairs and user-friendly human-readable text. Additionally, the size of the group file may be set according to a predetermined size, range, or micro-batch processing time.
[0052] pDNS records are grouped or sorted according to the first byte of the requested IP address in order to take advantage of the natural grouping of the requested data. For example, a single IP address tends to request similar resources over time, and thus grouping pDNS records by the first byte of the requesting IP address improves the grouping of other fields such as qname and answer IP addresses that will be indexed later. Further, grouping pDNS records according to the first byte of the requesting IP address reduces the shuffle data cost and improves the downstream DataFlow indexer job. The grouping scheme also provides a coarse first index across the requesting IPs since the file path structure can be structured as follows. service: / / <bucket> / <write-year> / <write-month> / <write-day> / <write-hour> / <first-requesting-ip-byte> / <avro files>
[0053] With this directory structure, the system can query all DNS traffic for a specific group of IP addresses within a specific time range by reading all the data in the corresponding subdirectory.
[0054] The structured streaming job forces the micro-batch processing to start 5 minutes after the previous start time by using a micro-batch trigger. This ensures that the processing is not fully captured up to the latest offset, which causes problems by writing overly small group files.
[0055] Figure 3 shows a flowchart for reformulating records from a distributed database into an extensible format suitable for indexing. The system reads and reformulates data from the distributed database by starting the first micro-batch process at 3002 using a structured streaming job. The system first reads a predetermined portion of the data from the distributed database within the first micro-batch at 3004. The amount of data within each micro-batch may be determined based on a predetermined trigger interval between the first micro-batch and subsequent micro-batches started thereafter. Based on the start of the first micro-batch, the system determines at 3006 whether to trigger subsequent micro-batches. The system reformulates the original records into an extensible format and reformulates at 3008 the multiple reformulated records into groups 3010 according to the first byte of the requested IP address. The system generates group files associated with the multiple reformulated records. In one aspect, the trigger timing between each micro-batch correlates with the size and number of records within the group file.
[0056] Operation of the Dataflow Job Indexer
[0057] When the system writes pDNS data records to the group file, the system reads the group file and performs a single data flow job indexer operation at a predetermined indexing interval. In one aspect, the indexing interval may be performed every hour. The operation of the data flow job indexer indexes all pDNS fields in the group file and stores the index in a rowKey database such as Google BigTable. To read data from the group file, the data flow job indexer operation lists a predetermined number of subdirectories in parallel for the first requested IP byte. The list of subdirectories can include all files such as 256 subdirectories for Apache Avro files. This results in a substantial increase in the read speed (about 256 times slower) compared to the conventional read scheme that lists all files on a single node. Further, indexing the group file in predetermined indexing interval batches provides a substantial improvement in the indexed record throughput compared to a continuously streaming indexing scheme. In comparison, a continuously streaming-indexing scheme runs the risk of processing records at a slower rate than they are received, backing off or duplicating on write jobs that distort the database system.
[0058] Each data flow indexer job reads the grouped file data for one hour, generates an index for each queryable field of the pDNS record, which points to the file containing the pDNS record (including metadata about the record). The index is then grouped by the time of the event timestamp of the pDNS record, and each group of the index is written to a single row in the rowKey database. Grouping the pDNS index into time groups uses a much smaller rowKey database cluster (about 4 times smaller) than would otherwise be required to store the index of the data.
[0059] Figure 4 shows a flowchart for generating an index indicating the location of the original record. The system starts 4002 a first data flow indexing job based on a predetermined indexing interval. The system reads 4004 one or more group files based on a predetermined indexing interval. The system lists 4006 a predetermined number of subdirectories of the group file so that the system can read the subdirectories in parallel. The system generates 4008 indexes for all records associated with a predetermined indexing interval range based on the index timestamp of the file. The group index is written 4010 to a single row of the rowKey database, which enables the front-end computing resources to query the database and obtain the original record from the back-end server within a certain time.
[0060] Databaseization of the index database
[0061] For performance purposes, the index database is a separate database within a partitioned database architecture and can be optimized according to a specific pDNS dataset. In various embodiments, each row of the index database includes an index for a predetermined number of pDNS records, such as up to 1000 records. The index can be a map-like object including the following structure. { "Group Filename” -> { BlockNumber of pDNS record-> [record metadata]}}
[0062] The group file block number is saved in addition to the group file name, so that at the time of reading, the query system can directly request the block containing the record of interest, thus improving the reading performance.
[0063] Furthermore, the group file is described such that there is only one pDNS record per block, which means that each BlockNumber of the index uniquely identifies a single pDNS record. The record metadata includes additional fields regarding the pDNS record, such as whether the record was a DNS request or response, which can be used to filter out unwanted records.
[0064] To query the database, the first step is to query the backend database server with a set of predetermined parameters such as the IP address or domain of interest and the time range. The database returns an index that identifies the files, and offsets to those files that contain records matching the IP address or domain. The system may obtain the records by reading the files from the database of grouped files and scanning them into the corresponding records. In various embodiments, the pNDS records may be queried according to the requesting IP address, the responding IP address, the queried domain name (qname), and the queried subdomain, and the processed pDNS records are mapped to the domain over a predetermined period such as several months.
[0065] Indexing layer that graphs IP connections and domain connections in the context of partial pDNS records
[0066] In one embodiment, the indexing database utilizes only the connections between IP addresses and domains within the DNS, rather than the full content of the DNS records. Thus, this embodiment may enable an improvement in read performance by creating another index on the data. The new index represents a graph of the connections between IP addresses and domains within the data. The rowKey of this scheme may be stored in the following format. <requesting IP address> # <event timestamp bin> # <requested qname> -> Metadata regarding the request
[0067] The IP graph index provides a faster query for all domains requested by an IP within a specific time range, or finds all instances where an IP address has requested a domain over a specific window. The rowKey representing the connection between a domain and an IP address may be stored as follows. <requested qname> # <event timestamp bin> # <requesting IP> -> Metadata about the request
[0068] Similarly, the domain graph index provides an efficient query for all IP addresses that have requested a specific domain (or any specific subdomain) within a given period. This indexing scheme leverages the data indexing scheme disclosed above, but involves variations in storing data in different rowKey databases.
[0069] FIG. 5 shows a diagram of a computing system 5000 according to at least one non-limiting aspect of the present disclosure. The computing system 5000 and various components included therein can be used to implement and / or execute any of the various components of the systems and methods 2000, 3000, and 4000 described above in connection with FIGS. 2-4, as described below.
[0070] According to a non-limiting aspect of FIG. 5, the computer system 5000 may include a bus 5002 (i.e., an interconnect), one or more processors 5004, main memory 5006, read-only memory 5008, removable storage media 5010, mass storage 5012, and one or more communication ports 5014. As should be understood, components such as removable storage media are optional and not required in all systems. The communication port 5014 can be connected to one or more networks through which the computer system 5000 can receive and / or transmit data.
[0071] As used herein, a processor can mean, regardless of architecture, one or more microprocessors, central processing units (CPUs), computing devices, microcontrollers, digital signal processors, or similar devices, or any combination thereof. An apparatus for performing a process can include, for example, a processor, and those devices such as input devices and output devices appropriate for performing the process.
[0072] The processor 5004 can be any known processor, such as but not limited to processors manufactured and / or sold by INTEL®, AMD®, MOTOROLA®, etc., which are generally well-known to those skilled in the art and clearly defined in the literature. The communication port 5014 can be any of, for example, an RS-232 port for use in a modem-based dial-up connection, a 10 / 100 Ethernet port, a gigabit port using copper or fiber, or a USB port. The communication port 5014 can be selected according to the network, such as a local area network (LAN), a wide area network (WAN), a CDN, or any network to which the computer system 5000 is connected. The computer system 5000 may communicate with peripheral devices (e.g., a display screen 5016, an input device 5018) via an input / output (I / O) port 5020.
[0073] The main memory 5006 can be a random access memory (RAM) or any other dynamic storage device commonly known in the art. The read-only memory 5008 can be any static storage device such as a programmable read-only memory (PROM) chip for storing static information such as instructions for the processor 5004. The mass storage device 5012 can be used to store information and instructions. For example, a hard disk such as the Adaptec® family of small computer serial interface (SCSI) drives, an optical disk, an array of disks such as a redundant array of independent disks (RAID) such as the Adaptec® family of RAID drives, or any other mass storage device can be used.
[0074] The bus 5002 communicatively couples the processor 5004 to other memories, storage, and communication blocks. The bus 5002 can be, for example, a PCI / PCI-X, SCSI, or a Universal Serial Bus (USB)-based system bus (or others) depending on the storage devices used. The removable storage medium 5010 can be any kind of external hard drive, floppy drive, IOMEGA® Zip drive, compact disc read-only memory (CD-ROM), compact disc rewritable (CD-RW), digital versatile disc read-only memory (DVD-ROM), and the like.
[0075] Aspects described herein may be provided as one or more computer program products, which may include a machine-readable medium having instructions stored thereon, which may be used to program a computer (or other electronic device) to perform a process. As used herein, the term "machine-readable medium" refers to any medium, plural media, or combination of different media that participates in providing data (e.g., instructions, data structures) that can be read by a computer, processor, or similar device. Such media can take many forms, including but not limited to non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks and other persistent memories. Volatile media includes dynamic random access memory, which typically constitutes a computer's main memory. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise a system bus coupled to a processor. Transmission media can include acoustic, light, and electromagnetic radiation, such as that generated during radio frequency (RF) and infrared (IR) data communications, or can transmit the same.
[0076] Machine-readable media may include, but are not limited to, floppy disks, optical disks, CD-ROMs, magneto-optical disks, ROMs, RAMs, erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other types of media / devices suitable for storing electronic instructions. Further, aspects described herein may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by a data signal embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
[0077] Various forms of computer-readable media may be involved in carrying data (e.g., a sequence of instructions) to a processor. For example, the data may (i) be delivered from RAM to the processor, (ii) be carried on a wireless transmission medium, (iii) be formatted and / or transmitted according to a number of formats, standards, or protocols, and / or (iv) be encrypted in any of a variety of ways well known in the art.
[0078] A computer-readable medium can store program elements (in any suitable format) suitable for performing a method.
[0079] As shown, main memory 5006 is encoded with an application 5022 that supports the functions discussed herein (the application 5022 may be an application that provides some or all of the functions of the CD service described herein, including a client application). The application 5022 (and / or other resources described herein) may be embodied as software code such as data and / or logical instructions (e.g., code stored in memory or on another computer-readable medium such as a disk) that support processing functions in different manners described herein.
[0080] During operation of one aspect, the processor 5004 accesses the main memory 5006 via the use of the bus 5002 to launch, execute, run, interpret, or otherwise execute the logical instructions of the application 5022. Execution of the application 5022 generates processing functions for services related to the application. In other words, the process 5024 represents one or more portions of the application 5022 that are executed within or on the processor 5004 within the computer system 5000.
[0081] In addition to the process 5024 that performs the operations discussed herein, note that other aspects described herein include the application 5022 itself (i.e., unexecuted or non-executing logical instructions and / or data). The application 5022 can be stored on a computer-readable medium such as a disk (e.g., a repository), or within an optical medium. According to other aspects, the application 5022 can also be stored in a memory type system such as firmware, read-only memory (ROM), or executable code within the main memory 5006 (e.g., within random access memory or RAM) as in this example. For example, the application 5022 can also be stored on the removable storage medium 5010, read-only memory 5008, and / or the mass storage device 5012.
[0082] One of ordinary skill in the art will understand that the computer system 5000 can include other processes and / or software and hardware components such as an operating system that controls the allocation and use of hardware resources.
[0083] The various aspects of the subject matter described herein are set forth in the following numbered examples.
[0084] Clause 1: A method for indexing cyber event data in an extensible database for fixed-time queries, the method comprising: receiving, by a processor, cyber event data from one or more data sources; reformating, by the processor, the cyber event data into a common intermediate format consisting of accessible attributes including a timestamp of event occurrence and one or more asset identifiers; generating, by the processor, a hash value unique to each cyber event; generating, by the processor, one or more rowKey indexes, asset identifiers, and timestamps of events corresponding to each of the cyber event hashes; storing, by the processor, the reformatted cyber event data in row inputs of one or more rowKey databases, wherein the one or more data rowKey databases are organized according to consecutive rowKeys; mapping, by the processor, the row inputs in the rowKey database to the original dataset; receiving, by the processor, a rowKey query based on parameters of the rowKey field; and returning, by the processor, cyber event data based on the rowKey query, wherein the query result is returned within a fixed time.
[0085] Clause 2: The method according to clause 2, wherein the one or more rowKey databases include separate databases for each asset type, and the asset types are IPV4, IPV6, and network domains.
[0086] Clause 3: The method according to either clause 1 or clause 2, wherein at least one of the query parameters includes an asset identifier, an observation timestamp, a unique hash value, or a range of observation timestamps.
[0087] Clause 4: The method according to any one of clauses 1 to 3, wherein index generation is performed by a data flow index job that differentiates cyber data content into higher-order Java classes.
[0088] Clause 5: The method according to any one of Clauses 1 to 4, wherein the asset identifier includes one or more of an IP address and a domain.
[0089] Clause 6: The method according to any one of Clauses 1 to 5, wherein the domain is described in reverse orientation.
[0090] Clause 7: Further comprising receiving, by a processor, a query for an asset or a range of assets (and optionally a time stamp or a range of times), determining, by the processor, the corresponding rowKey or range of rowKeys of the appropriate query result, and returning, by the processor, all cyber event data related to the query by obtaining data associated with the rowKey or range of rowKeys. The method according to any one of Clauses 1 to 6.
[0091] Clause 8: The method according to any one of Clauses 1 to 7, wherein data is collected, processed, and stored in an extensible distributed rowKey-indexed database using an extensible distributed processing system.
[0092] Clause 9: The method according to any one of Clauses 1 to 8, wherein one or more rowKey databases include separate databases for each asset type, and the asset types are IPV4, IPV6, and network domains.
[0093] Clause 10: The method according to any one of Clauses 1 to 9, wherein the query parameters include an asset identifier and optionally an observation time stamp or a range of observation time stamps.
[0094] Clause 11: The method according to any one of Clauses 1 to 10, wherein index generation is performed by mapping Google Dataflow index job mapping cyber event data to higher-order Java classes, and the rowKey database is implemented using Google Cloud BigTable.
[0095] Clause 12: The method described in any of Clauses 1 to 11, where the asset identifier includes an IP address or a domain.
[0096] Clause 13: The method described in any of Clauses 1 to 12, where the domain is described in reverse order.
[0097] All patents, patent applications, publications, or other disclosure materials described in this specification are hereby incorporated by reference in their entirety as if each individual reference were explicitly incorporated by reference. All references, and any materials or portions thereof, said to be incorporated by reference herein are incorporated herein only to the extent that the incorporated material does not conflict with the existing definitions, descriptions, or other disclosure materials set forth in this disclosure. Accordingly, and to the extent necessary, the disclosure set forth herein supersedes any conflicting materials incorporated by reference herein, and the disclosure is described explicitly within the management of this application.
[0098] Various illustrative and exemplary aspects are described. The aspects described herein are to be understood as providing illustrative features of various details of various aspects of the present disclosure, and thus, unless otherwise specified, without departing from the scope of the present disclosure, as much as possible, one or more features, elements, components, ingredients, materials, structures, modules, and / or aspects of the aspects of the present disclosure may be combined, separated, exchanged, and / or rearranged with one or more other features, elements, components, ingredients, materials, structures, modules, and / or aspects of the aspects of the present disclosure. Accordingly, those skilled in the art will recognize that various substitutions, modifications, or combinations of any of the illustrative aspects can be made without departing from the claimed subject matter. Further, those skilled in the art can, by a review of this specification, recognize or confirm many equivalents to the various aspects of the present disclosure using only routine experimentation. Accordingly, the present disclosure is not limited by the description of the various aspects, but only by the claims.
[0099] Those skilled in the art will generally recognize that terms used herein, and particularly in the appended claims (e.g., the body of the appended claims), are generally intended as "open" terms (e.g., the term "including" should be interpreted as "including but not limited to", the term "having" should be interpreted as "having at least", the term "includes" should be interpreted as "including but not limited to", etc.). It will further be understood by those skilled in the art that where a specific number of introduced claim limitations is intended, such intent will be expressly recited in the claims, and where no such recitation is present, no such intent exists. For example, by way of illustration, the following appended claims may include the use of introductory phrases "at least one" and "one or more" to introduce claim limitations. However, the use of such phrases should not be construed as implying that the introduction of a claim limitation by the indefinite article "a" or "an" limits any particular claim containing such introduced claim limitation to a claim scope containing only one such limitation, and the same shall apply to the use of the definite article used to introduce a claim limitation even where the same claim contains introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an" (e.g., "a" and / or "an" should generally be interpreted as meaning "at least one" or "one or more").
[0100] Furthermore, even if a specific number of the recited introduced claims is explicitly recited, one of ordinary skill in the art will recognize that such a recitation should typically be interpreted to mean at least the recited number (e.g., a mere recitation of "two recitations" would normally be understood to mean at least two recitations or two or more recitations without other qualifying language). Further, in these instances where a convention similar to "at least one of A, B, and C" is used, generally, such a construction is intended in the sense that one of ordinary skill in the art would understand the convention (e.g., "a system having at least one of A, B, and C" includes, but is not limited to, a system having A alone, B alone, C alone, a system having A and B together, a system having A and C together, a system having B and C together, and / or a system having A, B, and C together). In instances where a convention similar to "at least one of A, B, or C" is used, generally, such a construction is intended in the sense that one of ordinary skill in the art would understand the convention (e.g., "a system having at least one of A, B, or C" includes, but is not limited to, a system having A alone, B alone, C alone, a system having A and B together, a system having A and C together, a system having B and C together, and / or a system having A, B, and C together). It will be further understood by those of skill in the art that in any of the description, claims, or drawings, disjunctive and / or phrases presenting two or more alternative terms will typically be understood to contemplate the possibility of including one of the terms, any of the terms, or both terms, unless the context indicates otherwise. For example, the phrase "A or B" will typically be understood to include the possibilities of "A" or "B" or "A and B".
[0101] Regarding the appended claims, those skilled in the art will understand that the operations listed therein may generally be performed in any order. Also, although the claims are presented in sequence, it should be understood that the various operations may be performed in other orders than those described, or simultaneously. Examples of such alternative orders include, unless the context otherwise indicates, repetition, interleaving, interruption, reordering, incrementing, preparation, supplementation, simultaneity, reversal, or other variant orders. Further, unless the context otherwise indicates, terms such as "responding", "relating", or other past participles generally do not intend to exclude such variants.
[0102] It should be noted that any reference to "one aspect", "aspect", "exemplification", "an exemplification", and the like means that the particular features, structures, or characteristics described in relation to the aspect are included in at least one aspect. Thus, the appearances of the phrases "in one aspect", "in an aspect", "in an exemplification", and "in an exemplification" at various places throughout this specification do not necessarily all refer to the same aspect. Further, the particular features, structures, or characteristics may be combined in any suitable manner in one or more aspects.
[0103] As used herein, unless the context clearly indicates otherwise, the singular forms "a", "an", and "the" include plural references.
[0104] For example, without limitation, directional terms used herein, such as up, down, left, right, below, above, front, back, and variations thereof, relate to the orientation of the elements shown in the accompanying drawings and, unless otherwise explicitly stated, are not limiting with respect to the claims.
[0105] As used herein, the term "about" or "approximately" means, unless otherwise specified, an acceptable error for a particular value as determined by one of ordinary skill in the art, which depends in part on how the value is measured or determined. In certain embodiments, the term "about" or "approximately" means within 1, 2, 3, or 4 standard deviations. In certain embodiments, the term "about" or "approximately" means within 50%, 200%, 105%, 100%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1%, 0.5%, or 0.05% of a given value or range.
[0106] In this specification, unless otherwise indicated, all numerical parameters are to be understood as being preceded by the term "about," which in all cases means that the numerical parameter has the inherent variability characteristics of the underlying measurement technique used to determine the value of the parameter. At a minimum, and not as an attempt to limit the application of the doctrine of equivalents to the claims, each numerical parameter herein is to be construed in light of the reported number of significant digits and by applying ordinary rounding techniques.
[0107] Any numerical range recited herein includes all sub-ranges subsumed within the recited range. For example, the range "1 to 100" includes all sub-ranges between (and including) the recited minimum value of 1 and the recited maximum value of 100, i.e., all sub-ranges having a minimum value of 1 or more and a maximum value of 100 or less. Also, all ranges recited herein include the endpoints of the recited range. For example, the range 1 to 100 includes the endpoints 1 and 100. Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein, and any minimum numerical limitation recited herein is intended to include all higher numerical limitations subsumed therein. Accordingly, Applicants reserve the right to amend this specification, including the claims, to expressly recite sub-ranges subsumed within the explicitly recited ranges. All such ranges are inherently described herein.
[0108] Any patent application, patent, non-patent publication, or other disclosure material mentioned in this specification and / or listed in any application data sheet is hereby incorporated by reference into this specification, provided that the incorporated material does not conflict with this specification. Accordingly, and to the extent necessary, the present disclosure as expressly set forth herein prevails over any conflicting material incorporated by reference into this specification. Although said to be incorporated by reference into this specification, any material or portion thereof that conflicts with an existing definition, statement, or other disclosure material set forth herein is incorporated only to the extent that no conflict arises between the incorporated material and the existing disclosure material.
[0109] The terms “comprise” (and any form of comprise such as “comprises” and “comprising”), “have” (and any form of have such as “has” and “having”), “include” (and any form of include such as “includes” and “including”), and “contain” (and any form of contain such as “contains” and “containing”) are open-ended conjunctive verbs. As a result, a system that “comprises,” “has,” “includes,” or “contains” one or more elements possesses those one or more elements, but is not limited to possessing only those one or more elements. Similarly, an element of a system, device, or apparatus that “comprises,” “has,” “includes,” or “contains” one or more features possesses those one or more features, but is not limited to possessing only those one or more features.
[0110] The foregoing detailed description has described various forms of devices and / or processes by use of block diagrams, flowcharts, and / or examples. Where such block diagrams, flowcharts, and / or examples include one or more functions and / or operations, those skilled in the art will understand that each such function and / or operation within such block diagrams, flowcharts, and / or examples can be implemented individually and / or collectively by a wide variety of hardware, software, firmware, or substantially any combination thereof. Those skilled in the art will recognize that some aspects of the forms disclosed herein can be implemented as one or more computer programs operating on one or more computers (e.g., as one or more programs operating on one or more computer systems), as one or more programs operating on one or more processors (e.g., as one or more programs operating on one or more microprocessors), as firmware, or substantially any combination thereof, and that all or part thereof can be equivalently integrated into an integrated circuit, and that the circuit design, and / or the description of software code, and also firmware, are within the scope of the skills of those skilled in the art in light of the present disclosure. Further, those skilled in the art will understand that the mechanisms of the subject matter described herein can be distributed in various forms as one or more program products, and that the exemplary forms of the subject matter described herein apply regardless of the particular type of signal-bearing medium used to actually carry out the distribution.
[0111] The instructions used to program the logic to implement the various disclosed aspects may be stored in memory within the system, such as dynamic random access memory (DRAM), cache, flash memory, or other storage devices. Further, the instructions may be distributed via a network or via other computer-readable media. Thus, a machine-readable medium is any mechanism, but not limited to, a floppy disk, optical disk, compact disk, read-only memory (CD-ROM), and magneto-optical disk, read-only memory (ROM), random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical card, flash memory, or tangible machine-readable storage devices used to transmit information on the Internet via electrical, optical, acoustic, or other forms of propagated signals (such as carrier waves, infrared signals, digital signals, etc.). Thus, a non-transitory computer-readable medium includes any type of tangible machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
[0112] When used in any aspect of this specification, the term "control circuit" can refer to, for example, a wired circuit, a programmable circuit (e.g., a computer processor with one or more individual instruction processing cores, a processing unit, a processor, a microcontroller, a microcontroller unit, a controller, a digital signal processor (DSP), a programmable logic device (PLD), a programmable logic array (PLA), or a field programmable gate array (FPGA)), a state machine circuit, firmware storing instructions executed by a programmable circuit, and any combination thereof. The control circuit can be embodied, collectively or individually, as part of a larger system, such as an integrated circuit (IC), an application specific integrated circuit (ASIC), a system on chip (SoC), a desktop computer, a laptop computer, a tablet computer, a server, a smartphone, etc. Thus, as used herein, "control circuit" includes, but is not limited to, an electrical circuit having at least one discrete electrical circuit, an electrical circuit having at least one integrated circuit, an electrical circuit having at least one application specific integrated circuit, an electrical circuit forming a general purpose computing device configured by a computer program (e.g., a general purpose computer configured by a computer program that at least partially executes a process, and / or a device described herein, or a microprocessor configured by a computer program that at least partially executes a process, and / or a device described herein), an electrical circuit forming a memory device (e.g., in the form of a random access memory), and / or an electrical circuit forming a communication device (e.g., a modem, a communication switch, or an optoelectronic device). One of ordinary skill in the art will recognize that the subject matter described herein can be implemented in analog or digital fashion or some combination thereof.
[0113] When used in any aspect of this specification, the term "logic" can refer to an application, software, firmware, and / or circuitry configured to perform any of the foregoing operations. The software can be embodied as a software package, code, instructions, instruction sets, and / or data recorded on a non-transitory computer-readable storage medium. The firmware can be embodied as code, instructions, or instruction sets, and / or hard-coded (e.g., non-volatile) data within a memory device.
[0114] When used in any aspect of this specification, terms such as "component", "system", "module", etc. can refer to a computer-related entity, hardware, a combination of hardware and software, software, or software in execution.
[0115] When used in any aspect of this specification, "algorithm" refers to a self-consistent order of steps that produces a desired result, and "step" refers to an operation on a physical quantity and / or a logical state that can, although not necessarily, take the form of an electrical or magnetic signal capable of being stored, moved, combined, compared, and otherwise manipulated. These signals are commonly referred to as bits, values, elements, symbols, characters, terms, numbers, etc. These and similar terms may be associated with appropriate physical quantities and are merely convenient labels applied to these quantities and / or states. < / write-hour> < / write-day> < / write-month> < / write-year> < / bucket>
Claims
1. A method for indexing cyber event data in an extensible database for queries over a certain period of time, comprising: receiving, by a processor, cyber event data from one or more data sources; reformatting, by the processor, the cyber event data from an original dataset into a common intermediate format consisting of accessible attributes including a timestamp of event occurrence and one or more asset identifiers; generating, by the processor, a unique hash value for each cyber event; generating, by the processor, one or more rowKey indexes corresponding to the unique hash value for each cyber event, the one or more asset identifiers, and the timestamp of event occurrence; storing, by the processor, the reformatted cyber event data in row inputs of one or more rowKey databases, wherein the one or more rowKey databases are organized according to consecutive rowKey indexes; mapping, by the processor, the row inputs in the one or more rowKey databases to the original dataset; receiving, by the processor, a rowKey query based on parameters of a rowKey field; returning, by the processor, the cyber event data as a query result based on the rowKey query, wherein the query result is returned in a certain period of time regardless of the total number of row inputs in the one or more rowKey databases; identifying, by the processor, malicious behavior based on the returned query result; implementing, by the processor, security enhancements to mitigate the identified malicious behavior; A method comprising the above steps.
2. The method according to claim 1, wherein the one or more rowKey databases include separate databases for each asset type, and the asset types are IPv4, IPv6, and network domains.
3. The method according to claim 1, wherein the rowKey query based on the parameters of the rowKey field includes the one or more asset identifiers, an observation timestamp, a unique hash value, or a range of observation timestamps.
4. The method according to claim 1, wherein the one or more row key indexes are generated by a data flow index job that deserializes cyber data content into a higher-order Java class.
5. The method according to claim 1, wherein the one or more asset identifiers include an IP address, a domain, or any combination thereof.
6. The method according to claim 5, wherein the domain is described in reverse orientation.
7. The method according to claim 1, wherein the security enhancement includes any one of a software version update, a firmware version update, a history update, a continuous update of the dataset, or any combination thereof.
8. receiving, by the processor, the row key query for an asset or a range of assets; determining, by the processor, the one or more row key indexes or a range of row key indexes for the query result; returning, by the processor, the cyber event data as the query result based on the row key query by obtaining data associated with the one or more row key indexes or the range of row key indexes; The method according to claim 1, further comprising.
9. A system for indexing cyber event data in an extensible distributed row key-indexed database, comprising: at least one processor; at least one memory communicatively coupled to the at least one processor; an input / output interface configured to access data from one or more external sources, each of the one or more external sources being communicatively coupled to the at least one processor; a database present in the at least one memory and configured to store the data; comprising wherein the at least one memory is configured by the at least one processor to receive cyber event data from one or more data sources; reformat the cyber event data from the original dataset into a common intermediate format consisting of an event occurrence timestamp and accessible attributes including one or more asset identifiers; Generating a unique hash value for each cyber event; Generating one or more rowKey indexes corresponding to the unique hash value for each cyber event, the one or more asset identifiers, and the timestamp of the event occurrence; Storing the reformatted cyber event data in a row input of one or more rowKey databases, wherein the one or more rowKey databases are organized according to consecutive rowKey indexes; Mapping the row input in the one or more rowKey databases to the original dataset; Receiving a rowKey query based on parameters of a rowKey field; Returning a portion of the cyber event data as a query result based on the rowKey query, wherein the query result is returned in a fixed time regardless of the total number of row inputs in the one or more rowKey databases; Identifying malicious behavior based on the returned query result; Implementing security enhancements to mitigate the identified malicious behavior; A system further configured to store executable instructions.
10. The system according to claim 9, wherein the one or more rowKey databases include separate databases for each asset type, and the asset types are IPv4, IPv6, and network domains.
11. The system according to claim 9, wherein the rowKey query based on the parameters of the rowKey includes the one or more asset identifiers, an observation timestamp, a unique hash value, a range of observation timestamps, or any combination thereof.
12. The system according to claim 9, wherein the one or more rowKey indexes are generated by a Google Dataflow index job mapping cyber event data to higher-order java classes, and the one or more rowKey databases are implemented using Google Cloud BigTable.
13. The system according to claim 9, wherein the security enhancement includes any one of software version updates, firmware version updates, history updates, continuous updates of the dataset, or any combination thereof.
14. The system according to claim 9, wherein the one or more asset identifiers include an IP address, a domain, or any combination thereof.
15. The system according to claim 14, wherein the domain is described in reverse orientation.
16. A method for indexing cyber event data in an extensible database for time-based queries, comprising: receiving, by a processor, cyber event data from one or more data sources; reformatting, by the processor, the cyber event data from a first dataset into a common intermediate format comprising a timestamp of event occurrence and accessible attributes including one or more asset identifiers, wherein the first dataset is one of a plurality of datasets continuously updated from the one or more data sources; generating, by the processor, a unique hash value for each cyber event; generating, by the processor, one or more rowKey indexes corresponding to the unique hash value for each cyber event, the one or more asset identifiers, and the timestamp of event occurrence; storing, by the processor, the reformatted cyber event data in a row input of one or more rowKey databases, wherein the one or more rowKey databases are organized according to consecutive rowKey indexes; mapping, by the processor, the row input in the one or more rowKey databases to the first dataset of the plurality of datasets; receiving, by the processor, a rowKey query based on parameters of a rowKey field; returning, by the processor, the cyber event data as a query result based on the rowKey query, wherein the query result is returned in a fixed time regardless of the total number of row inputs of the first dataset of the plurality of datasets. The processor identifies malicious behavior based on the returned query results; The processor implements security enhancements to mitigate the identified malicious behavior; A method comprising the above.
17. The method according to claim 16, wherein the one or more rowKey databases include separate databases for each asset type, and the asset types are IPv4, IPv6, and network domains.
18. The method according to claim 16, wherein the rowKey query based on the parameter of the rowKey includes the one or more asset identifiers, observation timestamps, unique hash values, ranges of observation timestamps, or any combination thereof.
19. The method according to claim 16, wherein the one or more rowKey indexes are generated by a Google Dataflow index job mapping cyber event data to higher-order java classes, and the one or more rowKey databases are implemented using Google Cloud BigTable.
20. The method according to claim 16, wherein the security enhancement includes any one of software version updates, firmware version updates, history updates, continuous updates of the dataset, or any combination thereof.