Identification system for malicious applications
The integration of a neural network model with NLP for automated malicious application identification addresses the inefficiencies of manual methods, enabling swift and precise threat detection, enhancing mobile device security and reducing potential harm.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CASHSHIELD TECHNOLOGIES LLC
- Filing Date
- 2025-12-03
- Publication Date
- 2026-06-11
AI Technical Summary
Existing methods for identifying malicious mobile applications rely heavily on manual efforts, which are intensive, costly, and prone to human error, leading to delayed detection of threats that can cause user data compromise and financial loss.
A system utilizing a neural network model integrated with a natural language processor (NLP) to automate the identification of malicious applications, generating fixed-length feature vectors from collected application information, and using a labeled dataset to predict maliciousness, with an SDK installed on devices to gather data and transmit it to a server system.
The system enables rapid, accurate, and proactive detection of malicious applications, reducing the time to identify threats and minimizing harm by leveraging automation and AI, thus improving security and efficiency in mobile device protection.
Smart Images

Figure US2025057897_11062026_PF_FP_ABST
Abstract
Description
Attorney Docket No. 484-0003PCT - 1 -IDENTIFICATION SYSTEM FOR MALICIOUS APPLICATIONSREFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation-in-part of co-pending U. S. Application No.18 / 029,033 filed March 28, 2023, which is a continuation-in-part of PCT Application No. PCT / SG2021 / 050584 filed September 28, 2021, which claims the benefit of U. S. Provisional Application No. 63 / 084,655 filed September 29, 2020, the contents of each are incorporated herein in their entirety by reference.
[0002] This application also claims the benefit of co-pending U. S. Provisional Application No. 63 / 727,429 filed December 3, 2024, the contents of which are incorporated herein in their entirety by reference.
[0003] This patent application is related to and incorporates by reference each of the following International Patent Applications, U. S. Patent Applications and U. S.Provisional Patent applications:Int’l Pat Appl. Ser. No. PCT / SG2021 / 050584 filed on September 28, 2021, published as WO2022 / 071881 with publication date April 7, 2022;Int’l Pat Appl. Ser. No. PCT / IB2020 / 058799 filed on September 21, 2020, published as WO2021 / 053646 with publication date March 25, 2021;Int’l Pat. Appl. Ser. No. PCT / IB2020 / 058801 filed on September 21, 2020, published as W02021 / 053647 with publication date March 25, 2021;U. S. Prov. Ser. No. 63 / 084,655 filed September 29, 2020;U. S. Prov. Ser. No. 62 / 950,007 filed December 18, 2019;U. S. Prov. Ser. No. 62 / 949,993 filed December 18, 2019;U. S. Prov. Ser. No. 62 / 949,987 filed December 18, 2019;U. S. Prov. Ser. No. 62 / 949,979 filed December 18, 2019;U. S. Prov. Ser. No. 62 / 949,974 filed December 18, 2019;U. S. Prov. Ser. No. 62 / 949,965 filed December 18, 2019;U. S. Prov. Ser. No. 62 / 949,828 filed December 18, 2019;Attorney Docket No. 484-0003PCT - 2 -U. S. Prov. Ser. No. 62 / 949,816 filed December 18, 2019;U. S. Prov. Ser. No. 62 / 903,798 filed September 21, 2019;U. S. Prov. Ser. No. 62 / 903,797 filed September 21, 2019; and U. S. Prov. Ser. No. 62 / 903,796 filed September 21, 2019.
[0004] All of the above-referenced International Patent Applications, U. S. Patent Applications and U. S. Provisional Patent applications are collectively referenced herein as "the commonly assigned incorporated applications."FIELD
[0005] This disclosure generally relates to the device intelligence field. More particularly, some embodiments relate to identification of malicious applications on mobile devices.BACKGROUND AND SUMMARY OF DISCLOSURE
[0006] Consequential risks are posed when malicious mobile applications, among the myriad introduced daily, go undetected. Failure to identify these threats in a timely manner can lead to compromised user data, financial losses, and damage to the reputation of the mobile app platform. Scraping for new applications can be performed using automation. However, many known techniques for discerning and labelling each newly discovered application for malicious intent includes manual steps. This can be an intensive and highly cost-ineffective task. Additionally, human error can pose a significant risk in the accuracy of any assessment.
[0007] A solution is described that makes use of a high level of automation in the application identification process. This can effectively mitigate many limitations inherent in known techniques that rely more heavily on manual efforts. The integration of automation can streamline the workflow. Increased automation can allow for a significantly shortening the time used predict whether newly found applications areAttorney Docket No. 484-0003PCT - 3 -malicious. This can result in the ability to detect potential threats at an earlier stage, when little or no harm has been suffered. The system can be more proactive rather than reactive in defending against emerging risks.SUMMARY
[0008] The present disclosure provides embodiments of methods and systems predicting and / or identifying malicious applications. In some embodiments, a method for predicting and / or malicious applications installed on or running on mobile or portable devices and in communication with a server system is provided. The method includes collecting application information on each of one or more newly identified applications installed on or running on the mobile or portable devices and transmitting the application information to the server system. In some embodiments, collecting application information on each of the one or more newly identified applications includes installing on each mobile or portable device an SDK configured to gather the application information from the mobile or portable device associated with applications installed or running on the mobile or portable device, and transmitting the application information to the server system.
[0009] On the server system, the method includes transforming the application information using at least a natural language processor to generate a fixed length feature vector for each of the one or more newly identified applications. In some embodiments, the fixed length of the feature vector may be at least one of 384 dimensions, 512 dimensions, 768 dimensions, 1024 dimensions, 1536 dimensions, 3072 dimensions and / or 12,288 dimensions. The feature vector includes information on a plurality of features used to predict if each of the one or more newly identified applications is malicious or benign using a neural network model trained with a suitably labeled dataset of screened applications labeled as malicious or benign. In some embodiments, the method includes sequentially passing the feature vector for each of the one or more newly identified applications through the neural network model, andAttorney Docket No. 484-0003PCT - 4 -interpreting an output of the neural network model as a prediction that the application represented by the fixed length feature vector is malicious or benign.
[0010] In some embodiments, the method may include labeling the application represented by the feature vector as malicious or benign, and updating the labeled dataset of screened applications with the newly labeled application.
[0011] In some embodiments, transforming the application information to generate the feature vector for each of the one or more newly identified applications includes processing the application information to generate NLP-related features, tokenizing the NLP-related feature data to generate a feature vector, and pooling the feature data to fix the length of the feature vector. In some embodiments, pooling the feature data includes averaging word embeddings generated when tokenizing the NLP-related feature data.
[0012] In some embodiments, the method may include issuing an alert when the prediction that the application represented by the feature vector is malicious. The issued alert may include a type of the malicious application represented by the feature vector.
[0013] In some embodiments, a method for predicting malicious applications installed on or running on mobile or portable devices and in communication with a server system may include collecting application information on each of one or more newly identified applications from one or more app distribution platforms, and transmitting the application information to the server system. Collecting application information on each of the one or more newly identified applications may include installing on each mobile or portable device an SDK configured to gather the application information from the mobile or portable device associated with applications installed or running on the mobile or portable device, and transmitting the application information to the server system.
[0014] On the server system, transforming the application information using at least a natural language processor to generate a fixed length feature vector for each of theAttorney Docket No. 484-0003PCT - 5 -one or more newly identified applications. In some embodiments, the fixed length vector size may be at least one of 384 dimensions, 512 dimensions, 768 dimensions, 1024 dimensions, 1536 dimensions, 3072 dimensions and / or 12,288 dimensions. The fixed length feature vector includes information on a plurality of features used to predict if each of the one or more newly identified applications is malicious or benign using a neural network model trained with a suitably labeled dataset of screened applications labeled as malicious or benign. In some embodiments, the method includes sequentially passing the feature vector for each of the one or more newly identified applications through the neural network model, and interpreting an output of the neural network model as a prediction that the application represented by the fixed length feature vector is malicious or benign.
[0015] In some embodiments, the method includes labeling the application represented by the fixed length feature vector as malicious or benign; and updating the labeled dataset of screened applications with the newly labeled application.
[0016] In some embodiments, transforming the application information to generate the fixed length feature vector for each of the one or more newly identified applications includes processing the application information to generate NLP-related features, tokenizing the NLP-related feature data to generate a feature vector; and pooling the feature data to generate the fixed length feature vector.
[0017] In some embodiments, the method may include issuing an alert when the prediction that the application represented by the feature vector is malicious. The alert issued may include a type of the malicious application represented by the feature vector.
[0018] In some embodiments, a system running on a server for predicting malicious applications installed on or running on mobile or portable devices is provided. The system includes a scraper system and an NPL based malicious application identifier system. In some embodiments, the scraper system collects application information on each of one or more newly identified applications from one or more app distributionAttorney Docket No. 484-0003PCT - 6 -platforms. In some embodiments, the scraper system collects application information on each of one or more newly identified applications installed on or running on the mobile or portable devices. In some embodiments, an SDK may be installed on each mobile or portable device that gathers the application information from applications installed or running on the mobile or portable device, and that transmits the gathered application information to the server.
[0019] In some embodiments, the NPL based malicious application identifier system transforms the application information using at least a natural language processor to generate a fixed length feature vector for each of the one or more newly identified applications. In some embodiments, the fixed length of the feature vector may be at least one of 384 dimensions, 512 dimensions, 768 dimensions, 1024 dimensions, 1536 dimensions, 3072 dimensions, and 12,288 dimensions. The feature vector includes information on a plurality of features used to predict if each of the one or more newly identified applications is malicious or benign using a neural network model running on the server and trained with a suitably labeled dataset of screened applications labeled as malicious or benign that resides on the server. In some embodiments, the NPL based malicious application identifier system transforms the application information to generate the feature vector for each of the one or more newly identified applications by, for example, processing the application information to generate NLP-related features, tokenizing the NLP-related feature data to generate a feature vector, and pooling the feature data to fix the length of the feature vector. In some embodiments, pooling the feature data comprises averaging word embeddings generated when tokenizing the NLP-related feature data.
[0020] In some embodiments, the NPL based malicious application identifier system sequentially passes the feature vector for each of the one or more newly identified applications through the neural network model, and interprets an output of the neural network model as a prediction that the application represented by the feature vector is malicious or benign. In some embodiments, the server may issue an alert when the prediction that the application represented by the feature vector is malicious. In someAttorney Docket No. 484-0003PCT - 7 -embodiments, the issued alert may include a type of the malicious application represented by the feature vector.
[0021] In some embodiments, the NPL based malicious application identifier system labels the application represented by the feature vector as malicious or benign, and updates the labeled dataset of screened applications with the newly labeled application.BRIEF DESCRIPTION OF THE DRAWINGS
[0022] To further clarify the above and other advantages and features of the subject matter of this patent specification, specific examples of embodiments thereof are illustrated in the appended drawings. It should be appreciated that these drawings depict only illustrative embodiments and are therefore not to be considered limiting of the scope of this patent specification or the appended claims. The subject matter hereof will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
[0023] FIG. 1 is a schematic diagram illustrating an environment in which various devices are interacting with one or more servers and where malicious applications are identified, according to some embodiments;
[0024] FIG. 2 is a block diagram illustrating aspects of a multi-application-store scraper system used as part of a malicious application identification system, according to some embodiments;
[0025] FIG. 3 is a block diagram illustrating further aspects of a multi-application-store scraper system, according to some embodiments;
[0026] FIG. 4 is a block diagram illustrating further aspects of an NLP-based malicious application identifier, according to some embodiments;Attorney Docket No. 484-0003PCT - 8 -
[0027] FIG. 5 is a block diagram illustrating aspects of building training datasets and training of malicious application identifier systems, according to some embodiments; and
[0028] FIG. 6 is a block diagram illustrating aspects of issuing warnings and / or alerts based on a database maintained with a malicious application identifier system, according to some embodiments.DETAILED DESCRIPTION
[0029] A detailed description of examples of preferred embodiments is provided below. While several embodiments are described, it should be understood that the new subject matter described in this patent specification is not limited to any one embodiment or combination of embodiments described herein, but instead encompasses numerous alternatives, modifications, and equivalents. In addition, while numerous specific details are set forth in the following description in order to provide a thorough understanding, some embodiments can be practiced without some or all of these details. Moreover, for the purpose of clarity, certain technical material that is known in the related art has not been described in detail in order to avoid unnecessarily obscuring the new subject matter described herein. It should be clear that individual features of one or several of the specific embodiments described herein can be used in combination with features or other described embodiments. Further, like reference numbers and designations in the various drawings indicate like elements. Some of the figures described herein are simplified in that for clarity they may omit elements of structures that skilled persons would understand and therefore do not need to be shown expressly. The figures may show only a portion of a structure that comprises repeated patterns of the shown portions.
[0030] According to some embodiments, a system for rapidly identifying malicious mobile applications is described. The system is updated regularly, such as daily, with newly discovered malicious applications. The system can include a neural networkAttorney Docket No. 484-0003PCT - 9 -model as part of a natural language processor (NLP) -based malicious application identifier system. According to some embodiments, the neural network model is trained on details of a large number of applications that have been labelled as or otherwise determined to be malicious or not malicious. A database of applications and associated detail is compiled to initially train the model. Information on existing applications and new applications can be gathered by scraping internet sources. The gathered data should then be accurately labelled. According to some embodiments, artificial intelligence (AI) is used in the scraping and / or labelling processes. Following initial labeling the data can be refined using additional iterations. According to some embodiments, actual data from actual users is used to refine the model.
[0031] The techniques described herein can provide improvements in speed of computer operation and / or network operation, improvements in efficiency (including energy efficiency) computer operation and / or network operation, improvements in network capacity usage, improvements in performance of computer or network, reductions in network traffic, improvements in the security of mobile and / or portable devices and in the security of servers such mobile and / or portable devices communicate with and / or reductions in computer memory use.
[0032] FIG. 1 is a schematic diagram illustrating an environment in which various devices are interacting with one or more servers and where malicious applications may be identified, according to some embodiments. Internet 100 represents the global system of interconnected computer networks that use the internet protocol suite to communicate between networks and devices, e.g., mobile or portable devices. User 102 is shown operating mobile device 110, in this example a smart phone. Mobile device 110 includes a mobile application (“mobile app”) 112 that is provided to the user 102 for use on mobile device 110 by an organization 150. Mobile app 112 can provide various functionalities on mobile devices such as smart phones and / or tablets. The examples of functionality provided by or enhanced by app 112 include games, social media, banking, and news. In this example, the app is created by organization 150. The app 112 could be downloaded from one of many app stores, from anotherAttorney Docket No. 484-0003PCT - 10 -distribution platform, or provided or downloaded directly from organization 150 through a mobile browser. In some embodiments, organization 150 has an interest in assessing the risk that user 102 is not using a malicious application to commit fraud or other malicious activity which may harm or damage organization 150 and / or others.Organization 150 hires, retains, subscribes or otherwise contracts with cyber protection company 160 to aid organization 150 in detecting and reducing risks associated with fraudulent or malicious activity. According to some embodiments, a software development kit (SDK) 114 is provided by cyber protection company 160 to organization 150 and installed on one or more devices, e.g., mobile or portable devices 110, 120, or 130. Organization company 150 is shown in FIG. 1 with a server system 151. In many cases, organization 150 includes SDK 114 within mobile app 112 when app 112 is provided its users, such as user 102. The SDK 114 is configured to facilitate information gathering from the device 110 on which the mobile app 112 is running. In the case of user 102, SDK 114 in app 112 provides certain information from device 110 for use by organization 150 and / or cyber protection company 160.
[0033] For example, in many cases the SDK 114 is able to identify other mobile applications that are currently active on the mobile device 110, and / or identify other mobile applications that are installed on the mobile device 110. In the simple example shown in FIG. 1, there are three devices shown running mobile app 112, namely smart phones 110 and 130 and laptop computer 120. In this example, smartphone 110 is used by user 102 who is legitimately using app 112. In contrast, user 104 who operates laptop 120, and user 106 who operates smartphone 130 are employing one or more malicious applications to commit fraud or other malicious activity. In the case of laptop 120, the user 104 has installed a malicious application 126 that in the form of a spoofing tool. Spoofing tool 126 allows laptop 120 to emulate or appear to be a smartphone or other mobile device such as a tablet. User 106 is shown using smartphone 130, which has malicious application 136 configured for some other malicious activity. In some cases the devices 120 and / or 130 can include a plurality of malicious tools or applications.Attorney Docket No. 484-0003PCT - 11 -
[0034] As, discussed, in many cases SDK 114 is able to identify other mobile applications that are currently active and / or installed on the mobile device where SDK is installed. According to some embodiments, through SKD 114, cyber protection company 160 uses such information to identify new mobile applications that should be evaluated for maliciousness. With the techniques described herein, company 160 is able to quickly and accurately predict which of the newly discovered mobile apps are malicious and which are benign. After evaluation, cyber protection company 160 can identify devices that have one or more malicious applications currently active or currently installed. In the case of FIG. 1, SDK 114 is used by cyber protection company 160 to identify new mobile applications installed and / or being run on mobile devices that have app 112 from organization 150. Note that such devices are being operated by organization 150’s customers and / or end users. As will be described in greater detail, infra, cyber protection company 160 can evaluate each new mobile application for maliciousness. Cyber protection company 160 uses the information of whether or not each new application is malicious or not to maintain a comprehensive database of malicious mobile applications. Cyber protection company 160 uses the recently updated database to issue alerts and / or warnings to organization 150 that one or more of its customers or end users may be using or have installed malicious applications. In FIG. 1 this is depicted by warning / alert 152 on organization 150’s server system 151. According to some embodiments, cyber protection company 160 accurately identifies the presence and / or activity of malicious apps 126 and 136. In some cases, the speed and accuracy of detection of malicious apps allows cyber protection company 160 to issue its warning / alert to organization 150 before users such as 104 and 106 are able to cause harm. Note that in some cases it may not be apparent to the user of a mobile device that the device has one or more malicious applications installed (or being run). In such cases the malicious application is acting as malware. The warning / alert 152 can be used by organization 150 to alert and notify end users / customers of this situation.Attorney Docket No. 484-0003PCT - 12 -
[0035] According to some embodiments, using a malicious application identification system 164 such as in one or more of the embodiments described herein, the cyber protection company 160 is able to maintain a database 163 of malicious applications on server system 161 that is both accurate and up to date. In FIG. 1 cyber protection company 160 uses malicious application identification system 164 on server system 161 to update database 163 on a daily basis. For ease of description, the malicious application identification system 164 may also be referred to herein as the “app ID system”. In other cases the updates could be several times per day or every few days depending on various factors such as threat level and service level.
[0036] According to some embodiments, malicious application identification system 164 includes a multi-application-store scraper system 166 that scrapes one or more app distribution platforms 170 for new applications on a regular basis, such as daily. For ease of description, the multi-application-store scraper system 166 may also be referred to herein as the “scraper system”. The app ID system 164 also includes an NLP-based malicious application identifier system 168 which employs both an NLP based transformer and neural network based model to quickly and accurately predict which of the newly scraped applications are likely to be malicious apps. Using app ID system 164, cyber protection company 160 is able to maintain a highly up-to-date and accurate database of malicious apps 163, which can be stored on server system 161. Scraper system 166 is configured to gather information on new mobile apps by scraping relevant data from web app stores and other available app distribution platforms on a regular basis, such as daily. The information gather through the scraping process is sufficient to accurately predict if each new app malicious or not. Cyber protection company 160 can use information from the installed SDKs to include in the warning / alert to organization 150 which users are likely to have malicious application installed. In some cases, where a particularly dangerous app is discovered, the warning can be issued to the organization 150 in real time (e.g. within minutes) of identification by cyber protection company 160.Attorney Docket No. 484-0003PCT - 13 -
[0037] Although the simple example of FIG. 1 shows only three users, in practice there will be many more users and potentially malicious actors employing a wider range of devices and equipment. Likewise, there can be many organizations or entities 150. The server systems can in practice be any of a wide range of server system / computing system types configured to provide one or more specific functions (e.g. web server, file server, mail server, game server, etc.); and could be implemented in various hardware layouts including partially or wholly virtual and / or cloud based.
[0038] As used herein, the term “app distribution platform” refers to a digital distribution platform, marketplace, or catalog that is configured to provide mobile applications to mobile devices. This includes, for example, mobile app stores and alternative distribution platforms. The mobile app stores include both operating system native platforms (native to the mobile operating system) and non-native, third party app stores. Examples of mobile app stores include, without limitation: Apple App Store; Google Play Store; Samsung Galaxy Store; Amazon Appstore; Microsoft Store; Huawei AppGallery; Xiaomi GetApps; Baidu Mobile Assistant; Tencent Appstore; OPPO App Market; VIVO App Store; Aptoide; Aptoide iOS; GetJar; F-Droid; Cydia; ACMarket; TapTap; Cafe Bazaar; CodeNGo; Pure OS Software Center; Snap Store; Appcircle App Distribution; Appland; Applivery App Distribution; Qoo App; 9Apps; Uptodown;APKPure; APKcombo; SlideME; HappyMod; AltStore; Epic Games Store; Setapp;Setapp iOS; The term “mobile app distribution platform” also includes proxy app stores that act as a client for other stores, such as Aurora Store and Softpedia that act as a client for the Google Play store. Some of the above app stores may be considered “official” or “endorsed” while some of the third-party platforms may be considered “unofficial” app stores or “Alternative” sources. Other examples of unofficial or alternative app stores.
[0039] FIG. 2 is a block diagram illustrating aspects of a scraper system 166 and a malicious application identifier, according to some embodiments. As described with respect to FIG. 1, the automated system includes two components: a scraper system 166; and a natural language processing (NLP)-based malicious application identifierAttorney Docket No. 484-0003PCT - 14-168. The multi-application-store scraper 166 is a sophisticated scraper that gathers, e.g. performs scraping operations to gather, newly released applications information 212 on a periodic basis (e.g. daily) from multiple mobile app distribution platforms. This can include both official and unofficial application stores worldwide. The period length between scraping operations can be a number of minutes, hours, days, a combination thereof, depending on the particular application. According to some embodiments, the scraping operations can be run on an aperiodic basis, or at irregular intervals, for example, according to a perceived threat level algorithm or other information.According to some embodiments, the NLP-based malicious application identifier 168 can use ensemble methods according to which multiple learning algorithms improve predictive performance of the system. According to some embodiments, identifier system 168 can be a formidable structure comprising multiple machine learning models and techniques. Examples of known NLP-based transformer models that might be suitable, depending on the application, are discussed infra.
[0040] According to some embodiments, the malicious application identifier 168 is configured to accept input data, e.g., text input data, containing, for example, the following: application name, application package name, and / or application description. In general, selecting which data to use will depend on the application including associated constraints. Examples of other types of data or information that can be considered in configuring malicious application identifier 168 include: application developer; app rate; app number of reviews; and app category. According to some embodiments, a unique application identifier (e.g. number) can be assigned and included with the app information. According to some embodiments, NLP-based malicious application identifier 168 includes a feature engineering processor 220 which transforms data 222 into a more effective set of inputs 232. Each input set comprises several attributes or features. By providing the model(s) with highly relevant information, feature engineering processor 220 significantly enhances the system’s accuracy in identifying malicious applications. The output of feature engineering processor 220, which is shown in FIG. 2 as set of features 232, is fed into an advancedAttorney Docket No. 484-0003PCT - 15 -transformer-based word tokenization model 230, which is configured to generate one or more word embeddings that will be used to base the classification upon. The output feature vector 242 of transformer model 230 is then added to a queue for processing by, or directly used as an input to, the input layer of deep neural network 240. In this case, neural network 240 has been trained to yield a probabilistic score on how risky the said application is, further detail is provided in FIG. 4 and associated description, infra.
[0041] FIG. 3 is a block diagram illustrating further aspects of a multi-application-store scraper system, according to some embodiments. The scraper system 166 includes a database 306 of existing apps. The database 306 includes information on a large number of existing apps, e.g., mobile apps, and includes the names and other information on both malicious and benign apps. The database 306 is also regularly updated with newly discovered apps. There are several known methods to build up such a database. Specific software development tools are collected into an installable package referred to as a software development kit (SDK). One or more SDKs are integrated into various client applications worldwide (shown in FIG. 1 as SDK 114 installed in app 112). Through this integration an extensive list of applications can be compiled in the form of package names used by end users globally. This is shown in block 320 in FIG. 3, and package name data sources 304 include the SDKs in client apps 302. Another potential source of package names is to directly scrape a wide range of app store websites and other distribution platforms. This is shown in FIG. 3 by a dashed arrow. According to some embodiments, focusing scraping on apps that are installed by customers or end users that have one or more of the SDKs installed in some cases can result in increased computer and network efficiency and significantly reduces network traffic when compared to a less focused approach, such as indiscriminately scraping all publicly available application store websites for newly released applications. This approach also has an advantage of focusing on applications that pose a greater risk to one’s own customers and / or end users. Note that when initially compiling dataset 306, each application should be thoroughly screened and labeled as either malicious or benign. Additionally, the labeled dataset 306 should beAttorney Docket No. 484-0003PCT - 16-large enough to produce high-quality training sets. Further detail with respect to building of an initial dataset of mobile applications with suitable information and accuracy for training a neural network is provided, in FIG. 5, and associated description, infra.
[0042] In block 322, package names from data sources as SDKs in client apps 302, are cross references against database 306 of existing applications to create a refined list of new package names 212 (which is also shown in FIG. 2). In block 324, scraper script 310 is deployed on one or more app distribution platforms (including app store websites 170), and collects data for each application in the refined list of new package names 212. Scraping data from multiple sites and / or platforms, rather than just one, allows for more comprehensive scraping of all or most of the identified package names. Note that in general the package name is not the same as app name, it can include the app name but can also include a unique application identifier. According to some embodiments, several scraper scripts 310 can be developed to handle the varying structures of these websites, for example separate scripts configured for different stores and / or sites. Some of the sites scraped include, but are not limited to, play.google.com, apkpure.com, and apkcombo.net, (apple.com / store).
[0043] In block 326 the scraped data 222 collected from websites and platforms 170 is appended to a queue of new apps (with info) 312 for input to NLP-based malicious application identifier system 168. By using refined list 212 redundant scraping of the same information can be avoided. According to some embodiments, the scraper system shown in FIG. 3 can run at regular intervals, such as once a day.
[0044] According to some embodiments the scraper makes use of one or more antiblocking techniques to reduce the likelihood of being blocked by app platforms it intends to scrape, thereby reducing disruptions in the scraping process. In the example shown in FIG. 3, scraper script 310 is proxy-powered, that employs a proxy to prevent blocking, for example through use of a firewall or velocity checking tool. The proxy capability can employ known techniques such as automatic IP rotation whenever a publicly availableAttorney Docket No. 484-0003PCT - 17-application store website returns an error due to excessive requests. Non-limiting examples of available proxy services include nordVPN®, oxylabs® and bright data.
[0045] FIG. 4 is a block diagram illustrating further aspects of an NLP-based malicious application identifier system, according to some embodiments. The NLPbased malicious application identifier system 168 shown in FIG. 4 can generally be referred to as an inference pipeline. Every newly scraped application registered in the queue buffer 312 (also shown in FIG. 3), is passed through the inference pipeline of system 168 to determine whether it is considered malicious. Inference pipeline of system 168 includes sentence transformer 410 and deep learning model 420.Generally, the sentence transformer 410 is configured to capture the semantic meaning of sentences, e.g., text sentences, for subsequent processing, and the model 420 processes the output of the sentence transformer 410 to make a prediction that an app, e.g., mobile app, is malicious or benign (non-malicious). Model 420 has been trained on a suitably large (e.g. thousands) labelled dataset associated with collected application information.
[0046] In the sentence transformer 410, the first step is the NLP-based feature engineering process 412. In process 412 the application information is processed to generate NLP-related features. The “application information” being processed generally contains the information deemed relevant for an application such as, for example: app name, app description, package name, app rate, app number of reviews, and / or app category, app link. This includes cleaning and preprocessing the data, such as removing URLs, email traces, stop words, numbers, emojis, and non-alphanumeric text, which could introduce unnecessary noise in the subsequent steps. Note that the term “stop words” refers to common words that are generally not meaningful search terms for purposes of the NLP.
[0047] Next, an advanced transformer-based word tokenization model 414 (also in FIG. 2 as Model 230), is used to clean features and convert the NLP-related feature data into a feature vector. Many transformer models can be suitable for this purpose, depending upon the particular application. Examples include one or more of theAttorney Docket No. 484-0003PCT - 18 -following: sentence-transformers / all-MiniLM-L12-v1; sentence-transformers / all-mpnet-base-v1; sentence-transformers / multi-qa-MiniLM-L6-dot-v1; sentence-transformers / all-MiniLM-L6-v2; sentence-transformers / facebook-dpr-ctx_encoder-multiset-base; sentence-transformers / facebook-dpr-question_encoder-single-nq-base; sentence-transformers / multi-qa-mpnet-base-dot -v1; sentence-transformers / sentence-t5-xxl; sentence-transformers / LaBSE; sentence-transformers / msmarco-bert-base-dot-v5; sentence-transformers / msmarco-distilbert-dot-v5; sentence-transformers / multi-qa-distilbert-dot-v1; sentence-transformers / gtr-t5-xxl; and sentence-transformers / xlm-r-distilroberta-base-paraphrase-v1.
[0048] According to some embodiments, pooling operation 415 is used to average word embeddings being generated by the transformer model 414 resulting in a feature vector having a fixed length. The fixed length feature vector can be added to a queue 417 so that it can be passed through the neural network model 420 in an efficient manner. According to some embodiments, fixed length feature vectors 416 in queue 417 can be of a fixed predetermined vector size for efficient and effective processing by model 420. In general, the size or length of the vector size will range depending on the number features, architecture, algorithm requirements, embedding model(s) used, result quality, etc. A range of vector sizes for this NLP application can be a few hundred to several thousand dimensions. Some common fixed-length vector sizes are 384 dimensions, 512 dimensions, 768 dimensions, 1024 dimensions, 1536 dimensions, 3072 dimensions, and 12,288 dimensions.
[0049] The feature vectors are then applied to or passed through the neural network model 420, which has been previously trained with data accurately labeled to reflect each app as either malicious or benign. Neural network model 420 includes several layers of nodes. This includes input layer 418, output layer 424 and one or more intermediate layers 422, 426, and 428. The nodes of each intermediate layer are labeled ReLU. This model 420 accepts the feature vector 416 from transformer 410 and returns a probabilistic score, indicating the likelihood of the application being malicious. The configuration, of feature vector 416, including its size / length, should beAttorney Docket No. 484-0003PCT - 19-selected according to the particular application which includes considerations such as available resources and other constraints. According to some embodiments, rectified linear units (ReLUs) are used as the nodes for each of the one or more intermediate layers 422, 426 and 428. In this model 420 is a non-linear activation function used for deep neural networks in machine learning. It is also known as the rectifier activation function.
[0050] In block 430 the result of output layer 424 is interpreted as either malicious or benign (not malicious) and the database of applications 306 is updated for the application being processed. According to some embodiments, block 430 can flag the application for manual review in certain cases before the database 306 is updated.
[0051] The number of layers and the number of nodes per layer will vary depending on the particular application and resources available. For example, in some cases, the number of layers can be an input layer, output layer only (two layers). According to other cases there are one or more intermediate layers (e.g. 3, 4 or more layers). The number of nodes in each layer that is suitable for a particular application can be found using known methods. For example, the number of nodes in the input layer 418 can set according to the number of features in input data. In FIG. 4, there is only a single node in output layer 424 since the result in this example is simply the likelihood of maliciousness associated with the application. The number nodes in each of the intermediate layers (e.g. 422, 426 and 428) should be selected such that the level of accuracy and speed is suitable for the given application. In some examples, the number of nodes per layer can be a power of 2 such as 32, 64, 128, 256, etc.
[0052] The system shown in FIG. 4, can be a highly accurate prediction system that requires only a single inference pass. If the model is less accurate, then additional passes might be used to determine whether an application is malicious. However, by combining a well configured transformer-based word tokenization model with a suitable neural network architecture, it has been found that only a single pass can yield accurate results. Such configurations, therefore, can provide significant computer and network benefits, such as reducing computer memory usage.Attorney Docket No. 484-0003PCT - 20 -
[0053] The systems and techniques described herein enable a proactive approach to identifying newly released malicious applications online. Known methods of determining if an application is malicious range from basic heuristic algorithms to machine learning models. Compared with known methods, the techniques described herein can significantly improve performance in terms of accuracy and / or efficiency. In cases where a transformer-based word tokenization model is used, a more complete context of the application information is often captured. Additionally, using a deep neural network allows the predictions to based on more than simple relationships in the data. In some cases, the predictions of maliciousness can be based on complex, nonlinear patterns in the data, which in some cases leads to increases in accuracy.
[0054] FIG. 5 is a block diagram illustrating aspects of building training datasets and training of malicious application identifier systems, according to some embodiments. The building of an initial dataset of mobile applications with suitable information and accuracy for training a neural network is shown in the blocks labeled as 500. In block 510, a collection of applications is selected to use as an initial training set. According to some embodiments the initial training set includes at least 100 applications. According to some other embodiments the initial training set includes at least 500 applications, at least 1000 applications, or at least 2000 applications. In block 512 for each application in the initial set, an accurate determination should be made whether the application is malicious or benign. In block 514, the list is compiled, with each application having sufficient information for the neural network model predict maliciousness with suitable accuracy. Examples of types of information include: application name, package name, and app description. In block 516 the information is processed using the NLP based sentence transformer model. The processing includes tokenizing to generate word embeddings, and pooling to aggregate the sequence of token embeddings into a single, fixed-length representation to accurately capture meaning. In block 518 the neural network is trained using the dataset generated in block 516. Depending on the results, further changes might be needed. For example, for each application more or different information might have to be gathered, the information tokenization and or pooling mightAttorney Docket No. 484-0003PCT - 21 -need to be changed, and / or the number of applications in the dataset might have to be increased. This iteration is shown in block 502. Once a prediction model is created that has suitable results for the particular application, the neural network can be re-trained using additional data that has been compiled since the initial training, as shown in block 520. The frequency or length of time between re-training operations depends on the application and upon other factors such as perceived threat indications and indications of model inaccuracy (e.g. gradual model “drift”). In some cases, the model is retrained on a regular or routine basis, for example once per quarter year (3 months), or once every 6 months.
[0055] In some embodiments, the accuracy of the model 420 a ratio of the total number of correct predictions and the total number of predictions. The accuracy of the model 420 can also be measured as a ratio of the number of true positives (TP), true negatives TN), false positives (FP) and false negatives (FN), where:Accuracy = (TP+TN) / (TP+TN+FP+FN).In a non-limiting example, the suitability of the model 420 accuracy is dependent upon a number for factors including the size of the training data set, how well the data is cleaned and normalized, the number of layers in the model 420, and conventional hyperparameter tuning characteristics, such as learning rates. According to some embodiments, suitable accuracy can be at least 60% accuracy rate. This accuracy rate can be measured as being correct 60% or greater at predicting the ratio of true positives divided by the population size. According to some other embodiments, the accuracy rate is at least 70%, at least 80% or at least 90%.
[0056] FIG. 6 is a block diagram illustrating aspects of issuing warnings and / or alerts based on a database maintained with a malicious application identifier system, according to some embodiments. In this example, database 306 has recently been updated, for example as shown and described by block 430 in FIG. 4. In block 602 the recently updated database 306 is compared with apps installed on devices having an SDK (such as SDK 114 from cyber protection company 160). Following the example ofAttorney Docket No. 484-0003PCT - 22 -FIG. 1, this would be end-users or customers of organization 150 (FIG. 1 ). If matches are found, this indicates organization 150 has end users or customers with devices that have apps identified as malicious installed. In the example of FIG. 1, this could be malicious apps 126 and 136 (FIG. 1 ). In block 604, a warning / alert is automatically generated based on any matches found in block 602. In block 606 the warning / alert generated in block 604 is automatically issued. The warning / alert also identifies to organization 150 which if its customers / end users’ devices have apps installed that have been identified as malicious. The warning / alert may also include the type of malicious app, such as the type of malicious tool. In the example of FIG. 1, the warning alert 152 identifies devices 120 and 106 as having one or more apps installed that have been identified by cyber protection company 160 as likely to be malicious (e.g. apps 126 and 136 for devices 120 and 130 respectively), and may include that the malicious tool is a spoofing tool.
[0057] According to some embodiments, the NLP-based malicious application identifier is configured to detect one or more groups or types of malicious applications. In such cases, a version of the NLP-based malicious application identifier can be modified and retrained for each different malicious application type or group. According to some embodiments, the techniques described herein can be applied to a wide range of use cases. Examples of use cases can include, without limitation: detection of use of app cloner on mobile devices, detection of use of auto clicker / scroller app on mobile devices; detection of rooted scenario on mobile devices; detection of use of spoofers on mobile devices for GPS location and / or network; detection of use of VPN on mobile devices; detection of screen sharing scenario on mobile devices; detection of device attributes masking tools scenario on mobile devices; detection of use of emulator to access mobile application; detection of tampered application usage on mobile device; detection of factory reset scenario on mobile device; detection of use of OS virtualization app on mobile device; detection of use of hooking instrumentation tool on mobile device; and detection of debugging scenario on mobile device. According to some embodiments, in cases when more than one application is desired, separateAttorney Docket No. 484-0003PCT - 23 -datasets, neural network model instances are created and used for each separate type of use.
[0058] Although the foregoing has been described in some detail for purposes of clarity, it will be apparent that certain changes and modifications may be made without departing from the principles thereof. It should be noted that there are many alternative ways of implementing both the processes and apparatuses described herein.Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the body of work described herein is not to be limited to the details given herein. While the aforementioned techniques and systems are primarily applied to mobile applications and the detection of malicious apps and tools on mobile devices, the system can be modified by, for example, increasing or decreasing the number of layers in the model 420 and / or increasing or decreasing the fixed length feature vector sizes, to be applied to the detection of malicious apps and tools on web apps.
Claims
Attorney Docket No. 484-0003PCT - 24 -CLAIMSWhat is claimed is:
1. A method for predicting malicious applications installed on or running on mobile or portable devices and in communication with a server system, the method comprising:collecting application information on each of one or more newly identified applications installed on or running on the mobile or portable devices;transmitting the application information to the server system;on the server system, transforming the application information using at least a natural language processor to generate a fixed length feature vector for each of the one or more newly identified applications, the feature vector includes information on a plurality of features used to predict if each of the one or more newly identified applications is malicious or benign using a neural network model trained with a suitably labeled dataset of screened applications labeled as malicious or benign;sequentially passing the feature vector for each of the one or more newly identified applications through the neural network model; andinterpreting an output of the neural network model as a prediction that the application represented by the fixed length feature vector is malicious or benign.
2. The method according to claim 1, wherein the fixed length of the feature vector is at least one of 384 dimensions, 512 dimensions, 768 dimensions, 1024 dimensions, 1536 dimensions, 3072 dimensions, and 12,288 dimensions.
3. The method according to claim 1, further comprising:labeling the application represented by the feature vector as malicious or benign; andupdating the labeled dataset of screened applications with the newly labeled application.Attorney Docket No. 484-0003PCT - 25 -4. The method according to claim 1, wherein collecting application information on each of the one or more newly identified applications comprises:installing on each mobile or portable device an SDK configured to gather the application information from the mobile or portable device associated with applications installed or running on the mobile or portable device; andtransmitting the application information to the server system.
5. The method according to claim 1, wherein transforming the application information to generate the feature vector for each of the one or more newly identified applications comprises:processing the application information to generate NLP-related features; tokenizing the NLP-related feature data to generate a feature vector; and pooling the feature data to fix the length of the feature vector.
6. The method according to claim 1, wherein pooling the feature data comprises averaging word embeddings generated when tokenizing the NLP-related feature data.
7. The method according to claim 1, further comprising issuing an alert when the prediction that the application represented by the feature vector is malicious.
8. The method according to claim 7, wherein the alert includes a type of the malicious application represented by the feature vector.
9. A system running on a server for predicting malicious applications installed on or running on mobile or portable devices, the system comprising:a scraper system that collects application information on each of one or more newly identified applications from one or more app distribution platforms; andan NPL based malicious application identifier system that:transforms the application information using at least a natural language processor to generate a fixed length feature vector for each of theAttorney Docket No. 484-0003PCT - 26 -one or more newly identified applications, the feature vector includes information on a plurality of features used to predict if each of the one or more newly identified applications is malicious or benign using a neural network model running on the server and trained with a suitably labeled dataset of screened applications labeled as malicious or benign that resides on the server; sequentially passes the feature vector for each of the one or more newly identified applications through the neural network model; and interprets an output of the neural network model as a prediction that the application represented by the feature vector is malicious or benign10. The system according to claim 9, wherein the scraper system collects application information on each of one or more newly identified applications installed on or running on the mobile or portable devices.
11. The system according to claim 9, wherein the fixed length of the feature vector is at least one of 384 dimensions, 512 dimensions, 768 dimensions, 1024 dimensions, 1536 dimensions, 3072 dimensions, and 12,288 dimensions.
12. The system according to claim 9, wherein the NPL based malicious application identifier system labels the application represented by the feature vector as malicious or benign, and updates the labeled dataset of screened applications with the newly labeled application.
13. The system according to claim 9, further comprises:an SDK installed on each mobile or portable device that gathers the application information from applications installed or running on the mobile or portable device, and that transmits the gathered application information to the server.Attorney Docket No. 484-0003PCT - 27 -14. The system according to claim 9, wherein the NPL based malicious application identifier system transforms the application information to generate the feature vector for each of the one or more newly identified applications by:processing the application information to generate NLP-related features; tokenizing the NLP-related feature data to generate a feature vector; and pooling the feature data to fix the length of the feature vector.
15. The system according to claim 14, wherein pooling the feature data comprises averaging word embeddings generated when tokenizing the NLP-related feature data.
16. The system according to claim 9, wherein the server issues an alert when the prediction that the application represented by the feature vector is malicious.
17. The system according to claim 16, wherein the alert includes a type of the malicious application represented by the feature vector.
18. A method for predicting malicious applications installed on or running on mobile or portable devices and in communication with a server system, the method comprising:collecting application information on each of one or more newly identified applications from one or more app distribution platforms;transmitting the application information to the server system;on the server system, transforming the application information using at least a natural language processor to generate a fixed length feature vector for each of the one or more newly identified applications, the fixed length feature vector includes information on a plurality of features used to predict if each of the one or more newly identified applications is malicious or benign using a neural network model trained with a suitably labeled dataset of screened applications labeled as malicious or benign; sequentially passing the feature vector for each of the one or more newly identified applications through the neural network model; andAttorney Docket No. 484-0003PCT - 28 -interpreting an output of the neural network model as a prediction that the application represented by the fixed length feature vector is malicious or benign.
19. The method according to claim 18, wherein the fixed length vector size is at least one of 384 dimensions, 512 dimensions, 768 dimensions, 1024 dimensions, 1536 dimensions, 3072 dimensions, and 12,288 dimensions.
20. The method according to claim 18, further comprising:labeling the application represented by the fixed length feature vector as malicious or benign; andupdating the labeled dataset of screened applications with the newly labeled application.
21. The method according to claim 18, wherein collecting application information on each of the one or more newly identified applications comprises:installing on each mobile or portable device an SDK configured to gather the application information from the mobile or portable device associated with applications installed or running on the mobile or portable device; andtransmitting the application information to the server system.
22. The method according to claim 18, wherein transforming the application information to generate the fixed length feature vector for each of the one or more newly identified applications comprises:processing the application information to generate NLP-related features; tokenizing the NLP-related feature data to generate a feature vector; and pooling the feature data to generate the fixed length feature vector.
23. The method according to claim 18, further comprising issuing an alert when the prediction that the application represented by the feature vector is malicious.Attorney Docket No. 484-0003PCT - 29 -24. The method according to claim 23, wherein the alert includes a type of the malicious application represented by the feature vector.