Heuristic recovery mechanism for malicious shortcut file parsing and threat detection
By constructing a threat sample set and a robust parsing engine, combined with sliding window and recursive logic, the parsing anomalies and threat characterization problems of shortcut file analysis tools on the Windows platform when faced with maliciously constructed files are solved, achieving efficient malicious code detection and report generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANKAI UNIV
- Filing Date
- 2026-04-24
- Publication Date
- 2026-06-30
Smart Images

Figure CN122113105B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of software security technology, and in particular relates to malicious code analysis and detection technology for shortcut files on the Windows platform. Background Technology
[0002] As the cyber threat landscape continues to evolve, Windows platform shortcut (LNK) files have gradually become a key carrier for malware distribution. Attackers exploit the wide system compatibility and low detection threshold of shortcut files to weaponize them as covert loaders of malicious payloads. Although most existing analysis tools are developed based on Microsoft's publicly available standard documentation, their limitations are becoming increasingly apparent when faced with maliciously crafted non-standard shortcut files. Specifically, existing technologies face three major challenges: First, low tolerance for anomalies. Attackers often create structural anomalies by tampering with length fields or using undocumented region structures to evade analysis, causing tools to crash or terminate prematurely during parsing, resulting in the loss of critical data. Second, incomplete field interpretation. Traditional tools struggle to deeply parse deeply nested extended data blocks and cannot extract overlay data hidden at the end of the file, making it impossible to reveal deep malicious semantics. Finally, a lack of automated threat characterization capabilities. Existing analysis processes heavily rely on manual judgment. For example, although patent CN121765720A proposes a dynamic detection method during the shortcut trigger process startup phase, its limitation lies in being a runtime behavioral defense. It cannot perform deep threat characterization of the file's internal structure during the static analysis phase before file execution and still struggles to automatically identify the complex command-line obfuscation techniques and visual camouflage tactics unique to shortcut files. Summary of the Invention
[0003] The purpose of this invention is to provide a method for parsing and detecting threats in malicious shortcut files based on a heuristic recovery mechanism. Through a robust parsing strategy and a multi-dimensional threat model, it addresses the problems of missing parsing information and incomplete threat information extraction in existing tools when dealing with malicious shortcut samples. It can achieve high-fidelity data extraction and threat feature identification even in noisy and non-standard malicious samples.
[0004] To achieve the above objectives, the specific technical solution of the present invention is as follows:
[0005] A heuristic recovery mechanism-based method for malicious shortcut file parsing and threat detection constructs a threat sample set, utilizes a parsing engine incorporating sliding windows and recursive logic to process binary data, maximizes metadata extraction through a heuristic anomaly recovery mechanism, and automatically characterizes malicious behavior based on a three-dimensional model of deception, evasion, and execution. This method can perform high-fidelity parsing and threat feature extraction for structurally abnormal or malicious Windows shortcut files, including the following steps:
[0006] Step 1: Build a shortcut file threat intelligence dataset by integrating enterprise threat logs and public intelligence sources to obtain a massive number of malicious shortcut samples;
[0007] Step two: Use a heuristic-based robust parsing engine to process the binary data of the shortcut file; when encountering structurally abnormal or non-compliant fields, do not terminate the parsing, but use a heuristic recovery strategy to try to repair or skip the erroneous area in order to extract the maximum amount of metadata;
[0008] Step 3: Multi-dimensional threat feature analysis. Based on the metadata extracted in Step 2, an in-depth threat feature analysis model is constructed from the outside in, analyzing threat features from three dimensions: deceptiveness, evasion, and execution. In the deceptiveness dimension, the semantic consistency between the shortcut icon index and the target process path is compared. In the evasion dimension, the obfuscated command line parameters are restored through entropy analysis and pattern recognition algorithms. In the execution dimension, the call chain is extracted through the standardized target path.
[0009] Step 4: Output standardized structured data, mark detected structural anomalies and potential malicious features, and generate a threat detection report;
[0010] The robust parsing engine described above uses dynamic byte order processing and a sliding window algorithm to accurately locate terminators in variable-length fields; it uses a recursive parser to process nested file structures and extra expansion blocks to identify and extract hidden payloads hidden after the end of the file or in overwritten data.
[0011] The heuristic recovery strategy described above treats parsing failure as a threat intelligence. When the parsing engine detects encoding tampering, abnormal length fields, or undocumented structure types, the system records the anomaly type and automatically adjusts the parsing parameters for iterative extraction, ensuring that critical attack chain information can still be restored even if the file structure is damaged.
[0012] The aforementioned threat signature analysis can automatically identify attack tactics unique to shortcut files. In terms of deception, it relies on a multi-dimensional attribute association and consistency verification engine to extract and analyze icon resources and actual executables, accurately discovering visual disguises and semantic conflicts. In terms of evasion, it introduces lexical parsing and normalization algorithms based on Abstract Syntax Trees (ASTs) to perform semantic dimensionality reduction and reorganization on command-line shells that have undergone command-line obfuscation at the command or statement level, achieving high-fidelity restoration of concealed malicious instructions. In terms of execution, it intercepts and restores absolute physical paths based on lexical path normalization algorithms and dynamic evaluation engines, thereby accurately identifying fileless attack call chains that use Windows' native whitelist tools to load malicious code.
[0013] Specifically, the construction of the dataset in step one of this invention includes the following steps:
[0014] (1) Collection of malicious samples from multiple sources: The collection scope covers log records in the real network environment of partner companies and public threat intelligence sources from VirusShare. After cleaning and deduplication, a total of 100,000 malicious shortcut file samples were finally retained.
[0015] (2) Threat intelligence enrichment: Based on the acquisition of original binary file samples, the sample hash value is used to perform batch retrieval of VirusTotal, construct the mapping relationship between samples and intelligence, and realize the temporal alignment and family classification of original binary files and structured intelligence tags.
[0016] Specifically, step two of this invention, which involves processing the binary data of the shortcut file, includes the following steps:
[0017] (1) Dynamic processing and sliding window positioning: By using the sliding window algorithm to dynamically scan, the parsing engine does not simply rely on the length value declared in the file header for variable-length fields and specific encoded strings in the file binary stream;
[0018] (2) Recursive decoding of complex nested structures: For the complex nested structures of shell item identifier lists and additional data blocks that are widely present in shortcut files, a self-developed deep recursive parsing method is used for processing;
[0019] (3) Heuristic recovery of abnormal states: an abnormal fault tolerance and recovery mechanism is established. When an invalid segment head size or an unknown structural abnormality is encountered during the parsing process, the engine will not interrupt the program. Instead, it will mark the current state as abnormal rather than fatal. The parsing engine will iteratively adjust the offset of the file pointer through a heuristic algorithm, try to skip the damaged area and automatically search for the header features of the next valid structure, thereby realizing the recovery parsing of subsequent data and maximizing the recovery of the remaining valid information in the file.
[0020] Specifically, the analysis model in step three of this invention is carried out according to the following steps:
[0021] (1) Deceptive feature analysis based on semantic conflict judgment: In response to the disguise made by attackers using social engineering, the system focuses on identifying visual disguise and metadata forgery behavior. Through multi-dimensional attribute association and consistency verification engine, it extracts and analyzes the icon resource index of shortcut file and the actual target program path after parsing, accurately discovers the semantic conflict between icon display and actual application, and detects whether there is misleading information in the description field.
[0022] (2) Identification and detection of evasion techniques: In response to the obstacles set up by attackers to bypass static detection, the system deeply analyzes the structural anomalies and command line obfuscation features to detect whether there is any behavior of using special characters to fill or malformed structures to resist traditional scanning engines; at the instruction level, a lexical parsing and normalization algorithm based on abstract syntax tree (AST) is introduced to perform semantic dimensionality reduction and reorganization on the obfuscated shells in random uppercase and lowercase mixed, large number of escape characters inserted and string concatenation, so as to achieve high-fidelity restoration of the obscured malicious instructions without executing the code;
[0023] (3) Standardized restoration of execution logic: For the final call behavior of malicious payload, the system standardizes the parsed target path based on the lexical path normalization algorithm. By parsing environment variables to restore the absolute path, it automatically maps it to the absolute physical path of the victim host and eliminates redundant separators to calculate the unique real underlying execution address. This allows for accurate identification of the fileless attack call chain that uses the system's whitelist program to load malicious code. On this basis, the complex execution process is reconstructed into a structured sequential output, clearly showing the complete logical path from initial startup to final execution of the malicious payload.
[0024] Specifically, the report generation in step four of this invention is performed according to the following steps:
[0025] (1) Preservation and output of hierarchical data structure: In view of the complex nesting characteristics of shortcut files, the system uses a tree serialization algorithm based on depth-first traversal to dump the parsing results. Specifically, it recursively scans the highly nested blocks of the file with the file header as the root node. During the traversal, it captures the physical offset of each data item and the logical relationship between the parent and child in real time, deconstructs the complex binary network dependency into a standardized tree node model, and then generates a standardized JSON format report. This format completely preserves the hierarchical logic from the file header to the extended data block and can map the subordinate relationship between the shell item list and the additional data without loss.
[0026] (2) Generation of flattened statistical data: In order to meet the needs of batch feature extraction and trend analysis of massive samples, the system synchronously generates flattened CSV format reports by standardizing and expanding key metadata fields;
[0027] (3) Automated highlighting of threat indicators: In the output report, the system has built-in an automatic labeling algorithm based on multi-feature association pattern matching, which automatically identifies and highlights the key risk points found in the analysis process, including the structural fields marked as abnormal, the detected command line obfuscation technology features, and the storage location of potential malicious payloads, thereby helping security analysts to focus on core threats as soon as possible and quickly complete the qualitative and forensic analysis of malicious samples.
[0028] The present invention has the following advantages:
[0029] (1) Strong parsing robustness and anti-interference capability: Addressing the common practice of malicious samples exploiting malformed structures to circumvent analysis, this invention abandons the fragile logic of traditional tools that strictly rely on file header specifications. By introducing a sliding window search and heuristic anomaly recovery mechanism, this invention maintains the stability and continuity of the parsing engine even in extreme anti-forensic situations such as length field tampering, non-standard character encoding, and structural out-of-bounds errors. This completely solves the pain point of existing tools stopping upon encountering errors and being prone to crashing, ensuring a high success rate of accurate parsing of various malformed or damaged malicious shortcut files even in complex attack and defense environments.
[0030] (2) Comprehensive Deep Detection Vision: This invention breaks through the limitations of traditional forensic tools that can only extract standard metadata, achieving a deep understanding of file content. Through the built-in recursive parsing algorithm and file end scanning technology, this invention can penetrate complex multi-layered nested structures, identify the shell item type of unpublished documents, and accurately extract the overlay data hidden after the file terminator. This advantage eliminates detection blind spots, making it impossible for attackers to hide malicious payloads through deep nesting or steganography, significantly improving the ability to detect covert attacks.
[0031] (3) Automated Threat Intelligence Production and Defense Empowerment: This invention achieves a qualitative leap in automation from raw data extraction to high-value intelligence generation. The system can not only restore basic information, but also automatically identify and characterize character obfuscation, logical spoofing, and execution chain anomalies in command lines through built-in semantic analysis algorithms. The system can directly output standardized intelligence reports containing structured features, homogeneous clustering tags, and YARA rules, greatly reducing the threshold and cost of manual judgment, providing security analysts with immediate and actionable defense guidance, and effectively improving the response speed against variant attacks. Attached Figure Description
[0032] Figure 1 This is an overall flowchart of the method of the present invention.
[0033] Figure 2 This is a flowchart of the parsing engine logic based on the heuristic recovery mechanism of this invention.
[0034] Figure 3 This is a flowchart of the multi-dimensional threat feature analysis process of the present invention.
[0035] Figure 4 This is a schematic diagram of the deceptive feature analysis process of the present invention.
[0036] Figure 5 This is a flowchart illustrating the circumvention methods for identification and detection in this invention. Detailed Implementation
[0037] The present invention will now be described in further detail with reference to the accompanying drawings. However, the present invention can be implemented in many different ways depending on the provided technical solution. The accompanying drawings, which constitute a part of this application, are used to provide a further understanding of the present invention. The illustrative embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention.
[0038] like Figure 1 As shown, the malicious shortcut file parsing and threat detection method based on heuristic recovery mechanism provided by this invention first constructs a shortcut threat sample set and performs data preprocessing. Then, it uses a parsing engine containing sliding windows and recursive logic to process binary data, maximizes the extraction of metadata through a heuristic anomaly recovery mechanism, and automatically qualitatively identifies malicious behavior based on a three-dimensional model of deception, evasion, and execution, ultimately generating a standardized threat report. This method can perform high-fidelity parsing and threat feature extraction for structurally abnormal or malicious Windows shortcut files, including the following steps:
[0039] Step 1: Constructing a shortcut sample set and preprocessing the data
[0040] By integrating real samples captured from the enterprise intranet with extensive intelligence aggregated from the public internet, a multi-source, heterogeneous shortcut file threat intelligence dataset was constructed. This dataset not only focused on sample collection during its construction but also aimed to reconstruct the full picture and contextual relationships of attacks. By integrating enterprise threat logs and public intelligence sources, a massive number of malicious shortcut samples were obtained, thus providing solid data support for subsequent robust analysis and feature engineering. The specific data construction process is as follows:
[0041] (1) Collection of malicious samples from multiple sources: The collection strategy adopted a combination of enterprise intranet threat hunting and public intelligence source aggregation. The collection scope covered log records from the real network environment of partner enterprises and a wide range of public threat intelligence sources such as VirusShare. After strict cleaning and deduplication, a total of hundreds of thousands of malicious shortcut file samples were retained to ensure that the data covered the complete evolution from historical classic attacks to the latest variants.
[0042] (2) Threat intelligence enrichment: Based on the acquisition of the original binary samples, the sample hash value was used to perform batch retrieval of VirusTotal, and a mapping relationship between samples and intelligence was constructed, realizing the temporal alignment and family classification of the original binary files and structured intelligence tags.
[0043] Step 2: Processing binary data based on a heuristic recovery engine
[0044] We developed a malicious shortcut file parsing engine based on a heuristic recovery mechanism, overcoming the technical bottleneck of existing analysis tools that stop upon encountering errors when processing malformed or tampered samples. We adopted a highly resilient elastic parsing strategy, which can maximize the continuity of the parsing process and the integrity of data when facing non-standard structures or maliciously obfuscated data, thereby providing accurate metadata input for subsequent threat detection.
[0045] This step utilizes a robust parsing engine based on sliding window and recursive parsing algorithms to process the binary data of shortcut files. It aims to address the problem of traditional parsers prematurely terminating when faced with malformed structures. Through dynamic byte order processing and a sliding window algorithm, it accurately locates terminators in variable-length fields. A recursive parser handles nested file structures and extra extension blocks, extracting hidden payloads concealed after the end of the file or in overwritten data. Simultaneously, this step employs a heuristic recovery strategy, treating parsing failures as threat intelligence. When encoding tampering, abnormal length fields, or undocumented structure types are detected, the anomaly type is recorded, and parsing parameters are automatically adjusted for iterative extraction. The specific parsing process includes the following three steps:
[0046] (1) Dynamic processing and sliding window positioning: For variable-length fields and specific encoded strings in the file, the parsing engine does not simply rely on the length value declared in the file header, but introduces a sliding window algorithm to dynamically scan the binary stream. This processing method intelligently identifies and locates the legal string terminator by moving the window in the data stream, ensuring that the basic data can still be accurately extracted even if the header information is unreliable.
[0047] (2) Recursive Decoding of Complex Nested Structures: For complex nested structures such as shell item identifier lists and additional data blocks that are widely present in shortcut files, a deep recursive parsing method is used for processing. This method can decode multi-level nested data objects layer by layer, effectively extracting hidden attribute fields that are not recorded in public documents, and eliminating the parsing blind spots for specific structural fields.
[0048] (3) Heuristic recovery of abnormal states: An abnormal fault tolerance and recovery mechanism has been established. When encountering abnormalities such as invalid segment header size or unknown structure during the parsing process, the engine will not interrupt the program, but will mark the current state as abnormal rather than fatal. The parsing engine iteratively adjusts the offset of the file pointer through a heuristic algorithm, attempts to skip damaged areas and automatically searches for the header features of the next valid structure, thereby achieving recoverable parsing of subsequent data and maximizing the recovery of the remaining valid information in the file.
[0049] The analysis process in this step is as follows: Figure 2As shown in the attached diagram, the specific workflow is as follows: after the system receives the input binary stream and completes dynamic byte order adaptation, it enters the core loop reading phase. For the read structure fields, if they are deemed valid, the system either proceeds to recursive deep parsing or directly extracts metadata depending on whether nested structures exist; if they are deemed invalid, the system switches to the exception recovery branch, adjusts the offset to skip bad blocks, and then returns to the reading loop. When the system reaches the end of the file logic, it further checks for any remaining physical data to extract hidden payloads and generates a unified structured representation output, at which point the process ends.
[0050] Step 3: Multi-dimensional Threat Feature Analysis
[0051] Based on the extracted high-fidelity metadata, an in-depth threat feature analysis model was constructed, proceeding from the outside in. This model is no longer limited to static matching of single features, but delves deeper from visual appearance camouflage detection to internal structure evasion identification, ultimately restoring the true execution logic. Through three-dimensional correlation analysis, it accurately determines the hidden intent in shortcut files and finally generates a preliminary structured threat detection report.
[0052] Specifically, this step, based on an inside-out threat model, performs deep feature analysis on the extracted metadata across three dimensions: deceptiveness, evasion, and execution. This step can automatically identify attack tactics unique to shortcut files: in the deceptiveness dimension, visual camouflage is identified by comparing the semantic consistency between the icon index and the target process; in the evasion dimension, command-line parameters obfuscated at the command symbol or statement level are reconstructed using entropy analysis and pattern recognition algorithms; and in the execution dimension, fileless attacks utilizing native system call tools are identified by standardizing the target path. The multi-dimensional threat feature analysis process is as follows: Figure 3 As shown in the attached diagram, the workflow is as follows: The system first inputs a shortcut binary stream as the basis for analysis. Subsequently, the analysis engine performs parallel detection across three dimensions: deception analysis, evasion analysis, and execution analysis. After completing feature extraction and analysis in these three dimensions, the system aggregates the results, performs multi-dimensional feature correlation determination, and comprehensively evaluates and derives the final threat characterization result, at which point the process ends. The specific analysis model unfolds as follows:
[0053] (1) Deceptive Feature Analysis Based on Semantic Conflict Judgment: Targeting attackers' camouflage using social engineering techniques, the system focuses on identifying visual camouflage and metadata forgery. By cross-referencing the icon resource index of shortcut files with the parsed actual target program path, it can accurately detect semantic conflicts between the icon display and the actual system script interpreter it points to. Simultaneously, it detects whether misleading information exists in the description field, thereby exposing the file's false appearance. The deceptive feature analysis process based on semantic conflict judgment is as follows: Figure 4As shown in the attached diagram, the specific workflow mainly illustrates that the system first aggregates the various descriptive fields of a single shortcut file, then, through feature decoupling, clearly divides the data into two parts: descriptive fields and the actual program to which it points; next, semantic tagging is performed on the decoupled elements, and finally, conflict determination is performed based on the generated tags to verify whether the appearance and the substance are consistent, and the process ends.
[0054] (2) Identification and Detection of Evasion Techniques: To address the obstacles set by attackers to bypass static detection, the system deeply analyzes structural anomalies and command-line obfuscation features to detect whether special character padding or malformed structures are used to circumvent traditional scanning engines. At the instruction level, character entropy values are calculated and features are matched on command-line parameters to identify obfuscation techniques including random case mixing, insertion of numerous escape characters, and string concatenation, restoring the masked malicious instructions without executing the code. The identification and detection process for evasion techniques is as follows: Figure 5 As shown in the attached figure, the specific workflow mainly illustrates that after inputting command-line parameters, the system sends them to the multi-dimensional avoidance feature calculation engine for parallel processing. The corresponding level threat score is calculated through three independent branches: character level, lexical level, and encoding level. Finally, the overall avoidance feature score is summarized and statistically analyzed, and the process ends.
[0055] (3) Standardized Reconstruction of Execution Logic: For the final call behavior of malicious payloads, the system standardizes the parsed target path and restores the absolute path by parsing environment variables, thereby accurately identifying the fileless attack call chain that uses the system's whitelisted programs to load malicious code. On this basis, the complex execution flow is reconstructed into a structured sequential output, clearly showing the complete logical path from initial startup to final execution of the malicious payload.
[0056] Step 4: Generate a standardized threat report
[0057] The engine transforms the results of structural analysis and in-depth analysis into actionable intelligence assets. By establishing standardized output interfaces, the system can convert complex multidimensional analysis data into a universal structured format, which not only meets the needs of in-depth analysis of single samples but also adapts to statistical mining scenarios of large-scale datasets. At the same time, by intuitively presenting key threat indicators, it significantly improves the efficiency and accuracy of security analysis.
[0058] This step generates a threat detection report by outputting standardized structured data and automatically marking detected structural anomalies and potential malicious features. The specific process for report generation is as follows:
[0059] (1) Preservation and output of hierarchical data structure: In view of the complex nesting characteristics of shortcut files, the system first generates a standardized JSON format report. This format fully preserves the hierarchical logic from the file header to the extended data block, and can map the subordinate relationship between the shell item list and the additional data without loss. This structured data is highly compatible with the import interface of various threat modeling tools, which facilitates analysts to conduct a visual panoramic analysis.
[0060] (2) Generation of flattened statistical data: In order to meet the needs of batch feature extraction and trend analysis of massive samples, the system synchronously generates flattened CSV format reports. By standardizing and expanding key metadata fields, researchers can quickly import data into relational databases or statistical analysis software, thereby efficiently carrying out cluster analysis and threat situation mining of large-scale samples.
[0061] (3) Automated highlighting of threat indicators: The system has built-in intelligent annotation logic in the output report, which can automatically identify and highlight key risk points found during the analysis process. This includes structural fields marked as abnormal, detected command line obfuscation techniques, and potential malicious payload storage locations, thereby helping security analysts focus on core threats as soon as possible and quickly complete the qualitative and forensic analysis of malicious samples.
[0062] In the above embodiments, the present invention provides a highly robust parsing method for shortcut files based on a heuristic recovery mechanism and a sliding window algorithm.
[0063] This invention designs a highly robust parsing engine to address the problem of existing parsing tools easily crashing or developing blind spots when dealing with tampered or malformed shortcut files. This engine abandons the traditional mode of relying entirely on the file header declaration length for reading, and innovatively employs a sliding window algorithm to dynamically search for valid string terminators and structural boundaries in the binary stream. This allows it to still locate the start and end positions of valid data even when the length field is maliciously modified. Simultaneously, recursive parsing logic is introduced to deeply decode complex shell item identifier lists and nested additional extension blocks, and an anomaly transformation intelligence mechanism is established. When a parsing anomaly is captured, the program is not interrupted; instead, the anomaly type is recorded and the offset is automatically adjusted for retrying parsing, ensuring that standard fields, extended metadata, and overwritten data hidden after the file terminator are extracted to the greatest extent possible.
[0064] In the above embodiments, the present invention is based on multi-dimensional feature detection of semantic consistency verification and multi-level confusion.
[0065] This invention proposes a multi-dimensional static threat detection method that identifies evasion techniques by correlating the visual attributes and execution logic of files. In terms of deception analysis, the system extracts the icon resource path of shortcut files and the actual target program path, performing semantic consistency comparison to accurately identify visual camouflage behaviors. For obfuscation detection, it performs three-level analysis of shortcut command-line parameters: character-level, statement-level, and encoding-level. By calculating string entropy, statistically analyzing escape characters and space densities, and identifying specific concatenation characters and encoding functions, it can reconstruct the obfuscated malicious instruction intent without executing the code and detect structured concealment techniques that use excessively long parameters to squeeze malicious code out of the screen's visible range.
[0066] In the above embodiments, the present invention provides an automated homology clustering method driven by static parsing.
[0067] Based on the results of multi-dimensional analysis of malicious shortcuts, this invention can systematically extract features and perform cluster analysis on large-scale command line parameter structures, classify samples with similar parameter construction patterns, and automatically synthesize YARA detection rules accordingly, thereby achieving batch identification and defense against homologous variant attacks.
[0068] Specifically, in the implementation of the parsing engine of this invention, the following key processing flow is designed to address the vulnerability of shortcut file formats:
[0069] A. Sliding Window Search for Variable-Length Fields: For variable-length fields such as descriptive characters, the length value defined in the file header is not directly accepted. The parser maintains a sliding window that scans the binary stream byte by byte. For Unicode encoded strings, the window searches for consecutive double null bytes as terminator candidates and combines this with the character validity check of the preceding bytes. This allows for precise location of the string's true physical boundaries even if the length field has been tampered with (e.g., by overflowing with a maximum value or truncating with a negative value).
[0070] B. Deep Recursive Structure Decoding: For substructures in the shell item identifier list, the engine has a built-in recursive parser capable of identifying and parsing the type encoding of unpublished documents. For extension blocks nested in additional data, a hierarchical traversal algorithm is used to handle multi-layered nested data structures, preventing attackers from exploiting deep nesting to cause parsing stack overflows or content omissions.
[0071] C. Overlay Data Extraction and Anomaly Recovery: After reading all standard-defined structure blocks, the parser continues scanning the end of the file until the physical terminator. Data blocks following the standard terminator are extracted as overlay data for subsequent hidden payload detection. Simultaneously, an anomaly state machine is established. When a field validation failure or out-of-bounds access is encountered, the system marks the current position as an anomaly point. Based on preset heuristic rules, the file pointer offset is automatically adjusted to skip corrupted areas and attempt to recover the parsing of subsequent data blocks.
[0072] Specifically, this invention constructs a three-dimensional detection logic based on high-fidelity metadata extracted by the parsing engine:
[0073] A. Visual deception detection based on semantic mapping matrix:
[0074] The system extracts the icon resource index path of the shortcut file and the parsed actual target execution path, and performs a cross-comparison. The system has a built-in file type semantic mapping matrix, and for semantic conflicts where the icon resource points to a document processing program but the actual target path points to a script interpreter or system management tool, it directly identifies it as a visual spoofing attack.
[0075] B. Multi-level instruction obfuscation feature recognition algorithm:
[0076] Character-level dimension: The string complexity of command-line arguments is calculated based on the Shannon entropy algorithm. At the same time, the density of escape characters and the proportion of non-natural uppercase and lowercase mixed cases are statistically analyzed. When the statistical value exceeds the preset dispersion threshold, it is marked as a character interference behavior against static signatures.
[0077] Lexical level: Employing regular expression feature matching technology, we identify specific combination patterns of string concatenation operators and content replacement functions to detect methods that disassemble malicious instructions to circumvent key-value matching.
[0078] Encoding-level dimension: Analyze the character distribution characteristics of the parameter string, identify different encoding sequences, hexadecimal escape sequences, and specific decoding API call characteristics, thereby discovering hidden encoded payloads.
[0079] C. Structured stealth detection:
[0080] The system detects the directory backtracking depth in the target path and identifies abnormal traversal behavior that uses deep relative paths to hide the real target. At the same time, it calculates the whitespace offset in the parameter field. If the whitespace length causes the starting position of the valid instruction to exceed the visible truncation threshold of the operating system property window, it is determined to be a covert attack that uses truncation characteristics.
[0081] Specifically, this invention systematically generalizes command-line parameters from massive amounts of malicious samples (removing randomized IP addresses, filenames, and temporary paths while preserving parameter syntax and obfuscation patterns). It employs a string similarity-based clustering algorithm to group samples. For samples within the same cluster, the system extracts their longest common subsequence as a structured fingerprint and automatically converts it into standard YARA detection rules, thereby enabling batch detection of variant samples created using the same generation tool framework.
[0082] The principle behind the implementation of the technical solution of this invention is as follows:
[0083] (1) Robust parsing principle based on sliding window and heuristic recovery
[0084] This principle aims to overcome the limitation of traditional parsers that heavily rely on the file header declaration length, leading to parsing failures. The system abandons the fixed-length reading mode and maintains a sliding window in the binary stream to dynamically scan and verify consecutive double null bytes and structural boundaries. This allows for precise location of the physical terminator of variable-length fields even when the length field is maliciously modified, utilizing content features. Simultaneously, a built-in exception state machine and recursive decoding logic ensure that when encountering encoding violations or nested structure errors, the process is not interrupted. Instead, it automatically calculates the offset based on a preset magic number feature, skips corrupted areas, and heuristically recovers the parsing of subsequent data blocks, ensuring deep extraction of all metadata, including data overwritten at the end of the file.
[0085] (2) Static detection principle based on semantic mapping matrix and entropy features
[0086] This principle identifies disguises and obfuscations by constructing a multi-dimensional static feature analysis model. First, it establishes a semantic mapping relationship between file types for comparison, comparing the icon resource attributes of shortcut files with the parsed attributes of the actual target program, and uses logical mutual exclusion detection to determine visual deception. Second, it introduces the Shannon entropy algorithm from information theory to calculate the complexity of command-line parameters, combining regular expression feature matching and escape character density statistics to deeply reconstruct the obfuscated command intent at the character, lexical, and encoding levels. Finally, by detecting the directory backtracking depth of relative paths and the length of whitespace padding in parameters, it determines whether malicious code is being concealed by using the truncation threshold of the operating system's visualization framework.
[0087] (3) Homologous clustering principle based on structured fingerprint extraction
[0088] This principle aims to solve the problem of automated source tracing and defense against massive numbers of variant samples. The system first performs "generalization processing" on the parsed malicious LNK sample command-line parameters, removing randomized IP addresses, filenames, and temporary paths to eliminate interference, retaining only the core syntax structure and obfuscation patterns. Then, a string similarity-based clustering algorithm is used to group the samples, and the longest common subsequence of samples within the same cluster is extracted as the structured fingerprint of that group. Finally, the system automatically maps and converts the extracted fingerprints into standard YARA detection rules, enabling batch identification of homologous variant samples created using the same generation tools or attack frameworks.
[0089] Based on the above embodiments, compared with the prior art, the present invention improves the detection efficiency of malicious code and can better assess the security of malicious code detection models. The implementation of the technical solution of the present invention will be further illustrated below with specific examples.
[0090] Example 1
[0091] Taking sample cfe7a0ff9bc9671f0849e3466973424502412876bf9339e887bcdd66729f572a as an example, this paper introduces a method for parsing and detecting malicious shortcut files based on a heuristic recovery mechanism.
[0092] 1. Dataset Construction and Feature Preprocessing
[0093] This embodiment first verifies the shortcut file threat intelligence dataset constructed above. The dataset integrates publicly available samples from VirusShare and samples captured through enterprise insider threat hunting. After cleaning and deduplication, the total number of samples reaches over 180,000. In this embodiment, a highly socially engineered deceptive sample is selected for detailed analysis. The file name is Employees_Affected_by_Transition.pdf.lnk, with a SHA-256 value of cfe7a0ff9bc9671f0849e3466973424502412876bf9339e887bcdd66729f572a and an MD5 value of f3f9fec06f32c379307faeeffd6d94c8. This sample disguises itself as a PDF document about the impact of employee transitions, inducing victims to click on it.
[0094] 2. Robust engine-based parsing and metadata extraction: The binary file was processed using the LNKer parsing engine described in this invention. Addressing the file structure anomalies in this sample, the engine first detected the byte order identifier (Magic Number) at the file header and automatically adapted to endianness. Subsequently, it automatically initiated a sliding window algorithm and a heuristic recovery mechanism, successfully locating and repairing the structural boundaries. Through recursive parsing logic, the engine completely extracted high-fidelity metadata, including the target path C:\Windows\System32\cmd.exe, command-line parameters spanning thousands of characters, and overwritten data hidden at the end of the file, providing a precise data foundation for subsequent multi-dimensional analysis.
[0095] In the structured JSON output of this embodiment, each metadata structure was effectively extracted. The header field fully records the file attributes, link_info accurately restores the local base address path, and extra_data records the additional data segments.
[0096] 3. Multi-dimensional threat characteristic analysis
[0097] Based on the extracted metadata, the system automatically initiates a multi-dimensional threat analysis model to qualitatively analyze the sample from three dimensions: deceptiveness, evasion, and execution. During threat detection, the system outputs corresponding log information to the console based on different detection dimensions. First, the system performs deceptive feature analysis, performing semantic consistency checks and social engineering trap detection. It discovers a serious semantic mismatch in the sample: the icon resource points to .\Document.pdf, attempting to display the appearance of a PDF document, but the actual target program is cmd.exe. This contradiction of "visually a document, logically an executable program" directly identifies a visual camouflage attack. Furthermore, the system detects that the filename uses a dual file extension mechanism (.pdf.lnk), and that the system opens the decoy document salary_report.pdf during operation, further confirming its deceptive intent.
[0098] Secondly, the system identifies and detects evasion techniques. It performs deep analysis on command-line parameters that are thousands of characters long and identifies a variety of techniques that evade static scanning: In character-level analysis, the Shannon entropy of the parameter string is as high as 5.92 and contains 42 caret characters (^), indicating that encryption or randomization is involved; In lexical analysis, the system discovers a "whitespace injection" technique, which inserts more than 500 consecutive spaces to squeeze the malicious command out of the visible range of the property window, while using environment variable fragmentation (%C%^%o%...) to bypass keyword detection.
[0099] In the code-level analysis, 2KB of invisible Unicode characters were padded to the end of the file to change the file hash value; the system ultimately marked features such as high entropy, obfuscation, and resistance to static analysis, and successfully restored the obscured CreateObject(WinHttp) call logic.
[0100] Finally, the execution logic was standardized and restored. The system standardized and reconstructed the parsed parameters, drawing a complete attack logic tree: After the user clicks the shortcut file, cmd.exe is first launched as the initial launcher, using the echo command to release a malicious script and concurrently open the decoy file; then wscript.exe is called to execute the script, and it is detected that it uses forfiles.exe for process proxy execution to hide the parent-child process relationship; the script finally connects to the C2 server to download and load directly into memory for execution, thus classifying it as a typical "fileless landing-assisted attack".
[0101] 4. Standardized report generation and security assessment
[0102] The system generates a standardized threat detection report for each input file, automatically highlighting detected threat indicators. The report details key threat indicators, including C2 addresses, malicious behaviors such as droppers and indirect execution, spoofing techniques like PDF icon disguise and dual file extensions, and specific obfuscation tags such as whitespace injection and high entropy values. This result demonstrates that the multi-dimensional feature reconstruction and parsing technology of this invention can effectively penetrate complex obfuscation methods, accurately identify and reconstruct the true intent of unknown variant shortcut attacks.
[0103] Based on this, in order to verify the source tracing capability of the present invention in a large-scale threat scenario, the system included the sample in a dataset containing 180,000 samples for batch processing.
[0104] The system further performed source analysis on the batch-input LNK sample set. First, the system generalized the command-line parameters of all parsed malicious samples, using a regular expression replacement algorithm to remove randomized IP addresses, dynamically generated temporary filenames, and changing path interference, retaining only the core syntax structure and obfuscation patterns of the malicious commands. Then, a string similarity clustering algorithm based on edit distance was used to group the samples. Finally, the system successfully divided the malicious samples in the test set into multiple independent attack family clusters. Based on this, the system extracted the longest common subsequence of command-line parameters from all samples in the same cluster as the structured fingerprint of that type of sample and automatically mapped it into standard YARA detection rules. The visualization results of different clusters demonstrate the homology of shortcut files belonging to different malicious families. This enables the batch identification and defense against homogeneous variant samples created using the same generation tools or attack frameworks.
Claims
1. A method for parsing and detecting malicious shortcut files using a heuristic recovery mechanism, characterized in that, The process of parsing and extracting threat signatures from Windows shortcut files with abnormal structures or malicious intent involves four steps: Step 1: Build a shortcut file threat intelligence dataset by integrating enterprise threat logs and public intelligence sources to obtain a massive number of malicious shortcut samples; Step two: Use a heuristic-based robust parsing engine to process the binary data of the shortcut file; Instead of terminating the parsing process when encountering structurally abnormal or non-compliant fields, a heuristic recovery strategy is employed to attempt to repair or skip erroneous areas in order to extract the maximum amount of metadata. Step 3: Multi-dimensional threat feature analysis. Based on the metadata extracted in Step 2, an in-depth threat feature analysis model is constructed from the outside in, analyzing threat features from three dimensions: deceptiveness, evasion, and execution. In the deceptiveness dimension, the semantic consistency between the shortcut icon index and the target process path is compared. In the evasion dimension, the obfuscated command line parameters are restored through entropy analysis and pattern recognition algorithms. In the execution dimension, the call chain is extracted through the standardized target path. Step 4: Output standardized structured data, mark detected structural anomalies and potential malicious features, and generate a threat detection report; The robust parsing engine described above uses dynamic byte order processing and a sliding window algorithm to accurately locate terminators in variable-length fields; it uses a recursive parser to process nested file structures and extra expansion blocks to identify and extract hidden payloads hidden after the end of the file or in overwritten data. The heuristic recovery strategy described above treats parsing failure as a threat intelligence. When the parsing engine detects encoding tampering, abnormal length fields, or undocumented structure types, the system records the anomaly type and automatically adjusts the parsing parameters for iterative extraction, ensuring that critical attack chain information can still be restored even if the file structure is damaged. The threat signature analysis described above can automatically identify attack tactics unique to shortcut files. In terms of deception, it relies on a multi-dimensional attribute association and consistency verification engine to extract and analyze icon resources and actual executables, accurately discovering visual disguises and semantic conflicts. In terms of evasion, a lexical parsing and normalization algorithm based on Abstract Syntax Tree (AST) is introduced to perform semantic dimensionality reduction and reorganization on command line shells that have been obfuscated at the command level or statement level, thereby achieving high-fidelity restoration of hidden malicious commands. In terms of execution, based on the lexical path normalization algorithm and dynamic evaluation engine, the absolute physical path is intercepted and restored, thereby accurately identifying fileless attack call chains that use the Windows system's native whitelist tool to load malicious code.
2. The method according to claim 1, characterized in that, The construction of the dataset in step one includes the following steps: (1) Collection of malicious samples from multiple sources: The collection scope covers log records in the real network environment of partner companies and public threat intelligence sources from VirusShare. After cleaning and deduplication, a total of 100,000 malicious shortcut file samples were finally retained. (2) Threat intelligence enrichment: Based on the acquisition of original binary file samples, the sample hash value is used to perform batch retrieval of VirusTotal, construct the mapping relationship between samples and intelligence, and realize the temporal alignment and family classification of original binary files and structured intelligence tags.
3. The method according to claim 1, characterized in that, Step two involves processing the binary data of the shortcut file, which includes the following steps: (1) Dynamic processing and sliding window positioning: By using the sliding window algorithm to dynamically scan, the parsing engine does not simply rely on the length value declared in the file header for variable-length fields and specific encoded strings in the file binary stream; (2) Recursive decoding of complex nested structures: For the complex nested structures of shell item identifier lists and additional data blocks that are widely present in shortcut files, a self-developed deep recursive parsing method is used for processing; (3) Heuristic recovery of abnormal states: an abnormal fault tolerance and recovery mechanism is established. When an invalid segment head size or an unknown structural abnormality is encountered during the parsing process, the engine will not interrupt the program. Instead, it will mark the current state as abnormal rather than fatal. The parsing engine will iteratively adjust the offset of the file pointer through a heuristic algorithm, try to skip the damaged area and automatically search for the header features of the next valid structure, thereby realizing the recovery parsing of subsequent data and maximizing the recovery of the remaining valid information in the file.
4. The method according to claim 1, characterized in that, The analysis model in step three is carried out according to the following steps: (1) Deceptive feature analysis based on semantic conflict judgment: In response to the disguise made by attackers using social engineering, the system focuses on identifying visual disguise and metadata forgery behavior. Through multi-dimensional attribute association and consistency verification engine, it extracts and analyzes the icon resource index of shortcut file and the actual target program path after parsing, accurately discovers the semantic conflict between icon display and actual application, and detects whether there is misleading information in the description field. (2) Identification and detection of evasion tactics: In response to the obstacles set up by attackers to bypass static detection, the system conducts in-depth analysis of structural anomalies and command line obfuscation features to detect whether there is any behavior of using special characters to fill or malformed structures to evade traditional scanning engines. At the instruction level, a lexical parsing and normalization algorithm based on Abstract Syntax Tree (AST) is introduced to perform semantic dimensionality reduction and reorganization on obfuscated shells in cases of random case mixing, insertion of a large number of escape characters, and string concatenation, so as to achieve high-fidelity restoration of obscured malicious instructions without executing the code. (3) Normalization and restoration of execution logic: For the final call behavior of malicious payload, the system standardizes the parsed target path based on the lexical path normalization algorithm, restores the absolute path by parsing environment variables, thereby identifying the fileless attack call chain that uses the system whitelist program to load malicious code, and reconstructs the complex execution process into a structured sequential output.
5. The method according to claim 1, characterized in that, The report generation in step four is performed as follows: (1) Preservation and output of hierarchical data structure: In view of the complex nesting characteristics of the shortcut file, the system uses a tree serialization algorithm based on depth-first traversal to dump the parsing results and then generate a standardized JSON format report. This format completely preserves the hierarchical logic from the file header to the extended data block and can map the subordinate relationship between the shell item list and the additional data without loss. (2) Generation of flattened statistical data: In order to meet the needs of batch feature extraction and trend analysis of massive samples, the system synchronously generates flattened CSV format reports by standardizing and expanding key metadata fields; (3) Automated highlighting of threat indicators: In the output report, the system’s built-in automatic labeling algorithm based on multi-feature association pattern matching automatically identifies and highlights key risk points found during the analysis process, including structural fields marked as abnormal, detected command line obfuscation technology features, and potential malicious payload storage locations, thereby focusing on core threats at the first time and quickly completing the qualitative and evidentiary work of malicious samples.