Automobile sensitive data detection system based on multi-modal semantic context analysis

The vehicle sensitive data detection system, which utilizes multimodal semantic context analysis, addresses the issues of low accuracy and insufficient automation in existing vehicle network packet detection technologies. It achieves efficient and accurate sensitive data detection and risk assessment, thereby enhancing the system's automation and security.

CN122309742APending Publication Date: 2026-06-30CATARC AUTOMOTIVE TECH (SHANGHAI) CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CATARC AUTOMOTIVE TECH (SHANGHAI) CO LTD
Filing Date
2026-03-09
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies for detecting sensitive vehicle data in automotive network packets suffer from low accuracy and high false alarm rates, making it impossible to achieve automated and large-scale detection. Furthermore, they lack the ability to correlate sensitive information with network communication context.

Method used

A vehicle sensitive data detection system based on multimodal semantic context analysis is adopted, including a data packet uploading module, a payload extraction module, a multimodal hybrid detection engine, a context association module, and a result generation and presentation module. Through keyword matching, regular expression matching, and local dictionary comparison, combined with semantic context analysis, the system realizes automated, end-to-end detection and analysis of sensitive information in vehicle network data packets.

Benefits of technology

It improves detection efficiency, enhances the accuracy and coverage of identifying vehicle-specific sensitive data, strengthens the traceability and risk assessment capabilities of detection results, and ensures the security and robustness of the system processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309742A_ABST
    Figure CN122309742A_ABST
Patent Text Reader

Abstract

This invention discloses a vehicle sensitive data detection system based on multimodal semantic context analysis, specifically relating to the fields of network data security and automotive electronics technology. The system includes: a data packet upload module, a payload extraction module, a multimodal hybrid detection engine, a context association module, and a result generation and presentation module. The data packet upload module receives and verifies the PCAP file; the payload extraction module parses the data packet and extracts the application layer payload; the multimodal hybrid detection engine integrates keyword matching, regular expression matching, and local dictionary comparison units to perform parallel scanning of the payload text; the context association module binds the detection results with network quintuple information; and the result generation and presentation module generates a structured audit report and displays it visually. This invention achieves automated detection of sensitive information in automotive network data packets, improves the accuracy of identifying unique data such as VINs, geographic coordinates, and on-board diagnostic fault codes, and enhances the traceability and risk assessment capabilities of the detection results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of network data security and automotive electronics technology, and more specifically, to a vehicle sensitive data detection system based on multimodal semantic context analysis. Background Technology

[0002] With the continuous improvement of vehicle connectivity and intelligence, the data interaction between in-vehicle networks and the external environment is becoming increasingly frequent, including a large amount of sensitive data such as vehicle identification numbers, geographical locations, and diagnostic fault codes. Currently, existing technologies for detecting sensitive information in network data packets mainly rely on two types of methods: one is to use general data loss prevention systems to scan data streams using predefined keywords or regular expressions; the other is to use professional network analysis tools for manual analysis, where technicians filter and inspect captured data packets based on their experience.

[0003] However, in practical use, it still has some shortcomings. For example, the general DLP system is not designed for the specific format and protocol of automotive data, and lacks dedicated detection rules for sensitive automotive data with strict structured characteristics such as VIN and on-board diagnostic fault codes, resulting in low detection accuracy and high false alarm rate. The method of relying on manual analysis using professional tools is extremely inefficient, unable to cope with massive amounts of on-board network data, and requires high professional knowledge and experience from analysts, making it impossible to meet the needs of automated and large-scale detection. Existing technologies usually only focus on the data content itself and lack the ability to automatically correlate and analyze the detected sensitive information with its network communication context, making it difficult for security audit results to trace the source of data leakage and assess its actual risk level. Summary of the Invention

[0004] To overcome the aforementioned deficiencies of the prior art, the present invention provides a vehicle sensitive data detection system based on multimodal semantic context analysis, which can solve the problems mentioned in the background art through the following solution.

[0005] To achieve the above objectives, the present invention provides the following technical solution: a vehicle sensitive data detection system based on multimodal semantic context analysis, comprising: Data packet upload module: Receives network data packets uploaded by users through the web front-end interface, captures PCAP files, saves the PCAP files to the server, and generates PCAP files to be tested; Payload extraction module: Loads the PCAP file to be detected, parses and extracts the payload data of each data packet, and generates a payload text set; Multimodal hybrid detection engine: The engine integrates a keyword matching unit, a regular expression matching unit, and a local dictionary database comparison unit. The regular expression matching unit is pre-configured with matching rules corresponding to a strictly predefined format of the vehicle identification number (VIN). The engine is configured to use each unit to perform parallel scanning of the payload text set and output a sensitive information matching result set. The sensitive data types in the sensitive information matching result set include the vehicle identification number (VIN), geographic coordinate data, and on-board diagnostic system fault codes. Semantic Context Analysis Engine: Receives the preliminary matching result set, and for each suspected sensitive data record, constructs a context sliding window of a specific byte length centered on its position in the payload text, extracts the neighboring text within the window; by analyzing the semantic logic of the neighboring text, performs secondary verification and inference on the attributes of the suspected sensitive data, thereby reducing the false alarm rate, and outputs the final confirmed sensitive information matching result set. Context association module: Receives the sensitive information matching result set, and for each record therein, extracts the network quintuple information from its source data packet to generate an association result set; Result generation and presentation module: Receives the associated result set, formats it into a structured audit report, and returns the audit report to the web front-end interface for visualization.

[0006] The technical effects and advantages of this invention are as follows: It achieves automated, end-to-end sensitive information detection and analysis of automotive network data packets: through a modular design of the entire process from data packet uploading, payload extraction, multimodal detection, context association to report generation, it transforms the originally tedious network data analysis work that relied on manual experience into a highly automated processing pipeline, which greatly improves detection efficiency and can quickly meet the auditing needs of massive amounts of vehicle data packets; Improved accuracy and coverage of vehicle-specific sensitive data identification: The system-integrated multimodal hybrid detection engine comprehensively utilizes three strategies: keyword matching, regular expressions, and local dictionary comparison; Enhanced traceability and risk assessment capabilities of detection results: Through the context association module, the system automatically associates and binds the detected sensitive information with its original network communication context; The system's security and robustness are ensured: the data packet upload module includes strict file format verification, security scanning, and size limit checks, effectively preventing malicious file uploads and resource exhaustion attacks. Attached Figure Description

[0007] Figure 1 This is a schematic diagram of the overall structure of the present invention.

[0008] Figure 2 This is a schematic diagram of the working structure of the data packet upload module of the present invention.

[0009] Figure 3 This is a schematic diagram of the working structure of the multimodal hybrid detection engine of the present invention.

[0010] Figure 4 This is a schematic diagram of the working structure of the result generation and presentation module of the present invention. Detailed Implementation

[0011] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0012] refer to Figures 1-4 The automotive sensitive data detection system based on multimodal semantic context analysis shown includes: a data packet upload module, a payload extraction module, a multimodal hybrid detection engine, a context association module, and a result generation and presentation module. Each module communicates with the data protocol through predefined interfaces, collectively forming an automated processing pipeline from data input to audit report output.

[0013] The data packet upload module serves as the entry point for the initial interaction between the system and the user. Its core function is to reliably and securely receive network traffic data files provided by the user and prepare the data source for subsequent sensitive data detection processes. This module achieves trusted transmission and storage of files from the user's end to the server through the collaborative operation of the web front-end and back-end services. The specific implementation process of this module is as follows: S101: User Front-End Interaction: Users access the system through the web front-end interface provided by the system. The front-end interface offers dedicated file upload controls. Users can select one or more locally stored network packet capture files (i.e., PCAP format files) by clicking or dragging.

[0014] S102: File Upload Request Initiation: After the user confirms the file selection, the front-end JavaScript code captures the file object and constructs a FormData object, including the selected PCAP file as part of the form data. The front-end initiates an HTTP POST request to the system's back-end server via asynchronous JavaScript and XML technologies, or the Fetch API. The target address of this request is the back-end's dedicated file upload interface. In the request header, the Content-Type is set to multipart / form-data to support file transfer.

[0015] S103: Backend Interface Reception and Preliminary Processing: The backend server continuously listens for the file upload interface. Upon receiving an upload request, the backend-configured file parsing module parses the request and extracts the PCAP file stream. The server first temporarily stores the file in memory or a temporary directory to perform subsequent verification and saving operations.

[0016] S104: File Validation and Security Check: To ensure system stability and security, the backend executes a series of validation logic before saving files. Format verification: Check the file format identifier of the file to confirm whether it is a valid PCAP file format; Size limit: Checks whether the file size is within the system's preset threshold range to prevent malicious users from uploading excessively large files and exhausting server resources; Security Scan: Call the security scanning service to detect malicious code in the file.

[0017] If any validation fails, the backend will immediately abort the process and return an HTTP 4xx status code and error message to the frontend, containing the specific reason for the error. Upon receiving the error message, the frontend will notify the user through a user interface prompt component.

[0018] It should be further explained that format verification is divided into magic number and global header verification and data packet sample verification. Among them, magic number and global header verification: the backend reads the first 24 bytes of the file and parses the PCAP global header; firstly, it checks whether the magic number is a recognized identifier such as standard little endian or standard big endian. At the same time, it verifies whether the version number (such as major version 2, minor version 4) is within the compatibility range of the parsing library supported by this system. If this step fails, the file is determined to be in invalid PCAP format, the process is immediately interrupted and an HTTP 400 error is returned, thereby avoiding any unnecessary subsequent processing of invalid files. Packet sample verification: To ensure that the file is not only "header correct" but also "content correct", the system will attempt to read and parse the first N (e.g., N=5) packet records in the file; check whether the timestamp, capture length and original length fields of each packet header are reasonable (e.g., capture length ≤ original length, timestamp is not a future value), and verify whether the packet payload can be safely accessed.

[0019] Only files that pass both magic number and sample verification will be considered structurally complete PCAP files, and only then will the process be allowed to proceed to the next stage.

[0020] It should be further explained that the security scan is divided into two parts: initiating the scan task and judging the scan results. Initiating the scan task: For files that pass the first two layers of verification, the system calls the integrated security scan service. This service adopts a multi-engine strategy, including using anti-virus engines such as ClamAV to scan for known malware based on the signature database, and using a custom rule engine (such as YARA rules) to match common malicious traffic characteristics in penetration testing. The scan service returns a status (e.g., "Clean", "Infected", "Suspicious"). A "Clean" status is the final sign that the file has passed all security verifications, and only then is the file allowed to enter the S105 stage. If the status is "Infected" or "Suspicious", the file is immediately rejected, an HTTP 422 error is returned, and the file can be moved to the quarantine area for administrator analysis.

[0021] It should be further explained that the preset threshold range is 100MB. Setting the threshold to 100MB can cover the vast majority of legitimate and analytically valuable vehicle data packet files, ensuring that normal user experience is not restricted. The size of the PCAP file is linearly related to the number of data packets, and the processing time and memory consumption of the subsequent parsing, payload extraction, and detection modules are also positively related to the number of data packets. The 100MB threshold can effectively prevent the upload of excessively large files at one time, keeping the impact of potential resource attacks within a limited range. For a 100MB PCAP file, under normal server configuration, the processing time of the entire pipeline (parsing, extraction, detection, and correlation) is acceptable. If GB-level files are allowed to be uploaded, the processing time will be as long as tens of minutes or even hours, which does not meet the response expectations of interactive web applications.

[0022] S105: Persistent File Storage: After all verifications are passed, the backend program moves the PCAP file from the temporary storage area to a designated persistent storage directory on the server with access control. To avoid filename conflicts and overwriting, and to provide a unique identifier for subsequent processes, the system generates a unique file identifier for each successfully uploaded file. This file identifier is generated according to a preset naming rule, which includes at least one of the following: original file information, time information, and a random factor. The mapping relationship between the file storage path and the file identifier is recorded in the server's database or cache.

[0023] It should be further explained that the default naming rule is as follows: the naming rule for the file identifier is a string consisting of multiple fields. These fields together ensure the global uniqueness, readability, and traceability of the identifier. The rule format is as follows: {Timestamp}_{ServerID}_{RandomString}_{OriginalFileHash}_{SanitizedOriginalName}; The definitions and reasons for generating each field are as follows: Timestamp: Uses a UTC timestamp accurate to milliseconds, in the format YYYYMMDDHHMMSSmmm; Reason: It provides the precise upload time of files, facilitating sorting, retrieval, and archiving by time. Millisecond-level precision greatly reduces the probability of the same server generating duplicate identifiers at the same time.

[0024] ServerID: In a distributed or load-balanced environment, each server instance has a unique identifier (such as an instance ID, container ID, or a simplified server IP address range obtained from environment variables or a configuration center). Reason: In multi-server deployment scenarios, this field ensures that even if different servers upload files at the same time, the generated identifiers will not conflict, which is the key to ensuring global uniqueness.

[0025] RandomString: Generates a sufficiently long random string using a cryptographically secure random number generator (e.g., 8-12 characters long, with a character set of uppercase and lowercase letters and numbers). Reason: As the last line of defense against collisions. Even if the first two fields are duplicated in extreme cases (such as server time rollback), the random string can almost guarantee the uniqueness of the identifier. At the same time, it also increases the difficulty for malicious users to guess the file path.

[0026] OriginalFileHash: Calculates a short hash value for the original file content uploaded by the user (e.g., using the SHA-256 algorithm and taking the first 8 characters).

[0027] Reason: This field provides a fingerprint of the file's content and has two important functions: Deduplication: The system can quickly determine whether a user has already uploaded a file with the exact same content based on this hash value, thereby avoiding the storage of redundant data; Integrity verification: In subsequent processing, the hash value can be compared to verify that the file has not been tampered with during transmission and storage.

[0028] SanitizedOriginalName: This feature secures the original filename uploaded by the user by removing path information (to prevent path traversal attacks), replacing special characters (such as replacing non-alphanumeric characters with underscores), and truncating it to a certain length (such as keeping the first 50 characters). Reason: The semantic information of the original file is preserved, making it easier for administrators and users to identify the file's origin.

[0029] S106: Upload Result Feedback: After the file is successfully saved, the backend interface constructs a successful HTTP response. The response body contains structured JSON data, which includes at least the operation status, file identifier, and storage path information. Upon receiving the successful response, the frontend JavaScript parses the JSON data and displays a success message to the user on the interface. Simultaneously, the frontend saves the file identifier for subsequent triggering of detection requests.

[0030] S107: Status Maintenance and Detection Queue: After the data packet upload module completes file storage, it updates the system's task status. It marks the newly uploaded PCAP file as "to be detected" and adds its information to the task queue. The generated file identifier will be passed as a key parameter to the payload extraction module for locating and loading the PCAP file. At this point, the data packet upload module has completed its responsibilities.

[0031] Through the above process, the data packet upload module realizes the reception, verification, storage and status management of PCAP files submitted by users, providing a reliable data source entry point for subsequent sensitive data detection.

[0032] The payload extraction module is the system's data processing unit. It receives a file identifier from the data packet upload module and its task is to convert the uploaded binary PCAP format file into a set of plain text payloads that can be processed by the subsequent detection engine. This module implements the step of extracting application layer payload data from network data packets. The specific implementation flow of this module is as follows: S201: Loading the PCAP file to be detected: When the system receives the detection command, the backend service is triggered. The backend program locates and reads the corresponding PCAP file in the server's persistent storage directory based on the file identifier provided by the packet upload module. The system calls a dedicated network packet parsing library to load the file, and uses the file reading function provided by the parsing library to read the PCAP file into memory, forming a list of packet objects.

[0033] S202: Traversing Data Packets and Filtering Valid Packets: The module iterates through the list of data packet objects loaded into memory, accessing each data packet one by one. To filter out data packets containing application data, the system performs a preliminary filter: Protocol filtering: Checks whether data packets contain Transmission Control Protocol or User Data Packet Protocol payloads; Payload Existence Check: For data packets that meet the protocol requirements, check whether they contain valid application layer payload data.

[0034] It should be further explained that protocol filtering is divided into network layer protocol filtering and transport layer protocol filtering. Network layer protocol filtering: Parse the Ethernet frame header of the data packet and check its Ethernet type field. Only when the value of this field is 0x0800 is it determined to be an IPv4 data packet; or when the value is 0x86DD, it is determined to be an IPv6 data packet. Data packets that conform to the IPv4 or IPv6 standard are retained and enter the next layer of filtering; non-IP data packets (such as ARP requests, LLDP frames, etc.) are skipped directly in this step because they do not contain a resolvable application layer payload. Transport layer protocol filtering: Parse the IP header and check its protocol number field. Only when the value of this field is 6 is it determined to be a TCP packet; or when the value is 17, it is determined to be a UDP packet. Packets identified as TCP or UDP are retained and enter deep inspection. Although packets of other protocols (such as ICMP, IGMP, etc.) are also IP packets, their payload content is network control information rather than business data, so they are skipped.

[0035] S203: Parsing and Extracting Payload Data: For packets that pass the filtering, the module performs a payload extraction operation. Access the payload layer: Use the application programming interface of the network packet parsing library to obtain the payload layer of the data packet; Extract payload byte sequence: Extract the application layer payload data in the original byte sequence format from the payload layer.

[0036] S204: Decoding and Cleaning of Payload Data: The raw byte sequence cannot be directly processed by the text detection engine and needs to be decoded and cleaned. Multi-encoding format decoding attempt: The system attempts to decode the byte sequence using multiple character encoding formats until decoding is successful or all alternative encoding formats are traversed; Data cleaning: Preprocessing the successfully decoded text strings, including removing control characters and whitespace characters, and handling encoded error characters; For payloads that fail to be decoded, the system marks them as binary data and records them.

[0037] Further explanation is needed regarding the multi-encoding format decoding attempt: The system maintains a list of encoding formats sorted by frequency and attempts to decode the byte sequence using the encoding formats in the list sequentially. The decoding process does not throw any exceptions, and the decoded string does not contain a large number of invalid characters. Once one encoding attempt succeeds, the attempt terminates, and the process enters the cleaning phase. If all encoding attempts fail, the payload is marked as "binary data." The system records its original packet index and byte length but does not include it in the subsequent text detection process because the detection engine targets text patterns.

[0038] Data cleaning: Control characters and whitespace characters are processed using the regular expression [x00-\x1F\x7F] to remove all ASCII control characters (except newline and tab characters), compress multiple consecutive whitespace characters (including spaces, newlines, tabs, etc.) into a single space to straighten the text structure and avoid matching failures due to format disorder, and remove whitespace characters at the beginning and end of the text. Handling encoding errors: For any individual error characters that may remain after decoding, replace them with spaces or remove them directly to ensure the continuity of the text; Character set unification: Convert full-width characters to half-width characters to ensure consistency in detection rules. For example, convert full-width commas to half-width commas.

[0039] Context-specific cleanup: Identify and remove common, meaningless protocol prefixes or suffixes.

[0040] It should be further explained that the encoding format priority order is as follows: The system presets the encoding attempt order, sorted by the frequency of the adapted scenarios as follows: UTF-8 (preferred, suitable for most modern vehicle communication) → GBK / GB2312 (adapted to some domestic vehicle terminals) → ISO-8859-1 (compatible with traditional vehicle equipment ASCII extended encoding) → Unicode (UTF-16 LE / BE, adapted to specific vehicle control systems) → other niche encodings (such as EUC-JP, for special scenarios of imported equipment).

[0041] S205: Constructing the Payload Text Set: For each successfully extracted and decoded payload text, create a structured data item. This data item should contain at least: Original packet index: used to trace which packet in the PCAP file the payload originated from; Payload text content: a plain text string after decoding and cleaning; Network 5-tuple information: Extract the source IP address, destination IP address, source port number, destination port number, and transport layer protocol from the header information of this data packet.

[0042] Add all data items to the collection to form the payload text collection.

[0043] S206: Exception Handling and Status Transmission: The module is equipped with an exception handling unit to capture and handle exceptions during the data packet parsing process, ensuring that the process will not be interrupted due to a single data packet parsing error. After the module finishes execution, it generates a processing summary and updates the processing status to "payload extraction completed". Finally, the module outputs the constructed payload text set as output and passes it to the downstream multimodal hybrid detection engine.

[0044] Through the above process, this module realizes the function of reliably extracting application layer text data from network data packets, providing standardized data input for subsequent sensitive information detection.

[0045] The multimodal hybrid detection engine is the analysis and recognition unit of this system. It employs a multi-strategy parallel hybrid architecture to scan and identify sensitive information in text payloads. This engine integrates three detection units based on different principles, enabling it to handle various sensitive data types. The engine receives the payload text set from the payload extraction module and outputs structured detection results to the context association module. The specific implementation process of this engine is as follows: S301: Engine Initialization and Detection Unit Loading: When the system starts or a detection task begins, the engine initializes, and each detection unit loads its corresponding detection rules and data resources. Keyword matching unit: Loads a predefined list of sensitive keywords, which contains specific terms related to sensitive vehicle data; these keywords are configured in an updatable storage medium; Regular expression matching unit: compiles a series of pre-configured regular expression patterns designed for sensitive data types with specific formats; Local dictionary comparison unit: Loads a local sensitive data dictionary containing known sensitive information entries and loads it into memory using an efficient query data structure.

[0046] It should be further explained that the regular expression pattern for matching Vehicle Identification Number (VIN) is: ^[A-HJ-NPR-Z0-9]{17}$. 1st position: Geographic region code (AH, JN, P, RZ, 1-5, a total of 30 legal characters, excluding I, O, and Q to avoid confusion with numbers); 2nd-3rd digits: Manufacturer code (assigned by SAE International, a combination of letters and numbers); 4th-8th digits: Vehicle characteristic code (letter / number combination); 9th bit: Check bit (0-9 or X, used to verify the validity of VIN); 10th digit: Vehicle model year code (AY for 1980-2000, 1-9 or AH for 2001-2030, excluding I, O, Q, U, Z, and 0); 11th digit: Assembly plant code (letter / number combination); 12th-17th digits: Production serial number (a combination of numbers or letters).

[0047] S302: Parallel Scan Triggering and Task Distribution: The engine receives the payload text set from the payload extraction module. To improve detection efficiency, the system employs a parallel computing mechanism, simultaneously delivering the entire payload text set to the three detection units mentioned above. Each detection unit processes the same payload data in an independent execution flow.

[0048] S303: Keyword Matching Unit Execution Flow: This unit traverses each payload text in the payload text set; for each text, it compares it with the preloaded list of sensitive keywords, using a case-insensitive string matching algorithm. Once any keyword is found in the text, it is considered a successful match. For each successful match, the matched keyword, the position identifier of the keyword in the payload text, and the corresponding original data packet index are recorded.

[0049] It should be further explained that the specific steps of the case-insensitive string matching algorithm are as follows: 1. Preprocessing stage: Keyword and text standardization: Preprocessing is the core step in eliminating case differences. By unifying the character format of keywords and payload text, "case-insensitive matching" is transformed into "regular exact matching", avoiding frequent case conversion operations during the matching process.

[0050] (1) Keyword preprocessing Loading and Deduplication: Load keywords from a predefined list of sensitive keywords, perform deduplication first, and generate a set of keywords without duplicates; Case conversion: Converts all English keywords in the keyword set to lowercase; Chinese keywords do not require conversion. Keyword length index: Groups preprocessed keywords by keyword length. This operation allows candidate keyword groups to be located only based on the length of the current text segment during subsequent matching, reducing invalid comparisons.

[0051] (2) Load text preprocessing Unified encoding: Adhere to the text encoding format after S204 decoding (such as UTF-8) to ensure that the ASCII / Unicode encoding of characters in the text is consistent, and avoid case conversion errors caused by encoding differences; Case conversion: Converts all English characters in the payload text to lowercase, while keeping Chinese characters, numbers, and special symbols unchanged; Text segmentation: The preprocessed payload text is divided into several text segments according to "non-alphanumeric characters". Only the segments containing letters are matched later, reducing the processing scope.

[0052] 2. Matching Phase: Multi-pattern Matching Algorithm Execution: For the preprocessed text fragments, the Aho-Corasick multi-pattern matching algorithm is employed to achieve parallel and rapid matching of multiple keywords. This algorithm, by constructing a "keyword trie tree + failure pointer," supports matching all keywords in a single text scan, improving matching efficiency in multi-keyword scenarios. The specific process is as follows: (1) Constructing the Aho-Corasick automaton Step 1: Construct a Trie tree: Based on the preprocessed set of keywords, construct a Trie tree (prefix tree). Each node in the tree represents a character, and the path from the root node to the leaf node constitutes a complete keyword. Step 2: Set failure pointers: Set failure pointers for each node, pointing to the node with the longest suffix match in the current path. This is used to quickly jump to the node when a match fails, avoiding rescanning the text. Step 3: Mark output nodes: Mark the corresponding keywords on the leaf nodes of the Trie tree. At the same time, for nodes containing sub-keywords, associate all possible matching keywords through failure pointers to ensure that no sub-keyword matching is missed.

[0053] (2) Text fragment scanning and matching Initialization: Set the current node of the Aho-Corasick automaton to the root node, and traverse the preprocessed text fragments; Character-by-character matching: Scan each character of each text segment one by one, starting from the current node, and search for whether there is a child node that matches the current character. If a matching child node exists: move the current node to that child node and continue scanning the next character; If no matching child node is found: jump to the failed node via the failure pointer and repeat the search until the root node is reached; Matching result recording: Whenever the current node is an "output node" (marked with keywords), record the successfully matched keywords, the starting position of the keyword in the text fragment, and the corresponding text fragment, and generate preliminary matching results.

[0054] 3. Validation Phase: Eliminating False Matches and Boundary Validation: Initial matching results may contain false positives such as "partial match" or "similar string match." Boundary validation and semantic validation are necessary to eliminate invalid results and ensure matching accuracy. (1) Boundary verification Word boundary check: Using the \b (word boundary) rule of regular expressions, determine whether the matched keyword is an independent word to avoid false judgments caused by it being attached to other characters; Length check: Compare the length of the matching keyword with the length of the matching part in the text fragment to ensure they are completely consistent.

[0055] (2) Result output The verified matching results are recorded in the format required by the system: Matched keywords; The start / end position of keywords in the original payload text; The corresponding original data packet index (associated with the specific data packet in the PCAP file); Match status (“Valid”).

[0056] S304: Regular Expression Matching Unit Execution Flow: This unit iterates through each payload text in the payload text set. For each text, it performs a matching scan using all pre-compiled regular expression patterns. If any regular expression pattern matches a string that meets the condition in the text, it is considered to have discovered sensitive information. For each successfully matched item, it records the complete matched string, the regular expression pattern identifier used, the matching position information, and the corresponding original data packet index.

[0057] S305: Local Dictionary Comparison Unit Execution Flow: This unit iterates through each payload text in the payload text set. It performs an exact match query between the character sequences in the text and the local dictionary loaded into memory. For each item successfully found in the dictionary, it records the specific matched string, the match type identifier, and the corresponding original data packet index.

[0058] S306: Result Deduplication, Merging, and Integration: After the three detection units complete parallel scanning, three preliminary matching results are generated. The engine integrates these results: Data merging: combining results from different units into a unified result set; Deduplication: The rule for determining two records as duplicates is that they are identical if and only if their source data packet indexes are the same, their sensitive data content is identical after standardization, and the starting offset position of the content in the payload text is the same. Classification and labeling: Assign a type identifier to each finally confirmed sensitive information record.

[0059] S307: Generate a set of sensitive information matching results After integration, deduplication, and classification, the engine outputs a structured set of sensitive information matching results. Each record in this set contains the following fields: Sensitive data content; Data type identifier; Testing method labeling; Source data packet index.

[0060] By implementing multiple detection strategies in parallel and fusing the results, the coverage and accuracy of sensitive information identification are improved.

[0061] The semantic context analysis engine is the core innovation of this system. It is responsible for deep verification of the preliminary results output by the multimodal hybrid detection engine. By analyzing the semantic context, it distinguishes between true and false alarms, thereby significantly reducing false positives. The specific implementation process of this engine is as follows: S401: Engine Initialization and Data Reception: The semantic context analysis engine loads the semantic analysis rule base when the system starts. The engine receives a preliminary matching result set from the upstream multimodal hybrid detection engine. Each record in this set contains suspected sensitive data content, data type identifier, detection method identifier, source data packet index, and its start and end positions in the payload text.

[0062] S402: Context sliding window construction: For each initial matching record, the engine dynamically constructs a context sliding window centered on its position in the payload text.

[0063] It is necessary to further explain the construction rules of the context sliding window: Window Size: The window's byte length is configurable, with a default value of 200 bytes. The engine will use the center of the suspected sensitive data string as a reference point, expanding 100 bytes forward and backward to form a 200-byte analysis window. If the data is at the beginning or end of the text, the window will adjust adaptively.

[0064] Window content: Extract all text characters within this window as the neighborhood context for this semantic analysis.

[0065] S403: Neighborhood Text Semantic Extraction and Analysis: The engine extracts and analyzes semantic features of the text within the sliding window.

[0066] Keyword co-occurrence analysis: This checks whether there are contextual keywords within the window that are highly relevant to the current sensitive data type. For example, for a suspected VIN code, the engine will search the window for indicative words such as "VIN," "vehicle identification number," or "vehicle identification."

[0067] Semantic structure reasoning: Analyzing the logical relationship between suspected data and surrounding text. For example, if suspected geographic coordinate data is preceded or followed by structured key-value pairs such as "latitude:", "longitude:", "location", or "coordinates", the likelihood that it is a real geographic coordinate increases significantly.

[0068] Protocol Fragment Reassembly Inference: For fragmented data commonly found in HTTP / TCP payloads, the engine analyzes the semantic coherence of the text within the window to infer whether multiple data packet fragments belong to the same logical session, thereby assisting in determining the integrity of sensitive data.

[0069] S404: Secondary verification and result determination: Based on the semantic analysis results, the engine scores the confidence level of the initially matched records.

[0070] High confidence (confirmation): When there are strongly related keywords or clear structured tags in the semantic context, the record is determined to be real sensitive data.

[0071] Low confidence (exclusion): When the semantic context is irrelevant to the suspected data or when negative words are present, the record is determined to be a false alarm and excluded from the final result.

[0072] Manual review required: When the confidence level is in the middle range, the record will be marked as "pending" and can be highlighted in the audit report for the administrator's final decision.

[0073] S405: Generate the final sensitive information matching result set: After the above verification process, the engine outputs a high-precision final sensitive information matching result set that has removed a large number of false alarms, and passes it to the downstream context association module.

[0074] The context association module is a crucial component of the system for tracing sensitive data and analyzing network behavior. Its main function is to associate the matching results of sensitive information with the original network communication context, providing data support for generating traceable audit reports. The specific implementation process of this module is as follows: S501: Module Initialization and Data Reception: The context association module completes its initialization configuration upon system startup. The module receives a set of sensitive information matching results from the upstream multimodal hybrid detection engine. Each record in the set contains sensitive data content, data type identifier, detection method identifier, and source data packet index.

[0075] S502: Association Request Processing and Original Data Reloading: For the received sensitive information matching result set, the module initiates the association processing flow. Based on the corresponding PCAP file identifier in the sensitive information matching result set, the module reloads the original PCAP file into memory and obtains the complete protocol information of the data packet by calling the file reading function of the network packet parsing library.

[0076] S503: Network Five-Tube Information Extraction: The module iterates through each record in the sensitive information matching result set, and accurately locates the corresponding original data packet from the reloaded PCAP file based on the source data packet index in the record. The data packet is then parsed to extract the network five-tuple information. Source IP address: Extracted from the IP protocol layer of the data packet; Destination IP address: Extracted from the IP protocol layer of the data packet; Source port number: Extracted from the transport layer protocol header; Destination port number: Extracted from the transport layer protocol header; Transport layer protocol: Identifies the type of transport layer protocol used by the data packet.

[0077] S504: Sensitive Information and Context Association: The extracted network 5-tuple information is bound to the corresponding sensitive information record to generate an enhanced association record. This record contains the following fields: Sensitive data content; Data type identifier; Testing method labeling; Network quintuple information; Raw packet index; Optional timestamp information.

[0078] S505: Construct a set of associated results: Add the generated enhanced associated records to the result set to form a structured set of associated results.

[0079] S506: Data Aggregation and Statistical Analysis: The module also performs data aggregation and analysis operations. Statistics by IP address: Statistics on the types and quantities of sensitive data associated with each IP address; Statistics by session flow: Sensitive information is grouped based on network characteristics; Risk Score: The risk assessment value is calculated based on the characteristics of sensitive data and the communication context.

[0080] It should be further explained that the risk score is calculated as follows: The RiskScore (range 0-100) for each associated record is calculated according to the following rules: Base score: VIN leak +30 points, geographic coordinate leak +20 points, fault code leak +10 points.

[0081] If the source IP is a public IP and not a known trusted IP (by comparing it with the built-in trusted IP list), the score is multiplied by 1.5.

[0082] If the communication protocol is a plaintext protocol (such as HTTP), the score is +10.

[0083] The final score is normalized to ensure it does not exceed 100.

[0084] S507: Result Transmission and Output: After processing is complete, the context association module transmits the associated result set to the result generation and presentation module. The module updates the processing status and records the running log.

[0085] The above processing steps establish a connection between sensitive data and network communication context, thereby improving the traceability of audit results.

[0086] The result generation and presentation module is the terminal link for interaction between this system and the user. Its responsibility is to transform the structured data generated by upstream processing into an audit report and display it through a web front-end interface. The specific implementation process of this module is as follows: S601: Receiving Association Results and Data Preparation: This module receives the set of association results output from the upstream context association module. It also acquires metadata generated during system processing, including task identifiers, file identifiers, timestamps, data packet statistics, and sensitive information statistics.

[0087] S602: Structured Audit Report Generation: This module integrates the received set of related results with metadata to generate a structured audit report. The audit report uses a machine-readable data exchange format and includes the following main parts: Report Summary: Includes basic task information and statistical analysis data; Detailed discovery list: Contains detailed information about each sensitive information record and associated network context information.

[0088] It is necessary to further explain the core design principles of the audit report: Completeness: Covers key information throughout the entire process from PCAP file upload to sensitive data detection, including file metadata, detection statistics, sensitive data details and risk assessment, with no missing key information; Structured: JSON is used as the underlying data format (accommodating both machine parsing and cross-system transmission), with unified field definitions and clear hierarchy, avoiding ambiguity in unstructured text; Readability: Provide supplementary explanations for technical fields (such as network quintuples and detection method identifiers), and classify and label risk results to reduce the understanding threshold for non-technical users; Traceability: Each sensitive data record is associated with the original data packet index and network context, allowing users to trace back to the specific communication scenario and facilitating problem localization.

[0089] S603: Report Transmission to Frontend: The backend service returns the structured audit report as an HTTP response to the frontend interface. A success status indicator is returned upon successful processing, and an error status indicator and error description are returned upon handling an exception.

[0090] S604: Front-end Report Parsing and Visualization: The web front-end receives and parses the audit report data, and then visualizes it through user interface components. Summary information display: Key statistical indicators are displayed through an information panel; Detailed data display: A detailed list of findings is displayed using a table component; Interactive controls: Provides data sorting, pagination browsing, and content filtering functions; Risk visualization feature: In the table, records with a risk score higher than 80 are marked with a red background, those between 50 and 80 are marked with yellow, and those below 50 are marked with green.

[0091] S605: Interactive Functions and Data Export: The front-end interface also provides enhanced interactive functions: Detailed display function: Supports viewing detailed information for a single record; Risk visualization function: Displays risk assessment results in a differentiated manner; Data export function: Supports exporting report data to multiple file formats; Print output function: Provides optimized report formats for printing.

[0092] S606: User Notification and Status Management: After the front-end completes the report display, it sends an operation completion notification to the user. The system updates the task status to complete and stores the report data in the history for later query.

[0093] Through the above process, this module realizes the functions of generating, transmitting, displaying and interacting with audit results, completing the closed-loop processing from data analysis to user decision support.

[0094] Secondly: The accompanying drawings of the embodiments disclosed in this invention only involve the structures involved in the embodiments disclosed in this invention. Other structures can refer to the general design. In the absence of conflict, the same embodiment and different embodiments of this invention can be combined with each other. In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A system for automotive sensitive data detection based on multi-modal semantic context analysis, characterized in that, include: Data packet upload module: Receives network data packets uploaded by users through the web front-end interface, captures PCAP files, saves the PCAP files to the server, and generates PCAP files to be tested; Payload extraction module: Loads the PCAP file to be detected, parses and extracts the payload data of each data packet, and generates a payload text set; Multimodal hybrid detection engine: The engine integrates a keyword matching unit, a regular expression matching unit, and a local dictionary database comparison unit. The regular expression matching unit is pre-configured with matching rules corresponding to a strictly predefined format of the vehicle identification number (VIN). The engine is configured to use each unit to perform parallel scanning of the payload text set and output a sensitive information matching result set. The sensitive data types in the sensitive information matching result set include the vehicle identification number (VIN), geographic coordinate data, and on-board diagnostic system fault codes. Semantic Context Analysis Engine: Receives the preliminary matching result set, and for each suspected sensitive data record, constructs a context sliding window of a specific byte length centered on its position in the payload text, extracts the neighboring text within the window; by analyzing the semantic logic of the neighboring text, performs secondary verification and inference on the attributes of the suspected sensitive data, thereby reducing the false alarm rate, and outputs the final confirmed sensitive information matching result set. Context association module: Receives the sensitive information matching result set, and for each record therein, extracts the network quintuple information from its source data packet to generate an association result set; Result generation and presentation module: Receives the associated result set, formats it into a structured audit report, and returns the audit report to the web front-end interface for visualization.

2. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The PCAP file to be detected includes: a PCAP file that has passed file format verification, security scanning and size limit checks, and has a unique file identifier, which includes a timestamp, server ID, random string, original file hash and secure original file name.

3. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The payload text set includes: application layer payload data extracted from TCP or UDP packets, which is then decoded and cleaned to form a text data set; the text data set contains the original packet index, payload text content, and network quintuple information.

4. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The matching rules include: a regular expression matching pattern for the vehicle identification number (VIN), which is a 17-character pattern containing the character set AH, JN, P, RZ and 0-9, and excluding the characters I, O, and Q.

5. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The sensitive information matching result set includes: a set of sensitive data records identified by three detection methods: keyword matching, regular expression matching, and local dictionary comparison; the set includes sensitive data content, data type identifier, detection method identifier, and source data packet index.

6. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The network quintuple information includes: source IP address, destination IP address, source port number, destination port number, and transport layer protocol.

7. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The associated result set includes: a set of records that bind sensitive data content to the corresponding network quintuple information.

8. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The audit report includes a report summary and a detailed list of findings, in JSON format. The report summary contains basic task information, data packet statistics, and sensitive information statistics, while the detailed list of findings includes sensitive data content, data type identifiers, detection method identifiers, network quintuple information, and risk assessment values.

9. The vehicle sensitive data detection system based on multimodal semantic context analysis according to claim 1, characterized in that, The visualization includes: when displaying the risk assessment value on the web front-end interface, color-coding the risk assessment value, marking records with a risk score higher than 80 as red, between 50 and 80 as yellow, and below 50 as green.