Method and device for discovering confidential information leakage
A technology for confidential information and web page information, applied in the information field, can solve the problem of low efficiency of manual search, and achieve the effect of overcoming the low efficiency of retrieval
Inactive Publication Date: 2019-10-15
TRAVELSKY
7 Cites 0 Cited by
AI-Extracted Technical Summary
Problems solved by technology
However, the efficiency of manual search is low, and it is impossible to find leaked confident...
Abstract
The invention discloses a method and device for discovering confidential information leakage, and the method comprises the steps: obtaining a confidential information retrieval task which carries a first confidential information keyword group and a second confidential information keyword group; retrieving at least one confidential information keyword in the first confidential information keyword group through a search engine to obtain webpage information of a plurality of webpages; retrieving at least one confidential information keyword in the second confidential information keyword group inthe webpage content of at least one webpage in the plurality of webpages, and determining whether the confidential information is leaked in the at least one webpage or not, and overcoming the technical problem of relatively low manual retrieval efficiency by utilizing a mode of enabling a confidential information retrieval task to automatically retrieve in a search engine, thereby achieving the technical effect of timely and efficiently discovering the leaked confidential information from massive information on the Internet.
Application Domain
Data processing applicationsSpecial data processing applications +1
Technology Topic
Internet privacyWeb page +2
Image
Examples
- Experimental program(1)
Example Embodiment
[0032] Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
[0033] Such as figure 1 As shown, a method for discovering leakage of confidential information provided by an embodiment of the present invention includes:
[0034] S100. Obtain a confidential information retrieval task, where the confidential information retrieval task carries: a first confidential information keyword group and a second confidential information keyword group;
[0035] Specifically, the confidential information retrieval task may be formulated by the technical personnel according to actual needs. Specifically, both the first confidential information keyword group and the second confidential information keyword group may include at least one confidential information keyword. In practical applications, technicians can set the confidential information keywords in the first confidential information keyword group and the second confidential information keyword group according to the specific content of the confidential information they want to retrieve. Among them, the confidential information keywords may include one or more of the name of the confidential technology, the name of the key means of the confidential technology, the name of the insider of the confidential technology, the name of the organization to which the confidential technology belongs, and the technical field to which the confidential technology belongs. Optionally, the first confidential information keyword group and the second confidential information keyword group may both include the same confidential information keyword, such as the name of the confidential technology. Of course, the confidential information keywords included in the two may not be exactly the same or the two may not Including the same secret information keywords, the present invention is not limited here.
[0036] In other embodiments of the present invention, the confidential information keywords in the first confidential information keyword group and the confidential information keywords in the second confidential information keyword group may be associated. For example, the first confidential information keyword group includes the confidential information keyword "A project", and the second confidential information keyword group includes the confidential information keywords "a1 technology", "a2 technology" and so on. Among them, a1 technology and a2 technology are secret technologies adopted in A project, so "A project" is related to "a1 technology" and "a2 technology". Another example: the first confidential information keyword group includes the confidential information keyword "B company mailbox", the second confidential information keyword group includes the confidential information keywords "B company Zhang's mailbox account", "B company Zhang's mailbox password", etc.
[0037] Wherein, the confidential information retrieval task may also carry: retrieval logic, where the retrieval logic is the logic between keywords, and the logic may be a sum, an OR, a negation, etc. In practical applications, the first confidential information keyword group, the second confidential information keyword group and the retrieval logic of the present invention can be located in regular expressions. Specifically, the technician can write at least one regular expression according to the search logic between the first confidential information keyword group and each confidential information keyword in the first confidential information keyword group and put the compiled regular expression into the confidential information search task in. Correspondingly, the technician can also write at least one regular expression according to the search logic between the second confidential information keyword group and the confidential information keywords in the second confidential information keyword group and put the compiled regular expression into the confidential information search On mission.
[0038] In practical applications, the form of confidential information can include: company internal mailbox password, database password, company internal or external website password, company internal information, development documents, system user manual, system source code, etc.
[0039] Optionally, the confidential information retrieval task may also carry: execution time information of the confidential information retrieval task. Specifically, the time information may be a specific execution time or an execution time interval. For example, a confidential information retrieval task can be scheduled to be executed at 7 o'clock in the evening, or a confidential information retrieval task can be scheduled to be executed every hour.
[0040] Optionally, the confidential information retrieval task may also carry: a retrieval environment for formulating the confidential information retrieval task, where the retrieval environment is a search engine or website used for the confidential information retrieval. The invention can retrieve confidential information in a retrieval environment. Specifically, the website serving as the search environment in the present invention may be a website with an information sharing function.
[0041] S200: Retrieve at least one confidential information keyword in the first confidential information keyword group through a search engine to obtain webpage information of multiple webpages;
[0042] Specifically, the search engine may include one or more of a web search engine, a vertical search engine, and an aggregate search engine. The webpage information may include the Uniform Resource Locator (URL) of the webpage, and may also include other information that can distinguish the webpage from other webpages. Web page information can also include titles and abstracts.
[0043] In practical applications, the retrieval environment set for the first secret information keyword group in the secret information retrieval task of the present invention may be a search engine.
[0044] Optional, based on figure 1 The method shown, such as figure 2 As shown, in another method for discovering leakage of confidential information provided by an embodiment of the present invention, step S200 may include:
[0045] S210, searching for at least one confidential information keyword in the first confidential information keyword group in at least two search engines, respectively, to obtain webpage information of multiple webpages output by each search engine;
[0046] Specifically, the way of obtaining webpage information may be by respectively searching the confidential information keywords in the first confidential information keyword group in at least two search engines to obtain webpage information of multiple webpages output by each search engine. For example, search engine A and search engine B respectively search the confidential information keywords "'bank" and "mail" and "password" in the first confidential information keyword group, and obtain the 100 web pages output by search engine A Web page information, obtain the page information of 80 web pages output by search engine B.
[0047] S220: Compare the web page information of the web pages output by each search engine, and filter out at least part of the web page information from the web page information output by each search engine according to the comparison result.
[0048] Specifically, compare the webpage information respectively output after searching the confidential information keywords in the first confidential information keyword group on different search engines. For example, search engine A and search engine B respectively compare the first confidential information Search for the confidential information keywords "'bank' and'mail' and'password'" in the keyword group, and output the page information of 100 pages from search engine A, and output the page information of 80 pages from search engine B. Compare the web page information of the 100 web pages output by the search engine A with the web page information of the 80 web pages output by the search engine B. According to the comparison results of webpage information output by different search engines, webpage information that at least partially meets the preset conditions can be filtered from the webpage information output by each search engine. Specifically, the preset conditions may be formulated according to needs, and one of the preset conditions may be: all exist in the web page information of the web pages output by each search engine. When the webpage information of a certain webpage exists in the webpage information of the webpage output by each search engine, the webpage information of the webpage satisfies the preset condition.
[0049] In addition, after obtaining the webpage information of multiple webpages output by each search engine, this embodiment can also compare the obtained webpage information of each webpage with the webpage information of the webpage in the history record, and according to the comparison result, The web page information of the web page is removed from the web page information of the web page existing in the history record. Wherein, in this embodiment, after the processing of step S300 is performed on a certain webpage, the webpage information of the webpage can be put into the history record. In this way, the web page information in the history record is the web page information of the web page that has been determined whether to leak confidential information. After the webpage information of the webpage existing in the history record is obtained again in step S200 in the subsequent processing, the process of step S300 may not be performed on the webpage whose confidential information has been determined to be leaked. It is understandable that by comparing the webpage information (such as URL) of two webpages, the present invention can determine whether the two webpages are the same webpage. Of course, in addition to putting the webpage information into the historical record, the present invention can also put the result of determining whether the confidential information is leaked obtained after the processing of step S300 on the webpage is put into the historical record. Specifically, the history record may store: web page information of web pages that have been determined whether to disclose confidential information, and a result of determining whether to disclose confidential information corresponding to the web page information.
[0050] Such as image 3 As shown, in another method for discovering leakage of confidential information provided by an embodiment of the present invention, step S220 may include:
[0051] S221. Compare the web page information of the web pages output by each search engine, and filter the web page information of the web pages output by each search engine from the web page information output by each search engine according to the comparison result.
[0052] Specifically, the webpage information of the webpage output by one search engine can be compared with the webpage information of the webpage output by other search engines, and according to the comparison result, the webpage information of the webpage output by the search engine can be filtered out with those of other search engines. The page information of the output page is the same as the page information of the page. For example: output web page information of 100 web pages from search engine A, output web page information of 80 web pages from search engine B, compare 100 web page information output by search engine A with 80 web page information output by search engine B , Filter out 50 page information of the web pages output by both search engine A and search engine B, that is to say, the page information of any web page in the 50 web pages is output in both search engine A and search engine Output in B.
[0053] The process of comparing webpage information of a webpage output by one search engine with webpage information of a webpage output by another search engine may specifically include:
[0054] Compare the uniform resource locator of the webpage output by one search engine with the uniform resource locator of the webpage output by another search engine. When the uniform resource locator is the same, you can compare whether the titles of the two webpages are the same. If they are the same, It is determined that the webpage is the webpage output by both search engines. It is understandable that when the uniform resource locators of the two webpages are the same, it is also possible to compare whether the abstracts of the two webpages are the same, and if they are the same, it is determined that the webpage is the webpage output by the two search engines. It is understandable that when the uniform resource locators of the two webpages are the same, the titles and abstracts of the two webpages can be compared at the same time. If the two webpages have the same title and the same abstract, it is determined that the webpage is the two Web pages output by all search engines.
[0055] S300. According to the obtained web page information, search for at least one confidential information keyword in the second confidential information keyword group in the web content of at least one of the multiple web pages to determine whether the confidential information is in The at least one webpage is leaked.
[0056] In practical applications, the retrieval environment set for the second secret information keyword group in the secret information retrieval task of the present invention may be: multiple web pages obtained in step S200. In this way, step S300 can continue to search among the multiple web pages obtained in step S200, and the present invention can effectively improve the accuracy of the search result through the second search.
[0057] Specifically, the webpage content is various information carried by the webpage, and the webpage content may include one or more of title, author, date, summary, body, program code, pictures, audio, and video. The present invention can select the key to at least one confidential information in the second confidential information keyword group in the content of at least one webpage in the webpage information output by each search engine from the webpage information output by each search engine. Word search to determine whether confidential information is leaked in the at least one webpage. For example, search engine A and search engine B output a total of 50 webpage information. In the webpage content of at least one of the 50 webpages, at least one confidential information keyword in the second confidential information keyword group is performed Search to determine whether confidential information is leaked in the web content of the web page.
[0058] In order to improve the efficiency of retrieving at least one confidential information keyword in the second confidential information keyword group in the web content of at least one of the multiple web pages, this embodiment may also use preset useless content characters to change the search efficiency. Useless content in the webpage content of at least one webpage among the multiple webpages is deleted. For example, through some useless content characters such as "html, table, tab, div", the useless content in the webpage content of the webpage is deleted, and then the second confidential information is key to the webpage content of the deleted useless content of the webpage At least one secret information keyword in the phrase is searched. By excluding the search from the useless content of the webpage, not only the efficiency of the search is improved, but also the accuracy of the search is improved.
[0059] Generally speaking, when at least one secret information keyword in the second secret information keyword group is found in the webpage content of the webpage, it can be determined that the secret information is leaked in the webpage content of the webpage; otherwise, the secret information is determined No disclosure on this page.
[0060] It is understandable that after it is determined that the confidential information is leaked on a certain webpage, the relevant personnel can defend their rights by complaining to the administrator of the webpage to delete the confidential information from the webpage.
[0061] Wherein, in this embodiment, after the processing of step S300 is performed on a certain webpage, the webpage information of the webpage can be put into the history record. In this way, the web page information in the history record is the web page information of the web page that has been determined whether to leak confidential information. Of course, in addition to putting the webpage information into the historical record, the present invention can also put the result of determining whether the confidential information is leaked obtained after the processing of step S300 on the webpage is put into the historical record. Specifically, the history record may store: web page information of web pages that have been determined whether to disclose confidential information, and a result of determining whether to disclose confidential information corresponding to the web page information.
[0062] Specifically, after rights protection in the form of complaints, relevant personnel also need to monitor the effect of rights protection, that is, whether confidential information is deleted from the webpage in a timely manner after rights protection.
[0063] In another embodiment of the present invention, after it is determined according to step S300 that the confidential information is leaked in a certain webpage, the webpage information of the webpage where the confidential information is leaked may be compared with the webpage information of the webpage in the history record to determine the history record Whether there is any web page information of the web page that leaked confidential information in. If it exists, the determination result corresponding to the web page information in the historical record can be further obtained. If the determination result is: leaked, it can be determined whether the time interval between the generation time of the determination result and the current time exceeds the preset duration, If it exceeds, a follow-up reminder of confidential information leakage can be generated. The confidential information leakage follow-up reminder can remind relevant personnel to follow up the rights protection result of the webpage where the confidential information is leaked in time. It is understandable that although a certain webpage has been found to have leaked confidential information, but in the subsequent implementation of the present invention, it is found that the page still has leaked confidential information, which means that the previous rights protection did not achieve the corresponding effect or the rights protection was forgotten. Remind relevant personnel to follow up in time.
[0064] The method for discovering the leakage of confidential information provided by the embodiment of the present invention can obtain a confidential information retrieval task, wherein the confidential information retrieval task carries: a first confidential information keyword group and a second confidential information keyword group; through a search engine Search for at least one confidential information keyword in the first confidential information keyword group to obtain web page information of a plurality of web pages; and compare the second confidential information in the web content of at least one of the plurality of web pages At least one confidential information keyword in the keyword group is searched to determine whether the confidential information is leaked in the at least one webpage, and the method of making the confidential information search task automatically search in the search engine overcomes the technology of low manual search efficiency In turn, the technical effect of discovering leaked confidential information from the massive information on the Internet in a timely and efficient manner is achieved.
[0065] Optional, based on figure 1 The method shown, such as Figure 4 As shown, another method for discovering leakage of confidential information provided by an embodiment of the present invention may further include:
[0066] S400: If it is determined that the confidential information is leaked in the at least one webpage, generate a confidential information leak reminder.
[0067] Specifically, if it is determined that confidential information is leaked in at least one webpage, a confidential information leak reminder is generated. The confidential information leak reminder can be displayed in the device implementing this embodiment, or sent to the user through a pre-designated communication method. Pre-designated equipment for communication methods. For example, the confidential information leakage reminder may be displayed on the device implementing this embodiment in a pop-up window, or the confidential information leakage reminder may be sent to the mobile phone in the form of SMS or MMS through a pre-designated mobile phone number.
[0068] The confidential information leakage reminder may be a reminder by sending a warning message, where the warning message may be generated based on the webpage to which the confidential information is leaked and a preset warning message template. Specifically, at least part of the webpage content of the webpage to which the confidential information is leaked can be added to a preset warning message template to generate warning messages. In order to facilitate understanding, combine here Figure 5 Take an example: by adding part of the webpage content of the webpage to which confidential information is leaked and the uniform resource locator of the webpage into the preset warning message template, the warning message is generated, where the date in the warning message can be based on the Set the generation time of the warning message. The warning information may also include one or more of the title, source site, abstract, and author of the webpage to which the confidential information is leaked. It is understandable that the preset warning message template can also have multiple styles. Figure 5 Shown is just one style.
[0069] The confidential information leakage reminder can be reminded in a text manner similar to the above-mentioned generation of warning information, etc., but can also be reminded by means of alarm sound and flashing of the interface of the device implementing this embodiment. It is understandable that the reminding method of the confidential information leakage reminder can have multiple methods, which are not further limited here.
[0070] Optionally, if it is determined that the confidential information is leaked in the at least one webpage, another method for discovering the leak of the confidential information provided in the embodiment of the present invention may also take a screenshot of the webpage and save the screenshot. By taking a screenshot of the webpage content of the webpage where the confidential information is leaked and saving the screenshot, it is possible to retain evidence that the confidential information is leaked on the webpage.
[0071] Optional, based on figure 1 The method shown, such as Image 6 As shown, an embodiment of the present invention provides another method for discovering leakage of confidential information. Before step S300, the method may further include:
[0072] S500. Determine websites corresponding to the multiple webpages, and obtain website login information corresponding to the determined website;
[0073] It is understandable that the website corresponding to the webpage output by the search engine may have a website that needs to log in to obtain all the content of the webpage. At this time, the login information of the website needs to be obtained to access all the content of the webpage. Optionally, the present invention may obtain the website login information corresponding to the determined website from the login information database of each website established in advance. For example, when the website A corresponding to the webpage output by the search engine needs to be logged in to proceed to the subsequent steps, it is determined from the pre-established login information database of each website whether the login information of the website A exists, and if so, proceed to the next step . The present invention can also generate a reminder that the website needs login information, and obtain the website's login information through the content of the reminder feedback. For example, a reminder that the website requires login information is sent to a pre-designated mailbox, so that users of the mailbox can provide the website's login information.
[0074] Preferably, if the confidential information search task carries website login information corresponding to the website, the website login information corresponding to the website can also be obtained from the confidential information search task.
[0075] S600: Log in to a determined website through the obtained website login information, and obtain webpage content of the multiple webpages.
[0076] Specifically, after logging in to a determined website through the obtained website login information, at least one lower-level link address on the homepage of the website is accessed to obtain the webpage content of the webpage pointed to by the link address. It is understandable that in most cases, after logging in to a website, there will be multiple lower-level link addresses for different sections. At this time, visit the lower-level link address to obtain the webpage pointed to by the link address If there is a lower-level link address in the webpage content of the webpage pointed to by the lower-level link address, continue to visit the lower-level link address to obtain the webpage pointed to by the lower-level link address Page content until the link address does not exist in the page content. For example, after logging in to website A, the link addresses of the three sections of "variety shows, TV series and movies" appear, then the link addresses of these three sections are accessed to obtain the webpage content of the web pages pointed to by the link addresses of these three sections.
[0077] When it is found that there is confidential information in the content of the obtained webpage, it can be based on figure 1 The method shown, such as Figure 7 As shown, another method for discovering leakage of confidential information provided by an embodiment of the present invention may further include:
[0078] S700. If it is determined that the confidential information is leaked in the at least one webpage, obtain the contact information of the management party of the webpage to which the confidential information is leaked.
[0079] Specifically, when it is determined that confidential information is leaked in a web page, the contact information of the administrator of the web page can be obtained. The contact information can be obtained through keywords such as “contact information”, “mobile phone”, and “email” on the web page, or through Some specific character segments in the webpage are obtained, for example: 123456@xx.com, 138xxxxxxx. The contact information also includes the link addresses similar to "Feedback", "Email" and "Call" on the web page. It is understandable that the message function in the webpage can also be the contact information of the webpage.
[0080] Optional, based on Figure 7 The method shown, such as Figure 8 As shown, the embodiment of the present invention provides another method for discovering leakage of confidential information. After step S700, the method may further include:
[0081] S800. Send a leaked information deletion notification letter to the management party through the obtained contact information.
[0082] Specifically, through the obtained contact information, a notification letter for deleting the pre-edited leaked information can be sent to the management party. The notification letter for deletion of leaked information can be manually edited according to actual conditions. For example, after it is determined that the confidential information is leaked on the webpage, the party of the confidential information can edit the notification letter for deletion of the leaked information through the obtained contact of the webpage The method is sent to the administrator of the page.
[0083] Of course, the notification letter for the deletion of the leaked information can also be automatically edited according to the template. The detailed steps for automatically editing the notification letter for the deletion of the leaked information can be as follows Picture 9 The step S810 is shown.
[0084] An embodiment of the present invention provides another method for discovering leakage of confidential information, Figure 8 The step S800 shown may include:
[0085] S810: Generate a leaked information deletion notification letter according to the webpage to which the confidential information is leaked and a preset leaked information deletion notification letter template, and send the generated leaked information deletion notification letter to the contact method.
[0086] Specifically, at least part of the webpage content of the webpage where the confidential information is leaked is added to a preset leaked information deletion notification letter template to generate a leaked information deletion notification letter, and the generated leaked information deletion notification letter is sent to the contact information. In order to facilitate understanding, combine here Picture 10 Take an example: by adding part of the content of the webpage to which the confidential information is leaked and the uniform resource locator of the webpage to the preset leaked information deletion notification letter template, a leaked information deletion notification letter is generated, where the leaked information The date in the deletion notification letter can be set according to the date when the notification letter is sent. It is understandable that the preset notification letter template for the deletion of leaked information can also have multiple styles. Picture 10 Shown is just one style.
[0087] Corresponding to the foregoing method embodiment, the embodiment of the present invention also provides a device for discovering leakage of confidential information.
[0088] Such as Picture 11 As shown, an apparatus for discovering leakage of confidential information provided by an embodiment of the present invention may include: a task obtaining unit 100, a first retrieval unit 200, and a second retrieval unit 300,
[0089] The task obtaining unit 100 is configured to obtain a confidential information retrieval task, wherein the confidential information retrieval task carries: a first confidential information keyword group and a second confidential information keyword group;
[0090] Specifically, the confidential information retrieval task may be formulated by the technical personnel according to actual needs. Specifically, both the first confidential information keyword group and the second confidential information keyword group may include at least one confidential information keyword. In practical applications, technicians can set the confidential information keywords in the first confidential information keyword group and the second confidential information keyword group according to the specific content of the confidential information they want to retrieve. Among them, the confidential information keywords may include one or more of the name of the confidential technology, the name of the key means of the confidential technology, the name of the insider of the confidential technology, the name of the organization to which the confidential technology belongs, and the technical field to which the confidential technology belongs. Optionally, the first confidential information keyword group and the second confidential information keyword group may both include the same confidential information keyword, such as the name of the confidential technology. Of course, the confidential information keywords included in the two may not be exactly the same or the two may not Including the same secret information keywords, the present invention is not limited here.
[0091] In other embodiments of the present invention, the confidential information keywords in the first confidential information keyword group and the confidential information keywords in the second confidential information keyword group may be associated. For example, the first confidential information keyword group includes the confidential information keyword "A project", and the second confidential information keyword group includes the confidential information keywords "a1 technology", "a2 technology" and so on. Among them, a1 technology and a2 technology are secret technologies adopted in A project, so "A project" is related to "a1 technology" and "a2 technology". Another example: the first confidential information keyword group includes the confidential information keyword "B company mailbox", the second confidential information keyword group includes the confidential information keywords "B company Zhang's mailbox account", "B company Zhang's mailbox password", etc.
[0092] Wherein, the confidential information retrieval task may also carry: retrieval logic, where the retrieval logic is the logic between keywords, and the logic may be a sum, an OR, a negation, etc. In practical applications, the first confidential information keyword group, the second confidential information keyword group and the retrieval logic of the present invention can be located in regular expressions. Specifically, the technician can write at least one regular expression according to the search logic between the first confidential information keyword group and each confidential information keyword in the first confidential information keyword group and put the compiled regular expression into the confidential information search task in. Correspondingly, the technician can also write at least one regular expression according to the search logic between the second confidential information keyword group and the confidential information keywords in the second confidential information keyword group and put the compiled regular expression into the confidential information search On mission.
[0093] In practical applications, the form of confidential information can include: company internal mailbox password, database password, company internal or external website password, company internal information, development documents, system user manual, system source code, etc.
[0094] Optionally, the confidential information retrieval task may also carry: execution time information of the confidential information retrieval task. Specifically, the time information may be a specific execution time or an execution time interval. For example, a confidential information retrieval task can be scheduled to be executed at 7 o'clock in the evening, or a confidential information retrieval task can be scheduled to be executed every hour.
[0095] Optionally, the confidential information retrieval task may also carry: a retrieval environment for formulating the confidential information retrieval task, where the retrieval environment is a search engine or website used for the confidential information retrieval. The invention can retrieve confidential information in a retrieval environment. Specifically, the website serving as the search environment in the present invention may be a website with an information sharing function.
[0096] The first retrieval unit 200 is configured to retrieve at least one confidential information keyword in the first confidential information keyword group through a search engine to obtain web page information of multiple web pages;
[0097] Specifically, the search engine may include one or more of a web search engine, a vertical search engine, and an aggregate search engine. The webpage information may include the Uniform Resource Locator (URL) of the webpage, and may also include other information that can distinguish the webpage from other webpages. Web page information can also include titles and abstracts.
[0098] In practical applications, the retrieval environment set for the first secret information keyword group in the secret information retrieval task of the present invention may be a search engine.
[0099] Optional, based on Picture 11 The device shown, such as Picture 12 As shown, an embodiment of the present invention provides another device for discovering confidential information leakage. The first retrieval unit 200 includes: a first retrieval subunit 210 and a comparison and screening unit 220,
[0100] The first search subunit 210 is configured to search at least one confidential information keyword in the first confidential information keyword group in at least two search engines to obtain web pages of multiple web pages output by each search engine information;
[0101] Specifically, the way of obtaining webpage information may be by respectively searching the confidential information keywords in the first confidential information keyword group in at least two search engines to obtain webpage information of multiple webpages output by each search engine. For example, search engine A and search engine B respectively search the confidential information keywords "'bank" and "mail" and "password" in the first confidential information keyword group, and obtain the 100 web pages output by search engine A Web page information, obtain the page information of 80 web pages output by search engine B.
[0102] The comparison and screening unit 220 is configured to compare webpage information of webpages output by each search engine, and filter out at least part of the webpage information from the webpage information output by each search engine according to the comparison result.
[0103] Specifically, compare the webpage information respectively output after searching the confidential information keywords in the first confidential information keyword group on different search engines. For example, search engine A and search engine B respectively compare the first confidential information Search for the confidential information keywords "'bank' and'mail' and'password'" in the keyword group, and output the page information of 100 pages from search engine A, and output the page information of 80 pages from search engine B. Compare the web page information of the 100 web pages output by the search engine A with the web page information of the 80 web pages output by the search engine B. According to the comparison results of webpage information output by different search engines, webpage information that at least partially meets the preset conditions can be filtered from the webpage information output by each search engine. Specifically, the preset conditions may be formulated according to needs, and one of the preset conditions may be: all exist in the web page information of the web pages output by each search engine. When the webpage information of a certain webpage exists in the webpage information of the webpage output by each search engine, the webpage information of the webpage satisfies the preset condition.
[0104] In addition, after obtaining the webpage information of multiple webpages output by each search engine, this embodiment can also compare the obtained webpage information of each webpage with the webpage information of the webpage in the history record, and according to the comparison result, The web page information of the web page is removed from the web page information of the web page existing in the history record. Wherein, after processing a certain webpage, the second retrieval unit 300 of this embodiment can put the webpage information of the webpage into the history record. In this way, the web page information in the history record is the web page information of the web page that has been determined whether to leak confidential information. After the first retrieval unit 200 obtains the webpage information of the webpages in the history record again in the subsequent processing, the second retrieval unit 300 may not be used to process the webpages for which it has been determined whether the confidential information is leaked. It is understandable that by comparing the webpage information (such as URL) of two webpages, the present invention can determine whether the two webpages are the same webpage. Of course, in addition to putting the webpage information into the historical record, the present invention can also put the determination result of whether the confidential information is leaked obtained after the second retrieval unit 300 processes the webpage into the historical record. Specifically, the history record may store: web page information of web pages that have been determined whether to disclose confidential information, and a result of determining whether to disclose confidential information corresponding to the web page information.
[0105] Optionally, the comparison and screening unit 220 is specifically configured to compare webpage information of webpages output by each search engine, and filter the webpage information of webpages output by each search engine from the webpage information output by each search engine according to the comparison result .
[0106] Specifically, the webpage information of the webpage output by one search engine can be compared with the webpage information of the webpage output by other search engines, and according to the comparison result, the webpage information of the webpage output by the search engine can be filtered out with those of other search engines. The page information of the output page is the same as the page information of the page. For example: output web page information of 100 web pages from search engine A, output web page information of 80 web pages from search engine B, compare 100 web page information output by search engine A with 80 web page information output by search engine B , Filter out 50 page information of the web pages output by both search engine A and search engine B, that is to say, the page information of any web page in the 50 web pages is output in both search engine A and search engine Output in B.
[0107] The process of comparing webpage information of a webpage output by one search engine with webpage information of a webpage output by another search engine may specifically include:
[0108] Compare the uniform resource locator of the webpage output by one search engine with the uniform resource locator of the webpage output by another search engine. When the uniform resource locator is the same, you can compare whether the titles of the two webpages are the same. If they are the same, It is determined that the webpage is the webpage output by both search engines. It is understandable that when the uniform resource locators of the two webpages are the same, it is also possible to compare whether the abstracts of the two webpages are the same, and if they are the same, it is determined that the webpage is the webpage output by the two search engines. It is understandable that when the uniform resource locators of the two webpages are the same, the titles and abstracts of the two webpages can be compared at the same time. If the two webpages have the same title and the same abstract, it is determined that the webpage is the two Web pages output by all search engines.
[0109] The second retrieval unit 300 is configured to, according to the obtained web page information, identify at least one confidential information keyword in the second confidential information keyword group in the web content of at least one of the multiple web pages Perform a search to determine whether confidential information is leaked in the at least one webpage.
[0110] In practical applications, the retrieval environment set for the second secret information keyword group in the secret information retrieval task of the present invention may be: multiple web pages obtained by the first retrieval unit 200. In this way, the second search unit 300 can continue to search among the multiple web pages obtained by the first search unit 200, and the present invention can effectively improve the accuracy of the search results through the second search.
[0111] Specifically, the webpage content is various information carried by the webpage, and the webpage content may include one or more of title, author, date, summary, body, program code, pictures, audio, and video. The present invention can select the key to at least one confidential information in the second confidential information keyword group in the content of at least one webpage in the webpage information output by each search engine from the webpage information output by each search engine. Word search to determine whether confidential information is leaked in the at least one webpage. For example, search engine A and search engine B output a total of 50 webpage information. In the webpage content of at least one of the 50 webpages, at least one confidential information keyword in the second confidential information keyword group is performed Search to determine whether confidential information is leaked in the web content of the web page.
[0112] In order to improve the efficiency of retrieving at least one confidential information keyword in the second confidential information keyword group in the web content of at least one of the multiple web pages, this embodiment may also use preset useless content characters to change the search efficiency. Useless content in the webpage content of at least one webpage among the multiple webpages is deleted. For example, through some useless content characters such as "html, table, tab, div", the useless content in the webpage content of the webpage is deleted, and then the second confidential information is key to the webpage content of the deleted useless content of the webpage At least one secret information keyword in the phrase is searched. By excluding the search from the useless content of the webpage, not only the efficiency of the search is improved, but also the accuracy of the search is improved.
[0113] Generally speaking, when at least one secret information keyword in the second secret information keyword group is found in the webpage content of the webpage, it can be determined that the secret information is leaked in the webpage content of the webpage; otherwise, the secret information is determined No disclosure on this page.
[0114] It is understandable that after it is determined that the confidential information is leaked on a certain webpage, the relevant personnel can defend their rights by complaining to the administrator of the webpage to delete the confidential information from the webpage.
[0115] Wherein, in this embodiment, after the second retrieval unit 300 processes a certain webpage, the webpage information of the webpage can be put into the history record. In this way, the web page information in the history record is the web page information of the web page that has been determined whether to leak confidential information. Of course, in addition to putting the webpage information into the historical record, the present invention can also put the determination result of whether the confidential information is leaked obtained after the second retrieval unit 300 processes the webpage into the historical record. Specifically, the history record may store: web page information of web pages that have been determined whether to disclose confidential information, and a result of determining whether to disclose confidential information corresponding to the web page information.
[0116] Specifically, after rights protection in the form of complaints, relevant personnel also need to monitor the effect of rights protection, that is, whether confidential information is deleted from the webpage in a timely manner after rights protection.
[0117] In another embodiment of the present invention, after it is determined according to the second retrieval unit 300 that the confidential information is leaked in a certain webpage, the webpage information of the webpage where the confidential information is leaked can be compared with the webpage information of the webpage in the history record. Determine whether the web page information of the web page that leaks the confidential information exists in the history record. If it exists, the determination result corresponding to the web page information in the historical record can be further obtained. If the determination result is: leaked, it can be determined whether the time interval between the generation time of the determination result and the current time exceeds the preset duration, If it exceeds, a follow-up reminder of confidential information leakage can be generated. The confidential information leakage follow-up reminder can remind relevant personnel to follow up the rights protection result of the webpage where the confidential information is leaked in time. It is understandable that although a certain webpage has been found to have leaked confidential information, but in the subsequent implementation of the present invention, it is found that the page still has leaked confidential information, which means that the previous rights protection did not achieve the corresponding effect or the rights protection was forgotten. Remind relevant personnel to follow up in time.
[0118] According to an embodiment of the present invention, a device for discovering leakage of confidential information can obtain a confidential information retrieval task, where the confidential information retrieval task carries: a first confidential information keyword group and a second confidential information keyword group; through a search engine Search for at least one confidential information keyword in the first confidential information keyword group to obtain web page information of a plurality of web pages; and compare the second confidential information in the web content of at least one of the plurality of web pages At least one confidential information keyword in the keyword group is searched to determine whether the confidential information is leaked in the at least one webpage, and the method of making the confidential information search task automatically search in the search engine overcomes the technology of low manual search efficiency In turn, the technical effect of discovering leaked confidential information from the massive information on the Internet in a timely and efficient manner is achieved.
[0119] Optional, based on Picture 12 The device shown, such as Figure 13 As shown, another device for discovering leakage of confidential information provided by an embodiment of the present invention may further include: a reminder generating unit 400,
[0120] The reminder generating unit 400 is configured to generate a confidential information leakage reminder after the second retrieval unit 300 determines that the confidential information is leaked in the at least one webpage.
[0121] Specifically, if it is determined that confidential information is leaked in at least one webpage, a confidential information leak reminder is generated. The confidential information leak reminder can be displayed in the device implementing this embodiment, or sent to the user through a pre-designated communication method. Pre-designated equipment for communication methods. For example, the confidential information leakage reminder may be displayed on the device implementing this embodiment in a pop-up window, or the confidential information leakage reminder may be sent to the mobile phone in the form of SMS or MMS through a pre-designated mobile phone number.
[0122] The confidential information leakage reminder may be a reminder by sending a warning message, where the warning message may be generated based on the webpage to which the confidential information is leaked and a preset warning message template. Specifically, at least part of the webpage content of the webpage to which the confidential information is leaked can be added to a preset warning message template to generate warning messages. In order to facilitate understanding, combine here Figure 5 Take an example: by adding part of the webpage content of the webpage to which confidential information is leaked and the uniform resource locator of the webpage into the preset warning message template, the warning message is generated, where the date in the warning message can be based on the Set the generation time of the warning message. The warning information may also include one or more of the title, source site, abstract, and author of the webpage to which the confidential information is leaked. It is understandable that the preset warning message template can also have multiple styles. Figure 5 Shown is just one style.
[0123] The confidential information leakage reminder can be reminded in a text manner similar to the above-mentioned generation of warning information, etc., but can also be reminded by means of alarm sound and flashing of the interface of the device implementing this embodiment. It is understandable that the reminding method of the confidential information leakage reminder can have multiple methods, which are not further limited here.
[0124] Optionally, another device for discovering the leakage of confidential information provided by the embodiment of the present invention may further include a webpage screenshot unit for after the second retrieval unit 300 determines that the confidential information is leaked in the at least one webpage To take a screenshot of the webpage and save the screenshot. By taking a screenshot of the webpage content of the webpage where the confidential information is leaked and saving the screenshot, it is possible to retain evidence that the confidential information is leaked on the webpage.
[0125] Optional, based on Figure 13 The device shown, such as Figure 14 As shown, another device for discovering leakage of confidential information provided by the embodiment of the present invention may further include: a website login information obtaining unit 500 and a web content obtaining unit 600,
[0126] The website login information obtaining unit 500, before the second searching unit 300 searches for at least one confidential information keyword in the second confidential information keyword group in the web content of at least one of the plurality of web pages , Used to determine the website corresponding to the multiple web pages, and obtain website login information corresponding to the determined website;
[0127] It is understandable that the website corresponding to the webpage output by the search engine may have a website that needs to log in to obtain all the content of the webpage. At this time, the login information of the website needs to be obtained to access all the content of the webpage. Optionally, the present invention may obtain the website login information corresponding to the determined website from the login information database of each website established in advance. For example, when the website A corresponding to the webpage output by the search engine needs to be logged in to proceed to the subsequent steps, it is determined from the pre-established login information database of each website whether the login information of the website A exists, and if so, proceed to the next step . The present invention can also generate a reminder that the website needs login information, and obtain the website's login information through the content of the reminder feedback. For example, a reminder that the website requires login information is sent to a pre-designated mailbox, so that users of the mailbox can provide the website's login information.
[0128] Preferably, if the confidential information search task carries website login information corresponding to the website, the website login information corresponding to the website can also be obtained from the confidential information search task.
[0129] The webpage content obtaining unit 600 is configured to log in to a determined website through the obtained website login information, and obtain webpage contents of the multiple webpages.
[0130] Specifically, after logging in to a determined website through the obtained website login information, at least one lower-level link address on the homepage of the website is accessed to obtain the webpage content of the webpage pointed to by the link address. It is understandable that in most cases, after logging in to a website, there will be multiple lower-level link addresses for different sections. At this time, visit the lower-level link address to obtain the webpage pointed to by the link address If there is a lower-level link address in the webpage content of the webpage pointed to by the lower-level link address, continue to visit the lower-level link address to obtain the webpage pointed to by the lower-level link address Page content until the link address does not exist in the page content. For example, after logging in to website A, the link addresses of the three sections of "variety shows, TV series and movies" appear, then the link addresses of these three sections are accessed to obtain the webpage content of the web pages pointed to by the link addresses of these three sections.
[0131] When the second retrieval unit 300 finds that there is confidential information in the web content of the obtained web page, it may be based on Figure 14 The device shown, such as Figure 15 As shown, another device for discovering leakage of confidential information provided by an embodiment of the present invention may further include: a contact information obtaining unit 700,
[0132] The contact information obtaining unit 700 is configured to obtain the contact information of the administrator of the webpage to which the confidential information is leaked after the second second retrieval unit 300 determines that the confidential information is leaked in the at least one webpage.
[0133] Specifically, when it is determined that the confidential information is leaked in the webpage, the contact information obtaining unit 700 obtains the contact information of the administrator of the webpage. The contact information can be through keywords such as "contact information", "mobile phone", and "email" in the webpage. It can also be obtained through some specific character segments in the webpage, for example: 123456@xx.com, 138xxxxxxx. The contact information also includes the link addresses similar to "Feedback", "Email" and "Call" on the web page. It is understandable that the message function in the webpage can also be the contact information of the webpage.
[0134] Optional, based on Figure 15 The device shown, such as Figure 16 As shown, another device for discovering leakage of confidential information provided by an embodiment of the present invention may further include: a notification letter sending unit 800;
[0135] The notification letter sending unit 800, after the contact information obtaining unit 700 obtains the contact information of the management party of the webpage to which the confidential information is leaked, is used to send the leakage information to the management party through the obtained contact information Delete the notification letter.
[0136] Specifically, through the contact information obtained by the contact information obtaining unit 700, the notification letter sending unit 800 may send a notification letter for deleting the previously edited leaked information to the management party. The notification letter for deletion of leaked information can be manually edited according to actual conditions. For example, after it is determined that the confidential information is leaked on the webpage, the party of the confidential information can edit the notification letter for deletion of the leaked information through the obtained contact of the webpage The method is sent to the administrator of the page.
[0137] Of course, the leaked information deletion notification letter can also be automatically edited based on a template. Specifically, the notification letter sending unit 800 may generate a leaked information deletion based on the web page to which the confidential information is leaked and a preset leaked information deletion notification letter template. Notification letter, sending the generated notification letter for deleting the leaked information to the contact information.
[0138] Optionally, the notification letter sending unit 800 is specifically configured to generate a leaked information deletion notification letter based on the webpage to which the confidential information is leaked and a preset leaked information deletion notification letter template, and send the generated leaked information deletion notification letter To the contact information.
[0139] Specifically, at least part of the webpage content of the webpage where the confidential information is leaked is added to a preset leaked information deletion notification letter template to generate a leaked information deletion notification letter, and the generated leaked information deletion notification letter is sent to the contact information. In order to facilitate understanding, combine here Picture 10 Take an example: by adding part of the content of the webpage to which the confidential information is leaked and the uniform resource locator of the webpage to the preset leaked information deletion notification letter template, a leaked information deletion notification letter is generated, where the leaked information The date in the deletion notification letter can be set according to the date when the notification letter is sent. It is understandable that the preset notification letter template for the deletion of leaked information can also have multiple styles. Picture 10 Shown is just one style.
[0140] The device for discovering leakage of confidential information includes a processor and a memory. The task obtaining unit 100, the first retrieval unit 200, and the second retrieval unit 300 are all stored in the memory as program units, and are executed by the processor and stored in the memory. The above-mentioned program unit in to realize the corresponding function.
[0141] The processor contains the kernel, which calls the corresponding program unit from the memory. The kernel can be set to one or more, by adjusting the kernel parameters to timely and efficiently discover the leaked confidential information from the massive information on the Internet.
[0142] The memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flashRAM), and the memory includes at least one storage chip.
[0143] The embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the method for discovering the leakage of confidential information is realized.
[0144] The embodiment of the present invention provides a processor, the processor is used to run a program, wherein the method for discovering leakage of confidential information is executed when the program is running.
[0145] The embodiment of the present invention provides a device. The device includes a processor, a memory, and a program stored on the memory and running on the processor, and the processor implements the following steps when the program is executed:
[0146] Obtain a secret information retrieval task, where the secret information retrieval task carries: a first secret information keyword group and a second secret information keyword group;
[0147] Searching for at least one confidential information keyword in the first confidential information keyword group through a search engine to obtain webpage information of multiple webpages;
[0148] According to the obtained web page information, search for at least one confidential information keyword in the second confidential information keyword group in the web content of at least one of the multiple web pages to determine whether the confidential information is in the At least one webpage leaked.
[0149] The devices in this article can be servers, PCs, PADs, mobile phones, etc.
[0150] This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
[0151] Obtain a secret information retrieval task, where the secret information retrieval task carries: a first secret information keyword group and a second secret information keyword group;
[0152] Searching for at least one confidential information keyword in the first confidential information keyword group through a search engine to obtain webpage information of multiple webpages;
[0153] According to the obtained web page information, search for at least one confidential information keyword in the second confidential information keyword group in the web content of at least one of the multiple web pages to determine whether the confidential information is in the At least one webpage leaked.
[0154] Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
[0155] This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated for use In the process Figure one Process or multiple processes and/or boxes Figure one A device with functions specified in a block or multiple blocks.
[0156] These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device is implemented in the process Figure one Process or multiple processes and/or boxes Figure one Functions specified in a box or multiple boxes.
[0157] These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. Instructions are provided to implement the process Figure one Process or multiple processes and/or boxes Figure one Steps of functions specified in a box or multiple boxes.
[0158] In a typical configuration, the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
[0159] The memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read only memory (ROM) or flash memory (flashRAM). The memory is an example of a computer-readable medium.
[0160] Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
[0161] It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or include elements inherent to the process, method, commodity, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity or equipment that includes the element.
[0162] Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
[0163] The above are only examples of this application, and are not used to limit this application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.