[0032] Embodiment: A method for establishing a web crawler task for a physician database system, specifically refers to a method for quickly realizing a fast and stable web crawler for crawling corresponding websites when crawling different websites. The specific implementation is as follows:
[0033] A. In step S11, write a template for storing the link address of the page. This step creates a template for storing page link address information for each page that needs to be crawled. This template is equivalent to a page address blank record book that can be used to save the link address of the crawled page and the depth of the page. For example, the link to the detailed information page of Wanfang's paper is:
[0034] (Http://d.wanfangdata.com.cn/Periodical_ahzylczz201203001.aspx), the page depth is 3, then the content stored in the template is the above address link and the depth value 3.
[0035] B. In step S12, write a link resolver. First, establish a regular expression, analyze the website that needs to be crawled, and write a regular expression that can extract the link address of this type of page from the content of the page according to the characteristics of the link address of each page that needs to be crawled; Write the specific implementation method of the link resolver. The input of the link parser is a string representing the content of the webpage. The parser uses regular expressions to extract links that meet the requirements from the webpage content, stores these links in the template written in step A and returns them as the return result. For example, to write a link parser that extracts links to the detailed information page of the paper from the web content of Wanfang's paper list, you first need to write a regular expression to extract links:
[0036] "
[0037] href='(? http://d.wanfangdata.com.cn/Periodical_.+?\.aspx)'> (?.+?)", and then write a method in the parser to extract the link based on the regular expression. image 3 Is a page of Wanfang's paper list, Figure 4 It is the source code of this page. The source code is used as input, and a parser is used to parse it to extract the links of the five articles displayed in the web content. The extracted links are:
[0038] [1].http:///d.wanfangdata.com.cn/Periodical_ahzylczz201203001.aspx,
[0039] [2].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203002.aspx,
[0040] [3].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203003.aspx,
[0041] [4].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203004.aspx,
[0042] [5].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203005.aspx. After parsing, store the above-mentioned link and its corresponding page depth in the corresponding template.
[0043] C. In step S13, program the actuator. In order for the client to work normally, it is also necessary to implement an executor that submits crawling tasks. The executor sets the priority for the links extracted by the parser from the response content, and encapsulates the links that have not been crawled into a collection of crawling tasks as the return result of the function and returns it to the server. The actuator usually has a fixed format and uses the default actuator. If the content of the website that needs to be crawled has a specific directory structure, it needs to be modified on the basis of the default actuator. For example, the 39 health network website uses the provincial and municipal directory structure. Therefore, the actuator needs to establish the directory structure of the provinces and cities in advance.
[0044] D. In step S13, priority is set. For each web page, the client's executor will give the page task a priority value, and the crawler program controls the order of crawling through the order of the priority value. The present invention balances multiple distributed networks through priority setting The crawled content of the crawler. The specific method is to first assign a priority range for each level page, and then during the program running, randomly assign a priority value to each page in its corresponding priority range, and the page with a larger priority value will be the first Crawling. For example, if the page priority interval of Wanfang’s paper list is set to [30,80] and the priority interval of Wanfang’s paper detailed information page is set to [1,50], a page A of Wanfang’s paper list (http: //C.wanfangdata.com.cn/Periodica
[0045] The priority value of l-ahzylczz.aspx) is assigned as 35, and the priority value of the two Wanfang paper detailed information pages B1 (http://d.wanfangdata.com.cn/Periodical_ahzylczz201203001.aspx) is assigned as 30 , B2 (http://d.wanfangdata.com.cn/Periodical_ahzylczz
[0046] 201203002.aspx) is assigned a priority value of 60, and the order in which these three pages are crawled is B2, A, B1.
[0047] E. In step S21, the server and the client communicate through the WCF protocol. The client obtains the web page address from the queue of web page addresses to be crawled, encapsulates it into an HTTP request and sends it to the server. The server receives the request for the web page address to be crawled from the client, and at the same time sends the request to the corresponding URL, and returns the requested content to the client; the client extracts the web page to be crawled from the content returned by the server Link the address and add it to the queue to be crawled until the entire website is crawled.
[0048] The client communicates with the server through the WCF protocol. The client parses a useful URL address through regular expressions. Before encapsulating the parsed URL into a request and sending it to the server, it needs to check whether the corresponding request already exists in the database. If the request exists, it does not need to be sent again, because the request has been made. The page does not need to be processed again; otherwise, if it does not exist, the request is sent to the server and the URL is stored in the database.
[0049] F. In step S22, if the crawler terminates unexpectedly during the crawling process, the crawler does not need to crawl from the beginning. You only need to restart the server and the client. The client will read in the crawling tasks that have not been completed from the database, and then re-send the request to the server until the entire website is crawled.
[0050] Based on the above content, the implementation process of the present invention is summarized as follows:
[0051] 1. The client obtains the link address of the web page to be crawled. In order for the web crawler to crawl the website normally, it is necessary to give the client an initial crawling link address, which can be one or more.
[0052] 2. The client takes out a link address to be crawled from the list of links to be crawled in the database, and sends this link address to the server.
[0053] 3. The server sends an HTTP request to the page to be crawled, and returns the requested content A to the corresponding client.
[0054] 4. The client receives the crawled content A returned by the server, and needs to do the following operations on content A:
[0055] 1) The client's actuator stores content A in the hard disk.
[0056] 2) The client's parser uses regular expressions to parse out the required link B from content A (the parsed link may be one or more). E.g image 3 It is a part of one page of Wanfang's paper list, and the names of five papers are circled in boxes; Figure 4 It is the source code of this part of the webpage content. The links to the five papers are circled in boxes, and the same parts of these links are circled in green boxes. After observation, the differences between these links are A part of the characters between the two green boxes, this part can be matched with wildcard ".+?" in regular expressions. So you can write a regular expression that extracts these links: " http://d.wanfangdata.com.cn/Periodical_.+?\.aspx)"> ". "Url> in regular expression "Means that the matching result is stored in Url. Use this regular expression to Figure 4 The following five web links are extracted from the web content of:
[0057] [1].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203001.aspx
[0058] [2].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203002.aspx
[0059] [3].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203003.aspx
[0060] [4].http://d.wanfangdata.com.cn/Periodical_ahzylczz201203004.aspx
[0061] [5].
[0062] http://d.wanfangdata.com.cn/Periodical_ahzylczz201203005.aspx.
[0063] 3) The client's executor sets the priority for the link B extracted in step 2). The specific method is as follows: before the web crawler runs, it is necessary to manually assign a priority interval for each level page to be crawled; in this way, During the running of the program, the program randomly assigns a priority value to each page in the corresponding priority range, and the page with a higher priority value will be crawled first. For example, if the page priority interval of Wanfang’s paper list is set to [1,50] and the priority interval of Wanfang’s paper details page is set to [30,80], a page A of Wanfang’s paper list (http: //c.wanfangdata.com.cn/Periodical-ahzylczz.aspx) is assigned a priority value of 35 (randomly assigned within the interval [1,50]), two Wanfang papers detailed information page B1 (http: //d.wanfangdata.com.cn/Periodical_ahzylczz201203001.aspx) is assigned a priority value of 30 (randomly assigned within the interval [30,80]), B2 (http://d.wanfangdata.com.cn/P The priority value of eriodical_ahzylczz201203002.aspx) is assigned to 60 (randomly assigned in the interval [30, 80]), and the order in which the three pages are crawled is B2, A, B1. According to this method, the order in which pages are crawled can be controlled by setting the priority of page crawling.
[0064] 4) The link address B extracted from the content is added to the list to be crawled in the database.
[0065] 5. Repeat steps 2-4
[0066] The following functions in the above process are all completed by the client:
[0067] 1. Send a new task to the server and add the unfinished task information to the database;
[0068] 2. Request the content that has been crawled from the website from the server's response queue, and mark the corresponding task in the database as completed. Parsing the response, if you need to continue crawling, encapsulate the URL of the page that needs to be crawled into a task request, send it to the server and record it in the database, otherwise save the response content;
[0069] 3. Recovery mode: Read unfinished tasks from the database and resend them to the server (such as figure 2 Shown).