Method and apparatus for directionally grabbing page resource

A page resource and page technology, applied in the field of Internet resource collection, can solve the problems of missed pages and low recall rate, and achieve the effect of ensuring representativeness, improving efficiency, and ensuring accuracy

Inactive Publication Date: 2009-06-10
ZHEJIANG UNIV
0 Cites 80 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, this kind of focused crawler that only uses the relevance of the parent page to the topic to predict the relevance of the subpage to the topic as a guide will inevitably miss many page...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

By the regular expression relevant to seed site URL and topic and the regular expression irrelevant to topic all are matched, can determine better whether to grab this represented page, thereby further improve the validity and the crawling of page accuracy.
Can know by analyzing the data structure of URL, a URL can comprise protocol parameter (such as http protocol), site parameter (host), path parameter (path) and inquiry parameter (query), wherein path parameter can comprise a series of Directory (directory), query parameters can include a series of key-value (Key-Value) pairs. For example, for the URL: http://www.china-pub.com/member/buybook/view.asp? add=1&tid=203839, the site parameter is www.china-pub.com; the path parameter is /member/buybook/view.asp; the directory of the path parameter is member, buybook, view.asp; the query parameter is add=1&tid =203839, that is, the key-value pair of the query parameter is (add, 1) and (tid, 203839). It can be understood that the main purpose of the filtering step is to determine whether it is a seed site by extracting the site parameters of the URL, if so, keep the URL, and if not, remove the URL. Obtaining seed sites through filtering can reduce the workload of subsequent page crawling, thereby effectively improving crawling efficiency and crawling accuracy.
The working principle of Crawler is: at first Crawler generates a URL collection to be grabbed webpage according to WebDB and is called Fetchlist, then download thread Fetcher starts webpage grabbing back according to Fetchlist, if there are many download threads, so just generate a lot of Fetchlist, that is, a Fetcher corresponds to a Fetchlist. Then Crawler updates based on the fetched webpage WebDB, generates a new Fetchlist according to the updated WebDB, which contains uncrawled or newly discovered URLs, and then restarts the next round of crawling cycle. This cyclic process can be called a "generate/fetch/update" cycle. In addition, URLs pointing to web resources on the same host are usually assigned to the same Fetchlist, which prevents too many Fetchers from simultaneously fetching a host and causing the host to be overloaded.
[0118] A crawl...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a method for directionally snatching a webpage resource. The method comprises the followings steps: snatching webpage according with an amount threshold value in advance according to seed site URL; determining characteristic webpage in the pre-snatched webpage; generating a regular expression generalizing the characteristic webpage URL; matching the seed site URL with the regular expression, and maintaining the seed site URL meeting the matching condition as a snatching target URL; and snatching webpage according to the snatching target URL. The method can effectively improve the utilization percent and call back ratio of webpage resource snatching, thereby better helping people acquire required information with large scale, high efficiency and high accuracy on the internet.

Application Domain

Technology Topic

Regular expressionDatabase +2

Image

  • Method and apparatus for directionally grabbing page resource
  • Method and apparatus for directionally grabbing page resource
  • Method and apparatus for directionally grabbing page resource

Examples

  • Experimental program(1)

Example Embodiment

[0093] In order to make the above objectives, features and advantages of the present invention more obvious and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0094] The present invention can be used in many general-purpose or special-purpose computing device environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, multi-processor devices, distributed computing environments including any of the above devices or devices, and so on.
[0095] The invention can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present invention can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.
[0096] One of the core concepts of the embodiments of the present invention is that, by analyzing the page resources of the Internet, most of the web pages on the Internet are dynamic pages, and these dynamic pages are formed by querying the corresponding data records from the database and filling the page templates. , And query parameters are generally included in the URL. Page URLs generated with the same template often differ only in query fields, and the themes of pages generated with the same template are often similar. In short, the pages generated with the same page template often belong to the same category, and their URLs are very similar.
[0097] Due to the similarity of URLs of pages belonging to the same topic on the same website, one (or several) regular expressions can be used to generalize. For example, for a book e-commerce website www.china-pub.com, the computer book information can use regular expressions http://www.china-pub.com/computers/common/info.asp? Id=* to summarize, the asterisk (metacharacter) represents the index number of various books. With this regular expression, it can be judged whether the page represented by a URL in the site is related to computer book information, that is, by matching the URL with the regular expression, it can be concluded whether the URL is related to the topic.
[0098] For a focused crawler, pages related to a certain topic it is interested in are often generated by one or several such templates on the same website. Therefore, the present invention is based on the dual-crawler strategy architecture of experimental crawlers and focused crawlers, and proposes a focused crawler based on URL rules. This focused crawler can learn representative topic-related pages, topic-related catalog pages, and topic-related catalog pages from each topic page-related site. URL regular expressions of topic-independent pages, and then use these URL regular expressions to guide the crawling of focused crawlers.
[0099] reference figure 1 , Shows a flowchart of an embodiment of a method for directional grabbing page resources of the present invention, which may include the following steps:
[0100] Step 101: Obtain the torrent site URL;
[0101] In this embodiment, there is no restriction on obtaining the URL of the torrent site, for example, calling from a preset torrent resource library, or searching and obtaining according to a certain keyword, etc., and those skilled in the art can use any method to obtain what they need. Seed site URLs are all feasible. For example, the list of relevant seed sites in the hardware industry is:
[0102] cnsaw.com www.fsonline.com.cn www.beareyes.com.cn www.gx.xinhuanet.com www.beicha.com www.hnwj.net www.bxg.cn www.ieicn.com www.ce.cn www.ldmetals.com www.cenn.cn www.sealing.cn www.chemeinfo.com www.sg001.cn www.chinabmb.com www.wjjw.cn www.chinabtob.net www.wjw.cn www.cnpv.com www.xmnn.cn www.cutinfo.cn
[0103] In practice, a site may contain a large number of hyperlinked pages, such as a NetEase site (www.163.com), which has many hyperlinks to sports, military, finance, etc., therefore, it is preferable that the seed The site URL can include not only the URL of the seed site, but also the URL of all hyperlinks in the page represented by the seed site. However, it is possible that the site URL of some hyperlink URLs is no longer the URL of the seed site. For example, in the above-mentioned NetEase sites, some hyperlinks are hyperlinks to other sites such as Sohu, Sina, etc. through friendship links, etc. For non-seed sites, in this case, this step can also include the following sub-steps:
[0104] Filter the torrent site URL.
[0105] It is understandable that the filtering sub-step is used to ensure that the crawler only deeply crawls the torrent site. For example, when crawling the pages in the Netease site, it will not crawl to Sohu through the hyperlinks in the friendship link. , Sina and other sites.
[0106]By analyzing the data structure of the URL, it can be known that a URL can include protocol parameters (such as http protocol), site parameters (host), path parameters (path), and query parameters (query). The path parameters can include a series of directories (directory ), the query parameter can include a series of key-value pairs. For example, for the URL: http://www.china-pub.com/member/buybook/view.asp? add=1&tid=203839, the site parameter is www.china-pub.com; the path parameter is /member/buybook/view.asp; the directory of the path parameter is member, buybook, view.asp; the query parameter is add=1&tid =203839, that is, the key-value pair of the query parameter is (add, 1) and (tid, 203839). It can be understood that the main purpose of the filtering step is to determine whether it is a seed site by extracting site parameters of the URL, if it is, keep the URL, if not, remove the URL. Obtaining seed sites through filtering can reduce the workload of subsequent page crawling, thereby effectively improving crawling efficiency and crawling accuracy.
[0107] In practice, the torrent site URL can be filtered through the following sub-steps:
[0108] Sub-step A1, read the URL of the seed site into an array, and sort the array;
[0109] Sub-step A2, extract site parameters of a certain URL, and determine whether the site parameters are included in the array, if yes, perform sub-step A3; if not, perform sub-step A4;
[0110] Sub-step A3, reserve the URL;
[0111] Sub-step A4. Remove the URL.
[0112] In practice, a binary search method can be used to search in the array to determine whether the site parameter is included in the array. It is well known that the basic idea of ​​the binary search method is to divide n elements into two halves of roughly the same number, and compare a[n/2] with the x to be searched. If x=a[n/2] then Find x and the algorithm terminates. Assuming that the array elements are arranged in ascending order, then if xa[n/2], then only in the right half of array a Continue to search for x.
[0113] Of course, the above-mentioned filtering method is only used as an example. Those skilled in the art can use any filtering method, for example, setting a site weight value and filtering according to the weight value; or, extracting keywords from the URL, and proceeding according to the keywords. Filtering and the like are all feasible, and the present invention does not need to limit this.
[0114] Step 102: Pre-fetch pages that meet the number threshold according to the URL of the seed site;
[0115] This step is an experimental crawler strategy. This experimental crawler can crawl pages according to a preset page crawling depth, and the number threshold is the preset page crawling depth value. As the crawling depth increases, the pages to be crawled will also increase exponentially. Assuming that the depth is d, and N pages are crawled at each level, then Nd pages must be crawled to the dth level. Based on this situation, if the N of each layer is too large, then the efficiency of page crawling will be too low; and if the N of each layer is too small, then the distribution range of the crawled pages will be relatively small, and the breadth is insufficient, that is, it is not enough. Generate representative regular expressions.
[0116] In order to improve the crawling efficiency and ensure the representativeness of regular expressions, the present invention continuously adjusts to obtain a quantity threshold through a large number of experiments. The value range of the quantity threshold is 1000 to 5000. In practice, the quantity threshold can be Set it in the configuration file (such as data/config.xml), and read the number threshold from the configuration file when in use.
[0117] Specifically, this step can be implemented based on the Nutch crawler principle in the prior art. NutchCrawler (crawler) is mainly used to crawl web pages from the Internet and index these web pages. Crawler focuses on two aspects, Crawler's workflow and the format and meaning of the data files involved. Data files mainly include three types, namely WebDB database (Webdatabase, Web database), a series of segments (page data segment) plus index (index information), WebDB stores the link structure information between web pages crawled by crawlers , It is only used in crawler Crawler work. The information of two entities is stored in WebDB: page (page) and link (link). The Page entity characterizes an actual webpage by describing the characteristic information of a webpage on the network. Because there are many webpages that need to be described, WebDB indexes these webpage entities through two indexing methods: the URL of the webpage and the MD5 of the webpage content. The characteristics of the webpage described by the Page entity mainly include the number of links in the webpage, the time of crawling this webpage and other related crawling information, and the importance score of this webpage. Similarly, the Link entity describes the link relationship between two page entities. WebDB constitutes a link structure diagram of the crawled web page. In this diagram, the Page entity is the node of the graph, and the Link entity represents the edge of the graph.
[0118] A crawl of the Crawler will generate many segments, and each segment stores the web pages that the Crawler crawls in a single crawl cycle and the index of these web pages. When crawling, Crawler will generate the fetchlist (fetch list) required for each crawl cycle according to the link relationship in WebDB according to a certain crawling strategy, and then Fetcher (download thread) will crawl and index these web pages through the URL in the fetchlist. Then save it to the segment. Segments are time-limited. When these web pages are re-crawled by the Crawler, the previously crawled segments are invalid. The Segment folder is named after the generation time to facilitate the deletion of invalid segments to save storage space. Index is the index of all webpages crawled by Crawler. It is obtained by merging the indexes in all single segments.
[0119] The working principle of Crawler is: First, Crawler generates a collection of URLs for web pages to be crawled based on WebDB called Fetchlist, and then the download thread Fetcher starts to crawl the web pages back according to Fetchlist. If there are many download threads, then many Fetchlists are generated. That is, a Fetcher corresponds to a Fetchlist. Then the Crawler updates according to the web page WebDB that is crawled back, and generates a new Fetchlist based on the updated WebDB, which contains the uncrawled or newly discovered URLs, and then the next round of crawling cycle starts again. This cyclic process can be called a "generation/fetch/update" cycle. In addition, URLs that point to web resources on the same host are usually assigned to the same Fetchlist, which can prevent too many Fetchers from crawling a host at the same time, causing the host to be overloaded.
[0120] In Nutch, the realization of the crawler operation is accomplished through the realization of a series of sub-operations. These sub-operations include:
[0121] 1. Create a new WebDB;
[0122] 2. Write the start URL of the crawl into WebDB;
[0123] 3. Generate a fetchlist according to WebDB and write it into the corresponding segment;
[0124] 4. Fetch web pages according to the URL in the fetchlist;
[0125] 5. Update WebDB according to the crawled webpage;
[0126] 6. Circulate 3-5 steps until the preset grasping depth.
[0127] 7. Update the segments according to the webpage scores and links obtained by WebDB;
[0128] 8. Index the crawled webpages;
[0129] 9. Discard web pages with duplicate content and duplicate URLs in the index;
[0130] 10. Merge the indexes in the segments to generate the final index (merge) for retrieval.
[0131] The detailed workflow of the Crawler is: after creating a WebDB (step 1); the "generate/crawl/update" cycle (steps 3-6) starts according to some seed URLs. When this cycle is completely finished, the Crawler creates an index based on the segments generated during the crawl (steps 7-10). Before repeating URL removal (step 9), the index of each segment is independent (step 8). Finally, each independent segment index is merged into a final index index (step 10).
[0132] According to the working principle of the aforementioned Nutch crawler, in this embodiment, the generation of its crawl list can be modified. Specifically, the page can be pre-fetched through the following sub-steps:
[0133] Sub-step B1, write the URL of the seed site into the database (WebDB);
[0134] Sub-step B2, read the URL from the database, and extract the site parameters (host) of the URL;
[0135] Sub-step B3, update the number of URL crawls corresponding to the site parameters;
[0136] Sub-step B4: Determine whether the number of URL crawls exceeds the number threshold, if not, add the URL to the URL crawl list;
[0137] Sub-step B5, download the page corresponding to the URL in the URL grab list, and generate a corresponding page data segment (Segment);
[0138] Sub-step B6: Update the database according to the page data segment.
[0139] As another implementation solution of this embodiment, the step of filtering the torrent site URL can also be completed in this step, for example, pre-fetching the page through the following substeps:
[0140] Sub-step C1: Write the URL of the seed site into the database (WebDB);
[0141] Sub-step C2, read the URL from the database, and extract the site parameters (host) of the URL;
[0142] Sub-step C3, judge whether the site parameters match the seed site, if so, directly execute sub-step C4; if not, remove the URL;
[0143] Sub-step C4, update the number of URL crawls corresponding to the site parameters;
[0144] Sub-step C5: Determine whether the number of URL crawls exceeds the number threshold, and if not, add the URL to the URL crawl list;
[0145] Sub-step C6, download the page corresponding to the URL in the URL grab list, and generate a corresponding page data segment (Segment);
[0146]Sub-step C7: Update the database according to the page data segment.
[0147] Of course, the foregoing method of pre-fetching pages is only used as an example. It is feasible for those skilled in the art to pre-fetch pages by using any web crawler according to actual conditions, and the present invention does not need to limit this.
[0148] Step 103: Determine a feature page in the pre-fetched page;
[0149] In this embodiment, the feature page may include a theme-related page, and the theme-related page may further include a theme-related content page and a theme-related catalog page. Wherein, the subject-related catalog page is a linked page containing content pages related to the linked subject.
[0150] In order to further improve the accuracy of page crawling, the feature page may also include a page that has nothing to do with the subject.
[0151] In the prior art, many methods for page classification have been proposed. For example, a page classification method includes the following steps:
[0152] 11) The sample library is preset, and the sample characteristic parameters are calculated for each sample;
[0153] 12) Collect network texts on the Internet that meet the preset conditions, and calculate the corresponding text feature parameters of the network text;
[0154] 13) Compare the text feature parameters with the feature parameters of each sample in the sample library, and complete the classification of the network text in turn. Generally speaking, a web text can be classified into the sample class with the highest similarity.
[0155] Alternatively, LingPipe (a Java open source toolkit for natural language processing developed by Alias-i) can be used for page classification. Taking news page classification as an example, the following steps can be included:
[0156] 21) Read the news text content and its category from the original database, and then store each file containing the news content in different folders according to the category. For example, sports news content is stored in this folder, and entertainment news is stored in 2 Under this folder, these data are collectively referred to as training data;
[0157] 22) Based on the LingPipe open source package, first set the similarity threshold (it can be obtained from the configuration file data/config.xml, the larger the threshold, the higher the similarity is, the easier it is to misjudge), and then extract the text content of the page, Match with the training data. If the matching function is used, the training data closest to the text content is obtained, and then the content category to which this training data belongs is obtained as the category of this page;
[0158] 23) Extract keywords and assign a certain weight to the keywords as needed, and further confirm whether this page should be assigned to the category.
[0159] Obviously, through the above steps, a more accurate URL of the topic-related page can be obtained, which provides a good resource foundation for learning regular expressions.
[0160] Step 104: Generate a regular expression summarizing the URL of the characteristic page;
[0161] It is well-known that regular expressions are tools for text matching, and are usually composed of some ordinary characters and some metacharacters. Common characters include uppercase and lowercase letters and numbers, while metacharacters have special meanings. Regular expression matching can be understood as finding the part that matches the given regular expression in a given string. It is possible that more than one part of the string satisfies a given regular expression, and then each such part is called a match. Matching can include three meanings in this article: one is adjective, such as a string matching an expression; the other is verbal, such as matching a regular expression in a string; the other is a noun Sexual is the "part of the string that satisfies the given regular expression" just mentioned.
[0162] The following uses examples to illustrate the regular expression generation rules.
[0163] If you want to find hi, you can use the regular expression hi. This regular expression can exactly match such a string: It consists of two characters, the first character is h, the latter is i. In practice, regular expressions can ignore case. If many words contain two consecutive characters hi, such as him, history, high, etc. If you use hi to search, the hi in this word will also be found. If you want to find the word hi accurately, you should use \bhi\b. Among them, \b is a metacharacter of regular expressions, which represents the beginning or end of a word, that is, the boundary between words. Although usually English words are separated by spaces or punctuation or newlines, \b does not match any of these word separators, it only matches one position. If you are looking for a Lucy not far behind hi, you should use \bhi\b.*\bLucy\b. Among them, is another metacharacter, which matches any character except the newline character. * Is also a metacharacter, which represents a quantity-that is, the content before the specified * can be repeated any number of times in a row to make the entire expression match. Now the meaning of \bhi\b.*\bLucy\b is obvious: first it is a word hi, then any number of arbitrary characters (but not a newline), and finally the word Lucy.
[0164] Preferably, this step may include the following sub-steps:
[0165] Sub-step D1: divide the feature page URL into multiple URL subsets;
[0166] Sub-step D2, aggregate the subset of URLs into multiple URL categories;
[0167] Substep D3: Extract the regular expression of the URL category.
[0168] Generally speaking, the sub-step of dividing multiple URL subsets can be implemented by dividing URLs with the same site parameters into the same URL subset.
[0169] More specifically, the sub-step of dividing multiple URL subsets can be implemented through the following steps:
[0170] Divide URLs with the same site parameters into the same URL subset;
[0171] Divide URLs with the same number of directories into the same URL subset.
[0172] In some cases, the division of multiple URL subsets can be achieved through the following steps:
[0173] Divide URLs with the same site parameters into the same URL subset;
[0174] Divide URLs with path parameters with the same number of directories into the same URL subset;
[0175] Divide URLs with the same query parameters into the same URL subset.
[0176] It can be seen that URL division is to divide the URL collection into several subsets according to a certain standard. Its input is a collection of URLs, and its output is a subset of URLs. First, based on the relationship between the site parameters of the URL, the URLs with the same site parameters are classified into the same subset, so that the entire URL collection is divided into several subsets, and the site parameters of the URLs in each subset are the same. After this division, it can be further divided, for example, according to the number of directories in the path parameter part of the URL, and the same number is classified into the same subset, or further divided according to the query parameter part of the URL, and the same query parameters are classified into the same subset. However, in practice, if only the query parameter part of the URL is different, the pages referred to by the two URLs are basically generated by the same template, and generally belong to the same type of page. Therefore, partial subsets based on query parameters are often used in extreme situations. The purpose of dividing the URL set is mainly for the convenience of the next step of clustering, so as to save the processing time of clustering.
[0177] In this case, the sub-step of aggregating multiple URL classes may include:
[0178] Preset clustering rules of the URL category;
[0179] Read the URL from the URL subset, and determine whether the URL meets the clustering rules of the URL category, if so, assign the URL to the URL category; if not, then according to the URL Create a new URL class.
[0180] In this way, after aggregation, some URL classes are obtained in the URL class queue. One type of URL can contain several similar URLs, that is, URLs in a URL class have certain similarities. Therefore, it is preferable that the clustering rule can This is achieved by setting a similar function, that is, all URLs in the same class satisfy this similar function.
[0181] In order to obtain better clustering results, the sub-step of aggregating multiple URL categories may further include the steps:
[0182] Count the number of URL categories and the total number of URLs;
[0183] Adjust the clustering rule of the URL category according to the statistical result.
[0184] Then, extract a URL regular expression that can summarize and represent this category for each URL category. Specifically, according to the data structure of the URL described earlier, the URL is decomposed into three parts: host (site parameter), path (path parameter) and query (query parameter), and the path is decomposed into a series of directories, and the query is decomposed Into a series of key-value pairs. Since the host part must be the same, it is enough to write the host directly in the regular expression, and then align the directories in the path part. If the directories at the corresponding positions are the same, add this part of the value to the regular expression. Otherwise, use * to add to the regular expression; the query part is also added to the regular expression in a similar way to the path part to generate a corresponding regular expression, and there are usually multiple regular expressions.
[0185] Step 105: Match the torrent site URL with the regular expression, and reserve the torrent site URL that meets the matching condition as the crawling target URL;
[0186] In the case where the feature page is a topic-related page, the regular expression is also generated by summarizing the URL of the topic-related page. In this case, the matching step may include:
[0187] If the torrent site URL matches the regular expression of the topic-related page, the torrent site URL meets the matching condition.
[0188] In the case that the characteristic page is a topic-independent page, the regular expression is also generated by summarizing the URL of the topic-independent page. In this case, the matching step may further include:
[0189] If the torrent site URL matches the regular expression of the topic-independent page, the torrent site URL does not meet the matching condition.
[0190] By matching the URL of the seed site with topic-related regular expressions and topic-unrelated regular expressions, it is possible to better determine whether to crawl the represented page, thereby further improving the effectiveness and accuracy of page crawling.
[0191] Of course, it is feasible for those skilled in the art to set corresponding matching conditions according to actual conditions, and the present invention does not need to limit this.
[0192]In practice, a small number of topic-related pages may be classified as topic-independent pages, or a few topic-independent pages may be classified as topic-related pages, so that there may be some URL regular expressions learned from a website that represent topic-related URLs. The URL regular expression that actually represents the subject-independent page. To filter these regular expressions, this embodiment may further include the following steps:
[0193] Count the number of URLs matched by the regular expression;
[0194] If the number of URLs is less than the preset filtering threshold, the regular expression is deleted.
[0195] Usually, the number of URLs matched by these URL regular expressions is relatively small, so URL regular expressions with a particularly small number of matched URLs can be filtered out as noise.
[0196] Specifically, the number of URLs can be counted based on the following matching corresponding strategies:
[0197] 1. If there are multiple regular expressions for matching topic-related URLs in the URL, match the URL with the regular expression with the largest number of matched URLs;
[0198] 2. If there are multiple URL regular expressions that match the URL, the URL is matched with the regular expression with the largest number of matched URLs.
[0199] In this case, if the feature page includes a topic-related page and a topic-unrelated page, the matching step may include sub-steps:
[0200] Sub-step S1, matching the URL of the seed site with the regular expressions of the subject-related pages and the regular expressions of the subject-independent pages;
[0201] Sub-step S2, respectively, count the number of URLs corresponding to the regular expressions of topic-related pages matched by the URL of the seed site, and the number of URLs corresponding to the regular expressions of the topic-independent pages, and compare them;
[0202] Sub-step S3: If the comparison result meets the noise filtering threshold, the seed site URL does not meet the matching condition.
[0203] Step 106: Grab the page according to the crawling target URL.
[0204] The page crawling in this step can be implemented with reference to the aforementioned Nutch crawler, or can be implemented with other methods in the prior art, which is not limited by the present invention.
[0205] In practice, there may also be "over-learning" (the learned regular expressions are too general to match other categories of URLs) or "under-learning" (the learned regular expressions are only part of matching a category , Causing other URLs in this category to fail to match). In order to solve this problem, you can also implement the following strategies during page crawling:
[0206] 1. Determine whether the site parameter of the new URL is a derived site, if otherwise, the page of the URL is not crawled, and the process ends;
[0207] 2. Find a URL regular expression of a topic-related page that matches the current URL from the URL regular expression list, if otherwise, the page of the URL is not captured, and the process ends;
[0208] 3. If there are multiple URL regular expressions for matching topic-related pages, select the best regular expression, which means that the number of URLs summarized is the largest;
[0209] 4. From the URL regular expression list, search for a URL regular expression of a page that matches the topic irrelevant to it. If it does not exist, grab the URL of the page and end the process;
[0210] 5. If there are multiple URL regular expressions for pages that match the topic irrelevant to it, select the best regular expression;
[0211] 6. If P/N>f, then grab the page, otherwise don't grab it.
[0212] Among them, P represents the number of URLs summarized by the URL regular expressions of the optimal topic-related pages, N represents the number of URLs summarized by the URL regular expressions of the optimal topic-independent pages, and f is a filtering threshold greater than 0.
[0213] In order to enable those skilled in the art to better understand the present invention, the embodiments of the present invention will be described in detail below through a specific example.
[0214] (1) Obtain the torrent site URL, filter the hyperlink URL in the page represented by the torrent site URL, and obtain the "derivative site URL list" related to the topic.
[0215] The specific process is: first read the torrent site URL from a torrent site file into an array, and then sort the array; for the URL to be crawled, first take out the site parameters, and then use binary search for the site parameters in the array If you find that the site parameter is included in the array, the URL will be returned, which means that the URL is not filtered, and the page represented by the URL should be crawled. If the site name is not included in the array, the URL is filtered out, indicating that the page represented by the URL should not be crawled.
[0216] This step can reduce the workload of the experimental crawler, and at the same time improve the efficiency and accuracy of the experimental crawler.
[0217] (2) Use the experimental crawler to grab the page to be learned:
[0218] According to the filtered torrent site URL list (ie, derived site URL list), an experimental crawler is used to crawl web pages based on these URLs. It uses a breadth-first search algorithm to crawl up to N pages from each torrent site (N is 1000 -5000). The experimental crawler is implemented based on Nutch (Apache2007), but in this example, the crawl list generator of the Nutch crawler is modified. The modified crawl list generator process is as follows:
[0219] 1. Read the threshold of the number of URLs set in the configuration file (data/config.xml), and initialize the MAP (site parameters, number of URLs);
[0220] 2. Extract the site parameters of the URL to be crawled, and find out whether there are corresponding site parameters in the MAP, if yes, go directly to 4; if not, go to 3;
[0221] 3. Add site parameters to MAP and initialize the number of URLs corresponding to 0;
[0222] 4. Add 1 to the number of URLs corresponding to the corresponding site parameters;
[0223] 5. Determine whether the number of URLs corresponding to the site parameters exceeds the number threshold, if so, do not add this URL to the crawl list; if not, add this URL to the crawl list.
[0224] (3) Classify the captured N pages and determine whether a page is a theme-related content page or a theme-related catalog page, and the theme-related content page and the theme-related catalog page constitute a theme-related page collection;
[0225] (4) Learn URL regular expressions from the collection of topic-related pages:
[0226] The specific learning process is;
[0227] 1. URL distance:
[0228] (1) URL data structure
[0229] Divide a URL into three parts (remove the http protocol part): host, path, and query. The path is composed of a series of directories, and the query is composed of a series of key-value pairs. For example URL http://www.china-pub.com/member/buybook/view.asp? add=1&tid=203839, its host is www.china-pub.com; path is /member/buybook/view.asp, the directory composing the path is member, buybook, view.asp; query is add=1&tid=203839, The key-value pairs that make up the query are (add, 1) and (tid, 203839). The URL data structure expressed in java is as follows:
[0230] public class URLtruct{
[0231] private String host;
[0232] private String[]path;
[0233] private ArrayList query;
[0234] (2) URL distance (similarity) measurement
[0235] After decomposing a URL into the above-mentioned URL data structure, the distance of the URL can be calculated based on the distance of each part of the URL data structure. The distance d between two URLs i and j URL (i, j) can be expressed by the following formula:
[0236] d URL (i,j)=(d Host (i,j)+1)×(d Path (i,j)+1)×(d Query (i,j)+1)-1
[0237] Where d Host (i, j) is the distance between the host part of the i-th and j-th URL, d Path (i, j) is the distance between the path part of the i-th and j-th URL, d Query (i, j) is the distance between the query part of the i-th and j-th URL.
[0238] The distance calculation principle of the above three parts is as follows:
[0239] a. If the host of the two URLs are not the same, then d Host (i, j) = 32, otherwise d Host (i, j)=0.
[0240] Set d like this Host (i, j), which helps prevent URLs from different hosts from clustering into the same category.
[0241] b. Note that there are m directories and n directories in the path part of the i-th and j-th URLs, and set m≤n. Remember the number of unequal directories in the first m corresponding positions is k, then
[0242] d Path (i,j)=k×2+(n-m)×4
[0243] c, right d Query (i, j) Make simple settings, if they are equal, then d Query (i, j) = 0; otherwise d Query (i, j)=1.
[0244] Because for most websites, if the other parts are equal but only the query part is different, usually the pages referred to by these two URLs are basically generated by the same template, and generally belong to the same type of page.
[0245] 2. URL collection division:
[0246] URL division is to divide the URL collection into several subsets according to a certain standard. Its input is a collection of URLs, and its output is a subset of URLs. In this example, according to the relationship of the host part of the URL, the URLs with the same host can be grouped into the same subset, so that the entire URL set is divided into several subsets. The host of the URL in each subset is the same, and the difference between different subsets The host of the URL is different; or, it is divided according to the number of directories in the path part of the URL, and the same number is classified into the same subset.
[0247] 3. URL aggregation algorithm:
[0248] After dividing a URL set, several URL subsets are obtained. Such a URL subset is a cluster of URLs. The aggregation algorithm is implemented for each cluster of URLs to aggregate into several types of URLs. The specific process of aggregation can be divided into the following steps:
[0249] (1) Read the URL list from a certain URL cluster divided by the URL collection;
[0250] (2) Create a new URL class list C URL ={C URL (1), C URL (2),..., C URL (j),...C URL (m)}, and initialized so that C URL (1)=URL(1), set the distance threshold h=1;
[0251] (3) Continue to read URL(i) from the URL list in the URL cluster of the aggregated object, i=2,3...n, if i>=n then end;
[0252] (4) From the URL class list C URL Find the URL class C that matches URL(i) in URL (j), if the match is successful, add URL(i) to the C URL (j); if there is no matching URL class, create a new URL class C URL (j+1), then add URL(i) to the newly created URL class C URL (j+1), and insert the new URL class into the URL class list C URL , Go to step 3.
[0253] A URL matches a URL class means that the distance between the URL and all URLs in the URL class is not greater than a distance threshold h, that is
[0254] d URL (i,j)≤h
[0255] In this way, some URL classes are obtained in the URL class queue after aggregation. The distance between all URLs in the same class is not greater than the distance threshold h, so URLs in the same class have certain similarities.
[0256] Does the aggregated URL class list meet the following conditions:
[0257] Among them, m is the number of URL categories in the aggregation results generated by static aggregation, n is the total number of URLs, and l is C URL The maximum number of URLs contained in (j), p is the quantity induction parameter, 0
[0258] Through the above aggregation process, you can get a list of URL categories C URL ={C URL (1), C URL (2),..., C URL (j),...C URL (m)}, each class C in this list URL (j) each represents one or more URLs.
[0259] 4. URL regular expression extraction:
[0260] After a cluster of URLs are aggregated, several URL classes C are generated URL (j) The next extraction process is to extract a URL regular expression that can summarize and represent this category for each URL category.
[0261] According to the aforementioned URL data structure, all URLs are decomposed into three parts: host, path and query, and path is decomposed into a series of directories, and query is decomposed into a series of key-value pairs. Since the host part must be the same, the host is recorded in the regular expression as it is. Align the directories in the path part. If the directories at the corresponding positions are the same, add this part of the value to the regular expression, otherwise use * to add it to the regular expression. For the query part, a method similar to the path part is also used. Finally, several regular expressions related to the hardware industry can be obtained.
[0262] (5) Focused crawlers carry out web page crawling work under the guidance of regular expressions according to the "derived site URL list". The main steps are as follows:
[0263] 1. Read the URL list of derivative sites;
[0264] 2. Read the URL regular expression list file obtained in the previous stage (including topic-related regular expressions and topic-independent regular expressions), and then perform positive and negative example matching on each derivative site URL according to the regular expressions to determine whether Crawl the web page where the derived URL is located, and generate a "crawl target URL list" based on the crawled derived URL;
[0265] 3. Carry out web page crawling work according to the "crawl target URL list".
[0266] The focused crawler in this example is also implemented based on the Nutch crawler, and also only modified the crawl list generator part. The modified process of the crawl list generator is:
[0267] 1. Initialization. Initialize a site filter and a URL regular expression filter;
[0268] The rule of the site filter is to determine whether the host of the URL to be crawled can be found in the list of derived sites. If it can be found, the URL will pass the site filter and will not be filtered out. This is to ensure that crawlers will not Crawl to the site pages on the list of non-topic related sites to improve vertical crawling efficiency.
[0269] The rule of the URL regular expression filter is to determine whether the URL to be crawled matches the regular expression, if it is, no filtering is required; if not, it is filtered out.
[0270] 2. Read a URL that needs to be crawled from WebDB;
[0271] 3. Use a site filter to filter the URL, if you need to filter, go to 6;
[0272] 4. Use the URL regular expression filter to filter the URL, if you need to filter, go to 6;
[0273] 5. Add the URL to the list of URLs that need to be crawled;
[0274] 6. If the reading of WebDB is completed, end, otherwise go to 2.
[0275] (6) Noise filtering
[0276] In practice, there may be a problem of judging a small number of topic-related pages as topic-independent pages, or judging a small number of topic-independent pages as topic-related pages, which will lead to a URL regular expression learned from a website that represents the topic. In, there may be some URL regular expressions that actually represent pages that are irrelevant to the subject. Therefore, these URL regular expressions need to be filtered. Since the number of URLs summarized by these URL regular expressions is relatively small, the URL regular expressions with a particularly small number of summarized URLs can be filtered out as noise. The filtering criteria are:
[0277] n URL ≤N×v
[0278] Where n URL To summarize the number of URLs, v is the "noise filtering threshold", usually 0
[0279] In order to better illustrate the technical effects of the present invention, the following describes the comparison of the performance of the focused crawler (UBFC), the breadth first search crawler (BFSC), and the basic focused crawler (BLFC) crawling hardware industry news pages by applying the focused crawler (UBFC) of the present invention.
[0280] The performance may be evaluated by the harvest rate and the recall rate. Specifically, the harvest rate may represent the proportion of the pages related to the subject in all the web pages crawled by the web crawler, which may represent the crawling accuracy rate. It can be calculated by the following formula:
[0281] g=p/d
[0282] Among them, g represents the harvest rate, p represents the number of crawled topic-related web pages, and d represents the total number of crawled web pages.
[0283] The recall rate can represent the ratio of topic-related pages crawled by web crawlers to all topic-related pages on the Internet, and can be calculated by the following formula:
[0284] r=p/ps
[0285] Among them, r represents the recall rate, p represents the number of crawled topic-related pages, and ps represents the total number of topic-related pages that actually exist.
[0286] It should be noted that since it is impossible to count the number of topic-related pages on the Internet, it is difficult to calculate the recall rate in practice. Therefore, the experiment is based on a simulated data set. In this case, the simulated data set is used as a simulated experiment. In the Internet, the number of theme-related pages contained in the simulation data set is used as the total number of theme-related pages to calculate the recall rate.
[0287] Using the focused crawler (UBFC), breadth-first search crawler (BFSC), and basic focused crawler (BLFC) of the present invention to crawl the list of relevant seed sites in the hardware industry, the crawling depth is 4, open There are 57 threads, and the results of crawling with various crawlers are shown in the following crawling test result table. In the table, S represents the total number of crawled pages, P represents the number of related pages, G represents the harvest rate, and R represents the recall rate. It should be noted that the breadth-first search crawler (BFSC) does not have a recall rate, because the number of relevant pages crawled by BFSC is used as a benchmark value to calculate the recall rate of the other two crawlers.
[0288] The capture test results of BFSC, BLFC, UBFC are:
[0289]
[0290] As can be seen from the above table, beareyes, ce, chemeinfo, chinabmb, cnpv, fsonline, ieicn, sg001, xmnn and other websites have almost no topic-related pages, but BFSC still downloaded 51135 pages from these websites, of which 401 Zhang is related to the subject. BLFC has only downloaded 593 pages from these sites, of which 4 are related to themes, while UBFC works better. It downloaded 575 pages, of which 42 are related to themes. From this point on In terms of performance, UBFC is undoubtedly the best.
[0291] From an overall point of view, BFSC crawled a total of 82,946 pages, of which only 3558 were related to the topic, with a harvest rate of 0.04. BLFC crawled a total of 1,629 pages, 95 of which were related to the subject. The harvest rate was 0.06, which was only a little bit higher than BFSC, and its recall rate was only 0.03. From this point of view, BFSC has a very useful value. Small, because there are a lot of related pages that have not been crawled. This is because it only grabs the speech from the URL linked from the topic-related page, because not all the topic-related pages are linked together, and current pages often contain a large number of links, most of which may be topic-unrelated , Which is why BLFC only crawls so few pages. UBFC crawled a total of 5670 pages, of which 1514 pages were related to the subject, with a harvest rate of 0.27 and a recall rate of 0.43. It can be seen that the total number of pages captured by UBFC is only one-fifteenth of BFSC, and its recall rate is close to one-half, which is more than 10 times that of BLFC, and the harvest rate is 6 times that of BFSC, which is BLFC. More than 4 times.
[0292] For further reference figure 2 , This is the trend comparison chart of crawling topic-related webpages during the performance test. This comparison chart is related to time. At the beginning, the number of relevant pages crawled by the three crawlers was almost the same, but as time passed, the UBFC grew faster and faster, thus proving that the advantages of the present invention began to manifest. The growth of BFSC and BLFC is slow. Finally, BLFC crawls even less relevant pages than BFSC. This is because, first, BLFC needs to determine whether its parent page is a topic-related page when deciding whether to put an unknown URL in the crawl list. , It takes time; 2. When the BLFC decides whether to put an unknown URL in the crawl list, it filters more than the BFSC, so the pages that need to be crawled at each level are few compared to the BLFC. The crawling of a layer of pages was quickly completed, which made it spend a lot of time not in the actual page crawling, but in the self-update and maintenance of Nutch's data structure.
[0293] image 3 It further illustrates the relationship between the harvest rate of the three crawlers and time. It can be seen from the figure that the harvest rate of BFSC and BLFC stabilizes soon, while the harvest rate of UBFC has risen for a longer time and gradually tends to It is stable, and its harvest rate is much higher than other crawlers.
[0294] Through the above analysis, it can be concluded that the present invention effectively improves the harvest rate and recall rate of page resource crawling, and can better help people obtain the required information from the Internet in a large range, high efficiency, and high precision.
[0295] For the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described sequence of actions, because according to the present invention, Some steps can be performed in other order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the involved actions and modules are not necessarily required by the present invention.
[0296] reference Figure 4 , Shows a structural block diagram of an embodiment of an apparatus for directional grabbing page resources of the present invention, which may include the following modules:
[0297] The seed filtering module 401 is used to filter the URL of the seed site, where the URL of the seed site includes the URL of the seed site and the URL of the link;
[0298] The experimental crawler module 402 is used to pre-crawl pages that meet the number threshold according to the URL of the seed site;
[0299] The classification module 403 is used to determine a characteristic page in the pre-fetched page;
[0300] The regular expression learning module 404 is configured to generate a regular expression summarizing the URL of the characteristic page;
[0301] The matching module 405 is configured to match the torrent site URL with the regular expression, and reserve the torrent site URL that meets the matching condition as the crawling target URL;
[0302] The focused crawler module 406 is used to crawl the page according to the crawling target URL.
[0303] Preferably, the regular expression learning module 404 may include the following submodules:
[0304] The set division sub-module is used to divide the feature page URL into multiple URL subsets;
[0305] The clustering sub-module is used to aggregate the subset of URLs into multiple URL categories;
[0306] The extraction submodule is used to extract the regular expression of the URL category.
[0307] Preferably, the URL includes site parameters, and the set division sub-module may include the following units:
[0308] The first dividing unit is used to divide URLs with the same site parameters into the same URL subset.
[0309] More preferably, the URL may also include path parameters, and the set division submodule may also include the following units:
[0310] The second dividing unit is used to divide URLs of path parameters with the same number of directories into the same URL subset.
[0311] Furthermore, the URL may also include query parameters, and the set division submodule may also include the following units:
[0312] The third dividing unit is used to divide URLs with the same query parameters into the same URL subset.
[0313] Preferably, the clustering sub-module may include the following units:
[0314] The rule setting unit is used to preset the clustering rules of the URL category;
[0315] The processing unit is configured to read URLs from the URL subset, and determine whether the URL conforms to the clustering rules of the URL category, and if so, assign the URL to the URL category; if not, Then create a new URL class based on the URL.
[0316] More preferably, the clustering submodule may further include the following units:
[0317] A statistical unit, used to count the number of URL categories and the total number of URLs;
[0318] The rule adjustment unit is configured to adjust the clustering rule of the URL category according to the statistical result.
[0319] Preferably, the seed filtering module may include the following units:
[0320] An array generation sub-module for reading the URL of the seed site into an array, and sorting the array;
[0321] The site filtering sub-module is used to extract site parameters of a certain URL, determine whether the site parameters are included in the array, and if so, keep the URL; if not, remove the URL.
[0322] Preferably, this embodiment may also include the following modules:
[0323] A quantity statistics module for counting the number of URLs matched by the regular expression;
[0324] The regular expression filtering module is configured to delete the regular expression when the number of URLs is less than a preset filtering threshold.
[0325] reference Figure 5 , Showing the application Figure 4 The flow chart of the method for directional grabbing page resources in the illustrated embodiment may include the following steps:
[0326] Step 501: The seed filtering module filters the URL of the seed site;
[0327] Preferably, the torrent site URL includes the URL of the torrent site and its linked URL. In this case, the processing steps of the seed filtering module include:
[0328] The array generation sub-module reads the URL of the seed site into an array, and sorts the array;
[0329] The site filtering sub-module extracts the site parameters of a certain URL, determines whether the site parameters are included in the array, and if so, keeps the URL; if not, removes the URL.
[0330] Step 502: The experimental crawler module pre-fetches pages that meet the number threshold according to the URL of the seed site;
[0331] Preferably, the value range of the number threshold is 1000 to 5000, and the experimental crawler module may crawl the page through the following steps:
[0332] Write the URL of the seed site into the database;
[0333] Read the URL from the database, and extract site parameters of the URL;
[0334] Update the number of URL crawls corresponding to the site parameters;
[0335] Determine whether the number of URL crawls exceeds the number threshold, and if not, add the URL to the URL crawl list;
[0336] Download the page corresponding to the URL in the URL grab list, and generate a corresponding page data segment;
[0337] The database is updated according to the page data segment.
[0338] Step 503: The classification module determines a feature page in the pre-fetched page;
[0339] Preferably, the feature pages may include theme-related pages and theme-independent pages, and the theme-related pages may specifically include theme-related content pages and theme-related catalog pages.
[0340] Step 504: The regular expression learning module generates a regular expression summarizing the URL of the characteristic page;
[0341] Preferably, the regular expression learning module can learn the regular expression of the feature page URL through the following steps:
[0342] The set division sub-module divides the feature page URL into multiple URL subsets;
[0343] The clustering sub-module aggregates the subset of URLs into multiple URL categories;
[0344] The extraction sub-module extracts the regular expression of the URL category.
[0345] Specifically, the URL may include site parameters, path parameters, and query parameters, and the set division sub-module may divide URL subsets through the following steps:
[0346] The first dividing unit divides URLs with the same site parameters into the same URL subset;
[0347] The second dividing unit divides URLs of path parameters with the same number of directories into the same URL subset;
[0348] The third dividing unit divides URLs with the same query parameters into the same URL subset.
[0349] In this case, the clustering sub-module can aggregate URL categories through the following steps:
[0350] The rule setting unit presets the clustering rule of the URL category;
[0351] The processing unit reads the URL from the URL subset, and judges whether the URL conforms to the clustering rule of the URL category, and if so, assigns the URL to the URL category; if not, it determines whether the URL matches the URL category. Create a new URL class for the URL;
[0352] The statistics unit counts the number of URL categories and the total number of URLs;
[0353] The rule adjustment unit adjusts the clustering rule of the URL category according to the statistical result.
[0354] Then, the extraction sub-module extracts a URL regular expression that can summarize and represent this category for each URL category. Specifically, according to the aforementioned data structure of the URL, the URL is decomposed into three parts: host, path, and query, and the path is decomposed into a series of directories, and the query is decomposed into a series of key-value pairs. Since the host part must be the same, it is enough to write the host directly in the regular expression, and then align the directories in the path part. If the directories at the corresponding positions are the same, add this part of the value to the regular expression. Otherwise, use * to add to the regular expression. For the query part, a method similar to the path part is also used. Finally, multiple corresponding regular expressions can be obtained.
[0355] Step 505: The matching module matches the torrent site URL with the regular expression, and retains the torrent site URL that meets the matching condition as the crawling target URL;
[0356] In this embodiment, for a topic-related page, if the torrent site URL matches the regular expression of the topic-related page, the torrent site URL meets the matching condition; for the topic-independent page, if the torrent site URL matches the regular expression of the topic-related page, If the torrent site URL matches the regular expression of the topic-independent page, the torrent site URL does not meet the matching condition.
[0357] Step 506: The focused crawler module crawls the page according to the crawling target URL.
[0358] Preferably, this embodiment may further include the following steps:
[0359] The quantity statistics module counts the number of URLs matched by the regular expression;
[0360] The regular expression filtering module deletes the regular expression when the number of URLs is less than a preset filtering threshold.
[0361] As for the device embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
[0362] The method for directional grabbing page resources and the device for directional grabbing page resources provided by the present invention are described above in detail. In this article, specific examples are used to illustrate the principle and implementation of the present invention. The above implementation The description of the examples is only used to help understand the method and the core idea of ​​the present invention; at the same time, for those of ordinary skill in the art, according to the idea of ​​the present invention, there will be changes in the specific implementation and the scope of application. In summary As mentioned, the content of this specification should not be construed as a limitation of the present invention.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Device and method for inner bore type component brush plating

InactiveCN101514468ALow costReduce workloadRotary actuatorMechanical engineering
Owner:ACADEMY OF ARMORED FORCES ENG PLA

Teaching quality assessment cloud service platform

InactiveCN107958351AReduce workloadLarge number of samplesSemantic analysisSpeech analysisData acquisition moduleQuality assessment
Owner:重庆大争科技有限公司

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products