Character search system and method based on search engines
A technology of search engine and retrieval system, which is applied in the Internet and search field, can solve problems such as the great impact of clustering effect, the inability to extract information related to names of people on the webpage, and the impact of clustering effect, and achieve the effect of solving the problem of ambiguity of names
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0057] A person retrieval system based on a search engine, such as figure 1 As shown, it includes data acquisition module, data preprocessing module, feature extraction module and clustering module connected in sequence;
[0058] Input the retrieved name, and the data acquisition module uses the distributed crawling system based on Scrapy-redis to crawl the web page information returned by multiple search engines to retrieve the name to form a web page set; the web page information refers to: several web pages returned by the search engine to retrieve the name, Each web page includes title (title), url, summary (content), and complete web page;
[0059] First, the crawler engine crawls the url in each piece of information returned by different search engines to retrieve names, and then uses the webpage download tool httrack to download the complete webpage information in the url. After observation, it is found that only the first 10 pages of the information returned by the se...
Embodiment 2
[0071] According to a kind of search engine-based character retrieval system described in embodiment 1, its difference is:
[0072] The data preprocessing module includes a sequentially connected data cleaning module, a webpage segmentation module, and a character-related visual block extraction module, the data acquisition module is connected to the data cleaning module, and the character-related visual block extraction module is connected to a feature extraction module;
[0073] The data cleaning module uses the named entity recognizer to identify whether each web page crawled by the crawler system contains the retrieved name: if a web page does not have the retrieved name or the number of names different from the retrieved name exceeds 5, it will directly The webpage is marked as a webpage that has nothing to do with the person's name, otherwise, the webpage is marked as a webpage that is related to the person's name;
[0074] The web page block module performs visual block p...
Embodiment 3
[0077] According to a search engine-based character retrieval system described in Embodiment 1 or 2, the difference is that:
[0078] The feature extraction module includes a character-related attribute extraction module, a character relationship extraction module, and a text vectorization module. The data preprocessing module is respectively connected to the character-related attribute extraction module, and the character relationship extraction module. Vectorization module; character-related attribute extraction module uses rules and template matching methods to extract 20-dimensional character attributes in each web page. The 20-dimensional character attributes include place of birth, occupation name, graduate school, date of birth, ethnicity, gender, and work name , personal experience, political affiliation, education, religious belief, height, weight, email, marital status, nationality, achievements, blood type, hobbies, telephone.
[0079] Rule matching is to use regula...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 

