Unlock instant, AI-driven research and patent intelligence for your innovation.

Character search system and method based on search engines

A technology of search engine and retrieval system, which is applied in the Internet and search field, can solve problems such as the great impact of clustering effect, the inability to extract information related to names of people on the webpage, and the impact of clustering effect, and achieve the effect of solving the problem of ambiguity of names

Active Publication Date: 2018-04-13
HARBIN INST OF TECH AT WEIHAI
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this patent directly extracts, word-segments, and part-of-speech tagging from the text of the webpage to form a document. Currently, the types of webpages returned by search engines are complex and have various structures. The sidebars and multi-level titles in the webpage often contain most of the information about the searched names
The method of this patent cannot extract the name-related information in the non-text text of the webpage, which seriously affects the effect of clustering; the clustering algorithm of this patent needs to extract the information of the person field in the text, and the amount of extracted information has a great impact on the effect of clustering The impact is very large, and the threshold of clustering needs to be manually specified, and there is an impact of manual intervention on the clustering effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Character search system and method based on search engines
  • Character search system and method based on search engines

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0057] A person retrieval system based on a search engine, such as figure 1 As shown, it includes data acquisition module, data preprocessing module, feature extraction module and clustering module connected in sequence;

[0058] Input the retrieved name, and the data acquisition module uses the distributed crawling system based on Scrapy-redis to crawl the web page information returned by multiple search engines to retrieve the name to form a web page set; the web page information refers to: several web pages returned by the search engine to retrieve the name, Each web page includes title (title), url, summary (content), and complete web page;

[0059] First, the crawler engine crawls the url in each piece of information returned by different search engines to retrieve names, and then uses the webpage download tool httrack to download the complete webpage information in the url. After observation, it is found that only the first 10 pages of the information returned by the se...

Embodiment 2

[0071] According to a kind of search engine-based character retrieval system described in embodiment 1, its difference is:

[0072] The data preprocessing module includes a sequentially connected data cleaning module, a webpage segmentation module, and a character-related visual block extraction module, the data acquisition module is connected to the data cleaning module, and the character-related visual block extraction module is connected to a feature extraction module;

[0073] The data cleaning module uses the named entity recognizer to identify whether each web page crawled by the crawler system contains the retrieved name: if a web page does not have the retrieved name or the number of names different from the retrieved name exceeds 5, it will directly The webpage is marked as a webpage that has nothing to do with the person's name, otherwise, the webpage is marked as a webpage that is related to the person's name;

[0074] The web page block module performs visual block p...

Embodiment 3

[0077] According to a search engine-based character retrieval system described in Embodiment 1 or 2, the difference is that:

[0078] The feature extraction module includes a character-related attribute extraction module, a character relationship extraction module, and a text vectorization module. The data preprocessing module is respectively connected to the character-related attribute extraction module, and the character relationship extraction module. Vectorization module; character-related attribute extraction module uses rules and template matching methods to extract 20-dimensional character attributes in each web page. The 20-dimensional character attributes include place of birth, occupation name, graduate school, date of birth, ethnicity, gender, and work name , personal experience, political affiliation, education, religious belief, height, weight, email, marital status, nationality, achievements, blood type, hobbies, telephone.

[0079] Rule matching is to use regula...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a character search system and method based on search engines. The system comprises a data collection module, a data preprocessing module, a feature extraction module and a clustering module which are connected in sequence. The data collection module crawls webpage information returned by the search engines through search of names. The data preprocessing module filters webpages irrelevant to the names, carries out block processing and filters vision blocks irrelevant to searched names in the webpages. The feature extraction module extracts attributes and entities related to searched characters, carries out statistics on word frequencies in the vision blocks, constructs a vector expression form of each webpage, and properly increases values of corresponding dimensions of feature values in a vector space. The clustering module takes the vector expression form of each webpage as input, clusters webpage texts and outputs a list composed of webpage class labels. According to the system and method, the problem that in the returned webpages, the names are ambiguous and the information is disordered when the characters are searched is effectively solved, character digests are constructed through extraction of the character attributes and character relationships, and the convenience for a user to search the names is provided.

Description

technical field [0001] The invention relates to a search engine-based character retrieval system and method, and belongs to the technical field of Internet and search. Background technique [0002] At present, the main difficulty of person retrieval is that there are problems of name ambiguity and information clutter in the web pages returned by the person name search. Name disambiguation refers to distinguishing multiple individuals with the same name. The ubiquity of ambiguity in personal names has caused a lot of inconvenience to information dissemination and resource acquisition. The search results for personal names provided by mainstream search engines are often a mixture of all web pages of the same name and irrelevant web pages. These web pages are sorted according to certain rules and are followed. Person information with high degree of accuracy is more likely to be ranked in the front position. For example, for "Li Na" in the Baidu search engine, among the search...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/3347G06F16/355G06F16/951G06F40/295
Inventor 周奇刘扬王佰玲辛国栋孙云霄王巍
Owner HARBIN INST OF TECH AT WEIHAI