Method and apparatus for quickly searching for contents required to be queried

A content and fast technology, applied in the field of search engines, can solve the problems of wide data sources, limited search features, and many repeated content, and achieve the effect of efficient and accurate query, high matching efficiency, and good user experience

Inactive Publication Date: 2017-01-04
GUANGDONG IDATATECH CO LTD
8 Cites 9 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0003] However, general crawler search has the following shortcomings: because the goal of crawling is to cover as large a network as possible, the results of crawling will inevitably contain a large number of web pages that users do not need; For data with a certain structure, general search engines are mostly based on keyword retrieval, and it is difficult to realize the requirements for querying semantic information and intelligent indexing engines
The s...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

In a word, in the present embodiment, retrieval can be realized quickly by data model; By setting up Chinese index, when user uses, can search out target by fuzzy word, high-precision participle and combined query make query change be efficient and accurate. Combined with graph data, it can provide users with a better experience; the attribute value of the nodes in the graph structure is retrieved by the search engine, and the nodes and relationships are retrieved through the graph structure, and the matching efficiency is high.
[0057] Step S06 extracts the data stored in each node from the graph structure, and builds a Chinese index: in this step, extracts the data stored in each node from the graph structure, and builds a Chinese index. When users use it, they can search for the target through vague words. High-precision...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a method and an apparatus for quickly searching for contents required to be queried. The method comprises the steps of acquiring various data from the internet, and performing associated storage on the data and corresponding nodes in a graphic structure in a graphic database; converting non-structured data into structured data capable of analyzing applications; performing cleaning and building a unified data model; establishing a data warehouse by adopting an HBase, and loading the cleaned data into the data warehouse; associating dispersed data through company names, abbreviations or stock codes, and storing the dispersed data in the corresponding nodes according to modes of nodes and relationships in the graphic structure; extracting the data stored in each node from the graphic structure, and establishing a Chinese index; and inputting a statement required to be queried, searching for related graphic structures by adopting a traversal algorithm, and arranging the searched graphic structures according to correlation values. According to the method and the apparatus, the retrieval can be quickly carried out; the query is efficient and accurate; relatively good experience can be provided for users; and the matching efficiency is relatively high.

Application Domain

Technology Topic

Image

  • Method and apparatus for quickly searching for contents required to be queried
  • Method and apparatus for quickly searching for contents required to be queried
  • Method and apparatus for quickly searching for contents required to be queried

Examples

  • Experimental program(1)

Example Embodiment

[0047] The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
[0048] In the embodiments of the method and device for quickly searching the content to be queried in the present invention, the flow chart of the method for quickly searching the content to be queried is as follows: figure 1 Shown. figure 1 , The method for quickly searching the content to be queried includes the following steps:
[0049] Step S01: Use a web crawler system to collect various data from the Internet, and associate the collected data with the corresponding nodes in the graph structure in the graph database. In this step, use the web crawler system to collect various data from the Internet, and In the database, the collected data is stored in association with the corresponding nodes in the graph structure. The web crawler system is a high-performance web crawler system. The collected data is actually stored in the distributed file system Hadoop for subsequent processing, providing a strong basis for data processing, suspect analysis, and audit and evidence collection.
[0050] It is worth mentioning that, in this embodiment, the graph structure includes a number of nodes, and nodes with relationships are connected by directional connecting lines (that is, directional connecting lines). Figure 4 This is a schematic diagram of the directed acyclic graph structure in this embodiment. The circle in the figure represents a node, the node represents an entity, such as a person or a commodity, and the edge represents the connection relationship between the node and the node, which can be directional or non-directional. , If user A buys product B, it is expressed as A-> B; if user A and user C both know each other, this relationship is two-way, expressed as A <-> C.
[0051] A graph database can be regarded as a collection of nodes and relationships. The graph database stores the collected data in nodes with attributes, and uses relationships to organize these nodes, such as Figure 5 Shown.
[0052] Graph databases are good at searching for association relations, but for attribute searching, especially Chinese matching efficiency, the present invention combines the graph database with a search engine, and the attribute values ​​in the graph database are retrieved by the search engine, and for node relations, the graph structure is used To retrieve. Of course, in some cases of this embodiment, the open web page resources on the World Wide Web are effectively extracted through the web crawler, and the information is used, which will greatly reduce manpower and material resources. By selecting the correct web page crawling strategy, using web page analysis algorithms and topology analysis algorithms to obtain data mechanically.
[0053] Step S02: Convert the unstructured data in the collected data into structured data that can be analyzed and applied: In this step, after the collected data is saved, the collected data will be processed uniformly, and the collected data will be unstructured The data is transformed into structured data that can be analyzed and applied.
[0054] Step S03 cleans the structured data, and establishes a unified data model according to the business field and source: In this step, clean the structured data to remove noise, provide a quality data source for subsequent analysis, and the cleaned data basis Business areas and data sources, the system will analyze and establish a unified data model.
[0055] Step S04 uses the HBase database to establish a data warehouse according to the data model, and loads the scattered data into the data warehouse by extracting, converting and loading the cleaned data: In this step, the HBase database is used to create new data according to the data model The warehouse NDW integrates all data resources on this basis. By extracting, transforming and loading the cleaned data, the scattered data can be loaded into the data warehouse NDW.
[0056] Step S05 associates the scattered data by company name, abbreviation or stock code, and stores the scattered data in the corresponding nodes according to the pattern of nodes and relationships in the graph structure: in this step, the scattered data is processed again, The scattered data is associated with the company name, abbreviation or stock code. This process uses a graph database. All the scattered data will be stored in the corresponding nodes according to the node and relationship mode. The important purpose of data storage is for subsequent retrieval. For finding the relationship between nodes and displaying the multi-layer relationship between nodes, the data model can be used to quickly search.
[0057] Step S06 extracts the data stored in each node from the graph structure and establishes a Chinese index: In this step, extract the data stored in each node from the graph structure and establish a Chinese index. When the user uses it, he can search for the target through vague words. High-precision word segmentation and combined query make the query efficient and accurate. Combined with graphic data, it can provide users with an excellent experience. When searching for a node or relationship by attribute, search it by Chinese index.
[0058] Step S07 The user enters the sentence to be queried, uses the traversal algorithm to search for related graphic structures, and arranges the searched graphic structures according to the size of relevance: In this step, the user enters the sentence to be queried, and the traversal algorithm is used to search for relevant Graph structure, and arrange the searched graph structure according to the size of relevance. Specifically, the search and search of the graph structure is completed by a traversal algorithm. According to the traversal algorithm, from the starting node to the connected node, questions such as "Who are my friends' friends" are queried. Therefore, the graph structure can be navigated and operated by the traversal algorithm to determine the path between nodes, such as Image 6 Shown.
[0059] Such as Figure 7 As shown, by establishing a Chinese index, a certain node can be found faster and more efficiently. Under normal circumstances, you may only want to find a certain node or relationship by attributes instead of traversing the entire graph structure. In this case, you can find a node through the Chinese index, such as "locating user nodes based on user names" and so on. The method for quickly searching the content to be queried in the present invention can quickly search, query is efficient and accurate, can provide better users, and has higher matching efficiency.
[0060] For this embodiment, the above step S07 can be further refined, and the refined flowchart is as figure 2 Shown. figure 2 In the above step S07 further includes:
[0061] Step S71 composes the weight of each word in the sentence to be queried into a query vector, and composes the weight of the corresponding word in the data stored in the node in the searched graph structure into a document vector: in this step, each word in the sentence to be queried The weight of is composed of the query vector, and the weight of the corresponding word in the data stored in the node in the searched graph structure is composed of the document vector.
[0062] Specifically, the sentence to be queried is regarded as a document, and the relevance between the document and the document is scored. The higher the score, the more relevant, the higher the ranking. Of course, you can also artificially influence the scoring, such as Baidu search, which may not be ranked completely according to relevance. A document is composed of multiple (or one) words (in this example, it is represented by Term), such as: "solr" and "toturial". Different words may have different importance. For example, solr is more important than toturial. If a document There are 10 times toturial, but only once solr, and another document solr appears 4 times, toturial appears once, then the latter is likely to be the result of the desired search, which leads to the weight (this example Use Termweight to represent the concept of).
[0063] The weight indicates the importance of the word in the document. The more important the word, the higher its weight, so the influence is greater when calculating the relevance of the document. The process of obtaining document relevance by weighting between words is called the Vector Space Model algorithm. There are two main aspects that affect the importance of a word in a document: Term Frequencey (abbreviated as tf) and Document Frequency (abbreviated as df). Term Frequencey represents the frequency of Term appearing in this document. The larger the tf, the more important; Document Frequency indicates how many documents have this Trem appeared. The larger the df, the less important it is. The weight formula is as follows:
[0064] W t,d =tf t,d ×log(n/df t )
[0065] Where W t,d Is the weight of the file, tf t,d Is the word frequency of the file, n is the total number of files, df t Is the number of files containing rights.
[0066] In this embodiment, the weight of the word in the document is regarded as a vector, Document={term1,term2,...,termN}, Document is the document, term1, term2,..., termN is the word in the document; Document Vector= {weight1,weight2,……,weight N}, Document Vector is the searched document vector, weight1,weight2,……,weight N is the weight of each word in the document vector.
[0067] Consider the sentence to be queried as a simple document, which is also represented by a vector: Query={term11,term12,……,term1N}, Query is the sentence to be queried, term11,term12,……,term1N, which is the query The words in the voice, Query Vector={weight11,weight12,……,weight 1N}, Query Vector is the query vector, weight11,weight12,……,weight 1N is the weight of each word in the query vector.
[0068] Step S72 Put each document vector and query vector into an N-dimensional space, and each word represents one dimension: in this step, put each document vector and query vector into an N-dimensional space, such as Figure 8 Shown. Each word represents one dimension, and N is equal to the number of words in the document vector or query vector.
[0069] Step S73: Calculate the angle between each document vector and the query vector, and arrange the angles in ascending order: In this step, calculate the angle between each document vector and the query vector, according to the angle from small to large. Arrange in large order. The smaller the angle, the more similar and the greater the correlation.
[0070] This embodiment also relates to a device for realizing the above-mentioned method for quickly searching the content to be queried, and its structure diagram is as image 3 Shown. image 3 The device includes a data acquisition and storage unit 1, a data conversion unit 2, a data model establishment unit 3, a data loading unit 4, an association storage unit 5, an index establishment unit 6 and a search and arrangement unit 7; among them, the data acquisition storage unit 1 To use a web crawler system to collect various data from the Internet, and associate the collected data with the corresponding nodes in the graph structure in the graph database; the graph structure includes several nodes, and the nodes with relations pass through the directed connecting lines Connection; a graph database is a collection of nodes and relationships.
[0071] In this embodiment, the data conversion unit 2 is used to convert the unstructured data in the collected data into structured data that can be analyzed and applied; the data model establishment unit 3 is used to clean the structured data according to the business field and source Establish a unified data model; data loading unit 4 is used to build a data warehouse using HBase database according to the data model, and load scattered data into the data warehouse by extracting, converting and loading the cleaned data; associating storage unit 5 It is used to associate scattered data by company name, abbreviation or stock code, and store the scattered data in corresponding nodes according to the pattern of nodes and relationships in the graph structure; the index building unit 6 is used to extract every piece of data from the graph structure. The data stored in each node, and a Chinese index is established; when a node or relationship is searched by attributes, the Chinese index is used to search. The search and arrangement unit 7 is used to enable the user to input the sentence to be queried, use a traversal algorithm to search for related graphic structures, and arrange the searched graphic structures according to the magnitude of relevance. The device of the present invention can quickly perform retrieval, query is efficient and accurate, can provide better users with higher matching efficiency.
[0072] In this embodiment, the search and arrangement unit 7 further includes a vector composition module 71, a vector dimension module 72, and a vector angle calculation and arrangement module 73; wherein the vector composition module 71 is used to compose the weight of each word in the sentence to be queried into a query vector , The weights of corresponding words in the data stored in the nodes in the searched graph structure form a document vector; the vector dimension module 72 is used to put each document vector and query vector into an N-dimensional space, and each word represents a Dimension; N is equal to the number of words in the document vector or query vector; the vector included angle calculation and arrangement module 73 is used to calculate the included angle between each document vector and the query vector, and arrange them in the order of the included angle from small to large.
[0073] In short, in this embodiment, rapid retrieval can be achieved through the data model; by establishing a Chinese index, users can search for targets through vague words when using them. High-precision word segmentation and combined queries make the query efficient and accurate . Combining with graphic data can provide users with a better user experience; the attribute values ​​of nodes in the graphic structure are retrieved by search engines, and nodes and relationships are retrieved through the graphic structure, which has a higher matching efficiency.
[0074] The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the present invention. Within the scope of protection.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Method and terminal for downloading multiple tasks

ActiveCN102567097AImprove CPU resource utilization and browser performanceGood experienceMultiprogramming arrangementsContinuationResource utilization
Owner:TCL KING ELECTRICAL APPLIANCES HUIZHOU

Classification and recommendation of technical efficacy words

  • Good experience
  • Quick search

Method and device for relocation of orderly broadcast priority

Owner:GUANGDONG OPPO MOBILE TELECOMM CORP LTD

Robot, control method thereof and robot system

Owner:POSITEC POWER TOOLS (SUZHOU) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products