Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and apparatus for removing html tag from search engine

A search engine and tag technology, applied in the field of network search, can solve the problems of narrow usability, transformer failure, frequent updates, etc., and achieve the effect of strong versatility

Inactive Publication Date: 2016-08-31
LETV HLDG BEIJING CO LTD +1
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the official version of solr is updated frequently, and the version will be upgraded every one or two months
The version is sometimes unstable, and the transformer will fail in some versions
In addition, this solution is only for this kind of search engine, and its applicability is too narrow. Many search engines still have the above problems when searching

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and apparatus for removing html tag from search engine
  • Method and apparatus for removing html tag from search engine
  • Method and apparatus for removing html tag from search engine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] In view of the problems in the prior art, regular expressions can be used to remove html tags. The key question is where to do it. On a large scale, it is divided into doing it when building an index and doing it when searching. Of course, the efficiency is the highest when building an index. At that time, the solr search engine was used. The solr search engine has a regular expression filtering function, but this step must be done after the word segmentation. After the word segmentation, the html tags will also be divided into pieces due to semantic word segmentation. Regular expressions can no longer be used. Considering this problem, none of the existing solutions of these search engines can solve it. Chinese is too complicated, and foreign software considerations are based on the idea that their tokenizers basically divide words according to spaces, which is not applicable to Chinese.

[0029] It is also possible to remove html tags after fetching from the datab...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Embodiments of the present invention provide a method and apparatus for removing a html tag from a search engine. The method comprises removing a html tag from a data source, which is formed after a user edits contents on a website and includes the html tag, before a website server processes the data source; performing semantic word segmentation on the data source from which the html tag is removed; and storing the contents after word segmentation into a search database maintained by the website.

Description

technical field [0001] The invention relates to the technical field of network search, in particular to a method for removing hypertext markup language HTML tags in a search engine. Background technique [0002] When users search for content, some search systems may find that the searched content does not match the entered keywords. For example, if you search for the keyword "blog", you can find a lot, but you can't find the word blog at all. [0003] This is because in the data source, there is a piece of content in the original search content that users can edit through rich text, and this content is stored in the database as a data source. That is to say, this part of the content has html style. And there happened to be a blog in the class attribute of a tag, so it was searched out. [0004] In the prior art, the solr search engine is used to solve the above problems. Solr's solution is to remove the html tag of a certain field when fetching data from the database. H...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/9562G06F16/951G06F16/972
Inventor 谢晓静
Owner LETV HLDG BEIJING CO LTD