Method and system for generating input-method word frequency base based on internet information

A technology for generating systems and input methods, applied in the fields of input method systems and input method word frequency database generation, which can solve problems such as inability to cover, slow update, and inconformity with Internet activity, so as to improve hit rate, input speed and efficiency Effect

Active Publication Date: 2007-03-28
BEIJING SOGOU TECHNOLOGY DEVELOPMENT CO LTD
View PDF0 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, with the acceleration of the pace of society, the continuous cultural conflict and integration, leading to the use of many words in modern society, the use of existing specific closed document collections is far from being able to cover, especially with the popularity of the Internet, As a result of the rapid expansion of information, the above-mentioned problems have become more and more prominent
Due to the small size and fixed content of the closed document collection, the document collection was formed earlier and updated very slowly, and the frequency of words obtained from it does not meet the activeness of Internet use, which will lead to the use of words with relatively low frequency. Words are sorted first, but the current most frequently used words are sorted last
For example: "top", "online game", "financial report" and other Internet common words are used quite frequently, but in the existing technology, the general order of these words is relatively low, which does not meet the user's need for frequent use.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for generating input-method word frequency base based on internet information
  • Method and system for generating input-method word frequency base based on internet information
  • Method and system for generating input-method word frequency base based on internet information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0051] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0052] The core idea of ​​the present invention is to grab Chinese webpages (such as 4 billion) including Internet news, forums, blogs, chat rooms and other network contents from the Internet; Duplicate webpages, spam webpages, and yellow webpages are given lower weight values, and webpages with lower weight values ​​are removed, so as to obtain a relatively high-quality set of analyzed webpages (for example, 1 billion) or reduce the word frequency statistics of some webpages through weight values Then, through webpage analysis technology and Chinese word segmentation technology, the information in the webpage collection is segmented, and word frequency statistics are performed on the entries to obtain a word frequency library that...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The method includes following procedures: using technique of network crawler to obtain web pages of Internet; carrying out procedure of dividing words for information of web pages; carrying out statistics of word frequency for vocabulary entry, and saving statistical result so as to form Internet word frequency base. Using public real-time changeable information from Internet being as source of statistics of word frequency, the invention can create up to date, optimal information of word frequency. Through each convenient way, the method updates the word frequency base of system in input method system from the said optimal information of word frequency. Thus, information of word frequency base of system can be kept consistent to information in Internet. The invention raises hit rate of first selected word from user so as to raise input speed and efficiency.

Description

technical field [0001] The present invention relates to the field of Internet information processing, in particular to a method and system for generating an input method word frequency database using Internet information as a source of word frequency statistics, and an input method system. Background technique [0002] Current input method systems (including Chinese, Japanese, Korean, etc.) are all based on their thesaurus systems and word frequencies in the thesaurus systems to provide users with a ranking of candidate words during information input. The ranking of candidate words is an important indicator of the hit rate of users' preferred words in the process of information input. The hit rate of the preferred words means that after the user inputs certain keyboard information, the words or words ranked first are most needed by the user. Of course, taking the Chinese input method as an example, technically speaking, the input method system itself cannot know which word ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F17/30864G06F16/951
Inventor 佟子健郭奇
Owner BEIJING SOGOU TECHNOLOGY DEVELOPMENT CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products