Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for automatically acquiring new words from Chinese webpages

A new word and webpage technology, applied in the field of Internet data mining, can solve problems such as low algorithm efficiency, leakage of user privacy, poor Chinese support, etc., and achieve the effect of improving accuracy and processing efficiency

Active Publication Date: 2010-05-12
TSINGHUA UNIV
View PDF0 Cites 44 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The first method: User data such as search engine query keywords and chat records are not easy to obtain, and improper use may leak user privacy;
[0009] The second method: search each candidate new word in the search engine, the algorithm efficiency is low, and the applicability is poor;
[0010] The third method: there are defects of low timeliness and incomplete search range of new words
[0012] However, this method based on dictionary lookup is difficult to create and maintain dictionaries, and has no processing power for new words to be recognized.
[0013] To sum up, the above-mentioned methods all have the defects of low efficiency of new word acquisition methods, insufficient real-time performance, incomplete search range of new words, or poor support for Chinese.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically acquiring new words from Chinese webpages
  • Method for automatically acquiring new words from Chinese webpages
  • Method for automatically acquiring new words from Chinese webpages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] A kind of Chinese web page neologism automatic acquisition method that the present invention proposes, in conjunction with accompanying drawing and embodiment describe in detail as follows:

[0056] A kind of method that the present invention proposes automatic acquisition of new words in Chinese web pages, at first set original database and stop words database; Described original database is initially set as empty, is used for depositing the data that produces in the processing process of this new word acquisition method; The above-mentioned stop word database pre-stores words that cannot appear according to the Chinese language rules (can also be changed at any time according to needs), and used words to be deleted; set the new word acquisition cycle (the length of the cycle can be adjusted according to actual application needs) If you want to get new words in the near future, you can set the period to be short, otherwise you can set the period to be longer, and you ca...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for automatically acquiring new words from Chinese webpages and belongs to the technical field of excavating internet data. The method comprises the following steps of: acquiring different types of webpages from the Internet, acquiring texts of webpages containing time information by analysis, pre-treating the texts, performing the n-gram word-segmentation of the obtained sentence segments to generate word strings and accounting word frequencies, and storing the word strings, the word frequencies and the time information of the word strings in an original database; filtering the word strings in the original database by word frequency threshold values, and keeping the word strings of which the word frequencies are more than or equal to the word frequency threshold values; and filtering the kept word strings after the adjacent string comparison and the father-son string comparison of the word strings are carried out, deleting and disabling the same word strings in the word database, and performing time-sequence analysis of the time information of the obtained primarily selected new word strings to obtain new words. The method can also comprises a step of adding the filtering word strings acquired by artificial labeling to the filter word database. The method has the advantages of wide range of acquiring new words, easy and convenient Chinese word-segmentation method, high processing efficiency, and high accuracy and scientificity of finding new words.

Description

technical field [0001] The invention belongs to the technical field of Internet data mining, in particular to a method for acquiring new words. Background technique [0002] With the rapid development and promotion of computer network technology, network data has expanded rapidly. These data have the characteristics of fast update speed, huge data volume, and irregular data organization forms, but they also contain a lot of valuable information. In addition, due to the increase in people's mutual communication needs, the Internet has become a platform for information release and dissemination. Some Internet terms and hot words generated from this have been widely used in real life, affecting people's lives, and some new words have gradually been accepted by people, expanding Chinese vocabulary. These newly emerging words have the characteristics of fast generation and wide coverage, and are often scattered in massive network texts. It is unimaginable to view and retrieve th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 孙立远袁睿翕卞小丁
Owner TSINGHUA UNIV
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More