Method for automatically acquiring new words from Chinese webpages

A new word and webpage technology, applied in the field of Internet data mining, can solve problems such as low algorithm efficiency, leakage of user privacy, poor Chinese support, etc., and achieve the effect of improving accuracy and processing efficiency

Active Publication Date: 2010-05-12
TSINGHUA UNIV
View PDF0 Cites 44 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The first method: User data such as search engine query keywords and chat records are not easy to obtain, and improper use may leak user privacy;
[0009] The second method: search each candidate new word in the search engine, the algorithm efficiency is low, and the applicability is poor;
[0010] The third method: there are defects of low timeliness and incomplete search range of new

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically acquiring new words from Chinese webpages
  • Method for automatically acquiring new words from Chinese webpages
  • Method for automatically acquiring new words from Chinese webpages

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0055] The method for automatically acquiring Chinese webpage new words proposed by the present invention is described in detail as follows in conjunction with the drawings and embodiments:

[0056] In the method for automatically acquiring new words in Chinese webpages proposed by the present invention, an original database and a stop word database are first set up; the original database is initially set to be empty for storing the data generated during the processing of the new word acquisition method; The stop word database described is pre-stored with words that cannot appear according to Chinese language rules (and can be changed at any time as needed), and used words to be deleted; set the new word acquisition cycle (the length of the cycle can be changed according to actual application needs) If you want to get new words in the near future, you can set the period to be short, otherwise you can set the period to be longer, and you can also make appropriate adjustments accord...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for automatically acquiring new words from Chinese webpages and belongs to the technical field of excavating internet data. The method comprises the following steps of: acquiring different types of webpages from the Internet, acquiring texts of webpages containing time information by analysis, pre-treating the texts, performing the n-gram word-segmentation of the obtained sentence segments to generate word strings and accounting word frequencies, and storing the word strings, the word frequencies and the time information of the word strings in an original database; filtering the word strings in the original database by word frequency threshold values, and keeping the word strings of which the word frequencies are more than or equal to the word frequency threshold values; and filtering the kept word strings after the adjacent string comparison and the father-son string comparison of the word strings are carried out, deleting and disabling the same word strings in the word database, and performing time-sequence analysis of the time information of the obtained primarily selected new word strings to obtain new words. The method can also comprises a step of adding the filtering word strings acquired by artificial labeling to the filter word database. The method has the advantages of wide range of acquiring new words, easy and convenient Chinese word-segmentation method, high processing efficiency, and high accuracy and scientificity of finding new words.

Description

technical field [0001] The invention belongs to the technical field of Internet data mining, in particular to a method for acquiring new words. Background technique [0002] With the rapid development and promotion of computer network technology, network data has expanded rapidly. These data have the characteristics of fast update speed, huge data volume, and irregular data organization forms, but they also contain a lot of valuable information. In addition, due to the increase in people's mutual communication needs, the Internet has become a platform for information release and dissemination. Some Internet terms and hot words generated from this have been widely used in real life, affecting people's lives, and some new words have gradually been accepted by people, expanding Chinese vocabulary. These newly emerging words have the characteristics of fast generation and wide coverage, and are often scattered in massive network texts. It is unimaginable to view and retrieve th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 孙立远袁睿翕卞小丁
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products