ITQ algorithm-based Indonetic similar news recommendation method

A recommendation method and news technology, applied in the computer field, can solve the problems of no more consideration information, high overhead, low utilization rate of news information, etc., and achieve the effect of good effect, small dimension, and reduction of calculation amount and memory overhead.

Active Publication Date: 2019-07-09
UNIV OF ELECTRONIC SCI & TECH OF CHINA
4 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

The disadvantages of this method are as follows: word frequency-inverse document frequency first vectorizes the news, that is, converts the news into a one-dimensional numerical vector with the same dimension, and performs similar recommendation on the basis of the news vector
The dimension of this vector is very large. Even if some vocabulary filtering methods are used t...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

[0082] In the present embodiment, the present invention utilizes the method of converting the news text into a binary code to extract news similarities, and screen the news instead of obtaining the vector representation of the news. This improvement greatly reduces the amount of calculation. The existing TF-IDF word frequency-inverse document frequency technology will calculate the TF-IDF value of all words in each news, and use such a word vector to represent a news, usually how many different words are there in each news The length of the vector representation is as big as it is. The common words in each language may be in the hundreds of thousands, so the length of the vector of each news article is hundreds of thousands, so the calculation of the length vector is too expensive for memory. The present invention uses the ITQ method to convert a piece of news i...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention provides an ITQ algorithm-based Indonesian similar news recommendation method, which comprises the following steps of: firstly, extracting a title and a text in each Indonesian news, andstoring the title and the text in fields corresponding to the Indonesian news; training a Word2Vec model according to the Indonesian news data to obtain a news-to-vector mapping dictionary; obtaininga binary code of the feature vector under the optimal rotation matrix through an ITQ algorithm; calculating an n-bit signature composed of the binary number of each Indonesian news in the currently browsed Indonesian news and candidate data set; calculating the Hamming distance between the currently browsed news and each Indonesian news in the candidate data set; and performing sorting accordingto the Hamming distance, and selecting the first m Indonesian news with the minimum distance in the candidate data set as recommended news. According to the method, the technical problem of balance between the news recommendation effect and the calculated amount based on the content is solved. The method is high in flexibility and can be suitable for various language environments.

Application Domain

Technology Topic

Image

  • ITQ algorithm-based Indonetic similar news recommendation method
  • ITQ algorithm-based Indonetic similar news recommendation method
  • ITQ algorithm-based Indonetic similar news recommendation method

Examples

  • Experimental program(1)

Example Embodiment

[0049] Example
[0050] like figure 1 Shown, the present invention discloses a kind of Indonesian language similar news recommendation method based on ITQ algorithm, and its implementation method is as follows:
[0051] (S1) Crawl the Indonesian news data, extract the title and text in each Indonesian news, and save it in the field corresponding to the Indonesian news;
[0052] (S2) according to the Indonesian news data training Word2Vec model, obtain the mapping dictionary of news to vector, it comprises the steps:
[0053] (a1) According to the crawled Indonesian news data, get the most frequently used 100,000 words, and use the Word2Vec model to calculate the word embedding;
[0054] (a2) converting each piece of news into a vector representation according to the word embedding, thereby obtaining a mapping dictionary from news to vectors;
[0055] Described step (S2) also comprises the preprocessing to Indonesian news, and it comprises the following steps:
[0056] (b1) Segment the content of the Indonesian news into words;
[0057] (b2) According to the word segmentation processing result, the stop words and special characters in the word segmentation are filtered to obtain plain text data, thereby completing the preprocessing of Indonesian news;
[0058] (S3) according to the mapping dictionary of described news to vector by ITQ algorithm, obtain the binary code of the feature vector under optimal rotation matrix, it comprises the following steps:
[0059] (c1) Using PCA to perform dimensionality reduction processing on the mapping dictionary of the news to the vector, the expression is as follows:
[0060]
[0061] in, is the objective function about W, W is the covariance matrix X T A matrix of eigenvectors of X, E is the expected value, x is a single sample, ω k is the hyperplane parameter of the sgn ( ) function, m is the number of samples, T is the transposition of the matrix, X is the data set after news word vectorization, k represents the kth sgn function, and I is the identity matrix;
[0062] (c2) Find the rotation matrix with the smallest quantization error according to the dimensionality reduction processing results, so as to obtain the binary code of the eigenvector under the optimal rotation matrix, and the solution process is as follows:
[0063] (d1) Randomly initialize and fix the orthogonal matrix R, and update the optimal solution matrix B;
[0064] (d2) transforming the orthogonal matrix R and the optimal solution matrix B into a minimum objective function Obtain the optimal solution matrix B=sgn(VR), wherein, F is F normal form, tr is the trace of matrix, T is the transposition of matrix, V is the matrix after projecting headline news and V=XW, X is news word vector The transformed data set, W is the covariance matrix X T A matrix of eigenvectors of X;
[0065] (d3) fix the optimal solution matrix B, and update the orthogonal matrix R;
[0066] (d4) According to the Orthogonal Procrustes problem for B T V for singular value SVD decomposition Get the optimal solution orthogonal matrix Among them, B is the optimal solution matrix, S and Both are B T V is the unitary matrix after singular value decomposition, T is the transposition of the matrix, V is the matrix after projecting headline news and V=XW, X is the data set after vectorization of news words, and W is the covariance matrix X T A matrix composed of eigenvectors of X, Ω is a positive semi-definite diagonal matrix after singular value decomposition;
[0067] (d5) repeat steps (d1) to (d4), output optimal solution matrix B, thereby obtain the binary code of the eigenvector under optimal rotation matrix;
[0068] (S4) According to the binary code of the mapping dictionary of the news to the vector and the feature vector under the optimal rotation matrix, respectively calculate the n-bit signature composed of the binary number of each Indonesian news in the currently browsed news and the candidate data set, wherein ,
[0069] Calculating the n-digit signature composed of the binary numbers of the currently browsed Indonesian news includes the following steps:
[0070] (e1) According to the mapping dictionary from the news to the vector, calculate the n-bit signature A composed of the binary number of the headline in the currently browsed news i;
[0071] (e2) According to the binary code of the eigenvector under the optimal rotation matrix, calculate the n-bit signature B composed of the binary number of the news content in the currently browsed news i;
[0072] (e3) n-bit signature A formed according to the binary number i and the n-bit signature B composed of the binary number i , splicing to get the n-bit signature C composed of the binary number of the currently browsed news i , so as to complete the n-bit signature composed of the binary number of the currently browsed Indonesian news, where i is the translation number of the news in the candidate data set, and n is the total number of bits in the binary code;
[0073] Calculating the n-bit signature composed of binary numbers of each Indonesian news in the candidate data set includes the following steps:
[0074] (f1) According to the mapping dictionary from the news to the vector, calculate the n-bit signature D composed of the headline binary number in each piece of news in the candidate data set i;
[0075] (f2) According to the binary encoding of the eigenvector under the optimal rotation matrix, calculate the n-bit signature E composed of the binary numbers of the news content in each piece of news in the candidate data set i;
[0076] (f3) n-bit signature D formed according to the binary number i An n-bit signature composed of the binary number E i , calculate the n-bit signature F composed of the binary number of each news in the candidate data set i , so as to complete the n-bit signature composed of the binary number of each Indonesian news in the candidate data set, where i is the translation number of the news in the candidate data set, and n is the total number of bits in the binary code;
[0077] (S5) Calculate the Hamming distance between the currently browsed news and each piece of Indonesian news in the candidate data set according to the n-digit signature formed by the binary number, specifically using the n-digit signature C composed of the binary number i An n-bit signature consisting of the binary number F i , calculate the Hamming distance between the two, so as to obtain the Hamming distance between the currently browsed news and each Indonesian news in the candidate data set, the expression is as follows:
[0078]
[0079] Among them, d(·) is the Hamming distance calculation function, C i and F i are n-bit binary codes, n is the total number of binary codes, i is the translation number of the news in the candidate data set, and j is the binary code C i and F i the jth bit of is an XOR operation.
[0080] (S6) Sorting according to the Hamming distance, and selecting the first m Indonesian news with a small distance in the candidate data set as recommended news, thereby completing the recommendation of similar news.
[0081] In this embodiment, the Word2Vec algorithm is used to process the word embedding of Indonesian language, and the mapping dictionary of news to vector is obtained according to the crawled Indonesian news data, which is converted into binary code through the ITQ algorithm model. Then, when recommending the currently browsed news, the time complexity of finding the top m news articles containing the most keywords of the currently browsed news is O(Nlogm), where N is the total number of news, compared with the traditional TF-IDF word frequency - The inverse document frequency technology is used as a vector representation of news, which greatly reduces the amount of calculation and memory overhead, and can quickly select candidate news with O(Nlogm) time complexity. In the ITQ algorithm, because the rotation of the matrix can reduce the quantization error, the binary-coded news contains more context information, and after the first step of screening, the number of candidate news is within one hundred. Then in such The hierarchical architecture model not only ensures the accuracy of news similarity, but also realizes fast calculation.
[0082]In this embodiment, the present invention uses the method of converting news texts into binary codes to extract news similarity relations, and screens news instead of obtaining news vector representations, which greatly reduces the amount of computation. The existing TF-IDF word frequency-inverse document frequency technology will calculate the TF-IDF value of all words in each news, and use such a word vector to represent a news, usually how many different words are there in each news The length of the vector representation is as big as it is. The common words in each language may be in the hundreds of thousands, so the length of the vector of each news article is hundreds of thousands, so the calculation of the length vector is too expensive for memory. The present invention uses the ITQ method to convert a piece of news into a fixed length of 32, 64 or 128, which is much lower than the original magnitude and reduces storage overhead. The present invention uses the ITQ algorithm to obtain the binary news An n-bit signature composed of numbers. At this time, the length of the news encoding is on the order of one hundred, and the similarity between news is calculated by this, because the dimension is small, the calculation cost is small, and the calculation speed is fast. At the same time, this is to achieve recommendation through further filtering , the recommendation effect is better, and at the same time, the model based on the ITQ algorithm of the present invention can convert each piece of news into a fixed-length binary code, and the Hamming distance between the binary codes can reflect the similarity between the news, The smaller the distance, the higher the news similarity.
[0083] The present invention solves the technical problem of content-based news recommendation effect and calculation load balance through the above method. The present invention ensures the similarity of recommended news while greatly reducing the calculation load. The present invention has strong flexibility and can be applied to various languages. surroundings.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Feeder terminal device and signal acquisition calculation and line selection method thereof

PendingCN112557804AImprove acquisition accuracy and speedReduce quantization errorFault location by conductor typesCircuit interrupters testingDigital conversionFeeder line
Owner:SHANDONG ELECTRICAL ENG & EQUIP GRP XINNENG TECH CO LTD

Method and device for quantizing local features of picture into visual vocabularies

ActiveCN103020231AReduce quantization errorImprove robustnessSpecial data processing applicationsRelationship - FatherConfidence factor
Owner:BEIJING BAIDU NETCOM SCI & TECH CO LTD

Digital adaptive hysteresis system

InactiveUS20050286380A1Reduce quantization errorOutput errorTelevision system detailsDigital technique networkSelf adaptiveRounding
Owner:CIRRUS LOGIC INC

Classification and recommendation of technical efficacy words

  • Reduce quantization error

Semantic enhanced hash medical image retrieval method based on mixed attention

PendingCN113889228AReduce quantization errorHigh precisionStill image data queryingMedical imagesMachine learningSemantic enhancement
Owner:WUHAN UNIV OF TECH

Dictionary learning static image lossy compression method based on minimum quantization error criterion

ActiveCN107170020AReduce coding costsReduce quantization errorCode conversionImage codingDictionary learningSparse coefficient
Owner:NORTHWESTERN POLYTECHNICAL UNIV

Low-power time-to-digital converter

ActiveCN110174834AHigh precisionReduce quantization errorTime-to-digital convertersVIT signalsCapacitance
Owner:FUDAN UNIV

Neural network optimization method and related equipment

PendingCN111950700AReduce quantization errorEfficient in training and useGeometric image transformationCharacter and pattern recognitionEngineeringAlgorithm
Owner:HUAWEI TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products