Word Similarity with Co-occurrence matrix and Word2Vec(Skip-gram)

Mahdi Amrollahi
Analytics Vidhya
Published in
3 min readApr 19, 2021

--

One of the most crucial fields in natural language processing is semantic analysis. Identifying the similarity between two words can help us on this ground. But firstly, we need to convert each word to a vector that can be computed by a machine. We can achieve the vectors through the co-occurrence matrixes or neural networks. In this experiment, we use the co-occurrence matrix and Skip-gram to show the results and then compare them.

Introduction

Word embedding is a representation that words with the same meaning have a similar representation. This type of representing words will be one of the main points in natural language processing. Word embedding is a technique where individual words are represented as vectors in numbers. Each word will be mapped to one vector, and the vector values are calculated within some models like neural networks or co-occurrence matrixes. There are two common ways to calculate the word embedding vectors: Co-occurrence matrixes and neural networks.

Co-occurrence matrix

The co-occurrence matrix demonstrates relationships between words and describes how words occur together. The simplest method for transforming words is counting the occurrence of each word in each corpus.

There are some other ways to calculate the co-occurrence matrix instead of simple counting. In this experiment, we use PMI(Point-wise Mutual Information) and SPPMI(Shifted Positive PMI).

The idea of PMI is that we want to quantify the likelihood of co-occurrence of two words, regarding the fact that it might be caused by the frequency of the single words. In reality, computing the PMI based on the count of the words in the corpus is problematic, so we need to estimate the PMI based on how many times a word appears in a window of any target word. The idea is that any word in the corpus is mostly realized by the near words, not all the corpus. When the corpus is not that too long, the variance of the estimator goes high, so we use a Shifted Positive method instead of PMI:

Neural network

Creating embeddings with a Neural network are helpful because they are working better on large corpora and also reduce the dimensions of variables and represent variables in the transformed space. Neural network embedding can overcome the limitations of one-hot encoding.

Dataset

For training the model, a well rich text is important. The dataset should almost cover each area to be evaluated by the other famous test collections, which are scored mostly by humans. Among the several datasets, we have chosen a custom Wiki-Dump(build 2021–04–01) which contains pages and articles to build the model. There are different types of Wiki-Dump files (.xml .txt .sql and…) that contain a huge amount of text that seems one of the best datasets on this ground. The good point about the Wiki-Dump is that it includes both formal and non-formal text so that it can show the similarity between two words better.

Results

References

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27:2177–2185, 2014.

--

--