Bag of Words: Approach, Python Code, Limitations

5 min read

By Naman Swarnkar

In this blog, we will study the Bag of Words method for creating vectorized representations of text data. These representations can then be used to perform Natural Language Processing tasks such as Sentiment Analysis. We'll understand the relevant terms, limitations, and further highlight the advantages of the method. The topics covered are:

Bag of Words is a simplified feature extraction method for text data that is easy to implement. It involves maintaining a vocabulary and calculating the frequency of words, ignoring various abstractions of natural language such as grammar and word sequence.

Bag of Words Approach

The Bag of Words approach takes a document as input and breaks it into words. These words are also known as tokens and the process is termed as tokenization.

Unique tokens collected from all processed documents then constitute to form an ordered vocabulary. Finally, a vector of length equivalent to the size of the vocabulary is created for each document with values representative of the frequency of the tokens appearing in the respective document.

Note that, we ignore the order in which these words appear in our document. Hence the name ‘Bag of Words’ signifying the unordered collection of items in a bag. We can easily implement this approach in python. Below is an example demonstrating the same.

Approach_Bag_of_Words
Bag_of_words_approach_output

Note the difference in the number of total words and length of vocabulary. We'll now calculate the frequencies of words appearing in each document and store it in a dictionary.

frequency_bag_of_words1
frequency_bag_of_words2

Limitations of Bag of Words

Consider deploying the Bag of Words method to generate vectors for large documents. The resultant vectors will be of large dimension and will contain far too many null values resulting in sparse vectors. This is also observed in the above sample example.

Apart from resulting in sparse representations, Bag of Words does a poor job in making sense of text data. For example, consider the two sentences: "I love playing football and hate cricket" and it's vice-versa "I love playing cricket and hate football". Bag of Words approach will result in similar vectorized representations although both sentences carry different meanings. Attention-based deep learning models like BERT are used to solve the problem of contextual awareness.

We can solve the problem of sparse vectors to some extent using the techniques discussed below:

Converting all words to the lower case

While tokenizing documents, we may encounter similar words but in different cases, eg: upper ‘CASE’ or lower ‘case’ or title ‘Case’. While the word case is common, different tokens will be generated for them. This increases the size of vocabulary and consequently the dimension of generated word vectors.

Removing Stop Words

Stop words include common occurring words such as ‘the’, ‘is’, etc. Removing such words from vocabulary results in vectors of lesser dimension. Stop words are not exhaustive, and one can specify custom stop words while working on their Bag of Words model.

Stemming and Lemmatization

While the aim of both the techniques is to result in a root word from the original word, the method deployed in doing so is different. Stemming does this by stripping the suffix of words under consideration. For example: ‘playing’ becomes ‘play’ and so on. There is no standard procedure to do stemming and various stemmers are available. Often stemming results in words that do not mean anything. Lemmatization takes a different approach by incorporating linguistics into consideration and results in meaningful root words. This method is relatively difficult as it requires constructing a dictionary to achieve the desired results.

Below is an example of Scikit-learn’s CountVectorizer that has added functionality of removing stop words and converting words into the lower case before coming up with the vectorized representation of documents.

sklearn_remove_stop_words
frequency_bag_of_words_3

Notice the difference in the number of words in vocabulary as compared to the fundamental approach.

Bag of Words vs Word2Vec

Even after incorporating the mentioned techniques, it is difficult to limit the growing dimension of vectors while dealing with a large number of documents. One can indeed limit the vocabulary by limiting it to include only the most frequent words, but this results in suboptimal performance.

Word embedding models like Word2Vec results in distributed representations that take semantics into account such that words with similar meanings are present close to each other in vector space. Word2vec also limits the dimension of generated vectors. This makes Word2Vec a preferred choice for creating a vectorized representation of words.

Advantages of Bag of Words

Bag of Words is still widely used owing to its simplicity. NLP researchers usually create their first model using Bag of Words to get an idea of the performance of their work before proceeding to better word embeddings.

It is particularly helpful when we are working on a few documents and they are very domain-specific. For example: Working on Political News Data from twitter to measure sentiment. Word2Vec is a pre-trained model and thus may not have word embeddings related to niche domains.

Conclusion

With the help of word embedding models, one can create an end-to-end algorithmic trading pipeline for processing and leveraging alternate text data to predict potential price movements. You have seen how Bag of Words can be used to create vectorized representation. You can learn about more sophisticated techniques like Word2Vec and BERT to build sentiment analysis models in the course Natural Language Processing in Trading.

References

Disclaimer: All data and information provided in this article are for informational purposes only. QuantInsti® makes no representations as to accuracy, completeness, currentness, suitability, or validity of any information in this article and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use. All information is provided on an as-is basis.