Twitter for Sentiment Analysis

This corpus is a collection of tweets containing text and images collected from July to December 2016. During this time span, we exploited Twitter’s Sample API to access a random 1% sample of the stream of all globally produced tweets, discarding:

  • tweets not containing any static image or containing other media (i.e., we also discarded tweets containing only videos and/or animated GIFs)
  • tweets not written in the English language
  • tweets whose text was less than 5 words long
  • retweets

At the end of the data collection process, the total number of tweets in our dataset is ~3.4M, corresponding to ~4M images. Each tweet (text and associated images) has been labeled according to the sentiment polarity of the text (negative, neutral and positive) predicted by our classifier, obtaining a labeled set of tweets and images divided in 3 categories. We selected the tweets having the most confident textual sentiment predictions to build our Twitter for Sentiment Analysis (T4SA) dataset. We removed corrupted and near-duplicate images, and we selected a balanced subset of images, named B-T4SA, that we used to train our visual classifiers.

You can download the T4SA dataset at the following link.

References

Vadicamo L., Carrara F., Cimino A., Cresci S., Dell’Orletta F., Falchi F., Tesconi M. (2017) “Cross-Media Learning for Image Sentiment Analysis in the Wild“. In Proceedings of 5th Workshop on Web-scale Vision and Social Media (VSM) International Conference on Computer Vision Workshop (ICCVW), 23 October 2017, Venice, Italy.

(Please cite the paper above if you make use of this corpus in your research)