Sharing GEW-Based Emotion Lexicons and Data

In the course of our research for advancing emotion recognition in tweets, we produced several emotion lexicons, collected emotional data, and annotated some data with emotion labels. These resources correspond to 20 emotion categories from Geneva Emotion Wheel (GEW), version 2.0.

The presented resources are password-protected. To request access, please send an e-mail to pearl.pu [at] epfl.ch and valentinasintsova [at] gmail.com.

All of these resources are available for research purposes. Please read the terms of use below.

Terms of use for the provided linguistic emotional resources

  • These resources can be used for research purposes only.
  • Do not redistribute these resources further. Instead, please refer interested parties to this web page: http://hci.epfl.ch/sharing-emotion-lexicons-and-data
  • Please kindly cite our papers and reports in your publications.

Provided resources

Emotion Lexicons

      • OlympLex

        This emotion lexicon was designed to recognize emotions in the domain sport events. More specifically, it was constructed via human computation using the Twitter data collected from the Olympic Games 2012 in Gymnastics. The used tweets contained the hashtag #gymnastics. The labeled tweets are provided in section Sports-Related Emotion Corpus (SREC) below. 

        More details on the construction process of the lexicon can be found in:
        Valentina Sintsova, Claudiu Musat, and Pearl Pu. Fine-Grained Emotion Recognition in Olympic Tweets Based on Human Computation. In Proceedings of the NAACL/HLT Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), ACL, 2013.

        We distribute two version of this lexicon:
        – Version 1.0, the original one, created as explained in the WASSA paper;
        – Version 1.1, where we removed ~100 terms that are not seen as indicative of specific emotions and can often be factual. ‘Event’, ‘women’, and ‘today’ are examples of removed terms. This was done by manual exploration of the most occurred terms in our data (i.e. not all terms in the lexicon were checked).

        In both versions, the lexicon terms are n-grams up to 5 consecutive words. Each term has an assigned emotion distribution in terms of GEW 2.0 category set.

        Download Olymplex

      • GALC-R for GEW-2.0

        This lexicon is a derivative from the Geneva Affect Label Coder (GALC)-the lexicon of explicit emotional stems provided with the emotion categories from Geneva Emotion Wheel (GEW). The original GALC lexicon contains stemmed terms, allowing for any continuation, such as happ* for Happiness. We produced a revised lexicon GALC-R with manually validated, instantiated words corresponding to the used emotion categories from GEW, version 2.0. For that, we first extracted the terms corresponding to the used emotion categories. Second, we instantiated the remaining stems on the random tweets, i.e. we found the exact words that matched the stems, along with their frequency. And finally, we manually reviewed the most frequent words to validate their correspondence to the associated emotion. This resulted in 1026 terms, 52.9 on average per emotion category.

        Download GALC-R

      • PMI-Hash

        We generated this lexicon by extracting the PMI-scores of the association of terms to each emotion over the pseudo-labeled dataset of tweets with emotional hashtags. This PMI-based method of training emotion lexicons was described in:
        Saif M. Mohammad. #Emotional Tweets. In Proceedings of the First Joint Conference on Lexical and Computational Semantics and the Sixth International Workshop on Semantic Evaluation, 246–55, 2012.

        We collected tweets with explicit emotional hashtags corresponding to the used emotion categories. Those tweets are considered to be pseudo-labeled. Out of all those tweets we randomly selected 500,000 tweets that were non-retweets and non-duplicates, contained at least 3 words, and had only one of the considered hashtags. Those tweets were used for learning PMI-Hash lexicon, where we assigned per-emotion PMI-based Strength-of-Association scores to each unigram and bigrams appearing at least 5 times.

        Download PMI-Hash

      • Dystemo-produced emotion lexicons for sports events

        We make available six emotion lexicons produced using the Dystemo framework and designed to recognize emotions within the domain of sport events. The construction process involved pseudo-labeling of unlabeled within-domain data (Olympic tweets) with emotions based on the application of initial emotion lexicons. The shared six lexicons were trained with Balanced Weighted Voting and PMI-based methods, while initialized from different initial lexicons (either GALC-R, OlympLex-1.1, or PMI-Hash). They were trained with the optimized parameters for each case.

        More details on the construction of those lexicons can be found in:
        Valentina Sintsova and Pearl Pu. Dystemo: Distant Supervision Method for Multi-Category Emotion Recognition in Tweets. ACM Transactions on Intelligent Systems and Technology (TIST), 8(1):Article No.13, 2016

        Download Dystemo-produced lexicons

Annotated Emotional Data

  • Sports-Related Emotion Corpus (SREC)

    This dataset contains a set of sports-related tweets manually annotated with emotion categories. The annotation was performed by workers from the Amazon Mechanical Turk platform. This dataset was the basis for the OlympLex emotion lexicon.

    More details on the collection and annotation process can be found in:
    Valentina Sintsova, Claudiu Musat, and Pearl Pu. Fine-Grained Emotion Recognition in Olympic Tweets Based on Human Computation. In Proceedings of the NAACL/HLT Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), ACL, 2013.

    Unfortunately, by Twitter terms of service, we cannot share this dataset directly. Thus, we share those tweets via their identifiers. In the current distribution (version 1.2), we provide the annotation for 1265 tweets for which we have the Twitter identifiers, instead of all 1957 tweets used in the paper.

    Download SREC data

Pseudo-Labeled Emotional Data: EMO-Hash data

  • Tweets with explicit emotional hashtags

    We collected 17.6 million tweets with explicit emotional hashtags corresponding to the GEW emotion categories, by using Twitter Streaming API between 27th February and 26th May of 2014. Among them, we extracted 1,729,980 tweets that had those hashtags at the end of the text, were not repeated, were no retweets, did not contain URLs, and were assigned to only one emotion category. Using 500,000 of these pseudo-labeled tweets, we built the PMI-Hash emotion lexicon, as described above.

    Unfortunately, by Twitter terms of service, we cannot share this dataset directly. Thus, we share those tweets via their identifiers.

    Download EMO-Hash data