Creation of large high-quality human dialog corpus

April 12, 2019
Duration: One Semester
Goals: Crawl multi turn dialogs from Twitter, clean and analyze the resulting dataset.
Assistant: Ekaterina Svikhnushina (ekaterina DOT svikhnushina AT epfl DOT ch)
Student Name: Open

This project aims at creating a large (~1M samples) clean dataset of natural human dialogs, which will be applicable for development and evaluation of open-domain dialog agents

The student is expected to:

  • use Twitter API to crawl multi-turn sequences of tweets
  • develop criteria for cleaning the data
  • analyze the clean dataset on utterance and vocabulary levels (identify emotional coloring of the utterances, evaluate the percentage of obscene language in the vocabulary, etc.)
  • visualize the findings of the analysis
Related Skills: Background or interest in data mining, machine learning, natural language processing
Suitable for: Master student. Interested student should contact Ekaterina Svikhnushina (ekaterina DOT svikhnushina AT epfl DOT ch) and Pearl Pu (pearl DOT pu AT epfl DOT ch) along with a copy of your CV.