Dialog Segmentation in Movie and TV Subtitles

April 12, 2019
Duration: One Semester
Goals: Devise an algorithm that can automatically segment dialogs in subtitles.
Assistant: Yubo Xie (yubo DOT xie AT epfl DOT ch)
Student Name: Open
Keywords: human dialogs, text segmentation, natural language processing

One of the main challenges of open-domain dialog modeling is the scarcity of training data, especially for multi-turn settings. Movie and TV subtitles are naturally a good source for developing conversation corpora. Currently the biggest corpus is the OpenSubtitles dataset. However, subtitle files usually lack clear scene markers, making it difficult to extract self-contained dialogs used for training multi-turn dialog models.

The student is expected to:

  • preprocess and analyze the OpenSubtitles dataset
  • devise an automatic dialog segmentation algorithm (rule-based or data-driven)
  • evaluate the segmentation accuracy
Related Skills: Basic knowledge in natural language processing
Suitable for: Undergraduate/Master student. Interested student should contact Yubo Xie (yubo DOT xie AT epfl DOT ch) and Pearl Pu (pearl DOT pu AT epfl DOT ch) along with a copy of your CV.