|Goals:||Devise an algorithm that can automatically segment dialogs in subtitles.|
|Assistant:||Yubo Xie (yubo DOT xie AT epfl DOT ch)|
|Keywords:||human dialogs, text segmentation, natural language processing|
One of the main challenges of open-domain dialog modeling is the scarcity of training data, especially for multi-turn settings. Movie and TV subtitles are naturally a good source for developing conversation corpora. Currently the biggest corpus is the OpenSubtitles dataset. However, subtitle files usually lack clear scene markers, making it difficult to extract self-contained dialogs used for training multi-turn dialog models.
The student is expected to:
|Related Skills:||Basic knowledge in natural language processing|
|Suitable for:||Undergraduate/Master student. Interested student should contact Yubo Xie (yubo DOT xie AT epfl DOT ch) and Pearl Pu (pearl DOT pu AT epfl DOT ch) along with a copy of your CV.|