Datasets on the cluster

Moderator: Seminar: Deep Learning for NLP and Speech

Benjamin Milde
Beiträge: 3
Registriert: 5. Okt 2015 12:26

Datasets on the cluster

Beitrag von Benjamin Milde »

All data files have now been uploaded - let my know if something is missing.

Text datasets

For the reddit and wikipedia datasets I didn't decompress the files (would be too big). You can work with the bz2 files directly on the cluster, as if they were text files (this is probably faster since you will read the files via network!).

The following would print every line from the wiki dump directly from the bz2 (abort with CTRL+C):

Code: Alles auswählen

import bz2
with'enwiki-latest-pages-articles.xml.bz2','rt',encoding='utf-8') as infile:
    for line in infile:
For the print function to work correctly with utf-8 and python3 on the cluster, you will need to set your LANG and PYTHONIOENCODING environment variable:

export LANG=en_US.UTF-8

You can also put the two export commandos in your ~/.bash_rc before the module command.

Audio datasets

I've converted ALC into 16kHz audio files for you, for this corpus you'll also find the FBank feature files (standard parameters) . You might need to average over several vectors, since the temporal resolution might be to precise for you RNN (100 vectors per second).

For the german speech data (speaker identification) everything is already in 16kHz. I can also run the FBank feature script for you, so that the feature files are also in the same directory, let me know if you need that.

Zurück zu „Seminar: Deep Learning for NLP and Speech“