For the reddit and wikipedia datasets I didn't decompress the files (would be too big). You can work with the bz2 files directly on the cluster, as if they were text files (this is probably faster since you will read the files via network!).
The following would print every line from the wiki dump directly from the bz2 (abort with CTRL+C):
Code: Alles auswählen
import bz2 with bz2.open('enwiki-latest-pages-articles.xml.bz2','rt',encoding='utf-8') as infile: for line in infile: print(line)
You can also put the two export commandos in your ~/.bash_rc before the module command.
I've converted ALC into 16kHz audio files for you, for this corpus you'll also find the FBank feature files (standard parameters) . You might need to average over several vectors, since the temporal resolution might be to precise for you RNN (100 vectors per second).
For the german speech data (speaker identification) everything is already in 16kHz. I can also run the FBank feature script for you, so that the feature files are also in the same directory, let me know if you need that.