Dear all,

in EX2 we should split about 75% into Training and Test Data. Should we first split the classes and then seperate the classes into training and test data (what makes sense in my opinion) or should we split overall data?

I first seperate the classes and then split the data and I hope that this is correct...

So you would always have the same proportional sizes of the classes in your training and test data set? Maybe I misunderstood, but I thought that was one of the things we were trying to prevent by randomization? Anyway, I doubt that it makes a huge difference. :)

To the original question:
You should randomize first, then split, even if that leads to slightly different proportions of the classes.

To the follow up:
The purpose of the randomization is to ensure that the order of the data points doesn't matter more than anything else. The proportion of the different classes may vary a bit between different randomizations, but on average it will stay the same.
It's true though that either way doesn't make a big difference.


