thanks for the suggestion. Personally, I like the idea. In the following I want to give some advice and hints.
In general, I see a few (feasible) challenges that need to be addressed with this kind of data.
As mentioned in the introductory session, one of the best practices (cf. slide 23-24) is to keep the scope as narrow as possible which reduces the number of possible questions. I don't see any problems with the first and the third questions, I have doubts about the second question: "What are the latest films of Matt Damon?" Remember that the answers to the questions need to be in present in the text. Consider for example an initial google query to check if the answer can be found on the particular web pages as google does quite a good job in extracting the data
: https://www.google.de/#q=%22In+which+fi ... ipedia.org
In order to make question 2 answerable we need some more reasoning, e.g. as mentioned with structured data from dbpedia and possibly some other services.
Another challenge I see is the homogeneity of the input data: E.g. for the simpson corpus from last year we could easily use the wikia API in order to get relatively high quality and semi-structured (sections, paragraphs) input data. This is for sure limited in the case of rottentomatoes or imbd and needs preprocessing of the html pages, it is more feasible with wikipedia, I guess.
so far from my side, I'd love to hear about opinions from the other students.