Corpus Idea: Movie Facts

Moderator: Praktikum: QA Technologies Behind IBM Watson

Beiträge: 1
Registriert: 22. Apr 2016 15:47

Corpus Idea: Movie Facts

Beitrag von JohannesR »

when I thought about an interesting knowledge database one idea for a corpus came into my mind. What about a corpus for movie facts: plot, actors, producer etc. Having such a corpus would allow user questions like:
What was the name of the film were Bruce Willis played a police officer rescuing his wife from terrorists?
What are the latest films of Matt Damon?
In which film Bridget Moynahan and Will Smith played together?
Sources for the corpus could include Wikipedia, RottenTomatoes or themoviedb. Having this corpus would enable other use cases as well like information visualization - "which actors work together frequently" or which films are similar (e.g. for recommendation).

What do you think?

Beiträge: 9
Registriert: 15. Aug 2011 11:16

Re: Corpus Idea: Movie Facts

Beitrag von biem »

Sounds like a great idea -- this could be combined with structured data from DBPedia and unstructured data coming from textual sources such as the ones you mention.
Please evaluate whether it is possible to crawl enough of the data to create a corpus; rottentomatoes might block you :)

Beiträge: 9
Registriert: 20. Apr 2015 22:16

Re: Corpus Idea: Movie Facts

Beitrag von remstef »

Hi Johannes,

thanks for the suggestion. Personally, I like the idea. In the following I want to give some advice and hints.

In general, I see a few (feasible) challenges that need to be addressed with this kind of data.
As mentioned in the introductory session, one of the best practices (cf. slide 23-24) is to keep the scope as narrow as possible which reduces the number of possible questions. I don't see any problems with the first and the third questions, I have doubts about the second question: "What are the latest films of Matt Damon?" Remember that the answers to the questions need to be in present in the text. Consider for example an initial google query to check if the answer can be found on the particular web pages as google does quite a good job in extracting the data :) : ...

In order to make question 2 answerable we need some more reasoning, e.g. as mentioned with structured data from dbpedia and possibly some other services.

Another challenge I see is the homogeneity of the input data: E.g. for the simpson corpus from last year we could easily use the wikia API in order to get relatively high quality and semi-structured (sections, paragraphs) input data. This is for sure limited in the case of rottentomatoes or imbd and needs preprocessing of the html pages, it is more feasible with wikipedia, I guess.

so far from my side, I'd love to hear about opinions from the other students.



Zurück zu „Praktikum: Question Answering Technologies Behind IBM Watson“