Marvel CU Corpus extracted.

Moderator: Praktikum: QA Technologies Behind IBM Watson

Erich
Mausschubser
Mausschubser
Beiträge: 57
Registriert: 17. Okt 2010 13:29

Marvel CU Corpus extracted.

Beitrag von Erich »

Hey guys,

we used the python script (with some adjustment and modification) and retrieved 8453 html files from Marvel CU Wikia.
3 out of the 8453 files are in the attachment. Please take a look and tell us if you have any objection regarding the format of the information section.
Dateianhänge
articles.zip
(5.98 KiB) 33-mal heruntergeladen

SchottCh
Mausschubser
Mausschubser
Beiträge: 74
Registriert: 4. Okt 2010 16:39

Re: Marvel CU Corpus extracted.

Beitrag von SchottCh »

Hi Erich,
can you please even crawl the information from the info boxes on the right side into the html documents. (See Appendix)
I think that this informations are Important for us, the group information network, especially eventually for the evaluation.

Thanks and Regards,
Christoph
Dateianhänge
Bildschirmfoto 2016-05-17 um 20.38.17.png
Bildschirmfoto 2016-05-17 um 20.38.17.png (111.82 KiB) 901 mal betrachtet

remstef
Moderator
Moderator
Beiträge: 9
Registriert: 20. Apr 2015 22:16

Re: Marvel CU Corpus extracted.

Beitrag von remstef »

Hi Erich,

I agree with Christoph, the data should probably include the information from the infoboxes. As far as I know these have to handled differently.

best,
Steffen

Benutzeravatar
whileTrue
Neuling
Neuling
Beiträge: 8
Registriert: 10. Mai 2012 20:13

Re: Marvel CU Corpus extracted.

Beitrag von whileTrue »

Hi,
I'm Wei from the same group as Erich. Since the API apparently doesn't extract the table/infobox, I had to do it as a post-process, meaning that I extracted all the html files, and then added the tables/infoboxes afterwards. The second step takes about 30minutes. Maybe there's a better way to do it, which I'm not aware of.
Anyway, the updated version is in the attachment.
Dateianhänge
articles_simple.rar
(2.61 KiB) 23-mal heruntergeladen

Antworten

Zurück zu „Praktikum: Question Answering Technologies Behind IBM Watson“