Seite 1 von 1

Marvel CU Corpus extracted.

Verfasst: 14. Mai 2016 22:04
von Erich
Hey guys,

we used the python script (with some adjustment and modification) and retrieved 8453 html files from Marvel CU Wikia.
3 out of the 8453 files are in the attachment. Please take a look and tell us if you have any objection regarding the format of the information section.

Re: Marvel CU Corpus extracted.

Verfasst: 17. Mai 2016 20:45
von SchottCh
Hi Erich,
can you please even crawl the information from the info boxes on the right side into the html documents. (See Appendix)
I think that this informations are Important for us, the group information network, especially eventually for the evaluation.

Thanks and Regards,

Re: Marvel CU Corpus extracted.

Verfasst: 17. Mai 2016 23:08
von remstef
Hi Erich,

I agree with Christoph, the data should probably include the information from the infoboxes. As far as I know these have to handled differently.


Re: Marvel CU Corpus extracted.

Verfasst: 21. Mai 2016 21:20
von whileTrue
I'm Wei from the same group as Erich. Since the API apparently doesn't extract the table/infobox, I had to do it as a post-process, meaning that I extracted all the html files, and then added the tables/infoboxes afterwards. The second step takes about 30minutes. Maybe there's a better way to do it, which I'm not aware of.
Anyway, the updated version is in the attachment.