Moderator: Praktikum: QA Technologies Behind IBM Watson
we used the python script (with some adjustment and modification) and retrieved 8453 html files from Marvel CU Wikia.
3 out of the 8453 files are in the attachment. Please take a look and tell us if you have any objection regarding the format of the information section.
- (5.98 KiB) 30-mal heruntergeladen
can you please even crawl the information from the info boxes on the right side into the html documents. (See Appendix)
I think that this informations are Important for us, the group information network, especially eventually for the evaluation.
Thanks and Regards,
- Bildschirmfoto 2016-05-17 um 20.38.17.png (111.82 KiB) 879 mal betrachtet
I'm Wei from the same group as Erich. Since the API apparently doesn't extract the table/infobox, I had to do it as a post-process, meaning that I extracted all the html files, and then added the tables/infoboxes afterwards. The second step takes about 30minutes. Maybe there's a better way to do it, which I'm not aware of.
Anyway, the updated version is in the attachment.
- (2.61 KiB) 23-mal heruntergeladen