thumbnail of Twitch Cry Sticker-cPN6RVJmo6h8ZW5Ayo.gif
thumbnail of Twitch Cry Sticker-cPN6RVJmo6h8ZW5Ayo.gif
Twitch Cry... gif
(225.94 KB, 480x480)
WARCs = data is pretty much set and done (and stuff in there is basically versioned because each URL should have a timestamp). Raws = challenging. Here's what I mean with raws:
(1) Path and data exactly match to HTTP source URL. Example: /ipfs/QmSZBpnHGdkPTd5B7Dng8CoVNR9hp6W5av4JAnQnbdeR6a/archive.org/download/MyLittlePonyFull/MyLittlePonyFull_reviews.xml = a version of https://archive.org/download/MyLittlePonyFull/MyLittlePonyFull_reviews.xml - but from what time? The timestamp is specified elsewhere, or it isn't.
(2) Path has small replacements + other differences. Example: "/ipfs/QmSZBpnHGdkPTd5B7Dng8CoVNR9hp6W5av4JAnQnbdeR6a/mega.nz/folder-Ji5VkAwY-BYzHARWDjj-djo8e4WTsnQ/Pony Life/Official GIFs on GIPHY/Twitch Cry Sticker-cPN6RVJmo6h8ZW5Ayo.gif" - related to https://mega.nz/folder/Ji5VkAwY#BYzHARWDjj-djo8e4WTsnQ - you can't go to a mega.nz child folder like you can in that IPFS mirror (mega.nz/folder/a#b/c = doesn't work, but with a download of that mega.nz folder you can go to a#b/c).
(3) Path is the same or similar, but it's a browser download of a webpage (one with a "*_files/" folder). Example: "/ipfs/QmSZBpnHGdkPTd5B7Dng8CoVNR9hp6W5av4JAnQnbdeR6a/archive.is/yDlX9/PMV - Summer Sun Celebration - YouTube.html" - also, this has one IP address removed (only edit).
(4) Has extra data which is related. Example: "/ipfs/QmSZBpnHGdkPTd5B7Dng8CoVNR9hp6W5av4JAnQnbdeR6a/drive.google.com/drive/folders/1i7nqeL8lSLElPoxwTsP2MrQVO57MxIa_/" includes file "mlp.heartshine.xyz-20230916T064019Z-001.zip".
(5) Path looks significantly different. Example: "/ipfs/QmSZBpnHGdkPTd5B7Dng8CoVNR9hp6W5av4JAnQnbdeR6a/endchan/pone/thread3148/"
(6) Path is way different and doesn't match at all to source URLs (trying to organize raws, but can't or haven't yet tried to match them up to paths). Example: "/ipfs/QmSZBpnHGdkPTd5B7Dng8CoVNR9hp6W5av4JAnQnbdeR6a/pbooru.com/3103_pbooru.com_webpages_-_various_post_IDs_1301_to_308741_extracted/"

Proposed folders for raws: 1 = "same", 2+4+5 = "diff", 3 = "browser", 6 and maybe also 5 = "dump".