- Endchan Magrathea

Crystal Pony
4/1/2024 20:08:00 No. 10113 [Open] [Reply]
thumbnail of What_I_Like_About_You_PMV.wmv-Twig_I-20110521-youtube-640x480-3mAKAd6v59c.webm
What_I_Like_About_You... webm
(11.42 MB, 640x480 vp9)
d-3334736 png
(19.37 KB, 256x235)
 >>/10108/
> Going with the non-repartition solution for now...
Some info for that - ponibooru_60k is this: /ipfs/QmS4EfzF2rQir4ZbE7zdTu46YyAVvY43STEwhzJN7CRzFF

 >>/10110/
> small channel
downloaded
> /z9/youtube/Twig_I_UCBt86lSt-wCJuS-y5juoYWg/What_I_Like_About_You_PMV.wmv-Twig_I-20110521-youtube-640x480-3mAKAd6v59c.webm

 >>/10111/
> pink
no pink posts here (any last50): https://endchan.org/.static/last50.html?b=pone&t=9086

> Oh, are you scraping the JS ponies from Derpi?
I used vim -- NVIM v0.6.1 -- to get all of the .gif file URLs that load from .js (I think today I was significantly faster at doing such a thing than I was in the past). I think Derpi only uses one .js file to load them, so that covers all cases (different overlay GIFs load based on character tags). All of those GIF animations are in:
. warc.gz https://endchan.org/.media/ab0aba07daca938a0c6a3909818c7307-applicationgzip.undefined
. cdx https://endchan.org/.media/735ab0b26a4ba1678231844d02f04438-applicationoctet-stream.so

> I do believe that is a pre-existing open sourced project
That might be "Desktop Ponies".
> but it doesn't hurt with how many things I see just disappear over the years when everyone thought it would be safe.
Reminds me of LiveLeak: gone with zero warning. As for "Desktop Ponies", those files likely exist elsewhere, but those URLs to those files are unique and only exist in that website (unlike IPFS). That WARC shows the mtime of those files there and the server used plus other stuff; Derpi uses this server: cl*udflare.

>  >web.archive.org: 190 seconds between save requests = still get HTTP 429 TOO MANY REQUESTS (wget --spider). Months ago, 45 seconds between save requests was totally fine (all HTTP 200).
> The problem with scraping web archive is that they know all the tricks with web scraping and understandably their bandwidth is pretty precious as is, considering all the data that is being kept online. I can understand them not wanting scraping as multiple people running scraping operations would probably be a huge drain on their resources BUT I also think the information there is a chaotic mess and their is some instances were have to scrape it.
I meant "save requests to wbm"; I was saving those https://derpibooru.org/ponies/&#42;.gif files to WBM. I see that's unclear, and I think you think I was downloading from wbm. I even got HTTP 429 with a 290-second wait between save-to-wbm requests: hoping this is an April Fools joke.

If web.archive.org was annoyed by users downloading web files/pages from them, then they should offer info as to where their WARCs are located (which they don't anymore), in case they are annoyed that users are creating WARCs of their WARC replays. But then, many of those .warc.gz files are unable to be downloaded. More on that:  >>/9774/. If archive.org was more open, then there could be non-IA archive-focused websites which could easily get just about any snapshot replaying on their server from just about any WARC. As it is now, this is the case with archive.org:
. ArchiveTeam-created+uploaded WARCs: open access
. WARCs uploaded to IA by normal users (at /details/): open access
. All other WARCs (everything in web.archive.org other than the above two case): NO open access and there is no longer a way to tell where they are stored in archive.org  >>/9774/
.. this includes captures of sites which are and are not excluded from WBM

Why do they do this? Maybe they want to be the only one who can "archive the web" and provide replays. And/Or, they do it so that no one can know which file really stores whatever shit but them, so no one but those who have access to archive.org's backend can request that a WARC file be deleted (or they could just delete it with no request process). Anyone can see which replay contains whatever at web.archive.org/web/... but not anyone can know the WARC file that it's based on. Deleting replay due to excluded or whatever + deleting WARC (private backend) = basically totally gone.