- Endchan Magrathea

Anon
3/12/2024 04:28:00 No. 9816 [Open] [Reply]
thumbnail of large.1100033403_BlindedbyTheTwilight.jpg.e0342277d807062e125626fae0cba3ab.jpg
large.1100033403_Blin... jpg
(54.83 KB, 1600x900)
Not blocked yet:
start -> 2024-03-11T10:11:57.051663083Z
now ---> 2024-03-12T03:36:41.353799992Z
unix1 -> 1710151917 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/id" and normal stat says 2024-03-11 10:11:57.553640017 +0000)
unix2 -> 1710215173 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/wpull.log" and normal stat says 2024-03-12 03:46:03.762802180 +0000)
size --> 237 MB /z9/warc/012/www.canterlot.com-2024-03-11-9b3a5a01
200x3 -> https://www.canterlot.com/gallery/image/8158-yama-san-from-the-mountains/ + https://www.canterlot.com/gallery/album/1167-ondrea + https://www.canterlot.com/gallery/image/8380-rainpng/ (all recent)
est. --> 100 GB final size (with may image files, it could be 200 GB)
ran ---> 63,256 seconds (1710215173 - 1710151917)
down --> 3.747 KB/s (237/63256)
left --> roughly 99 GB (99,000,000 KB)
eta ---> roughly 26,421,137 seconds or 306 days (99000000/3.747 and 26421137/60/60/24)
notes -> The delay file can be change to contain "3000" (or whatever number) while grab-site is still running and it will then have that delay instead. Doing that seems to result in no problems. grab-site option of interest = --permanent-error-status-codes STATUS_CODES = "A comma-separated list of HTTP status codes to treat as a permanent error and therefore *not* retry (default: 401,403,404,405,410)". The wpull.db file can be opened by running "sqlite3 -column -header -csv 'wpull.db'"; then view tables by running ".tables"; then view rows by running "select * from tablename;". What's the fate of this grab? "Probably" my computer will crash/reboot then I won't return to it, so I'll just get a small portion of that site which requires a delay between requests. Or, I could keep working on it in various ways. A dealy of 5000ms-10000ms will take 306 or 612 days; let's say it will take about a year. A delay of 5000ms will maybe take "only" 150 days to download all of that website. I wish that grab-site was more fault-tolerant. Apparently Common Crawl has a lot of www.canterlot.com, but it doesn't have content.invisioncic.com outlinks and recent data.

Anyone rsync millions of files? It was a drag that bash deleted my paused job that was doing that:
> $ utc; rsync -a --info=progress2 /d1/path1/ /d2/path1/; utc # ~2,072,198 items
> 2024-03-10T14:47:03.267513346Z
> [...]479,775,578,488 29% 3.84MB/s 33:05:03 (xfr#1195512, to-chk=849846/2072198)^Z
> [1]+ Stopped rsync -a [...]
> 2024-03-11T23:57:51.704609306Z
> $ ./qbittorrent-4.6.0_x86_64.AppImage & disown
> [2] 69975
> bash: warning: deleting stopped job 1 with process group 2332
> $ jobs -l
> [2]+ 69975 Running ./qbittorrent-4.6.0_x86_64.AppImage &
> $ # Didn't disown it because that stopped rsync command got deleted.

Image/Title which likely references "Blinded By The Light" ( https://iv.nboeck.de/watch?v=Rpq35wyDi7I ) from
>  https://www.canterlot.com/gallery/image/8372-blinded-by-the-twilight/
>  > https://content.invisioncic.com/r257793/monthly&#95;2021&#95;04/large.1100033403&#95;BlindedbyTheTwilight.jpg.e0342277d807062e125626fae0cba3ab.jpg