Actions
large.1100033403_Blin... jpg
(54.83 KB, 1600x900)
(54.83 KB, 1600x900)
Not blocked yet: start -> 2024-03-11T10:11:57.051663083Z now ---> 2024-03-12T03:36:41.353799992Z unix1 -> 1710151917 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/id" and normal stat says 2024-03-11 10:11:57.553640017 +0000) unix2 -> 1710215173 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/wpull.log" and normal stat says 2024-03-12 03:46:03.762802180 +0000) size --> 237 MB /z9/warc/012/www.canterlot.com-2024-03-11-9b3a5a01 200x3 -> https://www.canterlot.com/gallery/image/8158-yama-san-from-the-mountains/ + https://www.canterlot.com/gallery/album/1167-ondrea + https://www.canterlot.com/gallery/image/8380-rainpng/ (all recent) est. --> 100 GB final size (with may image files, it could be 200 GB) ran ---> 63,256 seconds (1710215173 - 1710151917) down --> 3.747 KB/s (237/63256) left --> roughly 99 GB (99,000,000 KB) eta ---> roughly 26,421,137 seconds or 306 days (99000000/3.747 and 26421137/60/60/24) notes -> The delay file can be change to contain "3000" (or whatever number) while grab-site is still running and it will then have that delay instead. Doing that seems to result in no problems. grab-site option of interest = --permanent-error-status-codes STATUS_CODES = "A comma-separated list of HTTP status codes to treat as a permanent error and therefore *not* retry (default: 401,403,404,405,410)". The wpull.db file can be opened by running "sqlite3 -column -header -csv 'wpull.db'"; then view tables by running ".tables"; then view rows by running "select * from tablename;". What's the fate of this grab? "Probably" my computer will crash/reboot then I won't return to it, so I'll just get a small portion of that site which requires a delay between requests. Or, I could keep working on it in various ways. A dealy of 5000ms-10000ms will take 306 or 612 days; let's say it will take about a year. A delay of 5000ms will maybe take "only" 150 days to download all of that website. I wish that grab-site was more fault-tolerant. Apparently Common Crawl has a lot of www.canterlot.com, but it doesn't have content.invisioncic.com outlinks and recent data. Anyone rsync millions of files? It was a drag that bash deleted my paused job that was doing that: > $ utc; rsync -a --info=progress2 /d1/path1/ /d2/path1/; utc # ~2,072,198 items > 2024-03-10T14:47:03.267513346Z > [...]479,775,578,488 29% 3.84MB/s 33:05:03 (xfr#1195512, to-chk=849846/2072198)^Z > [1]+ Stopped rsync -a [...] > 2024-03-11T23:57:51.704609306Z > $ ./qbittorrent-4.6.0_x86_64.AppImage & disown > [2] 69975 > bash: warning: deleting stopped job 1 with process group 2332 > $ jobs -l > [2]+ 69975 Running ./qbittorrent-4.6.0_x86_64.AppImage & > $ # Didn't disown it because that stopped rsync command got deleted. Image/Title which likely references "Blinded By The Light" ( https://iv.nboeck.de/watch?v=Rpq35wyDi7I ) from > https://www.canterlot.com/gallery/image/8372-blinded-by-the-twilight/ > > https://content.invisioncic.com/r257793/monthly_2021_04/large.1100033403_BlindedbyTheTwilight.jpg.e0342277d807062e125626fae0cba3ab.jpg