Re-Intro Time!
I just moved here from sysad.ninja. I'm a #IT & #cybersecurity instructor/course developer.
I am a mediocre upright bass player focusing on #celticmusic, #blues and melodic #jazz. I've also done sound, run theatres, written and produced a play, etc.
My wife and I are currently beginning a small market farm, so you'll also see lots of posts about plants, trees, and the war against invasive privet.
wget http://some.thing
pgrep wget
tail -f /proc/<pid>/4 #(or more)
So how the web scrape is going...
I'm estimating completion by Wednesday.
I'm 170 base sites in. Each one points to about 30 others.
This is with a depth of 1.
Every 10.0s: echo number of sites archived ; ls /var/lib/topgen/vhosts/ | wc -l ; echo; echo current; ps -ef | grep wget | grep -v grep | awk '{ print $NF }'; ps -ef | grep wget | grep -v grep | awk '{ print $NF }' >> wfile ; sort... 1ecf4ac17ca0: Mon Aug 11 00:32:47 2025
number of sites archived
4815
current
lefigaro.fr
free
403G
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp ESTAB 0 0 192.18.0.254:51466 23.193.174.82:https
So what am I doing on my birthday weekend?
Herding a very large webscrape.
It got stuck on the live video at cnn.com.
Killing the wget broke the scraper process.
Had to hack out a bunch of the sanity checks in topgen-scrape.sh to get it running again.
Re-Intro Time!
I just moved here from sysad.ninja. I'm a #IT & #cybersecurity instructor/course developer.
I am a mediocre upright bass player focusing on #celticmusic, #blues and melodic #jazz. I've also done sound, run theatres, written and produced a play, etc.
My wife and I are currently beginning a small market farm, so you'll also see lots of posts about plants, trees, and the war against invasive privet.