I’ve added a script, export_google_site_news, to my catchall repository on github. It’s for downloading the section of a Google Sites hosted webpage generated by the News Gadget. It takes two arguments, the address of your site and the name of the news section. It will download all the news stories it finds as html files into your working directory.
For example, to download the news stories at https://sites.google.com/a/medinacommunityband.org/www/announcements, you would run
export_google_site_news \ https://sites.google.com/a/medinacommunityband.org/www \ announcements
When Free It Athens moved our website from Google Sites to Drupal, we started from scratch rather than importing our old content. I realized on Wednesday that the news posts on the site were interesting historical information, yet I’d never archived them.
First, I tried a recursive wget.
wget -r --no-parent --no-clobber \ "https://sites.google.com/a/freeitathens.org/foo/news"
Next I found and tried to use the google-sites-export tool from Google’s Data Liberation Team, but I was never able to authenticate succesfully from it.
for i in $(seq 0 10 120); do wget "https://sites.google.com/a/freeitathens.org/foo/news?offset=$i" \ "-Onews.$i" done
After doing that, I looked at the first one and determined a pattern that would match the relative URLs of individual news stories. I then extracted all the URLs.
grep -E -h -o '/a/freeitathens.org/foo/news/[a-z0-9\-]+' news.* | sort -u > news_links
Once I had the list of URLs, it was simple to have wget download them all.
wget -i news_links -B https://sites.google.com
Since I didn’t find any other guides to doing this, I decided to flesh out what I’d done into a simple tool and write about it here.