Export news from google sites


I’ve added a script, export_google_site_news, to my catchall repository on github. It’s for downloading the section of a Google Sites hosted webpage generated by the News Gadget. It takes two arguments, the address of your site and the name of the news section. It will download all the news stories it finds as html files into your working directory.

For example, to download the news stories at https://sites.google.com/a/medinacommunityband.org/www/announcements, you would run

export_google_site_news \
https://sites.google.com/a/medinacommunityband.org/www \
announcements

The backstory

When Free It Athens moved our website from Google Sites to Drupal, we started from scratch rather than importing our old content. I realized on Wednesday that the news posts on the site were interesting historical information, yet I’d never archived them.

First, I tried a recursive wget.

wget -r --no-parent --no-clobber \
"https://sites.google.com/a/freeitathens.org/foo/news"

This failed to work because Google pointlessly used javascript rather than anchor tags to link between the news listing pages.

Next I found and tried to use the google-sites-export tool from Google’s Data Liberation Team, but I was never able to authenticate succesfully from it.

At this point I was worried I’d need to use a tool like Selenium to run the javascript, but then I realized the news listing pages took a single paramater to determine how far along in the pagination they were. It wouldn’t take more than a C-style for loop to download them all.

for i in $(seq 0 10 120); do
    wget "https://sites.google.com/a/freeitathens.org/foo/news?offset=$i" \ 
"-Onews.$i"
done

After doing that, I looked at the first one and determined a pattern that would match the relative URLs of individual news stories. I then extracted all the URLs.

grep -E -h -o '/a/freeitathens.org/foo/news/[a-z0-9\-]+' news.* | 
sort -u > news_links

Once I had the list of URLs, it was simple to have wget download them all.

wget -i news_links -B https://sites.google.com

Since I didn’t find any other guides to doing this, I decided to flesh out what I’d done into a simple tool and write about it here.