Tuesday, August 13, 2013

Using CURL and Google Spreadsheet REST Services

CURL and Google Spreadsheets

Reader Beware: this post contains some XML markup that did not survive the cut and paste. I'll have to clean it up later.

Google provides a full REST API for many of its services.  While you could use their programming APIs, it’s quick and fun to make your requests directly with the “curl” command.   These are notes on how to do this.  I’m interested in querying multiple online spreadsheets in Google Drive along with public websites to find intersections of researchers who have large computing allocations, are interested in workflows, who have existing advanced support allocations, and who are attending various conferences.  This is a good overview: https://developers.google.com/gdata/articles/using_cURL.  Other useful links:

Get your authentication credential

See https://developers.google.com/accounts/docs/AuthForInstalledApps. You’ll need to know the name of your service, so see https://developers.google.com/gdata/faq#clientlogin.  For spreadsheets, the service name is “wise”.
curl https://www.google.com/accounts/ClientLogin --data-urlencode Email=you@gmail.com --data-urlencode Passwd=yourpasswd -d accountType=GOOGLE -d source=your.org-your.service-your.service.version -d service=wise
Replace the italicized text above with your appropriate values. The “source” attribute is apparently used only for logging, so any value is OK.  
The response to the above should be (edited)
You’ll need the final value (Auth) for all your subsequent requests.

Get a list of all your spreadsheets.  

You’ll need this in order to get the URLs for the specific spreadsheets you want to query.  Use the Auth key value you just obtained here.
curl --silent --header "Authorization: GoogleLogin auth=DQAAANkAXXX" "https://spreadsheets.google.com/feeds/spreadsheets/private/full" | tidy -xml -indent -quiet
The response will be in ATOM/XML with an entry for every spreadsheet you own. You’ll probably want to save to a file.  Passing through tidy will format it nicely as described in https://developers.google.com/gdata/articles/using_cURL.  An example entry is shown below.  The key in the URLs is a unique ID for the spreadsheet.
Formatting: The UNIX tidy command is nice for formatting. The more powerful xmllint command can also be used.
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/worksheets/$KEY/private/full" |xmllint --format -
Note the trailing “-” after “--format”  $AUTH and $KEY are environment variables set using values discussed above.  The xmllint command can also do more powerful parsing, but this doesn’t seem to be working well on my Mac.
   term='http://schemas.google.com/spreadsheets/2006#spreadsheet' />
   XSEDE Allocations June 2013
   XSEDE Allocations June 2013
   type='application/atom+xml'  href='https://spreadsheets.google.com/feeds/worksheets/key/private/full' />
   href='https://spreadsheets.google.com/feeds/spreadsheets/private/full/key />

The “alternate” link above with type “text/html” is a link suitable for display in a browser (that is, how you usually view it). The “id” URL and also the “self” link are ATOM/XML feeds for the specific spreadsheet.  If you just want to return this particular entry as a response, GET the “self” URL.

GET the Worksheet Service URL

The worksheet feed link,  https://spreadsheets.google.com/feeds/worksheets/key/private/full, will provide the next step.  Curling it will return information about the specific worksheet in the spreadsheet.
curl --silent --header "Authorization: GoogleLogin auth=DDDDDD" "https://spreadsheets.google.com/feeds/worksheets/key/private/full" | tidy -xml -indent -quiet
The response again is in ATOM/XML.
 term='http://schemas.google.com/spreadsheets/2006#worksheet' />
 XSEDE Allocations June 2013
 href='https://spreadsheets.google.com/ccc?key=key’ />
 href='https://spreadsheets.google.com/feeds/worksheets/key/private/full' />
 href='https://spreadsheets.google.com/feeds/worksheets/key/private/full' />
 href='https://spreadsheets.google.com/feeds/worksheets/key/private/full' />
   term='http://schemas.google.com/spreadsheets/2006#worksheet' />
   href='https://spreadsheets.google.com/feeds/list/key/od6/private/full' />
   href='https://spreadsheets.google.com/feeds/cells/key/od6/private/full' />
   href='https://spreadsheets.google.com/tq?key=key&sheet=od6' />
   href='https://spreadsheets.google.com/feeds/worksheets/key/private/full/od6' />
 href='https://spreadsheets.google.com/feeds/worksheets/key/private/full/od6/0' />

The atom+xml type URLs in the response are useful for various operations.  In particular, the listfeed URL will allow us to interact with the content of a specific worksheet within a spreadsheet.

Use the worksheet feed URL to get the worksheet’s contents

The following command does what you want.
curl --silent --header "Authorization: GoogleLogin auth=DDDD" "https://spreadsheets.google.com/feeds/list/key/od6/private/full"|tidy  -xml -indent -quiet
The entire spreadsheet will be returned in ATOM/XML format. Each worksheet row will be an Atom feed entry with the worksheet columns as child XML tags.  For example, the above spreadsheet has columns listing the owner of the allocation, the requested allocation, the approved allocation, and so on.  There is one row for each allocation holder.  An example entry in the response to the curl command above is below:
   term='http://schemas.google.com/spreadsheets/2006#list' />
   pi: Pilastname Pifirstname, propnum: ABC123,
   requested: 100000, approved: 90000, production: 100,
   exploration: 0, education: 0, commandline: 100, gridtools: 0,
   metascheduler: 0, independent: 0, independentrelated: 0,
   tightlycoupled: 0, dependent: 100
   href='https://spreadsheets.google.com/feeds/list/key/od6/private/full/cokwr' />
   href='https://spreadsheets.google.com/feeds/list/key/od6/private/full/cokwr/xxxyyyzzz' />
   Pilastname, Pifirstname

We now sort and filter to our hearts’ content.  

Sorting, Filtering, and Ordering

Sorting, filtering, and other operations are performed by appending the operation to the end of the feedlist URL.  See https://developers.google.com/google-apps/spreadsheets/.  In summary, the following operation keywords are supported (append as ?keyword=keyvalue).
  • orderby: orders the response alphabetically or numerically with the supplied column. For example, to list the largest approved allocations first, use ?orderby=approved.
  • reverse: reverses the order of the response.  To sort the response with the largest response last, use ?orderby=approved&reverse=true.
  • sq: this is a structured query operation. To return a list of all responses with the “dependent” column value > 50, use ?sq=dependent>50.  The format for the query language is described in https://developers.google.com/chart/interactive/docs/querylanguage. The queries must be URL encoded; the link provides a little tool for doing this.
You can also request alternative formats (see https://sites.google.com/site/collaboratory20/journal/usingcurltodownloadagooglespreadsheetasxlsexcel).  The following command will download the spreadsheet in CSV format.
curl --silent --header "Authorization: GoogleLogin auth=XXXX" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=key&exportFormat=csv"
We can also get tabbed-separated values with
curl --silent --header "Authorization: GoogleLogin auth=XXXX" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=key&exportFormat=tsv"
Since the data in the spreadsheet cells often includes commas, TSV is the better choice.  Unfortunately the export feature doesn’t work with the ordering and structured query parameters.  On the other hand, TSV and CSV is easier to work with using standard UNIX command line tools than XML.

Sorting and Extracting Information from CSV and TSV Output

We’ll stick with CSV since we don’t really care too much about the document structure and just want to extract some output.  For extremely large outputs, this may not scale (the server may do a better job).  Let’s sort the allocations by the last field in the output (which indicates the percentage of the allocation projected to involve dependent jobs:
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | sort -n -k 15 -t $'\t'
See http://stackoverflow.com/questions/1037365/unix-sort-with-tab-delimiter for the tip about sorting tab-deliminated files.  We can also sort on multiple columns. To first sort by the dependent job percentage and then sort by awarded allocation size so that we can find out who the power users with dependent jobs, use
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | sort -n -k 15 -k 5 -t $'\t'
Let’s now clean this up, keeping only the last name, first name, and dependency percentages and removing the duplicated names. The latter come from users who have made multiple allocation requests.
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | sort -n -k 15 -k 5 -t $'\t'|awk 'BEGIN { FS="\t";OFS="\t";} {print $2,$15}' | uniq
Just keep the allocated users with 20% or more of their allocation for dependent jobs.  The output is formatted Lastname, Firstname.
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | sort -n -k 15 -k 5 -t $'\t'|awk 'BEGIN { FS="\t";OFS="\t";} {print $2,$15}' | uniq | awk 'BEGIN {FS="\t";OFS="\t";} $2 >= 20 {print $1, $2}'
Note that the 15th column in the original table is the 2nd column in the sorted and awked table.  
Since cross-matching names in multiple documents will usually be easiest if we search only for last names, we can use this command:
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | tail +2| sort -n -k 15 -k 5 -t $'\t'|awk 'BEGIN { FS="\t";OFS="\t";} {print $2,$15}' | uniq | awk 'BEGIN {FS="\t";OFS="\t";} $2 >= 75 {print $1}' | awk 'BEGIN {FS=",";ORS="\t";}{print $1;}'
The tail +2 removes the annoying header line.  The above curl filters out users with >75% requirement for dependent jobs.  The ORS in the last awk command separates the output with tabs rather than new lines, which will be useful later in constructing input arrays. We could have done a similar things by piping through xargs, but xargs doesn’t give us control over the separator. This is a problem for last names with spaces.
Here’s another useful one.  After filtering by dependency > 25%, it keeps only the comma-separated last and first names.  We then switch the order to Firstname Lastname and separate entries with a tab.
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | tail +2| sort -n -k 15 -k 5 -t $'\t'|awk 'BEGIN { FS="\t";OFS="\t";} {print $2,$15}'  | awk 'BEGIN {FS="\t";OFS="\t";} $2 >= 25 {print $1;}' |sort|uniq| awk 'BEGIN {FS=",";OFS=" ";ORS="\t"} {print $2,$1;}'
This will return the full names in Firstname Lastname order:
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | tail +2| sort -n -k 15 -k 5 -t $'\t'|awk 'BEGIN { FS="\t";OFS="\t";} {print $2,$15}'  | awk 'BEGIN {FS="\t";OFS="\t";} $2 >= 25 {print $1,$2;}' |cut -d$'\t' -f1|awk 'BEGIN {FS=","} {print $2,$1}'
You can use a similar approach with CSV output using the following command. Note that the comma in the “name” field between last name and the first name causes problems for sort, so we just treat this as a new column.
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=csv"|sort -n -k 16 -k 6 -t ','|awk 'BEGIN { FS=",";OFS=","} {print $2,$3,$16}' | uniq | awk 'BEGIN {FS=",";} $3 >= 20 {print $1;}'| tr -d '\"'|xargs

Searching Other Sites

We’re now ready to use the last names to search conference proceedings.  Annoyingly, the XSEDE 2013 proceedings are not immediately obvious. The conference site uses JavaScript and
tags to populate the presentation topics, so this won’t be preserved if we curl the page.  The proceedings are also in ACM’s Digital Libraries, but I can’t find a standalone “Table of Contents” page.  I did find a curl-able version of this by viewing one of the papers, switching from the tabbed view to the static table view, and selecting the “Table of Contents” link.  The URL is
The following simple curl will tell us which allocated users with a > 75% dependent jobs requirement also presented papers at XSEDE 2013:
for name in `curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY&exportFormat=tsv" | tail +2| sort -n -k 15 -k 5 -t $'\t'|awk 'BEGIN { FS="\t";OFS="\t";} {print $2,$15}' | awk 'BEGIN {FS="\t";OFS="\t";} $2 >= 75 {print $1}' | sort|uniq|awk 'BEGIN {FS=",";ORS="\t";}{print $1;}'`; do echo $name:;curl --silent "http://dl.acm.org/citation.cfm?id=2484815&preflayout=flat#prox"|grep $name; done
I’m keeping only the last names, and the sort|uniq removes duplicates.  Be careful with this: the ACM site will block your IP address if you make too many requests.
We can also search across two Google Spreadsheet accounts. The following command will return the first and last names of everyone receiving ECSS support:
curl --silent --header "Authorization: GoogleLogin auth=$AUTH" "https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$KEY2&exportFormat=tsv" | awk 'BEGIN {FS="\t";OFS=",";}{print $5}' | sort|uniq|tail +3
Using tail +3 removes some headers and leading blank lines. Again, TSV is preferable to CSV since the cells can contain free text with commas.