Tuesday, February 13, 2007

Notes on Amazon S3 File Services

Getting Started

* Best thing to do is go straight to the end and download the code, since the code snippets in the tutorial are not actual working programs. Go through the program s3Driver.java and then go back and read tutorial:

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/?ref=get-started

* Then read the technical documentation,

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/

It is all pretty simple stuff--all the sophistication is on the server-side, I'm sure.

Building a Client

* Interestingly, (for Java) there is no jar to download. Everything works with standard Java classes if you use the REST version of the service. If you use the WSDL, you will need jars for your favorite WSDL2Java tool (e.g. Axis).

* Java 1.5 worked fine, but 1.4.2 has a compilation error. The readme has almost no instructions on how to compile and run. To do the thing on linux,
  1. Unpack it and cd into s3-example-libraries/java.
  2. Edit both S3Driver.java and S3Test.java to use your access ID key and secret key.
  3. Compile with find . -name "*.java"| xargs javac -d .
  4. Run tests with "java -cp . S3Test" and then run the example with "java -cp . S3Drive"

This will compile various helper classes in the com.amazon.s3 package. These are all relatively transparent wrappers around lower level but standard HTTP operations (PUT, GET, DELETE), request parameters, and custom header fields, so you can easily invent your own API if you don't like Amazon's client code.

* Some basic concepts:
- To get started, you need to generate an Access ID key and a Secret Access Key.
- Your access id key is used to construct unique URL names for you.
- Files are called "objects" and named with keys.
- Files are stored in "buckets".
- You can name the files and buckets however you please.
- You can have 100 buckets, but the number of objects in a bucket is unlimited.

* Both buckets and s3objects can be mapped to local Java objects. The s3object is just data and arbitrary metadata (stored as a Java Map of name/value pairs). The bucket has more limited metadata (name and creation date). Note again this is all serialized as HTTP, so the local programming language objects are just there as a programming convenience.

* Buckets can't contain other buckets, so to mimic a directory structure you need to come up with some sort of key naming convention (i.e. try naming /mydir1/mydir2/file as the key mydir1.mydir2.file).

* In the REST version of things, everything is done by sending an HTTP command: PUT, GET, DELETE. You then write the contents directly to the remote resource using standard HTTP transfer.

* Security is handled by a HTTP request property called "Authorization". This is just a string of
the form

"AWS "+"[your-access-key-id]"+":"+"[signed-canonical-string]".

The canonical string is a sum of all the "interesting" Amazon headers that are required for a particular communication. This is then signed by the client program using the client's secret key. Amazon also has a copy of this secret key and can verify the authenticity of the request.

* Your files will be associated with the URL

https://s3.amazonaws.com/[your-access-key-id]-[bucket-name]/

That is, if your key is "ABC123DEF456" and you create a bucket called "my-bucket" and you create a file object called "test-file-key", then your files will be in the URL

https://s3.amazonaws.com/ABC123DEF456-my-bucket/test-file-key

* By default, your file will be private, so even if you know the bucket and key name, you won't be able to retrieve the file without also including a signed request. This URL will look something like this:

https://s3.amazonaws.com:443/ABC123DEF456-test-bucket/test-key?Signature=xelrjecv09dj&AWSAccessKeyId=ABC123DEF456

* Also note that this URL can be reconstructed entirely on the client side without any communication to the server--all you need is to know the name of the bucket and the object key and have access to your Secret Key. So even though the URL looks somewhat random, it is not, and no communication is required between the client and Amazon S3 to create this.

* Note also that this URL is in no way tied to the client that has the secret key. I could send it in
email or post it to a blog and allow anyone to download the contents. It is possible I suppose to actually guess this URL also, but note that guessing one file URL this way would not help you guess another, since the Signature is a hash of the file name and other parameters--a file with a very similar name would have a very different hash value.

* You can also set access controls on your file. Amazon does this with special a HTTP header,
x-amz-acl. To make the file publicly readable, you put the magic string "public-read" as the value of this header field.

* If your file is public (that is, can be read anonymously), then the URL

https://s3.amazonaws.com/ABC123DEF456-my-bucket/test-file-key-public

is all you need to retrieve it.

Higher Level Operations

* Error Handling: The REST version relies almost entirely on HTTP response codes. As noted in the tutorial, the provided REST client classes have minimal error handling, so this is something that would need to be beefed up.

For more information, see the tech docs:

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/

It is not entirely clear that the REST operations give you as much error information as the SOAP faults. Need to see if REST has additional error messages besides the standard HTTP error codes.

* Fine-Grained Access Control: Both objects and their container buckets can have ACL. The ACL of objects is actually expressed in XML. These can be sent as stringified XML. For more information on this, you have to go to the full developer documentation:

http://docs.amazonwebservices.com/AmazonS3/2006-03-01/


* Objects can have arbitrary metadata in the form of name/value pairs. These are sent as "x-amz-meta-" HTTP headers. If you want something more complicated (say, structured values or RDF triplets) you will have to invent your own layer over the top of this.

Security Musings

* Amazon provides X509 certs optionally, but all Amazon services work with Access Key identifiers. The Access Key is public and is transmitted over the wire as an identification mechanism. You verify your identity by signing your Access Key with your Secret Access Key. This is basically a shared secret, since Amazon also has a copy this key. Based on the incoming Access Key Identifier, they can look up the associated secret key and verify that the message indeed comes from the proper person and was not tampered with by reproducing the message hash.

* You use your secret key to sign all commands that you send to the S3 service. This authorizes you to write to a particular file space, for example.

* This is of course security 101, but I wonder how they handle compromised keys. These sorts of non-repudiation issues are the bane of elegant security implementations, so presumably they have some sort of offline method for resolving these issues. I note that you have to give Amazon your credit card to get a key pair in the first place, so I suppose on the user side you could always dispute charges.

* Imagine, for example, that someone got access to your secret key and uploaded lots of pirated copies of large home movies. This is not illegal material (like pirated movies or software) but it will cost you money. So how does Amazon handle this situation? Or imagine that I uploaded lot of stuff and then claimed my key was stolen in order to get out of paying. How does Amazon handle this?

No comments: