Friday, February 20, 2009

Word Cloud for My Blog

From Wordle, courtesy of Yu "Don't call me Marie on your stupid blog" Ma.

Friday, February 06, 2009

Some Xen command line stuff

Here's a Xen command line cheat sheet. We'll assume Xen is installed.

1. Make sure you've booted from the Xen kernel.

2. To start the Xen daemon, use "xend restart".

3. To install a new image, use virt-install --paravirt --name vc8 -r 500 -f /var/lib/xen/images/vc5.img --file-size 8 -l http://fedora.fastsoft.net/pub/linux/fedora/linux/releases/8/Fedora/i386/os --nographics

The virt-install man page has more examples and an explanation of the above. The above command installs Fedora 8 on the image. You should be prompted to install the OS in the VM after the download. This can obviously take a while, so you'll want to clone these images.

4. To list vms, you can use "virsh list" or "xm list".

5. To start your VM, use "virsh start vc8". The image "vc8" was the image created above.

6. To connect to the above VM, use "xm console vc8". Replace vc8 with your VM name. This will boot the operating system, which should get you to a login prompt.

Thursday, February 05, 2009

GRAM Job submission failed because data transfer to the server failed (error code 10)

Problem: I had a working pre-Web service GRAM "fork" job manager but then needed to use LSF job manager for submissions to the scheduler on a cluster. The LSF job manager was not built when we deployed globus initially, which is unusual.

The LSF job manager was built with the commands

% gpt-build globus_gram_job_manager_setup_lsf-1.17.tar.gz
% ./setup-globus-gram-job-manager-lsf

However the command line tests didn't work. For example, the command

globusrun -o -r my.secret.machine/jobmanager-lsf '&(executable=/bin/date)'

threw the error

GRAM Job submission failed because data transfer to the server failed (error code 10)

This is unfortunately an all-purpose Globus error. You will sometimes see it associated with problems in the grid-mapfile, but again my fork jobmanager worked fine, so I had a different bug.

Unfortunately nothing useful turned up in the gsi-gatekeeper.log, even after I turned up the logging level.

Solution: the problem turned out to be that the LSF job manager files were not given the correct permissions during the deployment. These should be 755 (group and world readable and executable). Find them with a command like

find $GLOBUS_LOCATION -name "*lsf*"

I then made the changes manually, but you may also do some "find|xargs" trick.

Globus error 123 and Condor_g

Thanks to Stu Martin and Todd Tannenbaum for independently providing the solution below.

Problem: Condor-G submission to a Globus pre-webservice GRAM failed and throws an error
 "Globus error 123 (could not write the job state file)"
This is described in more detail at http://www-unix.globus.org/mail_archive/discuss/2002/11/msg00131.html. You can reproduce it with this globusrun command:

globusrun -r my.secret.machine/jobmanager '&(executable=/bin/date)(save_state=yes)'

Solution: Edit $GLOBUS_LOCATION/etc/globus-job-manager.conf and change the value of the -state-file-dir to point to a local, non-NFS file system. For example:

-state-file-dir /usr/local/gram_job_state

Also, set permissions on this directory:

chmod ogu+rwxt /usr/local/gram_job_state

myproxy-logon error

The problem: my favorite myproxy-logon command suddenly stopped working and I got the error barf below

Failed to receive credentials.
Error authenticating: GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
globus_gss_assist: Error during context initialization
globus_gsi_gssapi: Unable to verify remote side's credentials
globus_gsi_gssapi: Unable to verify remote side's credentials: Couldn't verify the remote certificate
OpenSSL Error: s3_pkt.c:1052: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate SSL alert number 42

I tried the usual soluiton: made sure my certificates were correct in $HOME/.globus/certificates and /etc/grid-security/certificates but still got the error.

Solution: I had a proxy certificate from another (non-overlapping) Grid sitting around in the default X509 location (/tmp/x509up_nnn). I deleted this and myproxy-logon worked again. Apparently the myproxy-logon command tries to use this credential to authenticate to the MyProxy server. The errors are what you expect since MyProxy from Grid #1 doesn't trust the CA of Grid #2.

Delete the old credential and things should work again.