Wednesday, July 11, 2007

Beating Condor Like a Rented Buzzard

OK, trying to get Condor running on a few linux boxes on a private network with no DNS. There were lots of pitfalls that took a while to debug. Also, try searching for my del.icio.us links tagged with "condor" and "ecsu" for sites I found helpful.

* Make sure you have the libstdc++.so compatibility patch installed, since condor needs the obsolete libstdc++.so.5. See the previous post.

* Turn off your iptables firewalls (/etc/init.d/ipconfig stop).

* I did not find using /etc/hosts entries to be helpful. This will conflict with the NO_DNS attribute that I used below.

Get the tar.gz version of condor, since we will need to babystep our way through the complicated configuration file. Do the following as the root user. Unpack the tar, and use

./condor_install

for the interactive installation on all the machines you want to use. Pick one of these to be the master. Don't use condor_configure --install unless you know what you are doing. Answer the questions as best you can. Note you should answer "No" when asked about the shared file system if you are just using a random collection of machines with no NFS, etc.

Next, set your $CONDOR_CONFIG environment variable and then edit this file. Make sure that CONDOR_HOST points to the master node. You must change HOSTALLOW_WRITE and you may want to change HOSTALLOW_READ. I set these to IP addresses since I don't have a DNS. Don't forget to include 127.0.0.1 in the allowed list.

Next, scroll down in condor_config to find the parameters DEFAULT_DOMAIN_NAME and NO_DNS. Uncomment these, since we assume there is no DNS. The value of the first can be anything, the value of the second should be TRUE.

Still editing condor_config, cruise down to part 3 and set the attributes START=TRUE, SUSPEND=FALSE, CONTINUE=TRUE, PREEMPT=FALSE, and KILL=FALSE as described at https://www-auth.cs.wisc.edu/lists/condor-users-rc/2005-February/msg00259.shtml.
This will cause condor to submit your jobs right away.

Now, start condor with condor_master on your main condor host. This should come up in a few minutes, so use condor_status to check.

Next, from each machine in your cluster, try "telnet 10.20.30.40 9618". Replace 10.20.30.40 with the main host's IP address. 9618 is the port of the collector. If you can't even get to this port, you probably have firewall issues, so turn off iptables and check /etc/hosts.deny and /etc/hosts.allow. If you get connection refused, it is probably a condor_config issue, so make sure that you followed the steps above.

I actually found it useful to install Tomcat and modify to run on port 9618 just to verify that other machines could reach the main host's port 9618. This tells you connection if problems are in your network configuration or in condor's config. Be sure to turn tomcat off when you are done.

Repeat the condor installation steps for the other condor cluster members. When you bring them up, they should be added to the output of condor_status.

No comments: