Wednesday, July 11, 2007

Some condor problems and solutions

As in my previous post, here were some problems I encountered along the way to getting condor installed on a set of lab PCs running linux with no shared file systems and no DNS.

* Condor complains about missing install the compatibility patch for your OS. See earlier post.

* You get something like

Failed to start non-blocking update to <>.
attempt to connect to < > failed: Connection refused (connect errno = 111).
ERROR: SECMAN:2003:TCP connection to <> failed

This probably means you have not set HOSTALLOW_READ and HOSTALLOW_WRITE correctly. Use the IP addresses you want to connect as the values. Also be sure you have enabled the NO_DNS and DEFAULT_DOMAIN_NAME parameters in your condor_config file.

* Jobs seem stuck in "idle". This could just mean that condor is really being unobtrusive. Try changing the START, SUSPEND, etc attributes in condor_config (see previous post and link).

* Jobs only run on the machine they are submitted from. Try adding


to your job submission script.

* Need some general debugging help: here is a nice little link

In short, look at the log files in /home/condor/log. Here are some useful commands:

1. condor_q -better-analyze
2. condor_q -long
3. condor_status -long

#1 will tell you why your job is failing to get matched.

No comments: