Wednesday, July 11, 2007

Some condor problems and solutions

As in my previous post, here were some problems I encountered along the way to getting condor installed on a set of lab PCs running linux with no shared file systems and no DNS.

* Condor complains about missing libstdc++.so.5: install the compatibility patch for your OS. See earlier post.

* You get something like

Failed to start non-blocking update to <10.237.226.81:9618>.
attempt to connect to <10.237.226.81:9618 > failed: Connection refused (connect errno = 111).
ERROR: SECMAN:2003:TCP connection to <10.237.226.81:9618> failed

This probably means you have not set HOSTALLOW_READ and HOSTALLOW_WRITE correctly. Use the IP addresses you want to connect as the values. Also be sure you have enabled the NO_DNS and DEFAULT_DOMAIN_NAME parameters in your condor_config file.

* Jobs seem stuck in "idle". This could just mean that condor is really being unobtrusive. Try changing the START, SUSPEND, etc attributes in condor_config (see previous post and link).

* Jobs only run on the machine they are submitted from. Try adding

should_transfer_files=YES
when_to_transfer_output=ON_EXIT

to your job submission script.

* Need some general debugging help: here is a nice little link
http://www.cs.wisc.edu/condor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs

In short, look at the log files in /home/condor/log. Here are some useful commands:

1. condor_q -better-analyze
2. condor_q -long
3. condor_status -long

#1 will tell you why your job is failing to get matched.

No comments: