As in my previous post, here were some problems I encountered along the way to getting condor installed on a set of lab PCs running linux with no shared file systems and no DNS.
* Condor complains about missing libstdc++.so.5: install the compatibility patch for your OS. See earlier post.
* You get something like
Failed to start non-blocking update to <10.237.226.81:9618>.
attempt to connect to <10.237.226.81:9618 > failed: Connection refused (connect errno = 111).
ERROR: SECMAN:2003:TCP connection to <10.237.226.81:9618> failed
This probably means you have not set HOSTALLOW_READ and HOSTALLOW_WRITE correctly. Use the IP addresses you want to connect as the values. Also be sure you have enabled the NO_DNS and DEFAULT_DOMAIN_NAME parameters in your condor_config file.
* Jobs seem stuck in "idle". This could just mean that condor is really being unobtrusive. Try changing the START, SUSPEND, etc attributes in condor_config (see previous post and link).
* Jobs only run on the machine they are submitted from. Try adding
should_transfer_files=YES
when_to_transfer_output=ON_EXIT
to your job submission script.
* Need some general debugging help: here is a nice little link
http://www.cs.wisc.edu/condor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs
In short, look at the log files in /home/condor/log. Here are some useful commands:
1. condor_q -better-analyze
2. condor_q -long
3. condor_status -long
#1 will tell you why your job is failing to get matched.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment