IBM Blade Server Cluster
 

> FAQ's

Question: I’ve been trying to optimize some of the parameters in my calculation for speed. I found that when I run an identical calculation back-to-back there is significant variation in the time it takes. When I checked the job with bjobs command, I found that when the node#’s are close to each other its faster than when one of the process is running in node 2 and other in node 132. Why is this? If I understood you correctly, there is one switch between each blade and one switch betn each chassis; so would the job that is running across several chassis be running slower than that is running within one chassis? If this is the case, is it possible to make a queue that would run all the processes of a job in a single chassis for increased speed?

Answer: We are hesitant to create a queue that will capture an entire chassis. Its true that while this will provide you with some speed improvement today it could hamper your speed over time as more users login and begin to request cpus. Your job would wait in the queue until a full chassis was available rather than taking advantage of the first number of cpus that are available. Remember your job will not begin until all the cpus you have requested are available.. We will continually evaluate how the system is being used and tailor the queues to reflect that usage. We can revisit this option after we have some experience with how the current queue system is working.

Question: Is it possible for LSF, when it assigns nodes to jobs, pick those nodes (which are open, active) that are separated by least number of switches? Is LSF already doing that or what I’m suggesting is easier said than done? In any case, I’m interested in how LSF decides what nodes to run on – is a random open, active node picked or does it have some other means to do it?

Answer: LSF chooses the host based on the r15s (average number of processes ready to use the CPU during last 15 seconds) and pg (paging rate). We schedule based on system load indice as we assume the cluster is run in a local network.

Question: How can I tell if LSF is running?

Answer: The lsid command will tell you if LSF is running on your system.

Question: How do I make LSF put the output from my job in a file rather than an email message?

Answer: By default, LSF includes all of the standard output (stdout) and standard error (stderr) from your job in the email message it sends you after your job finishes. However, you can force all standard output to a file by using the -o argument to the bsub command, and you can force all standard error to a file using the -e argument. For example, bsub -o my.out -e my.err ...

If you specify a -o argument but do not specify a -e argument, the standard error is merged with the standard output.

The output file created by the -o option to the bsub command normally contains job report information as well as the job output. If you want to separate the job report information from the job output, use the -N option to specify that the job report information should be sent by email.

The output files specified by the -o and -e options are created on the execution host.

To uniquely identify the output from your job, add the job number to the file name using the special %J variable.

For example: bsub -o out.%J ...

Question: How can I tell how busy the machines are?

Answer: The command lsload shows all the host machines with length of UNIX run queues, CPU utilization, measure of swap activity, and various other load indices. The command bhosts -w shows the assignment of processors by LSF to jobs. The -w option will cause it to tell why a machine may be closed. The command bqueues shows the state of all the jobs in the queues.

back to top