|
|
GNQS Job SubmissionQueen is not intended for computational work. This is because computational jobs and interactive work tend to place very different requirements on a system. So on the hive they are separated from one another: queen runs the interactive stuff (including job submission) and the drones do the grunt work of running the jobs.
This means, though, that you have to use the job queue if you want to
run computational jobs, because that's the only way you're going to
get access to the computational machines. We use a system called
GNQS (Generic Network Queueing System). To use it you'll need to know
how to use the command The simple way to submit jobs
The simplest way to submit a job with % qsub source /prog/setup charmm charmm < file.inp > file.out
After hitting STDIN.e123 STDIN.o123
The number at the end will be the job number of the submission.
Generally errors will be in the A bit more sophisticatedInstead of typing in your commands on the spot like above, you can create a file that looks something like this: #!/bin/tcsh ########### # # This is a job script for a charmm job. # source /prog/setup charmm charmm < input-file > output-file
If the above file were called % qsub charmm-script Putting the job in a script like this is useful if the job is a bit complex, or if you'll be running it (perhaps with minor variations) many times. Specifying a queueIn the above examples the jobs are submitted without worrying about which queue they'll be run in. There are two queues, though: long and short. The long queue is the default queue; unless you have a reason to choose the short queue this is where your jobs should go. There is essentially no time limit on the long queue (about a month, actually) and there are as many as six slots your job could run in. The short queue is for quick little jobs where you need the results as soon as possible. These actually work by stealing processor time from long jobs, so the short queue shouldn't be used without reason. For this reason there are only three running slots available in the short queue. There's also a time limit of 8 hours: if your job wont finish in that time it shouldn't be in the short queue. They do run at a higher priority than long jobs though, so if you need something quickly and the system is already loaded this is an option.
To specify a queue when calling % qsub -q short charmm-script More sophisticated yet
In order to run your job most efficiently sometimes you need to use
the local disk of the drone you'll be running your job on. [NOTE:
emphasize that running on the local disk isn't usually necessary.]
You can always access that disk via the path Using these scripts places a few demands on how you organize your files, though. Typically a job will require input data and will output data as well (often into multiple files). So the first thing to do is to make sure that all your input data files are in the same directory (possibly organized into subdirectories). Ideally, you want only the files you need for the job to be in this directory, because it will be copied across the network at least twice. For this to work properly it's important that you don't reference any of your data files via an absolute path (in files or in symbolic links). Otherwise the job will still access your files in the original location instead of the local copy.
So lets say that you have all files in a directory named % local-long charmm-script /scratch/data
The nowan.XXXXzPmHKq
Your data will be removed from the Checking the queue
To confirm that your job is running after you submit it you use the
% qstat -d This command will check the drones as well as queen for any jobs that you've submitted. You'll see output that looks something like this: Destination machine: drone1.med.jhmi.edu Destination machine: drone2.med.jhmi.edu Destination machine: drone3.med.jhmi.edu Request I.D. Owner Queue St -------------- ------- -------- -------- -- STDIN 241 nowan long-dest R You can see that it checks drone1, drone2, and finally drone3. Since drone3 is running one of my jobs it's listed there. A number of characteristics are listed, but the most useful are the job number (241, in this example) and the queue (long-dest, which is simply an end-point of the long queue).
You may also want to see what jobs others are running. To do this you
can add in the % qstat -da This command will list any job that anyone is running, on any machine in the hive. If you're interested in seeing more information about jobs that are running try this for a long listing: % qstat -dal Check the man page for more options, or an explanation of the output of the long listing: % man qstat Deleting and killing jobs
When you submit a job it doesn't immediately start running. It will
sit on queen for a little bit while the queueing system decides which
machine to send the job to. Sometimes all the drones are full, so the
job stays on queen waiting for a drone to free up. When this happens
the job is said to be waiting, as opposed to running. If a job is
waiting on queen you can delete it using the % qdel 241
Where % qdel -k 241.queen@drone3
The document created by Jeremy Hankins, maintained by Crystallography facility
|
|