#PBS -N hwrf%STORMNUM%_post2_%CYC%
#PBS -j oe
#PBS -S /bin/bash
#PBS -q %QUEUE%
#PBS -A %PROJ%-%PROJENVIR%
#PBS -l walltime=03:00:00
#PBS -l place=vscatter,select=1:mpiprocs=24:ncpus=24:mem=10G
#PBS -l debug=true

export NODES=1
export TOTAL_TASKS=24

model=hwrf
export cyc="%CYC%"
%include <head.h>
%include <envir-p1.h>

export cyc="%CYC%"
export storm_num="%STORMNUM%"

# versions file for hwrf sets $model_ver and $code_ver

module load envvar/${envvar_ver}
module load PrgEnv-intel/${PrgEnv_intel_ver}
module load craype/${craype_ver}
module load intel/${intel_ver}
module load cray-pals/${cray_pals_ver}
module load libjpeg/${libjpeg_ver}
module load grib_util/${grib_util_ver}
module load wgrib2/${wgrib2_ver}
module load bufr/${bufr_ver}
module load hdf5/${hdf5_ver}
module load netcdf/${netcdf_ver}
# module load pnetcdf/${pnetcdf_ver}
module load udunits/${udunits_ver}
module load gsl/${gsl_ver}
module load nco/${nco_ver}
module load python/${python_ver}
module load cfp/${cfp_ver}
module list

${HOMEhwrf}/jobs/JHWRF_POST

%include <tail.h>

%manual

PURPOSE: Runs in parallel with the forecast/model/* jobs, converting
native WRF output files to native grid GRIB files.  The native grid
GRIB file is not used by the public; it is later read in by the
products job for further processing.

This job runs two streams of post-processing: 

  sat - generate synthetic satellite brightness temperatures by
    running the Community Radiative Transfer Model (CRTM) on the
    synthetic atmosphere generated by the WRF

  nonsat - all other GRIB-based products.

Both streams have to run from hours 0-126.  However, the sat stream is
six-hourly due to the extreme expense of the CRTM.  The nonsat stream
runs hourly from 0-9 and three hourly from 9-126.  This is reflected
in the updates to the meters.

The meters display the last forecast hour of each type known to have
completed.  That means, if hours 0-6 of sat are completed, post1 is
running 9, and post2 is running 12, the meter will show 6.

There are two copies of this job because that is how many is needed to
keep up with the forecast.  Both jobs, post1 and post2, run both
streams of the post.  They communicate with one another using lock
files and an sqlite3 database, to prevent duplication of work.

TROUBLESHOOTING

Most failures of this job fall in two categories:

  - model failed
  - operator error
  - system issues

If this job failed, check the model first.  If the model is stuck or
failed, that is why the post1/2 job failed.

What do I mean by "operator error?"

* ALWAYS KILL AND REQUEUE THE ENTIRE forecast FAMILY to rerun the
    rerun the forecast model.  Never, under ANY circumstances, rerun
    just the model!

* If you need to rerun the post or products, KILL AND REQUEUE THE
    REQUEUE THE ENTIRE FAMILY so that the unpost runs first.

Either category of issues, system or operator, has caused a wide
variety of interesting problems in the post and products jobs.  

The 2016 upgrade has been changed to immediately exit the post and
products jobs at even the smallest sign of error instead of retrying
operations.  However, if you forget to requeue the entire post family,
ultimately the system cannot do much to overcome that operator error.

%end