#PBS -N hwrf%STORMNUM%_post2_%CYC% #PBS -j oe #PBS -S /bin/bash #PBS -q %QUEUE% #PBS -A %PROJ%-%PROJENVIR% #PBS -l walltime=03:00:00 #PBS -l place=vscatter,select=1:mpiprocs=24:ncpus=24:mem=10G #PBS -l debug=true export NODES=1 export TOTAL_TASKS=24 model=hwrf export cyc="%CYC%" %include %include export cyc="%CYC%" export storm_num="%STORMNUM%" # versions file for hwrf sets $model_ver and $code_ver module load envvar/${envvar_ver} module load PrgEnv-intel/${PrgEnv_intel_ver} module load craype/${craype_ver} module load intel/${intel_ver} module load cray-pals/${cray_pals_ver} module load libjpeg/${libjpeg_ver} module load grib_util/${grib_util_ver} module load wgrib2/${wgrib2_ver} module load bufr/${bufr_ver} module load hdf5/${hdf5_ver} module load netcdf/${netcdf_ver} # module load pnetcdf/${pnetcdf_ver} module load udunits/${udunits_ver} module load gsl/${gsl_ver} module load nco/${nco_ver} module load python/${python_ver} module load cfp/${cfp_ver} module list ${HOMEhwrf}/jobs/JHWRF_POST %include %manual PURPOSE: Runs in parallel with the forecast/model/* jobs, converting native WRF output files to native grid GRIB files. The native grid GRIB file is not used by the public; it is later read in by the products job for further processing. This job runs two streams of post-processing: sat - generate synthetic satellite brightness temperatures by running the Community Radiative Transfer Model (CRTM) on the synthetic atmosphere generated by the WRF nonsat - all other GRIB-based products. Both streams have to run from hours 0-126. However, the sat stream is six-hourly due to the extreme expense of the CRTM. The nonsat stream runs hourly from 0-9 and three hourly from 9-126. This is reflected in the updates to the meters. The meters display the last forecast hour of each type known to have completed. That means, if hours 0-6 of sat are completed, post1 is running 9, and post2 is running 12, the meter will show 6. There are two copies of this job because that is how many is needed to keep up with the forecast. Both jobs, post1 and post2, run both streams of the post. They communicate with one another using lock files and an sqlite3 database, to prevent duplication of work. TROUBLESHOOTING Most failures of this job fall in two categories: - model failed - operator error - system issues If this job failed, check the model first. If the model is stuck or failed, that is why the post1/2 job failed. What do I mean by "operator error?" * ALWAYS KILL AND REQUEUE THE ENTIRE forecast FAMILY to rerun the rerun the forecast model. Never, under ANY circumstances, rerun just the model! * If you need to rerun the post or products, KILL AND REQUEUE THE REQUEUE THE ENTIRE FAMILY so that the unpost runs first. Either category of issues, system or operator, has caused a wide variety of interesting problems in the post and products jobs. The 2016 upgrade has been changed to immediately exit the post and products jobs at even the smallest sign of error instead of retrying operations. However, if you forget to requeue the entire post family, ultimately the system cannot do much to overcome that operator error. %end