#PBS -N hwrf%STORMNUM%_products_%CYC% #PBS -j oe #PBS -S /bin/bash #PBS -q %QUEUE% #PBS -A %PROJ%-%PROJENVIR% #PBS -l walltime=03:00:00 #PBS -l select=1:ncpus=48:mpiprocs=48:mem=18G #PBS -l debug=true export NODES=1 export TOTAL_TASKS=48 export NHC_PRODUCTS_NTHREADS=24 model=hwrf export cyc="%CYC%" %include %include export cyc="%CYC%" export storm_num="%STORMNUM%" # versions file for hwrf sets $model_ver and $code_ver module load envvar/${envvar_ver} module load PrgEnv-intel/${PrgEnv_intel_ver} module load craype/${craype_ver} module load intel/${intel_ver} module load cray-pals/${cray_pals_ver} module load libjpeg/${libjpeg_ver} module load grib_util/${grib_util_ver} module load wgrib2/${wgrib2_ver} module load bufr/${bufr_ver} module load hdf5/${hdf5_ver} module load netcdf/${netcdf_ver} # module load pnetcdf/${pnetcdf_ver} module load udunits/${udunits_ver} module load nco/${nco_ver} module load python/${python_ver} module load cfp/${cfp_ver} module list ${HOMEhwrf}/jobs/JHWRF_PRODUCTS %include %manual TASK products PURPOSE: Runs in parallel with the rest forecast/model, converting native grid WRF data to products useful to the forecasters and public. Delivers native files to COM as requred. Runs the tracker, dbn_alert, delivers products. Meters: gribber - last forecast hour completed that has completed all GRIB2 generation, delivery and alerting. tracker - last forecast hour the tracker has completed. Forecast hours past the end of the track are still counted in this bar. Events: SentTrackToNHC - set immediately after the track file has been delivered to NHC areas. DETAILS: In short, this is the main delivery job for the HWRF system. This job runs in multiple threads that work together via an sqlite3 database system. It has restart capability, which means it will start where it left off if it is killed and restarted. That means if you want it to start from the beginning, you MUST first run the unpost job. On the other hand, if the job died from a technical problem (downed node, fire, etc.) then requeueing the job will cause it to start where it left off. The only exception is the tracker component, which always starts from the beginning. The last step of this job is to run a special threaded OpenMP program that reads in several custom input files, and generates the *.swath.grb2 file, various NHC custom files, and the AFOS file. THe AFOS file is emailed to the SDM at the end of the job. TROUBLESHOOTING Most failures of this job fall in two categories: - post job failed - operator error - system issues If this job failed, check the post1 and post2 first. If the post1/2 jobs are stuck or failed, that is why the products job failed. What do I mean by "operator error?" * ALWAYS KILL AND REQUEUE THE ENTIRE forecast FAMILY to rerun the rerun the forecast model. Never, under ANY circumstances, rerun just the model! * If you need to rerun the post or products, KILL AND REQUEUE THE REQUEUE THE ENTIRE FAMILY so that the unpost runs first. Either category of issues, system or operator, has caused a wide variety of interesting problems in the post and products jobs. The 2016 upgrade has been changed to immediately exit the post and products jobs at even the smallest sign of error instead of retrying operations. However, if you forget to requeue the entire post family, ultimately the system cannot do much to overcome that operator error. %end