/* *=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* */ /* ** Copyright UCAR (c) 1992 - 2010 */ /* ** University Corporation for Atmospheric Research(UCAR) */ /* ** National Center for Atmospheric Research(NCAR) */ /* ** Research Applications Laboratory(RAL) */ /* ** P.O.Box 3000, Boulder, Colorado, 80307-3000, USA */ /* ** 2010/10/7 23:12:35 */ /* *=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* */ PROPOSED DESIGN - DIDSS - DATA INGEST AND DISTRIBUTED SERVER SYSTEM =================================================================== Mike Dixon, October 1998 Overview. --------- This document describes the design of the new DIDSS system of data ingest and handling. It includes data servers and a data 'mapper' process which will keep track of data availablity within a 'shared data region'. Shared data region. ------------------- A shared data region is any networked region with access fast enough for distributed access to work properly. Generally this means a fast LAN, such as ethernet, although slower links may be included in the region is the performance across those links is deemed satisfactory. Some projects are set up in such a way that each machine is a shared data region. In such cases all data sets reside on each machine which requires access to them. Data types. ----------- Any number of main data types may be used. For example, we might have: mdv - data in MDV format spdb - data in SPDB format titan - data in TITAN storm track format grib - data in grib format etc. etc. Server Protocols. ----------------- The server protocols will be derived from the data types, with 'p' appended. So using the types above we would have: mdvp: MDV data server protocol spdbp: SPDB data server protocol titanp: TITAN data server protocol Directory structure. -------------------- The top of the data directory tree will be defined by the environment variable $DIDSS_DATA_DIR. The various main data types will be stored in subdirectories of $DIDSS_DATA_DIR. For example, you could expect to see the following subdirectories: $DIDSS_DATA_DIR/mdv - all mdv data $DIDSS_DATA_DIR/spdb - all spdb data $DIDSS_DATA_DIR/titan - all titan data $DIDSS_DATA_DIR/grib - all grib data etc. The various data sets would be stored in subdirectories of these main directories. For example: $DIDSS_DATA_DIR/mdv/sat/vis - visible satellite data $DIDSS_DATA_DIR/mdv/radarCart - cartesian radar data $DIDSS_DATA_DIR/spdb/boundaries - colide-derived boundaries $DIDSS_DATA_DIR/spdb/ltg - lightning $DIDSS_DATA_DIR/spdb/surface/metar - METARs $DIDSS_DATA_DIR/titan/tz30 - TITAN 30dBZ storm data Putting data to disk. --------------------- Data may be put to disk by any number of different processes. The put function may optionally register information about the latest data with the data mapper. The put function may optionally have a distributed capability. In this case the put function would contact a server to perform the put. The put API should effectively hide the implementation, so that the programmer server functionality, if it is implemented, is trasparent. Getting data from disk. ----------------------- The data may be read by any number of processes. The get function may optionally have a distributed capability. In this case the get function would contact a server to perform the get. The get API should effectively hide the implementation, so that the programmer server functionality, if it is implemented, is trasparent. URL definition. --------------- Data location will be specified using a URL-style address. The URL syntax will be: protocol:translator:params//host:port:dir#args The protocol and dir are required. The translator, params, host, port and #args are optional. All delimiters except the # are required. The optional translator, if present, defines the program to be used to translate the data into the desired form. The optional params, if present, defines the param file to be used by the server or the translator. If the host is missing, the data_mapper must be contacted to determine the host. If the port is present, contact with the server is made via this alternative port. This will only be used if multiple users on the host are running a server. The #args is an open-ended mechanism for passing extra information to the server API. It will probably not be used in the majority of cases. URL examples are: Visible satellite MDV data from virga on the standard port: mdvp:://virga::sat/vis IR satellite MDV data - host to be determined from data_mapper: mdvp:://::sat/ir Visible satellite MDV data from virga using server on port 11000: mdvp:://virga:11000:sat/vis Colide boundaries SPDB data on babinet: spdbp:://babinet::boundaries/colide TITAN tz30 SPDB data, using alternative server on port 14900, the Titan2Symprod translation process running from the mult_forecasts parameter file: spdbp:Titan2Symprod:mult_forecasts//vil:14900:titan/tz30 Data mapper. ------------ The data mapper is a process which listens on a well-known port, and receives and serves out information on data sets within the shared data region. Up to 2 data mappers may run in a shared data region. The primay mapper will run on $DATAMAP_HOST, and the secondary on $DATAMAP_HOST2. The information stored on the secondary will mirror that on the primary. The rationale for 2 mappers is redundancy - if the primary host fails the secondary will be available. The data mapper will store the following information on each data set: host: host on which the data resides user: user who owns data data_type: major data type, e.g. mdv dir: subdirectory for data store latest_time: the time of the latest data put to the store. This is not necessarily the most recent data in the store, because in playback-type mode the latest data may precede other data. start_time: start time of data in the store end_time: end time of data in the store Data mapper clients. -------------------- A number of clients will use the data mapper for gathering information on the data. (a) get functions: these use the data mapper to resolve the host name if it is missing in the URL. (b) data monitors: these processes compare the information in the data mapper with a list of expected data sets to provide information on the status of the data sets. For example, a process may provide information on which data sets are late by comparing the data mapper information with a list of data sets and the expected frequencies for data arrival. Simple servers. --------------- A simple server is one which never requires any translation processes. For example, if all gridded data sets are stored in MDV format, the data may be served out using the MdvServer without translation. A simple server starts up in default mode, without reading any paramter file, and then listens on its port for requests. The port may be overridden via the command line. When a simple server receives a request, it creates a new thread of execution either using a thread call or by spawning a child, depending upon implementation. The thread uses the URL to find the relevant directory. If the params option is set, it reads the params file. It then serves out the data and exits. Translating servers. -------------------- A translating server is one which sometimes requires a translation process. For example, sometimes SPDB data must be translated into graphical objects before being served out to a display. A translating server acts just like a simple server for any requests for which the translator option is not set. If the translator option is set, the server checks whether it has already started the translation process, using the required params file if the params option is also set. If necessary, it starts the translator. It then passes the request to the translator, which processes it and passes back the product data. The product data is then served to the client. Translators. ------------ The translator processes are themselves servers. When spawned a translator reads the relevant param file, if required, and then handles any requests passed to it from the translating server. Each request is handled in a separate thread.