!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! !! !! GNU General Public License !! !! !! !! This file is part of the Flexible Modeling System (FMS). !! !! !! !! FMS is free software; you can redistribute it and/or modify !! !! it and are expected to follow the terms of the GNU General Public !! !! License as published by the Free Software Foundation. !! !! !! !! FMS is distributed in the hope that it will be useful, !! !! but WITHOUT ANY WARRANTY; without even the implied warranty of !! !! MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the !! !! GNU General Public License for more details. !! !! !! !! You should have received a copy of the GNU General Public License !! !! along with FMS; if not, write to: !! !! Free Software Foundation, Inc. !! !! 59 Temple Place, Suite 330 !! !! Boston, MA 02111-1307 USA !! !! or see: !! !! http://www.gnu.org/licenses/gpl.txt !! !! !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !----------------------------------------------------------------------- ! Communication for message-passing codes ! ! AUTHOR: V. Balaji (V.Balaji@noaa.gov) ! SGI/GFDL Princeton University ! ! This program is free software; you can redistribute it and/or modify ! it under the terms of the GNU General Public License as published by ! the Free Software Foundation; either version 2 of the License, or ! (at your option) any later version. ! ! This program is distributed in the hope that it will be useful, ! but WITHOUT ANY WARRANTY; without even the implied warranty of ! MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ! GNU General Public License for more details. ! ! For the full text of the GNU General Public License, ! write to: Free Software Foundation, Inc., ! 675 Mass Ave, Cambridge, MA 02139, USA. !----------------------------------------------------------------------- module mpp_mod !a generalized communication package for use with shmem and MPI !will add: co_array_fortran, MPI2 !Balaji (V.Balaji@noaa.gov) 11 May 1998 ! ! V. Balaji ! ! ! ! ! mpp_mod, is a set of simple calls to provide a uniform interface ! to different message-passing libraries. It currently can be ! implemented either in the SGI/Cray native SHMEM library or in the MPI ! standard. Other libraries (e.g MPI-2, Co-Array Fortran) can be ! incorporated as the need arises. ! ! ! The data transfer between a processor and its own memory is based ! on load and store operations upon ! memory. Shared-memory systems (including distributed shared memory ! systems) have a single address space and any processor can acquire any ! data within the memory by load and ! store. The situation is different for distributed ! parallel systems. Specialized MPP systems such as the T3E can simulate ! shared-memory by direct data acquisition from remote memory. But if ! the parallel code is distributed across a cluster, or across the Net, ! messages must be sent and received using the protocols for ! long-distance communication, such as TCP/IP. This requires a ! ``handshaking'' between nodes of the distributed system. One can think ! of the two different methods as involving puts or ! gets (e.g the SHMEM library), or in the case of ! negotiated communication (e.g MPI), sends and ! recvs. ! ! The difference between SHMEM and MPI is that SHMEM uses one-sided ! communication, which can have very low-latency high-bandwidth ! implementations on tightly coupled systems. MPI is a standard ! developed for distributed computing across loosely-coupled systems, ! and therefore incurs a software penalty for negotiating the ! communication. It is however an open industry standard whereas SHMEM ! is a proprietary interface. Besides, the puts or ! gets on which it is based cannot currently be implemented in ! a cluster environment (there are recent announcements from Compaq that ! occasion hope). ! ! The message-passing requirements of climate and weather codes can be ! reduced to a fairly simple minimal set, which is easily implemented in ! any message-passing API. mpp_mod provides this API. ! ! Features of mpp_mod include: ! ! 1) Simple, minimal API, with free access to underlying API for ! more complicated stuff.
! 2) Design toward typical use in climate/weather CFD codes.
! 3) Performance to be not significantly lower than any native API. ! ! This module is used to develop higher-level calls for domain decomposition and parallel I/O. ! ! Parallel computing is initially daunting, but it soon becomes ! second nature, much the way many of us can now write vector code ! without much effort. The key insight required while reading and ! writing parallel code is in arriving at a mental grasp of several ! independent parallel execution streams through the same code (the SPMD ! model). Each variable you examine may have different values for each ! stream, the processor ID being an obvious example. Subroutines and ! function calls are particularly subtle, since it is not always obvious ! from looking at a call what synchronization between execution streams ! it implies. An example of erroneous code would be a global barrier ! call (see mpp_sync below) placed ! within a code block that not all PEs will execute, e.g: ! !
!   if( pe.EQ.0 )call mpp_sync()
!   
! ! Here only PE 0 reaches the barrier, where it will wait ! indefinitely. While this is a particularly egregious example to ! illustrate the coding flaw, more subtle versions of the same are ! among the most common errors in parallel code. ! ! It is therefore important to be conscious of the context of a ! subroutine or function call, and the implied synchronization. There ! are certain calls here (e.g mpp_declare_pelist, mpp_init, ! mpp_malloc, mpp_set_stack_size) which must be called by all ! PEs. There are others which must be called by a subset of PEs (here ! called a pelist) which must be called by all the PEs in the ! pelist (e.g mpp_max, mpp_sum, mpp_sync). Still ! others imply no synchronization at all. I will make every effort to ! highlight the context of each call in the MPP modules, so that the ! implicit synchronization is spelt out. ! ! For performance it is necessary to keep synchronization as limited ! as the algorithm being implemented will allow. For instance, a single ! message between two PEs should only imply synchronization across the ! PEs in question. A global synchronization (or barrier) ! is likely to be slow, and is best avoided. But codes first ! parallelized on a Cray T3E tend to have many global syncs, as very ! fast barriers were implemented there in hardware. ! ! Another reason to use pelists is to run a single program in MPMD ! mode, where different PE subsets work on different portions of the ! code. A typical example is to assign an ocean model and atmosphere ! model to different PE subsets, and couple them concurrently instead of ! running them serially. The MPP module provides the notion of a ! current pelist, which is set when a group of PEs branch off ! into a subset. Subsequent calls that omit the pelist optional ! argument (seen below in many of the individual calls) assume that the ! implied synchronization is across the current pelist. The calls ! mpp_root_pe and mpp_npes also return the values ! appropriate to the current pelist. The mpp_set_current_pelist ! call is provided to set the current pelist. !
! ! F90 is a strictly-typed language, and the syntax pass of the ! compiler requires matching of type, kind and rank (TKR). Most calls ! listed here use a generic type, shown here as MPP_TYPE_. This ! is resolved in the pre-processor stage to any of a variety of ! types. In general the MPP operations work on 4-byte and 8-byte ! variants of integer, real, complex, logical variables, of ! rank 0 to 5, leading to 48 specific module procedures under the same ! generic interface. Any of the variables below shown as ! MPP_TYPE_ is treated in this way. ! use mpp_parameter_mod, only : MPP_VERBOSE, MPP_DEBUG, ALL_PES, ANY_PE, NULL_PE use mpp_parameter_mod, only : NOTE, WARNING, FATAL, MPP_CLOCK_DETAILED,MPP_CLOCK_SYNC use mpp_parameter_mod, only : CLOCK_COMPONENT, CLOCK_SUBCOMPONENT, CLOCK_MODULE_DRIVER use mpp_parameter_mod, only : CLOCK_MODULE, CLOCK_ROUTINE, CLOCK_LOOP, CLOCK_INFRA use mpp_data_mod, only : request, mpp_record_timing_data use mpp_comm_mod, only : mpp_chksum, mpp_max, mpp_min, mpp_sum, mpp_transmit use mpp_comm_mod, only : mpp_send, mpp_recv, mpp_broadcast, mpp_malloc use mpp_comm_mod, only : mpp_init, mpp_exit, mpp_set_stack_size #ifdef use_MPI_GSM use mpp_comm_mod, only : mpp_gsm_malloc, mpp_gsm_free #endif use mpp_util_mod, only : stdin, stdout, stderr, stdlog, lowercase, uppercase use mpp_util_mod, only : mpp_error, mpp_error_state, mpp_set_warn_level, mpp_sync use mpp_util_mod, only : mpp_sync_self, mpp_pe, mpp_node, mpp_npes, mpp_root_pe use mpp_util_mod, only : mpp_set_root_pe, mpp_declare_pelist, mpp_get_current_pelist use mpp_util_mod, only : mpp_set_current_pelist, mpp_clock_begin, mpp_clock_end use mpp_util_mod, only : mpp_clock_id, mpp_clock_set_grain character(len=128), public :: version= & '$Id mpp.F90 $' character(len=128), public :: tagname= & '$Name$' !--- public paramters ----------------------------------------------- public :: MPP_VERBOSE, MPP_DEBUG, ALL_PES, ANY_PE, NULL_PE, NOTE, WARNING, FATAL public :: MPP_CLOCK_SYNC, MPP_CLOCK_DETAILED, CLOCK_COMPONENT, CLOCK_SUBCOMPONENT public :: CLOCK_MODULE_DRIVER, CLOCK_MODULE, CLOCK_ROUTINE, CLOCK_LOOP, CLOCK_INFRA !--- public data from mpp_data_mod ------------------------------ public :: request !--- public interface from mpp_util_mod ------------------------------ public :: stdin, stdout, stderr, stdlog, lowercase, uppercase, mpp_error, mpp_error_state public :: mpp_set_warn_level, mpp_sync, mpp_sync_self, mpp_set_stack_size, mpp_pe public :: mpp_node, mpp_npes, mpp_root_pe, mpp_set_root_pe, mpp_declare_pelist public :: mpp_get_current_pelist, mpp_set_current_pelist, mpp_clock_begin, mpp_clock_end public :: mpp_clock_id, mpp_clock_set_grain, mpp_record_timing_data !--- public interface from mpp_comm_mod ------------------------------ public :: mpp_chksum, mpp_max, mpp_min, mpp_sum, mpp_transmit, mpp_send, mpp_recv public :: mpp_broadcast, mpp_malloc, mpp_init, mpp_exit #ifdef use_MPI_GSM public :: mpp_gsm_malloc, mpp_gsm_free #endif end module mpp_mod #ifdef test_mpp #ifdef SYSTEM_CLOCK #undef SYSTEM_CLOCK #endif program test !test various aspects of mpp_mod #include #ifdef sgi_mipspro use shmem_interface #endif use mpp_mod, only : mpp_init, mpp_exit, mpp_pe, mpp_npes, mpp_root_pe, stdout use mpp_mod, only : mpp_clock_id, mpp_clock_begin, mpp_clock_end, mpp_sync, mpp_malloc use mpp_mod, only : mpp_declare_pelist, mpp_set_current_pelist, mpp_set_stack_size use mpp_mod, only : mpp_broadcast, mpp_transmit, mpp_sum, mpp_max, mpp_chksum, ALL_PES #ifdef use_MPI_GSM use mpp_mod, only : mpp_gsm_malloc, mpp_gsm_free #endif implicit none integer, parameter :: n=1048576 real, allocatable, dimension(:) :: a, b, c #ifdef use_MPI_GSM real :: d(n) pointer (locd, d) #else real, allocatable, dimension(:) :: d integer(LONG_KIND) :: locd #endif integer :: tick, tick0, ticks_per_sec, id integer :: pe, npes, root, i, j, k, l, m, n2, istat real :: dt call mpp_init() call mpp_set_stack_size(3145746) pe = mpp_pe() npes = mpp_npes() root = mpp_root_pe() call SYSTEM_CLOCK( count_rate=ticks_per_sec ) allocate( a(n), b(n) ) id = mpp_clock_id( 'Random number' ) call mpp_clock_begin(id) call random_number(a) call mpp_clock_end (id) !---------------------------------------------------------------------! ! time transmit, compare against shmem_put and get ! !---------------------------------------------------------------------! if( pe.EQ.root )then print *, 'Time mpp_transmit for various lengths...' #ifdef SGICRAY print *, 'For comparison, times for shmem_get and shmem_put are also provided.' #endif print * end if id = mpp_clock_id( 'mpp_transmit' ) call mpp_clock_begin(id) !timing is done for cyclical pass (more useful than ping-pong etc) l = n do while( l.GT.0 ) !--- mpp_transmit ------------------------------------------------- call mpp_sync() call SYSTEM_CLOCK(tick0) do i = 1,npes call mpp_transmit( put_data=a(1), plen=l, to_pe=modulo(pe+npes-i,npes), & get_data=b(1), glen=l, from_pe=modulo(pe+i,npes) ) ! call mpp_sync_self( (/modulo(pe+npes-i,npes)/) ) end do call mpp_sync() call SYSTEM_CLOCK(tick) dt = real(tick-tick0)/(npes*ticks_per_sec) dt = max( dt, epsilon(dt) ) if( pe.EQ.root )write( stdout(),'(/a,i8,f13.6,f8.2)' )'MPP_TRANSMIT length, time, bw(Mb/s)=', l, dt, l*8e-6/dt !#ifdef SGICRAY ! !--- shmem_put ---------------------------------------------------- ! call mpp_sync() ! call SYSTEM_CLOCK(tick0) ! do i = 1,npes ! call shmem_real_put( b, a, l, modulo(pe+1,npes) ) ! end do ! call mpp_sync() ! call SYSTEM_CLOCK(tick) ! dt = real(tick-tick0)/(npes*ticks_per_sec) ! dt = max( dt, epsilon(dt) ) ! if( pe.EQ.root )write( stdout(),'( a,i8,f13.6,f8.2)' )'SHMEM_PUT length, time, bw(Mb/s)=', l, dt, l*8e-6/dt ! !--- shmem_get ---------------------------------------------------- ! call mpp_sync() ! call SYSTEM_CLOCK(tick0) ! do i = 1,npes ! call shmem_real_get( b, a, l, modulo(pe+1,npes) ) ! end do ! call SYSTEM_CLOCK(tick) ! dt = real(tick-tick0)/(npes*ticks_per_sec) ! dt = max( dt, epsilon(dt) ) ! if( pe.EQ.root )write( stdout(),'( a,i8,f13.6,f8.2)' )'SHMEM_GET length, time, bw(Mb/s)=', l, dt, l*8e-6/dt !#endif l = l/2 end do !---------------------------------------------------------------------! ! test mpp_sum ! !---------------------------------------------------------------------! if( pe.EQ.root )then print '(/a)', 'Time mpp_sum...' end if a = real(pe+1) call mpp_sync() call SYSTEM_CLOCK(tick0) call mpp_sum(a(1:1000),1000) call SYSTEM_CLOCK(tick) dt = real(tick-tick0)/ticks_per_sec dt = max( dt, epsilon(dt) ) if( pe.EQ.root )write( stdout(),'(a,2i4,f9.1,i8,f13.6,f8.2/)' ) & 'mpp_sum: pe, npes, sum(pe+1), length, time, bw(Mb/s)=', pe, npes, a(1), n, dt, n*8e-6/dt call mpp_clock_end(id) !---------------------------------------------------------------------! ! test mpp_max ! !---------------------------------------------------------------------! if( pe.EQ.root )then print * print *, 'Test mpp_max...' end if a = real(pe+1) print *, 'pe, pe+1 =', pe, a(1) call mpp_max( a(1) ) print *, 'pe, max(pe+1)=', pe, a(1) !pelist check call mpp_sync() call flush(stdout(),istat) if( npes.GE.2 )then if( pe.EQ.root )print *, 'Test of pelists: bcast, sum and max using PEs 0...npes-2 (excluding last PE)' call mpp_declare_pelist( (/(i,i=0,npes-2)/) ) a = real(pe+1) if( pe.NE.npes-1 )call mpp_broadcast( a, n, npes-2, (/(i,i=0,npes-2)/) ) print *, 'bcast(npes-1) from 0 to npes-2=', pe, a(1) a = real(pe+1) if( pe.NE.npes-1 )then call mpp_set_current_pelist( (/(i,i=0,npes-2)/) ) id = mpp_clock_id( 'Partial mpp_sum' ) call mpp_clock_begin(id) call mpp_sum( a(1:1000), 1000, (/(i,i=0,npes-2)/) ) call mpp_clock_end (id) end if if( pe.EQ.root )print *, 'sum(pe+1) from 0 to npes-2=', a(1) a = real(pe+1) if( pe.NE.npes-1 )call mpp_max( a(1), (/(i,i=0,npes-2)/) ) if( pe.EQ.root )print *, 'max(pe+1) from 0 to npes-2=', a(1) end if call mpp_set_current_pelist() #ifdef use_CRI_pointers !---------------------------------------------------------------------! ! test mpp_chksum ! !---------------------------------------------------------------------! if( modulo(n,npes).EQ.0 )then !only set up for even division n2 = 1024 a = 0.d0 if( pe.EQ.root )call random_number(a(1:n2)) ! if( pe.EQ.root )call random_number(a) call mpp_sync() call mpp_transmit( put_data=a(1), plen=n2, to_pe=ALL_PES, & get_data=a(1), glen=n2, from_pe=root ) ! call mpp_transmit( put_data=a(1), plen=n, to_pe=ALL_PES, & ! get_data=a(1), glen=n, from_pe=root ) m= n2/npes ! m= n/npes allocate( c(m) ) c = a(pe*m+1:pe*m+m) if( pe.EQ.root )then print * print *, 'Test mpp_chksum...' print *, 'This test shows that a whole array and a distributed array give identical checksums.' end if print *, 'chksum(a(1:1024))=', mpp_chksum(a(1:n2),(/pe/)) print *, 'chksum(c(1:1024))=', mpp_chksum(c) ! print *, 'chksum(a)=', mpp_chksum(a,(/pe/)) ! print *, 'chksum(c)=', mpp_chksum(c) end if !test of pointer sharing #ifdef use_MPI_GSM call mpp_gsm_malloc( locd, sizeof(d) ) #else if( pe.EQ.root )then allocate( d(n) ) locd = LOC(d) end if call mpp_broadcast(locd,root) #endif if( pe.EQ.root )then call random_number(d) end if call mpp_sync() call test_shared_pointers(locd,n) #ifdef use_MPI_GSM call mpp_gsm_free( locd ) #else if( pe.EQ.root )then deallocate( d ) end if #endif #endif call mpp_exit() contains subroutine test_shared_pointers(locd,n) integer(LONG_KIND), intent(in) :: locd integer :: n real :: dd(n) pointer( p, dd ) p = locd print *, 'TEST_SHARED_POINTERS: pe, locd=', pe, locd print *, 'TEST_SHARED_POINTERS: pe, chksum(d)=', pe, mpp_chksum(dd,(/pe/)) return end subroutine test_shared_pointers end program test #endif test_mpp