Status Report of FNAL ( July 15, 2013 )

- (Soon) - decomposed the full tracking kernel for each physics process and model
  to evaluate its performance and to validate physics outputs.
- profiled corresponding CPU codes with Open|SpeedShop and have studies results for future optimizations.
- (Philippe) Finished broker which can handle multiple tasks
Areas of memory =number of tasks plus 2.   Task are schedules when their buffer are full or when they are not received data for a while.

- (John) Summer student is optimising sequential (voxel) navigation.  Will prepare task on simplified 'vector-like' navigation in voxels for him - hoping he can create a first sequential implementation in early August ( from which a corrected GPU implementation can be derived. )

Status Report of FNAL ( Mar 18, 2013)

 I. Memory allocation for secondaries produced by EM processes on the device

o test configuration
   - thread organization: 32 block x 128 threads
   - one step for 100K primary tracks
   - maximum number of secondaries per track and per step = 2
   - generate secondaries randomly for 100K tracks per event
     i.e., <number of secondaries>/step/event ~ 100K

o memory transaction

  1) allocate a fixed amount of the device memory as a global storage, and
     directly write each secondary with a global atomic counter.
     maximum amount of the device memory per step = 2*ntracks*sizeof(track) 
  2) allocate a fixed amount of the device memory as a global storage, write 
     all secondaries produced by each track to a dynamically allocated
     (temporary) memeory, copy secondaries to the global storage, free the
     temporary memory
  3) dynamic memory allocation/reallocation per track basis
     using two separate kernels; 
     i) first kernel - generate secondaries, store them on the global memory, 
        send information of the stack (number of secondaries, addresses) to CPU
     ii) second kernel - allocate the required device memory and copy the array
        of stacks to the allocated memory
  4) dynamic memory allocation/reallocation per thread basis 
     using a temporary amount of local memory per thread =
     (nTracks*maxSecondaryPerStep)/(blockDim.x*gridDim.x) 
  5) dynamic memory allocation/reallocation per block basis
     using a temporary amount of shared memory per block =
     (nTracks*maxSecondaryPerStep)/(gridDim.x)

o memory throughput: (100K tracks)/1ms = 6.4 GB/sec 
        GPU[ms]       CPU[ms]
1        1.5           35
2         70           50
3     275(250+25)      60
4     130(100+30)      60
5      60(30+30)       60
   
[3-5](first kernel + second kernel)

II. Implementation of G4EmStandardPhysics for ``e-'' : 4 Processes 

1. G4eBremsstrahlung process (Done) 
   - G4SeltzerBergerModel (E<1GeV) - xsec-data: brem_SB
   - G4eBremsstrahlungRelModel (E>1GeV)
    lambda table only for PbWO4, xsec-data for all atoms up to Z=92
2. G4eIonisation
   - G4MollerBhabhaModel (Moller e-e-, Bhabha e+e-)
3. G4eMultipleScattering
   - G4UrbanMscModel95 (E<100 MeV);
   - G4WentzelVIModel(E>100 MeV);
4. G4CoulombScattering


First notes from Vector/GPU meeting, Mon  4 March 2013, 17:30-19:00 CET
Editor: J. Apostolakis
First version: Thur  7 Mar 2013, 17:00 CET
----------------- DRAFT ------------ DRAFT -------------- DRAFT -------------------
Changes by: JA

Items to discuss
- Challenges, stumbling blocks
- Next steps

Reports on progress
- Andrei Gheata

   Wrote a vectorised version of the method which determines is a point is inside a Box.  
Simply using one method call for many boxes, in place of multiple function calls gave a speedup of about 2.5.  Vectorisation provided an additional speedup to a factor between 3 and 4.  Vector size used was 1024. 
  Next will try Intrinsics and potentially Vc.
 Action> Check speedup in cases of modest 'vector' size, e.g. 16

 - Philippe Canal / Soon Jun
  Examined two ways to handle secondaries
  i) Use fixed size buffer, and atomic operations to write secondaries into it;
 ii) Create memory dynamically
  Keep full information on secondaries on GPU and send information on addresses to CPU. 

 [ JA Question: Avoiding to send the bulk of information about secondaries to CPU could reduce the volume sent from GPU to CPU. Is the very limited space/RAM on GPU an issue ? 
 [ PhC: Memory is not that scarse on the GPU.  Our current board has 8GB, the next generation (NVidia K20) should have more ]

[ JA2> What uses do lists of secondaries have in current simulations? ]

  Issue: Sending output information to CPU requires the CPU to know the amount of memory required (as it will initiate the transfer.)  This requires an extra communication step - with associated (latency) cost.

  Time for creating secondaries from 100K interactions:
  Fixed:           1.5 ms 
  Dynamic:  300.0 ms 

  Plan to use a 2nd kernel to reorganize tracks on GPU.

  How many secondaries are created per step ?  Assume maximum 2 secondaries.

 JA> Need to clarify the current existing functionality:
        Is the set of secondaries kept on GPU until exhausted ?
        Is there just one batch of tracks sent from CPU to GPU or are several new tracks sent ?

  Can we quantify the cost of sending back secondaries (and surviving particles) to the CPU ? 
  [ The above can be a first step in an alternative - from this an optimized version which sends only exiting particles or some other selection can be created. ]

  Potential scenarios:  
    1) Keep all tracks already sent on the GPU - this was the originally envisioned scenario. The cost is that tracks are in many different parts of the geometry, so the GPU must keep navigation history information, and there is reduced commonality between tracks in an SM; the gain is significant reduction in data traffic. 
    2) send tracks in a set of logical volumes to the GPU.  Tracks that are created outside the current volume (or this set?) are sent back to CPU.

  Issue> What data must be transferred for secondary particle, and what for survivor ? 
    Secondary: particle type, event number/batch, position, momentum, energy   
    Survivor:  same ?

  Note> The history of particle creation (process, parent type etc) is sometimes important for some analyses.  We will ignore this potential requirement for the time being.

  Issue> How to create compatible geometries between different simulation engines (G4, Vector, GPU) ?
  Proposed solutions: 
   1) Use consistent indexing for Logical Volumes, Physical volumes (Replica Numbers) - Preferred
   2) Use common navigation engine, potentially VMC - significant complexity (Federico)
   3) Share future implementation of geometry (starting with USolids) - not yet available

  Issue> Should NavigationHistory information be transferred from GPU to CPU ?  How to transfer it efficiently ?   (  Andrei and John to discuss and summarize. )

How to proceed ?
* Agreed to use GDML to transfer geometry model between simulation engines (Vector, GPU, G4).

Actions> 
- Proposal for numbering logical volumes and physical volumes in GDML file and reading it into all Sim engines  ( Federico, John, Philippe )
- How to transfer stack of volumes / navigation history information between CPU and GPU efficiently ? (Andrei, John)

- Next meeting: March 18th
    Request for updates to be presented with short written summaries


Appendix
Issue 1> How many secondaries are created per step ? 

[JA] Consulting with EM expert, the EM processes can produce a maximum of 2 secondaries  and a survivor.  Ignoring atomic de-excitation (which complicates matters a lot and is not used in HEP production) here is a short list: 
    2 secondaries  1 survivor:    mu to mu e+e-.   
    2 secondaries, no survivor:  gamma conversion, positron annihilation
    1 secondary,    1 survivor:   Ionisation, Brem.

----------------- DRAFT ------------ DRAFT -------------- DRAFT -------------------


First notes from Vector/GPU meeting, Wed 21 February 2013
Editor: J. Apostolakis
First version: Thur 21 Febr 2013, 11:00 CET
Changes by: 

Goals in next 3 months:
- Make first estimates of potential speedup from vectorisation
- Estimate the effect of secondary production from GPU on speedup potential
Target in 12-18 month timeframe:
- Comparable semi-realistic benchmarks of Vector and GPU approaches, using simplified ECAL geometry and small sets of physics processes.

Agreements reached:
- Agreed on extending Vector prototype with Vectorised implementations of two solid types: Box and Trd.
- Agreed to work with existing Basketiser/Scheduler and create two adaptors:
    i) a class to provide input taken from Geant4;    Person: John A.
   ii) a class to broker between the Basketiser/Scheduler and the GPU   "GPU Broker" Person: Philippe C. 
          (first version to dispatch even small baskets; potential second version to aggregate work)
  iii) enhance one electron GPU process by adding generation of secondaries: Person Soon J.

- Add milestone to Interface USolid library to Vector prototype ( to precede proposed vectorisation of USolid classes. )

Actions
1> Make first estimates of effort for these deliverables. (Responsible person/s for each one)

2> Review and revise lists of further proposed deliverables - to enable us to consider goals and further milestones at next meeting (John, Philippe)

3> Identify what data must be passed between CPU and GPU (Andrei, Soon)
     a) passed for each track to GPU.  First list is
             event Id, event slot # (?), trackId, position, momentum, Energy, volume Id
     b) returned from GPU to CPUs: 
             surviving track, secondaries, hits

4> Find new day and time for meeting - will create poll with proposals. (Daniel)
     In case new day/time is not found yet, the next meeting will be Wed 6 March at 16:15 CET

Open questions:
- Creating one, central list of agreed tasks / milestones
- In case tracks can remain on GPU, (as well as CPU) must identify the master element which decides when an event is finished. 
- How  to estimate the effect of missing components on results ? (And when will it be possible? )