Deos Multicore Design

**TBD**

getSystemInfoDeos()
processRemainingQuota();
startOfPeriodTickValueArray()
systemRemainingQuota();
systemTickPointer()
windowActivationTableInfo()
windowInfo();

systemInformationBlock()
processInformationBlock
 DEOSBASEAPI UNSIGNED32 DEOSKERNPPI currentCoreIndex(void);
 DEOSBASEAPI void DEOSKERNPPI raiseIPI(void); 

 DEOSBASEAPI void DEOSPALPPI sendIPI(SIGNED32 targetCore); 

    FLASHeraseBlock                  Any, although kernel will ensure that only one
    FLASHnumPagesPerBlock            FLASH API is active at a time across all cores.
    FLASHupdate32
    FLASHwrite

    PALVideoMemoryAddress            master only
    PALcoldstart                     master only
    PALidleFunction                  all cores(?)  If so, must be "quiet" on slave cores
    PALwarmstart                     all cores  "master first, then one core at a time"
    frameSynchLostTickIndex          master
    maskPlatformInterrupt            all  However any specific interrupt will only be
                                     masked/unmasked on one core at a time.  IWI on
                                     master, non-IWI on the core where
                                     raisePlatformInterrupt() should be called.
				     todo: the mask/unmask assurances are not yet implemented.
    pollForSystemTick                master
    powerLossDuration                master
    timerTimeRemaining               all (must have unique timer per core)
    timerWrite                       all 
    unmaskPlatformInterrupt          all.  See maskPlatformInterrupt
    waitForNextSystemTick            master. Change to waitForWindowStart
    whichCPUInLRU                    any
    setActiveWAT		     master
    windowTimerTimeRemaining         all
    windowTimerWrite                 master
    kernelExtensionsHandler          all
    windowTimerTimeRemaining         master?
    windowTimerWrite                 master?
    setCriticalLevel                 all

    raiseTimer                       all
    raiseWindowTimer                 all
    raisePowerTransient              master - TODO: all?
    raisePlatformInterrupt           all (see unmaskPlatformInterrupt)
    numberOfPlatformInterrupts       all
    platformInterruptKind            all
    maximumSchedulingPriority        all
    frameSynchronizationLost         master?
    mapPhysicalAddress               master
    mapPhysicalAddress64             master
    allocateRAM                      master
    registerKernelExtensionsHandler  master
    exitCritical                     all
    enterCritical                    all
    virtualAttributesPPI             all
    logSystemEventPPI                all
    registerInterruptControl         master
    registerSetActiveWAT             master
    set[Logical]InterruptHandler     master

 TODO: need a better solution.

 Lock(outer);
 do stuff;
 UnLock(outer);

 Lock(outer);
 Pre nested lock code; // Part of Execution[i]
   Lock(inner);
   do stuff;
   UnLock(inner);
 Post nested lock code; // Part of Execution[i]
 UnLock(outer);

  PreExe   Latency to get nested lock   Duration of nested lock   PostExe 

 Duration[i] =  Exe[i]  +  Latency[i+1]  +  Duration[i+1] 

   longest scheduler private lock +
   duration of critical section activity

   duration of scheduler private activity

   duration of sched public.

 duration of sched public: time to merge a priority list.
     #priorities * (2 or 3) * time to get a cache line.

 duration of KIO: current longest non-ctx critical.

 latency to start of window interrupt =
     #cores * context switch + (#cores)**2 * longest sched public

|   DW    | PW   |  DW    |   6W   | 6W  |  DW   |    DW    |

 - unmaskedInterrupts : set of interrupts
   The interrupts for which ISR thread's in the scheduler are in a
   waitUntilNextInterrupt().

 - Platform interrupts can activate ISR threads either in the current
   scheduler, or for a scheduler in the interrupt window.
 - To reduce overall IWI latency, IWI interrupts should be higher
   priority than non IWI platform interrupts.
 - Each platform interrupt is associated with (at most) one thread
   (at a time), hence one scheduler.
   - It would be acceptable for the registry to know "the" scheduler
     an interrupt would be associated with.
 - When HW asserts an interrupt, there are two cases. If the
   interrupt is not associated with the interrupt window, the PAL
   must call raisePlatformInterrupt() on the scheduler's current
   core.  If the interrupt is associated with the interrupt window,
   then the PAL must call the IWI handler on all cores.  The kernel
   resolves/addresses the interaction of IWI and WSI interrupts.
 - An ISR thread may be in a waitUntilNextInterrupt() when the window
   is suspended, but retainedBudget could expire.
   The next window containing the ISR thread's scheduler
   must ensure the ISR thread is runnable at the window start.  I.e.,
   for RMA schedulers the window start must be a budget replenishment
   point for the ISR thread.  For threads in schedulers that don't
   enforce budgets, this is trivially satisfied.

 windows
   A sequence of windows, indexed by 0.
 windowSlack
   The sum of the finish early times for all the most recent
   contiguous sequence of window activations where
   mayFinishEarly=true, minus any windowSlack used in that interval.
 retainedBudget
   The amount of time remaining for the IW.  retainedBudget is zero
   at WAT start.  retainedBudget is incremented by the finish early time
   when the IW is scheduled, decremented when the IW runs, and reset
   to zero (expires) when:
     - All the retainedBudget is used, or
     - a window actualStart happens at its
       (specifiedStart-windowSlack).  An interesting special
       case is the start of a mayFinishEarly=false window.
   In the second case, retainedBudget incrementally transitions to
   windowSlack as retainedBudget would otherwise begin to expire.
  retainedBudgetExhausted
   Indicates whether there is sufficient retainedBudget to activate
   the interruptWindow.  retainedBudgetExhausted is set only by
   core0, but read by all cores.

 specifiedStart  The time specified during system design.
 actualStart     The most recent run-time value.
   The offset from the WATStart.  actualStart is always <=
   specifiedStart+(some TBD overhead parameter)
 mayStartEarly bool
   If true, the window may be activated prior to its specifiedStart.
 specifiedDuration
 actualDuration
 mayFinishEarly bool
   If true, the window may finish before the next window's
   specifiedStart.
 mayUseSlack    bool
   If true, the window may use more time than its specifiedDuration.
 unusedBudgetDestination : enum {interruptWindow, windowSlack}
   Where the unused budget goes when the window ends early.  Only
   the interruptWindow can give the interruptWindow budget (currently).
 ownedInterrupts : set of interrupts
   The interrupts that *might* be enabled when the window is active.
   When a window is activated the ownedInterrupts for which ISR thread's
   are in a waitUntilNextInterrupt() are unmasked.
 enabledInterrupts : set of interrupts
   The interrupts that might trigger an interrupt window activation
   during this window.  These interrupts are only unmasked if there
   is retainedBudget and the ISR thread's scheduler indicates the
   interrupt is unmasked.
 nextWindow
   The next window in the WAT.  The WAT's last window's nextWindow is
   WAT.windows[0]

  WSI   Window Start Interrupt
  WSI.time = latest time that the current window can finish, aka next window start.
    Always relative to current WAT start.

    For interrupt window:
           = retainedBudget
    else (non interrupt windows):
           = window.duration + window slack
           window slack includes retainedBudget if next window has a fixed
           start time, i.e., the current window is the last window before retainedBudget expires.
  IWI   Interrupt Window Interrupt (multiple possible)

 // Note handler is also called by maystartEarly().
 WSI_or_IWI_handler(bool IWI or WSI, actualInterruptNumber) // runs on all cores
 {
   if IWI and currentWindow.isIW then propagate actualInterruptNumber to sched and return;
   if IWI and currentWindow has IWI's masked, is assumed to be impossible;
   if IWI and retainedBudgetExhausted return;  // This is a "spurious" interrupt, only one/core possible.
   if currentCore == 0
      syncA;  // A simple barrier, perhaps based on a window activation counter
      recompute retainedBudget;  // cases: nextWindow.mayStartEarly=false and currently window is IW.
      switch:
        IWI and not currentWindow.isIW:
           if retainedBudget > some threshold
             currentWindow = interrupt window;
           else
             // PAL must reassert IWI later, TODO: which may be an issue for edge triggered interrupts.
             temporarily mask all IWIs; // Prevent remaining IWI's from consuming any more budget
             retainedBudgetExhausted = true;
             programming window timer not necessary in this case;
        WSI: currentWindow = next normal window;
             retainedBudgetExhausted = false;
      syncB; // sync with slaves again;
      // interrupt mask status TBD.
      write window timer for currentWindow;
   else  // slave core
      syncA;
      syncB;
   update IWI mask status for currentCore (depends on both window status and temporary mask status);
   activate scheduler for currentWindow on this core;

Case  currentwindow                Events                 nextWindow
       IWIEnabled       Core 0             Core 1          IWIEnabled
 5       false          WSI                WSI             D/C
 1       true           IWI, WSI           IWI, WSI        false
 2       true           IWI, WSI           WSI, IWI        false
 3       true           IWI, WSI           WSI             false
 4       true           WSI                WSI, IWI        false
 5       true           IWI                IWI             false
 6       true           WSI                WSI             false

 command_t enum { run, wait, finish };
 command_t command=run;  // "command" only written by core0
 slaveState_t enum { running, waiting, acknowledging };
 slaveState_t slaveState[ncores] = running;  // Elements inter-core cache line aligned.

 // References to command and slaveState need to be onceOnly(). There
 // need to be memory barriers between reads and following writes (not
 // writes and reads), to prevent speculative and out of order writes.
 function finishEarly()  // This would replace PALidleFunction().
   loop
     if currentCore==0

       // Not sure how to get a power efficient wait here.

       // At this point, and at all points outside of the following critical,
       // core0 will observe, for all other cores, slaveState[core]==running
       enterCritical();  // May have to be master critical.
       if finish early is permitted, e.g., the window specifies mayFinishEarly==true then
         command=wait;
         var myCommand=run;
         if for every other core, slaveState[core]==waiting within timeout then
           command=finish;
           myCommand=finish;
           wait for every other core, slaveState[core]==acknowledging;
         else
           // The "else" case must account for only a proper subset of cores
           // seeing the command.  Observing "running" below does the trick.
         command=run;
         // The following loop prevents cores from observing a transition
         // directly from wait to another wait.
         wait until for every other core, slaveState[core]==running;
         // At this point core0 will observe, for all other cores, slaveState[core]==running
         if myCommand==finish
           SWI_or_IWI_handler(WSI=true, actualInterruptNumber=D/C);
       exitCritical();
     else // Not core 0
       // For power efficiency, there could be a waitUntilEqual(&command, wait); here.
       enterCritical();

       // Could observe command in {run, wait}, but not finish.
       if command==wait
         // Could observe command in {run, wait}, but not finish.
         slaveState[currentCore] = waiting;
         // Could observe command as anything, i.e., {run, wait, finish}.
         // The "Not Equal" here addresses the fact that some cores may
         // not see command=wait, or may not be in this algorithm at all.
         var myCommand=waitUntilNotEqual(&command, wait);
         // Could observe command in {run, finish}.  
         slaveState[currentCore] = acknowledging;
         waitUntilEqual(&command, run);
         slaveState[currentCore] = running;
         if myCommand==finish
           SWI_or_IWI_handler(WSI=true, actualInterruptNumber=D/C);
       exitCritical()

Visibility:
Change:
Coherency:

paging structures must be consistently viewed across all cores.  I.e., a
thread from any process executing on any core must see Deos RAM (all
globally visible state?) the same in all cases

Visibility
Who can change it and when.
What is its lifetime

Many things immutable after first process dispatch.
  paging structures in svas
  idt raw handler tables
  code

many processor specific registers defined by boot

floating point context

timestamp counter (TSC).  Boot/PAL are required to synchronize the
value across all cores.

Memory
  can be consistently read/written with single aligned read/write.

icache
  Coherent until kernel changes mapping of code space, then requires
  cross core coherency protocol.

TLBs

memory mapping change (VAS):
  TLB shootdown
  perhaps icache if virtually indexed.

Thread User stack:
  Can be written by:
    - Current Thread
    - Any thread running BIT
    - Some other thread executing in cross core crit that ensures the
      thread is :
	- not the "current thread" on any core.
	- Ensures the core with the thread as "current" is "quiescent".

kernel data structure:
  Can be written by:
    - Current Thread when holding cross core critical.
    - Any thread running BIT
    - Kernel during startup/shutdown.

data used to enter kernel critical
exception vector save area
  Any core, but only using special algorithm
  Special note: Not BIT.

kernel code:
  any thread on any core at any time, except when some thread is
  running BIT.

Deos Multicore Design

Introduction

Principles

Technical overview

Backward Compatibility

Baseline Features

653 Windows

Memory Pools

Not Supported

Alloc Dealloc Not Supported

Differences from Previous kernels

Supported Targets

System Startup and Shutdown

Executable File Format

API

Using Named Objects

Process Services

Thread Services

Thread States

Slack Scheduling

Thread Coordination Services

Mutexes

Events and Semaphores

Inter-Process Communication

Exception Handling

Interrupt Handling

Platform Resource Services

BIT Services

Kernel Attribute Services

File Services

Library Services

Virtual Memory Services

Development Environment Support

System Information Services

Debugger Support Services

Platform Abstraction Layer

New Kernel Interfaces

New PAL Interfaces

Changes to PAL and PRL Interfaces

Boot Changes

Multicore Boot Sequence

Multicore Shutdown Sequence

Status Monitor Support

R S I (Report System Information)

R S S (Report System Schedule)

R S E (Report System Events)

R T E (Report Thread Events)

R T X (Report Thread eXceptions)

R T G (Report Thread Group (Core/Scheduler))

New Design Issues

Ensuring consistent machine state between the cores

Dealing with true parallelism

Scheduling

Short Duration Wait

Imprecise Waiter

Application Induced Blocking

Locks

Critical Section Bound Formulas

Data Structures and Terms

Core

Cross Core Critical

Kernel Instance

Scheduler Instance

Master Core

Slave Core

Window

Thread

Window Scheduler

Scheduler

Time line

Scheduler Algorithms

Window interrupt handler algorithm

Use Cases

Finish Early

State Management and Coordination

Open Issues

Page Tables and TLBs

TLB shootdown

Issues

Optimizations