Deos Multicore Design

From Deos
Jump to navigationJump to search

Introduction

The purpose of the Deos multicore project is to enable Deos applications to execute on multicore processors. The following is the state of the Deos multi-core Design as of April 2015.

The project is guided by the following principles.

Principles

  1. Unless there is a compelling reason, multicore semantics should be the same as single-core Deos semantics.
    1. Ideally it would be possible to run an application designed for multicore Deos on a uni-core.
  2. Guaranteed budget is more important than getting slack availability.
  3. Any changes should be applicable to all targets (e.g., x86 and PowerPC) for all design decisions.
  4. Maximize backward compatibility.
    1. Most important for applications, but also for BOOT and PAL.
  5. If possible, don't require separate uni-core and multicore kernels
    1. This is mostly a cost containment issue.

Technical overview

The primary observation is that in a time partitioned, multicore environment there is a high degree of application execution time co-dependence due to the degree of shared resources in a multicore processor. Most notably memory and cache, although system bus and devices are also a concern. Thus if application A might run at the same time as application B, then the timing analysis for A and B are co-dependent. An unconstrained environment with N applications requires N squared analysis, and that is prohibitively expensive. Furthermore if any application changes, then all analysis must be repeated. Clearly that is unacceptable, so the key is that the set of applications that can execute simultaneously must be constrained.

The chosen solution to manage this situation is to divide the timeline into different windows of time. Each window is assigned a scheduler for each of the cores. Threads are assigned a scheduler and are therefore constrained with what threads can run simultaneously on other cores.

The other key principle is interference should only be caused by the application itself. An application which is accessing semaphore S1 can tolerate interference from another thread accessing S1. However, accesses to an unrelated semaphore S2 should not induce any blocking. This principle is referred to as Application Induced Blocking(AIB).

Backward Compatibility

The Deos multicore kernel is derived from the Deos 8.3 kernel including support for memory pools.

Baseline Features

653 Windows

The Deos kernel API and PPI for support of ARINC 653 windows is being re-evaluated in light of the changes required for multicore. At present the APIs exist in the kernel for windows, but some fields are #Not_Supported.

Memory Pools

No changes, however see #Alloc_Dealloc_Not_Supported.

Not Supported

The phrase Not Supported means that the capability does not work. While it may be possible to invoke a not supported service or situation, the results are undefined. It may work sometimes, or may crash the kernel. I.e., not supported services are either unimplemented, or incompletely implemented.

Alloc Dealloc Not Supported

Memory allocation and deallocation, or more generally changes to a process' virtual address space currently have several issues

  1. Such accesses have non-AIB blocking, specifically they depend on global cross core locks.
    • This will be removed in a future release.
  2. Changes to a VAS are not propagated to other cores until a context switch.
    • This is a permanent situation.
    • The current documentation (UG/DDD) does not reflect this situation.
  3. De-allocation may leave dangling references that might cause corruption.
    • This will be fixed in a future release
    • This can easily cause a kernel halt or other undefined behavior.

Note that allocation and deallocation are not restricted to virtualAllocDEOS() and virtualFreeDEOS(), they also affect envelopes, writeProcessMemory() to frozen pages, mapViewOfKernelFile(), SMO manipulation, etc.


Differences from Previous kernels

The following are the key differences from the uni-core kernel. The description is in the same order as the existing Deos User's Guide.

Supported Targets

Only Intel x86 and PowerPC are supported for the multicore kernel at this time.

System Startup and Shutdown

Executable File Format

No changes.

API

The following kernel services are being evaluated for possible changes as a side-effect of the multicore design. The thing that they have in common is that all return a pointers to kernel data structures that are not constants or process local data.

getSystemInfoDeos()
processRemainingQuota();
startOfPeriodTickValueArray()
systemRemainingQuota();
systemTickPointer()
windowActivationTableInfo()
windowInfo();

TBD: Does the level E debugger support library export any functions that meet the above qualification?

Using Named Objects

No changes

Process Services

No API changes are anticipated.

Creating and deleting processes that have threads on other cores is supported in kernel version 9999.1.5 (Mar 2015). However all processes that are dynamically created or deleted must have a main thread on the same core as all other processes that get dynamically created or deleted. Furthermore the thread that is creating the processes must also be on that same core. This restriction will be lifted in a future release.

Furthermore when a child process is created from parent quota, the main threads of the parent and child must be in the same scheduler, not just the same core.

Thread Services

No API changes are anticipated. Core specification is in the registry.

Allocation of thread budgets is validated for threads created in the same scheduler instance as the main thread. For other threads the budget is assumed to be available and is not taken from the process which may result in an oversubscription of the CPU. This accounting will be fixed in a future release of the kernel. All budgets are enforced at runtime and exceeding the time during a period will result in a timeBudgetExceeded exception.

setCPUBudgetTransferRecipient is restricted to be between threads with the same #CoreScheduler.

When a thread waits, indefinite waits work on multi-core as they do on single-core. For short duration waits, when the resource is made ready, the released threads on cross cores are not increased in priority, but they will detect the resource availability as on unicore. Slack waiting threads on the same core will be readied immediately, but cross core slack waiting threads may not be readied until an interval boundary is crossed.

The above changes are expected to be permanent.

Thread States
Slack Scheduling

TODO: Need to fully evaluate the changes in the scheduler related to all the overview sections. Ideally there would be no changes, however the changes in #Scheduling need to be carefully evaluated to ensure application developers understand the new constraints.

Thread Coordination Services

Mutexes

Mutexes are constrained to only be associated with a single #Scheduler_Instance. This is a permanent restriction.

The implication of the above is that mutexes cannot be used for synchronization of threads that execute on different cores.

Events and Semaphores

No API changes are anticipated for semaphores. No API changes are expected for events within a single scheduler. The impact of cross scheduler events is being investigated.

Inter-Process Communication

See #API, #Memory_Pools, #Imprecise_Waiters, and #Short_Duration_Wait.

Exception Handling

No API changes are anticipated. However, at this time raising an exception to a different thread is restricted to a single exception, and that exception may not be noticed by the thread until the recipient thread next transitions from kernel mode to user mode (e.g., returns from an API call). If the recipient thread gets an exception raised to it (e.g, timeBudgetExceeded, or calls raiseExceptionToThread() on itself), the first exception may be lost.

TODO: The final disposition of this situation is under discussion and may change in a future release.

Interrupt Handling

TODO: Interrupt windows are not yet implemented. At this time, interrupts should be on the master core in scheduler instance 0.

Platform Resource Services

See #API.

BIT Services

No API changes are anticipated. BIT tests will be restricted in some manner to only execute on a single core, perhaps the master core, and all other cores will be required to be #Quiescent. BIT services are currently #Not_Supported.

Kernel Attribute Services

No changes are anticipated. However, warmstart is currently #Not_Supported.

File Services

No changes are anticipated. However see #Alloc_Dealloc_Not_Supported.

Library Services

No changes are anticipated. However see #Alloc_Dealloc_Not_Supported.

Virtual Memory Services

No changes are anticipated. However see #Alloc_Dealloc_Not_Supported.

Development Environment Support

System Information Services

See #API

The kernel's scheduling and process creation history queues are being changed to core-specific data structures. This is being done to minimize memory interference between cores. To user code, this change will manifest itself via backward-compatible changes to the results returned from the systemInformationBlock() and processInformationBlock kernel services.

Also, see #653_Windows.

Debugger Support Services

No API changes are anticipated.

Platform Abstraction Layer

  1. What mutual exclusion properties?

    Only one function that "must be called in critical" will be called simultaneously, otherwise no assurances beyond what the API description defines

  2. Being in a critical is no longer the same thing as having interrupts disabled. Many functions say "must be called within a critical", when in fact they require interrupts to be disabled and vise verse. This will have to be a code review change and we need to decide whether the PAL will be responsible for this, or whether we want the kernel to handle it. Note the precondition difference from the enter/exit functions in core.h and cpu.h
  3. At this point assuming LLKMI will not be supported on multicore, and perhaps not at all any more.
  4. Need a clear definition of the differences between interrupts, master interrupts, and being in a critical

TODO: Historically there was a correlation between disabling interrupts and entering of critical sections. That relationship is now slightly altered. Specifically we need to clarify the difference between interrupts disabled, master interrupts disabled, and #Locks.

TODO: The multicore kernel uses ELF "weak symbols" to enable backward compatible operation on PALs that don't provide the new multicore PPIs. If window scheduling is retained we may remove the recently added register* KERNPPI functions.

New Kernel Interfaces
 DEOSBASEAPI UNSIGNED32 DEOSKERNPPI currentCoreIndex(void);
 DEOSBASEAPI void DEOSKERNPPI raiseIPI(void); 
New PAL Interfaces
 DEOSBASEAPI void DEOSPALPPI sendIPI(SIGNED32 targetCore); 
Changes to PAL and PRL Interfaces

The PAL must supply a thread timer per core. The PAL and PRLs must ensure that all core specific data structures are maintained properly.

The following table shows what functions in the PAL and potentially in PRLs can be called by the kernel on master and slave processors.

TODO: In the 9999.1.5 kernel window start interrupts are delivered to the master core in the future they will need to be delivered to all cores.

    FLASHeraseBlock                  Any, although kernel will ensure that only one
    FLASHnumPagesPerBlock            FLASH API is active at a time across all cores.
    FLASHupdate32
    FLASHwrite

    PALVideoMemoryAddress            master only
    PALcoldstart                     master only
    PALidleFunction                  all cores(?)  If so, must be "quiet" on slave cores
    PALwarmstart                     all cores  "master first, then one core at a time"
    frameSynchLostTickIndex          master
    maskPlatformInterrupt            all  However any specific interrupt will only be
                                     masked/unmasked on one core at a time.  IWI on
                                     master, non-IWI on the core where
                                     raisePlatformInterrupt() should be called.
				     todo: the mask/unmask assurances are not yet implemented.
    pollForSystemTick                master
    powerLossDuration                master
    timerTimeRemaining               all (must have unique timer per core)
    timerWrite                       all 
    unmaskPlatformInterrupt          all.  See maskPlatformInterrupt
    waitForNextSystemTick            master. Change to waitForWindowStart
    whichCPUInLRU                    any
    setActiveWAT		     master
    windowTimerTimeRemaining         all
    windowTimerWrite                 master
    kernelExtensionsHandler          all
    windowTimerTimeRemaining         master?
    windowTimerWrite                 master?
    setCriticalLevel                 all

The following table shows what kernel PPI functions are callable from the PAL and PRLs.

    raiseTimer                       all
    raiseWindowTimer                 all
    raisePowerTransient              master - TODO: all?
    raisePlatformInterrupt           all (see unmaskPlatformInterrupt)
    numberOfPlatformInterrupts       all
    platformInterruptKind            all
    maximumSchedulingPriority        all
    frameSynchronizationLost         master?
    mapPhysicalAddress               master
    mapPhysicalAddress64             master
    allocateRAM                      master
    registerKernelExtensionsHandler  master
    exitCritical                     all
    enterCritical                    all
    virtualAttributesPPI             all
    logSystemEventPPI                all
    registerInterruptControl         master
    registerSetActiveWAT             master
    set[Logical]InterruptHandler     master

Note that raisePowerTransient() and frameSynchronizationLost() are #Not_Supported at this time.

Boot Changes

The "logical core number" is a new parameter to kernel entry. This was done in a "binary backward compatible, but not source code compatible manner.

Logical core number must be zero based without skipping any numbers.

Defined new bootiface members: maxLogicalCoreIndex, currentBootCore. It is likely that the maxLogicalCoreIndex field will be removed in a future update. Boot is expected to start all applicable cores. Currently the kernel uses the maxLogicalCoreIndex != 0 to indicate that the kernel is to be "multicore". This will likely be changed to be a processorSubArchitecture flag in the future.

Multicore Boot Sequence

  1. At power on, BOOT starts the master, initializes boot interface object, then releases slave cores. Kernel entry is called on master and all slaves in parallel.
  2. Boot must not change the boot interface object once any core has entered the kernel until all cores have exited.
  3. Kernel runs master until just before calling waitForNextSystemTick()
  4. Kernel releases slave cores one at a time to just before entering PALidleFunction() on each.
  5. Master calls waitForNextSystemTick(), releases the slave cores, then normal scheduling ensues on all cores.

Slave cores are released one at a time, primarily because it is simpler, but also because the time for each core is very small. This also permits certain optimizations on x86 (see x86/thrd_con_arch.cpp). The boot interface object currentBootCore member is used to sequence the above.

The single threaded startup behavior of the kernel is not visible to the BOOT or PAL and may change in the future to permit more parallelism.

Note that the master and "one or more" slave cores are in a critical in the above.

It is still TBD whether PALidleFunction() will run on the slave cores long term.

Multicore Shutdown Sequence

  1. Some core initiates a shutdown (this might be restricted to master in the future), e.g., by calling setKernelAttributes(), or via raisePowerTransient().
  2. All (other) cores are told to shutdown (using an IPI).
  3. All cores return to boot in arbitrary order (potentially in parallel).
  4. Boot must wait for all cores to exit the kernel before re-entering the kernel.
    1. There are additional requirements boot must satisfy, e.g., synchronizing any required updates to the boot interface object, but those requirements are true for any kernel entry.

One way for boot to accomplish the above is to keep a "boot count" global variable. On first power up, the master core initializes it to zero. Each core snapshots that global value into a local variable prior to booting. After the kernel returns, each slave core waits for the global variable to become different than the local cached value, at which point the core can update its local copy with the new global value and call the kernel. The master core updates the global value just before it reenters the kernel.

Status Monitor Support

The status monitor will support some new/updated commands to provide insight into the multicore kernel.

R S I (Report System Information)

Updated to include how many cores are configured in the system.

R S S (Report System Schedule)

Updated to display the core number for events in the scheduling history log.

R S E (Report System Events)

Updated to display the core number for events in the dynamic process event log.

R T E (Report Thread Events)

Updated to display the core number for events in the thread history.

R T X (Report Thread eXceptions)

Updated to display the core number for events in the exception history.

R T G (Report Thread Group (Core/Scheduler))

Updated to display the core and scheduler index for threads.


New Design Issues

The multicore Deos kernel is a single #Kernel_Instance managing one or more processor cores. All cores share the same kernel global data structures. The kernel requires that the processor cores be a shared memory multi-processor. Specifically memory accesses on different cores are required to be coherent and reads and writes are required to be atomic for all properly aligned machine defined storage units, normally 8-bit, 16-bit, and 32-bit bytes/words.

The Deos kernel has always been multi-threaded, even when the kernel is in privileged mode. The uni-core kernel uses a single critical section to manage mutual exclusion. This leads to non-AIB interference. The multicore Deos kernel instead uses fine grain locks. The #Locks are categorized in a manner which enables deadlock avoidance, bounded critical sections, and ensures #Application Induced Blocking only. The current prototype kernel has not fully deployed the locks and may exhibit non-AIB interference characteristics.

The primary design issues for multicore Deos are:

  1. #Ensuring consistent machine state between the cores
  2. #Dealing with true parallelism
  3. #Scheduling

Ensuring consistent machine state between the cores

Most multicore processor architectures require software support to maintain consistency between the processor states on the cores. All processor state, including all registers and processor defined memory tables, need to be analyzed to determine what form of consistency is required for that state. Note that in some cases consistency means "the same" value on all cores, e.g., the kernel code space, or "the same within some tolerance", e.g., the timestamp counter, sometimes consistency means "can be completely different", e.g., the current thread. Note that this also has an effect on the behavior of applications. The current classification and characterization of state is described in #State_Management_and_Coordination. Note that the details are still in some degree of flux, especially the names of the categories.

Ensuring consistency also involves establishing the initial consistent state. This task is shared between the kernel and BOOT, and to a lesser extent the PAL. The main challenge is permitting all cores to perform initialization (and finalization) activity that requires mutual exclusion, but doing so prior to the time when the runtime mutual exclusion mechanism has been initialized. See #Boot_Changes for an overview.

Finally, there is a need to implement cross core critical sections. See #Enter_Critical_Algorithm. Note that a particularly interesting sub-case is dealing with BIT, which can temporarily invalidate the normal state consistency management assumptions. E.g., by changing what would normally be considered constants, such as kernel code.

Dealing with true parallelism

In the original Deos design, certain algorithms took advantage of the fact that if a thread was running during a critical section, by definition, in privileged mode, then all other threads had to be suspended. These algorithms require changes which have as of yet not been fully implemented. The March 2015 kernel addresses some of these issues, most notably enabling #Exception_Handling, but more cases exist.

True parallelism also affects the kernel when it is manipulating hardware defined paging structures, e.g., page tables, or data structures reference by core specific hardware handlers, e.g., TLB handlers on PowerPC. If one core is updating a page table entry while a different core's hardware is accessing the same page table entry, extreme care must be exercised to ensure that the other core does not see an inconsistent intermediate state. Note that there are typically no synchronization primitives that can be used to coordinate this situation. See #TLB/page_tables.

True parallelism also affects users, and consequently a slightly different definition of application correctness is needed. See #State_Management_and_Coordination for our current definition.

Scheduling

The following new concepts were introduced in the multicore scheduler:

  1. Window scheduler
  2. Scheduler instances
  3. More than one "current thread"
  4. Application Induced Blocking
  5. Locks

It is important to understand that most kernel APIs are not affected by the addition of multicore. All manipulations of process and system state already had to be thread safe, so primarily the change affects the scheduler. The scheduler was rewritten to be multicore aware. Effectively what was a single monolithic scheduler, which had a single "ready list" was restructured to be several cooperating #Scheduler_Instances. A further observation was made that activating the schedulers at a period boundary was effectively identical to the activity required to manage ARINC 653 windows, so the scheduler was further split into two pieces, a #Window_Scheduler and a Thread Scheduler, which is abbreviated to just #Scheduler.

Short Duration Wait

In Deos the timeout value waitShortDuration is used to yield between threads of the same priority. If the thread is scheduled again without the resource being available, then it reports a timeout status. This allows threads to ping pong sharing a resource and maintain their fixed RMA budgets. In many client server relationships budget transfer was used in conjunction with a short duration wait.

In the multicore kernel the semantics of short duration wait have not changed. However, there is the possibility that the related threads may be configured to execute in different #Scheduler_Instances. In this case budget transfer cannot be utilized, and when yielding the short duration wait may not have any threads ready to execute at the same priority. This may result in a timeout status before the thread on the other scheduler/core has a chance to execute and make the resource available.

Therefore, the application design must account for this possibility. One approach may be to loop calling the wait API with either waitNoWait or waitShortDuration until it does not timeout or some other condition such as period identification has changed.

A thread in the same #Scheduler_Instance waiting short duration will be readied as soon as the resource is available. However, a thread in a different #Scheduler_Instance will depend on the other threads in that scheduler with the same priority also waiting short duration to poll their resource and timeout or become ready before it is scheduled in order to detect its resource is available.

See #Imprecise_Waiter for more details about waiting between #Scheduler_Instances.

Imprecise Waiter

In the multicore kernel pulsing an event has the potential of readying threads in an effectively unbounded number of #Scheduler_Instances. Clearly visiting N scheduler instances is not possible in a single time bounded critical section, and breaking an event pulse into multiple critical sections is very complex. Instead, events are assigned an "owning #Scheduler_Instance" matching the thread that created the event. Given a thread (T) that is waiting for an event (S), if the owning #Scheduler_Instance for T is different than the owning scheduler instance for S, then T is called an Imprecise Waiter of S.

When a thread does an imprecise wait, rather than being added to the list of threads to be readied when the event is pulsed, the thread is added to an imprecise waiters list in the thread's owning #Scheduler_Instance and the event is marked as having an imprecise waiter.

If there are imprecise waiters when a pulse is performed, every core in the system is sent an #Inter_Processor_Interrupt. On receipt of the IPI, the schedulers put all threads on their imprecise waiter list onto their runnable list, and the threads determine if the resource they were waiting for has become ready.

The same situation does not apply to semaphores. Signaling a semaphore only releases one thread, whereas pulsing an event releases all waiters. Once the appropriate thread has been readied, the kernel determines if the thread is a member of the current scheduler on the current core. If it is the thread may be scheduled immediately. If it is in another scheduler then the thread may not be scheduled until the next window activation.

TODO: we need to provide guidance to the user for this situation. E.g., imprecise waiting *THREAD* may consume its fixed budget and return eventInsufficientTime due to unrelated event waits. The act of integrating other applications can affect the behavior of existing applications.

 TODO: need a better solution.

Since a #Scheduler_Instance may not be current when an event is pulsed, threads on the imprecise wait list are readied whenever the scheduler instance is activated due to a window change.

When a semaphore is cleared or deleted, then all of the waiters must be unblocked. In this case the semaphore is marked as not usable and the kernel releases one thread at a time, allowing for preemption between each thread.

Currently runAllWaiters loops through the waiting list and the slackWaitingList associated with the semaphore. After each thread it enables and disables interrupts to prevent an O(n) critical. This is because these lists contain threads from all schedulers so we can't merge the list. In normal semaphore operation we are only pulling one off at a time, so we can add that thread to its schedulers list without needing a merge. We weren't concerned with possible performance of multiple preemption to keep the longest critical time down, since this is considered an atypical operation to clear and/or delete a semaphore. However, based on what other longest criticals are, we may want to change this to do x(ie 5-10) threads before allowing preemption.

Application Induced Blocking

With concurrent execution on multiple cores, it is crucial to understand the interference which may need to be accounted for in a thread's budget to ensure sufficient worst case execution times. The multicore kernel executes on behalf of a thread executing on a core, and any time it spends is allocated to that thread. Therefore, the kernel should not introduce any blocking or interference that the application does not expect. This is referred to as Application Induced Blocking (AIB). AIB is acceptable if the impact is only to threads involved in the AIB.

For example, if two threads are waiting on and signalling the same semaphore, then that is an application coupling where interference is expected. However, if another thread is signalling a different semaphore or pulsing an event, then the application would not expect interference and that would be non-AIB interference.

The multicore kernel is designed to eliminate non-AIB interference imposed by Deos algorithms. Refer to #Locks for more information on how the kernel blocking works.

Locks

The Deos kernel has many data structures which must be protected from concurrent access in order to ensure data integrity and consistency. The simplest approach would be to use a single critical section to protect the kernel data, but this would lead to non-AIB interference. Instead the multicore kernel uses a finer grained locking strategy.

Smaller locks increase the intellectual overhead and introduce the potential of live/dead lock. In order to prevent live/dead lock, the locks are aggregated into a few categories and restricted within a "core local critical" to one lock at each category in a specified order. If a second lock is required at the same level, then the algorithm must allow all of the locks to be released and the critical section exited and then a new critical section entered in order to maintain bounds on the worst case critical section times. Locks are not permitted to be held outside of a core local critical section.

The categories are defined as:

Thread Private
Protects data that must be accessible to the current thread.
Scheduler Private
Protects data used for scheduling which must always be available to a non-SMP scheduler.
KIO Public
Protects data associated with kernel interface objects (ie. semaphores, processes, threads)
Scheduler Public
Protects data associated with cross scheduler readying of threads which will then occur at the next scheduler activation.
Miscellaneous
Protects other data which has not migrated yet to the new designs.

The Miscellaneous category is envisioned to be eliminated or merged with Scheduler Public once all of the kernel algorithms have eliminated non-AIB interference.

The order of the categories above is the order in which the locks must be obtained. Given the scenario of a thread signalling a semaphore which will ready a thread the following sequence of locks will be taken.

  1. Thread private lock always implicitly held
  2. Enter core critical
  3. Lock private data for threads scheduler
  4. Lock the semaphore
  5. Determine the thread to release.
    1. If in the same scheduler we can immediately ready using the private data structures already locked.
    2. If in another scheduler, lock the public data for the released threads scheduler, add the thread, and unlock the public data.
  6. Unlock the semaphore
  7. If the released thread is in the current scheduler, then perform scheduling in case it is higher priority. If scheduling occurs the critical section, scheduler private lock, and thread private lock are migrated to the new current thread, and will be migrated back once the preempted thread resumes.
  8. Unlock the private scheduler lock
  9. Exit the core critical

Every time an explicit lock must be obtained, there is a potential for blocking. Assigning a thread to a scheduler which does not support SMP, ensures that the first two categories of locks are always immediately obtainable. The third level of lock is dependent upon a specific kernel interface object. If blocking occurs then multiple threads require access to the same kernel interface object and this is a point in the application design where #Application Induced Blocking may occur. At the fourth level, the time for which the lock is held is expected to be minimal. Applications which are designed to coordinate across #Scheduler Instances, are indicating this is an acceptable overhead at this point in their processing.

Critical Section Bound Formulas

The length of time that the kernel spends within a critical section impacts the latency of platform interrupts and the start of the next window activation. The crittime kernel will be updated in the future to support measuring the times that encompass the different values in the formula. There are two cases:

  • No nested locking occurs, or
  • A nested lock is also gotten.

In source code form:

 Lock(outer);
 do stuff;
 UnLock(outer);

or, for the nested lock case:

 Lock(outer);
 Pre nested lock code; // Part of Execution[i]
   Lock(inner);
   do stuff;
   UnLock(inner);
 Post nested lock code; // Part of Execution[i]
 UnLock(outer);

At each nested lock level there is some work which is performed before optionally obtaining the next level lock and again after the nested lock is released, PreExe and PostExe above, and collectively referred to as Execution. Furthermore, since a lock may be held by another core, there might be some delay acquiring the lock. The delay is called the Latency. We define Duration as the amount of time which a lock is held, and as the formula below indicates is dependent upon the latency to get a nested lock and duration spent within the nested lock. Therefore, Duration, graphically depicted, is as follows:

  PreExe   Latency to get nested lock   Duration of nested lock   PostExe 

Since Execution = PreExe + PostExe, and knowing that eventually nesting must stop, we get the recursively defined formula for the Duration a lock is held of:

 Duration[i] =  Exe[i]  +  Latency[i+1]  +  Duration[i+1] 

where i has the following values:

  1. Core Critical
  2. Scheduler Private
  3. KIO
  4. Scheduler Public
  5. Miscellaneous (Should be eliminated/merged with scheduler public in final implementation)

Since the thread private lock is always implicitly held, it does not impact the formula.

  • Non-smp case thread and scheduler private are always immediately available.

leaves us with KIO and sched public.

start of window latency = ???

latency to get thread private = 0 (non-smp case) latency to get scheduler private = 0 (non-smp case)

The following ignores window activation criticals, which may be long for 653/posix window activations.

latency to start of window interrupt = latency for critical section.

latency for critical section = latency to get scheduler private +

   longest scheduler private lock +
   duration of critical section activity

longest scheduler private lock = latency to get KIO lock + Longest KIO lock +

   duration of scheduler private activity

Latency to get KIO lock: (#cores-1) * longest KIO lock.

Longest KIO lock: duration of KIO + scheduler public latency +

   duration of sched public.

sched public latency: (#cores-1)*longest sched public

longest sched public = duration of sched public + 0 (no nested locks)

expected times:

 duration of sched public: time to merge a priority list.
     #priorities * (2 or 3) * time to get a cache line.
 duration of KIO: current longest non-ctx critical.


Approximation:

 latency to start of window interrupt =
     #cores * context switch + (#cores)**2 * longest sched public


Observation: We might be able to tolerate a window interrupt after getting a scheduler private lock, but before getting a KIO lock. This would reduce window start latency.

In practice the worst case will not include the number of cores squared. This is because the execution time at each subsequent level is expected to be smaller than the previous level. Therefore, while performing the execution time for a kio after receiving a lock, the core which released the kio will be performing the scheduler public operation. If we had to wait for all other cores at the kio level, then this pattern has most likely occurred when each released the kio and no cores will be waiting for the scheduler public lock making the actual latency zero. In addition, by controlling the application designs, #cores can be reduced to the number of related cores. For instance on a quad core system two cores may be used in an application which is coordinating with a semaphore. Therefore, only one other core may already have the kio lock on the semaphore, and if this core is waiting for the kio lock it will not block the other core from getting the scheduler public lock.

Data Structures and Terms

Core

  • A piece of hardware that can independently execute instructions.
  • Processor execution core - GPR's
  • Thread timer for active scheduler
  • Active thread or quiescent state

Cross Core Critical

  • A critical section that only one #Core at a time can be in.
  • At present it is single entry (only one writer or reader).
  • This is intended only for select cases:
    • Startup and shutdown.
    • BIT
    • Window Boundaries.

Kernel Instance

  • Manages kernel heap, handles, etc
  • Current Schedulers[#cores]  : the "current" scheduler for each core.

Scheduler Instance

Master Core

The CPU core that coordinates cross system events, e.g., startup, shutdown, frame synchronization, and window transitions. There is only one master per #Kernel Instance

Slave Core

Window

A contiguous range of wall time during which a particular set of #Scheduler_Instances are active.

  • time duration @ an offset from WAT start
  • 653 user attribute (periodic processing start)
  • may start early attribute
  • may end early (negation of "sit-n-spin")
    • Window has "is start of system tick" and "slowest RMA rate that starts now" attributes.
    • continuation window vs start of period (may not be bool, but slowest period starting now. scheduler can have table and handle with just bool)
    • Zero width window not permitted. If you have a "may finish early" followed by a "may NOT start early", you have to add a minimal width RMA window between them
  • Rules
    • window at system tick boundary cannot start early
    • Windows can not span system tick boundaries
      • This is necessary if system tick is a system wide attribute.
      • This is a desirement (for most efficient use of window generated slack) if system tick is a Deos scheduler notion only. If this is not enforced something will be needed to prevent a Deos window from spanning a tick boundary.
    • RMA Windows must be at least as frequent as the fastest rate.
    • The least common denominator of the sum of the duration of all Deos windows containing the RMA scheduler per tick is the RMA scheduler cpu quota
      • It is possible to relax this
  • If we allow 653 windows to start early for aperiodic processing, we will need a way to trigger the release point at the normal start of the window. Perhaps if a window starts early the window start time is not updated and the interrupt triggers it.
  • track unused time.
  • Track time donated from other windows. (slack)
  • Associated with a set of schedulers, one for each core.
  • May have scheduler specific data, but then must be validated

Thread

A schedulable entity that executes on a processor core.

  • Stack
  • Budget
    • may be only quanta (wall time) for POSIX threads
    • TODO: may not be guaranteed for ISR threads in interrupt window if interrupts are handled on multiple cores.
  • Priority (dynamic)
  • rate (for replenishment)?
  • #Scheduler_Instance that we are a member of. This may be a scheduler tied to a core giving affinity to this core, or an SMP scheduler referenced by a number of core schedulers.

Window Scheduler

  • handled by master core
  • may handle some Deos global data. May be part of activation/scheduler switch to.
  • invokes thread scheduler
  • allow registering window activation handler

Scheduler

  • Data
    • shortDurationList
    • runnableList
    • slackRequestorList
    • itsHighestSlackRequestorPriority
  • Slack data valid only for RMA schedulers, specifically not meaningful for SMP:
    • fastestPeriodPending
    • itsRunningThread
    • netSlackReclaimed
    • itsSlackBudget
    • _slackBudgetVector
  • SMP issues
    • when need to get thread, check core list and SMP list. Use higher priority. If same priority, does core have preference or do we need timestamp inserted and extra comparison (ie FIFO requirement for Posix/653)
    • TODO: SMP needs to be changed to a scheduler instance which can manage multiple cores, accounting for some threads in the scheduler having affinity to a single core. Otherwise, multiple scheduler locks are needed simultaneously.
  • Scheduling
    • Cross scheduler thread activation will use a public interface. The public lists will be merged with the scheduler's private lists at scheduler activation (ie window activation).
  • Scheduler context switch
    • Does old get notification when leaving, or must clean up when switched back to
    • scheduler/thread switched away from will get bonus (cache + possibly window)
    • Start/Stop charging CPU time - currently thread members. Should they be scheduler?
      • Posix threads have notion of thread CPU time clock, and process CPU time clock, tracking actual CPU time
      • Deos RMA threads uses for remaining budget
      • Is this mechanism for 653 deadline (Wall time) and Posix round robin (Wall or CPU?)
      • Need to determine who is responsible for starting and stopping time accounting
  • Examples
    • Deos RMA uni-core scheduler
      • Manages Deos threads as today. No Window activated threads present.
    • Priority scheduler unicore
      • Priority preemptive
      • Thread timer used for 653 deadline monitoring or POSIX round robin interval
    • Deos RMA multicore
    • Priority scheduler multicore
      • Priority preemptive across cores in cluster
      • Uses affinities to place ready threads on ready cores
  • schedulers can have different priority ranges
    • certain cross core semantics are only meaningful when schedulers have the same priorities
  • priority preemptive
  • enforces thread budget
  • short duration wait? Thrd level?
  • relative timeout

Time line

Example window schedule:

|   DW    | PW   |  DW    |   6W   | 6W  |  DW   |    DW    |

DW = Deos Window PW = Posix Window 6W = 653 Window

Each DW has budget for cache bonus ctxpd Total avalable DW budget for threads == sum of DW window durations - (number of activations * (cache bonus + ctxpd))

Scheduler Algorithms

Goals:

  • Must be able to activate windows at precise times.
  • Must efficiently support interrupts (minimal budget).
  • Desire window slack.
  • want window scheduler to be as simple as possible, e.g.,
    • No priorities
    • No reordering
    • Interrupt windows are an implementation tactic, they are not specifically required, i.,e., it would be nice to eliminate interrupt windows.

Thread Scheduler (just new fields):

 - unmaskedInterrupts : set of interrupts
   The interrupts for which ISR thread's in the scheduler are in a
   waitUntilNextInterrupt().

Platform interrupts:

 - Platform interrupts can activate ISR threads either in the current
   scheduler, or for a scheduler in the interrupt window.
 - To reduce overall IWI latency, IWI interrupts should be higher
   priority than non IWI platform interrupts.
 - Each platform interrupt is associated with (at most) one thread
   (at a time), hence one scheduler.
   - It would be acceptable for the registry to know "the" scheduler
     an interrupt would be associated with.
 - When HW asserts an interrupt, there are two cases. If the
   interrupt is not associated with the interrupt window, the PAL
   must call raisePlatformInterrupt() on the scheduler's current
   core.  If the interrupt is associated with the interrupt window,
   then the PAL must call the IWI handler on all cores.  The kernel
   resolves/addresses the interaction of IWI and WSI interrupts.
 - An ISR thread may be in a waitUntilNextInterrupt() when the window
   is suspended, but retainedBudget could expire.
   The next window containing the ISR thread's scheduler
   must ensure the ISR thread is runnable at the window start.  I.e.,
   for RMA schedulers the window start must be a budget replenishment
   point for the ISR thread.  For threads in schedulers that don't
   enforce budgets, this is trivially satisfied.

More fundamentally, ISR budgets cannot be assured across cores. This is the same issue as for single-core RMA ISRs when the interrupt arrives close to the end of a period.

A window can be activated by:

  1. windowStart interrupt.
  2. The early completion of the previous window.
  3. End of an interrupt window preemption.
  4. In addition to the above, the interrupt window can also be activated by the PAL calling raisePlatformInterrupt() for any of the interrupt window's ownedInterrupts.

As in the FourPeaks design, the windows are in a WAT

WAT

 windows
   A sequence of windows, indexed by 0.
 windowSlack
   The sum of the finish early times for all the most recent
   contiguous sequence of window activations where
   mayFinishEarly=true, minus any windowSlack used in that interval.
 retainedBudget
   The amount of time remaining for the IW.  retainedBudget is zero
   at WAT start.  retainedBudget is incremented by the finish early time
   when the IW is scheduled, decremented when the IW runs, and reset
   to zero (expires) when:
     - All the retainedBudget is used, or
     - a window actualStart happens at its
       (specifiedStart-windowSlack).  An interesting special
       case is the start of a mayFinishEarly=false window.
   In the second case, retainedBudget incrementally transitions to
   windowSlack as retainedBudget would otherwise begin to expire.
  retainedBudgetExhausted
   Indicates whether there is sufficient retainedBudget to activate
   the interruptWindow.  retainedBudgetExhausted is set only by
   core0, but read by all cores.

Window

 specifiedStart  The time specified during system design.
 actualStart     The most recent run-time value.
   The offset from the WATStart.  actualStart is always <=
   specifiedStart+(some TBD overhead parameter)
 mayStartEarly bool
   If true, the window may be activated prior to its specifiedStart.
 specifiedDuration
 actualDuration
 mayFinishEarly bool
   If true, the window may finish before the next window's
   specifiedStart.
 mayUseSlack    bool
   If true, the window may use more time than its specifiedDuration.
 unusedBudgetDestination : enum {interruptWindow, windowSlack}
   Where the unused budget goes when the window ends early.  Only
   the interruptWindow can give the interruptWindow budget (currently).
 ownedInterrupts : set of interrupts
   The interrupts that *might* be enabled when the window is active.
   When a window is activated the ownedInterrupts for which ISR thread's
   are in a waitUntilNextInterrupt() are unmasked.
 enabledInterrupts : set of interrupts
   The interrupts that might trigger an interrupt window activation
   during this window.  These interrupts are only unmasked if there
   is retainedBudget and the ISR thread's scheduler indicates the
   interrupt is unmasked.
 nextWindow
   The next window in the WAT.  The WAT's last window's nextWindow is
   WAT.windows[0]

Rationale: A design specifying times (integers) for flexible start and end times was rejected because integers would (in some cases) require the window scheduler to introduce unscheduled windows into the timeline.

Rationale: There is only one interrupt window because otherwise the interrupt windows would need a priority scheme.

Future enhancements:

  • Allow windows other than the IW to provide retainedBudget.
  • Allow windows to designate the window to receive its unused budget.
  • Allow IW to use slack.
  • Allow multiple schedulers (presumably on different cores) to own the same IWI.
    • May need some restriction like all schedulers are in the same window(?).
  • Allow a window to specify an "earliest start time" to constrain how far left the window could slide.
    • Some constraints like "may use slack" might have to be co-properties to avoid the need to introduce unscheduled windows.
    • Intent is to provide more precise control over window timing without fixing the window in time.
    • Proposed at the 3/19/2015 telecon.

If a window has mayStartEarly=false, the preceding window's end time must be fixed. The permitted predecessor combinations are:

  1. mayStartEarly=false, mayFinishEarly=false, mayUseSlack=D/C.
  2. mayStartEarly=true, mayFinishEarly=false, mayUseSlack=true

If a window has a fixed duration (mayUseSlack=false), then mayStartEarly=true implies mayFinishEarly=true.

For mayStartEarly=false windows, if the predecessor could finish early and there is an ISR window wanting CPU time, it would be a shame to force the predecessor to spin.

To minimize overhead, the PAL must deliver window start interrupts and all interrupt window interrupts, to all cores.

  • Rationale: An alternative is that the PAL delivers to one core, then the kernel sends IPIs to all other cores. In that case the latency is 2*longest critical, rather than 1. Also, most platforms can be configured for the HW to deliver the interrupts to a specific core, or to all, so the PAL overhead may be minimal.
  • Note: In the March 2015 and previous implementations, IWI is only delivered to master core.

The Interrupt Window being activated close to the next window's (specifiedStart - windowSlack), could significantly delay the next window's actualStart. In this case there will be a interruptInhibitedInterval during which an interrupt that would normally activate the interrupt window will be inhibited (the current design is to mask all interrupt window interrupts when the first such interrupt arrives in the interruptInhibitedInterval). Core 0 determines the start of the interruptInhibitedInterval.

The interrupt window must be scheduled in order to "refresh" its budget (see the description of WAT.retainedBudget).

The expected scheduling sequence in a "period", would be:

  1. Non-interruptable, fixed start, fixed duration windows.
  2. The interrupt window
  3. May start early windows.

We're debating whether some form of the above should be, or could be, formalized to simplify the design, but no specific proposal is currently available.

A "start of period" must be at a window that is mayStartEarly=false.

See the Fourpeaks design for timing of Start of Window interrupts. Basically SOW is when the HW will assert the interrupt. We will publish the "maximum delay before the first user mode instruction", and that is what the application would consider to be the start of the window.

Is there a way we could separate the window and thread schedulers more? E.g., have window scheduler be able to preempt thread scheduler critical regions?

  • Use a VM like separation. There would have to be constraints on what scheduler operations could block other schedulers, and constraints on consistency of scheduling so that future windows were not blocked by a lock held by an inactive scheduler.
  • Perhaps if IW/SOW interrupts had higher priority, and schedulers could complete the critical region of another scheduler, at least until the point where the original scheduler could continue.

The design addresses the fact that a window can be started by several different events that can arrive and be interpreted differently by each core, and that each core can even get a different number of triggering events. E.g., one core could decide that its window should complete early, and another core could get a window interrupt before it makes the same determination. Thus one core gets two window change events, the other core gets only one. The differing number of events issue is addressed in proposal 1 by case analysis of which events are redundant or for which duplicates are possible, and then addressed at the beginning of the function.

Window interrupt handler algorithm

Core 0 determines window to start, and all cores synchronize for window activation.

  • TODO: Need a case analysis of the mayStartEarly, mayFinishEarly, mayUseSlack combinations.
  • TODO: it is TBD if the window start interrupt is written on master core or all cores.
  • TODO: Are IWI's masked on each core, or only by master?
  • Window timer only needs to be readable by master.
  WSI   Window Start Interrupt
  WSI.time = latest time that the current window can finish, aka next window start.
    Always relative to current WAT start.
 
    For interrupt window:
           = retainedBudget
    else (non interrupt windows):
           = window.duration + window slack
           window slack includes retainedBudget if next window has a fixed
           start time, i.e., the current window is the last window before retainedBudget expires.
  IWI   Interrupt Window Interrupt (multiple possible)
 
 // Note handler is also called by maystartEarly().
 WSI_or_IWI_handler(bool IWI or WSI, actualInterruptNumber) // runs on all cores
 {
   if IWI and currentWindow.isIW then propagate actualInterruptNumber to sched and return;
   if IWI and currentWindow has IWI's masked, is assumed to be impossible;
   if IWI and retainedBudgetExhausted return;  // This is a "spurious" interrupt, only one/core possible.
   if currentCore == 0
      syncA;  // A simple barrier, perhaps based on a window activation counter
      recompute retainedBudget;  // cases: nextWindow.mayStartEarly=false and currently window is IW.
      switch:
        IWI and not currentWindow.isIW:
           if retainedBudget > some threshold
             currentWindow = interrupt window;
           else
             // PAL must reassert IWI later, TODO: which may be an issue for edge triggered interrupts.
             temporarily mask all IWIs; // Prevent remaining IWI's from consuming any more budget
             retainedBudgetExhausted = true;
             programming window timer not necessary in this case;
        WSI: currentWindow = next normal window;
             retainedBudgetExhausted = false;
      syncB; // sync with slaves again;
      // interrupt mask status TBD.
      write window timer for currentWindow;
   else  // slave core
      syncA;
      syncB;
   update IWI mask status for currentCore (depends on both window status and temporary mask status);
   activate scheduler for currentWindow on this core;

Since IWIs are broadcast, some schedulers will not have an ISR

Use Cases

The above algorithm needs to address IWI, WSI, and finishEarly() cases, both when addressed separately and in combinations close enough that the observed event sequence may differ between processors. Fortunately WSI and mayStartEarly() are disjoint because mayStartEarly() is synchronized from a critical on all cores. This means that the only ambiguous event sequence is IWI and WSI. Furthermore different cores may see different IWI interrupts, but all are assumed to be masked simultaneously so are treated as one event below.

Case  currentwindow                Events                 nextWindow
       IWIEnabled       Core 0             Core 1          IWIEnabled
 5       false          WSI                WSI             D/C
 1       true           IWI, WSI           IWI, WSI        false
 2       true           IWI, WSI           WSI, IWI        false
 3       true           IWI, WSI           WSI             false
 4       true           WSI                WSI, IWI        false
 5       true           IWI                IWI             false
 6       true           WSI                WSI             false

Only Core 0 can change the window. There is a cross core synchronization point when the window is changed.


Finish Early

An important property of any mechanism is that either all cores will start early, or none will. This prevents some cores from doing a window transition via start early and others via an interrupt, e.g., a WSI.

This algorithm is for the case where idle triggers "finish early" and it supports schedulers with ISRs, however it has an undesirable "timeout".

The algorithm is asymmetric. Every core calls finishEarly() in Idle, but a master core initiates commands to slaves and coordinates the responses from the slaves. Since not all slaves may be ready to finish early, the algorithm employs a timeout to detect ineligible slave cores. The timeout should be set to the maximum amount of time it would take for all N cores that were already in finishEarly() to receive and respond to a command. Assuming a CPU has a write buffer depth of NWB cache lines, a command will be seen in NWB times the time it takes to transfer a cache line to memory plus 1 times the time it takes to transfer a cache line beetween cores. The equation for the slave's response should be similar, but the details depend on the configuration of the memory subsystem.

In the following the phrase "could observe foo in {x, y, z}", means that at that point in the code, the current core could observe the variable foo as having any of the values in the set.

The key to the algorithm is that every core will see the SAME transition of command from wait to either sleep or finish. This is important because any core specific state transitions without a synchronization point in between them could permit the first state to be missed by a cross core observer.

Either the master or the slave could handle ISRs until all cores finish. Care is taken to ensure that inter-core cache line transfers are minimized.

 command_t enum { run, wait, finish };
 command_t command=run;  // "command" only written by core0
 slaveState_t enum { running, waiting, acknowledging };
 slaveState_t slaveState[ncores] = running;  // Elements inter-core cache line aligned.
 
 // References to command and slaveState need to be onceOnly(). There
 // need to be memory barriers between reads and following writes (not
 // writes and reads), to prevent speculative and out of order writes.
 function finishEarly()  // This would replace PALidleFunction().
   loop
     if currentCore==0

       // Not sure how to get a power efficient wait here.

       // At this point, and at all points outside of the following critical,
       // core0 will observe, for all other cores, slaveState[core]==running
       enterCritical();  // May have to be master critical.
       if finish early is permitted, e.g., the window specifies mayFinishEarly==true then
         command=wait;
         var myCommand=run;
         if for every other core, slaveState[core]==waiting within timeout then
           command=finish;
           myCommand=finish;
           wait for every other core, slaveState[core]==acknowledging;
         else
           // The "else" case must account for only a proper subset of cores
           // seeing the command.  Observing "running" below does the trick.
         command=run;
         // The following loop prevents cores from observing a transition
         // directly from wait to another wait.
         wait until for every other core, slaveState[core]==running;
         // At this point core0 will observe, for all other cores, slaveState[core]==running
         if myCommand==finish
           SWI_or_IWI_handler(WSI=true, actualInterruptNumber=D/C);
       exitCritical();
     else // Not core 0
       // For power efficiency, there could be a waitUntilEqual(&command, wait); here.
       enterCritical();

       // Could observe command in {run, wait}, but not finish.
       if command==wait
         // Could observe command in {run, wait}, but not finish.
         slaveState[currentCore] = waiting;
         // Could observe command as anything, i.e., {run, wait, finish}.
         // The "Not Equal" here addresses the fact that some cores may
         // not see command=wait, or may not be in this algorithm at all.
         var myCommand=waitUntilNotEqual(&command, wait);
         // Could observe command in {run, finish}.  
         slaveState[currentCore] = acknowledging;
         waitUntilEqual(&command, run);
         slaveState[currentCore] = running;
         if myCommand==finish
           SWI_or_IWI_handler(WSI=true, actualInterruptNumber=D/C);
       exitCritical()


State Management and Coordination

Correct operation requires that machine state (including memory) is only accessed in a coherent manner. The kernel will always perform coherent accesses, but it can't ensure that applications are doing so. For example, if thread A on core 1 is looking at the user stack for thread B on core 2 while both A and B are active, then the kernel is still operating properly, but the application is misbehaving and the application results are undefined.

Deos provides assurances only to *well behaved* applications. In simple terms, well behaved applications work in the presence of parallelism. Temporarily ignoring time partitioning issues, a well behaved application is one that would perform its intended function if any ready thread could be interrupted between the execution of any two of its instructions by any number of other ready threads that execute any number of their instructions before the first thread executes its next instruction. Note that the thread may be susceptible even to the execution of lower priority threads, if those threads are on a different core.

All machine state (including memory) is characterized by what level of mutual exclusion (execution state) is required for it to be coherently accessed.

Open Issues

TODO: Should this be called Inter Processor Interrupt (IPI) or Inter Core Interrupt (ICI)?

What HW assumptions can we make?

  • Queued int/message delivery?
    • No
  • automatic TLB and cache coherency?
    • In general No although some architectures may support it.
    • Use of such capabilities would preclude Deos being used in a heterogeneous (different RTOS) cluster.

TODO: Limit IPI. Multiple cores in core local criticals.

  • Kernel defined events
    • "force all quiescent" (e.g., for raiseKernelException())
    • Shutdown
    • System Tick(?)
  • If a core raises an inter-core interrupt, how do we ensure that interrupt is processed before other interrupts, e.g., timer?
    • Only one core in a critical at a time helps, but the "wait to enter critical" must address inter-core interrupts.

Page Tables and TLBs

TLB handlers may access page tables asynchronously.

64-bit page table entries will either have to be written atomically, or with a specific protocol in order to ensure consistent data read out. Either way, the TLBIVAX can't be trusted to invalidate TLB entries because the IVAX may come in between the read of the page table entry and the writing of the TLB.

One protocol would be to write the entry not present, then write the other word, then the word containing the present bit.

The final algorithm is still TBD. Consider investigating BSD/Linux implementation. Unsure how Intel architecture handles this since the page table entry may be read by HW at any time. Key phrase in the literature is "TLB shootdown".


TLB shootdown

Visibility:
Change:
Coherency:

paging structures must be consistently viewed across all cores.  I.e., a
thread from any process executing on any core must see Deos RAM (all
globally visible state?) the same in all cases

Visibility
Who can change it and when.
What is its lifetime

Many things immutable after first process dispatch.
  paging structures in svas
  idt raw handler tables
  code

many processor specific registers defined by boot

floating point context

timestamp counter (TSC).  Boot/PAL are required to synchronize the
value across all cores.


Memory
  can be consistently read/written with single aligned read/write.

icache
  Coherent until kernel changes mapping of code space, then requires
  cross core coherency protocol.

TLBs

memory mapping change (VAS):
  TLB shootdown
  perhaps icache if virtually indexed.



Thread User stack:
  Can be written by:
    - Current Thread
    - Any thread running BIT
    - Some other thread executing in cross core crit that ensures the
      thread is :
	- not the "current thread" on any core.
	- Ensures the core with the thread as "current" is "quiescent".
  

kernel data structure:
  Can be written by:
    - Current Thread when holding cross core critical.
    - Any thread running BIT
    - Kernel during startup/shutdown.
  
data used to enter kernel critical
exception vector save area
  Any core, but only using special algorithm
  Special note: Not BIT.

kernel code:
  any thread on any core at any time, except when some thread is
  running BIT.


Issues

  • event wait must account for not waking in a period. Perhaps array of tick counts.
    • current implementation tracks number of periods elapsed when thread does poll
  • Short duration wait cross core is problematic for events. Perhaps need wait with "ISR" semantics.
    • currently we prevent and raise a new error code

Optimizations