Kernel multicore

From DDCIDeos
Jump to navigationJump to search

Project to finalize the Deos multicore kernel.

Marketing Level Overview

Multi-core kernel is based on greys feature set. The kernel is API complete and is also largely functionally complete, but requires numerous optimizations to satisfy non-interference requirements and to properly account for system overheads.

  • Targets: ARM, PPC, x86.
    • No MIPS. Not even unicore.
  • Only RMA scheduler
    • POSIX/FACE via RTEMS in user mode.
    • 653 via user mode library.
  • No RMA SMP
    • SMP for 653 and RTEMS possible in a future release, but will require modest kernel changes
  • Memory throttling only partially implemented.
    • At present only really usable by 653 (using slack inhibit mechanism).
  • Thermal throttling
    • Using slack inhibit feature.
  • QEMU for all supported targets.
    • With multi-core support

Backward Compatibility

The multi-core kernel is to the largest extent possible, binary backward compatible with prior Deos kernel versions. However, some restrictions are necessary:

  1. Mutexes & budget transfer are limited to threads within a single scheduler.
  2. BIT routines have been removed
  3. BSP interface changes were made

Useful Links

Be a Maintainer

Currently running the multi-core kernel "head" requires the following:

  • Install the following "unreleased" components:
    • integration tool (3.37.0)
    • kernel
    • status monitor
    • hypstart
    • any unreleased qemu-{arm,ppc,x86}-{boot,config,pal}

Task Lists

Kernel Tasks

Current certification complete "need by" date is end of Q1 2019, driven by Harrys_Program.

Release 1 Lock and Barrier Optimization

Done.

Task Priority Assignee Status Effort (hours) Remarks
Remove multi-level criticals 2 Done 0
Objects with multiple owners and when objects change ownership(POOSOO) 6 Done 0 I believe this was w.r.t. ownership of thread nodes on lists.
Resolve accessHack issues 4 Done 0 See #Removal_of_accessHack and #2017-07 Face to Face Meeting
Remove CCC lock from memory pools. 6 Done 0
Atomic PTE update 6 Done 0
Finalize #Anticipated_Process_Changes(APROCH) 2 Done 0 Needed by "Initial" to make sure artifacts embody new standards.
Default cache pool for mailboxes 0 Done 0 Told LM that this task has been deleted.
Identify how to restrict globally virtually visible. 6 Done 0 See #Make Kernel not Globally Virtually Visible Not clear this is required yet. May need this for user cert, not clear yet. Meeting in FL early July (#2017-07 Face to Face Meeting).
Investigate setting code/RO-data to be non-coherent 6 Done 0 See #Non-coherent memory
Merge mainline 1 Done 0
C11 types for 64-bit APIs PCR:9981 2 Done 0 Add types file, change systyp UNSIGNED32 from "long" to "uint", see also videobuf
Locking when computing slack when donor main thread is not active (CSWDMT) 6 Done 0 Appears to be OBE
Locking when creating sub-process and mains have mismatched periods (CSPMMP) 6 Done 0 Appears to be OBE
Resolve interrupt handling, separate windows, multiple schedulers/window(IHSWMS) 6 Done 0 PCR:10864 One change added, but may revisit when proper bonusing. Other issues likely OBE, but should verify.
Need invariant/protocol constraining locking on either side of a context switch. 4 Done 0 remove fakeCriticalExit.
Update "kernel" for 64-bit types 6 RLF Done
investigate sync requirements on wait 6 Done 0 What level sync required? Write followed by write without barrier. Impact on non idle loops.
Total @sum(column) 940 original estimate

Release 2 Windows Optimization

Project Number: 2110-002-603
Project Title: Deos Multi-core R&D
Task Number: Release 2
Task Description: Windows Optimization Release
Planned Completion: October 31, 2017

Goal is that cross core interference is bounded as it will be in the final release.

Task Priority Assignee Status Effort (hours) Remarks
Reconcile wikis 0 Pending 24 AL: #Scheduler
Correctly apply lock protocols(KIMBKOPL) 2-release RLR Done 0 Remaining: Add lock documentation to SRD and trace to code.
Window slack Full Functionality 2-release Done 0 Requirements, analyze any corner cases, etc,
Proper bonus for window activation and scheduler interactions design(IIAWRBII) 2-release Done 0
Update scheduler documentation. 2-release Done 0
Update Window scheduler documentation. 2-release Done 0
Cache align critical data structures 2-release RLF In Work 60 Note may be granule align in some cases. "Critical data structures"== Data structures writable during normal operation (especially .data) . Be sure to consider hardware accessed memory. from the checklist: Mutable system state (e.g., L2 shared by a cluster of cores, or page tables) that is accessed by multiple hardware agents (e.g., CPU cores, DMA bus masters, memory management prefetch, etc.): This is done enough for single-core operation.
Develop multi-core#Verification_Strategy 2-release AL,JON Pending 120 Still need to go through CAST32A document. Separation of end system supplier vs DDCI for platform behaviors. Also, core-to-core interference stimulation to support crittime data. Talk with suppliers for non-obvious core-to-core interference, etc.
make kernel not globally virtually visible. 2-release Done 0 At this point the majority of the kernel's data is not user mode accessible. Of the data that is user mode accessible there are only 3 items that are also modified at run-time (the rest are constant after startup): StartOfPeriodEvent::startOfPeriodTickValues[], StartOfPeriodEvent::fastestPeriodCounter, and System::systemTickCount. Except during a frame sync (frame sync operation has not been written yet), these values are only written by the master core during the handling of a system tick when all other cores are quiescent (i.e. it happens infrequently and at time when the cores are assured to not be "fighting" over the cache line).
Thread/Process deletion improve performance of VAS RAM return 2-release AL/RLR Done 0 Initial code implementation of UFT memory is complete. Documentation work is still TODO. Hours reduced from 80 to 20 to do the documentation updates.
Validate CC lock inter-core interference is acceptable(VCCLIIIA) 2-release RLR Done 0
Resolve bad name of interruptAcknowlege 2-release RLR Done 0 Also update documentation. Add task to upate all BSPs. Needd hooks for window/thread change. Call WindowTimerWrite() on all cores master core first (with barrier), delete interruptAcknowlege(). PPI_CHANGE
Add applicable threadHandle to timerWrite 2-release RLR Done 0 Removes need for PALs to call currentThreadHandle(). PPI_CHANGE
remove stdcall on setLogicalInterruptHandler() 2-release RLR Done 0 Also remove docs and code references to DEOSKERNPPI2. PPI_CHANGE
change x86 HAL to halt on services trap from kernel mode. 2-release RLR Done 0 PPI_CHANGE
remove KERNEL_PROTOTYPICAL_MULTICORE_SUPPORT macros and references 2-release RLR Done 0 PPI_CHANGE
Add "PAL is using obsolete interfaces" to idle display 2-release RLR Done 0 PPI_CHANGE
Decide what to do with atomic-fence.h 2-release RLR Done 0 Remove atomic-fences.h
Remove Interrupt Window Functionality (kernel) 2-release RLR Done 0 Requirements, analyze any corner cases, etc
Remove Interrupt Window Functionality (Integ tool) 2-release Done 0 Requirements, analyze any corner cases, etc
Add adjustNextLibraryStartAddress PCR:10816 2-release RLR Done 0 Decided to not do this in the kernel.
Remove BIT tests 3-release RLR Done 0 both testPageOfRamCells() and testRamAddressDecodeLogic(). Create PCR for remapPhysicalAddress() so someone could implement a memory scrubber.
Optimize TLS 3-release Done 0 Preliminary reports from Honeywell indicate TLS on multi-core is substantially slower than UC.
Reconcile test suite 3-release GK In Work 100 1) Update existing test suite for changed APIs, C11 types. 2) Address changed/added requirements.
Implement POM 3-release AL/RLR Done 0 Finish TODOs, implement secure/release proposal.
64-bit page tables for ARM (kernel and hypstart) 3-release AL/RLR Done 0
Fix GCC 7.3 linker script 3-release SH Done 0
Fix VAS locks 3-release RLR Done 0 Currently using RAM quota lock which is insufficient, e.g., system process not same as current process and/or memory object.
Update kernel examples for 9.0.0 types/API changes 3-release NL Done 0
Resolve resource wait definition specificity 3-release Done 0 See GK's mailing list posting
Merge multi-core branch back to mainline 3-release Done 0 Includes Subversion, Bugzilla, and updating externals of hypstart.
Total @sum(column) 1760 original estimate

Release 3 Space and Time Optimization

Project Number: 2110-002-603
Project Title: Deos Multi-core R&D
Task Number: Release 3
Task Description: Space & Time Optimization Release
Planned Completion: December 31, 2017

Task Priority Assignee Status Effort (hours) Remarks
PAL interface changes: backward incompatible Phase 2 3-release Done 0 Remove binary backward compatibility from the kernel. Remove IPI interfaces.
Proper bonus for window activation and scheduler interactions implement 3-release Done 0
Do preliminary review of kernel stack size 3-release Done 0 Just to get an early confirmation the size is adequate.
Design and implement frame sync lost logic 3-release RLR Done 0
Design and implement warmstart logic 3-release RLR Done 0
Portals thread specific 3-release Done 0 Static assignment optimization could be added but is not required based on current support. At a minimum need to verify the portal analysis is correct.
Support Release activity 3-release Pending 80
Add support for mapViewOfKernelFilePPI PCR:10507 3-release Done 0
Reduce kernel object sizes 3-release Done 0 Decided all objects are small enough.
Resolve 500ish todo items 3-release In Work 40 estimate 2 hours each
Resolve currentThread from PAL 3-release Done 0 API has become trapping.
Update crittime kernel to address lock level durations and multiple cores 2a-release JN In Work 300 include ctxtime, PIG
Remove support for PPC 750/e600 cores 3-release RLR Done 0 These processors can no longer be supported because there is no way to invalidate TLBs across multiple processes.
TLS. Durants3 JN Done 0
Individual file system part numbers design Durants3 Pending 0
Individual file system part numbers implementation Durants3 Pending 0
Total @sum(column) 1068 original estimate

Certification Candidate Release

Project Number: 2110-002-603
Project Title: Deos Multi-core R&D
Task Number: Cert.Cand.
Task Description: Certification Candidate Release

Task Priority Assignee Status Effort (hours) Remarks
Update CPU tables with data item applicability, e.g., HW thread, core, cluster, etc. (not needed for harrys) 9-CertCand Pending 80 Estimate is mostly infrastructure. Specific processor analysis and documentation not included.
Warning user about things like interrogation capability. 2-release Pending 40 Add AIB warning to UG. Done enough for single-core operation?
Implement #Verification_Strategy 9-CertCand Pending 200
capture the external references made by the updated howtos 9-CertCand Pending 8 Current links are to external web sites.
Update test case and procedure checklists. 9-CertCand Pending 40
Create a new kernel variant for verifying kernel locks. 9-CertCand Pending 16 May be able to use Debug kernel, but would have to change kernel RFS to require debug variant.
Total @sum(column) 596 original estimate

Post Harrys Tasks

Task Priority Assignee Status Effort (hours) Remarks
Update critical section bound documentation. 2-release Pending 80 We might be able to reduce scope for this for single-core operation.
Time accounting information APIs 3-release Pending 120 Supply processor usage information for windows, cumulative execution time for threads (F22) Done enough for Harrys?
RMA Schedulers that don't get activated at rates faster than the fastest rate they have threads ?-release Pending 80 Related to "Integ tool support for major frames longer than the fastest period". When is this needed? Harrys does not seem to need this. Can this be deferred until after the current set of customer certs?
Update UG description for all APIs 3-release Pending 120 AIB related and other MC whatever new documentation standard is. Done enough for single-core operation?
Total @sum(column) 1760 original estimate


Tasks Pending Assessment

During development tasks that are uncovered that have not been assessed are captured here until we can decide what to do with them:

  • PCR:11141 - Rename dwPageSize and dwAllocationGranularity in DEOS_SYSTEM_INFO
  • PCR:11140 - Remove eventLogClockFrequency from DEOS_SYSTEM_INFO
  • PCR:7345 - Specify video memory address in the hyperstart image

Deferred Tasks

Task Priority Assignee Status Effort (hours) Remarks
Cross core mutex (kernel PCR:9940, IT PCR TBD) Post TRL6 Defer TBD Have user workaround using semaphore and a mutex
Runtime support for dynamic thread templates (M4.1) 7-FullFunc Defer 100 Ref minutes for March 5, 2015 meeting. Estimate is from quote to complete
Minimize number of idle threads Post TRL6 Defer 40
Make Mutex blocking time per scheduler Post TRL6 Defer Currently blocking time is system wide.
MIPS updates Post TRL6 Defer 0 Multi-core support not required, but must make consistent changes. MIPS won't be part of MC kernel release
Remove boot interface object maxCoreIndex Post TRL6 Defer 56 Ref kernel-maintainers posting
How do timeouts work for posix/653? Is kernel support required? Post TRL6 Defer TBD Was in play when Kernel was going to do the scheduling
Multi-core debugger support Post TRL6 Defer 0 Scope needs to be defined. May be adequate as is. No work identified, but still might need something
Return Idle Implementation to the PAL 7-FullFunc Defer 0
Single writer, multi-reader semaphores Post TRL6 Defer TBD
Yield API Post TRL6 Defer TBD
Elimnate hard coded limit on number of cores Post TRL6 Defer TBD
POSIX scheduler Post TRL6 Defer TBD
ARINC 653 scheduler Post TRL6 Defer TBD
multi-cast HAL/PAL inter-processor communication (kernel) 4-Initial Defer 27 23 PAL interface changes: backward incompatible Phase 2 existing code sufficient for initial releases
Make kernel stack come from application memory pool Post TRL6 Defer 60 50 46 Probably not necessary until SMP
Windows finish early idle priority enhancements(SWWFEACAR) Post TRL6 Defer
Add window notification PCR:10441 Post TRL6 Defer 60 2 of three items done, do we want waitUntilNextWindow?
Add ability of PRL/PAL to change physical address mapping of a page Post TRL6 Defer 60 To permit a PRL to implement a memory scrubber. First mapPhysPage, then remap over memory to be tested.
Total @sum(column)

BSP Task List

The following BSPs should be supported by the multi-core kernel:

  • x86: qemu-x86, minnow-turbot
  • ppc: qemu-ppc, t10xx
  • arm: qemu-arm, zcu102
Task Priority Assignee Status Effort (hours) Remarks
qemu-platforms PAL interface changes: 1-release TBR In work 6 Update to kernel PAL interface changes window timer write, phase 1, phase 2 backward incompatible - The 3 qemu bsps now work. Hopefully the changes to the bsps can be reused for the real boards. The 3 qemu bsps are currently uploaded as UNRELEASED. It will take a few hours to move them to stable (9 components total).
qemu-platforms Fix multi-core BSPs to run kernel regression 1-release TBR In work 15 Ensure tests can run, but no analysis on qemu platforms. Kernel tests have been attempted run on qemu-x86, qemu-arm and qemu-ppc. Some of the tests either kill the qemu-platform or makes it unresponsive, so no further tests can be run. When these few tests are excempted, results have been achieved for the remaining tests. Many give the same results on all 3 qemu-platforms, some have configuration differences. The tpk*_down typically work, but the following tpk*_norm don't (the needed applications are not left on the target). Note also that some tests are marked as problem tests that are not run.
t10xx PAL interface changes: 1-release Pending 80 Update to kernel PAL interface changes window timer write, phase 1, phase 2 backward incompatible
t10xx Fix multi-core BSPs to run kernel regression 1-release Pending 40 multiple memory pools, analyze failures, etc
PCR:10155 Broadcast interrupt needs investigating. Pending
ARM kernel regression 1-release Pending 40 Ensure tests can run, but no analysis on qemu platforms
PPC kernel regression 1-release Pending 40 Ensure tests can run, but no analysis on qemu platforms
X86 kernel regression 1-release Pending 40 Ensure tests can run, but no analysis on qemu platforms
Support Release activity 1-release Pending 40
Evaluate IO Fence support TBD Pending 40 Investigate C11 to see if there is a IO fence capability we should export, perhaps via atomic-fences.h
Total @sum(column) 400 original estimate

Embedded Components Task List

Task Priority Assignee Status Effort (hours) Remarks
Add "unsigned int" to videobuf 2-release JON Done 0 Required for kernel task "C11 types for 64-bit APIs"
Status Monitor - introspection APIs 2-release Pending 80 Supply processor usage information for windows
Demo apps 2-release Pending 120
IOI barriers 2-release GK Pending 40 PCR:9878 Probably requires IOI API and config changes. Depends on task: User visible barrier services. May need optimization after initial release
RTEMS: Ensure Deos SMO satisfy RTEMS needs 3-release Pending 80
RTEMS: Ensure IST satisfy RTEMS needs (consider PCR:7929) 3-release Pending 80
Update library APIs for new C11/64-bit types N/A Pending TBD See Deos_Multicore_Verf for individual components
Debugger changes TBD Defer 0 Currently believed to be ok
Network TBD Defer 0
653 Part 1 Suplement 4 TBD Defer 0
Decide what to do with atomic-fence.h TBD Pending 40 Investigate C11 to see if there is a IO fence capability we should export, perhaps via atomic-fences.h. Kernel has UG material for this header but removed the header. At least cffs-server and ioi ring-buffer currently use atomic-fence.h.
Total @sum(column) 424 original estimate

Tooling Task List

Task Priority Assignee Status Effort (hours) Remarks
Get 64-bit GCC compilers installable via cygwin 3-release JON Done 0 Needed to do experiments with kernel 64-bit header changes.
Stop deleting /usr/include in cygwin post install script 3-release JON Pending 16 So 'cygcheck -c' is (more) clean. Current issue is python installs a dev package that brings in many .h files. Counter indication: .h files left in place are confusing.
Integ tool support for major frames longer than the fastest period ?-release GK Pending 120 When is this needed? Harrys does not seem to need this.
hypstart 3-release JON Done 0 Build 32 bit MinGW exes
Registry CVT 9-CertCan GK Pending 120
IOI CVT 9-CertCan GK Pending 120
OpenArbor: Window time accounting information 3-release Pending 40
OpenArbor: General enhancement requests 3-release Pending 120
OpenArbor: Window timeline layout and scheduler reporting 3-release Pending 120
Add support for race condition detection (e.g., thread sanitizer) TBD Defer 0 There is a Linux kernel variant: https://github.com/google/ktsan/wiki Note this requires a 64-bit Linux.
General bucket for kernel requested enhancements TBD Defer 0
TRAC TBD Defer 0 No changes currently identified. Perform some testing
Remove Interrupt Window Functionality (Integ tool) TBD Hold 24 Requirements, analyze any corner cases, etc
Warning user about things like interrogation capability. TBD GK Pending 8 Add AIB warning to ITUG.
Total @sum(column) 720 original estimate

Future Enhancements

  1. wait for thread/process deletion API
    1. Otherwise we risk zombie processes
    2. How to control thread membership in a scheduler
      1. Larry's display window use case would like to have membership be more dynamnic than thread template. Note scenario was originally for execution groups.
  2. General multi-core support library services (optional)
    1. A general busy wait loop that ensures cache is exclusive to current processor. (Does accessing shared cache still generate bus traffic?).
    2. Scheduler Yield?
  3. Performance monitor data, e.g., cache misses, bus accesses, etc.
    1. Perhaps use for ensuring some sort of budget enforcement.
  4. A standard "adversary" component.
    1. Define an application that can drive bus interference, cache interference, etc.

References

  1. http://lwn.net/Articles/588300/
    • The kernel mailing list discussion is important background, including general MC programming considerations:
    1. Read copy update (RCU) strategy
      1. https://www.kernel.org/doc/Documentation/RCU/rcu.txt
      2. The directory contains several additional docs, but the above is the key.
    2. read write memory reordering
    3. memory barriers
      1. https://www.kernel.org/doc/Documentation/memory-barriers.txt
      2. http://en.cppreference.com/w/c/atomic/memory_order (C++ 11)
    4. compiler optimizations
    5. speculative execution
    6. atomic operations
      1. https://www.kernel.org/doc/Documentation/atomic_ops.txt
      2. http://lwn.net/Articles/588300/
      3. http://lwn.net/Articles/586838/
  2. https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html
    • This article makes a good case that volatile should be removed from the kernel, and probably all Deos code. (APROCH)Suggest making a PCR and perhaps add checklist item for it. AL:'s summary is that with single core memory model semantics, volatile across a critical section boundary is "good enough", but in a multi-core system, it is insufficient to ensure both compiler optimizations and appropriate memory barriers have been established.
  3. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4322.html
    • C++ 11 Memory models including atomcs and ptr dependencies.
  4. Determinism and Fail-Stop Races for Sane Multiprocessing
    • An approach based on transactional memory by Luis Ceze at the Jan 19 2011 nwcpp.
    • Teaming with a vendor like Corensic to provide debugging and analysis support like their Jinx debugger for MP programs would sure strengthen our position.
  5. http://www.cs.rice.edu/~johnmc/comp522/lecture-notes/COMP522-2014-Lecture8-MemoryModels.pdf
    • An approachable set of lecture notes on memory ordering issues.
  6. The Multikernel: A new OS architecture for scalable multicore systems
    • Is message passing better than shared memory?
  7. DMP: Deterministic Shared Memory Multiprocessing
    • Token passing to achieve deterministic parallel program execution. Still requires global memory, and implementation tricks like transactional memory.
  8. Exokernel: An Operating System Architecture for Application-Level Resource Management
    • Exporting HW capabilities rather than abstractions.
    • Interesting concept, but its not clear how we could use the techniques.
  9. Task Synchronization in Reservation-Based Real-Time Systems
    • Introduces the notion of bandwidth (effectively normalized utilization), Bandwidth Isolation (BWI) and formally defines task isolation in the presence of shared resources, then generalizes deadline satisfaction in the presence of soft real time tasks (those without accurate definitions of budget and deadline). The paper assumes PIP rather than PCP (to handle soft real time) and defines how deadlines and budgets can be inherited by tasks attempting to lock shared resources. Since Deos uses PCP, this is not all that useful.
  10. The Multiprocessor BandWidth Inheritance Protocol
    • Like BWI, the key to this paper is support for soft real time.
  11. BBBDISS Scheduling and Locking in Multiprocessor Real-Time Operating Systems
    • Thesis: A “multicore-ready” RTOS should employ scheduling and synchronization algorithms that minimize the loss of processor capacity allocable to real-time tasks. When both overhead-related and algorithmic capacity loss are considered on a current multicore platform, (i) partitioned scheduling is preferable to global and clustered approaches in the HRT case, (ii) P-EDF is superior to P-FP, and (iii) clustered scheduling can be effective in reducing the impact of bin-packing limitations in the SRT case. Further, (iv) multiprocessor locking protocols exist that are both efficiently implementable and asymptotically optimal with regard to the maximum duration of priority inversion.
    • ... phase-fair RWlocks... implementation of RW locks optimized for multiple readers.
    • (pg 119): We require that requests for global resources are not nested. That is, jobs are not allowed to request additional resources while holding a global resource and are further barred from requesting global resources while holding a local resource. Global resource requests can thus not lead to deadlock. This limitation is common to all multiprocessor locking protocols proposed to date. In Chapter 5, we discuss a simple “workaround” applicable to any protocol that allows nested requests by reducing nesting to coarse-grained “group locking.” To the best of our knowledge, predictable, deadlock-free support for nested requests (without resorting to coarse-grained coalescing of resources into groups) is an open problem.
  12. Supporting Nested Locking in Multiprocessor Real-Time Systems (long version)
    • The long version has an appendex that the journal paper does not have.
    • Introduces Real time Nested Locking Protocol (RNLP) which supports bounded time nested locking using a FIFO ordered by timestamp
    • The FIFO logically seems like a real winner, but they don't account for the core-to-core interference caused by maintaining the timestamp ordered queues.
  13. Fine-Grained Multiprocessor Real-Time Locking with Improved Blocking
    • Extends RNLP with "Group locking".
    • Approach is not directly applicable to Deos because the characterization of groups is insufficiently constrained, and consequently would result in substantial core-to-core interference.
    • The "token lock" approach could be used to prevent cores from blocking nested locks multiple times, thus tightening the nested lock blocking bounds.
  14. Multi-core Interference-Sensitive WCET Analysis Leveraging Runtime Resource Capacity Enforcement
    • From Michael Paulitsch following MCFA meeting. Just skimmed. Basic idea is that since WCET in MC is heavily dependent on core-to-core interference, then use performance monitors to count such events and partition that way. Placing an upper bound on shared resource accesses can then be employed by WCET tools to get higher utilization bounds.
  15. [WRL95-9] Memory Consistency Models for Shared-Memory Multiprocessors
    • Somewhat surprising memory ordering possibilities, imply locks for everything involving a write.
  16. MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms
    • Using performance monitors to partition memory bus traffic.
    • In AL:'s opinion, the technique suffers from several flaws.
      1. Not all memory accesses are equal:
        • Memory accesses have memory, e.g., serial and non-serial accesses take different time. (mentioned in paper)
        • Accesses that contend have more of an effect than accesses that are non-contending, i.e., some accesses cause problems, others don't.
      2. It is not clear how fine grained a bounding time interval is practical, the paper doesn't specify, but typical Linux system ticks are 1 or 10ms.
      3. Any limit would have to differentiate between potentially conflicting and non-conflicting, e.g., be specific to a memory bus.
  17. Proving the Correctness of Nonblocking Data Structures
  18. [MARA_draft]A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
    • Good, though long, overview of PPC and ARM memory models.
    • Has several interesting diagrammming techniques and terminlogy for describing aspects of reordering, including prefetch, etc.
    • References formal PPC memory model
  19. Non-scalable Locks are Dangerous (2012)
    • Highlights the importance of not having multiple cores monitoring (accessing simultaneously) the same cache line, especially for spinlocks.
  20. C/C++11 mappings to processors
    • Mapping various c11/c++11 memory order operations to various CPU architectures; notably ARM, PowerPC/PPC, x86.
  21. Memory consistency and event ordering in scalable shared-memory multiprocessors, Gharachorloo; Lenoski; Laudon; Gibbons; Gupta; Hennessy, Int'l Symp Comp Arch, ISCA(17):15-26, 1990, doi 10.1145/325096.325102.
    • The paper that introduced acquire/release consistency.
    • This paper was referenced in an interesting Stack Overflow discussion pointing out a defect in a (approx) 2016 version of the cppreference.com memory_order definition (subsequently fixed).

Unread Papers

Data synchronization - locking vs lock-free

  1. Performance evaluation of inter-thread communication mechanisms on multicore/multithreaded architectures (2012)
  2. Performance Impact of Lock-Free Algorithms on Multicore Communication APIs (2014)
  3. Nonblocking Algorithms and Scalable Multicore Programming - Exploring some alternatives to lock-based synchronization (2013)
  4. Resource Synchronization in Hierarchically Scheduled Real-Time Systems using Preemptive Critical Sections (2014)
  5. Real-Time Synchronization on Multiprocessors: To Block or Not to Block, to Suspend or Spin? (2008)
  6. A Flexible Real-Time Locking Protocol for Multiprocessors (2007)
  7. Multiprocessor priority ceiling based protocols (1994)
  8. Space-optimal, wait-free real-time synchronization
  9. Efficient synchronization under global edf scheduling on multiprocessors
  10. Resource sharing in global fixed-priority preemptive multiprocessor scheduling
  11. Minimizing memory utilization of real-time task sets in single and multiprocessor systems-on-a-chip
  12. Coordinated task scheduling, allocation and synchronization on multiprocessors
  13. Multiprocessor synchronization and hierarchical scheduling
  14. An investigation of synchronization under multiprocessors hierarchical scheduling
  15. Real-time synchronization protocols for shared memory multiprocessors


Scheduling

  1. Schedulability analysis of global scheduling algorithms on multiprocessor platforms
  2. New Schedulability Tests for Real-Time task sets scheduled by Deadline Monotonic on Multiprocessors
  3. A Framework for Implementing Objects and Scheduling Tasks in Lock-Free Real-Time Systems
  4. Resource sharing in hierarchical fixed priority pre-emptive systems (ppt presentation)
  5. Resource Kernels: A Resource-Centric Approach to Real-Time and Multimedia Systems (1998)

Random notes

What follows are notes that have not (yet) been incorporated into the Deos_Multicore_Design.

Full Application and Tooling Support

multi-core
  Finish scheduler
   - All RMA windows have to have an overhead to "bonus" the final thread that is interrupted at a window boundary.
   - When a window stops, make sure the thread timer is read "early" so that
     window associated overheads are not charged to the thread in the preceding
     window.
   - Desire to continue support for system ticks that are faster than fastest rate.

Verification Strategy

See also Multicore_Verification.

More notes than detailed strategy at this time.

System level

Issues where the kernel and application design characterize and mitigate core-to-core interference. We should review the CAST-32A paper for completeness and ensuring they are addressed, but at this level would be things like various HW architecture interactions (ref #Performance_Characterization) and SW architecture interactions (ref #Application_Induced_Blocking).

Note that there are potentially very subtle issues like ensuring that simultaneous multi-core access to the same cache line can't cause livelock. Ensuring we have a complete list of the issues may require interacting with the HW vendors.

Kernel Level

When verifying the kernel the remaining issues break down as follows:

  1. What addition tests and/or combinations of tests will be required is TBD. One idea: test low level synch with detailed MC tests, use that as justification for limiting true concurrent tests for higher level operations (e.g., mutexes, etc.).

Design

Note, the history of this page contains a number of alternate proposals and design work for them.

The design treats each core uniformly, with very minimal dependence on master vs slave. Fine grained locking is utilized to provide parallel read/update of unrelated kernel objects. The goal is to introduce blocking only when it is evident from the user source, e.g., two threads accessing the same semaphore. Ref #Application_Induced_Blocking.

All inter-core induced overhead must be accounted for. E.g., no spurious TLB or scheduler IPIs.

Multi-core Deos Design Overview

This project will extend Deos to multi-core processors. Specifically intended for processors that have the same CPU architecture with a shared memory.

All software inter-core interference in the kernel is explicitly tied to the kernel application programmer interface (API). This permits the application developer to properly reason about and budget for the overheads. This is a key to enabling guaranteed budget activity on more than one core.

The design assumes a single Deos kernel managing the scheduling of all the CPU cores in a system. Consequently, there is a single registry and tooling will be knowledgeable about the entire system, thus simplifying integration. The system will employ two level scheduling, the first level schedules windows, the second level schedules threads. The entities performing the scheduling are called the window scheduler and thread scheduler respectively. Some terminology:

                Windows, Schedulers, and cores
         -------------------------------------------------------------
         |   Window 1   | Window 2 |   Window 3       |   Window 4   |
         -------------------------------------------------------------
Core 0   |      S1      |   S2     |     s3           |     S2       |
         -------------------------------------------------------------
Core 1   |      S5      |   S6     |     s6           |     S6       |
         -------------------------------------------------------------


Window

A wall time duration from a start to an end time. The start and end times of a window are synchronized across all cores. I.e., a window starts and finishes at the same time on all cores.

AL: note finish early stuff from below.to a window finish early, i.e., all cores, then the next window may start early. There will be controls to determine whether windows starting early must retain their duration, or get additional time.

AL: Windows never start late. AL: Should note: Minimum window duration is assured. Window can expand due to slack. Window can shift due to interrupts.

Window Activation Table

An ordered list of windows where the end of one window is adjacent to the start of the next window. The duration of the window activation table is fixed and corresponds to a major frame in ARINC 653, or a hyperperiod in previous Deos systems. When the last window in the window activation table finishes, the window scheduler starts scheduling windows again starting with the first window in the window sequence. Windows must span the entire time line, i.e., there can be no "GAPS".

AL: must describe interrupt windows, start early, finish early, window slack. AL: what about interrupt windows?


Thread Scheduler Class

An algorithm that determines how threads are scheduled. The initial design will have a Rate Monotonic (RM) scheduler class implementing the legacy Deos scheduling algorithms. ARINC 653 (a653) and POSIX scheduling will be considered during the design, but will not be implemented. Support may be added in a future project. At the present time a #Thread_Scheduler_Instance of any of the scheduler classes is restricted to scheduling threads that are on a single core. SMP scheduling where a #Thread Scheduler Instance can span more than one core will be considered during the design, but will not be implemented. SMP scheduling support may be added in a future project.

At this point we believe the classes of schedulers will share the same basic algorithm, priority preemptive scheduling where rates map to priorities in the RM scheduler. Budgets are ignored in the a653 and POSIX schedulers.

Thread Scheduler Instance

A set of threads that are scheduled by the rules of a #Thread_Scheduler_Class. A thread is in exactly one thread scheduler instance. There can be multiple instances of a scheduler class. E.g., there can be scheduler instances S1 containing threads T1 and T2, and a second instance S2 containing threads T3 and T4.

At system configuration time every thread scheduler instance is declared and assigned to one core in zero or more windows. E.g., S1 is set to run on core 0 in window 3. A scheduler instance can be assigned to at most one core in any window. A scheduler instance can be assigned to any number of windows in the window sequence (including to no window at all). The budget and the rates that a scheduler can support depends on the size and frequency of the windows it is assigned to.

In the proposed design, the budget for an RM scheduler instance is the minimum of the normalized utilization at the highest rate the scheduler supports (e.g., if more time is available in a slower rate, that time will be accumulated as slack). It may be possible to ease this restriction in the future.

A thread running in any scheduler will be able to create another thread that runs in any other scheduler, provided that the creating thread's process has sufficient access (permission) and resources (e.g, budget) in the created thread's scheduler.

The following cases require locks in situations not currently covered:

  • (CSWDMT): Computing slack when donor main thread is not active.
  • (CSPMMP): Creating sub-process and mains have mismatched periods.
Thread Templates

In addition to the current template fields, an additional field specifies the scheduler for the created thread. The integration tool maps window and scheduler names to a scheduler instance handle used in the kernel registry.

A new API will exist that permits the user to specify the contents of the thread template at runtime. Users can fetch a template from the registry and modify the template before creating a thread with the template. In other words the registry can supply thread template templates.

Future projects for POSIX and multi-core ARINC 653 will need thread template properties to be specified at runtime, attr objects in POSIX and combination of CREATE API for stack size and additional APIs for core affinity for a653.

Slack

There are now two kinds of slack; window slack, and scheduler slack.

A Window finishing early generates window timeline slack. A subsequent window may optionally be permitted to use this slack, i.e., have its duration extended. In legacy Deos, this would be the equivalent of somehow having a system tick get longer.

A window permitted to expand its duration gets the window timeline slack available at the time the window is activated. A scheduler in such a window thus sees the window timeline slack as additional scheduler timeline slack.

At the end of a window, unused window timeline slack, and scheduler timeline slack surrendered when the window finishes early, becomes window timeline. If a window does not finish early, then slack generated by that scheduler can only be used by the scheduler instance where it is created. I.e., slack generated or consumed will not directly affect any other RM scheduler instances. This provides better control of where slack is used, but it causes unused slack to be wasted.

Interrupts

AL: needs lots of work.

The hardware could raise an interrupt during any window. Each window lists the interrupts that can be serviced during the window. Deos will ensure that only specified interrupts will be serviced during the window. Time to service interrupts will not be charged to every window in which the interrupt could be serviced. There will be some restrictions, e.g., a window where an interrupt can be serviced will need to be "can finish late" and the following window "can start late".

AL: So far we've talked about the above as "interrupt windows". I believe the reason is so that we get budget sharing across windows, and also so that there is the possibility to limit the threads running on the "other" cores (get core-to-core isolation) while the ISR runs. Is the core-to-core isolation required? If it is not required, then we could potentially use some other (simpler?) means than "interrupt windows". E.g., using the "FACE case" where each window has a list of schedulers, one of which could be a scheduler containing ISR threads, another an a653 or RM scheduler. In that situation it would be difficult to restrict the threads running on the "other" core. The thread schedulers would also somehow have to be able to ensure they caused the window to finish on time (i.e., ISR scheduler not multiply accounted). Hmm, on second thought this might not be simpler.

We may be able to limit conceptually to a single interrupt window, which contains all ISR threads if this simplifies the design. However, this may still be multiple interrupt windows in order to deal with not crossing fixed window boundaries (can't slide left).

Terminology

  • Current Thread or Current Scheduler
    Always from the perspective of a core, the current thread is the thread that is running on the (local) core. For example, thread A is current on core 0.
  • Running thread or Running scheduler
    A, or the set of, threads/schedulers that are current. For example, thread A is one of the running threads.

Application Induced Blocking

New design principle: All overheads must be tied to user operations or system events, actually this is old (but unrecorded). The implication in the multi-core system is that many previous operations that were easily accounted for in a unicore system, e.g., accessing memory, or entering a critical section, become problematic in a multi-core system. For example, global critical sections are not acceptable because core-to-core interference cannot be properly accounted for.

AL: machine state, locks, core critical sections. all have potential to affect other core. Need to describe strategy for all.

The following do not cause core to core interference:

  1. Executing a cached instruction from a virtual page that is in the CPU's local TLB, where the execution of the instruction itself does not cause interference. Non-interfering instructions include:
    1. Register to register operations that do not reference memory, e.g., register moves and arithmetic and logical operations. Note that floating point operations may cause an FPU context swap.
    2. Branch operations to in TLB, cached instructions.
    3. Memory references to addresses that are already in cache in a consistent (e.g., MESI) state and also in the TLB.
  2. AL: incomplete list, although I'm not sure we have to enumerate all the cases.
  • A corollary is that one core cannot interrupt another core unless the interrupting core can assure that the overhead on the interrupted core can be properly accounted for. In general this is very hard.

A single large lock is very analyzable, but core-to-core interference violates kernel assurance of access to CPU for threads.

Small locks increase intellectual overhead and introduce the potential of live/dead lock.

A new class of interference: Application induced blocking. E.g., two threads manipulating the same KIO may experience application induced interference/blocking AIB.

  • Non AIB blocking (hidden interference) is not acceptable.
  • AIB cannot affect other operations. E.g., manipulating the same event on two cores cannot cause a longer critical or otherwise interfere with system operation, including timing.
  • AIB is acceptable, but only if the impact is only to threads involved in the AIB, and only when the AIB can be prevented if desired. E.g., threads interacting with a scheduler resource crossing cores are implicitly stating their tolerance to AIB.

When there is a conflict such that accessing an object induces core-to-core blocking the design idea is that we separate the object into a "private" (object or core) specific part, and a "public" part.

  • This affects: Scheduler, thread
  • Parts of some objects may be covered by locks in others. E.g., a list node in a thread may be "owned" by a scheduler or a scheduler resource, not the thread. I.e., conceptually the list node is not part of the thread, its only there as a "design necessity".

Level 3 issue: Get memory from pool (as bounding interference pattern) vs. moving memory from system to process pool.

  • Resolved, but need to update DDD to note pool as source of interference.

Should add a note on IPC describing issues associated with waiting/interacting with objects in another scheduler.

RLR and AL 6/22/2017 telecon notes on characterizing interference

consistency, coherency are different

inputs to interference calculus

   - permissions (security sort)
   - core
   - locks (locks are used to determine acceptable interference
     patters) 
   - access requested (read, write) affects exclusivity (MESI level)
   - mutability of the data (read only, writable)
   - temporal nature of interference (e.g., different cores accessing
     at different times)
   - writer of shared data does something (e.g., dcbf, dcbst) to
     reduce interference.

effects:

   - transferring a cache lines
   - fighting over a cache line
     A writer and one or more concurrent readers or writers

Given two data items A and B, when can they be on the same cache line?

  - Both const.
  - Both core specific, covered by same lock, etc.
  - Something about temporal nature.  Temporal can be achieved by the
    following and all require coordination, e.g., some sort of fences:
    - lock.
    - COC/COD (creation/deletion)
    - Window synchronization 
    - single threaded ????

Is single threaded the same as "temporal nature"?

System States

AL: this needs much work.

System states are:

  • Off
  • Starting Up
  • Multi-Threaded
  • BIT
  • Shutting Down

Some object state is "constant" only within a particular system state. E.g., a pointer to a KIO pool is constant during multi-threaded and shutting down, but not otherwise. Curiously Scheduler::currentThread() is also const for *nearly* all invocations. However that may not be something we can exploit.

Access to CPU state:

constant(states) Object state is unchanging during the listed system states (or is that can't be simultaneously changed by any other thread? That would make handling startup/shutdown easier). The object state can be accessed without any locking.
Thread local Only the current thread can access the state.
Core local Only a thread operating on the core can manipulate the state.
object lock The object can only be accessed with the appropriate lock in place
Atomic word-at-a-time Caller must address all interference patterns. This would be done only in the most extreme cases. Candidates: VAS
Currently accessConst occurs in Envelope, Thread, BudgetUpdateVector, 
InterruptEvent, Mutex, Process, Scheduler, StartOfPeriodEvent, and Event

Scheduler, StartOfPeriodEvent, InterruptEvent, and BUVecs belonging to a 
scheduler are initialized on the master core and slave cores wait for the 
startup sequencer.
BuVecs belonging to a Process are covered by the justification for Process

This leaves Envelope, Thread, Mutex, and Process which are all KIO's.
For KIO's the final step after creating an object is to bind it.
After being bound the object is primarily accessed through its handle. 
This may be through securing the object or userModeObjectAddress.
Both result in calling getObjectAddress which reads the kio->handle.

Proposal:
Add a release fence to KernelInterfaceObject::bind via _handle
Add an acquire fence to getObjectAddress via candidateObjectPtr->handle()

Alternate:
Add a release fence to KernelInterfaceObject::bind via _handle
Add a verifyHandle API which can be used to ensure a handle is valid and 
if so perform the acquire fence via the kio->handle().
Add warnings to UG that handle is not valid on another core until the next 
window activation or that core has called verifyHandle for that handle.

The first proposal adds more acquire fences than are needed since we'll 
continue to secure objects, but seems simpler. TODO: Can we reduce once
secureCount has a per core value? Can we check if core has ever secured?
The second allows the user to limit the extra fence to only when needed. 
However, what is the behavior if they do not use the verifyHandle and a 
heavy barrier has not occurred? Can this lead to kernel vs user mode 
errors where the kernel would have to verify the handle is valid for a 
core before use which may be worse than just unconditionally having the 
fence?

Question:
How do we ensure this analysis stays accurate? Do we need a checklist item 
for accessConst that each occurrence justifies multi-core consistency, and 
we have two standard justifications for scheduler startup and kio's that 
can be referenced?


VAS Note: By using a "process VAS lock" we could keep the precondition on the VAS operations that says caller is responsible for ensuring stability, and it would even be a stronger sense of stability (e.g., entire VAS vs page table stability). This might simplify VAS code. It would introduce some undesirable coupling between threads, e.g., two threads simultaneously calling some combination of VAS manipulation APIs, such as mapViewOfKernelFile(), sendEnvelope(), attachPlatformResource(), etc. In cases where the interference would be non-local (e.g., sendEnvelope()), we might have to deploy different locking strategies for different parts of the VAS, or alternately change where operations are performed. E.g., sendEnvelope() currently manipulates both the sender and recipient's VASes, we could instead unmap in sender and queue the page to the recipient who would map it. Since swapPage() is the longest non-context switch critical, this might actually lower the worst case non-ctx critical. The "two threads calling mapViewOfKernelFile()" scenario has different blocking characteristics than current, so that would have to be noted in the API. In general what interference patterns are induces will have to become an API section. And, of course, there is still the HW interference on multi-word writes that needs to be resolved (w.r.t TLB handlers/shootdown).

When a thread locks the private part of its scheduler:

  • For AMP, the lock is guaranteed to be immediately available. I.e., no blocking
  • For SMP, the time to acquire will be O(# cores spanned by the SMP scheduler). Since there can only be one thread at a time with the scheduler locked, there can be no added live/deadlock cases.

Observation (not sure if it is critical or not to the analysis): Only a current thread can lock/unlock an object. I'm thinking this is like the observation that "if a thread is current, then all other threads are in kernel mode" from the legacy kernel.

Lock acquisition should be optimized for the success case to minimize the effects of multi-core implementation for single core operation. E.g., the "test if object locked, if not lock it" should probably have a atomic load/store "first pass", then "queue for lock" only if that fails. When the kernel is operating on a single core, the run-time overhead would then be minimal, although structural coverage analysis would potentially still be a bear.

AL: The solution proposed in the preceeding paragraph would likely introduce unbounded blocking because an oportunistic strategy would interfere with fairness if there were pending locks.

Lock acquisition still is going to require a FIFO queue, at least in cases where lock inter arrival rates can't be assured.

The current thread should always have priority for locking the public part of a thread/scheduler.

  • This ensures that acquiring the lock is O(1) for the current thread which is important because the current thread may not be the agent involved in the application induced blocking. E.g., process A and B might be interacting on cores 0 and 1, but a thread from process C might be current on core 0 when the core 1 thread readies a thread on core 0. We don't want overhead forced on C's threads.
  • The implication here is that it is acceptable for the current thread to only look once at the public thread/scheduler interface. The schedulers for the other cores have to deal with the case that the target scheduler's current thread was executing in user mode when they wrote to the public interface, so the local thread processing only part of the requests for its public interface *has* to be ok. The foreign scheduler will have to send an IPI anyway, so things will "just work". Using the same principle, it might be possible for the local thread to ignore the public interface if it is locked without queuing. This would make worst case timing estimates easier since there would be no blocking, and hence no need to measure the critical section.

The worst case observed time for a lock, since it is tied to a category, can be readily measured and worst case values kept. We'd also only have #categories worth of values. Still not great, but better than "# of KIOs".

Level 3 Issue: State that is unchanging while a thread is executing, but that might change while the thread is suspended (e.g., the thread->Scheduler()) is effectively const for the thread. We need a name for such state.

  • We think this issue is not applicable for BMP (must review above text to be sure).

Level 3 Issue: Reading "atomic" values, e.g., KIO handles. Why it is legal to read the handle of an object without the object being locked? Need to get terminiology/state description for this case. Especially for waitShortDuration in waitForResource() for the budget transfer recipient pointer. The code seems right, but seems "obviously wrong" w.r.t. state access without a lock.

  • We think this issue is not applicable with our new locking terminology.
When is it allowable to access something?

Proposed precondition/invariant terminology:
  Accessibility:
  Access:

      Locks(locks required)
      const after <lifetime>
        lifetime can be
	  startup something, perhaps first process dispatch,etc.
	  c++, COC

      monotonic, e.g., "process being deleted" transitions from 0 to
      1, but stays 1 until process is actually deleted.
        - similar for access rights.

      "Somehow assure no other references" is always implied.  Yuck.

      single_writer?  Suggest this only be used for infrastructure.

Assertion checker capability????
  Proposal1 1:
    - Add "access" parameter, and have caller specify the means they are
      using to assure the access.
    - Introduce "dummy" ATOMIC_READ_32 "lock", STARTUP_LOCK, ...

       kio_lock = lock(...)
       x = foo(params ACCESS_JUSTIFICATION(kio_lock));

       int foo(params LOCK_REQUIRED_PARAM)


Should we use GCC's function names for atomic add, subtract, compare and
swap, etc.?
  - Are these exported by the kernel?
  - If we do, we should note that these are unbounded time complexity.

Verify that handle() lifetime is C++ object, not COC object.

Alternate means for mutual exclusion

It is possible to write to the private portion of a "not current" thread if the thread is in the same scheduler as the current thread, (implied mutual exclusion). Is this sort of reasoning allowed?

Currentperiod ID isn't really const, but it is const for the duration of (most of) a window. So why is it ok to read without a lock?


Locks

Must hold klocks only within a core critical to bound blocking due to klocks().

  • Note: secure protocol allows secure to span a core critical when access to the object has been granted. Having an object secure is different from having a klock for the object. Also secure currently doesn't require priveleges to access to the object, that would have to change.

lock_klock() will enterCritical() and acquire the lock, returning with both the lock and a critical. Note that sometimes a critical may already be present, which means more analysis for that case.

unklock() would release lock and exitCritical()

(VCCLIIIA): If queue lock enqueue operations use atomic operations, what is the means to ensure the operation is time bounded?

  • Similar issue for single atomic word update operations such as secure()/release()
  • Look at reference 7 to see if they have a solution.
  • Unsure what task this would be associated with.

Lock Categories

To simplify analysis we aggregate objects into a few categories. Place restrictions on the categories, and then analyze the interaction only of the categories:

  • Thread Private. An implicit lock, only available to the #current_thread.
  • Scheduler Private. An implicit lock when in a critical.
    • Only lockable by *A* current thread in that scheduler.
    • Some sort of lock migration probably occurs at/in switchTo().
  • KIO Public. Includes Thread public, and all other KIOs.
    • Thread is the only KIO expected to have a private part.
    • Scheduler will have a user visible existence, but at this time are not expected to formally be a KIO. I.e., they will be referred to using a registry defined "index", not a handle.
  • System. Things like, system (i.e., unallocated) memory and object pools, etc. I.e., things that are acquired, accessed quickly, and released. The goal is for System locks to only affect process creation/deletion. Things like history buffers and portals will be made either thread or core specific.
  • Scheduler Public And Memory Pool. These two locks share the same lock level.

Probably want a better name than "public/private".

We define an order for locking the above. A core may have at most one object at a time locked from each class. The objects must be locked in the order listed above. If two objects at a level must be locked, then before the second object can be locked all locks must be completely released by the calling thread and its critical section exited. By maintaining the locks in a core based structure, we know the current thread on the core is responsible for the lock, but it allows for implicit migration at context switch for the scheduler private lock.

Restricting to locking only one at a time from each category and a specified order will prevent live/deadlock.

Historical commentary: BBBDISS page 119 indicates that nested locking in multi-core systems "has not been considered at all". Subsequent papers by the UNC staff, specifically the introduction of RNLP provide a basis for analysis of nested locking protocols, but the resulting algorithms would force non-AIB, specifically their "token server" requires maintaining total orders. The issue they are resolving is ensuring that a core does not get more than one chance to block other cores at a resource. We need to investigate alternate means. One thing to consider is that, at least for determining worst case window-to-window bounds, a IWI/ISI would inhibit multiple lock attempts. The thread scheduler level is still open.

The choice to split objects (fine grained locking) verses aggregating is a design flexibility. The principle is that we should only go finer if absolutely necessary. The more deeply nested the lock, the shorter the lock time must be. The crit time kernel will record maximum lock duration at each level.

Locking and unlocking must ensure memory coherency. Non "const" accesses without a lock held may not satisfy weakly ordered memory. Runtime verification of appropriate lock usage is being investigated.

  • Lock verification is now implemented in prototype

At this time we don't think we need separate read and write locks, only one lock for both read and write.

A category level may be skipped by recording a dummy lock at that level. The dummy lock is always available and will not have any latency to get the lock. Subsequent items locked cannot depend on automatically knowing the higher categories are actually locked, and they cannot attempt to lock a missing higher level. When dummy locks may span a function, we would need more preconditions on what locks are required to enter a function. All levels must be specified vs precondition of having a lock at level X which implied higher levels locked.

Observation: Some data members are not protected by a lock on their owning class. Instead it may make sense for it to be owned by another class that is involved with the data flow, or a "super-parent" class to control the number/order of locks.

  • All ThreadListNodes that are "on a list" are owned by the object that owns the list. E.g, if its on a scheduler list, then the scheduler owns the node. If the node is not on a list, then it is owned by the thread (covered by thread private lock).
  • Implication is that a node can only be owned by one sort of object. E.g., it can't spontaneously transition from being owned by a scheduler to a resource.
    • Issues:
      1. forceQuietComplete (warmStart/frameResync) calls removeFromAllLists without schedulerPrivate lock. That will need to be part of that design.
      2. threadPrivate cannot be used for ownership when not on a list, since not current thread when wanting to put on a public scheduler list.
      3. Is there a way to automate threadListNode owner verification?
    • The following scenarios involve adding a thread to a public list currently. This is problem for not having threadPrivate lock when node is not on a list.
      1. abortAnyPendingInterruptWait - uses resourceWaitNode to ready thread waiting for an interrupt. Therefore, thread cannot be in a resource list. propagatePlatformInterrupt will use the resourceWaitNode to ready isrThread in scheduler private runnable list. Need to look for race conditions between those.
      2. IWI - Adds thread to public runnable for isr's scheduler. This is from single threaded portion of window scheduler, but would need to analyze window critical vs scheduler critical for any race conditions with schedule state.
      3. signalSemaphore/runAllWaiters - code just took the node off of list and waitForResource handles resourceNode or timeout/SOP occurring "simultaneously", so should not be an issue.
      4. SOPEvent::makeThreadWait will add to public isr threads or waiting threads if not from current scheduler. Uses of public appear to be from waitForNextTriggeringEvent called from startThread and forceQuietComplete. forceQuietComplete needs new warmstart/frameSync design. When thread is being started there should not be race condition with its sop node.
  • Some fields of a thread may be controlled by a lock on the owning process in order to simplify threadPublic vs process locking. This abstraction should be minimized since it can increase access time for a process versus thread interference at the thread public level between interacting threads, but it does maintain AIB.

Invariant, or observation, or something: When the scheduler switches to a new thread, the thread must have all its list nodes removed from the scheduler lists before the scheduler lock is released (because the lock is what serializes writes to the nodes).

Scheduler

AL: merging is out of date. The scheduler public interface is currently limited to adding a thread to a public version of the slack requester or runnable lists. The lists are merged to the private list on every scheduler activation. This implies that there may be priority inversion between the time that another core readies a thread and the next window activation which occurs for that scheduler.

Resource Wait

  • Other design options were removed from the wiki on 10-Feb-2015.

Dealing with resource waits. There are two cases; events and everything else, i.e., multiple threads readied at one time vs single thread readied, respectively.

Single Thread Released
  • Long duration wait on fixed budget.
    Works as "mainline" now, namely at wait, the thread is added to scheduler's SOP list. There is no need to send an IPI, the released thread's scheduler will ready it at the SOP.
  • Short duration wait:
    Thread is removed from resource and added to the thread->scheduler public runnable list. The key is you never have to send an IPI in this case, even the mainline scheduler doesn't need to schedule since the released thread can't be higher priority than the currently running thread on that scheduler (i.e., mainline could be optimized).
  • Long duration wait on slack.
    Like short duration, except that if the released thread's scheduler is current, then must do scheduling, or if the thread's scheduler is running but non-current send an IPI. If the released thread's scheduler is non-running then the thread will be readied by SOW/SOP.

Note that for this case the resource wait list node is "owned" by the KIO having the list that the node is on. Possible lists it can be on:

  • semaphore wait list.
  • scheduler public
  • ready list


Multiple Threads Released

  • atypical scenarios - deleteSemaphore and clearSemaphore are handled by multiple critical sections (ie unbounded wall time, but O(1) criticals)
  • typical scenarios - Events
Events Design 2016-01-12

Basic problem is that when an operation releases an unbounded number of threads, it is impossible to visit an unbounded number of schedulers in bounded time, so it is not possible to ready all the threads that have been released, but not make ready those that haven't. The options are limited:

  1. Limit the number of schedulers that can have threads waiting, or less likely the total number waiting threads.
    • Note that limiting the number of schedulers whose thread's can pulse the event doesn't help
  2. Break the "pulse" into several critical sections.
    • This introduces a new event object state "doing a pulse" and the corresponding failure condition "event busy" for pulse and wait.
      • It may be possible to eliminate the new state by keeping track of the last pulse for each thread currently waiting, but assuring no wrap around of the pulse count would be theoretically hard, but practically a non-issue.
  3. Be less precise about which threads are readied.
    • Precisely ready some threads (or schedulers), imprecise for the rest.
      • Precise means put threads on some list and notify scheduler.
      • Imprecise means use some broadcast communication means, e.g., a global counter, or just treat all imprecise as SOP or SOW waiters.

Since it is impossible to precisely notify all schedulers, some arbitrary limit must be set. Zero, one, number of cores, or some registry limit are the most reasonable choices. Zero is out because that wouldn't support legacy semantics. A registry limit means that changes to the limit would change ctx+delta, and would undoubtedly involve non-trivial tooling and possibly runtime checks.

Proposal:

  • Event - add a precise scheduler - the scheduler of the thread which creates the event
  • Event - add a field for each core - initially a flag indicating a thread is waiting for the event on that core
  • System - add a field for each core - indicates that an event has been pulsed for which there was a thread waiting on that core

Event waits are handled by the 3 cases:

  • Long duration wait on fixed budget.
    Same as for single object released. I.e., the current event code doesn't add the thread to the wait list even in mainline kernel.
  • Short duration wait.
    • From precise scheduler.
      Like mainline - add thread to events waiting threads list and to scheduler's short duration timeout list.
    • From any other scheduler
      Let threads poll. This means blocked threads will potentially run before unblocked threads.
      We could give user more control, e.g., by adding new APIs:
      1. Yield short duration:
      2. Yield until next window activation
  • Long duration wait on slack.
    wait puts thread on Current Scheduler's "resource wait on slack waiter list".
    set event's flag for the current core indicating thread is waiting on this core
    thread will eventually be scheduled when any event is pulsed and poll if its event is available

Pulse event

  1. If current scheduler is precise scheduler, merge the event's waiting threads list to the current scheduler's runnable list.
  2. for all cores
    1. set system field for the core if the event has a thread waiting on that core
    2. clear the events field for the core
  3. Allow for scheduling on the current core

Context Switch

  • During a context switch for which slack is available, if the system field for that core is set
    1. Reset the system's field for the core to 0.
    2. Merge the scheduler's "resource wait on slack waiter list" to the slack requester list

Ensure appropriate barriers/fences are used for system field access since done without lock.

Scheduler Activation (ie SOW)

  • At each SOW every scheduler merges its resourceSlackWaiting lists to its local slack waiting list. An optimization would be to to only merge based on a "global pulse count".
    Required since field is tracked by core and not scheduler, so it handles parallel schedulers, but not the additional schedulers which may run in other windows on the core.


Thus one scheduler will be precise for short duration waits and threads in the other schedulers will poll when they timeout. For long duration on slack, no schedulers are precise, but after a scheduler has done a context switch which may ready a slack thread, all threads which have waited for any event on that core are merged to the slack requester list and available for slack in order to poll if the event they are waiting on was pulsed.

Introduce restriction on mutable TLB scope

This section may be of interest for DDD updates, but is largely obsolete.

This isn't quite right, but it gives hope that a well time bounded answer is possible without a two level critical.

Invariant: For a given core, only translations for the current PDT can be in the TLB.

This implies that all TLBs are flushed on every context switch. Might be able to make an exception for "global" pages.

Observation: TLB shootdown time for a core N will not be well bounded if N is executing in a critical section, but will be well bounded if it isn't.

Proceed by cases:

First we restrict the kernel, while in a critical section, to not reference anything that could be affected by a TLB shootdown. Hopefully this only means accesses to user or MOM space. portals never need TLB shootdown because portals are thread local. This may require kernel accesses to user/mom space to be potected by a VAS lock.

First we ensure that every time a core enters a critical a counter is incremented and a flag set to indicate the core is in a critical. To do a shootdown we raise an IPI to the other core. Once the IPI is delivered we know that the other core will not exit a critical without executing the TLB shootdown IPI handler (not quite correct, it will exit a critical, but the IPI will be pending and it will re-enter without getting to a place where the TLB would matter. Obviously this needs more clarity). I'm not sure how we can detect the delivery of the IPI to the other core. I'm hoping a sync will work.

The general idea is that if the remote core is in a critical, we queue a TLB shootdown and depend on the TLB IPI to clear later (fire and forget). If the core is not in a critical, we wait for the TLB shootdown to happen, or for the core to enter a critical. Since the core is not in a critical, this case should be well time bounded. At some point we say that if more than N shootdowns are queued, the other core is just going to flush all TLBs.

 TLBShootdown(Core n, PDT p, VirtualAddressTYP addr)
   If core[n].pdt != p then
     there is no need to perform shootdown for n.
   else
     raise_ipi(n)
     spin until core[n].is_in_critical ||

core[n] indicates addr has been shot down;

Tracking shoot down status could be handled by having a queue for every core (or perhaps between every pair of cores). When a shootdown is requested, a slot in the queue is added and "has been shotdown" flag set to false. When the TLB shootdown happens, the flag is set to true. Or perhaps there is a TLB shootdown count and the sender notes the count.

Essentially TLBShootdown() establishes a barrier for: "the TLB has been shotdown, or will be shotdown, before it can be observed on that core".

I think the crit serial number is not required.

some sort of queuing

Can we queue freed pages and wait for all references to go away before adding page to "free list"? If so, what if we don't have extra free pages?

TLB invalidate by waiting

Wait to free the page until it can be assured that TLBs on other cores no longer refer to it.

Proposal 1
  1. Ensure that on each context switch (or at least every window start) flush all TLB entries (at least all non-svas based TLB entries).
  2. Add a VAS spin lock. Perhaps a spin lock that allows multiple readers and one writter (I created one of these for FLASH once) or maybe only writers need spin lock and readers just deal with it?
  3. VAS operations that free pages when multiple threads are present (i.e. delete process would not have to do this) work as follows:
    1. Get spin lock
    2. Go through each page to be deleted and if present mark it no user access
    3. wait a period
    4. Go through each page to be deleted and if present delete it
    5. release spin lock
  4. VAS operations that allocate, map, or remap pages work as follows:
    1. Get spin lock
    2. Go through each page to be change and:
      1. change its attributes to the new attributes
      2. if page was owned
        1. if page was writeable:
          1. wait a period // Could queue multiple writable pages before waiting a period and freeing them.
        2. free the page
    3. wait a period // Ensures all cores will see the new mapping when the function returns
    4. release spin lock
Proposal 2
  1. Ensure that on each context switch (or at least every window start) flush all TLB entries (at least all non-svas based TLB entries).
  2. Add an "deleting" attribute in page table entries to not present pages. The attribute location would have to be carefully selected such that at least the owned and physical address bits where not affected (others may be needed as well, not sure at this point). Another option would be to create a shadow page table structure like exists for ARM (the way things are going this may be necessary in the not to distant future) VAS operation on the same location)
  3. VAS operations that free pages when multiple threads are present (i.e. delete process would not have to do this) work as follows:
    1. Go through each page to be deleted, and if not deleting, mark it not present and set the "deleting" flag preserving all other parts of the entry using compare and swap, if write fails continue to the next page
    2. wait a period
    3. Go through each page to be deleted and if not present and "deleting":
      1. clear the "deleting" flag using compare and swap, if write failes continue to the next page
      2. if the page was owned return it
  4. VAS operations that allocate, map, or remap pages work as follows:
    1. Go through each page to be change and:
      1. if not present and "deleting", continue to the next page
      2. change the pages attributes to the new attributes using compare and swap, if write fails continue to the next page
      3. if page was owned
        1. if page was writeable:
          1. wait a period // Could queue multiple writable pages before waiting a period and freeing them.
        2. free the page
    2. wait a period // Ensures all cores will see the new mapping when the function returns
Proposal 3
  1. Ensure that on each context switch (or at least every window start) flush all TLB entries (at least all non-svas based TLB entries).
  2. Add an "deleting" attribute in page table entries to not present pages. The attribute location would have to be carefully selected such that at least the owned and physical address bits where not affected (others may be needed as well, not sure at this point). Another option would be to create a shadow page table structure like exists for ARM (the way things are going this may be necessary in the not to distant future) VAS operation on the same location)
  3. VAS operations that free pages:
    1. Go through each page to be deleted:
      1. if page not preset and marked "deleting", exit loop.
      2. mark page not present and set the "deleting" flag preserving all other parts of the entry using compare and swap, if write fails exit loop. // assume owned flag is not set for not present pages
    2. if multiple threads are present, wait a period // could instead check if process has threads active on other cores, as long as we assume less than 32 cores this would be trivial to do with a bit map
    3. Go through each page in the range processed above:
      1. clear the "deleting" and owned flags
      2. if the page was owned return it
    4. if full range not deleted return an error.
  4. VAS operations that allocate, map, or remap pages work as follows:
    1. Go through each page to be change and:
      1. if not present and "deleting", continue to the next page
      2. change the pages attributes to the new attributes using compare and swap, if write fails continue to the next page
      3. if page was owned
        1. if multiple threads are present /* could instead check if process has threads active on other cores */ and page was writeable:
          1. wait a period // Could queue multiple writable pages before waiting a period and freeing them.
        2. free the page
    2. if multiple threads are present // could instead check if process has threads active on other cores
      1. wait a period // Ensures all cores will see the new mapping when the function returns
Proposal 4
  1. Ensure that on each context switch (or at least every window start) flush all TLB entries (at least all non-svas based TLB entries).
  2. Add a per pool (should not need to be per core) pending free list.
  3. At each window start have core 0 append the pending free lists onto the pool free lists.
  4. Modify RAM quotas to delay updating the number of pages until a window bondery is crossed (kinda like buvec for RAM quota).

Allocation from RAM when there is insufficent quota but there would be sufficent quota if the pending was added in could return a different error code but I am not sure it is necessary. Also could probably only do this if more than one core is defined in the registry for backward compatibility but again I am not convinced this is necessary.

Other Issues

  • What is the queuing mechanism? Are threads queued on a lock for fairness, or do they wait on the lock changing and then try to obtain? Is it a sit-n-spin wait while they have budget or is there some threshold where they give up and wait short duration or longer?
  • How do windows provide budget to schedulers?
    • normalized (consistent sized windows in every period)
    • buvec approach (variable sized windows)

History Buffers

  • There will be a separate scheduling and process creation histories for each core, so only a core local critical is required.
    • Offline tooling will merge the multiple histories.
  • If there are multiple timestamp counters (e.g. one per core), the PAL will need to keep them reasonably synchronized. The accuracy of synchronization only affects event logging accuracy, not time partitioning.

Threads

Typically threads are only secured for a specific operation initiated by an API request. Several of these do not involve scheduling (ie. getThreadPeriod). During scheduling it was assumed threads could be freely manipulated by the scheduler, since scheduling occurred in a cross core critical.

The following routines are an exception to this rule:

  • raiseExceptionToAllThreads - only used for frame resync and powerTransient. (ie global events that it would be ok to wait for cores to be quiescent and then perform from master critical)
  • suspendOrResumeProcessesBeingDebugged - while debugging, all threads may be secured in single critical to check if they are part of process or debug group being suspended or resumed. This does not affect any scheduling lists.

The following thread secures involve a minimal read and/or write operation that is disjoint with scheduling:

  • resetWorstCaseUtilizationsK
  • setThreadBudgetK
  • getThreadBudgetK
  • threadTemplateASCIINameK
  • setThreadRestartEntryPointK
  • getThreadDataK
  • getThreadPeriodK
  • executionStatusPointerK
  • setThreadTLSPtrK
  • setCPUBudgetTransferRecipientK
    • requires recipient handle and ptr to be stored atomically before source thread references in short duration wait. No direct impact on scheduler.
  • debuggerSuspendThreadK
    • Thread suspended flag is atomic needed by scheduler.
    • Current thread can wait for next interval, but that's not different from it calling WUNP
  • resumeSuspendedThreadK

The following thread secures could interfere with a scheduler on another core accessing them:

  • raiseExceptionToThreadK
    • Writes in threads context.
  • Process::initiateSuicide (deleteProcess/restartProcess)
    • For every thread in system, secures one per critical and checks if process owns thread and if so unbinds thread and initiates stopAndDelete, which raises kernel exception to thread
    • Process threads can be in different schedulers on different cores
  • deleteThreadK
    • initiateStopAndDelete on specific thread (raises kernel exception to thread)
  • restartThreadK
    • initiateStopAndDelete on specific thread (raises kernel exception to thread)

Raising an exception to a thread should not be done while the thread could be current on another core and potentially writing to the context (ie entering a kernel trap). Potential options include:

  • Update ctx switch to raise the exception when switching to the thread. That is when it would be handled. Extends ctx switch worst case and adds one check to general case.
  • Queue request to thread scheduler. When scheduler activated, it processes this queue and before scheduling, raises the exception to threads in its queue.
    • Who pays for scheduler activation cost?

Exception Masking

653 Impacts

Installing handlers:
  Only plan to use for startOfWindow exception. 
  Future enhancement could allow exceptions to mask to be specified in 653 config file for the other kernel synthesized exceptions which map to Health Monitor exceptions.
  Would like mainline to accept sizeOfExcExPar >= sizeof(excExParType)

Scheduler Stacks:
  May be able to be eliminated in the future. For now keeping compatible with mainline which will require scheduler stacks.

Deadlines and Timeouts:
  Will work as normal. May have worst case to process after a throttling event.

Release Points:
  This is the primary issue I see.
  Initial release point is calculated as next window in the partition with the periodic processing start attribute set.
  Because of DELAYED_START and starting processes at random times after NORMAL mode, processes may have a different initial release point.

  next release point = previous release point + 653 process period

  release points match if the current time is within system tick jitter of the calculated release point
  previous release point is updated to the start time of the window matching the release point, ie. windowStart(currentWindow()) + majorFrameStart

  The problem is going to be if a 653 partition is throttled across multiple windows.
  We would want to ensure (make PAL ensure) that the partition is rescheduled at the start of a window which is allowed to be a release point.
  We would not want reenabling slack in the middle of a window and having a pending startOfWindowEx raised to update a release point to a mid window value. 
   Does kernel reenable slack immediately, window, tick, or hyperperiod boundary?
  We do not have a way to maintain offset release points.
  For example if process A and B release point is window 2 and process C and D release point is window 4 and the partition is throttled until window 20, window 20 will become the previous release point for processes A, B, C, and D. We do not currently have a method to make window 20 the release point for A and B and window 22 the release point for process C and D

Semaphore Notes

The semaphore code was walked through in detail with the new design approach in mind.

createSemaphoreK()

Can be done with either fine or coarse locking without lock nesting.

Locking an object, or constituent objects, multiple times during an operation could cause another core to experience "multiple" lock duration times, perhaps time # of cores. Eww.

Securing a KIO also secures process, securing the process is probably not required.

KernelInterfaceManagerBase

  • Is lock at kiomgr level, or namespace?
  • Is a lock on the namespace transitive to the contained names? Yes.
  • Do we need read locks or just write locks? For now, just write locks.

While object is secured some fields are const.

While thread is active, it is implicitly secured, so all "secure const" members can be accessed.


  • signalSemaphore - highest thread may be in a different scheduler who is simultaneously having SD or SOP event and wants to remove thread from resource list.
  • clear/deleteSemaphore - (runAllWaiters) - waiters are in multiple schedulers, some of which may also be current on current and/or other cores.
  • waitSemaphore - small issue with checking delayed grant vs when it is decremented, but I think we can easily account for that.

Questions/Comments:

  • clear/delete results in runAllWaiters. If thread does not have budget to complete this it may be interrupted since it is comprised of multiple critical sections, releasing one thread per critical. This can potentially delay (indefinitely) when the resource becomes available.
  • Local vs foreign scheduler is biggest issue.
  • Is there a benefit to having a local only attribute? (ie resource is only usable within the single scheduler). It may allow optimizations if we know we're only dealing with one scheduler. However, would it be useful for the apps? This is different from process local since threads of the same process can be in different schedulers. I think POSIX has a shared boolean with the latter meaning.
  • Our semaphores have a multi-level priority. Short duration (implicitly current priority) have precedence. Then slack waiters. Then long duration waiters who have to wake up and poll if there is a resource. This is geared to allow the first thread which is ready to run to gain the resource, even if it immediately runs out of budget and is switched out. This can starve a higher priority thread continually. For POSIX, short duration and slack do not apply. If we add a timeout approach, then the long duration thread could be in a resource list and eliminate delayed grants. However, we would have the issue of budgeting for additional timer interrupts for timeouts or polling this list at spots which we eventually have to address. I don't think that helps with the general multi-core issue, but it could eliminate delayed grants.

Memory Subsystem

How to handle virtually visible, physically shared, memory between cores?

The issue arises due to shared code/data, e.g., the kernel. One processor has kernel data modified in L1 on its core, another core accesses the cache line, this causes bus traffic. How to bound that?

Ideas so far:

  • Make the kernel data/heap be not globally user visible. There will be some data structures that are required to be user visible, but likely a very small amount.
    • This doesn't completely eliminate the problem, but it does limit the amount of RAM that a given thread can access because other data is protected via access rights.
  • Disassociate the kernel stack from kernel heap and instead allocate from the memory pool of the creating thread.
  • Code is not modifiable (ignoring BIT), so coherency would not be required, which may minimize snooping
  • Place the kernel in an SRAM pool. This would be much easier if the kernel were smaller.
  • Put rarely used code/data, e.g., mode transition and boot code, into supervisor mode only access and put it in a different (e.g., "slow") pool.

Monitors

  1. How do we ensure that one core does not get out of time sync?
  2. Is there adequate information available to the PAL to implement a clock monitor?
    1. Slew
    2. Stuck timer, or timer anomalies

Kernel multi-core window scheduler design

Goals:

  • Must be able to activate windows at precise times.
  • Must efficiently support interrupts (minimal budget).
  • Desire window slack.
  • want window scheduler to be as simple as possible, e.g.,
    • No priorities
    • No reordering
    • Interrupt windows are an implementation tactic, they are not specifically required, i.,e., it would be nice to eliminate interrupt windows.

Thread Scheduler (just new fields):

 - unmaskedInterrupts : set of interrupts
   The interrupts for which ISR thread's in the scheduler are in a
   waitUntilNextInterrupt().

Platform interrupts:

 - Platform interrupts can activate ISR threads either in the current
   scheduler, or for a scheduler in the interrupt window.
 - To reduce overall IWI latency, IWI interrupts should be higher
   priority than non IWI platform interrupts.
 - Each platform interrupt is associated with (at most) one thread
   (at a time), hence one scheduler.
   - It would be acceptable for the registry to know "the" scheduler
     an interrupt would be associated with.
 - When HW asserts an interrupt, there are two cases. If the
   interrupt is not associated with the interrupt window, the PAL
   must call raisePlatformInterrupt() on the scheduler's current
   core.  If the interrupt is associated with the interrupt window,
   then the PAL must call the IWI handler on all cores.  The kernel
   resolves/addresses the interaction of IWI and WSI interrupts.
   - The DEOSKERNPPI will be extended so the PAL can determine
     whether each interrupt will be unicast, or broadcast.
   - The "IWI handler" will be raisePlatformInterrupt().
   - For unicast interrupts, the kernel will call
     maskPlatformInterrupt() and unmaskedInterrupt() on the core
     where the interrupt should be delivered.
   - For IWI (multicast) interrupts, the kernel will call
     maskPlatformInterrupt() and unmaskedInterrupt() on exactly one
     core.
 - An ISR thread may be in a waitUntilNextInterrupt() when the window
   is suspended, but retainedBudget could expire.
   The next window containing the ISR thread's scheduler
   must ensure the ISR thread is runnable at the window start.  I.e.,
   for RMA schedulers the window start must be a budget replenishment
   point for the ISR thread.  For threads in schedulers that don't
   enforce budgets, this is trivially satisfied.

Note: The original design described the IW as the thing that has retainedBudget, but the new design has the WAT having the retained budget. The issue is that the scheduler for the IW may need to be scheduled multiple times during the WAT, so the IS may appear in multiple windows. Furthermore, we wouldn't want a "normal window running the IS" to have to be interrupted to handle an interrupt. It might make sense to describe there being only one IW, or perhaps that "interrupt windowness" is a property of any window that is running the IS.

(IIAWRBII): How to deal with interrupts arriving when retainedBudget is insufficient to activate the ISR window? This may be a case where additional timeline budget must be reserved, or other special runtime considerations made.

More fundamentally, ISR budgets cannot be assured across cores. This is the same issue as for single-core RMA ISRs when the interrupt arrives close to the end of a period.

A window can be activated by:

  1. windowStart interrupt.
  2. The early completion of the previous window.
  3. End of an interrupt window preemption.
  4. In addition to the above, the interrupt window can also be activated by the PAL calling raisePlatformInterrupt() for any of the interrupt window's ownedInterrupts.

As in the FourPeaks design, the windows are in a WAT

WAT

 windows
   A sequence of windows, indexed by 0.
 windowSlack
   The sum of the finish early times for all the most recent
   contiguous sequence of window activations where
   mayFinishEarly=true, minus any windowSlack used in that interval.
 retainedBudget
   The amount of time remaining for the IW.  retainedBudget is zero
   at WAT start.  retainedBudget is incremented by the finish early time
   when the IW is scheduled, decremented when the IW runs, and reset
   to zero (expires) when:
     - All the retainedBudget is used, or
     - a window actualStart happens at its
       (specifiedStart-windowSlack).  An interesting special
       case is the start of a mayFinishEarly=false window.
   In the second case, retainedBudget incrementally transitions to
   windowSlack as retainedBudget would otherwise begin to expire.
  retainedBudgetExhausted
   Indicates whether there is sufficient retainedBudget to activate
   the interruptWindow.  retainedBudgetExhausted is set only by
   core0, but read by all cores.

Window

 specifiedStart  The time specified during system design.
 actualStart     The most recent run-time value.
   The offset from the WATStart.  actualStart is always <=
   specifiedStart+(some TBD overhead parameter)
 mayStartEarly bool
   If true, the window may be activated prior to its specifiedStart.
 specifiedDuration
 actualDuration
 mayFinishEarly bool
   If true, the window may finish before the next window's
   specifiedStart.
 mayUseSlack    bool
   If true, the window may use more time than its specifiedDuration.
 unusedBudgetDestination : enum {interruptWindow, windowSlack}
   Where the unused budget goes when the window ends early.  Only
   the interruptWindow can give the interruptWindow budget (currently).
 ownedInterrupts : set of interrupts
   The interrupts that *might* be enabled when the window is active.
   When a window is activated the ownedInterrupts for which ISR thread's
   are in a waitUntilNextInterrupt() are unmasked.
 enabledInterrupts : set of interrupts
   The interrupts that might trigger an interrupt window activation
   during this window.  These interrupts are only unmasked if there
   is retainedBudget and the ISR thread's scheduler indicates the
   interrupt is unmasked.
 nextWindow
   The next window in the WAT.  The WAT's last window's nextWindow is
   WAT.windows[0]

Rationale: A design specifying times (integers) for flexible start and end times was rejected because integers would (in some cases) require the window scheduler to introduce unscheduled windows into the timeline.

Rationale: There is only one interrupt window because otherwise the interrupt windows would need a priority scheme.

Future enhancements:

  • Allow windows other than the IW to provide retainedBudget.
  • Allow windows to designate the window to receive its unused budget.
  • Allow IW to use slack.
  • Allow multiple schedulers (presumably on different cores) to own the same IWI.
    • May need some restriction like all schedulers are in the same window(?).
  • Allow a window to specify an "earliest start time" to constrain how far left the window could slide.
    • Some constraints like "may use slack" might have to be co-properties to avoid the need to introduce unscheduled windows.
    • Intent is to provide more precise control over window timing without fixing the window in time.
    • Proposed at the 3/19/2019 telecon.

If a window has mayStartEarly=false, the preceding window's end time must be fixed. The permitted predecessor combinations are:

  1. mayStartEarly=false, mayFinishEarly=false, mayUseSlack=D/C.
  2. mayStartEarly=true, mayFinishEarly=false, mayUseSlack=true

If a window has a fixed duration (mayUseSlack=false), then mayStartEarly=true implies mayFinishEarly=true.

For mayStartEarly=false windows, if the predecessor could finish early and there is an ISR window wanting CPU time, it would be a shame to force the predecessor to spin.

(SWWFEACAR): Specifying that a window will finish early only when all cores are running Idle is a bit constraining. Propose: have an "idle window priority" (or perhaps idle scheduler priority) and when all cores are running threads with priority lower than that, the window may end early. This would permit cores opposite ISR threads to do constructive work and still not inhibit early window completion. For non-RMA threads that would also permit the threads to dynamically signal their desire to finish early by adjusting their priority.

  • We don't have an algorithm that would support this.
  • Also deal with successive windows where the same scheduler is running in both on one core, permit other core first window to finish early.

The Interrupt Window must be mayFinishEarly==true.

(IIAWRBII): Need timing diagrams and window <--> thread scheduler protocols.

To minimize overhead, the PAL must deliver window start interrupts and all interrupt window interrupts, to all cores.

  • Rationale: An alternative is that the PAL delivers to one core, then the kernel sends IPIs to all other cores. In that case the latency is 2*longest critical, rather than 1. Also, most platforms can be configured for the HW to deliver the interrupts to a specific core, or to all, so the PAL overhead may be minimal.
  • Note: In the March 2015 and previous implementations, IWI is only delivered to master core.


AL: The next paragraph is close, but I think it would be more correct to state that this situation arises when the currently computed value of WAT.retainedBudget is "small". That encapsulates both the end condition below, and the situation when the retainedBudget has been exhausted.

The Interrupt Window being activated close to the next window's (specifiedStart - windowSlack), could significantly delay the next window's actualStart. In this case there will be a interruptInhibitedInterval during which an interrupt that would normally activate the interrupt window will be inhibited (the current design is to mask all interrupt window interrupts when the first such interrupt arrives in the interruptInhibitedInterval). Core 0 determines the start of the interruptInhibitedInterval.

The interrupt window must be scheduled in order to "refresh" its budget (see the description of WAT.retainedBudget).

The expected scheduling sequence in a "period", would be:

  1. Non-interruptable, fixed start, fixed duration windows.
  2. The interrupt window
  3. May start early windows.

We're debating whether some form of the above should be, or could be, formalized to simplify the design, but no specific proposal is currently available.

A "start of period" must be at a window that is mayStartEarly=false.

See the Fourpeaks design for timing of Start of Window interrupts. Basically SOW is when the HW will assert the interrupt. We will publish the "maximum delay before the first user mode instruction", and that is what the application would consider to be the start of the window.

Is there a way we could separate the window and thread schedulers more? E.g., have window scheduler be able to preempt thread scheduler critical regions?

  • AL: Use a VM like separation. There would have to be constraints on what scheduler operations could block other schedulers, and constraints on consistency of scheduling so that future windows were not blocked by a lock held by an inactive scheduler.
  • AL: Perhaps if IW/SOW interrupts had higher priority, and schedulers could complete the critical region of another scheduler, at least until the point where the original scheduler could continue.

The design addresses the fact that a window can be started by several different events that can arrive and be interpreted differently by each core, and that each core can even get a different number of triggering events. E.g., one core could decide that its window should complete early, and another core could get a window interrupt before it makes the same determination. Thus one core gets two window change events, the other core gets only one. The differing number of events issue is addressed in proposal 1 by case analysis of which events are redundant or for which duplicates are possible, and then addressed at the beginning of the function.

Proposal 1: Core 0 determines window to start

Core 0 determines window to start, and all cores synchronize for window activation.

  • TODO: Need a case analysis of the mayStartEarly, mayFinishEarly, mayUseSlack combinations.
  • TODO: it is TBD if the window start interrupt is written on master core or all cores.
  • TODO: Are IWI's masked on each core, or only by master?
  • TODO: Need to track interrupted window and when interrupt window completes early return to that window with proper remaining time (not window.duration + slack)
  • TODO: When interrupt window returns to interrupted window, if too close to window boundary, should spin
  • Window timer only needs to be readable by master.

Proposal 1 pseudo code

 WSI   Window Start Interrupt
 WSI.time = latest time that the current window can finish, aka next window start.
   Always relative to current WAT start.

   For interrupt window:
          = retainedBudget
   else (non interrupt windows):
          = window.duration + window slack
          window slack includes retainedBudget if next window has a fixed
          start time, i.e., the current window is the last window before retainedBudget expires.
 IWI   Interrupt Window Interrupt (multiple possible)

// Note handler is also called by maystartEarly().
WSI_or_IWI_handler(bool IWI or WSI, actualInterruptNumber) // runs on all cores
{
  if IWI and currentWindow.isIW then propagate actualInterruptNumber to sched and return;
  if IWI and currentWindow has IWI's masked, is assumed to be impossible;
  if IWI and retainedBudgetExhausted return;  // This is a "spurious" interrupt, only one/core possible.
  if currentCore == 0
     syncA;  // A simple barrier, perhaps based on a window activation counter
     recompute retainedBudget;  // cases: nextWindow.mayStartEarly=false and currently window is IW.
     switch:
       IWI and not currentWindow.isIW:
          if retainedBudget > some threshold
            currentWindow = interrupt window;
          else
            // PAL must reassert IWI later, TODO: which may be an issue for edge triggered interrupts.
            temporarily mask all IWIs; // Prevent remaining IWI's from consuming any more budget
            retainedBudgetExhausted = true;
            programming window timer not necessary in this case;
       WSI: currentWindow = next normal window;
            retainedBudgetExhausted = false;
     syncB; // sync with slaves again;
     // interrupt mask status TBD.
     write window timer for currentWindow;
  else  // slave core
     syncA;
     syncB;
  update IWI mask status for currentCore (depends on both window status and temporary mask status);
  activate scheduler for currentWindow on this core;
  

Since IWIs are broadcast, some schedulers will not have an ISR

Use Cases

The above algorithm needs to address IWI, WSI, and finishEarly() cases, both when addressed separately and in combinations close enough that the observed event sequence may differ between processors. Fortunately WSI and mayStartEarly() are disjoint because mayStartEarly() is synchronized from a critical on all cores. This means that the only ambiguous event sequence is IWI and WSI. Furthermore different cores may see different IWI interrupts, but all are assumed to be masked simultaneously so are treated as one event below.

Case  currentwindow                Events                 nextWindow
       IWIEnabled       Core 0             Core 1          IWIEnabled
 5       false          WSI                WSI             D/C
 1       true           IWI, WSI           IWI, WSI        false
 2       true           IWI, WSI           WSI, IWI        false
 3       true           IWI, WSI           WSI             false
 4       true           WSI                WSI, IWI        false (assumed to be impossible)
 5       true           IWI                IWI             false
 6       true           WSI                WSI             false

Only Core 0 can change the window. There is a cross core synchronization point when the window is changed.

Finish Early Mechanism

An important property of any mechanism is that either all cores will start early, or none will. This prevents some cores from doing a window transition via start early and others via an interrupt, e.g., a WSI.

Finish Early Proposal

For case where idle triggers "finish early". Works for scheduler with ISRs.

This algorithm has an undesirable "timeout".

The algorithm is asymmetric. Every core calls finishEarly() in Idle, but a master core initiates commands to slaves and coordinates the responses from the slaves. Since not all slaves may be ready to finish early, the algorithm employs a timeout to detect ineligible slave cores. The timeout should be set to the maximum amount of time it would take for all N cores that were already in finishEarly() to receive and respond to a command. Assuming a CPU has a write buffer depth of NWB cache lines, a command will be seen in NWB times the time it takes to transfer a cache line to memory plus 1 times the time it takes to transfer a cache line beetween cores. The equation for the slave's response should be similar, but the details depend on the configuration of the memory subsystem.

In the following the phrase "could observe foo in {x, y, z}", means that at that point in the code, the current core could observe the variable foo as having any of the values in the set.

The key to the algorithm is that every core will see the SAME transition of command from wait to either sleep or finish. This is important because any core specific state transitions without a synchronization point in between them could permit the first state to be missed by a cross core observer.

Either the master or the slave could handle ISRs until all cores finish. Care is taken to ensure that inter-core cache line transfers are minimized.

 command_t enum { run, wait, finish };
 command_t command=run;  // "command" only written by core0
 slaveState_t enum { running, waiting, acknowledging };
 slaveState_t slaveState[ncores] = running;  // Elements inter-core cache line aligned.
 
 // References to command and slaveState need to be VOLATILE_REF(). There
 // need to be memory barriers between reads and subsequent writes (not
 // writes and reads), to prevent speculative reads and out of order writes.
 function finishEarly()  // This would replace PALidleFunction().
   loop
     if currentCore==0

       // Not sure how to get a power efficient wait here.

       // At this point, and at all points outside of the following critical,
       // core0 will observe, for all other cores, slaveState[core]==running
       enterCritical();  // May have to be master critical.
       if finish early is permitted, e.g., the window specifies mayFinishEarly==true then
         command=wait;
         var myCommand=run;
         if for every other core, slaveState[core]==waiting within timeout then
           command=finish;
           myCommand=finish;
           wait for every other core, slaveState[core]==acknowledging;
         else
           // The "else" case must account for only a proper subset of cores
           // seeing the command.  Observing "running" below does the trick.
         command=run;
         // The following loop prevents cores from observing a transition
         // directly from wait to another wait.
         wait until for every other core, slaveState[core]==running;
         // At this point core0 will observe, for all other cores, slaveState[core]==running
         if myCommand==finish
           WSI_or_IWI_handler(WSI=true, actualInterruptNumber=D/C);
       exitCritical();
     else // Not core 0
       // For power efficiency, there could be a waitUntilEqual(&command, wait); here.
       enterCritical();

       // Could observe command in {run, wait}, but not finish.
       if command==wait
         // Could observe command in {run, wait}, but not finish.
         slaveState[currentCore] = waiting;
         // Could observe command as anything, i.e., {run, wait, finish}.
         // The "Not Equal" here addresses the fact that some cores may
         // not see command=wait, or may not be in this algorithm at all.
         var myCommand=waitUntilNotEqual(&command, wait);
         // Could observe command in {run, finish}.  
         slaveState[currentCore] = acknowledging;
         waitUntilEqual(&command, run);
         slaveState[currentCore] = running;
         if myCommand==finish
           WSI_or_IWI_handler(WSI=true, actualInterruptNumber=D/C);
       exitCritical()

Finish Early Proposal No timer

Idle thread behavior:

Define the following state for each secondary:
state[core] is:

  run         some thread other than idle is (about to be) active

  waitNoCrit  secondary is running idle outside of a critical.

  startCrit   primary has commanded secondary to enter a critical

  waitInCrit  secondary was commanded to enter critical and wait, and is now waiting

  runPending  secondary was commanded to enter critical and wait, but refuses (got an interrupt)

  finish      The primary has concurrence from all cores that it is
              time to start to the next window.  Each secondary will
              call activateSchedulerWindow.

  abort        Some core has refused to start next window, the
              secondary will exit the critical back to waitNoCrit.

Secondary:
  on every ctx into idle, before exiting critical:
    state[currentThread] = idling
  ...

primary:
  outside of a critical
  while (1)
      // The next statement is not required for algorithmic correctness,
      // it just saves power.
      in power effecient manner wait for all cores to be idle

      enter critical
      foreach core:
          CAS(if state[core]==waitNoCrit, then state[core]=startCrit)
      // possible secondary states: waitNoCrit, run, startCrit
      foreach core:
          // wait until secondary changes, also covers CAS failed
          waitUntilNotEqual(state[core], startCrit)
      // possible secondary states: waitNoCrit, run, waitInCrit, runPending

      bool abort=false
      foreach core:
          if (state[core] != waitInCrit):
              abort=true
      if abort:
          foreach core:
              s=state[core]
              if s in [waitInCrit, runPending]:
                  state[core]=abort
          foreach core:
             waitUntilNotEqual(state[core], abort)
       else:
          // possible secondary states: waitInCrit
          foreach core:
              state[core]=finish
          activateSchedulerWindow()
      exit critical

secondary idle loop:
   state[core]=waitNoCrit
   exitCritical
   while (1)
       // local core possible states: waitInCrit, startCrit
       waitUntilNotEqual(state[core], waitNoCrit)
       enterCritical
       state[core]=waitInCrit
       waitUntilNotEqual(state[core], waitInCrit)
       if state[core]==finish:
          state[core]=run
          activateSchedulerWindow()
//     else: // state[core]==abort
       state[core]=waitNoCrit
       exitCritical        

secondary interrupt entry function
   // possible states: waitNoCrit, startCrit
   if CAS(if state[core]==waitNoCrit, then state[core]=run)
      handle interrupt...
   else
      // state is startCrit, transition to.
      state[core]==runPending
      waitun....
      state[core]=run
      handle interrupt
      
secondary interrupt exit function
   state[core]=waitNoCrit

Each secondary idle thread when it is context switched to enters
waitNoCrit:

  S-> waitNoCrit P-> startCrit s-> waitInCrit P-> finish   S-> run (and window switches)
                                              P-> abort    S-> waitNoCrit
                               s*> runPending P-> abort    S-> run (to handle interrupt)
                 S*> run


--------------------------------------------------------------------------------
alternate:

primary:
  outside of a critical
  while (1)
      // The next statement is not required for algorithmic correctness,
      // it just saves power.
      in power effecient manner wait for all cores to be idle

      enter critical
      foreach core:
          CAS(if state[core]==waitNoCrit, then state[core]=startCrit)
      // possible secondary states: waitNoCrit, run, startCrit
      foreach core:
          // wait until secondary changes, also covers CAS failed
          waitUntilNotEqual(state[core], startCrit)
      // possible secondary states: waitNoCrit, run, waitInCrit

      bool abort=false
      foreach core:
          if (state[core] != waitInCrit):
              abort=true
      if abort:
          foreach core:
              if state[core]==waitInCrit:
                  state[core]=abort
          // possible secondary states: waitNoCrit, run, abort
       else:
          // possible secondary states: waitInCrit
          foreach core:
              state[core]=finish
          activateSchedulerWindow()
      exit critical

secondary idle loop:
   state[core]=waitNoCrit
   exitCritical
   while (1)
       // local core possible states: waitInCrit, startCrit
       waitUntilNotEqual(state[core], waitNoCrit)
       enterCritical
       state[core]=waitInCrit
       waitUntilNotEqual(state[core], waitInCrit)
       if state[core]==finish:
          state[core]=run
          activateSchedulerWindow()
//     else: // state[core]==abort
       state[core]=waitNoCrit
       exitCritical        

secondary interrupt entry function
   // possible states: waitNoCrit, startCrit
      state[core]=run
      return handles interrupt... 
      
secondary interrupt exit function
   state[core]=waitNoCrit

Each secondary idle thread when it is context switched to enters
waitNoCrit:

  S-> waitNoCrit P-> startCrit s-> waitInCrit P-> finish   S-> run (and window switches)
                                              P-> abort    S-> waitNoCrit
                               s*> run (to handle interrupt)
                 S*> run

barriers?

Legend:
  s->   normal program flow
  S*>   due to an interrupt

Finish Early Proposal One Counter

Advantage here is there is only one word shared by all cores and hence no memory barriers.

The algorithm depends on two counters:
  1. The number of cores that are executing the idle thread.
  2. The number of cores executing the idle thread in a critical
     waiting for agreement to start a mayFinishEarly window.
Algorithm proceeds in two phases.  In the first phase all cores
wait outside of a crit for all other cores to start executing the
idleFunction.  In the second phase the cores enter a critical and
wait until all cores are in the crit, or any core stops running
idle.

// Number of cores waiting in idle
struct idleCoreCounts
    // Logically a structure, like this, but treated as a raw integer
    //  struct 
    //      uint16_t idleWaitingCount
    //      uint16_t idleMayFinECount

idleWaitingDelta = 1
idleMayFinEDelta = (1<<16)

// To get an interrupt, Idle must have been outside a crit.
interrupt entry function
    if currentThread.process() == systemProcess() // Thread is idle
        atomicAdd(&idleWaitingCoreCount, -idleWaitingDelta);
        waitUntilEqual(&idleMayFinECoreCount, 0) // efectively

interrupt exit function
    if currentThread.process() == systemProcess() // Thread is idle
        atomicAdd(&idleWaitingCoreCount, idleWaitingDelta);

Idle function:
    atomicAdd(&idleWaitingCoreCount, idleWaitingDelta);
    while (1)
        // Wait until all cores are idling outside of a crit
        while (1)
            current=idleCoreCount
            break if current.idleWaitingCount == CPU::numberOfCores()
            waitUntilNotEqual(&idleCoreCount, current)

        // Wait until all cores are idling inside of a crit, or at
        // least one core stops executing idle
        enterCritical()
        if ! current window may finish early:
            // all cores are sleeping but window can't finish early, just sleep
            atomicSleep() // exits critical
        else:
            atomicAdd(&idleMayFinECoreCount, idleMayFinEDelta);
            while (1)
                current=idleCoreCounts
                if current.idleMayFinECount == CPU::numberOfCores()
                    // At this point, logically, the idle counts would be reset to zero
                    // after after establishing a cross core barrier, for efficiency do
                    // that as part of the barrier in activateSchedulerWindow().
                    activateSchedulerWindow()
                    atomicAdd(&idleWaitingCoreCount, idleWaitingDelta);
                    break
                if current.idleWaitingCount != CPU::numberOfCores()
                    atomicAdd(&idleMayFinECoreCount, -idleMayFinEDelta);
                    break
                waitUntilNotEqual(&idleCoreCount, current)
            exitCritical()


Add to activateSchedulerWindow after core barrier:
    // sets idleMayFinECoreCount = 0 and idleWaitingCoreCount = 0
    idleCoreCounts = 0

Critical Section Bound Formulas

The length of time that the kernel spends within a critical section impacts the latency of platform interrupts and the start of the next window activation. The crittime kernel will be updated in the future to support measuring the times that encompass the different values in the formula. There are two cases:

  • No nested locking occurs, or
  • A nested lock is also gotten.

In source code form:

 Lock(outer);
 do stuff;
 UnLock(outer);

or, for the nested lock case:

 Lock(outer);
 Pre nested lock code; // Part of Execution[i]
   Lock(inner);
   do stuff;
   UnLock(inner);
 Post nested lock code; // Part of Execution[i]
 UnLock(outer);

At each nested lock level there is some work which is performed before optionally obtaining the next level lock and again after the nested lock is released, PreExe and PostExe above, and collectively referred to as Execution. Furthermore, since a lock may be held by another core, there might be some delay acquiring the lock. The delay is called the Latency. We define Duration as the amount of time which a lock is held, and as the formula below indicates is dependent upon the latency to get a nested lock and duration spent within the nested lock. Therefore, Duration, graphically depicted, is as follows:

  PreExe   Latency to get nested lock   Duration of nested lock   PostExe 

Since Execution = PreExe + PostExe, and knowing that eventually nesting must stop, we get the recursively defined formula for the Duration a lock is held of:

 Duration[i] =  Exe[i]  +  Latency[i+1]  +  Duration[i+1] 

where i has the following values:

  1. Core Critical
  2. Scheduler Private
  3. KIO
  4. Scheduler Public
  5. Miscellaneous (Should be eliminated/merged with scheduler public in final implementation)

Since the thread private lock is always implicitly held, it does not impact the formula.

  • Non-smp case thread and scheduler private are always immediately available.

leaves us with KIO and sched public.

start of window latency = ???

latency to get thread private = 0 (non-smp case) latency to get scheduler private = 0 (non-smp case)

The following ignores window activation criticals, which may be long for 653/posix window activations.

latency to start of window interrupt = latency for critical section.

latency for critical section = latency to get scheduler private +

   longest scheduler private lock +
   duration of critical section activity

longest scheduler private lock = latency to get KIO lock + Longest KIO lock +

   duration of scheduler private activity

Latency to get KIO lock: (#cores-1) * longest KIO lock.

Longest KIO lock: duration of KIO + scheduler public latency +

   duration of sched public.

sched public latency: (#cores-1)*longest sched public

longest sched public = duration of sched public + 0 (no nested locks)

expected times:

 duration of sched public: time to merge a priority list.
     #priorities * (2 or 3) * time to get a cache line.
 duration of KIO: current longest non-ctx critical.


Approximation:

 latency to start of window interrupt =
     #cores * context switch + (#cores)**2 * longest sched public


Observation: We might be able to tolerate a window interrupt after getting a scheduler private lock, but before getting a KIO lock. This would reduce window start latency.

In practice the worst case will not include the number of cores squared. This is because the execution time at each subsequent level is expected to be smaller than the previous level. Therefore, while performing the execution time for a kio after receiving a lock, the core which released the kio will be performing the scheduler public operation. If we had to wait for all other cores at the kio level, then this pattern has most likely occurred when each released the kio and no cores will be waiting for the scheduler public lock making the actual latency zero. In addition, by controlling the application designs, #cores can be reduced to the number of related cores. For instance on a quad core system two cores may be used in an application which is coordinating with a semaphore. Therefore, only one other core may already have the kio lock on the semaphore, and if this core is waiting for the kio lock it will not block the other core from getting the scheduler public lock.

critical section bound summary:

  • Have above equation
  • May need to rename "scheduler public" category to also include process VAS.
  • Crittime kernel/tool may need to consider morphology of locks when

determining worst case values, e.g., the Exe[critical] for a critical that doesn't take a nested lock could be treated differently to reduce worst case bounds.


Status w.r.t. locks (some issues w.r.t. scheduling, etc. not included below)

Scheduler private

Performance Characterization

We'll undoubtedly want to have benchmark tests to characterize the performance of a platform.

  1. How long to transfer a cache line between cores.
  2. Effect of snoop on cross core performance.
    • Performance change in a core when other core is sleeping vs doing non-conflicting cache reads/writes, and memory reads/writes.
  3. Cross core effect of TLB invalidate
  4. Cross core effect of ICACHE invalidate

Tech meeting minutes

12-10-2014 notes

--------------------
Aaron's airplane notes for how to implement time bounded multi-core operations.

A core has the lock when both is_waiting and 

lock:
  bool has_waiters;
  bool core_waiting[n_cores];  // True if the core is waiting.
  core_t core_that_can_lock;

// must analyze inter-core cache behavior.
lock(l, core, n_cores)
  l.core_waiting[core] = true;  // cw1:
  while true
    c = lwarx(l.locking_core) // lc1:
    if c == core then

      // At this point, another core could have observed both
      // locking_core==core at lc1 and core_waiting==false at cw2
      // and be attempting to increment locking_core at lc3.  Writing
      // the locking_core at lc2 ensures that either their attempted
      // update of locking_core at lc3 will fail, or ours at lc2 will.

      success = stwx(l.locking_core = c) // lc2: stwx returns true if write happened
      // AL: need justification of why this is true
      if success then break out of while loop // (we've got the lock)
    else
      while(l.core_waiting[c]) // cw2:
	/* read loop prevents spurious bus writes, could be sleepy loop */;
      c2 = increment_modulo(c, n_cores) // wrap around at n_cores

      // The following stwx will either succeed or fail.  If it
      // succeeds, then the local core is making progress.  A failing
      // stwx here, could be due to either the write at lc2 or lc3
      // from a different core(AL: not sure about reservation granule
      // interference).
      stwx(l.locking_core = c2) // lc3:
  return

unlock(l, core, n_cores)
  l.core_that_can_lock = increment_modulo(l.locking_core, n_cores)
  l.core_waiting[core] = false;
  
There is no "isLocked" API.

What if we took a different approach, you can only lock locks that the
current process has access to, so any interference is AIB, then just
use indeterminate wait loop.  Can't bound critical times that way.

07-24-2015 Notes

Notes from AL and RLR conversation on scheduler private lock not in a critical.

We think we can get replace the generic scheduler public lock with a
doubly linked list "lock free queue add" as long as we can ensure that
the set of possible enqueues are disjoint with the dequeues, and that
only one dequeue at a time would occur.  Since the dequeue would be
done at window boundaries, this condition should be trivial to
satisfy.  The enqueue operation is still not guaranteed to be fair,
but at this time neither is the "generic lock enqueue" operation.

We could eliminate the fairness issue by keeping an array of #cores
thread lists and have each core enqueue into its only element, but
then at window boundaries we'd have to visit #cores lists.  Since
object creation is rare, this seems like a bad trade.


Asssume Scheduler has public lists and a "copy" list.  The copy list
is only written from within a critical section during a window change,
and is only read during a critical section within a scheduler private
lock.  I.e., the "copy" is a write buffer to eliminate interference
writers to the public lock and the scheduler's reading of the list.  A
similar approach could be achieved by using a ping/pong buffer.


   public list: pl
   private list: pl_copy

  at window boundary, within synchronized window boundary activity: move pl to pl_copy.

  pl_copy is merged to scheduler's private lists;

Assume there is a scheduler lock region defined by something like:

  lock_sched_private();
  ...
  unlock_sched_private();

Then unlock would be implemented as follows, and something similar
would have to be done for return to user mode.

unlock_sched_private()
{
  if pending_stuff_to_do, then
    do it, and don't unlock_sched_private until you do.
  release sched private lock;
}


The existing HW Interrupts:
  platform
  power loss
  frame sync
  window start/IWI

When a hw interrupt happens, treat it like a LLKMI, where there is an
"immediate response" (below) and a deferred response ("do it" from above):
  1. Do minimal activity to respond to HW, only accessing data
     structures protected by critical sections, for power loss this
     might entail a reboot.
  2. Scribble off what needs to be done into a data structure that is
     protected by a critical section.
  3. Cause unlock_sched_private() to "do it".

Issues and analysis

The HW interrupt arrival will reorder the deferred response.
A. Platform interrupt deferred response will reorder by priority but maintain order within
   same priority threads.
B. Window start must not do a scheduleOtherThread() since either/both
   the from and "to" schedulers may be running.  We'll have to save
   machine state (registers, interrupt mask status, timer value) then
   do a low level thread context switch to let the thread continue
   running until it invokes the deferred action.
C. TBD

A set of time correlated events where there is a dependency on the
relationship between them must happen in the SAME critical.  E.g.,

   now = timerTimeRemaining()
   ...
   timerWrite(now - timerOffTime)

Make Kernel not Globally Virtually Visible

The Problem

Because the kernel is globally virtually visible any thread can perturbed the cache associated with the kernel's memory pool of any other thread by simply reading the kernel's memory.

The question

Is this really a problem? With the exception of kernel objects the thread has access to and the threads kernel stack the access will be limited to read only access. The kernel object and stack access is also read only from user mode but written by the kernel when the thread makes API calls. This limited interference may be tolerable.

Ideas

  • Break the kernel code up into multiple segments. The API and a small amount of data would be user readable, the majority would be user access none. This approach reduces the interference but there is still user readable code/data that is accessible. Plus API calls can impact stack/kernel object data and the execution of instructions will also have a cache effect.
  • Break the kernel up into 2 SO's a user mode SO and a kernel mode SO (not sure about code that is currently shared by user and kernel mode. Most of it would be forced to be kernel mode because the data would only be accessible in kernel mode but code like the image code may wind up duplicated in each library). The kernel mode mode library would be loaded as the kernel is today (across all processes). The user mode library would be loaded independently into each process. This allows an application to be forced to put the user mode portion in a separate. API calls can still impact stack/kernel object data and the execution of instructions will also have a cache effect.
  • Allow process to specify a specific pool for their copy of kernel code and read only data. This allows users to prevent other processes from impacting their cached copy of kernel code. Memory reads and API calls can still impact stack/kernel object data. To prevent memory reads kernel stacks could be made user access none without too much effort, kernel data would be a little more difficult. That would leave impact from API calls.
  • Do nothing. I like this, it is already done.

All of the above have the problem of API calls impacting stack/kernel object data. To fix this we could move stacks and kernel objects into user specifiable memory pools. This would require pool specific kernel interface object quota, managing this would be a configuration nightmare for the users so we would likely want to combine objects of similar size into a single quota (This is something I think we should consider even if we don't implement this). Given the current object sizes that would likely result in 2 quota types processes/threads and other. This approach could be applied to all of the above ideas. Note: Even with this there are some kernel data items that would have to live in a common pool.

What about PRLs? Perhaps a PRL could be loaded as user access none at startup and when a process attaches to an associated resource the PRL can be mapped user readable.

Random thought: Only make the kernel virtually visible to processes with interprocessInterrogationServices capability. This would fix any APIs that return pointers to status like info.

Single system blocking time or per scheduler?

Two concerns:

  1. Which schedulers can lock the mutex?
  2. Can the mutex be locked cross-scheduler?

per scheduler

  • Registry would have to specify mutex's scheduler(s)
  • increases available utilization

system

  • less available utilization
  • simpler to implement?

Crittime kernel design ideas

  1. User would like a way to log outlier worst case critical sections to try and spot a pattern
    • Idea: allow user to configure a threshold in the registry and when a crit larger then the threshold is observed log it to the system event log
      • May need more then just a threshold may also need a type or specific critical section identifier.
  2. Instead of capturing all the ctx time data in the big table only capture the n (where n is small) largest different critical sections observed. This would reduce the size of the data to a manageable size such that each core could track its own data.

Non-coherent memory

Summary: Establishing non-coherent memory on some architectures is possible, but it is not clear that the benefits would be consistent across an architecture, so this approach is currently considered a last resort. RLR: I don't think this sufficient because (at least) on ppc the user can invalidate cache (on all cores) for any address that is readable (e.g. code space).

This proposal does not apply to x86 because as far as I can tell x86 does not provide control of cached memory coherency at the page level (Actually I am not sure if there is control at any level).

For ARM I believe the sharable attribute is how the coherency of a page is specified.

For PPC coherency of a page is specified by the M (managed) bit.

Proposal

Change the page table format to add a non-coherent write back cached type that would be applied in one of the following 2 ways:

  1. Change VAS::shareFrom(...FrozenVAS...) to use this caching type when cache is enabled. VAS::thaw() would maintain this attribute if userAccess is unchanged.
  2. Change PTEntry to set this memory type if the page is frozen and write back cached. This approach would cause process that are executing from RAM to be coherent.

Concerns

  1. I am not sure how writeProcessMemory() would work with this.

Other Harebrained Ideas

  1. Have a registry setting that specifies a process is using not coherent memory and when page tables for cached memory are created within that process use the non-coherent form.
    • The clone of the SVAS would have to use the coherent form for the data pages
    • I am not sure how readProcessMemory() would work with this.

Notes from February 2016 design meetings

Open Issues

  • kernel virtually visible
    • short vs long term view
  • multi-level interrupts: In or out?
    • buvec impact.
    • assists with broadcast IWI.
  • startChargingCPUTime buvec access hack
  • crittime kernel design.
  • misc locks
    • vas uses CCC, treated as misc lock.
    • ram quota?
    • system
      • process and memory object creation/deletion?
      • process name alias creation
      • others?
    • memory pools
    • others?
  • Two kinds of quotas: "can use", vs "can give to subprocesses"
    • Reduces MOM space problem this is the solution described in PCR 1028#c10
    • Minimizes interference:
      • Memory pool allocation,
      • Sub-process creation simultaneous with object creation.
    • Is this something we need to do now, or can it be deferred?
    • Does it help remove need for misc lock?

Done "For now" issues

  • reservation granule size vs coherency granule size, vs cache line size
    • What is a reservation granule size
      • EREF says implementation defined, searching manuals finds:
        • T1024, T1040
          • All references to "granule" are "coherency granule".
          • No reference to "reservation"
        • T4240
          • "granule" refers to: coherency, DMA, DDR controller
        • e500 core rm
          • says reservation granule size=coherence granule size.
        • e6500
          • says 64 byte (no equivalence assurance with cache line size)
    • Need to have reservation granule size somehow represented in kernel data structure layout decisions.


  • DataMemberTemplate for constructor with arguments.
    • RLR exception queue. 3 args,
    • RLF case scheduling history has something similar.
      • constructor({ a, b, c});

Agreed Upon

accessConst

portal management
  portals moved to thread specific at process creation time
  non-AIB for process creation from system quota.
  portals being thread local limits # of threads.
    new exception code may not require portals.
    Update portal analysis after finishing exception code.

On multi-core, race variables must be permitted from critical
sections, e.g., to handle raiseExceptionToThread()

Do we need a porting guide, highlighting changes, e.g., new multi-core
only status codes from various APIs?  How is that different from
release notes?

schedule-before.
  2 or 2a.  Thread created in different starts immediately, might run
  out of sequence first time.

hack fake exit critical aka fakeCriticalExit()

Status of SMP
  SMP is out, but problematic areas are to be noted.

Create new PCRs for follow on MC work.

exception raising


----------------------------------------
ram allocation/quota synchronization
  Currently using cross core critical.

AL
> > I'm less sure how to deal with memory allocation, but we could 
> > probably get away with making memory alloc/dealloc operations 
> > subject to less well bounded interference.

RAM managed to pools
  alloc/dealloc from a pool doesn't impact other pools.

P1: use pools directly for alloc/dealloc.

P2: At process creation, move RAM from pools into process local pools, alloc/dealloc uses process pool.

Can minimize the difference between P1 and P2 by having integ tool
create process local pools, that works just fine for most use cases,
although something like the following is still problematic:
     process a with subprocesses b, c, d
     b is mutually exclusive with c&d
     quota(b) == quota(c)+quota(d)

Assume implementation does not need to use "locks" because
alloc/dealloc operation is small enough that primitive operations
would be "fair enough".  Need analysis and justification.


RLR
> It seems to me that cross core interference to allocate/deallocate 
> within a given memory pool should be acceptable (user can avoid it 
> by careful selection of pool assignment). I don't think we want to 
> add another type of lock for the effort so that leaves compare and 
> swap algorithms which might be fun to bound worst case execution 
> when needed to be performed in a critical (I did a quick check and I
> only saw a couple of places this will be a problem).
----------------------------------------

Need to add task to ensure "only blocking is AIB blocking"
  New requirements/code checklist item.
  (soon) Analyze existing APIs, identify problematic APIs
  For each problematic API
    - (initial) update docs as appropriate.
    - (full func) update code as appropriate.
  (soon) Add DDD/UG description of AIB

2017-07 Face to Face Meeting

Topics

  1. #Kernel globally virtually invisible
    1. Current idea is to combine with allocating all a process' objects from the process' memory pool.
    2. Random thoughts:
      • map the kernel virtually visible in "privileged" processes (e.g. statmo)
  2. Can we #Simplify secure by utilizing more UGO access control, especially for memory objects
  3. #Atomic PTE update
  4. #UFT ram page alloc/dealloc
  5. #Secure/release protocol
    • Can we simplify secure by delaying delete, e.g., ensure no delete can happen in the same critical as a secure.
  6. #Remove the scheduler public lock?
  7. #SOP event rework
    • Perhaps put the calculations in the registry.

Notes from meeting

Kernel globally virtually invisible

This really has two parts, minimizing the portion of the kernel visible (readable) by applications, and also minimizing the effects of references to kernel objects.

Limiting Kernel Visibility

  1. Current idea is to combine with allocating all a process' objects from the process' memory pool.
  2. Random thoughts:
    • map the kernel virtually visible in "privileged" processes (e.g. statmo). RLR: Not sure this is a good idea because on ARM the new page table format does not allow user mode an kernel mode to have different memory access. It is not clear how much longer ARM will support the legacy page table format in 32-bit mode, what is clear is A64 does not support the legacy page table format.
  3. flag for "pal data is user readable"
  4. flag per PRL for "PRL data is user readable"
    1. Maybe flag is a "future feature".
  1. Shortcut option:
    1. make current non-trapping APIs trapping
    2. kmalloc -> no user access
  1. interference between const/RO data has to be ok.
    1. We have no recourse.
    2. Even incoherent settings don't help.
  1. Is interference between const/RO data and writable data ok?
    1. Now there is "fighting".
  1. Is interference between writable data and writable data ok?
    1. Now there is more unbounded behavior, potentially even livelock

In the above "Between" means data on the same line (trashing), not cache line eviction.

Perhaps an option of copying the kernel (and dependents) to a local pool.

Reduce core-to-core interference due to kernel objects

Process specific objects contained within process specific region of kernel VM.

  1. Each process is assigned a virtual address range in kernel space.
    1. All constituent objects in the process are created in that VM range
    2. Object handles become <process ID, object ID within process, generation count>
      • Not sure if we have enough bits to keep object type.

Issues:

Do process object instances get C++ deleted?

  1. Assume yes. Then who deletes RAM?
    1. Idle
    2. The last thread in the process
    3. The next thread that creates (a|the) process.
  2. Assume no.
    1. Would need to create some shell of the processes using the appropriate pool
    2. Could we auto create the shell of all processes?
      1. We'd have to know the pool for each shell.

Terms:

  1. process object manager (POM): the process KIO and all the contained KIOs.
  2. POM VM : The virtual memory that spans the POM

Can we restrict process objects to be only visible to the current process?

  1. Option 1:
    1. process deletion returns POM RAM
    2. Make the POM VM be chapter sized&aligned.
    3. Use one of the PDT levels to restrict/enable user mode read access to its POM, leaving all other process' POMs unreadable.
      1. The pdt for the current process would be
  1. Option 2:
    1. Pre-create process POM based on <process template, memory pool> tuple
    2. KIO pools are moved from system to process specific.
    3. At system startup allocate RAM for all objects in the POM
      1. Number of objects of each type would be worst case for the tuple
    4. Also create object pools
    5. POM RAM is never deleted.
      1. Theoretically could delete ram for kernel thread stacks.
    6. Most KIO quota checks are no longer needed.
    7. Most quotas are not "transferred to a child". Exceptions are:
      1. <Process, memory pool>, RAM, CPU, SMO(probably?)
      2. previous "process quota" now essentially becomes quota for the "tuple"
    8. POM VM must be page aligned.
  1. Option 2a (optimization):
    1. Compute worst case RAM for worst case POM based on <process template, memory pool> tuple that might use the POM.

Random thoughts:

Would a "process deactivate" be better than process create/delete? Deactivate would stop the execution of threads, but not delete them.

Atomic PTE update

Should review ARM page table code to ensure that the "4 entry for 1k pages" is appropriately consistent. Need commentary saying that page table updates may not be visible to other cores for until a window boundary.

UFT ram page alloc/dealloc

Keep three free lists UFT and CFT, unrestricted usage

CFT pages can NOT be reused until a global TLB flush is achieved, e.g., at a window boundary. Such pages are paging structures, and

perhaps that is all. UFT pages CAN be reused, but must only be user data pages. Any updates via stray TLB references would be restricted to the "user". Moving a set of UFT pages to the "unrestricted" list would have to be a constant time operation and would happen anytime after the global TLB flush, e.g., when a new RAM page is allocated. process and MO deletion might be able to be structured so that the CFT list isn't really needed.

  • Rationale: These are the only times we deallocate paging structures, but we're not sure if paging structures are the only source of CFT pages.

On page alloc or dealloc:

  • pages (may) get moved from UFT to unrestricted.
  • alloc and dealloc must get parameter for: can use/is UFT page
  • Attempting to allocate a non-UFT page when only UFT pages are available would presumably require a WUNP (or wait until next window).

Issues:

  1. still some sort of interference in managing the lists, this is a data structure issue, perhaps solvable by making something process specific that is now system level, but we don't know how.
  2. perhaps allocating pages N at a time would simplify/minimize inter-core interference.
  3. Current strategy of allocating/deallocating from a pool where the pool ram page "free list" uses shared cache lines between pools needs to be change so that pools have their own cache lines for their respective list.

SOP event rework

The proposal is to add "slowest period starting" as a property of each window in the registry, then recode the SOP code to take advantage, which should greatly simplify the task.

  1. System would increment period counts
  2. Each scheduler would merge the SOP lists for each started period.
  3. Why is there ISRThreadsWaitingPeriodic and periodic wait list, why not just a PriorityListOfThreads?

Detailed implementation notes for POM

During development of the MC kernel we decided to push on the POM strategy for managing kernel interface objects.

Why

  • Process delete was very messy due to object transfer.
  • Names are allocated from system, so you need system lock during object creation.
  • removes MOM space limits.
  • Better inter-process memory bus isolation (cache partitioning).
  • Simplify secure/release/bind/unbind protocols.
  • Some APIs would need to change again when we solve mom space/cache partitioning (status monitoring and test).
  • Easy process local thread index.
    • Simplify TLS
    • Portal Index
      • Portals likely have to become process local.
  • Separates process extent object quota from child process' object quota.

Requirements level changes

  • Each constituent object has new failure condition for "insufficient privilege to secure handle"
  • Process has many changes.
    • Process extent quota failures are different.
    • New Heap/pool creation failures
  • Secure/release/bind/unbind is quite different.

This takes advantage of the fact that Processes and MOs won't go away in the same period where they are unbound. It also uses compareAndSwap() for atomicity to eliminate fences on secure and release.

class SecureToken (fit within a 32 bit unsigned, aligned to granule size)
   int32_t count:31;
   bit beingDeleted;

constant unboundKIO which has #cores worth of {0, false} SecureToken

KIO objects have two new fields
  secure_counts_object : ptr to a KIO containing the SecureTokens to use
    - AL: Instead could this be ptr to SecureTokens[#cores]?
  ?? : SecureTokens[#cores]

All KIO objects are created with secure_counts_object pointing to unboundKIOTokens

bind
    handle = the valid handle. // AL: Was *after* the following in our orig proposal
    if instances is process or mo
        system::{process/mo}_objects[index] = this;
    else
        secure_counts_object = this
    storeStoreFence // needed because quiescent accesses are used during object creation.

unbind
    if process or mo
        system::objects[index] = some global constant unbound object;
    else
       _handle.invalidate()
       secure_counts_object = unboundKIOTokens.
    storeLoadFence() // above stores visible before subsequent reads are visible.

// This depends on process and MO being stable during an entire critical section.
handleToKernelPools(handle)
    switch kandle.systemScopeID
      process or object in process:
         kernelpools = system::processes[handle.process_index].kpools;
      memory_object:
         system::kpools[handle.object_index];

// Consider merging SecureToken with the kio lock.
secure(handle)
    // Ordering of unbound check and SecureToken is not required since
    // this is an optimization for limiting the waitForAllReleased() loop count.
    pool = handleToKernelPools(handle)
    UGO access rights check based on process or MO data.;
    obj = convert handle to object ptr in the appropriate pool.
    if obj == null return NULL;
    foo = &obj->secure_counts_object[core] ;
    orig_count = foo->count
    // If this is the same window the object was created in the handle is only valid
    // if the current core the the creating core.
    // Currently the window ID is a 32-bit number so if each window was 1ms this would roll over
    // May need to make the window ID a 64-bit number
    if (obj->creationWindowID == currentWindowID()) && (obj->creationCore != core)
      return NULL;
    // The only way this can fail is if flag was set.
    if compareAndSwap(foo, {orig_count, false}, {orig_count+1, false})
      return obj;
    return NULL;


release()
    // compareAndSwap() not needed here since only current core can
    // modify a secure count > 0.
    this->secure_counts_object[core].count--;

// precondition: obj is unbound
waitForAllReleased()
   core = 0
   while core < #cores
      if compareAndSwap(&secure_counts_object[core], {0, false}, {0, true})
        core++
      else
        waitUntilNextPeriod()
   // Ensure all test and set operations above finish before subsequent reads
   storeLoadFence() // or loadLoadFence()...

Throttling

Throttling Design

Proposed Design

Thermal Interrupt Support

Current x86 thermal monitor kernel requirements:
  • Thermal Management MSRs
    • Footnote di states It is only permissible to enable thermal management technology that will halt/reset the processor. It is not permissible to enable any thermal management technology that would cause instruction execution time to vary.
  • IA32_MISC_ENABLE.TM2 Enable
    • Footnote di above, and dj which states If TM2 is set to something other than 0, the PAL must halt on the receipt of a thermal management interrupt.
  • Couple of other bits for speed step, opportunistic processor peformance, etc.
    • footnotes indicating cannot be used, affect time partiioning, etc and specify required value
  • Power and Thermal Management
    • Footnote eq states Power and Thermal management that causes the execution time to vary (i.e. Enhanced Intel SpeedStep Technology, Opportunistic Processor Performance Operation, Hardware-Controlled Performance States (HWP), Hardware Duty Cycling, and Automatic and adaptive thermal monitoring) can not be enabled. It is permissible to use Power and Thermal management that causes a Deos™ warm start or Deos™ cold start. It is also permissible to put the processor to sleep as long as the wake time does not add to interrupt latency (perhaps C1).
Kernel Proposal
  • Use one footnote for di and eq footnote based on eq, and add something to that about setting slack state as permissible.
    • I think that is all that is needed for the kernel (no code update), and should allow the IA32_[PACKAGE_]THERM_STATUS and IA32_[PACKAGE_]THERM_INTERRUPT registers to be used by the PAL to set threshold 1/2 values and interrupt enables.
interruptAcknowledge issues
  • is this named properly
  • is it always in relation to an interrupt or in case of finish early is it called
  • should it have a window start/index indicator
  • Do we want to document any use cases for throttling (separate white paper, etc)? Are the following addressed:
    • What is the proper way for PAL to get control at start of window and know window index for looking up configuration data (new memory budgets) for window on each core?
    • Consider finish early where no interrupt is generated.
    • Consider start of interrupt window and resumption of interrupted window.

Notes for Geekfest

2018

  • Old Kernel news:
    • Partner with OAR for POSIX support instead of implementing in the kernel.
  • Old MC Kernel news (a refresher for those who forgot):
    • Schedulers:
      • Every scheduler schedules threads on one and only one core
      • Every thread executes under one and only one scheduler
      • The above implies no SMP
    • Window changes:
      • Window activated threads are no more.
      • Window starts are across all cores
      • Every window is assigned one and only scheduler per core
      • The same scheduler may be assigned to multiple windows
      • May Consume/generate slack
      • May be activated by an interrupt (see below)
    • Core specific logs
    • Fine grain locks, not disabling interrupts
    • More precise exception terminology. Queued exceptions
    • BIT is essentially useless in multi-core.
    • Envelopes are now copied, not "swapped"
    • No more PAL idle function.
    • Considering eliminating videoMemoryAddress
  • MC Kernel user visible changes of note:
    • At last geekfest we had multiple critical level. This feature was added when we thought POSIX and 653 would be baked into the kernel. This is no longer the plan so this feature has been removed to simplify kernel and PAL.
    • Masking of raised exceptions
    • Changed to C11 types
      • The API should now survive a port to 64-bit
    • Exception handlers can be configured to use the thread/process/system hierarchy of the legacy design (exception protocol v1) or to inherit from the creating thread (protocol v2). Processes with thread's in more than one scheduler must use the inherit approach (v2).
    • ARM enters the kernel with paging enabled. Why did we do it: kernel assumed RAM started at physical address 0 and this is not the case on all platforms.
    • For the most part, Kernel data is no longer mapped user readable. As documented in recent e-mails this change was non-backward compatible so why did we do it:
      • To reduce cross core cache interference. A user mode app reading kernel data on one core could have substantial impact to performance on another core writing to the same data because of "fighting" over ownership of the data in cache.
      • In preparation for 64-bit physical address support on ARM which won't allow the same virtual address to be mapped as kernel read/write and user read only using a single set of page tables.
      • Security.
  • MC Kernel internal changes of note:
    • Kernel trap optimized to minimize parameter re-ordering and register spills. Kernel code shrunk in size.
    • Cache aligned kernel objects to reduce cross core interference. Kernel objects have grown in size.
    • Most APIs are now trapping. This may have a noticeable impact to performance.
  • Possible changes for this cert:
    • Remove interrupt window support
    • Finalize PAL interface changes (what to do with interruptAcknowlege())
    • Proper bonuses/crit time accounting
    • Allow schedulers to be activated slower than fastest rate.
    • Improve ram [de]allocation performance (UFTM)
    • Support for legacy single core processors (uni-core kernel is dead! Long live multi-core!)
    • what about BIT on single core
    • ??
  • Possible changes for a future cert:
    • Support data space that is not user readable in PALs and PRLs
    • When same scheduler is in two consecutive windows skip window scheduling on that core if it is not an SOP event.
    • Replace grant access calls with static UGO. Is this needed for first cert?
    • kernel objects in memory pools.
    • ??
  • Tools in the toolbox others may wish to use:
    • weak functions: very useful for creating a component that is compatible with both uni-core and multi-core kernels (see gdbserver).
    • __builtin_unreachable(): useful to stop the compiler from injecting unreachable branches like loop entry checks (see ASSERT_UNREACHABLE in comp_sup.h

2019

How to reduce kernel verf costs:

  1. https://deos.ddci.com/mailman/private/kernel-maintainers/2018-December/001339.html
  2. Move file system modification into DAL-E library.
  3. Better understand schedule and kernel features for upcoming programs.
    1. TLS requires thread index and image index. How will image code do that?
      • build something into image code.
      • define some sort of "hook" that image code would call for each image loaded.
    2. What is in review scope/how to limit review scope?
      • multi-core
      • 64-bit vs 32-bit
  4. BSP dev-kit broken by kernel 9, which is hosing the BSP team. When can we go stable with 9.0?

What are the kernel features needed by Desert_Eagle_Program/Durants3_Program that Harrys_Program does not need:

  1. Warmstart? : Need by Sept*
  2. Framesync? : Need by Sept*
  3. Multiple Filesystems : Need by Sept*
    • Special binary : Need by Sept*
    • DesertEagle would potentially have 16 KFS when considering memory layout or other constraints.
  4. TLS (library index?) : Need by Sept*
    • TLS means threads for native Deos, but "653 processes" for ARINC 653. This likely means the loader/runtime will either need to paper over the differences, or we'll need to provide different libraries for "native" vs "653" processes/partitions.
  5. Multiple module schedules : Need by April
  6. cumulative execution time for RMA threads : Need by Sept*
  7. Other?
    1. memory barriers
    2. compare and swap

Notes: "*" means progress updates needed March/April and June/July

Harrys kernel defferable development tasks:

  1. ??

Harrys kernel defferable other tasks

  1. No x86 or ppc verf (review, or test)

Misc

  1. and ARM fp abi.
  2. ansi adding dart as a dependency complicates
    • building components and tests (e.g, might need to add -rpath-link),
    • test loading omits dart (no .cd.xml support)