Thread Local Store
Description
The Durants3_Program requires the implementation of Thread Local Store. Aaron asked Johan to give a brief rundown on thread local store (TLS) specifically related to kernel needs. The following is a refinement of that query and some summarization. Johan's answer to Aaron's ill-formed request appear at the end.
Kernel side
- What does the ELF loader have to do specially for TLS.
- What does the kernel have to do on thread create/delete.
- Anything else?
Aaron's summary:
for (1): The ELF loader must:
- Track the count of ELF images (.exe/.so) files that use TLS assigning each a unique index.
- Implement a few new relocation types.
- Ensure that specific registers are preserved (e.g., r2 on ppc).
- On x86, probably have some specific arrangement for at least one segment register.
- Using some specific GCC switches this can be replaced (at some runtime cost) by some other mechanism that uses function calls to get TLS values.
Perhaps we could restrict the TLS support to have some restrictions, such as not supporting unload library (freeLibraryDeos()).
For (2) the answer on the kernel side is still unclear.
Runtime side
GCC related TLS options
-fextern-tls-init
Support dynamic initialization of thread-local variables in a different translation unit.
Comment: It does not seem to have any effect on a simple example.
-mtls-dialect= [ gnu | gnu2 ]
Applies to arm and x86, not to ppc.
Very little difference between gnu and gnu2
-ftls-model= (arm,x86) [global-dynamic | local-dynamic | initial-exec | local-exec]
Set the default thread-local storage code generation model.
Comment: Very little difference between the models.
-mtls-markers (ppc)
Mark __tls_get_addr calls with argument info.
Comment: This appears only to happen with -fPIC.
-mtls-size (ppc)
Specify bit size of immediate TLS offsets.
--no-tls-optimize (ppc)
Don't try to optimize TLS accesses.
--no-tls-get-addr-optimize (ppc)
Don't use a special __tls_get_addr call.
--param (ppc,x86)
tm-max-aggregate-size
Size in bytes after which thread-local aggregates
should be instrumented with the logging functions
instead of save/restore pairs.
Comment: Not sure what this really means
-mtls-direct-seg-refs (x86)
Use direct references against %gs when accessing tls data.
TLS specific relocations
TLS relocations for ARM when -fPIC:
R_ARM_TLS_GD32 R_ARM_TLS_GOTDESC R_ARM_TLS_LE32
TLS relocations for ARM when no -fPIC:
R_ARM_TLS_LE32
TLS relocations for PPC when -fPIC:
R_PPC_GOT_TLSGD16 R_PPC_GOT_TLSLD16 R_PPC_TLSGD R_PPC_TLSLD
TLS relocations for PPC when no -fPIC:
R_PPC_TPREL16_LO R_PPC_TPREL16_HA R_PPC_DTPREL16_LO R_PPC_DTPREL16_HA
TLS relocations for X86 when -fPIC:
R_386_TLS_GD R_386_TLS_GOTDESC R_386_TLS_LE
TLS relocations for X86 when no -fPIC:
R_386_TLS_LE
Examples of the same source compiled with various combinations of options and disassembled can be provided to you by sending a mail to trasmussen@ddci.com and I shall send the disassemblies.
Emails from Johan
Edited for berevity and such by Aaron:
Aaron asked about TLS and distribution of tasks between compiler and loader. A detailed ELF description was published back in 2001. The latest version I have found is https://akkadia.org/drepper/tls.pdf. I haven't yet looked for the answer to Aaron's question but just wanted to get a good reference to you.
2nd paragraph from the bottom of page 1, last sentence "The only real limitation is that in C++ programs thread-local variables must not require a static constructor".
Spoke too soon. The TLS paper is correct that GCC's original and proprietary extension _thread did not permit a dynamic initialization expression. However, at least C++11 permits this.
The good news is that G++ 7.3.0 implements it. When compiling C code, even with -std=c11, this is not accepted by GCC 7.3.0. The C++ implementation does not appear to require any particular loader handling (and could just as well work for C). Instead it introduces a separate TLS guard boolean that probably is tested before any access to the user-defined TLS variable to branch around a copy of the initialization expression.
So the user perspective is that dynamic initialization expressions are possible for TLS variables in C++. The nasty bit is that the compiler injects extra conditions that will often be impossible to cover. Also calculations of the necessary TLS resources will need to take the extra guards into account.
The kernel and loader perspective on TLS variables with dynamic initialization expressions should remain blissful ignorance.
The mentioned TLS paper covers the loader and the user code side of ELF thread local storage in great detail.
It uses "module ID" to refer to a process-wide, unique, positive value that is assigned by the loader to each simultaneously loaded module (i.e. executable or shared object). See chapter 4, last paragraph: When performing a relocation for STT_TLS symbol the result is a module ID and a TLS block offset. The module ID must be 1 for the executable.
Interestingly, the TLS paper only directly covers IA-32 out of our three favourite architectures. Refer to https://www.uclibc.org/docs/tls-ppc.txt and https://www.uclibc.org/docs/tls-ppc64.txt for PowerPC specific descriptions. Compared to the ABIs currently referenced by Deos, TLS appears to involve an updated i386 ABI, https://www.uclibc.org/docs/psABI-i386.pdf, as well as the new PowerPC ABI, https://deos.ddci.com/scm/Deos/docs/reference-data/cpu/ppc/power.org/Power-Arch-32-bit-ABI-supp-1.0-Linux.pdf. I do not see any ARM references so that is an unknown.
The TLS paper explicitly limits compilers' exploits of certain representation aspects (chapter 3): compilers are not allowed to emit code which directly access dtvt. On the other hand, compiled code is expected (required?) to use a particular thread register tpt for accesses. In keeping with the ABIs mentioned above (and ARM https://deos.ddci.com/viewscm/Deos/docs/reference-data/cpu/arm/arm/bsabi/IHI0036B_bsabi.pdf & https://deos.ddci.com/viewscm/Deos/docs/reference-data/cpu/arm/arm/bsabi/IHI0042F_aapcs.pdf), GCC generates code requiring the thread register to be as follows:
i386 %gs:[0] ppc r2 arm r9
I haven't tried the 64 bit compilers but at least the IA-64 and ppc64 ABIs differ from their 32 bit counterparts.
The compiled code calls __tls_get_addr (with 3 underscores ___tls_get_addr for i386) which appears to be the only out-of-line support code directly referenced by application code. It is unclear where these functions should go. These functions alone use the module ID for the TLS variable in question as index into the dtvt; compiled user code should never need the module ID directly. Aaron, I believe this answers your question.
Johan