Embedded Tools Footprint Reduction Project

From DDCIDeos
Jump to navigationJump to search

Description

In support of the Louie_Program, the footprint of the embedded target development support applications must be reduced to 250KB RAM, and utilize a serial line for communications. FLASH footprint is less of a concern.

Things to keep in mind:

  • In the host environment (OpenArbor, etc.) the current interaction with target tools is 1-1 and assumes a communication channel that multiplexes data transfer over virtual circuits.
  • Raw serial IO is modestly reliable, but often not 8-bit clean. Some sort of error detection and correction protocol is likely necessary.
  • Synchronous Ack/Nack protocols have low bandwith utilization.

Evaluation Criteria

  1. Must fit in available room on PPC E200 core machine, ideally < 250KB RAM.
  2. Prefer solution that is not "one off", i.e., can be the solution for all customers going forward.
  3. Cost and schedule
  4. Performance, e.g., response time
  5. Have option to not require network driver to reduce BSP development time.
  6. Retain remote development, i.e., targets in lab, development in office.

Note: ordering of above priorities not yet vetted with management.

Reduce existing tools RAM footprint

Make LwIP configurable and move apps into LWIP with each app having its own thread. Add CSLIP or PPP to LwIP, note that both are already in LwIP they just need to be enabled and a serial driver integrated.

Current status

Current suggested strategy is to run all apps as independent threads in LwIP, thus minimizing effect on host tools.

You can try it yourself. See #Be A Maintainer.

Aaron has prototype reduced footprint LwIP, and ftp, sysvstrm, and sm apps that can run as threads inside lwip. There is also a (compile time switchable, currently switched out) ftp server that can execute statmo commands by converting statmo into an interactive "sockobj app". Also, currently LwIP must be compiled two different ways to get "small" and "large" footprint behavior. Substantial integration issues remain (see #Tasks below).

Aaron's "informed speculation" is that achieving 200KB for two simultaneous applications (statmo + ftp), or (statmo + gdb) is attainable in a month, 150KB is achievable in 2-3 months, and 100KB might be possible but would require more effort to develop a plan.

In the below, "Prototype RAM" is what Aaron has demonstrated in a prototype; "Est Final RAM" is an informed guess with all optimizations considered, including statically linking all apps into LwIP, merging inetd into an existing LwIP thread, and combining FTP and statmo into a single thread.

Component Original RAM (KB) Prototype RAM (KB) 2-3 month RAM (KB) Est final RAM (KB) Remarks
LwIP 600 236 92 76 Prototype RAM includes RAM for all apps and kernel stacks, but excludes Network Buffers and driver resource
buffers 1000 68 41 41 68KB required to support all apps simultaneously, smaller amount *might* be possible
driver res 64 64 16 4 I.e., "Realtek Ethernet Shared Memory". Slip will replace part of this, but estimates are a total guess.
ftpserv 100 16 4 final would static link and combine ftp and statmo
gdbserver 108 16 12 final would static link gdbserver.
inetd 100 0 0 Eliminate, use net.config and statically config ftp and gdbserver.
statmo 132 16 16 Final included in ftp.
sysvstrm 140 0 0 Eliminate, use ftpserver instead.
Total 2244 368 197 153


In the above "net" is RAM quota plus one extra page since RPU reports ram quota of one more than RAM quota and an additional 1 page for the kernel stack for each thread. LwIP driver buffer space is ignored above.

Component code complete tested Committed Unreleased Stable Remarks
LwIP Yes Yes Yes Yes No Lots of config details to work out.
ftpserv Yes Yes Yes Yes No
gdbserver Yes Yes Yes Yes No In LwIP Can't support debug groups.
inetd No No No No No Current plan is inetd won't be changed.
statmo Yes Yes Yes Yes No
sysvstrm Yes Yes Yes Yes No In LwIP only one instances possible.

Testing on various platforms:

Platform Lwip 3.10 Large footprint LwIP 3.10 small footprint Remarks
qemu-ppc Works Works
qemu-x86 Works Works
ep8343m Works Works Followed instructions in #Be A Maintainer and had success with Greys environment.
DeosMPC5675K Works Works Followed instructions in #Be A Maintainer and had success with Greys environment.

Responsiveness

Having apps run as threads within LwIP definitely reduces latency and makes OA much more responsive. Some quantitative results:

  time (exec 4<>/dev/tcp/192.168.19.100/1026; echo -e "r s p\rq" >&4; cat <&4)
  App as process: real    2.263s
  App as thread : real    0.041s

  junk$ time echo quit | ftp 192.168.19.100 --no-login
  App as process: real    2.781s
  App as thread : real    0.045s

Configuration and Feedback

Now can run applications:

  1. From inetd. Process is deleted when connection closes.
  2. Auto-created. Behaves like FTP did, one connection at a time, process is retained between connections.
  3. As LwIP threads. Thread, and all process state, is retained between connections.

Things to consider:

  1. Configuring all the possibilities independently seems overly complicated.
  2. gdbserver doesn't handle connection failure situations well.
    • Fixed some target issues, host is still problematic.
  3. Thread apps are fast. Hate to give that up.
    1. Ave connection time:
      • < .01 sec thread apps
      • 2.5 * period for inetd created app
    2. Eliimnated .5 sec latency in ftp connection establishment.
      • I.e., thread apps are faster to start than download mode FTP was.
    3. Eliminated mutex in all apps, reduced sysvstrm to 1 thread.
    4. Now all apps are slowest rate min slack which helps CPU, although biggest hog, by far, is LwIP.
      • 0.072% 200ms : sysvstrm
      • 0.016% 1 sec : ftpserver, gdbserver, statmo
    5. For reference CPU budget of previous apps:
      • 1.088% : sysvstrm
      • 0.072% : ftpserver, gdbserver, statmo

We could run separate process apps at 200ms and have same CPU footprint as before, but responsiveness wouldn't be as good as with thread apps.


Tasks

  1. Test cffs related ftpserver changes.
  2. Reduce Network Buffers to ~64KB.


Optional

  1. fix deprecated functions w/o need for deleting deprecation warnings.
  2. make sysvstrm use sockobj rather than internal buffering scheme.
  3. Add lwip configuration option, e.g., net.config, to disable socketDiagStream() and stats_init()
    • Saves about 12KB RAM for each.
    • Currently #ifdef'd and included only for large footprint
  4. Add desocket interface inside of LwIP support for gethostbyname() and gethostbyaddr()
    • Current Deos apps do not use these interfaces.
  5. Convert status monitor to use sockobj for better buffering (optional)

Deferred

  1. Get apps to link as relocatable executables
    • This permits them to be loaded as apps.
    • Problem, see #App_Fixed_Addresses. For now managing fixed link addresses.
  2. Consider changing socket init to use semaphore rather than mutex.
    • Marginal value at this time since NUM_SOCKET_THREADS=1 for all apps.
  3. The ftpserver platregd .pd file specifies a huge RAM quota. It would be nice to have that change dynamically as system resources are available.
    • Or as a .fp file for the platform.
    • In general, ftp would do better to have some fp files that a platform could invoke from itconf files, e.g., download modeness, run as applet, etc.
  4. Shared objects (.exe and lib*.so) have non-page aligned data segments.
    • Applies to all architectures. These shell command lists them:
    • for f in /desk/*/appbin{,/dbg}/*.exe; do s=$(objdump -x $f | grep 'LOAD off' | grep -v '000 paddr'); if [ -n "$s" ]; then printf "%-40s %s\n" $f "$s"; fi; done
    • for f in $(find /desk -iname 'lib*.so'); do s=$(objdump -x $f | grep 'LOAD off' | grep -v '000 paddr'); if [ -n "$s" ]; then printf "%-54s %s\n" $f "$s"; fi; done
    • For ppc as of unreleased components on 1/18/2016, the alignment isn't wasting any RAM.
  5. gdbserver
      • Attempting to start a second debug session before the first finishes leaves OA locked up. It appears that MLD gets hung.
    • There is some global data that caches various state. From within LwIP it may require work to be able to switch the executable being debugged without doing a reboot.
      • when run in lwip, runs at 1hz.
    • Previous ram quota was 33 pages, 10 seems adequate. Needs review.
  6. In LwIP, setting LWIP_STATS to zero causes DDCI added code to fail to compile.
    • Disabling stats would likely be a small footprint savings.
  7. All network buffer pools are currently 64 byte aligned. Only the PBUF pools require such a large alignment
    • This could save 6-8KB depending on configured limits
    • Document how to show lwip mem status video display.
  8. Compile files used in download mode with -Os (optimize for space)
    • Requires creation of runtime .o file for GPR save/restore intrinsics, a conceptually small effort but logically it would go in the compiler runtime.
    • LwIP would save 3 code pages based on a test Aaron did compiling and hacking in the symbols for the intrinsics to create a non-functional lwip.exe


For each app must test:

  • invoking via inetd
  • invoking inside lwip
  • invoking inside ftp (statmo only)

Results So Far

For x86/release:

  1. ftpserver
    • data+bss reduced from 7 to 2 pages (0x63d0 to 0x14d4)
    • CPU: 0.072 percent to 0.016 percent
  2. sysvstrm
    • Eliminated one thread and reduced the rate of the main.
    • data+bss reduced from 7 to 2 pages (0x6428 to 0x137c)
    • CPU: 1.088 percent to 0.072 percent
  3. status monitor
    • data+bss: reduced from 8 to 4 pages (0x74e8 to 0x33c8)
    • CPU: 0.072 percent to 0.016 percent

Preliminary FTP "get" results show about 15% improvement in data transfer rate.

Be A Maintainer

  1. Get unreleased for: ftpserver, sysvstrm, sm, gdbserver, lwip
  2. lwip distribution contains /desk/etc/lwip-apps.fp.xml. Copy this file to your platform project. It assumes all Deos network apps should be configured as lwip threads.
  3. Add the following lines to the end of /desk/platform/$PLATFORMNAME/etc/lwip.config
    # Make the following be application threads inside LwIP.
    application ftpserv.exe
    application gdbserver.exe
    application sm.exe
    application sysvstrm.exe
    
  4. Build a platform and start QEMU. All apps are available in normal and download mode.

Apps overview

Moving app into LwIP still requires pages for user and kernel stack per app either way, however the LwIP server thread equivalents would go away.

RAM savings in pages (per process):

  • 3-4: 1 PDT + 2 or 3 PTs
  • 1: Data page for each shared library used, probably just ANSI
  • 2: No longer need envelopes or mailboxes

FTPserver

x86 ftpserver ram quota is 24 (not include kernel stack), 3 remaining so 21 page ram quota in use:

 Now  Save 
   6        user stack.
   7    5   data+bss
   1    1   ansi
   1    1   imageapi.dll  (not required on limited footprint)
   2    2    envelopes
   3    3   PDT+PTs
 ---   --
  20   12

So, 8 pages, but that does not include kernel stack: (8+1)*4KB = 36KB

Linker script wastes nearly half a page of RAM by starting data section at same page relative offset as end of code.

Estimate we could further reduce stack by 4 and data by 1.

Final estimate (9-5)*4K = 16KB

data+bss reduction is because existing 7 page BSS is largely due to desocket which wastes 5 pages so easily reduces to 2 pages.

Eliminate inetd process

Integrate inetd into one of LwIP's existing periodic threads. Saves user and kernel stack (8KB)


Combined App Shell

Merge functionality for all interactive apps (ftpserv, statmo, sysvstrm, and gdbserver) into a single thread where the host can request any of them to process a command at any time. As a practical matter this probably only helps for the ftp (or gdbserver) coupled with statmo, so the effective saving is probably only 8KB, but latency would also be reduced since OA currently has to create both ftp and statmo and each has roughly 2 second latency (4 seconds total), whereas this approach would probably have only 1 second latency.

Multiple independent instances are possible. A "shell" would be created (i.e., a thread within LwIP) with a parameter saying what the default app is that it is implementing. the shell would permit sending a command at any time to any of the apps, e.g., "!sm R T H" to have statmo execute the "RTH" command. If the thread was created as a default status monitor, the above command would be just "R T H", but invoking sysvstrm could swill be performed by executing "!sysvstrm". Similarly if the app was created as a default FTP server, then the app could transparently interact with a host FTP client, and in the FTP client you could "quote site !sm R T H". When FTP is the default app a little client and host magic is required since FTP frames SITE command output and the framing would have to be added on the client and removed on the server. I'm not sure if the Java and Python FTP libraries already does the host side stripping or not.

Each app provides the following interfaces, either via a library or via source integration (undecided at this time).

  1. Each interactive app generates a library with 3 interfaces:
    1. appOpen(sockobj, void** appObject)
      • returns a pass fail status
      • Sets appObject to whatever the app wants.
      • sockobj is effectively stdio/stdout
    2. appProcessCommand(appObject, string, outputDiagnostic)
      • returns a pass, fail, unknown command status.
      • Process the command specified by the string
      • Output to sockobj is permitted, app should print its "next command" prompt as appropriate.
      • If outputDiagnostic is false and the command is unrecognized, the app should not print any diagnostic messages.
    3. appClose(appObject)
      • Any shutdown operations the app needs.

Combined App Shell Eliminate Apps

Like #Combined App Shell, but eliminate more apps.

  • statmo: integrate as app into ftp.
  • inetd: Eliminate entirely in small footprint configuration.
  • sysvstrm: Eliminate. OA polls for output using ftpserver.
  • ftpserver: static app inside LwIP
  • gdbserver: static app inside LwIP.

Add option to net.config to specify apps to load, sort of like an inetd.config file.

App Fixed Addresses

Several applications were changed to run as sockObjApp applications as threads within LwIP. This reduces overhead for separate processes (PDT/PTs), the need for network server thread stacks (user and kernel), and the envelopes used for RPC communication. However, the kernel, as of 8.3.2, does not support loading a relocatable object file as the process' executable. This presents a problem because currently all applications are linked at the same address. It is further complicated because applications use getNextLibraryStartAddress() and setNextLibraryStartAddress() to manage the virtual address space and assume that all addresses above the next library address are unused, but the kernel ignores any fixed address files other than the process' .exe when initializing setNextLibraryStartAddress(). This means that the process' .exe has to be linked at an address HIGHER than any application it might subsequently load.

The current sockObjApp apps are:

ftpserv, gdbserver, sm, sysvstrm
and maybe inetd.
  1. . lwip might load any app, so lwip must be highest.
  2. . ftpserv, and perhaps gdbserver, might load sm, so make sm be lowest, and gdb and ftp next highest below lwip.

No other apps are likely to load one another and can be anywhere.

And, of course, spreading the applications over a large address space, and thus causing new page tables to be required, defeats the intention of reducing the RAM footprint, so we need to reserve reasonable app virtual address ranges. Currently (Nov 2015) the worst case code+data+bss is 0x1c000 bytes, or about 100KB. The sizes were determined via:

 size /desk/*/appbin{,/dbg/}{ftpserv,gdbserver,sm,sysvstrm,inetd}.exe | sort -n -k 4

using 512KB (0x8000) for each (5 apps * .5 = 2.5MB) leaves 1.5MB for lwip. Hence the following address offsets from APP_BASE_ADDR:

 SM_LOAD_OFFSET        = 0x000000
 SYSVSTRM_LOAD_OFFSET  = 0x080000
 INETD_LOAD_OFFSET     = 0x100000
 GDBSERVER_LOAD_OFFSET = 0x180000
 FTPSERVER_LOAD_OFFSET = 0x200000
 LWIP_LOAD_OFFSET      = 0x280000