Kinghall lwip performance

From DDCIDeos
Jump to navigationJump to search

Project to improve the performance of LwIP for Kinghall

Description

Lwip is too slow, customer seeing 3MB/s on ftp get from target. Linux on same configuration is getting ~ 125MB/s, customer wants same on Deos.

Current results are 38.5Mbps (approx 13x faster than orig), still ~3x slower than theoretical max.

This is being worked partly as a BOT. See Richard/Kelly for time reporting.

There is an extensive chat history: https://teams.microsoft.com/l/message/19:fccc132d3593491598a081eaae7a506e@thread.v2/1735937346991?context=%7B%22contextType%22%3A%22chat%22%7D

Tasks

Task Description Priority Assignee Status Effort (Hours) Comments
Unelease updated ANSI 1 AL Done-ish 24 Many TODOs. No optimization for arch != ARM
Unelease lwip (see issues below) 1 MV Done-ish
Unelease gem driver (see issues below) 3 MV Pending must compile from head
Unelease kernel 3 RLR Done-ish unreleased "from the desk of Ryan"
Unelease ftp 3 CP Done-ish
Generate customer letter 1 TBD Pending

Issues Identified

  1. ANSI: memcpy() and memset() very slow for misaligned or uncached memory.
  2. Debug variants of lwip, and gem drivers compiled with -O0
  3. gem descriptor resource cache mode to writeThru
  4. gem turn off tx interrupts
  5. kernel debug variant very slow (presumably due to DataMemberTemplate)
  6. kernel event signaling
  7. lwip
    1. lwipopts window size to 44 times MSS
    2. TCP send buffer is 32 times MMS may want to at least 44 to match window size.
  8. FTP send buffer output cache
  9. On ARM reads of device memory slower than uncached normal memory (using writeThru as surrogate since current customer's ARM processors don't implement writeThru and fall back to uncached normal memory semantics.
  10. PCR:16193 Kernel was being conservative computing slack causing idle time at the end of a period.

Possible things to investigate

  1. Checksum offload. Determine how much time spent there now.
    1. Ref: https://lists.nongnu.org/archive/html/lwip-users/2008-02/msg00022.html
    2. Ref: https://docs.amd.com/r/en-US/ug1087-zynq-ultrascale-registers/dma_config-GEM-Register
  2. Minimize semaphore overhead:
    1. Alloc multiple pbufs per semaphore lock?
    2. Use fast path via atomics?
    3. Modify alloc algorithm to be lockless?
  3. Get gprof working
  4. Continue to add logSystemEvent() calls.
    1. Receive path?
  5. Try to modify gem driver to use cached descriptor memory.
  6. Turn off lwip software receive checksum verification: CHECKSUM_CHECK_TCP

To reproduce current best case results

Configuration

  1. put libansi.so from this chat.
  2. Put release variants of lwip, kernel, and gem driver.
  3. Change cacheMode:
    lwip.pia.xml: NetworkBuffers stays off (this is not a change, but it is curious that off works better than writeThru)
    xilinx-gem.pia.xml: GigabitEthDescriptorMemory0 (and 1, 2, and 3) to writeThru
  4. Change Scheduling Priority of ISR thread to zero.

To Run The Experiment

  1. On tfhost associated with target:
    1. ftp to target and:
      get /dev/zero /dev/null
    2. Start wireshark
      set filter of ip.addr to target's IP.
      Statistics/ I don't recall the steps here.

The wireshark graph should show ~230Mbps