exFAT Wikipedia [1]

DVMS redesign notes 11/20/24

Options

Go back to server-based solution
- Multiple servers, per filesystem volume?
- Issues:
  1. Many windows, many schedulers
  2. Latency?
  3. Could we have server processes in 653 partitions?
Sacrifice integrity for implementability
- We guarantee access is only given to permitted volumes
- We guarantee no collisions on directory manipulations
- File synchronization occurs at cluster boundaries to ensure metadata/file consistency
- User-writable metadata, susceptible to corruption by any process/653-partition that has access to the same filesystem volume; however, usage of the filesystem APIs will maintain metadata integrity
  1. Blocking may be necessary; but non-interfering accesses should not cause blocking
Implement full chain in kernel space (filesystem, media library, journal, ...) to ensure metadata integrity in the face of potential bad actors

To provide serialization of filesystem activities, there is an activity in progress 32-bit flag

30 bits reader count
1 bit write requested
1 bit write in progress

This allows multiple readers or single writer but not readers while writing.

Can runtimes (653, RMA, RTEMS) implement yield() to allow for scheduling in tight loops like while waiting for activity grant via activity in progress.
1. There can be unbounded blocking if a WAT change occurs during active filesystem API operations.

No filesystem operation can be left outstanding in the presence of a WAT change. (Ensure data synchronized).

Prior meeting notes...

Abbreviations to help compact table:

DES: Directory Entry Set: 3 to 19 contiguous 32-byte chunks of data in one cluster or spanning across two clusters that describe one file or directory entry in the root directory or a sub-directory.
cDES: cluster(s) containing the API-referenced (by path or descriptor) Directory Entry Set: Understood to be one cluster if the DES is entirely contained within it, two otherwise.
pcDES: cluster(s) of parent Directory Entry Set: The cluster(s) containing the DES of the parent of the API-referenced DES.
1C: First cluster of file/directory content
nC: Nth cluster of file/directory content. Note: no distinction is made whether the file/directory is contiguous or chained.
FAT: File Allocation Table (chained items only).
CBM: Cluster Bitmap.

Functions and what clusters they need to make stable in the cluster cache are described below.

Assumptions:

Getting to cDES may require walking multiple clusters including:
1. 1C of root directory.
2. nC of root directory.
  - Then repeat for each sub-directory:
  1. 1C of sub-directory.
  2. nC of sub-directory.
Getting to nC requires consulting the FAT if "chained flag" is set in the DES.
Nothing made stable in one API call is "already stable, guaranteed" in any subsequent call.


Function	Path Based?	Cluster1	Cluster2	Cluster3	Descriptor-saved items from cDES	Notes
getattr	Yes	cDES				Metadata inquisition only.
fgetattr	No	cDES				Metadata inquisition only.
truncate	Yes
ftruncate	No
opendir	Yes	cDES	1C		"valid size" "chained flag"
readdir	No	1C	nC			Directory walk may spill into subsequent clusters. No need to make 1C stable if we've already walked past it. On any given call the descriptor offset may have been changed by rewinddir to a value outside the current cluster.
releasedir	No	cDES				Things in DES that may need updating: "access time" (noatime option to avoid this)
open	Yes	cDES	1C*		"valid size" "chained flag"	Make 1C stable if it exists ("first cluster" is nonzero) *May not be necessary (no perceived benefit?)
create	Yes	cDES				1C doesn't exist yet. cDES is written.
release	No	cDES				Things in DES that may need updating if file was opened read-only: "access time" (noatime option to avoid this) Things that may need updating in DES if file was opened read-write: "valid/actual size" "start cluster" (if new file and now has content)
fsync	No	cDES
read	No	1C	nC			File read may spill into subsequent clusters. On any given call the descriptor offset may have been changed by lseek to a value outside the current cluster.
write	No	cDES	1C	nC		File write may spill into subsequent clusters and may need to consult CBM to find one. If "chained", may need to consult FAT to locate nC. On any given call the descriptor offset may have been changed by lseek to a value outside the current cluster.
unlink	Yes	pcDES	cDES
rmdir	Yes	pcDES	cDES
mkdir	Yes	cDES	1C
rename	Yes					Rename has a few different possible cases: Rename in same directory with shorter name - reuse existing DES. Rename in same directory with longer name - similar to create call (if renaming a directory, the "first cluster" and "valid length" gets copied to the new DES). Rename to different directory - possibly similar to create or mkdir (in new directory) followed by unlink or rmdir (in existing directory).
statfs	Yes					Metadata inquisition only.

TODO: What is the data structure and name for the object representing the opened files? Needed for prevention of rmdir, unlink, or rename of files/directories that are opened by others.

exFAT redesign plan 9/17/24

Need to try to make some progress on the design of concurrency protections for cluster bitmap, FAT, and directories.

Some thoughts...

What do we guarantee?
1. Metadata correctness.

AL: Forward progress is not a guarantee. As long as someone is making forward progress then the system is making forward progress.

Guarantee existence.
Guarantee non-interfering access.
Thread executing with lock held, loses context.

Cluster bitmap:

Acquisitions from it need to be atomically protected so Aaron and Chris don't get the same cluster when trying to acquire one.
Releases don't have to be seen immediately by all involved. Worst case: Chris releases one cluster that was the last available cluster in the volume; Aaron at the same time is trying to allocate one but sees there are none.
Implementation proposal to improve speed and co-location, perhaps:
- Instead of a flat list of clusters, searched linearly (the current design), clusters could be "binned" by some divisor with an "available count" tracked for each bin. When allocating a cluster, the "hint" given to the algorithm could be [if relevant], "I got one from this bin in the past for this file, so lets try here again".
- Speed of acquiring could increase. Instead of having to check each and every cluster, check each bin and allocate from one that still has a non-zero "available count".

https://en.wikipedia.org/wiki/Free-space_bitmap#Advanced_techniques

Structure is global. In memory.
- Atomic compare and swap.
- If optimization is implemented, will need a lock to protect the two pieces of information from inconsistency. Lock could result in unbounded blocking.
Operations:
1. Acquire cluster - set bit(s) to 1 (owned by somebody).
2. Release cluster - set bit(s) to 0 (available).

FAT:

Modified FAT elements *must* be protected by a cluster bitmap acquisition.
Unsure if this means that FAT is entirely protected by the cluster bitmap acquisition atomicity requirements.
Actions that require FAT modification:
- When a file/directory grows such that it crosses a cluster boundary into a non-contiguously acquired cluster the FAT entries for the entire file/directory must be updated.
Actions that do not require FAT modification:
- Deleting a file/directory.
FAT relevance to any particular file/directory is indicated by the "NoFatChain" flag (called "contiguous" in current design). The MS documentation also makes reference to "AllocationPossible" flag - I think this is set on things like the cluster bitmap, volume label, and upcase directory entries.

Block-cache. Eviction?
Operations (only relevant for non-contiguous files/directories):
1. Update - set some cluster index's value to the next cluster when allocating new ones.
2. Read -
  1. Consistency protection if multiple users cause non-contiguous transition?
    1. Not possible to do without locks?

Directories:

Should we cache directory clusters?
Protections required on new "directory entry set" allocations. Deallocations should be similar to cluster bitmap deallocations; it shouldn't be an issue if Chris is deallocating a directory entry set and Aaron is searching for one but doesn't see Chris' deallocation.
Allocation unit size is 32 bytes. A "directory entry set" minimum allocation units required is 3 (for names < 16 characters). Maximum is 19 (for names up to 255 characters).
Allocations are allowed to cross cluster boundaries. A "directory entry set" does not have to be contained in a single cluster. Similar to writing to a file - there's no restriction that data being written must not cross a cluster boundary.
Adding a file/directory to a directory can cause the directory to grow. Same contiguity rules apply as for files.

Cache.
1. Mechanism to cause cache persistence for updates.

exFAT redesign plan 8/20/24

Capturing diagrams of filesystem metadata and modifications therein due to user activities:

\\nx3000\Deos\cpow\FileSystem\exFAT.drawio

exFAT redesign plan 8/14/24

Tasks:
1. Shayne: Investigate Sourcetrail for call graphs.
2. Shayne: Work vfile interface task.
3. Chris: Create repository location, empty repository. Work Format utility.

Goals:
1. Create unencumbered DDCI version of exFAT.
2. Make it multi-user capable, minimal blocking, from the outset.

Things to avoid:
1. Using same file names?
2. Copy/paste from existing source. Greg Rose advised [if my memory is correct] having a copy of the original code open is fine to use as a reference, but any sort of copying from it should be avoided [if possible by not opening it in the first place].

Some places to start:
1. vfile interface - get exFAT able to call vfile open, fstat, pread, pwrite, close (others?)
2. Format utility (mkfs) - create the filesystem base metadata.
3. mount - parse and interpret existing filesystem metadata.
  - These would cover/include creation of and parsing of metadata like superblock, cluster bitmap, FAT, root directory cluster, volume boot record (VBR), upcase table.

Follow-on:
1. FUSE interface - get DVMS able to call stub filesystem functions.
2. PRL - this could likely largely be copy/paste from existing as it is largely our own original content. But some changes may be necessary as it does reference structures defined in exFAT header files, which we will need to rewrite.
  - Also, consider opportunities to improve the multi-user concurrency algorithms (directory slot reservations, FAT updates, ...)
3. C++ classes (not hard-over on using C++, but some object-orientedness might be helpful):
  - Cluster bitmap - routines to initialize, find empty, mark used, mark empty, ...
  - FAT - routines to initialize, read entry, write entry, ...
  - Directory + directory elements - routines to initialize a directory cluster (zeroize it), add directory entries, remove directory entries, ...
  - Upcase table - perhaps not useful as a class as it is really just a case conversion table, but various things that use it could be grouped.
  - Volume label - perhaps not useful as a class but it is a "special file" we need to handle.
4. Unmount - close superblock.
5. Determine logic flow from analysis of existing exFAT code.
  - Functions needed:
    - name lookup - directory traversal
    - utf16/utf8 conversions
    - logging
    - time - functions currently excluded due to lack of availability of 64-bit architectures time library.

Also:
1. Directory searching should make use of name hash to short circuit long file name comparisons. Existing exFAT has the hash algorithm implemented but does not use it in name lookup.

Later...:
1. Label utility - allow changing of the volume label.
2. Dump utility - allow analysis of the filesystem metadata.
3. Fsck utility - allow analysis of and fixing of errors in the filesystem metadata.
  - Repair functions invoked by fsck by a write-allowed client to fix errors.

Testing thoughts:
- All testing uses RAM MAL, potentially a much smaller RAM disk (copy-edit on dvms-ram.pia.xml) to make offload/onload quicker.
- Using existing exFAT implementation, mkfs, and cross-compare disk image with new exFAT mkfs.
- Use existing exFAT implementation to mkfs, capture disk image and attempt to mount it with new exFAT.
- Use DVMS to make filesystem calls, verify with stub MAL that proper data exchange occurs with disk image.

FUSE APIs (* = not called by DVMS; could be excluded from redesign):

.getattr    = fuse_exfat_getattr
.fgetattr   = fuse_exfat_fgetattr
.truncate   = fuse_exfat_truncate
.ftruncate  = fuse_exfat_ftruncate
.opendir    = fuse_exfat_opendir
.readdir    = fuse_exfat_readdir
.releasedir = fuse_exfat_releasedir
.open       = fuse_exfat_open
.create     = fuse_exfat_create
.release    = fuse_exfat_release
.flush*     = fuse_exfat_flush
.fsync      = fuse_exfat_fsync
.fsyncdir*  = fuse_exfat_fsync
.read       = fuse_exfat_read
.write      = fuse_exfat_write
.unlink     = fuse_exfat_unlink
.rmdir      = fuse_exfat_rmdir
.mknod*     = fuse_exfat_mknod
.mkdir      = fuse_exfat_mkdir
.rename     = fuse_exfat_rename
.utimens*   = fuse_exfat_utimens
.chmod*     = fuse_exfat_chmod
.chown*     = fuse_exfat_chown
.statfs     = fuse_exfat_statfs
.init       = fuse_exfat_init
.destroy    = fuse_exfat_destroy

vfile APIs called by exFAT:

close
fstat
fsync*
lseek*
open
pread*
pwrite*

= vfile API is short-circuited to inode dev_ops due to having opened file descriptor for lifetime of volume mount.

Notes 8/14/24

SATA AHCI
- Investigation yields potential to use up to 8 PRDT entries per transfer with minimal changes to driver.
- With one PRDT, 4MB transfer limit. So 8 gives us 32MB transfer limit in one slot with one call.
  - exFAT maximum cluster size is 32MB, so this is a good natural limit.

exFAT
- Cluster bitmap is high density and high likelihood of multiple users trying to modify same sector of cluster bitmap.
- Discussed slot reservation algorithm with Aaron - the window of opportunity for slot over-reservation still exists!
  - Further, the slot reservation algorithm as-is does not protect multi-user in same subfolder - the basher examples do not currently test this, so issues were not observed.
    - check_slot function is a "no-op" for anything but root directory. Root directory has items that are not "files or directories" in it, so the root directory must be read from disk during check_slot.

Journal
- Described plan for discovering whether sectors are currently in play in any "other's" running transaction and "holding off" if so.

Notes 8/7/24

Documentation:

Indicate in UG restrictions on multiple-DAL users using the same directory - don't do this.

Concerns in current design:

Original FUSE implementation written for single-thread access - a Linux process manages the filesystem much like our CFFS server.
- Approach to figuring out what needs to be protected to allow multi-client access has been focused on filesystem metadata:
  - Cluster bitmap:
    - This is maintained within PRL. Should be fairly easy to make it multi-core safe?
  - FAT:
    - Currently a vfile system semaphore is used to protect accesses to the FAT. This seems inefficient and/or wrong.
    - exFAT has a "contiguous" optimization that avoids the FAT if a file can be stored contiguously. The dvmssimple and dvmsthroughput examples do not do anything that breaks contiguity. The dvmsbasher examples, thankfully, do.
  - Directories:
    - Currently no example causes directories to have to be grown/shrunk. Potential issue here?
    - I've optimized directory reading - did I make correct assumption? DDCI_PCR:5157

Inefficiencies:
- Cluster table is read/written 4-bytes at a time. Extremely inefficient.
- Directory entries are read/written 32-bytes at a time. Very inefficient.
  - Options to improve? Both of these cause RMW of a 512-byte sector. Maybe that is the best we can do.
Directory searching for open/available slots.
- Current implementation of algorithm to prevent multiple clients from finding the same slot as available seems to work and seems to "be fair". But:
  - Bug/hole found on Monday where two clients actually did receive the same slot - the downstream bad effect of this was not immediate.
  - It generally allows for "forward progress" to be made by clients trying to find an open slot in the same directory, but is it "the best approach"?

SATA AHCI concerns?
- PRDB maximum transfer limit of 4MB per.
  - Currently the client thread transfer buffer size can be set bigger than this. Matt: that means the loop in the MAL needs to be reinstated.
- Do we need ability to specify multiplier on thread transfer buffers per client?

Geekfest 2024 - State of the Union

dvmsbasher-raw
- Haven't spent much time on this example as it seems the shorter pole, involving the least number of pieces.
- Throughput measured is 2.4MBps when configured with two processes, two threads each, interacting with different raw partitions.
  - Assumption: if interacting with the same partition, throughput would be same. There's no difference - the reads/writes are going pretty much straight to hardware, which doesn't care where data is being put or got from.
dvmsbasher-with-exfat
- Seems to work fairly well when configured with two processes, two threads each, either interacting with one filesystem volume or two.
  - Throughput measured is pretty paltry when directed at SATA AHCI MAL - not sure where the hang ups are yet.
  - ~220KBps (~10% of raw) when four threads writing to independent filesystem volumes. ~120KBps when writing to same filesystem volume. Conclusion: contention in filesystem.
    - This throughput only counts reads/writes. There are many open, close, lseek, mkdir, rmdir, etc. operations that take time but are not counted in throughput. Perhaps a "raw mode" option on the non-raw examples would help - focus on just read/write throughput.
    - There are still some areas in the filesystem where it seems necessary and/or prudent to block while activities are occurring, such as when modifying the FAT. A potential solution is to move the FAT into the PRL (similar to the cluster bitmap) - the current design interacts with the FAT on the disk directly.
    - A profiler would probably be handy here.
dvmsbasher-with-journaled-exfat
- Seems to work fairly well when configured with two processes, one thread each, interacting with separate filesystems (journaled separately, now that DVMS requires a 1:1 relation between journal and filesystem volume).
- Has issues pretty much immediately if:
  - Addding threads, or
  - Directing activities from multiple threads (whether or not in same user process) to a single journaled filesystem volume.
    - Probably a design flaw - with the journal there's more in flight in independent user's space and one user could be creating something that isn't committed. Anything not commmitted must be considered not available. So this means each user must somehow be made aware of the other user's intent, even if that intent never actually occurs, so they can avoid using the same resources.

DVMS SATA AHCI MAL

Update to either use more SATA_CLIENT_RAM or use PIA to generate client resources.
- get resource size with platformResourceSize
Transaction overlap/ordering?
- Do we have to protect against transactions that write to same sector from occurring out of order?
- Can we rely on the hardware to take care of this? According to section 9.3.2 of SATA AHCI 1.3 documentation, transaction order is maintained on each write to the PxCI register. If multiple slot commands are issued in one register write, the order is arbitrary. For example, is commands slots 1, 3, and 5 are written at the same time, the order is arbitrary. If slot 3, 1, and 5 are written in that order on 3 separate writes to the PxCI register, the order of the slot writes is maintained.
Client RAM could allow us to avoid spilling transactions(?) into several slots; at least for filesystem. Perhaps not for raw.
Allow users to nail up slots for "higher priority" data transfer?
Might need MAL-specific configuration elements in dvmsconfig?
Transaction retirement -
- Question: Is completion bit for a slot "sticky" or is it "clear on read"?
  - If "sticky", perhaps nothing needed except to make sure that a client only clears *its* slot bits.
  - If "clear on read", then MAL needs to ensure that the client resources are updated indicating transaction completion for all completed transactions, not just its own.
  - Need to duplicate concept of completed transactions from HW completion register to each client.
- Answer: The PxCI register is written by software to 1 to issue a command on slots that are ready to be processed. Hardware clears the bits when the transaction has been completed.
Notes from running dvmsbasher-with-exfat example with both using applications pointed at /mount/hd2 filesystem volume.
- Nothing to see here yet - what was previously here turned out to be an error in example code.

DVMS

Check whether ACL apply to raw partition access? It should, but I'm thinking it isn't.
- It didn't...it now does.

exFAT

Huge performance issue found: DDCI_PCR:5157

PRLs

DONE - PIAfy the PRL resources, remove add from FPs.

DVMS Update Notes

Contents

DVMS redesign notes 11/20/24

Prior meeting notes...

exFAT redesign plan 9/17/24

exFAT redesign plan 8/20/24

exFAT redesign plan 8/14/24

Notes 8/14/24

Notes 8/7/24

Geekfest 2024 - State of the Union

DVMS SATA AHCI MAL

DVMS

exFAT

PRLs

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

DDCI Tools & Resources

DDC-I private

Shared IP

Other

Tools