VAST is a parallel file system designed to deliver high performance to applications that typically perform poorly on traditional parallel file systems such as Lustre. It accomplishes this through a combination of tiered flash and persistent memory, new data layout and protection algorithms, and high-performance network protocols for I/O.
The VAST file system is not a production resource
The VAST file system at NERSC is an experimental resource that will only be available for a limited time. Do not plan to store any important data there for long periods of time, because the file system may be taken offline, rebooted, modified, etc., without notice.
Its speeds and feeds are as follows:
|Usable Capacity||533 TB|
|Read Bandwidth||40 GB/s|
|Write Bandwidth||5 GB/|
|Random Read IOPS||338,000|
|Total NVMe Drives||44|
|Total Optane SSDs||12|
The VAST file system is available on all Cori GPU nodes but is not available on other parts of Cori including its login nodes. To get access, fill out the NERSC VAST Account Request form and, if approved, a directory will be created for you at
/vast/$USER/. You can treat this directory like your space on any scratch file system with the following caveats:
You will not be able to access
/vast from login nodes, so you should copy important data out of your VAST directory at the end of your job if you wish to interact with it outside of jobs. Your data will generally persistent on this file system until you delete it though, so it will be waiting for you when your next Cori GPU job begins.
Performing shared-file I/O requires special consideration because VAST does not guarantee strict consistency like Cori Scratch or Community File System do. Specifically, the contents of a file are only guaranteed to be consistent across all nodes when that file is closed, so opening one file on two nodes may result in unexpected results if you try to write to it. See the Consistency section below for more information.
The VAST file system is optimized for read-intensive, metadata-intensive, and small-I/O workloads over pure bandwidth. For example, there is only a 25% difference in bandwidth when reading and writing using 2880-byte operations and 4 MiB operations.
VAST does favor higher concurrency though, so writing from as many different threads or MPI processes as possible will yield the best performance.
Whether this parallelism is on-node or across nodes is less important; reading eight files from a single node or one file per eight nodes will result in roughly the same aggregate performance.
VAST uses a high-performance NFS client and has close-to-open consistency, meaning the contents of a file are only guaranteed to be consistent after it has been closed. For this reason, it is not safe to write to a single file from multiple nodes unless you take the following steps:
- Do not write to overlapping parts of the same file
- Ensure all writes are aligned on 4 KiB boundaries within a shared file
If you accidentally violate these rules, your file may contain chunks of NULL bytes (
0x00) where data should be.
It is safe to read the same file from multiple nodes, and reading from one node while it is written by another node will generally work, although the reader will not always see the latest changes to the file.
If you must write to a single file from multiple nodes, you have two options:
- Use MPI-IO with collective buffering enabled. This will ensure that I/Os are aligned on 4 KiB boundaries.
- Open the file using the
O_DIRECToption. This will force all I/O to go back to the file system servers before the reads/writes return, ensuring that everyone always sees a consistent view of the file. This may reduce performance.