Skip to content

Using GHI at NERSC

Introduction

GPFS/HPSS Interface (GHI) is a way of interacting with the HPSS tape archive that offers the benefits of a more familiar file system interface. Often users want to store complex directory structures or large bundles in the tape archive which can be difficult to do with the traditional HPSS access tool. GHI can be used to easily move data between the GPFS file system and the tape archive with a few simple commands.

Tip

Files archived/migrated into the tape archive with GHI can’t be easily accessed with other HPSS tools like ftp, hsi, or htar. The assumption is that you will continue to use GHI to manage these files.

GHI is best thought of as a merging of two file systems (a single namespace), i.e. if you delete a GHI controlled file on the GPFS file system, it will also be deleted in the tape archive.

Warning

This installation of GHI at NERSC is experimental. Hardware backing GHI is not designed for production use, and thus does not have the same safeguards that are present in the production systems. Please don’t store any unique data in GHI and please let us know first if you’re going to try to access it at large scales.

Example Use Cases

Large Volumes of Infrequently Accessed Data

A group has 100s of TBs of data that is only accessed twice a year for reanalysis. The group doesn’t want to keep this data on GPFS in between analyses because it’s very large, so they use ghi put to initially put the files into the tape archive. Then they use ghi punch to remove all but the first few bytes from GPFS and free up space in their directories. Then, when they’re ready to analyse the data, then bring it back to GPFS using ghi stage. They can check which file system the data is on using ghi ls.

Archiving Complex Directory Structures

A project has many thousands of files of varying size ranging from a few bytes to hundreds of GBs (some files are too large to be bundled by htar) they’d like to archive. These files are organized in a complex directory structure that’s necessary to maintain in order to track the material in the files. The aggregate size of this data is around 100TB. Normally, they’d have to use tar to bundle each directory together (or parts of directories to keep the bundles around 500GB) and then put them into the tape archive. Instead, they use the ghi put option to put the entire directory into the tape archive.

Using GHI

Access

Currently you can only access GHI on NERSC's data transfer nodes using the /global/projectm file system. If you don’t have a usable directory there, let us know and we’ll make one. You can reach the data transfer nodes via ssh to dtn0[1-4].nersc.gov.

Commands

GHI can be accessed via the ghi command. This command should already be in your path on the NERSC data transfer nodes.

Usage

The ghi command requires an “action” argument. The possible choices are:

  • ls: show what file system the files are currently on
  • put: copy the files to the tape archive
  • stage: move the files back from the tape archive to spinning disk
  • punch: move the files to only the tape archive (leaving a stub behind on GPFS)
  • pin: keep the files from being removed from spinning disk
  • lock: keep the files from being removed using rm or otherwise modified

GHI ls

This is used to tell you which file system the files are on:

  • “G” means only on GPFS
  • “H” means only on the tape archive (with only a small stub remaining on GPFS)
  • “B” mean on both GPFS and the tape archive

Also, if a file is “pinned” there will be a “P” shown when listing.

You can also use extra flags to see more information:

  • -a: Show hidden files
  • -l: Long listing (same as “-l” for ls)
  • -e: Show tape position
  • -w: Show name of file in the tape archive
  • -n: Same as “-l” but use numeric user and group ID
  • -?: Print List Help

ghi ls can also be used to display the name of the file in the tape archive (note it is a UUID that’s used internally for system tracking).

List one file
dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
B  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt

The “B” means the file is on both GPFS and the tape archive (“H” is tape only, and “G” is GPFS only)

List many files

To list many files, you can either list them on the command line or put them in a text file. You can also list every file in a directory with

dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis
And recursive list all directories by adding a wildcard (*) at the end
dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/*

List via text file

To list files in a input text file, use the “-f” option:

dtn> cat /global/projectm/projectdirs/nstaff/elvis/my_list.txt
/global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
dtn> ghi ls -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt
B  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt

GHI put

This is used to put a file or set of files into the tape archive.

dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/test.txt
G  /global/projectm/projectdirs/nstaff/elvis/test.txt
dtn> ghi put /global/projectm/projectdirs/nstaff/elvis/test.txt
dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/test.txt
B  /global/projectm/projectdirs/nstaff/elvis/test.txt

You can also put many files using wildcards (*) or by putting them into a separate text file and using the -f flag following the syntax described in the ghi ls section.

GHI stage

This is used to bring a file or set of files back from the tape archive.

Note

Ghi stage is asynchronous. This means that the prompt could return before all files are finished being retrieved from the tape archive. You can check the progress of a stage using the ghi ls command.

dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/test.txt
H  /global/projectm/projectdirs/nstaff/elvis/test.txt

dtn> ghi stage /global/projectm/projectdirs/nstaff/elvis/test.txt
option -t t 0
Will not wait for results after posting stage request(s) to SD...
Staging 1 file...
Splitting the scan results for processing.
Grouping the scan results by media.
Sorting the results by media order.
Setting up socket to SD.
Sending stage request(s) to the SD.
All stage requests acknowledged by SD.

Temporary files/directories left behind are /global/projectm/scratch/.ghi/ghi_stage.ngfsv491.nersc.gov.16869
They may be removed with the command: rm -r /global/projectm/scratch/.ghi/ghi_stage.ngfsv491.nersc.gov.16869

dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/test.txt
B  /global/projectm/projectdirs/nstaff/elvis/test.txt

You can also stage many files using wildcards (*) or by putting them into a separate text file and using the -f flag following the syntax described in the ghi ls section.

GHI punch

This is used to remove a file or set of files from GPFS. The file must be put into the tape archive before it can be punched. When a file is punched, only a small stub is left behind on the GPFS file system. This is usually done to free up space for other more frequently used data. You must have write access to a file in order to punch it.

Note

Files must be put into the tape archive using ghi put before they can be punched.

dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
B  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt

dtn> ghi punch /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt

dtn> ghi ls /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
H  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt

You can also stage many files using wildcards (*) or by putting them into a separate text file and using the -f flag following the syntax described in the ghi ls section.

GHI pin

You can use ghi pin to keep files on the GPFS file system. A pinned file will remain on the GPFS file system even if you punch it later. Pinned files have an extra "P" in the output when you list them with ghi ls. You can unpin files by invoking ghi pin with the -u flag.

In the example below, the pinned file remains on both the tape archive and GPFS even after a punch command is issued. You can use ghi pin with either a list of files on the command line or -f for a text file list following the syntax described in the ghi ls section.

dtn> ghi ls -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt
B  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
B  /global/projectm/projectdirs/nstaff/elvis/blah.txt

dtn> ghi pin /global/projectm/projectdirs/nstaff/elvis/blah.txt

dtn> ghi ls -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt
B  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
BP /global/projectm/projectdirs/nstaff/elvis/blah.txt

dtn> ghi punch -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt

dtn> ghi ls -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt
H  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
BP /global/projectm/projectdirs/nstaff/elvis/blah.txt
Unpinning a file

You unpin a file by adding the -u flag to the pin command.

dtn> ghi ls -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt
H  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
BP /global/projectm/projectdirs/nstaff/elvis/blah.txt

dtn> ghi pin /global/projectm/projectdirs/nstaff/elvis/blah.txt -u

dtn> ghi ls -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt
H  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
B  /global/projectm/projectdirs/nstaff/elvis/blah.txt

dtn> ghi punch -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt

dtn> ghi ls -f /global/projectm/projectdirs/nstaff/elvis/my_list.txt
H  /global/projectm/projectdirs/nstaff/elvis/strace_ghi.txt
H  /global/projectm/projectdirs/nstaff/elvis/blah.txt

GHI lock

You can use ghi lock to prevent a file from being removed by commands like rm or otherwise modified. This is intended to prevent accidental deletion of a file from the tape system.

Note

Removing a file from the GHI system with rm will remove it from both the GPFS and tape archive.

You unlock a file by adding the -u flag to the lock command.

dtn> ghi lock /global/projectm/projectdirs/nstaff/elvis/hold.txt
dtn> rm /global/projectm/projectdirs/nstaff/elvis/hold.txt
rm: cannot remove ‘/global/projectm/projectdirs/nstaff/elvis/hold.txt’: Read-only file system