NVM Express or the Non-Volatile Memory Host Controller Interface (its prior name was NVMHCI, now shortened to NVMe) is a host-based software interface designed to communicate with Solid State storage devices across a PCIe fabric. The current Synopsys NVMe Verification IP (VIP) is a comprehensive testing vehicle which consists of two main subsystems – the first is the SVC (System Verification Component), the second is SVT (System Verification Technology). The SVC layers are associated with the actual NVMe (and PCIe, etc.) protocol layers. The SVT provides a verification methodology interface to UVM and other methodologies such as VMM and OVM.
Here’s where you can learn more about Synopsys’ VC Verification IP for NVMe and for PCIe and M-PHY.
Although the VIP supports multiple versions of VIP, we will initially be version agnostic, speaking more in generalities of the protocol in order to provide a 10,000’ view of the protocol and its support in the VIP. Future discussions will delve deeper into particular details of NVMe and features of the Verification IP.
You can learn more about Synopsys’ VC Verification IP for NVMe here.
Unlike PCIe – where the root and endpoint are essentially equals – NVMe’s asymmetric relationship is closer to that of other storage protocols (e.g. SATA, Fibre Channel)
An NVMe command (e.g. Identify, Read, Write) is initiated at the Host and converted to an NVMe request, which is then appended to a particular submission queue that lives in host memory. Once the command is inserted into a queue, the host writes to a per-queue doorbell register on the Controller (controllers live on PCIe endpoints.) This doorbell write wakes up the controller, which then probes the queue for the new request(s). It reads the queue entry, executes the command (potentially reading data buffers from the Host memory) and finally appends a completion into a completion queue then notifies the host of this via an interrupt. The host wakes up, pops that completion off the queue and returns results to the user.
There are two main types of queues that are used:
The queues have both tail (producer) and head (consumer) pointers for each queue. The tail pointer points to the next available entry to add an entry to. After the producer adds an entry to a queue, he increments the tail pointer (taking into consideration that once it gets to the end of the queue, it will wrap back to zero – they are all circular queues.) The queue is considered empty if the head and tail pointers are equal.
The consumer uses her head pointer to determine where to start reading from the queue; after examining the tail pointer and determining that the queue is non-empty; she will increment the head pointer after reading the each entry.
The submission queue’s tail pointer is managed by the host; after one or more entries have been pushed into the queue, the tail pointer (that was incremented) is written to the controller via a submission queue doorbell register. The controller maintains the head pointer and begins to read the queue once notified of the tail pointer update. It can continue to read the queue until empty. As it consumes entries, the head pointer is updated, and sent back to the host via completion queue entries (see below).
Similarly, the completion queue’s tail is managed by the controller, but unlike the host, the controller only maintains a private copy of the tail pointer. The only indication that there is a new completion queue entry is a bit in the completion queue entry that can be polled. Once the host determines an entry is available, it will read that entry and update the head pointer. The controller is notified of head pointer updates by host writes to the completion queue doorbell register.
Note that all work done by an NVMe controller is either pulled into or pushed out of that controller by the controller itself. The host merely places work into host memory and rings the doorbell (“you’ve got a submission entry to handle”). Later it collects results from the completion queue, again, ringing the doorbell (“I’m done with these completion entries”). So the controller is free to work in parallel with the host; for example, there is no requirement for ordering of completions – the controller can order it’s work anyway it feels like.
So what are these queue entries that we’re moving back and forth between host and controller?
The first is the Submission Queue Entry, a 64-byte data structure that the host uses to transmit command requests to the controller:
Bytes |
Description |
63:40 |
Command Dwords 15-10 (CDW15-10): 6 dwords of command-specific information. |
39:32 |
PRP Entry 2 (PRP2): Pointer to the PRP entry or buffer or (in conjunction with PRP1) the SGL Segment. |
31:24 |
PRP Entry 1 (PRP1): Pointer to the PRP entry, or buffer or (in conjunction with PRP2) the SGL Segment. |
23:16 |
Metadata Pointer (MPTR): This field contains the address of an SGL Segment or a contiguous buffer containing metatdata. |
15:08 |
Reserved |
07:04 |
Namespace Identifier (NSID): This field specifies the namespace ID that this command applies to. |
03:00 |
Command Dword 0 (CDW0): This field is common to all commands and contains the Command Opcode (OPC), Command Identifier (CID), and various control bits. |
One submission queue entry per command is enqueued to the appropriate Admin or I/O queue. The Opcode specifies the particular command to execute and the Command Identifier is a unique identifier for a command (when combined with the Submission Queue ID).
In addition to using queue entries to move information back and forth, the host can also allocate data buffers in host memory. These buffers can either be contiguous (defined by their base address and length) or a set of data buffers spread about the memory. The latter use data structures called PRP lists and Scatter-gather lists (SGL) to define their locations. When the host needs to move these buffers to/from the controller (e.g. for a read or write command), it will allocate the appropriate data structure in host memory and write information regarding those data structures for those buffers into the above PRP1 and PRP2 fields prior to writing the queue entry to that controller.
Metadata (e.g. end-to-end data protection) can also be passed along with the NVMe commands, in two ways. It can be sent either in-band with the data (i.e. it is contiguous with the data, per sector), or out-of-band (i.e. it is sent as a separate data stream). In SCSI parlance these are known as Data Integrity Field (DIF) and Data Integrity Extension (DIX), respectively. The latter of these uses the Metadata Pointer described above. We’ll discuss this in detail in future episodes.
When we are actually writing/reading to/from the Non-Volatile storage on the controller, we write to namespaces. In other storage technologies, there are other analogous containers – for example LUNs in SCSI. Namespaces can be unique to a controller, or shared across multiple controllers. Regardless, the namespace ID field in the request determines which namespace is getting accessed. Some commands don’t use the namespace field (which is then set to 0), others may need to deal with all the namespaces (the namespace ID is then set to 0xffff_ffff).
On the completion side, there is an analogous data structure, the Completion Queue Entry:
Bytes |
Description |
15:12 |
Command Specific Information: One dword of returned information. (Not always used.) |
11:8 |
Reserved |
7:6 |
Submission Queue ID: The submission queue in which the associated command was sent on. (16-bits) |
A few items to note:
Just to provide an introduction to the commands available, we’ll list a few of them here:
More commands and features are being added to NVMe all the time due to an active group that’s moving the development forward.
The industry consortium behind NVM Express is NVM Express, Inc. They develop the specifications along with the various contributing companies.
Thanks for hanging around through this foundational discussion. Next time we will look at the VIP and its basic architecture, function and use-models.
Authored by Eric Peterson