Verification Central

 

The Synopsys NVMe VIP: A High Level View

Overview

NVM Express or the Non-Volatile Memory Host Controller Interface (its prior name was NVMHCI, now shortened to NVMe) is a host-based software interface designed to communicate with Solid State storage devices across a PCIe fabric. The current Synopsys NVMe Verification IP (VIP) is a comprehensive testing vehicle which consists of two main subsystems – the first is the SVC (System Verification Component), the second is SVT (System Verification Technology).  The SVC layers are associated with the actual NVMe (and PCIe, etc.) protocol layers.  The SVT provides a verification methodology interface to UVM and other methodologies such as VMM and OVM.

Here’s where you can learn more about Synopsys’ VC Verification IP for NVMe and for PCIe and M-PHY.

Although the VIP supports multiple versions of VIP, we will initially be version agnostic, speaking more in generalities of the protocol in order to provide a 10,000’ view of the protocol and its support in the VIP. Future discussions will delve deeper into particular details of NVMe and features of the Verification IP.

NVMe-vip

You can learn more about Synopsys’ VC Verification IP for NVMe here.

A Brief Glance at NVMe

Unlike PCIe – where the root and endpoint are essentially equals – NVMe’s asymmetric relationship is closer to that of other storage protocols (e.g. SATA, Fibre Channel)

An NVMe command (e.g. Identify, Read, Write) is initiated at the Host  and converted to an NVMe request, which is then appended to a particular submission queue that lives in host memory.  Once the command is inserted into a queue, the host writes to a per-queue doorbell register on the Controller (controllers live on PCIe endpoints.)  This doorbell write wakes up the controller, which then probes the queue for the new request(s).  It reads the queue entry, executes the command (potentially reading data buffers from the Host memory) and finally appends a completion into a completion queue then notifies the host of this via an interrupt.  The host wakes up, pops that completion off the queue and returns results to the user.

NVMe-arch

There are two main types of queues that are used:

  • Admin Queues – these are used for configuring and managing various aspects of the controller. There is only one pair of Admin queues per controller.
  • I/O Queues – these are used to move NVMe protocol specific commands (e.g. Read, Write).  There can be up to 64K I/O queues per controller.

The queues have both tail (producer) and head (consumer) pointers for each queue.  The tail pointer points to the next available entry to add an entry to.  After the producer adds an entry to a queue, he increments the tail pointer (taking into consideration that once it gets to the end of the queue, it will wrap back to zero – they are all circular queues.)  The queue is considered empty if the head and tail pointers are equal.

The consumer uses her head pointer to determine where to start reading from the queue; after examining the tail pointer and determining that the queue is non-empty; she will increment the head pointer after reading the each entry.

The submission queue’s tail pointer is managed by the host; after one or more entries have been pushed into the queue, the tail pointer (that was incremented) is written to the controller via a submission queue doorbell register. The controller maintains the head pointer and begins to read the queue once notified of the tail pointer update.  It can continue to read the queue until empty.  As it consumes entries, the head pointer is updated, and sent back to the host via completion queue entries (see below).

Similarly, the completion queue’s tail is managed by the controller, but unlike the host, the controller only maintains a private copy of the tail pointer.  The only indication that there is a new completion queue entry is a bit in the completion queue entry that can be polled.  Once the host determines an entry is available, it will read that entry and update the head pointer.   The controller is notified of head pointer updates by host writes to the completion queue doorbell register.

Note that all work done by an NVMe controller is either pulled into or pushed out of that controller by the controller itself. The host merely places work into host memory and rings the doorbell (“you’ve got a submission entry to handle”). Later it collects results from the completion queue, again, ringing the doorbell (“I’m done with these completion entries”).  So the controller is free to work in parallel with the host; for example, there is no requirement for ordering of completions – the controller can order it’s work anyway it feels like.

So what are these queue entries that we’re moving back and forth between host and controller?

The first is the Submission Queue Entry, a 64-byte data structure that the host uses to transmit command requests to the controller:

Bytes

Description

63:40

Command Dwords 15-10 (CDW15-10): 6 dwords of command-specific information.

39:32

PRP Entry 2 (PRP2):  Pointer to the PRP entry or buffer or (in conjunction with PRP1) the SGL Segment.

31:24

PRP Entry 1 (PRP1): Pointer to the PRP entry, or buffer or (in conjunction with PRP2) the SGL Segment.

23:16

Metadata Pointer (MPTR): This field contains the address of an SGL Segment or a contiguous buffer containing metatdata.

15:08

Reserved

07:04

Namespace Identifier (NSID): This field specifies the namespace ID that this command applies to.

03:00

Command Dword 0 (CDW0): This field is common to all commands and contains the Command  Opcode (OPC), Command Identifier (CID), and various control bits.

One submission queue entry per command is enqueued to the appropriate Admin or I/O queue.  The Opcode specifies the particular command to execute and the Command Identifier is a unique identifier for a command (when combined with the Submission Queue ID).

In addition to using queue entries to move information back and forth, the host can also allocate data buffers in host memory.  These buffers can either be contiguous (defined by their base address and length) or a set of data buffers spread about the memory.  The latter use data structures called PRP lists and Scatter-gather lists (SGL) to define their locations.   When the host needs to move these buffers to/from the controller (e.g. for a read or write command), it will allocate the appropriate data structure in host memory and write information regarding those data structures for those buffers into the above PRP1 and PRP2 fields prior to writing the queue entry to that controller.

Metadata (e.g. end-to-end data protection) can also be passed along with the NVMe commands, in two ways.  It can be sent either in-band with the data (i.e. it is contiguous with the data, per sector), or out-of-band (i.e. it is sent as a separate data stream).  In SCSI parlance these are known as Data Integrity Field (DIF) and Data Integrity Extension (DIX), respectively.  The latter of these uses the Metadata Pointer described above.   We’ll discuss this in detail in future episodes.

When we are actually writing/reading to/from the Non-Volatile storage on the controller, we write to namespaces.  In other storage technologies, there are other analogous containers – for example LUNs in SCSI.   Namespaces can be unique to a controller, or shared across multiple controllers.  Regardless, the namespace ID field in the request determines which namespace is getting accessed. Some commands don’t use the namespace field (which is then set to 0), others may need to deal with all the namespaces (the namespace ID is then set to 0xffff_ffff).

On the completion side, there is an analogous data structure, the Completion Queue Entry:

Bytes

Description

15:12

Command Specific Information: One dword of returned information. (Not always used.)

11:8

Reserved

7:6

Submission Queue ID: The submission queue in which the associated command was sent on. (16-bits)


A few items to note:

  • The Submission Queue ID needs to be explicitly shown in the completion queue entry because multiple submission queues can share a completion queue.
  • As discussed above, the Submission Queue head pointer is maintained by the controller. As the controller updates that head pointer, it needs to send those updates back to the host, allowing the host to re-use those queue entries up-to and including the new head).
  • The Status Field is actually made up of a handful of sub-fields, and there are a good number of error status codes.
  • The Phase Tag provides a bit that one can poll on (if the host driver software want to, as opposed to being purely interrupt driven.)   If the Phase Tag changes when examining the tail entry of the queue, one can infer that there is a new entry to be examined.

Just to provide an introduction to the commands available, we’ll list a few of them here:

  • Create Admin Queues – actually not a command per-se, this is a register write to the NVMe controller to define the Admin Submission and Completion queues.   The NVMe register set has  a handful of registers for basic configuration, control and status.
  • Create I/O Queues – commands to create both Submission and Completion queues.
  • Identify – Several commands allowing one to request various information about the controller and the namespaces attached to the controller.
  • Set / Get Features – A means to set NVMe specific ‘features’ (e.g. configurable facilities).
  • Read / Write – This why we’re really here: to read and write data from/to the namespaces attached to the controllers.

More commands and features are being added to NVMe all the time due to an active group that’s moving the development forward.

The industry consortium behind NVM Express is NVM Express, Inc.  They develop the specifications along with the various contributing companies.

Thanks for hanging around through this foundational discussion. Next time we will look at the VIP and its basic architecture, function and use-models.

Authored by Eric Peterson