1[Generated file: see http://ozlabs.org/~rusty/virtio-spec/] 2Virtio PCI Card Specification 3v0.9.1 DRAFT 4- 5 6Rusty Russell <rusty@rustcorp.com.au>IBM Corporation (Editor) 7 82011 August 1. 9 10Purpose and Description 11 12This document describes the specifications of the “virtio” family 13of PCI[LaTeX Command: nomenclature] devices. These are devices 14are found in virtual environments[LaTeX Command: nomenclature], 15yet by design they are not all that different from physical PCI 16devices, and this document treats them as such. This allows the 17guest to use standard PCI drivers and discovery mechanisms. 18 19The purpose of virtio and this specification is that virtual 20environments and guests should have a straightforward, efficient, 21standard and extensible mechanism for virtual devices, rather 22than boutique per-environment or per-OS mechanisms. 23 24 Straightforward: Virtio PCI devices use normal PCI mechanisms 25 of interrupts and DMA which should be familiar to any device 26 driver author. There is no exotic page-flipping or COW 27 mechanism: it's just a PCI device.[footnote: 28This lack of page-sharing implies that the implementation of the 29device (e.g. the hypervisor or host) needs full access to the 30guest memory. Communication with untrusted parties (i.e. 31inter-guest communication) requires copying. 32] 33 34 Efficient: Virtio PCI devices consist of rings of descriptors 35 for input and output, which are neatly separated to avoid cache 36 effects from both guest and device writing to the same cache 37 lines. 38 39 Standard: Virtio PCI makes no assumptions about the environment 40 in which it operates, beyond supporting PCI. In fact the virtio 41 devices specified in the appendices do not require PCI at all: 42 they have been implemented on non-PCI buses.[footnote: 43The Linux implementation further separates the PCI virtio code 44from the specific virtio drivers: these drivers are shared with 45the non-PCI implementations (currently lguest and S/390). 46] 47 48 Extensible: Virtio PCI devices contain feature bits which are 49 acknowledged by the guest operating system during device setup. 50 This allows forwards and backwards compatibility: the device 51 offers all the features it knows about, and the driver 52 acknowledges those it understands and wishes to use. 53 54 Virtqueues 55 56The mechanism for bulk data transport on virtio PCI devices is 57pretentiously called a virtqueue. Each device can have zero or 58more virtqueues: for example, the network device has one for 59transmit and one for receive. 60 61Each virtqueue occupies two or more physically-contiguous pages 62(defined, for the purposes of this specification, as 4096 bytes), 63and consists of three parts: 64 65 66+-------------------+-----------------------------------+-----------+ 67| Descriptor Table | Available Ring (padding) | Used Ring | 68+-------------------+-----------------------------------+-----------+ 69 70 71When the driver wants to send buffers to the device, it puts them 72in one or more slots in the descriptor table, and writes the 73descriptor indices into the available ring. It then notifies the 74device. When the device has finished with the buffers, it writes 75the descriptors into the used ring, and sends an interrupt. 76 77Specification 78 79 PCI Discovery 80 81Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 82through 0x103F inclusive is a virtio device[footnote: 83The actual value within this range is ignored 84]. The device must also have a Revision ID of 0 to match this 85specification. 86 87The Subsystem Device ID indicates which virtio device is 88supported by the device. The Subsystem Vendor ID should reflect 89the PCI Vendor ID of the environment (it's currently only used 90for informational purposes by the guest). 91 92 93+----------------------+--------------------+---------------+ 94| Subsystem Device ID | Virtio Device | Specification | 95+----------------------+--------------------+---------------+ 96+----------------------+--------------------+---------------+ 97| 1 | network card | Appendix C | 98+----------------------+--------------------+---------------+ 99| 2 | block device | Appendix D | 100+----------------------+--------------------+---------------+ 101| 3 | console | Appendix E | 102+----------------------+--------------------+---------------+ 103| 4 | entropy source | Appendix F | 104+----------------------+--------------------+---------------+ 105| 5 | memory ballooning | Appendix G | 106+----------------------+--------------------+---------------+ 107| 6 | ioMemory | - | 108+----------------------+--------------------+---------------+ 109| 9 | 9P transport | - | 110+----------------------+--------------------+---------------+ 111 112 113 Device Configuration 114 115To configure the device, we use the first I/O region of the PCI 116device. This contains a virtio header followed by a 117device-specific region. 118 119There may be different widths of accesses to the I/O region; the “ 120natural” access method for each field in the virtio header must 121be used (i.e. 32-bit accesses for 32-bit fields, etc), but the 122device-specific region can be accessed using any width accesses, 123and should obtain the same results. 124 125Note that this is possible because while the virtio header is PCI 126(i.e. little) endian, the device-specific region is encoded in 127the native endian of the guest (where such distinction is 128applicable). 129 130 Device Initialization Sequence 131 132We start with an overview of device initialization, then expand 133on the details of the device and how each step is preformed. 134 135 Reset the device. This is not required on initial start up. 136 137 The ACKNOWLEDGE status bit is set: we have noticed the device. 138 139 The DRIVER status bit is set: we know how to drive the device. 140 141 Device-specific setup, including reading the Device Feature 142 Bits, discovery of virtqueues for the device, optional MSI-X 143 setup, and reading and possibly writing the virtio 144 configuration space. 145 146 The subset of Device Feature Bits understood by the driver is 147 written to the device. 148 149 The DRIVER_OK status bit is set. 150 151 The device can now be used (ie. buffers added to the 152 virtqueues)[footnote: 153Historically, drivers have used the device before steps 5 and 6. 154This is only allowed if the driver does not use any features 155which would alter this early use of the device. 156] 157 158If any of these steps go irrecoverably wrong, the guest should 159set the FAILED status bit to indicate that it has given up on the 160device (it can reset the device later to restart if desired). 161 162We now cover the fields required for general setup in detail. 163 164 Virtio Header 165 166The virtio header looks as follows: 167 168 169+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 170| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | 171+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 172| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | 173+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 174| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | 175| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | 176+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ 177 178 179If MSI-X is enabled for the device, two additional fields 180immediately follow this header: 181 182 183+------------++----------------+--------+ 184| Bits || 16 | 16 | 185 +----------------+--------+ 186+------------++----------------+--------+ 187| Read/Write || R+W | R+W | 188+------------++----------------+--------+ 189| Purpose || Configuration | Queue | 190| (MSI-X) || Vector | Vector | 191+------------++----------------+--------+ 192 193 194Finally, if feature bits (VIRTIO_F_FEATURES_HI) this is 195immediately followed by two additional fields: 196 197 198+------------++----------------------+---------------------- 199| Bits || 32 | 32 200+------------++----------------------+---------------------- 201| Read/Write || R | R+W 202+------------++----------------------+---------------------- 203| Purpose || Device | Guest 204| || Features bits 32:63 | Features bits 32:63 205+------------++----------------------+---------------------- 206 207 208Immediately following these general headers, there may be 209device-specific headers: 210 211 212+------------++--------------------+ 213| Bits || Device Specific | 214 +--------------------+ 215+------------++--------------------+ 216| Read/Write || Device Specific | 217+------------++--------------------+ 218| Purpose || Device Specific... | 219| || | 220+------------++--------------------+ 221 222 223 Device Status 224 225The Device Status field is updated by the guest to indicate its 226progress. This provides a simple low-level diagnostic: it's most 227useful to imagine them hooked up to traffic lights on the console 228indicating the status of each device. 229 230The device can be reset by writing a 0 to this field, otherwise 231at least one bit should be set: 232 233 ACKNOWLEDGE (1) Indicates that the guest OS has found the 234 device and recognized it as a valid virtio device. 235 236 DRIVER (2) Indicates that the guest OS knows how to drive the 237 device. Under Linux, drivers can be loadable modules so there 238 may be a significant (or infinite) delay before setting this 239 bit. 240 241 DRIVER_OK (3) Indicates that the driver is set up and ready to 242 drive the device. 243 244 FAILED (8) Indicates that something went wrong in the guest, 245 and it has given up on the device. This could be an internal 246 error, or the driver didn't like the device for some reason, or 247 even a fatal error during device operation. The device must be 248 reset before attempting to re-initialize. 249 250 Feature Bits 251 252The least significant 31 bits of the first configuration field 253indicates the features that the device supports (the high bit is 254reserved, and will be used to indicate the presence of future 255feature bits elsewhere). If more than 31 feature bits are 256supported, the device indicates so by setting feature bit 31 (see 257[cha:Reserved-Feature-Bits]). The bits are allocated as follows: 258 259 0 to 23 Feature bits for the specific device type 260 261 24 to 40 Feature bits reserved for extensions to the queue and 262 feature negotiation mechanisms 263 264 41 to 63 Feature bits reserved for future extensions 265 266For example, feature bit 0 for a network device (i.e. Subsystem 267Device ID 1) indicates that the device supports checksumming of 268packets. 269 270The feature bits are negotiated: the device lists all the 271features it understands in the Device Features field, and the 272guest writes the subset that it understands into the Guest 273Features field. The only way to renegotiate is to reset the 274device. 275 276In particular, new fields in the device configuration header are 277indicated by offering a feature bit, so the guest can check 278before accessing that part of the configuration space. 279 280This allows for forwards and backwards compatibility: if the 281device is enhanced with a new feature bit, older guests will not 282write that feature bit back to the Guest Features field and it 283can go into backwards compatibility mode. Similarly, if a guest 284is enhanced with a feature that the device doesn't support, it 285will not see that feature bit in the Device Features field and 286can go into backwards compatibility mode (or, for poor 287implementations, set the FAILED Device Status bit). 288 289Access to feature bits 32 to 63 is enabled by Guest by setting 290feature bit 31. If this bit is unset, Device must assume that all 291feature bits > 31 are unset. 292 293 Configuration/Queue Vectors 294 295When MSI-X capability is present and enabled in the device 296(through standard PCI configuration space) 4 bytes at byte offset 29720 are used to map configuration change and queue interrupts to 298MSI-X vectors. In this case, the ISR Status field is unused, and 299device specific configuration starts at byte offset 24 in virtio 300header structure. When MSI-X capability is not enabled, device 301specific configuration starts at byte offset 20 in virtio header. 302 303Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of 304Configuration/Queue Vector registers, maps interrupts triggered 305by the configuration change/selected queue events respectively to 306the corresponding MSI-X vector. To disable interrupts for a 307specific event type, unmap it by writing a special NO_VECTOR 308value: 309 310/* Vector value used to disable MSI for queue */ 311 312#define VIRTIO_MSI_NO_VECTOR 0xffff 313 314Reading these registers returns vector mapped to a given event, 315or NO_VECTOR if unmapped. All queue and configuration change 316events are unmapped by default. 317 318Note that mapping an event to vector might require allocating 319internal device resources, and might fail. Devices report such 320failures by returning the NO_VECTOR value when the relevant 321Vector field is read. After mapping an event to vector, the 322driver must verify success by reading the Vector field value: on 323success, the previously written value is returned, and on 324failure, NO_VECTOR is returned. If a mapping failure is detected, 325the driver can retry mapping with fewervectors, or disable MSI-X. 326 327 Virtqueue Configuration 328 329As a device can have zero or more virtqueues for bulk data 330transport (for example, the network driver has two), the driver 331needs to configure them as part of the device-specific 332configuration. 333 334This is done as follows, for each virtqueue a device has: 335 336 Write the virtqueue index (first queue is 0) to the Queue 337 Select field. 338 339 Read the virtqueue size from the Queue Size field, which is 340 always a power of 2. This controls how big the virtqueue is 341 (see below). If this field is 0, the virtqueue does not exist. 342 343 Allocate and zero virtqueue in contiguous physical memory, on a 344 4096 byte alignment. Write the physical address, divided by 345 4096 to the Queue Address field.[footnote: 346The 4096 is based on the x86 page size, but it's also large 347enough to ensure that the separate parts of the virtqueue are on 348separate cache lines. 349] 350 351 Optionally, if MSI-X capability is present and enabled on the 352 device, select a vector to use to request interrupts triggered 353 by virtqueue events. Write the MSI-X Table entry number 354 corresponding to this vector in Queue Vector field. Read the 355 Queue Vector field: on success, previously written value is 356 returned; on failure, NO_VECTOR value is returned. 357 358The Queue Size field controls the total number of bytes required 359for the virtqueue according to the following formula: 360 361#define ALIGN(x) (((x) + 4095) & ~4095) 362 363static inline unsigned vring_size(unsigned int qsz) 364 365{ 366 367 return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 368+ qsz)) 369 370 + ALIGN(sizeof(struct vring_used_elem)*qsz); 371 372} 373 374This currently wastes some space with padding, but also allows 375future extensions. The virtqueue layout structure looks like this 376(qsz is the Queue Size field, which is a variable, so this code 377won't compile): 378 379struct vring { 380 381 /* The actual descriptors (16 bytes each) */ 382 383 struct vring_desc desc[qsz]; 384 385 386 387 /* A ring of available descriptor heads with free-running 388index. */ 389 390 struct vring_avail avail; 391 392 393 394 // Padding to the next 4096 boundary. 395 396 char pad[]; 397 398 399 400 // A ring of used descriptor heads with free-running index. 401 402 struct vring_used used; 403 404}; 405 406 A Note on Virtqueue Endianness 407 408Note that the endian of these fields and everything else in the 409virtqueue is the native endian of the guest, not little-endian as 410PCI normally is. This makes for simpler guest code, and it is 411assumed that the host already has to be deeply aware of the guest 412endian so such an “endian-aware” device is not a significant 413issue. 414 415 Descriptor Table 416 417The descriptor table refers to the buffers the guest is using for 418the device. The addresses are physical addresses, and the buffers 419can be chained via the next field. Each descriptor describes a 420buffer which is read-only or write-only, but a chain of 421descriptors can contain both read-only and write-only buffers. 422 423No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc { 424 425 /* Address (guest-physical). */ 426 427 u64 addr; 428 429 /* Length. */ 430 431 u32 len; 432 433/* This marks a buffer as continuing via the next field. */ 434 435#define VRING_DESC_F_NEXT 1 436 437/* This marks a buffer as write-only (otherwise read-only). */ 438 439#define VRING_DESC_F_WRITE 2 440 441/* This means the buffer contains a list of buffer descriptors. 442*/ 443 444#define VRING_DESC_F_INDIRECT 4 445 446 /* The flags as indicated above. */ 447 448 u16 flags; 449 450 /* Next field if flags & NEXT */ 451 452 u16 next; 453 454}; 455 456The number of descriptors in the table is specified by the Queue 457Size field for this virtqueue. 458 459 <sub:Indirect-Descriptors>Indirect Descriptors 460 461Some devices benefit by concurrently dispatching a large number 462of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be 463used to allow this (see [cha:Reserved-Feature-Bits]). To increase 464ring capacity it is possible to store a table of indirect 465descriptors anywhere in memory, and insert a descriptor in main 466virtqueue (with flags&INDIRECT on) that refers to memory buffer 467containing this indirect descriptor table; fields addr and len 468refer to the indirect table address and length in bytes, 469respectively. The indirect table layout structure looks like this 470(len is the length of the descriptor that refers to this table, 471which is a variable, so this code won't compile): 472 473struct indirect_descriptor_table { 474 475 /* The actual descriptors (16 bytes each) */ 476 477 struct vring_desc desc[len / 16]; 478 479}; 480 481The first indirect descriptor is located at start of the indirect 482descriptor table (index 0), additional indirect descriptors are 483chained by next field. An indirect descriptor without next field 484(with flags&NEXT off) signals the end of the indirect descriptor 485table, and transfers control back to the main virtqueue. An 486indirect descriptor can not refer to another indirect descriptor 487table (flags&INDIRECT must be off). A single indirect descriptor 488table can include both read-only and write-only descriptors; 489write-only flag (flags&WRITE) in the descriptor that refers to it 490is ignored. 491 492 Available Ring 493 494The available ring refers to what descriptors we are offering the 495device: it refers to the head of a descriptor chain. The “flags” 496field is currently 0 or 1: 1 indicating that we do not need an 497interrupt when the device consumes a descriptor from the 498available ring. Alternatively, the guest can ask the device to 499delay interrupts until an entry with an index specified by the “ 500used_event” field is written in the used ring (equivalently, 501until the idx field in the used ring will reach the value 502used_event + 1). The method employed by the device is controlled 503by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] 504). This interrupt suppression is merely an optimization; it may 505not suppress interrupts entirely. 506 507The “idx” field indicates where we would put the next descriptor 508entry (modulo the ring size). This starts at 0, and increases. 509 510struct vring_avail { 511 512#define VRING_AVAIL_F_NO_INTERRUPT 1 513 514 u16 flags; 515 516 u16 idx; 517 518 u16 ring[qsz]; /* qsz is the Queue Size field read from device 519*/ 520 521 u16 used_event; 522 523}; 524 525 Used Ring 526 527The used ring is where the device returns buffers once it is done 528with them. The flags field can be used by the device to hint that 529no notification is necessary when the guest adds to the available 530ring. Alternatively, the “avail_event” field can be used by the 531device to hint that no notification is necessary until an entry 532with an index specified by the “avail_event” is written in the 533available ring (equivalently, until the idx field in the 534available ring will reach the value avail_event + 1). The method 535employed by the device is controlled by the guest through the 536VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] 537). [footnote: 538These fields are kept here because this is the only part of the 539virtqueue written by the device 540]. 541 542Each entry in the ring is a pair: the head entry of the 543descriptor chain describing the buffer (this matches an entry 544placed in the available ring by the guest earlier), and the total 545of bytes written into the buffer. The latter is extremely useful 546for guests using untrusted buffers: if you do not know exactly 547how much has been written by the device, you usually have to zero 548the buffer to ensure no data leakage occurs. 549 550/* u32 is used here for ids for padding reasons. */ 551 552struct vring_used_elem { 553 554 /* Index of start of used descriptor chain. */ 555 556 u32 id; 557 558 /* Total length of the descriptor chain which was used 559(written to) */ 560 561 u32 len; 562 563}; 564 565 566 567struct vring_used { 568 569#define VRING_USED_F_NO_NOTIFY 1 570 571 u16 flags; 572 573 u16 idx; 574 575 struct vring_used_elem ring[qsz]; 576 577 u16 avail_event; 578 579}; 580 581 Helpers for Managing Virtqueues 582 583The Linux Kernel Source code contains the definitions above and 584helper routines in a more usable form, in 585include/linux/virtio_ring.h. This was explicitly licensed by IBM 586and Red Hat under the (3-clause) BSD license so that it can be 587freely used by all other projects, and is reproduced (with slight 588variation to remove Linux assumptions) in Appendix A. 589 590 Device Operation 591 592There are two parts to device operation: supplying new buffers to 593the device, and processing used buffers from the device. As an 594example, the virtio network device has two virtqueues: the 595transmit virtqueue and the receive virtqueue. The driver adds 596outgoing (read-only) packets to the transmit virtqueue, and then 597frees them after they are used. Similarly, incoming (write-only) 598buffers are added to the receive virtqueue, and processed after 599they are used. 600 601 Supplying Buffers to The Device 602 603Actual transfer of buffers from the guest OS to the device 604operates as follows: 605 606 Place the buffer(s) into free descriptor(s). 607 608 If there are no free descriptors, the guest may choose to 609 notify the device even if notifications are suppressed (to 610 reduce latency).[footnote: 611The Linux drivers do this only for read-only buffers: for 612write-only buffers, it is assumed that the driver is merely 613trying to keep the receive buffer ring full, and no notification 614of this expected condition is necessary. 615] 616 617 Place the id of the buffer in the next ring entry of the 618 available ring. 619 620 The steps (1) and (2) may be performed repeatedly if batching 621 is possible. 622 623 A memory barrier should be executed to ensure the device sees 624 the updated descriptor table and available ring before the next 625 step. 626 627 The available “idx” field should be increased by the number of 628 entries added to the available ring. 629 630 A memory barrier should be executed to ensure that we update 631 the idx field before checking for notification suppression. 632 633 If notifications are not suppressed, the device should be 634 notified of the new buffers. 635 636Note that the above code does not take precautions against the 637available ring buffer wrapping around: this is not possible since 638the ring buffer is the same size as the descriptor table, so step 639(1) will prevent such a condition. 640 641In addition, the maximum queue size is 32768 (it must be a power 642of 2 which fits in 16 bits), so the 16-bit “idx” value can always 643distinguish between a full and empty buffer. 644 645Here is a description of each stage in more detail. 646 647 Placing Buffers Into The Descriptor Table 648 649A buffer consists of zero or more read-only physically-contiguous 650elements followed by zero or more physically-contiguous 651write-only elements (it must have at least one element). This 652algorithm maps it into the descriptor table: 653 654 for each buffer element, b: 655 656 Get the next free descriptor table entry, d 657 658 Set d.addr to the physical address of the start of b 659 660 Set d.len to the length of b. 661 662 If b is write-only, set d.flags to VRING_DESC_F_WRITE, 663 otherwise 0. 664 665 If there is a buffer element after this: 666 667 Set d.next to the index of the next free descriptor element. 668 669 Set the VRING_DESC_F_NEXT bit in d.flags. 670 671In practice, the d.next fields are usually used to chain free 672descriptors, and a separate count kept to check there are enough 673free descriptors before beginning the mappings. 674 675 Updating The Available Ring 676 677The head of the buffer we mapped is the first d in the algorithm 678above. A naive implementation would do the following: 679 680avail->ring[avail->idx % qsz] = head; 681 682However, in general we can add many descriptors before we update 683the “idx” field (at which point they become visible to the 684device), so we keep a counter of how many we've added: 685 686avail->ring[(avail->idx + added++) % qsz] = head; 687 688 Updating The Index Field 689 690Once the idx field of the virtqueue is updated, the device will 691be able to access the descriptor entries we've created and the 692memory they refer to. This is why a memory barrier is generally 693used before the idx update, to ensure it sees the most up-to-date 694copy. 695 696The idx field always increments, and we let it wrap naturally at 69765536: 698 699avail->idx += added; 700 701 <sub:Notifying-The-Device>Notifying The Device 702 703Device notification occurs by writing the 16-bit virtqueue index 704of this virtqueue to the Queue Notify field of the virtio header 705in the first I/O region of the PCI device. This can be expensive, 706however, so the device can suppress such notifications if it 707doesn't need them. We have to be careful to expose the new idx 708value before checking the suppression flag: it's OK to notify 709gratuitously, but not to omit a required notification. So again, 710we use a memory barrier here before reading the flags or the 711avail_event field. 712 713If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if 714the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to 715the PCI configuration space. 716 717If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the 718avail_event field in the available ring structure. If the 719available index crossed_the avail_event field value since the 720last notification, we go ahead and write to the PCI configuration 721space. The avail_event field wraps naturally at 65536 as well: 722 723(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx) 724 725 <sub:Receiving-Used-Buffers>Receiving Used Buffers From The 726 Device 727 728Once the device has used a buffer (read from or written to it, or 729parts of both, depending on the nature of the virtqueue and the 730device), it sends an interrupt, following an algorithm very 731similar to the algorithm used for the driver to send the device a 732buffer: 733 734 Write the head descriptor number to the next field in the used 735 ring. 736 737 Update the used ring idx. 738 739 Determine whether an interrupt is necessary: 740 741 If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check 742 if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail- 743 >flags 744 745 If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check 746 whether the used index crossed the used_event field value 747 since the last update. The used_event field wraps naturally 748 at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) 749 750 If an interrupt is necessary: 751 752 If MSI-X capability is disabled: 753 754 Set the lower bit of the ISR Status field for the device. 755 756 Send the appropriate PCI interrupt for the device. 757 758 If MSI-X capability is enabled: 759 760 Request the appropriate MSI-X interrupt message for the 761 device, Queue Vector field sets the MSI-X Table entry 762 number. 763 764 If Queue Vector field value is NO_VECTOR, no interrupt 765 message is requested for this event. 766 767The guest interrupt handler should: 768 769 If MSI-X capability is disabled: read the ISR Status field, 770 which will reset it to zero. If the lower bit is zero, the 771 interrupt was not for this device. Otherwise, the guest driver 772 should look through the used rings of each virtqueue for the 773 device, to see if any progress has been made by the device 774 which requires servicing. 775 776 If MSI-X capability is enabled: look through the used rings of 777 each virtqueue mapped to the specific MSI-X vector for the 778 device, to see if any progress has been made by the device 779 which requires servicing. 780 781For each ring, guest should then disable interrupts by writing 782VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. 783It can then process used ring entries finally enabling interrupts 784by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the 785EVENT_IDX field in the available structure, Guest should then 786execute a memory barrier, and then recheck the ring empty 787condition. This is necessary to handle the case where, after the 788last check and before enabling interrupts, an interrupt has been 789suppressed by the device: 790 791vring_disable_interrupts(vq); 792 793for (;;) { 794 795 if (vq->last_seen_used != vring->used.idx) { 796 797 vring_enable_interrupts(vq); 798 799 mb(); 800 801 if (vq->last_seen_used != vring->used.idx) 802 803 break; 804 805 } 806 807 struct vring_used_elem *e = 808vring.used->ring[vq->last_seen_used%vsz]; 809 810 process_buffer(e); 811 812 vq->last_seen_used++; 813 814} 815 816 Dealing With Configuration Changes 817 818Some virtio PCI devices can change the device configuration 819state, as reflected in the virtio header in the PCI configuration 820space. In this case: 821 822 If MSI-X capability is disabled: an interrupt is delivered and 823 the second highest bit is set in the ISR Status field to 824 indicate that the driver should re-examine the configuration 825 space.Note that a single interrupt can indicate both that one 826 or more virtqueue has been used and that the configuration 827 space has changed: even if the config bit is set, virtqueues 828 must be scanned. 829 830 If MSI-X capability is enabled: an interrupt message is 831 requested. The Configuration Vector field sets the MSI-X Table 832 entry number to use. If Configuration Vector field value is 833 NO_VECTOR, no interrupt message is requested for this event. 834 835Creating New Device Types 836 837Various considerations are necessary when creating a new device 838type: 839 840 How Many Virtqueues? 841 842It is possible that a very simple device will operate entirely 843through its configuration space, but most will need at least one 844virtqueue in which it will place requests. A device with both 845input and output (eg. console and network devices described here) 846need two queues: one which the driver fills with buffers to 847receive input, and one which the driver places buffers to 848transmit output. 849 850 What Configuration Space Layout? 851 852Configuration space is generally used for rarely-changing or 853initialization-time parameters. But it is a limited resource, so 854it might be better to use a virtqueue to update configuration 855information (the network device does this for filtering, 856otherwise the table in the config space could potentially be very 857large). 858 859Note that this space is generally the guest's native endian, 860rather than PCI's little-endian. 861 862 What Device Number? 863 864Currently device numbers are assigned quite freely: a simple 865request mail to the author of this document or the Linux 866virtualization mailing list[footnote: 867 868https://lists.linux-foundation.org/mailman/listinfo/virtualization 869] will be sufficient to secure a unique one. 870 871Meanwhile for experimental drivers, use 65535 and work backwards. 872 873 How many MSI-X vectors? 874 875Using the optional MSI-X capability devices can speed up 876interrupt processing by removing the need to read ISR Status 877register by guest driver (which might be an expensive operation), 878reducing interrupt sharing between devices and queues within the 879device, and handling interrupts from multiple CPUs. However, some 880systems impose a limit (which might be as low as 256) on the 881total number of MSI-X vectors that can be allocated to all 882devices. Devices and/or device drivers should take this into 883account, limiting the number of vectors used unless the device is 884expected to cause a high volume of interrupts. Devices can 885control the number of vectors used by limiting the MSI-X Table 886Size or not presenting MSI-X capability in PCI configuration 887space. Drivers can control this by mapping events to as small 888number of vectors as possible, or disabling MSI-X capability 889altogether. 890 891 Message Framing 892 893The descriptors used for a buffer should not effect the semantics 894of the message, except for the total length of the buffer. For 895example, a network buffer consists of a 10 byte header followed 896by the network packet. Whether this is presented in the ring 897descriptor chain as (say) a 10 byte buffer and a 1514 byte 898buffer, or a single 1524 byte buffer, or even three buffers, 899should have no effect. 900 901In particular, no implementation should use the descriptor 902boundaries to determine the size of any header in a request.[footnote: 903The current qemu device implementations mistakenly insist that 904the first descriptor cover the header in these cases exactly, so 905a cautious driver should arrange it so. 906] 907 908 Device Improvements 909 910Any change to configuration space, or new virtqueues, or 911behavioural changes, should be indicated by negotiation of a new 912feature bit. This establishes clarity[footnote: 913Even if it does mean documenting design or implementation 914mistakes! 915] and avoids future expansion problems. 916 917Clusters of functionality which are always implemented together 918can use a single bit, but if one feature makes sense without the 919others they should not be gratuitously grouped together to 920conserve feature bits. We can always extend the spec when the 921first person needs more than 24 feature bits for their device. 922 923[LaTeX Command: printnomenclature] 924 925Appendix A: virtio_ring.h 926 927#ifndef VIRTIO_RING_H 928 929#define VIRTIO_RING_H 930 931/* An interface for efficient virtio implementation. 932 933 * 934 935 * This header is BSD licensed so anyone can use the definitions 936 937 * to implement compatible drivers/servers. 938 939 * 940 941 * Copyright 2007, 2009, IBM Corporation 942 943 * Copyright 2011, Red Hat, Inc 944 945 * All rights reserved. 946 947 * 948 949 * Redistribution and use in source and binary forms, with or 950without 951 952 * modification, are permitted provided that the following 953conditions 954 955 * are met: 956 957 * 1. Redistributions of source code must retain the above 958copyright 959 960 * notice, this list of conditions and the following 961disclaimer. 962 963 * 2. Redistributions in binary form must reproduce the above 964copyright 965 966 * notice, this list of conditions and the following 967disclaimer in the 968 969 * documentation and/or other materials provided with the 970distribution. 971 972 * 3. Neither the name of IBM nor the names of its contributors 973 974 * may be used to endorse or promote products derived from 975this software 976 977 * without specific prior written permission. 978 979 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND 980CONTRIBUTORS ``AS IS'' AND 981 982 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 983TO, THE 984 985 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A 986PARTICULAR PURPOSE 987 988 * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE 989LIABLE 990 991 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 992CONSEQUENTIAL 993 994 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 995SUBSTITUTE GOODS 996 997 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 998INTERRUPTION) 999 1000 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 1001CONTRACT, STRICT 1002 1003 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 1004IN ANY WAY 1005 1006 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 1007POSSIBILITY OF 1008 1009 * SUCH DAMAGE. 1010 1011 */ 1012 1013 1014 1015/* This marks a buffer as continuing via the next field. */ 1016 1017#define VRING_DESC_F_NEXT 1 1018 1019/* This marks a buffer as write-only (otherwise read-only). */ 1020 1021#define VRING_DESC_F_WRITE 2 1022 1023 1024 1025/* The Host uses this in used->flags to advise the Guest: don't 1026kick me 1027 1028 * when you add a buffer. It's unreliable, so it's simply an 1029 1030 * optimization. Guest will still kick if it's out of buffers. 1031*/ 1032 1033#define VRING_USED_F_NO_NOTIFY 1 1034 1035/* The Guest uses this in avail->flags to advise the Host: don't 1036 1037 * interrupt me when you consume a buffer. It's unreliable, so 1038it's 1039 1040 * simply an optimization. */ 1041 1042#define VRING_AVAIL_F_NO_INTERRUPT 1 1043 1044 1045 1046/* Virtio ring descriptors: 16 bytes. 1047 1048 * These can chain together via "next". */ 1049 1050struct vring_desc { 1051 1052 /* Address (guest-physical). */ 1053 1054 uint64_t addr; 1055 1056 /* Length. */ 1057 1058 uint32_t len; 1059 1060 /* The flags as indicated above. */ 1061 1062 uint16_t flags; 1063 1064 /* We chain unused descriptors via this, too */ 1065 1066 uint16_t next; 1067 1068}; 1069 1070 1071 1072struct vring_avail { 1073 1074 uint16_t flags; 1075 1076 uint16_t idx; 1077 1078 uint16_t ring[]; 1079 1080 uint16_t used_event; 1081 1082}; 1083 1084 1085 1086/* u32 is used here for ids for padding reasons. */ 1087 1088struct vring_used_elem { 1089 1090 /* Index of start of used descriptor chain. */ 1091 1092 uint32_t id; 1093 1094 /* Total length of the descriptor chain which was written 1095to. */ 1096 1097 uint32_t len; 1098 1099}; 1100 1101 1102 1103struct vring_used { 1104 1105 uint16_t flags; 1106 1107 uint16_t idx; 1108 1109 struct vring_used_elem ring[]; 1110 1111 uint16_t avail_event; 1112 1113}; 1114 1115 1116 1117struct vring { 1118 1119 unsigned int num; 1120 1121 1122 1123 struct vring_desc *desc; 1124 1125 struct vring_avail *avail; 1126 1127 struct vring_used *used; 1128 1129}; 1130 1131 1132 1133/* The standard layout for the ring is a continuous chunk of 1134memory which 1135 1136 * looks like this. We assume num is a power of 2. 1137 1138 * 1139 1140 * struct vring { 1141 1142 * // The actual descriptors (16 bytes each) 1143 1144 * struct vring_desc desc[num]; 1145 1146 * 1147 1148 * // A ring of available descriptor heads with free-running 1149index. 1150 1151 * __u16 avail_flags; 1152 1153 * __u16 avail_idx; 1154 1155 * __u16 available[num]; 1156 1157 * 1158 1159 * // Padding to the next align boundary. 1160 1161 * char pad[]; 1162 1163 * 1164 1165 * // A ring of used descriptor heads with free-running 1166index. 1167 1168 * __u16 used_flags; 1169 1170 * __u16 EVENT_IDX; 1171 1172 * struct vring_used_elem used[num]; 1173 1174 * }; 1175 1176 * Note: for virtio PCI, align is 4096. 1177 1178 */ 1179 1180static inline void vring_init(struct vring *vr, unsigned int num, 1181void *p, 1182 1183 unsigned long align) 1184 1185{ 1186 1187 vr->num = num; 1188 1189 vr->desc = p; 1190 1191 vr->avail = p + num*sizeof(struct vring_desc); 1192 1193 vr->used = (void *)(((unsigned long)&vr->avail->ring[num] 1194 1195 + align-1) 1196 1197 & ~(align - 1)); 1198 1199} 1200 1201 1202 1203static inline unsigned vring_size(unsigned int num, unsigned long 1204align) 1205 1206{ 1207 1208 return ((sizeof(struct vring_desc)*num + 1209sizeof(uint16_t)*(2+num) 1210 1211 + align - 1) & ~(align - 1)) 1212 1213 + sizeof(uint16_t)*3 + sizeof(struct 1214vring_used_elem)*num; 1215 1216} 1217 1218 1219 1220static inline int vring_need_event(uint16_t event_idx, uint16_t 1221new_idx, uint16_t old_idx) 1222 1223{ 1224 1225 return (uint16_t)(new_idx - event_idx - 1) < 1226(uint16_t)(new_idx - old_idx); 1227 1228} 1229 1230#endif /* VIRTIO_RING_H */ 1231 1232<cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits 1233 1234Currently there are five device-independent feature bits defined: 1235 1236 VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature 1237 indicates that the driver wants an interrupt if the device runs 1238 out of available descriptors on a virtqueue, even though 1239 interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT 1240 flag or the used_event field. An example of this is the 1241 networking driver: it doesn't need to know every time a packet 1242 is transmitted, but it does need to free the transmitted 1243 packets a finite time after they are transmitted. It can avoid 1244 using a timer if the device interrupts it when all the packets 1245 are transmitted. 1246 1247 VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature 1248 indicates that the driver can use descriptors with the 1249 VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors] 1250 . 1251 1252 VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event 1253 and the avail_event fields. If set, it indicates that the 1254 device should ignore the flags field in the available ring 1255 structure. Instead, the used_event field in this structure is 1256 used by guest to suppress device interrupts. Further, the 1257 driver should ignore the flags field in the used ring 1258 structure. Instead, the avail_event field in this structure is 1259 used by the device to suppress notifications. If unset, the 1260 driver should ignore the used_event field; the device should 1261 ignore the avail_event field; the flags field is used 1262 1263 VIRTIO_F_BAD_FEATURE(30) This feature should never be 1264 negotiated by the guest; doing so is an indication that the 1265 guest is faulty[footnote: 1266An experimental virtio PCI driver contained in Linux version 12672.6.25 had this problem, and this feature bit can be used to 1268detect it. 1269] 1270 1271 VIRTIO_F_FEATURES_HIGH(31) This feature indicates that the 1272 device supports feature bits 32:63. If unset, feature bits 1273 32:63 are unset. 1274 1275Appendix C: Network Device 1276 1277The virtio network device is a virtual ethernet card, and is the 1278most complex of the devices supported so far by virtio. It has 1279enhanced rapidly and demonstrates clearly how support for new 1280features should be added to an existing device. Empty buffers are 1281placed in one virtqueue for receiving packets, and outgoing 1282packets are enqueued into another for transmission in that order. 1283A third command queue is used to control advanced filtering 1284features. 1285 1286 Configuration 1287 1288 Subsystem Device ID 1 1289 1290 Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote: 1291Only if VIRTIO_NET_F_CTRL_VQ set 1292] 1293 1294 Feature bits 1295 1296 VIRTIO_NET_F_CSUM (0) Device handles packets with partial 1297 checksum 1298 1299 VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial 1300 checksum 1301 1302 VIRTIO_NET_F_MAC (5) Device has given MAC address. 1303 1304 VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with 1305 any GSO type.[footnote: 1306It was supposed to indicate segmentation offload support, but 1307upon further investigation it became clear that multiple bits 1308were required. 1309] 1310 1311 VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4. 1312 1313 VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6. 1314 1315 VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN. 1316 1317 VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO. 1318 1319 VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4. 1320 1321 VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6. 1322 1323 VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN. 1324 1325 VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO. 1326 1327 VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers. 1328 1329 VIRTIO_NET_F_STATUS (16) Configuration status field is 1330 available. 1331 1332 VIRTIO_NET_F_CTRL_VQ (17) Control channel is available. 1333 1334 VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support. 1335 1336 VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering. 1337 1338 Device configuration layout Two configuration fields are 1339 currently defined. The mac address field always exists (though 1340 is only valid if VIRTIO_NET_F_MAC is set), and the status field 1341 only exists if VIRTIO_NET_F_STATUS is set. Only one bit is 1342 currently defined for the status field: VIRTIO_NET_S_LINK_UP. #define VIRTIO_NET_S_LINK_UP 1 1343 1344 1345 1346struct virtio_net_config { 1347 1348 u8 mac[6]; 1349 1350 u16 status; 1351 1352}; 1353 1354 Device Initialization 1355 1356 The initialization routine should identify the receive and 1357 transmission virtqueues. 1358 1359 If the VIRTIO_NET_F_MAC feature bit is set, the configuration 1360 space “mac” entry indicates the “physical” address of the the 1361 network card, otherwise a private MAC address should be 1362 assigned. All guests are expected to negotiate this feature if 1363 it is set. 1364 1365 If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify 1366 the control virtqueue. 1367 1368 If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link 1369 status can be read from the bottom bit of the “status” config 1370 field. Otherwise, the link should be assumed active. 1371 1372 The receive virtqueue should be filled with receive buffers. 1373 This is described in detail below in “Setting Up Receive 1374 Buffers”. 1375 1376 A driver can indicate that it will generate checksumless 1377 packets by negotating the VIRTIO_NET_F_CSUM feature. This “ 1378 checksum offload” is a common feature on modern network cards. 1379 1380 If that feature is negotiated, a driver can use TCP or UDP 1381 segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4 1382 (IPv4 TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and 1383 VIRTIO_NET_F_HOST_UFO (UDP fragmentation) features. It should 1384 not send TCP packets requiring segmentation offload which have 1385 the Explicit Congestion Notification bit set, unless the 1386 VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote: 1387This is a common restriction in real, older network cards. 1388] 1389 1390 The converse features are also available: a driver can save the 1391 virtual device some work by negotiating these features.[footnote: 1392For example, a network packet transported between two guests on 1393the same system may not require checksumming at all, nor 1394segmentation, if both guests are amenable. 1395] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially 1396 checksummed packets can be received, and if it can do that then 1397 the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, 1398 VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input 1399 equivalents of the features described above. See “Receiving 1400 Packets” below. 1401 1402 Device Operation 1403 1404Packets are transmitted by placing them in the transmitq, and 1405buffers for incoming packets are placed in the receiveq. In each 1406case, the packet itself is preceded by a header: 1407 1408struct virtio_net_hdr { 1409 1410#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 1411 1412 u8 flags; 1413 1414#define VIRTIO_NET_HDR_GSO_NONE 0 1415 1416#define VIRTIO_NET_HDR_GSO_TCPV4 1 1417 1418#define VIRTIO_NET_HDR_GSO_UDP 3 1419 1420#define VIRTIO_NET_HDR_GSO_TCPV6 4 1421 1422#define VIRTIO_NET_HDR_GSO_ECN 0x80 1423 1424 u8 gso_type; 1425 1426 u16 hdr_len; 1427 1428 u16 gso_size; 1429 1430 u16 csum_start; 1431 1432 u16 csum_offset; 1433 1434/* Only if VIRTIO_NET_F_MRG_RXBUF: */ 1435 1436 u16 num_buffers 1437 1438}; 1439 1440The controlq is used to control device features such as 1441filtering. 1442 1443 Packet Transmission 1444 1445Transmitting a single packet is simple, but varies depending on 1446the different features the driver negotiated. 1447 1448 If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has 1449 not been fully checksummed, then the virtio_net_hdr's fields 1450 are set as follows. Otherwise, the packet must be fully 1451 checksummed, and flags is zero. 1452 1453 flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set, 1454 1455 <ite:csum_start-is-set>csum_start is set to the offset within 1456 the packet to begin checksumming, and 1457 1458 csum_offset indicates how many bytes after the csum_start the 1459 new (16 bit ones' complement) checksum should be placed.[footnote: 1460For example, consider a partially checksummed TCP (IPv4) packet. 1461It will have a 14 byte ethernet header and 20 byte IP header 1462followed by the TCP header (with the TCP checksum field 16 bytes 1463into that header). csum_start will be 14+20 = 34 (the TCP 1464checksum includes the header), and csum_offset will be 16. The 1465value in the TCP checksum field will be the sum of the TCP pseudo 1466header, so that replacing it by the ones' complement checksum of 1467the TCP header and body will give the correct result. 1468] 1469 1470 <enu:If-the-driver>If the driver negotiated 1471 VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires 1472 TCP segmentation or UDP fragmentation, then the “gso_type” 1473 field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. 1474 (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this 1475 case, packets larger than 1514 bytes can be transmitted: the 1476 metadata indicates how to replicate the packet header to cut it 1477 into smaller packets. The other gso fields are set: 1478 1479 hdr_len is a hint to the device as to how much of the header 1480 needs to be kept to copy into each packet, usually set to the 1481 length of the headers, including the transport header.[footnote: 1482Due to various bugs in implementations, this field is not useful 1483as a guarantee of the transport header size. 1484] 1485 1486 gso_size is the size of the packet beyond that header (ie. 1487 MSS). 1488 1489 If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the 1490 VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well, 1491 indicating that the TCP packet has the ECN bit set.[footnote: 1492This case is not handled by some older hardware, so is called out 1493specifically in the protocol. 1494] 1495 1496 If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, 1497 the num_buffers field is set to zero. 1498 1499 The header and packet are added as one output buffer to the 1500 transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device] 1501 ).[footnote: 1502Note that the header will be two bytes longer for the 1503VIRTIO_NET_F_MRG_RXBUF case. 1504] 1505 1506 Packet Transmission Interrupt 1507 1508Often a driver will suppress transmission interrupts using the 1509VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers] 1510) and check for used packets in the transmit path of following 1511packets. However, it will still receive interrupts if the 1512VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that 1513the transmission queue is completely emptied. 1514 1515The normal behavior in this interrupt handler is to retrieve and 1516new descriptors from the used ring and free the corresponding 1517headers and packets. 1518 1519 Setting Up Receive Buffers 1520 1521It is generally a good idea to keep the receive virtqueue as 1522fully populated as possible: if it runs out, network performance 1523will suffer. 1524 1525If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or 1526VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to 1527accept packets of up to 65550 bytes long (the maximum size of a 1528TCP or UDP packet, plus the 14 byte ethernet header), otherwise 15291514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every 1530buffer in the receive queue needs to be at least this length [footnote: 1531Obviously each one can be split across multiple descriptor 1532elements. 1533]. 1534 1535If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at 1536least the size of the struct virtio_net_hdr. 1537 1538 Packet Receive Interrupt 1539 1540When a packet is copied into a buffer in the receiveq, the 1541optimal path is to disable further interrupts for the receiveq 1542(see [sub:Receiving-Used-Buffers]) and process packets until no 1543more are found, then re-enable them. 1544 1545Processing packet involves: 1546 1547 If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, 1548 then the “num_buffers” field indicates how many descriptors 1549 this packet is spread over (including this one). This allows 1550 receipt of large packets without having to allocate large 1551 buffers. In this case, there will be at least “num_buffers” in 1552 the used ring, and they should be chained together to form a 1553 single packet. The other buffers will not begin with a struct 1554 virtio_net_hdr. 1555 1556 If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or 1557 the “num_buffers” field is one, then the entire packet will be 1558 contained within this buffer, immediately following the struct 1559 virtio_net_hdr. 1560 1561 If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the 1562 VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be 1563 set: if so, the checksum on the packet is incomplete and the “ 1564 csum_start” and “csum_offset” fields indicate how to calculate 1565 it (see [ite:csum_start-is-set]). 1566 1567 If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were 1568 negotiated, then the “gso_type” may be something other than 1569 VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the 1570 desired MSS (see [enu:If-the-driver]).Control Virtqueue 1571 1572The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is 1573negotiated) to send commands to manipulate various features of 1574the device which would not easily map into the configuration 1575space. 1576 1577All commands are of the following form: 1578 1579struct virtio_net_ctrl { 1580 1581 u8 class; 1582 1583 u8 command; 1584 1585 u8 command-specific-data[]; 1586 1587 u8 ack; 1588 1589}; 1590 1591 1592 1593/* ack values */ 1594 1595#define VIRTIO_NET_OK 0 1596 1597#define VIRTIO_NET_ERR 1 1598 1599The class, command and command-specific-data are set by the 1600driver, and the device sets the ack byte. There is little it can 1601do except issue a diagnostic if the ack byte is not 1602VIRTIO_NET_OK. 1603 1604 Packet Receive Filtering 1605 1606If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can 1607send control commands for promiscuous mode, multicast receiving, 1608and filtering of MAC addresses. 1609 1610Note that in general, these commands are best-effort: unwanted 1611packets may still arrive. 1612 1613 Setting Promiscuous Mode 1614 1615#define VIRTIO_NET_CTRL_RX 0 1616 1617 #define VIRTIO_NET_CTRL_RX_PROMISC 0 1618 1619 #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 1620 1621The class VIRTIO_NET_CTRL_RX has two commands: 1622VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and 1623VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and 1624off. The command-specific-data is one byte containing 0 (off) or 16251 (on). 1626 1627 Setting MAC Address Filtering 1628 1629struct virtio_net_ctrl_mac { 1630 1631 u32 entries; 1632 1633 u8 macs[entries][ETH_ALEN]; 1634 1635}; 1636 1637 1638 1639#define VIRTIO_NET_CTRL_MAC 1 1640 1641 #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 1642 1643The device can filter incoming packets by any number of 1644destination MAC addresses.[footnote: 1645Since there are no guarantees, it can use a hash filter 1646orsilently switch to allmulti or promiscuous mode if it is given 1647too many addresses. 1648] This table is set using the class VIRTIO_NET_CTRL_MAC and the 1649command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data 1650is two variable length tables of 6-byte MAC addresses. The first 1651table contains unicast addresses, and the second contains 1652multicast addresses. 1653 1654 VLAN Filtering 1655 1656If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it 1657can control a VLAN filter table in the device. 1658 1659#define VIRTIO_NET_CTRL_VLAN 2 1660 1661 #define VIRTIO_NET_CTRL_VLAN_ADD 0 1662 1663 #define VIRTIO_NET_CTRL_VLAN_DEL 1 1664 1665Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL 1666command take a 16-bit VLAN id as the command-specific-data. 1667 1668Appendix D: Block Device 1669 1670The virtio block device is a simple virtual block device (ie. 1671disk). Read and write requests (and other exotic requests) are 1672placed in the queue, and serviced (probably out of order) by the 1673device except where noted. 1674 1675 Configuration 1676 1677 Subsystem Device ID 2 1678 1679 Virtqueues 0:requestq. 1680 1681 Feature bits 1682 1683 VIRTIO_BLK_F_BARRIER (0) Host supports request barriers. 1684 1685 VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is 1686 in “size_max”. 1687 1688 VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a 1689 request is in “seg_max”. 1690 1691 VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “ 1692 geometry”. 1693 1694 VIRTIO_BLK_F_RO (5) Device is read-only. 1695 1696 VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. 1697 1698 VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. 1699 1700 VIRTIO_BLK_F_FLUSH (9) Cache flush command support. 1701 1702 1703 1704 Device configuration layout The capacity of the device 1705 (expressed in 512-byte sectors) is always present. The 1706 availability of the others all depend on various feature bits 1707 as indicated above. struct virtio_blk_config { 1708 1709 u64 capacity; 1710 1711 u32 size_max; 1712 1713 u32 seg_max; 1714 1715 struct virtio_blk_geometry { 1716 1717 u16 cylinders; 1718 1719 u8 heads; 1720 1721 u8 sectors; 1722 1723 } geometry; 1724 1725 u32 blk_size; 1726 1727 1728 1729}; 1730 1731 Device Initialization 1732 1733 The device size should be read from the “capacity” 1734 configuration field. No requests should be submitted which goes 1735 beyond this limit. 1736 1737 If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the 1738 blk_size field can be read to determine the optimal sector size 1739 for the driver to use. This does not effect the units used in 1740 the protocol (always 512 bytes), but awareness of the correct 1741 value can effect performance. 1742 1743 If the VIRTIO_BLK_F_RO feature is set by the device, any write 1744 requests will fail. 1745 1746 1747 1748 Device Operation 1749 1750The driver queues requests to the virtqueue, and they are used by 1751the device (not necessarily in order). Each request is of form: 1752 1753struct virtio_blk_req { 1754 1755 1756 1757 u32 type; 1758 1759 u32 ioprio; 1760 1761 u64 sector; 1762 1763 char data[][512]; 1764 1765 u8 status; 1766 1767}; 1768 1769If the device has VIRTIO_BLK_F_SCSI feature, it can also support 1770scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req { 1771 1772 u32 type; 1773 1774 u32 ioprio; 1775 1776 u64 sector; 1777 1778 char cmd[]; 1779 1780 char data[][512]; 1781 1782#define SCSI_SENSE_BUFFERSIZE 96 1783 1784 u8 sense[SCSI_SENSE_BUFFERSIZE]; 1785 1786 u32 errors; 1787 1788 u32 data_len; 1789 1790 u32 sense_len; 1791 1792 u32 residual; 1793 1794 u8 status; 1795 1796}; 1797 1798The type of the request is either a read (VIRTIO_BLK_T_IN), a 1799write (VIRTIO_BLK_T_OUT), a scsi packet command 1800(VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote: 1801the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device 1802does not distinguish between them 1803]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote: 1804the FLUSH and FLUSH_OUT types are equivalent, the device does not 1805distinguish between them 1806]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit 1807(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a 1808barrier and that all preceding requests must be complete before 1809this one, and all following requests must not be started until 1810this is complete. Note that a barrier does not flush caches in 1811the underlying backend device in host, and thus does not serve as 1812data consistency guarantee. Driver must use FLUSH request to 1813flush the host cache. 1814 1815#define VIRTIO_BLK_T_IN 0 1816 1817#define VIRTIO_BLK_T_OUT 1 1818 1819#define VIRTIO_BLK_T_SCSI_CMD 2 1820 1821#define VIRTIO_BLK_T_SCSI_CMD_OUT 3 1822 1823#define VIRTIO_BLK_T_FLUSH 4 1824 1825#define VIRTIO_BLK_T_FLUSH_OUT 5 1826 1827#define VIRTIO_BLK_T_BARRIER 0x80000000 1828 1829The ioprio field is a hint about the relative priorities of 1830requests to the device: higher numbers indicate more important 1831requests. 1832 1833The sector number indicates the offset (multiplied by 512) where 1834the read or write is to occur. This field is unused and set to 0 1835for scsi packet commands and for flush commands. 1836 1837The cmd field is only present for scsi packet command requests, 1838and indicates the command to perform. This field must reside in a 1839single, separate read-only buffer; command length can be derived 1840from the length of this buffer. 1841 1842Note that these first three (four for scsi packet commands) 1843fields are always read-only: the data field is either read-only 1844or write-only, depending on the request. The size of the read or 1845write can be derived from the total size of the request buffers. 1846 1847The sense field is only present for scsi packet command requests, 1848and indicates the buffer for scsi sense data. 1849 1850The data_len field is only present for scsi packet command 1851requests, this field is deprecated, and should be ignored by the 1852driver. Historically, devices copied data length there. 1853 1854The sense_len field is only present for scsi packet command 1855requests and indicates the number of bytes actually written to 1856the sense buffer. 1857 1858The residual field is only present for scsi packet command 1859requests and indicates the residual size, calculated as data 1860length - number of bytes actually transferred. 1861 1862The final status byte is written by the device: either 1863VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest 1864error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK 0 1865 1866#define VIRTIO_BLK_S_IOERR 1 1867 1868#define VIRTIO_BLK_S_UNSUPP 2 1869 1870Historically, devices assumed that the fields type, ioprio and 1871sector reside in a single, separate read-only buffer; the fields 1872errors, data_len, sense_len and residual reside in a single, 1873separate write-only buffer; the sense field in a separate 1874write-only buffer of size 96 bytes, by itself; the fields errors, 1875data_len, sense_len and residual in a single write-only buffer; 1876and the status field is a separate read-only buffer of size 1 1877byte, by itself. 1878 1879Appendix E: Console Device 1880 1881The virtio console device is a simple device for data input and 1882output. A device may have one or more ports. Each port has a pair 1883of input and output virtqueues. Moreover, a device has a pair of 1884control IO virtqueues. The control virtqueues are used to 1885communicate information between the device and the driver about 1886ports being opened and closed on either side of the connection, 1887indication from the host about whether a particular port is a 1888console port, adding new ports, port hot-plug/unplug, etc., and 1889indication from the guest about whether a port or a device was 1890successfully added, port open/close, etc.. For data IO, one or 1891more empty buffers are placed in the receive queue for incoming 1892data and outgoing characters are placed in the transmit queue. 1893 1894 Configuration 1895 1896 Subsystem Device ID 3 1897 1898 Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control 1899 receiveq[footnote: 1900Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set 1901], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), 1902 ... 1903 1904 Feature bits 1905 1906 VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields 1907 are valid. 1908 1909 VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple 1910 ports; configuration fields nr_ports and max_nr_ports are 1911 valid and control virtqueues will be used. 1912 1913 Device configuration layout The size of the console is supplied 1914 in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature 1915 is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature 1916 is set, the maximum number of ports supported by the device can 1917 be fetched.struct virtio_console_config { 1918 1919 u16 cols; 1920 1921 u16 rows; 1922 1923 1924 1925 u32 max_nr_ports; 1926 1927}; 1928 1929 Device Initialization 1930 1931 If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver 1932 can read the console dimensions from the configuration fields. 1933 1934 If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the 1935 driver can spawn multiple ports, not all of which may be 1936 attached to a console. Some could be generic ports. In this 1937 case, the control virtqueues are enabled and according to the 1938 max_nr_ports configuration-space value, the appropriate number 1939 of virtqueues are created. A control message indicating the 1940 driver is ready is sent to the host. The host can then send 1941 control messages for adding new ports to the device. After 1942 creating and initializing each port, a 1943 VIRTIO_CONSOLE_PORT_READY control message is sent to the host 1944 for that port so the host can let us know of any additional 1945 configuration options set for that port. 1946 1947 The receiveq for each port is populated with one or more 1948 receive buffers. 1949 1950 Device Operation 1951 1952 For output, a buffer containing the characters is placed in the 1953 port's transmitq.[footnote: 1954Because this is high importance and low bandwidth, the current 1955Linux implementation polls for the buffer to be used, rather than 1956waiting for an interrupt, simplifying the implementation 1957significantly. However, for generic serial ports with the 1958O_NONBLOCK flag set, the polling limitation is relaxed and the 1959consumed buffers are freed upon the next write or poll call or 1960when a port is closed or hot-unplugged. 1961] 1962 1963 When a buffer is used in the receiveq (signalled by an 1964 interrupt), the contents is the input to the port associated 1965 with the virtqueue for which the notification was received. 1966 1967 If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a 1968 configuration change interrupt may occur. The updated size can 1969 be read from the configuration fields. 1970 1971 If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT 1972 feature, active ports are announced by the host using the 1973 VIRTIO_CONSOLE_PORT_ADD control message. The same message is 1974 used for port hot-plug as well. 1975 1976 If the host specified a port `name', a sysfs attribute is 1977 created with the name filled in, so that udev rules can be 1978 written that can create a symlink from the port's name to the 1979 char device for port discovery by applications in the guest. 1980 1981 Changes to ports' state are effected by control messages. 1982 Appropriate action is taken on the port indicated in the 1983 control message. The layout of the structure of the control 1984 buffer and the events associated are:struct virtio_console_control { 1985 1986 uint32_t id; /* Port number */ 1987 1988 uint16_t event; /* The kind of control event */ 1989 1990 uint16_t value; /* Extra information for the event */ 1991 1992}; 1993 1994 1995 1996/* Some events for the internal messages (control packets) */ 1997 1998 1999 2000#define VIRTIO_CONSOLE_DEVICE_READY 0 2001 2002#define VIRTIO_CONSOLE_PORT_ADD 1 2003 2004#define VIRTIO_CONSOLE_PORT_REMOVE 2 2005 2006#define VIRTIO_CONSOLE_PORT_READY 3 2007 2008#define VIRTIO_CONSOLE_CONSOLE_PORT 4 2009 2010#define VIRTIO_CONSOLE_RESIZE 5 2011 2012#define VIRTIO_CONSOLE_PORT_OPEN 6 2013 2014#define VIRTIO_CONSOLE_PORT_NAME 7 2015 2016Appendix F: Entropy Device 2017 2018The virtio entropy device supplies high-quality randomness for 2019guest use. 2020 2021 Configuration 2022 2023 Subsystem Device ID 4 2024 2025 Virtqueues 0:requestq. 2026 2027 Feature bits None currently defined 2028 2029 Device configuration layout None currently defined. 2030 2031 Device Initialization 2032 2033 The virtqueue is initialized 2034 2035 Device Operation 2036 2037When the driver requires random bytes, it places the descriptor 2038of one or more buffers in the queue. It will be completely filled 2039by random data by the device. 2040 2041Appendix G: Memory Balloon Device 2042 2043The virtio memory balloon device is a primitive device for 2044managing guest memory: the device asks for a certain amount of 2045memory, and the guest supplies it (or withdraws it, if the device 2046has more than it asks for). This allows the guest to adapt to 2047changes in allowance of underlying physical memory. If the 2048feature is negotiated, the device can also be used to communicate 2049guest memory statistics to the host. 2050 2051 Configuration 2052 2053 Subsystem Device ID 5 2054 2055 Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote: 2056Only if VIRTIO_BALLON_F_STATS_VQ set 2057] 2058 2059 Feature bits 2060 2061 VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before 2062 pages from the balloon are used. 2063 2064 VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest 2065 memory statistics is present. 2066 2067 Device configuration layout Both fields of this configuration 2068 are always available. Note that they are little endian, despite 2069 convention that device fields are guest endian:struct virtio_balloon_config { 2070 2071 u32 num_pages; 2072 2073 u32 actual; 2074 2075}; 2076 2077 Device Initialization 2078 2079 The inflate and deflate virtqueues are identified. 2080 2081 If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated: 2082 2083 Identify the stats virtqueue. 2084 2085 Add one empty buffer to the stats virtqueue and notify the 2086 host. 2087 2088Device operation begins immediately. 2089 2090 Device Operation 2091 2092 Memory Ballooning The device is driven by the receipt of a 2093 configuration change interrupt. 2094 2095 The “num_pages” configuration field is examined. If this is 2096 greater than the “actual” number of pages, memory must be given 2097 to the balloon. If it is less than the “actual” number of 2098 pages, memory may be taken back from the balloon for general 2099 use. 2100 2101 To supply memory to the balloon (aka. inflate): 2102 2103 The driver constructs an array of addresses of unused memory 2104 pages. These addresses are divided by 4096[footnote: 2105This is historical, and independent of the guest page size 2106] and the descriptor describing the resulting 32-bit array is 2107 added to the inflateq. 2108 2109 To remove memory from the balloon (aka. deflate): 2110 2111 The driver constructs an array of addresses of memory pages it 2112 has previously given to the balloon, as described above. This 2113 descriptor is added to the deflateq. 2114 2115 If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the 2116 guest may not use these requested pages until that descriptor 2117 in the deflateq has been used by the device. 2118 2119 Otherwise, the guest may begin to re-use pages previously given 2120 to the balloon before the device has acknowledged their 2121 withdrawal. [footnote: 2122In this case, deflation advice is merely a courtesy 2123] 2124 2125 In either case, once the device has completed the inflation or 2126 deflation, the “actual” field of the configuration should be 2127 updated to reflect the new number of pages in the balloon.[footnote: 2128As updates to configuration space are not atomic, this field 2129isn't particularly reliable, but can be used to diagnose buggy 2130guests. 2131] 2132 2133 Memory Statistics 2134 2135The stats virtqueue is atypical because communication is driven 2136by the device (not the driver). The channel becomes active at 2137driver initialization time when the driver adds an empty buffer 2138and notifies the device. A request for memory statistics proceeds 2139as follows: 2140 2141 The device pushes the buffer onto the used ring and sends an 2142 interrupt. 2143 2144 The driver pops the used buffer and discards it. 2145 2146 The driver collects memory statistics and writes them into a 2147 new buffer. 2148 2149 The driver adds the buffer to the virtqueue and notifies the 2150 device. 2151 2152 The device pops the buffer (retaining it to initiate a 2153 subsequent request) and consumes the statistics. 2154 2155 Memory Statistics Format Each statistic consists of a 16 bit 2156 tag and a 64 bit value. Both quantities are represented in the 2157 native endian of the guest. All statistics are optional and the 2158 driver may choose which ones to supply. To guarantee backwards 2159 compatibility, unsupported statistics should be omitted. 2160 2161 struct virtio_balloon_stat { 2162 2163#define VIRTIO_BALLOON_S_SWAP_IN 0 2164 2165#define VIRTIO_BALLOON_S_SWAP_OUT 1 2166 2167#define VIRTIO_BALLOON_S_MAJFLT 2 2168 2169#define VIRTIO_BALLOON_S_MINFLT 3 2170 2171#define VIRTIO_BALLOON_S_MEMFREE 4 2172 2173#define VIRTIO_BALLOON_S_MEMTOT 5 2174 2175 u16 tag; 2176 2177 u64 val; 2178 2179} __attribute__((packed)); 2180 2181 Tags 2182 2183 VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been 2184 swapped in (in bytes). 2185 2186 VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been 2187 swapped out to disk (in bytes). 2188 2189 VIRTIO_BALLOON_S_MAJFLT The number of major page faults that 2190 have occurred. 2191 2192 VIRTIO_BALLOON_S_MINFLT The number of minor page faults that 2193 have occurred. 2194 2195 VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used 2196 for any purpose (in bytes). 2197 2198 VIRTIO_BALLOON_S_MEMTOT The total amount of memory available 2199 (in bytes). 2200 2201