1--- 2title: Journal File Format 3category: Interfaces 4layout: default 5SPDX-License-Identifier: LGPL-2.1-or-later 6--- 7 8# Journal File Format 9 10_Note that this document describes the binary on-disk format of journals only. 11For interfacing with web technologies there's the [Journal JSON Format](JOURNAL_EXPORT_FORMATS.md#journal-json-format). 12For transfer of journal data across the network there's the [Journal Export Format](JOURNAL_EXPORT_FORMATS.md#journal-export-format)._ 13 14The systemd journal stores log data in a binary format with several features: 15 16* Fully indexed by all fields 17* Can store binary data, up to 2^64-1 in size 18* Seekable 19* Primarily append-based, hence robust to corruption 20* Support for in-line compression 21* Support for in-line Forward Secure Sealing 22 23This document explains the basic structure of the file format on disk. We are 24making this available primarily to allow review and provide documentation. Note 25that the actual implementation in the [systemd 26codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/) is the 27only ultimately authoritative description of the format, so if this document 28and the code disagree, the code is right. That said we'll of course try hard to 29keep this document up-to-date and accurate. 30 31Instead of implementing your own reader or writer for journal files we ask you 32to use the [Journal's native C 33API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) to access 34these files. It provides you with full access to the files, and will not 35withhold any data. If you find a limitation, please ping us and we might add 36some additional interfaces for you. 37 38If you need access to the raw journal data in serialized stream form without C 39API our recommendation is to make use of the [Journal Export 40Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format), which you can 41get via `journalctl -o export` or via `systemd-journal-gatewayd`. The export 42format is much simpler to parse, but complete and accurate. Due to its 43stream-based nature it is not indexed. 44 45_Or, to put this in other words: this low-level document is probably not what 46you want to use as base of your project. You want our [C 47API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead! 48And if you really don't want the C API, then you want the 49[Journal Export Format or Journal JSON Format](JOURNAL_EXPORT_FORMATS.md) 50instead! This document is primarily for your entertainment and education. 51Thank you!_ 52 53This document assumes you have a basic understanding of the journal concepts, 54the properties of a journal entry and so on. If not, please go and read up, 55then come back! This is a good opportunity to read about the [basic properties 56of journal 57entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), 58in particular realize that they may include binary non-text data (though 59usually don't), and the same field might have multiple values assigned within 60the same entry. 61 62This document describes the current format of systemd 246. The documented 63format is compatible with the format used in the first versions of the journal, 64but received various compatible and incompatible additions since. 65 66If you are wondering why the journal file format has been created in the first 67place instead of adopting an existing database implementation, please have a 68look [at this 69thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html). 70 71 72## Basics 73 74* All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format. 75* Offsets are always relative to the beginning of the file. 76* The 64bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part. 77* All structures are aligned to 64bit boundaries and padded to multiples of 64bit 78* The format is designed to be read and written via memory mapping using multiple mapped windows. 79* All time values are stored in usec since the respective epoch. 80* Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`) 81* Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`) 82* Randomized, unique 128bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement. 83 84## General Rules 85 86If any kind of corruption is noticed by a writer it should immediately rotate 87the file and start a new one. No further writes should be attempted to the 88original file, but it should be left around so that as little data as possible 89is lost. 90 91If any kind of corruption is noticed by a reader it should try hard to handle 92this gracefully, such as skipping over the corrupted data, but allowing access 93to as much data around it as possible. 94 95A reader should verify all offsets and other data as it reads it. This includes 96checking for alignment and range of offsets in the file, especially before 97trying to read it via a memory map. 98 99A reader must interleave rotated and corrupted files as good as possible and 100present them as single stream to the user. 101 102All fields marked as "reserved" must be initialized with 0 when writing and be 103ignored on reading. They are currently not used but might be used later on. 104 105 106## Structure 107 108The file format's data structures are declared in 109[journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h). 110 111The file format begins with a header structure. After the header structure 112object structures follow. Objects are appended to the end as time 113progresses. Most data stored in these objects is not altered anymore after 114having been written once, with the exception of records necessary for 115indexing. When new data is appended to a file the writer first writes all new 116objects to the end of the file, and then links them up at front after that's 117done. Currently, seven different object types are known: 118 119```c 120enum { 121 OBJECT_UNUSED, 122 OBJECT_DATA, 123 OBJECT_FIELD, 124 OBJECT_ENTRY, 125 OBJECT_DATA_HASH_TABLE, 126 OBJECT_FIELD_HASH_TABLE, 127 OBJECT_ENTRY_ARRAY, 128 OBJECT_TAG, 129 _OBJECT_TYPE_MAX 130}; 131``` 132 133* A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=". 134* A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value. 135* An **ENTRY** object, which binds several **DATA** objects together into a log entry. 136* A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects. 137* A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects. 138* An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. 139* A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). 140 141## Header 142 143The Header struct defines, well, you guessed it, the file header: 144 145```c 146_packed_ struct Header { 147 uint8_t signature[8]; /* "LPKSHHRH" */ 148 le32_t compatible_flags; 149 le32_t incompatible_flags; 150 uint8_t state; 151 uint8_t reserved[7]; 152 sd_id128_t file_id; 153 sd_id128_t machine_id; 154 sd_id128_t boot_id; /* last writer */ 155 sd_id128_t seqnum_id; 156 le64_t header_size; 157 le64_t arena_size; 158 le64_t data_hash_table_offset; 159 le64_t data_hash_table_size; 160 le64_t field_hash_table_offset; 161 le64_t field_hash_table_size; 162 le64_t tail_object_offset; 163 le64_t n_objects; 164 le64_t n_entries; 165 le64_t tail_entry_seqnum; 166 le64_t head_entry_seqnum; 167 le64_t entry_array_offset; 168 le64_t head_entry_realtime; 169 le64_t tail_entry_realtime; 170 le64_t tail_entry_monotonic; 171 /* Added in 187 */ 172 le64_t n_data; 173 le64_t n_fields; 174 /* Added in 189 */ 175 le64_t n_tags; 176 le64_t n_entry_arrays; 177 /* Added in 246 */ 178 le64_t data_hash_chain_depth; 179 le64_t field_hash_chain_depth; 180}; 181``` 182 183The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`. 184 185If a writer finds that the **machine_id** of a file to write to does not match 186the machine it is running on it should immediately rotate the file and start a 187new one. 188 189When journal file is first created the **file_id** is randomly and uniquely 190initialized. 191 192When a writer opens a file it shall initialize the **boot_id** to the current 193boot id of the system. 194 195The currently used part of the file is the **header_size** plus the 196**arena_size** field of the header. If a writer needs to write to a file where 197the actual file size on disk is smaller than the reported value it shall 198immediately rotate the file and start a new one. If a writer is asked to write 199to a file with a header that is shorter than its own definition of the struct 200Header, it shall immediately rotate the file and start a new one. 201 202The **n_objects** field contains a counter for objects currently available in 203this file. As objects are appended to the end of the file this counter is 204increased. 205 206The first object in the file starts immediately after the header. The last 207object in the file is at the offset **tail_object_offset**, which may be 0 if 208no object is in the file yet. 209 210The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are 211counters of the objects of the specific types. 212 213**tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number 214(see below) of the last or first entry in the file, respectively, or 0 if no 215entry has been written yet. 216 217**tail_entry_realtime** and **head_entry_realtime** contain the wallclock 218timestamp of the last or first entry in the file, respectively, or 0 if no 219entry has been written yet. 220 221**tail_entry_monotonic** is the monotonic timestamp of the last entry in the 222file, referring to monotonic time of the boot identified by **boot_id**. 223 224**data_hash_chain_depth** is a counter of the deepest chain in the data hash 225table, minus one. This is updated whenever a chain is found that is longer than 226the previous deepest chain found. Note that the counter is updated during hash 227table lookups, as the chains are traversed. This counter is used to determine 228when it is a good time to rotate the journal file, because hash collisions 229became too frequent. 230 231Similar, **field_hash_chain_depth** is a counter of the deepest chain in the 232field hash table, minus one. 233 234 235## Extensibility 236 237The format is supposed to be extensible in order to enable future additions of 238features. Readers should simply skip objects of unknown types as they read 239them. If a compatible feature extension is made a new bit is registered in the 240header's **compatible_flags** field. If a feature extension is used that makes 241the format incompatible a new bit is registered in the header's 242**incompatible_flags** field. Readers should check these two bit fields, if 243they find a flag they don't understand in compatible_flags they should continue 244to read the file, but if they find one in **incompatible_flags** they should 245fail, asking for an update of the software. Writers should refuse writing if 246there's an unknown bit flag in either of these fields. 247 248The file header may be extended as new features are added. The size of the file 249header is stored in the header. All header fields up to **n_data** are known to 250unconditionally exist in all revisions of the file format, all fields starting 251with **n_data** needs to be explicitly checked for via a size check, since they 252were additions after the initial release. 253 254Currently only five extensions flagged in the flags fields are known: 255 256```c 257enum { 258 HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0, 259 HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1, 260 HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2, 261 HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3, 262}; 263 264enum { 265 HEADER_COMPATIBLE_SEALED = 1 << 0, 266}; 267``` 268 269HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects 270that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4 271indicates that the file includes DATA objects that are compressed with the LZ4 272algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are 273objects compressed with ZSTD. 274 275HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins 276hash function the keyed siphash24 hash function is used for the two hash 277tables, see below. 278 279HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required 280for Forward Secure Sealing. 281 282 283## Dirty Detection 284 285```c 286enum { 287 STATE_OFFLINE = 0, 288 STATE_ONLINE = 1, 289 STATE_ARCHIVED = 2, 290 _STATE_MAX 291}; 292``` 293 294If a file is opened for writing the **state** field should be set to 295STATE_ONLINE. If a file is closed after writing the **state** field should be 296set to STATE_OFFLINE. After a file has been rotated it should be set to 297STATE_ARCHIVED. If a writer is asked to write to a file that is not in 298STATE_OFFLINE it should immediately rotate the file and start a new one, 299without changing the file. 300 301After and before the state field is changed, `fdatasync()` should be executed on 302the file to ensure the dirty state hits disk. 303 304 305## Sequence Numbers 306 307All entries carry sequence numbers that are monotonically counted up for each 308entry (starting at 1) and are unique among all files which carry the same 309**seqnum_id** field. This field is randomly generated when the journal daemon 310creates its first file. All files generated by the same journal daemon instance 311should hence carry the same seqnum_id. This should guarantee a monotonic stream 312of sequential numbers for easy interleaving even if entries are distributed 313among several files, such as the system journal and many per-user journals. 314 315 316## Concurrency 317 318The file format is designed to be usable in a simultaneous 319single-writer/multiple-reader scenario. The synchronization model is very weak 320in order to facilitate storage on the most basic of file systems (well, the 321most basic ones that provide us with `mmap()` that is), and allow good 322performance. No file locking is used. The only time where disk synchronization 323via `fdatasync()` should be enforced is after and before changing the **state** 324field in the file header (see below). It is recommended to execute a memory 325barrier after appending and initializing new objects at the end of the file, 326and before linking them up in the earlier objects. 327 328This weak synchronization model means that it is crucial that readers verify 329the structural integrity of the file as they read it and handle invalid 330structure gracefully. (Checking what you read is a pretty good idea out of 331security considerations anyway.) This specifically includes checking offset 332values, and that they point to valid objects, with valid sizes and of the type 333and hash value expected. All code must be written with the fact in mind that a 334file with inconsistent structure might just be inconsistent temporarily, and 335might become consistent later on. Payload OTOH requires less scrutiny, as it 336should only be linked up (and hence visible to readers) after it was 337successfully written to memory (though not necessarily to disk). On non-local 338file systems it is a good idea to verify the payload hashes when reading, in 339order to avoid annoyances with `mmap()` inconsistencies. 340 341Clients intending to show a live view of the journal should use `inotify()` for 342this to watch for files changes. Since file writes done via `mmap()` do not 343result in `inotify()` writers shall truncate the file to its current size after 344writing one or more entries, which results in inotify events being 345generated. Note that this is not used as a transaction scheme (it doesn't 346protect anything), but merely for triggering wakeups. 347 348Note that inotify will not work on network file systems if reader and writer 349reside on different hosts. Readers which detect they are run on journal files 350on a non-local file system should hence not rely on inotify for live views but 351fall back to simple time based polling of the files (maybe recheck every 2s). 352 353 354## Objects 355 356All objects carry a common header: 357 358```c 359enum { 360 OBJECT_COMPRESSED_XZ = 1 << 0, 361 OBJECT_COMPRESSED_LZ4 = 1 << 1, 362 OBJECT_COMPRESSED_ZSTD = 1 << 2, 363}; 364 365_packed_ struct ObjectHeader { 366 uint8_t type; 367 uint8_t flags; 368 uint8_t reserved[6]; 369 le64_t size; 370 uint8_t payload[]; 371}; 372``` 373 374The **type** field is one of the object types listed above. The **flags** field 375currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and 376OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that 377the data payload is compressed with XZ/LZ4/ZSTD. If one of the 378OBJECT_COMPRESSED_* flags is set for an object then the matching 379HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD 380flag must be set for the file as well. At most one of these three bits may be 381set. The **size** field encodes the size of the object including all its 382headers and payload. 383 384 385## Data Objects 386 387```c 388_packed_ struct DataObject { 389 ObjectHeader object; 390 le64_t hash; 391 le64_t next_hash_offset; 392 le64_t next_field_offset; 393 le64_t entry_offset; /* the first array entry we store inline */ 394 le64_t entry_array_offset; 395 le64_t n_entries; 396 uint8_t payload[]; 397}; 398``` 399 400Data objects carry actual field data in the **payload[]** array, including a 401field name, a `=` and the field data. Example: 402`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the 403payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header 404this is the siphash24 hash value of the payload, keyed by the file ID as stored 405in the **file_id** field of the file header. If the flag is not set it is the 406non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as 407it makes the format more robust against attackers that want to trigger hash 408collisions in the hash table. 409 410**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if 411a hash collision happens (in a singly linked list, with an offset of 0 412indicating the end). **next_field_offset** is used to link up data objects with 413the same field name from the FIELD object of the field used. 414 415**entry_offset** is an offset to the first ENTRY object referring to this DATA 416object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with 417offsets to other entries referencing this DATA object. Storing the offset to 418the first ENTRY object in-line is an optimization given that many DATA objects 419will be referenced from a single entry only (for example, `MESSAGE=` frequently 420includes a practically unique string). **n_entries** is a counter of the total 421number of ENTRY objects that reference this object, i.e. the sum of all 422ENTRY_ARRAYS chained up from this object, plus 1. 423 424The **payload[]** field contains the field name and date unencoded, unless 425OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the 426`ObjectHeader`, in which case the payload is compressed with the indicated 427compression algorithm. 428 429 430## Field Objects 431 432```c 433_packed_ struct FieldObject { 434 ObjectHeader object; 435 le64_t hash; 436 le64_t next_hash_offset; 437 le64_t head_data_offset; 438 uint8_t payload[]; 439}; 440``` 441 442Field objects are used to enumerate all possible values a certain field name 443can take in the entire journal file. 444 445The **payload[]** array contains the actual field name, without '=' or any 446field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of 447the payload. As for the DATA objects, this too is either the `.file_id` keyed 448siphash24 hash of the payload, or the non-keyed Jenkins hash. 449 450**next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE 451if a hash collision happens (in singly linked list, offset 0 indicating the 452end). **head_data_offset** points to the first DATA object that shares this 453field name. It is the head of a singly linked list using DATA's 454**next_field_offset** offset. 455 456 457## Entry Objects 458 459``` 460_packed_ struct EntryItem { 461 le64_t object_offset; 462 le64_t hash; 463}; 464 465_packed_ struct EntryObject { 466 ObjectHeader object; 467 le64_t seqnum; 468 le64_t realtime; 469 le64_t monotonic; 470 sd_id128_t boot_id; 471 le64_t xor_hash; 472 EntryItem items[]; 473}; 474``` 475 476An ENTRY object binds several DATA objects together into one log entry, and 477includes other metadata such as various timestamps. 478 479The **seqnum** field contains the sequence number of the entry, **realtime** 480the realtime timestamp, and **monotonic** the monotonic timestamp for the boot 481identified by **boot_id**. 482 483The **xor_hash** field contains a binary XOR of the hashes of the payload of 484all DATA objects referenced by this ENTRY. This value is usable to check the 485contents of the entry, being independent of the order of the DATA objects in 486the array. Note that even for files that have the 487`HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise 488used hash function) the hash function used for this field, as singular 489exception, is the Jenkins lookup3 hash function. The XOR hash value is used to 490quickly compare the contents of two entries, and to define a well-defined order 491between two entries that otherwise have the same sequence numbers and 492timestamps. 493 494The **items[]** array contains references to all DATA objects of this entry, 495plus their respective hashes (which are calculated the same way as in the DATA 496objects, i.e. keyed by the file ID). 497 498In the file ENTRY objects are written ordered monotonically by sequence 499number. For continuous parts of the file written during the same boot 500(i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo 501wallclock time jumps (due to incorrect clocks being corrected) the realtime 502timestamps are monotonic too. 503 504 505## Hash Table Objects 506 507```c 508_packed_ struct HashItem { 509 le64_t head_hash_offset; 510 le64_t tail_hash_offset; 511}; 512 513_packed_ struct HashTableObject { 514 ObjectHeader object; 515 HashItem items[]; 516}; 517``` 518 519The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are 520identical. They implement a simple hash table, with each cell containing 521offsets to the head and tail of the singly linked list of the DATA and FIELD 522objects, respectively. DATA's and FIELD's next_hash_offset field are used to 523chain up the objects. Empty cells have both offsets set to 0. 524 525Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE 526objects. Their payload is directly referred to by the file header in the 527**data_hash_table_offset**, **data_hash_table_size**, 528**field_hash_table_offset**, **field_hash_table_size** fields. These offsets do 529_not_ point to the object headers but directly to the payloads. When a new 530journal file is created the two hash table objects need to be created right 531away as first two objects in the stream. 532 533If the hash table fill level is increasing over a certain fill level (Learning 534from Java's Hashtable for example: > 75%), the writer should rotate the file 535and create a new one. 536 537The DATA_HASH_TABLE should be sized taking into account to the maximum size the 538file is expected to grow, as configured by the administrator or disk space 539considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the 540number of fields should be pretty static as it depends only on developers' 541creativity rather than runtime parameters. 542 543 544## Entry Array Objects 545 546 547```c 548_packed_ struct EntryArrayObject { 549 ObjectHeader object; 550 le64_t next_entry_array_offset; 551 le64_t items[]; 552}; 553``` 554 555Entry Arrays are used to store a sorted array of offsets to entries. Entry 556arrays are strictly sorted by offsets on disk, and hence by their timestamps 557and sequence numbers (with some restrictions, see above). 558 559Entry Arrays are chained up. If one entry array is full another one is 560allocated and the **next_entry_array_offset** field of the old one pointed to 561it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the 562list. To optimize allocation and seeking, as entry arrays are appended to a 563chain of entry arrays they should increase in size (double). 564 565Due to being monotonically ordered entry arrays may be searched with a binary 566search (bisection). 567 568One chain of entry arrays links up all entries written to the journal. The 569first entry array is referenced in the **entry_array_offset** field of the 570header. 571 572Each DATA object also references an entry array chain listing all entries 573referencing a specific DATA object. Since many DATA objects are only referenced 574by a single ENTRY the first offset of the list is stored inside the DATA object 575itself, an ENTRY_ARRAY object is only needed if it is referenced by more than 576one ENTRY. 577 578 579## Tag Object 580 581```c 582#define TAG_LENGTH (256/8) 583 584_packed_ struct TagObject { 585 ObjectHeader object; 586 le64_t seqnum; 587 le64_t epoch; 588 uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */ 589}; 590``` 591 592Tag objects are used to seal off the journal for alteration. In regular 593intervals a tag object is appended to the file. The tag object consists of a 594SHA-256 HMAC tag that is calculated from the objects stored in the file since 595the last tag was written, or from the beginning if no tag was written yet. The 596key for the HMAC is calculated via the externally maintained FSPRG logic for 597the epoch that is written into **epoch**. The sequence number **seqnum** is 598increased with each tag. When calculating the HMAC of objects header fields 599that are volatile are excluded (skipped). More specifically all fields that 600might validly be altered to maintain a consistent file structure (such as 601offsets to objects added later for the purpose of linked lists and suchlike) 602after an object has been written are not protected by the tag. This means a 603verifier has to independently check these fields for consistency of 604structure. For the fields excluded from the HMAC please consult the source code 605directly. A verifier should read the file from the beginning to the end, always 606calculating the HMAC for the objects it reads. Each time a tag object is 607encountered the HMAC should be verified and restarted. The tag object sequence 608numbers need to increase strictly monotonically. Tag objects themselves are 609partially protected by the HMAC (i.e. seqnum and epoch is included, the tag 610itself not). 611 612 613## Algorithms 614 615### Reading 616 617Given an offset to an entry all data fields are easily found by following the 618offsets in the data item array of the entry. 619 620Listing entries without filter is done by traversing the list of entry arrays 621starting with the headers' **entry_array_offset** field. 622 623Seeking to an entry by timestamp or sequence number (without any matches) is 624done via binary search in the entry arrays starting with the header's 625**entry_array_offset** field. Since these arrays double in size as more are 626added the time cost of seeking is O(log(n)*log(n)) if n is the number of 627entries in the file. 628 629When seeking or listing with one field match applied the DATA object of the 630match is first identified, and then its data entry array chain traversed. The 631time cost is the same as for seeks/listings with no match. 632 633If multiple matches are applied, multiple chains of entry arrays should be 634traversed in parallel. Since they all are strictly monotonically ordered by 635offset of the entries, advancing in one can be directly applied to the others, 636until an entry matching all matches is found. In the worst case seeking like 637this is O(n) where n is the number of matching entries of the "loosest" match, 638but in the common case should be much more efficient at least for the 639well-known fields, where the set of possible field values tend to be closely 640related. Checking whether an entry matches a number of matches is efficient 641since the item array of the entry contains hashes of all data fields 642referenced, and the number of data fields of an entry is generally small (< 64330). 644 645When interleaving multiple journal files seeking tends to be a frequently used 646operation, but in this case can be effectively suppressed by caching results 647from previous entries. 648 649When listing all possible values a certain field can take it is sufficient to 650look up the FIELD object and follow the chain of links to all DATA it includes. 651 652### Writing 653 654When an entry is appended to the journal, for each of its data fields the data 655hash table should be checked. If the data field does not yet exist in the file, 656it should be appended and added to the data hash table. When a data field's data 657object is added, the field hash table should be checked for the field name of 658the data field, and a field object be added if necessary. After all data fields 659(and recursively all field names) of the new entry are appended and linked up 660in the hashtables, the entry object should be appended and linked up too. 661 662At regular intervals a tag object should be written if sealing is enabled (see 663above). Before the file is closed a tag should be written too, to seal it off. 664 665Before writing an object, time and disk space limits should be checked and 666rotation triggered if necessary. 667 668 669## Optimizing Disk IO 670 671_A few general ideas to keep in mind:_ 672 673The hash tables for looking up fields and data should be quickly in the memory 674cache and not hurt performance. All entries and entry arrays are ordered 675strictly by time on disk, and hence should expose an OK access pattern on 676rotating media, when read sequentially (which should be the most common case, 677given the nature of log data). 678 679The disk access patterns of the binary search for entries needed for seeking 680are problematic on rotating disks. This should not be a major issue though, 681since seeking should not be a frequent operation. 682 683When reading, collecting data fields for presenting entries to the user is 684problematic on rotating disks. In order to optimize these patterns the item 685array of entry objects should be sorted by disk offset before 686writing. Effectively, frequently used data objects should be in the memory 687cache quickly. Non-frequently used data objects are likely to be located 688between the previous and current entry when reading and hence should expose an 689OK access pattern. Problematic are data objects that are neither frequently nor 690infrequently referenced, which will cost seek time. 691 692And that's all there is to it. 693 694Thanks for your interest! 695