The latest version of the Minecraft game uses a new, more efficient (but still poor) format for storing a worlds chunk data, based on work by Scaevolous.
Background & Reasoning
Prior to this system, each chunk (a 16x16x128 area of a level) was contained within its own file on disk, two folders down. The first folder represented the
X region, and the second the
Y region. This (very obviously) has several issues associated with it, mainly:
- Excessive filesystem activity,
- Poor fragmentation,
- Unneeded filesystem overhead (a 90-byte compressed chunk of mostly dirt and air will still consume at a minimum one sector, generally more),
- Generates large enough directory trees on large worlds to crash the Windows Explorer,
- A large multiplayer world would eventually run out of handles (by default, only 1024 handles can be opened by a single process), causing the server to stall, crash, or otherwise fail (the particular behaviour has changed between versions)
To improve performance, Scaevolous created an unofficial mod called McRegion. This mod groups chunks into regions, each of which contains a group of chunks 32 by 32. In theory, this improves a few things:
- Reduces filesystem activity & overhead considerably,
- Improves fragmentation (not directly, but the OS or underlying drive can intelligently improve it),
- Far, far fewer handles need to be open at any given time, allowing more of the level to be loaded at any one time.
- Reduces the number of files created considerably (no more crashing Explorer)
To define some terminology,
|Chunk||A single section of the level that is 16 by 16 blocks, and 128 blocks high.|
|Region||A single grouping of chunks in a 32 by 32 area|
|Level||A (realistically, but not technically) unlimited collection of chunks stored in regions that make up a single playable world.|
In the official client and server, each region is stored within the folder
region within the level folder. The naming scheme for region files is very simple. For example, given the filename
r is a meaningless prefix found on all region files,
8 is the
X coordinate of the region, and
20 is the
Z coordinate of the region.
Z can be found simply by dividing the chunks
Z by 32, and then flooring the result. Here's an example in python, where given a chunk at <81, -39>, the region filename can be found:
>>> import math >>> region_xz = lambda x,z: (math.floor(x / 32), math.floor(z / 32)) >>> region_xz(81,-39) (2.0, -2.0)
So, this chunk would reside in
Every region file begins with two 4KiB tables (each compose of 1024 4-byte integers), with the first table containing the location of each chunk, and the second table the last-modified timestamp of that chunk.
The location table is composed of 1024 entries, each 4 bytes long. The first three bytes indicate the offset in the file where the chunk may be found, and the last byte is the size of the file. When multiplied by
4096, this gives you the exact start of the chunk in bytes, and its end in bytes. If you know the
Z of the chunk you're looking for, you can find its location entry using the formula:
((x % 32) + (z % 32) * 32) * 4. If the offset and size are both
0, then the chunk at that location hasn't been generated yet.
As an example, this is the location entry for the first chunk in a region:
|Offset (3 bytes)||Size (1 byte)|
|On Disk (in hex)||
Here's an example in Python to decode a location entry:
def chunk_location(l): """ Returns the offset (in bytes) and size (in bytes) of the chuck for the given location. """ offset = ((l >> 8) & 0xFF) + ((l >> 16) & 0xFF) + ((l >> 32) & 0xFF) size = l & 0xFF return (offset * 4096, size * 4096)
The timestamp table is composed of 1024 timestamps, each a 4-byte integer. This is the time that the chunk was last modified, and is in the same order as the location table. Thus, the chunk whose location was at
location_table has a timestamp at
Each chunk has an additional 5-byte header, followed by the actual chunk which is stored as NBT.
|Length (in bytes)||Compression Scheme|
|On disk (in hex)||
There are two possible values for the compression scheme. If it is a
1, the following chunk is compressed using gzip. If it's a
2, the following chunk is compressed with zlib. In practice, you will only ever encounter chunks compressed using zlib.