Thursday, February 4, 2010

Presentation Data (HPFS & Ext3)

HPFS and Ext3
High Performance File System
HPFS or High Performance File System is a file system created specifically for the OS/2 operating system to improve upon the limitations of the FAT file system. It was written by Gordon Letwin and others at Microsoft and added to OS/2 version 1.2, at that time still a joint undertaking of Microsoft and IBM.

The HPFS file system was first introduced with OS/2 1.2 to allow for greater access to the larger hard drives that were then appearing on the market. Additionally, it was necessary for a new file system to extend the naming system, organization, and security for the growing demands of the network server market. HPFS maintains the directory organization of FAT, but adds automatic sorting of the directory based on filenames. Filenames are extended to up to 254 double byte characters. HPFS also allows a file to be composed of "data" and special attributes to allow for increased flexibility in terms of supporting other naming conventions and security. In addition, the unit of allocation is changed from clusters to physical sectors (512 bytes), which reduce lost disk space.

Under HPFS, directory entries hold more information than under FAT. As well as the attribute file, this includes information about the modification, creation, and access date and times. Instead of pointing to the first cluster of the file, the directory entries under HPFS point to the FNODE. The FNODE can contain the file's data, or pointers that may point to the file's data or to other structures that will eventually point to the file's data.

HPFS attempts to allocate as much of a file in contiguous sectors as possible. This is done in order to increase speed when doing sequential processing of a file.

HPFS organizes a drive into a series of 8 MB bands, and whenever possible a file is contained within one of these bands. Between each of these bands are 2K allocation bitmaps, which keep track of which sectors within a band have and have not been allocated. Banding increases performance because the drive head does not have to return to the logical top (typically cylinder 0) of the disk, but to the nearest band allocation bitmap to determine where a file is to be stored.
Additionally, HPFS includes a couple of unique special data objects:
1. Super Block
The Super Block is located in logical sector 16 and contains a pointer to the FNODE of the root directory. One of the biggest dangers of using HPFS is that if the Super Block is lost or corrupted due to a bad sector, so are the contents of the partition, even if the rest of the drive is fine. It would be possible to recover the data on the drive by copying everything to another drive with a good sector 16 and rebuilding the Super Block. However, this is a very complex task.

2. Spare Block
The Spare Block is located in logical sector 17 and contains a table of "hot fixes" and the Spare Directory Block. Under HPFS, when a bad sector is detected, the "hot fixes" entry is used to logically point to an existing good sector in place of the bad sector. This technique for handling write errors is known as hot fixing. Hot fixing is a technique where if an error occurs because of a bad sector, the file system moves the information to a different sector and marks the original sector as bad. This is all done transparent to any applications that are performing disk I/O (that is, the application never knows that there were any problems with the hard drive). Using a file system that supports hot fixing will eliminate error messages such as the FAT "Abort, Retry, or Fail?"error message that occurs when a bad sector is encountered.

Note: The version of HPFS that is included with Windows NT does not support hot fixing.

Among its improvements are:
  • support for mixed case file names, in different code pages
  • support for long file names (256 characters as opposed to FAT's 8+3 characters)
  • more efficient use of disk space (files are not stored using multiple-sector clusters but on a per-sector basis)
  • an internal architecture that keeps related items close to each other on the disk volume
  • less fragmentation of data
  • extent-based space allocation
  • separate datestamps for last modification, last access, and creation (as opposed to FAT's one last modification datestamp)
  • a B+ tree structure for directories
  • root directory located at the mid-point, rather than beginning of the disk, for faster average access
  • HPFS also can keep 64 KB of metadata ("extended attributes") per file.

IBM offers two kind of IFS drivers for this file system:

  • the standard one with a cache limited to 2 MB
  • HPFS386 provided with the server versions of OS/2

Windows Native Support
Windows 95 and its successors Windows 98, Windows Me can read/write HPFS only when mapped via a network share, but cannot read it from a local disk. They listed the NTFS partitions of networked computers as "HPFS", because NTFS and HPFS share the same filesystem identification number in the partition table.

Windows NT 3.1 and 3.5 have native read/write support for local disks and can even be installed onto an HPFS partition. This is because NT was originally going to be a version of OS/2.

Windows NT 3.51 can also read and write from local HPFS formatted drives. However, Microsoft discouraged using HPFS in Windows NT 4 and in subsequent versions. Microsoft even removed the ability of NT 3.51 to format an HPFS file system. Starting with Windows NT 4 the file system driver pinball.sys enabling the read/write access is not included in a default installation anymore. Pinball.sys is included on the installation media for Windows 2000 and can be manually installed and used with some limitations. Later Windows versions do not ship with this driver.

Microsoft retained rights to OS/2 technologies, including the HPFS file system, after they ceased collaboration. Since Windows NT 3.1 was designed for more rigorous (enterprise-class) use than previous versions of Windows, it included support for HPFS (and NTFS) giving it a larger storage capacity than FAT file systems. However, since HPFS lacks a journal, any recovery after an unexpected shutdown or other error state takes progressively longer as the file system grows. A utility such as CHKDSK would need to scan each entry in the file system to ensure no errors are present, a problem which is vastly reduced on NTFS where the journal is simply replayed.

Advantages of HPFS

  • HPFS is best for drives in the 200-400 MB range.
  • Support for long file names upto 256 characters.
  • Upper and lower case- HPFS preserves case, but it is not case sensitive
  • Native support for EA’s FAT is just too fragile to support this and the workplace spell depends on it heavily.
  • HPFS provides high performance.
  • Much greater integrity: Signature at the beginning of the system structure sectors, forwards and backwards links in fnode trees.
  • Much less fragmentation.

Disadvantages of HPFS

  • Because of the overhead involved in HPFS, it is not a very efficient choice for a volume of under approximately 200 MB. In addition, with volumes larger than about 400 MB, there will be some performance degradation.
  • You cannot set security on HPFS under WindowsNT.
  • HPFS is only supported under Windows NT versions 3.1, 3.5, and 3.51. Windows NT 4.0 cannot access HPFS partitions.

Ext3

The ext3 or third extended file system is a journaled file system that is commonly used by the Linux kernel. It is the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs File system in a 1998 paper and later in a February 1999 kernel mailing list posting, and the file system was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

Journaling results in massively reduced time spent recovering a file system after a crash, and is therefore in high demand in environments where high availability is important, not only to improve recovery times on single machines but also to allow a crashed machine's file system to be recovered on another machine when we have a cluster of nodes with a shared disk.

Advantages
Although its performance (speed) is less attractive than competing Linux file systems such as JFS, ReiserFS and XFS, it has a significant advantage in that it allows in-place upgrades from the ext2 file system without having to back up and restore data. Ext3 also uses less CPU power than ReiserFS and XFS. It is also considered safer than the other Linux file systems due to its relative simplicity and wider testing base.

The ext3 file system adds, over its predecessor:

  • A Journaling file system
  • Online file system growth
  • Htree indexing for larger directories. An HTree is a specialized version of a B-tree (not to be confused with the H tree fractal).

Without these, any ext3 file system is also a valid ext2 file system. This has allowed well-tested and mature file system maintenance utilities for maintaining and repairing ext2 file systems to also be used with ext3 without major changes. The ext2 and ext3 file systems share the same standard set of utilities, e2fsprogs, which includes a fsck tool. The close relationship also makes conversion between the two file systems (both forward to ext3 and backward to ext2) straightforward.

While in some contexts the lack of "modern" file system features such as dynamic inode allocation and extents could be considered a disadvantage, in terms of recoverability this gives ext3 a significant advantage over file systems with those features. The file system metadata is all in fixed, well-known locations, and there is some redundancy inherent in the data structures that may allow ext2 and ext3 to be recoverable in the face of significant data corruption, where tree-based file systems may not be recoverable.

What is a Journaling File system?

A journaling file system keeps a journal or log of the changes that are being made to the file system during disk writing that can be used to rapidly reconstruct corruptions that may occur due to events such a system crash or power outage. The level of journaling performed by the file system can be configured to provide a number of levels of logging depending on your needs and performance requirements.


What are the Advantages of a Journaling File system?

There are a number of advantages to using a journaling files system:


Both the size and volume of data stored on disk drives has grown exponentially over the years. The problem with a non-journaled file system is that following a crash the fsck (file system consistency check) utility has to be run. fsck will scan the entire file system validating all entries and making sure that blocks are allocated and referenced correctly. If it finds a corrupt entry it will attempt to fix the problem. The issues here are two-fold. Firstly, the fsck utility will not always be able to repair damage and you will end up with data in the lost+found directory. This is data that was being used by an application but the system no longer knows where they were reference from. The other problem is the issue of time. It can take a very long time to complete the fsck process on a large file system leading to unacceptable down time.

A journaled file system records information in a log area on a disk (the journal and log do not need to be on the same device) during each write. This is a essentially an "intent to commit" data to the file system. The amount of information logged is configurable and ranges from not logging anything, to logging what is known as the "metadata" (i.e ownership, date stamp information etc), to logging the "metadata" and the data blocks that are to be written to the file. Once the log is updated the system then writes the actual data to the appropriate areas of the file system and marks an entry in the log to say the data is committed.

After a crash the file system can very quickly be brought back on-line using the journal log reducing what could take minutes using fsck to seconds with the added advantage that there is considerably less chance of data loss or corruption.


What is a Journal Checkpoint?

When a file is accessed on the filesystem, the last snapshot of that file is read from the disk into memory. The journal log is then consulted to see if any uncommitted changes have been made to the file since the data was last written to the file (essentially looking for an "intention to commit" in the log entry as described above). At particular points the filesystem will update file data on the disk from the uncommited log entries and trim those entries from the log. Committing operations from the log and synchronizing the log and its associated filesystem is called a checkpoint.

What are the disadvantages of a Journaled Filesystem?

Nothing in life is is free and ext3 and journaled filesystems are no exception to the rule. The biggest draw back of journaling is in the area of performance simply because more disk writes are required to store information in the log. In practice, however, unless you are running system where disk performance is absolutely critical the performance difference will be negligable.

What Journaling Options are Available with the ext3 filesystem?

The ext3 file system provides three options. These are as follows:

Journal (lowest risk)

Both metadata and file contents are written to the journal before being committed to the main file system. Because the journal is relatively continuous on disk, this can improve performance in some circumstances. In other cases, performance gets worse because the data must be written twice - once to the journal, and once to the main part of the file system.

Ordered (medium risk)

Only metadata is journaled; file contents are not, but it's guaranteed that file contents are written to disk before associated metadata is marked as committed in the journal. This is the default on many Linux distributions. If there is a power outage or kernel panic while a file is being written or appended to, the journal will indicate the new file or appended data has not been "committed", so it will be purged by the cleanup process. (Thus appends and new files have the same level of integrity protection as the "journaled" level.) However, files being overwritten can be corrupted because the original version of the file is not stored. Thus it's possible to end up with a file in an intermediate state between new and old, without enough information to restore either one or the other (the new data never made it to disk completely, and the old data is not stored anywhere). Even worse, the intermediate state might intersperse old and new data, because the order of the write is left up to the disk's hardware. XFS uses this form of journaling.

Writeback (highest risk)

Only metadata is journaled; file contents are not. The contents might be written before or after the journal is updated. As a result, files modified right before a crash can become corrupted. For example, a file being appended to may be marked in the journal as being larger than it actually is, causing garbage at the end. Older versions of files could also appear unexpectedly after a journal recovery. The lack of synchronization between data and journal is faster in many cases. JFS uses this level of journaling, but ensures that any "garbage" due to unwritten data is zeroed out on reboot.

Does the Journal log have to be on the same disk as the file system?

No, the ext3 journal log does not have to be on the same physical device as the file system it is logging. On a Red Hat Linux the journal device can be specified using the journal_device= option with the -journal-options command line argument of the tune2fs utility.

Features of ext3

The ext3 file system is essentially an enhanced version of the ext2 file system. These improvements provide the following advantages:

Availability

After an unexpected power failure or system crash (also called an unclean system shutdown), each mounted ext2 file system on the machine must be checked for consistency by the e2fsck program. This is a time-consuming process that can delay system boot time significantly, especially with large volumes containing a large number of files. During this time, any data on the volumes is unreachable.

The journaling provided by the ext3 file system means that this sort of file system check is no longer necessary after an unclean system shutdown. The only time a consistency check occurs using ext3 is in certain rare hardware failure cases, such as hard drive failures. The time to recover an ext3 file system after an unclean system shutdown does not depend on the size of the file system or the number of files; rather, it depends on the size of the journal used to maintain consistency. The default journal size takes about a second to recover, depending on the speed of the hardware.

Data Integrity

The ext3 file system provides stronger data integrity in the event that an unclean system shutdown occurs. The ext3 file system allows you to choose the type and level of protection that your data receives. By default, Red Hat Linux 8.0 configures ext3 volumes to keep a high level of data consistency with regard to the state of the file system.

Speed

Despite writing some data more than once, ext3 has a higher throughput in most cases than ext2 because ext3's journaling optimizes hard drive head motion. You can choose from three journaling modes to optimize speed, but doing so means trade offs in regards to data integrity.

Easy Transition

It is easy to change from ext2 to ext3 and gain the benefits of a robust journaling file system without reformatting.

Why ext3?
Ext3 is forward and backward compatible with ext2, allowing users to keep existing file systems while very simply adding journaling capability. Any user who wishes to un-journal a file system can do so easily (not that we expect many to do so...). Furthermore, an ext3 file system can be mounted as ext2 without even removing the journal, as long as a recent version of e2fsprogs (such as the one included in Red Hat Linux 7.2) is installed.
Ext3 benefits from the long history of fixes and enhancements to the ext2 file system, and will continue to do so. This means that ext3 shares ext2's well-known robustness, but also that as new features are added to ext2, they can be carried over to ext3 with little difficulty. When, for example, extended attributes or HTrees are added to ext2, it will be relatively easy to add them to ext3. (The extended attributes feature will enable things like access control lists; HTrees make directory operations extremely fast and highly scalable to very large directories.)
Ext3, like ext2, has a multi-vendor team of developers who develop it and understand it well; its development does not depend on any one person or organization.
Ext3 provides and makes use of a generic journaling layer (jbd) which can be used in other contexts. ext3 can journal not only within the file system, but also to other devices, so as NVRAM devices become available and supported under Linux, ext3 will be able to support them.
Ext3 has multiple journaling modes. It can journal all file data and metadata (data=journal), or it can journal metadata but not file data (data=ordered or data=writeback). When not journaling file data, you can choose to write file system data before metadata (data=ordered; causes all metadata to point to valid data), or not to handle file data specially at all (data=writeback; file system will be consistent, but old data may appear in files after an unclean system shutdown). This gives the administrator the power to make the tradeoff between speed and file data consistency, and to tune speed for specialized usage patterns.
Ext3 has broad cross-platform compatibility, working on 32- and 64- bit architectures, and on both little-endian and big-endian systems. Any system (currently including many Unix clones and variants, BeOS, and Windows) capable of accessing files on an ext2 file system will also be able to access files on an ext3 file system.
Ext3 does not require extensive core kernel changes and requires no new system calls, thus presenting Linus Torvalds no challenges that would effecitvely prevent him from integrating ext3 into his official Linux kernel releases. Ext3 is already integrated into Alan Cox's -ac kernels, slated for migration to Linus's official kernel soon.
The e2fsck file system recovery program has a long and proven track record of successful data recovery when software or hardware faults corrupt a file system. ext3 uses this same e2fsck code for salvaging the file system after such corruption, and therefore it has the same robustness against catastrophic data loss as ext2 in the presence of data-corruption faults.

Size limits
Ext3 has a maximum size for both individual files and the entire filesystem. For the filesystem as a whole that limit is 232 blocks. Both limits are dependent on the block size of the filesystem; the following chart summarizes the limits:

Block size
Max file size
Max filesystem size





1 KB
16 GB
2 TB

2 KB
256 GB
8 TB

4 KB
2 TB
16 TB

8 KB
2 TB
32 TB



Disadvantages
Functionality
Since ext3 aims to be backwards compatible with the earlier ext2, many of the on-disk structures are similar to those of ext2. Because of that, ext3 lacks a number of features of more recent designs, such as extents, dynamic allocation of inodes, and block sub allocation. There is a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode.

ext3, like most current Linux filesystems, cannot be fsck-ed while the filesystem is mounted for writing. Attempting to check a file system that is already mounted may detect bogus errors where changed data has not reached the disk yet, and corrupt the file system in an attempt to "fix" these errors.

Defragmentation
There is no online ext3 defragmentation tool that works on the filesystem level. An offline ext2 defragmenter, e2defrag, exists but requires that the ext3 filesystem be converted back to ext2 first. But depending on the feature bits turned on in the filesystem, e2defrag may destroy data; it does not know how to treat many of the newer ext3 features.

There are userspace defragmentation tools like Shake and defrag. Shake works by allocating space for the whole file as one operation, which will generally cause the allocator to find contiguous disk space. It also tries to write files used at the same time next to each other. Defrag works by copying each file over itself. However they only work if the filesystem is reasonably empty. A true defragmentation tool does not exist for ext3.

That being said, as the Linux System Administrator Guide states, "Modern Linux filesystem(s) keep fragmentation at a minimum by keeping all blocks in a file close together, even if they can't be stored in consecutive sectors. Some filesystems, like ext3, effectively allocate the free block that is nearest to other blocks in a file. Therefore it is not necessary to worry about fragmentation in a Linux system."

While ext3 is more resistant to file fragmentation than the FAT filesystem, nonetheless ext3 filesystems can get fragmented over time or on specific usage patterns, like slowly-writing large files. Consequently the successor to the ext3 filesystem, ext4, includes a filesystem defragmentation utility and support for extents (contiguous file regions).

Recovery
There is no support of deleted file recovery in file system design. Ext3 driver actively deletes files by wiping file inodes for crash safety reasons. That's why accidental 'rm -rf ...' may cause permanent data loss.

There are still several techniques and some commercial software like UFS Explorer Standard Recovery version 4 for recovery of deleted or lost files using file system journal analysis; however, they do not guarantee any specific file recovery.

There is no chance of file recovery after file system format.

Compression
Support for transparent compression is available as an unofficial patch for ext3. This patch is a direct port of e2compr and still needs further development, it compiles and boots well with upstream kernels but journaling is not implemented yet. The current patch is named e3compr.

No checksumming in journal
Ext3 does not do checksumming when writing to the journal. If barrier=1 is not enabled as a mount option (in /etc/fstab), and if the hardware is doing out-of-order write caching, one runs the risk of severe filesystem corruption during a crash.

Consider the following scenario: If hard disk writes are done out-of-order (due to modern hard disks caching writes in order to amortize write speeds), it is likely that one will write a commit block of a transaction before the other relevant blocks are written. If a power failure or unrecoverable crash should occur before the other blocks get written, the system will have to be rebooted. Upon reboot, the file system will replay the log as normal, and replay the "winners" (transactions with a commit block, including the invalid transaction above which happened to be tagged with a valid commit block). The unfinished disk write above will thus proceed, but using corrupt journal data. The file system will thus mistakenly overwrite normal data with corrupt data while replaying the journal. There is a test program available to trigger the problematic behavior. If checksums had been used, where the blocks of the "fake winner" transaction were tagged with a mutual checksum, the file system could have known better and not replayed the corrupt data onto the disk. Journal checksumming has been added to EXT4.

EXT3 distribution

The EXT3 filesystem patch distributions and design papers are available from ftp://ftp.kernel.org/pub/linux/kernel/people/sct/ext3

Alternately, these materials are available from ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/

The EXT3 author and maintainer, Stephen Tweedie, may be reached at sct@redhat.com

No comments:

Post a Comment