File Distribution by Size
Recently, when performing a backup of my files to a 4 TB HDD, I was shocked to discover that after backing up 1.6 TB data, I had only 1 TB left on the disk.
Turns out that I had chosen to format the disk using exFAT file-system and Microsoft Windows had automatically assigned an Allocation Unit size of 256 KB i.e. for every file a minimum of 512 Sectors of 512 bytes each would be allocated. Since most of my files were small in size, this resulted in a huge wastage of space on the backup disk.
I wanted to analyze the files on my disk by factors like file-size, file-type, last-modified etc. While Windows does not come with any built-in tool, there are a number of free tools that available such as WinDirStat. I however wanted greater control of the process and opted to use a combination of ForFiles command, and a Spreadsheet program to get the desired output.
The statistics below represent my Laptop’s C drive which is the single partition on the 1 TB drive that the computer came with. The range of files on the computer is very diverse with files ranging from just a few Kilobytes (like documents) to files weighing in at multi Gigabytes (like ISO images).
|Largest File GB||16|
The distribution of the files by sizes – How to read this table:
The table is presented as number of files in a range. For example, there are 139,440 files that are greater than 1 KB, but less that or equal to 10 KB. Similarly, there are 176,398 files that are greater than 10 KB but less than or equal to 10 MB.
Plotting these numbers graphically:
In the graph above, if you see the 10x Range, you will notice that a considerably high (nearly 280 K of the 467 K total files) are less than 10 KB. In fact, almost 140 K files are <= 1 KB. If you see the 10Kx range, you will see that 350 K of the 467 K files are less than 100 KB in size. You will need to see the 100x Range to make out that about 80 K files are in the 100 MB range and a nearly invisible proportion is bigger than 1 GB.
In the graph above, each column represent the total files present on the disk and the colour segments represent the ranges. For example, looking at the 10x range, it is evident that files from 0 KB – 10 KB form 60% of the total files on the system. Looking at the 1Mx range, its easy to see that more than 90% of the files on the system are in the range of 0 KB – 10 MB.
The same graph plotted as an Area-Chart instead of a stacked bar-graph clearly shows the distribution curve. For example, the 100Kx point clearly illustrates that nearly 90% of the files less than 1 MB in size (Cyan area).
If I were to interpret this data for an exFAT file-system with 256 KB Allocation Unit size, I can easily determine than approx. 373,423 files are less than 100 KB in size, but they will each use 256 KB Allocation Units, thus wasting over 50% of the disk space. In fact, the actual loss will be more severe as almost 300,000 files are actually less than 10 KB in size.
Large-sized allocation units result in optimum performance when transferring large files with contiguous data (such as audio-video files, large archives, database dumps etc.). However in this case, smaller Allocation Units of 4 KB might be enough. In fact, I may even consider going as low as 1 KB Allocation Units.
Interestingly, while Microsoft Windows defaults to formatting the 1 TB HDD as 4 KB Allocation Units in NTFS, it uses 256 KB for exFAT, thus resulting in severe penalty in terms of usable disk space, seek times, and overall transfer speeds. It is almost a deliberate attempt to show exFAT file system in bad light.
So the next time you format a disk (or a pen-drive), consider the sizes of the files that you may store on it and then decide upon the Allocation Unit size to get optimum transfer speeds. In the event Microsoft Windows does not offer satisfactory controls on the disk formatting process, consider using another tool from the list here.