Filling a file with zeros
Saturday, September 4, 2010
In this blog post I’ll demonstrate a few of ways to fill a file with zeros in Factor. The goal is to write a some number of bytes to file in the least amount of time and using only a small amount of RAM; writing a large file should not fail.
Filling a file with zeros by seeking
The best way of writing a file full of zeros is to seek to one byte from the end of the file, write a zero, and close the file. Here’s the code:
: (zero-file) ( n path -- )
binary
[ 1 - seek-absolute seek-output 0 write1 ] with-file-writer ;
ERROR: invalid-file-size n path ;
: zero-file ( n path -- )
{
{ [ over 0 < ] [ invalid-file-size ] }
{ [ over 0 = ] [ nip touch-file ] }
[ (zero-file) ]
} cond ;
The first thing you’ll notice about the zero-file
is that we
special-case negative and zero file sizes. Special-casing zero file
length is necessary to avoid seeking to -1, which does everything
correctly but throws an error in the process instead of returning
normally. Special-casing negative file sizes is important because it’s
always an error, and though the operation fails overall, the file-system
can become littered with zero-length files that are created before the
exception is thrown.
To call the new word:
IN: scratchpad 123,456,789 "/Users/erg/zeros.bin" zero-file
"/Users/erg/zeros.bin" file-info size>> .
123456789
Copying a zero-stream
With Factor’s stream protocol, you can write new kinds of streams that,
when read from or written to, do whatever you want. I wrote a read-only
zero-stream
below that returns zeros whenever you read from it.
Wrapping a limit-stream
around it, you can give the inexhaustible
zero-stream
an artificial length, so that copying it reaches an end
and terminates.
TUPLE: zero-stream ;
C: <zero-stream> zero-stream
M: zero-stream stream-read drop <byte-array> ;
M: zero-stream stream-read1 drop 0 ;
M: zero-stream stream-read-partial stream-read ;
M: zero-stream dispose drop ;
:: zero-file2 ( n path -- )
<zero-stream> n limit-stream
path binary <file-writer> stream-copy ;
The drawback to this approach is that it creates 8kb byte-arrays in memory that it immediately writes to disk.
Setting the contents of a file directly
Using the set-file-contents
word, you can just assign a file’s
contents to be a sequence. However, this sequence has to fit into
memory, so this solution is not as good for our use case.
:: zero-file3 ( n path -- )
n <byte-array> path binary set-file-contents ;
Bonus: writing random data to a file
The canonical way of copying random data to a file in Unix systems is to
use the dd tool to read from /dev/urandom and write to a file. But what
about on Windows, where there is no /dev/urandom? We can come up with a
cross-platform solution that uses method number two from above, but
instead of a zero-stream
, we have a random-stream
. But then what
about efficiency? Well, it turns out that Factor’s Mersenne Twister
implementation generates random numbers faster than /dev/urandom on my
Macbook – writing a 100MB file from /dev/urandom is about twice as slow
as a Factor-only solution. So not only is the Factor solution
cross-platform, it’s also more efficient.
TUPLE: random-stream ;
C: <random-stream> random-stream
M: random-stream stream-read drop random-bytes ;
M: random-stream stream-read1 drop 256 random ;
M: random-stream stream-read-partial stream-read ;
M: random-stream dispose drop ;
:: stream-copy-n ( from to n -- )
from n limit-stream to stream-copy ;
:: random-file ( n path -- )
path binary <file-writer> n stream-copy-n ;
! Read from /dev/urandom
:: random-file-urandom ( n path -- )
[
path
binary <file-writer> n stream-copy-n
] with-system-random ;
Here are the results:
$ dd if=/dev/urandom of=here.bin bs=100000000 count=1
1+0 records in
1+0 records out
100000000 bytes transferred in 17.384370 secs (5752294 bytes/sec)
vs.
IN: scratchpad [ 100,000,000 "there.bin" random-file ] time
Running time: 5.623136439 seconds
Conclusion
Since Factor has high-level libraries that wrap the low-level libc and
system calls used for nonblocking i/o, we don’t have to deal with
platform-specific quirks at this level of abstraction like handling
EINTR, error codes, or resource cleanup at the operating system level.
When calls get interrupted, when errno is set to EINTR after the call
returns, the i/o operation is simply tried again behind the scenes, and
only serious i/o errors get thrown. There are many options for correct
resource cleanup should an error occur, but the error handling code we
used here is incorporated into the stream-copy
and with-file-writer
words–resources are cleaned up regardless of what happens. We also
demonstrated that a Factor word is preferable to a shell script or the
dd command for making files full of random data because it’s more
portable and faster, and that custom streams are easy to define.
Finally, there’s actually a faster way to create huge files full of
zeros, and that’s by using sparse files. Sparse files can start off
using virtually no file-system blocks, but can appear to be as large as
you wish, and only start to consume more blocks as parts of the file are
written. However, support for this is file-system dependent and,
overall, sparse files are of questionable use. On Unix file-systems that
support sparse files, the first method above should automatically
creates them with no extra work. Note that on MacOSX, sparse
file-systems are supported but not enabled by default. On Windows,
however, you have to make a call to DeviceIoControl
. If someone wants
to have a small contribution to the Factor project, they are welcome to
implement creation of sparse files for Windows.
Edit: Thanks to one of the commenters, I rediscovered that there’s a
Unix syscall truncate
that creates zero-length files in constant time
on my Mac. This is indeed the best solution for making files full of
zeros, and although unportable, a Factor library would have no problem
using a hook on the OS variable to call truncate on Unix and another
method on Windows.