Sorting Filenames Containing Numbers
Thursday, December 13, 2007
Jeff Atwood articulated a problem I usually run into when looking at files in Explorer. If you have a bunch of files with numbers, file “a100.txt” will come before file “a2.txt” when sorted by name. Of course, two is less than 100, so you’d expect “a2.txt” to come first.
Here’s a solution in Factor:
: human-sort ( seq -- newseq )
[ [ digit? ] cut3 >r string>number r> 3array ] map natural-sort
[ first3 >r number>string r> 3append ] map ;
IN: scratchpad { "a100.txt" "a10.txt" } human-sort .
{ "a10.txt" "a100.txt" }
The algorithm is simple: split a filename at the first number, convert
that from a string to a number, and let Factor’s natural-sort
do the
rest.
It maps over the sequence, { “a100.txt” “a10.txt” }, and cuts it at the
first number inside each element so, after cut3, you get the result on
the stack like “a” “100” “.txt”, and “a” “10” “.txt”. You dip under the
top element and convert the to a number, >r string>number r>
, and
3array
to make a sequence of sequences { { “a” 100 “.txt” } { “a” 10
“.txt” } }. You then call natural-sort
on this. Natural-sort knows how
to compare element by element, so seeing the “a” “a” is equal it moves
onto the next comparison, 100 10. Now that it’s in the right order, you
still need to turn your sequence back into strings. natural-sort
returned { { “a” 10 “.txt” } { “a” 100 “.txt” } }, so map (do something
to each element in the sequence) over this and do
first3 >r number>string r> 3append
, which explodes the array, converts
back to a string, and appends three strings together to get the
filenames. The dot at the end prints it.
I had to write cut3, but it’s a generally useful word for partitioning a sequence that I put into sequences.lib.
: cut3 ( seq pred -- head match tail )
2dup find drop [
rot swap cut rot [ not ] compose
dupd find drop [ cut ] [ f ] if*
] [
drop nip f like f f
] if* ;
UPDATE:
I generalized this to work on { “a100b200.txt” “a100b2.txt” } with some
suggestions from Slava. If you understood the previous code, this should
be pretty understandable too. It keeps calling cut3 until it returns no
matches and accumulates the result in a sequence, all the while applying
a quotation to whatever matched the predicate. It then compares with
natural-sort
like before, and transforms these sequences back into
filenames. cut3
is cleaned up in this version. I ended up with two
general words, cut3
and cut-all
that can be put into a library,
sequences.lib, and used elsewhere.
: cut-find ( seq pred -- before after )
dupd find drop dup [ cut ] when ;
: cut3 ( seq pred -- first mid last )
[ cut-find ] keep [ not ] compose cut-find ;
: (cut-all) ( seq pred quot -- )
[ >r cut3 r> dip >r >r , r> [ , ] when* r> ] 2keep
pick [ (cut-all) ] [ 3drop ] if ;
: cut-all ( seq pred quot -- seq )
[ (cut-all) ] { } make ;
: human-sort ( seq -- newseq )
[ [ digit? ] [ string>number ] cut-all ] map natural-sort
[ [ dup string? [ number>string ] unless ] map concat ] map ;
IN: scratchpad { "a100b2.txt" "a100b200.txt" } human-sort .
{ "a100b2.txt" "a100b200.txt" }
UPDATE TWO:
Here is a much better algorithm that doesn’t reconstruct the original
filenames and thus can handle leading zeros. It constructs keys that can
be sorted by sort-values
or sort-keys
(sort-values
is just as
efficient here and it lets you use dup
in the map instead of a
[ ] keep
, which makes the code slightly more legible).
: human-sort ( seq -- newseq )
[ dup [ digit? ] [ string>number ] cut-all ] { } map>assoc
sort-values keys ;
IN: scratchpad { "000a.txt" "00a.txt" } human-sort .
{ "00a.txt" "000a.txt" }