Sorting by tuple slots

Friday, January 9, 2009

Factor’s sorting has been extended to support sorting tuples by multiple slots, one after the other. In a music player application, you may wish to sort your songs first by artist, then by album, then track number. If you had a tuple representing every song, the code for such a sort is now very easy:

{ { artist>> <=> } { album>> <=> } { track>> <=> } } sort-by-slots

The main word at work is sort-by-slots ( seq sort-specs -- seq' ), where a sort-spec is an accessor paired with a comparator, one of <=>, >=<, human-<=>, human->=<. Human sort first converts consecutive digits to integers and then makes the comparison, e.g. { "a1" "a10" "a03" "a2" } sorts as { "a1" "a2" "a03" "a10" } instead of { "a03" "a1" "a10" "a2" }, as it would with natural-sort.

Sorting in reverse order is possible with the new operator >=<, which inverts the result of the <=> comparator. To sort a playlist by the most played songs in reverse order (most played first):

{ { play-count>> >=< } } sort-by-slots

Source code for sort-by-slots

: slot-comparator ( accessor comparator -- quot )
    '[ [ _ execute ] bi@ _ execute dup +eq+ eq? [ drop f ] when ] ;

MACRO: compare-slots ( sort-specs -- <=> )
    #! sort-spec: { accessor comparator }
    [ first2 slot-comparator ] map '[ _ 2|| +eq+ or ] ;

: sort-by-slots ( seq sort-specs -- seq' )
    '[ _ compare-slots ] sort ;

First, a bit about how Factor’s sort ( seq quot -- sortedseq ) word works. It expects a quotation that compares two objects lexicographically, which is element by element in dictionary order. Thus, the macro compare-slots expands the sort-specs into a quotation that compares tuples slot-by-slot until there is a difference. The macro-expansion for the first example looks like this:

( scratchpad ) [ { { artist>> <=> } { album>> <=> } { track>> <=> } } compare-slots ] expand-macros .
    f 2 [
        \ artist>> \ <=> [ [ execute ] curry bi@ ] dip execute
        dup +eq+ eq? [ drop f ] when
    ] [ [ drop ] dip ndup ] dip call dup
    [ 2 [ ndrop ] curry dip ] [
        2 [
            \ album>> \ <=> [ [ execute ] curry bi@ ] dip execute
            dup +eq+ eq? [ drop f ] when
        ] [ [ drop ] dip ndup ] dip call dup
        [ 2 [ ndrop ] curry dip ] [
            2 [
                \ track>> \ <=> [ [ execute ] curry bi@ ] dip
                execute dup +eq+ eq? [ drop f ] when
            ] [ [ drop ] dip ndup ] dip call dup
            [ 2 [ ndrop ] curry dip ]
            [ 2 [ drop ] dip ndrop t [ f ] [ no-cond ] if ] if
        ] if
    ] if +eq+ or

The macro first extracts the slots and compares with the first comparator. If the comparator returns +lt+ or +gt+ then the comparison is over, but if the comparator returns +eq+ then the next comparison is invoked and the sort continues in the same way. Notice that the slot-comparator word replaces +eq+ with f to avoid short-circuiting the iteration done by 2||. The final sort-by-slots word is a trivial call to sort with the new comparator word we just defined.

Algorithmic complexity and ideas for improvement

The sorting bound is still O(n log n) for time, but of course the more comparators that you need to break ties, the longer the algorithm will take to run.
If your data has to be sorted anyway, it’s possible to sort and then split the data into related slices using a sequence of accessors like the above. You could then compute the playing time of every album, or find your most-played song from every artist. Splitting with the same accessors takes O(n) time, but the odds are that you probably won’t have already sorted the data. So for unsorted data, there is a more efficient way to extract features – you can instead partition it in O(n) time and find features on each partition, avoiding sorting altogether. This will be the subject of a future blog post.