Sorting strings in Python

In this post, I focus on explaining string sorting. I know that Python already has a good sorting algorithm, but it’s missing one: natural sort! Luckily, there is an easy solution ;)

I will try expose the problem that we faced with an example. Take a look at the output of the following code:

>>> sorted(['1', '2', '3', '10', '11'])
['1', '10', '11', '2', '3']

It's not the expected output from a human point of view.

what gif?

Character mapping

We had this result because Python compares the characters of the string. But, how is the order defined between characters?

In software, there is a character mapping table, so for example: ‘a’ is 97, ‘b’ is 98, ‘c’ is 99, and so on. There are different tables or character encodings, and the most popular one is Unicode. Here are some characters mapped in Unicode:

character code value (decimal)
1 49
2 50
... ...
A 65
B 66
... ...
a 97
b 98
... ...
🐍 128013
🐎 128014

Note
The difference between Unicode and UTF-8 is that Unicode defines the mapping between the characters, 'a' is 97, and UTF-8 defines how these values are represented; this can be 0097 or 0x61 in hexadecimal. You can learn more about Unicode and Python here.

You can view the encoding number with ord and convert the number to the character with chr. Try running the following code to see the values of each character:

[chr(x) for x in range(0, 123)]

String comparison

Once we have defined the order of characters, then sorting is quite easy. One of the algorithms that are implemented in CPython is the following (source code):

// compares two strings in case insensitive
int PyOS_mystricmp(const char *s1, const char *s2)
{
    // iterate over the two strings until the end or a difference is found
    while (*s1 && (tolower((unsigned)*s1++) == tolower((unsigned)*s2++))) {
        ;
    }

    //compare the latest characters 
    return (tolower((unsigned)*s1) - tolower((unsigned)*s2));
}

So basically, when comparing '2' with '10', we compare characters by characters. As the first two characters are different, the algorithm returns the smallest one between '2' and '1'.

Now you know why we can sort something like this and why the output is ;)

>>> sorted(['1', 'a', 'A', '😀', ' ', '{', '🐍', '❤️'])
[' ', '1', 'A', 'a', '{', '❤️', '🐍', '😀']

Natural sort

But the problem still exists, for us '10' is greater than '2'. It will not make sense that in pepy.tech we show first the versions like 1.0, 1.1, 1.10, 1.2.

To solve this problem, we go for the easy way, use a library. A good one for us is natsort.

If you want to read more about how does it work, nothing better than their documentation.

\x54\x68\x61\x6E\x6B\x73\x20\xF0\x9F\x98\x80

links

social