Sorting versions in Python
In this post, I focus on explaining string sorting. I know that Python already has a good sorting algorithm, but it’s missing one: natural sort! Luckily, there is an easy solution ;)
I will try expose the problem that we faced with an example. Take a look at the output of the following code:
>>> sorted(['1', '2', '3', '10', '11'])
['1', '10', '11', '2', '3']
It's not the expected output from a human point of view.
Character mapping
We had this result because Python compares the characters of the string. But, how is the order defined between characters?
In software, there is a character mapping table, so for example: ‘a’ is 97, ‘b’ is 98, ‘c’ is 99, and so on. There are different tables or character encodings, and the most popular one is Unicode. Here are some characters mapped in Unicode:
character | code value (decimal) |
---|---|
1 | 49 |
2 | 50 |
... | ... |
A | 65 |
B | 66 |
... | ... |
a | 97 |
b | 98 |
... | ... |
🐍 | 128013 |
🐎 | 128014 |
ℹ️ NOTE: The difference between Unicode and UTF-8 is that Unicode defines the mapping between the characters, 'a' is 97, and UTF-8 defines how these values are represented; this can be 0097 or 0x61 in hexadecimal. You can learn more about Unicode and Python here.
You can view the encoding number with ord
and convert the number to the character with chr
. Try running the following code to see the values of each character:
String comparison
Once we have defined the order of characters, then sorting is quite easy. One of the algorithms that are implemented in CPython is the following (source code):
So basically, when comparing '2' with '10', we compare characters by characters. As the first two characters are different, the algorithm returns the smallest one between '2' and '1'.
Now you know why we can sort something like this and why the output is ;)
>>> sorted(['1', 'a', 'A', '😀', ' ', '{', '🐍', '❤️'])
[' ', '1', 'A', 'a', '{', '❤️', '🐍', '😀']
Natural sort
But the problem still exists, for us '10' is greater than '2'. It will not make sense that in pepy.tech we show first the versions like 1.0, 1.1, 1.10, 1.2.
To solve this problem, we go for the easy way, use a library. A good one for us is natsort.
If you want to read more about how does it work, nothing better than their documentation.
\x54\x68\x61\x6E\x6B\x73\x20\xF0\x9F\x98\x80