Sorting arrays of the ndarray
crate#
I came across a problem where I needed to sort an array by its values in one column and potentially sort other arrays with that index. In Python, this is very easy:
a = np.random.rand(4,3)
print(a)
# array([[0.98782953, 0.3603707 , 0.07544306],
# [0.3278871 , 0.97414566, 0.36348919],
# [0.77329616, 0.82990707, 0.07634337],
# [0.5545888 , 0.19784322, 0.37579676]])
idx = np.argsort(a[:,0])
print(idx)
# array([1, 3, 2, 0])
print(a[idx,:])
# array([[0.3278871 , 0.97414566, 0.36348919],
# [0.5545888 , 0.19784322, 0.37579676],
# [0.77329616, 0.82990707, 0.07634337],
# [0.98782953, 0.3603707 , 0.07544306]])
However, in rust this seems to be much more complicated, and I could not find
to solve this issue efficiently. I stumbled over the
ndarray-slice
crate but preferred to
try my own solution. As efficiency is not really that important (I need it for testing)
this approach will suffice for now. Maybe it proves useful for you too (if so, please let me know in the comments).
We start by including the ndarray
prelude and itertools
—it includes implementations
for sorting iterators.
use itertools::Itertools;
use ndarray::prelude::*;
Then, let’s start with a simple array and select the first column which we want to use for sorting.
fn main() {
let a = arr2(&[
[0.3898262290068988, 0.8167391344557614],
[0.11878159178282421, 0.5626218062608417],
[0.5704567148662028, 0.7053201952102196],
[0.43654539176129825, 0.39508426718930445],
]);
println!("{}", a.slice(s![.., 0]));
// Output: [0.3898262290068988, 0.11878159178282421, 0.5704567148662028, 0.43654539176129825]
We now have to convert the ndarray
slice into a standard rust slice using as_slice
. This
method fails if the memory is not aligned (which is the case when slicing a column). Therefore,
we create a memory aligned copy to_owned
(which is an unnecessary copy). As we want the
indices that sort the rows, we enumerate the values of the column. Using itertools
gives
us sort_by
, and we can use it to sort the column values. For now, we will print
the rows of the sorted array one after another together with their row indices in the original
array.
// slice needs to be owned / copied to be memory aligned (if iterating over rows)
a.slice(s![.., 0])
.to_owned().as_slice().unwrap()
.into_iter()
.enumerate()
.sorted_by(|a, b| a.1.partial_cmp(b.1).unwrap())
.for_each(|x| println!("{} {}", x.0, x.1));
// 1 0.11878159178282421
// 0 0.3898262290068988
// 3 0.43654539176129825
// 2 0.5704567148662028
We are interested in the index, so we modify the code such that we can store it in a vector.
let idx: Vec<usize> = a.slice(s![.., 0])
.to_owned().as_slice().unwrap()
.into_iter()
.enumerate()
.sorted_by(|a, b| a.1.partial_cmp(b.1).unwrap())
.map(|x| x.0).collect();
println!("{:?}", idx);
// Output: [1, 0, 3, 2]
Now that we have the index, we still cannot sort by this index as this seem to be not supported by the
Slice
object. It seems we can only slice
by ranges or single values but not an array.
One solution sufficient for my use case is to iterate over the index and copy the rows in a new array:
let mut b = a.clone();
idx.into_iter()
.enumerate()
.for_each(|x| {
b.slice_mut(s![x.0, ..]).assign(&a.slice(s![x.1, ..]));
});
// Output:
// [[0.11878159178282421, 0.5626218062608417],
// [0.3898262290068988, 0.8167391344557614],
// [0.43654539176129825, 0.39508426718930445],
// [0.5704567148662028, 0.7053201952102196]], shape=[4, 2], strides=[2, 1], layout=Cc (0x5), const ndim=2
Or skipping the explicit index:
let mut b = a.clone();
a.slice(s![.., 0])
.to_owned().as_slice().unwrap()
.into_iter()
.enumerate()
.sorted_by(|a, b| a.1.partial_cmp(b.1).unwrap())
.enumerate()
.for_each(|x| {
b.slice_mut(s![x.0, ..]).assign(&a.slice(s![x.1.0, ..]));
});
}
We can include multiple assignments in the last for_each
so that we can sort
multiple arrays by the index vector. Eventually, I’ll use the last bit to sort the components
of (Gaussian) mixture models by their relevance/weight.
To further improve readability, we can destructure the x
variable in the fore_each
method into more traditional index names i
and j
. We can do the same for sorted_by
step in the pipeline. Variables that we don’t need (i.e., the indices) can be replaced
by the placeholder _
to avoid compiler warnings (and improve readability even further).
let mut b = a.clone();
a.slice(s![.., 0])
.to_owned()
.as_slice()
.unwrap()
.into_iter()
.enumerate()
.sorted_by(|(_,a), (_,b)| a.partial_cmp(b).unwrap())
.enumerate()
.for_each(|(i, (j, _))| {
b.slice_mut(s![i, ..]).assign(&a.slice(s![j, ..]));
});
println!("{:?}", b);
We can of course apply the same to the case using the index. Here,
I used iter
instead of into_iter
. The latter gives away the ownership
of the index which is what we don’t want. Note the dereferencing operator *
which is then required.
let mut b = a.clone();
idx.clone().iter().enumerate().for_each(|(i, j)| {
b.slice_mut(s![i, ..]).assign(&a.slice(s![*j, ..]));
});
println!("{:?}", b);
Updated on 08 August 2023
Added paragraph about tuple destructuring and ownership preservation of the index