If you are reading this blog, you probably already know what Eigen is. If not, Eigen is a header only C++ library for linear algebra. And its a really good one. It does a lot of crazy magic under the hood to get the best performance possible.

When I started working at my current company, a workmate told me that we shouldn’t be using Eigen, that its performance its not okay, and that it’s good only for prototyping, but that we should write our own algebra, tailored specifically for our need to get the best possible performance. I am not absolutely sure if that is true or false, but using to its maximum power is a complex task.

For what we are interested in this series, we use Eigen for computing some properties on some geometry elements. For example, the area of all the triangles in a triangle mesh. We call the code that performs those computations *kernels*. In this series we are going to try to improve the performance of the *kernels* we use for cloth simulation. I have no idea if we will be able to improve the performance, but at least we will try it, and we will learn something.

## The (fast) norm

Before we try with the complex kernels, lets first see some easy examples. It is common to compute the norm of several 3D vectors. To do this, the simplest approach is to iterate on all the vectors and just compute the norm.

```
auto norm(const std::vector<Eigen::Vector3f> &vs, std::vector<float> &out) {
std::transform(std::begin(vs), std::end(vs), std::begin(out), [](const auto &v) {
return v.norm();
});
}
```

The most basic implementation receives a vector of 3D vectors and returns a vector of floats. For each vector, we compute it norm. Straightforward, easy to read, terrible performance.

```
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
bm_norm 348423 ns 347858 ns 2043
```

But what is wrong with that code? Its not using the CPU efficiently. Modern CPUs (well, modern in 2002 I guess…) can compute several operations at the same time in a single core. If we want to sum 3 floats to 3 floats, we can do that in a single instruction. This is called SIMD, Single Instruction Multiple Data. Except that we can’t do it for 3 floats. SIMD has some constraints, the data must come in packets of a specific size: 128, 256 or 512 bytes, depending on your processor. For floats, that means groups of 4, 8 or 16 floats.

The caveat if that to use SIMD, you must know in which platform your code is going to run, and you must use *intrinsics*. You dont do *a+b`*. Instead, you write *v_sum = _mm_hadd_ps(v_sum, v_sum).* That is not fun.

Fortunately, Eigen knows all this, and they have already done the dirty job of coding for each platform. So you can just write *v1 + v2* and if the vectors fit in a SIMD packet, they will call the vectorized sum.

But 3 floats are not 4 floats, and the smaller packet size we can use is 4 floats. So one option would to just add a fourth component to our vectors.

```
auto norm4(const std::vector<Eigen::Vector3f> &vs, std::vector<float> &out) {
std::transform(std::begin(vs), std::end(vs), std::begin(out), [](const auto &v) {
Eigen::Vector4f v4{v.x(), v.y(), v.z(), 0};
return v4.norm();
});
}
```

If the case with 3 floats was bad, this new example is not better. Its actually quite worse.

```
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
bm_norm 348423 ns 347858 ns 2043
bm_norm4 1103279 ns 1101929 ns 631
```

### Memory layout

If we want to make this code faster, we first need to understand how we are storing our data in memory. An *std::vector* stores its data continuosly, and an *Eigen::Vector3f* stores its 3 components compactly. That means, our memory layout is XYZXYZXYZ…XYZ. This way of storing the data is called Array of Structures (or AoS), because we have an array that store our structured data, in this case the point.

Instead of using AoS, we can store our data as a Structure of Arrays (SoA). With SoA, we store a structure that holds an array for each of the data members. In this case, we sould store an array for the X, an array for the Y, and an array for the Z. That means, our memory layout will be XXX…XYYY…YZZZ…Z.

And why is this better? Well, its not better or worse, that will depends on your memory access pattern. For the norm case, its better because of the operation we are doing. The norm of a 3D vector is computed in 5 steps. First, we square each component (X*X, Y*Y, Z*Z), then we sum all the squares and finally we perform the square root to get the final result.

If we store each component continuosly, that means we can perform 4, 8 or 16 squares in a single operation, the sum of the squares of X and Y in a single operation and the sum of the Z in a single operation. So we are computing the norm for N vectors are the same time.

To do this in Eigen, we first need to change our memory layout. Instead of representing each vector independently, we are going to store N vectors in a single *Eigen::Matrix<Scalar, N, 3>*. Since *Eigen* is by default column major, we have the memory layout we have just discussed. Since most probably the number of vectors we originally had is not a multiple of *N*, we will compute the remaining norms one by one. We could add more points until we fill the packet and compute the norm of this packet in the same way as for the others.

```
std::vector<Eigen::Matrix<Real, N, 3>> soaPoints;
soaPoints.resize(nbPackets);
for (int i = 0; i < nbPackets; ++i) {
soaPoints[i] = typename Eigen::Matrix<Real, N, 3, Eigen::RowMajor>::ConstMapType(points[N*i].data(), N, 3);
}
for(auto _ : state) {
for (int i = 0; i < nbPackets; ++i) {
typename Eigen::Matrix<Real, N, 1>::MapType(&norms[N*i], N, 1) = soaPoints[i].cwiseProduct(soaPoints[i]).rowwise().sum().array().sqrt();
}
for (int i = nbPackets*N; i < points.size(); ++i) {
norms[i] = points[i].dot(points[i]);
}
benchmark::DoNotOptimize(norms);
}
```

In lines 3 and 4 we convert our original *std::vector<Vector3>* vectors to a *std::vector<Eigen::Matrix<Real, N, 3>>*. This way, we ensure we have the SoA memory layout for N vectors. However, this operation takes some time, that is why we have left it out from the benchmark measurement. In a real case, we should measure how much time this operation takes and if its worth to convert perform the conversion.

In line 9 we perform the *norm* operation. We multiply each value in the matrix by itself, and then, row by row we sum all the columns. Although it says *rowwise*, Eigen is smart enough to sum N rows at the same time. Finally, we perform the *sqrt*, again, N at a time. Its not as straightforward and easy to read as the original implementation, but there is an importante performance boost.

In my CPU, a AMD Ryzen 5700 XT processor, which supports AVX2 instructions, I get ~2X performance boost when using a packet size of 8.

```
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
bm_norm 348423 ns 347858 ns 2043
bm_norm4 1103279 ns 1101929 ns 631
bm_norm_simd<8> 161240 ns 161063 ns 4283
```

## Conclusion

In this first post on this series, we have seen how we can spped up a very simple operation, the norm of a vector, when using Eigen appropiately. By changing the memory layout of our vectors, and changing the way we code the norm operation, we can get around a ~2X speed up.

However, it requires changing the memory layout of the vectors, and that itself has a huge impact on the performance, so we need to measure this impact to decide if is worth it.

This example was a very simple demo. In the next post, we will use a real 3D triangle mesh and we will compute the areas of all of the triangles.