NumPy Fast Operation and Computations

LearnNumPy fast operation and computations in this article by Alberto Boschetti, a data scientist with expertise in signal processing and statistics and Luca Massaron, a data scientist and marketing research director specialized in multivariate statistical analysis, machine learning, and customer insight.

When arrays need to be manipulated by mathematical operations, you just need to apply the operation on the array with respect to a numerical constant (a scalar) or an array of the same shape:

As a result, the operation is to be performed element-wise; that is, every element of the array is operated by either the scalar value or the corresponding element of the other array.

When operating on arrays of different dimensions, it is still possible to obtain element-wise operations without having to restructure the data if one of the corresponding dimensions is 1. In fact, in such a case, the dimension of size 1 is stretched until it matches the dimension of the corresponding array. This conversion is called broadcasting. For instance:

However, it won’t require an expansion of memory of the original arrays in order to obtain pair-wise multiplication.

Furthermore, there exists a wide range of NumPy functions that can operate element-wise on arrays: abs() , sign() , round() , floor(), sqrt()log(), and exp() .

Other usual operations that could be operated by NumPy functions are sum() and prod(), which provide the summation and product of the array rows or columns on the basis of the specified axis:

When operating on your data, remember that operations and NumPy functions on arrays are extremely fast when compared to simple Python lists. Now, try out a couple of experiments. First, compare a list comprehension to an array when dealing with a sum of a constant:

On Jupyter, %time   allows you to easily benchmark operations. Then, the -n 1 parameter requires the benchmark to execute the code snippet for only one loop; -r 3 requires you to retry the execution of the loops (in this case, just one loop) three times and report the best performance recorded from such repetitions.

Results on your computer may vary depending on your configuration and operating system. Anyway, the difference between the standard Python operations and the NumPy ones will remain quite large. Though unnoticeable when working on small datasets, this difference can really impact your analysis when dealing with larger data or when looping over and over the same analysis pipeline for parameter or variable selection.

This also happens when applying sophisticated operations, such as finding a square root:

Sometimes, you may need to apply custom functions to your array instead. The apply_along_axis  function lets you use a custom function and apply it to an axis of an array:

Matrix operations

Apart from element-wise calculations using the np.dot() function, you can also apply multiplications to your two-dimensional arrays based on matrix calculations, such as vector-matrix and matrix-matrix multiplications:

As an example, create a 5 x 5 two-dimensional array of ordinal numbers from 0 to 24:

  • Define a vector of coefficients and an array column stacking the vector and its reverse:

  • Now, multiply the array with the vector using the dotfunction:

  • Alternatively, the vector by the array:

  • Or the array by the stacked coefficient vectors (which is a 5 x 2 matrix):

NumPy also offers an object class, matrix, which is actually a subclass of ndarray , inheriting all its attributes and methods. NumPy matrices are exclusively two-dimensional (as arrays are actually multi-dimensional) by default. When multiplied, they apply matrix products, not element-wise ones (the same happens when raising powers) and they have some special matrix methods ( .H  for the conjugate transpose and .I for the inverse).

Apart from the convenience of operating in a fashion that is similar to that of MATLAB, they do not offer any other advantage. You may risk confusion in your scripts since you’ll have to handle different product notations for matrix objects and arrays.

Slicing and indexing with NumPy arrays

Indexing allows you to take a view of a ndarray by pointing out either what slice of columns and rows to visualize or an index:

  • Define a working array:

  • Your array is a 10 x 10 two-dimensional array. Start by slicing it into a single dimension. The notation for a single dimension is the same as that in Python lists:

  • You may want to extract even rows from 2 to 8:

  • After slicing the rows, slice the columns even further by taking only the columns from index 5:

  • As in lists, it is possible to use negative index values in order to start counting from the end. Moreover, a negative number for parameters, such as steps, reverses the order of the output array, like in the following example, where the counting starts from column index 5 but in the reverse order and goes toward index 0:

  • You can also create Boolean indexes that point out the rows and columns to select. Therefore, you can replicate the previous example using a row_indexand a col_index variable:

You cannot contextually use Boolean indexes on both columns and rows in the same square brackets, though you can apply the usual indexing to the other dimension using integer indexes. Consequently, you have to first operate a Boolean selection on rows and then reopen the square brackets and operate a second selection on the first, this time focusing on the columns.

  • If you need a global selection of elements in the array, you can also use a mask of Boolean values, as follows:

This approach is particularly useful if you need to operate on the partition of the array selected by the mask (for example, M[mask]=0 ).

Another way to point out the elements that need to be selected from your array is by providing a row or column index consisting of integers. Such indexes may be defined either by a np.where()  function that transforms a Boolean condition on an array into indexes or by simply providing a sequence of integer indexes, where integers may be in a particular order or might even be repeated. Such an approach is called fancy indexing:

Having defined the indexes of your rows and columns, you have to apply them contextually to select elements whose coordinates are given by the tuple of values of both the indexes:

In this way, the selection will report the following points: (1,0), (1,2), (2,4), and (7,8). Otherwise, you have to select the rows first and then the columns, which are separated by square brackets:

Finally, remember that slicing and indexing are just views of the data. If you need to create new data from such views, you have to use the .copy  method on the slice and assign it to another variable. Otherwise, any modification to the original array will be reflected on your slice and vice versa. The copy method is shown here:

If you found this article interesting, you can explore Python Data Science Essentials to gain useful insights from your data using popular data science tools. Fully expanded and upgraded, the latest edition of Python Data Science Essentials offers up-to-date insight into the core of Python, including the latest versions of the Jupyter Notebook, NumPy, pandas, and scikit-learn.

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events. Luca Massaron is a data scientist and marketing research director specialized in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience of solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of web audience analysis in Italy to achieving the rank of a top-10 Kaggler, he has always been very passionate about every aspect of data and its analysis, and also about demonstrating the potential of data-driven knowledge discovery to both experts and non-experts.

Tagged with: , ,