Last week I was participating at the Python and ML Summit in Berlin. One of the most interesting lectures, on which I participated, was a workshop by, named “Reading all yourself was yesterday – How to turn large amounts of text into insights with machine learning”. It explored analysis of large datasets and I have decided to write a basic article on how to use Pandas and NumPy.
These two libraries have the following main features:
- pandas
- data analysis, derived from “panel data”
- provides DataFrame
- somehow close to a spreadsheet
- numpy
- numeric functions for python
- mainly for calculation purposes
- has its own tricks with arrays
The examples below are available on a Jupyter notebook here.
So, let’s start with the pandas. This is how the initial sample data looks like:
1 2 3 4 5 6 7 |
import numpy as np import pandas as pd data = { 'vitoshacademy.com': [0, 0, 2, 1], 'codedaily.vitoshacademy.com': [2, 6, 3, 1] } |
Then, once the articles are Dataframed, they look like this:
Which becomes even better, if the indices are added. In our case, these are the week numbers:
1 2 |
data_with_index = pd.DataFrame(data, index = ['wk33', 'wk34', 'wk35', 'wk36']) data_with_index |
The indices are of course accessible through a .index command – data_with_index.index. And if we use data_with_index.to_numpy then an array with list of lists shows up:
There are other nice 1-line commands, that can help us get the best out of our data. E.g. it could be
- described():
- mean-ed():
- analyzed with cumulative sum – e.g. 2,8,11,12 is the cumulative sum of codedaily, because 2+6=8; 8+3=11; 11+1=12:
- with some lambda expression, the difference between max and min is easily taken:
pandas.DataFrame.to_excel
Writing data from the dataframe to Excel and reading is really 1 liner:
1 2 |
data_with_index.to_excel('myExcel.xlsx', sheet_name='Pandas') pd.read_excel('myExcel.xlsx', 'Pandas', index_col=None, na_values=['NA']) |
Indeed, pandas is a game changer even there – pandas.DataFrame.to_excel.html. As mentioned, all examples are available on a Jupyter notebook here.
Cheers!