Last week I was participating at the Python and ML Summit in Berlin. One of the most interesting lectures, on which I participated, was a workshop by, named “Reading all yourself was yesterday – How to turn large amounts of text into insights with machine learning”. It explored analysis of large datasets and I have decided to write a basic article on how to use Pandas and NumPy.
These two libraries have the following main features:
- pandas
- data analysis, derived from “panel data”
- provides DataFrame
- somehow close to a spreadsheet
- numpy
- numeric functions for python
- mainly for calculation purposes
- has its own tricks with arrays
The examples below are available on a Jupyter notebook here.
So, let’s start with the pandas. We may decide to compare the number of articles at vitoshacademy.com and codedaily.vitoshacademy.com. This is how the initial sample data looks like:
1 2 3 4 5 6 7 |
import numpy as np import pandas as pd data = { 'vitoshacademy.com': [0, 0, 2, 1], 'codedaily.vitoshacademy.com': [2, 6, 3, 1] } |
Then, once the articles are Dataframed, they look like this:
Which becomes even better, if the indices are added. In our case, these are the week numbers:
1 2 |
data_with_index = pd.DataFrame(data, index = ['wk33', 'wk34', 'wk35', 'wk36']) data_with_index |
The indices are of course accessible through a .index command – data_with_index.index. And if we use data_with_index.to_numpy then an array with list of lists shows up:
There are other nice 1-line commands, that can help us get the best out of our data. E.g. it could be
- described():
- mean-ed():
- analyzed with cumulative sum – e.g. 2,8,11,12 is the cumulative sum of codedaily, because 2+6=8; 8+3=11; 11+1=12:
- with some lambda expression, the difference between max and min is easily taken:
pandas.DataFrame.to_excel
Writing data from the dataframe to Excel and reading is really 1 liner:
1 2 |
data_with_index.to_excel('myExcel.xlsx', sheet_name='Pandas') pd.read_excel('myExcel.xlsx', 'Pandas', index_col=None, na_values=['NA']) |
Indeed, pandas is a game changer even there – pandas.DataFrame.to_excel.html. As mentioned, all examples are available on a Jupyter notebook here.
Cheers!