Working with Pandas is like working with Excel on steroids – it can really do a lot of things fast, but somehow doing the easy things get complicated. In this video and article tutorial I am presenting
- loading data to a dataframe
- how to select data from a dataframe in pandas
- how to change and format the index of the dataframe
- how to do a basic operations with pandas
- plotting data in matplotlib
- summing columns to a new column
To load data, which is used in the first part is taken from statsmodels datasets, we use the magic import statsmodels.api as sm. Once the data is loaded to a dataframe, it can be accessed through one the following:
1 2 3 |
df df.head() df['YEAR'] |
To get the corresponding data:
Changing the index of the dataframe is quite an easy task. First, we need to produce a new index:
1 |
index = pd.Index(sm.tsa.datetools.dates_from_range('1700','2008')) |
Once the index is produced, we may decide to format it. If we only need to get the year, this is the magic code to achieve so:
1 |
index = pd.to_datetime(index, format = "%m%d%Y").strftime("%Y") |
Selecting data from the dataframe is done with list comprehension. Plenty of ways to do so, however the easiest ones are these:
1 2 |
df[df.columns[1:2]][3:5] df.iloc[3:5,1:2] |
Basic operations with pandas are quite trivial – there are built-in functions like Sum(), ().Mean, etc ,which could be used for these:
1 2 |
sum(df['YEAR']) df['SUNACTIVITY'].mean() |
Creating our own dataframe and plotting data with matplotlib is quite easy, with Jupyter notebook as well. We may generate a few lists and put them in the dataframe on a single loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
dfx = pd.DataFrame() ww, xx, yy, zz = [],[],[],[] for n in range(100): w = n * 10 if n % 13 == 0: x = n * 2 y = n ** 1.4 z = n * 10.5 ww.append(w) xx.append(x) yy.append(y) zz.append(z) dfx['n * 10'] = ww dfx['n * 2'] = xx dfx['n ** 1.4'] = yy dfx['n * 10.5'] = zz |
Once this is carried out, the plotting of the data is a piece of cake:
1 2 3 4 5 |
plt.rcParams['figure.figsize'] = [10,10] dfx.plot() plt.ylabel("Values") plt.xlabel("N") print(dfx.index.tolist()) |
Summing columns in python to a new column is not science fiction – the trick is to remeber, that the axis of the column is always 1 (and the row is 0):
1 2 3 |
for n in range(5): dfx["n Sum "+ str(n+2)] = dfx.sum(axis = 1, numeric_only = True) dfx.plot() |
Pretty much that’s all. The Jupyter notebook is available in GitHub here: https://github.com/Vitosh/Python_personal/blob/master/JupyterNotebook/sunspots.load_pandas.ipynb