Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Engineering & Innovation

About Python

Good for:

  • Machine learning
  • Task automation
  • Multifaceted projects

Why Python:

  • It’s freely available
  • Well supported libraries and documentation
  • Used throughout academia and industry for many things

Recommended setup(s):

  • Anaconda Python 3.x.  If you prefer a graphical user interface, consider an IDE like PyCharm, Visual Studio Code, Spyder.  PyCharm is free for students for a 1 year license.  Visual Studio is also a staple of industry.  Spyder ships built into Anaconda Python and is open source.

 

Best resource to learn quickly:

Python Data Analysis on Lynda (access through SSO)

  • A nice introduction to key features of Python, with good examples of how to work with datasets and extract insights.  Login through SSO to get access to the exercise files.
  • Average video length: 5+ min
  • Total duration: 2h30m

Using Libraries

How to load/use a library in Python that isn't loaded by default

In [2]:
# to load libraries you've installed with pip or conda, 
# you import them like so:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys
# note you can choose an alias to refer to a loaded library or module. 
# This means we can refer to matplotlib.pyplot as plt when calling functions later.

Installing New Python Libraries

How to install and load new packages

In [3]:
# We highly recommend using Anaconda python 3.x and the conda environment for data analysis. 
# To install a new library in Anaconda, open the Anaconda prompt and activate whatever 
# environment you want to use (defualt is 'base').  Then simply call: conda install <library>
# where <library> is the name of the library you want to install.

#from the anaconda prompt, you only need to type:
# conda install wxPython
# OR
# pip install wxPython
# again, conda is preferred because it does some checking to make sure you won't break everything 
# by installing something incompatible with whatever else you have installed.

# from inside a jupyter notebook, it's best to use the following syntax:
!conda install --yes --prefix {sys.prefix} wxPython
 
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\ProgramData\Anaconda3

  added / updated specs:
    - wxpython


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.3                |   py37hc8dfbb8_1         3.1 MB  conda-forge
    wxpython-4.0.7.post2       |   py37h5fe3f0a_3        22.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        25.1 MB

The following NEW packages will be INSTALLED:

  python_abi         conda-forge/win-64::python_abi-3.7-1_cp37m

The following packages will be UPDATED:

  ca-certificates                     2019.11.28-hecc5488_0 --> 2020.6.20-hecda079_0
  certifi                                 2019.11.28-py37_0 --> 2020.6.20-py37hc8dfbb8_0
  conda                                        4.8.2-py37_0 --> 4.8.3-py37hc8dfbb8_1
  openssl                                 1.1.1d-hfa6e2cd_0 --> 1.1.1g-he774522_0
  wxpython                             4.0.4-py37h6538335_0 --> 4.0.7.post2-py37h5fe3f0a_3



Downloading and Extracting Packages

wxpython-4.0.7.post2 | 22.0 MB   |            |   0% 
wxpython-4.0.7.post2 | 22.0 MB   |            |   0% 
wxpython-4.0.7.post2 | 22.0 MB   | 2          |   3% 
wxpython-4.0.7.post2 | 22.0 MB   | 6          |   7% 
wxpython-4.0.7.post2 | 22.0 MB   | #1         |  11% 
wxpython-4.0.7.post2 | 22.0 MB   | #7         |  17% 
wxpython-4.0.7.post2 | 22.0 MB   | ##2        |  23% 
wxpython-4.0.7.post2 | 22.0 MB   | ##7        |  28% 
wxpython-4.0.7.post2 | 22.0 MB   | ###2       |  32% 
wxpython-4.0.7.post2 | 22.0 MB   | ###7       |  38% 
wxpython-4.0.7.post2 | 22.0 MB   | ####2      |  42% 
wxpython-4.0.7.post2 | 22.0 MB   | ####6      |  46% 
wxpython-4.0.7.post2 | 22.0 MB   | #####1     |  52% 
wxpython-4.0.7.post2 | 22.0 MB   | #####6     |  56% 
wxpython-4.0.7.post2 | 22.0 MB   | ######     |  61% 
wxpython-4.0.7.post2 | 22.0 MB   | ######4    |  65% 
wxpython-4.0.7.post2 | 22.0 MB   | ######9    |  69% 
wxpython-4.0.7.post2 | 22.0 MB   | #######4   |  74% 
wxpython-4.0.7.post2 | 22.0 MB   | #######8   |  79% 
wxpython-4.0.7.post2 | 22.0 MB   | ########3  |  83% 
wxpython-4.0.7.post2 | 22.0 MB   | ########7  |  87% 
wxpython-4.0.7.post2 | 22.0 MB   | #########2 |  93% 
wxpython-4.0.7.post2 | 22.0 MB   | #########7 |  98% 
wxpython-4.0.7.post2 | 22.0 MB   | ########## | 100% 

conda-4.8.3          | 3.1 MB    |            |   0% 
conda-4.8.3          | 3.1 MB    | #2         |  13% 
conda-4.8.3          | 3.1 MB    | ####8      |  49% 
conda-4.8.3          | 3.1 MB    | #######4   |  74% 
conda-4.8.3          | 3.1 MB    | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

Using Multiple Python Scripts

How to call functions and classes from other files

In [4]:
# To access other python scripts, you can import them as you would a library.
# For example, if you had a script called other_stuff.py in the same folder, 
# you can import it as follows:
import other_stuff

# if there was a function called, do_stuff(a,b) which returned a + (b/2), 
# you could now call it like:
a = other_stuff.do_stuff(2,7)
print(a)
 
5.5

Importing Data From Excel Spreadsheet

How to open and work with Excel datasets in Python.

In [6]:
# The pandas library is a great package for dealing with raw data.
# It can import from csv or xlsx files directly, and handles 
# a lot of the importing and data types for you.

import pandas as pd

data = pd.read_excel('ExampleDataClean.xlsx')
print(data.columns, data)
 
Index(['Date', 'Location', 'Field1', 'Field2', 'Field3', 'Field4', 'Field5',
       'Field6', 'Field7', 'Field8', 'Field9', 'Field10'],
      dtype='object')          Date Location    Field1    Field2    Field3    Field4    Field5  \
0  2020-05-06       US  0.087072  0.466163  0.256585  0.850972  0.024691   
1  2020-05-07       US  0.374148  0.596247  0.460567  0.455865  0.473421   
2  2020-05-09       MX  0.950790  0.786810  0.375649  0.651548  0.224263   
3  2020-05-11       CA  0.956982  0.137989  0.949380  0.251682  0.422171   
4  2020-05-13       US  0.275037  0.759614  0.623671  0.096792  0.265659   
5  2020-05-15       CA  0.579684  0.597635  0.354101  0.926063  0.061220   
6  2020-05-17       US  0.935200  0.917529  0.661647  0.260087  0.040231   
7  2020-05-19       MX  0.782576  0.316358  0.387379  0.021592  0.390715   
8  2020-05-21       MX  0.283339  0.928038  0.543262  0.318045  0.896379   
9  2020-05-23       US  0.357202  0.342773  0.762433  0.097341  0.628032   
10 2020-05-25       CA  0.071354  0.107643  0.787911  0.413408  0.876708   
11 2020-05-27       CA  0.016469  0.364118  0.303169  0.654925  0.702061   
12 2020-05-29       CA  0.844085  0.347905  0.430369  0.789135  0.326151   
13 2020-05-31       US  0.621470  0.226082  0.096330  0.699755  0.608162   
14 2020-06-02       MX  0.075084  0.893168  0.448265  0.001585  0.092250   
15 2020-06-04       US  0.479846  0.586265  0.751123  0.731068  0.320718   
16 2020-06-06       US  0.227084  0.188554  0.463362  0.728477  0.220309   
17 2020-06-08       US  0.401304  0.951056  0.434225  0.758877  0.387425   
18 2020-06-10       US  0.446687  0.553265  0.703386  0.075320  0.572195   

      Field6    Field7    Field8    Field9   Field10  
0   0.170840  0.306948  0.224802  0.227992  0.423540  
1   0.037419  0.164160  0.317844  0.357118  0.233253  
2   0.364410  0.277153  0.687860  0.586596  0.062513  
3   0.782476  0.938242  0.237547  0.456039  0.619374  
4   0.174680  0.850576  0.956483  0.336973  0.135804  
5   0.301500  0.766449  0.856508  0.824069  0.974019  
6   0.086861  0.234838  0.124674  0.453847  0.309176  
7   0.835033  0.681286  0.232524  0.841582  0.754590  
8   0.957806  0.934876  0.111279  0.977557  0.263052  
9   0.125401  0.686620  0.188627  0.035643  0.983478  
10  0.379938  0.816157  0.613449  0.978133  0.097426  
11  0.658342  0.041644  0.410201  0.881190  0.438716  
12  0.398118  0.793277  0.739230  0.561713  0.419796  
13  0.007754  0.395009  0.142180  0.111411  0.494485  
14  0.170952  0.360447  0.141528  0.524062  0.287555  
15  0.210069  0.992240  0.572826  0.811075  0.336953  
16  0.300855  0.083729  0.329310  0.069730  0.878799  
17  0.187967  0.394746  0.848277  0.216059  0.038225  
18  0.774101  0.265008  0.099812  0.121811  0.634335  

Importing Data From CSV

How to read in CSV files

In [7]:
# In Python, there are many ways to read in a csv file.
# One of the easiest is to use pandas like above.
# you would call read_csv() like so:

import pandas as pd

data = pd.read_csv('ExampleDataClean2.csv')
print(data.columns, data)

# If your data is dirtier than this and contains blank cells, use the keep_default_na parameter to compensate.
 
Index(['Timestamp', 'Val1', 'Val2', 'Val3'], dtype='object')        Timestamp                  Val1      Val2       Val3
0    15:50:40.94  7741335 pig attached -0.656250   6.699250
1    15:50:41.08  7741335 pig attached -0.562500   6.770688
2    15:50:41.23  7741335 pig attached -1.265625   7.151688
3    15:50:41.38  7741335 pig attached -0.656250  10.191750
4    15:50:41.51  7741335 pig attached  1.062500  12.961937
..           ...                   ...       ...        ...
398  15:51:38.99  7741335 pig attached -0.437500   7.381875
399  15:51:39.14  7741335 pig attached  0.500000   6.905625
400  15:51:39.27  7741335 pig attached -1.343750   6.770688
401  15:51:39.41  7741335 pig attached  1.015625   6.667500
402  15:51:39.56  7741335 pig attached -1.000000   6.611937

[403 rows x 4 columns]

Calculating Statistics by Index

How to analyze parts of your data

In [14]:
# Use slicing to look at specific components of your data.


import pandas as pd
import numpy as np

data = pd.read_csv('ExampleDataClean2.csv')
print(data.columns, data)

# Now we'll calculate the min and max of Val2 and Val3. NOTE: Unlike R and Matlab, python is 0 indexed. 
# 0 indexing means the first value starts at index 0, not 1.  

minVal2, minVal3 = data[['Val2', 'Val3']].min()
maxVal2, maxVal3 = data[['Val2', 'Val3']].max()
print(f'min of Val2: {minVal2}, max of Val2: {maxVal2}\n' +
        f'min of Val3: {minVal3}, max of Val3: {maxVal3}.')

# You can also apply your own function to the pandas dataframe like so:
# Let's take the inverse root of the values in the last column:
res = data['Val3'].apply(lambda x: 1/np.sqrt(x) if x !=0 else 0)
print(res)

# For column 3, we'll run into some negatives. we can filter those out in our function
res2 = data['Val2'].apply(lambda x: 1/np.sqrt(x) if x > 0 else -1/np.sqrt(x) if x > 0 else 0)
print(res2)
 
Index(['Timestamp', 'Val1', 'Val2', 'Val3'], dtype='object')        Timestamp                  Val1      Val2       Val3
0    15:50:40.94  7741335 pig attached -0.656250   6.699250
1    15:50:41.08  7741335 pig attached -0.562500   6.770688
2    15:50:41.23  7741335 pig attached -1.265625   7.151688
3    15:50:41.38  7741335 pig attached -0.656250  10.191750
4    15:50:41.51  7741335 pig attached  1.062500  12.961937
..           ...                   ...       ...        ...
398  15:51:38.99  7741335 pig attached -0.437500   7.381875
399  15:51:39.14  7741335 pig attached  0.500000   6.905625
400  15:51:39.27  7741335 pig attached -1.343750   6.770688
401  15:51:39.41  7741335 pig attached  1.015625   6.667500
402  15:51:39.56  7741335 pig attached -1.000000   6.611937

[403 rows x 4 columns]
min of Val2: -1.625, max of Val2: 1.1875
min of Val3: 6.3103125, max of Val3: 20.5105.
0      0.386355
1      0.384312
2      0.373935
3      0.313239
4      0.277757
         ...   
398    0.368058
399    0.380538
400    0.384312
401    0.387274
402    0.388898
Name: Val3, Length: 403, dtype: float64
0      0.000000
1      0.000000
2      0.000000
3      0.000000
4      0.970143
         ...   
398    0.000000
399    1.414214
400    0.000000
401    0.992278
402    0.000000
Name: Val2, Length: 403, dtype: float64

Plotting Data

How to plot data in Python

In [2]:
# There are a lot of great libraries for plotting (matplotlib, bokeh, seaborn, plot.ly, ...)
# Matplotlib is a staple though, and very extensible. We'll use it here.
%matplotlib notebook

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np
import datetime
import dateutil


data = pd.read_csv('ExampleDataClean2.csv')
print(data.columns, data)

# we can print directly from pandas like so:
data.plot(use_index=True, y=['Val2', 'Val3'], style='o-')

# we can also extract that data and plot it separately:
xs = data['Timestamp'].tolist()
ys = data['Val3'].tolist()
# convert datetime strings to dates
xs = [dateutil.parser.parse(s)+datetime.timedelta(0,60-np.floor(float((xs[0].split(':')[-1])))) for s in xs]

# set up new plot
fig = plt.figure(2)
ax = fig.add_subplot()
# set x axis to datetime
ax.xaxis_date()
ax.xaxis.set_major_locator(mdates.SecondLocator(interval=5))
# format the date so we don't get super long strings
date_fmt = mdates.DateFormatter('%S.0')
ax.xaxis.set_major_formatter(date_fmt)

# plot the data
ax.plot(xs, ys, linestyle='-', linewidth=1, color='darkgrey')

# do some formatting to clean up the look of the plots
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.xaxis.set_tick_params(top='off', direction='out', width=1, labelsize=10)
ax.yaxis.set_tick_params(right='off', direction='out', width=1, labelsize=10)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
ax.set_xlabel(r'Time, $sec$', fontsize=20)

# display the plot
plt.show()
 
Index(['Timestamp', 'Val1', 'Val2', 'Val3'], dtype='object')        Timestamp                  Val1      Val2       Val3
0    15:50:40.94  7741335 pig attached -0.656250   6.699250
1    15:50:41.08  7741335 pig attached -0.562500   6.770688
2    15:50:41.23  7741335 pig attached -1.265625   7.151688
3    15:50:41.38  7741335 pig attached -0.656250  10.191750
4    15:50:41.51  7741335 pig attached  1.062500  12.961937
..           ...                   ...       ...        ...
398  15:51:38.99  7741335 pig attached -0.437500   7.381875
399  15:51:39.14  7741335 pig attached  0.500000   6.905625
400  15:51:39.27  7741335 pig attached -1.343750   6.770688
401  15:51:39.41  7741335 pig attached  1.015625   6.667500
402  15:51:39.56  7741335 pig attached -1.000000   6.611937

[403 rows x 4 columns]