Data Science with Python I

Posted on 2018-09-08 | Post modified 2018-09-12 | In Languages , Data Science , Python , Basic

Basic Process and Possible Python Packages

Step 1. Get Data

Beautiful Soup - deals with HTML and XML

example: scrape IMDB and get actor names and characters in Shawshank Redemption

Sample code using Beautiful Soup

>     # modified from https://raw.githubusercontent.com/5harad/datascience/master/webscraping/01-bs/get_cast_from_movie.py
>     
>     from bs4 import BeautifulSoup
>     import requests
>     
>     
>     def clean_text(text):
>         """ Removes white-spaces before, after, and between characters
>         :param text: the string to remove clean
>         :return: a "cleaned" string with no more than one white space between
>         characters
>         """
>         return ' '.join(text.split())
>     
>     
>     """ Go to the IMDb Movie page in link, and find the cast overview list.
>         Prints tab-separated movie_title, actor_name, and character_played to
>         stdout as a result.
>     """
>     link = 'http://www.imdb.com/title/tt0111161/?ref_=nv_sr_1'
>     movie_page = requests.get(link)
>     
>     # Strain the cast_list table from the movie_page
>     soup = BeautifulSoup(movie_page.content, 'html.parser')
>     
>     # Iterate through rows and extract the name and character
>     # Remember that some rows might not be a row of interest (e.g., a blank
>     # row for spacing the layout). Therefore, we need to use a try-except
>     # block to make sure we capture only the rows we want, without python
>     # complaining.
>     
>     cast_list = soup.find('table', {'class': 'cast_list'})
>     for row in cast_list.find_all('tr'):
>         try:
>             td = row.find_all('td')
>             if len(td) == 4:
>                 actor = clean_text(td[1].find('a').text)
>                 character = clean_text(td[3].find('a').text)
>                 print('{:20} {:10}'.format(actor, character))
>         except AttributeError:
>             pass
>

LXML - deals with HTML and XML

Tweepy - deals with Twitter

PRAW - deals with Reddit

wikipedia - deals with Wikipedia

Pandas - loads csv or table

easily loads csv, tsv files

easily loads data in chinks if needed

supports group-by, indexing, selection, merge operations

supports data analysis functions like mean, median

Step 2. Data Pre-processing - Raw data might need to be pre-processed

Pandas - deals with numeric data

NumPy - deals with numeric data

NLTK - deals with text data

Scikit - deals with image data

Matplotlib - comprehensive 2D plotting

can easily create figures and minupulate them

supports: scatter plots, charts, bar charts, pie charts, box and whisker plots, lines

Preprocess the data only once! Don’t waste CPU cycles doing it each time

Step 3. Analysis and Modeling - Build of infer a mathematical model for the problem

NumPy - based n-dimensional array package
supports several statistical operations: np.mean, np.std, np.median

supports linear algebra operations: dot product, cross product

also supports Fast Fourier transforms, Signal Processing operations
example
invert the matrix [[2, 3], [2, 2]]
1
2
3
4
5
6
7
8
>       import numpy as np
>       
>       # Create the matrix we want to invert
>       A = np.array([[2, 3], [2, 2]])
>       
>       # Invert the matrix using linalg.inv
>       AI = np.linalg.inv(A)
>

SciPy - fundamental library for scientific computing
contains extensive functionality for use by scientists, such as:

scipy.linalg- linear algebra

scipy.optimize - optimization

scipy.stats - statistics

scipy.signal - signal processing

scipy.special - special functions, like Gamma
example
A car’s velocity in mph at time t is given by:
25 + 10t.
Find the distance in miles covered by the car in 3 hours.
1
2
3
4
5
6
7
8
9
10
11
>       import scipy
>       
>       # Velocity of car
>       def velocity(t):
>           return 25 + 10.0 * t
>           
>       # Integrate velocity from 0 to 3
>       distance = scipy.integrate.quad(velocity, 0, 3)
>       
>       print("Distance", distance)
>

SymPy - symbolic mathematics

supports differentiation, integration, simplifying equations etc

useful in modeling especially machine learning

most used for computing exact solutions

Sklearn - machine learning in python

supports regression, classification, clustering and dimensionality reduction

provides many models: SVM, Linear Regression, Logistic Regression

Catch: Understand how these algorithms work before you apply them

statsmodels - provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration

Step 4. Evaluate and Present

IPython

Bokeh - an interactive visualization library that targets modern web browsers for presentation

Flask - lightweight web framework

Advanced Visualization: Seaborn

Build upon Matplotlib with a high-level interface
With a single line of code

Example

import seaborn as sns
sns.set(style='ticks')

df = sns.load_dataset('iris')
sns.pairplot(df, hue = 'species')

Applications

Are boys taller than girls on an average?

Get data

Form hypothesis

Analyze data

Interpret results

How to classify iris flowers?

Derived from an example given by Randal S. Olson: http://www.randalolson.com/, licensed under CC BY 4.0

Goal: take four measurements of the flowers and identifies the species based on those measurement

The measurements (features): sepal length, sepal width, petal length, and petal width

Thes measurements come from hand-measurements by field researchers

Reference