Data Science with Python I

Basic Process and Possible Python Packages

Step 1. Get Data

  • Beautiful Soup - deals with HTML and XML

    • example: scrape IMDB and get actor names and characters in Shawshank Redemption

      Sample code using Beautiful Soup

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      >     # modified from https://raw.githubusercontent.com/5harad/datascience/master/webscraping/01-bs/get_cast_from_movie.py
      >
      > from bs4 import BeautifulSoup
      > import requests
      >
      >
      > def clean_text(text):
      > """ Removes white-spaces before, after, and between characters
      > :param text: the string to remove clean
      > :return: a "cleaned" string with no more than one white space between
      > characters
      > """
      > return ' '.join(text.split())
      >
      >
      > """ Go to the IMDb Movie page in link, and find the cast overview list.
      > Prints tab-separated movie_title, actor_name, and character_played to
      > stdout as a result.
      > """
      > link = 'http://www.imdb.com/title/tt0111161/?ref_=nv_sr_1'
      > movie_page = requests.get(link)
      >
      > # Strain the cast_list table from the movie_page
      > soup = BeautifulSoup(movie_page.content, 'html.parser')
      >
      > # Iterate through rows and extract the name and character
      > # Remember that some rows might not be a row of interest (e.g., a blank
      > # row for spacing the layout). Therefore, we need to use a try-except
      > # block to make sure we capture only the rows we want, without python
      > # complaining.
      >
      > cast_list = soup.find('table', {'class': 'cast_list'})
      > for row in cast_list.find_all('tr'):
      > try:
      > td = row.find_all('td')
      > if len(td) == 4:
      > actor = clean_text(td[1].find('a').text)
      > character = clean_text(td[3].find('a').text)
      > print('{:20} {:10}'.format(actor, character))
      > except AttributeError:
      > pass
      >
  • LXML - deals with HTML and XML

  • Tweepy - deals with Twitter

  • PRAW - deals with Reddit

  • wikipedia - deals with Wikipedia

  • Pandas - loads csv or table

    • easily loads csv, tsv files
    • easily loads data in chinks if needed
    • supports group-by, indexing, selection, merge operations
    • supports data analysis functions like mean, median

Step 2. Data Pre-processing - Raw data might need to be pre-processed

  • Pandas - deals with numeric data
  • NumPy - deals with numeric data
  • NLTK - deals with text data
  • Scikit - deals with image data
  • Matplotlib - comprehensive 2D plotting
    • can easily create figures and minupulate them
    • supports: scatter plots, charts, bar charts, pie charts, box and whisker plots, lines

Preprocess the data only once! Don’t waste CPU cycles doing it each time

Step 3. Analysis and Modeling - Build of infer a mathematical model for the problem

  • NumPy - based n-dimensional array package

    • supports several statistical operations: np.mean, np.std, np.median

    • supports linear algebra operations: dot product, cross product

    • also supports Fast Fourier transforms, Signal Processing operations

    • example

      • invert the matrix [[2, 3], [2, 2]]

        1
        2
        3
        4
        5
        6
        7
        8
        >       import numpy as np
        >
        > # Create the matrix we want to invert
        > A = np.array([[2, 3], [2, 2]])
        >
        > # Invert the matrix using linalg.inv
        > AI = np.linalg.inv(A)
        >
  • SciPy - fundamental library for scientific computing

    • contains extensive functionality for use by scientists, such as:

      • scipy.linalg- linear algebra
      • scipy.optimize - optimization
      • scipy.stats - statistics
      • scipy.signal - signal processing
      • scipy.special - special functions, like Gamma
    • example

      • A car’s velocity in mph at time t is given by:

        25 + 10t.
        

        Find the distance in miles covered by the car in 3 hours.

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        >       import scipy
        >
        > # Velocity of car
        > def velocity(t):
        > return 25 + 10.0 * t
        >
        > # Integrate velocity from 0 to 3
        > distance = scipy.integrate.quad(velocity, 0, 3)
        >
        > print("Distance", distance)
        >
  • SymPy - symbolic mathematics

    • supports differentiation, integration, simplifying equations etc
    • useful in modeling especially machine learning
    • most used for computing exact solutions
  • Sklearn - machine learning in python

    • supports regression, classification, clustering and dimensionality reduction
    • provides many models: SVM, Linear Regression, Logistic Regression
    • Catch: Understand how these algorithms work before you apply them
  • statsmodels - provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration

Step 4. Evaluate and Present

  • IPython
  • Bokeh - an interactive visualization library that targets modern web browsers for presentation
  • Flask - lightweight web framework

Advanced Visualization: Seaborn

  • Build upon Matplotlib with a high-level interface

  • With a single line of code

  • Example

    1
    2
    3
    4
    5
    import seaborn as sns
    sns.set(style='ticks')

    df = sns.load_dataset('iris')
    sns.pairplot(df, hue = 'species')

Applications

Are boys taller than girls on an average?

  • Get data
  • Form hypothesis
  • Analyze data
  • Interpret results

How to classify iris flowers?

  • Derived from an example given by Randal S. Olson: http://www.randalolson.com/, licensed under CC BY 4.0
  • Goal: take four measurements of the flowers and identifies the species based on those measurement
  • The measurements (features): sepal length, sepal width, petal length, and petal width
  • Thes measurements come from hand-measurements by field researchers

Reference

  1. Lecture 3 Python for Data Science I
  2. Lecture 4 Python for Data Science II