Basic Process and Possible Python Packages
Step 1. Get Data
Beautiful Soup - deals with HTML and XML
example: scrape IMDB and get actor names and characters in Shawshank Redemption
Sample code using Beautiful Soup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 > # modified from https://raw.githubusercontent.com/5harad/datascience/master/webscraping/01-bs/get_cast_from_movie.py
>
> from bs4 import BeautifulSoup
> import requests
>
>
> def clean_text(text):
> """ Removes white-spaces before, after, and between characters
> :param text: the string to remove clean
> :return: a "cleaned" string with no more than one white space between
> characters
> """
> return ' '.join(text.split())
>
>
> """ Go to the IMDb Movie page in link, and find the cast overview list.
> Prints tab-separated movie_title, actor_name, and character_played to
> stdout as a result.
> """
> link = 'http://www.imdb.com/title/tt0111161/?ref_=nv_sr_1'
> movie_page = requests.get(link)
>
> # Strain the cast_list table from the movie_page
> soup = BeautifulSoup(movie_page.content, 'html.parser')
>
> # Iterate through rows and extract the name and character
> # Remember that some rows might not be a row of interest (e.g., a blank
> # row for spacing the layout). Therefore, we need to use a try-except
> # block to make sure we capture only the rows we want, without python
> # complaining.
>
> cast_list = soup.find('table', {'class': 'cast_list'})
> for row in cast_list.find_all('tr'):
> try:
> td = row.find_all('td')
> if len(td) == 4:
> actor = clean_text(td[1].find('a').text)
> character = clean_text(td[3].find('a').text)
> print('{:20} {:10}'.format(actor, character))
> except AttributeError:
> pass
>
LXML - deals with HTML and XML
Tweepy - deals with Twitter
PRAW - deals with Reddit
wikipedia - deals with Wikipedia
Pandas - loads csv or table
- easily loads csv, tsv files
- easily loads data in chinks if needed
- supports
group-by
,indexing
,selection
,merge
operations- supports data analysis functions like mean, median
Step 2. Data Pre-processing - Raw data might need to be pre-processed
- Pandas - deals with numeric data
- NumPy - deals with numeric data
- NLTK - deals with text data
- Scikit - deals with image data
- Matplotlib - comprehensive 2D plotting
- can easily create figures and minupulate them
- supports:
scatter plots
,charts
,bar charts
,pie charts
,box and whisker plots
,lines
Preprocess the data only once! Don’t waste CPU cycles doing it each time
Step 3. Analysis and Modeling - Build of infer a mathematical model for the problem
NumPy - based n-dimensional array package
supports several statistical operations: np.mean, np.std, np.median
supports linear algebra operations: dot product, cross product
also supports Fast Fourier transforms, Signal Processing operations
example
invert the matrix [[2, 3], [2, 2]]
1
2
3
4
5
6
7
8 > import numpy as np
>
> # Create the matrix we want to invert
> A = np.array([[2, 3], [2, 2]])
>
> # Invert the matrix using linalg.inv
> AI = np.linalg.inv(A)
>
SciPy - fundamental library for scientific computing
contains extensive functionality for use by scientists, such as:
- scipy.linalg- linear algebra
- scipy.optimize - optimization
- scipy.stats - statistics
- scipy.signal - signal processing
- scipy.special - special functions, like Gamma
example
A car’s velocity in mph at time t is given by:
25 + 10t.
Find the distance in miles covered by the car in 3 hours.
1
2
3
4
5
6
7
8
9
10
11 > import scipy
>
> # Velocity of car
> def velocity(t):
> return 25 + 10.0 * t
>
> # Integrate velocity from 0 to 3
> distance = scipy.integrate.quad(velocity, 0, 3)
>
> print("Distance", distance)
>
SymPy - symbolic mathematics
- supports
differentiation
,integration
,simplifying equations
etc- useful in modeling especially machine learning
- most used for computing exact solutions
Sklearn - machine learning in python
- supports
regression
,classification
,clustering
anddimensionality reduction
- provides many models:
SVM
,Linear Regression
,Logistic Regression
- Catch: Understand how these algorithms work before you apply them
statsmodels - provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration
Step 4. Evaluate and Present
Advanced Visualization: Seaborn
Build upon Matplotlib with a high-level interface
With a single line of code
Example
1
2
3
4
5import seaborn as sns
sns.set(style='ticks')
df = sns.load_dataset('iris')
sns.pairplot(df, hue = 'species')
Applications
Are boys taller than girls on an average?
- Get data
- Form hypothesis
- Analyze data
- Interpret results
How to classify iris flowers?
- Derived from an example given by Randal S. Olson: http://www.randalolson.com/, licensed under CC BY 4.0
- Goal: take four measurements of the flowers and identifies the species based on those measurement
- The measurements (features): sepal length, sepal width, petal length, and petal width
- Thes measurements come from hand-measurements by field researchers