Data Munging
- Good data scientists spend most of their time
cleaning and formatting data
. - The rest spend most of their time complainging there is no data available.
Data munging
ordata wrangling
is the art of acquiring data and preparing it for analysis.
Language for Data Science
Python: contains libraries and features (e.g. regular expressions) for easier munging
Notebook Environments: Mixing code, data, computational results, and text are essential for projects to be:
- reproducible
- tweakable
- documented
Perl: used to be the go to language for data munging on the web, before Python ate it for lunch.
- Don’t be surprised if you encounter it in some legacy project.
R: a programming language of statisticians with the deepest libraries available for data analysis and visualization
Matlab: fast and efficient matrix operations
Java/C: language for Big Data systems
- less good than Python, R, or Matlab to building models
- tend to use for infrastructure
Mathematica/Wolfram Alpha: symbolic math
Excel: bread and butter tool for exploration
The Importance of Notebook Environments
The deliverable result of every data science project should be a computable notebook tying together the code, data, computational results, and written analysis of what you have learned in the process.
The reason this is so important is that computational results are the product of long chains of parameter selections and design decisions. This creates several problems that are solved by notebook computing environments:
Computations need to be
reproducible
We must be able to run the same programs again from scratch, and get exactly the same result.
Computations must be
tweakable
Often reconsideration or evaluation will prompt a change to one or more parameters or algorithms.
- A notebook is never finished until after the entire project is done.
Data pipelines need to be
documented
Notebooks make it easier to maintain data pipelines, the sequence of processing steps from start to finish.
Expect to have to redo your analysis from scratch, so build your code to make it possible!
Standard Data Formats
The best computational data formats have several useful properties:
- They are easy for computers to parse
- They are easy for people to read
- They are widely used by other tools and systems
Accepted standards are now available:
CSV files: for tables like spreadsheets
csv stands for comma-separated values
how to tell the csv file is good or not?
Try to open it by excel
human readable
XML: for structured but non-tabular data
- Xml stands for extensible markup language
- human unreadable
JSON: Javascript Object Notation for APIs
- human readable
SQL databases: for multiple related tables
- stands for structured query language
Protocol buffers: a language/platform-neutral way of serializing structured data for communications and storage across applications
- essentially lighter weight versions of XML
- designed to communicate small amounts of data across programs like JSON
Collecting Data
Where Does Data Come From?
The critical issue in any modeling project is finding the right data set.
Large data sets often come with valuable metadata:
book titles
image captions
Wikipedia edit history…
Repurposing metadata requires imagination.
Sources of Data
Companies and Proprietary data sources
- Facebook, Google, Amazon, Blue Cross, etc. have exciting user/transaction/log data sets.
- Most organizations have/ should have internal data sets of interest to their business.
- Getting outside access is usually impossible.
- Companyes sometimes release rate-limited APIs, including Twitter and Google.
Government data sets
- City, State, and Federal governments are increasingly committed to open data.
- Data.gov has over $10^6$ open data sets.
- The Freedom of Information Act (FOI) enables you to ask if something is not open.
- Preserving privacy is often the big issue in whether a data set can be released.
Acadamic data sets
- Making data available is now a requirement for publication in many fields.
- Expect to be able to find
economic
,medical
,demographic
, andmeteorological data
if you look hard enough. - Track down from relevant papers, and ask.
- Google topic and ‘Open Science’ or ‘data’.
Web search
What is the difference between spidering and scraping?
- Spidering is the process of downloading the right set of pages for analysis.
- Scraping is the fine art of stripping this content from each page to prepare it for computational analysis.
Scraping is the fine art of stripping text/data from a webpage.
Libraries exist in Python to help parse/scrape the web, but first search:
- Are APIs available from the source?
- Did someone previously write a scraper
Terms of service limit what you can legally do.
Available Data Sources:
Bulk Downloads:
Wikipedia
IMDB
Million Song Database
…
API access:
New York Times
Twitter
Facebook
Google
…
Be aware of limits and terms of use.
Sensor data
- The ‘Internet of Things’ can do amazing things:
- Image/video data can do many things: e.g. measuring the weather using Flicker images.
- Measure earthquakes using accelerometers in cell phones.
- Identify traffic flows through GSP on taxis.
- Build logging systems: storage is cheap.
- The ‘Internet of Things’ can do amazing things:
Crowdsourcing
- Many amazing open data resources have been built up by teams of contributors:
- Wikipedia/Freebase
- IMDB
- Crowdsourcing platforms like
Amazon Turk
enable you to pay for armies of people to help you gather data, like human annotation.
- Many amazing open data resources have been built up by teams of contributors:
Sweat equity
- Often projects require sweat equity.
- Sometimes you must work for your data instead of stealing it.
- Much historical data still exists only on paper or PDF, requiring manual
entry
/curation
.- At one record per minute, you can enter $10^3$ records in only two work days.
Cleaning Data: Garbage In, Garbage Out
Many issues arise in ensuring the sensible analysis of data from the field, including:
- Distinguishing errors from artifacts
- Data compatibility/unification
- Imputation of missing values
- Estimating unobserved (zero) counts
- Outlier detection
Errors vs. Artifacts
Data errors
represent information that is fundamentally lost in acquisition.Artifacts
are systematic problems arising from processing done to data.The good news is that processing artifacts can be corrected, so long as the original raw data set remains available.
The bad news is that these artifacts must be detected before they can be corrected.
The key to detecting artifacts is the sniff test, examining the product closely enough to get a whiff of something bad. Something bad is usually something unexpected or surprising, because people are naturally optimists. Surprising observations are what data scientists live for.
- Such insights are the primary reason we do what we do. However, most surprises turn out to be artifacts, so we must look at them skeptically.
First-time Scientific Authors by Years?
In a bibliographic study, we analyzed PubMed data to identify the year of first publication for the $10^5$ most frequently cited authors.
What should the distribution of new top authors by year look like?
It is important to have a preconception of any result to help detect anomalies.
Reference
Vocabularies
[1]
[2]
[3]
[4]