Data Science Assembling Data Sets

Data Munging

  • Good data scientists spend most of their time cleaning and formatting data.
  • The rest spend most of their time complainging there is no data available.
  • Data munging or data wrangling is the art of acquiring data and preparing it for analysis.

Language for Data Science

  • Python: contains libraries and features (e.g. regular expressions) for easier munging

    • Notebook Environments: Mixing code, data, computational results, and text are essential for projects to be:

      • reproducible
      • tweakable
      • documented
  • Perl: used to be the go to language for data munging on the web, before Python ate it for lunch.

    • Don’t be surprised if you encounter it in some legacy project.
  • R: a programming language of statisticians with the deepest libraries available for data analysis and visualization

  • Matlab: fast and efficient matrix operations

  • Java/C: language for Big Data systems

    • less good than Python, R, or Matlab to building models
    • tend to use for infrastructure
  • Mathematica/Wolfram Alpha: symbolic math

  • Excel: bread and butter tool for exploration

The Importance of Notebook Environments

The deliverable result of every data science project should be a computable notebook tying together the code, data, computational results, and written analysis of what you have learned in the process.

The reason this is so important is that computational results are the product of long chains of parameter selections and design decisions. This creates several problems that are solved by notebook computing environments:

  • Computations need to be reproducible

    We must be able to run the same programs again from scratch, and get exactly the same result.

  • Computations must be tweakable

    Often reconsideration or evaluation will prompt a change to one or more parameters or algorithms.

    • A notebook is never finished until after the entire project is done.
  • Data pipelines need to be documented

    Notebooks make it easier to maintain data pipelines, the sequence of processing steps from start to finish.

    Expect to have to redo your analysis from scratch, so build your code to make it possible!

Standard Data Formats

The best computational data formats have several useful properties:

  • They are easy for computers to parse
  • They are easy for people to read
  • They are widely used by other tools and systems

Accepted standards are now available:

  • CSV files: for tables like spreadsheets

    • csv stands for comma-separated values

    • how to tell the csv file is good or not?

      Try to open it by excel

    • human readable

  • XML: for structured but non-tabular data

    • Xml stands for extensible markup language
    • human unreadable
  • JSON: Javascript Object Notation for APIs

    • human readable
  • SQL databases: for multiple related tables

    • stands for structured query language
  • Protocol buffers: a language/platform-neutral way of serializing structured data for communications and storage across applications

    • essentially lighter weight versions of XML
    • designed to communicate small amounts of data across programs like JSON

Collecting Data

Where Does Data Come From?
  • The critical issue in any modeling project is finding the right data set.

  • Large data sets often come with valuable metadata:

    book titles

    image captions

    Wikipedia edit history…

  • Repurposing metadata requires imagination.

Sources of Data
  • Companies and Proprietary data sources

    • Facebook, Google, Amazon, Blue Cross, etc. have exciting user/transaction/log data sets.
    • Most organizations have/ should have internal data sets of interest to their business.
    • Getting outside access is usually impossible.
    • Companyes sometimes release rate-limited APIs, including Twitter and Google.
  • Government data sets

    • City, State, and Federal governments are increasingly committed to open data.
    • Data.gov has over $10^6$ open data sets.
    • The Freedom of Information Act (FOI) enables you to ask if something is not open.
    • Preserving privacy is often the big issue in whether a data set can be released.
  • Acadamic data sets

    • Making data available is now a requirement for publication in many fields.
    • Expect to be able to find economic, medical, demographic, and meteorological data if you look hard enough.
    • Track down from relevant papers, and ask.
    • Google topic and ‘Open Science’ or ‘data’.
  • Web search

    What is the difference between spidering and scraping?

    • Spidering is the process of downloading the right set of pages for analysis.
    • Scraping is the fine art of stripping this content from each page to prepare it for computational analysis.
    • Scraping is the fine art of stripping text/data from a webpage.

    • Libraries exist in Python to help parse/scrape the web, but first search:

      • Are APIs available from the source?
      • Did someone previously write a scraper
    • Terms of service limit what you can legally do.

    • Available Data Sources:

      • Bulk Downloads:

        Wikipedia

        IMDB

        Million Song Database

      • API access:

        New York Times

        Twitter

        Facebook

        Google

      Be aware of limits and terms of use.

  • Sensor data

    • The ‘Internet of Things’ can do amazing things:
      • Image/video data can do many things: e.g. measuring the weather using Flicker images.
      • Measure earthquakes using accelerometers in cell phones.
      • Identify traffic flows through GSP on taxis.
    • Build logging systems: storage is cheap.
  • Crowdsourcing

    • Many amazing open data resources have been built up by teams of contributors:
      • Wikipedia/Freebase
      • IMDB
    • Crowdsourcing platforms like Amazon Turk enable you to pay for armies of people to help you gather data, like human annotation.
  • Sweat equity

    • Often projects require sweat equity.
    • Sometimes you must work for your data instead of stealing it.
    • Much historical data still exists only on paper or PDF, requiring manual entry/curation.
      • At one record per minute, you can enter $10^3$ records in only two work days.
  • Dataset search

Cleaning Data: Garbage In, Garbage Out

Many issues arise in ensuring the sensible analysis of data from the field, including:

  • Distinguishing errors from artifacts
  • Data compatibility/unification
  • Imputation of missing values
  • Estimating unobserved (zero) counts
  • Outlier detection
Errors vs. Artifacts
  • Data errors represent information that is fundamentally lost in acquisition.

  • Artifacts are systematic problems arising from processing done to data.

    The good news is that processing artifacts can be corrected, so long as the original raw data set remains available.

    The bad news is that these artifacts must be detected before they can be corrected.

    The key to detecting artifacts is the sniff test, examining the product closely enough to get a whiff of something bad. Something bad is usually something unexpected or surprising, because people are naturally optimists. Surprising observations are what data scientists live for.

    • Such insights are the primary reason we do what we do. However, most surprises turn out to be artifacts, so we must look at them skeptically.

First-time Scientific Authors by Years?

In a bibliographic study, we analyzed PubMed data to identify the year of first publication for the $10^5$ most frequently cited authors.

What should the distribution of new top authors by year look like?

It is important to have a preconception of any result to help detect anomalies.

Reference

Lecture 6

Vocabularies

[1]

[2]

[3]

[4]