CyberSpy

Rantings from a guy with way too much free time

Data Science with Python: Let there be Light

2017-11-05 Programming

Data-Science - Clueless? No prob. We got you dude.

So you’ve likely heard all the hype about “data-science” - and if you’re not among the cool kids, it might be a wee-bit overwhelming to you. Where to even start? What does it all mean? How do you even begin to understand what you need to know to begin learning more and making progress in the field.

Let’s start off with a few definitions:

Data-Science

definition: data-science: also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] similar to data mining. source: wikipedia

Or, said another way, given a (typically large) collection of data, apply tools that provide investigators with the ability to infer knowledge not readily available in the data from first-hand observations. Often these data corpi are large and inter-related making observations difficult and insights hard to glean.

Tools

Modern data-science comprises of knowing how to use some fundamental tools combined with mathematical techniques largely derived from linear-algebra, probability and statistics, and machine learning. Each of these subjects taken on their own requires a significant amount of study to master. And once mastered, writing code that implements different algorithmic transformations within these areas would be a daunting programming task.

So, what can we do to limit the burden on the data-science investigator (you!)? We can take advantage of some fantastic tools that are widely supported by other data-science professionals who support open-source software OSS.

Python - the language of data-science

Much of the early work in data-science was done using the Python programming language. It’s incremental scripting language, combined with early support for community-supported libraries made it a natural choice for professionals in the field. As python evolved, the corner-stone tools have emerged within the community. We can break down these tools into two categories:

  • Python coding environments
  • Python libraries

Coding environments

Writing python is as easy as opening up an editor and writing code. But, a more productive approach is to use an interactive environment or a notebook. IPython or a notebook like jupyter.

If you are using OSX and brew, you can easily install these programs.

brew install python3 iphython jupyter qtconsole

Additionally, qtconsole is a program that works with both to pop-up a Qt window. Start a session (with the qtconsole) by executing ipython qtconsole or jupyter qtconsole. You should see a window like the following:

jupyter Qt Window

A more useful way to use jupyter is within the browser by managing and creating a notebook. Using a notebook within the browser is a great way to experiment and capture your efforts in an organized and stateful way.

DS libraries

With the programming environment installed and configured, the next step is to install the pillars of data-science libraries.

These are the predominant libraries used to conduct data-science analysis and programming are found in the ScyPi.org collection:

  • Numpy - objects and operations on multi-dimensional arrays
  • Scipy - algorithms and convenience functions that manipulate numpy objects
  • Matplotlib - extensive plotting of 2D and 3D plots for visualizing data objects
  • Pandas - provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive
  • SymPy and
  • sklearn (* although not technically a library used for DS, but one used for example data-sets and learning examples).

Installing libraries

To make use of these libraries we need to install them onto our system. We have a few ways to do it. We can use brew once more for some, or we can use pip3. Let’s proceed installing the libraries with pip3:

/usr/local/Cellar/python3/3.6.3/bin/pip3 install numpy scipy matplotlib pandas smypy sklearn

You can verify that the installation was successful by importing each one in your python interactive window: imports

Next Steps

Now that we have our programming language installed and libraries available to import, we’re ready to begin our journey of learning. We’ll start by selecting each one in turn and experiment with the library in our notebook. You ready to start? Let’s go!

comments powered by Disqus