Data Science Software Stack: Linux

In data science, the “jack-of-all-trades” motif comes up a lot. This is especially the case when dealing with operating systems. The major three (Microsoft Windows, macOS, and Linux) all have their strengths and weaknesses. In this section, I’ll be covering installing a basic data science software stack on Linux. Specifically, Ubuntu’s latest LTS release: 16.04.

Installing this software stack is quite simple. Open your terminal in Ubuntu and run the lines of code in order from top to bottom. I have separated the code into individual lines for error management. Software is always changing and installation processes can change overnight.

Base

Libraries:

The base Ubuntu package comes with very libraries installed other than what helps Ubuntu run. We will have to install our own dev libraries for our R and python packages to run.

sudo apt-get install libcurl4-openssl-dev
sudo apt-get install libssl-dev
sudo apt-get install libxml2-dev
sudo apt-get install libtiff5-dev
sudo apt-get install libgmp-dev
sudo apt-get install libglu1-mesa-dev
sudo apt-get install libudunits2-dev
sudo apt-get install libgdal1-dev

Atom:

Atom will serve as our text editor. Alternatively, you can use the base Ubuntu software gedit. I quite enjoy the flexibility and usability that Atom provides.

sudo add-apt-repository ppa:webupd8team/atom
sudo apt-get update
sudo apt-get install atom

Git:

Git is how we will version control our work. Git should already be installed though base Ubuntu, but it does not hurt to make sure.

sudo apt-get update
sudo apt-get install git

LaTeX:

LaTeX is our mathematic/scientific rendering language. We can write formulas with complete ease anywhere in Ubuntu and render those into PDF format.

sudo apt-get update
sudo apt-get install texlive
sudo apt-get install texlive-latex-extra
sudo apt-get install texlive-xetex

Pandoc:

Pandoc is specifically used to render our PDF documents. This workflow program is especially useful through R markdown documents.

sudo apt-get update
sudo apt-get install pandoc

Coding

Python:

Python is our bread-and-butter coding language for machine learning. For ease of use, we will be using the Anaconda distribution for python, providing support for python 2.7, python 3.4, and the conda package installer.

You will have to download the bash script from Anaconda and run that script through your terminal. Don’t open the script in a text editor, the file is close to 500MB and will be extremely slow.

bash Anaconda3-5.0.1-Linux-x86_64.sh

Once you go through all of the installation instructions, close and open your terminal to finalize the installation. We will then update the conda installer and the distribution to the most current version.

conda update conda
conda update anaconda

We will also be installing our python interactive development environment (IDE): Pycharm.

sudo snap install pycharm-community --classic

R:

R is the other tool in our toolkit for machine learning. I find R extremely useful for data wrangling and data visualization. All our our package installations will live in this R installation.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt-get update
sudo apt-get install r-base

We will also install our R interactive development environment (IDE): RStudio.

The base Anaconda install will also contain an R installation. We will have to add the below line to our .bashrc document, above the Anaconda line at the bottom, to make sure that Rstudio uses our R installation instead of Anaconda’s.

atom .bashrc
export RSTUDIO_WHICH_R=/usr/bin/R

All of these packages should be installed through RStudio to make sure we are installing in the right R installation.

install.packages(c("repr", "IRdisplay", "crayon", "pbdZMQ", "devtools", "stringr"))
install.packages(c("knitr", "ezknitr", "reprex"))
install.packages(c("tidyverse", "widyr"))
install.packages(c("forcats", "lubridate", "countrycode", "maps", "USAboundaries", "units", "Hmisc"))
install.packages(c("coin", "mvtnorm", "boot", "MASS"))
install.packages(c("moments", "cowplot", "lattice", "RColorBrewer", "viridis", "ggnetwork", "GGally", "factoextra", "rgl"))
install.packages(c("mclust", "flexclust", "ClusterR", "cluster", "matrixStats", "clusterSim"))
install.packages(c("coda", "rjags", "runjags"))
install.packages(c("shiny", "shinydashboard", "rsconnect"))

These next commands will implement R into Jupyter notebooks. You will have to run these three commands in the terminal instead of Rstudio.

R
devtools::install_github("IRkernel/IRkernel")
IRkernel::installspec()

Automation

Make:

Make will allow us to create projects from start to finish with one command; creating repositories, running scripts, writing reports, and removing extraneous files. Make should already be installed through base Ubuntu, but it does not hurt to make sure.

sudo apt-get update
sudo apt-get install build-essential

Docker:

Docker is how will will automate our projects in a controlled environment. Another user will have to run our Docker container instead of being required to run a specific operating system with specific programs and specific packages/libraries.

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo 'deb https://apt.dockerproject.org/repo ubuntu-xenial main' | sudo tee -a /etc/apt/sources.list.d/docker.list
sudo apt-get install linux-image-extra-$(uname -r)
sudo apt-get install docker-engine
sudo docker run hello-world

Misc.

The remaining programs are completely personal. I use Slack, Spotify, and VLC for my communications, music, and media respectively.

Slack:

Slack installation

Spotify:

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 0DF731E45CE24F27EEEB1450EFDC8610341D9410
echo deb http://repository.spotify.com stable non-free | sudo tee /etc/apt/sources.list.d/spotify.list
sudo apt-get update
sudo apt-get install spotify-client

VLC:

sudo add-apt-repository ppa:videolan/stable-daily
sudo apt-get update
sudy apt-get install vlc

Additional Customization

Terminal:

I personally like to replace the “~” directory shortcut with the whole path of the current directory in my terminal. Replace the “\W” or “\w” in .bashrc’s PS1 and/or PS2 function with “$(pwd)\n”. See the example below for my specific .bashrc document function.

atom .bashrc
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]$(pwd)\n\[\033[00m\]\$ '

To install the Virtual Box virtual machine and the Ubuntu linux distribution, you can find the instructions on my Operating System blog post.

Find this article helpful or interesting? Let me know! Send me an email at indiana@nikel.io or message me on either Twitter or LinkedIn.

One thought on “Data Science Software Stack: Linux

  1. Vero says:

    After spending a multitude of hours trying to get Ubuntu to boot on my HP Elitebook Folio, this article made it all worthwhile! The pain of installing all of this on a Windows OS still feels so fresh and I don’t think you could have made it any more succinct.

    Bravo! Keep ’em coming 🙂

Leave a Reply