Search This Blog

Monday, October 19, 2015

Spark/IPython on CentOS 6.5

1. Introduction
IPython is a remarkable tool to demonstrate/share your data analysis with collaborators. Spark provides Python APIs so that it is really nice to use IPython to share data analysis, with explanations for each step.

However, if you are on CentOS, you have big troubles. CentOS relies in python 2.6 for yum install, while most of modern python tools, including recent versions of IPython, depends on python 2.7.6 or later versions. The solution is to maintain two different versions of python in each system. Yum install typically happens under root privilege, python 2.6 for root and python2.7 for other users.

After that, PySpark (up to spark 1.1.0 ) has dependencies with numpy, which depends on lapack and blas libraries.If you don’t install these tools, it will yield some random errors, while executing your python program across cluster.

2. Procedures
2.1. Install Python2.7, with pip2.7 on each node.

sudo yum groupinstall "Development tools"
sudo yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel
wget --no-check-certificate https://www.python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
tar xf Python-2.7.6.tar.xz
cd Python-2.7.6
./configure --prefix=/usr/local
make 
sudo make altinstall

After running the commands above your newly installed Python 2.7.6 interpreter will be available as /usr/local/bin/python2.7 and the system version of Python 2.6.6 will be available as /usr/bin/python and /usr/bin/python2.6.

Check with:
 ls -ltr /usr/bin/python*

lrwxrwxrwx 1 root root    6 Nov 16  2002 /usr/bin/python2 -> python
-rwxr-xr-x 1 root root 1418 Jul 10  2013 /usr/bin/python2.6-config
-rwxr-xr-x 2 root root 4864 Jul 10  2013 /usr/bin/python2.6
-rwxr-xr-x 2 root root 4864 Jul 10  2013 /usr/bin/python
lrwxrwxrwx 1 root root   16 Oct 24 15:39 /usr/bin/python-config -> python2.6-config

ls -ltr /usr/local/bin/python*
-rwxr-xr-x 1 root root 6214533 Mar 19 22:46 /usr/local/bin/python2.7
-rwxr-xr-x 1 root root    1674 Mar 19 22:46 /usr/local/bin/python2.7-config
 
If things don’t look right, you might need to create a symbolic link in /usr/local/bin
cd /usr/local/bin
ls -ltr python*
WARNING: don’t do this before checking the $PATH for root. if it has /usr/local/bin before /usr/bin, it will see python2.7 first i.e.
 echo $PATH
/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
 
If you add this link, do a “which python” for the user and for root. If root is pointing to /usr/local/bin/python, remove the link you just added, and figure out something else.
sudo ln -s /usr/local/bin/python2.7 /usr/local/bin/python
final check:

sudo which python
sudo python --version


which python
python --version
2.2. install dependencies for IPython/PySpark
Assuming Spark is already installed across the cluster, we will install pip and numpy dependencies, followed by a few useful python data analysis tools
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
sudo /usr/local/bin/python2.7 ez_setup.py
sudo /usr/local/bin/easy_install-2.7 pip

sudo yum install lapack lapack-devel blas blas-devel libpng-devel freetype*

 sudo /usr/local/bin/pip2.7 install numpy
 sudo /usr/local/bin/pip2.7 install scipy
 sudo /usr/local/bin/pip2.7 install matplotlib
 sudo /usr/local/bin/pip2.7 install pandas
 
2.3 Finally, install ipython

     sudo /usr/local/bin/pip2.7 install ipython[notebook]