MARGO

News

Running Jupyter Notebook on Google Cloud for a Kaggle challenge

By Hao Chen Data Scientist

20/04/2018

When you start a Kaggle challenge, a computer is usually needed to hold all dataset in the memory and accelerate the training with your GPU. Rather than purchasing a new computer, I’d like to do it free with 300$ credit offered by Google Cloud Platform.

Step 1: Create a free account in Google Cloud

For this step, you can create a new Google Account or sign in with your Google Account on https://cloud.google.com/. Then, you will have to put your payment information and verify your account.

Step 2 : Create a new project on GCP

Click on the three dots shown in the image below and then click on the + sign to create a new project.

Margo - Create a new project on GCP

Step 3 : Create a VM instance

Click on the three lines on the upper left corner, then on the compute option, click on ‘Compute Engine’

Margo - Create a VM instance

Now click on ‘Create new instance’. Name your instance, select a zone close to you, in my case, I chose ‘europe-west1-b’ .Choose your ‘machine type’. (I chose 8v CPUs 52 GB memory because i had a huge dataset). GCP will give you a estimated price according to your configurations.

You can also customize your vitual machine if you need GPUs. Attention, GPUs are available only in several zones. So , make sure that you have chosen a zone from below:

  • us-west1-b
  • us-central1-c
  • us-central1-f
  • us-east1-c
  • europe-west1-b
  • europe-west1-d
  • asia-east1-a
  • asia-east1-c
  • europe-west4-a

Margo machine type

Under Boot Disk option, select your os as ‘Ubuntu 16.04 LTS’ and your disk size as what you need for your datasets, for example, I need 50 GB.

Under the firewall options tick both ‘http’ and ‘https’ (very important). Then, choose the disk tab and untick ‘ Delete boot disk when instance is deleted’.

Now click on ‘Create’ and your instance is ready!

IMPORTANT : DON’T FORGET TO STOP YOUR GPU INSTANCE AFTER YOU ARE DONE BY CLICKING ON THE THREE DOTS ON THE IMAGE ABOVE AND SELECTING STOP. OTHERWISE GCP WILL KEEP CHARGING YOU ON AN HOURLY BASIS.

Step 4: Make external IP address as static

By default, the external IP address is dynamic and we need to make it static to make our life easier. Click on the three horizontal lines on top left and then under networking, click on VPC network and then External IP addresses.

Change the type from Ephemeral to Static.

Now, click on the ‘Firewall rules’ setting under VPC network. Click on ‘Create Firewall Rules’ and refer the below image:

Under protocols and ports you can choose any port. I have chosen tcp:5000 as my port number. Now click on the save button.

Step 5 : Install Google Cloud SDK

According to your OS, refer the corresponding document on https://cloud.google.com/sdk/docs/quickstarts

Then run gcloud init follow steps on yhe website to initialize your Google Cloud SDK.

Step 6 : Install Jupyter notebook and other packages

Open a terminal, connect to your VM instance:

gcloud compute --project <project name> ssh --zone <zone name> <instance name>

Then, install anaconda3 into your VM,

wget http://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-4.0.0-Linux-x86_64.sh

and follow the on-screen instructions. The defaults usually work fine, but answer yes to the last question about prepending the install location to PATH:

Do you wish the installer to prepend the
Anaconda3 install location to PATH
in your /home/haroldsoh/.bashrc ?
[yes|no][no] >>> yes

To make use of Anaconda right away, source your bashrc:

source ~/.bashrc

Now, install other softwares, for example

conda install -c conda-forge lightgbm

Step 7: Set up the VM server

Open up a SSH session to your VM. Check if you have a Jupyter configuration file:

ls ~/.jupyter/jupyter_notebook_config.py

If it doesn’t exist, create one:

jupyter notebook --generate-config

We’re going to add a few lines to your Jupyter configuration file; the file is plain text so, you can do this via your favorite editor (e.g., vim, emacs). Make sure you replace the port number with the one you allowed firewall access to in step 4.

c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = <Port Number>

It should look something like this :

Step 8 : Launching Jupyter Notebook

To run the jupyter notebook, just type the following command in the ssh window you are in :

jupyter notebook --ip=0.0.0.0 --port=<port-number> --no-browser &

Once you run the command, it gives you a token like this:

Now to launch your jupyter notebook, just type the following in your browser :

http://<External Static IP Address>:<Port Number>

where, external ip address is the ip address which we made static and port number is the one which we allowed firewall access to.

Enter the token you got in the last step:

Then you have a jupyter notebook running on GCP.

References

  1. Running Jupyter Notebook on Google Cloud Platform in 15 min
  2. Google Cloud Quickstarts

By Hao Chen Data Scientist
Cloud
Google
Kaggle
News

Kaggle Challenge: TalkingData AdTracking Fraud Detection

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution.

31/05/2018 Discover 
News

Introduction to TensorFlow on the datalab of Google Cloud Platform

TensorFlow is a software library, open source since 2015, of numerical computation developed by Google. The particularity of TensorFlow is its use of data flow graphs.

30/05/2018 Discover 
News

Introduction to Reactive Systems

Margo Consultants participated in  Devoxx France 2018 , the conference for Passionate Developers, organized from April 18 to 20, 2018 in Paris. Discover a synthesis on reactive systems illustrated by a concrete use case.

11/05/2018 Discover 
News

Exploring Google Cloud Platform

Google offers about 50 different products in its Cloud solution, from storage and computing infrastructure to Machine Learning, including massive data analysis and transformation tools. These solutions are mainly quick to set up (around 10 minutes or less) and cheap compared to standard on premise softwares.

06/03/2018 Discover 
News

Do you know the latest AngularJS 2 evolutions?

AngularJS has become the most popular Javascript framework for web application programming. Launched in 2012, it has already gained in maturity and performance thanks to several evolutions. Today, AngularJS 2 has nothing in common with its first version. That’s why Google decided to make breaking changes to reform and redesign its framework.

31/03/2017 Discover