MARGO

News

Exploring Google Cloud Platform

From IT infrastructure to Data Science

By Joshua Mifsud Data Scientist at Margo

06/03/2018

Google offers about 50 different products in its Cloud solution, from storage and computing infrastructure to Machine Learning, including massive data analysis and transformation tools. These solutions are mainly quick to set up (around 10 minutes or less) and cheap compared to standard on premise softwares.

You can find some turnkey solutions following the SaaS (Software as a Service) model, such as Machine Learning APIs (facial recognition or emotion detector), and low-level applications following the IaaS (Infrastructure as a Service) model, such as storage or Cloud Computing. Products halfway between infrastructure and service are also available and allow you to deploy your own applications: it’s the PaaS model (Platform as a Service).

3 different kinds of services: from Infrastructure service (IaaS) to turnkey service (SaaS)

 

In the era of serverless

The main part of Google Cloud products operates in “serverless” architecture. These servers are somewhere in Google Cloud, but they are invisible for the user, who doesn’t need to worry about infrastructure dimension anymore. In other words, the system automatically scales down depending on the storage and computing power needs of the user.

Google App Engine allows you to deploy your own PaaS application in a secure way by easily managing versioning. Moreover, it’s possible to implement A/B testing by deploying two different versions of the same application. This solution is able to scale up to 1 billion users.

Cloud Storage

Google Cloud offers numerous storage solutions adapted to different needs. Only a few minutes are necessary to set up a cloud storage instance.

The most basic one is flat files storage, immutable on the Cloud, where you pay according to the volume of files you have in the instance (called a bucket). You can choose the kind of buckets you want depending on the way you use it. The latency will then adapt if you need fast and regular access or rare access.

Some products, such as BigTable or DataStore, allow you to store data as documents with a high speed and a very low latency (5ms to access 1 To). It’s ideal for real-time processing.

Cloud Computing

It’s very easy (it takes less than one minute) to mount a machine through either the Google Cloud Platform Console or the cloud command-line tool. You have to choose the number of cores (up to 64!), the RAM and the drive (HDD or SSD). You can mount network machines in order to balance the computing load between your different resources. This process lasts around 10 minutes. The computing power is charged in proportion to the operation of the machine (a turned off machine costs $ 0).

Big Data

Several tools allow you to process very large amounts of data. With Dataflow, you can process batch or stream ETL. It’s automated provisioning and its API allows you for example to compute an average on a sliding window in a single line of code.

BigQuery offers a relational system (but not a transactional one) in a serverless mode, allowing you to analyze relational databases in SQL language. With its computing power, you are able to process 1 billion lines in two seconds for a query count with group by. It’s difficult to find this kind of performance on an infrastructure with such computing power.

Finally, it’s possible to mount a Hadoop cluster (with Pig/Hive/Spark components) in less than 2 minutes through the Cloud Console or the command-line tool. In your cluster, you can use preemptible machines, which are cheaper and can be used by Google at any time. Moreover, it’s possible to add or to remove a node even if a processing is in progress.

Machine Learning/Deep Learning

In terms of Machine Learning, Google offers APIs based on Tensorflow, allowing the user to implement neural networks. An NLP (Natural Language Processing) API is able to identify entities in a text and its characteristics (organizations, people, celebrities) and to extract a feeling and its intensity.

With the Google Vision API, the user is able to detect faces from an image, and the related emotions with a certain confidence level.

It’s also possible to use vision recognition in videos and then to list the visible objects on an image at different times of the video.

It’s interesting to note that the infrastructure behind these APIs doesn’t use normal processors such as a CPU (Computer Processing Unit) or a GPU (Graphic Processor Unit). Google made a new kind of processor named TPU (Tensor Processor Unit), optimized for matrix calculation, needed by neural networks, explaining the negligible calculation time. TPUs are also available on Cloud Compute.

For Data Scientists, the cloud DataLab instance allows to handle data and to build Machine or Deep Learning models (with Tensorflow) through a Python notebook, very close to Jupyter, and equipped by Data Science libraries (numpy, pandas, scikit learn, ntlk for text-mining, matplotlib and seaborn for data viz and tensorflow). However, this component still shows some instability (especially SSH connectivity issues).

One last interesting possibility, halfway from turnkey API and “from scratch” model: Cloud AutoML. With this feature, you can train Google’s models with your own data to fit the context of the problem being addressed.

Conclusion

The Google Cloud Platform solution allows the user to quickly create applications and to host them in serverless environments, with a billing method based on consumption per second of the infrastructures. For Data Science contexts, Google supplies a complete suite of technical components and solutions that quickly provide significant capabilities for exploring and exploiting data.


By Joshua Mifsud Data Scientist at Margo
Big Data
Cloud
Data
Google
Machine Learning
Tribune

Successfully completing a data project: a path still strewn with pitfalls

In 2020, corporate investment in data projects is expected to exceed 203 billion dollars worldwide. But at a time when many are claiming to be Data Driven Companies, lots of data projects end in failure. Yet most of these failures are unnecessary and due to well-known causes! Focus on the recurrent pitfalls to avoid.

05/02/2019 Discover 
News

Kaggle Challenge: TalkingData AdTracking Fraud Detection

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution.

31/05/2018 Discover 
News

Data Science applied to the retail industry: 10 essential use cases

Data Science is having an increasing impact on business models in all industries, including retail. According to IBM, 62% of retailers say the use of Big Data techniques gives them a serious competitive advantage. Knowing what your customer wants and when, is today at your fingertips thanks to data science. You just need the right tools and the right processes. We present in this article 10 essential applications of data science in the field of retail.

31/05/2018 Discover 
News

Introduction to TensorFlow on the datalab of Google Cloud Platform

TensorFlow is a software library, open source since 2015, of numerical computation developed by Google. The particularity of TensorFlow is its use of data flow graphs.

30/05/2018 Discover 
News

Lamport clocks and the pattern of the Idempotent Producer (Kafka)

Do you know the Lamport clocks? Devoxx France 2018 was the opportunity, during the very interesting talk of DuyHai DOAN , to discover or rediscover this algorithm formalized by Leslie Lamport in 1978, more than ever used today in the field of distributed systems, and which would have inspired the Kafka developers in the implementation of the pattern of Idempotent Producer .

23/05/2018 Discover 
News

Establishment of a centralised log management platform with the Elastic suite

The volume of data generated by our systems and applications continues to grow, resulting in the proliferation of data centers and data storage systems.  In the face of this data explosion and the investment in skills and resources, decision-makers need sophisticated analysis and sophisticated dashboards to help them manage their systems and customers.

14/05/2018 Discover