Margo Consultants participated in Devoxx France 2018 , the conference for Passionate Developers organized from April 18 to 20, 2018 in Paris. Discover below our report on the Tools-in-Action dedicated to the Sparkube project: Transformer Apache Spark in OLAP cube, hosted by Antoine Chambille on Wednesday, April 18th.
You finally have your favourite notebook, your Spark cluster is well configured and powerful, and your HDFS contains everything you need in data. And yet, do you know that you can boost the added value of your data, thanks to the power of OLAP? At the Devoxx France 2018 conference, Antoine Chambille, R & D Director at Activeviam , came to present the Sparkube project that has just been launched.
Let’s start with a small point about OLAP.
OLAP systems (for OnLine Analytical Processing) are intended to allow users to navigate intuitively in the data. This technology is based on a particular structure: The Hypercube (or multidimensional cube).
The cube makes it possible to analyse the data under different analysis axes, called hierarchies. For example, the date of the transaction, the category of the product, the geographical area, the name of the seller … It also contains measures, which are aggregated on the various axes. Examples of measures can be the total number (count), the amount, the margin …
But how to use this cube?
The MDX (for Multidimensional Expressions) is de-facto the query language of OLAP cubes. Offered by Microsoft and subsequently adopted by a large number of OLAP solutions, this powerful language is particularly suited for cube structure, even though the learning curve may not be as fast as desired.
However, there is a protocol, XMLA (XML for Analysis), which allows the discovery (list of axes, measurements …) and the execution of MDX queries on an OLAP cube. Coupled with a suitable graphical interface, the user will no longer have to deal with MDX, and can generate through the interface all types of requests. The icing on the cake: Excel implements the XMLA protocol, and thus allows to connect, without any additional plugin, to all the servers that implement it.
And Sparkube in all this?
Sparkube is a library developed by the publisher Activeviam, capable of introspecting a Spark DataFrame, and creating an adequate OLAP cube structure. Sparkube will expose the cube on an XMLA interface. It will respond to client discovery queries, transform all MDX queries into Spark queries, and finally return the result at the end of execution.
In addition to operating “as is” with Excel, Activeviam also provides its own client, Active UI.
And in practice?
Nothing simpler, after having recovered a dataset of more than 370 000 lines for the test, we follow the instructions on the Sparkube page. We start by creating the DataSet, and then we ask Sparkube to create and expose the cube.
Sparkube responds that the cube is well exposed on http: // localhost: 9090 / xmla.
And here we are, we can start exploring.
We recover for example the average price of a car according to the brand, the mileage, or according to the accidental history of the car:
We can also insert graphics:
The tests are performed on a dataset of more than 370,000 rows, of 20 columns. The construction of the cube lasts about 3 minutes, and the request response time is between 2 and 5 seconds.
Sparkube certainly does not offer all the flexibility and performance of solutions specifically designed for OLAP. These solutions can in particular pre-aggregate and index all the points of the cube, they profit from particular hardware architectures (NUMA for example) and are mainly based on a horizontal scalability to limit the network exchanges (and to allow very powerful nonlinear aggregations) . However, the project remains very promising, and the ability to prove concepts as easily with very good performance on reasonable sizes demonstrates the interest of OLAP for users.
For the moment, the page of Sparkube does not contain enough information, but it is interesting to follow the evolution of this project!
- Sparkube: https://activeviam.com/en/sparkube
- Test data: https://data.world/data-society/used-cars-data
- XMLA Reference: https://docs.microsoft.com/en-us/sql/analysis-services/xmla/xml-for-analysis-xmla-reference
- Features of the test machine: Intel ® Core ™ i5-7200U CPU @ 2.5Ghz, 8GB DDR4 RAM, SSD
Watch the video of the conference hosted by Antoine Chambille : Sparkube Transformer Apache Spark project in OLAP cube.