Spark’s website has this one liner as the introduction — “Unified engine for large-scale data analytics.” But the question is why limit Spark to just Data analytics?
An apache spark data analytics platform is what enterprises traditionally deploy. But, this article is not a technical how-to article, covering basic programming or performance optimization centered around data analytics. Rather, I want to talk about how Spark works and why I think setting up a Spark cluster could be very useful for almost all data processing jobs, one of them being data import. In the current environment of low-cost memory, storage and CPU power, investing in a Spark cluster might be a good approach to take.
Spark SQL defined
Let’s focus specifically on Spark’s SQL Module and the merits of developing an Apache spark data import platform.
The question to be answered is “What benefits would there be to switch to Spark over the likes of Java or python?”
The need for batched Data import exists in almost all applications. The volume of the import varies, from 1000s of rows to millions of rows. Data validation and transformation is a certainty in nearly all these data import jobs.
The tendency has been to hand code small data import jobs in Java or python based on the skillsets available and use ETL/ELT platforms (Tibco, Informatica, Matillion etc) for larger workflows. In Java, the Spring framework via Spring batch enables the development of robust data processing jobs. When you code in Java or python, the varied data formats force different programming semantics and libraries to be used. Databases require JDBC API while JSON api exist for managing JSON data, etc. The coding approach will differ for each data format.
On the other hand, apache spark data processing is centered around Dataframe construct. Spark translates all incoming data formats into a Dataframe object. View this as a typical table (rows x named columns). You can then perform transformations, validations and joins to merge additional data. You can split the data into partitions and have these processed in parallel across the Spark executor nodes. All of these can be done using a high-level API on the Dataframe object.
Each transformation on a dataframe does not modify the existing dataframe but creates a new dataframe incorporating the changes — this approach enables the Spark engine to retry failed operations transparently, since it maintains a directed acyclic graph (DAG) of the changes and can rerun any failed operation as required.
If you’re a SQL expert, you alias the dataframe as a View and apply all the above via SQL. Finally, you write this dataframe to your Sink and complete your ETL processing!
Spark compiles your actions on the dataframe into an optimized physical execution plan, and via the step called “Whole stage codegen” generates java bytecode, that is executed across the executor nodes in the cluster. It only builds the plan when you perform a collecting step, e.g., save the dataframe to a file in parquet format or save into a table via JDBC. This allows Spark to analyze the data processing steps stored as a DAG and come up with the optimal execution plan.
As part of this physical execution plan, Spark attempts to push down filters (typically) into the data Source read operation. For example, using specific partitions against a parquet data source or using the ‘where’ clause in a ‘Select’ fired against a JDBC data source.
As you can see, the pros are indeed many. Now, let’s look at the cons:
1) Spark only supports inserts
It now supports Hive updates/deletes via Hive transactions but not for other JDBC data sources. But you’ll anyway code the individual upserts in your java/python programs via the batching mechanism available — jdbc batch updates is an example. This part doesn’t change. What changes is that you can now handle the Extract and Transform in Spark and utilize all its robust API/SQL functionality to vastly simplify these steps using the single data model called the dataframe.
2) Watch out for the shuffles!
The biggest performance impact in Spark programs is when Spark needs to send data to other nodes to complete aggregation steps, e.g., group by or sort, or to re-partition the data to effectively utilize the cluster.
The main benefit of spark is the ability to split the entire dataset into partitions and process them in parallel across the cluster. If the partitions keys are explicitly set “correctly” then you can either eliminate shuffles completely or reduce them to a vast extent. If you don’t, then Spark needs to ensure that all the data for the aggregation keys reside on the same partition in a node. Shuffling could also happen when data has to spread uniformly across the cluster.
But just as we would “Explain” SQL queries to figure out if indexes are missing or the query needs to be restructured, you can analyze the job execution steps in Spark and then figure out what needs to be tweaked to reduce the shuffle.
I think the pluses outweigh the minuses when it comes to Spark. Given the data processing requirements most firms have in their work pipeline, it does make sense to invest in a Spark Cluster.
To be continued…
In Part II of this article, I’ll walk you through how Spark can be setup quickly as a multi node cluster on your PC. This should enable you to run some samples and view the job execution in Spark’s history server dashboard. Stay tuned for a hands-on perspective of Spark so that you can evaluate it for your next Data processing requirement.
As a final note, typically a distributed file system is also setup for the unified file access across all nodes of the Spark cluster. The most common solution is Apache HDFS, but nowadays, you also have S3 solutions available.