Author: Mayank Malhotra (Big Data Engineer)
Apache spark is one of the popular big data processing frameworks. It can be used for various scenarios like ETL (Extract, Transform and Load), data analysis, training ML models, NLP processing, etc. To support a broad community of users, spark provides support for multiple programming languages, namely, Scala, Java and Python. It also supports R with limited functionality.
In this blog I will cover important aspects of each programming language (excluding R) to help you chose the right language for your needs.
Why Not Java?
Unlike Python and Scala, Java does not support all the APIs in spark and has some limitations with spark framework. For example, Tuple doesn’t have a built-in type for Java although there is a work-around (Tuple2) to handle it. Java is a verbose language where one has to deal with a lot of syntax to perform simple code. Another downside for Java is that it does not have an interactive notebook environment which can be very painful from developer’s viewpoint.
If you are working on a data science project, you will miss interactive notebook environment. I strong advise you to avoid Java unless your team has quite good experience in Java and you are ready to figure out workaround for every other problem.
After eliminating R and Java, we are left with only two contenders i.e. Python and Scala.
Python vs Scala
Both Scala and Python are object-oriented as well as a functional language and both have flourishing support communities. Spark is written in Scala which makes them quite compatible with each other.However, Scala has steeper learning curve compared to Python. Whereas Python has good standard libraries specifically for Data science, Scala, on the other hand offers powerful APIs using which you can create complex workflows very easily.
Static vs Dynamic Type
Scala uses JVM (Java Virtual Machine) during runtime which is not the case in Python. Scala is faster than Python in most of the cases (not all the cases) because it’s statically typed language compared to python which is dynamically typed.
As you know in dynamically typed languages, all the data types must be resolved in the run time which takes a little more time in execution. Debugging on the other can be easier in Python as you do not need to compile your code again and again.
Apart from RDDs and data frames, we have datasets abstraction as well which has introduced from 2.x onward. But it is available only in Scala. The major difference in datasets and data frames is statically typed features and performance of encoders. Serialization and deserialization happen in encoders specifically in case of UDF, which is extremely fast in the case of datasets. This is one more use case where Scala performs better.
UDF performance is slow in Python, although from spark 2.3 onward vectorized UDFs have been introduced which has improved the performance but has some limitations for now. I am expecting this to be improved in future releases.
Concurrency and Multi-threading
Scala allows code with multiple concurrency primitives whereas python does not support concurrency or multi-threading. Due to its concurrency feature, Scala has better memory management and data processing performance. However, Python does support heavyweight process forking. Here, only one thread is active at a time.
Python has external libraries like Pandas, GRAPHX and NumPy which can be super useful in case of Data science use cases. Scala has also some libraries for machine learning, but Python overshadows it in this capability.
Python also has some popular visualization libraries as well like matplotlib, plotly, etc.
- Python is comparatively slower but easy to use whereas Scala is faster and moderately easy to use.
- As spark is written in Scala, it is more compatible, and all the new features first come with Scala support followed by other languages.
- Python has better external libraries for data science, visualization use cases compared to Scala.
- Both languages have strong community support.
- Python should be preferred for data science use cases else you should use Scala to fully exploit power of Spark.