Do we really need yet another big data processing system? This was the first question which popped up when we first got to know about Apache Flink. Big data world doesn’t lack of frameworks. But when we talk about solving our need of different data processing, then the limitation of these platforms can be easily figured out. Apache Spark seems to be the best framework in these situations which made the need of another framework with similar goal questionable.
In this post we are trying put together our first impressions of Apache Flink and how it differs from Apache Spark.
What is Apache Flink?
Apache Flink is a community-driven open source framework for distributed big data analytics, like Hadoop and Spark. It is known for processing big data quickly with low data latency and high fault tolerance on distributed systems on large scale. Its defining feature is ability to process streaming data in real time. The core of Apache Flink is a distributed streaming dataflow engine, which is written in Java and Scala.
Need for Flink
The traditional wisdom is that data has value, it doesn’t matter how old the data is, and it has a lot of value. When processing capabilities increased, industry started realizing that the value of information is highest when the data is happening, or when the data is gathered. They want to process data as and when it happens, which dictates a need for a real-time processing system.
Streaming data processing makes it possible to set up and load a data warehouse quickly. A streaming processor that has low data latency gives more insights on data quickly. In addition to quicker processing, there is also another significant benefit: you have more time to design an appropriate response to events. For example, in the case of anomaly detection, lower latency and quicker detection enables you to identify the best response which is a key to prevent damage in cases such as fraudulent attacks on a secure website or industrial equipment damage. So, you can prevent substantial loss.
Apache Spark vs Apache Flink
Spark iterates its data in batches while Flink iterates data by using its streaming architecture.
- Processing Time
Flink processes faster than Spark because of its pipelined execution.
- Computational Model
Spark Streaming and Flink differ in its computation model. While Spark has adopted microbatching, Flink has adopted a continuous flow, operator based streaming model.
Apache Spark looks at streaming as fast batch processing. Whereas Apache Flink looks at batch processing as the special case of stream processing.
- Data Flow
Though ML algorithm is a cyclic data flow it is represented as direct acyclic graph inside the spark. But Flink takes little bit different approach to others. They support controlled cyclic dependency graph in run time. This makes them to represent the ML algorithms in a very efficient way.
By Apache Flink you get the benefit of being able to use the same algorithms in both streaming and batch modes (exactly as you do in Spark), but you no longer have to turn to a technology like Apache Storm if you require low-latency responsiveness. In addition, Flink takes the approach that a cluster should manage itself rather than user tuning it.
- Memory Management
It has its own memory management system, separate from Java’s garbage collector. By managing memory explicitly, Flink almost eliminates the memory spikes you often see on Spark clusters.
- Iterative Processing
In Spark, for iterative processing, each iteration has to be scheduled and executed separately. However Flink can be instructed to only process the parts of the data that have actually changed, thus significantly increasing the performance of job.
Use Cases Supported by Flink vs. Spark
Uses cases mainly differ on requirement of the data whether it needs to be processed batch wise or it requires real time data processing
- To calculate monthly sales at daily intervals. In this, we need to compute the daily sales total and then make a cumulative sum. Batch processing of data can take care of the individual batches of sales figures based on dates and then add them.
- To calculate how much bandwidth a particular user has used, in this case you really don’t need a real-time streaming solution. In this case, a micro batch is probably a more efficient solution.
- To aggregate different IP requests from different IP addresses to classify whether that IP address is listed in blacklist or not.
- To see the usage of particular industrial equipment in IoT.
- To calculate the monthly time each visitor spends on a website. In case of a website, the number of visits may be updated, hourly, monthly or even daily. But the problem arises in defining the session. It may be difficult to define the session start time and end time. Also, it is difficult to identify the periods of inactivity. In such situations, streaming data processing on a real time basis is helpful.
- To detect network anomaly, which has to happen in real time.
- Credit card fraud prevention.
- To raise an alert, whenever a particular threshold is reached, real time streaming is very much required.