“Long lines of tiny speckles on light in a high-ceiling interior” by on Joshua Sortino Unsplash is an streaming platform which provides you tremendous capabilities to run real-time data processing pipelines in a fault-tolerant way at a scale of millions of events per second. Apache Flink open source The key point is that it does all this using the minimum possible resources at single millisecond latencies. So how does it manage that and what makes it better than other solutions in the same domain? Low latency on minimal resources Flink is based on the i.e. processing the elements as and when they come rather than processing them in micro-batches (which is done by ). DataFlow model Spark streaming Micro-batches can contain huge number of elements and the resources needed to process those elements at once can be substantial. In the case of a sparse data stream (in which you get only a burst of data at irregular intervals), this becomes a major pain point. You also don’t need to go through the trial and error of configuring the micro-batch size so that the processing time of the batch doesn’t exceed it’s accumulation time. If it happens, then the batches start to queue up and eventually all the processing will come to a halt. Dataflow allows flink to process millions of records per minutes at milliseconds of latencies on a single machine (it’s also because of Flink’s managed memory and custom serialisation but more on that in next article). Here are some . benchmarks Variety of Sources and Sinks Flink provides seamless connectivity to a variety of data sources and sinks. Some of these include: Apache Cassandra Elasticsearch Kafka RabbitMQ Hive Fault tolerance Flink provides robust fault-tolerance using checkpointing (periodically saving internal state to external sources such as HDFS). However, Flink’s checkpointing mechanism can be made incremental (save only the changes and not the whole state) which really reduces the amount of data in HDFS and the I/O duration. The checkpointing overhead is almost negligible which enables users to have large states inside Flink applications. Flink also provides a high availability setup through zookeeper. This is for re-spawning the job in the cases when the driver (which is known as JobManager in Flink) crashes due to some error. High level API Unlike Storm (which also follows a data flow model), Flink provides a extremely simple high level api in the form of Map/Reduce, Filters, Window, GroupBy, Sort and Joins. Apache This provides a developer lot of flexibility and speeds up the development while writing new jobs. Stateful processing Sometimes an operation requires some config or data from some other source to perform an operations. A simple example will be to count the number of records of type Y in a stream X. This counter will be known as the state of the operation. Flink provides a simple API to interact with state like you would interact with a java object. States can be backed by Memory, Filesystem or RocksDB which are check pointed and are thus fault tolerant. e.g. With respect to the above example, in case your application restarts, your counter value will still be preserved. Exactly once processing Apache Flink provides exactly once processing like Kafka 0.11 and above with minimal overhead and zero dev effort. This is not trivial to do in other streaming solutions such as Spark Streaming and Storm and is . not supported in Apache Samza _Print I'm thrilled that we have hit an exciting milestone the Kafka community has long been waiting for: we have…_www.confluent.io Exactly-once Semantics is Possible: Here's How Apache Kafka Does it SQL Support Like Spark streaming Flink also provides a which makes writing a job easier for people with non programming background. Flink SQL is maturing day by day and is already being used by companies such as UBER and Alibaba to do analytics on real time data. SQL API interface Flink SQL & TableAPI in Large Scala Production at Alibaba Environment Support A Flink job can be run in a distributed system or in local machine. The program can run on mesos, yarn, kubernetes as well as standalone mode (e.g. in docker containers). Since Flink 1.4, Hadoop is not a pre-requisite which opens up a number of possibilities for places to run a flink job. Awesome community Flink has a great dev community which allows for frequent new features and bug fixes as well as great tools to ease the developer effort further. Some of these tools are: — Run Tensorflow graphs as a Flink process Flink Tensorflow —Anomaly detection in a stream in Flink Flink HTM — A temporal graph library build on top of Flink Tink Flink SQL and Complex Event Processing (CEP) were also initially developed by Alibaba and contributed back to flink. : Spark Streaming 2.3 has started offering support for continuos processing rather than micro-batching. Check it out . I’ll run some benchmarks using yahoo-streaming-benchmarks and post the results in next article. Note here Connect with me on LinkedIn or Facebook or drop a mail to kharekartik@gmail.com to share the feedback.