Spark has recently been gaining traction. So I thought of providing starting point to play with Spark. Have written a simple code for Logistic regression to help  in transition.

I hope you guys find it interesting, and building block for learning spark. Spark is really interesting.

What Is Apache Spark?

Apache Spark is an open source processing engine built around speed, ease of use, & Analytics. Apache Spark is a cluster computing platform designed to be fast and general-purpose. Spark is alternative for large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python

Spark combines SQL, streaming and complex Analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3.

At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. These components are designed to inter-operate closely, letting you combine them like libraries in a software project.

Components of Spark

Spark Core Concepts

At a high level, a Spark application consists of a driver program that launches various parallel operations on a cluster. The driver program contains the main function of your application which will be then distributed to the clusters members for execution. The SparkContext object is used by the driver program to access the computing cluster. For the shell applications the SparkContext is by default available through the sc variable.

A very important concept in Spark is RDD – resilient distributed data-set. This is an immutable collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDD can contain any type of object from Java, Scala, Python or R including user-defined classes. The RDDs can be created in two ways: by loading an external data-set or by distributing a collection of objects like list or sets.

After creation we can have two types of operation on the RDDS:

  • Transformations –  construct a new RDD from an existing one
  • Actions – compute a result based on an RDD

RDDs are computed in a lazy way – that is when they are used in an action.  Once Spark sees a chain of transformations, it can compute just the data needed for its result. Each time an action is run on an RDD, it  is recomputed. If you need the RDD for multiple actions, you can ask Spark to persist it using RDD.persist().

You can use Spark from a shell session or as a standalone program. Either way you  will have the following workflow:

  • create input RDDs
  • transform them using transformations
  • ask Spark to persist them if needed for reuse
  • launch actions to start parallel computation, which is then optimized and executed by Spark

Brief intro on Logistic Regression

Logistic Regression is a classification algorithm. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, but in our example, we’ll be tackling Binary Classification, in which there at two classes: 0 or 1.

PySpark

Most importantly for us, Spark supports a Python API to write Python Spark jobs or interact with data on cluster through a shell

Steps to launch spark and python shell.

1. First step is to go to folder where spark is build writing

cd  /usr/local/spark

  1. Launch python shell:

./bin/pyspark

OR

  1. To Launch Ipython notebook:

IPYTHON_OPTS=”notebook –pylab inline” ./bin/pyspark

After performing IPython Command,  IPython(Now its called jupyter) notebook will appear on screen. Lets get started…

you can preview the data  by using

#Preview data

raw_data.take(1)

Labeled Point

A labeled point is a local vector associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms and they are stored as doubles. For binary classification, a label should be either 0 (negative) or 1 (positive).

Remember…
Spark provides specific functions to deal with RDDs which elements are key/value pairs. They are  used to perform aggregations and other processings by key.

We have used the map function to create key-value pair. you can also notice Labeled Point output and that’s why Labeled Point library was imported.