Setting up PySpark for Jupyter Notebook – with Docker

When you google “How to run PySpark on Jupyter”, you get so many tutorials that showcase so many different ways to configure iPython notebook to support PySpark, that it’s a little bit confusing for every non-geek like me out there. So when I finally figured out a way to do it, with the help of multiple websites, I thought I will post it as a blog here to help my fellow non-geeks who wish to rock the world of Data Science !

Prerequisites: You should have Jupyter Notebook and PySpark locally installed on your machine.

Continue reading

Predict the Pitch | Baseball Research

PROBLEM STATEMENT :

To predict the pitch type, given various measurements of the pitch.

BACKGROUND :

This dataset was obtained from a national baseball team, as part of a Predictive Modeling Challenge. The objective was to predict the Pitch_Type, given various other parameters, like Start speed, height, angles etc. It’s a classification problem. I have used Random Forest to build the model.

Continue reading

PySpark SQL Demonstration

PROBLEM STATEMENT :

a) Count the number of students whose names start with the letter “D”.
b) Display names of students, their average on the 3 exams, and a letter grade. The letter grade is computed as follows:
>= 90 is an “A”, 80 – <90 will be a “B”, and so on.
c) What is class average on exam_1?
d) Repeat (b), but the display should be sorted in descending order by average on the 3

Continue reading

Data Exploration in PySpark

PROBLEM STATEMENT :

To perform exploratory analysis on a given set of Sales and Purchases data.

  1. How many records does the file have?
  2. Display total sales by store_name.
  3. Display the payment types received by store.
  4. How many “Music” items were bought?
  5. Provide a list of items that were purchased at the store “San Jose”. Note that your list should not contain duplicates.
  6. For each item, list the stores from which it was purchased.
  7. Display the total sales for the store “San Jose” by date.

Continue reading