Data Exploration in PySpark

PROBLEM STATEMENT :

To perform exploratory analysis on a given set of Sales and Purchases data.

  1. How many records does the file have?
  2. Display total sales by store_name.
  3. Display the payment types received by store.
  4. How many “Music” items were bought?
  5. Provide a list of items that were purchased at the store “San Jose”. Note that your list should not contain duplicates.
  6. For each item, list the stores from which it was purchased.
  7. Display the total sales for the store “San Jose” by date.

File to use: purchases.txt.gz

Format: Date, Time, Store_name , Item, Cost, Payment_type

The PySpark code is as follows :

p1 = sc.textFile(“../purchases.txt”)

p1.count()

4138476

p2 = p1.map(lambda x: x.split(“\t”))

p2.map(lambda x: (x[2],float(x[4]))).reduceByKey(lambda x,y: x+y).take(10)

[(u’Baton Rouge’, 10131273.229999999), (u’Spokane’, 10083362.979999999), (u’Sacramento’, 10123468.180000002), (u’San Diego’, 9966038.389999995), (u’Fremont’, 10053242.359999986), (u’Minneapolis’, 10011757.780000012), (u’El Paso’, 10016409.970000012), (u’Anaheim’, 10076416.360000007), (u’Honolulu’, 10006273.490000004), (u’Santa Ana’, 10050309.929999989)]

p2.map(lambda x: x[5]).distinct().collect()

[u’Visa’, u’Discover’, u’Amex’, u’Cash’, u’MasterCard’]

p2.map(lambda x: x[3] == “Music”).countByValue()

defaultdict(<type ‘int’>, {False: 3908326, True: 230150})

p2.map(lambda x: (x[2] == “San Jose”, x[3])).distinct().collect()

[(True, u”Children’s Clothing”), (False, u’Computers’), (True, u’Books’), (False, u’Toys’), (False, u’Crafts’), (False, u’Sporting Goods’), (True, u’Sporting Goods’), (True, u”Women’s Clothing”), (True, u’Pet Supplies’), (False, u’DVDs’), (False, u’Health and Beauty’), (True, u’Cameras’), (False, u”Children’s Clothing”), (False, u’Baby’), (True, u’DVDs’), (True, u”Men’s Clothing”), (True, u’Computers’), (True, u’Health and Beauty’), (True, u’Consumer Electronics’), (True, u’Music’), (True, u’CDs’), (False, u’Books’), (False, u”Women’s Clothing”), (True, u’Garden’), (False, u’Garden’), (False, u’Pet Supplies’), (False, u’Cameras’), (True, u’Video Games’), (False, u’Consumer Electronics’), (False, u”Men’s Clothing”), (True, u’Crafts’), (False, u’CDs’), (False, u’Music’), (False, u’Video Games’), (True, u’Baby’), (True, u’Toys’)]

p2.map(lambda x: (x[3],x[2])).distinct().groupByKey().map(lambda t: (t[0], list(t[1]))).take(2)

[(u’Pet Supplies’, [u’Orlando’, u’Columbus’, u’Gilbert’, u’Atlanta’, u’Lincoln’, u’Hialeah’, u’Santa Ana’, u’Aurora’, u’Kansas City’, u’Cleveland’, u’Tampa’, u’Louisville’, u’Jersey City’, u’Garland’, u’Fort Worth’, u’Reno’, u’Raleigh’, u’Baltimore’, u’Richmond’, u’Tulsa’, u’Cincinnati’, u’Henderson’, u’Portland’, u’Minneapolis’, u’Saint Paul’, u’Austin’, u’Omaha’, u’Nashville’, u’New York’, u’Oklahoma City’, u’Memphis’, u’North Las Vegas’, u’San Jose’, u’Plano’, u’Boston’, u’Fremont’, u’Pittsburgh’, u’Baton Rouge’, u’Chula Vista’, u’Stockton’, u’Anaheim’, u’Laredo’, u’Durham’, u’New Orleans’, u’Miami’, u’St. Louis’, u’San Bernardino’, u’Winston\u2013Salem’, u’Phoenix’, u’Seattle’, u’Fresno’, u’Lexington’, u’Dallas’, u’Philadelphia’, u’Colorado Springs’, u’Chicago’, u’Chandler’, u’Long Beach’, u’Spokane’, u’Scottsdale’, u’San Antonio’, u’Birmingham’, u’Newark’, u’San Diego’, u’Lubbock’, u’Houston’, u’Tucson’, u’Jacksonville’, u’Boise’, u’Charlotte’, u’Norfolk’, u’Irvine’, u’Anchorage’, u’Buffalo’, u’Madison’, u’Mesa’, u’Greensboro’, u’Sacramento’, u’Riverside’, u’Las Vegas’, u’St. Petersburg’, u’Los Angeles’, u’Milwaukee’, u’Bakersfield’, u’Washington’, u’Corpus Christi’, u’Chesapeake’, u’San Francisco’, u’Denver’, u’Wichita’, u’Oakland’, u’Arlington’, u’Fort Wayne’, u’Glendale’, u’Honolulu’, u’Virginia Beach’, u’Albuquerque’, u’El Paso’, u’Indianapolis’, u’Toledo’, u’Rochester’, u’Detroit’, u’Irving’]),

(u”Men’s Clothing”, [u’Chesapeake’, u’Sacramento’, u’Lubbock’, u’Fort Wayne’, u’Denver’, u’New York’, u’Corpus Christi’, u’Atlanta’, u’Glendale’, u’Reno’, u’Omaha’, u’Irving’, u’Columbus’, u’Albuquerque’, u’Detroit’, u’Stockton’, u’Norfolk’, u’Arlington’, u’Tucson’, u’Fresno’, u’Saint Paul’, u’Baltimore’, u’Los Angeles’, u’Boise’, u’Memphis’, u’Chicago’, u’Greensboro’, u’Rochester’, u’San Diego’, u’Colorado Springs’, u’Riverside’, u’Washington’, u’Charlotte’, u’Spokane’, u’San Francisco’, u’Fremont’, u’San Antonio’, u’San Jose’, u’Chandler’, u’Fort Worth’, u’Richmond’, u’San Bernardino’, u’Virginia Beach’, u’Indianapolis’, u’Anaheim’, u’Austin’, u’St. Louis’, u’Miami’, u’Madison’, u’Henderson’, u’Minneapolis’, u’Laredo’, u’Las Vegas’, u’El Paso’, u’Pittsburgh’, u’Scottsdale’, u’St. Petersburg’, u’Nashville’, u’Kansas City’, u’Philadelphia’, u’Tulsa’, u’Cincinnati’, u’Hialeah’, u’Portland’, u’New Orleans’, u’Seattle’, u’Cleveland’, u’Gilbert’, u’Durham’, u’Winston\u2013Salem’, u’Phoenix’, u’Anchorage’, u’Lincoln’, u’Orlando’, u’Louisville’, u’Long Beach’, u’Aurora’, u’Irvine’, u’Plano’, u’North Las Vegas’, u’Oklahoma City’, u’Boston’, u’Jacksonville’, u’Jersey City’, u’Dallas’, u’Wichita’, u’Baton Rouge’, u’Toledo’, u’Buffalo’, u’Bakersfield’, u’Garland’, u’Raleigh’, u’Houston’, u’Lexington’, u’Birmingham’, u’Chula Vista’, u’Mesa’, u’Oakland’, u’Honolulu’, u’Milwaukee’, u’Newark’, u’Tampa’, u’Santa Ana’])]

p3 = p2.filter(lambda x: x[2] == “San Jose”)

p3.map(lambda x: (x[0], float(x[4]))).reduceByKey(lambda x,y: x+y).take(10)

[(u’2012-02-07′, 27025.41000000001), (u’2012-12-25′, 25083.530000000002), (u’2012-11-10′, 28685.380000000005), (u’2012-08-06′, 32018.71), (u’2012-10-31′, 21446.13000000001), (u’2012-03-29′, 25643.80999999999), (u’2012-10-12′, 26152.890000000007), (u’2012-03-20′, 23913.04), (u’2012-03-04′, 30501.620000000003), (u’2012-08-29′, 24854.199999999993)]

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s