**PROBLEM STATEMENT :**

- Is there a relationship between the number of groups a person belongs to and the number of friends he / she has?
- Can we predict the number of groups a person has, given the number of friends he / she has?
- Can we predict the number of friends a person has, given the number of groups he / she has?
- Is it possible to put people in classified clusters based on their number of groups and number of friends?

**Files to use :** http://socialcomputing.asu.edu/datasets/YouTube2 .

Format of files :

**1. nodes.csv :
**It’s the file of all the users. This file works as a dictionary of all the users in this data set. It contains all the node ids used in the dataset

**2. groups.csv :
**It’s the file of all the groups. It contains all the group ids used in the dataset

**3. edges.csv :
**It’s is the friendship network among the users. The user’s friends are represented using edges. Since the network is symmetric, each edge is represented only once. Here is an example.

1,2 – This means user with id “1” is friend with user id “2”.

**4. group-edges.csv :
**It’s the user-group membership. In each line, the first entry represents user, and the 2nd entry is the group index.

**DATA CLEANING :**

I had to modify the data to get an enriched dataset that would yield a usable regression model. Hence, I calculated the number of friends each node has, by taking the symmetric network and adding the left out number of friends. And then I calculated the number of groups each person had. So I had a final dataset with the nodes (people), number of groups they belonged to and number of friends they had. I also transformed these two variables into their reciprocals, log and square root to make the data normal.

**NULL HYPOTHESIS :**

H0 : The number of Groups a person belongs to, does not affect the number of friends a person has.

**ALTERNATE HYPOTHESIS :**

H1 : The number of Groups a person belongs to, noticeably affects the number of friends a person has.

Note : I found that the vice versa model is not significant with the current dataset.

**The SAS code for this project is as follows:**

<pre>data group_analysis; set work.import; /*sort based on No. of groups and no. of friends*/ proc sort data=group_analysis; by descending No__of_Friends; run; proc sort data=group_analysis; by descending No__of_Groups; run; /*mean of no. of groups and no.of friends*/ proc means data=group_analysis; var No__of_Groups No__of_Friends; run; /*graphical representation*/ proc sgplot data=group_analysis; scatter x=No__of_Friends y=No__of_Groups; series x=No__of_Friends y=No__of_Groups; run; proc sgplot data=group_analysis; hbox No__of_Friends / category=No__of_Groups; run; /*Check Correlation between No. of Groups and No. of Friends*/ proc corr data=group_analysis; var No__of_Groups No__of_Friends; run; /*summarizing data distribution of analysis variable No. of Groups*/ proc univariate data=group_analysis; var No__of_Groups; run; /*summarizing data distribution of analysis variable No. of Friends*/ proc univariate data=group_analysis; var No__of_Friends; run; /*Regression analysis : Y=No. of groups, X=No. of friends*/ proc reg data=group_analysis; model No__of_Groups = No__of_Friends; run; /*Regression analysis : Y=No. of Friends, X=No. of groups*/ proc reg data=group_analysis; model No__of_Friends = No__of_Groups; run; /*Transformation of variables with Reciprocal*/ proc reg data=group_analysis; model Reciprocal_Groups = Reciprocal_Friends; run; proc reg data=group_analysis; model Reciprocal_Friends = Reciprocal_Groups; run; /*Transformation of variables with Log*/ proc reg data=group_analysis; model Log_Groups = Log_Friends; run; proc reg data=group_analysis; model Log_Friends = Log_Groups; run; /*Transformation of variables with Sqrt*/ proc reg data=group_analysis; model Sqrt_Groups = Sqrt_Friends; run; proc reg data=group_analysis; model Sqrt_Friends = Sqrt_Groups; run; /*cluster analysis*/ proc fastclus data=group_analysis out=clust maxclusters=10 maxiter=100 converge=0; var No__of_Groups No__of_Friends; run; /*plotting the clusters*/ proc plot; plot No__of_Groups*No__of_Friends=cluster; run; /*ANOVA Tukey Procedure*/ proc Anova data=group_analysis; class No__of_Groups; model No__of_Friends = No__of_Groups; means No__of_Groups / tukey; run;

__CONCLUSION__

- According to my model, Null hypothesis H0 was proved False and Alternate Hypothesis H1 was proved True.
- It’s proved that the number of groups that a person belongs to, does affect the number of friends he/she has, using a regression model and cluster analysis.
- The model can be enriched by adding more variables and used for network analysis.