Predicting number of friends based on number of social groups

PROBLEM STATEMENT :

  1. Is there a relationship between the number of groups a person belongs to and the number of friends he / she has?
  2. Can we predict the number of groups a person has, given the number of friends he / she has?
  3. Can we predict the number of friends a person has, given the number of groups he / she has?
  4. Is it possible to put people in classified clusters based on their number of groups and number of friends?

Files to use : http://socialcomputing.asu.edu/datasets/YouTube2 .

Format of files :

1. nodes.csv :
It’s the file of all the users. This file works as a dictionary of all the users in this data set. It contains all the node ids used in the dataset

2. groups.csv :
It’s the file of all the groups. It contains all the group ids used in the dataset

3. edges.csv :
It’s is the friendship network among the users. The user’s friends are represented using edges. Since the network is symmetric, each edge is represented only once. Here is an example.
1,2 – This means user with id “1” is friend with user id “2”.

4. group-edges.csv :
It’s the user-group membership. In each line, the first entry represents user, and the 2nd entry is the group index.

DATA CLEANING :

I had to modify the data to get an enriched dataset that would yield a usable regression model. Hence, I calculated the number of friends each node has, by taking the symmetric network and adding the left out number of friends. And then I calculated the number of groups each person had. So I had a final dataset with the nodes (people), number of groups they belonged to and number of friends they had. I also transformed these two variables into their reciprocals, log and square root to make the data normal.

NULL HYPOTHESIS :

H0 : The number of Groups a person belongs to, does not affect the number of friends a person has.

ALTERNATE HYPOTHESIS :

H1 : The number of Groups a person belongs to, noticeably affects the number of friends a person has.

Note : I found that the vice versa model is not significant with the current dataset.

The SAS code for this project is as follows:

<pre>data group_analysis;
set work.import;

/*sort based on No. of groups and no. of friends*/
proc sort data=group_analysis;
	by descending No__of_Friends;
run;

proc sort data=group_analysis;
	by descending No__of_Groups;
run;

/*mean of no. of groups and no.of friends*/
proc means data=group_analysis;
	var No__of_Groups No__of_Friends;
run;

/*graphical representation*/
proc sgplot data=group_analysis;
	scatter x=No__of_Friends y=No__of_Groups;
	series x=No__of_Friends y=No__of_Groups;
run;

proc sgplot data=group_analysis;
	hbox No__of_Friends / category=No__of_Groups;
run;

/*Check Correlation between No. of Groups and No. of Friends*/
proc corr data=group_analysis;
	var No__of_Groups No__of_Friends;
run;

/*summarizing data distribution of analysis variable No. of Groups*/
proc univariate data=group_analysis;
	var No__of_Groups;
run;

/*summarizing data distribution of analysis variable No. of Friends*/
proc univariate data=group_analysis;
	var No__of_Friends;
run;

/*Regression analysis : Y=No. of groups, X=No. of friends*/
proc reg data=group_analysis;
	model No__of_Groups = No__of_Friends;
run;

/*Regression analysis : Y=No. of Friends, X=No. of groups*/
proc reg data=group_analysis;
	model No__of_Friends = No__of_Groups;
run;

/*Transformation of variables with Reciprocal*/
proc reg data=group_analysis;
	model Reciprocal_Groups = Reciprocal_Friends;
run;

proc reg data=group_analysis;
	model Reciprocal_Friends = Reciprocal_Groups;
run;

/*Transformation of variables with Log*/
proc reg data=group_analysis;
	model Log_Groups = Log_Friends;
run;

proc reg data=group_analysis;
	model Log_Friends = Log_Groups;
run;

/*Transformation of variables with Sqrt*/
proc reg data=group_analysis;
	model Sqrt_Groups = Sqrt_Friends;
run;

proc reg data=group_analysis;
	model Sqrt_Friends = Sqrt_Groups;
run;

/*cluster analysis*/
proc fastclus data=group_analysis out=clust maxclusters=10 maxiter=100 converge=0;
	var No__of_Groups No__of_Friends;
run;

/*plotting the clusters*/
proc plot;
	plot No__of_Groups*No__of_Friends=cluster;
run;

/*ANOVA Tukey Procedure*/
proc Anova data=group_analysis;
class No__of_Groups;
model No__of_Friends = No__of_Groups;
means No__of_Groups / tukey;
run;

CONCLUSION

  • According to my model, Null hypothesis H0 was proved False and Alternate Hypothesis H1 was proved True.
  • It’s proved that the number of groups that a person belongs to, does affect the number of friends he/she has, using a regression model and cluster analysis.
  • The model can be enriched by adding more variables and used for network analysis.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s