Talking data user demographics

A Kaggle Problem.

In a moment of instantaneous motivation, impulsive decision making, and binge-watching movies on AI, I have decided to solve this Kaggle problem. Let’s dive in.

In a nutshell, User mobile usage data is provided using which we have to predict the age and gender of the user. Please check the Kaggle site for the complete problem statement.

So, the Final Prediction should be the group which comprises of gender and age range.

eg: M32–38,F24–26 etc.

Let's focus on “mobile usage data” here.

As shown in the above diagram,7 files are provided where gender_age is the train data and the other 5 files have the data of train and test devices. The 7th file is similar to gender_age, containing only device ids and skipping the rest.

We have device_id as our primary key indicating the device, which is linked with events data. app_events, labels, and label categories are linked through event_id,app_id, and label_id.

Key observations I had during analysis were:

  • There are 12 groups, where male users are more than female users.
  • Male age range -> 22–39+
  • Female age range -> 23–43+
  • Age-wise we have more data in females and count wise male data is more.
  • 25 and 75 Percentile Age values for Both Male and Female are similar
  • MAD is the same for both Male and female
  • Even though the count of males is higher than females, the distribution of data in both is similar.
  • The majority (68.77%) of Train data has events
  • The majority (68.80%) of Test data has events
  • Data is from 30th April 11.52 PM to 8th May 12 AM 2016
  • M32–38 has a significant number of people who spent the phone between 11 PM and 3 AM.
  • And comparatively, Males use the phone at night more than women.
  • Assuming more the count of events means more the user is using the device. But in our dataset, we can say that there is no relationship between the age and gender of users and the amount of time the user is using the device.
  • In each event, the user has used multiple apps, and these apps belong to multiple categories. And among those top used categories are:
  • Here all registered apps are installed but only 39.21% are actively used(Train).
  • Duplicates are present in phone data which need to be removed
  • The top 3 brands consist of 58.78% of phones and hence they are dominating the mobile industry.

After Analysis and a few references, I came up with the below features.

  • mode/median of longitude(of all events and timestamps).
  • mode/median of latitude(of all events and timestamps).
  • TFIDF approach of apps used(active) by that particular device.
  • BOW approach of labels of apps used(installed) by that particular device.
  • BOW approach of phone brand(one-hot encoding, since these consist of different languages).
  • BOW approach of phone model(one-hot encoding, since these consist of different languages).
  • TFIDF weekday the event has occurred.
  • TFIDF Hour of the Event.
  • BOW Hour bin of the Event.
  • The ratio of active apps and installed apps.
  • Clustering of location into 10 clusters using latitude and longitude.

Since 32% of the data doesn’t have events data I have divided the prediction into 2 parts:

  • With Events
  • Without Events

We will have individual prediction models for each dataset and finally, concatenate the results.

I have tried Logistic regression, XGBoost, and Neural networks, so the base of the model is an ensembling of the methods I tried. What I have done is, after obtaining the predictions of each algorithm, weightage has been given to results based on their CV_logloss(cross-validation).

Again, here since we have variance in Neural Networks i.e. they have different predictions every time ( even with the same parameters), I have implemented ensembling for each Neural network I had. It's more like taking an average of all the results obtained by the network.

In this way, I have trained 2 NNs for each dataset 10 times and have taken the average of the results, and then also applied Logistic Regression and XGBoost.You have to find the right combination of features you want to apply to get the best results.

And then finally concatenate the end results of both the datasets.

In my final model, I have used only the neural networks for my prediction and was able to reach the top 10% rank in the leaderboard, with a Private Score of 2.24054 and a public score of 2.23523.

I have seen many other participants get better scores by giving very low weightage to the other algorithms like 0.1 for LR,lightGBM, etc. You can try that too, In the end, it's all about the right features with the right set of methods to get the best results.

Please check my code in GitHub for detailed implementation and let me know if you had a better result or a different approach. We can connect through LinkedIn.

Aspiring Data Scientist