While going through volumes and volumes of whatsapp message logs of my communication with my fiance, I had this realization that there is an interesting statistical pattern hidden in them. Essentially, the number of messages per day in those logs showed telltale signs of pareto distribution. The same distribution that you see in a lot of social data such as distribution of wealth, internet traffic sources, number of successful self published authors in Amazon etc. was present in our data too. The common thread among all these social data is that they follow the 80-20 rule whereby the majority of the effects are the result of a small set of sources. However, for human communication that doesn’t seem to explain the cause of this distribution.
Word Cloud
It all started last year when I began the paperwork that would enable my fiance to immigrate to US. Immigration and Naturalization Service (INS), who is the gatekeeper for all US immigration services, requires an extensive list of evidence to prove that our budding relationship is in fact legitimate. This involved providing them with logs of all our communication, travels, photos etc.. For a moderate chat users like us, one year of messaging has already resulted in more than 400 double sided A4 sheets of messages. This was after I had reduced the text size to 8pt font with zero spacing between lines. The hapless INS worker who is tasked with deciphering this mountain of messages is not going to get anywhere with it and is most likely to make a subjective decision about my relationship. I didn’t want to leave anything to chance and pounced upon ways to effectively present this data.
The first thing that came to my mind was the word cloud. It is essentially a visual representation of commonly used words in our communication where text size indicates their frequency. There are online services like Wordle that help you generate this cloud once you input the entire chat log. Typical things like love, hope, marriage, terms of endearment etc. dominates our word cloud. There is potential to develop a linear model that will output a conversation summary based on this cloud data. But I wasn’t satisfied with this form of representation alone. I wanted to dig deeper and look for some mathematical explanation that would explain my feelings for Neetha in an objective manner.
Next data point that piqued my interest was the number of messages that we generated per day. A plot of this data over a year’s time period is shown below.
Messages per day over a period of one year
Clearly, three phases of our emerging relationship is visible in the above plot. First, we had the initial getting to know phase where we were very cautious about what we wrote. Each message was carefully crafted, thoroughly thought through and finally sent out hoping that it wouldn’t offend the other person in anyway. Once the familiarity has set in and our curiosity about the other person has reached its crescendo, we exchanged a flurry of messages sharing our life experiences, family history, people we know and so on.. Finally, in the third phase we settle down to a regular pattern where we are mostly sharing day to day activities, dreams about our future and a message frequency that is aware of our time constraints. This was the second objective data point that I wanted to use to impress this unknown INS worker who will decide our future but I wasn’t satisfied with just that..
PDF of messages per day
The real fun was hidden a level deep in that messages per day data. On generating a histogram of this data, it became clear that there is a familiar distribution pattern observable in this data. The image to the left shows the empirically generated histogram of our message per day data shown in blue matched to a pareto distribution PDF plotted in red. A generalized pareto distribution is represented by three parameters: location parameter
For our data, the parameters for this pareto distribution were found to be
Kolmogorov-Smirnov test gave a p-value of 0.90 indicating a very close match between the data and pareto distribution.
CDF Match
Why Pareto?
The popular explanation for pareto dataset is that there is some form of 80-20 rule causing this distribution. However, in a conversation between two individuals, that doesn’t seem to be the cause for this distribution. The explanation that made more sense to me has lot to do with the phrase “Money makes more money”. What this means is that each additional dollar in pocket opens up more avenues to make money. In the context of human conversation between two people who are interested in each other, each message sows the seed for further communication between them. This is a perfect example of a runaway system in action. However, other factors such as sleep, hunger, job requirements etc.. limits the amount of messages they exchange every day. So the simple check to see the strength of communication between two individual would be a test for pareto. Further, we should be able to use the pareto index defined as
This idea is still nascent and I am struggling with its birthing pangs. The final idea is to polish these tests into a quick tool that would quantify any human interaction.
Comments