Random vs. expert sampling of tweet streams

Several research studies / applications today rely upon content streams crowd-sourced from online social networks. Since real-time processing of large amounts of data generated on these sites is difficult, analytics companies and researchers are increasingly resorting to sampling. In this project, we investigated the crucial question of how to sample the data generated by users in social networks?. The traditional method is to randomly sample all the data, e.g., most researchers / applications today rely on the 1% and 10% randomly sampled streams of tweets provided by Twitter. We proposed and analyzed a different sampling methodology, where content is gathered only from a relatively small set of expert users. Over the duration of a month, we gathered tweets from over 500,000 experts on a diverse set of topics, and compared the resulting expert-sampled tweets with the 1% randomly sampled tweets provided publicly by Twitter on a variety of aspects -- the diversity, timeliness, and trustworthiness of the information contained in the tweet-samples. Our observations revealed significant differences in data obtained through the different sampling methodologies, which has major implications for applications such as topical search, trustworthy content recommendations, and breaking news detection.



On Sampling the Wisdom of Crowds: Random vs. Expert Sampling of the Twitter Stream
Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Sharma, Niloy Ganguly and Krishna P. Gummadi. International ACM Conference on Information and Knowledge Management (CIKM), San Francisco, USA, October-November 2013.




Saptarshi Ghosh is awarded a Humboldt Postdoctoral Research Fellowship
July 2014

Mainack Mondal, Bimal Viswanath and Krishna Gummadi, along with their co-authors win SOUPS distinguished paper award
July 2014

Juhi Kulshrestha receives Google Anita Borg Scholarship
May 2013

Cristian Danescu-Niculescu-Mizil wins WWW best paper award
May 2013