+ All Categories
Home > Documents > Final Project - Ricardo B Lourenço

Final Project - Ricardo B Lourenço

Date post: 15-Feb-2017
Category:
Upload: ricardo-barros-lourenco
View: 209 times
Download: 4 times
Share this document with a friend
23
Integration of Facebook Data to MongoDB and R-Studio Ricardo Barros Lourenço MSc. Candidate in Predictive Analytics CAPES Foundation – Ministry of Education of Brazil - BSMP Scholarship # 88888.075449/2013-00
Transcript
Page 1: Final Project - Ricardo B Lourenço

Integration of Facebook Data to MongoDB and R-Studio

Ricardo Barros Lourenço

MSc. Candidate in Predictive Analytics

CAPES Foundation – Ministry of Education of Brazil - BSMP Scholarship # 88888.075449/2013-00

Page 2: Final Project - Ricardo B Lourenço

Summary

• Objectives

• DataSift API

• rmongodb: R-Studio integration with MongoDB

• References

• Questions & Answers

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 3: Final Project - Ricardo B Lourenço

Objectives

• Ingest a Facebook public data stream using DataSift infrastructure on MongoDB

• Use the extreme flexibility of MongoDB to deal with schema less messages, like those generated in social networks, without concerns on data structures or injection performance

• The data is related with messages with content related to “Obama” and “Obamacare” which are popular topics on these days

• Allow integration of R-Studio (via rmongodb), with MongoDB once the data is already loaded

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 4: Final Project - Ricardo B Lourenço

DataSift

• DataSift is a startup based in San Francisco, with offices in New York and London

• They are specialized in social media, as a PaaS, in data sources, filtering and destinations

• They own a firehose connection with Twitter, and a public Facebook data connection

• Their website: http://datasift.com

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 5: Final Project - Ricardo B Lourenço

DataSift: Facebook API

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 6: Final Project - Ricardo B Lourenço

DataSift: Facebook API

• It’s an API connected to a public data facebook stream (more info at: https://developers.facebook.com/docs/public_feed/)

• It generates a JSON with anonymized data, or public data

• It’s interesting to have a broader view of facebook trending topics in depth

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 7: Final Project - Ricardo B Lourenço

DataSift: Facebook API

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 8: Final Project - Ricardo B Lourenço

DataSift: MongoDB API

• It’s an API that connects your stream source (in my case a Facebook source), to a MongoDBinstance

• It injects all JSON messages generated by Facebook API into documents, in a determinated database, with optional setting of a collection

• It conserves all data structures that comes from Facebook API source

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 9: Final Project - Ricardo B Lourenço

DataSift: MongoDB API

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 10: Final Project - Ricardo B Lourenço

DataSift: Task

• Once defined the data source, and data destination, you must start a task

• On a starter account, you receive $10 as test credit, which is really appropriated, because a volume of 1000 messages just costs almost $0.10

• The latency is rounded on 200ms

• The system works with asynchronism, with PUSH messages

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 11: Final Project - Ricardo B Lourenço

DataSift: Task

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 12: Final Project - Ricardo B Lourenço

DataSift: Task

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 13: Final Project - Ricardo B Lourenço

MongoDB: Setup

• Used a local instance of MongoDB (notebook)

• Needed to open firewall ports to Mongod

• Needed to create an access control for the facebookObama database, with the definition of a user and password for external connection

• Needed to create a sample register over the database, just to guarantee the creation of the database facebookObama

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 14: Final Project - Ricardo B Lourenço

MongoDB: Setup

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 15: Final Project - Ricardo B Lourenço

MongoDB after ingestion

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 16: Final Project - Ricardo B Lourenço

MongoDB: Message sample

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 17: Final Project - Ricardo B Lourenço

rmongodb: Conecting to MongoDBand displaying a single message

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 18: Final Project - Ricardo B Lourenço

rmongodb: Loading all messages

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 19: Final Project - Ricardo B Lourenço

rmongodb: Loading all messages(error on filtering by a key)

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 20: Final Project - Ricardo B Lourenço

rmongodb: Possibilities

• Once you are able to connect your MongoDBinstance into R-Studio, there are a wide range of options that you could apply for data analysis

• The difficulties rely on data structures, as MongoDB is schema less, so you must need to know all kinds of data structures that a document could handle (even multiple level embedding into it)

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 21: Final Project - Ricardo B Lourenço

rmongodb: Possibilities

• ETL activities would consume most of the user efforts, even knowing a sample message “schema”

• R-Studio have a text mining built-in package (called tm ), but it’s necessary to have a very well done job on ETL, avoiding excessive biasing when mining

• Within this text mining, the user should be able to recognize patterns over your data, with proper visualization

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 22: Final Project - Ricardo B Lourenço

References

DataSift

• http://dev.datasift.com/docs/push/connectors/mongodb

• http://dev.datasift.com/docs/push/steps

MongoDB

• http://docs.mongodb.org/manual/reference/program/mongod/

• http://docs.mongodb.org/manual/tutorial/add-user-administrator/

• http://docs.mongodb.org/manual/tutorial/add-user-to-database/

R-Studio

• http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_introduction.html

• http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_cheat_sheet.pdf

• http://dugontario.files.wordpress.com/2013/12/qualitative-analysis-in-r.pdf

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas

Page 23: Final Project - Ricardo B Lourenço

Questions & Answers

Big Data and NoSQL - Prof. Marco Chou and Prof. Gint

Butenas


Recommended