This presentation given at MongoSV 2012 focuses on data processing when
using MongoDB as your primary
database including integration with
Hadoop & the new MongoDB
aggregation framework. Learn how to integrate MongoDB with Hadoop for
large-scale distributed data processing. Using tools like MapReduce, Pig
and Streaming you will learn how to do analytics and ETL on large
datasets with the ability to load and save data against MongoDB. With
Hadoop MapReduce, Java and Scala programmers will find a native solution
for using MapReduce to process their data with MongoDB. Programmers of
all kinds will find a new way to work with ETL using Pig to extract and
analyze large datasets and persist the results to MongoDB. Python and
Ruby Programmers can rejoice as well in a new way to write native Mongo
MapReduce using the Hadoop Streaming interfaces.
MongoDB, Hadoop and humongous data – MongoSV
2012
from Steve Francia
MongoDB, Hadoop and humongous data – MongoSV 2012 — Presentation Transcript
- MongoDB, Hadoop & humongous data
- Talking about. What is Humongous Data. Humongous Data & You.
MongoDB & Data processing. Future of Humongous Data
- @spf13 AKA Steve Francia. 15+ years building the internet.
Father, husband, skateboarder. Chief Solutions Architect @ 10gen
responsible for drivers, integrations, web & docs
- What is humongous data ?
- 2000 Google Inc. Today announced it has released the largest
search engine on the Internet.Google’s new index, comprisingmore
than 1 billion URLs
- 2008 Our indexing system for processing links indicates that we
now count 1 trillion unique URLs (and the number of individual
webpages out there is growing by several billion pages per day).
- An unprecedented amount of data is being created and isaccessible
- Data Growth 1,0001000 750 500 500 250 250 120 55 4 10 24 1 0 2000
2001 2002 2003 2004 2005 2006 2007 2008 Millions of URLs
- Truly Exponential Growth Is hard for people to grasp A BBC
reporter recently: “Your current PCis more powerful than the
computer they had on board the first flight to the moon”.
- Moore’s Law Applies to more than just CPUs Boiled down it is
that things double at regular intervals. It’s exponential growth..
and applies to big data
- How BIG is it?
- How BIG is it?2008
- How BIG is it? 20072008 2005 2006 2003 2004 2001 2002
- Why all this talk about BIG Data now?
- In the past few years open source software emerged enabling ‘us’
to handle BIG Data
- The Big Data Story
- Is actually two stories
- Doers & Tellers talking about different things
http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
- Tellers
- Doers
- Doers talk a lot more about actual solutions
- They know it’s a two sided story Storage Processing
- Take aways MongoDB and Hadoop MongoDB for storage & operations
Hadoop for processing & analytics
- MongoDB & Data Processing
- Applications have complex needs. MongoDB ideal operational
database MongoDB ideal for BIG data. Not a data processing engine,
but provides processing functionality
- Many options for Processing Data • Process in MongoDB using Map
Reduce • Process in MongoDB using Aggregation Framework • Process
outside MongoDB (using Hadoop)
- MongoDB Map Reduce Map() MongoDB Data Group(k) emit(k,v) map
iterates on documents Document is $this Sort(k) 1 at time per shard
Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run
multiple times
- MongoDB Map Reduce MongoDB map reduce quite capable… but with
limits- Javascript not best language for processing map reduce-
Javascript limited in external data processing libraries- Adds load
to data store
- MongoDB Aggregation Most uses of MongoDB Map Reduce were for
aggregationAggregation Framework optimized for aggregate
queriesRealtime aggregation similar to SQL GroupBy
- MongoDB & Hadoop same as Mongos Many map operationsMongoDB shard
chunks (64mb) 1 at time per input split Creates a list each split
Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx)
thread as map each split Map (k , v , ctx)single server orsharded
cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2
Combiner(k2,values2)2 RecordReader ctx.write(k2,v )
Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v
Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2)
Sort(k2)2 Sort(k )MongoDB Reducer threads Reduce(k2,values3) Output
Format Runs once per key kf,vf
- DEMOTIME
- DEMO Install Hadoop MongoDB Plugin Import tweets from twitter
Write mapper in Python using Hadoop streamingWrite reducer in Python
using Hadoop streaming Call myself a data scientist
- Installing Mongo-hadoop
https://gist.github.com/1887726hadoop_version 0.23
hadoop_path=”/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib”git
clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i
“s/default/$hadoop_version/g” build.sbtcd streaming./build.sh
- Groking Twitter curl
https://stream.twitter.com/1/statuses/sample.json
-u<login>:<password> | mongoimport -d test -c live … let it run
for about 2 hours
- DEMO 1
- Map Hashtags in Python#!/usr/bin/env pythonimport
syssys.path.append(“.”)from pymongo_hadoop import BSONMapperdef
mapper(documents): for doc in documents: for hashtag in
doc[entities][hashtags]: yield {_id: hashtag[text], count:
1}BSONMapper(mapper)print >> sys.stderr, “Done Mapping.”
- Reduce hashtags in Python#!/usr/bin/env pythonimport
syssys.path.append(“.”)from pymongo_hadoop import BSONReducerdef
reducer(key, values): print >> sys.stderr, “Hashtag %s” %
key.encode(utf8) _count = 0 for v in values: _count += v[count]
return {_id: key.encode(utf8), count: _count}BSONReducer(reducer)
- All together hadoop jar
target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper
examples/twitter/twit_hashtag_map.py -reducer
examples/twitter/twit_hashtag_reduce.py -inputURI
mongodb://127.0.0.1/test.live -outputURI
mongodb://127.0.0.1/test.twit_reduction -file
examples/twitter/twit_hashtag_map.py -file
examples/twitter/twit_hashtag_reduce.py
- Popular Hash Tags db.twit_hashtags.find().sort( {count : -1 }){
“_id” : “YouKnowYoureInLoveIf”, “count” : 287 }{ “_id” :
“teamfollowback”, “count” : 200 }{ “_id” : “RT”, “count” : 150 }{
“_id” : “Arsenal”, “count” : 148 }{ “_id” : “milars”, “count” :
145 }{ “_id” : “sanremo”, “count” : 145 }{ “_id” :
“LoseMyNumberIf”, “count” : 139 }{ “_id” : “RelationshipsShould”,
“count” : 137 }{ “_id” : “Bahrain”, “count” : 129 }{ “_id” :
“bahrain”, “count” : 125 }{ “_id” : “oomf”, “count” : 117 }{ “_id”
: “BabyKillerOcalan”, “count” : 106 }{ “_id” : “TeamFollowBack”,
“count” : 105 }{ “_id” : “WhyDoPeopleThink”, “count” : 102 }{
“_id” : “np”, “count” : 100 }
- DEMO 2
- Aggregation in Mongo 2.1 db.live.aggregate( { $unwind :
“$entities.hashtags” } , { $match : { “entities.hashtags.text” : {
$exists : true } } } , { $group : { _id :
“$entities.hashtags.text”, count : { $sum : 1 } } } , { $sort : {
count : -1 } }, { $limit : 10 })
- Popular Hash Tags db.twit_hashtags.aggregate(a){ “result” : [ {
“_id” : “YouKnowYoureInLoveIf”, “count” : 287 }, { “_id” :
“teamfollowback”, “count” : 200 }, { “_id” : “RT”, “count” : 150 },
{ “_id” : “Arsenal”, “count” : 148 }, { “_id” : “milars”, “count”
: 145 }, { “_id” : “sanremo”,“count” : 145 }, { “_id” :
“LoseMyNumberIf”, “count” : 139 }, { “_id” : “RelationshipsShould”,
“count” : 137 }, { “_id” : “Bahrain”, “count” : 129 }, { “_id” :
“bahrain”, “count” : 125 } ],”ok” : 1}
- The Future of humongous data
- What is BIG? BIG today is normal tomorrow
- Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120
250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 Millions of URLs
- Data Growth 9,000 9000 6750 4,4004500 2,1502250 1,000 500 55 120
250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 Millions of URLs
- 2012 Generating over 250 Millions of tweets per day
- MongoDB enables us to scale with the redefinition of BIG. New
processing tools like Hadoop & Storm are enabling us to process the
new BIG.
- Hadoop is our first step
- MongoDB iscommitted to working with best data tools including
Hadoop, Storm, Disco, Spark & more
- http://spf13.com http://github.com/spf13 @spf13 Questions?
download at github.com/mongodb/mongo-hadoop