Not just hadoop no SQL in the enterprise at strata nyc 2012 | spf13

At the NYC Strata & Hadoop World conference I presented on ‘Not Just Hadoop: NoSQL in the Enterprise’. Robert Lancaster from Orbitz joined me on stage for the final presentation of the Bridge to Big Data track. Mark Madsen did a great job moderating the session and kept the energy high the entire day. Robert shared how Orbitz uses MongoDB with Apache Hadoop to provide real time rates. This is my second time presenting at Strata’s Big Data conference. There are few things I enjoy more in my work than presenting to an engaged audience full of good questions which is exactly what I found at Strata.

While Hadoop is the most well-known technology in big data, it’s not always the most approachable or appropriate solution for data storage and processing. In this session you’ll learn about enterprise NoSQL architectures, with examples drawn from real-world deployments, as well as how to apply big data regardless of the size of your own enterprise.

Big data for the rest of us from Steve Francia

Tweets

As the talk was going on the following tweets mentioned some highlights

.@spf13: “Moore’s Law applies to more than just CPUs.It also applies to data” #structureconf

— Matt Asay (@mjasay) October 23, 2012

.@spf13: “For 10+ years, Big Data = ‘custom sw w/ big hw.’” In the past few years open source has made Big Data accessible to the rest of us

— Matt Asay (@mjasay) October 23, 2012

Learning from @spf13 of @10gen how MongoDB enables #BigData (for the rest of us) #strataconf

— Tamara Dull (@tamaradull) October 23, 2012

“What is BIG? What is big today is normal tomorrow.” @spf13 #strataconf

— Tamara Dull (@tamaradull) October 23, 2012

https://twitter.com/markmadsen/status/260844790348918784

Presentation Transcript

Not Just Hadoop, NoSQL in the Enterprise
Talking about What is BIG Data BIG Data & you Real world examples The future of Big Data
@spf13 AKA Steve Francia 16+ years building the internet Father, husband, skateboarder Chief Evangelist @responsible for drivers,integrations, web & writing
What isBIG data ?
2000 Google Inc Today announced it has released the largest search engine on theInternet. Google’s new index, comprising more than 1 billion URLs
2008 Our indexing system for processing links indicates that we now count 1 trillion unique URLs (and the number of individual webpages out there is growing by several billion pages per day).
An unprecedented amount of data is being created and is accessible
Data Growth
Truly Exponential GrowthIs hard for people to grasp. A BBC reporter recently: “Your current PC is more powerful than the computer they had on board the ﬁrst ﬂight to the moon”.
Moore’s LawApplies to more than just CPUs Boiled down it is that things double at regular intervals. It’s exponential growth.. and applies to big data
How BIG is it?
How BIG is it? 2008
How BIG is it? 20072008 2005 2006 2003 2004 2001 2002
We’ve had BIG Data needs for a long time. In 1998 Google won the search race through custom software & infrastructure
We’ve had BIG Data needs for a long time. In 2002 Amazon again wrote custom & proprietary software to handle their BIG Data needs
We’ve had BIG Data needs for a long time. In 2006 Facebook started with off the shelf software, but quickly turned to developing their own custom built solutions
Ability to handle big data is one of the largest factors in determining winners vs losers.
For over a decade BIG Data = custom software
Why all this talk about BIG Data now?
In the past fewyears open source software emerged enabling ‘us’ to handle BIG Data
The Big Data Story
Is actually two stories
Doers & Tellers talking about different things http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
Tellers
Doers
Doers talk a lot more about actual solutions
They know it’s a two sided story: Storage & Processing
Take aways: MongoDB and Hadoop, MongoDB for storage & operations. Hadoop for processing & analytics
How MongoDB enables big data • Flexible schema• Horizontal scale built in & free•Operates at near speed of memory• Optimized for modern apps
MongoDB @ Orbitz Rob Lancaster October 23 | 2012
Use Cases • Hotel Data Collection • Hotel Rate Feed: • Supply hotel rates to Google for their Hotel Finder • Uses MongoDB: – Maintain state of data sent to Google – Identify changes in rates as they occur • Makes use of complex querying, secondary indexing • EasyLoader: • Feature allowing suppliers to easily load inventory to Orbitz • Uses MongoDB to persist all changes for auditing purposes 29
Hotel Data Collection • Goals: • Better understand performance of our hotel path • Optimize our hotel rate cache • Methods: • Capture every action performed by our hotel search engine. • Persist this data for long periods. • Challenges: • Need high performance capture. • Scalable, inexpensive storage. 30
Requirements Collection Storage & Processing • High write throughput • High data volume • 500 servers • ~500 GB/day • > 100 million documents/day • 7 TB/month compressed • Flexibility • Scalable • Complex extendable documents • Inexpensive • No forced schema • Proximity with other data • Scalability • Simplicity 31
The Solution • Utilize MongoDB as a collector: • ~ 500 clients • Utilize unsafe writes for high throughput • Heterogeneous documents • New collection for each hour • HDFS for storage & processing: • Data moved via M/R job: – One job per collection – One mapper per MongoDB instance • Additional processing and analysis by other jobs 32
Challenges & Conclusions •Challenges? None really. •Achieved a robust and simple solution •MongoDB has been entirely worry free • Very high write throughput • Reads (well, full collection dumps across the wire) are slower 33
The Futureof BIG data
What is BIG? BIG today isnormal tomorrow
Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
How BIG is it?
How BIG is it? 2012
How BIG is it? 20112012 2009 2010 2007 2008 2005 2006
2012 Generating over 250 Millions of tweets per day
MongoDB enables us to scale with the redeﬁnition of BIG. Tools like Hadoop are enabling us to process thenew BIG.
MongoDB iscommitted to working with best data tools including Hadoop, Storm,Disco, Spark & more