James Governor's Monkchips

The Guardian: NoSQL EU. Don’t Melt The Database

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

Matthew Wall - the Guardian

What follows is something like a live blog, based on comments from Matthew Wall and Simon Willison from The Guardian the NoSQL EU conference in London today.
Wall kicked off the talk with a question about NoSQL: is it a good name for the phenomenon? He says not really, pointing out absurdity of calling SQLite and MySQL “old world databases” as opposed to “new world” key value stores.

[This point resonates strongly with RedMonk thinking. Stephen and I have both been wary of reductionist approaches to defining NoSQL – we feel Hadoop style Big Data for example should be thought of as a related trend]

Where is The Guardian today? Its a modern, information-driven web site driven by tags and feeds.

“Its a traditional three tier web app, with a large Oracle database at center of the world. People might have thought we’re cooler than that, but we’re not.”

“The Guardian took the decision to stick with traditional relational model 5 years ago. The kind of tools we’re beginning to use weren’t as mature back then. A key reason for sticking with Oracle was the maturity of the surrounding tools ecosystem- performance management and optimisation, back up – and available skills.

SQL has worked well for the paper. SQL is great. we can do cool stuff with it. At scale.”

Searching one tag is ok, but what about two? What does it do to the database?

“Related content” was 40% of the Guardian’s app load so… the team used a search engine instead.. The search engine approach – using Apache Solr – worked well, but scale issues were still likely to become a problem.

“Willison suggested the Guardian stuck a massive memcached in front instead”.

It worked. But what about throwing more resource at Oracle instead?

“We wanted to avoid Oracle RAC because its really expensive, but we want to scale out”.

[Oracle RAC is the database giant’s clustering technology.]

The Guardian’s Business Drivers: Linked data, social networks- there is all sorts of information out there. we need to engage with them. We can’t just broadcast the news…

The Guardian’s editor called for the organisation to Mutualise the News.
“We’re changing the platform because of the business change. new technologies: we have a real need to use them… blurring the line between journalists and readers.”

“Journalism is becoming the curation of all the world’s information”.

[note: google’s automated curation seems to be winning at this point… which explains why the Guardian is responding in the way it is.]

What happens with API access, which drives for example, tag proliferation, which dramatically increases load on the database.

“Apache Solr is like a database, it works like one for us”

Fields can be multi-value. one piece of content with five tags can be stored in one field. Most important is that SOLR offers the ability to facet the content. apply it *like* a tag…

For example: – an editor’s star rating. we can facet on that for free, and just jump to all the three star albums. facets can be combined much more quickly than a relational database.

With Solr we can perform complex queries, filter by facets.

“On our data set, most queries are about the same cost. no transactions.”

With Solr Schema design is very important – the schema are more flexible and fuzzy than relational.

This is about getting data out of the system: powering the Guardian’s iPad app, site components, editors tools off the API, with far more to follow. But what about getting data in?

The Guardian has also built a simple REST/HTTP framework. for example – for sucking in live football scores, eg. apps that don’t affect the data store.

At this point the talk speeded up dramatically. Willison talks a lot faster than Wall. Never mind the high level stuff stuff- if you’re a real dork i recommend you go straight to the source – and check out Simon’s slides from the Redis workshop at NoSQL EU.

NoSQL for journalism

“I am working at the Guardian because I am interested in the opportunity to build rapid prototypes that go live: apps that live for two or three days. My interest is how NoSQL can help support journalism.”

Rapid prototyping. things that scale down as well as up handle massive spikes (if you’re on the front page) quickest way to do lookups- was to use Redis.

version 1 of the Guardian’s Investigate Your MPs Expenses app was not Redis enabled.

The initial application generated 468k rows, randomised, every time someone hit the button!

Guardian Zeitgeist, meanwhile, doesn’t use Redis. The app attempts to highlight stories on the guardian that are interesting- the amount of conversation about that thing on social networks. looks for peaks, ie, a page on the Guardian’s Environment section that gets more traffic than normal.

So use message queues and cron jobs. pull data, task queue, then calculate hotness. feed into Big Table, running on Google AppEngine, which not great at complex queries, but good at simple select and sort.

“Using Big Table as a dumping ground for data you can sort by 1 or 2 columns when you need to”

Talking of dumping grounds… Guardian employees were effectively creating data sets that if they didn’t make it into the paper as Infographics, weren’t used. Raw numbers were being collected and cleaned up. Today the underlying data will be in a Google Docs spreadsheet, and made accessible on the Guardian website accordingly.

Guardian Datablog – a bunch of Google doc spreadsheets. Retrieve data as CSV, XLSW, JSON. click “make a copy” Make a Copy, and run your own.

“We want to keep publishing arbitrary data sets, for example “output school league tables” or “volcano information”. we want something schema free.”

Our first option is CouchDb. Create schema free database, then index in Solr.

We have changed from the relational database being at the center of the world to a mix of datastores and models.

disclosure: Oracle is not a client. VMWare, which is, recently acquired Redis.

30 comments

  1. James Governor’s Monkchips » The Guardian: NoSQL EU. Don’t Melt The Database http://bit.ly/93dlPC #nosqleu cc : @matwal @simonw
    This comment was originally posted on Twitter

  2. reading @monkchips’ first blog post about #nosqleu. a writeup of @simonw and @matwal’s excellent guardian talk. http://bit.ly/93dlPC
    This comment was originally posted on Twitter

  3. Good post by @monkchips on @simonw and @matwall’s excellent talk, about NoSql at the Guardian #nosqleu: http://bit.ly/93dlPC
    This comment was originally posted on Twitter

  4. Quite odd to start reading a blog post then realise I’m sitting opposite the guy who just wrote it. http://bit.ly/93dlPC #nosqleu @monkchips
    This comment was originally posted on Twitter

  5. Live blogging – congrats – RT @monkchips: » The Guardian: NoSQL EU. Don’t Melt The Database http://bit.ly/93dlPC #nosqleu
    This comment was originally posted on Twitter

  6. Coverage of Mat Wall’s talk at #nosqleu here http://goo.gl/GQLr (via @monkchips)
    This comment was originally posted on Twitter

  7. @monkchips Thanks for the great write up of our #nosql talk here: http://bit.ly/aeGKYI You make us sound very professional!
    This comment was originally posted on Twitter

  8. The Guardian: NoSQL EU. Don’t Melt The Database:
    What follows is something like a live blog, based on comments f… http://bit.ly/c3rZgI
    This comment was originally posted on Twitter

  9. The Guardian and NoSQL databases (@monkchips) http://bit.ly/9nNgov
    This comment was originally posted on Twitter

  10. “solr is like a database, it works like one for us” http://bit.ly/aeGKYI #nosqleu #nosql #rdbms #solr #lucene
    This comment was originally posted on Twitter

  11. @monkchips live blogs the #nosqleu talk by the Guardian. wish I’d been there; James does a great job laying out issues. http://bit.ly/aeGKYI
    This comment was originally posted on Twitter

  12. The Guardian: NoSQL EU. Don’t Melt the Database: Real-world case studies are great ways to distinguish fact from f… http://bit.ly/9XmeUf
    This comment was originally posted on Twitter

  13. @arnaldostream The Guardian: NoSQL EU. Don’t Melt The Database – http://bit.ly/9uU38V
    This comment was originally posted on Twitter

  14. Great article! RT @alexview: @arnaldostream The Guardian: NoSQL EU. Don’t Melt The Database – http://bit.ly/9uU38V #nosql
    This comment was originally posted on Twitter

  15. UK’s The Guardian, Lucid Imagination Customer: “Solr is like a database, it works like one for us” http://bit.ly/c0wYBp
    This comment was originally posted on Twitter

  16. The Guardian: NoSQL EU. Don’t Melt The Database http://ow.ly/17813s
    This comment was originally posted on Twitter

  17. The Guardian: #NoSQLEU. Don’t Melt The Database http://monk.ly/d3nBCe updated with link to @simonw’s much praised Redis Workshop.
    This comment was originally posted on Twitter

  18. the technology underpinning rusbriger’s “mutualism” – The Guardian: NoSQL EU. Don’t Melt The Database http://monk.ly/d3nBCe for @jeffnolan
    This comment was originally posted on Twitter

  19. so come on #yam people: guardian use same tech as we do, not so much rdf: http://bit.ly/bEh9wL – how’s rdf better?
    This comment was originally posted on Twitter

  20. @gvenk Aardig stuk over journalistiek, Guardian en NoSQL: http://bit.ly/bEh9wL /cc @wilbertbaan @dutchproblogger @paulvereijken
    This comment was originally posted on Twitter

  21. […] Lucid Imagination’s Grant Ingersoll to talk about using search as a database. I’ve come across people who are using Solr as a search-based way of retrieving and store data in their applica…, and I wanted to explore that topic with someone in the know, here, Grant who’s steeped in […]

  22. Een goed artikel over jounalistiek & NoSQL: http://bit.ly/ctxndK
    This comment was originally posted on Twitter

  23. Een goed artikel over jounalistiek & NoSQL: http://bit.ly/ctxndK (via @alper)
    This comment was originally posted on Twitter

  24. James Governor’s Monkchips » The Guardian: NoSQL EU. Don’t Melt … http://ow.ly/17e3yP
    This comment was originally posted on Twitter

  25. “At this point the talk speeded up dramatically. Willison talks a lot faster than Wall.” http://bit.ly/b3TwYs — Hee hee!
    This comment was originally posted on Twitter

  26. […] James Governor's Monkchips » The Guardian: NoSQL EU. Don’t Melt The Database"What follows is something like a live blog, based on comments from Matthew Wall and Simon Willison from The Guardian the NoSQL EU conference in London today." (databases data_journalism presentation guardian ) […]

  27. […] James Governor attended a NoSQL event at the Guardian a few months ago. Apache Solr was a focal part of […]

  28. […] James Governor attended a NoSQL event at the Guardian a few months ago. Apache Solr was a focal part of […]

  29. […] is a high performance key value store database, used at places like The Guardian. Like MongoDB it also includes messaging functionality for replication, which means it’s being […]

  30. […] I recently attended the NoSQL conference in London and think I have a better idea now how to answer the original question. I also wrote a blog post, and there are a couple of other good ones. […]

Leave a Reply

Your email address will not be published. Required fields are marked *