Building a super database from linked data

Post on 25-Jun-2015

1,960 views 1 download

Tags:

description

Stephen Wang http://stephenwang.comAlivenotdead.com CTOmongoDB Beijing Presentation (March 3, 2011):From Rotten Tomatoes to alivenotdead.com to alive.cn, an explanation of the evolution of building an entertainment database at each stage of evolution. The current version is a multi-lingual global entertainment database using linked open data and mongoDB.

transcript

Building a super database from linked data

Stephen Wang 王傳仁me@stephenwang.com

March 3, 2011

Who is this NOT for?

Building a large database from a tiny team Organizing the world's information Information innovation

Who IS this for?

About

Co-founder, CTO Popular movie reviews web site Aggregated reviews,

comprehensive film database

The Stone Age

Static HTML templates

Editors read articles and pull quotations

Only cover the newest movies

~1000 films

Modern Times

Shift to LAMP License long-tail

database Automated spiders,

early UGC via critics Use homegrown

CMS for additional content

(How I felt maintaining Rotten Tomatoes' overloaded database servers)

8 million unique visitors / month Lean startup: 25x traffic with 7 staff Great site for film lovers (including Steve Jobs)

v

The Result

About Co-founder, CTO

SNS for artists started with Daniel Wu 吴彦祖

Started with six artists, now 1,600 artists, 600K registered users

Also powers official web sites:

李连杰: JetLi.com

成龙: JackieChan.com

莫文蔚: KarenMok.com

Our LAMP stack: Not the best setup for...Newsfeeds...

Viral loop analysis...

Multivariate testing...

The Problem?!?Scalability issues with real-time data, but without traffic from

public, long-tail content

About

A better entertainment database

Providing the long-tail content

Still a part of alivenotdead.com

Still in alpha

Features Comprehensive info

for celebrities, films, music, and TV

Searchable, structured data

Multilingual: English, Chinese, Japanese

Aggregated social media from inside/outside China

Why use mongoDB?

Flexible schema for different data sources

Dozens of other sources...

Why use

Scalable big data 500,000 translations

Next challenge:

Aggregating and storing the social media firehose

2 million+ topics covered

Why use

Crossing the border... alive.tom.com in

Tianjin Alivenotdead.com

in Hong Kong

Use replica sets/eventual consistency to overcome frequent cross-border network issues

Wikipedia as structured data Creative Commons license

Multiple CC sources Organized taxonomy Acquired by Google No Chinese/Japanese yet!

Using Linked Open Data

Wikipedia as structured data Creative Commons license

Only Wikipedia Messy taxonomy Chinese/Japanese topic

translations, but requires English topic link

Using Linked Open Data

Using Linked Open Data

Use Freebase organized taxonomy, broad data Expand DBpedia to Chinese-only topics Same methodology across Chinese wiki sources

The Future

Developer API Topic extraction Real-time trends

across languages Other verticals

Already 10x more data than Rotten Tomatoes...

The complete sum of information from across the web...

Information not constrained by language...

We're hiring PHP engineers! Send your CV to me@stephenwang.com

My blog: http://stephenwang.com