Date post: | 12-Jan-2015 |
Category: |
Education |
Upload: | andraz-tori |
View: | 666 times |
Download: | 0 times |
Today's plan• Short story of Zemanta
• The Zemanta technology
Where am I right now?
Wonders of modern communication
Ljubljana
Strip mine
• A system for Slovenian National television in 2006
• Closed captioning web page for each episode of →
each show
• Natural Langauge Processing, Information
Retrieval...
Start-up? Why not?
v
Tour de Slovénie
Sales
Seedcamp
• First European program inspired by YC (2007)
• London based
• 3 months, 50.000 EUR / 10%
Roller coaster12. August Deadline20. August Shortlist23. August Phone interview24. August Results
3. September London week start7. September London week end16. September ==> London
3 months in London
Back to Ljubljana
Back to Ljubljana
• Figuring out US is our target market
• Figuring out where in US to be and who to have here
• Partnerships
• And naturally the business model
And then ...
Technology
• Zemanta – Personal Writing Assistant
- on your current platform
• While bloggers write we suggest:
- images
- related articles
- in-text links
- tags
What do we do?
• 80k bloggers monthly
• 1.3 million posts enhanced in 2011
Some stats
How does it work• Natural Language Processing
• Big database of “meanings” (entities, concepts, topics)
• Word Sense Disambiguation
• Linking out to Wikipedia, Freebase, …
• Categorization, Named Entity Recognition
• Information Retrieval
• Solr based, using features from NLP
• With some twists
Contentsuggestions
Plain text(article) Analysis
Semanticsearch
Backgroundknowledge
Indexed content
“Text Understanding”- Input is meaningful chunk of text (not a keyword or a phrase)- Input is (semi) English language- Has to work across all domains in the open world- music, celebrities, finance, entertainment, politics, gardening, parenting, …
Backgroundknowledge
Contentsuggestions
Plain text(article) Analysis
Semanticsearch
Indexed content
Background knowledge- Data from Wikipedia, MusicBrainz, Freebase… and the
world wild web
- Includes linguistical and semantical properties and unstructured data
- Present in two forms:
- in “original” custom built triple store on top of MySQL (150 GB)
- processed into 7 GB optimized “memory mapped dump”
Analysis pipelineNamed Entity
Extraction
Known phrasesextraction
(aho-corasick)
Triple storeSurface form features evaluation
Statistical comparison tobackground knowledge
Semantic coherenceand hand-tuned
heuristics
Disambiguated entities
etc.
Backgroundknowledge
Contentsuggestions
Plain text(article) Analysis
Semanticsearch
Indexed content
Connecting content
• Indexing blogosphere and mediasphere
• Solr based index
• Twist: complicated queries – 50 terms
• Filtering out spam is “fun”
• Probably best “related content” in terms of accuracy
• Coming soon: social signal
But why just for bloggers?
Let's open up the API!
Some API users
Back to reality.
Age of “smart”
Blog me up, Scotty!23. April 2012
Some takeaways
• Accelerators are good• World is getting flatter
But it will never be flat• Start monetizing soon – to learn, not to earn• Be where your market is• Many markets left to innovate in
Thank you!