Welcome to hell
Penalties and Big Data
@portentint
ask me questions
Kinda cool
and a cautionary tale
How hard could machine learning be?
I was a history major
Used: Links we disavowed and/or Google pointed at and said “these are bad.”
Trained: Based on domains that were clear disavows
Trained: Based on links Google pointed out
Total: 250,000 domains
Then: Hand-review results (random, 1,000 links)
Lesson 1
Excel doesn’t like 250,000 row spreadsheets
Work with text files, or use SQLLite
Here are a few rules of thumb for detecting bad links.
These are not causal They are correlations
BigML: Upload source
BigML: Dataset
BigML: Model
Lesson 2
Crap data = crap results
Check your assumptions: Don’t behave like a History major trying to analyze 4 million records using machine learning.
Just enough to be dangerous
Lesson 3
These are not causal These are correlations
@portentint portent.co/playpenguin
That’s it