2017 02-07 - elastic & spark. building a search geo locator

Roma – 7 Febbraio 2017presenta Alberto Paro, Seacom

Elastic & Spark. Building A Search Geo Locator

Alberto Paro Laureato in Ingegneria Informatica (POLIMI)

Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech review

Lavoro principalmente in Scala e su tecnologie BD (Akka, Spray.io, Playframework, Apache Spark) e NoSQL (Accumulo, Cassandra, ElasticSearch e MongoDB)

Evangelist linguaggio Scala e Scala.JS

Elasticseach 5.x - Cookbook Choose the best ElasticSearch cloud topology to deploy and power it up

with external plugins Develop tailored mapping to take full control of index steps Build complex queries through managing indices and documents Optimize search results through executing analytics aggregations Monitor the performance of the cluster and nodes Install Kibana to monitor cluster and extend Kibana for plugins. Integrate ElasticSearch in Java, Scala, Python and Big Data applications

Discount code for Ebook: ALPOEB50Discount code for Print Book: ALPOPR15Expiration Date: 21st Feb 2017

Obiettivi

Architetture Big Data con ES Apache Spark GeoIngester

Data Collection Ottimizzazione Indici Ingestion via Apache Spark Ricerca per un luogo

Cenni di Big Data Tools

Architettura

Hadoop / SparkHadoop MapReduce

Apache Spark

Evoluzione del modello Map Reduce

Apache Spark

Scritto in Scala con API in Java, Python e R Evoluzione del modello Map/Reduce Potenti moduli a corredo:

Spark SQL Spark Streaming MLLib (Machine Learning) GraphX (graph)

Geoname

GeoNames è un database geografico, scaricabile gratuitamente sotto licenza creative commons.

Contiene circa 10 millioni di nomi geografici e consiste di circa 9 milioni di feature uniqche di cui 2.8 milioni di posti popolati e 5.5 millioni di nomi alternativi.

Può essere facilmente scaricato da http://download.geonames.org/export/dump come file CSV.

Il codice è disponibile all’indirizzo:https://github.com/aparo/elasticsearch-geonames-locator

Geoname - StrutturaNo. Attribute name Explanation1 geonameid Unique ID for this geoname2 name The name of the geoname3 asciiname ASCII representation of the name4 alternatenames Other forms of this name. Generally in several languages5 latitude Latitude in decimal degrees of the Geoname6 longitude Longitude in decimal degrees of the Geoname7 fclass Feature class see http://www.geonames.org/export/codes.html8 fcode Feature code see http://www.geonames.org/export/codes.html9 country ISO-3166 2-letter country code10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code11 admin1 Fipscode (subject to change to iso code12 admin2 Code for the second administrative division, a county in the US13 admin3 Code for third level administrative division14 admin4 Code for fourth level administrative division14 population The Population of Geoname14 elevation The elevation in meters of Geoname14 gtopo30 Digital elevation model14 timezone The timezone of Geoname14 moddate The date of last change of this Geoname

http://www.geonames.org/export/codes.html

http://www.geonames.org/export/codes.html

Ottimizzazione indici – 1/2

Necessario per: Rimuove campi non richiesti. Gestire campi Geo Point. Ottimizzare i campi stringa (text, keyword) Numeri shard corretto (11M records => 2 shards)

Vantaggi => performances/spazio/CPU

Ottimizzazione indici – 2/2{ "mappings": { "geoname": { "properties": { "admin1": { "type": "keyword", "ignore_above": 256 },…"alternatenames": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },…

…"location": { "type": "geo_point" },…"longitude": { "type": "float" }, "moddate": { "type": "date" },

Ingestion via Spark – GeonameIngester – 1/7

Il nostro ingester eseguirà i seguenti steps: Inizializzazione Job Spark Parse del CSV Definizione della struttura di indicizzazione Popolamento delle classi Scrittura dati in Elasticsearch Esecuzione del Job Spark

Ingestion via Spark – GeonameIngester – 2/7Inizializzazione di un Job Spark

import org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.types._import org.elasticsearch.spark.rdd.EsSparkimport scala.util.Try

object GeonameIngester { def main(args: Array[String]) { val sparkSession = SparkSession.builder .master("local") .appName("GeonameIngester") .getOrCreate()

Ingestion via Spark – GeonameIngester – 3/7Parse del CSVval geonameSchema = StructType(Array( StructField("geonameid", IntegerType, false), StructField("name", StringType, false), StructField("asciiname", StringType, true), StructField("alternatenames", StringType, true), StructField("latitude", FloatType, true), ….

val GEONAME_PATH = "downloads/allCountries.txt"val geonames = sparkSession.sqlContext.read .option("header", false) .option("quote", "") .option("delimiter", "\t").option("maxColumns", 22) .schema(geonameSchema) .csv(GEONAME_PATH) .cache()

Ingestion via Spark – GeonameIngester – 4/7Definizione delle nostre classi per l’Inidicizzazione

case class GeoPoint(lat: Double, lon: Double)

case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String)

implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean)}}

Ingestion via Spark – GeonameIngester – 5/7Definizione delle nostre classi per l’Inidicizzazione

case class GeoPoint(lat: Double, lon: Double)

case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String)

implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean)}}

Ingestion via Spark – GeonameIngester – 6/7Popolazione delle nostre classi

val records = geonames.map { row => val id = row.getInt(0) val lat = row.getFloat(4) val lon = row.getFloat(5) Geoname(id, row.getString(1), row.getString(2), Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil), lat, lon, GeoPoint(lat, lon), row.getString(6), row.getString(7), row.getString(8), row.getString(9), row.getString(10), row.getString(11), row.getString(12), row.getString(13), row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17), row.getDate(18).toString )}

Ingestion via Spark – GeonameIngester – 7/7Scrittura in Elasticsearch

EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" -> "geonameid"))

Esecuzione di uno Spark Job

spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator-assembly-1.0.jar

(~20 minuti su singola macchina)

Ricerca di un luogocurl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{ "query": { "bool": { "minimum_should_match": 1, "should": [ { "term": { "name": "moscow"}}, { "term": { "alternatenames": "moscow"}}, { "term": { "asciiname": "moscow" }} ], "filter": [ { "term": { "fclass": "P" }}, { "range": { "population": {"gt": 0}}} ] } }, "sort": [ { "population": { "order": "desc"}}]}'

NoSQL

Key-Value Redis Voldemort Dynomite Tokio*

BigTable Clones Accumulo Hbase Cassandra

Document CouchDB MongoDB ElasticSearch

GraphDB Neo4j OrientDB …Graph

Message Queue Kafka RabbitMQ ...MQ

NoSQL - Evolution

MicroServices

Linguaggio – Scala vs Javapublic class User { private String firstName; private String lastName; private String email; private Password password; public User(String firstName, String lastName, String email, Password password) { this.firstName = firstName; this.lastName = lastName; this.email = email; this.password = password; }

public String getFirstName() {return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } public String getEmail() { return email; } public void setEmail(String email) { this.email = email; } public Password getPassword() { return password; } public void setPassword(Password password) { this.password = password; } @Override public String toString() { return "User [email=" + email + ", firstName=" + firstName + ", lastName=" + lastName + "]"; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((email == null) ? 0 : email.hashCode()); result = prime * result + ((firstName == null) ? 0 : firstName.hashCode()); result = prime * result + ((lastName == null) ? 0 : firstName.hashCode()); result = prime * result + ((password == null) ? 0 : password.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; User other = (User) obj; if (email == null) { if (other.email != null) return false; } else if (!email.equals(other.email)) return false; if (password == null) { if (other.password != null) return false; } else if (!password.equals(other.password)) return false; if (firstName == null) { if (other.firstName != null) return false; } else if (!firstName.equals(other.firstName)) return false; if (lastName == null) { if (other.lastName != null) return false; } else if (!lastName.equals(other.lastName)) return false; return true; } }

case class User(var firstName:String, var lastName:String, var email:String, var password:Password)

JAVASCALA

Grazie per l’attenzione

Alberto Paro

Q&A

Date post:	14-Feb-2017
Category:	Data & Analytics
Upload:	alberto-paro
View:	128 times
Download:	1 times

2017 02-07 - elastic & spark. building a search geo locator

Data & Analytics