Date post: | 14-Feb-2017 |
Category: |
Data & Analytics |
Upload: | alberto-paro |
View: | 128 times |
Download: | 1 times |
Alberto Paro Laureato in Ingegneria Informatica (POLIMI)
Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech review
Lavoro principalmente in Scala e su tecnologie BD (Akka, Spray.io, Playframework, Apache Spark) e NoSQL (Accumulo, Cassandra, ElasticSearch e MongoDB)
Evangelist linguaggio Scala e Scala.JS
Elasticseach 5.x - Cookbook Choose the best ElasticSearch cloud topology to deploy and power it up
with external plugins Develop tailored mapping to take full control of index steps Build complex queries through managing indices and documents Optimize search results through executing analytics aggregations Monitor the performance of the cluster and nodes Install Kibana to monitor cluster and extend Kibana for plugins. Integrate ElasticSearch in Java, Scala, Python and Big Data applications
Discount code for Ebook: ALPOEB50Discount code for Print Book: ALPOPR15Expiration Date: 21st Feb 2017
Obiettivi
Architetture Big Data con ES Apache Spark GeoIngester
Data Collection Ottimizzazione Indici Ingestion via Apache Spark Ricerca per un luogo
Cenni di Big Data Tools
Apache Spark
Scritto in Scala con API in Java, Python e R Evoluzione del modello Map/Reduce Potenti moduli a corredo:
Spark SQL Spark Streaming MLLib (Machine Learning) GraphX (graph)
Geoname
GeoNames è un database geografico, scaricabile gratuitamente sotto licenza creative commons.
Contiene circa 10 millioni di nomi geografici e consiste di circa 9 milioni di feature uniqche di cui 2.8 milioni di posti popolati e 5.5 millioni di nomi alternativi.
Può essere facilmente scaricato da http://download.geonames.org/export/dump come file CSV.
Il codice è disponibile all’indirizzo:https://github.com/aparo/elasticsearch-geonames-locator
Geoname - StrutturaNo. Attribute name Explanation1 geonameid Unique ID for this geoname2 name The name of the geoname3 asciiname ASCII representation of the name4 alternatenames Other forms of this name. Generally in several languages5 latitude Latitude in decimal degrees of the Geoname6 longitude Longitude in decimal degrees of the Geoname7 fclass Feature class see http://www.geonames.org/export/codes.html8 fcode Feature code see http://www.geonames.org/export/codes.html9 country ISO-3166 2-letter country code10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code11 admin1 Fipscode (subject to change to iso code12 admin2 Code for the second administrative division, a county in the US13 admin3 Code for third level administrative division14 admin4 Code for fourth level administrative division14 population The Population of Geoname14 elevation The elevation in meters of Geoname14 gtopo30 Digital elevation model14 timezone The timezone of Geoname14 moddate The date of last change of this Geoname
Ottimizzazione indici – 1/2
Necessario per: Rimuove campi non richiesti. Gestire campi Geo Point. Ottimizzare i campi stringa (text, keyword) Numeri shard corretto (11M records => 2 shards)
Vantaggi => performances/spazio/CPU
Ottimizzazione indici – 2/2{ "mappings": { "geoname": { "properties": { "admin1": { "type": "keyword", "ignore_above": 256 },…"alternatenames": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },…
…"location": { "type": "geo_point" },…"longitude": { "type": "float" }, "moddate": { "type": "date" },
Ingestion via Spark – GeonameIngester – 1/7
Il nostro ingester eseguirà i seguenti steps: Inizializzazione Job Spark Parse del CSV Definizione della struttura di indicizzazione Popolamento delle classi Scrittura dati in Elasticsearch Esecuzione del Job Spark
Ingestion via Spark – GeonameIngester – 2/7Inizializzazione di un Job Spark
import org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.types._import org.elasticsearch.spark.rdd.EsSparkimport scala.util.Try
object GeonameIngester { def main(args: Array[String]) { val sparkSession = SparkSession.builder .master("local") .appName("GeonameIngester") .getOrCreate()
Ingestion via Spark – GeonameIngester – 3/7Parse del CSVval geonameSchema = StructType(Array( StructField("geonameid", IntegerType, false), StructField("name", StringType, false), StructField("asciiname", StringType, true), StructField("alternatenames", StringType, true), StructField("latitude", FloatType, true), ….
val GEONAME_PATH = "downloads/allCountries.txt"val geonames = sparkSession.sqlContext.read .option("header", false) .option("quote", "") .option("delimiter", "\t").option("maxColumns", 22) .schema(geonameSchema) .csv(GEONAME_PATH) .cache()
Ingestion via Spark – GeonameIngester – 4/7Definizione delle nostre classi per l’Inidicizzazione
case class GeoPoint(lat: Double, lon: Double)
case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String)
implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean)}}
Ingestion via Spark – GeonameIngester – 5/7Definizione delle nostre classi per l’Inidicizzazione
case class GeoPoint(lat: Double, lon: Double)
case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String)
implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean)}}
Ingestion via Spark – GeonameIngester – 6/7Popolazione delle nostre classi
val records = geonames.map { row => val id = row.getInt(0) val lat = row.getFloat(4) val lon = row.getFloat(5) Geoname(id, row.getString(1), row.getString(2), Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil), lat, lon, GeoPoint(lat, lon), row.getString(6), row.getString(7), row.getString(8), row.getString(9), row.getString(10), row.getString(11), row.getString(12), row.getString(13), row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17), row.getDate(18).toString )}
Ingestion via Spark – GeonameIngester – 7/7Scrittura in Elasticsearch
EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" -> "geonameid"))
Esecuzione di uno Spark Job
spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator-assembly-1.0.jar
(~20 minuti su singola macchina)
Ricerca di un luogocurl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{ "query": { "bool": { "minimum_should_match": 1, "should": [ { "term": { "name": "moscow"}}, { "term": { "alternatenames": "moscow"}}, { "term": { "asciiname": "moscow" }} ], "filter": [ { "term": { "fclass": "P" }}, { "range": { "population": {"gt": 0}}} ] } }, "sort": [ { "population": { "order": "desc"}}]}'
NoSQL
Key-Value Redis Voldemort Dynomite Tokio*
BigTable Clones Accumulo Hbase Cassandra
Document CouchDB MongoDB ElasticSearch
GraphDB Neo4j OrientDB …Graph
Message Queue Kafka RabbitMQ ...MQ
Linguaggio – Scala vs Javapublic class User { private String firstName; private String lastName; private String email; private Password password; public User(String firstName, String lastName, String email, Password password) { this.firstName = firstName; this.lastName = lastName; this.email = email; this.password = password; }
public String getFirstName() {return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } public String getEmail() { return email; } public void setEmail(String email) { this.email = email; } public Password getPassword() { return password; } public void setPassword(Password password) { this.password = password; } @Override public String toString() { return "User [email=" + email + ", firstName=" + firstName + ", lastName=" + lastName + "]"; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((email == null) ? 0 : email.hashCode()); result = prime * result + ((firstName == null) ? 0 : firstName.hashCode()); result = prime * result + ((lastName == null) ? 0 : firstName.hashCode()); result = prime * result + ((password == null) ? 0 : password.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; User other = (User) obj; if (email == null) { if (other.email != null) return false; } else if (!email.equals(other.email)) return false; if (password == null) { if (other.password != null) return false; } else if (!password.equals(other.password)) return false; if (firstName == null) { if (other.firstName != null) return false; } else if (!firstName.equals(other.firstName)) return false; if (lastName == null) { if (other.lastName != null) return false; } else if (!lastName.equals(other.lastName)) return false; return true; } }
case class User(var firstName:String, var lastName:String, var email:String, var password:Password)
JAVASCALA