Misthttps://github.com/Hydrospheredata/mist
www.provectus.com
© Provectus, Inc.
1
• Леонид Блохин
• Big Data Engineer
• +7 (917) 295 - 40 - 49
Mist
• HydroSphere
• Spark
• Why We Needed a Mist
• Running
• Configuration
• Spark Job at Mist
• Road Map
www.provectus.com2
Mist
www.provectus.com3
http://hydrosphere.io/
Hydrosphere – Opensource Big Data and Analytics platform
with DevOps culture in mind.
Mist
www.provectus.com4
http://hydrosphere.io/
Mist
www.provectus.com5
http://spark.apache.org/
Apache Spark™ is a fast and general engine for large-scale data processing.
Mist
www.provectus.com6
Mist
• Mist is a thin service on top of Spark which makes it possible to execute Scala & Python Spark Jobs
from application layers and get synchronous, asynchronous, and reactive results as well as provide
an API to external clients.
• It implements Spark as a Service and creates a unified API layer for building enterprise solutions
and services on top of a Big Data lake.
www.provectus.com7
Mist
● HTTP and Messaging (MQTT) API● Scala & Python Spark job execution● Works with Standalone, Mesos, Yarn any Spark config● Support for Spark SQL and Hive● High Availability and Fault Tolerance● Persist job state for self healing● Async and sync API, JSON job results
www.provectus.com8
Why We Needed a Mist
Mist
Build the project
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly
Create configuration file
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \
target/scala-2.10/mist-assembly-0.2.0.jar
www.provectus.com9
Running
Mist
www.provectus.com
Configuration
10
# spark master url can be either of three: local, yarn, mesos (local by default)mist.spark.master = "local[*]"
# number of threads: one thread for one jobmist.settings.threadNumber = 16
# http interface (off by default)mist.http.on = truemist.http.host = "192.168.10.13"mist.http.port = 2003
Mist
www.provectus.com
Configuration
11
# MQTT interface (off by default)mist.mqtt.on = truemist.mqtt.host = "192.168.10.33"mist.mqtt.port = 1883# mist listens this topic for incoming requestsmist.mqtt.subscribeTopic = "foo"# mist answers in this topic with the resultsmist.mqtt.publishTopic = "foo"
Mist
www.provectus.com
Configuration
12
# recovery job (off by default)mist.recovery.on = truemist.recovery.multilimit = 10mist.recovery.typedb = "MapDb"mist.recovery.dbfilename = "file.db"
Mist
www.provectus.com
Configuration
13
# default settings for all contexts# timeout for each job in contextmist.contextDefaults.timeout = 100 days# mist can kill context after job finished (off by default)mist.contextDefaults.disposable = false
# settings for SparkConfmist.contextDefaults.sparkConf = { spark.default.parallelism = 128 spark.driver.memory = "10g" spark.scheduler.mode = "FAIR"}
Mist
www.provectus.com
Configuration
14
# settings can be overridden for each contextmist.contexts.foo.timeout = 100 days
mist.contexts.foo.sparkConf = { spark.scheduler.mode = "FIFO"}
mist.contexts.bar.timeout = 1000 secondmist.contexts.bar.disposable = true
# mist can create context on start, so we don't waste time on first requestmist.contextSettings.onstart = ["foo"]
Mist
Spark Job at MistMist Scala Spark Job
In order to prepare your job to run on Mist you should extend scala object from MistJob and implement abstract method
doStuff :
def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = ???
def doStuff(context: SQLContext, parameters: Map[String, Any]): Map[String, Any] = ???
def doStuff(context: HiveContext, parameters: Map[String, Any]): Map[String, Any] = ???
www.provectus.com15
Mist
Spark Job at MistExample:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Building Mist jobs
Add Mist as dependency in your build.sbt:libraryDependencies += "io.hydrosphere" % "mist" % "0.2.0"
www.provectus.com16
Mist
Spark Job at MistMist Python Spark Job
Import mist and implement method doStuff.The following are Spark Contexts aliases to be used for convenience:
job.sc = SparkContext
job.sqlc = SQL Context
job.hc = Hive Context
www.provectus.com17
Mist
Spark Job at Mist
for examplimport mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
pylist = []
count = 0
while count < list.size():
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
www.provectus.com18
Mist
www.provectus.com19
mosquitto_pub -h 192.168.10.33 -p 1883 -m'{
"jarPath":"/vagrant/examples/target/scala-2.11/mist_examples_2.11-0.0.1.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"
}' -t 'foo'
Mist
www.provectus.com20
Mist
www.provectus.com21
{"success":true,"payload":{"result":[2,4,6,8,10,12,14,16,18,0]},"errors":[],"request":{"jarPath":"src/test/resources/mistjob_2.10-1.0.jar","className":"SimpleContext$","name":"foo","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]},"external_id":"12345678"}
}
Mist
www.provectus.com22
● Super parallel mode Support multi JVM● Cluster mode and node framework● Add logging● Restification● Support streaming contexts/jobs● Apache Kafka support● AMQP support● Web UI
Your contributions are very welcome on Github!https://github.com/Hydrospheredata/mist
Road Map
Thanks!
Questions?
www.provectus.com23
Леонид Блохин
Skype: leonid_niko
Email: [email protected]
www.provectus.com