Post on 16-Apr-2017
transcript
Microsoft R Server
Ing. Eduardo Castro, PhD
ecastrom@gmail.com
Data Science SpecializationMicrosoft Data Platform MVPPASS Regional MentorPASS Board Advisor
Organiza
http://tinyurl.com/ComunidadWindows
Patrocinadores del SQL Saturday
Platinum Sponsor
Diamond Sponsor
Bronze Sponsor
Fuentes consultadas
Esta presentación include slides tomados de las siguientes fuentes:
Revolution R Enterprise. Hong Ooi. Data Science with Azure Machine Learning,
SQL Server and R. Lukawiecki
Tutoriales y Demostraciones https://msdn.microsoft.com/en-us/library/mt590
536.aspx7
La ciencia de datos
El método científico de razonamiento aplicado de decisiones basadas en datos
Hipótesis, experimentos, hechos, lógico razonamiento+ Ingeniería de datos.
Data wrangling (munging), retrieval + storage
Data mining & machine learning
Statistics
Big data
la ciencia
de datos
¿Cómo?
DatosModelosNecesidad de negocios
El aprendizaje automático ≣ ciencia de datos
exploradatos
encuentra patrones
Predecir (scoring)
Herramientas disponibles
Herramientas
Chart from "2014 Data Science Salary Survey" (ISBN 978-1-491-91842-5)© 2015 O'Reilly Media, used with permission. Arrows mine.For more info, and great titles on data science, visit oreilly.com
Herramienta de la ciencia de datos # 1: SQL
¡Microsoft SQL Server!
Lenguage R
SAS
A veces
Están teniendo auge
Metodología sugerida
SSAS Data
MiningR Azure ML
Fácil, visual, intuitiva, Excel, simplemente
funciona
Estadísticas descriptivas, “sentir” sus datos, más algoritmos
Los algoritmos avanzados, el auto-tuning,
servicios web, nube!
Otras herramientas de las ciencias de datos de Microsoft
HDInsightHadoop en la nube+ Storm (análisis en tiempo real)+HBase (NoSQL)+Mahoot (ML!)
Azure Stream AnalyticsStreaming Data procedentes de la nubeBasado en HDInsight/ Hadoop
También son útiles:Power BI: Power Query, Power View, and DashboardsExcelAzure Data Factory (ETL in the cloud)Analytics Platform System (SQL Server on steroids + Hadoop + hardware)
¿Qué es R?
Lenguaje interpretado, pobre IDE 5000+ paquetes de software estadístico Mejor IDE: RStudio
http://www.rstudio.com/
Rattle y OnePageR hace que sea aún más fácil
Código abierto, libre, multiplataforma R Core: la versión más pura: http://cran.r-project.org/ Revolution Analytics: paralelismo y Rendimiento:
http://www.revolutionanalytics.com/ Azure ML: built-in
Limitaciones del open source R
R necesita datos en memoria R solo tiene un hilode ejecución
R require habilidades especializadas para crear cluster
R Open es soportado por la comunidad
Revolution R Enterprise brinda una solución a esto!
Usuarios de Revolution Analytics
Revolution roadmap con Microsoft
Continua el soporte para estas plataformas Windows Linux Hadoop Teradata
Integración con nuevas plataformas Azure Marketplace Azure ML Azure HDInsight Sql Server 2016 Azure SQL Frontend tooling/BI integration
Revolution R vs open source R
NO tiene límites de RAM• Open source R llena la
memoria y falla• RRE escala lineamiente
aunque sobrepase el límite de RAM
Algoritmos más rápidos• RRE optimizado para gran
cantidad de datos
File Name
Compressed File Size
(MB) No. Rows
Open Source R
(secs)Revolution R
(secs)Tiny 0.3 1,235 0.00 0.05V. Small 0.4 12,353 0.21 0.05Small 1.3 123,534 0.03 0.03Medium 10.7 1,235,349 1.94 0.08Large 104.5 12,353,496 60.69 0.42
Big (full) 12,960.0123,534,96
9 Memory! 4.89
V.Big 25,919.7247,069,93
8 Memory! 9.49
Huge 51,840.2494,139,87
6 Memory! 18.92
Public US Flight Data Linear Regression sobre el campo Arrival Delay Ejecución en 4 core laptop, 16GB RAM and 500GB
SSD
Revolution R vs SAS
• Pruebas realizadas por consultores independientes – 5 x 4 core maquinas ejecutando sobre CentOS
• SAS 9.4: Base SAS, SAS/STAT, Grid Mgr • Revolution R Enterprise ScaleR, con
IBM Platform LSF, Platform MPI Release 9
• Data set: 591 columnas y 5,000,000 filas
Cientificos de datosinteractuar directamente con
datos
Incorporado a SQL Server
Desarrollador de datos / DBA
Manejo de datos y analíticas en el mismo motor
Incorporando el análisis avanzadoDentro de la base de datos de análisis
Ejemplo de soluciones• La detección del fraude• Pronóstico de ventas• la eficiencia del inventario• Mantenimiento predictivo
datos relacional
Biblioteca analítica
T-SQL Interface
Extensibilidad
?RIntegración
R
010010
100100
010101
Microsoft AzureMachine Learning Marketplace
R nuevas secuencias de
comandos01001
010010
001010
1
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
Integración con R Scripts
Fuente: https://visualstudiomagazine.com/articles/2015/05/04/sql-server-2016-preview.aspx
Revolution R vs SAS
• Pruebas realizadas por empresa independiente – 5 x 4 core machines ejecutando CentOS
• SAS 9.4: Base SAS, SAS/STAT, Grid Mgr
• Revolution R Enterprise ScaleR, IBM Platform LSF, Platform MPI Release 9
• Data set: 591 columnas con 5,000,000 filas
Revo product suite
• Distribución gratis y open source R• Mejorado y distribuido por Revolution
Analytics
Revolution R Open
• Seguridad, Escalable una Distribución de R con soporte
• Incluye componentes propietarios creados por Revolution Analytics
Revolution R Enterprise
Revolution R Enterprise (RRE)
Distribución Open Source de R:• Conectivida con objetos big-data• Big-data advanced analytics• Soporte multiplataforma• Análisis Predictivo In-Hadoop in-Teradata• Soporte para ambientes de desarrollo y
producción• Servicios de soporte técnico y
entrenamiento
R+C
RA
N
Rev
olut
ion
R O
pen
DistributedR
DeployR DevelopR
ScaleR
ConnectR
La Plataforma RRE
Rev
oR
DevelopR DeployR
R+C
RA
N
DistributedR
ScaleR
ConnectR
ConnectR• Contiene High-speed &
direct connectors
Available for:• High-performance XDF• Formato de archivos SAS,
SPSS, delimited & fixed format text
• Hadoop HDFS (texto & XDF)
• Teradata Database & Aster• EDWs and ADWs• ODBC
ScaleR• Incluye características
Ready-to-Use high-performance para big data big analytics
• Procesamiento analítico Fully-parallelized
• Estadística descriptive &
pruebas estadísticas• Incluye funciones
adicionales de análisis predictivo
• Herramientas para distribuir R algorithms entre nodos
• Soporte para Wide data – miles de variables
DistributedR• Framework de computación
distribuidad• Portabilidad multiplataformaDisponible en:• Windows Servers• Red Hat and SuSE Linux Servers• Teradata Database• Cloudera Hadoop• Hortonworks Hadoop• MapR Hadoop
R+CRAN• Open source R interpreter
• R 3.2.2• Gran cantidad de algoritmos gratuitos• Algoritmos utilizados por RevoR• Embeddable R scripts• 100% Compatible con R scripts,
funcionesy paquetesRevoR• Intérprete de R con mejora
de desempeño• Basado en el open source
R• Agrega high-performance
math library acelerar las funciones de algebra lineal
Integración de R dentro de SQL Server 2016
exec sp_configure 'external scripts enabled', 1; reconfigure;
"C:\Program files\RRO\RRO-3.2.2-for-RRE-7.5.0\R-3.2.2\library\RevoScaleR\rxLibs\x64\registerRext.exe" /install
Integración de R dentro de SQL Server 2016
USE <target database name> GO CREATE LOGIN [<login name>] WITH PASSWORD= '<password>', CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF; CREATE USER [<user name>] FOR LOGIN [<login name>] WITH DEFAULT_SCHEMA=[db_datareader] ALTER ROLE [db_datareader] ADD MEMBER [<user name>]
Integración de R dentro de SQL Server 2016
USE [master] GO CREATE USER [<user name>] FOR LOGIN [<login name>] WITH DEFAULT_SCHEMA=[db_rrerole] ALTER ROLE [db_rrerole] ADD MEMBER [<user name>]
Demostración
Instalación de R Server e Integración con SQL Server 2016
Posibles herramientas cliente
RRE: escalar a grandes volúmenes de datos
“Fragmentación“ de datos alivia los límites de memoria
Volumen limitado sólo por la capacidad de almacenamiento
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 421.0 6 160.0 110 3.90 2.875 17.02 0 1 4 422.8 4 108.0 93 3.85 2.320 18.61 1 1 4 121.4 6 258.0 110 3.08 3.215 19.44 1 0 3 118.7 8 360.0 175 3.15 3.440 17.02 0 0 3 218.1 6 225.0 105 2.76 3.460 20.22 1 0 3 114.3 8 360.0 245 3.21 3.570 15.84 0 0 3 424.4 4 146.7 62 3.69 3.190 20.00 1 0 4 222.8 4 140.8 95 3.92 3.150 22.90 1 0 4 219.2 6 167.6 123 3.92 3.440 18.30 1 0 4 417.8 6 167.6 123 3.92 3.440 18.90 1 0 4 416.4 8 275.8 180 3.07 4.070 17.40 0 0 3 317.3 8 275.8 180 3.07 3.730 17.60 0 0 3 315.2 8 275.8 180 3.07 3.780 18.00 0 0 3 310.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4. . .
mpg cyl disp hp drat wt qsec vs am gear carb
RRE: escalar a grandes volúmenes de datos
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 421.0 6 160.0 110 3.90 2.875 17.02 0 1 4 422.8 4 108.0 93 3.85 2.320 18.61 1 1 4 121.4 6 258.0 110 3.08 3.215 19.44 1 0 3 118.7 8 360.0 175 3.15 3.440 17.02 0 0 3 218.1 6 225.0 105 2.76 3.460 20.22 1 0 3 114.3 8 360.0 245 3.21 3.570 15.84 0 0 3 424.4 4 146.7 62 3.69 3.190 20.00 1 0 4 222.8 4 140.8 95 3.92 3.150 22.90 1 0 4 219.2 6 167.6 123 3.92 3.440 18.30 1 0 4 417.8 6 167.6 123 3.92 3.440 18.90 1 0 4 416.4 8 275.8 180 3.07 4.070 17.40 0 0 3 317.3 8 275.8 180 3.07 3.730 17.60 0 0 3 315.2 8 275.8 180 3.07 3.780 18.00 0 0 3 310.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4. . .
mpg cyl disp hp drat wt qsec vs am gear carbEn un archivo de xdf (local)
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 421.0 6 160.0 110 3.90 2.875 17.02 0 1 4 422.8 4 108.0 93 3.85 2.320 18.61 1 1 4 121.4 6 258.0 110 3.08 3.215 19.44 1 0 3 118.7 8 360.0 175 3.15 3.440 17.02 0 0 3 218.1 6 225.0 105 2.76 3.460 20.22 1 0 3 114.3 8 360.0 245 3.21 3.570 15.84 0 0 3 424.4 4 146.7 62 3.69 3.190 20.00 1 0 4 222.8 4 140.8 95 3.92 3.150 22.90 1 0 4 219.2 6 167.6 123 3.92 3.440 18.30 1 0 4 417.8 6 167.6 123 3.92 3.440 18.90 1 0 4 416.4 8 275.8 180 3.07 4.070 17.40 0 0 3 317.3 8 275.8 180 3.07 3.730 17.60 0 0 3 315.2 8 275.8 180 3.07 3.780 18.00 0 0 3 310.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
mpg cyl disp hp drat wt qsec vs am gear carb
RRE: escalar a grandes volúmenes de datos
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 421.0 6 160.0 110 3.90 2.875 17.02 0 1 4 422.8 4 108.0 93 3.85 2.320 18.61 1 1 4 121.4 6 258.0 110 3.08 3.215 19.44 1 0 3 118.7 8 360.0 175 3.15 3.440 17.02 0 0 3 218.1 6 225.0 105 2.76 3.460 20.22 1 0 3 114.3 8 360.0 245 3.21 3.570 15.84 0 0 3 424.4 4 146.7 62 3.69 3.190 20.00 1 0 4 222.8 4 140.8 95 3.92 3.150 22.90 1 0 4 219.2 6 167.6 123 3.92 3.440 18.30 1 0 4 417.8 6 167.6 123 3.92 3.440 18.90 1 0 4 416.4 8 275.8 180 3.07 4.070 17.40 0 0 3 317.3 8 275.8 180 3.07 3.730 17.60 0 0 3 315.2 8 275.8 180 3.07 3.780 18.00 0 0 3 310.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4. . .
mpg cyl disp hp drat wt qsec vs am gear carb Teradata
VAMPs
Teradata Database
ODBC
Revolution R Enterprise
Data Segments
Database Nodes
Hybrid Storage
ParseEngine
External Stored Procedure
Table Operator
Table Operator
Table Operator
Table Operator
Desktops & Servers
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 421.0 6 160.0 110 3.90 2.875 17.02 0 1 4 422.8 4 108.0 93 3.85 2.320 18.61 1 1 4 121.4 6 258.0 110 3.08 3.215 19.44 1 0 3 118.7 8 360.0 175 3.15 3.440 17.02 0 0 3 218.1 6 225.0 105 2.76 3.460 20.22 1 0 3 114.3 8 360.0 245 3.21 3.570 15.84 0 0 3 424.4 4 146.7 62 3.69 3.190 20.00 1 0 4 222.8 4 140.8 95 3.92 3.150 22.90 1 0 4 219.2 6 167.6 123 3.92 3.440 18.30 1 0 4 417.8 6 167.6 123 3.92 3.440 18.90 1 0 4 416.4 8 275.8 180 3.07 4.070 17.40 0 0 3 317.3 8 275.8 180 3.07 3.730 17.60 0 0 3 315.2 8 275.8 180 3.07 3.780 18.00 0 0 3 310.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
RRE: escalar a grandes volúmenes de datos
Slave node
Task tracker
Master node
Job tracker
Hadoop
Slave node
Task tracker
Slave node
Task tracker
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 421.0 6 160.0 110 3.90 2.875 17.02 0 1 4 422.8 4 108.0 93 3.85 2.320 18.61 1 1 4 121.4 6 258.0 110 3.08 3.215 19.44 1 0 3 118.7 8 360.0 175 3.15 3.440 17.02 0 0 3 218.1 6 225.0 105 2.76 3.460 20.22 1 0 3 114.3 8 360.0 245 3.21 3.570 15.84 0 0 3 424.4 4 146.7 62 3.69 3.190 20.00 1 0 4 222.8 4 140.8 95 3.92 3.150 22.90 1 0 4 219.2 6 167.6 123 3.92 3.440 18.30 1 0 4 417.8 6 167.6 123 3.92 3.440 18.90 1 0 4 416.4 8 275.8 180 3.07 4.070 17.40 0 0 3 317.3 8 275.8 180 3.07 3.730 17.60 0 0 3 315.2 8 275.8 180 3.07 3.780 18.00 0 0 3 310.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4. . .
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 421.0 6 160.0 110 3.90 2.875 17.02 0 1 4 422.8 4 108.0 93 3.85 2.320 18.61 1 1 4 121.4 6 258.0 110 3.08 3.215 19.44 1 0 3 118.7 8 360.0 175 3.15 3.440 17.02 0 0 3 218.1 6 225.0 105 2.76 3.460 20.22 1 0 3 114.3 8 360.0 245 3.21 3.570 15.84 0 0 3 424.4 4 146.7 62 3.69 3.190 20.00 1 0 4 222.8 4 140.8 95 3.92 3.150 22.90 1 0 4 219.2 6 167.6 123 3.92 3.440 18.30 1 0 4 417.8 6 167.6 123 3.92 3.440 18.90 1 0 4 416.4 8 275.8 180 3.07 4.070 17.40 0 0 3 317.3 8 275.8 180 3.07 3.730 17.60 0 0 3 315.2 8 275.8 180 3.07 3.780 18.00 0 0 3 310.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 414.7 8 440.0 230 3.23 5.345 17.42 0 0 3 432.4 4 78.7 66 4.08 2.200 19.47 1 1 4 130.4 4 75.7 52 4.93 1.615 18.52 1 1 4 233.9 4 71.1 65 4.22 1.835 19.90 1 1 4 121.5 4 120.1 97 3.70 2.465 20.01 1 0 3 115.5 8 318.0 150 2.76 3.520 16.87 0 0 3 215.2 8 304.0 150 3.15 3.435 17.30 0 0 3 213.3 8 350.0 245 3.73 3.840 15.41 0 0 3 419.2 8 400.0 175 3.08 3.845 17.05 0 0 3 227.3 4 79.0 66 4.08 1.935 18.90 1 1 4 126.0 4 120.3 91 4.43 2.140 16.70 0 1 5 230.4 4 95.1 113 3.77 1.513 16.90 1 1 5 215.8 8 351.0 264 4.22 3.170 14.50 0 1 5 419.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6. . .
RRE: cómputo distribuido
Ningún movimiento de datos Establecer el contexto de cálculo determina donde
se realiza la transformación
VAMPs
Teradata Database
ODBC
Revolution R Enterprise
Data Segments
Database Nodes
Hybrid Storage
ParseEngine
External Stored Procedure
Table Operator
Table Operator
Table Operator
Table Operator
Desktops & Servers
Contexto de cómputo local
### LOCAL COMPUTE CONTEXT ### rxSetComputeContext("local")
### CREATE DIRECTORY AND FILE OBJECTS ###AirlineDatabase <-file.path("datasets","AirlineDemoSmall")AirlineDataSet <- RxXdfData(file.path(AirlineDatabase,"AirlineDemoSmall.xdf"))
### ANALYTICAL PROCESSING ###### Statistical Summary of the datarxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the datarxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plotarrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(arrLateLinMod$coefficients)
Remote Compute: Teradata
### SETUP TERADATA ENVIRONMENT VARIABLES ###dbConnStr <- "Driver=Teradata; Server=dbHostName; Database=RevoDb; Uid=xxxx; pwd=xxxx"myTeradataCC <- RxInTeradata(connectionString = dbConnStr, shareDir = "/tmp", remoteShareDir = "/tmp/revoJobs", revoPath = "/usr/lib64/Revo-7.0/R-3.0.2/lib64/R")
### TERADATA COMPUTE CONTEXT ###rxSetComputeContext(myTeradataCC)
### CREATE TERADATA DATA SOURCE ###AirlineDemoQuery <- "SELECT * FROM AirlineDemoSmall;" AirlineDataSet <- RxTeradata(connectionString = dbConnStr, sqlQuery = AirlineDemoQuery)
### ANALYTICAL PROCESSING ###### Statistical Summary of the datarxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the datarxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plotarrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(arrLateLinMod$coefficients)
Remote compute: Hadoop
### SETUP HADOOP ENVIRONMENT VARIABLES ###myNameNode <- "master"myUser <- "root"myPort <- 8020myHadoopCluster <- RxHadoopMR(sshUsername = myUser, sshHostname = myNameNode, port = myPort)
### HADOOP COMPUTE CONTEXT USING HDFS ###rxSetComputeContext(myHadoopCluster)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)AirlineDatabase <-file.path("datasets","AirlineDemoSmall")AirlineDataSet <- RxXdfData(file.path(AirlineDatabase,"AirlineDemoSmall.xdf"), fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###### Statistical Summary of the datarxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the datarxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plotarrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(arrLateLinMod$coefficients)
Contexto remoto: SQL Server *### SETUP SQL SERVER ENVIRONMENT VARIABLES ###
dbConnStr <- "Driver=SQL Server; Server=dbHostName; Database=RevoDb; Uid=xxxx; pwd=xxxx"mySqlServerCC <- RxInSqlServer(connectionString = dbConnStr, consoleOutput = TRUE)
### SQL SERVER COMPUTE CONTEXT ###rxSetComputeContext(mySqlServerCC)
### CREATE SQL SERVER DATA SOURCE ###AirlineDemoQuery <- "SELECT * FROM AirlineDemoSmall;" AirlineDataSet <- RxSqlServer(connectionString = dbConnStr, sqlQuery = AirlineDemoQuery)
### ANALYTICAL PROCESSING ###### Statistical Summary of the datarxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the datarxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plotarrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(arrLateLinMod$coefficients) * In 2016
ScaleR funciones y algoritmosData step
Data import – delimited, fixed, SAS, SPSS, ODBCVariable creation & transformationRecode variablesFactor variablesMissing value handlingSort, merge, splitAggregate by category (means, sums)
Descriptive statisticsMin / Max, Mean, Median (approx.)Quantiles (approx.)Standard deviationVarianceCorrelationCovarianceSum of squares (cross product matrix for set variables)Pairwise cross tabsRisk ratio & odds ratioCrosstabulation of data (standard tables & long form)Marginal summaries of crosstabulations
Statistical testsChi square testKendall rank correlationFisher’s exact testStudent’s t-test
SamplingSubsample (observations & variables)Random sampling
Predictive modelsSum of squares (cross product matrix for set variables)Multiple linear regressionGeneralized linear models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.Covariance & correlation matricesLogistic regressionClassification & regression treesPredictions/scoring for modelsResiduals for all models
Variable selectionStepwise regression
SimulationSimulation (eg Monte Carlo)Parallel random number generation
Cluster analysisK-means clustering
ClassificationDecision forests (random forests)Decision treesGradient boosted decision treesNaïve Bayes
CombinationPEMA APIrxDataSteprxExec
DeployR
Marco de I como un servicio para aplicaciones de BI / web
Arquitectura de DeployR
42 |
Nombre expositor
email blog
PREGUNTAS Y RESPUESTAS