Date post: | 03-Oct-2018 |
Category: |
Documents |
Upload: | duongkhanh |
View: | 214 times |
Download: | 0 times |
El aprendizaje automático marcará la gestión
de las Amenazas en Ciberseguridad
Francisco J. Gomez Elizabeth Gonzalez
Logtrust Inc.
Roadmap
• Definiciones • Aprendizaje • Comparación entre ellos • Los datos • Tipos de aprendizajes
Aprendizaje Automático - Amenazas - Ciberseguridad
NCA: Cyber Crime Assessment 2016
“The ONS estimated that there were 2.46 million cyber incidents and 2.11 million victims of cyber crime in the UK in 2015.“
Office of National Statistics
Fuente: http://www.nationalcrimeagency.gov.uk/publications/709-cyber-crime-assessment-2016/file
Demasiados
Virginia Aguilar, Responsable del Centro de Coordinación de la Capacidad de Respuesta a Incidentes Informáticos de OTAN - NCIRC CC
Dónde está el ML
“In an incredibly complex operation, machine learning in online advertising can help determine the optimal amount to bid on every single impression.”
Fuente: https://www.refuel4.com/blog/2016/07/21/machine-learning-advertisings-next-big-thing/
Machine Learning
Machine learning is an application of artificial
intelligence (AI) that provides systems the
ability to automatically learn and improve from
experience without being explicitly
programmed. Machine learning focuses on
the development of computer programs that
can access data and use it learn for
themselves.
expertsystem.com
The field of machine learning is concerned with
the question of how to construct computer
programs that automatically improve with
experience.
Tom Mitchell
Machine learning studies computer algorithms
for learning to do stuff. We might, for instance,
be interested in learning to complete a task, or to
make accurate predictions, or to behave
intelligently. [...]
The emphasis of machine learning is on
automatic methods. In other words, the goal is
to devise learning algorithms that do the learning
automatically without human intervention
or assistance.
Robert Schapire
Comparativa
• Análisis Manual • Cualquier caso no
modelado queda sin procesar
• Necesitas menos casos
• Menor error • Menor amplitud
• Análisis Automático • Los casos nos
modelados pueden ser procesados
• Necesitas de un set de datos elevado
• Más errores • Mayor amplitud
Tipos de aprendizaje
• Supervisado – Los datos de entrenamiento consisten de pares de objetos donde una componente del par son los
datos de entrada y la otra son los resultados deseados. Datos etiquetados.
• No supervisado – Los datos de entrenamiento no contiene el resultado. Datos no etiquetados.. Su función suele ser
clasificar los datos en busca de patrones.
• Semi-Supervisado – Supone la unión entre los dos anteriores y se basa en utilizar un set de datos etiquetados mucho menor
que el set de datos etiquetados para mejorar los aprendizajes no supervisados.
Detección de ataques de red
Dataset: UNSW-NB 15 dataset:
https://www.unsw.adfa.edu.au/australian
-centre-for-cyber-
security/cybersecurity/ADFA-NB15-
Datasets/
- Training set: 175,341 rows
- Validation set: 82,332 rows
Parametrización
● ntrees: this option specifies the number of trees to build in the model, the default is set to 50 in R.h2o.
● max_depth: the maximum depth for each tree, the default is 5. Deeper trees lead to better results in general but
this can overfit the data and increase the execution time.
● sample_rate: row sample rate per tree, from 0 to 1, default is set to 1.
● col_sample_rate_per_tree: column sample rate per tree, from 0 to 1, default is set to 1.
● col_sample_rate: column sample rate per split, from 0 to 1, default is set to 1.
● col_sample_rate_change_per_level: relative change of the column sampling rate for every level (from 0.0 to 2.0)
Defaults to 1.
● histogram_type: this option specifies the type of histogram to use for finding optimal split points
● min_rows: this option specifies the minimum number of observations for a leaf in order to split, default is set to 10.
● min_split_improvements: this option specifies the minimum relative improvement in squared error reduction in
order for a split to occur. When properly tuned, this option can help reduce overfitting because the algorithm will
stop splitting when all the possible splits lead to worse error measures.
● nbins: for numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
Defaults to 20.
● nbins_cat: for categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher
values can lead to more overfitting. To make a model more general, decrease nbins_cats. To make a model more
specific, increase nbins and/or nbins_cats. Keep in mind that increasing nbins_cats can have a big impact on the
amount of overfitting.
Detección de ataques en tráfico HTTP
Dataset: NLS-KDD (http://www.unb.ca/cic/research/datasets/nsl.html)
duration,src_bytes,dst_bytes,logged_in,num_compromised,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
• 40337 eventos
• 7 tipos de ataque
Detección de ataques por HTTP
Algoritmo de clustering: K-means
Objetivo:
Obtener una serie de puntos que en cierto modo representan al resto de puntos iniciales por su posición privilegiada con respecto al total (Patrón).
Detección de ataques en tráfico HTTP
Resultado del proceso de clustering
Name: Mix_Nor1
Name: Portsweep
Detección de ataques en tráfico HTTP
Resultado del proceso de clustering
Name: Mix_Nor2
Name: MixNor_Nep
Detección de de fraude en tarjetas de
crédito
Dataset: Transacciones realizadas en Septiembre del 2013 492 fraudes en 284,807 transactions Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amou
nt