Árbol de decisiones en el aprendizaje automático


Árbol de decisiones en el aprendizaje automático

Príncipe Yadav

13 de noviembre de 2018·9 min de lectura


Árbol de decisiones para el pronóstico de lluvia

Formato de datos

training_data = [
['Green', 3, 'Apple'],
['Yellow', 3, 'Apple'],
['Red', 1, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon'],
# Header = ["Color", "diameter", "Label"]
# The last column is the label.
# The first two columns are features.

Enfoque para tomar el árbol de decisiones

Ganancia de información

Haciendo preguntas

class Question:
"""A Question is used to partition a dataset.
This class just records a 'column number' (e.g., 0 for Color) and a
'column value' (e.g., Green). The 'match' method is used to compare
the feature value in an example to the feature value stored in the
question. See the demo below.
def __init__(self, column, value):
self.column = column
self.value = value
def match(self, example):
# Compare the feature value in an example to the
# feature value in this question.
val = example[self.column]
if is_numeric(val):
return val >= self.value
return val == self.value
def __repr__(self):
# This is just a helper method to print
# the question in a readable format.
condition = "=="
if is_numeric(self.value):
condition = ">="
return "Is %s %s %s?" % (
header[self.column], condition, str(self.value))
Question(1, 3) ## Is diameter >= 3?
Question(0, "Green") ## Is color == Green?
def partition(rows, question):
"""Partitions a dataset.
For each row in the dataset, check if it matches the question. If
so, add it to 'true rows', otherwise, add it to 'false rows'.
true_rows, false_rows = [], []
for row in rows:
if question.match(row):
return true_rows, false_rows

# Let's partition the training data based on whether rows are Red.
true_rows, false_rows = partition(training_data, Question(0, 'Red'))
# This will contain all the 'Red' rows.
true_rows ## [['Red', 1, 'Grape'], ['Red', 1, 'Grape']]
false_rows ## [['Green', 3, 'Apple'], ['Yellow', 3, 'Apple'], ['Yellow', 3, 'Lemon']]

Impureza de Gini



Definición de impureza de Gini

def gini(rows):
"""Calculate the Gini Impurity for a list of rows.

There are a few different ways to do this, I thought this one was
the most concise. See:
counts = class_counts(rows)
impurity = 1
for lbl in counts:
prob_of_lbl = counts[lbl] / float(len(rows))
impurity -= prob_of_lbl**2
return impurity
# Demo 1:
# Let's look at some example to understand how Gini Impurity works.
# First, we'll look at a dataset with no mixing.
no_mixing = [['Apple'],
# this will return 0
gini(no_mixing) ## output=0

## Demo 2:
# Now, we'll look at dataset with a 50:50 apples:oranges ratio
some_mixing = [['Apple'],
# this will return 0.5 - meaning, there's a 50% chance of misclassifying
# a random example we draw from the dataset.
gini(some_mixing) ##output=0.5

## Demo 3:
# Now, we'll look at a dataset with many different labels
lots_of_mixing = [['Apple'],
# This will return 0.8
gini(lots_of_mixing) ##output=0.8

Pasos para tomar el árbol de decisiones

def find_best_split(rows):
"""Find the best question to ask by iterating over every feature / value
and calculating the information gain."""
best_gain = 0 # keep track of the best information gain
best_question = None # keep train of the feature / value that produced it
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1 # number of columns
for col in range(n_features): # for each feature values = set([row[col] for row in rows]) # unique values in the column for val in values: # for each value question = Question(col, val) # try splitting the dataset
true_rows, false_rows = partition(rows, question)
# Skip this split if it doesn't divide the
# dataset.
if len(true_rows) == 0 or len(false_rows) == 0:
# Calculate the information gain from this split
gain = info_gain(true_rows, false_rows, current_uncertainty)
# You actually can use '>' instead of '>=' here
# but I wanted the tree to look a certain way for our
# toy dataset.
if gain >= best_gain:
best_gain, best_question = gain, question
return best_gain, best_question

# Demo:
# Find the best question to ask first for our toy dataset.
best_gain, best_question = find_best_split(training_data)
## output - Is diameter >= 3?
def build_tree(rows):
"""Builds the tree.
Rules of recursion: 1) Believe that it works. 2) Start by checking
for the base case (no further information gain). 3) Prepare for
giant stack traces.
# Try partitioning the dataset on each of the unique attribute,
# calculate the information gain,
# and return the question that produces the highest gain.
gain, question = find_best_split(rows)
# Base case: no further info gain
# Since we can ask no further questions,
# we'll return a leaf.
if gain == 0:
return Leaf(rows)
# If we reach here, we have found a useful feature / value
# to partition on.
true_rows, false_rows = partition(rows, question)
# Recursively build the true branch.
true_branch = build_tree(true_rows)
# Recursively build the false branch.
false_branch = build_tree(false_rows)
# Return a Question node.
# This records the best feature / value to ask at this point,
# as well as the branches to follow
# dependingo on the answer.
return Decision_Node(question, true_branch, false_branch)

Árbol de decisiones de construcción

training_data = [
['Green', 3, 'Apple'],
['Yellow', 3, 'Apple'],
['Red', 1, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon'],
# Header = ["Color", "diameter", "Label"]
# The last column is the label.
# The first two columns are features.

my_tree = build_tree(training_data)

Is diameter >= 3?
--> True:
Is color == Yellow?
--> True:
Predict {'Lemon': 1, 'Apple': 1}
--> False:
Predict {'Apple': 1}
--> False:
Predict {'Grape': 2}

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

05J7EIiTwRyACm92tUQC5ZP lim.size 1200x630

¿Puede actualizar a Windows 11? Vuelve la aplicación PC Health Check de Microsoft

social og oracle badge

Replicar y transformar datos | oracleGoldenGate