Trying out a random forest algorithm for a university delivery.

fib-upc

Find a file

Vylion 0f0ae62cd5 🐞 Fixed Out-of-bag error calculation 📝 Uncommented the single tree training in Forest testing 📝 Silenced the single tree training output in Forest testing		2019-04-26 22:30:09 +02:00
.gitignore	📻 Added parallelization to best question search	2019-04-24 05:04:39 +02:00
forest.py	🐞 Fixed Out-of-bag error calculation	2019-04-26 22:30:09 +02:00
forest_tester.py	🐞 Fixed Out-of-bag error calculation	2019-04-26 22:30:09 +02:00
hygdata_v3.csv	📻 Initial commit	2019-04-23 03:51:09 +02:00
question.py	📝 Added the Forest class	2019-04-26 13:07:16 +02:00
README.md	Add README.md	2019-04-23 14:22:31 +00:00
star.py	📝 Added the Forest class	2019-04-26 13:07:16 +02:00
star_reader.py	📝 Added the Forest class	2019-04-26 13:07:16 +02:00
tree.py	📝 Added the Forest class	2019-04-26 13:07:16 +02:00
tree_bootstrapped.py	📝 Added the Forest class	2019-04-26 13:07:16 +02:00
tree_tester.py	📝 Added the Forest class	2019-04-26 13:07:16 +02:00

README.md

About

My first Random Forest algorithm implementation for a classification problem, as part of a university delivery.

Database

I use a star database as input for the classification. From each star I keep about 5 characteristics (magnitude, distance, luminosity, color, temperature), a label (the spectral class, which is what the classification that the random forest has to work out) and a display name (either a proper/popular name, an abbreviated name, or the ID in the database if the first two are missing).

The used database is the HYG Data (version 3).

Since star data isn't precise and some stars could belong to more than one class, these are considered to belong to all of the possible ones, and when predicting the class, if a prediction of any of the possible classes for a star will be considered a success.

Decision Tree

It's more or less agnostic to the fact that the entries to classify are stars. It receives a list of entries (a dataset) and a list of fields. Each entry must be an object with an entry.label, with the class value that the decission tree must figure out; and an entry.data, which is a list of values. The length of entry.data is expected to match the length of fields, and each value of fields must be the name for the value in the same position at entry.data.

The training set is a subset of the whole database, and the resulting tree is tested against the remaining entries.

Random Forest

The random forest is built by bootstrapping the original training set, and then creating a tree for each bootstrapped instance of the dataset.