API

The following sections explain in more detail of how to use python-weka-wrapper from Python using the API.

A lot more examples you will find in the (aptly named) examples repository.

Java Virtual Machine

In order to use the library, you need to manage the Java Virtual Machine (JVM).

For starting up the library, use the following code:

import weka.core.jvm as jvm
jvm.start()

If you want to use the classpath environment variable and all currently installed Weka packages, use the following call:

jvm.start(system_cp=True, packages=True)

In case your Weka home directory is not located in wekafiles in your user’s home directory, then you have two options for specifying the alternative location: use the WEKA_HOME environment variable or the packages parameter, supplying a directory. The latter is shown below:

jvm.start(packages="/my/packages/are/somwhere/else")

Most of the times, you will want to increase the maximum heap size available to the JVM. The following example reserves 512 MB:

jvm.start(max_heap_size="512m")

If you want to print system information at start up time, then you can use the system_info parameter:

jvm.start(system_info=True)

This will output key-value pairs generated by Weka’s weka.core.SystemInfo class, similar to this:

DEBUG:weka.core.jvm:System info:
DEBUG:weka.core.jvm:java.runtime.name=OpenJDK Runtime Environment
DEBUG:weka.core.jvm:java.awt.headless=true
...
DEBUG:weka.core.jvm:java.vm.compressedOopsMode=Zero based
DEBUG:weka.core.jvm:java.vm.specification.version=11

And, finally, in order to stop the JVM again, use the following call:

jvm.stop()

Option handling

Any class derived from OptionHandler (module weka.core.classes) allows getting and setting of the options via the property options. Depending on the sub-class, you may also provide the options already when instantiating the class. The following two examples instantiate a J48 classifier, one using the options property and the other using the shortcut through the constructor:

from weka.classifiers import Classifier
cls = Classifier(classname="weka.classifiers.trees.J48")
cls.options = ["-C", "0.3"]
from weka.classifiers import Classifier
cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can use the options property also to retrieve the currently set options:

from weka.classifiers import Classifier
cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
print(cls.options)

Data generators

Artifical data can be generated using one of Weka’s data generators, e.g., the Agrawal classification generator:

from weka.datagenerators import DataGenerator
generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"])
DataGenerator.make_data(generator, ["-o", "/some/where/outputfile.arff"])

Or using the low-level API (outputting data to stdout):

generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"])
generator.dataset_format = generator.define_data_format()
print(generator.dataset_format)
if generator.single_mode_flag:
    for i in range(generator.num_examples_act):
        print(generator.generate_example())
else:
    print(generator.generate_examples())

Loaders and Savers

You can load and save datasets of various data formats using the Loader and Saver classes.

The following example loads an ARFF file and saves it as CSV:

from weka.core.converters import Loader, Saver
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_file("/some/where/iris.arff")
print(data)
saver = Saver(classname="weka.core.converters.CSVSaver")
saver.save_file(data, "/some/where/iris.csv")

The weka.core.converters module has convenience method for loading and saving datasets called load_any_file and save_any_file. These methods determine the loader/saver based on the file extension:

import weka.core.converters as converters
data = converters.load_any_file("/some/where/iris.arff")
converters.save_any_file(data, "/some/where/else/iris.csv")

Filters

The Filter class from the weka.filters module allows you to filter datasets, e.g., removing the last attribute using the Remove filter:

from weka.filters import Filter
data = ...                       # previously loaded data
remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
remove.inputformat(data)     # let the filter know about the type of data to filter
filtered = remove.filter(data)   # filter the data
print(filtered)                  # output the filtered data

Classifiers

Here is an example on how to cross-validate a J48 classifier (with confidence factor 0.3) on a dataset and output the summary and some specific statistics:

from weka.classifiers import Classifier, Evaluation
from weka.core.classes import Random
data = ...             # previously loaded data
data.class_is_last()   # set class attribute
classifier = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
evaluation = Evaluation(data)                     # initialize with priors
evaluation.crossvalidate_model(classifier, data, 10, Random(42))  # 10-fold CV
print(evaluation.summary())
print("pctCorrect: " + str(evaluation.percent_correct))
print("incorrect: " + str(evaluation.incorrect))

Here we train a classifier and output predictions:

from weka.classifiers import Classifier
data = ...             # previously loaded data
data.class_is_last()   # set class attribute
cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
cls.build_classifier(data)
for index, inst in enumerate(data):
    pred = cls.classify_instance(inst)
    dist = cls.distribution_for_instance(inst)
    print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist))

Clusterers

In the following an example on how to build a SimpleKMeans (with 3 clusters) using a previously loaded dataset without a class attribute:

from weka.clusterers import Clusterer
data = ... # previously loaded dataset
clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"])
clusterer.build_clusterer(data)
print(clusterer)

Once a clusterer is built, it can be used to cluster Instance objects:

for inst in data:
    cl = clusterer.cluster_instance(inst)  # 0-based cluster index
    dist = clusterer.distribution_for_instance(inst)   # cluster membership distribution
    print("cluster=" + str(cl) + ", distribution=" + str(dist))

Attribute selection

You can perform attribute selection using BestFirst as search algorithm and CfsSubsetEval as evaluator as follows:

from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
data = ...   # previously loaded dataset
search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
attsel = AttributeSelection()
attsel.search(search)
attsel.evaluator(evaluator)
attsel.select_attributes(data)
print("# attributes: " + str(attsel.number_attributes_selected))
print("attributes: " + str(attsel.selected_attributes))
print("result string:\n" + attsel.results_string)

Attribute selection is also available through meta-schemes:

  • classifier: weka.classifiers.AttributeSelectedClassifier
  • filter: weka.filters.AttributeSelection

Associators

Associators, like Apriori, can be built and output like this:

from weka.associations import Associator
data = ...   # previously loaded dataset
associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"])
associator.build_associations(data)
print(associator)

Timeseries

Timeseries forecasting can be achieved with the weka.timeseries module (which wraps the timeseriesForecasting package). Notable are the WekaForecaster forecaster, the TSLagMaker filter and the TSEvaluation class:

from weka.timeseries import WekaForecaster
from weka.classifiers import Classifier
forecaster = WekaForecaster()
forecaster.fields_to_forecast = ["passenger_numbers"]
forecaster.base_forecaster = Classifier(classname="weka.classifiers.functions.LinearRegression")
forecaster.tslag_maker.timestamp_field = "Date"
forecaster.tslag_maker.adjust_for_variance = False
forecaster.tslag_maker.include_powers_of_time = True
forecaster.tslag_maker.include_timelag_products = True
forecaster.tslag_maker.remove_leading_instances_with_unknown_lag_values = False
forecaster.tslag_maker.add_month_of_year = True
forecaster.tslag_maker.add_quarter_of_year = True
print("algorithm name: " + str(forecaster.algorithm_name))
print("command-line: " + forecaster.to_commandline())
print("lag maker: " + forecaster.tslag_maker.to_commandline())

evaluation = TSEvaluation(airline_data, 0.0)
evaluation.evaluate_on_training_data = False
evaluation.evaluate_on_test_data = False
evaluation.prime_window_size = forecaster.tslag_maker.max_lag
evaluation.forecast_future = True
evaluation.horizon = 20
evaluation.evaluation_modules = "MAE,RMSE"
evaluation.evaluate(forecaster)
print("Evaluation setup:")
print(evaluation)
print("Future forecasts")
print(evaluation.print_future_forecast_on_training_data(forecaster))

Serialization

You can easily serialize and de-serialize as well.

Here we just save a trained classifier to a file, load it again from disk and output the model:

from weka.classifiers import Classifier
classifier = ...  # previously built classifier
classifier.serialize("/some/where/out.model")
...
classifier2, _ = Classifier.deserialize("/some/where/out.model")
print(classifier2)

Weka usually saves the header of the dataset that was used for training as well (e.g., in order to determine whether test data is compatible). This is done as follows:

from weka.classifiers import Classifier
classifier = ...  # previously built Classifier
data = ... # previously loaded/generated Instances
classifier.serialize("/some/where/out.model", header=data)
...
classifier2, data2 = Classifier.deserialize("/some/where/out.model")
print(classifier2)
print(data2)

Clusterers and filters offer the serialize and deserialize methods as well. For all other serialization/deserialiation tasks, use the methods offered by the weka.core.serialization module:

  • write(file, object)
  • write_all(file, [obj1, obj2, …])
  • read(file)
  • read_all(file)

Experiments

Experiments, like they are run in Weka’s Experimenter, can be configured and executed as well.

Here is an example for performing a cross-validated classification experiment:

from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
from weka.classifiers import Classifier
import weka.core.converters as converters
# configure experiment
datasets = ["iris.arff", "anneal.arff"]
classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.trees.J48")]
outfile = "results-cv.arff"   # store results for later analysis
exp = SimpleCrossValidationExperiment(
    classification=True,
    runs=10,
    folds=10,
    datasets=datasets,
    classifiers=classifiers,
    result=outfile)
exp.setup()
exp.run()
# evaluate previous run
loader = converters.loader_for_file(outfile)
data   = loader.load_file(outfile)
matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
tester.resultmatrix = matrix
comparison_col = data.attribute_by_name("Percent_correct").index
tester.instances = data
print(tester.header(comparison_col))
print(tester.multi_resultset_full(0, comparison_col))

And a setup for performing regression experiments on random splits on the datasets:

from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
from weka.classifiers import Classifier
import weka.core.converters as converters
# configure experiment
datasets = ["bolts.arff", "bodyfat.arff"]
classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.functions.LinearRegression")]
outfile = "results-rs.arff"   # store results for later analysis
exp = SimpleRandomSplitExperiment(
    classification=False,
    runs=10,
    percentage=66.6,
    preserve_order=False,
    datasets=datasets,
    classifiers=classifiers,
    result=outfile)
exp.setup()
exp.run()
# evaluate previous run
loader = converters.loader_for_file(outfile)
data   = loader.load_file(outfile)
matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
tester.resultmatrix = matrix
comparison_col = data.attribute_by_name("Correlation_coefficient").index
tester.instances = data
print(tester.header(comparison_col))
print(tester.multi_resultset_full(0, comparison_col))

Packages

Packages can be listed, installed and uninstalled using the weka.core.packages module:

# refresh package cache
import weka.core.packages as packages
packages.refresh_cache()

# list all packages (name and URL)
items = packages.all_packages()
for item in items:
    print(item.name + " " + item.url)

# install CLOPE package
packages.install_package("CLOPE")
items = packages.installed_packages()
for item in items:
    print(item.name + " " + item.url)

# uninstall CLOPE package
packages.uninstall_package("CLOPE")
items = packages.installed_packages()
for item in items:
    print(item.name + " " + item.url)

You can also output suggested Weka packages for partial class/package names or exact class names (default is partial string matching):

# suggest package for classifier 'RBFClassifier'
search = "RBFClassifier"
suggestions = packages.suggest_package(search)
print("suggested packages for " + search + ":", suggestions)

# suggest package for package '.ft.'
search = ".ft."
suggestions = packages.suggest_package(search)
print("suggested packages for " + search + ":", suggestions)

# suggest package for classifier 'weka.classifiers.trees.J48graft'
search = "weka.classifiers.trees.J48graft"
suggestions = packages.suggest_package(search, exact=True)
print("suggested packages for " + search + ":", suggestions)