API

The following sections explain in more detail of how to use python-weka-wrapper from Python using the API.

A lot more examples you will find in the (aptly named) examples repository.

Java Virtual Machine

In order to use the library, you need to manage the Java Virtual Machine (JVM).

For starting up the library, use the following code:

>>> import weka.core.jvm as jvm
>>> jvm.start()

If you want to use the classpath environment variable and all currently installed Weka packages, use the following call:

>>> jvm.start(system_cp=True, packages=True)

In case your Weka home directory is not located in wekafiles in your user’s home directory, then you have two options for specifying the alternative location: use the WEKA_HOME environment variable or the packages parameter, supplying a directory. The latter is shown below:

>>> jvm.start(packages="/my/packages/are/somwhere/else")

Most of the times, you will want to increase the maximum heap size available to the JVM. The following example reserves 512 MB:

>>> jvm.start(max_heap_size="512m")

And, finally, in order to stop the JVM again, use the following call:

>>> jvm.stop()

Option handling

Any class derived from OptionHandler (module weka.core.classes) allows getting and setting of the options via the property options. Depending on the sub-class, you may also provide the options already when instantiating the class. The following two examples instantiate a J48 classifier, one using the options property and the other using the shortcut through the constructor:

>>> from weka.classifiers import Classifier
>>> cls = Classifier(classname="weka.classifiers.trees.J48")
>>> cls.options = ["-C", "0.3"]
>>> from weka.classifiers import Classifier
>>> cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can use the options property also to retrieve the currently set options:

>>> from weka.classifiers import Classifier
>>> cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
>>> print(cls.options)

Data generators

Artifical data can be generated using one of Weka’s data generators, e.g., the Agrawal classification generator:

>>> from weka.datagenerators import DataGenerator
>>> generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"])
>>> DataGenerator.make_data(generator, ["-o", "/some/where/outputfile.arff"])

Or using the low-level API (outputting data to stdout):

>>> generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"])
>>> generator.dataset_format = generator.define_data_format()
>>> print(generator.dataset_format)
>>> if generator.single_mode_flag:
>>>     for i in xrange(generator.num_examples_act):
>>>         print(generator.generate_example())
>>> else:
>>>     print(generator.generate_examples())

Loaders and Savers

You can load and save datasets of various data formats using the Loader and Saver classes.

The following example loads an ARFF file and saves it as CSV:

>>> from weka.core.converters import Loader, Saver
>>> loader = Loader(classname="weka.core.converters.ArffLoader")
>>> data = loader.load_file("/some/where/iris.arff")
>>> print(data)
>>> saver = Saver(classname="weka.core.converters.CSVSaver")
>>> saver.save_file(data, "/some/where/iris.csv")

The weka.core.converters module has convenience method for loading and saving datasets called load_any_file and save_any_file. These methods determine the loader/saver based on the file extension:

>>> import weka.core.converters as converters
>>> data = converters.load_any_file("/some/where/iris.arff")
>>> converters.save_any_file(data, "/some/where/else/iris.csv")

Filters

The Filter class from the weka.filters module allows you to filter datasets, e.g., removing the last attribute using the Remove filter:

>>> from weka.filters import Filter
>>> data = ...                       # previously loaded data
>>> remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
>>> remove.inputformat(data)     # let the filter know about the type of data to filter
>>> filtered = remove.filter(data)   # filter the data
>>> print(filtered)                  # output the filtered data

Classifiers

Here is an example on how to cross-validate a J48 classifier (with confidence factor 0.3) on a dataset and output the summary and some specific statistics:

>>> from weka.classifiers import Classifier, Evaluation
>>> from weka.core.classes import Random
>>> data = ...             # previously loaded data
>>> data.class_is_last()   # set class attribute
>>> classifier = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
>>> evaluation = Evaluation(data)                     # initialize with priors
>>> evaluation.crossvalidate_model(classifier, data, 10, Random(42))  # 10-fold CV
>>> print(evaluation.summary())
>>> print("pctCorrect: " + str(evaluation.percent_correct))
>>> print("incorrect: " + str(evaluation.incorrect))

Here we train a classifier and output predictions:

>>> from weka.classifiers import Classifier
>>> data = ...             # previously loaded data
>>> data.class_is_last()   # set class attribute
>>> cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
>>> cls.build_classifier(data)
>>> for index, inst in enumerate(data):
>>>     pred = cls.classify_instance(inst)
>>>     dist = cls.distribution_for_instance(inst)
>>>     print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist))

Clusterers

In the following an example on how to build a SimpleKMeans (with 3 clusters) using a previously loaded dataset without a class attribute:

>>> from weka.clusterers import Clusterer
>>> data = ... # previously loaded dataset
>>> clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"])
>>> clusterer.build_clusterer(data)
>>> print(clusterer)

Once a clusterer is built, it can be used to cluster Instance objects:

>>> for inst in data:
>>>     cl = clusterer.cluster_instance(inst)  # 0-based cluster index
>>>     dist = clusterer.distribution_for_instance(inst)   # cluster membership distribution
>>>     print("cluster=" + str(cl) + ", distribution=" + str(dist))

Attribute selection

You can perform attribute selection using BestFirst as search algorithm and CfsSubsetEval as evaluator as follows:

>>> from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
>>> data = ...   # previously loaded dataset
>>> search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
>>> evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
>>> attsel = AttributeSelection()
>>> attsel.search(search)
>>> attsel.evaluator(evaluator)
>>> attsel.select_attributes(data)
>>> print("# attributes: " + str(attsel.number_attributes_selected))
>>> print("attributes: " + str(attsel.selected_attributes))
>>> print("result string:\n" + attsel.results_string)

Associators

Associators, like Apriori, can be built and output like this:

>>> from weka.associations import Associator
>>> data = ...   # previously loaded dataset
>>> associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"])
>>> associator.build_associations(data)
>>> print(associator)

Serialization

You can easily serialize and de-serialize as well.

Here we just save a trained classifier to a file, load it again from disk and output the model:

>>> from weka.classifiers import Classifier
>>> classifier = ...  # previously built classifier
>>> classifier.serialize("/some/where/out.model")
>>> ...
>>> classifier2, _ = Classifier.deserialize("/some/where/out.model")
>>> print(classifier2)

Weka usually saves the header of the dataset that was used for training as well (e.g., in order to determine whether test data is compatible). This is done as follows:

>>> from weka.classifiers import Classifier
>>> classifier = ...  # previously built Classifier
>>> data = ... # previously loaded/generated Instances
>>> classifier.serialize("/some/where/out.model", header=data)
>>> ...
>>> classifier2, data2 = Classifier.deserialize("/some/where/out.model")
>>> print(classifier2)
>>> print(data2)

Clusterers and filters offer the serialize and deserialize methods as well. For all other serialization/deserialiation tasks, use the methods offered by the weka.core.serialization module:

  • write(file, object)
  • write_all(file, [obj1, obj2, …])
  • read(file)
  • read_all(file)

Experiments

Experiments, like they are run in Weka’s Experimenter, can be configured and executed as well.

Here is an example for performing a cross-validated classification experiment:

>>> from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
>>> from weka.classifiers import Classifier
>>> import weka.core.converters as converters
>>> # configure experiment
>>> datasets = ["iris.arff", "anneal.arff"]
>>> classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.trees.J48")]
>>> outfile = "results-cv.arff"   # store results for later analysis
>>> exp = SimpleCrossValidationExperiment(
>>>     classification=True,
>>>     runs=10,
>>>     folds=10,
>>>     datasets=datasets,
>>>     classifiers=classifiers,
>>>     result=outfile)
>>> exp.setup()
>>> exp.run()
>>> # evaluate previous run
>>> loader = converters.loader_for_file(outfile)
>>> data   = loader.load_file(outfile)
>>> matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
>>> tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
>>> tester.resultmatrix = matrix
>>> comparison_col = data.attribute_by_name("Percent_correct").index
>>> tester.instances = data
>>> print(tester.header(comparison_col))
>>> print(tester.multi_resultset_full(0, comparison_col))

And a setup for performing regression experiments on random splits on the datasets:

>>> from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
>>> from weka.classifiers import Classifier
>>> import weka.core.converters as converters
>>> # configure experiment
>>> datasets = ["bolts.arff", "bodyfat.arff"]
>>> classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.functions.LinearRegression")]
>>> outfile = "results-rs.arff"   # store results for later analysis
>>> exp = SimpleRandomSplitExperiment(
>>>     classification=False,
>>>     runs=10,
>>>     percentage=66.6,
>>>     preserve_order=False,
>>>     datasets=datasets,
>>>     classifiers=classifiers,
>>>     result=outfile)
>>> exp.setup()
>>> exp.run()
>>> # evaluate previous run
>>> loader = converters.loader_for_file(outfile)
>>> data   = loader.load_file(outfile)
>>> matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
>>> tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
>>> tester.resultmatrix = matrix
>>> comparison_col = data.attribute_by_name("Correlation_coefficient").index
>>> tester.instances = data
>>> print(tester.header(comparison_col))
>>> print(tester.multi_resultset_full(0, comparison_col))

Packages

Packages can be listed, installed and uninstalled using the weka.core.packages module:

# refresh package cache
import weka.core.packages as packages
packages.refresh_cache()

# list all packages (name and URL)
items = packages.all_packages()
for item in items:
    print(item.name + " " + item.url)

# install CLOPE package
packages.install_package("CLOPE")
items = packages.installed_packages()
for item in items:
    print(item.name + " " + item.url)

# uninstall CLOPE package
packages.uninstall_package("CLOPE")
items = packages.installed_packages()
for item in items:
    print(item.name + " " + item.url)