API
===

The following sections explain in more detail of how to use *python-weka-wrapper* from Python using the API.

A lot more examples you will find in the (aptly named) `examples <https://github.com/fracpete/python-weka-wrapper3-examples>`_
repository.


Java Virtual Machine
--------------------

In order to use the library, you need to manage the Java Virtual Machine (JVM).

For starting up the library, use the following code:

.. code-block:: python

   import weka.core.jvm as jvm
   jvm.start()

If you want to use the classpath environment variable and all currently installed Weka packages,
use the following call:

.. code-block:: python

   jvm.start(system_cp=True, packages=True)

In case your Weka home directory is not located in `wekafiles` in your user's home directory,
then you have two options for specifying the alternative location: use the `WEKA_HOME` environment
variable or the `packages` parameter, supplying a directory. The latter is shown below:

.. code-block:: python

   jvm.start(packages="/my/packages/are/somwhere/else")

Most of the times, you will want to increase the maximum heap size available to the JVM.
The following example reserves 512 MB:

.. code-block:: python

   jvm.start(max_heap_size="512m")

If you want to print system information at start up time, then you can use the `system_info`
parameter:

.. code-block:: python

   jvm.start(system_info=True)

This will output key-value pairs generated by Weka's `weka.core.SystemInfo` class,
similar to this::

   DEBUG:weka.core.jvm:System info:
   DEBUG:weka.core.jvm:java.runtime.name=OpenJDK Runtime Environment
   DEBUG:weka.core.jvm:java.awt.headless=true
   ...
   DEBUG:weka.core.jvm:java.vm.compressedOopsMode=Zero based
   DEBUG:weka.core.jvm:java.vm.specification.version=11

And, finally, in order to stop the JVM again, use the following call:

.. code-block:: python

   jvm.stop()


Option handling
---------------

Any class derived from ``OptionHandler`` (module ``weka.core.classes``) allows 
getting and setting of the options via the property ``options``. Depending on
the sub-class, you may also provide the options already when instantiating the
class. The following two examples instantiate a J48 classifier, one using
the ``options`` property and the other using the shortcut through the constructor:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48")
   cls.options = ["-C", "0.3"]

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can use the ``options`` property also to retrieve the currently set options:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   print(cls.options)


Data generators
---------------

Artifical data can be generated using one of Weka's data generators, e.g., the
`Agrawal` classification generator:

.. code-block:: python

   from weka.datagenerators import DataGenerator
   generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"])
   DataGenerator.make_data(generator, ["-o", "/some/where/outputfile.arff"])

Or using the low-level API (outputting data to stdout):

.. code-block:: python

   generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"])
   generator.dataset_format = generator.define_data_format()
   print(generator.dataset_format)
   if generator.single_mode_flag:
       for i in range(generator.num_examples_act):
           print(generator.generate_example())
   else:
       print(generator.generate_examples())


Loaders and Savers
------------------

You can load and save datasets of various data formats using the `Loader` and `Saver` classes.

The following example loads an ARFF file and saves it as CSV:

.. code-block:: python

   from weka.core.converters import Loader, Saver
   loader = Loader(classname="weka.core.converters.ArffLoader")
   data = loader.load_file("/some/where/iris.arff")
   print(data)
   saver = Saver(classname="weka.core.converters.CSVSaver")
   saver.save_file(data, "/some/where/iris.csv")

The `weka.core.converters` module has convenience method for loading and saving
datasets called `load_any_file` and `save_any_file`. These methods determine
the loader/saver based on the file extension:

.. code-block:: python

   import weka.core.converters as converters
   data = converters.load_any_file("/some/where/iris.arff")
   converters.save_any_file(data, "/some/where/else/iris.csv")


Filters
-------

The `Filter` class from the `weka.filters` module allows you to filter datasets, e.g.,
removing the last attribute using the `Remove` filter:

.. code-block:: python

   from weka.filters import Filter
   data = ...                       # previously loaded data
   remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
   remove.inputformat(data)     # let the filter know about the type of data to filter
   filtered = remove.filter(data)   # filter the data
   print(filtered)                  # output the filtered data

Classifiers
-----------

Here is an example on how to cross-validate a `J48` classifier (with confidence factor 0.3)
on a dataset and output the summary and some specific statistics:

.. code-block:: python

   from weka.classifiers import Classifier, Evaluation
   from weka.core.classes import Random
   data = ...             # previously loaded data
   data.class_is_last()   # set class attribute
   classifier = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   evaluation = Evaluation(data)                     # initialize with priors
   evaluation.crossvalidate_model(classifier, data, 10, Random(42))  # 10-fold CV
   print(evaluation.summary())
   print("pctCorrect: " + str(evaluation.percent_correct))
   print("incorrect: " + str(evaluation.incorrect))

Here we train a classifier and output predictions:

.. code-block:: python

   from weka.classifiers import Classifier
   data = ...             # previously loaded data
   data.class_is_last()   # set class attribute
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   cls.build_classifier(data)
   for index, inst in enumerate(data):
       pred = cls.classify_instance(inst)
       dist = cls.distribution_for_instance(inst)
       print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist))

Clusterers
----------

In the following an example on how to build a `SimpleKMeans` (with 3 clusters)
using a previously loaded dataset without a class attribute:

.. code-block:: python

   from weka.clusterers import Clusterer
   data = ... # previously loaded dataset
   clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"])
   clusterer.build_clusterer(data)
   print(clusterer)

Once a clusterer is built, it can be used to cluster Instance objects:

.. code-block:: python

   for inst in data:
       cl = clusterer.cluster_instance(inst)  # 0-based cluster index
       dist = clusterer.distribution_for_instance(inst)   # cluster membership distribution
       print("cluster=" + str(cl) + ", distribution=" + str(dist))


Attribute selection
-------------------

You can perform attribute selection using `BestFirst` as search algorithm and
`CfsSubsetEval` as evaluator as follows:

.. code-block:: python

   from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
   data = ...   # previously loaded dataset
   search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
   evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
   attsel = AttributeSelection()
   attsel.search(search)
   attsel.evaluator(evaluator)
   attsel.select_attributes(data)
   print("# attributes: " + str(attsel.number_attributes_selected))
   print("attributes: " + str(attsel.selected_attributes))
   print("result string:\n" + attsel.results_string)

Attribute selection is also available through meta-schemes:

* classifier: `weka.classifiers.AttributeSelectedClassifier`
* filter: `weka.filters.AttributeSelection`


Associators
-----------

Associators, like `Apriori`, can be built and output like this:

.. code-block:: python

   from weka.associations import Associator
   data = ...   # previously loaded dataset
   associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"])
   associator.build_associations(data)
   print(associator)


Timeseries
----------

Timeseries forecasting can be achieved with the `weka.timeseries` module (which wraps the `timeseriesForecasting` package).
Notable are the `WekaForecaster` forecaster, the `TSLagMaker` filter and the `TSEvaluation` class:

.. code-block:: python

   from weka.timeseries import WekaForecaster
   from weka.classifiers import Classifier
   forecaster = WekaForecaster()
   forecaster.fields_to_forecast = ["passenger_numbers"]
   forecaster.base_forecaster = Classifier(classname="weka.classifiers.functions.LinearRegression")
   forecaster.tslag_maker.timestamp_field = "Date"
   forecaster.tslag_maker.adjust_for_variance = False
   forecaster.tslag_maker.include_powers_of_time = True
   forecaster.tslag_maker.include_timelag_products = True
   forecaster.tslag_maker.remove_leading_instances_with_unknown_lag_values = False
   forecaster.tslag_maker.add_month_of_year = True
   forecaster.tslag_maker.add_quarter_of_year = True
   print("algorithm name: " + str(forecaster.algorithm_name))
   print("command-line: " + forecaster.to_commandline())
   print("lag maker: " + forecaster.tslag_maker.to_commandline())

   evaluation = TSEvaluation(airline_data, 0.0)
   evaluation.evaluate_on_training_data = False
   evaluation.evaluate_on_test_data = False
   evaluation.prime_window_size = forecaster.tslag_maker.max_lag
   evaluation.forecast_future = True
   evaluation.horizon = 20
   evaluation.evaluation_modules = "MAE,RMSE"
   evaluation.evaluate(forecaster)
   print("Evaluation setup:")
   print(evaluation)
   print("Future forecasts")
   print(evaluation.print_future_forecast_on_training_data(forecaster))


Serialization
-------------

You can easily serialize and de-serialize as well.

Here we just save a trained classifier to a file, load it again from disk and output the model:

.. code-block:: python

   from weka.classifiers import Classifier
   classifier = ...  # previously built classifier
   classifier.serialize("/some/where/out.model")
   ...
   classifier2, _ = Classifier.deserialize("/some/where/out.model")
   print(classifier2)

Weka usually saves the header of the dataset that was used for training as well (e.g., in order to determine
whether test data is compatible). This is done as follows:

.. code-block:: python

   from weka.classifiers import Classifier
   classifier = ...  # previously built Classifier
   data = ... # previously loaded/generated Instances
   classifier.serialize("/some/where/out.model", header=data)
   ...
   classifier2, data2 = Classifier.deserialize("/some/where/out.model")
   print(classifier2)
   print(data2)

Clusterers and filters offer the `serialize` and `deserialize` methods as well. For all other
serialization/deserialiation tasks, use the methods offered by the `weka.core.serialization` module:

* `write(file, object)`
* `write_all(file, [obj1, obj2, ...])`
* `read(file)`
* `read_all(file)`


Experiments
-----------

Experiments, like they are run in Weka's Experimenter, can be configured and executed as well.

Here is an example for performing a cross-validated classification experiment:

.. code-block:: python

   from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
   from weka.classifiers import Classifier
   import weka.core.converters as converters
   # configure experiment
   datasets = ["iris.arff", "anneal.arff"]
   classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.trees.J48")]
   outfile = "results-cv.arff"   # store results for later analysis
   exp = SimpleCrossValidationExperiment(
       classification=True,
       runs=10,
       folds=10,
       datasets=datasets,
       classifiers=classifiers,
       result=outfile)
   exp.setup()
   exp.run()
   # evaluate previous run
   loader = converters.loader_for_file(outfile)
   data   = loader.load_file(outfile)
   matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
   tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   tester.resultmatrix = matrix
   comparison_col = data.attribute_by_name("Percent_correct").index
   tester.instances = data
   print(tester.header(comparison_col))
   print(tester.multi_resultset_full(0, comparison_col))

And a setup for performing regression experiments on random splits on the datasets:

.. code-block:: python

   from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
   from weka.classifiers import Classifier
   import weka.core.converters as converters
   # configure experiment
   datasets = ["bolts.arff", "bodyfat.arff"]
   classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.functions.LinearRegression")]
   outfile = "results-rs.arff"   # store results for later analysis
   exp = SimpleRandomSplitExperiment(
       classification=False,
       runs=10,
       percentage=66.6,
       preserve_order=False,
       datasets=datasets,
       classifiers=classifiers,
       result=outfile)
   exp.setup()
   exp.run()
   # evaluate previous run
   loader = converters.loader_for_file(outfile)
   data   = loader.load_file(outfile)
   matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
   tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   tester.resultmatrix = matrix
   comparison_col = data.attribute_by_name("Correlation_coefficient").index
   tester.instances = data
   print(tester.header(comparison_col))
   print(tester.multi_resultset_full(0, comparison_col))


Packages
--------

Packages can be listed, installed and uninstalled using the `weka.core.packages` module:

.. code-block:: python

   # refresh package cache
   import weka.core.packages as packages
   packages.refresh_cache()

   # list all packages (name and URL)
   items = packages.all_packages()
   for item in items:
       print(item.name + " " + item.url)

   # install CLOPE package
   packages.install_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)

   # uninstall CLOPE package
   packages.uninstall_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)

You can also output suggested Weka packages for partial class/package names or exact class names (default is partial
string matching):

.. code-block:: python

   # suggest package for classifier 'RBFClassifier'
   search = "RBFClassifier"
   suggestions = packages.suggest_package(search)
   print("suggested packages for " + search + ":", suggestions)

   # suggest package for package '.ft.'
   search = ".ft."
   suggestions = packages.suggest_package(search)
   print("suggested packages for " + search + ":", suggestions)

   # suggest package for classifier 'weka.classifiers.trees.J48graft'
   search = "weka.classifiers.trees.J48graft"
   suggestions = packages.suggest_package(search, exact=True)
   print("suggested packages for " + search + ":", suggestions)