API
===

The following sections explain in more detail of how to use *python-weka-wrapper* from Python using the API.

A lot more examples you will find in the (aptly named) `examples <https://github.com/fracpete/python-weka-wrapper-examples>`_
repository.


Java Virtual Machine
--------------------

In order to use the library, you need to manage the Java Virtual Machine (JVM).

For starting up the library, use the following code:

.. code-block:: python

   >>> import weka.core.jvm as jvm
   >>> jvm.start()

If you want to use the classpath environment variable and all currently installed Weka packages,
use the following call:

.. code-block:: python

   >>> jvm.start(system_cp=True, packages=True)

In case your Weka home directory is not located in `wekafiles` in your user's home directory,
then you have two options for specifying the alternative location: use the `WEKA_HOME` environment
variable or the `packages` parameter, supplying a directory. The latter is shown below:

.. code-block:: python

   >>> jvm.start(packages="/my/packages/are/somwhere/else")

Most of the times, you will want to increase the maximum heap size available to the JVM.
The following example reserves 512 MB:

.. code-block:: python

   >>> jvm.start(max_heap_size="512m")

And, finally, in order to stop the JVM again, use the following call:

.. code-block:: python

   >>> jvm.stop()


Option handling
---------------

Any class derived from ``OptionHandler`` (module ``weka.core.classes``) allows 
getting and setting of the options via the property ``options``. Depending on
the sub-class, you may also provide the options already when instantiating the
class. The following two examples instantiate a J48 classifier, one using
the ``options`` property and the other using the shortcut through the constructor:

.. code-block:: python

   >>> from weka.classifiers import Classifier
   >>> cls = Classifier(classname="weka.classifiers.trees.J48")
   >>> cls.options = ["-C", "0.3"]

.. code-block:: python

   >>> from weka.classifiers import Classifier
   >>> cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can use the ``options`` property also to retrieve the currently set options:

.. code-block:: python

   >>> from weka.classifiers import Classifier
   >>> cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   >>> print(cls.options)


Data generators
---------------

Artifical data can be generated using one of Weka's data generators, e.g., the
`Agrawal` classification generator:

.. code-block:: python

   >>> from weka.datagenerators import DataGenerator
   >>> generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"])
   >>> DataGenerator.make_data(generator, ["-o", "/some/where/outputfile.arff"])

Or using the low-level API (outputting data to stdout):

.. code-block:: python

   >>> generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"])
   >>> generator.dataset_format = generator.define_data_format()
   >>> print(generator.dataset_format)
   >>> if generator.single_mode_flag:
   >>>     for i in xrange(generator.num_examples_act):
   >>>         print(generator.generate_example())
   >>> else:
   >>>     print(generator.generate_examples())


Loaders and Savers
------------------

You can load and save datasets of various data formats using the `Loader` and `Saver` classes.

The following example loads an ARFF file and saves it as CSV:

.. code-block:: python

   >>> from weka.core.converters import Loader, Saver
   >>> loader = Loader(classname="weka.core.converters.ArffLoader")
   >>> data = loader.load_file("/some/where/iris.arff")
   >>> print(data)
   >>> saver = Saver(classname="weka.core.converters.CSVSaver")
   >>> saver.save_file(data, "/some/where/iris.csv")

The `weka.core.converters` module has convenience method for loading and saving
datasets called `load_any_file` and `save_any_file`. These methods determine
the loader/saver based on the file extension:

.. code-block:: python

   >>> import weka.core.converters as converters
   >>> data = converters.load_any_file("/some/where/iris.arff")
   >>> converters.save_any_file(data, "/some/where/else/iris.csv")


Filters
-------

The `Filter` class from the `weka.filters` module allows you to filter datasets, e.g.,
removing the last attribute using the `Remove` filter:

.. code-block:: python

   >>> from weka.filters import Filter
   >>> data = ...                       # previously loaded data
   >>> remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
   >>> remove.inputformat(data)     # let the filter know about the type of data to filter
   >>> filtered = remove.filter(data)   # filter the data
   >>> print(filtered)                  # output the filtered data

Classifiers
-----------

Here is an example on how to cross-validate a `J48` classifier (with confidence factor 0.3)
on a dataset and output the summary and some specific statistics:

.. code-block:: python

   >>> from weka.classifiers import Classifier, Evaluation
   >>> from weka.core.classes import Random
   >>> data = ...             # previously loaded data
   >>> data.class_is_last()   # set class attribute
   >>> classifier = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   >>> evaluation = Evaluation(data)                     # initialize with priors
   >>> evaluation.crossvalidate_model(classifier, data, 10, Random(42))  # 10-fold CV
   >>> print(evaluation.summary())
   >>> print("pctCorrect: " + str(evaluation.percent_correct))
   >>> print("incorrect: " + str(evaluation.incorrect))

Here we train a classifier and output predictions:

.. code-block:: python

   >>> from weka.classifiers import Classifier
   >>> data = ...             # previously loaded data
   >>> data.class_is_last()   # set class attribute
   >>> cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   >>> cls.build_classifier(data)
   >>> for index, inst in enumerate(data):
   >>>     pred = cls.classify_instance(inst)
   >>>     dist = cls.distribution_for_instance(inst)
   >>>     print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist))

Clusterers
----------

In the following an example on how to build a `SimpleKMeans` (with 3 clusters)
using a previously loaded dataset without a class attribute:

.. code-block:: python

   >>> from weka.clusterers import Clusterer
   >>> data = ... # previously loaded dataset
   >>> clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"])
   >>> clusterer.build_clusterer(data)
   >>> print(clusterer)

Once a clusterer is built, it can be used to cluster Instance objects:

.. code-block:: python

   >>> for inst in data:
   >>>     cl = clusterer.cluster_instance(inst)  # 0-based cluster index
   >>>     dist = clusterer.distribution_for_instance(inst)   # cluster membership distribution
   >>>     print("cluster=" + str(cl) + ", distribution=" + str(dist))


Attribute selection
-------------------

You can perform attribute selection using `BestFirst` as search algorithm and
`CfsSubsetEval` as evaluator as follows:

.. code-block:: python

   >>> from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
   >>> data = ...   # previously loaded dataset
   >>> search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
   >>> evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
   >>> attsel = AttributeSelection()
   >>> attsel.search(search)
   >>> attsel.evaluator(evaluator)
   >>> attsel.select_attributes(data)
   >>> print("# attributes: " + str(attsel.number_attributes_selected))
   >>> print("attributes: " + str(attsel.selected_attributes))
   >>> print("result string:\n" + attsel.results_string)


Associators
-----------

Associators, like `Apriori`, can be built and output like this:

.. code-block:: python

   >>> from weka.associations import Associator
   >>> data = ...   # previously loaded dataset
   >>> associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"])
   >>> associator.build_associations(data)
   >>> print(associator)


Serialization
-------------

You can easily serialize and de-serialize as well.

Here we just save a trained classifier to a file, load it again from disk and output the model:

.. code-block:: python

   >>> from weka.classifiers import Classifier
   >>> classifier = ...  # previously built classifier
   >>> classifier.serialize("/some/where/out.model")
   >>> ...
   >>> classifier2, _ = Classifier.deserialize("/some/where/out.model")
   >>> print(classifier2)

Weka usually saves the header of the dataset that was used for training as well (e.g., in order to determine
whether test data is compatible). This is done as follows:

.. code-block:: python

   >>> from weka.classifiers import Classifier
   >>> classifier = ...  # previously built Classifier
   >>> data = ... # previously loaded/generated Instances
   >>> classifier.serialize("/some/where/out.model", header=data)
   >>> ...
   >>> classifier2, data2 = Classifier.deserialize("/some/where/out.model")
   >>> print(classifier2)
   >>> print(data2)

Clusterers and filters offer the `serialize` and `deserialize` methods as well. For all other
serialization/deserialiation tasks, use the methods offered by the `weka.core.serialization` module:

* `write(file, object)`
* `write_all(file, [obj1, obj2, ...])`
* `read(file)`
* `read_all(file)`


Experiments
-----------

Experiments, like they are run in Weka's Experimenter, can be configured and executed as well.

Here is an example for performing a cross-validated classification experiment:

.. code-block:: python

   >>> from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
   >>> from weka.classifiers import Classifier
   >>> import weka.core.converters as converters
   >>> # configure experiment
   >>> datasets = ["iris.arff", "anneal.arff"]
   >>> classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.trees.J48")]
   >>> outfile = "results-cv.arff"   # store results for later analysis
   >>> exp = SimpleCrossValidationExperiment(
   >>>     classification=True,
   >>>     runs=10,
   >>>     folds=10,
   >>>     datasets=datasets,
   >>>     classifiers=classifiers,
   >>>     result=outfile)
   >>> exp.setup()
   >>> exp.run()
   >>> # evaluate previous run
   >>> loader = converters.loader_for_file(outfile)
   >>> data   = loader.load_file(outfile)
   >>> matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
   >>> tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   >>> tester.resultmatrix = matrix
   >>> comparison_col = data.attribute_by_name("Percent_correct").index
   >>> tester.instances = data
   >>> print(tester.header(comparison_col))
   >>> print(tester.multi_resultset_full(0, comparison_col))

And a setup for performing regression experiments on random splits on the datasets:

.. code-block:: python

   >>> from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
   >>> from weka.classifiers import Classifier
   >>> import weka.core.converters as converters
   >>> # configure experiment
   >>> datasets = ["bolts.arff", "bodyfat.arff"]
   >>> classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.functions.LinearRegression")]
   >>> outfile = "results-rs.arff"   # store results for later analysis
   >>> exp = SimpleRandomSplitExperiment(
   >>>     classification=False,
   >>>     runs=10,
   >>>     percentage=66.6,
   >>>     preserve_order=False,
   >>>     datasets=datasets,
   >>>     classifiers=classifiers,
   >>>     result=outfile)
   >>> exp.setup()
   >>> exp.run()
   >>> # evaluate previous run
   >>> loader = converters.loader_for_file(outfile)
   >>> data   = loader.load_file(outfile)
   >>> matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
   >>> tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   >>> tester.resultmatrix = matrix
   >>> comparison_col = data.attribute_by_name("Correlation_coefficient").index
   >>> tester.instances = data
   >>> print(tester.header(comparison_col))
   >>> print(tester.multi_resultset_full(0, comparison_col))


Packages
--------

Packages can be listed, installed and uninstalled using the `weka.core.packages` module:

.. code-block:: python

   # refresh package cache
   import weka.core.packages as packages
   packages.refresh_cache()

   # list all packages (name and URL)
   items = packages.all_packages()
   for item in items:
       print(item.name + " " + item.url)

   # install CLOPE package
   packages.install_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)

   # uninstall CLOPE package
   packages.uninstall_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)