Examples
========

The following examples are meant to be executed in sequence, as they rely on previous steps,
e.g., on data present.

For more examples, check out the example repository on github:

`github.com/fracpete/python-weka-wrapper3-examples <https://github.com/fracpete/python-weka-wrapper3-examples>`__


Start up JVM
------------

.. code-block:: python

   import weka.core.jvm as jvm
   jvm.start()

If you want to use the classpath environment variable and all currently installed Weka packages,
use the following call:

.. code-block:: python

   jvm.start(system_cp=True, packages=True)

In case your Weka home directory is not located in `wekafiles` in your user's home directory,
then you have two options for specifying the alternative location: use the `WEKA_HOME` environment
variable or the `packages` parameter, supplying a directory. The latter is shown below:

.. code-block:: python

   jvm.start(packages="/my/packages/are/somwhere/else")

Most of the times, you will want to increase the maximum heap size available to the JVM.
The following example reserves 512 MB:

.. code-block:: python

   jvm.start(max_heap_size="512m")

If you want to print system information at start up time, then you can use the `system_info`
parameter:

.. code-block:: python

   jvm.start(system_info=True)

This will output key-value pairs generated by Weka's `weka.core.SystemInfo` class,
similar to this::

   DEBUG:weka.core.jvm:System info:
   DEBUG:weka.core.jvm:java.runtime.name=OpenJDK Runtime Environment
   DEBUG:weka.core.jvm:java.awt.headless=true
   ...
   DEBUG:weka.core.jvm:java.vm.compressedOopsMode=Zero based
   DEBUG:weka.core.jvm:java.vm.specification.version=11

For more information, check out the help of the `jvm` module:

.. code-block:: python

   help(jvm.start)
   help(jvm.stop)


Location of the datasets
------------------------

The following examples assume the datasets to be present in the `data_dir` directory. For instance,
this could be the following directory:

.. code-block:: python

   data_dir = "/my/datasets/"


Load dataset and print it
-------------------------

.. code-block:: python

   from weka.core.converters import Loader
   loader = Loader(classname="weka.core.converters.ArffLoader")
   data = loader.load_file(data_dir + "iris.arff")
   data.class_is_last()

   print(data)

The `weka.core.converters` module has a convenience method for loading datasets
called `load_any_file`. This method determines a loader based on the file extension
and then loads the full dataset:

.. code-block:: python

   import weka.core.converters as converters
   data = converters.load_any_file(data_dir + "iris.arff")
   data.class_is_last()

   print(data)

It is also possible to define the class attribute when loading:

.. code-block:: python

   data = loader.load_file(data_dir + "iris.arff", class_index="last")
   data = converters.load_any_file(data_dir + "iris.arff", class_index="last")

The following strings are supported:

* `first`
* `second`
* `third`
* `last-2` (third to last)
* `last-1` (second to last)
* `last`
* any other string gets interpreted as 1-based index


Create dataset manually
-----------------------

The following code snippet defines the dataset structure by creating its attributes and then the
dataset itself. Once the `weka.core.dataset.Instances` object is available, rows (i.e., `weka.core.dataset.Instance`
objects) can be added.

.. code-block:: python

   from weka.core.dataset import Attribute, Instance, Instances

   # create attributes
   num_att = Attribute.create_numeric("num")
   date_att = Attribute.create_date("dat", "yyyy-MM-dd")
   nom_att = Attribute.create_nominal("nom", ["label1", "label2"])

   # create dataset
   dataset = Instances.create_instances("helloworld", [num_att, date_att, nom_att], 0)

   # add rows
   values = [3.1415926, date_att.parse_date("2014-04-10"), 1.0]
   inst = Instance.create_instance(values)
   dataset.add_instance(inst)

   values = [2.71828, date_att.parse_date("2014-08-09"), Instance.missing_value()]
   inst = Instance.create_instance(values)
   dataset.add_instance(inst)

   print(dataset)


Create dataset from lists
-------------------------

If your data is easily available as lists, you can also construct datasets using this approach (custom column names can be supplied via `cols_x` and `col_y`):

.. code-block:: python

   from weka.core.dataset import create_instances_from_lists
   from random import randint

   # pure numeric
   x = [[randint(1, 10) for _ in range(5)] for _ in range(10)]
   y = [randint(0, 1) for _ in range(10)]
   dataset = create_instances_from_lists(x, y, name="generated from lists")
   print(dataset)

   dataset = create_instances_from_lists(x, name="generated from lists (no y)")
   print(dataset)

   # mixed data types
   x = [["TEXT", 1, 1.1], ["XXX", 2, 2.2]]
   y = ["A", "B"]
   dataset = create_instances_from_lists(x, y, name="generated from mixed lists", cols_x=["text", "integer", "float"], col_y="class")
   print(dataset)


Create dataset from matrices
----------------------------

Another way of constructing a dataset is to use numpy matrices/arrays, e.g., obtained from a Panda data frame (custom column names can be supplied via `cols_x` and `col_y`):

.. code-block:: python

   from weka.core.dataset import create_instances_from_matrices
   import numpy as np

   # pure numeric
   x = np.random.randn(10, 5)
   y = np.random.randn(10)
   dataset = create_instances_from_matrices(x, y, name="generated from matrices")
   print(dataset)

   dataset = create_instances_from_matrices(x, name="generated from matrix (no y)")
   print(dataset)

   # mixed data types
   x = np.array([("TEXT", 1, 1.1), ("XXX", 2, 2.2)], dtype='S20, i4, f8')
   y = np.array(["A", "B"], dtype='S20')
   dataset = create_instances_from_matrices(x, y, name="generated from mixed matrices", cols_x=["text", "integer", "float"], col_y="class")
   print(dataset)


Dataset subsets
---------------

Transformations in Weka usually occur by applying filters (see section *Filters* below).
However, quite often one only wants to quickly create a subset (of colunms or rows) from a dataset.
For this purpose, the `subset` method of the `weka.core.dataset.Instances` method can be used
(it uses filters under the hood to generate the actual subset):

.. code-block:: python

   from weka.core.converters import load_any_file

   data = load_any_file("/some/where/iris.arff")
   print(data.attribute_names(), data.num_instances)

   # select columns by name
   subset = data.subset(col_names=['sepallength', 'sepalwidth', 'petallength', 'petalwidth'])
   print(subset.attribute_names(), subset.num_instances)

   # select columns by range (1-based indices)
   subset = data.subset(col_range='1-3,5')
   print(subset.attribute_names(), subset.num_instances)

   # select rows by range (1-based indices)
   subset = data.subset(row_range='51-150')
   print(subset.attribute_names(), subset.num_instances)

   # invert selection of cols/rows and keep original relation name
   subset = data.subset(col_range='5', invert_cols=True, row_range='51-150', invert_rows=True, keep_relationame=True)
   print(subset.attribute_names(), subset.num_instances)


Data generators
---------------

Artifical data can be generated using one of Weka's data generators, e.g., the
`Agrawal` classification generator:

.. code-block:: python

   from weka.datagenerators import DataGenerator
   generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"])
   DataGenerator.make_data(generator, ["-o", "/some/where/outputfile.arff"])

Or using the low-level API (outputting data to stdout):

.. code-block:: python

   generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"])
   generator.dataset_format = generator.define_data_format()
   print(generator.dataset_format)
   if generator.single_mode_flag:
       for i in range(generator.num_examples_act):
           print(generator.generate_example())
   else:
       print(generator.generate_examples())


Filters
-------

The `Filter` class from the `weka.filters` module allows you to filter datasets, e.g.,
removing the last attribute using the `Remove` filter:

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")

   from weka.filters import Filter
   remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
   remove.inputformat(data)
   filtered = remove.filter(data)

   print(filtered)


Output help from underlying OptionHandler
-----------------------------------------

If the underlying Java class implements the ``weka.core.OptionHandler`` method, then
you can use the ``to_help()`` method to generate a string containing the ``globalInfo()``
and ``listOptions()`` information:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48")
   print(cls.to_help())


Option handling
---------------

Any class derived from ``OptionHandler`` (module ``weka.core.classes``) allows 
getting and setting of the options via the property ``options``. Depending on
the sub-class, you may also provide the options already when instantiating the
class. The following two examples instantiate a J48 classifier, one using
the ``options`` property and the other using the shortcut through the constructor:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48")
   cls.options = ["-C", "0.3"]

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can use the ``options`` property also to retrieve the currently set options:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   print(cls.options)

Using the `to_commandline()` method, you can return a single string that contains
classname and options, just like Weka's Explorer does when copying the setup to
the clipboard:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   print(cls.to_commandline())

The `to_commandline(...)` method of the `weka.core.classes` module generates
the command-line string for any class that implements the `weka.core.OptionHandler`
Java interface under the hood (a lot of classes do!):

.. code-block:: python

   from weka.classifiers import Classifier
   from weka.core.classes import to_commandline
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   print(to_commandline(cls))

The reverse, generating an object from a command-line, is done via the
`from_commandline(...)` method:

.. code-block:: python

    cmdline = 'weka.classifiers.functions.SMO -K "weka.classifiers.functions.supportVector.NormalizedPolyKernel -E 3.0"'
    classifier = from_commandline(cmdline, classname="weka.classifiers.Classifier")


Build classifier on dataset, output predictions
-----------------------------------------------

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   cls.build_classifier(data)

   for index, inst in enumerate(data):
       pred = cls.classify_instance(inst)
       dist = cls.distribution_for_instance(inst)
       print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist))


Build classifier on dataset, print model and draw graph
-------------------------------------------------------

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   cls.build_classifier(data)

   print(cls)

   import weka.plot.graph as graph  # NB: pygraphviz and PIL are required
   graph.plot_dot_graph(cls.graph)


Build classifier incrementally with data and print model
--------------------------------------------------------

.. code-block:: python

   loader = Loader(classname="weka.core.converters.ArffLoader")
   iris_inc = loader.load_file(data_dir + "iris.arff", incremental=True)
   iris_inc.class_is_last()

   print(iris_inc)

   cls = Classifier(classname="weka.classifiers.bayes.NaiveBayesUpdateable")
   cls.build_classifier(iris_inc)
   for inst in loader:
       cls.update_classifier(inst)

   print(cls)


Cross-validate filtered classifier and print evaluation and display ROC
-----------------------------------------------------------------------

.. code-block:: python

   data = loader.load_file(data_dir + "diabetes.arff")
   data.class_is_last()

   from weka.filters import Filter
   remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "1-3"])

   cls = Classifier(classname="weka.classifiers.bayes.NaiveBayes")

   from weka.classifiers import FilteredClassifier
   fc = FilteredClassifier()
   fc.filter = remove
   fc.classifier = cls

   from weka.classifiers import Evaluation
   from weka.core.classes import Random
   evl = Evaluation(data)
   evl.crossvalidate_model(fc, data, 10, Random(1))

   print(evl.percent_correct)
   print(evl.summary())
   print(evl.class_details())

   import weka.plot.classifiers as plcls  # NB: matplotlib is required
   plcls.plot_roc(evl, class_index=[0, 1], wait=True)


Cross-validate regressor, display classifier errors and predictions
-------------------------------------------------------------------

.. code-block:: python

   from weka.classifiers import PredictionOutput, KernelClassifier, Kernel
   data = loader.load_file(data_dir + "bolts.arff")
   data.class_is_last()

   cls = KernelClassifier(classname="weka.classifiers.functions.SMOreg", options=["-N", "0"])
   kernel = Kernel(classname="weka.classifiers.functions.supportVector.RBFKernel", options=["-G", "0.1"])
   cls.kernel = kernel
   pout = PredictionOutput(classname="weka.classifiers.evaluation.output.prediction.PlainText")
   evl = Evaluation(data)
   evl.crossvalidate_model(cls, data, 10, Random(1), pout)

   print(evl.summary())
   print(pout.buffer_content())

   import weka.plot.classifiers as plcls  # NB: matplotlib is required
   plcls.plot_classifier_errors(evl.predictions, wait=True)


Parameter optimization - property names
---------------------------------------

Both, `GridSearch` and `MultiSearch`, use Java Bean property names (and paths consisting of these),
not command-line options in order to get/set the parameters under optimization.
Using the `list_property_names` method of the `weka.core.classes` module, you can list the
properties from a Java object:

.. code-block:: python

   from weka.core.classes import list_property_names
   cls = Classifier(classname= "weka.classifiers.trees.J48")
   for p in list_property_names(cls):
       print(p)


Parameter optimization - GridSearch
-----------------------------------

The following code optimizes the `C` property of `SMOreg` and the `gamma` property of its `RBFKernel`:

.. code-block:: python

   from weka.classifiers import GridSearch
   grid = GridSearch(options=["-sample-size", "100.0", "-traversal", "ROW-WISE", "-num-slots", "1", "-S", "1"])
   grid.evaluation = "CC"
   grid.y = {"property": "kernel.gamma", "min": -3.0, "max": 3.0, "step": 1.0, "base": 10.0, "expression": "pow(BASE,I)"}
   grid.x = {"property": "C", "min": -3.0, "max": 3.0, "step": 1.0, "base": 10.0, "expression": "pow(BASE,I)"}
   cls = Classifier(
       classname="weka.classifiers.functions.SMOreg",
       options=["-K", "weka.classifiers.functions.supportVector.RBFKernel"])
   grid.classifier = cls
   grid.build_classifier(train)
   print("Model:\n" + str(grid))
   print("\nBest setup:\n" + grid.best.to_commandline())

**NB:** The `gridSearch` package must be installed for this to work.


Parameter optimization - MultiSearch
------------------------------------

The following code optimizes the `C` property of `SMOreg` and the `gamma` property of its `RBFKernel`:

.. code-block:: python

   from weka.core.classes import ListParameter, MathParameter
   multi = MultiSearch(options=["-S", "1"])
   multi.evaluation = "CC"
   mparam = MathParameter()
   mparam.prop = "kernel.gamma"
   mparam.minimum = -3.0
   mparam.maximum = 3.0
   mparam.step = 1.0
   mparam.base = 10.0
   mparam.expression = "pow(BASE,I)"
   lparam = ListParameter()
   lparam.prop = "C"
   lparam.values = ["-2.0", "-1.0", "0.0", "1.0", "2.0"]
   multi.parameters = [mparam, lparam]
   cls = Classifier(
       classname="weka.classifiers.functions.SMOreg",
       options=["-K", "weka.classifiers.functions.supportVector.RBFKernel"])
   multi.classifier = cls
   multi.build_classifier(train)
   print("Model:\n" + str(multi))
   print("\nBest setup:\n" + multi.best.to_commandline())

**NB:** The `multisearch-weka-package <https://github.com/fracpete/multisearch-weka-package>`_ package must
be installed for this to work.


Clustering
----------

In the following is an example on how to build a `SimpleKMeans` (with 3 clusters)
using a previously loaded dataset without a class attribute:

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")
   data.delete_last_attribute()

   from weka.clusterers import Clusterer
   clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"])
   clusterer.build_clusterer(data)

   print(clusterer)

Once a clusterer is built, it can be used to cluster Instance objects:

.. code-block:: python

   # cluster the data
   for inst in data:
       cl = clusterer.cluster_instance(inst)  # 0-based cluster index
       dist = clusterer.distribution_for_instance(inst)   # cluster membership distribution
       print("cluster=" + str(cl) + ", distribution=" + str(dist))


Associations
------------

Associators, like `Apriori`, can be built and output like this:

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")
   data.class_is_last()

   from weka.associations import Associator
   associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"])
   associator.build_associations(data)

   print(associator)


Attribute selection
-------------------

You can perform attribute selection using, e.g., `BestFirst` as search algorithm and
`CfsSubsetEval` as evaluator as follows:

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")
   data.class_is_last()

   from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
   search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
   evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
   attsel = AttributeSelection()
   attsel.search(search)
   attsel.evaluator(evaluator)
   attsel.select_attributes(data)

   print("# attributes: " + str(attsel.number_attributes_selected))
   print("attributes: " + str(attsel.selected_attributes))
   print("result string:\n" + attsel.results_string)

Attribute selection is also available through meta-schemes:

* classifier: `weka.classifiers.AttributeSelectedClassifier`
* filter: `weka.filters.AttributeSelection`


Timeseries
----------

With the `timeseriesForecasting` package installed and the JVM started with package support, you can perform
timeseries forecasting:

.. code-block:: python

   airline_data = loader.load_file(data_dir + "airline.arff")
   airline_train, airline_test = airline_data.train_test_split(90.0)

   # configure and build
   from weka.timeseries import WekaForecaster
   from weka.classifiers import Classifier
   forecaster = WekaForecaster()
   forecaster.fields_to_forecast = ["passenger_numbers"]
   forecaster.base_forecaster = Classifier(classname="weka.classifiers.functions.LinearRegression")
   forecaster.fields_to_forecast = "passenger_numbers"
   forecaster.build_forecaster(airline_train)

   # prime
   from weka.core.dataset import Instances
   num_prime_instances = 12
   airline_prime = Instances.copy_instances(airline_train, airline_train.num_instances - num_prime_instances, num_prime_instances)
   forecaster.prime_forecaster(airline_prime)

   # forecast
   num_future_forecasts = airline_test.num_instances
   preds = forecaster.forecast(num_future_forecasts)
   print("Actual,Predicted,Error")
   for i in range(num_future_forecasts):
       actual = airline_test.get_instance(i).get_value(0)
       predicted = preds[i][0].predicted
       error = actual - predicted
       print("%f,%f,%f" % (actual, predicted, error))


Serialization
-------------

You can easily serialize and de-serialize as well.

Here we just save a trained classifier to a file, load it again from disk and output the model:

.. code-block:: python

   from weka.classifiers import Classifier
   classifier = ...  # previously built classifier
   classifier.serialize("/some/where/out.model")
   ...
   classifier2, _ = Classifier.deserialize("/some/where/out.model")
   print(classifier2)

Weka usually saves the header of the dataset that was used for training as well (e.g., in order to determine
whether test data is compatible). This is done as follows:

.. code-block:: python

   from weka.classifiers import Classifier
   classifier = ...  # previously built Classifier
   data = ... # previously loaded/generated Instances
   classifier.serialize("/some/where/out.model", header=data)
   ...
   classifier2, data2 = Classifier.deserialize("/some/where/out.model")
   print(classifier2)
   print(data2)

Clusterers and filters offer the `serialize` and `deserialize` methods as well. For all other
serialization/deserialiation tasks, use the methods offered by the `weka.core.classes` module:

* `serialization_write(file, object)`
* `serialization_write_all(file, [obj1, obj2, ...])`
* `serialization_read(file)`
* `serialization_read_all(file)`


Experiments
-----------

Experiments, like they are run in Weka's Experimenter, can be configured and executed as well.

Here is an example for performing a cross-validated classification experiment:

.. code-block:: python

   datasets = [
       data_dir + "iris.arff",
       data_dir + "vote.arff",
       data_dir + "anneal.arff"
   ]
   classifiers = [
       Classifier(classname="weka.classifiers.rules.ZeroR"),
       Classifier(classname="weka.classifiers.trees.J48"),
       Classifier(classname="weka.classifiers.trees.REPTree"),
   ]
   result = "exp.arff"
   from weka.experiments import SimpleCrossValidationExperiment
   exp = SimpleCrossValidationExperiment(
       classification=True,
       runs=10,
       folds=10,
       datasets=datasets,
       classifiers=classifiers,
       result=result)
   exp.setup()
   exp.run()

   import weka.core.converters
   loader = weka.core.converters.loader_for_file(result)
   data = loader.load_file(result)
   from weka.experiments import Tester, ResultMatrix
   matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
   tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   tester.resultmatrix = matrix
   comparison_col = data.attribute_by_name("Percent_correct").index
   tester.instances = data

   print(tester.header(comparison_col))
   print(tester.multi_resultset_full(0, comparison_col))
   print(tester.multi_resultset_full(1, comparison_col))


Other parameters that can be supplied to the constructor of the `SimpleCrossValidationExperiment` or
`SimpleRandomSplitExperiment` classes:

* `class_for_ir_statistics` - defines the class label to use for computing IR statistics like AUC
* `attribute_id` - the 0-based index of the attribute that identifies rows
* `pred_target_column` - for outputting the predictions and ground truth in separate columns in case of classification, e.g., for calculating confusion matrices manually afterwards


And a setup for performing regression experiments on random splits on the datasets:

.. code-block:: python

   from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix
   from weka.classifiers import Classifier
   import weka.core.converters as converters
   # configure experiment
   datasets = [data_dir + "bolts.arff", data_dir + "bodyfat.arff"]
   classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.functions.LinearRegression")]
   outfile = "results-rs.arff"   # store results for later analysis
   exp = SimpleRandomSplitExperiment(
       classification=False,
       runs=10,
       percentage=66.6,
       preserve_order=False,
       datasets=datasets,
       classifiers=classifiers,
       result=outfile)
   exp.setup()
   exp.run()
   # evaluate previous run
   loader = converters.loader_for_file(outfile)
   data   = loader.load_file(outfile)
   matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
   tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   tester.resultmatrix = matrix
   comparison_col = data.attribute_by_name("Correlation_coefficient").index
   tester.instances = data
   print(tester.header(comparison_col))
   print(tester.multi_resultset_full(0, comparison_col))


The `Tester` class allows you to swap columns and rows, therefore comparing datasets rather than classifiers:

.. code-block:: python

   tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   tester.swap_rows_and_cols = True
   tester.resultmatrix = matrix


Partial classnames
------------------

All classes derived from `weka.core.classes.JavaObject` like `Classifier`, `Filter`, etc.,
allow the use of partial classnames. So instead of instantiating a classifier like this:

.. code-block:: python

   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can instantiate it with a shortened classname (must start with a `.`):

.. code-block:: python

   cls = Classifier(classname=".J48", options=["-C", "0.3"])

**NB:** This will fail with an exception if there are no or multiple matches.
For instance, the following will result in an error, as there are two `Discretize`
filters, supervised and unsupervised:

.. code-block:: python

   cls = Filter(classname=".Discretize")

.. code-block:: bash

   Exception: Found multiple matches for '.Discretize':
   weka.filters.supervised.attribute.Discretize
   weka.filters.unsupervised.attribute.Discretize


Packages
--------

The following examples show how to list, install and uninstall an *official* package:

.. code-block:: python

   import weka.core.packages as packages
   items = packages.all_packages()
   for item in items:
       if item.name == "CLOPE":
           print(item.name + " " + item.url)

   packages.install_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)

   packages.uninstall_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)

You can also install *unofficial* packages. The following example installs a previously downloaded zip file:

.. code-block:: python

   import weka.core.packages as packages
   success = packages.install_package("/some/where/funky-package-1.0.0.zip")
   print(success)

And here installing it directly from a URL:

.. code-block:: python

   import weka.core.packages as packages
   info = packages.install_package("http://some.server.com/funky-package-1.0.0.zip", details=True)
   print(info)

Using the `details=True` flag, you can receive a dictionary instead of a simple boolean.
This dictionary consists of:

* `from_repo`: whether the package was installed from the repo or not (i.e., unofficial URL or local archive)
* `version`: the version (only for packages from the repo)
* `error`: any error that may have occurred during installation
* `install_message`: optional message from the package maintainer on the installation
* `success`: whether the package was installed successfully

Of course, you can also install multiple packages in one go using the
`install_packages` method:

.. code-block:: python

   import weka.core.packages as packages
   info = packages.install_packages([
       "http://some.server.com/funky-package-1.0.0.zip",
       "http://some.server.com/cool-package-2.0.0.zip",
       "http://some.server.com/fancy-package-1.1.0.zip",
   ], fail_fast=False, details=True)

This method offers the `details` flag as well and returns a dictionary with
the package name/URL/file name as the key and the information dictionary as
the value.

With the `fail_fast` flag you can control whether to stop the installation process
as soon as the first package fails to install (`fail_fast=True`) or keep trying to
install them (`fail_fast=False`).

You can include automatic installation of packages in your scripts:

.. code-block:: python

   import sys
   import weka.core.jvm as jvm
   from weka.core.packages import install_missing_package, install_missing_packages, LATEST

   # installs a single package (if missing) and exits if installation occurred (outputs messages in console)
   install_missing_package("CLOPE", stop_jvm_and_exit=True)

   # installs any missing package, outputs messages in console, but restarting JVM is left to script
   success, exit_required = install_missing_packages([("CLOPE", LATEST), ("gridSearch", LATEST), ("multisearch", LATEST)])
   if exit_required:
       jvm.stop()
       sys.exit(0)


You can also output suggested Weka packages for partial class/package names or exact class names (default is partial
string matching):

.. code-block:: python

   # suggest package for classifier 'RBFClassifier'
   search = "RBFClassifier"
   suggestions = packages.suggest_package(search)
   print("suggested packages for " + search + ":", suggestions)

   # suggest package for package '.ft.'
   search = ".ft."
   suggestions = packages.suggest_package(search)
   print("suggested packages for " + search + ":", suggestions)

   # suggest package for classifier 'weka.classifiers.trees.J48graft'
   search = "weka.classifiers.trees.J48graft"
   suggestions = packages.suggest_package(search, exact=True)
   print("suggested packages for " + search + ":", suggestions)


Stop JVM
--------

.. code-block:: python

   jvm.stop()


Database access
---------------

Thanks to JDBC (Java Database Connectivity) it is very easy to connect to SQL databases and load data
as an Instances object. However, since we rely on 3rd-party libraries to achieve this, we need to
specify the database JDBC driver jar when we are starting up the JVM. For instance, adding a MySQL
driver called `mysql-connector-java-X.Y.Z-bin.jar`:

.. code-block:: python

   jvm.start(class_path=["/some/where/mysql-connector-java-X.Y.Z-bin.jar"])

Assuming the following parameters:

 * database host is `dbserver`
 * database is called `mydb`
 * database user is `me`
 * database password is `verysecret`

We can use the following code to select all the data from table `lotsadata`.

.. code-block:: python

   from weka.core.database import InstanceQuery
   iquery = InstanceQuery()
   iquery.db_url = "jdbc:mysql://dbserver:3306/mydb"
   iquery.user = "me"
   iquery.password = "verysecret"
   iquery.query = "select * from lotsadata"
   data = iquery.retrieve_instances()


Recreating environments
-----------------------

There are two approaches for recreating a python-weka-wrapper3 environment
in another virtual environment or on another machine:

1. `pww-packages freeze/install`

Using the `pww-packages` command-line tool, you can export the currently installed
Weka packages to a text file:

.. code-block:: bash

   pww-packages freeze -r requirements.txt

If you have unofficial packages installed then it is recommended to include the URLs
from which they could be obtained (according to the information stored in the packgages):

.. code-block:: bash

   pww-packages freeze -u -r requirements.txt

In the other environment, with python-weka-wrapper3 already installed, you can then
install the packages as follows:

.. code-block:: bash

   pww-packages install -r requirements.txt

Any issues with installing packages will be output in the terminal.


2. `pww-packages bootstrap`

Using the *bootstrap* approach, you can generate a Python script that will install
*python-weka-wrapper3* and all currently installed Weka packages. Any other Python
libraries you need to install yourself, which you can easily do by adapting the
generated script.

You can generate this install script from the current environment as follows:

.. code-block:: bash

   pww-packages bootstrap -o pww3.py

In your other environment, simply run the generated script:

.. code-block:: bash

   python pww3.py

Any issues with installing packages will be output in the terminal.