Examples
========

The following examples are meant to be executed in sequence, as they rely on previous steps,
e.g., on data present.

For more examples, check out the example repository on github:

`github.com/fracpete/python-weka-wrapper-examples <https://github.com/fracpete/python-weka-wrapper-examples>`__


Start up JVM
------------

.. code-block:: python

   import weka.core.jvm as jvm
   jvm.start()

For more information, check out the help of the `jvm` module:

.. code-block:: python

   help(jvm.start)
   help(jvm.stop)


Location of the datasets
------------------------

The following examples assume the datasets to be present in the `data_dir` directory. For instance,
this could be the following directory:

.. code-block:: python

   data_dir = "/my/datasets/"


Load dataset and print it
-------------------------

.. code-block:: python

   from weka.core.converters import Loader
   loader = Loader(classname="weka.core.converters.ArffLoader")
   data = loader.load_file(data_dir + "iris.arff")
   data.class_is_last()

   print(data)

The `weka.core.converters` module has a convenience method for loading datasets
called `load_any_file`. This method determines a loader based on the file extension
and then loads the full dataset:

.. code-block:: python

   import weka.core.converters as converters
   data = converters.load_any_file(data_dir + "iris.arff")
   data.class_is_last()

   print(data)


Create dataset manually
-----------------------

The following code snippet defines the dataset structure by creating its attributes and then the
dataset itself. Once the `weka.core.dataset.Instances` object is available, rows (i.e., `weka.core.dataset.Instance`
objects) can be added.

.. code-block:: python

   from weka.core.dataset import Attribute, Instance, Instances

   # create attributes
   num_att = Attribute.create_numeric("num")
   date_att = Attribute.create_date("dat", "yyyy-MM-dd")
   nom_att = Attribute.create_nominal("nom", ["label1", "label2"])

   # create dataset
   dataset = Instances.create_instances("helloworld", [num_att, date_att, nom_att], 0)

   # add rows
   values = [3.1415926, date_att.parse_date("2014-04-10"), 1.0]
   inst = Instance.create_instance(values)
   dataset.add_instance(inst)

   values = [2.71828, date_att.parse_date("2014-08-09"), Instance.missing_value()]
   inst = Instance.create_instance(values)
   dataset.add_instance(inst)

   print(dataset)


Create dataset from lists
-------------------------

If your data is easily available as lists, you can also construct datasets using this approach:

.. code-block:: python

   from weka.core.dataset import create_instances_from_lists
   from random import randint

   # pure numeric
   x = [[randint(1, 10) for _ in range(5)] for _ in range(10)]
   y = [randint(0, 1) for _ in range(10)]
   dataset = create_instances_from_lists(x, y, name="generated from lists")
   print(dataset)

   dataset = create_instances_from_lists(x, name="generated from lists (no y)")
   print(dataset)

   # mixed data types
   x = [["TEXT", 1, 1.1], ["XXX", 2, 2.2]]
   y = ["A", "B"]
   dataset = create_instances_from_lists(x, y, name="generated from mixed lists")
   print(dataset)


Create dataset from matrices
----------------------------

Another way of constructing a dataset is to use numpy matrices/arrays (e.g., obtained from a Panda data frame):

.. code-block:: python

   from weka.core.dataset import create_instances_from_matrices
   import numpy as np

   # pure numeric
   x = np.random.randn(10, 5)
   y = np.random.randn(10)
   dataset = create_instances_from_matrices(x, y, name="generated from matrices")
   print(dataset)

   dataset = create_instances_from_matrices(x, name="generated from matrix (no y)")
   print(dataset)

   # mixed data types
   x = np.array([("TEXT", 1, 1.1), ("XXX", 2, 2.2)], dtype='S20, i4, f8')
   y = np.array(["A", "B"], dtype='S20')
   dataset = create_instances_from_matrices(x, y, name="generated from mixed matrices")
   print(dataset)


Output help from underlying OptionHandler
-----------------------------------------

If the underlying Java class implements the ``weka.core.OptionHandler`` method, then
you can use the ``to_help()`` method to generate a string containing the ``globalInfo()``
and ``listOptions()`` information:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48")
   print(cls.to_help())


Option handling
---------------

Any class derived from ``OptionHandler`` (module ``weka.core.classes``) allows 
getting and setting of the options via the property ``options``. Depending on
the sub-class, you may also provide the options already when instantiating the
class. The following two examples instantiate a J48 classifier, one using
the ``options`` property and the other using the shortcut through the constructor:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48")
   cls.options = ["-C", "0.3"]

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can use the ``options`` property also to retrieve the currently set options:

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   print(cls.options)


Build classifier on dataset, output predictions
-----------------------------------------------

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   cls.build_classifier(data)

   for index, inst in enumerate(data):
       pred = cls.classify_instance(inst)
       dist = cls.distribution_for_instance(inst)
       print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist))


Build classifier on dataset, print model and draw graph
-------------------------------------------------------

.. code-block:: python

   from weka.classifiers import Classifier
   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])
   cls.build_classifier(data)

   print(cls)

   import weka.plot.graph as graph  # NB: pygraphviz and PIL are required
   graph.plot_dot_graph(cls.graph)


Build classifier incrementally with data and print model
--------------------------------------------------------

.. code-block:: python

   loader = Loader(classname="weka.core.converters.ArffLoader")
   iris_inc = loader.load_file(data_dir + "iris.arff", incremental=True)
   iris_inc.class_is_last()

   print(iris_inc)

   cls = Classifier(classname="weka.classifiers.bayes.NaiveBayesUpdateable")
   cls.build_classifier(iris_inc)
   for inst in loader:
       cls.update_classifier(inst)

   print(cls)


Cross-validate filtered classifier and print evaluation and display ROC
-----------------------------------------------------------------------

.. code-block:: python

   data = loader.load_file(data_dir + "diabetes.arff")
   data.class_is_last()

   from weka.filters import Filter
   remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "1-3"])

   cls = Classifier(classname="weka.classifiers.bayes.NaiveBayes")

   from weka.classifiers import FilteredClassifier
   fc = FilteredClassifier()
   fc.filter = remove
   fc.classifier = cls

   from weka.classifiers import Evaluation
   from weka.core.classes import Random
   evl = Evaluation(data)
   evl.crossvalidate_model(fc, data, 10, Random(1))

   print(evl.percent_correct)
   print(evl.summary())
   print(evl.class_details())

   import weka.plot.classifiers as plcls  # NB: matplotlib is required
   plcls.plot_roc(evl, class_index=[0, 1], wait=True)


Cross-validate regressor, display classifier errors and predictions
-------------------------------------------------------------------

.. code-block:: python

   from weka.classifiers import PredictionOutput, KernelClassifier, Kernel
   data = loader.load_file(data_dir + "bolts.arff")
   data.class_is_last()

   cls = KernelClassifier(classname="weka.classifiers.functions.SMOreg", options=["-N", "0"])
   kernel = Kernel(classname="weka.classifiers.functions.supportVector.RBFKernel", options=["-G", "0.1"])
   cls.kernel = kernel
   pout = PredictionOutput(classname="weka.classifiers.evaluation.output.prediction.PlainText")
   evl = Evaluation(data)
   evl.crossvalidate_model(cls, data, 10, Random(1), pout)

   print(evl.summary())
   print(pout.buffer_content())

   import weka.plot.classifiers as plcls  # NB: matplotlib is required
   plcls.plot_classifier_errors(evl.predictions, wait=True)


Parameter optimization - GridSearch
-----------------------------------

The following code optimizes the `C` parameter of `SMOreg` and the `gamma` parameter of its `RBFKernel`:

.. code-block:: python

   from weka.classifiers import GridSearch
   grid = GridSearch(options=["-sample-size", "100.0", "-traversal", "ROW-WISE", "-num-slots", "1", "-S", "1"])
   grid.evaluation = "CC"
   grid.y = {"property": "kernel.gamma", "min": -3.0, "max": 3.0, "step": 1.0, "base": 10.0, "expression": "pow(BASE,I)"}
   grid.x = {"property": "C", "min": -3.0, "max": 3.0, "step": 1.0, "base": 10.0, "expression": "pow(BASE,I)"}
   cls = Classifier(
       classname="weka.classifiers.functions.SMOreg",
       options=["-K", "weka.classifiers.functions.supportVector.RBFKernel"])
   grid.classifier = cls
   grid.build_classifier(train)
   print("Model:\n" + str(grid))
   print("\nBest setup:\n" + grid.best.to_commandline())

**NB:** Make sure that the `GridSearch` package is not installed, as the `GridSearch` meta-classifier is already
part of the monolithic `weka.jar` that comes with *python-weka-wrapper*.


Parameter optimization - MultiSearch
------------------------------------

The following code optimizes the `C` parameter of `SMOreg` and the `gamma` parameter of its `RBFKernel`:

.. code-block:: python

   from weka.core.classes import ListParameter, MathParameter
   multi = MultiSearch(
       options=["-sample-size", "100.0", "-initial-folds", "2", "-subsequent-folds", "2",
                "-num-slots", "1", "-S", "1"])
   multi.evaluation = "CC"
   mparam = MathParameter()
   mparam.prop = "classifier.kernel.gamma"
   mparam.minimum = -3.0
   mparam.maximum = 3.0
   mparam.step = 1.0
   mparam.base = 10.0
   mparam.expression = "pow(BASE,I)"
   lparam = ListParameter()
   lparam.prop = "classifier.C"
   lparam.values = ["-2.0", "-1.0", "0.0", "1.0", "2.0"]
   multi.parameters = [mparam, lparam]
   cls = Classifier(
       classname="weka.classifiers.functions.SMOreg",
       options=["-K", "weka.classifiers.functions.supportVector.RBFKernel"])
   multi.classifier = cls
   multi.build_classifier(train)
   print("Model:\n" + str(multi))
   print("\nBest setup:\n" + multi.best.to_commandline())

**NB:** `multisearch-weka-package <https://github.com/fracpete/multisearch-weka-package>`_ must be installed for
this to work.


Experiments
-----------

.. code-block:: python

   datasets = [
       data_dir + "iris.arff",
       data_dir + "vote.arff",
       data_dir + "anneal.arff"
   ]
   classifiers = [
       Classifier(classname="weka.classifiers.rules.ZeroR"),
       Classifier(classname="weka.classifiers.trees.J48"),
       Classifier(classname="weka.classifiers.trees.REPTree"),
   ]
   result = "exp.arff"
   from weka.experiments import SimpleCrossValidationExperiment
   exp = SimpleCrossValidationExperiment(
       classification=True,
       runs=10,
       folds=10,
       datasets=datasets,
       classifiers=classifiers,
       result=result)
   exp.setup()
   exp.run()

   import weka.core.converters
   loader = weka.core.converters.loader_for_file(result)
   data = loader.load_file(result)
   from weka.experiments import Tester, ResultMatrix
   matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText")
   tester = Tester(classname="weka.experiment.PairedCorrectedTTester")
   tester.resultmatrix = matrix
   comparison_col = data.attribute_by_name("Percent_correct").index
   tester.instances = data

   print(tester.header(comparison_col))
   print(tester.multi_resultset_full(0, comparison_col))
   print(tester.multi_resultset_full(1, comparison_col))


Clustering
----------

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")
   data.delete_last_attribute()

   from weka.clusterers import Clusterer
   clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"])
   clusterer.build_clusterer(data)

   print(clusterer)

   # cluster the data
   for inst in data:
       cl = clusterer.cluster_instance(inst)  # 0-based cluster index
       dist = clusterer.distribution_for_instance(inst)   # cluster membership distribution
       print("cluster=" + str(cl) + ", distribution=" + str(dist))


Associations
------------

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")
   data.class_is_last()

   from weka.associations import Associator
   associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"])
   associator.build_associations(data)

   print(associator)


Attribute selection
-------------------

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")
   data.class_is_last()

   from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
   search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
   evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
   attsel = AttributeSelection()
   attsel.search(search)
   attsel.evaluator(evaluator)
   attsel.select_attributes(data)

   print("# attributes: " + str(attsel.number_attributes_selected))
   print("attributes: " + str(attsel.selected_attributes))
   print("result string:\n" + attsel.results_string)


Data generators
---------------

.. code-block:: python

   from weka.datagenerators import DataGenerator
   generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"])
   DataGenerator.make_data(generator, ["-o", data_dir + "generated.arff"])

   generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"])
   generator.dataset_format = generator.define_data_format()
   print(generator.dataset_format)
   if generator.single_mode_flag:
       for i in xrange(generator.num_examples_act):
           print(generator.generate_example())
   else:
       print(generator.generate_examples())


Filters
-------

.. code-block:: python

   data = loader.load_file(data_dir + "vote.arff")

   from weka.filters import Filter
   remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
   remove.inputformat(data)
   filtered = remove.filter(data)

   print(filtered)


Partial classnames
------------------

All classes derived from `weka.core.classes.JavaObject` like `Classifier`, `Filter`, etc.,
allow the use of partial classnames. So instead of instantiating a classifier like this:

.. code-block:: python

   cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"])

You can instantiate it with a shortened classname (must start with a `.`):

.. code-block:: python

   cls = Classifier(classname=".J48", options=["-C", "0.3"])

**NB:** This will fail with an exception if there are no or multiple matches.
For instance, the following will result in an error, as there are two `Discretize`
filters, supervised and unsupervised:

.. code-block:: python

   cls = Filter(classname=".Discretize")

.. code-block:: bash

   Exception: Found multiple matches for '.Discretize':
   weka.filters.supervised.attribute.Discretize
   weka.filters.unsupervised.attribute.Discretize


Packages
--------

The following examples show how to list, install and uninstall an *official* package:

.. code-block:: python

   import weka.core.packages as packages
   items = packages.all_packages()
   for item in items:
       if item.get_name() == "CLOPE":
           print(item.name + " " + item.url)

   packages.install_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)

   packages.uninstall_package("CLOPE")
   items = packages.installed_packages()
   for item in items:
       print(item.name + " " + item.url)

You can also install *unofficial* packages. The following example installs a previously downloaded zip file:

.. code-block:: python

   import weka.core.packages as packages
   packages.install_package("/some/where/funky-package-1.0.0.zip")

And here installing it directly from a URL:

.. code-block:: python

   import weka.core.packages as packages
   packages.install_package("http://some.server.com/funky-package-1.0.0.zip")


Stop JVM
--------

.. code-block:: python

   jvm.stop()


Database access
---------------

Thanks to JDBC (Java Database Connectivity) it is very easy to connect to SQL databases and load data
as an Instances object. However, since we rely on 3rd-party libraries to achieve this, we need to
specify the database JDBC driver jar when we are starting up the JVM. For instance, adding a MySQL
driver called `mysql-connector-java-X.Y.Z-bin.jar`:

.. code-block:: python

   jvm.start(class_path=["/some/where/mysql-connector-java-X.Y.Z-bin.jar"])

Assuming the following parameters:

 * database host is `dbserver`
 * database is called `mydb`
 * database user is `me`
 * database password is `verysecret`

We can use the following code to select all the data from table `lotsadata`.

.. code-block:: python

   from weka.core.database import InstanceQuery
   iquery = InstanceQuery()
   iquery.db_url = "jdbc:mysql://dbserver:3306/mydb"
   iquery.user = "me"
   iquery.password = "verysecret"
   iquery.query = "select * from lotsadata"
   data = iquery.retrieve_instances()