Examples ======== The following examples are meant to be executed in sequence, as they rely on previous steps, e.g., on data present. For more examples, check out the example repository on github: `github.com/fracpete/python-weka-wrapper3-examples `__ Start up JVM ------------ .. code-block:: python import weka.core.jvm as jvm jvm.start() If you want to use the classpath environment variable and all currently installed Weka packages, use the following call: .. code-block:: python jvm.start(system_cp=True, packages=True) In case your Weka home directory is not located in `wekafiles` in your user's home directory, then you have two options for specifying the alternative location: use the `WEKA_HOME` environment variable or the `packages` parameter, supplying a directory. The latter is shown below: .. code-block:: python jvm.start(packages="/my/packages/are/somwhere/else") Most of the times, you will want to increase the maximum heap size available to the JVM. The following example reserves 512 MB: .. code-block:: python jvm.start(max_heap_size="512m") If you want to print system information at start up time, then you can use the `system_info` parameter: .. code-block:: python jvm.start(system_info=True) This will output key-value pairs generated by Weka's `weka.core.SystemInfo` class, similar to this:: DEBUG:weka.core.jvm:System info: DEBUG:weka.core.jvm:java.runtime.name=OpenJDK Runtime Environment DEBUG:weka.core.jvm:java.awt.headless=true ... DEBUG:weka.core.jvm:java.vm.compressedOopsMode=Zero based DEBUG:weka.core.jvm:java.vm.specification.version=11 For more information, check out the help of the `jvm` module: .. code-block:: python help(jvm.start) help(jvm.stop) Location of the datasets ------------------------ The following examples assume the datasets to be present in the `data_dir` directory. For instance, this could be the following directory: .. code-block:: python data_dir = "/my/datasets/" Load dataset and print it ------------------------- .. code-block:: python from weka.core.converters import Loader loader = Loader(classname="weka.core.converters.ArffLoader") data = loader.load_file(data_dir + "iris.arff") data.class_is_last() print(data) The `weka.core.converters` module has a convenience method for loading datasets called `load_any_file`. This method determines a loader based on the file extension and then loads the full dataset: .. code-block:: python import weka.core.converters as converters data = converters.load_any_file(data_dir + "iris.arff") data.class_is_last() print(data) It is also possible to define the class attribute when loading: .. code-block:: python data = loader.load_file(data_dir + "iris.arff", class_index="last") data = converters.load_any_file(data_dir + "iris.arff", class_index="last") The following strings are supported: * `first` * `second` * `third` * `last-2` (third to last) * `last-1` (second to last) * `last` * any other string gets interpreted as 1-based index Create dataset manually ----------------------- The following code snippet defines the dataset structure by creating its attributes and then the dataset itself. Once the `weka.core.dataset.Instances` object is available, rows (i.e., `weka.core.dataset.Instance` objects) can be added. .. code-block:: python from weka.core.dataset import Attribute, Instance, Instances # create attributes num_att = Attribute.create_numeric("num") date_att = Attribute.create_date("dat", "yyyy-MM-dd") nom_att = Attribute.create_nominal("nom", ["label1", "label2"]) # create dataset dataset = Instances.create_instances("helloworld", [num_att, date_att, nom_att], 0) # add rows values = [3.1415926, date_att.parse_date("2014-04-10"), 1.0] inst = Instance.create_instance(values) dataset.add_instance(inst) values = [2.71828, date_att.parse_date("2014-08-09"), Instance.missing_value()] inst = Instance.create_instance(values) dataset.add_instance(inst) print(dataset) Create dataset from lists ------------------------- If your data is easily available as lists, you can also construct datasets using this approach (custom column names can be supplied via `cols_x` and `col_y`): .. code-block:: python from weka.core.dataset import create_instances_from_lists from random import randint # pure numeric x = [[randint(1, 10) for _ in range(5)] for _ in range(10)] y = [randint(0, 1) for _ in range(10)] dataset = create_instances_from_lists(x, y, name="generated from lists") print(dataset) dataset = create_instances_from_lists(x, name="generated from lists (no y)") print(dataset) # mixed data types x = [["TEXT", 1, 1.1], ["XXX", 2, 2.2]] y = ["A", "B"] dataset = create_instances_from_lists(x, y, name="generated from mixed lists", cols_x=["text", "integer", "float"], col_y="class") print(dataset) Create dataset from matrices ---------------------------- Another way of constructing a dataset is to use numpy matrices/arrays, e.g., obtained from a Panda data frame (custom column names can be supplied via `cols_x` and `col_y`): .. code-block:: python from weka.core.dataset import create_instances_from_matrices import numpy as np # pure numeric x = np.random.randn(10, 5) y = np.random.randn(10) dataset = create_instances_from_matrices(x, y, name="generated from matrices") print(dataset) dataset = create_instances_from_matrices(x, name="generated from matrix (no y)") print(dataset) # mixed data types x = np.array([("TEXT", 1, 1.1), ("XXX", 2, 2.2)], dtype='S20, i4, f8') y = np.array(["A", "B"], dtype='S20') dataset = create_instances_from_matrices(x, y, name="generated from mixed matrices", cols_x=["text", "integer", "float"], col_y="class") print(dataset) Dataset subsets --------------- Transformations in Weka usually occur by applying filters (see section *Filters* below). However, quite often one only wants to quickly create a subset (of colunms or rows) from a dataset. For this purpose, the `subset` method of the `weka.core.dataset.Instances` method can be used (it uses filters under the hood to generate the actual subset): .. code-block:: python from weka.core.converters import load_any_file data = load_any_file("/some/where/iris.arff") print(data.attribute_names(), data.num_instances) # select columns by name subset = data.subset(col_names=['sepallength', 'sepalwidth', 'petallength', 'petalwidth']) print(subset.attribute_names(), subset.num_instances) # select columns by range (1-based indices) subset = data.subset(col_range='1-3,5') print(subset.attribute_names(), subset.num_instances) # select rows by range (1-based indices) subset = data.subset(row_range='51-150') print(subset.attribute_names(), subset.num_instances) # invert selection of cols/rows and keep original relation name subset = data.subset(col_range='5', invert_cols=True, row_range='51-150', invert_rows=True, keep_relationame=True) print(subset.attribute_names(), subset.num_instances) Data generators --------------- Artifical data can be generated using one of Weka's data generators, e.g., the `Agrawal` classification generator: .. code-block:: python from weka.datagenerators import DataGenerator generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"]) DataGenerator.make_data(generator, ["-o", "/some/where/outputfile.arff"]) Or using the low-level API (outputting data to stdout): .. code-block:: python generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"]) generator.dataset_format = generator.define_data_format() print(generator.dataset_format) if generator.single_mode_flag: for i in range(generator.num_examples_act): print(generator.generate_example()) else: print(generator.generate_examples()) Filters ------- The `Filter` class from the `weka.filters` module allows you to filter datasets, e.g., removing the last attribute using the `Remove` filter: .. code-block:: python data = loader.load_file(data_dir + "vote.arff") from weka.filters import Filter remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"]) remove.inputformat(data) filtered = remove.filter(data) print(filtered) Output help from underlying OptionHandler ----------------------------------------- If the underlying Java class implements the ``weka.core.OptionHandler`` method, then you can use the ``to_help()`` method to generate a string containing the ``globalInfo()`` and ``listOptions()`` information: .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48") print(cls.to_help()) Option handling --------------- Any class derived from ``OptionHandler`` (module ``weka.core.classes``) allows getting and setting of the options via the property ``options``. Depending on the sub-class, you may also provide the options already when instantiating the class. The following two examples instantiate a J48 classifier, one using the ``options`` property and the other using the shortcut through the constructor: .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48") cls.options = ["-C", "0.3"] .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) You can use the ``options`` property also to retrieve the currently set options: .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) print(cls.options) Using the `to_commandline()` method, you can return a single string that contains classname and options, just like Weka's Explorer does when copying the setup to the clipboard: .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) print(cls.to_commandline()) The `to_commandline(...)` method of the `weka.core.classes` module generates the command-line string for any class that implements the `weka.core.OptionHandler` Java interface under the hood (a lot of classes do!): .. code-block:: python from weka.classifiers import Classifier from weka.core.classes import to_commandline cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) print(to_commandline(cls)) The reverse, generating an object from a command-line, is done via the `from_commandline(...)` method: .. code-block:: python cmdline = 'weka.classifiers.functions.SMO -K "weka.classifiers.functions.supportVector.NormalizedPolyKernel -E 3.0"' classifier = from_commandline(cmdline, classname="weka.classifiers.Classifier") Build classifier on dataset, output predictions ----------------------------------------------- .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) cls.build_classifier(data) for index, inst in enumerate(data): pred = cls.classify_instance(inst) dist = cls.distribution_for_instance(inst) print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist)) Build classifier on dataset, print model and draw graph ------------------------------------------------------- .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) cls.build_classifier(data) print(cls) import weka.plot.graph as graph # NB: pygraphviz and PIL are required graph.plot_dot_graph(cls.graph) Build classifier incrementally with data and print model -------------------------------------------------------- .. code-block:: python loader = Loader(classname="weka.core.converters.ArffLoader") iris_inc = loader.load_file(data_dir + "iris.arff", incremental=True) iris_inc.class_is_last() print(iris_inc) cls = Classifier(classname="weka.classifiers.bayes.NaiveBayesUpdateable") cls.build_classifier(iris_inc) for inst in loader: cls.update_classifier(inst) print(cls) Cross-validate filtered classifier and print evaluation and display ROC ----------------------------------------------------------------------- .. code-block:: python data = loader.load_file(data_dir + "diabetes.arff") data.class_is_last() from weka.filters import Filter remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "1-3"]) cls = Classifier(classname="weka.classifiers.bayes.NaiveBayes") from weka.classifiers import FilteredClassifier fc = FilteredClassifier() fc.filter = remove fc.classifier = cls from weka.classifiers import Evaluation from weka.core.classes import Random evl = Evaluation(data) evl.crossvalidate_model(fc, data, 10, Random(1)) print(evl.percent_correct) print(evl.summary()) print(evl.class_details()) import weka.plot.classifiers as plcls # NB: matplotlib is required plcls.plot_roc(evl, class_index=[0, 1], wait=True) Cross-validate regressor, display classifier errors and predictions ------------------------------------------------------------------- .. code-block:: python from weka.classifiers import PredictionOutput, KernelClassifier, Kernel data = loader.load_file(data_dir + "bolts.arff") data.class_is_last() cls = KernelClassifier(classname="weka.classifiers.functions.SMOreg", options=["-N", "0"]) kernel = Kernel(classname="weka.classifiers.functions.supportVector.RBFKernel", options=["-G", "0.1"]) cls.kernel = kernel pout = PredictionOutput(classname="weka.classifiers.evaluation.output.prediction.PlainText") evl = Evaluation(data) evl.crossvalidate_model(cls, data, 10, Random(1), pout) print(evl.summary()) print(pout.buffer_content()) import weka.plot.classifiers as plcls # NB: matplotlib is required plcls.plot_classifier_errors(evl.predictions, wait=True) Parameter optimization - property names --------------------------------------- Both, `GridSearch` and `MultiSearch`, use Java Bean property names (and paths consisting of these), not command-line options in order to get/set the parameters under optimization. Using the `list_property_names` method of the `weka.core.classes` module, you can list the properties from a Java object: .. code-block:: python from weka.core.classes import list_property_names cls = Classifier(classname= "weka.classifiers.trees.J48") for p in list_property_names(cls): print(p) Parameter optimization - GridSearch ----------------------------------- The following code optimizes the `C` property of `SMOreg` and the `gamma` property of its `RBFKernel`: .. code-block:: python from weka.classifiers import GridSearch grid = GridSearch(options=["-sample-size", "100.0", "-traversal", "ROW-WISE", "-num-slots", "1", "-S", "1"]) grid.evaluation = "CC" grid.y = {"property": "kernel.gamma", "min": -3.0, "max": 3.0, "step": 1.0, "base": 10.0, "expression": "pow(BASE,I)"} grid.x = {"property": "C", "min": -3.0, "max": 3.0, "step": 1.0, "base": 10.0, "expression": "pow(BASE,I)"} cls = Classifier( classname="weka.classifiers.functions.SMOreg", options=["-K", "weka.classifiers.functions.supportVector.RBFKernel"]) grid.classifier = cls grid.build_classifier(train) print("Model:\n" + str(grid)) print("\nBest setup:\n" + grid.best.to_commandline()) **NB:** The `gridSearch` package must be installed for this to work. Parameter optimization - MultiSearch ------------------------------------ The following code optimizes the `C` property of `SMOreg` and the `gamma` property of its `RBFKernel`: .. code-block:: python from weka.core.classes import ListParameter, MathParameter multi = MultiSearch(options=["-S", "1"]) multi.evaluation = "CC" mparam = MathParameter() mparam.prop = "kernel.gamma" mparam.minimum = -3.0 mparam.maximum = 3.0 mparam.step = 1.0 mparam.base = 10.0 mparam.expression = "pow(BASE,I)" lparam = ListParameter() lparam.prop = "C" lparam.values = ["-2.0", "-1.0", "0.0", "1.0", "2.0"] multi.parameters = [mparam, lparam] cls = Classifier( classname="weka.classifiers.functions.SMOreg", options=["-K", "weka.classifiers.functions.supportVector.RBFKernel"]) multi.classifier = cls multi.build_classifier(train) print("Model:\n" + str(multi)) print("\nBest setup:\n" + multi.best.to_commandline()) **NB:** The `multisearch-weka-package `_ package must be installed for this to work. Clustering ---------- In the following is an example on how to build a `SimpleKMeans` (with 3 clusters) using a previously loaded dataset without a class attribute: .. code-block:: python data = loader.load_file(data_dir + "vote.arff") data.delete_last_attribute() from weka.clusterers import Clusterer clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"]) clusterer.build_clusterer(data) print(clusterer) Once a clusterer is built, it can be used to cluster Instance objects: .. code-block:: python # cluster the data for inst in data: cl = clusterer.cluster_instance(inst) # 0-based cluster index dist = clusterer.distribution_for_instance(inst) # cluster membership distribution print("cluster=" + str(cl) + ", distribution=" + str(dist)) Associations ------------ Associators, like `Apriori`, can be built and output like this: .. code-block:: python data = loader.load_file(data_dir + "vote.arff") data.class_is_last() from weka.associations import Associator associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"]) associator.build_associations(data) print(associator) Attribute selection ------------------- You can perform attribute selection using, e.g., `BestFirst` as search algorithm and `CfsSubsetEval` as evaluator as follows: .. code-block:: python data = loader.load_file(data_dir + "vote.arff") data.class_is_last() from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"]) evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"]) attsel = AttributeSelection() attsel.search(search) attsel.evaluator(evaluator) attsel.select_attributes(data) print("# attributes: " + str(attsel.number_attributes_selected)) print("attributes: " + str(attsel.selected_attributes)) print("result string:\n" + attsel.results_string) Attribute selection is also available through meta-schemes: * classifier: `weka.classifiers.AttributeSelectedClassifier` * filter: `weka.filters.AttributeSelection` Timeseries ---------- With the `timeseriesForecasting` package installed and the JVM started with package support, you can perform timeseries forecasting: .. code-block:: python airline_data = loader.load_file(data_dir + "airline.arff") airline_train, airline_test = airline_data.train_test_split(90.0) # configure and build from weka.timeseries import WekaForecaster from weka.classifiers import Classifier forecaster = WekaForecaster() forecaster.fields_to_forecast = ["passenger_numbers"] forecaster.base_forecaster = Classifier(classname="weka.classifiers.functions.LinearRegression") forecaster.fields_to_forecast = "passenger_numbers" forecaster.build_forecaster(airline_train) # prime from weka.core.dataset import Instances num_prime_instances = 12 airline_prime = Instances.copy_instances(airline_train, airline_train.num_instances - num_prime_instances, num_prime_instances) forecaster.prime_forecaster(airline_prime) # forecast num_future_forecasts = airline_test.num_instances preds = forecaster.forecast(num_future_forecasts) print("Actual,Predicted,Error") for i in range(num_future_forecasts): actual = airline_test.get_instance(i).get_value(0) predicted = preds[i][0].predicted error = actual - predicted print("%f,%f,%f" % (actual, predicted, error)) Serialization ------------- You can easily serialize and de-serialize as well. Here we just save a trained classifier to a file, load it again from disk and output the model: .. code-block:: python from weka.classifiers import Classifier classifier = ... # previously built classifier classifier.serialize("/some/where/out.model") ... classifier2, _ = Classifier.deserialize("/some/where/out.model") print(classifier2) Weka usually saves the header of the dataset that was used for training as well (e.g., in order to determine whether test data is compatible). This is done as follows: .. code-block:: python from weka.classifiers import Classifier classifier = ... # previously built Classifier data = ... # previously loaded/generated Instances classifier.serialize("/some/where/out.model", header=data) ... classifier2, data2 = Classifier.deserialize("/some/where/out.model") print(classifier2) print(data2) Clusterers and filters offer the `serialize` and `deserialize` methods as well. For all other serialization/deserialiation tasks, use the methods offered by the `weka.core.classes` module: * `serialization_write(file, object)` * `serialization_write_all(file, [obj1, obj2, ...])` * `serialization_read(file)` * `serialization_read_all(file)` Experiments ----------- Experiments, like they are run in Weka's Experimenter, can be configured and executed as well. Here is an example for performing a cross-validated classification experiment: .. code-block:: python datasets = [ data_dir + "iris.arff", data_dir + "vote.arff", data_dir + "anneal.arff" ] classifiers = [ Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.trees.J48"), Classifier(classname="weka.classifiers.trees.REPTree"), ] result = "exp.arff" from weka.experiments import SimpleCrossValidationExperiment exp = SimpleCrossValidationExperiment( classification=True, runs=10, folds=10, datasets=datasets, classifiers=classifiers, result=result) exp.setup() exp.run() import weka.core.converters loader = weka.core.converters.loader_for_file(result) data = loader.load_file(result) from weka.experiments import Tester, ResultMatrix matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText") tester = Tester(classname="weka.experiment.PairedCorrectedTTester") tester.resultmatrix = matrix comparison_col = data.attribute_by_name("Percent_correct").index tester.instances = data print(tester.header(comparison_col)) print(tester.multi_resultset_full(0, comparison_col)) print(tester.multi_resultset_full(1, comparison_col)) Other parameters that can be supplied to the constructor of the `SimpleCrossValidationExperiment` or `SimpleRandomSplitExperiment` classes: * `class_for_ir_statistics` - defines the class label to use for computing IR statistics like AUC * `attribute_id` - the 0-based index of the attribute that identifies rows * `pred_target_column` - for outputting the predictions and ground truth in separate columns in case of classification, e.g., for calculating confusion matrices manually afterwards And a setup for performing regression experiments on random splits on the datasets: .. code-block:: python from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix from weka.classifiers import Classifier import weka.core.converters as converters # configure experiment datasets = [data_dir + "bolts.arff", data_dir + "bodyfat.arff"] classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.functions.LinearRegression")] outfile = "results-rs.arff" # store results for later analysis exp = SimpleRandomSplitExperiment( classification=False, runs=10, percentage=66.6, preserve_order=False, datasets=datasets, classifiers=classifiers, result=outfile) exp.setup() exp.run() # evaluate previous run loader = converters.loader_for_file(outfile) data = loader.load_file(outfile) matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText") tester = Tester(classname="weka.experiment.PairedCorrectedTTester") tester.resultmatrix = matrix comparison_col = data.attribute_by_name("Correlation_coefficient").index tester.instances = data print(tester.header(comparison_col)) print(tester.multi_resultset_full(0, comparison_col)) The `Tester` class allows you to swap columns and rows, therefore comparing datasets rather than classifiers: .. code-block:: python tester = Tester(classname="weka.experiment.PairedCorrectedTTester") tester.swap_rows_and_cols = True tester.resultmatrix = matrix Partial classnames ------------------ All classes derived from `weka.core.classes.JavaObject` like `Classifier`, `Filter`, etc., allow the use of partial classnames. So instead of instantiating a classifier like this: .. code-block:: python cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) You can instantiate it with a shortened classname (must start with a `.`): .. code-block:: python cls = Classifier(classname=".J48", options=["-C", "0.3"]) **NB:** This will fail with an exception if there are no or multiple matches. For instance, the following will result in an error, as there are two `Discretize` filters, supervised and unsupervised: .. code-block:: python cls = Filter(classname=".Discretize") .. code-block:: bash Exception: Found multiple matches for '.Discretize': weka.filters.supervised.attribute.Discretize weka.filters.unsupervised.attribute.Discretize Packages -------- The following examples show how to list, install and uninstall an *official* package: .. code-block:: python import weka.core.packages as packages items = packages.all_packages() for item in items: if item.name == "CLOPE": print(item.name + " " + item.url) packages.install_package("CLOPE") items = packages.installed_packages() for item in items: print(item.name + " " + item.url) packages.uninstall_package("CLOPE") items = packages.installed_packages() for item in items: print(item.name + " " + item.url) You can also install *unofficial* packages. The following example installs a previously downloaded zip file: .. code-block:: python import weka.core.packages as packages success = packages.install_package("/some/where/funky-package-1.0.0.zip") print(success) And here installing it directly from a URL: .. code-block:: python import weka.core.packages as packages info = packages.install_package("http://some.server.com/funky-package-1.0.0.zip", details=True) print(info) Using the `details=True` flag, you can receive a dictionary instead of a simple boolean. This dictionary consists of: * `from_repo`: whether the package was installed from the repo or not (i.e., unofficial URL or local archive) * `version`: the version (only for packages from the repo) * `error`: any error that may have occurred during installation * `install_message`: optional message from the package maintainer on the installation * `success`: whether the package was installed successfully Of course, you can also install multiple packages in one go using the `install_packages` method: .. code-block:: python import weka.core.packages as packages info = packages.install_packages([ "http://some.server.com/funky-package-1.0.0.zip", "http://some.server.com/cool-package-2.0.0.zip", "http://some.server.com/fancy-package-1.1.0.zip", ], fail_fast=False, details=True) This method offers the `details` flag as well and returns a dictionary with the package name/URL/file name as the key and the information dictionary as the value. With the `fail_fast` flag you can control whether to stop the installation process as soon as the first package fails to install (`fail_fast=True`) or keep trying to install them (`fail_fast=False`). You can include automatic installation of packages in your scripts: .. code-block:: python import sys import weka.core.jvm as jvm from weka.core.packages import install_missing_package, install_missing_packages, LATEST # installs a single package (if missing) and exits if installation occurred (outputs messages in console) install_missing_package("CLOPE", stop_jvm_and_exit=True) # installs any missing package, outputs messages in console, but restarting JVM is left to script success, exit_required = install_missing_packages([("CLOPE", LATEST), ("gridSearch", LATEST), ("multisearch", LATEST)]) if exit_required: jvm.stop() sys.exit(0) You can also output suggested Weka packages for partial class/package names or exact class names (default is partial string matching): .. code-block:: python # suggest package for classifier 'RBFClassifier' search = "RBFClassifier" suggestions = packages.suggest_package(search) print("suggested packages for " + search + ":", suggestions) # suggest package for package '.ft.' search = ".ft." suggestions = packages.suggest_package(search) print("suggested packages for " + search + ":", suggestions) # suggest package for classifier 'weka.classifiers.trees.J48graft' search = "weka.classifiers.trees.J48graft" suggestions = packages.suggest_package(search, exact=True) print("suggested packages for " + search + ":", suggestions) Stop JVM -------- .. code-block:: python jvm.stop() Database access --------------- Thanks to JDBC (Java Database Connectivity) it is very easy to connect to SQL databases and load data as an Instances object. However, since we rely on 3rd-party libraries to achieve this, we need to specify the database JDBC driver jar when we are starting up the JVM. For instance, adding a MySQL driver called `mysql-connector-java-X.Y.Z-bin.jar`: .. code-block:: python jvm.start(class_path=["/some/where/mysql-connector-java-X.Y.Z-bin.jar"]) Assuming the following parameters: * database host is `dbserver` * database is called `mydb` * database user is `me` * database password is `verysecret` We can use the following code to select all the data from table `lotsadata`. .. code-block:: python from weka.core.database import InstanceQuery iquery = InstanceQuery() iquery.db_url = "jdbc:mysql://dbserver:3306/mydb" iquery.user = "me" iquery.password = "verysecret" iquery.query = "select * from lotsadata" data = iquery.retrieve_instances()