API === The following sections explain in more detail of how to use *python-weka-wrapper* from Python using the API. A lot more examples you will find in the (aptly named) `examples `_ repository. Java Virtual Machine -------------------- In order to use the library, you need to manage the Java Virtual Machine (JVM). For starting up the library, use the following code: .. code-block:: python import weka.core.jvm as jvm jvm.start() If you want to use the classpath environment variable and all currently installed Weka packages, use the following call: .. code-block:: python jvm.start(system_cp=True, packages=True) In case your Weka home directory is not located in `wekafiles` in your user's home directory, then you have two options for specifying the alternative location: use the `WEKA_HOME` environment variable or the `packages` parameter, supplying a directory. The latter is shown below: .. code-block:: python jvm.start(packages="/my/packages/are/somwhere/else") Most of the times, you will want to increase the maximum heap size available to the JVM. The following example reserves 512 MB: .. code-block:: python jvm.start(max_heap_size="512m") If you want to print system information at start up time, then you can use the `system_info` parameter: .. code-block:: python jvm.start(system_info=True) This will output key-value pairs generated by Weka's `weka.core.SystemInfo` class, similar to this:: DEBUG:weka.core.jvm:System info: DEBUG:weka.core.jvm:java.runtime.name=OpenJDK Runtime Environment DEBUG:weka.core.jvm:java.awt.headless=true ... DEBUG:weka.core.jvm:java.vm.compressedOopsMode=Zero based DEBUG:weka.core.jvm:java.vm.specification.version=11 And, finally, in order to stop the JVM again, use the following call: .. code-block:: python jvm.stop() Option handling --------------- Any class derived from ``OptionHandler`` (module ``weka.core.classes``) allows getting and setting of the options via the property ``options``. Depending on the sub-class, you may also provide the options already when instantiating the class. The following two examples instantiate a J48 classifier, one using the ``options`` property and the other using the shortcut through the constructor: .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48") cls.options = ["-C", "0.3"] .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) You can use the ``options`` property also to retrieve the currently set options: .. code-block:: python from weka.classifiers import Classifier cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) print(cls.options) Data generators --------------- Artifical data can be generated using one of Weka's data generators, e.g., the `Agrawal` classification generator: .. code-block:: python from weka.datagenerators import DataGenerator generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-B", "-P", "0.05"]) DataGenerator.make_data(generator, ["-o", "/some/where/outputfile.arff"]) Or using the low-level API (outputting data to stdout): .. code-block:: python generator = DataGenerator(classname="weka.datagenerators.classifiers.classification.Agrawal", options=["-n", "10", "-r", "agrawal"]) generator.dataset_format = generator.define_data_format() print(generator.dataset_format) if generator.single_mode_flag: for i in range(generator.num_examples_act): print(generator.generate_example()) else: print(generator.generate_examples()) Loaders and Savers ------------------ You can load and save datasets of various data formats using the `Loader` and `Saver` classes. The following example loads an ARFF file and saves it as CSV: .. code-block:: python from weka.core.converters import Loader, Saver loader = Loader(classname="weka.core.converters.ArffLoader") data = loader.load_file("/some/where/iris.arff") print(data) saver = Saver(classname="weka.core.converters.CSVSaver") saver.save_file(data, "/some/where/iris.csv") The `weka.core.converters` module has convenience method for loading and saving datasets called `load_any_file` and `save_any_file`. These methods determine the loader/saver based on the file extension: .. code-block:: python import weka.core.converters as converters data = converters.load_any_file("/some/where/iris.arff") converters.save_any_file(data, "/some/where/else/iris.csv") Filters ------- The `Filter` class from the `weka.filters` module allows you to filter datasets, e.g., removing the last attribute using the `Remove` filter: .. code-block:: python from weka.filters import Filter data = ... # previously loaded data remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"]) remove.inputformat(data) # let the filter know about the type of data to filter filtered = remove.filter(data) # filter the data print(filtered) # output the filtered data Classifiers ----------- Here is an example on how to cross-validate a `J48` classifier (with confidence factor 0.3) on a dataset and output the summary and some specific statistics: .. code-block:: python from weka.classifiers import Classifier, Evaluation from weka.core.classes import Random data = ... # previously loaded data data.class_is_last() # set class attribute classifier = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) evaluation = Evaluation(data) # initialize with priors evaluation.crossvalidate_model(classifier, data, 10, Random(42)) # 10-fold CV print(evaluation.summary()) print("pctCorrect: " + str(evaluation.percent_correct)) print("incorrect: " + str(evaluation.incorrect)) Here we train a classifier and output predictions: .. code-block:: python from weka.classifiers import Classifier data = ... # previously loaded data data.class_is_last() # set class attribute cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.3"]) cls.build_classifier(data) for index, inst in enumerate(data): pred = cls.classify_instance(inst) dist = cls.distribution_for_instance(inst) print(str(index+1) + ": label index=" + str(pred) + ", class distribution=" + str(dist)) Clusterers ---------- In the following an example on how to build a `SimpleKMeans` (with 3 clusters) using a previously loaded dataset without a class attribute: .. code-block:: python from weka.clusterers import Clusterer data = ... # previously loaded dataset clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "3"]) clusterer.build_clusterer(data) print(clusterer) Once a clusterer is built, it can be used to cluster Instance objects: .. code-block:: python for inst in data: cl = clusterer.cluster_instance(inst) # 0-based cluster index dist = clusterer.distribution_for_instance(inst) # cluster membership distribution print("cluster=" + str(cl) + ", distribution=" + str(dist)) Attribute selection ------------------- You can perform attribute selection using `BestFirst` as search algorithm and `CfsSubsetEval` as evaluator as follows: .. code-block:: python from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection data = ... # previously loaded dataset search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"]) evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"]) attsel = AttributeSelection() attsel.search(search) attsel.evaluator(evaluator) attsel.select_attributes(data) print("# attributes: " + str(attsel.number_attributes_selected)) print("attributes: " + str(attsel.selected_attributes)) print("result string:\n" + attsel.results_string) Attribute selection is also available through meta-schemes: * classifier: `weka.classifiers.AttributeSelectedClassifier` * filter: `weka.filters.AttributeSelection` Associators ----------- Associators, like `Apriori`, can be built and output like this: .. code-block:: python from weka.associations import Associator data = ... # previously loaded dataset associator = Associator(classname="weka.associations.Apriori", options=["-N", "9", "-I"]) associator.build_associations(data) print(associator) Timeseries ---------- Timeseries forecasting can be achieved with the `weka.timeseries` module (which wraps the `timeseriesForecasting` package). Notable are the `WekaForecaster` forecaster, the `TSLagMaker` filter and the `TSEvaluation` class: .. code-block:: python from weka.timeseries import WekaForecaster from weka.classifiers import Classifier forecaster = WekaForecaster() forecaster.fields_to_forecast = ["passenger_numbers"] forecaster.base_forecaster = Classifier(classname="weka.classifiers.functions.LinearRegression") forecaster.tslag_maker.timestamp_field = "Date" forecaster.tslag_maker.adjust_for_variance = False forecaster.tslag_maker.include_powers_of_time = True forecaster.tslag_maker.include_timelag_products = True forecaster.tslag_maker.remove_leading_instances_with_unknown_lag_values = False forecaster.tslag_maker.add_month_of_year = True forecaster.tslag_maker.add_quarter_of_year = True print("algorithm name: " + str(forecaster.algorithm_name)) print("command-line: " + forecaster.to_commandline()) print("lag maker: " + forecaster.tslag_maker.to_commandline()) evaluation = TSEvaluation(airline_data, 0.0) evaluation.evaluate_on_training_data = False evaluation.evaluate_on_test_data = False evaluation.prime_window_size = forecaster.tslag_maker.max_lag evaluation.forecast_future = True evaluation.horizon = 20 evaluation.evaluation_modules = "MAE,RMSE" evaluation.evaluate(forecaster) print("Evaluation setup:") print(evaluation) print("Future forecasts") print(evaluation.print_future_forecast_on_training_data(forecaster)) Serialization ------------- You can easily serialize and de-serialize as well. Here we just save a trained classifier to a file, load it again from disk and output the model: .. code-block:: python from weka.classifiers import Classifier classifier = ... # previously built classifier classifier.serialize("/some/where/out.model") ... classifier2, _ = Classifier.deserialize("/some/where/out.model") print(classifier2) Weka usually saves the header of the dataset that was used for training as well (e.g., in order to determine whether test data is compatible). This is done as follows: .. code-block:: python from weka.classifiers import Classifier classifier = ... # previously built Classifier data = ... # previously loaded/generated Instances classifier.serialize("/some/where/out.model", header=data) ... classifier2, data2 = Classifier.deserialize("/some/where/out.model") print(classifier2) print(data2) Clusterers and filters offer the `serialize` and `deserialize` methods as well. For all other serialization/deserialiation tasks, use the methods offered by the `weka.core.serialization` module: * `write(file, object)` * `write_all(file, [obj1, obj2, ...])` * `read(file)` * `read_all(file)` Experiments ----------- Experiments, like they are run in Weka's Experimenter, can be configured and executed as well. Here is an example for performing a cross-validated classification experiment: .. code-block:: python from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix from weka.classifiers import Classifier import weka.core.converters as converters # configure experiment datasets = ["iris.arff", "anneal.arff"] classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.trees.J48")] outfile = "results-cv.arff" # store results for later analysis exp = SimpleCrossValidationExperiment( classification=True, runs=10, folds=10, datasets=datasets, classifiers=classifiers, result=outfile) exp.setup() exp.run() # evaluate previous run loader = converters.loader_for_file(outfile) data = loader.load_file(outfile) matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText") tester = Tester(classname="weka.experiment.PairedCorrectedTTester") tester.resultmatrix = matrix comparison_col = data.attribute_by_name("Percent_correct").index tester.instances = data print(tester.header(comparison_col)) print(tester.multi_resultset_full(0, comparison_col)) And a setup for performing regression experiments on random splits on the datasets: .. code-block:: python from weka.experiments import SimpleCrossValidationExperiment, SimpleRandomSplitExperiment, Tester, ResultMatrix from weka.classifiers import Classifier import weka.core.converters as converters # configure experiment datasets = ["bolts.arff", "bodyfat.arff"] classifiers = [Classifier(classname="weka.classifiers.rules.ZeroR"), Classifier(classname="weka.classifiers.functions.LinearRegression")] outfile = "results-rs.arff" # store results for later analysis exp = SimpleRandomSplitExperiment( classification=False, runs=10, percentage=66.6, preserve_order=False, datasets=datasets, classifiers=classifiers, result=outfile) exp.setup() exp.run() # evaluate previous run loader = converters.loader_for_file(outfile) data = loader.load_file(outfile) matrix = ResultMatrix(classname="weka.experiment.ResultMatrixPlainText") tester = Tester(classname="weka.experiment.PairedCorrectedTTester") tester.resultmatrix = matrix comparison_col = data.attribute_by_name("Correlation_coefficient").index tester.instances = data print(tester.header(comparison_col)) print(tester.multi_resultset_full(0, comparison_col)) Packages -------- Packages can be listed, installed and uninstalled using the `weka.core.packages` module: .. code-block:: python # refresh package cache import weka.core.packages as packages packages.refresh_cache() # list all packages (name and URL) items = packages.all_packages() for item in items: print(item.name + " " + item.url) # install CLOPE package packages.install_package("CLOPE") items = packages.installed_packages() for item in items: print(item.name + " " + item.url) # uninstall CLOPE package packages.uninstall_package("CLOPE") items = packages.installed_packages() for item in items: print(item.name + " " + item.url) You can also output suggested Weka packages for partial class/package names or exact class names (default is partial string matching): .. code-block:: python # suggest package for classifier 'RBFClassifier' search = "RBFClassifier" suggestions = packages.suggest_package(search) print("suggested packages for " + search + ":", suggestions) # suggest package for package '.ft.' search = ".ft." suggestions = packages.suggest_package(search) print("suggested packages for " + search + ":", suggestions) # suggest package for classifier 'weka.classifiers.trees.J48graft' search = "weka.classifiers.trees.J48graft" suggestions = packages.suggest_package(search, exact=True) print("suggested packages for " + search + ":", suggestions)