The flow components of python-weka-wrapper are not related to Weka’s KnowledgeFlow. Instead, they were inspired by the ADAMS workflow engine. It is a very simple workflow, aimed at automating tasks and easy to extend as well. Instead of linking operators with explicit connections, this flow uses a tree structure for implicitly defining how the data is processed.
A workflow component is called an actor. All actors are derived from the Actor class, but there are four different kinds of actors present:
- source actors generate data, but don’t consume any
- transformer actors consume and generate data, similar to a filter
- sink actors only consume data, e.g., displaying data or writing to file
- control actors define how the data is passed around in a flow, e.g., branching
Data itself is being passed around in Token containers.
Due to the limitation of the tree structure of providing only 1-to-n connections, objects can be stored internally in a flow using a simple dictionary (internal storage). Special actors store, retrieve, update and delete these objects.
For finding out more about a specific actor, and what parameters it offers (via the config dictionary property), you use one of the following actor methods:
- print_help() – outputs a description of actor and its options on stdout
- generate_help() – generates the help string output by print_help()
Printing the layout of a flow is very simple. Assuming you have a flow variable called myflow, you simply use the tree method to output the structure: print(myflow.tree)
All actors can return and restore from JSON as well, simply use the following property to access or set the JSON representation: json
Life cycle¶
The typical life-cycle of a flow (actually any actor) can be described through the following method calls:
- setup() configures and checks the flow (outputs error message if failed, None otherwise)
- execute() performs the execution of actors (outputs error message if failed, None otherwise)
- wrapup() finishes up the execution
- cleanup() destructive, frees up memory
The following source actors are available:
- CombineStorage expands storage items in a format string and forwards the generated string
- DataGenerator outputs artificial data
- FileSupplier outputs predefined file names
- ForLoop outputs integer tokens as defined by the loop setup (min, max, step)
- GetStorageValue outputs a storage from internal storage
- ListFiles lists files/directories
- LoadDatabase loads data from a database using an SQL query
- Start dummy source that just triggers the execution of other actors following
- StringConstants simply outputs a list of predefined strings, one by one
The following transformers are available:
- AttributeSelection performs attribute selection on a dataset and outputs an AttributeSelectionContainer
- ClassSelector sets/unsets the class attribute of a dataset
- Convert applies simple conversion schemes to the data passing through
- Copy creates a deep copy of serializable Java objects
- CrossValidate performs cross-validation on a classifier or clusterer
- DeleteFile deletes files that match a regular expression
- DeleteStorageValue deletes a value from internal storage
- Evaluatie evaluates a trained classifier/clusterer in internal storage on the data passing through
- EvaluationSummary generates a summary from a classifier/clusterer evaluation object
- Filter applies a Weka filter to the data passing through
- InitStorageValue sets the initial value for a internal storage value
- LoadDataset loads the data stored in the file received as input, either using automatic determined loader or user-specified one
- MathExpression computes a numeric value from a expression and numeric input
- ModelReader reads classifier/clusterer models from disk and forwards a ModelContainer
- PassThrough is a dummy that just passes through the tokens
- Predict applies classifier/clusterer model (serialized file or from storage) to incoming Instance objects
- RenameRelation updates the relation name of Instance/Instances objects
- SetStorageValue stores the payload of the current token in internal storage
- Train builds a classifier/clusterer/associator and passes on a ModelContainer
- UpdateStorageValue applies an expression to update an internal storage value, e.g. incrementing a counter
The following sinks are available:
- ClassifierErrors displays the classifier errors obtained from an Evaluation object
- Console just outputs a string representation of the object on stdout
- DumpFile similar to Console, but stores the string representation in a file
- InstanceDumper dumps incoming Instance/Instances object in a file
- LinePlot displays an Instances object as line plot, using the internal format
- MatrixPlot displays an Instances object as matrix plot
- ModelWriter stores a trained model on disk
- Null simply swallows any token (like /dev/null on Linux)
- PRC plots a precision-recall curve from an Evaluation object
- ROC plots a receiver-operator curver from an Evaluation object
Control actors¶
The following control actors define how data is getting passed around in a workflow:
- Branch forwards the same input token to all of its sub-branches
- ContainerValuePicker extracts a named value from a container, e.g. the Model from a ModelContainer
- Flow the outermost actor that also handles the internal storage
- Sequence executes its sub-actors sequentially, with the data generated by the previous being the input for the next one
- Stop stops the execution of the flow
- Tee allows to tee off the current token and process it separately in a sub-flow before continuing with the processing; optional condition available that determines when a token gets tee’d off
- Trigger executes its sub-actors whenever a token passes through (i.e., when the condition evaluates to True)
The following conversion schemes can be used in conjunction with the Convert transformer:
- AnyToCommandline generates a command-line string from an object, e.g., a classifier
- CommandlineToAny generates an object from a command-line string, e.g., a classifier setup
- PassThrough is a dummy conversion that just passes through the data
Check out the examples available through the python-weka-wrapper-examples project on Github:
The example scripts are located in the src/wekaexamples/flow sub-directory.
Below is a code snippet for building a flow that cross-validates a classifier on a dataset and outputs the evaluation summary and the ROC and PRC curves:
from weka.classifiers import Classifier
from weka.flow.control import Flow, Branch, Sequence
from weka.flow.source import FileSupplier
from weka.flow.transformer import LoadDataset, ClassSelector, CrossValidate, EvaluationSummary
from weka.flow.sink import Console, ClassifierErrors, ROC, PRC
flow = Flow(name="cross-validate classifier")
filesupplier = FileSupplier()
filesupplier.config["files"] = ["/some/where/iris.arff"]
loaddataset = LoadDataset()
select = ClassSelector()
select.config["index"] = "last"
cv = CrossValidate()
cv.config["setup"] = Classifier(classname="weka.classifiers.trees.J48")
branch = Branch()
seqsum = Sequence()
seqsum.name = "summary"
summary = EvaluationSummary()
summary.config["title"] = "=== J48/iris ==="
summary.config["complexity"] = False
summary.config["matrix"] = True
console = Console()
seqerr = Sequence()
seqerr.name = "errors"
errors = ClassifierErrors()
errors.config["wait"] = False
seqroc = Sequence()
seqroc.name = "roc"
roc = ROC()
roc.config["wait"] = False
roc.config["class_index"] = [0, 1, 2]
seqprc = Sequence()
seqprc.name = "prc"
prc = PRC()
prc.config["wait"] = True
prc.config["class_index"] = [0, 1, 2]
# run the flow
msg = flow.setup()
if msg is None:
msg = flow.execute()
if msg is not None:
print("Error executing flow:\n" + msg)
print("Error setting up flow:\n" + msg)
With the following command you can output the built flow tree:
The above example gets printed like this:
Flow 'cross-validate classifier'
|-FileSupplier [files: 1]
|-LoadDataset [incremental: False, custom: False, loader: weka.core.converters.ArffLoader]
|-ClassSelector [index: last]
|-CrossValidate [setup: weka.classifiers.trees.J48 -C 0.25 -M 2, folds: 10]
| |-Sequence 'summary'
| | |-EvaluationSummary [title: === J48/iris ===, complexity: False, matrix: True]
| | |-Console [prefix: '']
| |-Sequence 'errors'
| | |-ClassifierErrors [absolute: True, title: None, outfile: None, wait: False]
| |-Sequence 'roc'
| | |-ROC [classes: [0, 1, 2], title: None, outfile: None, wait: False]
| |-Sequence 'prc'
| | |-PRC [classes: [0, 1, 2], title: None, outfile: None, wait: True]
Adding additional flow components is quite easy:
- Choose the superclass, based on how the actor is supposed to behave:
- source – weka.flow.source.Source
- transformer – weka.flow.transformer.Transformer
- sink – weka.flow.sink.Sink
- add the new class with a constructor like this def __init__(self, name=None, options=None):
- add a description method that returns a string, describing what your actor does
- added a fix_config method that ensures that all configurations are present and help for them as well (e.g., transformer.ClassSelector)
- if you want to output some additional info in the tree layout, implement a quickinfo method (see, e.g., transformer.ClassSelector)
- override the setup method if you require some additional checks (e.g., file actually exists) or setup steps (e.g., loading of model from disk); return None if everything OK, otherwise the error; don’t forget to call the super-method.
- transformers or sink can check the input data by overriding the check_input method
- the actual execution or processing of input data happens in the do_execute method; return an error string if something failed, otherwise None; sources and transformers can append the generated data (wrapped in Token objects) to the self._output list