2007:Design wiki backup

From MIREX Wiki

M2K 1.2 Design issues, thoughts and discussions

This page is going to be a mess to start with, as I am going to use it dump random thoughts on to.

I will tidy it up and add some structure when I have time. You may notice that much of this page criticises the complete lack of software engineering principles being applied to the design of the alpha version of M2K, much of which is my fault. Careful thought may help us to stop this happening in the Beta. This is probably our last chance to correct this mistake (we're not in a hurry for the Beta, are we?), as beyond the Beta we will stuck with the format.


Kris


M2K 1.2 design approach

  • Take step back and properly specify the API (interfaces), then refactor modules into it. Produce reference/abstract implementations of every type of module that may have multiple types/children.
  • Add test methods (JUnit?) to everything, we had an unacceptable amount of bugs in the alpha versions that went unnoticed due to ad hoc testing. A proper API and set of abstract classes will significantly reduce the burden of doing this as, in many cases, will allow components of the tests or the whole test to be inherited from a base class.
  • Divide into 4 or 5 sub-APIs? (DSP, Utility (IO, dataflow etc.), Machine Learning, Symbolic stuff, Evaluation). Certain processes are performed in many modules and should be performed by reference implementations in a Utility package or the API concerned, e.g. covariance matrix calculation (feature extraction, modelling, transformation), confusion matrix calculation (Eval) and reindexing of data arrays for use in Jama matrices.

M2K 1.2 specific ideas

  • Completely rebuild eval modules in a proper API, many share common methods/functionality and implementations could be massively simplifed by proper use of abstraction (likely to reduce cockups too).
  • Deprecate/remove old ML stuff the I am gradually replacing (but not removing from cvs).
  • Remove all spaces from file names and ensure correct java naming conventions are used throughout.
  • Insert more abstaction wherever possible, we duplicate effort/modules all over the place. E.g. we have loads of ApplyThisModel or ApplythatModel modules in the modelling pack (sasme in transforms). Using a decent classifier interface will allow us to cut that to one module and will enforce consistent behaviour, where appropriate.
  • Big, Sexy ML API and ensemble/bagging system (in progress).
  • Add endExecution methods to *everything* as cleanup only occurs before next itinerary execution! On Linux/Unix the heap size can go done as well as up (huge advantage over windows) so this will help M2K to behave more helpfully (by releasing you from page file hell when you press abort, at present I actually have to close D2K or press run then abort).
  • Add an M2KComputeModule to extend everything from. This could have a message identifying if no endExecution method is implemented...
  • Add execution time stuff modules? Output in endExecution, great for profiling. Perhaps link to D2K logging level.
  • Change windowing code to work on windows in time rather than frames, or peg number of frames to a sample rate, so that audio at other rates can use same window settings (handle different sample rate streams with same setup). Overlap sizes should be in percentages.

D2K Feedback

Counter-intuitive behaviours

API

  • Setting of input and putput types as Strings. Surely these should be instances of a class or Array object? That way mistakes would be found at compile time (instead of months later - such as Integer and java.lang.Integer)
  • Only one level of nesting supported.
  • Properties file is overwritten on exit, very annoying (and somewhat unpredictable) if running on a couple of machines with a shared filesystem (perhaps a bad idea, but very useful). Perhaps we should be able to pass a props file on the command line, that would take preference over the default one (would prefer a file to commandline options).


GUI

  • Size of selection box being around module icon and name, making it huge for modules with long names. Often this requires you to move several modules to be able to select a particular one.
  • Can't seem to copy/paste between two nested itineraries
  • When copy/pasting, modules links between subgraphs are lost.
  • When copy/pasting, pasted modules should be centered on last mouse click, not location of old modules (plus small increment).

Actual bugs

  • Hang caused by Thread.sleep() calls (used in network IO, eek!) when set to use more than one processor and an explicit pipesize. (Fixed in next release, woohoo!).
  • Module names that are too long for toolkit to handle are often created when nesting an itinerary as nested itin name is prepended to all the module names in the nest.
  • Changing the nested itinerary names when module names in nest have been changed already, can cause D2K to clip names to incorrect (shorter) lengths or even to negative lengths, making the nest impossible to open anymore.
  • When an exception is propagated up into the executor, it displays the error then just sits there, instead of exiting. AFAIK, RuntimeExceptions stop the executor, whereas others just cause it to sit there doing nothing. We need them to 'Fail-Fast'.
  • Inconsistent/bad behaviour with capitilised property names. E.g. PropertyName getPropertyName() and setPropertName() will fail, producing the message:
    The property in the property descriptions named PropertyName does not exist.
    PropertyName DID NOT EXIST.
    Contact the module developer.

    Whereas, PRopertyName getPRopertyName() and setPRopertName(), appears to work fine as does propertyName getPropertyName() and setPropertName()...

Wishlist

  • A simpler method of adding drop down menus for parameter selection, perhaps based on a simple object with string keys or a 2D String array (String[key][menu text].
  • A utility method for adding file or directory name parameters, which doesn't require implementation of custom property editor. These parameters can be quite common and often a simple string gets used (leading to errors) instead of a proper selection box.
  • Update Jama jar file to version 1.0.2.
  • It would be great to have the ability to set proximities for an itinerary via a script when running headless.
  • Having some sort of secure authentication mechanism for running on remote machines would be very valuable. Our setup at McGill prevents all our computers from being behind the same firewall, so it greatly restricts the number of computers I can run D2KServer on without letting the whole world use it.
  • It would also be very nice if, instead of just crashing when a remote proximity is set to be used in an itinerary but it is not available for some reason, there was an option of defaulting to running on another node instead, or waiting for the remote node to become available again.

Documentation

  • Clearer and more prominent declaration that modules should only produce one output per pipe, per iteration would be good, along with justification.
  • Advice on avoiding deadlocks (Modules doing double duty) might be useful. With careful thought you can have modules doing double duty and these tend to massively simplify complex itineraries, but the thought must be put into avoiding deadlocks, and this is not obvious to a new developer (who will likely try and keep functionality related to a single process/function in one module, rather than splitting it up).

(Thanks go to Rebecca Fiebrink for contributing much of (or all) of the next section)

Additional documentation that would've made things much easier when I was learning to use D2K, listed in no particular order:

  • It wasn't clear initially that modules must be serializable, and why. (I did discover I could use some non serializable classes by adding them to lib/ on the nodes with D2K Server, but this would have been nice to know up front.)
  • It wasn't really clear how to implement a reentrant module, especially with regards to passing it my long list of data and "collecting" the objects that came out the output pipes without breaking the "rule" about pushing only once per output pipe per doit() call. I figured this out eventually, but it would have been nice to have an example of this to look at.
  • It would be helpful to have explicit documentation on how to take advantage of the logging level ("debug" / "warn" / etc.) from within one's own modules.
  • Also would be nice to see in the documentation that you can throw an exception from within your module code, and it will result in the nice error box being displayed in the GUI.
  • An issue I encountered is the guesswork involved in setting module properties via a script for the built-in D2K modules, for which I don't have the source code and therefore don't have access to the property names as they are coded in the module. Generally guesswork seems to do the trick (e.g., "fileName" works for the "File Name" property in the "Input File Name" module), but it seems like there should be an explicit naming convention for module properties that one can count on, or that the property name one would use in the script should be clearly indicated in each module's documentation shown in the "Info" tab.
  • An issue that came up at the M2K workshop was the question of objects being passed by reference or by value between modules. I haven't seen anything making this explicit in the documentation. Also, if it is the case that they are being passed by reference, there should be documentation making clear the behavior of the fan-out (does it copy the object or just pass multiple references to the same object?), and the expected behavior when an object is passed through an input pipe to a module that is being executed remotely.
  • It would be helpful to have some sort of information on how D2K decides to distribute the load when a reentrant module is set to be executed on several processors. Having a clearer picture of this might help me design my modules and itineraries better.
  • Finally, there are a few quite simple questions that nevertheless took a lot of time to figure out on my own. Having answers to these up front in the documentation would have saved me a lot of time:
    • Where should I put my own modules in the directory/package structure? What is the classpath necessary for headless execution?
    • How do I change the classpath that the GUI uses to start up? (On Mac OS X? On Windows? These are different.)
    • If my modules use X.jar, where should I put this in the D2K directory structure?
    • How secure is D2K Server? Should I just open up 7021 to the world? What are the security ramifications of doing this?