Bob Marinier : binning data

Introduction

A common data transformation is to take a number and assign it a label based on what range it falls in. For example, we might have a speed, and decide that [0,10) is slow, [10, 20) is medium, and [20, infinity) is fast. Of course, writing rules to do this for specific cases is pretty simple, but we'd ideally have some kind of general solution. For this problem, Tcl provides a natural solution to this problem: we can define the bins we care about and let Tcl generate the specific rules for us. This results is much more readable and maintainable code than raw rules.

We will also discuss non-Tcl alternatives: a pure Soar solution and having the input system handle it.

Finally, of course any of these approaches could be used in conjunction with Tcl constants.

Approach 1: Input System

If the data you want to bin comes from input, and the bins are static (i.e., not context-dependent), and they are unlikely to need to change, then the most efficient solution may be to have the input system (i.e., code outside of Soar) create the bins and put them on the input-link.

While slightly more efficient than a rule-based approach (it avoids rule matching), it's also very inflexible. Changing anything about the bins will typically require recompiling the system, and the next project will need to reimplement its own binning system. Of course, declarative solutions are possible (e.g., where the bins are defined in a config file).

Approach 2: Pure Soar

A pure Soar, non-Tcl general solution may also be possible, but is likely to be clunky, and will certainly be less efficient as it will rely on matching declarative structures in working memory to determine what the bins are.

Approach 2.1: Specific Soar

This approach is not actually interesting – it's just writing specific Soar rules for each bin. It's included here for comparison to the other approaches. As you can see, you need a separate rule for every bin, and the bin definitions are spread among multiple rules.

sp "bin-speed-low
   (state <s> ^io.input-link.speed-object <so>)
   (<so> ^speed <= 10.0)
-->
   (<so> ^speed-bin low)
"

sp "bin-speed-medium
   (state <s> ^io.input-link.speed-object <so>)
   (<so> ^speed {> 10.0 <= 20.0})
-->
   (<so> ^speed-bin medium)
"

sp "bin-speed-high
   (state <s> ^io.input-link.speed-object <so>)
   (<so> ^speed > 20.0)
-->
   (<so> ^speed-bin high)
"

Approach 2.2: Define Specific Bin

Here's an example that's specific to a particular set of bins. This gathers all of the constants in one place, so if you want to change the bins you can just modify one rule. The downside is you need separate rules for every value you want to bin, so this isn't really a general solution. It also requires matching on the bin structure, which is slightly less efficient than having separate rules with the bin bounds directly encoded in them. Finally, as shown here, it doesn't allow for defining whether the bounds of the ranges are included or not – it's hardcoded a particular way. And unlike the specific approach above, it requires both min and max bounds. (These last two issues are fixable within this paradigm, of course, but would require more rules.)

# this rule defines the bins
sp "define-speed-bins
   (state <s> ^superstate nil)
-->
   (<s> ^bins <bs>)
   (<bs> ^low <l> ^medium <m> ^high <h>)
   (<l> ^min 0.0 ^max 10.0)
   (<m> ^min 10.0 ^max 20.0)
   (<h> ^min 20.0 ^max 9999999.0)
"

# this rule creates them
sp "bin-speed
   (state <s> ^io.input-link.speed-object <so>
              ^bins <bs>)
   (<so> ^speed <speed>)
   (<bs> ^<bin> <b>)
   (<b> ^min <= <speed>
        ^max > <speed>)
-->
   (<so> ^speed-bin <bin>)
"

Approach 2.3: General Soar

A more general solution could allow all bins to be registered in a single place, and then just have a single generic rule for creating the bin wherever it needs to be. The problem here is in specifying the location declaratively. If the location always exists, then a rule could literally create a pointer to where the data is (and possibly a separate location of where the bin needs to go). But many data will come and go, and thus the pointers to the locations will need to come and go – this implies separate rules for each location. With these extra rules, I start to wonder if we're actually saving much over the more specific solutions.

# these rules define the bins and locations
# there are two sets: speed bins and distance bins

sp "define-speed-bins
   (state <s> ^bins <bs>)
-->
   (<bs> ^speed <sb>)
   (<sb> ^ranges <rs>
         ^source-attr speed
         ^dest-attr speed-bin)
   (<rs> ^low <l> ^medium <m> ^high <h>)
   (<l> ^min 0.0 ^max 10.0)
   (<m> ^min 10.0 ^max 20.0)
   (<h> ^min 20.0 ^max 9999999.0)
"

sp "define-speed-location
   (state <s> ^bins.speed <sb>
              ^io.input-link.speed-object <obj>)
-->
   (<sb> ^location <obj>)
"

sp "define-distance-bins
   (state <s> ^bins <bs>)
-->
   (<bs> ^distance <db>)
   (<db> ^ranges <rs>
         ^source-attr distance
         ^dest-attr distance-bin)
   (<rs> ^near <n> ^far <f>)
   (<n> ^min 0.0 ^max 50.0)
   (<f> ^min 50.0 ^max 9999999.0)
"

sp "define-distance-location
   (state <s> ^bins.distance <db>
              ^io.input-link.distance-object <obj>)
-->
   (<db> ^location <obj>)
"

# these are the generic rules that support the bins

sp "create-generic-bins
   (state <s> ^superstate nil)
-->
   (<s> ^bins <bs>)
"

sp "bin-value
   (state <s> ^bins <bs>)
   (<bs> ^<type> <b>)
   (<b> ^ranges <rs>
        ^location <loc>
        ^source-attr <sattr>
        ^dest-attr <dattr>)
   (<loc> ^<attr> <value>)
   (<rs> ^<bin> <b>)
   (<b> ^min <= <value>
        ^max > <value>)
-->
   (<loc> ^<dattr> <bin>)
"

Another consideration in creating these bin data structures is what they may do to the size of episodic memory episodes. Of course, they can be excluded, and unless you think you'll need them in retrieved episodes for some reason, that's probably a good idea.

Approach 3: Tcl

Of course, the real problem with any of the pure Soar approaches is there's just too much to type. Lots of typing means lots of potential for typos.

Tcl does a great job of reducing typing, and thus typos, making development faster. It also makes it easier to refactor code, as you just have to change the code that generates the rules, not every rule as you would in the pure Soar approaches. Finally, you can get the benefits of the specific and general Soar approaches – a general-purpose proc that generates specific rules, thus maximizing performance. The third approach below demonstrates this.

Just as with pure Soar, there are multiple approaches with Tcl. The ones shown here are intended to give you an idea of what's possible (and indeed, these have been used in real systems), but you may come up with a better way.

Approach 3.1: Specific Tcl

This approach is specific to a particular set of bins. The primary gain over the specific and specific bin Soar approaches shown above is that it makes it very easy to see what the bins are, as the details of the bounds are not buried in the rules. It is even easier to see the bin definitions but still has the performance of the specific Soar solution.

# defineSpeedBins --
#
#    Generates rules for binning speed values.
#    Min is always > and max is always <=
#
# Arguments:
#    bin: bin to create
#    min: numeric value must be > min
#    max: numeric value must be <= max
#
# Results:
#    A Soar rule is generated that creates the specified bin if the speed value falls in the specified range

proc defineSpeedBins { bin min max } {
     set body "elaborate*$bin
               (state <s> ^io.input-link.speed-object <so>)
               (<so> ^speed \{> $min <= $max\} )
              -->
               (<so> ^speed-bin $bin)
              "
         
     debugEcho $body
     sp $body
}

defineSpeedBins slow 0.0 10.0
defineSpeedBins medium 10.0 20.0
defineSpeedBins fast 20.0 9999999.0

This proc uses a general approach: assign the body of the Soar rule to a variable so it can be printed for debugging. This is especially helpful when there is a bug that prevents it from sourcing, so it can't be printed from Soar.

Also note the braces in the Soar syntax need to be escaped.

Approach 2: More flexible specific Tcl

One of the problems with the above example is it requires a min and max to be specified. It also hardcodes that the rule will use > for min and <= for max.

A more flexible approach is to specify the conditions directly in the call to the Tcl proc. This makes the calls more complex, but the resulting code is very flexible (although still specific to one set of bins).

# defineSpeedBins --
#
#    Generates rules for binning speed values using supplied conditions.
#
# Arguments:
#    bin: bin to create
#    conditions: tests to use on speed value
#
# Results:
#    A Soar rule is generated that creates the specified bin if the speed value falls in the specified range

proc defineSpeedBins2 { bin conditions } {
     set body "elaborate*$bin
               (state <s> ^io.input-link.speed-object <so>)
               (<so> ^speed \{$conditions\} )
              -->
               (<so> ^speed-bin $bin)
              "
         
     debugEcho $body
     sp $body
}

# no min value
defineSpeedBins2 slow "<= 10.0" 
# neither uses >= or <=
defineSpeedBins2 medium "> 10.0 < 20.0"
# no max value
defineSpeedBins2 fast ">= 20.0"
# speed undefined
defineSpeedBins2 undefined "nil"

Note that this does not technically require the tests to be numeric. If the speed value ever contains non-numeric values, those can be tested for, too (as in the last example).

Approach 3.3: General Tcl

This approach attempts to generalize the proc so that a single proc can be used to define multiple bin sets. To accomplish this, the location to bin must be specified, as well as the attribute to use for the bin. This version also uses a Tcl dict to define the ranges, so a single call creates all the bin rules. This was done to avoid having to specify the same location across multiple calls to the proc, which would have been an opportunity for a typo.

# mapRangesToCategory --
#
#    Generates rules for binning speed values using supplied conditions.
#
# Arguments:
#    bin: bin to create
#    conditions: tests to use on speed value
#
# Results:
#    A Soar rule is generated that creates the specified bin if the speed value falls in the specified range

# ranges: a tcl dict defining all the bins of interest (see example below)
# object: the location where the numeric values exist and where the bins will go
# destAttr: the attribute to use for the binned value
proc mapRangesToCategory { ranges object destAttr } {

    # we'll generate a separate rule for each entry in the dictionary
    # see here for an explanation of the dict for syntax: http://www.tcl.tk/man/tcl8.5/tutorial/Tcl23a.html
    dict for { category info } $ranges {
        
        # to make sure all the rules have unique names, we'll generate part of 
        # the name from the object. Since that may have "." characters in it 
        # (if it's a path) and that's not valid in a soar rule, we'll convert 
        # those to "*" and lowercase the whole thing since that's standard Soar
        set objectName [string map {"." *} [string tolower $object]]
        
        set conditions ""

        # this version also supports defining multiple tests across different
        # source attributes to define a bin
        # so we add a condition for each attribute we need to test
        # note these don't even have to be numbers, although it does assume
        # that the attributes are all on the same object 
        foreach { sourceAttr val } $info {
            append conditions "^$sourceAttr \{$val\} "
        }

        set body "elaborate*$objectName*$sourceAttr*$category
                  (state <s> ^$object <object>)
                  (<object> $conditions)
                  -->
                  (<object> ^$destAttr $category)"

                debugEcho $body
        sp $body
    }
}

# to use this, you first define the dict that defines the bins
# then you call the proc with the dict and path information
dict set speeds slow { speed "< 10" }
dict set speeds medium { speed ">= 10 <= 50" }
dict set speeds deep { speed "> 50" }
dict set speeds undefined { speed "nil" }

mapRangesToCategory $speeds io.input-link.speed-object.speed speed-bin

# here's a different example showing some of the flexibility
# it uses multiple different tests, only one of which is numeric
dict set ships avoidable { speed "< 10" has-sonar false }
dict set ships unavoidable { speed ">= 10" has-sonar true }
dict set ships probably-avoidable { speed "< 10" has-sonar true }
dict set ships maybe-avoidable { speed "> 10" has-sonar false }

mapRangesToCategory $ships ship-specs.capabilities avoidance-bin

Of course, improvements to this are possible:

  • the location where the bins are created could be definable. Indeed, rather than just specifying the name of the attribute, and entire path could be given
  • it's not really necessary to require the bin name to be specified (a default name could be generated from the value attribute if not specified)
  • this could be extended to support testing values from multiple objects (so a more complex mapping function as opposed to just simple binning)
  • this could be made easier to use for numeric values by allowing a sequence of numbers to be used, rather than requiring the max of one bin to be repeated as the min for the next bin (which is asking for errors). This would probably be a less flexible version of this proc, but could cover most of the need.

One issue is there is a complexity/flexibility tradeoff. If you just want to create a single simple bin, this is more typing than the first Tcl example above. And the use of dicts is a little awkward. Perhaps someone can come up with a more clever solution.