Tuesday, July 28, 2009

Beware! Groovy split and tokenize don't treat empty elements the same

Groovy's tokenize, which returns a List, will ignore empty elements (when a delimiter appears twice in succession). Split keeps such elements and returns an Array. If you want to use List functions but you don't want to lose your empty elements, then just use split and convert your Array into a List in a separate step.

This might be important if you are parsing CSV files with empty cells.


import groovy.util.GroovyTestCase


class StringTests extends GroovyTestCase {

protected void setUp() {
super.setUp()
}

protected void tearDown() {
super.tearDown()
}

void testSplitAndTokenize() {
assertEquals("This,,should,have,five,items".tokenize(',').size(),5)
assertEquals("This, ,should,have,six,items".tokenize(',').size(),6)

assertEquals("This, ,should,have,six,items".split(',').size(),6)
assertEquals("This,,should,have,six,items".split(',').size(),6)

//convert array to List and re-evaluate
def fieldArray = "This,,should,have,six,items".split(',')
def fields=fieldArray.collect{it}
assert fields instanceof java.util.List
assertEquals(fields.size(),6)
}
}

4 comments:

  1. Thanks for the post!

    Another tricky thing I noticed with split is that

    assertEquals(",,,".split(',').size(),0)
    assertEquals(",,,a".split(',').size(),4)

    that is, if all the tokens are empty then the returned list is not made by empty tokens, but is itself an empty list.

    this caused me some headaches :)

    ReplyDelete
  2. Split will omit ending empty elements.

    "a,b,c".split(",").size() == 3 //as expected

    but

    "a,b,".split(",").size() == 2 //not as expected

    and

    "a,,".split(",").size() == 1 //not as expected

    However

    ",,c".split(",").size() == 3 //as expected

    Similarly,

    ",, ".split(",").size() == 3 //as expected

    and

    "a,, ".split(",").size() == 3 //as expected

    This gives rise to a hacky work-around:

    (someString + " ").split(someDelimiter).size() == yourExpectedSize

    By adding a space to the end of the string, we make sure the last element is never empty. In this case, split works as expected. If necessary, you can .trim() the last element of the array.

    ReplyDelete
  3. This seems to work as well:
    foo = "A,,C"

    println foo.tokenize(",")

    bar= foo.split(",").toList()

    println "bar is " + bar



    OUTPUT is:
    [A, C]
    bar is [A, , C]

    ReplyDelete