The Groovy JVM scripting language has been around for many years now, but I never really had much interrest in testing it. I finally read a bit more about it and watched a presentation. I wanted to test it out by myself by parsing a table on a HTML page and printing the output. The amount of code required was very low and the syntax was somewhat familiar from Java. I used Groovy/Grails Tool Suite as my IDE, since it had better code completion than MyEclipse 10.7.1.
Here’s the “final” code for the test
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParser = slurper.parse("data.html")
def myTable = htmlParser.'**'.find{ it.@class == 'my_div_class'}.'**'.find{ it.@class == 'my_table_class' }
myTable.tr.eachWithIndex{ row, index ->
println "${row.td[0]} ${row.td[3]} ${row.td[2]}"
}On line 1-3 we grab the package needed to parse HTML which can have missing end tags etc. and we create a parser. On 4 we load the HTML file and parse it. On line 5 we extract the table element we are looking for by searching for an element that has the class “my_div_class” and inside that, the table with the class “my_table_class”. On line 6 we loop all the rows in the table and for each row we give a closure which on line 7 prints the first, fourth and third cells in that order. And that’s it!
Here’s a sample of the same code in Java
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HtmlParser {
public static void main(String[] args) {
File htmlFile = new File("src/main/java/ama/test/mavenstuff/data.html");
try {
Document doc = Jsoup.parse(htmlFile, null);
Element tableElement = doc.getElementsByClass("module").get(0).getElementsByClass("table_stockexchange").get(0);
Elements tableRows = tableElement.select("tr");
for (int i = 0; i < tableRows.size(); i++) {
System.out.println(tableRows.get(i).select("td").get(0).text()
+ " " + tableRows.get(i).select("td").get(3).text()
+ " " + tableRows.get(i).select("td").get(2).text()
);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}


0 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.