Posts

Showing posts from November, 2010

Google Refine and easier HTML parsing

Image
After spending most of the day BANGING my head on using Regex and GREL to handle HTML parsing. I thought, there MUST be a better way to parse HTML !!! I know several of you who have thought the same thing.  So, I took the time today to find out where and how this could be improved directly in Google Refine or with an extension. It just so happens Google Refine already has a wonderful extension with another language itself: Jython Enter BeautifulSoup  (love the name?) a Jython library for powerful HTML parsing and entity extraction. Here's more on how to use it easily within Refine: http://code.google.com/p/google-refine/wiki/StrippingHTML Enjoy! -Thad