Monday, 10 June 2013

Microsoft Office Metadata with R

Sometimes I need to retrieve various items of metadata from Microsoft Office files. For the 'old-style' (i.e. '.doc' and '.xls') files perhaps a solution in python, such as hachoir, was the best way to extract this data from the ole2 file format - although perhaps it was always possible in R too? When I started digging around for a similar solution for the 'new-style' (i.e. '.xlsx' and '.docx') files I was pleasantly surprised to find the file structure is much more open, indeed it is called Office Open XML. I am by no means an expert but basically it is a zipped set of xml type files. This makes getting at the metadata so much easier. I found a simple example in python by zeekay on stack overflow. My code below is an unashamed replication of this in R.


  1. Very cool, I didn't know it was so easy. You can pick up comments in the same way, if there are any: doc = xmlInternalTreeParse(unzip('test2.docx','word/comments.xml'))

  2. Another factoid that isn't widely noted is that MS Word will open an HTML file as a Word file. If you use Knitr and Markdown, and save the document as HTML, you can change the extension on the file to ".doc" and MS Word will open it as if it were a Word file, with correct formatting. That makes editing documents from Markdown in Word easy.

  3. Nice! I modified the code so it can parse an OpenDocument XML file (e.g., LibreOffice, OpenOffice, etc). In this case, I used an ODT file (LibreOffice Writer), which unzips to meta.xml in the current directory (rather than the DOCX docProps subdirectory), and then modified the XPATH in the getNodeSet command:

    doc = xmlInternalTreeParse(unzip('mainArticle.odt','meta.xml'))

    ns=c('dc'= '')

    author = xmlValue(getNodeSet(doc, '//dc:creator', namespaces=ns)[[1]])