1---2title: 'File magic and libextract on Windows'3date: '2007-07-16'4published_at: '2007-07-16T13:38:00.000+10:00'5tags: ['cygwin', 'programming', 'windows']6author: 'Gavin Jackson'7excerpt: 'My current problem is quite simple, I have a directory full of files with no extensions. I would like to locate all word documents and run them through my header/footer extraction tool. Easy, I''ll jus...'8updated_at: '2007-07-16T13:53:44.787+10:00'9legacy_url: 'http://www.gavinj.net/2007/07/file-magic-and-libextract-on-windows.html'10---1112My current problem is quite simple, I have a directory full of files with no extensions. I would like to locate all word documents and run them through my header/footer extraction tool. Easy, I'll just use the file tool in cygwin to check the files binary "magic" pattern.1314So I download the cygwin file package and start running some tests - initially all of my word, excel and powerpoint files were coming back as "Windows Installer" - so I modified the magic file - it then only reported the items as "Microsoft Office Documents" - it turns out that all Microsoft formats are essentially the same thing - Microsoft Ole Objects (that contain an internal filesystem structure).1516So is there a way to perform a deeper analysis of the ole2 structure (to differentiate between the various different office formats)? Yes, there is a library available called libextrator - it provides an application called extract that allows you to "extract" metadata from files.1718[http://gnunet.org/libextractor/](http://gnunet.org/libextractor/)1920I needed to use the windows build of this software, and it needed a bit of fiddling to get it to work. Step 1 - extract the archive Step 2 - copy all dlls in lib/libextract/*.dll to bin/ Step 3 - run extract using the following command: extract -l libextract_ole2 -f mydoc.doc2122```23C:\Documents and Settings\Administrator\Desktop\blah\bin>e24ctor_ole2 -f mydoc.doc25filesize - 20.99 KB26filename - mydoc.doc27mimetype - application/msword28language - U.S. English29company - Vertex Systems Incorporated30paragraph count - 731line count - 1332last saved by - Gav33character count - 14734template - Normal.dot35creation date - 2007-07-12T23:25:036title - Text on Page 1 (Section 1)37word count - 4238page count - 739creator - CNVT40date - 2007-07-13T00:15:041generator - Microsoft Office Word42C:\Documents and Settings\Administrator\Desktop\blah\bin>43```4445By looking at the file mime type, we can now determine the exact type of ole2 object we are dealing with.464748