File magic and libextract on Windows Source

1---
2title: 'File magic and libextract on Windows'
3date: '2007-07-16'
4published_at: '2007-07-16T13:38:00.000+10:00'
5tags: ['cygwin', 'programming', 'windows']
6author: 'Gavin Jackson'
7excerpt: 'My current problem is quite simple, I have a directory full of files with no extensions. I would like to locate all word documents and run them through my header/footer extraction tool. Easy, I''ll jus...'
8updated_at: '2007-07-16T13:53:44.787+10:00'
9legacy_url: 'http://www.gavinj.net/2007/07/file-magic-and-libextract-on-windows.html'
10---
11
12My current problem is quite simple, I have a directory full of files with no extensions. I would like to locate all word documents and run them through my header/footer extraction tool. Easy, I'll just use the file tool in cygwin to check the files binary "magic" pattern.
13
14So I download the cygwin file package and start running some tests - initially all of my word, excel and powerpoint files were coming back as "Windows Installer" - so I modified the magic file - it then only reported the items as "Microsoft Office Documents" - it turns out that all Microsoft formats are essentially the same thing - Microsoft Ole Objects (that contain an internal filesystem structure).
15
16So is there a way to perform a deeper analysis of the ole2 structure (to differentiate between the various different office formats)? Yes, there is a library available called libextrator - it provides an application called extract that allows you to "extract" metadata from files.
17
18[http://gnunet.org/libextractor/](http://gnunet.org/libextractor/)
19
20I needed to use the windows build of this software, and it needed a bit of fiddling to get it to work. Step 1 - extract the archive Step 2 - copy all dlls in lib/libextract/*.dll to bin/ Step 3 - run extract using the following command: extract -l libextract_ole2 -f mydoc.doc
21
22```
23C:\Documents and Settings\Administrator\Desktop\blah\bin>e
24ctor_ole2 -f mydoc.doc
25filesize - 20.99 KB
26filename - mydoc.doc
27mimetype - application/msword
28language - U.S. English
29company - Vertex Systems Incorporated
30paragraph count - 7
31line count - 13
32last saved by - Gav
33character count - 147
34template - Normal.dot
35creation date - 2007-07-12T23:25:0
36title - Text on Page 1 (Section 1)
37word count - 42
38page count - 7
39creator - CNVT
40date - 2007-07-13T00:15:0
41generator - Microsoft Office Word
42C:\Documents and Settings\Administrator\Desktop\blah\bin>
43```
44
45By looking at the file mime type, we can now determine the exact type of ole2 object we are dealing with.
46
47
48