Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Mime type detection is very interesting thing. I wrote media type detection for McAfee Web Gateway 7.x and because it was a high performance proxy, the detection speed was a major focus, but also the precision, especially for "container types, like, MS Office, OLE-based files, etc. The base of it was a simple Lisp-like language that allowed to write signatures very fast, and everything was combined with very aggressive caching of the data, so we avoided to read data again and again, and used internal caches a lot. In tests, the detection was ~10x faster than file, and with more flexible language we got more file types recognized precisely. Although there were challenges with some formats, like, OLE-based files had FAT directory structure at the end of the file, and you were need to walk the tree to find the top-level structure to distinguish Excel file from Excel file embedded into Word.

Streams detection was also quite funny task...



Ah, I remember, the self-extracted .msi file was one of the quite challenging files - it's executable, it's a .cab file, and OLE2-container




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: