Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.

It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".

I like the idea, but the current implementation can't be relied on IMO; especially not for automation.

A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.



Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.


Sure thing :)

Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff

These are files that were in one of my crawl datasets.

[0]: https://poc.lol/files/magika-test.tgz


Thank you - we are adding them to our test suit for the next version.


Super, thank you! I look forward to it :)

I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.


Do you have permission to redistribute these files?


LOL nice b8 m8. For the rest of you who are curious, the files look like this:

    <HTML><HEAD>
    <TITLE>Access Denied</TITLE>
    </HEAD><BODY>
    <H1>Access Denied</H1>
     
    You don't have permission to access "http&#58;&#47;&#47;placement&#46;api&#46;test4&#46;example&#46;com&#47;" on this server.<P>
    Reference&#32;&#35;18&#46;9cb0f748&#46;1695037739&#46;283e2e00
    </BODY>
    </HTML>
Legend. "Do you have permission" hahaha.


You are asking what if this guy has "web crawl data" that google does not have?

And what if he says no, he does not have permission.


> You are asking what if this guy has "web crawl data" that google does not have?

No, I'm asking if he has permission to redistribute these files.


Are you attempting to assert that use of these files solely for the purpose of improving a software system meant to classify file types does not fall under fair use?

https://en.wikipedia.org/wiki/Fair_use


I'm asking a question.

Here's another one for you: Do you believe that all pictures you have ever taken, all emails you have ever written, all code you have ever written could be posted here on this forum to improve someone else's software system?

If so, could you go ahead and post that zip? I'd like to ingest it in my model.


Your question seems orthogonal to the situation. The three files posted seem to be the minimum amount of information required to reproduce the bug. Fair use encompasses a LOT of uses of otherwise copyrighted work, and this seems clearly to be one.


I don't see how publicly posting them on a forum is

> the minimum amount of information required to reproduce the bug

MAYBE if they had communicated privately that'd be an argument that made sense.


So you don't think that software development which happens in public web forums deserve fair use protection?


That's an interesting way to frame "publicly posted someone else's data without their consent for anyone to see and download"


I notice you're so invested that you haven't noticed that the files have been renamed and zipped such that they're not even indexable. How you'd expect anyone not participating in software development to find them is yet to be explained.


[flagged]


Have fun, buddy!


It's three files that were scraped from (and so publicly available on) the web. That's not at all similar to your strawful analogy.


I'm over here trying to fathom the lack of control over one's own life it would take to cause someone to turn into an online copyright cop, when the data in question isn't even their own, is clearly divorced from any context which would make it useful for anything other than fixing the bug, and about which the original copyright holder hasn't complained.

Some people just want to argue.

If the copyright holder has a problem with the use, they are perfectly entitled to spend some of their dollar bills to file a law suit, as part of which the contents of the files can be entered into the public record for all to legally access, as was done with Scientology.

I don't expect anyone would be so daft.


Literally just asked a question and that seems to have set you off, bud. Are you alright? Do you need to feed your LLM more data to keep it happy?


I'm always happy to stand up for folks who make things over people who want to police them. Especially when nothing wrong has happened. Maybe take a walk and get some fresh air?


I share your distaste for people whose only contribution is subtraction but suggest you lay off the sarcasm though. Trolls; don't feed. (Well done on your project BTW)


I don't see any sarcasm from me in the thread. I had serious questions. Perhaps you could point out what you see? Thanks for the supportive words about the project.


Perhaps I misread "Maybe take a walk and get some fresh air?" - no worries though.


I've certainly seen people say similar things facetiously, but I was being genuine. I'm not sure if beeboobaa was trolling or not, I try to take what folks say at face value. They seemed to be pretty attached to a particular point of view, though. Happens to all of us. The thing for attachment is time and space and new experiences. Walks are great for those things, and also the best for organizing thoughts. Einstein loved taking walks for these reasons, and me too. It feels better to suggest something helpful when discussion derails, than to hurl insults as happens all too frequently.


Literally all you did is bitch and moan about someone asking a simple question, lol. Go touch grass.


I already had my walk this morning, thanks! If you'd like to learn more about copyright law, including about all the ways it's fuzzy around the edges for legitimate uses like this one, I highly recommend groklaw.net. PJ did wonderful work writing about such boring topics in personable and readable ways. I hope you have a great day!


no thanks, not interested in your american nonsense laws. lecturing people who are asking SOMEONE ELSE a question is a terrible personality trait btw


181 out of 195 countries and counting!

https://en.wikipedia.org/wiki/Berne_Convention

Look at that map!

https://upload.wikimedia.org/wikipedia/commons/7/76/Berne_Co...

P.S. Berne doesn't sound like a very American name.

You would really learn a lot from reading Groklaw. Of course, I can't make you. Good luck in the world though!


man, you really are putting a lot of effort into justifying stealing other people's content


Thanks for such great opportunities to post educational content to Hacker News! I genuinely hope some things go your way, man. Rooting for you. Go get 'em.


If you can’t undermine someone’s argument, undermine their nationality. American tech culture doesn’t do this as much as it should, perhaps because we know eventually those folks wake up.


Not sure what your point is, but why would i care to learn about the laws of some other dude's country that he's using to support his bizarro arguments?


> why would i care to learn about the laws of some other dude's country

The website you're attempting to police other people's behavior on is hosted in the country you're complaining about. Lol.

Maybe there is a website local to your country where your ideas would be better received?


You're so brave


Thanks!


What is the MIME type of a .tar file; and what are the MIME types of the constituent concatenated files within an archive format like e.g. tar?

hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

File signature: https://en.wikipedia.org/wiki/File_signature

PhotoRec: https://en.wikipedia.org/wiki/PhotoRec

"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/

Table of ': https://formats.kaitai.io/xref.html

AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...

Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression

Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database

sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...

clamav sigtool: https://www.google.com/search?q=clamav+sigtool

https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :

  sigtool –-find-sigs "$name" | sigtool –-decode-sigs 
List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signatures

And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.

A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

https://github.com/google/osv.dev/blob/master/README.md#usin... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

Add'l useful formats:

> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage


I’m not sure what this comment is trying to say


File-based hashing is done is so many places, there's so much heat.

Sub- file-based hashing with feature engineering is necessary for AV, which must take packing, obfuscating, loading, and dynamic analysis into account in addition to zip archives and magic file numbers.

AV AntiVirus applications with LLMs: what do you train it on, what are some of the existing signature databases.

https://SigStore.dev/ (The Linux Foundation) also has a hash-file inverted index for released artifacts.

Also otoh with a time limit,

1. What file is this? Dirname, basename, hashes(s)

2. Is it supposed to be installed at such path?

3. Per it's header, is the file an archive or an image or a document?

4. What file(s) and records and fields are packed into a file, and what transforms were the data transformed with?


> the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)


I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?


It had 3 failures. How is that a sign it's untrustworthy? I'm sure all alternatives have more than 3 failures. You might be making assumptions about the distribution of successes and failures (GP didn't say how many files they tested to find those 3) or how "soft" they were. In an extreme case, they might even have been crafted adversarial examples. But even if not, they might have features that really do look more like some other file type from the point of view of the classifier even if it's not easily apparent to a human. Being strictly superior to a competent human is a pretty high bar to set.


> or how "soft" they were.

From the comment: It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

That's pretty soft. Nothing "adversarial" claimed either.

> Being strictly superior to a competent human is a pretty high bar to set.

The bar is the file utility.


Those are only soft to a human. I looked at a couple and I picked them correctly but I don't know what details the classifier was seeing which I was blind to. Not to say it was correct, just that we can't call them soft just because they're short and easy for a human.

> The bar is the file utility.

It has higher accuracy than that. You would reject it just because the failures are different even though they're less?


Yes. Unpredictable failures are significantly worse than predictable ones. If file messes up, it's because it decided a ZIP-based document was a generic ZIP file. If Magika messes up, it's entirely random. I can work around file's failure modes, especially if it's one I work with often. Magika's failure modes strike at random and are not possible to anticipate. File also bails out when it doesn't know, a very common failure mode in Magika is that it confidently returns a random answer when it wasn't trained on a file type.


Your original statement was that having a couple of failures brings into question its claims about performance. It doesn't because it doesn't claim such high performance. 99.31% is lower than perhaps 997 out of 1000 or whatever the GP tested. Of course having unpredictable failures is a worry but it's a different worry.


They uploaded 3 sample files for the authors, there were more failures than that, and the failures that GP and others have experienced are of a less tolerable nature. This is the point I was making, that the value added by classifying files with no rigid structure is offset heavily by its unpredictable shortcomings and difficult-to-detect failure modes.

If you have a point of your own to make I'd prefer you jump to it. Nitpicking baseless assumptions like how many files the evil GP had to sift through in order to breathlessly bring us 3 bad eggs is not something I find worthwhile.


The point I'm making is that you drew a conclusion based on insufficient information, apparently by making assumptions about the distribution of failures or the definition of "easy".


> It whiffed multiple common softballs

I must have missed this in the article. Where was this?


...It's in the comment you were responding to. Directly above the section you quoted.


I understand that, but it wasn't clear to me where those examples came from.


It's pretty obvious from the whole comment that they're his own experience. Are you going anywhere with this or are you just saying things?


It provided the wrong file-types for some files, so I cannot rely on its output to be correct.

If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.


Except a 100% correct implementation doesn't exist AFAIK. So if I want to do anything that makes a decision based on the type of a file, I have to pick some algorithm to do that. If I can do that correctly 99% of the time, that's better than not being able to make that decision at all, which is where I'm left if a perfect implementation doesn't exist.


Nobody's asking for perfection. But the AI is offering inexplicable and obvious nondeterministic mistakes that the traditional algorithms don't suffer from.

Magika goes wrong and your fonts become audio files and nobody knows why. Magic goes wrong and your ZIP-based documents get mistaken for generic ZIP files. If you work with that edge case a lot, you can anticipate it with traditional algorithms. You can't anticipate nondeterministic hallucination.


Seconding this.

Something like Magika is potentially useful as a second pass if conventional methods of detecting a file type fail or yield a low-confidence result. But, for the majority of binary files, those conventional methods are perfectly adequate. If the first few bytes of a file are "GIF89a", you don't need an AI to tell you that it's probably a GIF image.


Doesn't seem all that non-deterministic. I tested the vba.html example multiple times and it always said it was VBA. I added a space between </HEAD> and <BODY> and it correctly picked HTML as most likely but with a low confidence.

So I think we can say it's sensitive to mysterious features, not that it's non-deterministic. Still leads to your same conclusion that you can't anticipate the failures. But I don't think you can with traditional tools either. Some magic numbers are just plain text (like MZ) which could legitimately accidentally appear at the beginning of a plain text file, for example.


Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.


Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.


Addition and multiplication of floats are commutative.


> It would seem surprising for there to be anything non-deterministic about an ML model like this

I think there may be some confusion of ideas going in here. Machine learning is fundamentally stochastic, so it is non-deterministic almost by definition.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: