Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I have never seen any kind of technical documentation published in any other format than PDF that is comfortable for reading and searching, even when that is done on a mobile phone.

Can you provide an example of what you mean? My experience is completely the polar opposite.



I refer to something like a 3000 page manual of some microcontroller, or the datasheets of some integrated circuits or the specifications of some Arm architecture variant, or the standards for some programming language, e.g. C++ or System Verilog.

These are concrete examples of documents that I might have read during some flights or when waiting for some flight, on a smartphone.

When reading something like a fiction novel, reflowing the text based on the window width may be acceptable.

On the other hand, the navigation through a huge document half of which are tables, figures, diagrams, schematics and graphics is extremely painful when it is in HTML format so the layout changes based on the device and window used and there are no means to jump quickly e.g. to page 1436, then to page 2117. When zoom, pan and scroll are correctly implemented, which unfortunately happens seldom, they are much less distracting than the random changes in page layout caused by rendering as done by a browser.

I strongly dislike whenever a company provides only a Web documentation that is hard to navigate, instead of also providing a PDF manual.

Web documentation may be acceptable for very small documents, but not for most of the current technical documentation, where many thousands of pages for a manual are common.

Perhaps an EPUB format extended with everything necessary to completely describe a fixed page layout might become competitive with PDF, but I will have to see an example to believe it.

For now, whenever I see a book or any other document both in PDF and in EPUB formats, I always choose the PDF variant, because without exception it provides a better quality of the rendered pages.


I accept your points and agree that the kind of documentation you're thinking about sounds like a poor use case for HTML/EPUB. I do not regularly encounter this sort of documentation.

I've been boosting the idea in the OP, but more for things like "your local council's meeting minutes" or "your English class assignment" or "a research paper".

Though I do want to point out that even moderately complex specs, when designed for the web, can work well. For example, the HTML spec doesn't reference page numbers, but has extensive internal hyperlinking: https://html.spec.whatwg.org/

> Perhaps an EPUB format extended with everything necessary to completely describe a fixed page layout might become competitive with PDF

I highly doubt this will ever happen, for use cases which require fixed layout. But there are plenty of use cases where fixed layout is unnecessary and inferior.


I work with the same type of documents regularly, and I’d give up both exact referencing and stable rendering in a heartbeat in exchange for something reflowable that I can reliably search in and copy paste from.


The PDF documents allow reliable search and copy/paste, but unfortunately only when the author of the document has taken care to ensure this. Nevertheless, this usually happens automatically when the PDF has been created by exporting a document created with some Office suite, unless the author has changed the default options to forbid these features.

Even many of the PDFs created by scanning printed documents allow reasonably reliable search/copy/paste, if they had been processed by an OCR.


> The PDF documents allow reliable search and copy/paste

Are you sure about that? As far as I understand, extracting text from an ultimately vector-graphics-like PDF heavily depends on ORC-like heuristics on the PDF consumer's side.

The ToUnicode mapping table can help with the glyph-to-codepoint mapping aspect of this, but figuring out the difference between the gap between two letters and two words seems hard.

I've seen bothtypesofissues mentioned in the following article i n t h e p a s t, including in a specification document I use multiple times per day for my job:

https://web.archive.org/web/20220328102205/https://filingdb....


I did not look at the details of the PDF specification, but I have heard that there are indeed many cases that can confuse a PDF reader which wants to find or copy a text string.

Nevertheless, I have been using very frequently every day for many years search and copy + paste from PDF documents without any problem. I usually prefer to use mupdf as the PDF reader, because it is very fast (it also works better as an EPUB reader than the other EPUB readers that I have tried), but there are some seldom-encountered PDF files that mupdf cannot parse, in which case I fall back to other PDF readers, e.g. okular.

The only case that I encounter when search/copy/paste does not work is in scanned books that have not been OCR'ed, so they contain only bitmap images of the pages, without text.

The problems mentioned at your link are caused mostly by the PDF specification being too permissive, which allows abuses like using a non-standard character encoding coupled with the use of a non-standard font. However, this specific type of abuse could not be prevented by any specification without using some sort of AI to decide whether the glyph used for a character encoded as Unicode "A" is really a kind of "A".

Among the problems enumerated at your link, I have encountered a few times the case when there are thin spaces inserted between each letter of a string. In such a case it is annoying to remove those spaces after pasting the text in another document, but this is something that I have seen only very rarely.


I find that the PDF format makes some technical manuals like the Intel instruction set reference to be harder to use than they should be (though it probably works great if printed out). It's often easier to use other websites for reference.

I totally see what you mean for circuits though!


And between PDF and EPUB, I always choose the EPUB variant, because in my laptop it definitely looks better, with the text the right size and sane pagination.

I don't jump to page 2112, I use the table of contents to jump to section 3.1.2, which is as fast if not faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: