Saturday, February 25, 2017

Farewell to SHA-1 and hello to TenFourFox FPR1

The long-in-the-tooth hashing algorithm SHA-1 has finally been broken by its first collision attack (and the broken SHA-1 apparently briefly broke the WebKit repository, too). SHA-1 SSL TLS certificates have been considered dangerous for some time, which is why the Web has largely moved on to SHA-256 certificates and why I updated Classilla to support them.

Now that it is within the realm of practical plausibility to actually falsify site certificates signed with SHA-1 and deploy them in the real world, it's time to cut that monkey loose. Mozilla is embarking on a plan of increasing deprecation, but Google is dumping SHA-1 certificates entirely in Chrome 56, in which they will become untrusted. Ordinarily we would follow Mozilla's lead here with TenFourFox, but 45ESR doesn't have the more gradated current SHA-1 handling they're presently using.

So, we're going to go with Chrome's approach and shut the whole sucker down. The end of 45ESR (and the end of TenFourFox source parity) will occur on June 13 with the release of Firefox 54 (52.2 ESR); the last scheduled official 45ESR is 45.9, on April 18. For that first feature parity release of TenFourFox in June after 45.9, security.pki.sha1_enforcement_level will be changed from the default 0 (permit) to 1 (deny) and all SHA-1 certificates will become untrusted. You can change it back if you have a server you must connect to that uses a SHA-1 certificate to secure it, but this will not be the default, and it will not be a supported configuration.

You can try this today. In fact, please do switch over now and see what breaks; just go into about:config, find security.pki.sha1_enforcement_level and set it to 1. The good news is that Mozilla's metrics indicate few public-facing sites should be affected by this any more. In fact, I can't easily find a server to test this with; I've been running with this setting switched over for the past day or so with nothing gone wrong. If you hit this on a site, you might want to let them know as well, because it won't be just TenFourFox that will refuse to connect to it very soon. TBH, it would have to cause problems on a major, major site for me not to implement this change because of the urgency of the issue, but I want to give our users enough time to poke around and make sure they won't suddenly be broken with that release.

That release, by the way, will be the end of TenFourFox 45 and the beginning of the FPR (Feature Parity Release) series, starting with FPR1. Internally the browser will still be a fork of Gecko 45 and addons that ask its version will get 45.10 for FPR1 and 45.11 for FPR2 and so on, but the FPR release number will now appear in the user agent string as well as the 45.x version number, and the About box and site branding will now reference the current FPR number instead. The FPR will not necessarily be bumped every release: if it's a security update only and there are no new major features, it will get an SPR bump instead (Security Parity Release). FPR1 (SPR2) would be equivalent, then, to an internal version of 45.10.2.

Why drop the 45 branding? Frankly, because we won't really be 45.* Like Classilla, where I backported later changes into its Mozilla 1.3.1 core, I already have a list of things to add to the TenFourFox 45 core that will improve JavaScript ES6 compatibility and enable additional HTML5 features, which will make the browser more advanced. Features post-52ESR will be harder to backport as Mozilla moves more towards Rust code descended from Servo and away from traditional C++ and XPCOM, but there is a lot we can still make work, and unlike Classilla we won't be already six years behind when we get started.

The other thing we need to start thinking about is addons. There is no XUL, only WebExtensions, according to Mountain View, and moreover we'll be an entire ESR behind even if Mozilla ends up keeping legacy XUL addons around. While I don't really want to be in the business of maintaining these addons, we may want to archive a set of popular ones so that we don't need to depend on AMO. Bluhell Firewall, uBlock, Download YouTube Videos as MP4 and OverbiteFF are already on my list, and we will continue to host and support the QuickTime Enabler. What other ones of general appeal would people like added to our repository? Are there any popular themes or Personas we should keep copies of? (No promises; the final decision is mine.)

The big picture is still very bright overall. We'll have had almost seven years of source parity by June, and I think I can be justifiably proud of that. Even the very last Quad G5 to roll off the assembly line will have had almost eleven years of current Firefox support and we should still have several more years of practical utility in TenFourFox yet. So, goodbye SHA-1, but hello FPR1. The future beckons.

(* The real reason: FPR and SPR are absolutely hilarious PowerPC assembly language puns that I could backronym. I haven't figured out what I can use for GPR yet.)

Tuesday, February 21, 2017

45.8.0b1 available with user agent switching (plus: Chrome on PowerPC?!)

45.8.0b1 is available (downloads, hashes). This was supposed to be in your hands last week, but I had a prolonged brownout (68VAC on the mains! thanks, So Cal Edison!) which finally pushed the G4 file server over the edge and ruined various other devices followed by two extended blackouts, so the G5 was just left completely off all week and I did work on my iBook G4 and i7 MBA until I was convinced I wasn't going to regret powering the Quad back on again. The other reason for the delay is because there's actually a lot of internal changes in this release: besides including the work so far on Operation Short Change (there is still a fair bit more to do, but this current snapshot improves benchmarks anywhere from four to eight percent depending on workload), this also has numerous minor bug fixes imported from later versions of Firefox and continues removing telemetry from UI-facing code to improve browser responsiveness. I also smoked out and fixed many variances from the JavaScript conformance test suite, mostly minor, but this now means we are fully compliant and some classes of glitchy bugs should just disappear.

But there are two other big changes in this release. The first is to font support. While updating the ATSUI font blacklist for Apple's newest crock of crap, the fallback font (Helvetica Neue) appeared strangely spaced, with words as much as a 1/6th of the page between them. This looks like it's due to the 10.4-compatible way we use for getting native font metrics; there appears to be some weird edge case in CGFontGetGlyphAdvances() that occasionally yields preposterous values, meaning we've pretty much shipped this problem since the very beginning of TenFourFox and no one either noticed or reported it. It's not clear to me why (or if) other glyphs are unaffected but in the meantime the workaround is to check if the space glyph's horizontal advance value is redonkulous and if it is, assuming the font metrics are not consistent with a non-proportional font, heuristically generate a more reasonable size. In the process I also tuned up font handling a little bit, so text run generation should be a tiny bit faster as well.

The second is this:

As promised, user agent switching is now built-in to TenFourFox; no extension is required (and you should disable any you have installed or they may conflict). I chose five alternatives which have been useful to me in the past for sites that balk at TenFourFox (including cloaking your Power Mac as Intel Firefox), or for lower-powered systems, including the default Classilla user agent. I thought about allowing an arbitrary string but frankly you could do that in about:config if you really wanted, and was somewhat more bookwork. However, I'm willing to entertain additional suggestions for user agent strings that work well on a variety of sites. For fun, open Google in one tab and the preferences pane in another, change the agent to a variety of settings, and watch what happens to Google as you reload it.

Note that the user agent switching feature is not intended to hide you're using TenFourFox; it's just for stupid sites that don't properly feature-sniff the browser (Chase Bank, I'm looking at yoooooouuuuu). The TenFourFox start page can still determine you're using TenFourFox for the purpose of checking if your browser is out of date by looking at navigator.isTenFourFox, which I added to 45.8 (it will only be defined in TenFourFox; the value is undefined in other browsers). And yes, there really are some sites that actually check for TenFourFox and serve specialized content, so this will preserve that functionality once they migrate to checking for the browser this way.

If you notice font rendering is awry in this version, please compare it with 45.7 before reporting it. Even if the bug is actually old I'd still like to fix it, but new regressions are more serious than something I've been shipping for awhile. Similarly, please advise on user agent strings you think are useful, and I will consider them (no promises). 45.8 is scheduled for release on March 7.

Finally, last but not least, who says you can't run Google Chrome on a Power Mac?

Okay, okay, this is the PowerPC equivalent of a stupid pet trick, but this is qemu 2.2.1 (the last version that did not require thread-local storage, which won't work with OS X 10.6 and earlier) running on the Quad as a 64-bit PC, booting Instant Web Kiosk into Chromium 53. It is incredibly, crust-of-the-earth-coolingly slow, and that nasty blue tint is an endian problem in the emulator I need to smoke out. But it actually can browse websites (eventually, with patience, after a fashion) and do web-browsery things, and I was even able to hack the emulator to run on 10.4, a target this version of qemu doesn't support.

So what's the point of running a ridiculously sluggish emulated web browser, particularly one I can't stand for philosophical reasons? Well, there's gotta be something I can work on "After TenFourFox"™ ...

Jokes aside, even though I see some optimization possibilities in the machine code qemu's code generator emits, those are likely to yield only marginal improvements. Emulation will obviously never be anywhere near as fast as native code. If I end up doing anything with this idea of emulation for future browser support, it would likely only be for high-end G5 systems, and it may never be practical even then. But, hey, Google, guess what: it only took seven years to get Chrome running on a Tiger Power Mac.

Saturday, February 4, 2017

An experiment: introducing the PopOutPlayer (plus: microbenchmarking PowerPC SPRs)

TenFourFox 45.8 is coming along -- Operation Short Change (my initiative to rework IonMonkey to properly emit single branch instructions instead of entire branch stanzas when the target is guaranteed to be in range) is smoothing out the browser quite a bit but there are hundreds of places that need to be changed. I've finished the macroassembler and the regular expression engine (and improved both of them in a couple other minor ways), and now I'm working on the Ion code generator. This mostly benefits systems that cannot fit generated code into cache, so while the Quad G5 with its (comparatively large) 1MB per core has only improved somewhere on the order of about 5%, I expect more on low-end systems because hot code is now more likely to stay in the i-cache. I am also planning to put user agent support in 45.8, and there are a few other custodial improvements. The idea is to release that to beta testing around Valentine's Day so I don't have to buy you lot any flowers.

A long while back I mentioned a couple secret projects; one of them was SandboxSafari, and the other was intended as a followup to the now obsolete MacTubes Enabler. That second project got backburnered for awhile, but since I'm noticing changes that may obsolete the work I've done on it as well, I'm going to push it out the door in its current state. Without further ado, let's introduce the PopOutPlayer.

This is not Photoshop. There is, just as you see, a floating video window playing the Vimeo movie that TenFourFox cannot (no built-in H.264 codec). In fact, I think this is the only way you can play Vimeo movies on Tiger now -- neither Safari nor OmniWeb work either.

Yes, it works fine with YouTube also:

Drop the application in your apps folder, drop the PopOutPlayer Enabler addon on TenFourFox, and then on any Vimeo (10.4 only) or YouTube (10.4 or 10.5) page or embedded video, right-click or SHIFT-right-click and select Pop Out Video. The video floats in its own window on top of other windows. You can close the browser tab and go look at something else while the video plays; the playback is a completely independent process. When you open another video, the window pops up in the same location where you left it. On multiprocessor Power Macs it can even get scheduled on another core for even smoother playback.

Sounds great, right? Well, there's a few reasons why I hadn't released this earlier.

First, the app itself is, and I'm actually feeling ill writing this, based on these sites' Flash players. Flash was the only reliable way to do playback and was even more performant than native H.264 in WebKit, and it also had substantially fewer user interface problems at the risk of sometimes being crashy. Flash is no safer now than it was before I condemned it from the browser, but I've learned a few lessons from SandboxSafari, of which the PopOutPlayer is actually a descendant. To remember its window location means we have to run the app with your uid (which eliminates SandboxSafari's primary means of protection), but we also have the advantage of only needing to support at most two specific video player applets, so we can design a very restrictive environment that protects the app from being subverted and rejects running video or Flash applets from other sources.

This different type of sandbox implements other restrictions, most notably preventing the applet from going full-screen. This is necessity reborn as virtue because most of our systems would not do well with full-screen playback, let alone HD (which is also blocked/unsupported), and prevents a subverted player from monkeying with the rest of your screen. The sandbox doesn't let URLs open from the applet either, and has its own Mach exception handler and CGEventTap to filter other possible avenues of exploit. However, that also means that you can't do many of the things you would expect to do if the Flash player were embedded in the browser. That won't change.

The window is fixed-size and floats. Allowing the window to resize caused lots of problems because the applets didn't expect to deal with that possibility. You get one size no matter how big your screen is, and you can only close it or move it.

Although the PopOutPlayer can play more YouTube videos than the QuickTime Enabler, there are still many it cannot play, though at least the Flash player will give you some explanation (the Vevo ones are the most notorious). The PopOutPlayer also isn't intended for generic HTML5 video playback; it doesn't replace the QTE in that sense, and the QTE is still the official video solution in general. The application will reject passing it URLs from other video sites.

The really big limitation, however, is that I could not get the Vimeo applet to run on 10.5 using the hacks I devised for 10.4. Leopard WebKit can play some Vimeo videos, so all is not lost, but no matter what I tried the PopOutPlayer simply wouldn't display any video itself. For the time being, Vimeo URLs on Leopard will generate an "Unsupported video URL" message. It is quite possible this might never be able to be fixed with the current method the PopOutPlayer uses for display, so don't expect it will necessarily be repaired in the future. For that matter, Vimeo on-demand doesn't even work with 10.4.

I consider the PopOutPlayer to be highly experimental, and when (not if) Vimeo and YouTube decommission their Flash players, it will abruptly cease to work without warning. But because I expect that time is coming sooner and not later you are welcome to use the PopOutPlayer for as long as it benefits you, and if I can solve some of these issues I might even make it a supported option in the future -- just don't hold your breath.

Download it here. It is unsupported. Source code is not currently available.

So back to TenFourFox. If you would, permit me now to indulge in some gratuitous nerderosity. Part of Operation Short Change was also to explore whether our branch stanza far calls could be made more efficient. Currently, if the target of a branch instruction exceeds the displacement the branch instruction (i.e., b(l) or bc(l)) can encode, we load the target into a general purpose register (GPR), transfer it to the counter register (a special purpose register, or SPR), and branch to that (i.e., lis/ori/mtctr/b(c)ctr(l)). The PowerPC ISA does not allow directly branching to a GPR or FPR, only to the counter register (CTR) and the link register (LR), which are both SPRs.

This would be all well and good except that the G5 groups instructions together, and IBM warns that there is a substantial execution penalty if mtctr and b(c)ctr(l) are in the same dispatch group. Since mtspr instructions like mtctr must always lead dispatch groups, the above stanza is guaranteed to put them both together (recall instruction dispatch groups are no more than four, or five with a branch, with branches being the last slot). Is it faster to insert nops and accept the code bloat? What about using the link register instead?

It's time for ... assembly language microbenchmarking!

#define REPS 0x4000
_main:
        .globl _main

        mflr r0
        stwu r0, -4(r1)

        li r3,0
        lis r5,REPS
        bl .+4
        ; the location of the following mflr is now in r4
        mflr r4
        addi r4, r4, 8
        ; now r4 points to the addi below

        addi r3,r3,1
        cmp cr0,r3,r5
        beq done
#if USE_LR
        mtlr r4
#if USE_NOPS
        nop
        nop
        nop
        nop
#endif
        blr
#else
        mtctr r4
#if USE_NOPS
        nop
        nop
        nop
        nop
#endif
        bctr
#endif

done:
        lwz r0, 0(r1)
        mtlr r0
        li r3,0
        addi r1, r1, 4
        blr
This just runs a tight loop 1,073,741,824 times, branching to the loop header with either LR or CTR, and with the mtspr instruction separated from the blr/bctr with sufficient nops to put them in separate dispatch groups or not (there must be four to prevent the branch from getting in the terminal branch slot). That gives us four variations to test with a loop so tight the cost of the branch should substantially weigh on total runtime. Let's see what we get. If you're following along on your own Power Mac, compile these like so:
gcc -o lrctr_ctr lrctr.s
gcc -DUSE_NOPS -o lrctr_ctrn lrctr.s
gcc -DUSE_LR -o lrctr_lr lrctr.s
gcc -DUSE_LR -DUSE_NOPS -o lrctr_lrn lrctr.s
If you want to confirm what was actually assembled, you can look at the result with otool -tV.

Our control will be our trusty 1.0GHz iMac G4 (256K L2 cache). There should be no difference between the SPRs, and it all fits into cache and there are no dispatch groups, so if we did this right the runtimes should be nearly identical. In this case we are only interested in the user CPU time (the first field).

luxojr% time ./lrctr_ctr
7.550u 0.077s 0:08.88 85.8%     0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_ctrn
7.550u 0.071s 0:08.55 89.1%     0+0k 0+0io 0pf+0w
luxojr% time ./lrctr_lr
7.550u 0.070s 0:10.08 75.5%     0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_lrn
7.550u 0.069s 0:08.43 90.2%     0+0k 0+0io 0pf+0w
Excellent. Let's run it on the Quad G5 (numbers in reduced performance mode).
bruce% time ./lrctr_ctr
4.298u 0.028s 0:04.33 99.5%     0+0k 0+2io 0pf+0w
bruce% time ./lrctr_ctrn
4.986u 0.035s 0:05.03 99.6%     0+0k 0+2io 0pf+0w
bruce% time ./lrctr_lr
13.755u 0.050s 0:13.82 99.8%    0+0k 0+1io 0pf+0w
bruce% time ./lrctr_lrn
13.752u 0.048s 0:13.82 99.7%    0+0k 0+2io 0pf+0w
Wait, what? Putting mtctr and bctr in the same dispatch group was actually the fastest of these four variations. Not only was using LR slower, it was over three times slower. Even spacing the two CTR instructions apart was marginally worse. Just to see if it was an artifact of throttling, I ran them again in highest performance. Same thing:
bruce% time ./lrctr_ctr
2.149u 0.012s 0:02.16 99.5%     0+0k 0+1io 0pf+0w
bruce% time ./lrctr_ctrn
2.492u 0.010s 0:02.50 100.0%    0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lr
6.876u 0.013s 0:06.89 99.8%     0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lrn
6.874u 0.017s 0:06.89 99.8%     0+0k 0+1io 0pf+0w

I found this so surprising I rewrote it for AIX and put it on my POWER6, which is also dispatch-group based and uses an evolved version of the same instruction pipeline as the POWER4 (from which the G5 is derived). And, well ...

uppsala% time ./lrctr_ctr
3.752u 0.001s 0:04.41 85.0%     0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_ctrn
4.064u 0.001s 0:04.94 82.1%     0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lr
13.499u 0.001s 0:16.19 83.3%    0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lrn
13.215u 0.001s 0:15.66 84.3%    0+1k 0+0io 0pf+0w
Execution times should consider that I run this POWER6 throttled in ASMI to reduce power consumption and its microarchitectural differences, but the same relative run times hold. It's actually not faster to space the CTR and branch (that is, if there's nothing better you could be doing -- see below), and the CTR is the best SPR to use for branching on the G5 regardless of any penalty paid. It may well be that using LR fouls the CPU link cache and thus tanks runtime, but whatever the explanation, using it for far calls is clearly the worst option.

Now, as you may have guessed, I've deliberately presented a false choice here because all four of these options are patently pathological. The optimal instruction sequence would be to schedule some work between the mtctr and the bctr. We don't have much work to, uh, work with here, but here's one way.

#define REPS 0x4000
_main:
        .globl _main

        mflr r0
        stwu r0, -4(r1)

        li r3,0
        lis r5,REPS
        bl .+4
        ; the location of the following mflr is now in r4
        mflr r4
        addi r4, r4, 8
        ; now r4 points to the mtctr below

        mtctr r4
        addi r3,r3,1
        cmp cr0,r3,r5
        beq done
        bctr

done:
        lwz r0, 0(r1)
        mtlr r0
        li r3,0
        addi r1, r1, 4
        blr

bruce% gcc -o ctrop ctrop.s
bruce% time ./ctrop
4.299u 0.028s 0:04.34 99.3%     0+0k 0+1io 0pf+0w
Almost identical runtimes (in reduced mode), but the beq takes the branch slot away from the bctr, guaranteeing the SPR operations will be split into two dispatch groups without trying to space them with nops. But inexplicably, if you coalesce the beq/bctr simply into bnectr (which does occupy the same branch slot), you get even faster:
bruce% time ./ctrop2
3.824u 0.039s 0:04.02 95.7%     0+0k 0+0io 0pf+0w
Is this optimal on the G4 and POWER6 as well?
luxojr% time ./ctrop
6.472u 0.064s 0:12.05 54.1%     0+0k 0+2io 0pf+0w
luxojr% time ./ctrop2
6.471u 0.063s 0:09.33 69.9%     0+0k 0+0io 0pf+0w

uppsala% time ./ctrop
3.918u 0.001s 0:04.61 84.8%     0+1k 0+0io 0pf+0w
uppsala% time ./ctrop2
2.612u 0.001s 0:03.09 84.4%     0+1k 0+0io 0pf+0w
Yup, it is, though it's worth noting the G4 did not improve with the bnectr. (This is still pathological, mind you: the best instruction sequence would be simply addi/cmp/bne, which the G5 in reduced mode runs in 2.580u 0.029s and the POWER6 in 1.261u 0.001s, reclaiming its speed crown. But when you have a far call, you don't have a choice.)

The moral of the story? Don't fix what ain't broke.