Wuffs’ PNG image decoder is memory-safe but can also clock between 1.22x and 2.75x faster than libpng, the widely used open source C implementation. It’s also faster than the libspng, lodepng and stb_image C libraries as well as the most popular Go and Rust PNG libraries. High performance is achieved by SIMD-acceleration, 8-byte wide input and copies when bit-twiddling and zlib-decompressing the entire image all-at-once (into 1 large intermediate buffer) instead of 1 row at a time (into smaller, reusable buffers). All-at-once requires more intermediate memory but allows substantially more of the image to be decoded in the zlib-decompressor’s fastest code paths.
Tag: performance
Android lags years
2020’s high-end Androids sport the single-core performance of an iPhone 8, a phone released in Q3’17
mid-priced Androids were slightly faster than 2014’s iPhone 6
low-end Androids have finally caught up to the iPhone 5 from 2012You’re reading that right: single core Android performance at the low end is both shockingly bad and dispiritingly stagnant.

GPU price per FLOPS
in recent years, GPU prices have fallen at rates that would yield an order of magnitude over roughly:
17 years for single-precision FLOPS
10 years for half-precision FLOPS
5 years for half-precision fused multiply-add FLOPS
Against key-value stores
Fast key-value stores, an idea whose time has come and gone
In ProtoCache, replacing the RInK with stateful application servers resulted in a 29-57% median latency improvement and 40%+ reduction in CPU.
Datacenter performance
How do you know how well your large kubernetes cluster is performing? Is a particular change worth deploying? Can you quantify the ROI? To do that, you’re going to need some WSC-wide metric of performance. Not so easy! The WSC may be running 1000s of distinct jobs all sharing the same underlying resources. Developing a load-testing benchmark workload to accurately model this is ‘practically impossible.’ Therefore, we need a method that lets us evaluate performance in a live production environment. Google’s answer is the Warehouse Scale performance Meter (WSMeter), “a methodology to efficiently and accurately evaluate a WSC’s performance using a live production environment.” At WSC scale, even small improvements can translate into considerable cost reductions. WSMeter’s low-risk, low-cost approach encourages more aggressive evaluation of potential new features.
Reducing data movements
Our evaluation shows that offloading simple functions from these consumer workloads to processing-in-memory logic, consisting of either simple cores or specialized accelerators, reduces system energy consumption by 55.4% and execution time by 54.2%, on average across all of our workloads.
Computer latency 1977-2017
Compared to a modern computer that’s not the latest ipad pro, the apple 2 has significant advantages on both the input and the output, and it also has an advantage between the input and the output for all but the most carefully written code since the apple 2 doesn’t have to deal with context switches, buffers involved in handoffs between different processes, etc.
Facebook Lite
We rolled out Facebook Lite, our version of Facebook for Android built for emerging markets, in June of 2015. The app has hit 100M monthly active users. It’s the fastest-growing version of Facebook to reach 100M users in under 9 months. It has an APK that is less than 1 MB in size, meaning people can download it in seconds on slow connections.
To reach the APK size target, the Lite APK doesn’t have the product code and resources found in a typical Android app. The Lite client is a simple VM that provides various capabilities to interact with the OS (such as read a file, open the camera, create an SQLite database, and so on) and a rendering engine to drive the Android UI. Product code is written on the server and is expressed in terms of the capabilities the client has. Resources are sent down from the server as needed and cached. So it has infinite scalability for building additional product without bloating the APK.
Networks are 100x slower than optimal
In principle, a network can transfer data at nearly the speed of light. Today’s Internet, however, is much slower: our measurements show that latencies are typically more than one, and often more than 2 orders of magnitude larger than the lower bound implied by the speed of light
Bloated web
Facebook has put everyone else on notice. Your content better load fast or you’re screwed. Publication websites have become an absolutely bloated mess. They range from beautiful (The Verge) to atrocious (Bloomberg) to unusable (Forbes). The common denominator: they’re all way too slow. Instant karma’s gonna get them
this is why i have javascript off by default, and only allowlist maybe 10 sites. it avoids all those stupid “widgets” that these sites love so much.
The price of efficiency for advertisers’ is the user experience of the reader. The problem for publishers, though, is that $ and cents — which come from advertisers — are a far more scarce resource than are page views, leaving publishers with a binary choice: provide a great user experience and go out of business, or muddle along with all of the baggage that relying on advertising networks entails