How to Display Upload Status on Dropbox
Photographic camera uploads is a feature in our Android and iOS apps that automatically backs up a user's photos and videos from their mobile device to Dropbox. The feature was first introduced in 2012, and uploads millions of photos and videos for hundreds of thousands of users every 24-hour interval. People who use camera uploads are some of our virtually dedicated and engaged users. They intendance deeply most their photograph libraries, and wait their backups to be quick and undecayed every time. It'south important that nosotros offer a service they can trust.
Until recently, camera uploads was congenital on a C++ library shared between the Android and iOS Dropbox apps. This library served u.s. well for a long time, uploading billions of images over many years. Even so, information technology had numerous problems. The shared code had grown polluted with complex platform-specific hacks that made it hard to understand and risky to modify. This risk was compounded past a lack of tooling support, and a shortage of in-firm C++ expertise. Plus, after more than v years in production, the C++ implementation was beginning to bear witness its historic period. It was unaware of platform-specific restrictions on background processes, had bugs that could delay uploads for long periods of time, and made outage recovery difficult and fourth dimension-consuming.
In 2019, we decided that rewriting the characteristic was the best way to offer a reliable, trustworthy user feel for years to come up. This fourth dimension, Android and iOS implementations would be carve up and apply platform-native languages (Kotlin and Swift respectively) and libraries (such asWorkManager andRoom for Android). The implementations could and so be optimized for each platform and evolve independently, without being constrained past design decisions from the other.
This post is most some of the blueprint, validation, and release decisions we made while building the new camera uploads feature for Android, which nosotros released to all users during the summer of 2021. The project shipped successfully, with no outages or major problems; error rates went down, and upload performance greatly improved. If you oasis't already enabled camera uploads, you should try it out for yourself.
Designing for background reliability
The main value proposition of camera uploads is that it works silently in the background. For users who don't open the app for weeks or even months at a time, new photos should still upload promptly.
How does this work? When someone takes a new photo or modifies an existing photo, the OS notifies the Dropbox mobile app. A groundwork worker nosotros telephone call the scanner carefully identifies all the photos (or videos) that oasis't yet been uploaded to Dropbox and queues them for upload. And then another background worker, the uploader, batch uploads all the photos in the queue.
Uploading is a two step procedure. First, like many Dropbox systems, nosotros suspension the file into 4 MB blocks, compute the hash of each block, and upload each block to the server. Once all the file blocks are uploaded, we make a final commit request to the server with a list of all block hashes in the file. This creates a new file consisting of those blocks in the user'southward Camera Uploads folder. Photos and videos uploaded to this folder tin and then exist accessed from any linked device.
I of our biggest challenges is that Android places strong constraints on how often apps can run in the background and what capabilities they have. For example, App Standby limits our groundwork network access if the Dropbox app hasn't recently been foregrounded. This means we might only be allowed to admission the network for a 10-minute interval one time every 24 hours. These restrictions accept grown more strict in contempo versions of Android, and the cantankerous-platform C++ version of camera uploads was not well-equipped to handle them. It would sometimes try to perform uploads that were doomed to fail considering of a lack of network access, or fail to restart uploads during the system-provided window when network admission became available.
Our rewrite does non escape these background restrictions; they still apply unless the user chooses to disable them in Android'southward system settings. Yet, we reduce delays as much every bit possible by taking maximum advantage of the network access we practise receive. Nosotros use WorkManager to handle these background constraints for us, guaranteeing that uploads are attempted if, and just if, network access becomes available. Different our C++ implementation, we also exercise as much piece of work as possible while offline—for case, by performing rudimentary checks on new photos for duplicates—earlier asking WorkManager to schedule u.s. for network access.
To farther optimize use of our limited network access, nosotros also refined our handling of failed uploads. C++ camera uploads aggressively retried failed uploads an unlimited number of times. In the rewrite nosotros added backoff intervals between retry attempts, and besides tuned our retry behavior for unlike error categories. If an fault is likely to be transient, we retry multiple times. If it's likely to be permanent, nosotros don't bother retrying at all. Equally a consequence, we make fewer overall retry attempts—which limits network and battery usage—andusers see fewer errors.
Designing for performance
Our users don't but expect camera uploads to work reliably. They as well look their photos to upload quickly, and without wasting system resources. Nosotros were able to make some big improvements hither. For instance, first-fourth dimension uploads of large photo libraries at present finish upwards to four times faster. There are a few ways our new implementation achieves this.
Parallel uploads
Get-go, we substantially improved operation past adding support for parallel uploads. The C++ version uploaded merely ane file at a time. Early in the rewrite, we collaborated with our iOS and backend infrastructure colleagues to design an updated commit endpoint with back up for parallel uploads.
In one case the server constraint was gone, Kotlin coroutines made it easy to run uploads concurrently. Although Kotlin Periods are typically candy sequentially, the available operators are flexible plenty to serve as building blocks for powerful custom operators that support concurrent processing. These operators can be chained declaratively to produce code that's much simpler, and has less overhead, than the transmission thread management that would've been necessary in C++.
val uploadResults = mediaUploadStore .getPendingUploads() .unorderedConcurrentMap(concurrentUploadCount) { mediaUploader.upload(it) } .takeUntil { it != UploadTaskResult.SUCCESS } .toList()
A unproblematic instance of a concurrent upload pipeline. unorderedConcurrentMap is a custom operator that combines the built-in flatMapMerge and transform operators.
Optimizing memory use
After adding support for parallel uploads, we saw a large uptick in out-of-memory crashes from our early testers. A number of improvements were required to make parallel uploads stable enough for product.
First, we modified our uploader to dynamically vary the number of simultaneous uploads based on the amount of bachelor system memory. This style, devices with lots of memory could enjoy the fastest possible uploads, while older devices would not exist overwhelmed. However, nosotros were still seeing much higher retentivity usage than nosotros expected, then nosotros used the retentivity profiler to take a closer look.
The get-go matter nosotros noticed was that memory consumption wasn't returning to its pre-upload baseline subsequently all uploads were done. Information technology turned out this was due to an unfortunate behavior of the Java NIO API. Information technology created an in-memory cache on every thread where we read a file, and once created, the cache could never be destroyed. Since nosotros read files with the threadpool-backed IO dispatcher, nosotros typically ended up with many of these caches, one for each dispatcher thread we used. We resolved this by switching to direct byte buffers, which don't allocate this cache.
The next matter nosotros noticed were large spikes in memory usage when uploading, especially with larger files. During each upload, we read the file in blocks, copying each cake into aByteArray for further processing. We never created a new byte assortment until the previous one had gone out of telescopic, and then we expected merely one to be in-memory at a time. Still, it turned out that when nosotros allocated a large number of byte arrays in a curt time, the garbage collector could not free them quickly enough, causing a transient memory spike. We resolved this issue past re-using the same buffer for all block reads.
Parallel scanning and uploading
In the C++ implementation of photographic camera uploads, uploading could not get-go until we finished scanning a user's photo library for changes. To avoid upload delays, each scan only looked at changes that were newer than what was seen in the previous browse.
This approach had downsides. There were some edge cases where photos with misleading timestamps could exist skipped completely. If nosotros ever missed photos due to a bug or OS change, shipping a ready wasn't enough to recover; we also had to clear affected users' saved scan timestamps to forcefulness a full re-browse. Plus, when camera uploads was outset enabled, we however had to check everything before uploading anything. This wasn't a cracking first impression for new users.
In the rewrite, nosotros ensured correctness by re-scanning the whole library after every change. We also parallelized uploading and scanning, so new photos can start uploading while we're still scanning older ones. This means that although re-scanning can take longer, the uploads themselves still start and finish promptly.
Validation
A rewrite of this magnitude is risky to send. It has dangerous failure modes that might just show upwards at scale, such as corrupting one out of every million uploads. Plus, every bit with most rewrites, we could not avoid introducing new bugs considering nosotros did not understand—or even know about—every border case handled by the old system. We were reminded of this at the start of the projection when we tried to remove some ancient camera uploads code that we thought was expressionless, and instead ended up DDOSing Dropbox's crash reporting service. ð
Hash validation in product
During early evolution, we validated many depression-level components past running them in product alongside their C++ counterparts and then comparing the outputs. This let us confirm that the new components were working correctly earlier nosotros started relying on their results.
Ane of those components was a Kotlin implementation of the hashing algorithms that nosotros employ to place photos. Because these hashes are used for de-duplication, unexpected things could happen if the hashes change for even a tiny per centum of photos. For case, we might re-upload erstwhile photos assertive they are new. When we ran our Kotlin code alongside the C++ implementation, both implementations well-nigh e'er returned matching hashes, but they differed about 0.005% of the time. Which implementation was wrong?
To answer this, nosotros added some additional logging. In cases where Kotlin and C++ disagreed, we checked if the server subsequently rejected the upload because of a hash mismatch, and if so, what hash information technology was expecting. We saw that the server was expecting the Kotlin hashes, giving us loftier confidence the C++ hashes were wrong. This was nifty news, since it meant nosotros had fixed a rare bug nosotros didn't fifty-fifty know we had.
Validating land transitions
Camera uploads uses a database to rails each photo's upload country. Typically, the scanner adds photos in state NEW and then moves them to Awaiting (or DONE if they don't need to exist uploaded). The uploader tries to upload Pending photos and then moves them to DONE or Error.
Since nosotros parallelize so much work, information technology'due south normal for multiple parts of the system to read and write this land database simultaneously. Individual reads and writes are guaranteed to happen sequentially, just we're still vulnerable to subtle bugs where multiple workers endeavor to alter the country in redundant or contradictory ways. Since unit tests but cover single components in isolation, they won't grab these bugs. Even an integration test might miss rare race conditions.
In the rewritten version of camera uploads, we guard confronting this by validating every state update confronting a prepare of immune land transitions. For case, nosotros stipulate that a photograph tin never movement from ERROR to DONE without passing back through Pending. Unexpected country transitions could bespeak a serious bug, so if we see one, we cease uploading and written report an exception.
These checks helped united states observe a nasty bug early on in our rollout. We started to see a high volume of exceptions in our logs that were caused when photographic camera uploads tried to transition photos fromWashed toDONE. This made us realize we were uploading some photos multiple times! The root cause was a surprising behavior in WorkManager whereunique workers tin can restart earlier the previous instance is fully cancelled. No duplicate files were being created because the server rejects them, but the redundant uploads were wasting bandwidth and time. In one case we fixed the result, upload throughput dramatically improved.
Rolling information technology out
Even after all this validation, we nonetheless had to exist cautious during the rollout. The fully-integrated organisation was more circuitous than its parts, and we'd also need to contend with a long tail of rare device types that are not represented in our internal user testing puddle. We also needed to continue to meet or surpass the high expectations of all our users who rely on camera uploads.
To reduce this risk preemptively, nosotros fabricated sure to back up rollbacks from the new version to the C++ version. For instance, we ensured that all user preference changes made in the new version would apply to the old version also. In the stop we never ended up needing to curlicue back, but information technology was still worth the effort to have the option available in case of disaster.
We started our rollout with an opt-in pool of beta (Play Shop early access) users who receive a new version of the Dropbox Android app every week. This pool of users was large plenty to surface rare errors and collect primal performance metrics such equally upload success charge per unit. We monitored these key metrics in this population for a number of months to gain conviction it was ready to transport widely. Nosotros discovered many problems during this time period, but the fast beta release cadency allowed us to iterate and gear up them quickly.
We also monitored many metrics that could hint at future bug. To make certain our uploader wasn't falling behind over time, nosotros watched for signs of ever-growing backlogs of photos waiting to upload. We tracked retry success rates by error blazon, and used this to fine-tune our retry algorithm. Last simply non least, we also paid close attending to feedback and support tickets we received from users, which helped surface bugs that our metrics had missed.
When we finally released the new version of photographic camera uploads to all users, it was clear our months spent in beta had paid off. Our metrics held steady through the rollout and we had no major surprises, with improved reliability and low error rates right out of the gate. In fact, nosotros ended up finishing the rollout alee of schedule. Since nosotros'd front-loaded so much quality improvement work into the beta period (with its weekly releases), we didn't accept any multi-week delays waiting for critical bug fixes to roll out in the stable releases.
So, was it worth it?
Rewriting a big legacy feature isn't always the right decision. Rewrites are extremely fourth dimension-consuming—the Android version lonely took two people working for two full years—and can easily cause major regressions or outages. In order to exist worthwhile, a rewrite needs to deliver tangible value by improving the user feel, saving engineering fourth dimension and effort in the long term, or both.
What advice do we have for others who are showtime a project like this?
- Define your goals and how y'all will mensurate them. At the showtime, this is important to brand sure that the benefits will justify the endeavor. At the end, it will help you determine whether y'all got the results you wanted. Some goals (for example, future resilience against Os changes) may not be quantifiable—and that'southward OK—but it's adept to spell out which ones are and aren't.
- De-take a chance information technology. Identify the components (or system-wide interactions) that would crusade the biggest problems if they failed, and baby-sit against those failures from the very commencement. Build disquisitional components commencement, and try to test them in production without waiting for the whole organization to be finished. It's also worth doing extra work up-front in order to be able to ringlet back if something goes wrong.
- Don't rush. Shipping a rewrite is arguably riskier than shipping a new feature, since your audience is already relying on things to piece of work as expected. Start by releasing to an audition that's just big enough to requite yous the data yous need to evaluate success. And then, watch and await (and fix stuff) until your information give you lot confidence to proceed. Dealing with issues when the user-base is small is much faster and less stressful in the long run.
- Limit your telescopic. When doing a rewrite, it's tempting to tackle new characteristic requests, UI cleanup, and other excess piece of work at the aforementioned fourth dimension. Consider whether this will actually be faster or easier than aircraft the rewrite first and fast-post-obit with the rest. During this rewrite we addressed issues linked to the core architecture (such every bit crashes intrinsic to the underlying information model) and deferred all other improvements. If y'all alter the feature also much, non only does it take longer to implement, simply it'southward too harder to notice regressions or gyre back.
In this case, we feel practiced about the decision to rewrite. We were able to improve reliability right away, and more importantly, nosotros set ourselves up to stay reliable in the future. Equally the iOS and Android operating systems continue to evolve in separate directions, it was merely a affair of time before the C++ library bankrupt badly plenty to require cardinal systemic changes. At present that the rewrite is complete, nosotros're able to build and iterate on camera uploads much faster—and offer a better experience for our users, as well.
Also: We're hiring!
Are yous a mobile engineer who wants to make software that'due south reliable and maintainable for the long haul? If then, we'd beloved to accept you lot at Dropbox! Visit our jobs page to see current openings.
Source: https://dropbox.tech/mobile/making-camera-uploads-for-android-faster-and-more-reliable
0 Response to "How to Display Upload Status on Dropbox"
āđāļŠāļāļāļāļ§āļēāļĄāļิāļāđāļŦ็āļ