ENGINEERING · 12 min read
Why your SharePoint migration stalls at 99%
If you have run a Google Drive to SharePoint migration at meaningful scale you have probably watched a batch stall at 99%. Different files, different sizes, different tenants — the same symptom: chunked upload ended without completion, 99% of bytes written. It looks like a flaky network. It is not a flaky network. It is the signature of four separate failure modes that live in the plumbing between Google’s download API and Microsoft Graph’s upload session protocol, and they all surface as the same number.
If you are evaluating a migration tool, or building your own pipeline, this is what you need to plan for. The failure modes are deterministic. Each has a root cause in a real HTTP-layer behavior. A correctly designed pipeline handles all four automatically. A naive implementation hits at least two of them on any batch that contains large files or runs against a busy tenant.
Why it is always 99%
The data path looks simple on paper. A worker opens a stream from Google Drive (files.get?alt=media), pipes it into a Graph upload session (PUT /drive/items/{id}:/path:/content), and when the pipe finishes the file is in SharePoint. At rest this works. Under load, roughly 4–8% of files per batch fail — weighted toward larger files — and they all fail at the same fraction of total size.
That alone is the most useful diagnostic. A flaky network fails at 37%, 62%, 4% — random percentages scattered across the distribution of socket lifetimes. When every file fails at the same fraction of its total size, you are not looking at flaky hardware. You are looking at a systematic drift between what one side thinks the payload is and what the other side is measuring. Google thinks the file is done. SharePoint thinks the file is not done. The bytes streamed are short of the bytes Graph was expecting, but only by a small margin. That margin is the tell.
Failure mode 1: Google auto-gzips, Node auto-decompresses, Content-Length lies
Google Drive returns binary file content with Content-Encoding: gzip for anything the server deems compressible. This includes many document formats. The Content-Length header Google sets describes the compressed byte count on the wire, not the file’s true size.
Node’s http and fetch implementations, and every streaming HTTP client built on top of them, will transparently decompress a gzip response unless told otherwise. The stream the application reads is the uncompressed payload — the true file bytes — but the Content-Length on the response object is the compressed count. If that value is used to drive the upload (which Graph upload sessions require when opening the session), Graph is told the file is smaller than it actually is. When the stream keeps producing bytes past the advertised length, Graph closes the session politely and leaves the file 99% written.
The correct defense is to refuse compression on the source request: send Accept-Encoding: identity and eat the extra bandwidth in exchange for a truthful Content-Length. The right pipeline does this on every Google Drive read.
// Naive
const src = await drive.files.get({ fileId, alt: 'media' }, { responseType: 'stream' });
// src.headers['content-length'] ← compressed count
// src.data actually emits ← uncompressed count, larger
// Result: upload session sized too small, fails at ~99%
// Tolerant
const src = await drive.files.get({ fileId, alt: 'media' }, {
responseType: 'stream',
headers: { 'Accept-Encoding': 'identity' }
});
// Content-Length matches the stream. Upload session sized correctly.
This one detail eliminates the majority of 99% failures. It does not eliminate all of them, because there are three more modes stacked behind it.
Failure mode 2: Google idle socket timeout on slow destinations
Large files on tenants with slow SharePoint upload endpoints fail for a second, independent reason. When the destination upload is slow, the source stream’s internal socket sits unread for several seconds. Google’s edge does not wait forever. After a threshold of idleness — roughly 20–25 seconds in most regions — the source side closes the connection. The destination side sees EOF and declares the upload ended, short.
In a naive pipe, the source socket only gets read when the destination socket is ready to receive. If the destination does anything slow — a cert renegotiation, a throttle response, a chunk finalize — the source blocks. Google’s side sees no reads. Google disconnects.
The defense is a buffering PassThrough between source and destination: a 64 MB in-memory buffer that the source fills as fast as Google can serve, and the destination drains as fast as SharePoint accepts. The source socket is always being read actively — there is no idle period for Google to time out on. The destination can pause briefly without propagating back-pressure to the source. A tolerant pipeline keeps this buffer in place on every transfer so the source side is never the bottleneck.
64 MB is not arbitrary. It balances the longest observed destination pause during throttling events (around 10–12 seconds) against peak source throughput (around 40–60 MB/s) with headroom. Smaller buffers let slow destinations propagate back-pressure. Larger buffers waste memory per concurrent transfer. 64 MB per worker thread is the right trade for cross-cloud transfers at scale.
Failure mode 3: Graph upload sessions lock the total size at the first chunk
Graph’s large-file upload protocol is chunked. The client opens a session, then PUTs ranges against it: Content-Range: bytes 0-4194303/14380102, Content-Range: bytes 4194304-8388607/14380102, and so on until the final range closes the session.
The trailing number after the slash — the total size — is not negotiable. Whatever value is sent in the first chunk is the value Graph records. Subsequent chunks that disagree are rejected with 416 Requested Range Not Satisfiable. This happens even when the later range arithmetic is self-consistent and the actual byte count is higher. Graph trusts the first total and will refuse the real file if that first total was wrong.
Combined with failure mode 1, this is catastrophic. Read the compressed Content-Length off the source response, use it to size the upload session, start streaming, and when the actual uncompressed bytes outrun that number Graph slams the door at 99% — because 99% of the understated size is 100% of the truth originally declared.
The identity-encoding fix resolves most of this. The remaining defense is: never open a Graph upload session before the true size is known. For files where the source does not return a reliable size (rare, but it happens for Google Docs exported on the fly), the right approach is to buffer the first few megabytes to local disk, read the final size, then open the session. It is not free, but it removes the class of failure entirely.
Failure mode 4: The tenant throttle budget is smaller than the obvious concurrency
The last failure mode is not an error at all. It is a gradual slowdown that is easily misread as SharePoint being slow. It shows up when concurrency is set by hardware capacity rather than by the destination tenant’s Graph budget.
Each chunked upload is not a single request. It is: open session, multiple chunk PUTs, finalize, get-item-metadata, set-metadata, permission replay if enabled. A single file costs 6–12 Graph requests. At concurrency 30 with an average of 8 requests per file and a 5-second-per-file median, that is about 288 requests per second, or roughly 17,000 per minute against the destination tenant. Microsoft’s per-tenant Graph budget is generous but not unlimited — the practical ceiling lands around 600 requests per minute before 429s start appearing, and more aggressive behavior drives Retry-After headers into the tens of seconds.
Concurrency has to match the budget, not the hardware. A sensible default for Google Drive to SharePoint transfers is concurrency 10, which is the sustainable steady-state for a typical tenant; scale up only for workspaces that have provisioned additional Graph capacity or that run during off-peak hours. The pipeline should measure 429 rate continuously and back off before Graph forces it to.
What a tolerant pipeline produces
On a representative 312-file test dataset (mixed sizes, 24 GB total), the difference between a naive pipe and a pipeline that handles all four failure modes:
| Metric | Naive pipe | Tolerant pipeline |
|---|---|---|
| Files failed | 47 / 312 (15%) | 0 / 312 (0%) |
| Silent 99% truncations | 31 | 0 |
| Retries needed | 94 | 0 |
| Sustained throughput | 11 MB/s effective | 38 MB/s sustained |
| Graph 429 rate | 8.2% | 0.1% |
| Wall clock (24 GB batch) | 62 min + manual cleanup | 11 min, no cleanup |
The 6x wall-clock improvement is not because any single request is faster. It is because the pipeline stops paying the tax of doing the same work twice — first as a broken truncated upload, then as a retried full upload. Eliminating the failure class removes almost all of the retry overhead at once. A correctly built pipeline retries from the byte offset Graph has already accepted rather than restarting the file, which is the difference between minutes and hours on a large batch.
How to tell which mode you are hitting
If you are evaluating a migration tool or instrumenting your own pipe between a compressed HTTP source and a chunked upload session, these are the fingerprints worth watching for:
- Every failure at exactly the same fraction of file size (not random percentages) → gzip Content-Length drift. The pipeline should force
Accept-Encoding: identityor compute size independently. - Failures correlate with large files on slow tenants, and small files succeed → idle source socket. The pipeline needs a draining buffer between source and destination.
416 Requested Range Not Satisfiableon the final chunk with self-consistent range arithmetic → first-chunk size mismatch. The upload session should not be opened until the true size is known.- Gradual slowdown with
Retry-Afterin response headers → concurrency exceeds the tenant throttle budget. The pipeline should be budget-aware, not hardware-aware.
The larger point
The seductive thing about streaming HTTP is that a pipe() call looks like it handles everything. Source goes to destination. The fact that the source has its own timeout clock, that the framework silently decompresses responses, that the destination locks metadata on the first write, and that the entire pipeline sits inside a per-tenant throttle budget — none of that is visible in the one-line pipe. All of it produces failure modes that look identical at the log level. “Upload ended at 99%.”
It looks like a network problem. It is four separate layers each lying slightly to the next one. A migration tool either handles that invisibly, or it fails at 99% and blames the network.
When you pick a migration tool for Google Drive to SharePoint work, the question is not whether it can move a file. Any tool can move a file. The question is what happens to the twenty-seventh file in a batch of four hundred when the destination tenant starts throttling and the source socket starts idling. The right answer covers all four cases: identity encoding on source reads, buffered pass-through between streams, deferred upload-session creation until true size is known, and tenant-aware concurrency that respects the Graph budget. MigrationFox ships these as defaults on every workspace — no flags, no config.
Related reading
- How to migrate Google Drive to SharePoint
- SharePoint migration speed fixes
- Why .DS_Store and Thumbs.db slow down your migration
Get started
Run a Google Drive to SharePoint migration at app.migrationfox.com/register. Tolerant pipeline defaults apply on every workspace tier.