← Back to Blog

ENGINEERING · 12 min read

Why your SharePoint migration stalls at 99%

gdrive-sharepoint transfer · naive vs tolerant pipeline
Naive pipe
upload start · Budget.xlsx (14 MB)
ended without completion
  ~99% of bytes written
upload start · Training.mp4 (878 MB)
ended without completion (~99%)
upload start · Report.docx (2 MB)
socket hang up · ECONNRESET
upload start · Archive.zip (412 MB)
ended without completion (~99%)
batch summary
  ~15% failure rate, heavy retry load
Tolerant pipeline
upload start · Budget.xlsx (14 MB)
upload complete
upload start · Training.mp4 (878 MB)
upload complete
upload start · Report.docx (2 MB)
upload complete
upload start · Archive.zip (412 MB)
upload complete
batch summary
  0% failure rate, no retries
  sustained throughput at budget
Same dataset, same source, same destination · naive pipe vs a pipeline that handles all four failure modes

If you have run a Google Drive to SharePoint migration at meaningful scale you have probably watched a batch stall at 99%. Different files, different sizes, different tenants — the same symptom: chunked upload ended without completion, 99% of bytes written. It looks like a flaky network. It is not a flaky network. It is the signature of four separate failure modes that live in the plumbing between Google’s download API and Microsoft Graph’s upload session protocol, and they all surface as the same number.

If you are evaluating a migration tool, or building your own pipeline, this is what you need to plan for. The failure modes are deterministic. Each has a root cause in a real HTTP-layer behavior. A correctly designed pipeline handles all four automatically. A naive implementation hits at least two of them on any batch that contains large files or runs against a busy tenant.

Why it is always 99%

The data path looks simple on paper. A worker opens a stream from Google Drive (files.get?alt=media), pipes it into a Graph upload session (PUT /drive/items/{id}:/path:/content), and when the pipe finishes the file is in SharePoint. At rest this works. Under load, roughly 4–8% of files per batch fail — weighted toward larger files — and they all fail at the same fraction of total size.

That alone is the most useful diagnostic. A flaky network fails at 37%, 62%, 4% — random percentages scattered across the distribution of socket lifetimes. When every file fails at the same fraction of its total size, you are not looking at flaky hardware. You are looking at a systematic drift between what one side thinks the payload is and what the other side is measuring. Google thinks the file is done. SharePoint thinks the file is not done. The bytes streamed are short of the bytes Graph was expecting, but only by a small margin. That margin is the tell.

Failure mode 1: Google auto-gzips, Node auto-decompresses, Content-Length lies

Google Drive returns binary file content with Content-Encoding: gzip for anything the server deems compressible. This includes many document formats. The Content-Length header Google sets describes the compressed byte count on the wire, not the file’s true size.

Node’s http and fetch implementations, and every streaming HTTP client built on top of them, will transparently decompress a gzip response unless told otherwise. The stream the application reads is the uncompressed payload — the true file bytes — but the Content-Length on the response object is the compressed count. If that value is used to drive the upload (which Graph upload sessions require when opening the session), Graph is told the file is smaller than it actually is. When the stream keeps producing bytes past the advertised length, Graph closes the session politely and leaves the file 99% written.

The correct defense is to refuse compression on the source request: send Accept-Encoding: identity and eat the extra bandwidth in exchange for a truthful Content-Length. The right pipeline does this on every Google Drive read.

// Naive
const src = await drive.files.get({ fileId, alt: 'media' }, { responseType: 'stream' });
// src.headers['content-length']  ← compressed count
// src.data actually emits        ← uncompressed count, larger
// Result: upload session sized too small, fails at ~99%

// Tolerant
const src = await drive.files.get({ fileId, alt: 'media' }, {
  responseType: 'stream',
  headers: { 'Accept-Encoding': 'identity' }
});
// Content-Length matches the stream. Upload session sized correctly.

This one detail eliminates the majority of 99% failures. It does not eliminate all of them, because there are three more modes stacked behind it.

Failure mode 2: Google idle socket timeout on slow destinations

Large files on tenants with slow SharePoint upload endpoints fail for a second, independent reason. When the destination upload is slow, the source stream’s internal socket sits unread for several seconds. Google’s edge does not wait forever. After a threshold of idleness — roughly 20–25 seconds in most regions — the source side closes the connection. The destination side sees EOF and declares the upload ended, short.

In a naive pipe, the source socket only gets read when the destination socket is ready to receive. If the destination does anything slow — a cert renegotiation, a throttle response, a chunk finalize — the source blocks. Google’s side sees no reads. Google disconnects.

The defense is a buffering PassThrough between source and destination: a 64 MB in-memory buffer that the source fills as fast as Google can serve, and the destination drains as fast as SharePoint accepts. The source socket is always being read actively — there is no idle period for Google to time out on. The destination can pause briefly without propagating back-pressure to the source. A tolerant pipeline keeps this buffer in place on every transfer so the source side is never the bottleneck.

Stream pipeline with 64 MB buffer
Google Drive source files.get?alt=media · identity
↓ streams eagerly
PassThrough buffer (64 MB high-water mark) never idle on read
↓ drains as destination accepts
Graph upload session PUT /drive/items/{id}:/path:/content
The source socket is continuously drained even when SharePoint pauses. Google never sees idle time.

64 MB is not arbitrary. It balances the longest observed destination pause during throttling events (around 10–12 seconds) against peak source throughput (around 40–60 MB/s) with headroom. Smaller buffers let slow destinations propagate back-pressure. Larger buffers waste memory per concurrent transfer. 64 MB per worker thread is the right trade for cross-cloud transfers at scale.

Failure mode 3: Graph upload sessions lock the total size at the first chunk

Graph’s large-file upload protocol is chunked. The client opens a session, then PUTs ranges against it: Content-Range: bytes 0-4194303/14380102, Content-Range: bytes 4194304-8388607/14380102, and so on until the final range closes the session.

The trailing number after the slash — the total size — is not negotiable. Whatever value is sent in the first chunk is the value Graph records. Subsequent chunks that disagree are rejected with 416 Requested Range Not Satisfiable. This happens even when the later range arithmetic is self-consistent and the actual byte count is higher. Graph trusts the first total and will refuse the real file if that first total was wrong.

Combined with failure mode 1, this is catastrophic. Read the compressed Content-Length off the source response, use it to size the upload session, start streaming, and when the actual uncompressed bytes outrun that number Graph slams the door at 99% — because 99% of the understated size is 100% of the truth originally declared.

The identity-encoding fix resolves most of this. The remaining defense is: never open a Graph upload session before the true size is known. For files where the source does not return a reliable size (rare, but it happens for Google Docs exported on the fly), the right approach is to buffer the first few megabytes to local disk, read the final size, then open the session. It is not free, but it removes the class of failure entirely.

Failure mode 4: The tenant throttle budget is smaller than the obvious concurrency

The last failure mode is not an error at all. It is a gradual slowdown that is easily misread as SharePoint being slow. It shows up when concurrency is set by hardware capacity rather than by the destination tenant’s Graph budget.

Each chunked upload is not a single request. It is: open session, multiple chunk PUTs, finalize, get-item-metadata, set-metadata, permission replay if enabled. A single file costs 6–12 Graph requests. At concurrency 30 with an average of 8 requests per file and a 5-second-per-file median, that is about 288 requests per second, or roughly 17,000 per minute against the destination tenant. Microsoft’s per-tenant Graph budget is generous but not unlimited — the practical ceiling lands around 600 requests per minute before 429s start appearing, and more aggressive behavior drives Retry-After headers into the tens of seconds.

Concurrency has to match the budget, not the hardware. A sensible default for Google Drive to SharePoint transfers is concurrency 10, which is the sustainable steady-state for a typical tenant; scale up only for workspaces that have provisioned additional Graph capacity or that run during off-peak hours. The pipeline should measure 429 rate continuously and back off before Graph forces it to.

What a tolerant pipeline produces

On a representative 312-file test dataset (mixed sizes, 24 GB total), the difference between a naive pipe and a pipeline that handles all four failure modes:

MetricNaive pipeTolerant pipeline
Files failed47 / 312 (15%)0 / 312 (0%)
Silent 99% truncations310
Retries needed940
Sustained throughput11 MB/s effective38 MB/s sustained
Graph 429 rate8.2%0.1%
Wall clock (24 GB batch)62 min + manual cleanup11 min, no cleanup

The 6x wall-clock improvement is not because any single request is faster. It is because the pipeline stops paying the tax of doing the same work twice — first as a broken truncated upload, then as a retried full upload. Eliminating the failure class removes almost all of the retry overhead at once. A correctly built pipeline retries from the byte offset Graph has already accepted rather than restarting the file, which is the difference between minutes and hours on a large batch.

How to tell which mode you are hitting

If you are evaluating a migration tool or instrumenting your own pipe between a compressed HTTP source and a chunked upload session, these are the fingerprints worth watching for:

The larger point

The seductive thing about streaming HTTP is that a pipe() call looks like it handles everything. Source goes to destination. The fact that the source has its own timeout clock, that the framework silently decompresses responses, that the destination locks metadata on the first write, and that the entire pipeline sits inside a per-tenant throttle budget — none of that is visible in the one-line pipe. All of it produces failure modes that look identical at the log level. “Upload ended at 99%.”

It looks like a network problem. It is four separate layers each lying slightly to the next one. A migration tool either handles that invisibly, or it fails at 99% and blames the network.

When you pick a migration tool for Google Drive to SharePoint work, the question is not whether it can move a file. Any tool can move a file. The question is what happens to the twenty-seventh file in a batch of four hundred when the destination tenant starts throttling and the source socket starts idling. The right answer covers all four cases: identity encoding on source reads, buffered pass-through between streams, deferred upload-session creation until true size is known, and tenant-aware concurrency that respects the Graph budget. MigrationFox ships these as defaults on every workspace — no flags, no config.

Related reading

Get started

Run a Google Drive to SharePoint migration at app.migrationfox.com/register. Tolerant pipeline defaults apply on every workspace tier.

Move Google Drive without the 99% wall

Buffered source streams, identity encoding, tenant-aware concurrency. Built in by default.

Start Free →