# Gazing at the numbers: the Collatz sequence

I've been obsessing on this problem for a couple weeks. While thinking on it I made some observations on my own, so it's worth a post on my blog. Note that I'm not a mathematician, so there's more English than Math here (which, for people like me, could in fact make it easier to read). I don't have proofs for the most interesting observations I made, and they might be known already for the trained mind.

If you studied the Collatz conjecture, then you might want to jump to the TL;DR.

## The problem

The Collatz conjecture defines a sequence of positive integers that starts with any number $n > 1$, and at each step, in order to get the next number, you apply this simple logic, depending on the parity of the current number $n$:

• When $n$ is even, the next number is $n / 2$
• When $n$ is odd, the next number is $3n + 1$

The conjecture states that any such sequence eventually reaches $1$ (of course, we could continue after $1$ but it will loop). For an example, let's start with $n = 6$: 6 → 3 → 10 → 5 → 16 → 8 → 4 → 2 → 1.

## It looks so simple!

I was intrigued that a problem so simply formulated that I could explain it to my 10 y.o. kid is actually one of the unsolved puzzles, with great minds stating “Mathematics may not be ready for such problems” and “this is an extraordinarily difficult problem, completely out of reach of present day mathematics” (from the Wikipedia page).

Can I prove it? (hint: nope) I thought maybe induction would be the right tool here. The conjecture can be easily verified with a computer (according to Wikipedia, it has been verified upto $5 \times 2^{60}$). I would start a proof like this:

Let's assume that the conjecture holds true for any start number from $1$ to $n$. Can we then prove that the sequence starting at $n + 1$ also leads to $1$? When $n$ is odd, $n + 1$ will be even so the next number will decrease to ${n + 1} \over 2$, which is smaller than $n$, so based on our induction assumption, the sequence leads to $1$, case closed. What's left is to analyze the situation where $n$ is even. Because $n + 1$ is odd, the sequence starting at $n + 1$ will look like this:

$n + 1, 3(n + 1) + 1, {{3(n + 1) + 1} \over 2}$

The last term above is greater than $n$, so we can't yet use our induction assumption. However, let's notice something obvious: after an increase, the sequence will immediately decrease at least once (because when $n$ is odd, $3n + 1$ is even). I feel this “at least once” is the key. I started thinking, what are the chances that $3n + 1$ is a power of two? If we reach a power of two, it can only decrease to $1$ and the conjecture stands.

## Obsessing about powers of two

Observation #1. It's too simple so it must be known, but I didn't know it. Turns out, many powers of two can be written as $3k + 1$. I would say, half of them. We can notice that:

$2^0 = 1 = 3 \times 0 + 1 \\ 2^2 = 4 = 3 \times 1 + 1 \\ 2^4 = 16 = 3 \times 5 + 1 \\ 2^6 = 64 = 3 \times 21 + 1$

We can prove that $2^n$ can be written as $3k + 1$ if and only if $n$ is even. Proof by induction: we already verified the base case (it works for $n = 0$). Let's assume $2^n = 3k + 1$. We have:

\begin{align} 2^{n+2} &= 2^2 \times 2^n \\ &= 4(3k + 1) \\ &= 12k + 3 + 1 \\ &= 3(4k + 1) + 1 \end{align}

so it has the form I was talking about.

First I naively thought “great, seems we have a 50% chance to reach a power of two when the sequence increases to $3n + 1$”. But this is silly, of course. The bigger the numbers, the bigger the distance between consecutive powers of two, and smaller the chance to actually hit one. In fact, using a test program for various start numbers, I would say chances to reach a high power of two are close to zero.

But I kept thinking. Do the bits tell us anything? What if I see the numbers of this sequence in binary? I modified my little test program to display the factorization of the current number at each step, and the binary representation.

Observation #2. While looking at the binaries, it stroke me that a number $k$ with the property that $3k + 1$ is a power of two has a binary representation consisting of alternating ones and zeros. For example:

\begin{align} 10101_{(2)} &= 21 & \quad 3 \times 21 + 1 &= 64 = 2^6 \\ 1010101_{(2)} &= 85 & \quad 3 \times 85 + 1 &= 256 = 2^8 \\ 101010101_{(2)} &= 341 & \quad 3 \times 341 + 1 &= 1024 = 2^{10} \end{align}

This turns out to have a simple explanation. It can be deduced from the classic multiplication algorithm in base two:

   1010101 *
11 (this is 3)
------------
1010101 +
1010101
------------
11111111

Obviously, if we add $1$ to the result, we'll get 100000000 which is $2^8$. This result can be formulated in various ways, but I guess the mathematical way of putting it is: if $k = 2^0 + 2^2 + 2^4 + \ldots + 2^n$ then $3k + 1 = 2^{n+2}$ (actually, now that I've written it in algebraic form, the formal proof seems obvious; I won't bother you with it). Also an interesting consequence, we can say that a number whose binary representation consists of an even number of ones is divisible by 3. And this in fact can be easily deduced from the “first observation” above. And the result of dividing it by 3 can be obtained by nulling out every bit of an odd rank. It's all beautiful, isn't it? But I digress.

Observation #3. Perhaps the most interesting one, in relation to the Collatz sequence, and also closely related to the previous one: a number $k$ whose binary representation ends with a sequence of alternating 0 and 1 of length $n$ ($1$ must be the last digit) has two interesting properties:

• Obviously, it's not even (since the last bit is 1). This means that in the Collatz sequence it will be followed by $3k + 1$.
• $3k + 1$ is a multiple of $2^n$.

I didn't bother to prove it, but it should come easily from the previous point. The reason why this is interesting in the Collatz sequence is that $3k + 1$ will be half-able $n$ times and will effectively decrease the sequence.

I think what we should analyze is how do the operations affect the last few digits in the binary representation. An even number (ends with 0 in binary) will obviously decrease the sequence; this case isn't interesting. An odd number that ends in 01 will increase to $3n + 1$ and then decrease twice to ${3n + 1} \over 4$, which is smaller than $n$ — again, the sequence obviously decreases. This is derived from Observation #3 above, but let's see an example:

                (*)
↓
(n)          xxxx001 *
(3)               11
-----------
xxxx001 +
xxxx001
-----------
(3n)       xxxxxx011 +
1
-----------
(3n+1)     xxxxxx100 (divides by 4)
-----------
(3n+1)/4     xxxxxx1

Worst case scenario, $(3n+1)/4$ will have the same number of binary digits as $n$, but it's obviously smaller, so the sequence decreases. I marked bit 2 above with a (*) as I wanted to explain why I've chosen it to be zero for this example. If it's 1, then the sequence will decrease even more because $3n+1$ would be a multiple of $2^3$.

So to recap: if the current number ends with 0, the sequence decreases. If it ends with 01, the sequence decreases. Then the only interesting case left is when it ends with 11. Such numbers do indeed increase the sequence most often. Next I started analyzing sequences starting with just ones in the binary representation (in other words, numbers $2^k-1$).

Observation #4. If the sequence $2^k-1$ (where $k$ is odd) converges in $n$ iterations, the sequence $2^{k+1}-1$ converges in $n+1$ iterations. This is a non-trivial observation for which I have yet no proof; but there's more: in a relatively small number of steps, both sequences reach the same number. More precisely, the first common number is the $(k+2)$-th even number in the first sequence, and the $(k+3)$-th even number in the second sequence. The rest of the numbers are obviously common.

It's time I show some sample output from my program:

LISPERATOR> (collatz #b111)            LISPERATOR> (collatz #b1111)
7       111   (7)                     15        1111   (3 5)
22     10110 * (2 11)                  46      101110 * (2 23)
11      1011   (11)                    23       10111   (23)
34    100010 * (2 17)                  70     1000110 * (2 5 7)
17     10001   (17)                    35      100011   (5 7)
52    110100 * (2 2 13)               106     1101010 * (2 53)
26     11010 * (2 13)                  53      110101   (53)
13      1101   (13)                   160    10100000 * (2 2 2 2 2 5)
→ 40    101000 * (2 2 2 5)               80     1010000 * (2 2 2 2 5)
20     10100 * (2 2 5)               → 40      101000 * (2 2 2 5)
10      1010 * (2 5)                   20       10100 * (2 2 5)
5       101   (5)                     10        1010 * (2 5)
16     10000 * (2 2 2 2)                5         101   (5)
8      1000 * (2 2 2)                 16       10000 * (2 2 2 2)
4       100 * (2 2)                    8        1000 * (2 2 2)
2        10 * (2)                      4         100 * (2 2)
1         1   NIL                      2          10 * (2)
5 ↑      11 ↓ 2.2                         1           1   NIL
5 ↑      12 ↓ 2.4

On the left there's output for $collatz(7)$, and on the right it's $collatz(15)$ (thus, $k=3$). After they reach 1 there's one line showing us how many times did the sequence increase (both 5 times) and how many times they decrease (the first decreases 11 times, and the second 12 times). Lines marked with a * contain even numbers (i.e. the sequence decreases). I marked the number 40, which is the 5-th even number in the first sequence ($k+2$) and the 6-th even number in the second sequence ($k+3$).

(BTW, the last number displayed (2.2 and 2.4) is the ratio between the count of even numbers vs. count of odd numbers that a sequence reaches. For long sequences, this ratio seems to float (without converging) around 1.8, which suggests the sequence decreases almost twice more times than it increases.)

Here's another example, for $k=5$ (remember, we analyze sequences $2^k-1$ and $2^{k+1}-1$ where $k$ is odd):

LISPERATOR> (collatz #b11111)            LISPERATOR> (collatz #b111111)
31        11111   (31)                   63         111111   (3 3 7)
94      1011110 * (2 47)                190       10111110 * (2 5 19)
47       101111   (47)                   95        1011111   (5 19)
142     10001110 * (2 71)                286      100011110 * (2 11 13)
71      1000111   (71)                  143       10001111   (11 13)
214     11010110 * (2 107)               430      110101110 * (2 5 43)
107      1101011   (107)                 215       11010111   (5 43)
322    101000010 * (2 7 23)              646     1010000110 * (2 17 19)
161     10100001   (7 23)                323      101000011   (17 19)
484    111100100 * (2 2 11 11)           970     1111001010 * (2 5 97)
242     11110010 * (2 11 11)             485      111100101   (5 97)
121      1111001   (11 11)              1456    10110110000 * (2 2 2 2 7 13)
→ 364    101101100 * (2 2 7 13)            728     1011011000 * (2 2 2 7 13)
...                                    → 364      101101100 * (2 2 7 13)
39 ↑      67 ↓ 1.7179487                   ...
39 ↑      68 ↓ 1.7435898

The 7-th even number in the first sequence is 364, same as the 8-th even number in the second sequence. The first common number always seems to be reached this way:

• in the $2^k-1$ sequence, after an increase operation ($3n+1$).
• in the $2^{k+1}-1$ sequence, after two decrease operations ($n/2$).

I verified this for odd $k$ from 1 to 2047: it's always the case that the $(k+2)$-th even number in the $2^k-1$ sequence is the same as the $(k+3)$-th even number in the $2^{k+1}-1$ sequence.

## The shortcut Collatz sequence

Moving forward, I'm going to focus on the “shortcut” (or “simplified”) Collatz sequence. It's defined similarly, but it skips the even numbers generated by a $3n+1$ operation. Thus, if the current number is $n$, the next number will be:

• ${3n+1} \over 2$ if $n$ is odd
• $n/2$ if $n$ is even.

Let's introduce a notation here, because I'll refer to it quite often. Let $S(n)$ be a function that returns the shortcut Collatz sequence for number $n$, as an ordered set (list/array, programmers!) where element zero is $n$ itself. I'm going to write $S(n)[k]$ to refer to the $k$-th number in this sequence. Thus:

S(n) = n
S(n)[i] = S(n)[i-1] % 2 == 0 ? S(n)[i-1] / 2 : (3 * S[n][i-1] + 1) / 2


Observation #5. The first even number in $S(2^k-1)$ is reached after $k$ iterations.

Here are some examples:

LISPERATOR> (collatz-simplified-list #b11)      ; k = 2
(3 5 8 4 2 1)

LISPERATOR> (collatz-simplified-list #b111)     ; k = 3
(7 11 17 26 13 20 10 5 8 4 2 1)

LISPERATOR> (collatz-simplified-list #b1111)    ; k = 4
(15 23 35 53 80 40 20 10 5 8 4 2 1)

LISPERATOR> (collatz-simplified-list #b11111)   ; k = 5
(31 47 71 107 161 242 ...)

LISPERATOR> (collatz-simplified-list #b111111)  ; k = 6
(63 95 143 215 323 485 728 ...)

Next, I couldn't help noticing something about the ratios between these numbers:

\begin{align} 26/8 &= 3.25 \\ 80/26 &= 3.076923... \\ 242/80 &= 3.025 \\ 728/242 &= 3.008264... \end{align}

I would postulate that:

$\lim_{n \to \infty}{{S(2^{n+1}-1)[n+1]} \over {S(2^n-1)[n]}} = 3$

I verified that after a small number of steps, the ratio is indistinguishable from 3 in floating point, but it never equals exactly 3. This result might not be very interesting itself, but it helped me move one step forward, and observe that (look at numbers from the above example, and their positioning into the sequence):

\begin{align} 11 &= 3 \times 3 + 2 \\ 23 &= 3 \times 7 + 2 \\ 161 &= 3 \times 53 + 2 \\ 485 &= 3 \times 161 + 2 \\ 728 &= 3 \times 242 + 2 \end{align}

in particular,

\begin{align} 728 &= 3 \times 242 + 2 \\ &= 3 \times (3 \times 80 + 2) + 2 \\ &= 3 \times (3 \times (3 \times 26 + 2) + 2) + 2 \\ &= 3 \times (3 \times (3 \times (3 \times 8 + 2) + 2) + 2) + 2 \end{align}

Observation #6. $S(2^k-1)[k] = 3^k-1$. This can be easily deduced by unpacking the parens above, although that's not a proof (and I don't have a proof for this observation yet).

In particular, when $k$ is odd, $3^k-1$ divides by two exactly once (why? hmm, dunno). We can use this knowledge to compute the second even number:

\begin{align} {{3 \times {{3^k-1} \over 2} + 1} \over 2} &= {{3^{k+1}-1} \over 4} \end{align}

But according to our observation, $3^{k+1}-1$ is the first even number in $S(2^{k+1}-1)$, so it follows that ${3^{k+1}-1} \over 4$ belongs to both sequences! This goes to sustain Observation #4 which stated that when $k$ is odd, sequences $2^k-1$ and $2^{k+1}-1$ quickly reach the same value.

Observation #7. The Collatz sequence cannot produce a number which is a multiple of 3 (except when the starting number is itself a multiple of 3, in which case it will stay so for as long as it remains even).

Seems obvious.

Observation #8. Back to binary representation, we can extend what we noticed above to arbitrary numbers that end with $k$ digits of 1 in binary. I noticed that for a number that looks like this in binary: xxxxx011111 — simply, it ends with $k$ digits of one, and it doesn't matter what's before the zero — the following stands:

\begin{align} S(\texttt{xxxxx0111...1})[k] &= 3^k \times \texttt{xxxxx1} - 1 \end{align}

Random example:

LISPERATOR> (elt (collatz-simplified-list #b1011011011111)
5)   ;; k = 5, as we have 5 1-s at the end
44468
LISPERATOR> (1- (* #b10110111 (expt 3 5)))
44468


So.. you take a bunch of binary digits — upto the last zero — replace that zero with a one, and multiply that by 3 to the number of ones at the end, and subtract 1, and you get the first even number in the shortcut sequence! Let me state this in a more “academic” manner:

$S(2^{k+1} n + 2^k - 1)[k] = 3^k (2n + 1) - 1, \quad \forall n \in \mathbb{N}$

In particular, it stands for $n=0$ (Observation #6).

## TL;DR: it's been fun

It was definitely a nice experience to notice all this myself. I didn't study any existing material (except the Wikipedia page) prior to making these observations, but I did after. All this is known (more so, it's proven); I still didn't find Observation #8 clearly stated anywhere, but it can be easily inferred from existing knowledge (Lemma 3 from this document).

While it's a bit disconcerting that I couldn't find anything new, I'm glad to see that I wasn't completely off-the-mark by looking at the binary representation. It might remain the key to this problem. I'm not remotely as smart as other people who have studied it, so I'm going to stop here. I'm giving up. I hope. :-)