Self application is not a necessary ingredient of the proof
In a nutshell
If there is a Turing machine H that solves
the halting problem, then from that machine we can build another Turing
machine L with a halting behavior (halting characteristic function)
that cannot be the halting behavior of any Turing machine.
The paradox built on the self applied function D (called L in this answer - sorry about notation inconsistencies) is not a necessary
ingredient of the proof, but a device usable with the construction of
one specific contradiction, hiding what seems to be the "real purpose" of the construction. That is probably why it is not
intuitive.
It seems more direct to show that there is only a denumerable number
of halting behaviors (no more than Turing machines), that can be
defined as characteristic halting functions associated with each Turing
machine. One can define constructively a characteristic halting
function not in the list, and build from it, and from a machine H
that solves the halting problem, a machine L that has that new
characteristic halting function. But since, by construction, it is not
the characteristic halting function of a Turing machine, L cannot be
one. Since L is built from H using Turing machine building
techniques, H cannot be a Turing machine.
The self-application of L to itself, used in many proofs, is a way
to show the contradiction. But it works only when the impossible
characteristic halting function is built from the diagonal of the list
of Turing permitted characteristic halting functions, by flipping this
diagonal (exchanging 0 and 1). But there are infinitely many other ways
of building a new characteristic halting function. Then
non-Turing-ness can no longer be evidenced with a liar paradox (at
least not simply). The self-application construction is not intuitive
because it is not essential, but it looks slick when pulled out of the
magic hat.
Basically, L is not a Turing machine because it is designed from
the start to have a halting behavior that is not that of a Turing
machine, and that can be shown more directly, hence more intuitively.
Note: It may be that, for any constructive choice of the
impossible characteristic halting function, there is a computable
reordering of the Turing machine enumeration such that it becomes
the diagonal ( I do not know). But, imho, this does not change the
fact that self-application is an indirect proof technique that is
hiding a more intuitive and interesting fact.
Detailed analysis of the proofs
I am not going to be historical (but thanks to those who are, I enjoy
it), but I am only trying to work the intuitive side.
I think that the presentation given @vzn, which I did encounter a long
time ago (I had forgotten), is actually rather intuitive, and even
explains the name diagonalization. I am repeating it in details only because I
feel @vzn did not emphasize enough its simplicity.
My purpose is to have an intuitive way to retrieve the proof, knowing
that of Cantor. The problem with many versions of the proof is that the
constructions seem to be pulled from a magic hat.
The proof that I give is not exactly the same as in the question,
but it is correct, as far as I can see. If I did not make a mistake, it
is intuitive enough since I could retrieve it after more years than I
care to count, working on very different issues.
The case of the subsets of N (Cantor)
The proof of Cantor assumes (it is only an hypothesis) that there is an enumeration of the
subsets of the integers, so that all such subset Sj can be
described by its characteristic function Cj(i) which is 1 if
i∈Sj and is 0 otherwise.
This may be seen as a table T, such that T[i,j]=Cj(i)
Then, considering the diagonal, we build a characteristic function D
such that D(i)=T[i,i]¯¯¯¯¯¯¯¯¯¯¯¯, i.e. it is identical to the
diagonal of the table with every bit flipped to the other value.
There is nothing special about the diagonal, except that it is an easy
way to get a characteristic function D that is different from all
others, and that is all we need.
Hence, the subset characterized by D cannot be in the enumeration.
Since that would be true of any enumeration, there cannot be an
enumeration that enumerates all the subsets of N.
This is admittedly, according to the initial question, fairly
intuitive. Can we make the proof of the halting problem as intuitive?
The case of the halting problem (Turing)
We assume we have an enumeration of Turing machines (which we know is
possible). The halting behavior of a Turing machine Mj can be
described by its characteristic halting function Hj(i) which is 1 if
Mj halts on input i and is 0 otherwise.
This may be seen as a table T, such that T[i,j]=Hj(i)
Then, considering the diagonal, we build a characteristic halting function D
such that D(i)=T[i,i]¯¯¯¯¯¯¯¯¯¯¯¯, i.e. it is identical to the
diagonal of the table with every bit flipped to the other value.
There is nothing special about the diagonal, except that it is an easy
way to get a characteristic halting function D that is different from all
others, and that is all we need (see note at the bottom).
Hence, the halting behavior characterized by D cannot be that of a
Turing machine in the enumeration. Since we enumerated them all, we
conclude that there is no Turing machine with that behavior.
No halting oracle so far, and no computability hypothesis: We know
nothing of the computability of T and of the functions Hj.
Now suppose we have a Turing machine H that can solve the halting
problem, such that H(i,j) always halts with Hj(i) as result.
We want to prove that, given H, we can build a machine L that has the characteristic halting function D. The machine
L is nearly identical to H, so that L(i) mimics
H(i,i), except that whenever H(i,i) is about to
terminate with value 1, L(i) goes into an infinite loop and does not
terminate.
It is quite clear that we can build such a machine L if H
exists. Hence this machine should be in our initial enumeration of all
machines (which we know is possible). But it cannot be since its
halting behavior D corresponds to none of the machines enumerated.
Machine L cannot exist, which implies that H cannot
exist.
I deliberately mimicked the first proof and went into tiny
details
My feeling is that the steps come naturally in this way, especially
when one considers Cantor's proof as reasonably intuitive.
One first enumerates the litigious constructs. Then one takes and modifies the
diagonal as a convenient way of touching all of them to get an
unaccounted for behaviour, then gets a contradiction by exhibiting an
object that has the unaccounted for behaviour ... if some hypothesis
were to be true: existence of the enumeration for Cantor, and
existence of a computable halting oracle for Turing.
Note: To define the function D, we could replace the flipped
diagonal by any other characteristic halting function, different from
all the ones listed in T, that is computable (from the ones listed
in T, for example) provided a halting oracle is available. Then the machine L
would have to be constructed accordingly, to have D as characteristic
halting function, and L(i) would make use of the machine H, but
not mimic so directly H(i,i). The choice of the diagonal makes it
much simpler.
Comparison with the "other" proof
The function L defined here is apparently the analog of the function
D in the proof described in the question.
We only build it in such a way that it has a characteristic halting
function that corresponds to no Turing machine, and get directly a
contradiction from that. This gives us the freedom of not using the
diagonal (for what it is worth).
The idea of the "usual" proof seems to try to kill what I see
as a dead fish. It says: let's assume that L is one of the machines
that were listed (i.e., all of them). Then it has an index jL in
that enumeration: L=MjL. Then if L(jL) halts, we have
T[jL,jL]=H(jL,jL)=1, so that L(jL) will loop by
construction. Conversely, if L(jL) does not halt, then
T[jL,jL]=H(jL,jL)=0 so that L(jL) will halt by construction.
Thus we have a contradiction. But the contradiction results from the
way the characteristic halting function of L was constructed,
and it seems a lot simpler just to say that L cannot be a Turing
machine because it is constructed to have a characteristic halting
function that is not that of a Turing machine.
A side-point is that this usual proof would be a lot more painful if
we did not choose the diagonal, while the direct approach used above
has no problem with it. Whether that can be useful, I do not know.