A few days ago, I pulled out of the bookshelf a book. I sometimes like to randomly go back to old readings, to dig up a citation or a particular image that I remember, or maybe just to check if I still remember at least something of what I had read. This time it was The book of legendary lands (Storia delle terre e dei luoghi leggendari), by Umberto Eco. While browsing for some other memories, I stumbled on the pages describing the history of the myth and the search for the lost realm of Atlantis. There have been tens, even hundreds of presumed localizations of the “real” site of Atlantis, covering about any site on Earth, from Morocco to Siberia, from the Spitzbergen to the Tierra del Fuego. What is even more surprising (but maybe we should not be too surprised) is that there have been people searching for it all over the history, since the earliest years just after Plato’s writing in his Timaeus around 360 BC, up to these days, practically uninterrupted. Every now and then somebody comes up with a new possible localization, but the lost island remains shrouded in its mystery, unreachable. The search for Atlantis falls into a special category, that is, searching for an object that we do not know for sure it exists (an unbounded search). But the predicate of existence does not necessarily make the search so much easier: the problem of finding even a well known object that we are sure must be somewhere can be nearly equally daunting, depending on the size, dimensionality and topology of the search space. And this becomes a mathematical physics problem.

What is the best strategy for finding a missing object? Anyone who has ever lost keys has already faced this problem. The “streetlight effect” hinted in the title refers to the observational bias that occurs when people only search for something where it is easiest to look. Such everyday life situations as looking for a displaced object are a prototypical example of a search problem, which under its simplest form involves a searcher, be it a person, an animal, or any kind of organism or particle, in general able to move across a given search domain, and one or several targets. Even if it is a schematic definition, the search problem as stated turns out to be a universal question, which arises at different scales and in various fields, and has generated an increasing amount of work in recent years, notably in the physics community. Several different problems can be born out of this simple scheme, in which a “blind walker” explores at random a nearly empty landscape. The search for a missing object is a single task, but more complex and/or detailed searches can involve finding at least one object from a set (a problem known as “first-passage time”), or determining the average time between two successful findings, or finding all the objects of the set (the exhaustive search, or “all-the-mushrooms-in-the-forest” problem). For each of these searches it may be possible to find optimal ‘walking’ strategies, which may have in some cases an analytical expression, or a statistical formulation, or in many cases have only a numerical solution. On the other hand, search strategies belong to a class of combinatorial decision problems that often follow exponential probability laws, so that the purely numerical solution very quickly will stumble against a memory and/or computation time wall. Add to this the malicious attitude of us scientists, to invent (apparently useless) complications to (apparently simple) problems, such variants like the persistent random walk (the walker has at each step a given probability of continuing in the same direction as the previous step); the Cauchy or Lévy flights, in which the walking step lengths are distributed according to a given probability function; intermittent walk, in which the walker can every now and then jump to a distant site; restarted or reset random walk, in which the walker is randomly (and dastardly) put back to the initial position; and many more.

Applications of theoretical studies of search strategies can be traced back to World War II, during which the U.S. Navy tried to efficiently hunt for submarines and developed rationalized search procedures. Similar search algorithms have since been developed and utilized in the context of castaway rescue operations, or even for the recovery of an atomic bomb lost in the Mediterranean Sea near Palomares, in 1966, and the nuclear submarine Scorpion lost near the Azores in 1968. At the macroscopic scale, other important and widely studied examples of search processes concern animals searching for a mate, food, or shelter. Even prehistoric migrations, apart from classical archaeological literature, have been studied as a search problem, in which human groups search for new profitable territories. At the microscopic scale, search processes naturally occur in the context of chemical reactions, for which the encounter of reactive molecules – or, in other words, the fact that one searcher molecule finds a reactive target site – is a required first step. An obvious example is the theory of diffusion-controlled reactions, initiated years ago by the work of Smoluchowski (1917) and developed by innumerable researchers. More recently, this field has regained interest in the context of biochemical reactions in cells, where the sometimes very small number of reactive molecules makes this first step of the search for a reaction partner crucial for the kinetics.

This was more or less the problem I was working at, a couple of weeks ago. We wanted to better understand a kinetics plot of a DNA repair enzyme: the researchers had created a short fragment of DNA sequence containing an error, and wanted to study the efficiency of a specific protein that is able to find exactly that type of sequence error, and fix it. By using radioactive-P labelled DNA strands, they could measure the amount of DNA copies that were tagged by the repair enzyme, as a function of reaction time. Our problem is then the following: we have a volume of a solution containing a given concentration of short DNA fragments, plus a given concentration of copies of the repair protein; the probability of one protein finding one DNA in a volume of given size is a simple combinatorial problem; however, if we ask such questions as “how much time does it take for a protein to hit a DNA”, and “how much time does it take for one protein to find all the DNA in a given volume”, then we must resort to a random search model. This is actually what we did: we built in the computer a huge three-dimensional lattice of ‘sites’; a few of these sites are occupied by a DNA copy; and one protein walks along the sites, jumping from one site to the next, until she finds a DNA. A number of subtleties must be taken into account in such kind of problems, such as the dimensionality and topology of the walking space. For example, are the sites connected in a regular cubic or triangular or icosahedral grid? Or are they distributed with random connections? Can the walker make only nearest-neighbor moves, or can it move also diagonally between sites? Is the walking step-length fixed, or it can be of variable length? And so on. The final objective is to obtain a kinetics (that is, a timing schedule of the DNA-protein matching) that fits the experimental data, and a posteriori deduce what is the actual microscopic mechanism of search that best fits the observed kinetics.

Such kind of questions open up a whole Pandora’s vase of physically interesting situations. For the sake of argument, let us stick to the subject of recognition and binding of specific sites on DNA by proteins, a vital issue for many cellular functions. In the search for its target site, the DNA-associated protein is facing both thermodynamic and kinetic difficulties. The thermodynamic challenge lies in recognizing and tightly binding a specific site (usually just 3 or 4 nucleotides) among the billions of other non-specific sequences on the DNA; the kinetic difficulty lies in finding a specific site in mere seconds amidst the crowded cellular environment that is filled with zillions of other DNA sequences and proteins. Considering the nanoMolar concentrations relevant for such binding reactions, it may be useful to bear in mind that 1 nM corresponds to merely 1 molecule per cubic micrometer, that is, not more than 5-6 molecules in a typical cell nucleus. Simple irreversible association kinetics describe the rate of matching [D]NA and [P]rotein as proportional to the product [D][P] via a kinetic coefficient that looks like the product of three different factors (and the constant 4π), ka=4πD3Dzb. D3D is the 3D-diffusion constant of the protein in solution, estimated via the Einstein-Smoluchowski equation as something of the order of ~10-6 cm2/s for a typical protein; z is the fraction of the protein surface interested in the contact, could be about z~0.1; and b =0.34 nm is the spacing between DNA bases, needed to discriminate between different sequences. The result is a typical ka value of about 108 M-1s-1. Unfortunately, such an estimate is 100 to 1000 times slower than the observed DNA-protein association rates, showing that a pure symmetric random walk in 3D is as far from the solution, as Oedipus was from finding the killer of Laius. The currently accepted view to get around this problem, is that proteins could use a ‘facilitated diffusion’ mechanism, resulting from a combination of 3D and 1D (much faster) diffusion: after a protein hits a non-specific DNA site by random 3D walk, she sticks there for a little while, and starts hopping along the DNA length with a 1D random walk, thereby accelerating the search times by orders of magnitude. Such events are actually observed in experiments; however, also alternative models to the “sliding ” exist, such as the long-distance hopping (akin to the intermittent walk), or the inter-loop transfer, which requires a higher-order topology of connections along looping DNA lengths (if you are interested, take a loop… ooops!… a look at the nice review Dancing on DNA).

When you run the computer simulation I described above and ask for the time to find all targets in the search space, you are trying to estimate the so-called cover time, a long-standing problem in the random walk theory. The numerical solution, as we directly verified, becomes rapidly impossible, since the cover time of N sites grows with N3. On the other hand, analytical results are scarce and are found only for symmetric cubic walks in Euclidean geometry. Nothing exists for more complex (and more interesting) situations like the intermittent walk, Lévy flights, or the persistent random walk. The group of Olivier Benichou, from the Paris-Jussieu university and CNRS, has been working for a long time on such problems, and published a few years ago a very interesting contribution on the subject. They were able to prove an analytical formula for the partial cover times of a single walker, that is the time to find M sites out of N, in the limit of large N and M (for M→N this is the total cover time); they also showed that the distribution of partial cover times, its average and second moment, depend only on the average value of the first-passage time, and on the size of the search domain, both global properties of the problem. Such results were recently challenged (this is what actually put me on the track for today’s letter) by a very recent paper by a young mathematician at Utah university, Sean Lawley (for the moment only in preprint on arXiv). They found an explicit formula, supported by rigorous computer simulations, for all the moments of the cover-time distribution, for an arbitrary number of walkers on an arbitrary discrete network. The results show that cover times depend only on properties of the network along the shortest paths to the most distant parts of the target. This is extremely interesting, since it proves a strictly local dependence, and contrasts with the idea that cover times for a single walker depend on global properties of the network.

The search is (still) on.

Walkers in the dark, or how to avoid the streetlight effect

Post navigation


Leave a Reply

Your email address will not be published. Required fields are marked *