سرفصل های مهم
ماده ها هوشمند می شوند
توضیح مختصر
- زمان مطالعه 0 دقیقه
- سطح خیلی سخت
دانلود اپلیکیشن «زیبوک»
فایل صوتی
برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.
ترجمهی فصل
متن انگلیسی فصل
Chapter 2
Matter Turns Intelligent
Hydrogen…, given enough time, turns into people.
Edward Robert Harrison, 1995
One of the most spectacular developments during the 13.8 billion years since our Big Bang is that dumb and lifeless matter has turned intelligent. How could this happen and how much smarter can things get in the future? What does science have to say about the history and fate of intelligence in our cosmos? To help us tackle these questions, let’s devote this chapter to exploring the foundations and fundamental building blocks of intelligence. What does it mean to say that a blob of matter is intelligent? What does it mean to say that an object can remember, compute and learn?
What Is Intelligence?
My wife and I recently had the good fortune to attend a symposium on artificial intelligence organized by the Swedish Nobel Foundation, and when a panel of leading AI researchers were asked to define intelligence, they argued at length without reaching consensus. We found this quite funny: there’s no agreement on what intelligence is even among intelligent intelligence researchers! So there’s clearly no undisputed “correct” definition of intelligence. Instead, there are many competing ones, including capacity for logic, understanding, planning, emotional knowledge, self-awareness, creativity, problem solving and learning.
In our exploration of the future of intelligence, we want to take a maximally broad and inclusive view, not limited to the sorts of intelligence that exist so far. That’s why the definition I gave in the last chapter, and the way I’m going to use the word throughout this book, is very broad: intelligence = ability to accomplish complex goals
This is broad enough to include all above-mentioned definitions, since understanding, self-awareness, problem solving, learning, etc. are all examples of complex goals that one might have. It’s also broad enough to subsume the Oxford Dictionary definition—“the ability to acquire and apply knowledge and skills”—since one can have as a goal to apply knowledge and skills.
Because there are many possible goals, there are many possible types of intelligence. By our definition, it therefore makes no sense to quantify intelligence of humans, non-human animals or machines by a single number such as an IQ.*1 What’s more intelligent: a computer program that can only play chess or one that can only play Go? There’s no sensible answer to this, since they’re good at different things that can’t be directly compared. We can, however, say that a third program is more intelligent than both of the others if it’s at least as good as them at accomplishing all goals, and strictly better at at least one (winning at chess, say).
It also makes little sense to quibble about whether something is or isn’t intelligent in borderline cases, since ability comes on a spectrum and isn’t necessarily an all-or-nothing trait. What people have the ability to accomplish the goal of speaking? Newborns? No. Radio hosts? Yes. But what about toddlers who can speak ten words? Or five hundred words? Where would you draw the line? I’ve used the deliberately vague word “complex” in the definition above, because it’s not very interesting to try to draw an artificial line between intelligence and non-intelligence, and it’s more useful to simply quantify the degree of ability for accomplishing different goals.
To classify different intelligences into a taxonomy, another crucial distinction is that between narrow and broad intelligence. IBM’s Deep Blue chess computer, which dethroned chess champion Garry Kasparov in 1997, was only able to accomplish the very narrow task of playing chess—despite its impressive hardware and software, it couldn’t even beat a four-year-old at tic-tac-toe. The DQN AI system of Google DeepMind can accomplish a slightly broader range of goals: it can play dozens of different vintage Atari computer games at human level or better. In contrast, human intelligence is thus far uniquely broad, able to master a dazzling panoply of skills. A healthy child given enough training time can get fairly good not only at any game, but also at any language, sport or vocation. Comparing the intelligence of humans and machines today, we humans win hands-down on breadth, while machines outperform us in a small but growing number of narrow domains, as illustrated in figure 2.1. The holy grail of AI research is to build “general AI” (better known as artificial general intelligence, AGI) that is maximally broad: able to accomplish virtually any goal, including learning. We’ll explore this in detail in chapter 4. The term “AGI” was popularized by the AI researchers Shane Legg, Mark Gubrud and Ben Goertzel to more specifically mean human-level artificial general intelligence: the ability to accomplish any goal at least as well as humans.1 I’ll stick with their definition, so unless I explicitly qualify the acronym (by writing “superhuman AGI,” for example), I’ll use “AGI” as shorthand for “human-level AGI.”*2 Although the word “intelligence” tends to have positive connotations, it’s important to note that we’re using it in a completely value-neutral way: as ability to accomplish complex goals regardless of whether these goals are considered good or bad. Thus an intelligent person may be very good at helping people or very good at hurting people. We’ll explore the issue of goals in chapter 7. Regarding goals, we also need to clear up the subtlety of whose goals we’re referring to. Suppose your future brand-new robotic personal assistant has no goals whatsoever of its own, but will do whatever you ask it to do, and you ask it to cook the perfect Italian dinner. If it goes online and researches Italian dinner recipes, how to get to the closest supermarket, how to strain pasta and so on, and then successfully buys the ingredients and prepares a succulent meal, you’ll presumably consider it intelligent even though the original goal was yours. In fact, it adopted your goal once you’d made your request, and then broke it into a hierarchy of subgoals of its own, from paying the cashier to grating the Parmesan. In this sense, intelligent behavior is inexorably linked to goal attainment.
Figure 2.2: Illustration of Hans Moravec’s “landscape of human competence,” where elevation represents difficulty for computers, and the rising sea level represents what computers are able to do.
Figure 2.2: Illustration of Hans Moravec’s “landscape of human competence,” where elevation represents difficulty for computers, and the rising sea level represents what computers are able to do.
It’s natural for us to rate the difficulty of tasks relative to how hard it is for us humans to perform them, as in figure 2.1. But this can give a misleading picture of how hard they are for computers. It feels much harder to multiply 314,159 by 271,828 than to recognize a friend in a photo, yet computers creamed us at arithmetic long before I was born, while human-level image recognition has only recently become possible. This fact that low-level sensorimotor tasks seem easy despite requiring enormous computational resources is known as Moravec’s paradox, and is explained by the fact that our brain makes such tasks feel easy by dedicating massive amounts of customized hardware to them—more than a quarter of our brains, in fact.
Computers are universal machines, their potential extends uniformly over a boundless expanse of tasks. Human potentials, on the other hand, are strong in areas long important for survival, but weak in things far removed. Imagine a “landscape of human competence,” having lowlands with labels like “arithmetic” and “rote memorization,” foothills like “theorem proving” and “chess playing,” and high mountain peaks labeled “locomotion,” “hand-eye coordination” and “social interaction.” Advancing computer performance is like water slowly flooding the landscape. A half century ago it began to drown the lowlands, driving out human calculators and record clerks, but leaving most of us dry. Now the flood has reached the foothills, and our outposts there are contemplating retreat. We feel safe on our peaks, but, at the present rate, those too will be submerged within another half century. I propose that we build Arks as that day nears, and adopt a seafaring life!2 During the decades since he wrote those passages, the sea level has kept rising relentlessly, as he predicted, like global warming on steroids, and some of his foothills (including chess) have long since been submerged. What comes next and what we should do about it is the topic of the rest of this book.
As the sea level keeps rising, it may one day reach a tipping point, triggering dramatic change. This critical sea level is the one corresponding to machines becoming able to perform AI design. Before this tipping point is reached, the sea-level rise is caused by humans improving machines; afterward, the rise can be driven by machines improving machines, potentially much faster than humans could have done, rapidly submerging all land. This is the fascinating and controversial idea of the singularity, which we’ll have fun exploring in chapter 4.
Computer pioneer Alan Turing famously proved that if a computer can perform a certain bare minimum set of operations, then, given enough time and memory, it can be programmed to do anything that any other computer can do. Machines exceeding this critical threshold are called universal computers (aka Turing-universal computers); all of today’s smartphones and laptops are universal in this sense. Analogously, I like to think of the critical intelligence threshold required for AI design as the threshold for universal intelligence: given enough time and resources, it can make itself able to accomplish any goal as well as any other intelligent entity. For example, if it decides that it wants better social skills, forecasting skills or AI-design skills, it can acquire them. If it decides to figure out how to build a robot factory, then it can do so. In other words, universal intelligence has the potential to develop into Life 3.0.
The conventional wisdom among artificial intelligence researchers is that intelligence is ultimately all about information and computation, not about flesh, blood or carbon atoms. This means that there’s no fundamental reason why machines can’t one day be at least as intelligent as us.
But what are information and computation really, given that physics has taught us that, at a fundamental level, everything is simply matter and energy moving around? How can something as abstract, intangible and ethereal as information and computation be embodied by tangible physical stuff? In particular, how can a bunch of dumb particles moving around according to the laws of physics exhibit behavior that we’d call intelligent?
If you feel that the answer to this question is obvious and consider it plausible that machines might get as intelligent as humans this century—for example because you’re an AI researcher—please skip the rest of this chapter and jump straight to chapter 3. Otherwise, you’ll be pleased to know that I’ve written the next three sections specially for you.
What Is Memory?
If we say that an atlas contains information about the world, we mean that there’s a relation between the state of the book (in particular, the positions of certain molecules that give the letters and images their colors) and the state of the world (for example, the locations of continents). If the continents were in different places, then those molecules would be in different places as well. We humans use a panoply of different devices for storing information, from books and brains to hard drives, and they all share this property: that their state can be related to (and therefore inform us about) the state of other things that we care about.
What fundamental physical property do they all have in common that makes them useful as memory devices, i.e., devices for storing information? The answer is that they all can be in many different long-lived states—long-lived enough to encode the information until it’s needed. As a simple example, suppose you place a ball on a hilly surface that has sixteen different valleys, as in figure 2.3. Once the ball has rolled down and come to rest, it will be in one of sixteen places, so you can use its position as a way of remembering any number between 1 and 16.
This memory device is rather robust, because even if it gets a bit jiggled and disturbed by outside forces, the ball is likely to stay in the same valley that you put it in, so you can still tell which number is being stored. The reason that this memory is so stable is that lifting the ball out of its valley requires more energy than random disturbances are likely to provide. This same idea can provide stable memories much more generally than for a movable ball: the energy of a complicated physical system can depend on all sorts of mechanical, chemical, electrical and magnetic properties, and as long as it takes energy to change the system away from the state you want it to remember, this state will be stable. This is why solids have many long-lived states, whereas liquids and gases don’t: if you engrave someone’s name on a gold ring, the information will still be there years later because reshaping the gold requires significant energy, but if you engrave it in the surface of a pond, it will be lost within a second as the water surface effortlessly changes its shape.
The simplest possible memory device has only two stable states (figure 2.3). We can therefore think of it as encoding a binary digit (abbreviated “bit”), i.e., a zero or a one. The information stored by any more complicated memory device can equivalently be stored in multiple bits: for example, taken together, the four bits shown in figure 2.3 can be in 2 × 2 × 2 × 2 = 16 different states 0000, 0001, 0010, 0011,…, 1111, so they collectively have exactly the same memory capacity as the more complicated 16-state system. We can therefore think of bits as atoms of information—the smallest indivisible chunk of information that can’t be further subdivided, which can combine to make up any information. For example, I just typed the word “word,” and my laptop represented it in its memory as the 4-number sequence 119 111 114 100, storing each of those numbers as 8 bits (it represents each lowercase letter by a number that’s 96 plus its order in the alphabet). As soon as I hit the w key on my keyboard, my laptop displayed a visual image of a w on my screen, and this image is also represented by bits: 32 bits specify the color of each of the screen’s millions of pixels.
Since two-state systems are easy to manufacture and work with, most modern computers store their information as bits, but these bits are embodied in a wide variety of ways. On a DVD, each bit corresponds to whether there is or isn’t a microscopic pit at a given point on the plastic surface. On a hard drive, each bit corresponds to a point on the surface being magnetized in one of two ways. In my laptop’s working memory, each bit corresponds to the positions of certain electrons, determining whether a device called a micro-capacitor is charged. Some kinds of bits are convenient to transport as well, even at the speed of light: for example, in an optical fiber transmitting your email, each bit corresponds to a laser beam being strong or weak at a given time.
Engineers prefer to encode bits into systems that aren’t only stable and easy to read from (as a gold ring), but also easy to write to: altering the state of your hard drive requires much less energy than engraving gold. They also prefer systems that are convenient to work with and cheap to mass-produce. But other than that, they simply don’t care about how the bits are represented as physical objects—and nor do you most of the time, because it simply doesn’t matter! If you email your friend a document to print, the information may get copied in rapid succession from magnetizations on your hard drive to electric charges in your computer’s working memory, radio waves in your wireless network, voltages in your router, laser pulses in an optical fiber and, finally, molecules on a piece of paper. In other words, information can take on a life of its own, independent of its physical substrate! Indeed, it’s usually only this substrate-independent aspect of information that we’re interested in: if your friend calls you up to discuss that document you sent, she’s probably not calling to talk about voltages or molecules. This is our first hint of how something as intangible as intelligence can be embodied in tangible physical stuff, and we’ll soon see how this idea of substrate independence is much deeper, including not only information but also computation and learning.
Because of this substrate independence, clever engineers have been able to repeatedly replace the memory devices inside our computers with dramatically better ones, based on new technologies, without requiring any changes whatsoever to our software. The result has been spectacular, as illustrated in figure 2.4: over the past six decades, computer memory has gotten half as expensive roughly every couple of years. Hard drives have gotten over 100 million times cheaper, and the faster memories useful for computation rather than mere storage have become a whopping 10 trillion times cheaper. If you could get such a “99.99999999999% off” discount on all your shopping, you could buy all real estate in New York City for about 10 cents and all the gold that’s ever been mined for around a dollar.
For many of us, the spectacular improvements in memory technology come with personal stories. I fondly remember working in a candy store back in high school to pay for a computer sporting 16 kilobytes of memory, and when I made and sold a word processor for it with my high school classmate Magnus Bodin, we were forced to write it all in ultra-compact machine code to leave enough memory for the words that it was supposed to process. After getting used to floppy drives storing 70kB, I became awestruck by the smaller 3.5-inch floppies that could store a whopping 1.44MB and hold a whole book, and then my first-ever hard drive storing 10MB—which might just barely fit a single one of today’s song downloads. These memories from my adolescence felt almost unreal the other day, when I spent about $100 on a hard drive with 300,000 times more capacity.
What about memory devices that evolved rather than being designed by humans? Biologists don’t yet know what the first-ever life form was that copied its blueprints between generations, but it may have been quite small. A team led by Philipp Holliger at Cambridge University made an RNA molecule in 2016 that encoded 412 bits of genetic information and was able to copy RNA strands longer than itself, bolstering the “RNA world” hypothesis that early Earth life involved short self-replicating RNA snippets. So far, the smallest memory device known to be evolved and used in the wild is the genome of the bacterium Candidatus Carsonella ruddii, storing about 40 kilobytes, whereas our human DNA stores about 1.6 gigabytes, comparable to a downloaded movie. As mentioned in the last chapter, our brains store much more information than our genes: in the ballpark of 10 gigabytes electrically (specifying which of your 100 billion neurons are firing at any one time) and 100 terabytes chemically/biologically (specifying how strongly different neurons are linked by synapses). Comparing these numbers with the machine memories shows that the world’s best computers can now out-remember any biological system—at a cost that’s rapidly dropping and was a few thousand dollars in 2016.
The memory in your brain works very differently from computer memory, not only in terms of how it’s built, but also in terms of how it’s used. Whereas you retrieve memories from a computer or hard drive by specifying where it’s stored, you retrieve memories from your brain by specifying something about what is stored. Each group of bits in your computer’s memory has a numerical address, and to retrieve a piece of information, the computer specifies at what address to look, just as if I tell you “Go to my bookshelf, take the fifth book from the right on the top shelf, and tell me what it says on page 314.” In contrast, you retrieve information from your brain similarly to how you retrieve it from a search engine: you specify a piece of the information or something related to it, and it pops up. If I tell you “to be or not,” or if I google it, chances are that it will trigger “To be, or not to be, that is the question.” Indeed, it will probably work even if I use another part of the quote or mess things up somewhat. Such memory systems are called auto-associative, since they recall by association rather than by address.
In a famous 1982 paper, the physicist John Hopfield showed how a network of interconnected neurons could function as an auto-associative memory. I find the basic idea very beautiful, and it works for any physical system with multiple stable states. For example, consider a ball on a surface with two valleys, like the one-bit system in figure 2.3, and let’s shape the surface so that the x-coordinates of the two minima where the ball can come to rest are x = √2 ≈ 1.41421 and x = π ≈ 3.14159, respectively. If you remember only that π is close to 3, you simply put the ball at x = 3 and watch it reveal a more exact π-value as it rolls down to the nearest minimum. Hopfield realized that a complex network of neurons provides an analogous landscape with very many energy-minima that the system can settle into, and it was later proved that you can squeeze in as many as 138 different memories for every thousand neurons without causing major confusion.
What Is Computation?
We’ve now seen how a physical object can remember information. But how can it compute?
A computation is a transformation of one memory state into another. In other words, a computation takes information and transforms it, implementing what mathematicians call a function. I think of a function as a meat grinder for information, as illustrated in figure 2.5: you put information in at the top, turn the crank and get processed information out at the bottom—and you can repeat this as many times as you want with different inputs. This information processing is deterministic in the sense that if you repeat it with the same input, you get the same output every time.
Although it sounds deceptively simple, this idea of a function is incredibly general. Some functions are rather trivial, such as the one called NOT that inputs a single bit and outputs the reverse, thus turning zero into one and vice versa. The functions we learn about in school typically correspond to buttons on a pocket calculator, inputting one or more numbers and outputting a single number—for example, the function x2 simply inputs a number and outputs it multiplied by itself. Other functions can be extremely complicated. For instance, if you’re in possession of a function that would input bits representing an arbitrary chess position and output bits representing the best possible next move, you can use it to win the World Computer Chess Championship. If you’re in possession of a function that inputs all the world’s financial data and outputs the best stocks to buy, you’ll soon be extremely rich. Many AI researchers dedicate their careers to figuring out how to implement certain functions. For example, the goal of machine-translation research is to implement a function inputting bits representing text in one language and outputting bits representing that same text in another language, and the goal of automatic-captioning research is inputting bits representing an image and outputting bits representing text describing it.
In other words, if you can implement highly complex functions, then you can build an intelligent machine that’s able to accomplish highly complex goals. This brings our question of how matter can be intelligent into sharper focus: in particular, how can a clump of seemingly dumb matter compute a complicated function?
Rather than just remain immobile as a gold ring or other static memory device, it must exhibit complex dynamics so that its future state depends in some complicated (and hopefully controllable/programmable) way on the present state. Its atom arrangement must be less ordered than a rigid solid where nothing interesting changes, but more ordered than a liquid or gas. Specifically, we want the system to have the property that if we put it in a state that encodes the input information, let it evolve according to the laws of physics for some amount of time, and then interpret the resulting final state as the output information, then the output is the desired function of the input. If this is the case, then we can say that our system computes our function.
As a first example of this idea, let’s explore how we can build a very simple (but also very important) function called a NAND gate*3 out of plain old dumb matter. This function inputs two bits and outputs one bit: it outputs 0 if both inputs are 1; in all other cases, it outputs 1. If we connect two switches in series with a battery and an electromagnet, then the electromagnet will only be on if the first switch and the second switch are closed (“on”). Let’s place a third switch under the electromagnet, as illustrated in figure 2.6, such that the magnet will pull it open whenever it’s powered on. If we interpret the first two switches as the input bits and the third one as the output bit (with 0 = switch open, and 1 = switch closed), then we have ourselves a NAND gate: the third switch is open only if the first two are closed. There are many other ways of building NAND gates that are more practical—for example, using transistors as illustrated in figure 2.6. In today’s computers, NAND gates are typically built from microscopic transistors and other components that can be automatically etched onto silicon wafers.
There’s a remarkable theorem in computer science that says that NAND gates are universal, meaning that you can implement any well-defined function simply by connecting together NAND gates.*4 So if you can build enough NAND gates, you can build a device computing anything! In case you’d like a taste of how this works, I’ve illustrated in figure 2.7 how to multiply numbers using nothing but NAND gates.
MIT researchers Norman Margolus and Tommaso Toffoli coined the name computronium for any substance that can perform arbitrary computations. We’ve just seen that making computronium doesn’t have to be particularly hard: the substance just needs to be able to implement NAND gates connected together in any desired way. Indeed, there are myriad other kinds of computronium as well. A simple variant that also works involves replacing the NAND gates by NOR gates that output 1 only when both inputs are 0. In the next section, we’ll explore neural networks, which can also implement arbitrary computations, i.e., act as computronium. Scientist and entrepreneur Stephen Wolfram has shown that the same goes for simple devices called cellular automata, which repeatedly update bits based on what neighboring bits are doing. Already back in 1936, computer pioneer Alan Turing proved in a landmark paper that a simple machine (now known as a “universal Turing machine”) that could manipulate symbols on a strip of tape could also implement arbitrary computations. In summary, not only is it possible for matter to implement any well-defined computation, but it’s possible in a plethora of different ways.
As mentioned earlier, Turing also proved something even more profound in that 1936 paper of his: that if a type of computer can perform a certain bare minimum set of operations, then it’s universal in the sense that given enough resources, it can do anything that any other computer can do. He showed that his Turing machine was universal, and connecting back more closely to physics, we’ve just seen that this family of universal computers also includes objects as diverse as a network of NAND gates and a network of interconnected neurons. Indeed, Stephen Wolfram has argued that most non-trivial physical systems, from weather systems to brains, would be universal computers if they could be made arbitrarily large and long-lasting.
This fact that exactly the same computation can be performed on any universal computer means that computation is substrate-independent in the same way that information is: it can take on a life of its own, independent of its physical substrate! So if you’re a conscious superintelligent character in a future computer game, you’d have no way of knowing whether you ran on a Windows desktop, a Mac OS laptop or an Android phone, because you would be substrate-independent. You’d also have no way of knowing what type of transistors the microprocessor was using.
I first came to appreciate this crucial idea of substrate independence because there are many beautiful examples of it in physics. Waves, for instance: they have properties such as speed, wavelength and frequency, and we physicists can study the equations they obey without even needing to know what particular substance they’re waves in. When you hear something, you’re detecting sound waves caused by molecules bouncing around in the mixture of gases that we call air, and we can calculate all sorts of interesting things about these waves—how their intensity fades as the square of the distance, such as how they bend when they pass through open doors and how they bounce off of walls and cause echoes—without knowing what air is made of. In fact, we don’t even need to know that it’s made of molecules: we can ignore all details about oxygen, nitrogen, carbon dioxide, etc., because the only property of the wave’s substrate that matters and enters into the famous wave equation is a single number that we can measure: the wave speed, which in this case is about 300 meters per second. Indeed, this wave equation that I taught my MIT students about in a course last spring was first discovered and put to great use long before physicists had even established that atoms and molecules existed!
This wave example illustrates three important points. First, substrate independence doesn’t mean that a substrate is unnecessary, but that most of its details don’t matter. You obviously can’t have sound waves in a gas if there’s no gas, but any gas whatsoever will suffice. Similarly, you obviously can’t have computation without matter, but any matter will do as long as it can be arranged into NAND gates, connected neurons or some other building block enabling universal computation. Second, the substrate-independent phenomenon takes on a life of its own, independent of its substrate. A wave can travel across a lake, even though none of its water molecules do—they mostly bob up and down, like fans doing “the wave” in a sports stadium. Third, it’s often only the substrate-independent aspect that we’re interested in: a surfer usually cares more about the position and height of a wave than about its detailed molecular composition. We saw how this was true for information, and it’s true for computation too: if two programmers are jointly hunting a bug in their code, they’re probably not discussing transistors.
We’ve now arrived at an answer to our opening question about how tangible physical stuff can give rise to something that feels as intangible, abstract and ethereal as intelligence: it feels so non-physical because it’s substrate-independent, taking on a life of its own that doesn’t depend on or reflect the physical details. In short, computation is a pattern in the spacetime arrangement of particles, and it’s not the particles but the pattern that really matters! Matter doesn’t matter.
In other words, the hardware is the matter and the software is the pattern. This substrate independence of computation implies that AI is possible: intelligence doesn’t require flesh, blood or carbon atoms.
Because of this substrate independence, shrewd engineers have been able to repeatedly replace the technologies inside our computers with dramatically better ones, without changing the software. The results have been every bit as spectacular as those for memory devices. As illustrated in figure 2.8, computation keeps getting half as expensive roughly every couple of years, and this trend has now persisted for over a century, cutting the computer cost a whopping million million million (1018) times since my grandmothers were born. If everything got a million million million times cheaper, then a hundredth of a cent would enable you to buy all goods and services produced on Earth this year. This dramatic drop in costs is of course a key reason why computation is everywhere these days, having spread from the building-sized computing facilities of yesteryear into our homes, cars and pockets—and even turning up in unexpected places such as sneakers.
Why does our technology keep doubling its power at regular intervals, displaying what mathematicians call exponential growth? Indeed, why is it happening not only in terms of transistor miniaturization (a trend known as Moore’s law), but also more broadly for computation as a whole (figure 2.8), for memory (figure 2.4) and for a plethora of other technologies ranging from genome sequencing to brain imaging? Ray Kurzweil calls this persistent doubling phenomenon “the law of accelerating returns.” All examples of persistent doubling that I know of in nature have the same fundamental cause, and this technological one is no exception: each step creates the next. For example, you yourself underwent exponential growth right after your conception: each of your cells divided and gave rise to two cells roughly daily, causing your total number of cells to increase day by day as 1, 2, 4, 8, 16 and so on. According to the most popular scientific theory of our cosmic origins, known as inflation, our baby Universe once grew exponentially just like you did, repeatedly doubling its size at regular intervals until a speck much smaller and lighter than an atom had grown more massive than all the galaxies we’ve ever seen with our telescopes. Again, the cause was a process whereby each doubling step caused the next. This is how technology progresses as well: once technology gets twice as powerful, it can often be used to design and build technology that’s twice as powerful in turn, triggering repeated capability doubling in the spirit of Moore’s law.
Something that occurs just as regularly as the doubling of our technological power is the appearance of claims that the doubling is ending. Yes, Moore’s law will of course end, meaning that there’s a physical limit to how small transistors can be made. But some people mistakenly assume that Moore’s law is synonymous with the persistent doubling of our technological power. Contrariwise, Ray Kurzweil points out that Moore’s law involves not the first but the fifth technological paradigm to bring exponential growth in computing, as illustrated in figure 2.8: whenever one technology stopped improving, we replaced it with an even better one. When we could no longer keep shrinking our vacuum tubes, we replaced them with transistors and then integrated circuits, where electrons move around in two dimensions. When this technology reaches its limits, there are many other alternatives we can try—for example, using three-dimensional circuits and using something other than electrons to do our bidding.
Nobody knows for sure what the next blockbuster computational substrate will be, but we do know that we’re nowhere near the limits imposed by the laws of physics. My MIT colleague Seth Lloyd has worked out what this fundamental limit is, and as we’ll explore in greater detail in chapter 6, this limit is a whopping 33 orders of magnitude (1033 times) beyond today’s state of the art for how much computing a clump of matter can do. So even if we keep doubling the power of our computers every couple of years, it will take over two centuries until we reach that final frontier.
Although all universal computers are capable of the same computations, some are more efficient than others. For example, a computation requiring millions of multiplications doesn’t require millions of separate multiplication modules built from separate transistors as in figure 2.6: it needs only one such module, since it can use it many times in succession with appropriate inputs. In this spirit of efficiency, most modern computers use a paradigm where computations are split into multiple time steps, during which information is shuffled back and forth between memory modules and computation modules. This computational architecture was developed between 1935 and 1945 by computer pioneers including Alan Turing, Konrad Zuse, Presper Eckert, John Mauchly and John von Neumann. More specifically, the computer memory stores both data and software (a program, i.e., a list of instructions for what to do with the data). At each time step, a central processing unit (CPU) executes the next instruction in the program, which specifies some simple function to apply to some part of the data. The part of the computer that keeps track of what to do next is merely another part of its memory, called the program counter, which stores the current line number in the program. To go to the next instruction, simply add one to the program counter. To jump to another line of the program, simply copy that line number into the program counter—this is how so-called “if” statements and loops are implemented.
Today’s computers often gain additional speed by parallel processing, which cleverly undoes some of this reuse of modules: if a computation can be split into parts that can be done in parallel (because the input of one part doesn’t require the output of another), then the parts can be computed simultaneously by different parts of the hardware.
The ultimate parallel computer is a quantum computer. Quantum computing pioneer David Deutsch controversially argues that “quantum computers share information with huge numbers of versions of themselves throughout the multiverse,” and can get answers faster here in our Universe by in a sense getting help from these other versions.4 We don’t yet know whether a commercially competitive quantum computer can be built during the coming decades, because it depends both on whether quantum physics works as we think it does and on our ability to overcome daunting technical challenges, but companies and governments around the world are betting tens of millions of dollars annually on the possibility. Although quantum computers cannot speed up run-of-the-mill computations, clever algorithms have been developed that may dramatically speed up specific types of calculations, such as cracking cryptosystems and training neural networks. A quantum computer could also efficiently simulate the behavior of quantum-mechanical systems, including atoms, molecules and new materials, replacing measurements in chemistry labs in the same way that simulations on traditional computers have replaced measurements in wind tunnels.
What Is Learning?
Although a pocket calculator can crush me in an arithmetic contest, it will never improve its speed or accuracy, no matter how much it practices. It doesn’t learn: for example, every time I press its square-root button, it computes exactly the same function in exactly the same way. Similarly, the first computer program that ever beat me at chess never learned from its mistakes, but merely implemented a function that its clever programmer had designed to compute a good next move. In contrast, when Magnus Carlsen lost his first game of chess at age five, he began a learning process that made him the World Chess Champion eighteen years later.
The ability to learn is arguably the most fascinating aspect of general intelligence. We’ve already seen how a seemingly dumb clump of matter can remember and compute, but how can it learn? We’ve seen that finding the answer to a difficult question corresponds to computing a function, and that appropriately arranged matter can calculate any computable function. When we humans first created pocket calculators and chess programs, we did the arranging. For matter to learn, it must instead rearrange itself to get better and better at computing the desired function—simply by obeying the laws of physics.
To demystify the learning process, let’s first consider how a very simple physical system can learn the digits of π and other numbers. Above we saw how a surface with many valleys (see figure 2.3) can be used as a memory device: for example, if the bottom of one of the valleys is at position x = π ≈ 3.14159 and there are no other valleys nearby, then you can put a ball at x = 3 and watch the system compute the missing decimals by letting the ball roll down to the bottom. Now, suppose that the surface is made of soft clay and starts out completely flat, as a blank slate. If some math enthusiasts repeatedly place the ball at the locations of each of their favorite numbers, then gravity will gradually create valleys at these locations, after which the clay surface can be used to recall these stored memories. In other words, the clay surface has learned to compute digits of numbers such as π.
Other physical systems, such as brains, can learn much more efficiently based on the same idea. John Hopfield showed that his above-mentioned network of interconnected neurons can learn in an analogous way: if you repeatedly put it into certain states, it will gradually learn these states and return to them from any nearby state. If you’ve seen each of your family members many times, then memories of what they look like can be triggered by anything related to them.
Neural networks have now transformed both biological and artificial intelligence, and have recently started dominating the AI subfield known as machine learning (the study of algorithms that improve through experience). Before delving deeper into how such networks can learn, let’s first understand how they can compute. A neural network is simply a group of interconnected neurons that are able to influence each other’s behavior. Your brain contains about as many neurons as there are stars in our Galaxy: in the ballpark of a hundred billion. On average, each of these neurons is connected to about a thousand others via junctions called synapses, and it’s the strengths of these roughly hundred trillion synapse connections that encode most of the information in your brain.
We can schematically draw a neural network as a collection of dots representing neurons connected by lines representing synapses (see figure 2.9). Real-world neurons are very complicated electrochemical devices looking nothing like this schematic illustration: they involve different parts with names such as axons and dendrites, there are many different kinds of neurons that operate in a wide variety of ways, and the exact details of how and when electrical activity in one neuron affects other neurons is still the subject of active study. However, AI researchers have shown that neural networks can still attain human-level performance on many remarkably complex tasks even if one ignores all these complexities and replaces real biological neurons with extremely simple simulated ones that are all identical and obey very simple rules. The currently most popular model for such an artificial neural network represents the state of each neuron by a single number and the strength of each synapse by a single number. In this model, each neuron updates its state at regular time steps by simply averaging together the inputs from all connected neurons, weighting them by the synaptic strengths, optionally adding a constant, and then applying what’s called an activation function to the result to compute its next state.*5 The easiest way to use a neural network as a function is to make it feedforward, with information flowing only in one direction, as in figure 2.9, plugging the input to the function into a layer of neurons at the top and extracting the output from a layer of neurons at the bottom.
The success of these simple artificial neural networks is yet another example of substrate independence: neural networks have great computational power seemingly independent of the low-level nitty-gritty details of their construction. Indeed, George Cybenko, Kurt Hornik, Maxwell Stinchcombe and Halbert White proved something remarkable in 1989: such simple neural networks are universal in the sense that they can compute any function arbitrarily accurately, by simply adjusting those synapse strength numbers accordingly. In other words, evolution probably didn’t make our biological neurons so complicated because it was necessary, but because it was more efficient—and because evolution, as opposed to human engineers, doesn’t reward designs that are simple and easy to understand.
When I first learned about this, I was mystified by how something so simple could compute something arbitrarily complicated. For example, how can you compute even something as simple as multiplication, when all you’re allowed to do is compute weighted sums and apply a single fixed function? In case you’d like a taste of how this works, figure 2.10 shows how a mere five neurons can multiply two arbitrary numbers together, and how a single neuron can multiply three bits together.
Although you can prove that you can compute anything in theory with an arbitrarily large neural network, the proof doesn’t say anything about whether you can do so in practice, with a network of reasonable size. In fact, the more I thought about it, the more puzzled I became that neural networks worked so well.
For example, suppose that we wish to classify megapixel grayscale images into two categories, say cats or dogs. If each of the million pixels can take one of, say, 256 values, then there are 2561000000 possible images, and for each one, we wish to compute the probability that it depicts a cat. This means that an arbitrary function that inputs a picture and outputs a probability is defined by a list of 2561000000 probabilities, that is, way more numbers than there are atoms in our Universe (about 1078). Yet neural networks with merely thousands or millions of parameters somehow manage to perform such classification tasks quite well. How can successful neural networks be “cheap,” in the sense of requiring so few parameters? After all, you can prove that a neural network small enough to fit inside our Universe will epically fail to approximate almost all functions, succeeding merely on a ridiculously tiny fraction of all computational tasks that you might assign to it.
I’ve had lots of fun puzzling over this and related mysteries with my student Henry Lin. One of the things I feel most grateful for in life is the opportunity to collaborate with amazing students, and Henry is one of them. When he first walked into my office to ask whether I was interested in working with him, I thought to myself that it would be more appropriate for me to ask whether he was interested in working with me: this modest, friendly and bright-eyed kid from Shreveport, Louisiana, had already written eight scientific papers, won a Forbes 30-Under-30 award, and given a TED talk with over a million views—and he was only twenty! A year later, we wrote a paper together with a surprising conclusion: the question of why neural networks work so well can’t be answered with mathematics alone, because part of the answer lies in physics. We found that the class of functions that the laws of physics throw at us and make us interested in computing is also a remarkably tiny class because, for reasons that we still don’t fully understand, the laws of physics are remarkably simple. Moreover, the tiny fraction of functions that neural networks can compute is very similar to the tiny fraction that physics makes us interested in! We also extended previous work showing that deep-learning neural networks (they’re called “deep” if they contain many layers) are much more efficient than shallow ones for many of these functions of interest. For example, together with another amazing MIT student, David Rolnick, we showed that the simple task of multiplying n numbers requires a whopping 2n neurons for a network with only one layer, but takes only about 4n neurons in a deep network. This helps explain not only why neural networks are now all the rage among AI researchers, but also why we evolved neural networks in our brains: if we evolved brains to predict the future, then it makes sense that we’d evolve a computational architecture that’s good at precisely those computational problems that matter in the physical world.
Now that we’ve explored how neural networks work and compute, let’s return to the question of how they can learn. Specifically, how can a neural network get better at computing by updating its synapses?
In his seminal 1949 book, The Organization of Behavior: A Neuropsychological Theory, the Canadian psychologist Donald Hebb argued that if two nearby neurons were frequently active (“firing”) at the same time, their synaptic coupling would strengthen so that they learned to help trigger each other—an idea captured by the popular slogan “Fire together, wire together.” Although the details of how actual brains learn are still far from understood, and research has shown that the answers are in many cases much more complicated, it’s also been shown that even this simple learning rule (known as Hebbian learning) allows neural networks to learn interesting things. John Hopfield showed that Hebbian learning allowed his oversimplified artificial neural network to store lots of complex memories by simply being exposed to them repeatedly. Such exposure to information to learn from is usually called “training” when referring to artificial neural networks (or to animals or people being taught skills), although “studying,” “education” or “experience” might be just as apt. The artificial neural networks powering today’s AI systems tend to replace Hebbian learning with more sophisticated learning rules with nerdy names such as “backpropagation” and “stochastic gradient descent,” but the basic idea is the same: there’s some simple deterministic rule, akin to a law of physics, by which the synapses get updated over time. As if by magic, this simple rule can make the neural network learn remarkably complex computations if training is performed with large amounts of data. We don’t yet know precisely what learning rules our brains use, but whatever the answer may be, there’s no indication that they violate the laws of physics.
Just as most digital computers gain efficiency by splitting their work into multiple steps and reusing computational modules many times, so do many artificial and biological neural networks. Brains have parts that are what computer scientists call recurrent rather than feedforward neural networks, where information can flow in multiple directions rather than just one way, so that the current output can become input to what happens next. The network of logic gates in the microprocessor of a laptop is also recurrent in this sense: it keeps reusing its past information, and lets new information input from a keyboard, trackpad, camera, etc., affect its ongoing computation, which in turn determines information output to, say, a screen, loudspeaker, printer or wireless network. Analogously, the network of neurons in your brain is recurrent, letting information input from your eyes, ears and other senses affect its ongoing computation, which in turn determines information output to your muscles.
The history of learning is at least as long as the history of life itself, since every self-reproducing organism performs interesting copying and processing of information—behavior that has somehow been learned. During the era of Life 1.0, however, organisms didn’t learn during their lifetime: their rules for processing information and reacting were determined by their inherited DNA, so the only learning occurred slowly at the species level, through Darwinian evolution across generations.
About half a billion years ago, certain gene lines here on Earth discovered a way to make animals containing neural networks, able to learn behaviors from experiences during life. Life 2.0 had arrived, and because of its ability to learn dramatically faster and outsmart the competition, it spread like wildfire across the globe. As we explored in chapter 1, life has gotten progressively better at learning, and at an ever-increasing rate. A particular ape-like species grew a brain so adept at acquiring knowledge that it learned how to use tools, make fire, speak a language and create a complex global society. This society can itself be viewed as a system that remembers, computes and learns, all at an accelerating pace as one invention enables the next: writing, the printing press, modern science, computers, the internet and so on. What will future historians put next on that list of enabling inventions? My guess is artificial intelligence.
As we all know, the explosive improvements in computer memory and computational power (figure 2.4 and figure 2.8) have translated into spectacular progress in artificial intelligence—but it took a long time until machine learning came of age. When IBM’s Deep Blue computer overpowered chess champion Garry Kasparov in 1997, its major advantages lay in memory and computation, not in learning. Its computational intelligence had been created by a team of humans, and the key reason that Deep Blue could outplay its creators was its ability to compute faster and thereby analyze more potential positions. When IBM’s Watson computer dethroned the human world champion in the quiz show Jeopardy!, it too relied less on learning than on custom-programmed skills and superior memory and speed. The same can be said of most early breakthroughs in robotics, from legged locomotion to self-driving cars and self-landing rockets.
In contrast, the driving force behind many of the most recent AI breakthroughs has been machine learning. Consider figure 2.11, for example. It’s easy for you to tell what it’s a photo of, but to program a function that inputs nothing but the colors of all the pixels of an image and outputs an accurate caption such as “A group of young people playing a game of frisbee” had eluded all the world’s AI researchers for decades. Yet a team at Google led by Ilya Sutskever did precisely that in 2014. Input a different set of pixel colors, and it replies “A herd of elephants walking across a dry grass field,” again correctly. How did they do it? Deep Blue–style, by programming handcrafted algorithms for detecting frisbees, faces and the like? No, by creating a relatively simple neural network with no knowledge whatsoever about the physical world or its contents, and then letting it learn by exposing it to massive amounts of data. AI visionary Jeff Hawkins wrote in 2004 that “no computer can…see as well as a mouse,” but those days are now long gone.
Just as we don’t fully understand how our children learn, we still don’t fully understand how such neural networks learn, and why they occasionally fail. But what’s clear is that they’re already highly useful and are triggering a surge of investments in deep learning. Deep learning has now transformed many aspects of computer vision, from handwriting transcription to real-time video analysis for self-driving cars. It has similarly revolutionized the ability of computers to transform spoken language into text and translate it into other languages, even in real time—which is why we can now talk to personal digital assistants such as Siri, Google Now and Cortana. Those annoying CAPTCHA puzzles, where we need to convince a website that we’re human, are getting ever more difficult in order to keep ahead of what machine-learning technology can do. In 2015, Google DeepMind released an AI system using deep learning that was able to master dozens of computer games like a kid would—with no instructions whatsoever—except that it soon learned to play better than any human. In 2016, the same company built AlphaGo, a Go-playing computer system that used deep learning to evaluate the strength of different board positions and defeated the world’s strongest Go champion. This progress is fueling a virtuous circle, bringing ever more funding and talent into AI research, which generates further progress.
We’ve spent this chapter exploring the nature of intelligence and its development up until now. How long will it take until machines can out-compete us at all cognitive tasks? We clearly don’t know, and need to be open to the possibility that the answer may be “never.” However, a basic message of this chapter is that we also need to consider the possibility that it will happen, perhaps even in our lifetime. After all, matter can be arranged so that when it obeys the laws of physics, it remembers, computes and learns—and the matter doesn’t need to be biological. AI researchers have often been accused of over-promising and under-delivering, but in fairness, some of their critics don’t have the best track record either. Some keep moving the goalposts, effectively defining intelligence as that which computers still can’t do, or as that which impresses us. Machines are now good or excellent at arithmetic, chess, mathematical theorem proving, stock picking, image captioning, driving, arcade game playing, Go, speech synthesis, speech transcription, translation and cancer diagnosis, but some critics will scornfully scoff “Sure—but that’s not real intelligence!” They might go on to argue that real intelligence involves only the mountaintops in Moravec’s landscape (figure 2.2) that haven’t yet been submerged, just as some people in the past used to argue that image captioning and Go should count—while the water kept rising.
Assuming that the water will keep rising for at least a while longer, AI’s impact on society will keep growing. Long before AI reaches human level across all tasks, it will give us fascinating opportunities and challenges involving issues such as bugs, laws, weapons and jobs. What are they and how can we best prepare for them? Let’s explore this in the next chapter.
THE BOTTOM LINE:
•Intelligence, defined as ability to accomplish complex goals, can’t be measured by a single IQ, only by an ability spectrum across all goals.
•Today’s artificial intelligence tends to be narrow, with each system able to accomplish only very specific goals, while human intelligence is remarkably broad.
•Memory, computation, learning and intelligence have an abstract, intangible and ethereal feel to them because they’re substrate-independent: able to take on a life of their own that doesn’t depend on or reflect the details of their underlying material substrate.
•Any chunk of matter can be the substrate for memory as long as it has many different stable states.
•Any matter can be computronium, the substrate for computation, as long as it contains certain universal building blocks that can be combined to implement any function. NAND gates and neurons are two important examples of such universal “computational atoms.”
•A neural network is a powerful substrate for learning because, simply by obeying the laws of physics, it can rearrange itself to get better and better at implementing desired computations.
•Because of the striking simplicity of the laws of physics, we humans only care about a tiny fraction of all imaginable computational problems, and neural networks tend to be remarkably good at solving precisely this tiny fraction.
•Once technology gets twice as powerful, it can often be used to design and build technology that’s twice as powerful in turn, triggering repeated capability doubling in the spirit of Moore’s law. The cost of information technology has now halved roughly every two years for about a century, enabling the information age.
•If AI progress continues, then long before AI reaches human level for all skills, it will give us fascinating opportunities and challenges involving issues such as bugs, laws, weapons and jobs—which we’ll explore in the next chapter.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.