
Artificial Neural Networks Adapted from "Neural Networks 101," by S. L. Thaler, Servo Magazine, April 2005 Artificial neural networks may be broadly summarized as collections of 'onoff' switches that automatically connect themselves so as to build intricate computer programs, without the need for human intervention. As such they may be considered the 'Swiss Army Knife' of fitting functions in that they may absorb behaviors and phenomena in the world without any human preconceptions or input. Accordingly, this branch of artificial intelligence, which is inspired by the technique nature itself has used to attain intelligence, the brain, must inevitably form the basis of the freewheeling and selfdetermining AI predicted by futurists and visionaries. 1.0 Introduction  Prior to 1974, there had been many attempts to model events and phenomena via network based techniques. Probably the most well known of these modeling methodologies was introduced by the Reverend Thomas Bayes as early as 1763. Within these models (Figure 1) simple probability or rulebased calculations were carried out within socalled 'nodes' and the results transmitted to other such nodes to create compound models. Although such Bayesian networks were impressive, once completed, they did admittedly require that humans endow each node with wisdom. Certainly, the creation of a Bayesian net is an admirable and scholarly project, however it does not qualify as the basis for the kinds of futuristic AI or robotics that Hollywood envisions. For one thing, what happens when an AI agent or robot encounters a new scenario or environment for which new human wisdom is required? Out of pure necessity, the kinds of cybernetic systems predicted by science fiction must be capable of fending for themselves. Figure 1. Network vs. Neural Network Models  Vintage network models consist of intelligent nodes communicating through dumb connections. Modern day neural networks consist of dumb nodes or switches that effectively grow connections that embody intelligence. 1974 was the breakthrough year when researchers discovered that multiple layers of totally unintelligent nodes, effectively acting as 'onoff switches' could autonomously wire themselves together so as to effectively eliminate the need for humans planting explicit rules or calculations within the individual nodes (Werbos). Instead, all of the wisdom spontaneously 'grew' in the form of connection weights between very dumb nodes called neurons. Interestingly enough, this work was fueled by socalled computational psychologists who were attempting to gain a mathematical foothold into how the brain perceived things and events within the environment. They needed to account for how the brain could enlist relatively dumb switches to encode all world knowledge in the form of electrochemical connections between them. The result was the socalled perceptron, a collection of interconnected neurons that allowed sets of patterns to be associated with one another. Therefore, the image of broccoli, a milliondimensional, retinal firing pattern, could activate an opinion, the cortical firing pattern representing one's emotional response to eating this vegetable. To better understand how wisdom, in the form of complex logic can be represented by connection weights between simple switches, consider Figure 2. There, human ingenuity has been used to prescribe connection weight values that convert the simplest neural network of them all, a single neuron, into either an AND or OR logic gate. Note that the neuron switches between off (0) and on (1) states by weighting the logical inputs, in_{1} and in_{2}, and comparing against some predetermined threshold value, q. If the weighted input in_{1} * w_{1} + in_{2} * w_{2} exceeds q, the neuron transitions from the off to on state. (Check for yourself that by simply adjusting the threshold value, q, normally considered a connection weight fed by a constant neuron outputting a value of 1, the AND network becomes an OR network.)
Figure 2. Simple Example of How Intelligence May be Contained in Connection Weights. With weights w_{1} and w_{2} both assigned values of 1.0 and the switching threshold q set to 1.5, this simplest of neural networks acts as an AND logic gate. By lowering the threshold to 0.5, it then acts as an OR gate. 2.0 An Exemplary Artificial Neural Network  Realizing that the single neuron neural network just discussed was purely a pedagogical one that did not selforganize, let's discuss how practical neural networks function. These will be nets that effectively build themselves and capture much more complex relationships in higher dimensional spaces. Let's consider as an example a sonar system aboard a robot (i.e., a hexapod crawler) that is returning range measurements over a 30 degree range (Figure 3), in 1 degree sampling windows in the robot's forward vista. If we elect to build a neural network to classify the kind of scene in front of the robot as an open doorway or not, all we need do is assemble a training database involving representative sonar input patterns along with human judgments as to whether each sonar pattern represents a passageway. The input patterns could therefore consist of 28 sonar range measurements, perhaps in meters {1, 1, 1, ... 4, 4, 4, 4 ..., 1, 1, 1}. Since this pattern looks as though it's an open door, then the corresponding desired network output pattern would be {1, 0}, with the first component set to 1 to indicate a doorway pattern. The second component, here represented as 0, would be reserved to indicate those sonar patterns that do not qualify as open doorways. If this had been the sonar return from a wall, this latter component would be set to 1. Figure 3. Doorway Detection Using Sonar Sweep. A hexapod robot, simulated here in virtual reality, conducts a 30 degree sonar sweep to detect doors, portals, and other openings through which it can crawl. Here it has identified a doorway to crawl through. Prior to training this network we would amass a training exemplar database consisting of several hundred records that include 28 fields describing the range measurements in the sonar sweep and two latter fields indicating whether the corresponding sonar pattern represented a doorway or not. Hopefully, this database would be collected over a range of scenarios that include sonar sweeps of both walls and openings, at a variety of viewing aspects. The particular network chosen (Figure 4) is known as a multilayer perceptron or MLP, and consists of 28 input neurons to accept the 28 sonar ranges from the angular sweep, 10 intermediate or hidden layer neurons, and two output neurons that are intended to activate in a mutually exclusive fashion to indicate the presence or absence of a doorway. To train this MLP we intend to rapidly stream these records, called exemplars in neural network parlance, applying input patterns to the net's input layer and corresponding output patterns to its output layer, as the MLP's connection weights selforganize so as to capture all the complex relationships involved. In this manner we hope to create an accurate doorway detector for the robotic crawler. In beginning such network training, all connection weights would be randomized, perhaps in the range from 1 to +1. In training this network, each of our input patterns, would be applied to the net in what is called a feed forward cycle. Now, each of the hidden layer neurons, labeled by the index j, receives 28 analog inputs, instead of the two Boolean inputs in Figure 2. We call this weighted sum net_{j}, the net input to the j^{th} neuron (Equation 1). Rather than simply compare this net input against some stored threshold value, as we did before, we utilize a sigmoid function (Figure 5) to approximate this switching behavior and to calculate the activation for the jth neuron, act_{j} (Equation 2), where q_{j} is the switching threshold or bias of the jth neuron and act_{i}, the activation of any preceeding neuron i. Figure 4. Doorway Detecting Neural Network. Inputs are 28 sonar ranges for the angular sweep between +14 and 14 degrees. Outputs represent classification into door and not door categories. The intermediate layer is called the hidden layer. Figure 5. Sigmoid Function (solid blue) versus Simple Thresholding Function (dashed red). This is a plot of Equation 2, showing the activation of the jth neuron, act_{j} with respect to its net input, net_{j}. Here the threshold, q_{j} has been set to a value of 5. The reader will note that the functional form expressed by Equation 2, manifests the necessary switching behavior required of a neuron, transitioning from an off state (0) to an on state (1) near the threshold value of q (Figure 4), where act_{j} assumes the value of 1/(1 + exp(0)) = 1/2. One very convenient feature of the sigmoid function is that its first derivative is well behaved, in a mathematical sense, and takes on a very simple form, where act_{j}' represents the differential of act_{j} with respect to net_{j}. This quantity, act_{j}' is very important, as you are about to see, because it helps determine the corrections to any connection weight feeding the neuron j. Further, because of its simple analytical form, it can be readily incorporated into the computer algorithm responsible for training this neural net. Once all of the outputs have been calculated for each hidden layer neuron, the forward propagation is repeated for the two output layer neurons that are likewise accepting activation values from the ten hidden layer neurons, calculating net inputs, and producing activations that are always initially wrong when compared with the desired network outputs. It is at this stage, that a training algorithm takes charge, notes the difference between actual and desired outputs of the network, and begins the process called back propagation. In the process, the connection weights feeding any neuron are corrected or updated according to the following rule, where Dw_{ij }is the correction, either positive or negative in sign, that must be added to the weight, w_{ij}, h is an adjustable constant called the learning rate, d_{j }is the output 'error' of the jth neuron, act_{j}' is the first derivative of neuron output with respect to net input noted above, and act_{i} represents the raw, unweighted output of the neuron i feeding the neuron j through the weight w_{ij}. The output error d_{j }is easy to calculate for any output neuron, being the difference between its desired and actual outputs. Otherwise, the calculation of d_{j }for any hidden layer neuron is more formidable, since it involves the weighting of all the output layer deltas as they propagate from network output layer back to the neuron of interest, j. In effect, the network's output errors percolate back through the network counter to the direction of typical information flow. This backpropagation takes place along 'trees' whose branches begin at all the ouput neurons and terminate at the 'back doors' of the neuron whose weights are being corrected. Thereafter these weighted error sums govern the size and sign of updates applied to weights leading to the 'front door' of each neuron. (For more details, see for instance, Freeman and Skapura.) Before departing from our discussion of Equation 4, I offer three observations. First of all, although the weight update rule looks simple, it is derived from the requirement that the average prediction error of the network over all of its training exemplars be minimized, producing some very impressive differential equations that ultimately reduce to this simple equation. Secondly, Equation 4, may contain another additive optional term proportional to the product of last weight update, Dw_{ij}, and a constant called momentum. Its purpose is to keep the average training error from 'falling' into and dwelling within local minima, and instead keeps the error rolling between these error canyons and valleys, until it arrives at an exceptionally deep error minimum from which training cannot escape. Finally, Equation 4 amounts to what I typically call a 'mathematical spanking', effectively punishing connection weights that do not contribute to the accuracy of the neural network by modifying their values. This mathematical spanking continues, as we repeatedly apply input and output training exemplars to the network, until the prediction error falls below some acceptable level. As a result of these myriad weight updates, some very interesting phenomena are occurring within the network. The first of these is the formation of what is called a classification layer as the connection weights leading to the hidden layer selforganize to effectively bin input patterns into important families or classes. In so doing, this classification layer learns by repeated exposure, to detect the important features of the input space that can ultimately contribute to the correct classification of the sonar pattern as indicative of doorway or not. In this sonar example, distributed colonies of neurons develop, that because of their feeding or afferent connection weights, respond to important features such as walls, edges, and gaps. Therefore, when the sonar return senses a solid wall ahead of it, with no available portal to crawl through, only the colony of cells corresponding to walls activates. However, if the sonar return is from an open doorway, then wall, gap, and edge detecting colonies all activate, producing a preliminary classification of the doorway as just that. The second selforganizational phenomenon takes place in the output weight layer wherein the necessary logic for classification develops that might very well resemble our previous example of the AND gate. This heuristic knowledge, once discerned, would look something like this: If edge, gap, and wall features are all indicated in the classification layer, then a door is present; otherwise the forward scene does not include a doorway. Now that I've described the training of an ANN, let me emphasize that absolutely no domain knowledge was necessary. The network simply trained on both examples and nonexamples of doors, essentially learning the 'zen' of what constitutes a doorway. This process consisted of (1) learning the essential features of what constitutes a door and (2) the required logic to discern the presence of a door should the requisite features be detected. A programmer did not have to hard code this logic into software drawing upon his or her world knowledge. The ANN learns completely by exposure to sonar returns that are representative and nonrepresentative of doorways. (What I haven't told you is how this process may be totally automated in real time.) You now should have a handle on how intelligence is automatically absorbed within the connection weights between neurons, typically through this iterative process of weight corrections. Of course this particular example has illustrated just one kind of artificial neural network, notably the workhorse of the field, the MLP. Although we have chosen a rather small, three layer network as a robotics related working example, oftentimes, four or more layers are used, with many more neurons. The kind of neural network just discussed, may be used to build reactive robotic systems, somewhat like spinal reflex, that automatically maps sensor inputs to some robotic response. In this case, such a response may involve generating the necessary leg servo signals to advance a robot toward the sensed doorway. Note that the kind of neural architecture required in building more ambitious deliberative robots, those that choose among multiple courses of action from selfacquired world models, are quite different than that discussed above and are beyond the scope of this introductory article. 3.0 Conclusions  Neural network practitioners are always quick to point out that the primary advantage of artificial neural networks is their ability to capture highly nonlinear relationships, but I typically take the argument a few levels deeper than that: Throughout history, humans have devised models of things and events by combining fundamental analogies to describe more complex phenomena. Mathematicians have characteristically combined well understood functional behaviors such as linear, power law, and sinusoidal relationships to account for phenomena having scientific or technical merit, weighting these simpler functional analogies by expansion coefficients, a_{i}, Here, the F_{i} represent the simpler functional behaviors that could be the powers of x, Fourier components in the form of sine and cosine terms, or a simple linear expansion. The choice of proper functional analogies, F_{i}, and solution for the expansion coefficients, a_{i}, has occupied scientists and mathematicians for centuries. Once these weightings are discovered, typically through linear matrix techniques, we produce precise mathematical models to predict moderately complex behaviors. But, the bulk of activities going on in the world are not describable in terms of simplistic series expansions of the type embodied in Equation 5. In reality, in the more complex and nonideal problems increasingly mulled over by scientists and engineers, events happen in complex, causal chains. For instance, there may be some initializing event and then subsequent events occur in a cascading fashion. Looking at any given output neuron in Figure 4, its activation is a function of all the activations within the hidden layer, that are in turn a function of activations in the previous layer. In effect, this network can represent a causal chain of the form where F(x) is the output of any output neuron that may be represented in a nested functional form. In other words, something has happened, F_{n}, because something else has happened, F_{n1}, ultimately because of some initiating event x. ...In other words, multilayered neural networks are the natural and most flexible way of modeling the world. A moment's reflection reveals that Equation 6 is a much more general functional fit than Equation 5, allowing for such nontrivial causal chains, and incorporating the less general series expansions at any level of nesting. Also realize that the solution for the basic functional forms, F_{i}, cannot be achieved using the traditional linear matrix techniques. Instead, we must rely upon the iterative weight update procedure discussed above that forms the basis of training methods like backpropagation. Because of the generality of fit, without any presuppositions about the functional forms to be chosen, I call MLPs, the "Swiss Army Knife" of fitting functions. As long as there exists some intrinsic relationship between input and output patterns, a multilayered neural network can discover it. In the process, the network will automatically divide its world into its most notable entities (i.e., microfeature detection that partitioned the sonar network into edges, walls, and gaps at the hidden layer), classify its sensed world according to the dominance of such features (i.e., hidden layer classification), and automatically grow the logic in the form of connection weights to make some decision (i.e., the decision as to whether the forward vista contained a portal through which to crawl). For these and other reasons, I will always think of neural networks as so much more than very highdimensional, nonlinear, statistical curve fits. They are the future of autonomous robotics as well as a revealing model of human perception. 4.0 References Freeman, J. A. and Skapura, D. M. (1991). Neural Networks: Algorithms, Applications, and Programming Techniques, AddisonWesley, Reading, MA. Werbos, P. (1974). Beyond Regression: New Tools for
Prediction and Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard,
Cambridge, MA. 
© 19972020, Imagination Engines, Inc.  Creativity Machine®, Imagination
Engines®, Imagitron®, and DataBots® are registered trademarks of Imagination Engines, Inc.
1550 Wall Street, Ste. 300, St. Charles, MO 63303 • (636) 7249000