Optimization for Industrial Problems
Patrick Bangert
Optimization for Industrial Problems
123
Patrick Bangert alg...

Author:
Patrick Bangert

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Optimization for Industrial Problems

Patrick Bangert

Optimization for Industrial Problems

123

Patrick Bangert algorithmica technologies GmbH Bremen, Germany

ISBN 978-3-642-24973-0 e-ISBN 978-3-642-24974-7 DOI 10.1007/978-3-642-24974-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011945031 Mathematics Subject Classiﬁcation (2010): 90-08, 90B50 © Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

It can be done !

algorithmica technologies GmbH Advanced International Research Institute on Industrial Optimization gGmbH Department of Mathematics, University College London

Preface

Some Early Opinions on Technology There is practically no chance communications space satellites will be used to provide better telephone, telegraph, television, or radio service inside the United States T. Craven, FCC Commissioner, 1961 There is not the slightest indication that nuclear energy will ever be obtainable. It would mean that the atom would have to be shattered at will. Albert Einstein, 1932 Heavier-than-air ﬂying machines are impossible. Lord Kelvin, 1895 We will never make a 32 bit operating system. Bill Gates, 1983 Such startling announcements as these should be deprecated as being unworthy of science and mischievous to its true progress. William Siemens, on Edison’s light bulb, 1880 The energy produced by the breaking down of the atom is a very poor kind of thing. Anyone who expects a source of power from the transformation of these atoms is talking moonshine. Ernest Rutherford, shortly after splitting the atom for the ﬁrst time, 1917 Everything that can be invented has been invented. Charles H. Duell, Commissioner of the US Patent Ofﬁce, 1899

Content and Scope Optimization is the determination of the values of the independent variables in a function such that the dependent variable attains a maximum over a suitably deﬁned vii

viii

Preface

area of validity (c.f. the boundary conditions). We consider the case in which the independent variables are many but the dependent variable is limited to one; multicriterion decision making will only be touched upon. This book, for the ﬁrst time, combines mathematical methods and a wide range of real-life case studies of industrial use of these methods. Both the methods and the problems to which they are applied as examples and case studies are useful in real situations that occur in proﬁt making industrial businesses from ﬁelds such as chemistry, power generation, oil exploration and reﬁning, manufacturing, retail and others. The case studies focus on real projects that actually happened and that resulted in positive business for the industrial corporation. They are problems that other companies also have and thus have a degree of generality. The thrust is on take-home lessons that industry managers can use to improve their production via optimization methods. Industrial production is characterized by very large investments in technical facilities and regular returns over decades. Improving yield or similar characteristics in a production facility is a major goal of the owners in order to leverage their investment. The current approach to do this is mostly via engineering solutions that are costly, time consuming and approximate. Mathematics has entered the industrial stage in the 1980s with methods such as linear programming to revolutionize the area of industrial optimization. Neural networks, simulation and direct modeling joined and an arsenal of methods now exists to help engineers improve plants; both existing and new. The dot-com revolution in the late 1990s slowed this trend of knowledge transfer and it is safe to say that the industry is essentially stuck with these early methods. Mathematics has evolved since then and accumulated much expertise in optimization that remains hardly used. Also, modern computing power has exploded with the affordable parallel computer so that methods that were once doomed to the dusty shelf can now actually be used. These two effects combine to harbor a possible revolution in industrial uses for mathematical methods. These uses center around the problem of optimization as almost every industrial problem concerns maximizing some goal function (usually efﬁciency or yield). We want to help start this revolution by a coordinated presentation of methods, uses and successful examples. The methods are necessarily heuristic, i.e. non-exact, as industrial problems are typically very large and complex indeed. Also, industrial problems are deﬁned by imprecise, sometimes even faulty data that must be absorbed by a model. They are always non-linear and have many independent variables. So we must focus on heuristic methods that have these characteristics. This book is practical This book is intended to be used to solve real problems in a handbook manner. It should be used to look for potential yet untapped. It should be used to see possibilities where there were none before. The impossible should move towards the realm

Preface

ix

of the possible. The use, therefore, will mainly be in the sphere of application by persons employed in the industry. The book may also be used as instructional material in courses on either optimization methods or applied mathematics. It may also be used as instructional material in MBA courses for industrial managers. Many readers will get their ﬁrst introduction as to what mathematics can really and practically do for the industry instead of general commonplaces. Many will ﬁnd out what problems exist where they previously thought none existed. Many will discover that presumed impossibilities have been solved elsewhere. In total, I believe that you, the reader, will beneﬁt by being empowered to solve real problems. These solutions will save the corporations money, they will employ people, they will reduce pollution into the environment. They will have impact. It will show people also that very theoretical sciences have real uses. It should be emphasized that this book focuses on applications. Practical problems must be understood at a reasonable level before a solution is possible. Also all applications have several non-technical aspects such as legal, compliance and managerial ramiﬁcations in addition to the obvious ﬁnancial dimension. Every solution must be implemented by people and the interactions with them is the principal cause for failure in industrial applications. The right change management including the motivation of all concerned is an essential element that will also be addressed. Thus, this book presents cases as they can really function in real life. Due to the wide scope of the book, it is impossible to present neither the methods nor the cases in full detail. We present what is necessary for understanding. To actually implement these methods, a more detailed study or prior knowledge is required. Many take-home lessons are however spelt out. The major aim of the book is to generate understanding and not technical facility. This book is intended for practitioners The intended readership has ﬁve groups: 1. Industrial managers - will learn what can be done with mathematical methods. They will ﬁnd that a lot of their problems, many seemingly impossible, are already solved. These methods can then be handed to technical persons for implementation. 2. Industrial scientists - will use the book as a manual for their jobs. They will ﬁnd methods that can be applied practically and have solve similar problems before. 3. University students - will learn that their theoretical subjects do have practical application in the context of diverse industries and will motivate them in their studies towards a practical end. As such it will also provide starting points for theses. 4. University researchers - will learn to what applications the methods that they research about have been put or respectively what methods have been used by others to solve problems they are investigating. As this is a trans-disciplinary

x

Preface

book, it should facilitate communication across the boundaries of the mathematics, computer science and engineering departments. 5. Government funding bodies - will learn that fundamental research does actually pay off in many particular cases. A potential reader from these groups will be assumed to have completed a mathematics background training up to and including calculus (European high-school or US ﬁrst year college level). All other mathematics will be covered as far as needed. The book contains no proofs or other technical material; it is practical. A short summary Before a problem can be solved, it and the tools must be understood. In fact, a correct, complete, detailed and clear description of the problem is (measured in total human effort) often times nearly half of the ﬁnal solution. Thus, we will place substantial room in this book on understanding both the problems and the tools that are presented to solve them. Indeed we place primary emphasis on understanding and only secondary emphasis on use. For the most part, ready-made packages exist to actually perform an analysis. For the remainder, experts exist that can carry it out. What cannot be denied however, is that a good amount of understanding must permeate the relationship between the problem-owner and the problem-solver; a relationship that often encompasses dozens of people for years. Here is a brief list of the contents of the chapters 1. 2. 3. 4. 5. 6. 7. 8.

What is optimization? What is an optimization problem? What are the management challenges in an optimization project? How can we deal with faulty and noisy empirical data? How do we gain an understanding of our dataset? How is a dataset converted into a mathematical model? How is the optimization problem actually solved? What are some challenges in implementing the optimal solution in industrial practice (change management)?

Most of the book was written by me. Any deﬁciencies are the result of my own limited mind and I ask for your patience with these. Any beneﬁts are, of course, obtained by standing on the shoulders of giants and making small changes. Many case studies are co-authored by the management from the relevant industrial corporations. I heartily thank all co-authors for their participation! All the case studies were also written by me and the same comments apply to them. I also thank the coauthors very much for the trust and willingness to conduct the projects in the ﬁrst place and also to publish them here. Chapter 8 was entirely written by Andreas Ruff of Elkem Silicon Materials. He has many years of experience in implementing optimization projects’ results

Preface

xi

in chemical corporations and has written a great practical account of the potential pitfalls and their solutions in change management. Following this text, we provide ﬁrst an alphabetical list of all co-authors and their afﬁliations and then a list of all case studies together with their main topics and educational illustrations. Bremen, 2011

Patrick Bangert

Markus Ahorner COO [email protected] Section 4.8, p. 53; Section 4.9, p. 58

algorithmica technologies GmbH Gustav-Heinemann-Strasse 101 28215 Bremen, Germany www.algorithmica-technologies.com

Dr. Patrick Bangert CEO [email protected] All sections

algorithmica technologies GmbH Gustav-Heinemann-Strasse 101 28215 Bremen, Germany www.algorithmica-technologies.com

Claus Borgb¨ohmer Director Project Management [email protected] Section 4.8, p. 53

Sasol Solvents Germany GmbH R¨omerstrasse 733 47443 Moers, Germany www.sasol.com

Pablo Cajaraville Director Engineering and Sales [email protected] Section 6.6, p. 135

Reiner Microtek Poligono Industrial Itziar, Parcela H-3 20820 Itziar-Deba, Spain www.reinermicrotek.com

Roger Chevalier Senior Research Engineer [email protected] Section 6.10, p. 152

EDF SA, R&D Division 6 quai Watier, BP49 78401 Chatou Cedex, France www.edf.com

J¨org-A. Czernitzky Power Plant Group Director Berlin [email protected] Section 7.10, p. 197

Vattenfall Europe W¨arme AG Puschkinallee 52 12435 Berlin, Germany www.vattenfall.de

Prof. Dr. Adele Diederich Professor of Psychology [email protected] Section 7.6, p. 183

Jacobs University Bremen gGmbH P.O. Box 750 561 28725 Bremen, Germany www.jacobs-university.de

xii

Preface

Bj¨orn Dormann Research Director [email protected] Section 6.6, p. 135

Kl¨ockner Desma Schuhmaschinen GmbH Desmastr. 3/5 28832 Achim, Germany www.desma.de

Hans Dreischmeier Director SAP [email protected] Section 4.9, p. 58

Vestolit GmbH & Co. KG Industriestrasse 3 45753 Marl, Germany www.vestolit.de

Bernd Herzog Quality Control Manager [email protected] Section 4.11, p. 63

Hella Fahrzeugkomponenten GmbH Dortmunder Strasse 5 28199 Bremen, Germany www.hella.de

Dr. Philipp Imgrund Director Biomaterial Technology Director Power Technologies [email protected] Section 6.6, p. 135 Maik K¨ohler Technical Expert [email protected] Section 6.6, p. 135

Kl¨ockner Desma Schuhmaschinen GmbH Desmastr. 3/5 28832 Achim, Germany www.desma.de

Lutz Kramer Project Manager Metal Injection Molding [email protected] Section 6.6, p. 135 Guisheng Li Institute Director [email protected] Section 6.12, p. 157

Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Oil Production Technology Research Institute Plant No. 5 of Petrochina Dagang Oilﬁeld Company Tianjin 300280, China www.petrochina.com.cn

Bailiang Liu Vice Director [email protected] Section 7.9, p. 194

PetroChina Dagang Oilﬁeld Company Tianjin 300280 China www.petrochina.com.cn

Preface

xiii

Oscar Lopez Senior Research Engineer [email protected] Section 6.6, p. 135 Torsten Mager Director Technical Services [email protected] Section 5.6, p. 102

MIM TECH ALFA, S.L. Avenida Otaola, 4 20600 Eibar, Spain www.alfalan.es KNG Kraftwerks- und Netzgesellschaft mbH Am K¨uhlturm 1 18147 Rostock, Germany www.kraftwerk-rostock.de

Manfred Meise CEO [email protected] Section 4.11, p. 63

Hella Fahrzeugkomponenten GmbH Dortmunder Strasse 5 28199 Bremen, Germany www.hella.de

Kurt M¨uller Director Maintenance [email protected] Section 4.9, p. 58

Vestolit GmbH & Co. KG Industriestrasse 3 45753 Marl, Germany www.vestolit.de

Kaline Pagnan Furlan Research Assistant Metal Injection Molding [email protected] Section 6.6, p. 135

Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Yingjun Qu Oil Production Technology Research Institute Institute Director Plant No. 6 of Petrochina Changqing Oilﬁeld Company 718600 Shanxi, China jyj [email protected] Section 6.12, p. 157 www.petrochina.com.cn Pedro Rodriguez Director R&D [email protected] Section 6.6, p. 135

MIM TECH ALFA, S.L. Avenida Otaola, 4 20600 Eibar, Spain www.alfalan.es

Andreas Ruff Technical Marketing Manager [email protected] Chapter 8, p. 201

Elkem Silicon Materials Hochstadenstrasse 33 50674 Kln www.elkem.no

Dr. Natalie Salk CEO

PolyMIM GmbH Am Gefach

xiv

Preface

[email protected] Section 6.6, p. 135

55566 Bad Sobernheim www.polymim.com

Prof. Chaodong Tan Professor [email protected] Section 6.12, p. 157; Section 7.9, p. 194 J¨org Volkert Project Manager Metal Injection Molding [email protected] Section 6.6, p. 135

China University of Petroleum Beijing 102249 China www.upc.edu.cn Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Xuefeng Yan Beijing Yadan Petroleum Technology Co., Ltd. Director of Production Technology No. 37 Changqian Road, Changping Beijing 102200, China yxf [email protected] Section 6.12, p. 157 www.yadantech.com Jie Zhang Vice CEO [email protected] Section 7.9, p. 194

Yadan Petroleum Technology Co Ltd No. 37 Changqian Road, Changping Beijing 102200, China www.yadantech.com

Timo Zitt Director Dormagen Combined-Cycle Plant [email protected] Section 7.11, p. 199

RWE Power AG Chempark, Geb. A789 41538 Dormagen www.rwe.com

The following is a list of all case studies provided in the book. For each study, we provide its location in the text and its title. The summary indicates what the case deals with and what the result was. The “lessons” are the mathematical optimization concepts that this case particularly illustrates. Self-Benchmarking in Maintenance of a Chemical Plant Section 4.8, p. 53 Summary: In addition to the common practice of benchmarking, we suggest to compare the plant to itself in the past to make a self-benchmark. Lessons: The right pre-processing of raw data from the ERP system can already bear useful information without further mathematical analysis. Financial Data Analysis for Contract Planning Section 4.9, p. 58

Preface

xv

Summary: Based on past ﬁnancial data, we create a detailed projection into the future in several categories and so provide decision support for budgeting. Lessons: Discovering basic statistical features of data ﬁrst, allows the transformation of ERP data into a mathematical framework capable of making reliable projections. Early Warning System for Importance of Production Alarms Section 4.11, p. 63 Summary: Production alarms are analyzed in terms of their abnormality. Thus we only react to those alarms that indicate qualitative change in operations. Lessons: Comparison of statistical distributions based on statistical testing allows us to distinguish normal from abnormal events. Optical Digit Recognition Section 5.4, p. 92 Summary: Images of hand-written digits are shown to the computer in an effort for it to learn the difference between them without us providing this information (unsupervised learning). Lessons: It is possible to cluster data into categories without providing any information at all apart from the raw data but it pays to pre-process this data and to be careful about the number of categories speciﬁed. Turbine Diagnosis in a Power Plant Section 5.5, p. 96 Summary: Operational data from many turbines are analyzed to determine which turbine was behaving strangely and which was not. Lessons: Time-series can be statistically compared based on several distinctive features providing an automated check on qualitative behavior of the system. Determining the Cause of a Known Fault Section 5.6, p. 102 Summary: We search for the cause of a bent blade of a turbine and do not ﬁnd it. Lessons: Sometimes the causal mechanism is beyond current data acquisition and then cannot be analyzed out of it. It is important to recognize that analysis can only elucidate what is already there. Customer Segmentation Section 5.10, p. 117 Summary: Consumers are divided into categories based on their purchasing habits. Lessons: Based on purchasing histories, it is possible to group customers into behavioral groups. It is also possible to extract cause-effect information about which purchases trigger other purchases. Scrap Detection in Injection Molding Manufacturing Section 6.6, p. 135

xvi

Preface

Summary: It is determined whether an injection molded part is scrap or not. Lessons: Several time-series need to be converted into a few distinctive features to then be categorized by a neural network as scrap or not. Prediction of Turbine Failure Section 6.7, p. 140 Summary: A turbine blade tear is correctly predicted two days before it happened. Lessons: Time-series can be extrapolated into the future and thus failures predicted. The failure mechanism must be visible already in the data. Failures of Wind Power Plants Section 6.8, p. 143 Summary: Failures of wind power plants are predicted several days before they happen. Lessons: Even if the physical system is not stable because of changing wind conditions, the failure mechanism is sufﬁciently predictable. Catalytic Reactors in Chemistry and Petrochemistry Section 6.9, p. 148 Summary: The catalyst deactivation in ﬂuid and solid catalytic reactors is projected into the future. Lessons: Non-mechanical degradation can be predicted as well and allows for projection over one year in advance. Predicting Vibration Crises in Nuclear Power Plants Section 6.10, p. 152 Summary: A temporary increase in turbine vibrations is predicted several days before it happens. Lessons: Subtle events that are not discrete failures but rather quantitative changes in behavior can be predicted too. Identifying and Predicting the Failure of Valves Section 6.11, p. 155 Summary: In a system of valves, we determine which valve is responsible for a nonconstant ﬁnal mixture and predict when this state will be reached. Lessons: Using data analysis in combination with plant know-how, we can identify the root-cause even if the system is not fully instrumented. Predicting the Dynamometer Card of a Rod Pump Section 6.12, p. 157 Summary: The condition of a rod pump can be determined from a diagram known as the dynamometer card. This 2D shape is projected into the future in order to diagnose and predict future failures.

Preface

xvii

Lessons: It is possible not only to predict time-series but also changing geometrical shapes based on a combination of modeling and prediction. Human Brains use Simulated Annealing to Think Section 7.6, p. 183 Summary: Based on human trial, we determine that human problem solving uses the simulated annealing paradigm. Lessons: Simulated annealing is a very general and successful method to solve optimization problems that, when combined with the natural advantages of the computer, becomes very powerful and can ﬁnd the optimal solution in nearly all cases. Optimization of the M¨uller-Rochow Synthesis of Silanes Section 7.8, p. 189 Summary: A complex chemical reaction whose kinetics is not fully understood by science is modeled with the aim of increasing both selectivity and yield. Lessons: It is possible to construct empirical models without theoretical understanding and still compute the desired answers. Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells Section 7.9, p. 194 Summary: Offshore oil pumps are modeled with the aim of both predicting their future failures and increasing the oil production yield. Lessons: The pumps must be considered as a system in which the pumps inﬂuence each other. We solve a balancing problem between them using their individual models. Increase of coal burning efﬁciency in CHP power plant Section 7.10, p. 197 Summary: The efﬁciency of a CHP coal power plant is increased by 1%. Lessons: While each component in a power plant is already optimized, mathematical modeling offers added value in optimizing the combination of these components into a single system. The combination still allows a substantial efﬁciency increase based on dynamic reaction to changing external conditions. Reducing the Internal Power Demand of a Power Plant Section 7.11, p. 199 Summary: A power plant uses up some its own power by operating pumps and fans. The internal power is reduced by computing when these should be turned off. Lessons: We extrapolate discrete actions (turning off and on of pumps and fans) from the continuous data from the plant in order to optimize a ﬁnancial goal.

Contents

1

Overview of Heuristic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What is Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Searching vs. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Finding through a little Searching . . . . . . . . . . . . . . . . . . . . . . 3 1.1.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.5 Certainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Exact vs. Heuristic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Exact Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Heuristic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Example Theoretical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2

Statistical Analysis in Solution Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Basic Vocabulary of Statistical Mechanics . . . . . . . . . . . . . . . . . . . . . . 2.2 Postulates of the Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 14 18 20 23 25

3

Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Waterfall Model vs. Agile Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Design of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Prioritizing Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30 34 35

4

Pre-processing: Cleaning up Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Dirty Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Time-Series from Instrumentation . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Data not Ordered in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 37 38 38 39

xix

xx

Contents

4.3

Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Unrealistic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Unlikely Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Irregular and Abnormal Data . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data reduction / Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Similar Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Irrelevant Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Redundant Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Distinguishing Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smoothing and De-noising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Singular Spectrum Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . Representation and Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study: Self-Benchmarking in Maintenance of a Chemical Plant 4.8.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Self-Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study: Financial Data Analysis for Contract Planning . . . . . . . . Case Study: Measuring Human Inﬂuence . . . . . . . . . . . . . . . . . . . . . . . Case Study: Early Warning System for Importance of Production Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40 41 41 41 42 43 43 43 44 44 47 47 48 50 51 53 53 54 56 58 62

Data Mining: Knowledge from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Concepts of Statistics and Measurement . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Population, Sample and Estimation . . . . . . . . . . . . . . . . . . . . . 5.1.2 Measurement Error and Uncertainty . . . . . . . . . . . . . . . . . . . . 5.1.3 Inﬂuence of the Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Meaning of Probability and Statistics . . . . . . . . . . . . . . . . . . . . 5.2 Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Testing Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Speciﬁc Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Do two datasets have the same mean? . . . . . . . . . . . 5.2.2.2 Do two datasets have the same variance? . . . . . . . . 5.2.2.3 Are two datasets differently distributed? . . . . . . . . . 5.2.2.4 Are there outliers and, if so, where? . . . . . . . . . . . . . 5.2.2.5 How well does this model ﬁt the data? . . . . . . . . . . . 5.3 Other Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Correlation and Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67 67 67 68 70 71 73 73 75 75 76 76 77 78 79 79 81 84 85 89 91

4.4

4.5

4.6 4.7 4.8

4.9 4.10 4.11

5

63

Contents

xxi

5.4 5.5 5.6 5.7 5.8

Case Study: Optical Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 92 Case Study: Turbine Diagnosis in a Power Plant . . . . . . . . . . . . . . . . . 96 Case Study: Determining the Cause of a Known Fault . . . . . . . . . . . . 102 Markov Chains and the Central Limit Theorem . . . . . . . . . . . . . . . . . . 105 Bayesian Statistical Inference and the Noisy Channel . . . . . . . . . . . . 107 5.8.1 Introduction to Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . 107 5.8.2 Determining the Prior Distribution . . . . . . . . . . . . . . . . . . . . . . 108 5.8.3 Determining the Sampling Distribution . . . . . . . . . . . . . . . . . . 110 5.8.4 Noisy Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.8.4.1 Building a Noisy Channel . . . . . . . . . . . . . . . . . . . . . 111 5.8.4.2 Controlling a Noisy Channel . . . . . . . . . . . . . . . . . . . 112 5.9 Non-Linear Multi-Dimensional Regression . . . . . . . . . . . . . . . . . . . . . 113 5.9.1 Linear Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . 113 5.9.2 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.9.3 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.10 Case Study: Customer Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6

Modeling: Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1 What is Modeling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1.2 How much data is enough? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Basic Concepts of Neural Network Modeling . . . . . . . . . . . . . . . . . . . 129 6.4 Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.6 Case Study: Scrap Detection in Injection Molding Manufacturing . . 135 6.7 Case Study: Prediction of Turbine Failure . . . . . . . . . . . . . . . . . . . . . . 140 6.8 Case Study: Failures of Wind Power Plants . . . . . . . . . . . . . . . . . . . . . 143 6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry . . . 148 6.10 Case Study: Predicting Vibration Crises in Nuclear Power Plants . . . 152 6.11 Case Study: Identifying and Predicting the Failure of Valves . . . . . . . 155 6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump . . . . 157

7

Optimization: Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.2 Elementary Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.3 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.4 Cooling Schedule and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.4.1 Initial Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.4.2 Stopping Criterion (Deﬁnition of Freezing) . . . . . . . . . . . . . . 174 7.4.3 Markov Chain Length (Deﬁnition of Equilibrium) . . . . . . . . . 175 7.4.4 Decrement Formula for Temperature (Cooling Speed) . . . . . 177 7.4.5 Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.4.6 Parameter Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.5 Perturbations for Continuous and Combinatorial Problems . . . . . . . . 181

xxii

Contents

7.6 Case Study: Human Brains use Simulated Annealing to Think . . . . . 183 7.7 Determining an Optimal Path from A to B . . . . . . . . . . . . . . . . . . . . . . 186 7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.9 Case Study: Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.10 Case Study: Increase of coal burning efﬁciency in CHP power plant 197 7.11 Case Study: Reducing the Internal Power Demand of a Power Plant 199 8

The human aspect in sustainable change and innovation . . . . . . . . . . . . 201 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.1.1 Deﬁning the items: idea, innovation, and change . . . . . . . . . . 202 8.1.2 Resistance to change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 8.2 Interface Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.2.1 The Deliberate Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.2.2 The Healthy Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 8.3 Innovation Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.4 Handling the Human Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 8.4.1 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.4.2 KPIs for team engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.4.3 Project Preparation and Set Up . . . . . . . . . . . . . . . . . . . . . . . . . 221 8.4.4 Risk Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 8.4.5 Roles and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.4.6 Career development and sustainable change . . . . . . . . . . . . . . 228 8.4.7 Sustainability in Training and Learning . . . . . . . . . . . . . . . . . . 231 8.4.8 The Economic Factor in Sustainable Innovation . . . . . . . . . . . 232 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

Chapter 1

Overview of Heuristic Optimization

1.1 What is Optimization? Suppose we have a function f (x) where the variable x may be a vector of many dimensions. We seek the point x∗ such that f (x∗ ) is the maximum value among all possible f (x). This point x∗ is called the global optimum of the function f (x). It is possible that x∗ is a unique point but it is also possible that there are several points that share the maximal value f (x∗ ). Optimization is a ﬁeld of mathematics that concerns itself with ﬁnding the point x∗ given the function f (x). There are two ﬁne distinctions to be made relative to this. First, the point x∗ is the point with highest f (x) for all possible x and as such the global optimum. We are usually interested in this global optimum. There exists the concept of a local optimum that is the point with highest f (x) for all x in the neighborhood of the local optimum. For example, any peak is a local optimum but only the highest peak is the global maximum. Usually we are not interested in ﬁnding local optima but we are interested in recognizing them because we want to be able to determine that, while we are on a peak, there exists a higher peak elsewhere. Second, the phrase “all possible x needs careful consideration. Usually any value of the independent variable is allowed , x ∈ [−∞, ∞], but in some cases the independent variable is restricted. Such restrictions may be very simple like 3 ≤ x ≤ 18. Some may be complex by not giving explicit limitations but rather tying two elements of the independent variable vector together, e.g. “

x1

p we accept the null hypothesis and otherwise accept the alternative hypothesis. In the following treatment, we will be assuming this method. By using any standard statistical software, you will be able to follow this method. In summary, this is the method: 1. Compute the test statistic and call the value x,

5.2 Statistical Testing

75

2. Compute the probability P(X < x), 3. Choose a signiﬁcance level 1 − p, and 4. If P(X < x) > p, accept the null hypothesis and otherwise accept the alternative hypothesis. In the next section, we will state a few speciﬁc tests. For each, we will give the formulas for computing the test statistic, the distribution function and the identity of the null and alternative hypotheses. This makes the above method deﬁnite apart from choosing p, which must be done in dependence upon the practical problem at hand.

5.2.2 Speciﬁc Tests Please note that statistical theory has constructed a great many tests for various purposes. There are sometimes even several tests for a particular purpose. This book does not mean to give an exhaustive treatment. We will give a test for those questions that we consider relevant for basic data analysis in the process industry.

5.2.2.1 Do two datasets have the same mean? For the two datasets A and B, that are thought to have the same variance, we compute the t-statistic, xA − xB t=

−1 2 ∑i∈A (xi −xA ) +∑i∈B (xi −xB )2 1 1 + NA +NB −2 NA NB and the distribution function P(X < t) =

1

ν2B

t

1 1

x2 1+ ν

ν 2 , 2 −t

− ν+1 2 dx

where the number of degrees of freedom ν = NA +NB −2, B(· · · ) is the beta function and NA and NB are the number of observations in either dataset. If the two datasets do not have the same variance, the t-statistic is t=

xA − xB σA2 /NA + σB2 /NB

with σA2 the variance of dataset A and the number of degrees of freedom are

ν=

σA2 NA

(σA2 /NA )2 NA −1

σ2

+ NBB +

2

(σB2 /NB )2 NB −1

76

5 Data Mining: Knowledge from Data

while the distribution function remains unchanged. The null hypothesis is that the means are the same and the alternative hypothesis is that the means are different.

5.2.2.2 Do two datasets have the same variance? For the two datasets A and B, that are thought to have the same mean, we compute the F-statistic, σ2 F = A2 σB where σA2 > σB2 . The distribution function is

ν ν

ν ν A B B A , , − I νB P(X < F) = 2 − I νA νB +νA F νA +νB F 2 2 2 2 with I(· · · ) the incomplete beta function. The null hypothesis is that they have equal variances and the alternative hypothesis is that the variances are different.

5.2.2.3 Are two datasets differently distributed? There are different approaches depending on the nature of the two distributions. We have to answer whether we are comparing an empirical distribution to a theoretically expected distribution or to another empirical distribution. We also have to answer whether the empirical data is in the form of binned data or available as a continuously valued distribution. Note that while binning involves a loss of information and arbitrary choice of bins, it is necessary if the dataset is in itself not a distribution and will thus convert the dataset into a probability distribution. If possible, one should not bin datasets. From these two questions, we arrive at four possibilities. In all cases the null hypothesis is that the two sets are equally distributed and the alternative hypothesis is that they are differently distributed. Note that this test makes no statements as to how they are distributed but merely as to same or different. One binned empirical distribution against a theory: The empirical distribution has ni observations in bin i where we expect to ﬁnd mi observations and so we create the chi-squared statistic (ni − mi )2 χ2 = ∑ mi i with distribution

χ2

P(X < χ 2 ) =

1 Γ (a)

2 0

e−t t a−1 dt

5.2 Statistical Testing

77

where a is twice the number of degrees of freedom. The degrees of freedom are the number of bins used minus the number of constraint equations imposed on the theory, e.g. that the sum of expected bin counts over the theory equals that over the empirical data, ∑ mi = ∑ ni . Two binned empirical distributions: When the ﬁrst distribution has ni observations in bin i and the second has mi observations, the chi-squared statistic is ( M/Nni − N/Mmi )2 2 χ =∑ ni + m i i where N = ∑ ni and M = ∑ mi . The distribution function is the same as above. One continuous empirical distribution against a theory: The empirical distribution n(x) is compared to a theory m(x) by computing the very simple KolmogorovSmirnov statistic D = max |n(x) − m(x)| −∞<x λi for a particular i, then we look for the largest such i, i.e. l = max{i|ESDi > λi } and declare the l points xext (S0 ), xext (S1 ), ... xext (Sl−1 ) as outliers.

5.2.2.5 How well does this model ﬁt the data? Suppose we have experimental data (xi , yi ) for many i. We then suppose that the function f () represents the relationship between xi and yi to within our desired accuracy in the manner that yi ≈ f (xi ). Note that it is unrealistic to expect that the equality be precise, i.e. yi = f (xi ). The reason is simply that experimental data is always subject to measurement errors and, generally, no model considers all effects. Generally, a model f () has parameters a1 , a2 , · · · , am that must be determined in some manner. The ﬁtting problem consists of ﬁnding the values of these parameters such that the model ﬁts the data optimally, i.e. that ∑i ( f (xi ) − yi )2 is a minimum over all possible sets of parameters. This approach is called the least squares ﬁtting approach; as one attempts to ﬁnd the least sum over squared terms. One typically ﬁnds this method illustrated in books in the context of ﬁtting a straight line, f (x) = mx + b with m and b being the parameters, but the approach is quite general. See section 5.3.1 for this. The approach of least squares only prescribes the utility function (the above sum over squares) and not the method for ﬁnding the parameters. Finding these is a different issue and we must generally use a full optimization algorithm to do it. Even though one will usually see least squares ﬁtting in the context of straight lines, please note that straight lines are rare in real life situations. We almost always encounter non-linear situations and so f () must be a non-linear function. The methods to ﬁnd the parameters must then take account of this. The optimization methods discussed in chapter 7 can handle all such situations. Clearly, we can use the sum ∑i ( f (xi ) − yi )2 as a score to rank different parameter assignments and choose the best one; that is the least squares method. However, when we have chosen the best one and have ﬁnished the least squares method, we are left with the questions: Is the f () really representative of the data? Could a different 1 We start with the entire dataset, compute its mean and select the point that is furthest from the mean on either side. We remove this point to get dataset S1 . We recompute the mean and select the point that is now furthest from the mean and proceed like this. If we are only interested in removing outliers from one end of the distribution, we must only search for points on the interesting end.

5.3 Other Statistical Measures

79

f () have represented the data better? Is this least squares sum “good enough” for our practical purpose? The least squares approach makes no statements to this effect as it considers a single f () that is given to it by the human operator of the method. It is we who must choose the f () and here lies the magic of modeling. If you have chosen the modeling function f () well, the modeling and optimizing are merely laborious steps that will lead you to your goal; if you have chosen f () poorly, those steps will be a waste of time. Thus, it is elemental to verify that the model ﬁts the data. For this purpose, we will use the chi-squared test. First, we compute the chi-squared statistic (yi − f (xi ))2 . f (xi ) i=1 n

χ2 = ∑

The probability distribution is the same as the one from section 5.2.2.3, χ2

P(X < χ 2 ) =

1 Γ (a)

2

e−t t a−1 dt

0

where a is twice the number of degrees of freedom. The hypothesis that the model does indeed correctly represent the data collected is to be accepted if this probability is larger than your chosen signiﬁcance level (usually 0.95 or 0.99), otherwise the model is a poor one at this signiﬁcance level. Note that this method makes no statements about how to ﬁx the problem if your model is poor. You must choose another one by yourself and try again!

5.3 Other Statistical Measures 5.3.1 Regression The process of ﬁtting a model to data described loosely above in section 5.2.2.5 is often called regression. The term regression with its generally unﬂattering connotations derives from the ﬁrst academic use of the method: To describe the phenomenon that the descendants of tall ancestors tend to get shorter and approach the mean height of the population over the generations. This term is very commonly used for the general problem of ﬁtting a model to data. Often the word is used in the context of straight lines. If it is used in a wider context, the literature generally speaks of non-linear regression, which then refers to the general ﬁtting problem. Here, we will brieﬂy discuss a few special cases. We will present only the result, i.e. the formulas by which to compute the parameter values. If you are interested

80

5 Data Mining: Knowledge from Data

in how there were arrived at, there are many books that will describe this in detail, e.g. [62]. Recall that, with ﬁtting, we are concerned with determining the values of the parameters from empirical data given a known model. The process of deciding on the model is generally a human decision with all of its subjective alchemical features leaving it impossible to treat in a mathematical book. We will suppose that your data includes N observations in the form (xi , yi ) where we will suppose that the yi are dependent upon the xi in some way that we wish to model. Both the xi and the yi may be vectors and need not be single values. The model takes the general form y = f (x). It is our hope that yi ≈ f (xi ) with a “good” accuracy. Presumably, we have decided what accuracy we require for our practical purpose. We will also assume that the yi are empirically measured quantities that have a known measurement error σi inherent in them, so that each measurement is actually yi ± σi . Should you wish to ignore this feature, please set σi = 1 for all i in the formulas below. Should you choose a straight line, f (x) = mx + b, you will need to determine two parameters from the data: The slope m and the y-intercept b. This is how:

2 N 2 N 1 xi xi Δ = ∑ 2 ∑ σ2 − ∑ σ2 i=i σi i=i i i=i i N 2 N N N xi yi xi xi yi b= ∑ σ 2 ∑ σ 2 − ∑ σ 2 ∑ σ 2 /Δ i=i i i=i i i=i i i=i i N N N N 1 xi yi xi yi m= ∑ σ 2 ∑ σ 2 − ∑ σ 2 ∑ σ 2 /Δ i=i i i=i i i=i i i=i i N

(5.1) (5.2) (5.3)

Please note that many formulas that do not appear to be linear at ﬁrst sight, actually are after the variable has been transformed, e.g. y = eax + b → y = a z + b with a = ea and z = ex y = (ax)c + b → y = a z + b with a = ac and z = xc y = abx → y = a + b x with y = ln y, a = ln a and b = ln b.

(5.4) (5.5) (5.6)

Suppose that this model is not sufﬁcient and you would like to make things more interesting. General linear least squares focuses on models that are linear (in the parameters) but may take non-linear basis functions. An example is the polynomial, y = a1 + a2 x + a3 x2 + a4 x3 + · · · an xn−1 but generally any function linear in the parameters is ﬁne. The most general form is M

y = ∑ ai Xi (x) i=1

5.3 Other Statistical Measures

81

where the Xi (x) are functions of x only and have no unspeciﬁed parameters. To obtain the unknown parameter values ai , we go through the following steps, X j (xi ) , the design matrix σi yi bi = , the result vector σi ai = ai , the solution vector λ = AT · A

Ai j =

β = A ·b a = λ −1 · β . T

(5.7) (5.8) (5.9) (5.10) (5.11) (5.12)

Please note that the matrix A should have more rows than columns as we should have more data points than unknown parameters. All steps are easy except for the last, which requires us to compute λ −1 . As this matrix is generally large, an explicit inversion is not a good idea for many numerical reasons. The solution of the implicit equation λ · a = β for a must therefore be done numerically. We suggest singular value decomposition (SVD) as the method of choice as it deals well with the roundoff errors that accumulate. Explaining the SVD method would go beyond the scope of this book. Any linear algebra book will explain this in painful detail, e.g. [122]. This paragraph will just explain the procedure once this decomposition is done. So, we assume that A has been SVD decomposed into A = U · W · VT where W is the diagonal matrix of singular values, U is column orthogonal and V is orthogonal. Then, M

a= ∑

i=1

U(i) · b · V(i) Wii

where the subscript (i) refers to taking the i-th column of the respective matrix.

5.3.2 ANOVA ANOVA is an abbreviation for analysis of variance that is a very popular method that is also very prone to misunderstandings. Note carefully that ANOVA is not a statistical test. That is, it does not answer any question with a yes or no answer, nor does it conﬁrm or deny any hypotheses. The method is used to investigate the effect of qualitative factors on a quantitative result. Should you have quantitative factors, you can always suppress this by grouping them into low-middle-high or good-bad groups and use ANOVA to deal with that. The principal assumption behind ANOVA is that the relationship between factors and result is linear. This is a crucial assumption as many relationships will be known to be non-linear in which case this method will do you no good. Historically, the method was devised as a part of the theory of the design of experiments in

82

5 Data Mining: Knowledge from Data

which we plan experiments before carrying them out in order to reduce the work to a minimum while gathering all required data for a certain desired future analysis. The basic idea of ANOVA is to see if the variance of the result can be explained on the basis of the differing categories of the factors. For this, we make experiments and compute variances within and between groups of categories to see if they differ signiﬁcantly. This allows conclusions as to whether the division of a factor into its categories is sensible or useful with respect to the result. For instance we may use ANOVA to compare a control group to an experimental group to see if the factor that is different really causes some observable difference in some result. As such, ANOVA is a central and commonly used tool in many experimental studies (particularly involving people). Because it is based on categorical variables rather than numerical ones, it is naturally suited towards the social studies even though it is also used in the sciences. One use to which we may put ANOVA is the attribution of causes to effects. We may for instance ﬁnd that there is a link between touching dirt and washing hands and we may ﬁnd that this link has a preferred time-wise direction: Generally touching dirt occurs before washing hands. This may lead us to believe that touching dirt is a cause for washing hands. ANOVA allows us to draw this conclusion in a well-deﬁned procedural way. As such it is useful to deal with situations in which we lack the common sense that we all have in relation to dirt and washing. In order to use the method, the experimental data must satisfy a few conditions. If the data does not satisfy these, then the method is unusable. This is also the reason for which you should plan your experiments, using design-of-experiments methods, before you make them. The conditions are 1. The observations in any one group should be homogeneously distributed. This means that if we break our group of measurements into several (arbitrary) smaller groups, these should not differ in terms of their natural variation or distribution. A common example of when this does not hold is for a time series in which observations early in time are tightly clustered around a mean and then, as time passes, get less and less clustered around the mean. This group of measurements is not homogeneously distributed but heterogeneously distributed. The technical terms for this are homoscedasticity versus heteroscedasticity. There are various tests to conﬁrm or deny this but these would go beyond the scope of this book. You will ﬁnd them in many statistical books, e.g. [71]. 2. The observations over the groups of any one factor are assumed to be normally distributed. If, for example, we have a factor with three groups (high-middlelow), then we would expect to ﬁnd an equal number of observations in the high and low groups and a correspondingly larger amount in the middle category. 3. The observations must be independent of each other. In a time series, for example, this is deﬁnitely not true as the observation at a later time generally depends causally (often in a known way) on the observation at an earlier time. 4. If you have more than one factor, it is extremely helpful to have an equal number of observations in each combination of groups of all the factors. For example if we have two factors and each factor has three groups (high-middle-low), then we

5.3 Other Statistical Measures

83

have nine combinations of these groups (high-high, high-middle, high-low, ...). This is called balance. It is not a strict requirement but life is easier if it is true. It can easily be seen that if we are strict about these conditions (especially 2 and 3) most datasets cannot be analyzed with ANOVA. We urge you here, without any sarcasm, to consider the principle “Never trust a statistic you didn’t fake yourself” by Winston Churchill. Should you wish to continue beyond this point, we must now set up the presumed model. Here we will only treat the case of a single factor. This is not very realistic but treating more complex cases would go beyond the scope of this book, see e.g. [87]. We then assume that the relationship between the result y and the factor is given by y i j = μ + α i + εi j for i = 1, 2, · · · , k; j = 1, 2, · · · , ni where yi j is the result and is normally distributed within each group, μ is the mean of the entire dataset, αi is the effect of the i-th factor group, εi j is a catch-all for all sorts of external random disturbances to our experiment, k is the number of groups in our factor, ni is the number of observations in the i-th group. Practically, we must now compute the variance in each group σi2 , the means in each group xi and the variance of the entire dataset σ 2 . Then we compute the F statistic n1 n2 (x1 − x2 )2 n1 n2 (x1 − x2 )2 F= = (n1 + n2 )σ 2 n1 σ12 + n2 σ22 to compare two groups with each other. To ﬁnish the F-test, we must look up the value of the probability distribution of the F statistic

ν ν

ν ν 1 2 2 1 , , P(X < F) = 2 − I ν1 − I ν2 ν1 +ν2 F ν2 +ν1 F 2 2 2 2 with I(· · · ) the incomplete beta function and ν1 and ν2 being the number of degrees of freedom in each group. The null hypothesis in this case was that the two variances are the same. If we reject this, we must conclude that the variances are indeed different. In this case, we would conclude that the factor does indeed have an inﬂuence upon the result. Note that we did not say anything about the nature or strength of this inﬂuence. The signiﬁcance level at which this can be claimed is important as this can be interpreted to be the degree to which this factor can be used to explain the variation in the result. In this case, we used the F-test and this is quite frequent in ANOVA. Even though we ended up using an F-test, please note that ANOVA itself is not a test and involves a lot more than a test. For realistic cases, we would, of course, use tests with several factors but this goes beyond the level of this treatment, which is intended more as a caveat than a genuine introduction.

84

5 Data Mining: Knowledge from Data

5.3.3 Correlation and Autocorrelation The concept of correlation is very simple. When one thing is changed, does another thing change as a consequence? If yes, these are correlated. For example, if we increase the temperature in a vessel, then the pressure will rise (supposing nothing else was changed as well). Thus, these are correlated under these circumstances. In the ideal gas situation, the variation of pressure and temperature is linear and positive: If pressure rises by a factor a, then temperature will also rise (positive correlation) by the same factor a (linear correlation). The correlation is so strong that, if volume is not modiﬁed, one variable will be enough to compute the value of the other. We measure correlation strength numerically on a scale between -1 and 1. In the above example, the correlation would be 1. If we compare the effect of studying hard on exam results, we would expect a correlation to exist and for this correlation to be positive but it certainly will not be 1 because certain people are more prone to the subject and will thus get higher grades than others with the same amount of studying. The effect of sunlight exposure on the water level in a glass is negative: As the sun continues to shine, the water level drops due to evaporation. Two variables x and y given by a group of measurements (xi , yi ) have a linear correlation coefﬁcient or Pearson’s r of r=

∑(xi − x)(yi − y) i ∑(xi − x)2 ∑(yi − y)2 i

i

where the over line indicates taking an average. You should not try to determine if the correlation is signiﬁcant based on r, it is not suitable for that. However, if the correlation is signiﬁcant, r is a good measure of its strength. Depending on your ﬁeld of inquiry, various r values are considered “good” by the community. If you are a physicist, then you would be looking for r = 0.99 or so to explain an effect. Dealing with experimental data from a real life situation (as opposed to laboratory conditions), you should be happy with r = 0.8 or so. Many studies based on questionnaires among humans will try to interpret correlations of r = 0.4 and sometimes lower to have sensible meaning. Sometimes working with such low correlations is necessary when the inﬂuencing factors are too many or cannot all be measured. In our efforts to make a mathematical model of a natural phenomenon that is known via experimental data from an industrial process, we should be happy with r between 0.8 and 0.9. If we achieve an r higher than this, we should suspect ourselves of over-ﬁtting the model, i.e. introducing so many parameters that the model can memorize the data instead of extrapolating intelligently. Such over-ﬁtting is a modeler’s suicide and must be avoided. Note that the above formula is the linear correlation coefﬁcient. This assumes that the relationship between the two variables is linear. This is an important restric-

5.3 Other Statistical Measures

85

tion as few real life relationships are linear. In order to use a non-linear correlation, we must ﬁrst specify the exact form of the relationship that we believe to hold. For this reason, we cannot treat this here but must be done on a case by case basis. For a quick-and-dirty check if two variables have anything at all to do with each other, the linear coefﬁcient is a reasonable thing to compute. Just don’t base any important arguments on it. In industrial practice, we most often deal with time-series. This is a variable that depends on or changes with time. A very interesting question about such a timeseries is whether we can get a feeling for future values based on past values, i.e. knowing x(0), x(1), x(2), . . . , x(t) can we make a reasonable guess at x(t + 1) and so on? To answer this, we will be asking how a variables correlates with itself at an ealier time. This is called the autocorrelation function R(τ), R(τ) =

∞

x(t)x(t − τ)dt.

−∞

The new variable τ is called the lag of the time-series. Usually, we normalize the autocorrelation such that R(0) = 1. The autocorrelation indicates the correlation of the variable with itself at different times. The value R(τ) is a measure of the inﬂuence that the value x(t − τ) has on the value of x(t). Take the example of a retail business measuring its total revenue once per month. Business is mostly stable, except in December where the Christmas business doubles revenue. We will therefore observe that R(12) for this time-series will be much higher than the rest of the autocorrelation function. This indicates that there is a strong cyclic behavior with time-lag 12 (measured in months due to the data taking cadence), which agrees with our expectation. You will probably also see a stronger dimple at a time-lag of 4 and 8. These will be due to the Easter business (Easter is usually 4 months after Christmas and it usually takes 8 months from Easter to Christmas). Thus, the strength of the Christmas business may be used somewhat to predict the strength of next Easter’s business and also next Christmas’ business. That is the point of autocorrelation. The score R(τ) can only be interpreted relatively and offers a useful indication for further investigation. Note that autocorrelation is time directed. That is, we measure the correlation that past events have with future events. Thus, autocorrelation is a measure of the strength of predictability: Knowing a historical fact, how reliable is an estimate of future performance on its basis (relative to knowing the future fact by simply waiting for it, which is equal to 1 by normalizing the function)? Thus, autocorrelation identiﬁes cause-effect relationships within a single variable’s time-series.

5.3.4 Clustering Suppose that you have many observations of some process and that each observation is vector of values. We may now wish to group observations into a few qualitatively

86

5 Data Mining: Knowledge from Data

distinct categories in order to generate some form of understanding about the underlying dynamics that produce the observations. A simple example is a group of people visiting a store. Each purchase action is an observation. The observation itself is a vector of purchased goods. We now want to group the many observations made over some time interval into qualitative groups, e.g. “health conscious client” or “ready made meals client.” These groups can then be described both by their purchase habits (what, how often, in which combinations ...) as well as by their economic impact (what revenue, what margins ...) in order to draw conclusions about possible changes to marketing or the store. Each cluster should be as homogeneous as possible and the clusters should be as heterogeneous between each other as possible. The action of grouping vectored observations into phenomenological groups is called clustering. The manner in which observations are clustered has some elements that are common to all methods. Beyond these common elements, there are many algorithms to accomplish a clustering and it is difﬁcult to say a priori which method is best. A major reason for this, in practice, is that it is frequently not clear how to deﬁne or measure what is “best” and this emerges only empirically once the results of several methods have been compared by persons with signiﬁcant domain knowledge. First, all methods require a metric. A metric is a method to compute a distance measure between two observations. Some types of observation (locations and the like) have an inherent sensible distance measure but others (option A instead of option B) are difﬁcult to tie to the concept of distance. Since we are comparing vectors of values, the issue of comparing apples with pears is also a signiﬁcant problem. For instance if we measure both temperature and pressure of something in a two-dimensional vector, how are we going to measure the distance between two sample vectors? The physical units in which these quantities are measured become important. For example if we measure temperature in degrees Kelvin as opposed to degrees Celsius, then the numerical values of all temperatures will be much higher. In any normal distance measurement, the temperature will then be more signiﬁcant and could thus potentially skew the results. Thus, the design of the metric is an issue that must be resolved carefully in a practical application. Second, each clustering creates so called centers. These are points in the multidimensional space deﬁned by the observational vectors that indicate the position of the “center” of the cluster. Each cluster has a radius around this center (as measured by the metric function) and all observations within this sphere belong to that center. We may thus list the observations per center and, via descriptive statistics, arrive at a description of each group. This description can then be interpreted and thus knowledge may be derived. Third, some observations may not lie inside any center and thus be called outliers. Generally these outliers are few observations that, for one reason or another, are sufﬁciently atypical that they do not belong to a cluster and also sufﬁciently few in number not to justify forming a new cluster (or several new clusters depending on the metric function) for them.

5.3 Other Statistical Measures

87

Outliers are very interesting points as they indicate an abnormal observation. As outliers skew statistical results, it is important to look at these points in detail to determine if they are indeed genuine observations. It is possible that outliers are produced by some form of error in the observation process and would then be excluded from further treatment. If an outlier is a genuine and correct observation, then it offers insights into abnormal events. This may be important for the practical application for a variety of reasons such as capacity planning, which must orient itself on the events that are maximally taxing and so, by deﬁnition, abnormal. Fourth, the number of centers is often a crucial point. In many practical applications, a central question is: Into how many groups is it most sensible to divide the observation? The disadvantage of most clustering methods is that the number of centers must be speciﬁed by the user before clustering is begun. In practice then, clustering is re-run with several different settings for the number of centers and each result is examined for “sensibility” whatever this may mean in the practical application at hand. Simple versions of a sensibility deﬁnition include (1) a homogeneous distribution of observations within each cluster, (2) no cluster with less/more than a speciﬁed percentage of all observations, (3) less than some speciﬁed percentage of outliers and (4) a certain speciﬁed minimum distance between centers to make sure that clusters are sufﬁciently distinct. The idea of k-means requires the number of centers, k, to be speciﬁed by the user. It is an optimization problem that requires the mean-squared distance of each observation to its nearest center to be minimized by moving the k centers around in the multi-dimensional space provided by the observations. Please note that k-means clustering is not an algorithm but a problem speciﬁcation. There are several algorithms that are used to accomplish the solution of the above described problem. In fact, common optimization algorithms may proﬁtably be used for this purpose. There is a speciﬁc algorithm, called Lloyd’s algorithm, that has been invented just for this purpose: (1) Assign the observations to the centers randomly, (2) compute the location of each center as the centroid of the observations associated with it, (3) move each observation to the center that it is closest to, (4) repeat steps 2 and 3 until no re-assignments of observations to centers are made. This algorithm is simple but it ﬁnds a local minimum. In order for us to ﬁnd a global minimum, this algorithm is generally enhanced by an incorporation of simulated annealing (see chapter 7). A sample output of a k-means clustering run is shown in ﬁgure 5.1. The data is two-dimensional in this case to make drawing an image easier. In practice the number of dimensions would generally be large. The metric used here is the Euclidean metric where the straight line between two points is the shortest distance; this will be inappropriate for many practical applications. Having gotten this output, the question is what do we do with it? Clustering means that the observations associated with a particular cluster are in some sense similar and observations associated with different clusters are in the same sense different. The “sense” indicated here is principally measured by the metric function. The metric function is however a stepping stone and not the result because it is

88

5 Data Mining: Knowledge from Data

Fig. 5.1 The output of a k-means clustering with two clusters (top image) and ﬁve clusters (bottom image) on a two-dimensional dataset with the Euclidean distance metric. The centers are marked with stars and the observations with circles.

a convenient numerical measure to help the algorithm but it does not help human understanding. What is needed at this point is to describe each cluster in such a way as to be meaningful to a human being who is charged with interpreting the data. We suggest extracting some of the following statistics, in general, for each cluster: (1) The position of the center, (2) the radius in each dimension being a measure of how large the cluster is, (3) the number of points belonging to this cluster, (4) the mean and variance over the observations in each cluster to be compared to the center position and radius as a measure of how tightly clustered the cluster really is. A comparison and critical examination of this data will allow one to discover a number of generally useful conclusions. First, are there artifacts in the data? Artifacts are all those features of the dataset that are not intended to be there, are strange and thus to be excluded. With clustering one is likely to get artifacts from three sources: bad data (solution: better pre-processing, see chapter 4), outliers (solution: critically examine outliers and possibly exclude them) and an incorrect number of centers (solution: change k). One can determine simple artifacts by looking how clearly clusters are separated from each other. An example will illustrate the point: Two different large cities (e.g. Paris and London) are clusters of houses and are well separated as there are large tracts of virtually houseless lands in between them. However two suburbs of London are also clusters of houses but they are not well separated – their distinction is a purely administrative one and it is not immediately visible to the tourist that one section stops and another starts. Mathematically speaking thus, the division of a city into suburbs is an artifact that we would want to exclude (with respect to a certain viewpoint of ﬁnding clusters of houses). Clustering is there to discover meaningful qualitative differences.

5.3 Other Statistical Measures

89

Second, which dimensions principally distinguish the clusters, i.e. which attributes are sufﬁciently telling about an observation? This will in practice lead to the conclusion that a (hopefully small) subset of the measured parameters is sufﬁcient to distinguish observations into their clusters. This will save effort and cost while still allowing the important conclusions to be drawn. Third, what is the population of the clusters? If there is one cluster with many observations and other clusters with only a few each, then the situation is very different than if each cluster had approximately equally many observations. The one large cluster could be called the “normal” cluster while the others are various kinds of non-normal clusters. It depends upon the application of course, but situations with one large cluster are often correctly interpreted by saying that the large cluster acts as an attractor for the system, i.e. it is a kind of equilibrium state that a participant of the system would like to go towards. In that sense all other clusters would be pseudo-stable states that would eventually decay into the attractor. Thus, the clustering would have distinguished (pseudo-)ergodic sets from each other in the sense of statistical mechanics (see chapter 2). Fourth, what are the application speciﬁc characteristics of each cluster? It is interesting now to compute the distinguishing features of each cluster relative to the application at hand. This depends, of course, on the nature of the problem but industrially speaking these are now parameters focusing on the major areas: safety, reliability, quality, costs, margin/proﬁtability and the use of various resources. In this way, one can distinguish the clusters and judge them to be, in some sense, “bad” or “good” clusters. Typically this is done with respect to money using safety as a limiting criterion, i.e. we wish to maximize proﬁtability while retaining a reasonable level of safety, reliability and quality. Fifth, is there a dynamic in the clustering? If some of the major distinguishing features are such that they change over time, it may be possible to extrapolate a dynamic system over the clusters. This means that a participant in the system may be in one cluster at one time and in another cluster at a later time. This transition may be governed by laws that could perhaps be discovered (using other methods). In this way, a member of a “bad” cluster may be transformed into a member of a “good” cluster in some manner that, after analysis, could be understood well enough to be manipulatable.

5.3.5 Entropy Informational entropy is a measure of the information content of a signal. High entropy means that a signal carries much information per unit of signal. If a signal is a series of letters, then the sequence “aaaaa...” is predictable and thus carries very little information. In contrast, the sequence “abcd...” carries high information content. Generally speaking, a signal source that behaves in a uniform manner will transmit a signal with near-constant entropy. While the individual measurements vary from moment to moment, the informational content of the signal is static - this

90

5 Data Mining: Knowledge from Data

is referred to as an ergodic source and is a highly desirable property for mechanical systems. All the mechanical states that the system assumes from moment to moment and that give rise to this constant entropy are an ergodic state set. All states in one ergodic set may be interpreted as belonging to the same qualitative mode of operation. Using this, we can thus detect several qualitative states of the system over time and label these as desirable states or not. If the system switches from one ergodic state set into another, we will observe a discontinuity in the entropy signal. This event, referred to as ergodicity breaking, is a signiﬁcant event and indicates a qualitative change in operations, see section 2.5. Thus, our method of entropy tracing detects ergodicity breaking events in the history of the measurements. Entropy can be understood by graphing a signal as a histogram. During a particular time window, the range from lowest to highest observed value is split into bins and the number of occurrences per bin over the time window is counted. Divided by the total number of measurements, this yield a probability density function that characterizes the signal over that time window, see ﬁgure 5.2 (a). The shape of this distribution characterizes the ergodic set that the system experiences during this time. If we allow some time to pass, we may detect an alternative density function, see ﬁgure 5.2 (b). The system has evolved from one qualitative state to another and this is clearly visible from the change in the density function. There has been ergodicity breaking. Entropy provides a statistical measure of this change in a single numerical quantity and allows measurement of the severity of the change. That is only a simple example, of what the entropy method is able to detect. Most fundamental structural changes in the shape of the distribution will be detected, because they always include a change in the information. If the entropy for a single time-window is larger/smaller than the mean of all time windows ±2 standard deviations, then the change in entropy is big enough to let us deﬁne it as a signiﬁcant change. Supposing that the distribution is normal2 (as seen in ﬁgure 5.2 (a)), then a deviation of more than two standard deviations has a probability of less than 5%. This entropy method can be tuned to a problem by adaptively varying the number of standard deviations chosen as its detection sensitivity. We may summarize the meaning of the abnormality score of this method: The bigger the absolute value of the abnormality score for a single time window, the higher is the difference between the distribution of values in the present timewindow and the averaged distribution of all time-windows. 2 If a large number of independent and identically distributed factors add up to produce a single result, then this result is approximately normally distributed - this is called the central limit theorem. Often, this theorem is used to claim that the result of almost any complex mechanism should be normally distributed. However, in practice, we ﬁnd that the assumptions of the theorem (independent and identically distributed factors) hardly apply and thus the distribution is not normal. Many industrial processes have probability distributions signiﬁcantly different from normal. As a result of this, we must not over interpret the claim that a data point further away from the mean than two standard deviations occurs with 5% likelihood. This is a rough guideline unless we know the identity of the distribution.

5.3 Other Statistical Measures

91

Fig. 5.2 The value distribution of a measurement before and after a qualitative change in the system that gave rise to the secondary peak on the right.

5.3.6 Fourier Transformation The Fourier transform (FT) fˆ(ζ ) of a function f (t) is given by fˆ(ζ ) =

∞

f (t)e−2πiζ dt

−∞

where t is time and ζ can be interpreted to be the frequency of the signal. The transform is essentially a transformation of the basis of the coordinate system to another basis. While singular spectrum analysis (SSA) makes a linear transformation, FT makes a non-linear transformation and focuses on the frequencies with which signals change. We thus separate slow changes that occur with low frequency and fast changes that occur with high frequency.

92

5 Data Mining: Knowledge from Data

Over the different frequencies, we may plot the amplitudes of the frequencies in the signal spectrum. This amplitude is not the one of the original signal but of the transformed signal in frequency space. Figure 5.3 (a) displays an example in which it is clear that the frequencies around 50 Hz dominate this particular time-window. As the system evolves to a later time, displayed in ﬁgure 5.3 (b), we observe that the frequencies around 77 Hz get populated and thus there is a fast signal variation present now that was not present before. The shape of the graph of the amplitudes has thus changed. We measure the difference between two graphs by computing their correlation. This provides us with a numerical measure of shape similarity that we can trace over time. This correlation obtains a mean and standard deviation and we mark abnormalities if the correlation at a particular time-window is more than two standard deviations different from its mean over all time-windows. When a signal obtains or looses variations at particular frequencies, then the FT method will show this. An abnormality here indicates that there is a signiﬁcant change in the speed with which the signal varies over time.

5.4 Case Study: Optical Digit Recognition Consider a letter whose envelope is addressed by hand writing. Someone must be able to read the address to be able to deliver it. As there are a great many letters sent each day, post agencies the world over have invested in automated systems that can read an address on an envelope. Part of the task is to identify numbers like the ZIP code. Technically speaking, we are given an image of a digit and we are to say which digit the image represents. A sample of such data can be seen in ﬁgure 5.4 where we see several examples of the number six. There exists a database of about 60,000 images of digits written by about 250 different persons that we use as our dataset; this is the NIST dataset (National Institute of Standards and Technology in the USA). In order to obtain better results, the data set is pre-processed before applying any training algorithm. One of the classical pre-processing steps is binarization: The image is transformed into having only black and white pixels instead of various gray levels or colors. After binarization, the data was skeletonized and thinned, see ﬁgure 5.5. However, the results obtained were worse than the results obtained using only binarization. We will attempt to solve the problem by using a self-organizing map (SOM). In the terminology of section 6.4 this is a one-layer perceptron network but there is no need to skip ahead in the book as we will discuss the concept here. We want to classify an image of a digit into the abstract categories of digits. The input of the classiﬁer thus receives the pixels of this image and the output yields the category that this image belongs to. Thus, the SOM needs two layers. The input layer is a vector with as many elements as there are pixels in the image and so allows

5.4 Case Study: Optical Digit Recognition

93

Fig. 5.3 The frequency spectrum of a signal before and after a qualitative change in the system that gave rise to the presence of frequencies around 77 Hz on the right.

for the image to be inputted into the SOM. The output layer is a vector with as many elements as there are categories; in our case of digits there are 10 such categories. Each element of the input layer is connected with each element of the output layer and this connection has a certain strength. Thus the strengths form a matrix. In this way, every output element has a weight vector made up of the connection strength of all inputs with respect to this particular output. When an input is presented to the network, we determine that output element whose weight vector is closest to the input vector. To measure distance, we use a metric. Most of the time, the metric is the normal Euclidean metric where the distance between (a, b) and (c, d) is (a − c)2 + (b − d)2 .

94

5 Data Mining: Knowledge from Data

Fig. 5.4 Several examples of the number six that have been hand written.

Fig. 5.5 Binarized images (on the ﬁrst row) after applying thinning (on the second row) and skeletonization (on the third row)

We may however use a different metric. According to a speciﬁc algorithm, the weight vector is now adjusted slightly to reduce the distance between the input and the weight vector. Independently of the weight vector, the output elements are considered to be locations on a map. Usually, these elements are hexagonally distributed over a twodimensional plane, see ﬁgure 5.6 (a). The weight vectors of those output elements that are close to the speciﬁc output element just chosen are also modiﬁed by a speciﬁc algorithm. Then the next input vector is processed. In this way, the system learns the input data and particularly learns to classify it in the form of a map. Each output element corresponds to a region in this map. We may now display this map in various graphical ways to aid the human understanding. Particularly, we may plot the distance (in the sense of the above mentioned metric) between the neighboring output elements. Close elements indicate related categories and so on. The map may of course be distorted graphically in order to reﬂect these distances and to give the illusion of it being a true map of something. This way of classifying input data is particularly powerful if we have not categorized the inputs beforehand by human means. This is also an important difference. If we have input data for which we already know the output, we may teach a computer system by example as we would teach a student; this is called supervised learning. If we do not have this, then we must present only the input data to the computer system and hope that it will divine some sensible method to differentiate the data

5.4 Case Study: Optical Digit Recognition

95

into categories; this is called un-supervised learning. For un-supervised categorization, the SOM is a very good technology. We note in passing that the accuracy of an un-supervised system is generally several percentage points worse than a supervised learning system for the simple reason that a supervised learning system has much more information (this applies for the same input data volume). The weight vectors of each output element must be equal to some value at the start of training; this assignment is called initialization. The best results we obtained after we initialized the network with samples from the training data set. In this way the map is initially well-ordered and the convergence is faster. On Figures 5.6 (a) and 5.6 (b) is shown a map with hexagonal and rectangular topology respectively initialized from the training data.

Fig. 5.6 Initialization of a 6x6 output layer with hexagonal topology on the left and rectangular topology on the right

Note that when initializing the network from the training data set, similar letters should be close to each other. For example, in Figure 5.6, the digit 1 is close to 7 since they look alike. A crucial part of the SOM training algorithm is determining the winner node (the output element, and therefore category, assigned to the particular input image under present investigation) which is closest to the input with respect to the metric function. Thus the choice of the metric function is very important for the performance of the training. We use the so-called tangent metric and not the Euclidean metric. The reason is that it achieves substantially better results (approx. 5% difference). However, the downside of using the tangent metric is that evaluations are computationally intesive and take more time than using the Euclidean metric (about a factor 30 in time). It goes beyond the scope of this book to give the details of the tangent metric, we refer the interested reader to the literature, e.g. [72]. At ﬁrst, we may observe bad performance. This may be due to a badly formulated output layer. For instance the digit 7 may be written with or without a horizontal bar

96

5 Data Mining: Knowledge from Data

and thus the number 7 is orthographically worth at least two categories. The same is true of several other digits and so we must increase the output layer from 10 to 16 as seen in ﬁgure 5.6. In this way, we may cut the error probability into half. On a total training set of 25,000 images, we end up with an error rate of 11% on the remaining 35,000 images that were not used for training, which is not too bad given that the algorithm has no idea what we are trying to classify. The technique of re-learning is also useful: We learn and then use the ﬁnal state as the starting state for another round of learning. If we make several such rounds, we can end up with a very good performance.

5.5 Case Study: Turbine Diagnosis in a Power Plant A coal power plant essentially works by creating steam from water by heating it through a coal furnace, see ﬁgure 5.7. This steam is passed through a turbine, which turns a generator that makes the electricity. The most important piece of equipment in the power plant is the turbine.

Fig. 5.7 A schematic diagram of a combined cycle power plant. The term “combined cycle” means that the power plant produces both electricity and heat for local homes. The diagram describes the combined heat and power (CHP) process in overview including the major steps: (1) Entry-point of the air, (2) boiler with water and steam, (3) high-pressure turbine, (4) mid-pressure turbine, (5) low-pressure turbine, (6) condenser, (7) generator, (8) transformer, (9) feed into power grid, (10) district heating, (11) cooling water source, (12) cooling tower, (13) ﬂue gas, (14) Ammonia addition (15) denitriﬁcation of ﬂue gas, (16) air pre-heater, (17) dust ﬁlter, (18) ash end-product, (19) ﬁltered ash end-product, (20) desulfurization of ﬂue gas, (21) wash cycle, (22) chalk addition, (23) cement/gypsum removal, (24) cement/gypsum end-product.

5.5 Case Study: Turbine Diagnosis in a Power Plant

97

A turbine is a rotating engine that converts the energy of a ﬂuid (here steam) into usable work. Figure 5.8 shows a steam turbine. Turbines are used in many other contexts such as water power plants, windmills, wind power turbines, airplane turbines and so on. We will be focusing on turbines used in standard coal-ﬁred power plants. The blades on the turbine have the job of actually capturing the energy and driving the central rotating shaft. They are made from steel and may be more than one meter in length. The forces at work when this machine is running are very large indeed.

Fig. 5.8 A steam turbine of Siemens during a routine inspection. Source: Siemens AG Press Photos.

Economically, a turbine is the heart piece of a power plant. All other equipment in the plant essentially caters to the turbine-generator combination because it is responsible for the conversion of energy from steam to electricity. The rest of the power plant essentially has the job of making the steam and cleaning up after itself (for example ﬁltering the ﬂue gas). Thus, it is important to carefully watch the turbine for any signs of abnormal behavior. As the turbine is so important, its operations are monitored by a variety of sensors installed in key locations. The most crucial information regarding the health of the turbine is contained in the vibration measurements. All sensor output is logged in a data historian and therefore available for study. In our case of monitoring a ﬂeet of turbines, there are between 111 and 179 sensors measuring the condition of each turbine, which provides us with a satisfactory amount of data for the analysis.

98

5 Data Mining: Knowledge from Data

At all times, we want to provide some automatic diagnosis that decides whether the values of the sensors are abnormal. This is the key to the analysis. Only an abnormally functioning turbine should need manual looking at and thus we want to automatically determine abnormal behavior. For this purpose, we are going to use three methods that can do this. If any set of sensors delivers abnormal values, we want to know: 1. 2. 3. 4. 5.

Which method or combination of methods detected the abnormal operation? How is the development of strength of the abnormalities? Which sensor or set of sensors delivers the abnormal values and for how long? Which sensors send abnormal values as a result of a previous abnormality? When did the ﬁrst/last signs of this abnormal operation appear/disappear?

We analyze time-windows from the time-series, where each time-window contains 7 days of sensor data and the time difference between two consecutive windows is 1 hour. Thus, if the result of one of the methods is that in the time-window 1 May 00:00:00 - 8 May 00:00:00 there is no abnormality, but in the time-window 1 May 01:00:00 - 8 May 01:00:00 there is some abnormality, we conclude that the observed abnormality is induced by the 1 hour difference, i.e. that it occurred on 8 May between 00:00:00 and 01:00:00. We always use the last hour of the analyzed time interval to present the obtained results. Each analysis method delivers one value per time-window. To detect whether there was an abnormality in the analyzed time-window for a concrete method, we compute the mean and the standard deviation based on the results delivered by that method over all the time-windows. Then we compare whether the results for the current window are larger/smaller than the mean ± two standard deviations. If it is, then excess amount is called the “abnormality score” and recognized an abnormality. This means that if a sensor had the abnormality score of 4.3 on 1 May 00:00:00, then while analyzing the data for this sensor from 24 April 00:00:00 to 1 May 00:00:00, the analysis result was either greater or less than the mean ± (2 + 4.3) times the standard deviation. We used three techniques for the investigation of the time-windows of the four datasets. All three methods use the concept of “abnormality score” as deﬁned above and produce one abnormality score per time window and per sensor. The methods are singular spectrum analysis (see section 4.5.2), entropy (see section 5.3.5) and Fourier transformation (see section 5.3.6). In brief, the entropy indicates if there are abnormal values in the individual measurements, the SSA indicates if there are abnormal variances in the principal component direction and the FT indicates if there are abnormal frequencies in the signal. We thus have two indicators that concern the shape of a probability density (entropy and FT) and one that concerns the size of the density (SSA). These methods concern very different indicators of abnormality and thus may or may not simultaneously detect an event. We have seven combinations of the methods for detecting an event. Each of these possibilities indicates an abnormal situation. Which methods, or which combinations of methods, respond indicates the nature of the abnormality

5.5 Case Study: Turbine Diagnosis in a Power Plant

99

and may assist in the diagnosis of what the cause of the abnormality may be. We describe brieﬂy what this may mean: 1. Entropy only: We measure abnormal values but the system changes at the same speed and with the same variation as before. This must mean that the abnormal value observed is not along the principal component direction. 2. SSA only: We measure an abnormal variance but no abnormal values or frequencies. This must mean that the principal component direction changed as otherwise, we would require a signiﬁcant change in the value distribution (entropy) as well. Thus, we have a qualitative change of what measurements are important for characterizing the system. 3. FT only: We measure an abnormal frequency but no abnormal values of variances. This means that the system does what it did before but it has changed the speed at which it does it. 4. Entropy and SSA: We measure abnormal values and variances but the same frequencies. The system has changed the range of its operation but not the speed at which it varies. 5. Entropy and FT: We measure abnormal values and frequencies but the same variances. As the values have changed but not the variance, this must mean that the values that changed are not along the principal component direction. Additionally, the frequencies have changed so that the inherent speed of variation has changed. 6. SSA and FT: We measure abnormal variances and frequencies but the same values. As the variance changed without having values change, this must mean that the principal component direction changed. Additionally the speed changed. 7. Entropy, SSA and FT: We observe abnormal values, variances and frequencies. The system now visits new values at new speeds and this changes the range of variation along the principal component direction. This is the most signiﬁcant indication of a change in the system and these changes should be viewed as bearing the most danger. Whether an abnormality is dangerous or not is not included in this analysis. It is likely that when a signiﬁcant change in operational settings is made by the operator that an abnormality is detected even though this does not necessarily indicate a danger. However, it seems very likely that most dangerous situations would display an abnormality of at least one of the above kinds. A zoom-in for one dataset and the SSA method shows how a particular event starts and progresses over time, see ﬁgure 5.9. Monitored over a full year, the analysis yields the result of ﬁgure 5.10. When all three methods are combined, we may compare them as per ﬁgure 5.11. As discussed above, the combination of methods that detect a particular event lets us interpret what kind of event is taking place and thus aids the engineer in interpreting what should be done about the event. In table 5.1, we summarize, for four turbines, how many events were detected by each combination of methods. Please note that no combination is in any sense “better” than another.

100

5 Data Mining: Knowledge from Data

Fig. 5.9 The numbers on the vertical axis indicate the sensor that is being analyzed so that the image as a whole gives us a holistic health check for the whole turbine. We can read from the plot that the event starts with Sensor 41 on the 30th of June, then a more signiﬁcant deviation is observed for sensors 122 and 151, then on the 9th of July more sensors (51) get involved and the largest abnormality (-2.09) is observed for sensor 93 on the 15th of July. For several days, abnormalities of most sensors disappear and only the sensors 122 and 151 continue deviating and start a second reaction of a smaller magnitude on the 24th of July. On the 3rd of August all abnormalities disappear.

By and large, we can see in the data that those events that have a particularly large abnormality score do tend to be detected by more than one method. The largest abnormalities are mostly detected by all three methods. On this basis, we can say that the urgency with which an event should be looked at can be proportional to the number of methods that detected it. However, this judgment is made without correlating it with the actual ﬁnal outcome of the event (benign events vs. dangerous situations). In ﬁgure 5.12, we provide a plot of the abnormality score for the events detected for one turbine. It is also visible that the total abnormality score of an event forms an approximate exponential distribution. This is an interesting feature as this is not the outcome that would result from a large number of random interactions (central limit theorem) but rather suggests a much more structured causality. In particular, we would expect

5.5 Case Study: Turbine Diagnosis in a Power Plant

101

Fig. 5.10 Here we see a turbine analyzed by SSA over a full year’s operation. The vertical black areas are areas where the turbine was ofﬂine. In between the ofﬂine times, we can see where the method diagnosed abnormalities. These were then analyzed by human means and appropriately responded to. Set of Methods SSA only FFT only ENT only SSA and FFT SSA and Entropy FFT and Entropy SSA and FFT and Entropy

Turbine 1 2 7 1 4 3 1 4

Turbine 2 7 4 4 0 1 1 6

Turbine 3 1 4 0 3 1 7 3

Turbine 4 4 1 3 5 4 0 3

Table 5.1 Each combination detects a particular signature of event and thus they should be seen as complementary detection schemes rather than a hierarchy. No one method dominates this table. This shows that events of all signatures do take place in the systems studied.

this outcome to result from a Poisson stochastic error source, which is present when events occur continuously and independently at a constant average rate. Thus, we would conclude that, approximately and on average, the events detected here did not interfere with or cause each other but were independently caused. We would also conclude that whatever causation mechanism is giving rise to these

102

5 Data Mining: Knowledge from Data

Fig. 5.11 Here we see all three methods in comparison over a whole year on one particular sensor in a particular turbine. We see broad agreement but differing opinions in the details. These differences can be interpreted as discussed in the text.

events acts at a constant rate. This means that the system does not exhibit ageing over the time period (one year) investigated here where the event rate would increase with increasing time. We may summarize all the above ﬁndings by saying that the present methods allow the fully automatic screening of the majority of the data while ﬁnding that they indicate normal operations. For a selected small minority of the data, these methods give a clear indication at which times which sensors are abnormal, how abnormal they are and in what sense they are abnormal. This allows targeted human analysis to take place on a need basis.

5.6 Case Study: Determining the Cause of a Known Fault Co-Author: Torsten Mager, KNG Kraftwerks- und Netzgesellschaft mbH

5.6 Case Study: Determining the Cause of a Known Fault

103

Fig. 5.12 These events are sorted by the sum of their abnormality scores of all three methods. As the score is deﬁned similarly for each method, the numerical value of the score of one method is comparable to the score of another. We can see that the detection efﬁciency of the methods increases with increasing abnormality, as it should. The plot also contains the number of days before an event that an advance warning would have been possible. It is recognizable, that the ﬁrst signs appeared several days in advance for most events. There are only two events where there was no sign before the event. The average advance warning time for an event was ﬁve days.

Here we take the case of turbine that has failed for a known reason. We seek the cause for the failure in both time and space, i.e. where and when did what happen to bring about the known failure? This is often important to settle issues of liability between the manufacturer, operator and insurer of the machine. It is certainly important in terms of planning what to do about it and what to do to prevent a similar case in the future. We recall that a turbine failure is signiﬁcant and expensive event for the plant. In the previous case study, we presented three methods to analyse abnormal operations for turbines. We might think that using these methods over the history of the particular turbine would reveal the point at which things went wrong. It should be mentioned that the particular fault in question was that a single blade touched the casing and was bent. This bent blade was only detected visually upon opening the turbine months later. It was unclear to what extent the turbine would be able to continue running and so this blade was exchanged; a time-consuming and

104

5 Data Mining: Knowledge from Data

expensive process. We expect that this is an event that should be visible in the data (particularly in the vibration data) upon careful analysis. The analysis result can be seen in ﬁgure 5.13 for one vibration measurement as an example. The three lines indicate the abnormality score of each of the three methods outlined in the previous case study. It is not important here which is which. We simply note that there are a few points at which the analysis result intersects the abnormality boundary and thus we do ﬁnd abnormal events. Later manual analysis discovered that all such abnormal events could be explained by sensible operator decisions.

Fig. 5.13 The outcome of the analysis using the three methods of singular spectrum analysis, Fourier transform and entropy analysis are shown with respect to one vibration measurement. We observe that there are signiﬁcant changes that cross boundary lines and thus indicate abnormal events.

In ﬁgure 5.14, we display the raw data in a different manner that also allows some interesting interpretations. One vibration is plotted against another in an effort to track their mutual locus. We ﬁnd that they do indeed possess a well-deﬁned mutual locus that is traversed clockwise in ﬁgure 5.14. Each point in the image represents 10 minutes of operation. A single cycle thus represents approximately 15 hours. We see on the image (on the far right) that there is a region in time lasting roughly 3 hours in which the system deviated signiﬁcantly from the established locus. We can interpret this as an abnormal event. In this particular case, however, this event was benign as it was intentionally caused by the operators. This further indicates that abnormal events do not need to be harmful. A data analysis system can detect abnormalities but would have to be given a vast knowledge to be able to interpret these as benign or harmful – at present we regard this enhancement to be impractical. We do observe however that there exist some tools that can interpret particular problematic situations on particular devices and so some work has been done in this regard [77]. In a similar fashion, we also analyzed the cases in which the turbine was not rotating at operating speed but was in various stages of cycling up or down. Also here, we were not able to ﬁnd any genuine abnormality.

5.7 Markov Chains and the Central Limit Theorem

105

Fig. 5.14 One vibration measurement (horizontal axis) with respect to another (vertical axis). The locus is essentially a dented cycle, which (in this case) we traverse clockwise over time. We see that most of the time, the system holds a fairly well-deﬁned locus in time. Occasionally, we do deviate from this and this represents, loosely deﬁned, an abnormal event.

Even though this is a negative example of data analysis, it is instructive in various ways that must be taken seriously when starting such an analysis. In particular, it is clear that there are events that cannot be seen – even upon detailed analysis – in the normal measurements of a plant. To circumvent this, we may have to install further instrumentation equipment speciﬁcally targeted at discovering such a problem. We must be aware that before we conduct a data analysis that it is possible that the feature we are looking for: (1) does not exist at all, (2) is not contained in the data we have, (3) is overshadowed by noise in the data, (4) occurs at such short timescales that it appears to be an outlier and so on. Ideally, we would have the opportunity to design a data acquisition system in order to look for a particular problem in advance. When we encounter a problem however, then we will simply have to deal with the data available and may then encounter the above challenges. The correct acquisition and cleaning of data prior to analysis is crucial for success. We conclude this case study by observing that there are two principal explanations for the failure to ﬁnd the cause: 1. The event occurred during the time period analyzed but was not visible in the data possibly because it was too short-lived. 2. The event did not occur during the time period analyzed. This would imply that the turbine was initially taken into operation in a damaged state.

5.7 Markov Chains and the Central Limit Theorem A Markov chain is a sequence of numbers, each of which is a random variable, with the property that the probability distribution of any one number depends only upon the previous number in the sequence. This property is called the Markov property.

106

5 Data Mining: Knowledge from Data

Thus, a Markov chain z(0) , z(1) , z(2) · · · z(m−1) , z(m) has the property that

p z(m+1) |z(0) , z(1) , · · · z(m) = p z(m+1) |z(m) . This probability is called the transition probability, which is generally a matrix and not a scalar as both z(m) and z(m+1) are vectors and we must specify the probability of transition from each element of one to each element of the other

Tm ≡ p z(m+1) |z(m) . If the transition probability Tm is the same for all m, then the Markov chain is called homogeneous. The Markov property is a severe restriction and extreme simpliﬁcation. It therefore allows many special properties of Markov chains to be proved. However, it also means that a Markov chain is not always suitable to model a practical situation. For that reason, we will often want to relax the Markov property and deﬁne the p-order Markov chain by the property that the transition probability should depend upon the prior p random variables,

p z(m+1) |z(0) , z(1) , · · · z(m) = p z(m+1) |z(m−p−1) , z(m−p−2) , · · · z(m) . The transition probability becomes

Tm ≡ p z(m+1) |z(m−p−1) , z(m−p−2) , · · · z(m) . To model a real system by using a Markov chain, we thus need to determine the transition probabilities Tm . If we know these, and we determine the initial condition of the Markov chain (i.e. the values of the ﬁrst p random variables), then we may probabilistically compute the evolution of the chain into the future and thus arrive at our model. Supposing that the initial conditions are not to be obtained from physical experiments but rather must also be calculated, we must establish the probability distribution of the initial random variables and then rely on our determination of Tm to compute the others. As we are dealing with statistical distributions, we require a signiﬁcant amount of data to be able to distinguish the various possible distributions from each other. In case of doubt, one often chooses the Gaussian distribution also called the normal distribution that looks like p(z) = √

1 2πσ 2

e

−

(z−z)2 2σ 2

where σ is the standard deviation of the distribution and z is the mean of the distribution. The defense for this choice is often the central limit theorem. The users of this defense believe that the central limit theorem effectively says: If the factors leading

5.8 Bayesian Statistical Inference and the Noisy Channel

107

to a particular observation are sufﬁciently many, the distribution of this observation will tend to the normal distribution in the limit of inﬁnitely many factors. Actually it states that the random variable y which is the sum of many random variables xi (i.e. y = x1 + x2 + · · · xm ) will tend to be distributed normally in the limit of inﬁnitely many x’s if the xi are independent and identically distributed and have ﬁnite mean and variance. Please note the if clause in the previous sentence. Thus, the factors (xi ) leading to a particular observation (y) have to be independent and identically distributed, which is typically not the case as cause-effect interrelationships usually exist in the physical world. Also, the observation y is a very particular observation, namely the sum of the x’s, and not some other related observation. In short, we must be very careful when using the central limit theorem to justify a normal distribution for the initial random variable in a Markov chain. There are several popular probability distributions that optically all look the same – they have a bell shaped curve. It is also wrong to say that they are all the same and one might just as well stick with the normal distribution. In the central area of the bell, these distribution are indeed similar but they usually differ signiﬁcantly on the tails (i.e. far away from the central area). Please note that in probability theory it is generally the tails that are the interesting parts because these describe the nontypical situations that will nevertheless occur. A few names of such distributions are: Cauchy, Student’s t, generalized normal and logistic distribution. We will not go further into these individually. If an incorrect distribution is chosen for z(0) , then even a correct Tm will lead to bad results overall as everything depends on the initial condition. Thus, modeling the starting point correctly is essential for a correct Markov chain model. Moreover the correct modeling of the initial condition can only be done with sufﬁcient data from the system under consideration. In dearth of this, we must rely on some other source such as a physical model of the situation.

5.8 Bayesian Statistical Inference and the Noisy Channel 5.8.1 Introduction to Bayesian Inference Let us focus on a practical problem, namely optical character recognition, see for instance section 5.4 above. We may observe the system via a random variable x, the photographic image of a letter, and wish to deduce the parameter θ , the original letter, that gave rise to this observation. We begin our excursion with the prior distribution g(θ ) that is the probability distribution over the various possible values of the parameter. We will suppose, for the moment, that this distribution is known. We will discuss later on how it can be determined. The observable random variable x has a conditional distribution function f (x|θ ) called the sampling distribution that is also assumed known for all values of the

108

5 Data Mining: Knowledge from Data

parameter θ ∈ Θ . According to Bayes theorem, we may now compute the so-called posterior distribution f (x|θ )g(θ ) g(θ |x) = . Θ f (x|θ )g(θ )dθ Thus, we now know the distribution of the parameter given an observation. When we have made an observation, we can use this distribution to determine the probability distribution of the parameter for this particular observation. Knowing this distribution is very useful indeed. For example, if we have an image of a handwritten letter and we determine that the probability distribution over the alphabet is such that the probability of “r” is 0.4, the probability of “v” is 0.6 and the probability of all other letters is virtually zero, then we may conclude that we have either an “r” or a “v” with high probability. We may also conclude that we are more likely to have a “v” than an “r” and we have a quantitative method to assess the degree to which it is more likely, namely 0.2 more. We may conclude from such a distribution that our model is not yet good enough since it cannot tell these two letters apart with sufﬁcient certainty and thus that we need to present it with more examples of these two letters to train it to be better. Thus, it is a very useful result to have the posterior distribution. However, it depends on the prior distribution and the conditional distribution both to be known. In general these must be determined by examining the physical mechanism that gives rise to the problem. In the following subsections, we will treat the determination of these two distributions.

5.8.2 Determining the Prior Distribution In our case, the prior distribution g(θ ) is the probability that a particular letter is going to occur. We can make our life simple and deduce this from a typical piece of prose text and stipulate this as the prior distribution, see table 5.2. However, the letter frequencies are very different if we know what letter came before the current one, see table 5.3. We may, in fact, introduce a complex grammar based language model here to give an intelligent guess as to the next expected letter based on a lot of domain speciﬁc knowledge. We must make a decision what amount of knowledge must be inserted into the prior distribution for it to be good enough for our practical purpose. Noting how much complexity was added going from single letter to digram frequencies, we need to be careful before attempting to introduce a more complex model as this may create more effort than the result is worth. In the case of optical character recognition, the usefulness is however so large that several independent projects have created models of great complexity at great expense resulting in several commercial software packages. In the case that we do not want to or cannot insert theoretical knowledge into the construction of the prior distribution, we must construct it empirically. So let us assume that we have a number of observations at our disposal x1 , x2 , · · · xn . The parameters θ1 , θ2 , · · · θn that gave rise to these observations are unknown but we will

5.8 Bayesian Statistical Inference and the Noisy Channel a b c d e f g h i

8.167% 1.492% 2.782% 4.253% 12.702% 2.228% 2.015% 6.094% 6.966%

j k l m n o p q r

0.153% 0.772% 4.025% 2.406% 6.749% 7.507% 1.929% 0.095% 5.987%

s t u v w x y z

109 6.327% 9.056% 2.758% 0.978% 2.360% 0.150% 1.974% 0.074%

Table 5.2 The probability of each letter in average English prose texts.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A 1 8 44 45 131 21 11 84 18

B C D 32 39 15

18 11 2 2 1 7

34 56 54 9 21

7 9 7 18 1

57 75 56 18 15 32 3 11

4 13 14 5

12 4 10 64 107 9 1 1 1 2 1 55 16

8 28 1 2 31 118 18 16

E F 10 58 55 1 39 12 39 23 25 14 32 3 251 2 37 27 2 28 72 5 48 64 8 3 94 40

G 18

H

46 2 3 20 15 1 6 1 16 5 10

1 75 3

1 9 3 7

I 16 6 15 57 40 21 10 72

J K 10 2 8 1 1 2 1

L 77 21 16 7 46 10 4 3 8 39

8 57 1 3 55 26 37 3 3 10 13 5 17 8 29

M N O 18 172 2 1 11 59 9 5 37 43 120 46 3 2 38 1 3 23 1 2 46 32 169 63 4 3 3 4 1 28 5 3 28 7 9 65 44 145 23 28

14 16 148 6 6 3 77 1 11 12 15 12 54 21 6 84 13 6 30 42 2 6 14 19 71 6 9 94 5 1 315 128 12 14 8 111 17 11 11 1 12 2 5 28 9 33 2 53 19 6 3 4 30 1 48 37 4 1 10 17 5 1 4 1 11 10 4 12 3 5 5 18 6 4 3 28 5 2 1

P Q R 31 1 101 6 1 7 7 1 10 32 14 154 3 4 1 21 1 8 3 21

S 67 5 1 32 145 8 7 3 106

2 2 2 2 12 16 6 7 5 51 29 113 37 26 42 3 8 18 39 24 2 6 41 8 30 32 17 49 42 2 4 7

1

T U V 124 12 24 25 38 16 39 8 4 80 7 16 42 11 1 13 8 22 2 88 14 4 1 19 8 2 6 13 110 12 4 53 96 13 14 7 20 63 6 5 121 30 2 53 22 4 45

W X YZ 7 27 1 19 1 9 6 41 17 17 4 1 2 1 7 1 1 1 4 3 3 5 47 2 3 15 1 14 36 4 2 1 2 10 27 16

17 4 21 1 1 1

3

6 1 1 2 1 1 5 17 21 1 3 14 1

Table 5.3 The relative letter digram frequencies as measured on one English prose text. They are ordered in that the letter on the vertical axis precedes the letter on the horizontal axis in any individual diagram, so that the relative frequency of the digram “AB” is 32 and not 8.

assume that they are independent and identically distributed. Then we can group the observations into sets that are as homogeneous as possible within the set and as heterogeneous as possible between the sets. This is exactly the unsupervised clustering example that we introduced via the method of k-means in section 5.3.4 and illustrated in section 5.4 using the self-organizing map.

110

5 Data Mining: Knowledge from Data

5.8.3 Determining the Sampling Distribution The sampling distribution f (x|θ ) is a distribution of the observable x for every value of the parameter θ . In our case, it is the distribution of all possible letter images for every possible letter. A particular letter, when it is transformed into an image can be rotated, skewed, squeezed, expanded, thinned, thickened and so on. Various mechanisms exist to make the letter image look different from a prototypical letter image – just compare your handwriting to machine typing. Generally, we would have to have domain knowledge to construct this distribution. If this is unavailable, we may construct it empirically as we did above. We simply ask many people to write text and take this as our distribution, see section 5.4. When we cluster observations into clusters, the cluster determines the parameter and thus the prior distribution. The not quite homogeneous distribution inside any particular cluster is the sampling distribution for that particular parameter. Please note that there is a snag: Unsupervised clustering generally produces clusters that may or may not coincide with your intended parameter. In fact, a single parameter may actually correspond to several clusters because there are various ways to distort the parameter that are very different. In many cases, the connection between the parameter and the observation is indeed a very similar one to the case of the letter, i.e. the connection is that an idealistic concept is transformed into a physical reality via a series of mechanisms that distort the original prototype in various ways. Thus there is a channel from prototype to real object and this channel adds noise to the signal is some fashion. This noise must be removed for us to recover the original prototype.

5.8.4 Noisy Channels There are two branches of science that deal with noisy channels. From an engineering point of view, we focus on the noisy channel in that we will focus on actively building a channel that has the property that we can later on remove the noise as much as possible. This view led to the construction of the television or telephone channels that are both noisy but allow the noise to be removed at the recipient’s end sufﬁciently well to suit the user’s needs. From a computer science point of view, we focus on cybernetics or control theory in that we will focus on a given noisy channel that we cannot inﬂuence and we must remove the noise as best we can. An example of this is handwriting: We cannot inﬂuence the author’s handwriting and must do the best we can to recognize the intended letters via optical character recognition. Please note carefully the difference here. In the ﬁrst, we are building a channel with the noise removal problem on our mind. In the second, we are given an imperfect channel and told to remove the noise. The focus shifts from making an ideal channel to making a noise removing machine. Furthermore, when building the channel, we know to some extent the noise that the channel will add and this knowl-

5.8 Bayesian Statistical Inference and the Noisy Channel

111

edge helps in later removal attempts but when given an existing channel, we do not generally know how it adds the noise. We will brieﬂy present both approaches here.

5.8.4.1 Building a Noisy Channel In constructing a channel we will want to reduce as much as possible the noise that the channel adds. For example, we want to measure the temperature in a process and convey this measurement to a process control system. The original reality is a substance that has a certain temperature. Via a sensor, cables, analog-to-digital converters, ﬁlters and so on, the information arrives at the process control system in the form of a ﬂoating-point number. This will be stored once every so often or, as is commonly done, if the value changes by more than a certain amount from the last measurement. A noisy channel is generally characterized by a input alphabet (possible temperatures) A, an output alphabet (possible temperature readings) B and a set of conditional probability distributions P(y|x). We may construct a transition probability matrix Q ji = P(y = b j |x = ai ) which gives the probability that the original temperature ai is displayed as the reading b j . A probability distribution over the input (temperatures) px , which is a vector, may thus be converted into a distribution over the outputs (readings) py by multiplying it with the transition matrix like this: py = Qpx . As we can inﬂuence the channel itself because we are building it, what can we do to increase our chances of correct reconstruction? That’s right, we insert the same information into the channel more than once. For very important temperature measurements, the industry generally installs three sensors and the process control system records the measurement if at least two sensors agree. There are a plethora of other engineering changes that can be made to make the channel more reliable. They include procuring a good sensor with little drift and good aging properties, installing it in a location where it is not likely to get damaged, dirty or overheated, insulating the cables, setting the recording threshold low so that many value are stored and so on. Mathematically speaking, we would control the manner in which a source signal (your voice) is encoded into electrical form over the channel (telephone) to arrive at the decoding output (speaker). We may do this via special mathematical techniques of including some, but not too much, extra information to allow the decoder to correct some of the errors that are introduced inside the channel. The adding of extra information reduces the capacity of the channel and so it is important to add the right amount. There is a large theory on exactly how much is the right amount of extra information in order to be able to decode enough of the signal for all practical purposes. Of course, this requires the “practical purpose” to be speciﬁed very accurately ﬁrst. We will not go into this, as this goes beyond the scope of this book. You will ﬁnd this under the headings of information theory or communication theory.

112

5 Data Mining: Knowledge from Data

5.8.4.2 Controlling a Noisy Channel In real life, we are more often concerned with controlling a noisy channel. That is, the channel and the encoding mechanism exist and cannot be modiﬁed. Our task is exclusively restricted to decoding, i.e. recovering the original prototype from the signal which consists of the original prototype and some noise. As the noise is generally not uniformly random but rather structured in some fashion this becomes a challenge. Simply put, at one end we have the physical reality that we cannot directly observe, then we have a channel whose operation we do not know and at the other end we have the output that we can measure. Let us consider the channel to be a black box. Let us also suppose that we can arrange for the physical reality to be in a particular known state, at least for the purposes of experimentation if not in the ﬁnal application stage. Then let us present the known physical state to the black box and observe the output. We repeat this for many trials. If we chose our inputs carefully (for example by a design of experiment, see section 3.2), then we will have observed the functioning of the channel in all important modes of operation. What we are left with is a relationship between inputs A and outputs B. The channel is then a function B = f (A). Our task is to obtain the function from the data. This is a problem of modeling that we will tackle in chapter 6. For the moment, we will assume that the function is determined. Now, we can compute, for any input what its corresponding output will be. But we can do more. If we invert the function A = f −1 (B), then we may compute the input for a given observed output. This ﬁnally solves the problem of determining the reality that gave rise to an observation. Of course, it is possible to construct the inverted function directly via modeling, there is no need to ﬁrst model f (· · · ) only to then construct f −1 (· · · ). It is however often useful to have both orientations of the function on hand because f (· · · ) is effectively a simulation of the real system and can be used for experimentation and f −1 (· · · ) is a back trace for any output. Suppose that we are capable of inﬂuencing the inputs A in some ways, then we can take B = f (A) and ask the question: What is the value of A such that a function of the outputs g(B), the so-called goal function or merit function, takes on the maximum (or minimum) value possible? This is an optimization question that we will tackle in chapter 7. The topic of developing the model f (· · · ) is effectively the development of the sampling distribution in the Bayesian inference approach. Please note that there is no contradiction between inverting the function and using Bayes’ approach. The difference in method also yields a difference in output. The Bayesian inference approach does not yield a single answer (as the inversion approach does) but it yields the distribution of possible answers. In practice we may be boring and desire a single answer but for more scientiﬁc work we may indeed be interested in the distribution. Also, for practical work, we would indeed be interested in the conﬁdence (or probability) with which we may accept the single most likely answer. Perhaps the single answer of inversion is the most

5.9 Non-Linear Multi-Dimensional Regression

113

likely answer in Bayesian terms but there may be other answers that are sufﬁciently competitive that we actually cannot be conﬁdent in our conclusion.

5.9 Non-Linear Multi-Dimensional Regression 5.9.1 Linear Least Squares Regression The word regression refers, in the context of statistics, to a collection of methods that infer a function describing a set of known observations. Often, regression is also referred to as curve ﬁtting. A simple example is the ﬁtting of a straight line to a set of two dimensional observations, see ﬁgure 5.15. The simplest example is linear regression where we have observations of two variables (xi , yi ) and wish to deduce a straight line y = mx + b from this data. Our task, therefore, is to determine a slope m and an intercept b such that this straight line ﬁts the observed data best. The critical word here is “best” because we will need to deﬁne very carefully what we mean by best. The classic criterion is to minimize the squared difference between model and observations, which is called the method of least squares. Thus, we take D = ∑ (yi − y)2 = ∑ (yi − mxi − b)2 i

i

and desire to minimize D in the space of all possible m and b. The least squared sum is a simple function and so we can apply calculus to it, i.e. the minimum has the ﬁrst derivative equal to zero, ∂D ∂D = = 0. ∂m ∂b Solving this leads to m=

∑(xi − xi )(yi − yi ) i

∑(xi − xi )2

,

b = yi − mxi

i

where the over-line represents an average of the observed data. This method is very simple and yields a result visible in ﬁgure 5.15. Regression lends itself to more complex questions. We may principally make matters more complex along three different directions: The observed data may be in more dimensions, the function that we hope describes the data may be non-linear and the criterion for “best” ﬁtting may be more complex than the least squares criterion. The investigation of methods, such as the above, in the large arena opened by these three directions constitutes the ﬁeld of regression.

114

5 Data Mining: Knowledge from Data

Fig. 5.15 An example of a linear regression line drawn through a set of observations in two dimensions.

5.9.2 Basis Functions When the observations are multidimensional, we merely represent the observations by a vector xi . The non-linearity of the function can be complex and varied. An important possibility is to work with basis functions. An example of the basis function approach is a Fourier series in which we represent a function by a combination of sines and cosines, f (x) =

a0 + a1 cos(x) + b1 sin(x) + a2 cos(2x) + b2 sin(2x) + 2 · · · + an cos(nx) + bn sin(nx) + · · · .

(5.13) (5.14)

The trick in such a representation is to answer the following questions: (1) Can my function be represented in such a way in the ﬁrst place?, (2) Does this series converge to the true value of my function after a practical number of terms and how many terms do I need? and (3) How do I calculate the values of the coefﬁcients? In the case of Fourier series, we may answer these questions: (1) If the function is square integrable, then it may be represented in this way. A square integrable function is one for which ∞

| f (x)|2 dx

−∞

is ﬁnite. Note that this is true for nearly all functions that we are likely to meet in real life. (2) It will converge to the true value of the function at every point at which the function is differentiable. As the frequency of variation increases with each set of terms, we may terminate the series after a number of terms such that further terms would vary too rapidly for our practical purpose. In practice, we observe that this number of terms is low enough for this to be practical. Note, for instance, that music is represented (roughly) this way on a CD or in an MP3 ﬁle. (3) There are formulas for working out the value of the coefﬁcients. We will not present them here as this would carry us too far aﬁeld.

5.9 Non-Linear Multi-Dimensional Regression

115

In summary, the sines and cosines form a basis in which we may expand square integrable functions. There are other bases for other types of functions but the Fourier series is the most important one for practical applications. Another important basis is the polynomial basis, i.e. the fact that any polynomial function can be written in terms of the powers xn . So we may put y = b0 + b1 x + b2 x2 + b3 x3 + · · · + bn xn . This is suitable if the dependency of y upon x has n − 1 extrema (local maxima and minima) and n − 2 points of inﬂection. To determine the bi , we apply the same method as before. We form the least squares sum 2 D = ∑ yi − b0 − b1 xi − b2 xi2 − b3 xi3 − · · · − bn xin i

and then put the ﬁrst derivatives equal to zero ∂D =0 ∂ bi

∀i : 0 ≤ i ≤ n.

This yields a system of equations that can be solved using matrix methods (e.g. Gaussian elimination) and you are left with a regression polynomial.

5.9.3 Nonlinearity The above example of using the polynomial basis to obtain a function that is nonlinear in the independent variable x is a popular method. It leaves us with the question of determining the “correct” value of n. The answer to this is complex because we ﬁrst need to agree on a criterion that would allow us to objectively judge the quality of a particular choice. There are various possibilities but the least squares sum will also be useful here, i.e. select that n that minimizes D over the space of all n. This can easily be determined with a little computation. Let us recall Taylor’s theorem from our school days. In the simple version, it states that any n + 1 times differentiable function f (x) can be approximated in the neighborhood of a point x = a by f (x) ≈ f (a)+ f (a)(x−a)+

f (a) f (a) f (n) (a) (x−a)2 + (x−a)3 +· · ·+ (x−a)n . 2 3! n!

The approximation is at least as accurate as the remainder term R=

x (n+1) f (t) a

(n)!

(x − t)n dt.

116

5 Data Mining: Knowledge from Data

Simply put, we may approximate any reasonable function by a polynomial within some region around an interesting point. Because this theorem applies to virtually every function that we are likely to meet in daily life, the polynomial basis is a valid approximation for virtually any function within a certain convergence interval. Of course, practically it will not be easy to determine the convergence interval but it can be determined by experimental means (deviation between polynomial and reality). Strictly speaking, we must analyze the remainder term R and show that it approaches zero in the limit of n approaching inﬁnity. If we can do this, the function is called analytic and its series expansion will converge to the function within a neighborhood around a. The neighborhood’s size can be determined by determining the range of values for which the remainder tends to zero for inﬁnite n. With closed form functions, we can compute this but with empirically determined functions, we must compare the output of the polynomial to experimental data. To truly capture a real life case however, we need to be able to consider several independent variables. We will denote the variables by x(i) to distinguish them from the respective empirical measurements x(i) j . Thus, if we want the ﬁfth empirical measurement of the third variable we would say x(3)5 . If we now take m such variables and combine them to a maximum power of n, then the function is y=

n n−i1 n−i1 −i2

∑ ∑ ∑

i1 =0 i2 =0 i3 =0

···

n−i1 −i2 −···−im−1

∑

im =0

bi1 i2 ···im x(ii1 ) x(ii2 ) · · · xi(imm ) . 1

2

As before, we form the least squares sum D = ∑ yj − j

n n−i1 n−i1 −i2

∑ ∑ ∑

i1 =0 i2 =0 i3 =0

···

n−i1 −i2 −···−im−1

∑

im =0

2 bi1 i2 ···im x(ii1 ) j x(ii2 ) j · · · xi(imm ) j 1 2

.

To determine the parameters, we again perform the relevant partial derivatives and solve the ensuing system of coupled equations to arrive at the bi1 i2 ···im . The m will be apart from the problem speciﬁcations, it is equal to the number of independent variables that are available. However, we must be careful here: Not all available variables really do inﬂuence y (enough to matter)! Thus, we must be careful to choose only those independent variables that have a relevant and signiﬁcant inﬂuence and this is a matter of some delicacy that requires domain knowledge about the source of these variables. The n may be determined as before by performing several computations with diverse n and comparing D. In this manner, we may deduce a non-linear model from data that will be valid at least locally. Please note that such models are inherently static as we have said nothing about time – time is not just another variable as it represents a causal correlation between subsequent measurements of the same variable and creates a totally different complication.

5.10 Case Study: Customer Segmentation

117

5.10 Case Study: Customer Segmentation At a major European wholesale retailer, hoteliers, restaurants, caterers, canteens, small- and medium size retailers as well as service companies and businesses of all kind ﬁnd everything they need to run their daily business. Every customer has a dedicated membership number and card. Due to this, it is possible to attribute every item sold to a particular customer. Customer segmentation in general is the problem of grouping a set of customers into meaningful groups based e.g. on their profession or based on their buying behavior. In this particular case, it also allows us to trace which customers belong to these groups because we are aware of their (business) identities. This trace possibility is attempted by many other retailers via loyalty programs in which clients also allow the retailer to attach their identity to the products purchased. Globally speaking it is interesting to ﬁnd out buying patterns that can be detected in a certain group of clients. Based on a more detailed description of these groups and investigations on cause and effect for the actions of these groups, it is then possible to adjust the business model to react to such features, for example with targeted advertising such as speciﬁc products offerings to speciﬁc customers based on their purchasing habits. Such an investigation has taken place for a dataset of all sold items in two stores over one calendar year. This included over 31 million transactions. The investigation included no particular questions to be answered and no a priori hypotheses to be conﬁrmed or denied. The goal was to ﬁnd any structures that might be economically interesting from the point of marketing to these clients. The methods used to treat the data were diverse in nature. We used descriptive statistics, non-linear multi-dimensional regression analyses in all dimensions, k-means clustering and Markov Chain modeling. The aims were as follows: 1. Descriptive Statistics [91]: To get an overall feel for the dataset and its various sections as discovered by the other algorithms. This includes correlation analyses. In supporting the Markov chain methodology, this also includes Bayesian prior and posterior distribution analysis, which is able to tell, for example, in which order in time events happen (leads to cause and effect conclusions). 2. Nonlinear multidimensional regression [40]: To get a dependency model of the variable among each other. Expressing variables in terms of each other can lead at once to understanding and also dimensionality reduction. 3. k-means clustering [44]: To ﬁnd out which purchases/clients belong into the same phenomenological group and thus determine the actual segmentation that the other methods describe. 4. Markov Chain modeling [44]: To model the time-dependent dynamics of the system and thus to ﬁnd out what stable states exist. Several descriptive conclusions are available to help with the understanding of the dataset. We present them here in a descriptive format as this is all that is required for understanding the ﬁnal result. In the actual case study, these conclusions can be made numerically precise:

118

5 Data Mining: Knowledge from Data

1. The total amount of money spent per visit is, statistically speaking, the same per visit for any particular client. Thus, in order to increase total revenues, the key is to increase customer trafﬁc – either by getting a client to come more often or by attracting new customers. 2. Customers will generally go to the store closest to their own locations (in this study their place of business since we are dealing with a wholesaler). The probability of visiting another store decreases exponentially with distance. 3. The high seasonal business is focused in the aftermath of the summer school holidays and the preparations for Christmas. The low seasonal business is focused in the summer school holidays and during the early year post Christmas. 4. The total amount of money spent per year and per visit as well as the number of articles purchased depends highly on the type of client and the geographical region. This has a signiﬁcant effect upon storage and logistics planning. 5. The majority of clients very rarely shop in the store. There is a core group of clients that shop quite regularly. 6. The products and product groups sold depend strongly on regional effects and on the visit frequency of a customer. 7. Certain products are generally bought in combination with certain other products. Thus, we may speak of a “bag of goods” that is generally bought as a whole. The contents of this bag depend upon the customer group and geography. 8. Via Bayesian analysis and Markov Chain modeling it is possible to deduce that the purchase of a certain product causally leads to the purchase of another product as an effect of the initial purchase. An example is that a purchase of fresh meat directly leads to the purchase of vegetables, cheese, and other milk products. To summarize these conclusions, we may say that the customer behavior depends upon geography, product availability, time of the year and certain key products. It was determined that the following factors offer a signiﬁcant potential in order to improve the proﬁtability of the retail market (most signiﬁcant ﬁrst): 1. Individual marketing [42]: Customers tend to be interested in a narrow range of products. It is educational to cluster the customers into interest groups. We ﬁnd that there are less than 10 clusters that hold a signiﬁcant number of customers and that are sufﬁciently heterogeneous in terms of the products they offer to really divide the customers into different groups. These different interest groups could now be treated differently in some ways, e.g. by sending them advertising materials speciﬁcally targeted towards their interest group. 2. Price arbitrage: In each important product group there is a particular product that is the causal product in the group. This means that if the customer buys this product, then the customer will also buy a variety of related products in this category. This cause-effect relationship may be used to make this key product more attractive in order to boost sales in the entire product group. One way to do this is to lower the price of the key product (and raise the prices of the non-key products in the same category). It can be shown that the causal relationship is independent of price changes. However, the identity of the key product is not a universal in that there are regional differences.

5.10 Case Study: Customer Segmentation

119

3. Geography: Most sales are made to customers whose place of business is 20 to 40 minutes away from the store. At an average travel speed of 30 km/h, this is an area of approximately 940 km2 , which is comparable to the size of a moderate sized city. The wholesaler can focus his efforts, e.g. when establishing one-toone contacts with his customers, in this area. Promotional activities in this area like billboard advertising on major roads may also be effective. 4. Time of the year: The main purchase times are March, August and pre-Christmas. The low times are January, February and summer holidays. The rest of the year corresponds to the average purchase activity. The advertising should reﬂect this trend, focusing on and exploiting the seasonal peaks. Due to non-disclosure, we have presented the conclusions at such a high-level. The procedures of data-mining are able to output a quantitative presentation of these results (also with uncertainty corridors) that allows these conclusions to act as a ﬁrm basis for business decisions. We note that these conclusions were the result of blind analysis. That is, data of 31 million transactions were given to the mining algorithms without specifying either questions or hypotheses. These algorithms output data that could be interpreted by an experienced human analyst to the above conclusions in just a few hours. Based on these results we may now ask a number of speciﬁc questions to make the results clearer, especially when decisions have to be taken to implement changes based on these ﬁndings. We will not go into such an interactive question-answer process. Despite the wish to know more, these conclusions are quite telling and provide valuable material for high level decision making. This illustrates very well the power of data-mining. We have converted a vast collection of data into small number of understandable actionable conclusions that can be presented to corporate management. Moreover, we have been able to do so quickly. This procedure may well be automatically reproduced monthly to track changes to customer behavior. One caveat remains however: The challenge for any data-mining approach in a “bricks and mortar” business is to translate the ﬁndings into successful operational business concepts.

Chapter 6

Modeling: Neural Networks

6.1 What is Modeling? A mathematical model is a mathematical description of a system. This may take the form of a set of equations that can be solved for the values of several variables or it may take the form of an algorithmic prescription of how to compute something of interest, e.g. a decision tree to compute whether a produced part is good or bad. It is wrong to think that all models can be written in traditional equation format, e.g. y = mx + b no matter how complicated or simple the equation is. Frequently, it is far simpler to include a step-by-step recipe based on if-then rules and the like to describe how to get at the desired result. The model is thus this recipe. Modeling is a process that has the mathematical model as its objective and end. Mostly it starts with data that has been obtained by taking measurements in the world. Industrially, we have instrumentation equipment in our plants that produce a steady stream of data, which may be used to create a model of the plant. Note that modeling itself converts data into a model – a model that ﬁts the situation as described by the data. That’s it. Practically, just having a model is nice but does not solve the problem. In order to solve a particular practical problem, we need to use the model for something. Thus, modeling is not the end of industrial problem solving but modeling must be followed by other steps; at least some form of corporate decision making and analysis. Modeling is also not the start of solving a problem. The beginning is formed by formulating the problem itself, deﬁning what data is needed, collecting the data and preparing the data for modeling. Frequently it is these steps prior to modeling that require most of the human time and effort to solve the problem. Mathematically speaking, modeling is the most complicated step along the road. It is here that we must be careful with the methodology and the data as much happens that has the character of a black box. Generally speaking modeling involves two steps: (1) Manual choice of a functional form and learning algorithm and (2) automatic execution of the learning algorithm to determine the parameters of the chosen functional form. It is important to

P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7_6, © Springer-Verlag Berlin Heidelberg 2012

121

122

6 Modeling: Neural Networks

distinguish between these two because the ﬁrst step must be done by an experienced modeler after some insight into the problem is gained and step two is a question of computing resources only (supposing, of course, that the necessary software is ready). In practice, however this two-step process is most frequently a loop. The results of the computation make it clear that the manual choices must be re-evaluated and so on. Through a few loops a learning process takes place in the mind of the human modeler as to what approach would work best. Modeling is thus a discovery process with an uncertain time plan; it is not a mechanical application of rules or methods.

Fig. 6.1 A neural network is trained to distinguish two categories. After a few training steps (a) and two intermediate steps (b) and (c), the network settles into its ﬁnal convergent state (d). In this case it may take approximately 200 iterations of a normal neural network training method to reach the ﬁnal state.

Let us take the example of ﬁtting a curve through a set of data points. First, we choose to ﬁt a straight line, i.e. y = mx + b. Second, we use the linear least-squares algorithm to determine the values of m and b from the collected data. The result is the model, i.e. speciﬁc values for a and b. The algorithm used in the second step will produce the best values for the parameters that it can, given the functional form and the data. It will make an output even if a straight line is a patently poor ﬁt for

6.1 What is Modeling?

123

the data. Thus, it is essential to understand what type of model even has a chance to correctly model the data. Frequently model types are chosen that are provably so ﬂexible that they can be used for nearly all data. An example of this is the neural network that we will deal with in this chapter. It can model virtually any relationship present in a dataset. However, this statement must be taken with some salt as it is contingent upon a variety of conditions such as that the number parameters may have to be increased indeﬁnitely for this statement to hold. That would pose a practical problem, of course, because we desire to have many fewer parameters in the model than we have data points and clearly the data points are ﬁnite in number. If we had many parameters, then the model would be no better than a simple list of data points and the modeling step would not provide us with knowledge. It is precisely the compactiﬁcation of lots of data into a functional form with a few parameters that encapsulates the knowledge gain of the modeling process. We can attempt to understand the model because it is “small.” We can use the model to compute what the system will be like at points that were never measured (interpolation and perhaps extrapolation). If the number of parameters were too large, the model would merely learn the data by heart and we would loose both advantages – a situation known as over-ﬁtting. Thus, it is fair to say that a model is a functional summary of a dataset – it is a summary in that it encapsulates the same information in fewer numbers and it is functional in that we may use it to generate information that is not immediately apparent by evaluating it at novel points. To make this clearer, we will look at an example in ﬁgure 6.2. This is a vibration measurement on an industrial crane. The gray jagged line displays the raw data and the dotted black line displays the cleaned data according to the methods of chapter 4. In fact, we have seen this example in ﬁgure 4.1 before. What we have added in this ﬁgure is the solid black line.

Fig. 6.2 Here we compare raw data (gray) against ﬁltered data (black dotted) and the model (black solid) with a prediction into the future of the model.

124

6 Modeling: Neural Networks

Please note that the input data (raw as well as denoised) lasts from time zero to time 6500 minutes. This data is provided to a machine learning algorithm that produces a mathematical formula for calculating the next value in time given the previous values. Using this formula, we then attempt to re-create the known values and then we will calculate some more in order to predict the future. What we see here is that the model output (solid black line) for the time from zero to 6500 reproduces the known denoised data (dotted line) so well that we can hardly tell them apart on the diagram. That is a good sign. Beyond reproducing the known data, the model is then used to compute values for the time from 6500 to 8000 minutes. On the image, we have also graphed the uncertainty in this prediction as we can no longer validate the prediction made due to a lack of observations in the future. And so we see a line that slowly gains a greater and greater uncertainty but remains within a well-deﬁned corridor of values. The formula found, the model, thus reproduces the known data and makes a prediction that, on face value, makes sense. Whether this now corresponds to true reality, only experimentation (waiting for the future) can reveal. Once we are conﬁdent that the model can indeed predict the future, then we can use the model to compute the future. On this basis, we may then, with conﬁdence, plan actions to prevent events we do not want to occur but of which we know now that they will occur if we do nothing.

6.1.1 Data Preparation We will assume that the data is already clean and so only contains data that is representative of the problem that we wish to model; see chapter 4 for methods of getting raw data this far. A factor that must be considered at this point is whether the data is in the form that allows a machine learning algorithm to quickly learn the salient features of the dataset. We give a simple example to illustrate the point. Consider modeling the expected rise/fall of a stock price as a function of the principal balance sheet components of a company. To train this model, you may have data over several years from many companies. Such a model will not be useful to predict a share price (too little data per company) but it will reveal some interesting basic information. Two of the principal balance sheet ﬁgures are, for example, the stock price and the earnings. The machine learning algorithm will be able to learn that the expected rise/fall of the stock price depends upon the ratio price-to-earnings but it will require both time and data for making this conclusion. Since this is something that we humans already know, it would have helped the algorithm if we had removed the column of price and that of earnings and had inserted a column consisting of the ratio. This example is referred to by the general injunction that one should add domain knowledge into the training data set. We have not attempted to explain the dynamics of the stock exchange to a neural network but rather we transformed the data into a form that is conducive for learning just as we would purchase a colorful nice

6.1 What is Modeling?

125

language learning book for our child as opposed to a mere dictionary to aid it in its learning of a foreign language. Another feature of data preparation for machine learning is that many learning methods do not deal well with data that is collinear. What is meant by this is that if we have two series of observations (e.g. a and b) that are related by a simple linear transformation (e.g. a = mb + c with m and c constants), then the learner can become confused by this. The reason why this is strange may be compared to a situation with human beings. If we are trying to teach someone the meaning of “chair” and illustrate this with examples of the same chair in a small and large variety. The person who is to learn the concept may actually confuse the purpose of the chair with the size relationship in the dataset and thus learn something entirely unintended. This must be avoided by removing too simply related data from the dataset. It cannot be overemphasized that the form in which data is presented to a machine learning algorithm has more inﬂuence on the accuracy of the ﬁnal model than the choice of learning algorithm (as long as it is a reasonable algorithm that has the essential capability to learn the present problem). Thus, data preparation is a delicate activity and must be aided by thought and domain knowledge. Generally speaking, the following questions should be answered when modeling industrial data: 1. Are all the relevant measurements present and have they been cleaned in the sense of chapter 4? 2. Are only the relevant measurements present? 3. Have simple relationships been removed? 4. Can the data be transformed in some meaningful way in order to better represent the phenomenon that is important for modeling?

6.1.2 How much data is enough? When modeling, the question of how much data is needed always arises. Often, we are limited by practicality. Getting data sometimes represents a cost, perhaps both in time, money and effort. Having more data may also make learning slower. Increasing the sample rate may just produce noise and not more valuable signal information. Too little information is not enough and too much may be counterproductive as well. Before we end up with too much data, there is deﬁnitely a long interval of diminishing returns from getting more and more data. One can philosophically say that an industrial process contains a certain amount of information regardless of how much data we extract from it. We must make sure that this information is represented in the data and by a reasonable amount of data. Thus, we must choose the right amount of data that represents the greatest gain of knowledge for the resources that we put into the acquisition of the data. Because this is dependent upon the problem at hand, we need a quantitative measure of “enough.” We will approach this in two steps.

126

6 Modeling: Neural Networks

First, we will say that what we are really looking for is that if we have a reasonable model of the situation, then an additional bit of data will improve the model somewhat. As soon as we have reached the state where additional data does not allow a model improvement anymore, we may stop and say that we have enough data. Thus, we deﬁne enough data as that amount of data that we need to get the model to converge (to the right result), i.e. the model output for the validation data agrees with the experimental measurements for the same dataset and no longer changes appreciably with more data. Second, we need to ﬁnd a quantitative measure of convergence. This is provided, for example, by the variance. After the model is generated, we compute the variance, add more data, remodel and again compute the variance. In this way, we obtain the relationship between the size of the dataset and the variance. This relationship is roughly logarithmic, i.e. the variance rises with increasing data set size and eventually settles down to a more or less horizontal line. This can easily be detected and there is your convergence point. Equally well, this can be measured by the mean squared deviation between model output and experimental data in the validation data set. This should behave in roughly the same manner and thus provide the point at which training may proﬁtably stop. Note that in this approach, it is not possible to calculate, a priori, how many data points are needed. Rather it is a checking procedure, in dependence upon the modeling method, to check if we already have enough. Any method to compute a deﬁnite data volume before modeling begins will be a mere estimate. Note also that the information content is not a simple function of data volume as, for example, many repetitions of the same measurement do not add any information. The data volume must, therefore have a suitably diverse population of points in it to aid the analysis.

6.2 Neural Networks The term neural network refers to a wide family of functional forms that are frequently used to model experimental data. Historically, the creation of some of these forms was motivated by the apparent design of the human brain. This historical motivation aspect does not concern us here and will not be discussed. For the present book, we shall distinguish between a functional form and a function by allowing the former to have numerically-undetermined parameters and require the later to specify numerical values for all parameters. So, for example, sin(ax) with a a parameter is a functional form and sin(2x) is a function. This distinction is crucial in machine learning as the principal effort in machine learning is always put into the methods of determining the numerical value of the parameters in an a priori determined functional form. Practically speaking, we begin with a dataset and decide (by some unspeciﬁed and generally human procedure) that a certain functional form should be able to

6.2 Neural Networks

127

model that data. Then a machine learning algorithm determines the values of the parameters. The end result is a function (without parameters) that models the data. The topic of neural networks can be proﬁtably split into two categories 1. A list of different functional forms, so-called networks, that can be used to model data and their attendant properties such as a. b. c. d.

The kind of functions that can be represented, Restrictions on the values of parameters, Robustness properties, and Scaling behavior.

2. A presentation of various algorithms used to determine the numerical value of the parameters in the functional form and their attendant properties such as a. Requirements on the training data form (labeled or unlabeled) or cleanliness (signal-to-noise-ratio or other pre-processing requirements), b. Speed of training, c. Convergence rate, convergence target (local or global optimum) and robustness of convergence, and d. Practical issues such as termination criteria, parametrization or initialization requirements. Most books on neural networks mix these topics strongly and implicitly place a focus on the second. In practice, however, it is important to distinguish the network from the algorithm used to train it because we usually have a (largely) independent choice for both network and algorithm and must be aware of the involved limitations. For understanding, it is also important to know what a network can accomplish in the realistic case. Training a neural network is a black art requiring much experience and depends mostly on the person preparing the data (pre-processing) and selecting the training methodology and parametrization of the training method. Even then, the issue of convergence to local optima typically requires signiﬁcant tuning to a particular problem before a function is found that represents the data well enough for practical purposes1 . For this reason, we will not be discussing training algorithms at all but refer to the specialist literature for this purpose [69]. We mention in passing that if you have a problem that you wish to model using neural networks and you are not already an expert in training them, it is probably best to get an expert to do the modeling for you as learning how to do it can require many months. This chapter is intended to give the novice an overview of what neural networks are, what they can and cannot do and to give a sense of the complexity of the topic. Before going into the details, we need to discuss two important issues. 1 Together, the neural network topology and the training method give rise to several variables having to be chosen by the human modeler. On top of that, most training methods involve some degree of random number generation, which means that each training run will be conducted slightly differently and so results cannot be completely comparable. The exact effect of changing any one of the initial parameters is unclear and so much stabbing in the dark may become necessary before enough learning has happened in the modeler’s mind.

128

6 Modeling: Neural Networks

First, neural network methods yield a function that describes a speciﬁc set of data. What is ﬁrst done to obtain this set of data or what is later done with this formula is no longer an aspect of neural network theory or practice. As such, it is useful to view a neural network as a summary of data – a large table of numbers is converted into a function – similar to the abstract of a scientiﬁc paper being a summary. Please note, that a summary cannot contain more information than the original set of data2 ; indeed it contains less! Due to its brevity, however, it is hoped that the summary may be useful. In this way, neural networks can be said to transform information into knowledge, albeit into knowledge that still requires interpretation to yield something practically usable. Second, neural networks are intended for practical modeling purposes. The summarization of data is nice but it is not sufﬁcient for most applications. To be practical, we require interpolative and extrapolative qualities of the model. Supposing that the dataset included measurements taken for the independent variable x taking the values x = 1 and x = 2, then the model has the interpolative quality if the model output at a point in between these two is a reasonable value, i.e. it is close to what would have been measured had this measurement been taken, e.g. at x = 1.5. Of course, this can only hold if the original data has been taken with sufﬁcient resolution to cover all effects. The model has the extrapolative quality if the model output for values of the independent variable outside the observed range is also reasonable, e.g. for x = 2.5 in this case. If a model behaves well under both interpolation and extrapolation, it is said to generalize well. A neural network model is generally used as a black-box model. That is, it is not usually taken as a function to be understood by human engineers but rather as a computational tool to save having to do many experiments and determine values by direct observation. It is this application that necessitates both above aspects: We require a function for computation and this is only useful if it can produce reasonable output for values of the independent variables that were not measured (i.e. compute the result of an experiment that was not actually done having conﬁdence that this computed result corresponds to what would have been observed had the experiment been done). The data used for training can be obtained from actual experiments or alternatively from simulations – neural networks are not concerned with the data source. Simulations of physical phenomena are often very complex as they are usually done from so-called ﬁrst principles, i.e. by using the basic laws of physics and so on to model the system. As these simulations take time, they cannot be continuously run in an industrial setting. Neural networks provide a simpler empirical device by adjusting the parameters of a functional form until the so-determined function represents the data well. 2 Whatever is in the data will hopefully also be in the network but there is no guarantee of this as the summarization process is a lossy process. Whatever is not in the data, however, is deﬁnitely not in the network. We must thus not expect a neural network to divine effects that have not been measured in the data. Issues such as noise and over-representation of one effect over another can produce unexpected results. That is the reason why pre-processing is so important.

6.3 Basic Concepts of Neural Network Modeling

129

This represents the major advantages of neural networks: They are (1) easier to create than ﬁrst principles simulations, (2) can capture more dynamics than normal ﬁrst principles model due to them modeling actual experimental output and not idealized situations, (3) once existing, are easy and fast to evaluate. They are thus practical and cheap. The price that must be paid for this practicality is that the way in which the function is obtained and the resultant function itself are both not intended for human understanding. Questions such as ‘why’ and ‘how’ must thus never be asked of a neural network. We may only ask ‘what’ the value of some variable is.

6.3 Basic Concepts of Neural Network Modeling The dataset that will be used as the basis for obtaining the neural network contains several variables. For better understanding, we include a very simple example in table 6.1. We have three variables x1 , x2 and y1 in this table. Each variable has an associated measurement uncertainty or error Δ (x1 ), Δ (x2 ) and Δ (y1 ). It is important to note that no observation whatsoever is fully accurate and so there are always measurement errors. Frequently, however, their presence is ignored for basic modeling purposes. We include them here because they have signiﬁcant effects for industrial practice. x1 Δ (x1 ) x2 Δ (x2 ) y1 Δ (y1 ) 1.2 0.1 2.3 0.2 0.1 0.01 1.4 0.1 3.1 0.2 0.2 0.01 1.6 0.2 3.2 0.2 0.3 0.02 1.8 0.2 3.5 0.3 0.4 0.05 Table 6.1 An example data set for training a neural network. Note the presence of uncertainty measurements as well. This is important in practice as no measurement in the real world is totally precise.

The variables must ﬁrst be classiﬁed into dependent and independent variables or, to use neural network vocabulary, into output and input variables respectively. We will decide that the two x variables are independent and yield the single y variable that is the dependent variable. We could have arbitrarily many independent and dependent variables in general; there is no fundamental limitation. Thus, we look for some function f (· · · ), such that y1 = f (x1 , x2 ). In general, when we have many variables, we represent their collection by a vector, x = {x1 , x2 , · · · }. The general function is thus, y = f (x) knowing Δ xi , we may compute Δ y by

130

6 Modeling: Neural Networks

(Δ y)2 = ∑ i

∂ f (x) ∂ xi

2 (Δ xi )2 .

Having classiﬁed the variables into input/output for the function, we must classify the type of output variable. There are two principal options. An output variable may be numeric or nominal. If a variable is numeric, its numerical value has some signiﬁcance and so it makes sense to compare various value to each other, e.g. a temperature measurement. The function to be modeled is thus a regular mathematical function that could be drawn on a plot. If a variable is nominal, then the value of the variable serves only to distinguish it from other values and its numerical value has no signiﬁcance. We use nominal values to differentiate category 1 from category 2 knowing that it is senseless to say that the difference between category 2 and category 1 is 1. Neural networks are very often used for nominal variables and many are speciﬁcally intended to be classiﬁcation networks, i.e. they classify an observation into one of several categories. Empirically, it has been found that neural network methods are very good at learning classiﬁcations. Generally, most neural network methods assume that the data points illustrated in the above sample table are independent measurements. This is a pivotal point in modeling and bears some discussion. Suppose we have a collection of digital images and we classify them into two groups: Those showing human faces and those showing something else. Neural networks can learn the classiﬁcation into these two groups if the collection is large enough. The images are unrelated to each other; there is no cause-effect relationship between any two images – at least not one relevant to the task of learning to differentiate a human face from other images. Suppose, however, that we are classifying winter versus summer images of nature and that our images were of the same location and arranged chronologically with relatively high cadence. Now the images are not independent but rather they have a cause-effect relationship ordered in time. This implies that the function f (· · · ) that we were looking for is really quite different, y = f (x)

→

yi = f (xi−1 , xi−2 , · · · , xi−h ) .

In this version, we see a dependence upon history that implies a time-oriented memory of the system over a time length h that must somehow be determined. Depending on the dynamics of the time-dependent system, the memory of the process does not have to be a universal constant, so that in general h = h(i) is itself a function of time. As a consequence of this, we have network models that work well for datasets with independent data points (see section 6.4) and others that work well for datasets in which the data points are time-dependent (see section 6.5). The networks that deal with independent points are called feed-forward networks and form the historical beginning of the ﬁeld of neural networks as well as the principal methods being used. The networks that deal with time-dependent points are called recurrent networks, which are newer and more complex to apply.

6.4 Feed-Forward Networks

131

6.4 Feed-Forward Networks The most popular neural network is called the multi-layer perceptron and takes the form y = aN (WN · aN−1 (WN−1 · · · · a1 (W1 · x + b1 ) · · · + bN−1 ) + bN ) where N is the number of layers, the Wi are weight matrices, the bi are bias vectors, the ai (· · · ) are activation functions and x are the inputs as before. The weight matrices and the bias vectors are the place-holders for the model’s parameters. The so-called topology of the network refers to the freedom that we have in choosing the size of the matrices and vectors, and also the number of layers N. Per layer, we thus get to choose one integer and thus have N + 1 integers to choose for the topology. The only restriction that we have is the problem-inherent size of the input and output vectors. Once the topology is chosen, then the model has a speciﬁc number of parameters that reside in these matrices and vectors. The activation functions are almost always functions with the shape of a sinusoid, for example tanh(· · · ). In training such a network, we must ﬁrst choose the topology of the network and the nature of the activation functions. After that we must determine the value of the parameters inside the weight matrices and bias vectors. The ﬁrst step is a matter of human choice and involves considerable experience. After decades of research into this topic, the initial topological choices of a neural network are still effectively a black art. There are many results to guide the practitioners of this art in their rituals but these go far beyond the scope of this book. The second step can be accomplished by standard training algorithms that we mentioned before and also will not treat in this book (see e.g. [69]). A single-layer perceptron is thus y = a (W · x + b) and was one of the ﬁrst neural networks to be investigated. The single-layer perceptron can represent only linearly separable patterns. It is possible to prove that a two-layer perceptron with a sigmoidal activation function for the ﬁrst layer and a linear activation function for the second layer can approximate virtually any function of interest to any degree of accuracy if only the weight matrices and bias vectors in each layer are chosen large enough. In practice, we ﬁnd that perceptrons with between two and four layers are used very frequently to model data. Such a two-layer perceptron looks like y = m (W2 · tanh (W1 · x + b1 ) + b2 ) where m is a scalar and where we get to choose the size of the two weight matrices to match the problem.

132

6 Modeling: Neural Networks

This approach is used both for numerical data and for nominal data and is found to work very well indeed. Many industrial applications are based on perceptrons, e.g. adaptive controllers.

6.5 Recurrent Networks The basic idea of a recurrent neural network is to make the future dependent upon the past by a similar form of function like the perceptron model. So, for instance, a very simple recurrent model could be x(t + 1) = a (W · x(t) + b) . Note very carefully three important features of this equation in contrast to the perceptron discussed before: (1) Both input and output are no longer vectors of values but rather vectors of functions that depend on time, (2) there is no separate input and output but rather the input and output are the same entity at two different times and (3) as this is a time-dependent recurrence relation, we need an initial condition such as x(0) = p for evaluation. The above network, if we choose the activation function to be ⎧ z>1 ⎨1 a(z) = z −1 ≤ z ≤ 1 , ⎩ −1 z < −1 is called the Hopﬁeld network and is a very good classiﬁer. The methodology is like this: (1) Every one of the possible categories is characterized by a vector of values called a primary pattern, (2) each item to be classiﬁed is also characterized by a similar vector of values, (3) the network is trained using a specialized training algorithm, (4) the characterizing vector of an as-yet-unclassiﬁed item is input into the network as the initial condition p, (5) the network is now iterated in “time” until the vector converges to an unchanging state, (6) this convergent state is one of the primary patterns and thus the classiﬁcation is done. This network uses the concept of time to accomplish a static task in which actual time does not play a role. If correctly constructed, the time iteration can make the network numerically more stable and so produce more reliable answers than for a static network like the perceptron.

Fig. 6.3 The ﬁrst three digits as input to train a Hopﬁeld network to recognize these digits.

6.5 Recurrent Networks

133

We demonstrate the issues by means of a common example, recognition of digits. In ﬁgure 6.3 we display the bit pattern of three digits that we wish to recognize using a Hopﬁeld network. Each two-dimensional pattern can easily be converted into a single dimensional vector of bits by joining each column on the bottom of the previous column. Thus, the digit 0” becomes the vector x0 = [011110100001100001100001100001011110]T . With this setup, we can train the Hopﬁeld network and obtain the matrix W and the vector b. Using this network, we may then classify new inputs. Figure 6.4 displays the results of verifying the network on novel input. We see that if we occlude 50% of the original pattern, we retrieve the correct result. However, if we occlude 67%, then we will get errors in recognition. If the network is presented with noisy inputs, the network will make an identiﬁcation that is the same as a human being would have made. Thus, we conclude that the network is sensible in its classiﬁcation. We have thus been able to represent the difference between the ﬁrst three digits in a Hopﬁeld network. This is roughly the principle by which optical character recognition is done even though fancier techniques are being used in commercial software to make the system less error prone.

(a)

(b)

(c)

Fig. 6.4 Several test patterns for the digit recognizing Hopﬁeld network and their associated outputs. Test set (a) includes 50% occluded inputs. Test set (b) includes 67% occluded inputs. Test set (c) includes noisy inputs.

A very new type of recurrent neural network is the echo state network. This uses the concept of a reservoir, which is essentially just a set of nodes that are connected to each other in some way. The connection between nodes is expressed by a weight

134

6 Modeling: Neural Networks

matrix W that is initialized in some fashion3 . The current state of each node in the reservoir is stored in a vector x(t) that depends on time t. An input signal is given to the network u(t) that also depends upon time. This is the actual time-series measured in reality that we wish to predict. The input process is governed by an input weight matrix Win that provides the input vector values to any desired neurons in the reservoir. The output of the reservoir is given as a vector y(t). The system state then evolves over time according to

x(t + 1) = f W · x(t) + Win · u(t + 1) + W f b · y(t) where W f b is a feedback matrix, which is optional for cases in which we want to include the feedback of output back into the system and f (· · · ) is a sigmoidal function, usually tanh(· · · ). The output y(t) is computed from the extended system state z(t) = [x(t); u(t)] by using y(t) = g Wout · z(t) where Wout is an output weight matrix and g(· · · ) is a sigmoidal activation function, e.g. tanh(· · · ). The input and feedback matrices are part of the problem speciﬁcation and must therefore be provided by the user. The internal weight matrix is initialized in some way and then remains untouched. If the matrix W satisﬁes some complex conditions, then the network has the echo state property, which means that the prediction is eventually independent of the initial condition. This is crucial in that it does not matter at which time we begin modeling. Such networks are theoretically capable of representing any function (with some technical conditions) arbitrarily well if correctly set up. An example of this can be seen in ﬁgure 6.5. The original time-series is the very detailed spiky line. The smooth curve on top is an echo state network with many nodes in the reservoir and the thick line that seems to have a slight time delay is an echo state network with a small number of nodes in the reservoir. In this case, the time-series has a ﬁnancial origin, it is the Euro to US Dollar exchange rate. We see that such a signal can be modeled to sufﬁcient accuracy using an echo state network. Please note again at this point the principal difference between a time-series in which the points are correlated in time and the classiﬁcation of observations into categories in which the observations are not correlated at all. The correlation in time makes it necessary to use much more sophisticated mathematics. 3

Generally it is initialized randomly but substantial gain can be got when it is initialized with some structure. At present, it is a black art to determine what structure this should be as it deﬁnitely depends upon the problem to be solved.

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing

135

Fig. 6.5 The prediction of the Euro to US Dollar exchange rate over 1.5 years.

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing Co-Authors: Pablo Cajaraville, Reiner Microtek Bj¨orn Dormann, Kl¨ockner Desma Schuhmaschinen GmbH Dr. Philipp Imgrund, Fraunhofer Institute IFAM Maik K¨ohler, Kl¨ockner Desma Schuhmaschinen GmbH Lutz Kramer, Fraunhofer Institute IFAM Oscar Lopez, MIM TECH ALFA, S.L. Kaline Pagnan Furlan, Fraunhofer Institute IFAM Pedro Rodriguez, MIM TECH ALFA, S.L. Dr. Natalie Salk, PolyMIM GmbH J¨org Volkert, Fraunhofer Institute IFAM The injection molding technology is a widely used technology for the mass production of components with complex geometries. Almost all material classes can be processed with this technology. For polymers the pelletized material is injection molded under elevated temperature in a mold cavity showing the negative structure of the resulting part. The part is cooled down and ejected to the ﬁnished component. In the case of metals and ceramics the so called metal injection molding (MIM) or ceramic injection molding (CIM) process is applied. Both processes fall under the umbrella term powder injection molding (PIM). In all cases, the material powder is mixed with a binder system composed of polymers and/or waxes. This so-called feedstock is subsequently injection molded compared to the described polymer material. The ejected parts are called green parts still contain the binder material that acted as ﬂowing agent during the injection molding process. To remove the binder,

136

6 Modeling: Neural Networks

the components have to be debinded in a solvent or water solution for a certain amount of time. Subsequently, a thermal debinding step is needed to decompose the residual binder acting as backbone in what is now called the brown part. During the ﬁnal sintering step, the parts are heated up to approximately 3/4 of the melting point of the integrated material powder. During the sintering process the material densiﬁes to a full metallic or ceramic part, showing the same material properties as the respective material. Examples of a good and bad polymer as well as metal parts are illustrated in ﬁgure 6.6.

Fig. 6.6 Both images display a good part and a damaged part. On the left, we have a plastic part where we can have various deformations as one damage mechanism. On the right, we have a green metal part that is broken in one place and a ﬁnal metal part showing the end product as it should be. In both examples, the damage is visible but this is often not so.

Often, the damages to a part that occur during injection can only been seen on the ﬁnal part. The unnecessarily performed steps of debindering and sintering have expended signiﬁcant amounts of electrical energy and have also made the material useless. If we could identify a damaged part during its green stage, we could recycle the material and also save the energy for debindering and sintering. For all parts, there is an effort involved in determining whether the part is good or not. Today, this effort is usually made manually, which is expensive. If we could make the identiﬁcation automatic, then we would save this effort as well. An injection molding machine is controlled by manually inputting a series of values known as set-points. These are the values for various physical quantities that we desire to have during the injection process. It is the responsibility of the machine to attempt to realize these set-points in actual operation. This attempt is generally achieved but there are deviations in the details. In order to monitor what the actual value of these various measurements is, an injection machine will also have sensors that output these measurements over time, i.e. a time-series. For each part produced, we thus have the set-points and also a variety of timeseries over the duration of its injection. This information is available in order to characterize a part.

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing

137

We will assume that the function that computes the scrap versus good status, the decision function, takes the form of a three-layer perceptron γ = W3 · tanh(W2 · tanh(W1 · x + b1 ) + b2 ) + b3 where the bias vectors bi and the weight matrices Wi must be determined by some training algorithm. In order to make an input vector x, we must extract salient features from the time-series in the form of scalar quantities that allow the characterization of scrap vs. good parts as we cannot input the whole time-series into the neural network. If we did input the entire time-series, we would force the weight matrices to become extremely large. Thus we would have many parameters to be found by training. This would require many more parts to be produced and this is unrealistic. We must live with a few hundred training parts and so we must keep the input vector as small as possible. After many trials with various settings, we have found the best performance with the following procedure (see ﬁgure 6.7 for an example): 1. Take all observations of good parts over the pilot series. For each time series, create an averaged time series over all these good parts. 2. Disregarding local noise in the time-series, compute the turning points of this time series. 3. For every part encountered, perform the same turning point analysis. 4. For every part, we now compute the difference between the turning points in the present part relative to the turning points of the averaged series. If we ﬁnd differences, then these will be taken as salient features. 5. These differences form the salient features and these will constitute the vector x. For practicality, we limit ourselves to a speciﬁc maximum number of turning points allowed. In order to use the training method, it must be provided with some pairs (x, γ) of input vectors and the quality output. To determine these pairs, we must manually assess a number of injected parts and characterize them as scrap or good. Having obtained such training data, the training algorithm produces the decision function. In practise this means that the machine user must inject several parts, record the data x, manually determine whether the ﬁnal parts are scrap or not, γ, and provide this information to the learning method. We have determined that a good number of observations is 500 or more. Furthermore, it is good to use several settings of the set-points within these 500 parts. When a new part is now injected, its data x is provided to the decision function and it computes whether this part is good or bad γ and outputs this value. As a result a robot can be triggered to remove the scrap parts from further production. We may deﬁne four different recognition rates: 1. 2. 3. 4.

good rate, ρg – The number of correctly identiﬁed good parts scrap rate, ρs – The number of correctly identiﬁed scrap parts false-negative rate, ρn – The number of good parts identiﬁed as scrap false-positive rate, ρ p – The number of scrap parts identiﬁed as good

138

6 Modeling: Neural Networks

Fig. 6.7 The bottom curve is the average pressure observed at the nozzle averaged over all parts known to be good. The top curve is a single observation of a part known to be bad. The black arrows pointing up indicate the position of the turning points of the bottom curve and the black arrows pointing down indicate the position of the turning points of the top curve. We observe that the top curve has two extra turning points. We also observe that the vertical position of several turning points is higher than that of the average good curve.

where each count is divided by the total number of produced parts in order to make each item into a genuine rate. Please note that these rates are thus normalized by deﬁnition, i.e. ρg + ρs + ρn + ρ p = 1. It is generally not possible to design a system that will have perfect recognition efﬁciency (ρn = ρ p = 0). We would rather throw away a good part than let a scrap part through. Thus, the overall objective is to minimize the false-positive rate ρ p by designing a decision function that is as accurate as possible. This enumeration is, of course, theoretical as it would require the user to know which parts are really good or bad. This identiﬁcation would require the manual characterization that we want to avoid using the present methods (except for the training data set for which it is necessary). Thus, we will never actually know what these rates are except in two cases: the pilot series where the data is used for training the function and, possibly, in any quality control spot checks that are usually too infrequent to really allow the computation of a rate. Thus, these rates must be interpreted as a useful guideline for thinking but not practical numerical quantities except at training time when we actually know the quality state of all parts. Even if we are able to correctly identify every part as either good or bad, this will not change the amount of scrap actually produced – it will merely change the

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing

139

amount of scrap delivered to the customer. In order to reduce the amount of scrap actually produced, we must interact with the production process actively via the set-points and this is our next stage. We observe that the recognition rates mentioned above are in fact functions of the set-points. In particular, we want to reduce the scrap rate as we would like to produce as many good parts as possible. Let us combine the ten set-points αi into a single vector, (6.1) s = (α1 , α2 , · · · , α10 ). Using this, we thus focus on the scrap rate function ρs (s). This function achieves a global minimum ρs∗ at the point s∗ , ρs∗ = min{ρs (s)} = ρs (s∗ ). s

(6.2)

The point s∗ is determined using an optimization algorithm and then communicated to the injection molding machine. These methods were tried on the part in the right image of ﬁgure 6.6. In total 500 parts were made, with various settings of the set-points, of which approximately 20% were scrap. This high scrap rate results from the various set-points settings, some of which are, of course, not optimal. We ﬁnd that the recognition efﬁciency is 98% in that ρg + ρs = 0.98 where we recognize good and bad parts correctly nearly always with 490 out of 500 parts. We obtain a low false positive rate ρ p = 0.002 in which scrap parts are recognized as good. Relative to our sample size of 500, this means that a single scrap part was not recognized as such. The false negative rate in which good parts are recognized as scrap was ρn = 0.018, which means 9 parts in total. Recall that the objective of training the network was to minimize the false positive rate. It is clear that this cannot be reliably zero and so getting a rate of one part in 500 can be interpreted as a success. The system is certainly more reliable than manual quality control, which is common in the industry. The actual production scrap rate was approximately 20%. This could have been reduced to 5% by adjusting the set-points appropriately. Of course, a real production will not be at the level of 20% scrap and so an improvement of a factor of four seems unlikely. Nevertheless, a signiﬁcant factor should be possible. Now that we have veriﬁed that it is possible to reliably identify scrap from good parts based only on process data and that we can optimize the process based on the same analysis, we ask what the practical signiﬁcance of this is. There are two major points: Quality improvement and energy savings. With respect to quality improvement, there are also two aspects. First, the quality of the delivered product. Even if we produce 20% scrap, 98% of these are correctly recognized to be scrap and so these are not delivered to the client. Looking at the current numbers, then, we produce 500 parts all in all. Of these, we have 100 scrap parts. We recognize 98 scrap parts as scrap, 392 good parts as good, 9 good parts as scrap and one scrap part as good. Thus the client receives 392 good parts and one scrap part in the delivery. This is an effective scrap rate – with the

140

6 Modeling: Neural Networks

client – of 0.3%. The identiﬁcation was thus able to lower the production scrap rate of 20% to a delivery scrap rate of 0.3%. Second, the production quality will also improve due to the optimization. Since we can lower the production scrap rate of 20% to 5%, we would produce 475 good parts and 25 scrap parts as compared to the above ﬁgures. In ﬁnal consequence, the client would receive 466 parts of which one would be scrap. This lowers the effective delivery scrap rate to 0.2% and relative to a larger delivery size. With the optimization, the molding production cost per delivered part is lowered by 19%. This is a reduction in production cost that is otherwise unreachable. With respect to energy savings, we also save the energy costs that would have ﬂown into the steps of debinding and sintering of parts that later are recognized as scrap. It is hard to quantify this in any general manner but we assume that this lowers the total production cost per delivered part by another 4%. We gratefully acknowledge the partial funding of this research by the European Union: Investment in your future – European fund for regional development, the EU programs MANUNET and ERANET, the German ministry for education and research (BMBF) and the Wirtschaftsf¨orderung Bremen (economic development agency of the city state of Bremen, Germany).

6.7 Case Study: Prediction of Turbine Failure In this study we will attempt to predict a known turbine failure using historical data for it. We refer to section 5.5 for details on turbines and coal power plants. On a particular turbine, a blade tore off and completely damaged the turbine. After the event, the question was raised whether this could have been predicted and localized to a speciﬁc location in the turbine. The speciﬁc turbine in question has over 80 measurements on it that were considered worthwhile to monitor. Most of these were vibrations but there were also some temperatures, pressures and electrical values. A history of six months was deemed long enough and the frequency depended upon each individual measurement point – some were measured several times per second, others only once every few hours. In fact, the data historian only stores a new value in its database if the new value differs from the last stored value by more than a static parameter. In this way, the history matrix contained a realistic picture of an actual turbine instrumented as it normally is in the industry. No enhancements were made to the turbine, its instrumentation or the data itself. A data dump of six months was made without modiﬁcation. The data stopped two days before a known (historically occurring) blade tear on that turbine. During the time leading up to the blade tear and until immediately before it, no sign of it could be detected by any analysis run by the plant engineers either before or after the blade tear was known. Thus, it was concluded that the tear is a spontaneous and thus unpredictable event. Initially, the machine learning algorithm (echo state network from section 6.5) was provided with no data. Then the points measured were presented to the algo-

6.7 Case Study: Prediction of Turbine Failure

141

rithm one by one, starting with the ﬁrst measured point. Slowly, the model learned more and more about the system and the quality of its predictions improved both absolutely and in terms of the maximum possible future period of prediction. Once even the last measured point was presented to the algorithm, it produced a predication valid for the following two days of real time. The result may be seen in ﬁgure 6.8.

Fig. 6.8 Here we see the actual measurement (spiky curve) versus the model output (smooth line) over a little history (left of the vertical line) and for the future three days (right of the vertical line). We observe a close correspondence between the measurement and the model. Particularly the event, the sharp drop, is correctly predicted two days in advance.

Thus, we can predict accurately that something will take place in two days from now with an accuracy of a few hours. Indeed it is apparent from the data that it would have been impossible to predict this particular event more than two days ahead of time due to the qualitative change in the system (the failure mode) occurring only a few days before the event. The model must be able to see some qualitative change for some period of time before it is capable of extrapolating a failure and so the model has a reaction time. Events that are caused quickly are thus predicted relatively close to the deadline but two days warning is enough to prevent the major damages in this case. In general, failure modes that are slower can be predicted longer in advance. It must be emphasized here that the model can only predict an event, such as the drop of a measurement. It cannot label this event with the words “blade tear.” The identiﬁcation of an event as a certain type of event is altogether another matter. It is possible via the same sort of methods but would require many examples of blade tears and this is a practical difﬁculty. Thus, the model is capable of giving a speciﬁc

142

6 Modeling: Neural Networks

time when the turbine will suffer a major defect; the nature of the defect must be discovered by manual search on the physical turbine. This is interesting but to be truly helpful, we must be able to locate the damage within the large structure of the turbine, so that maintenance personnel will not spend days looking for the proverbial needle in the haystack. Fault detection and localization is now done by performing an advanced datamining methodology (singular spectrum analysis from section 4.5.2) that tracks frequency distributions of signals over the history and can deduce qualitative changes. Over the 80 measurements points, we are able to isolate that four of them contain a qualitative shift in their history and that two of these four go through such a shift several days before the other two. Thus, we are able to determine which two out of 80 locations in the turbine are the root cause for the event that is to occur in two days. See ﬁgure 6.9 for an illustration. In this ﬁgure, we graph the abnormality as measured by singular spectrum analysis (see section 4.5.2) over time for each measurement. We observe that only four time-series are abnormal at all. Two of them become abnormal early in time and two others follow. When we ask which time-series these are, then we ﬁnd that the ﬁrst two are the radial and axial vibrations of one bearing and the second two are the same vibrations of the neighboring bearing. We may safely attribute causation to this evolution. Thus we say that the ﬁrst bearing’s abnormality causes the second bearing’s abnormality. We also say that the ﬁrst bearing’s abnormality was the ﬁrst sign of what later lead to the event, the blade tear. Indeed, the blade tore off very close to this particular bearing. Thus we are successful in localizing the fault within the large turbine.

Fig. 6.9 We compute a deviation from normal being tracked over a window of about four days length. So we observe that two sensors start behaving abnormally and two days later, two other sensors behave abnormally. About 3.5 days after the start of the abnormal behavior, this new behavior has become normal and so the deviation from normal is seen to reduce again. Therefore, we observe a qualitative change in the performance of these four points.

6.8 Case Study: Failures of Wind Power Plants

143

The localization that is possible here is to identify the sensor that measures an abnormal signal and that will be the ﬁrst to show the anomaly that will develop into the event. It is, of course, not possible to compute a physical location on the actual turbine more accurately than the data provided. However, a physical search of the turbine, after the actual blade tear, found out that the cause was indeed at the location determined by the data-mining approach. It is possible to reliably and accurately predict a failure on a steam turbine two days in advance. Furthermore, it is possible to locate the cause of this within the turbine so that the location covered by the sensor that measures the anomaly can be focused on by the maintenance personnel. The combination of these two results, allows preventative maintenance on a turbine to be performed in a real industrial setting saving the operator a great expense.

6.8 Case Study: Failures of Wind Power Plants Wind power plants sometime shut down due to diverse failure mechanisms and must be maintained. Especially in the offshore sector but also in the countryside, are these maintenance activities costly due to logistics and delay. Common failures are for example due to insufﬁcient lubrication or bearing damages. These can be seen in vibration patterns if the signal is analyzed appropriately. It is possible to model dynamic evolving mechanisms of aging in mathematical form such that a reliable prediction of a future failure can be computed. For example, we can say that a bearing will fail in 59 hours from now because the vibration will then exceed the allowed limits. This information allows a maintenance activity to be planned in advance and thus saves collateral damage and a longer outage. Wind power plants experience failures that lead to ﬁnancial losses due to a variety of causes. Please see ﬁgure 6.10 for an overview of the causes, ﬁgure 6.11 for their effects and ﬁgure 6.12 for the implemented maintenance measures. Figure 6.13 shows the mean time between failures, ﬁgure 6.14 the failure rate per age and ﬁgure 6.15 the shutdown duration and failure frequency4 . From these statistics we may conclude the following: 1. At least 62.9% of all failure causes are internal engineering related failure modes while the remainder are due to external effects, mostly weather related. 2. At least 69.5% of all failure consequences lead to less or no power being produced while the remainder leads to ageing in some form. 3. About 82.5% of all maintenance activity is hardware related and thus means that a maintenance crew must travel to the plant in order to ﬁx the problem. This is particularly problematic when the power plant is offshore. 4. On average, a failure will occur once per year for plants with less than 500 kW, twice per year for plants between 500 and 999 kW and 3.5 times per year for 4

All statistics used in ﬁgures 6.10 to 6.15 were obtained from ISET and IWET.

144

6 Modeling: Neural Networks

plants with more than 1 MW of power output. The more power producing capacity a plant has, the more often it will fail. 5. The age of a plant does not lead to a signiﬁcantly higher failure rate. 6. The more rare the failure mode, the longer the resulting shutdown. 7. A failure will, on average, lead to a shutdown lasting about 6 days. From this evidence, we must conclude that internal causes are responsible for a 1% capacity loss for plants with less than 500 kW, 2% for plants between 500 and 999 kW and 3.5% for plants with more than 1 MW of power output. In a wind power ﬁeld like Alpha Ventus in the North Sea, with 60 MW installed and expecting 220 GWh (i.e. an expectation that the ﬁeld will operate 41.8% of the time) of electricity production per year, the 3.5% loss indicates a loss of 7.7 GWh. At the rate of German government regulation of 7.6 Eurocents per kWh, this loss is worth 0.6 million Euro per year. Every cause leads to some damage that usually leads to collateral damages as well. Adding the cost of the maintenance measures related to these collateral damages themselves yields a ﬁnancial damage of well over 1 million Euro per year. The actual original cause exists and cannot be prevented but if it could be identiﬁed in advance, then these costs could be saved. This calculation does not take into account worst case scenarios such as the plant burning up thus requiring effectively a new build.

Fig. 6.10 The causes for a wind power plant to fail are illustrated here with their corresponding likelihood of happening relative to each other.

6.8 Case Study: Failures of Wind Power Plants

145

Fig. 6.11 The effects of the causes of ﬁgure 6.10 are presented here with the likelihood of happening relative to each other.

Fig. 6.12 The maintenance measures put into place to remedy the effects of ﬁgure 6.11 with the likelihood of being implemented relative to each other.

146

6 Modeling: Neural Networks

Fig. 6.13 The mean time between failures per major failure mode.

Fig. 6.14 The yearly failure rate as a function of wind power plant age. It can be seen that plants with higher output fail more often and that age does not signiﬁcantly inﬂuence the failure rate.

6.8 Case Study: Failures of Wind Power Plants

147

Fig. 6.15 The failure frequency per failure mode and the corresponding duration of the shutdown in days.

A recurrent neural network was applied to a particular wind power plant. From the instrumentation, all values were recorded to a data archive for six months. One value per second was taken and recorded if it different signiﬁcantly from the previously recorded value. There were a total of 56 measurements available from around the turbine and generator but also subsidiary systems such a lubrication pump and so on. Using ﬁve months of these time-series, a model was created and found that the model agreed with the last month of experimental data to within 0.1%. Thus, we can assume that the model correctly represents the dynamics of the wind power plant. This system was then allowed to make predictions for the future state of the plant. The prediction, according to the model’s own calculations, was accurate up to one week in advance. Naturally such predictions assume that the conditions present do not change signiﬁcantly during this projection. If they do, then a new prediction is immediately made. Thus, if for example a storm suddenly arises, the prediction must be adjusted. One prediction made is show in ﬁgure 6.16, where we can see that a particular vibration on the turbine is to exceed the maximum allowed alarm limit after 59 ± 5 hours from the present moment. Please note that this prediction actually means that the failure event will take place somewhere in the time range from 54 to 64 hours from now. A narrower range will become available as the event comes closer in time. This information is however, accurate enough to become actionable. We may schedule a maintenance activity in two days from now that will deﬁnitely prevent the problem. Planning for two days in advance is sufﬁciently practical that this would solve the problem in practice.

148

6 Modeling: Neural Networks

In this case, no maintenance activity was performed in order to test the system. It was found that the turbine failed due to this particular vibration exceeding the limit after 62 hours from the moment it was predicted to happen. This failure led to contact with the casing, which led to a ﬁre effectively destroying the plant.

Fig. 6.16 The prediction for one of the wind power plant’s vibration sensors on the turbine clearly indicating a failure due to excessive vibration. The vertical line on the last ﬁfth of the image is the current moment. The curve to the left of this is the actual measurement, the curve to the right shows the model’s output with the conﬁdence of the model.

It would have been impossible to predict this particular event more than 59 hours ahead of time due to the qualitative change in the system (the failure mode) occurring just a few days before the event. The model must be able to see some qualitative change for some period of time before it is capable of extrapolating a failure and so the model has a reaction time. Events that are caused quickly are thus predicted relatively close to the deadline. In general, failure modes that are slower can be predicted longer in advance.

6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry Catalytic reactors are devices in chemical plants whose job it is to provide a conducive environment for a certain chemical reaction to take place, see ﬁgure 6.17. In a reactor, at least two substances are brought into contact with each other. One is a substance that we would like to change in some chemical way and the other is the catalyst, i.e. the substance that is supposed to bring this change about. The two

6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry

149

substances are mixed and heated to provide the energy for the change. Then we wait and provide the necessary plumbing for the substances to come into and for the end product to leave the reactor. Some parts that are not converted have to be re-cycled back for a second, and possibly more, times through the reactor until ﬁnally all the original substance has been changed. An example is the breaking down of the long molecular chains of crude oil in the effort to make gasoline.

Fig. 6.17 The basic workings of the catalytic reactor system in a petrochemical reﬁnery.

The catalyst performs its work upon the substance and brings about a change. It thereby uses up its potential to cause this change and thus ages over time. This degradation of the catalyst is the primary problem with operating such a reactor continuously over the long term. The catalyst must therefore be re-activated in some fashion and at some time. We will investigate both of the major two kinds of catalytic reactors that exist: the ﬂuid catalytic converter (FCC) and the granular catalytic reactor (GCR). In the FCC, the catalyst is a ﬂuid that can be pumped into and out of the reactor. We can therefore create a loop in which the catalyst is pumped into the reactor

150

6 Modeling: Neural Networks

to perform its function and then out again into a reactivation phase only to return. This loop occurs forever and the catalyst can thus be used essentially without limit. However the speed of the loop must be carefully tuned to the actual aging of the catalyst inside the reactor so that we do not put either too much work (attempting to reactivate catalyst that is still ﬁne) or too little work (not reactivating enough catalyst so that eventually we have too little) into the reactivation job. In the GCR, the catalyst is in the form of granules that are ﬁlled into tubes in the reactor. These granules stay inside the tubes until the point is reached where the catalyst is so deactivated that the process is no longer economical. At this point, the reactor must be opened, the catalyst removed and fresh one injected. The old granules can then be sent for reactivation. Such an exchange process may require approximately four weeks of downtime and is thus a substantial cost for the plant. Also the new catalyst must be ordered well in advance and so the date of change must be planned beforehand. Both types of reactor therefore require a prediction of the aging process into the future. We must know weeks in advance if we will have a critical deactivation. In order to make the prediction, we have access to several temperatures around the reactor, the inﬂow and outﬂow, a gas chromatographic identiﬁcation of what is ﬂowing out and a few process pressures. In fact the age of the catalyst is measured by the pressure difference across the reactor. The higher it gets, the older the catalyst is. Using the method of recurrent neural networks, we create a model of the GCR using almost four years of data in which the catalyst was exchanged twice, see ﬁgure 6.18. The jagged curve running over the whole plot is the measured pressure difference over the reactor. Whenever you see a sharp drop, this means that the catalyst has been exchanged; this happened three times in total in that ﬁgure. The mathematical model draws the smooth curve over the data. We can hardly see the difference on the left of the image, so closely does the model represent the data. At the ﬁrst vertically dashed line, we have reached the “now” point from which the model predicts without receiving more input data. We see three smooth lines diverging from this time. The middle one is the actual prediction, the other two being the uncertainty boundaries for it. The jagged line then shows what actually occurred. We can see that the model is very accurate indeed. The brief ups and downs in the real measurement are not in fact due to aging of the catalyst but due to various operational modes and varying quality of the crude oil being injected in the reactor. We are only concerned with the long term trend and not with short term ﬂuctuations. At the time of the second vertically dashed line, the catalyst was exchanged. The prediction was accurate up to this time, 416 ± 25 days later. Thus the prediction was accurate over one year in advance. You may ask why the catalyst was exchanged, for the three times that we see on the ﬁgure, at different ages (or different differential pressures). Would it not be good to specify a single boundary value to deﬁne “too old?” In this case, our true cut-off criterion is not a certain age but rather the point of uneconomicality of the process. As this depends on various scientiﬁc but also on various economic inﬂuences, the

6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry

151

Fig. 6.18 The pressure differential (equivalent to the catalyst age) over time. The jagged line is the actual measurement and the middle smooth line the prediction. The ﬁrst vertical line from the left to the right is the present moment from which the prediction starts. The second vertical line is the time at which we predict a catalyst exchange to be necessary. This prediction can be seen, using the real measurements, to be correct more than one year in advance.

actual age at which the catalyst becomes uneconomical changes with ﬂuctuating market prices. These (and their uncertainties) must be taken into account. We now proceed to the FCC in a chemical plant. Here we will take a different viewpoint. The FCC is a complex unit that has many set-points. For instance the rate and manner in which the ﬂuid catalyst is recycled is up to us to control. The setpoints that control this are changed by the operating personell to match the demands of the time. While measurements such as the ambient air temperature are variables that must be measured to be known, the set-points are of a different nature. As the operators modify the set-points in dependence upon market factors, we do have some knowledge about these beforehand. Thus, we ask to what extent does this prior knowledge help the model to predict the future state? We investigate, in ﬁgure 6.19, a very simple neural network model (perceptron, see section 6.4) for the pressure differential in dependence upon all other variables of the FCC, of which there many be several dozen and several set-points as well. If we only provide historical information to it, we obtain the solid curve. Compared to the actually measured dotted curve, we can see that this model is too simplistic to be able capture reality. If we present the future data for the set-points in addition (i.e. the production plan) to the historical data to the same simple model, we obtain the dashed curve. This dashed curve is accurate and achieves our aim.

152

6 Modeling: Neural Networks

Fig. 6.19 The pressure differential over time (in hours) on a ﬂuid catalytic converter predicted into the future using two different models. The actual measurement is shown in the dotted line. A prediction without any information about the future is shown in the solid line. A prediction made with the knowledge of some future set-points is shown in the dashed line. It is clear that knowledge of future actions is beneﬁcial.

We can thus see that, in the case of a simple perceptron model, the provision of some limited information about the future signiﬁcantly helps the model to predict those measurements that we cannot know in advance. In conclusion, we see that both major kinds of catalytic reactors can be modeled well. We can predict both kinds of reactor into the future and thus provide information about essential future events such as the deactivation of the catalyst and thus the time (in the case of GCR) and the manner (in the case of FCC) of this deactivation. From both predictions, we easily derive the ability to plan speciﬁc actions to remedy the problem.

6.10 Case Study: Predicting Vibration Crises in Nuclear Power Plants Co-Author: Roger Chevalier, EDF SA, R&D Division So far, we have focused on predicting failure events. Such events are characterized by large, usually fast, changes that result in damage and usually a shutdown of the plant. In this section, we will focus on predicting a more subtle phenomenon. We observe that the vibration measurement on a certain bearing of a steam turbine increases periodically. This increase is alarming but does not represent a damage or danger. In our speciﬁc example, the turbine has ﬁve bearings and each has a vibration sensor. See ﬁgure 6.20 for a plot of a temporary increase in vibration, which we

6.10 Case Study: Predicting Vibration Crises in Nuclear Power Plants

153

will call vibration crisis. The exact cause of the problem is not precisely identiﬁed at present but it always in the same conditions of vacuum pressure and power

Fig. 6.20 This is the vibration of one bearing over time. The horizontal line is the limit for the vibration crisis, i.e. if the vibration measurement exceeds this limit, we speak of a crisis. It will be our goal to detect such events. Time is measured in units of ten minutes and so the plot is over a period of roughly 35 days.

This study concerns itself with the prediction of future vibration crises and not with determining the mechanism at its source. If one can know, only hours in advance, that a crisis will happen, this will help operators signiﬁcantly in preparing for the event. The plant can be regulated into a state more conducive to controlling the impending crisis. We attempt a model via a recurrent neural network (see section 6.5). The information about the turbine includes ﬁve displacement vibrations, for each bearing we have two metal pad temperatures, the steam pressure at several points in the process, the axial position of the turbine shaft, the rotation rate, the active and reactive power produced and one temperature on the oil circuit. This information is sampled once every ten minutes for the period of about ﬁve months in order to generate the model and learn the signs of an impending vibration crisis. The dataset contained several examples of such crises so that effective learning was possible. With respect to the raw data from ﬁgure 6.20, we see the results of the prediction process in ﬁgure 6.21. The raw data is displayed in gray and the model output in black. We have a ﬁrst period during which we observe the turbine before a prediction is possible. When enough data is there, we enter into a second period during which we predict and immediately validate the predictions with the real data. We see during this second period a close agreement between model output and measured values. Then we enter the third period, the actual future during which no more measurement data are available. This is the genuine prediction. We note from ﬁgure 6.20 that we can correctly predict a vibration crisis, in this example, up to about 2.8 days in advance. Beyond this point, the prediction is no

154

6 Modeling: Neural Networks

Fig. 6.21 The same data as in ﬁgure 6.20 but now in gray and overlayed in black with the model output from time 3100. We note that the vibration crisis from 3100 to 3500 is predicted correctly but the next vibration crisis at 3700 and the one after that from 4500 are not predicted correctly.

longer successful. This is roughly played out in all other examples. Thus, we see that there is a lead time of less than 3 days before such an event is detectable. This must be enough in practice to construct some kind of reaction. Please note carefully the aim of this study. It is not the aim to correctly represent the vibration measurement over all values and all times. The goal is to accurately compute the times at which the vibration measurement crosses the limit line. When doing modeling it is essential to keep one’s objectives clear as modeling is an adjustment of numerical parameters with respect to minimizing some sort of numeric accuracy requirement. In general, if one’s aim were a representation of the vibration signal as such, one would choose the least squares method to measure the difference between measurement and model output. In this case, however, we are not trying to do this. Thus, our metric is not the least squares metric but the deviation between the actual and modeled times of the vibration signal crossing the limit line – a very different goal. Thus, the ﬁgure 6.20 should be interpreted accordingly. We note that the strongest correlant with the problem is the outside temperature (cooling water). However it is not as clean as having a speciﬁc trigger temperature. The problem has a compound trigger in which the cooling water temperature plays a leading role but not the only role. A prediction of a future vibration crisis is reliable for 3 days in advance. If we attempt to predict it further into the future, the uncertainty in the prediction makes the prediction itself useless. Of course, the closer we get to the interesting time, the more accurate the prediction gets. We have made 6 such predictions in a double blind study and have correctly predicted 5 vibration crises from among the 6 cases. The model is thus quite successful in being able to predict the future occurrence of a crisis. This is the case even though the speciﬁc causal mechanism remains still under investigation.. The model could be improved if the problem would be better understood so that the data can be more properly prepared..

6.11 Case Study: Identifying and Predicting the Failure of Valves

155

6.11 Case Study: Identifying and Predicting the Failure of Valves A chemical plant has a particular unit that is meant to combine several chemicals from a variety of input sources and to provide a gaseous output with an as-constantas-possible composition. This task is controlled by an assembly of 40 valves that are controlled by a computer that opens and closes them according to a well-balanced schedule. If the valves fail to open and close according to schedule and are either fast or slow or leak, then the tailgas is not constant and causes problems later on in the process. In this study, we are to predict future problems of the assembly and also identify which of the 40 valves is responsible for the problem. In the whole process, there are three phases. For each of these, we will compute the probability distribution of deviations between the set point provided by the control computer and the actual response as measured by the instrumentation. Figure 6.22 displays the results for each phase. We observe that one phase has an exponential distribution and the two others do not as they have secondary or even tertiary local maxima. An exponential distribution is what we would expect to see from normal operations of a controller – deviations are very rare and exponentially decreasing in magnitude indicating that deviations are random in origin. Seeing secondary peaks in this distribution is not expected as it shows a structured mechanism and hence some form of damage.

Fig. 6.22 The probability distribution over the three phases of valve operations. The vertical axis is the probability and the horizontal axis the normalized absolute value of the deviation between set point and actual response of valve openings. We observe that one phase appears to be operating correctly (exponential distribution) and two phases incorrectly (non-exponential distributions).

156

6 Modeling: Neural Networks

Next, we will introduce measure of abnormality for a valve. The score itself is based on the difference between set point and response (just as in ﬁgure 6.22). However, we also demand that there be an associated surge in non-constancy of the tailgas within a certain response time to track only those abnormal valve openings/closings that were close to a unwanted product surge. In ﬁgure 6.23, we graph the abnormality in this sense for each valve across all three phases of operation. The valve numbers are on the horizontal axis and the absolute value of the difference between set point and measurement on the valve opening and closing is on the vertical axis. The solid, dashed and dotted lines correspond to the three phases in the same manner as in ﬁgure 6.22. The grey line is the average of the three phase lines.

Fig. 6.23 A selection of the 40 valves (horizontal axis) is investigated here in terms of their deviation from the set point (vertical axis) in those cases in which an output pressure surge occurs within a certain time frame. The three black lines correspond to the three phases in ﬁgure 6.22 and the gray line is the average of the three black lines.

Combining these two images, we must look for the dashed and dotted phases in ﬁgure 6.23 and select those valves that have a high score there. We now combine this information with some process speciﬁc analysis from the scheduling application. From this we can rule out some valves because they do not operate in the relevant phase and their deviation is thus a spurious observation. From such an analysis, it is one valve that remains. We attribute the problem to it. As such, the valve is operating in a way that is not ideal but it is not causing a signiﬁcant problem just yet. Let us look over the time evolution of this abnormality in ﬁgure 6.24. The jagged line is the abnormality at each moment in time. As we are not concerned with the shorter term ups and downs, we take a 20-day moving average to smooth out the curve. This is the thick line on the plot. We observe an upward trend over time with a dip near the end. The time on the horizontal axis is in days and so this evolution occurs over a signiﬁcant time frame. We now create a prediction of this time-series, which is plotted as the continuation of the thick line on the right of the image.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

157

Please note that the peak observed in ﬁgure 6.24 at day 147 is a failure of a valve. After the valve has been ﬁxed, we see the abnormality decline, which suggests that the maintenance measure has relieved the problem. However the abnormality does not decline to its former levels. This means that we have not solved the problem fully. From the prediction made, we predict that another failure is going to occur on day 208. This prediction is made on day 175, i.e. 33 days in advance. The uncertainty of this prediction is ±10 days.

Fig. 6.24 Abnormality over time during the relevant phase together with 20-day moving average and prediction into the future. The peak around day 147 is a known failure. On day 175, we predict a second failure to occur on day 208 with an accuracy of ±10 days. This event occurred as predicted.

After waiting for a few weeks, we ﬁnd that the failure did indeed happen as predicted. The failed valve is the same valve as the one we have identiﬁed using the above abnormality approach. Thus we conclude that is possible to predict the future failure of valves and to identify which valve it will be even if we only have information about the family of valves.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump Co-Authors: Prof. Chaodong Tan, China University of Petroleum Guisheng Li, Plant No. 5 of Petrochina Dagang Oilﬁeld Company Yingjun Qu, Plant No. 6 of Petrochina Changqing Oilﬁeld Company Xuefeng Yan, Beijing Yadan Petroleum Technology Co., Ltd.

158

6 Modeling: Neural Networks

A rod pump is a simple device that is used the world over to pump for oil on land, see ﬁgures 6.25 and 6.26. Basically, we drill a hole in the ground and appropriately cement the hole such that a nice vertical cavity results. This cavity is ﬁlled with a rod that is going to move up and down using a mechanical device that is called the rod pump. Attached to the bottom of the rod is a plunger, which is a cylindrical “bottle” used to transport the oil. On the downward stroke, the plunger is allowed to ﬁll with oil and on the upward stroke this oil is transported to the top where it is extracted and put into barrels.

Fig. 6.25 A schematic of a rod pump. The motor drives the gearbox, which causes the beam to tilt. This drives the horsehead up and down. This assembly assures that the rotating motion of the motor is converted into a linear up-down movement of the rod. The stufﬁng box contains the oil that is discharged through a valve on the top of the well.

Let us focus our attention on two variables of this assembly: the displacement of the rod as measured from its topmost position and the tension force in the rod. When we graph these two variables against each other such that the displacement goes on the horizontal and the tension on the vertical axis, we will ﬁnd that, as the system is in cyclic motion, the curve is a closed locus. This is called the dynamometer card of the rod pump, see ﬁgure 6.27 (01) for an example of expected operations. To travel once around the locus takes the same amount of time as it takes the rod pump to complete one full cycle of downstroke and upstroke. A normal rod pump makes four strokes per minute. It is a remarkable observation that the shape of this locus allows us to diagnose any important problem with the rod pump [64]. In ﬁgure 6.27 we display dynamometer card examples for the most common problems. We will go into a little detail on these shapes and their problems because it is an exceptional fact that a complete diagnosis can be made so readily from a single image. This approach should be possible for a variety of other machines once only the correct measurements and the correct way of presenting them are found. That is the deeper reason behind presenting these here. It should be encouraged to seek a similar presentation of faults in other machinery.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

159

Fig. 6.26 A schematic of the well bottom. The rod drives the plunger down into the well guided by the well casing. The bottom of the plunger has a so-called riding valve to take in the oil through the inlets. The bottom of the well is closed off to the reservoir with a so-called standing valve that open once the plunger is at the bottom.

The cases are: 01 This is the shape we expect to see on a properly working rod pump. The upper and lower horizontal features are nearly parallel and the diagram is close to the theoretically expected diagram. 02 Another example of good operations 03 The two horizontal features are sloping downward, are much closer to each other and more wave-shaped than in the good case. This is due to excessive vibration of the rod. 04 The lower right-hand corner of the card is missing but the two horizontal features are still horizontal. This indicates that the plunger is not being ﬁlled fully but that the pump is working properly. 05 A more severe case of the former kind. 06 Here the pump is still working properly but the oil is very thick. 07 These distinctive jagged features with the lower right corner missing are caused by the presence of sand in the oil. This will cause damage to the rod assembly in the short term. 08 The lower right-hand corner is missing but the horizontal features are no longer horizontal; the bite taken out of the low right corner has an exponential boundary. This is caused by reservoir de-gassing and slowing the downward plunge. 09 A more severe case of the former kind. 10 A similar case to the former kind. Here the gas forms an air-lock inside the plunger preventing the plunger from draining at the top. 11 The bottom horizontal feature is rounded and/or lifted up making the whole card signiﬁcantly smaller. This is due to a leaking inlet valve.

160

6 Modeling: Neural Networks

12 The opposite feature to above. Here the top horizontal feature is rounded and/or pressed down making the whole card smaller. This is due to a leaking outlet valve. 13 This oval feature results from a combination of both the inlet and the outlet valve leaking. Note that this is a fairly ﬂat oval compared to the oval of image (06). 14 The top left-hand corner is missing and the boundary is in the shape of an exponential curve; compare with (08) and (09). This is due to a delay in the closing of the inlet valve. 15 Same as above but for a shorter delay. 16 The right side of the card is pressed down. This happens because of a sudden unloading of the oil at the top. The outlet valve is not opening properly but suddenly. 17 The characteristic upturned top right-hand corner (as opposed to (08)) indicates a collision of the plunger and the guide ring. 18 The lower left-hand corner is bent backwards and the top right-hand corner is sloped down in addition to features like (08). This indicates a collision between the plunger the ﬁxed valve at the bottom of the hole. 19 The thin card with concave loading and unloading dents on upper left and lower right corners indicates a resistance to the ﬂow of the oil such as the presence of parafﬁn wax. 20 Very thin but long card in the middle of the theoretically expected card with wavy horizontal features indicates a broken rod. 21 A thin long card with straight horizontal features indicates that the plunger is ﬁlling too fast due to a high pressure inside the reservoir. The plunger should be exchanged for a larger one. 22 The card looks normal but is too thin, particularly on the bottom. This is due to tubing leakage. 23 The piston is sticking to the walls of the hole and bending the rod. It can easily be seen that the diagnosis of problems is immediate from the shape given a little experience in the matter. In fact, it has been shown that the diagnosis can be automated by recognizing the shape with a perceptron neural network [43] (see section 6.4 for a discussion of perceptrons). Our purpose here is to investigate whether we can predict the future shape of the dynamometer card and thus diagnose a situation today that will lead to a problem in the next days. In order to predict the evolution of the shape over time, we must be able to characterize the shape numerically ﬁrst. For this we will seek a two step process. First, we will attempt to ﬁnd a formula-based model for the shape itself that has only a few parameters that must be ﬁtted to any particular shape. As we get a new dynamometer card several times per minute, this ﬁtting process happens continuously thereby inducing a time-series on those parameters. It is these parameters that we will model using a recurrent neural network. In total, this will yield a prediction system for the future shape of a dynamometer card.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

(01)

(02)

(03)

(04)

(05)

(06)

(07)

(08)

(09)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

Fig. 6.27 The various cases of dynamometer cards. See text for an explanation.

161

162

6 Modeling: Neural Networks

In order to allow modeling, we ﬁrst normalize the experimental data so that a dynamometer card’s displacements and tension always lie in the interval [-1,1]. We do this by t =

2t − tmin − tmax tmax − tmin

and d =

2d − dmin − dmax . dmax − dmin

Using this transformation, the shape of the dynamometer card can be described by a parametric equation f cos θ − sin θ t a cosb x + c sind x + = g sin θ cos θ d e sin x where x is the artiﬁcial variable of the parametric equation such that x ∈ [−π, π] [83]. The parameters a, b, c, d, e, f and g are the parameters that must be found by ﬁtting. A typical dynamometer card consists of about 144 observations of d and t. Thus, we have enough data to reliably ﬁt the 7 free parameters in the model using the normal least-squares approach. The vector v = [a, b, c, d, e, f , g,tmin ,tmax , dmin , dmax ] is actually a function of time v(t) and this function is then modeled as a recurrent neural network, see section 6.5. In ﬁgure 6.28 we see the evolution of such a prediction. Time is measured in strokes each of which is about 15 seconds in duration. The dotted line is the experimental data and the solid line is the model. The historical data upon which the model is based is mostly not shown but images (1) and (2) are still historical data. Image (3), (4) and (5) are the predictions that result. Based on this prediction, we initiate a maintenance measure that restores the pump to normal operations in image (6). As we can see that the model and measurement from the experiment agree quite well, we have demonstrated that this approach can indeed predict future problems with dynamometer cards. Note that the problem in image (5) and the moment at which the prediction is made in image (3) are separated by 4000 pump cycles or about 16.7 hours. This is enough warning time for practical maintenance to react. To fully understand this evolution, we need to look at the corresponding evolution of the model parameter. See the left image in ﬁgure 6.29 for those model parameters that changed. From time 15000 onwards, we have normal operations and so this level for the parameters is constant. Based on this, we observe an increasing deviation from normal operations in the model parameters. The right image in ﬁgure 6.29 displays the evolution of the average of the displacement and the width of the displacement. We display these as the experimental data was normalized for the images in ﬁgure 6.28. Here we also observe an abnormal behavior in the beginning.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

163

Fig. 6.28 The modeling of a dynamometer card’s evolution in time. The difference between each image is 2000 cycles, i.e. about 8.3 hours. The model was train on historical data. Images (1) and (2) are historical data providing the model with the initial data. Using this, the model predicts images (3) to (5) and indicates that at image (5) we have a problem requiring attention. The maintenance measure is performed and we observe, on image (6), the establishment of operations as they should be. See ﬁgure 6.29 for more details.

In conclusion, we note that a recurrent neural network can reliably predict a future fault of a rod pump system via predicting the future model parameters of a mathematical formulation of the dynamometer card. In this example, the prediction could be made 16.7 hours in advance of the problem.

164

6 Modeling: Neural Networks

Fig. 6.29 On the top image, we see the evolution of the model’s parameters over time: e as the solid line, a as the dotted line, θ as the closely dashed line and c as the long dashed line. All other model parameters were constant throughout. On the bottom we see the evolution of displacement average in the solid line and the displacement width (maximum minus minimum) in the dotted line. The period from time 15000 onwards is to be considered normal operations and so we can observe a gradual worsening of operations leading up to the necessary maintenance measure at time 10000.

Chapter 7

Optimization: Simulated Annealing

There are many optimization approaches. Most are exact algorithms that deﬁnitely compute the true global optimum for their input. Unfortunately such methods are usually intended for a very speciﬁc problem only. There exists only one general strategy that always ﬁnds the global optimum. This method is called enumeration and consists of listing all options and choosing the best one. This method is not realistic as in most problems the number of options is so large that they cannot be practically listed. All other general optimization methods derive from enumeration and attempt to either exclude bad options without examining them or only examine options that have a good potential based on some criteria. Particularly two methods, genetic algorithms and simulated annealing, have been successful in a variety of applications. The later advent of genetic algorithms stole the limelight from simulated annealing for some time [102]. However, from direct comparisons between these two approaches it appears that simulated annealing nearly always wins in all three important categories: implementation time, use of computing resources (memory and time) and solution quality [54, 102]. Thus, we shall focus on simulated annealing. Having said this, the opinions between these two approaches border on religious fervor. To do some justice to this debate, we will present genetic algorithms in section 7.1 and then describe simulated annealing for the rest of the chapter. It will become clear where the differences lie. It should also be mentioned here that several methods exist that are usually presented under the heading of optimization methods but that will be ignored here. Such methods are, for example, conjugate gradient methods or Newton’s method. The reason we shall ignore them here is that they rely on the problem being purely continuous. They cannot deal with some of the variables being discrete. Industrial problem however almost always involve discrete variables as equipment is turned on and off or many be switched in discrete modes or levels. If you meet with a particular problem that can be written in terms of a purely continuous function, then these methods may not be bad to use. However, general optimization methods may be used proﬁtably in this case as well. P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7_7, © Springer-Verlag Berlin Heidelberg 2012

165

166

7 Optimization: Simulated Annealing

7.1 Genetic Algorithms Genetic algorithms get their name and basic idea from the evolutionary ideas of biology. There is a population of individuals that mate, beget offspring and eventually die. At any one time, there is population of many individuals but over time these individuals change. These individuals, and thus the whole population, have certain characteristics that are important to us. Particularly, they have a so-called “ﬁtness” derived from the statement “In the struggle for survival, the ﬁttest win out at the expense of their rivals because they succeed in adapting themselves best to their environment” by Charles Darwin. The ﬁtness is thus identical with the objective function of optimization. The “purpose” of evolution is to breed an individual with the best possible ﬁtness. In nature, changing generations face changing environmental conditions and so it is likely that we will never reach an equilibrium stage at which the truly ﬁttest possible individual can live. In mathematical optimization however, the environment is the problem instance and so does not change. Thus evolution can reach an equilibrium and this is the proposed optimum. Note here that certain concepts are being turned upside down by the method: The ground state of the problem instance (the least likely state) becomes the equilibrium (the most likely state) of the population. The mechanism that achieves this switch is evolution. The fact that the search for something rare is turned into a process of equilibration to something common is the whole point behind genetic algorithms. The basic features of the genetic approach are thus: The initialization of a population of candidate solutions, the decision of how many such solutions there should be in any one generation, the method for combining several solutions into a new one and the criteria for stopping the search. Note that this approach is a heuristic. We thus have no assurance of ﬁnding the true optimum and even if we do ﬁnd it by chance, we do not have a foolproof way of recognizing it for what it is. This is a pity but we cannot both have our optimum and know it, as it were. This is the price we pay for fast practical execution times. The decision on how many individuals live per generation is a human design decision that is essentially a black art comparable to choosing the number of hidden neurons in a neural network. The initialization of the ﬁrst generation is generally done uniformly randomly among all possible candidate solutions. The criteria for stopping have given rise to a signiﬁcant research ﬁeld but most applications terminate the search when solution quality no longer improves signiﬁcantly over many generations, i.e. a convergence criterion. If this seems too haphazard, simply restart the process with a different initial population a few thousand times and take the best outcome. This “restart” method has been shown to provide a small but signiﬁcant improvement in general and is worth doing for nearly applications with the only exception being if you are very pressed for time (e.g. real-time applications). The only point that is really complex is deﬁning how solutions mate and beget child solutions. The idea again derives from biological evolution. Two DNA codes seem to combine to make a new DNA code by selecting genes from either parent DNA and then subjecting the result to some mutation.

7.2 Elementary Simulated Annealing

167

Supposing that the two parents are the two solution vectors A and B, then we may construct a new solution C A = [a1 , a2 , a3 , . . . , an ] B = [b1 , b2 , b3 , . . . , bn ] C = [c1 , c2 , c3 , . . . , cn ]

(7.1) (7.2) (7.3)

by putting randomly either ci = ai + εi or ci = bi + εi where εi is a small randomly generated mutation term. Choosing an element from either A or B is called the crossover operator and adding a small random element is called the mutation operator. It is the mutation operator that allows the new solution to be made up of elements that did not exist yet in the initial population and this is crucial in order to gradually visit all possible points. The method by which we choose elements from A or B can become arbitrary complex or may be a simple 50-50 random choice. Also the method to apply the mutation may be complex or simple. In general, it can be said that only simple problems can be solved by simple methods for these two stages of the solution generation process. For complex problems, we must prefer certain features in the crossover operation and must gradually suppress mutation over the long term so that the overall solution accuracy can focus on a precise ﬁnal result. How exactly this is done would go beyond the scope of this book since we wish to focus on simulated annealing but the ideas presented for simulated annealing may be applied to this context.

7.2 Elementary Simulated Annealing When an alloy is made from various metals, these are all heated beyond their melting points, stirred and then allowed to cool according to a carefully structured timetable. If the system is allowed to cool too quickly or is not sufﬁciently melted initially, local defects arise in the system and the alloy is unusable because it is too brittle or displays other defects. Simulated annealing is an optimization algorithm based on the same principle. It starts from a random microstate. This is modiﬁed by changing the values of the variables. The new microstate is compared with the old one. If it is better, we keep it. If it is worse, we keep it with a certain probability that depends on a ‘temperature’ parameter. As the temperature is lowered, the probability of accepting a change for the worse decreases and so uphill transitions are accepted increasingly rarely. Eventually the solution is so good that improvements are very rare and accepted changes for the worse are also rare because the temperature is very low and so the method ﬁnishes and is said to converge. The exact spelling out of the temperature values and other parameters is called the cooling schedule. Many different cooling schedules have been proposed in the literature but these have effect only on the details and not the overall philosophy of the method.

168

7 Optimization: Simulated Annealing

The initial idea came from physics [95]. The physical process of annealing is one in which a solid is ﬁrst heated up only to be cooled down slowly in an effort to ﬁnd its ground state. We are asked to cool the solid so slowly that it remains at thermal equilibrium throughout the entire process. Based on this assumption, statistical physics has been able to calculate the probability distribution of microstates (exact atomic conﬁguration) giving rise to an observed macrostate (the global features of the whole solid). This distribution says that the probability of the solid making a transition between a state of energy E to a state of energy E (with E > E) is proportional to E − E exp − kT where k is a constant and T is the temperature of the solid. As annealing is an actual physical process used in many manufacturing plants, the computerized simulation of this process is known as simulated annealing. In the context of a general substance the method takes the following form [13]: Data : A candidate solution S and a cost function C(x). Result : A solution S that minimizes the cost function C(x). T ← starting Temperature while not frozen do while not at equilibrium do S ← perturbation of S. if C(S ) < C(S) or selection criterion then S ← S end T ← reduced temperature end

Algorithm 1: General Simulated Annealing

We immediately see some rather enticing features: (1) Only two solutions must be kept in memory at any one time, (2) we must only be able to compare them against each other, (3) we allow temporary worsening of the solution, (4) the more time we give the algorithm, the better the solution gets and (5) this method is very general and can be applied to virtually any problem as it needs to know nothing about the problem as such. The only points at which the problem enters this method is via creating a perturbed solution and via making a comparison of cost function values. Note that the inner loop gives rise to a Markov chain as each new state depends only upon its predecessor. Also note that this formulation of the method is quite general. Several pieces are missing: (1) a method for assigning an initial temperature, (2) a deﬁnition of “frozen,” (3) a deﬁnition of “equilibrium,” (4) a selection criterion and (5) a method to calculate the next temperature. The cost function and perturbation mechanism are given by the problem we are trying to solve. It must be said that there are, in general, various ways in which perturbations could be gener-

7.3 Theoretical Results

169

ated. The solution quality (ﬁnal cost and computation time) depends on this choice. As this is highly problem speciﬁc, we can only hint at this in section 7.5. Giving deﬁnite computable functions for the ﬁve elements named above is referred to as a cooling schedule and must be found to turn simulated annealing into an algorithm that can be implemented. As presented above, it may be considered a general computational paradigm but not yet an algorithm. First we give an example in the form of the traveling salesman problem in which we try to ﬁnd the optimal journey between 100 cities arranged on a circle beginning from a randomly generated ordering. We chose a large number as initial temperature haphazardly, told the algorithm to stop once no improvement was seen over two consecutive temperatures, deﬁned equilibrium as 1000 proposed transitions and cooled our system by multiplying the current temperature by 0.9. This extremely simple schedule led to the optimal solution of this problem in 35 temperatures. The cost for the problem is the Euclidean distance between the points on the journey and the moves are simple too: (1) reverse a sub-path and (2) replace two sub-paths between towns A and B and C and D by two paths between A and C and B and D [86]. This is an example in which we know what the optimal solution is (a circular journey) and thus we can be happy with this simple cooling schedule. In general however, we do not know what the optimal solution is and so choosing a cooling schedule becomes a problem in its own right because we cannot verify whether the ﬁnal answer is actually good. In fact, most authors make a rather haphazard choice of functions and parameters for their cooling schedule.

7.3 Theoretical Results In the limit of inﬁnitely slow cooling, simulated annealing ﬁnds the optimal solution for any problem and any instance [123]. This is the central result on which most authors justify their use of simulated annealing. The question of how slowly is slow enough in practice is a complex one that again raises the question of a cooling schedule. It is possible to prove polynomial-time execution of simulated annealing for a large class of cooling schedules [124]. If R is the set of all possible microstates and qk the stationary probability distribution of the Markov chain (the inner loop of the algorithm), we may deﬁne the expected cost C(T ), the expected square cost C(T )2 , the variance in the cost at equilibrium σ 2 (T ) and the entropy at equilibrium S(T ) all at a certain value of the temperature T by

170

7 Optimization: Simulated Annealing

C(T ) = C(T )2 =

∑ C(k)qk (T );

(7.4)

∑ C2 (k)qk (T );

(7.5)

k∈R k∈R

σ 2 (T ) = C(T )2 − C(T )2 ; S(T ) =

∑ qk (T ) ln qk (T ).

(7.6) (7.7)

k∈R

Furthermore, the speciﬁc heat of the system is given by h=

∂ ∂ σ 2 (T )

C(T ) = T S(T ) = . ∂T ∂T T2

(7.8)

The expected cost and expected square cost can be shown to be approximated by their averages over the Markov chain by virtue of the central limit theorem and the law of large numbers [3]. These quantities prove helpful not only in analogy to the physics origins of the paradigm but also in providing us with a good cooling schedule. Furthermore, they become important in a discussion of the typical behavior of simulated annealing [124]. In physical annealing, the substance effectively undergoes slow solidiﬁcation after it has initially been melted at high temperature. It thus undergoes a phase transition. We would expect the total entropy of the system to drastically decrease during this transition and only slowly on either side of it. Physically, such transitions are usually fast but do take a ﬁnite amount of time. If we monitor the average energy, standard deviation of the energy and the speciﬁc heat of the alloy during the physical annealing process, we should be able to see the phase transition clearly. Surprisingly, we see the same effects in the evolution of combinatorial problems using simulated annealing. For the traveling salesman problem on 100 cities on the circle, we monitored these quantities over 72 temperatures and plotted them relative to the logarithm of the temperature, see ﬁgure 7.1. We see that there is a clear phase transition that is extended over a few temperatures and that both cost and standard deviation vary relatively little on either side of this phase transition. Subject to the assumption that cost is distributed normally over conﬁguration space, we are able to prove that at large temperatures, the standard deviation is constant and the cost inversely proportional to the temperature. At low temperatures, both standard deviation and cost are linearly depended on temperature [2, 70]. This is borne out by the data we have collected. The speciﬁc heat of the instance is roughly constant with a few exceptions. The speciﬁc heat of a material body is the amount of heat necessary to be added to the body to increase its temperature by one degree Kelvin per kilogram. It differs in value depending on whether the pressure or the volume of the body are kept constant throughout the process of heating. However, it is a constant property of the material of the body. In the context of combinatorial problems, we could interpret the pressure to be the external forces of change (i.e. the probability distribution of accepting proposed transition) which is constant over the Markov chain that is used to compute the speciﬁc heat. The volume of the problem could be considered to be

7.3 Theoretical Results

171

Fig. 7.1 We see the normalized average cost (top curve) and normalized standard deviation (lower curve) against the logarithm of the temperature. As temperature decreases over time, this means that time runs from the right of the plot to the left. We clearly observe the phase transition in the middle of the image.

the average cost during the Markov chain. Thus we are computing the speciﬁc heat at constant pressure. In further analogy to physics, we would assume that the speciﬁc heat is a constant throughout the execution of simulated annealing. In physics, local maxima in the speciﬁc heat indicate local freezing and hence local cluster formation. This is detrimental to ﬁnding the ground state of the material and thus should be avoided. In other words, local maxima in speciﬁc heat indicate deviation from equilibrium and thus too rapid cooling. If the speciﬁc heat at the end of any particular Markov chain is signiﬁcantly greater than the speciﬁc heat computed in the ﬁrst Markov chain, we should thus disregard the last chain and cool the system more slowly. This adaptive philosophy to a decrement formula will force the speciﬁc heat over the evolution to be roughly constant. In our speciﬁc example, we see a speciﬁc heat maximum around the onset of the phase transition on ﬁgure 7.2. Had we cooled more slowly here, we would have in general obtained a much better result.

172

7 Optimization: Simulated Annealing

Fig. 7.2 We observe the speciﬁc heat as the grey curve and the acceptance probability of the suggested transitions as the black curve against the logarithm of temperature. As in ﬁgure 7.1, time therefore runs from the right to the left of the image. We again clearly observe the phase transition in the speciﬁc heat curve as the onset of local freezing. As expected, the acceptance ratio of suggestions declines exponentially and, upon becoming too low for further progress, leads to the end of the optimization run.

7.4 Cooling Schedule and Parameters A host of experimental data from a variety of combinatorial problems show that the performance in both quality of ﬁnal solution and execution time is highly dependent upon the cooling schedule [124]. We will spend some time reviewing different possibilities for such a schedule and, as we shall see, the parameters that the parts of the cooling schedule require us to choose are no less signiﬁcant for the performance of the algorithm. Generally, the quality of the ﬁnal solution of simulated annealing can compete favorably with the very best of tailored algorithms for speciﬁc combinatorial problems however at the cost of additional execution time [124]. This observation has been made in many papers but all of them have used very simple cooling schedules and have not tuned the parameters of these schedules well. This leaves us to speculate that one may expect additional quality and time performance from simulated annealing after appropriate tuning. If this were consistently true, this method may beat a number of tailored algorithms in quality. There are parallel implementations of simulated annealing that speed up the execution considerably. These are, however, more complex and deviate somewhat from the original physical and intuitive ideas. They are also harder to implement and so we refer to the literature for this, e.g. [100] and references. We begin the discussion of cooling schedules with the cautionary remark that an experimental comparison

7.4 Cooling Schedule and Parameters

173

of the major cooling schedules (with tuned parameters) has never been done and so advantages and disadvantages are a matter of theory for now.

7.4.1 Initial Temperature The starting temperature has to be chosen such that the system can make highly uneconomical transitions in the beginning and later settle down to temperatures at which very few such transitions are possible. Thus, in combination with the temperature decrement formula and the selection criterion, this temperature should be chosen high enough but not too high. There are two major directions in which investigators have chosen to go. The ﬁrst is to say that the initial temperature T0 should be such that a certain percentage χ0 of uneconomical transitions are accepted [112, 79]. We start by assuming a Gaussian distribution of cost ﬂuctuations because this is our generic selection criterion and thus set the probability of acceptance that we want (χ0 ) equal to the probability at the initial temperature [57], ΔC(+) ΔC(+) χ0 = exp − → T0 = −1 T0 ln χ0 where ΔC(+) is the average of all cost function increases observed. In order to use this formula, we thus perform one Markov chain in order to compute ΔC(+) and then compute T0 by choosing some χ0 . Estimates of this kind are used in many papers [84, 85, 119, 66, 97]. While this is a relatively simple and utilitarian solution of ﬁnding a good starting temperature, most application oriented papers do not even go this far but merely decide to ﬁx T0 to a number that seems to give good results empirically. Once this number is found using a few test instances, this number is ﬁxed for all subsequent instances and thus becomes part of the problem speciﬁcation. Clearly this is not a good choice of strategy. In the best case, there will be many instances for which the computation time taken will be larger than needed but it is likely that in many cases the ﬁnal solution found will be inferior to the one that could have been found with a different initial temperature. We thus advise on an adaptive tuning of the initial temperature according to some model. A more sophisticated approach is based on the assumption that the number of solutions corresponding to a particular cost C, the conﬁguration density, is normally distributed with a mean of C and standard deviation of σ∞ . These parameters are empirically determined during a Markov chain. We may then compute the expectation value of the cost as a function of the temperature C(T ) ≈ C − σ∞2 /T , which is a local Taylor expansion where C is an average taken at thecurrent temperature, i.e. over the current Markov chain. The variance σ∞2 = C(T )2 − C(T )2 is estimated 2

to be σ∞2 ≈ C(T )2 − C(T ) . Finally, we agree that we would like the initial expectation of the cost to be within x standard deviations from the average cost. Together

174

7 Optimization: Simulated Annealing

with the empirical estimators for the expectation values, we obtain [128] 2 T0 > x C2 −C . We learn in statistics that 68% of all cases lie within one, 95% within two and 99.7% within three standard deviations of the mean for a normal distribution. The number of standard deviations x, thus again comes down to choosing a χ0 . From the cumulative Gaussian probability distribution the number of standard deviations and the initial acceptance probability are thus related by χ0 =

xσ∞ −xσ∞

2 C −C 1 √ exp − dC. 2σ∞2 2πσ∞

This is analogous to a practical method at which we start the metal off at room temperature and heat it up gradually until we believe it is hot enough for it to be well mixed (some distance past its melting point for instance) and then we begin the annealing process in earnest. It is the concept of melting point that we have essentially attempted to describe in this section and that accurately captures what we need the initial temperature to be.

7.4.2 Stopping Criterion (Deﬁnition of Freezing) The analogy between the melting point of a metal to be annealed and the starting temperature of a combinatorial problem to be simulated annealed was drawn in the previous section. This can be continued between the freezing point of a metal and the stop criterion of the simulated annealing process. Physically speaking, the freezing temperature and the melting point are the same (this temperature marks the phase transition between the solid and liquid phases) but this transition is not instantaneous. In the context of optimization, the starting temperature is higher and the ﬁnal temperature lower. Thus, a real phase transition does not occur at one instant but over a period of time. The entropy does not have an actual discontinuity (as the theory would have us believe) but rather it has a sudden and drastic change over a small but ﬁnite time frame. The simplest proposal for the ﬁnal temperature is to ﬁx the number of the different temperatures and thus the ﬁnal temperature depends upon the temperature reduction formula. The actual number varies between six [113] and ﬁfty [47] in the literature. The analogue of waiting until no more consequential transitions are made is to wait until the optimal conﬁgurations found after a number of Markov chains (typically the last four) are identical [79, 97, 115]. We may further require that the probability of accepting a random transition is smaller than some ﬁxed value χ f by analogy to the treatment of the starting temperature [57].

7.4 Cooling Schedule and Parameters

175

A number of more sophisticated proposals are made in the literature. Suppose that we are at a local minimum C0 in the cost function and the lowest cost value of any conﬁguration in the neighborhood of the current conﬁguration (i.e. reachable by a single transition) is C1 . Then we would like that the probability of transiting from the local minimum to this point should be lower than 1/R where R is the size of the neighborhood. This condition (assuming that cost is normally distributed, as before) gives [128], C1C0 Tf ≤ . ln R The choice that 1/R is to be the cut-off probability seems reasonable but nevertheless we may consider this a general statement under the Gaussian assumption and input any desired probability. Alternatively, we may require that the probability of the last conﬁguration reached in a Markov chain being more than ε (in cost) over the true minimum of the cost function, is less than some real number θ . This may be used to derive the condition [89] ε Tf ≤ ln(|R| − 1) − ln θ where R is the set of all possible conﬁgurations. This obviously suffers from having to choose an ε and a θ (similar to the χ f above) and having to calculate the size of the conﬁguration space. Consider the difference between the average cost C(Tk ) during the kth Markov chain and the true optimum. This may be expanded as a ﬁrst-order Taylor series when Tk is small. This difference (calculated by the Taylor series) relative to the average cost in the ﬁrst Markov chain is desired to be lower than some ﬁxed ε divided by the terminal temperature T f . Finally this gives [4, 101] 2

C2 (T f ) −C (Tf ) Tf > ε C(T0 ) −C(T f )

7.4.3 Markov Chain Length (Deﬁnition of Equilibrium) Continuing our physical analogy, we are to anneal our substance at equilibrium. Thus we are only allowed to lower the temperature when the substance has reached thermal equilibrium at the current temperature. We need a ﬁrm deﬁnition of this concept in combinatorial terms in order to terminate our Markov chain. A strict mathematical deﬁnition of equilibrium is practically impossible as it would entail computing the probability distribution of conﬁguration space which would correspond to the simplest of all optimization algorithms (check all possibilities and choose the best one). Let the length of the kth chain be Lk . The simplest practical way is to give a definite ﬁxed length to every chain so that Lk is independent of k and depends only

176

7 Optimization: Simulated Annealing

upon the problem size, i.e. it is some polynomial-time computable function of |R|, the size of the conﬁguration space. Many authors simply choose some number, e.g. Lk = 100 [47, 46, 90, 51, 115]. Alternatively we use this function only as a ceiling for the length and terminate the chain possibly before, such that the number of accepted transitions is at least some ηmin (this requires a ceiling because at low temperatures the acceptance ratio is lower) [79, 57, 84, 85, 97]. On the other hand, we may want to require that the refused transitions are at least a certain number [113]. This seems counter-intuitive as this leads to shorter chains at low temperatures and thus a speedup of cooling whereas one would assume that achieving equilibrium takes longer at low temperatures. More physically, consider breaking the chain into ﬁxed-length (in terms of accepted transitions) segments. The cost of a segment is the cost of the last conﬁguration. When the cost of current segment is within a cut-off of the preceding segment, we terminate the chain [119, 58, 52]. This is more intuitive because this is related to the ﬂuctuations in cost over the chain. We terminate when the ﬂuctuations settle down; a deﬁnition of equilibrium that an experimental physicist might agree with. Statistically speaking, we would wish for a sufﬁciently large probability to make an uneconomical transition (possibly out of a local minimum) to be maintained throughout the chain. A reasonable estimate, based on Markov chains, for a scale length of a speciﬁc chain is N≈

1 exp (− (CmaxCmin ) /T )

where Cmax and Cmin are the largest and smallest cost observed so far (including previous chains). This is further corroborated by the fact that N plays a similar role in stochastic dynamical systems as the time constant plays in linear dynamical systems; it is thus really a length scale [59, 108]. Taking the actual length of the chain to be a few Ns should thus be enough to get to equilibrium; exactly how many, however, remains to be decided by the user. Finally, it is possible to show that the number of accepted transitions within an interval ±δ about the average cost C reaches a stationary value ⎛ ⎞ δ δ ⎠, κ = erf ≈ erf ⎝ σ (T ) 2 2 C −C where erf(· · · ) denotes the error function, at equilibrium and we may take this as our deﬁnition (practical care has to be taken to avoid extremely long chains at low temperatures) [94].

7.4 Cooling Schedule and Parameters

177

7.4.4 Decrement Formula for Temperature (Cooling Speed) After each Markov chain the temperature is decreased in analogy to the physical annealing process in which a metal is cooled at equilibrium. The new temperature Tk+1 is calculated from the old temperature Tk very simply by either keeping their ratio [79, 57, 47, 46, 90, 51, 84, 85, 97, 115] or their difference [119, 66] a global constant. The ratio is usually taken between 0.9 and 0.99 but also 0.5 has been used. If the difference is used, then it is determined by ﬁxing the number of different temperatures, the initial temperature and the ﬁnal temperature. The ratio rule is used very frequently in applications to the virtual exclusion of other rules. It is the decrement rule that has the most impact upon the quality and efﬁciency of the algorithm among the ﬁve rules that we must prescribe [124]. We would like to decrease the temperature slowly so that the subsequent chains do not have to be too long in order to reestablish equilibrium but not too slowly so that the algorithm takes too much time to freeze. We begin with the reasonable statement that the stationary probability distributions of two successive temperatures should be close, i.e. their ratio larger than 1/(1 + δ ) and smaller than 1 + δ for some (small) real number δ . Depending on subsequent assumptions, we may derive the following rules,

Tk+1 Tk+1 Tk+1

−1 ln(1 + δ ) Tk = Tk 1 + , see [4, 1] 3 σ (Tk ) 3 Tk ln(1 + δ ) = Tk 1 + , see [82] 3 σ (Tk ) γTk −1 = Tk 1 + , see [89] U

(7.10)

Tk2 ln(1 + δ ) , see [101] Cmax + Tk ln(1 + δ )

(7.12)

Tk+1 = Tk −

(7.9)

(7.11)

where γ is some small real number and U is an upper bound on the difference in cost between the current point and the optimum. Alternatively, we can begin from the statistical mechanics formula for speciﬁc heat and approximate, as we have done so far, the expected cost by the average cost. This leads to λ Tk Tk+1 = Tk exp − σ (Tk ) where λ is the number of standard deviations by which the average costs of different chains are allowed to differ; we require λ ≤ 1 [94].

178

7 Optimization: Simulated Annealing

7.4.5 Selection Criterion Generally, the Maxwell-Boltzmann distribution is assumed to be a reasonable criterion for accepting or rejecting a proposed transition for the worse by the analogy to statistical physics and so the probabilistic selection criterion is relative to the function exp[−ΔC/T ]. The discussion of whether a different condition should be chosen is based on the observation that transitions of high cost difference can help to get the system out of local minima and these are accepted rather less often at low temperatures. Furthermore, at large values of the temperature virtually all transitions are accepted without bias. One may wish to bias the selection of transitions such that large transitions are more likely at lower temperatures to help approach the optimum faster. There are many possible choices but they are all centered around attempting to force faster convergence and not lower ﬁnal cost. It is our experience that with present hardware, it is not necessary to speed up the algorithm (at the cost of possibly a worse solution) for almost all practical problems. We mention one simple way to tune the selection process, namely the reintroduction of Boltzmann’s constant in the Maxwell-Boltzmann distribution, i.e. changing the function to exp[−ΔC/kT ] where k is a constant. In physics, this constant takes one universal and constant value; it can thus be thought of as a conversion factor of Kelvins to Joules (units of temperature to units of energy). In combinatorics, this factor would convert units of temperature (whatever that may mean here) to units of cost. In our discussion of initial temperatures and decrement formulas, however the unit of temperature was the same as the unit of cost and so the constant, in this context, does not need to convert units. Speciﬁcally, there is some evidence that k = 2 may lead to slightly faster convergence to equilibrium [68]. A very interesting approach uses the so-called Tsallis statistics to attempt a speed up of annealing’s convergence without a loss of quality. This is very promising but is beyond the scope of this book to discuss, see [80] for a start.

7.4.6 Parameter Choice We have seen that we must choose ﬁve elements (initial and ﬁnal temperatures, chain length, decrement formula and selection criterion) to turn the paradigm of simulated annealing into an algorithm in addition to formulating our problem in terms of transitions. This choice is far from obvious. In addition, almost all of these elements depend on some parameters that we must also choose. We have some theoretical and practical guidelines and intuition as to what rules to choose but the parameters generally escape precise quantiﬁcation by intuition. We are thus lead to the question: Do slight variations in the parameters make measurable differences in the performance (quality and speed) of the simulated annealing algorithm applied to a particular problem? The experimental answer is deﬁnitely afﬁrmative. Thus we have to make intelligent choices.

7.4 Cooling Schedule and Parameters

179

It is unfortunate that almost all practitioners of the simulated annealing paradigm do not put much effort into ﬁnding the optimal parameters. From the literature it seems that the vast majority follows the following method. A few test instances of the problem are generated. Some of these have known optimal solutions and some do not. The parameters are adjusted manually such that the known cases are solved to optimality and the unknown cases are solved to a ﬁnal cost that seems reasonable in the light of the known cases. This manual adjustment means in practise that rather few different parameter sets are tried and the ﬁrst one that looks good in the above way is kept. Furthermore, the parameters are then kept ﬁxed for all future cases to be solved and are hardly ever (except perhaps in the case of the initial temperature) varied. However an optimal parameter set can improve the average solution quality appreciably over a manually chosen one. Another method used more recently is to try a few values for each parameter and then use linear regression to obtain some optimal interpolated parameter set [63]. This is essentially a manual adjustment as well as there is no good method to choose the few sets on which the regression is based. Alternatively, we can regard simulated annealing as a function of its parameters that returns the relative cost reduction α = C f inal −Cinitial /Cinitial of the problem instance. This is not quite good enough because α has a probability distribution that is largely unknown. However, if we generate a large number of similar instances of the same size and compute the average cost reduction α by running simulated annealing with identical parameters for each one, then it should return the expectation value of the relative cost reduction. This is a good measurement of the efﬁcacy of the method and we shall take this as our ﬁgure of merit function to ﬁnd the optimal parameters. In short, we have a multidimensional function minimization problem (maximize α is the same as minimizing 1/α). The simplest version of simulated annealing sets ﬁve constants A, B, C, D and E to some initial values and looks like this: Data : A candidate solution S and a cost function C(x). Result : A solution S that minimizes the cost function C(x). T ←A while T > B do for i = 1 to C do S ← perturbation of S. if C(S ) < C(S) or Random < exp[(C(S ) −C(S))/DT ] then S ← S end T ← ET end

Algorithm 2: Simple Simulated Annealing In words, we start with a constant temperature A and deﬁne a constant temperature B to be the freezing point. Equilibration is assumed to occur after or within C steps of the proposal-acceptance loop where the selection criterion is the thermodynamic Maxwell-Boltzmann distribution with Boltzmann’s constant D after which

180

7 Optimization: Simulated Annealing

the temperature is decremented by a constant factor E. The standard choices for these constants are A = C(S), B is 100 times smaller than the best lower bound on cost, C = 1000, D = 1 and E = 0.9. After successful implementation of this algorithm, one usually plays with these parameters until the program behaves satisfactorily. It is clear that implementing this method is very fast and we observe from the literature that the vast majority of applications are computed using the version of simulated annealing given in algorithm 2 where the ﬁve parameters are determined manually [54]. Using this interpretation, we may regard simulated annealing as deﬁning a function α = α(A, B,C, D, E) depending on ﬁve parameters (for the simple schedule). We would like this average reduction ratio to be as large as possible. This is yet another optimization problem with a function instead of a combinatorial problem. We are able to evaluate the function only at considerable computational cost (N runs of simulated annealing for N randomly generated initial conﬁgurations) and we do not know its derivative accurately. Even approximating the derivative comes only at heavy computational cost. There are many optimization methods such as Newton’s method or more generally a family of methods known under the names Quasi-Newton or also Newton-Kantorovich methods that rely on computing the derivative of the objective function. Some of them require high computational complexity due to the computation of the Hessian matrix but complexity considerations are secondary here. The most important reason against all these methods is that the derivative computation is not very accurate for the function constructed here and this loss of accuracy in an iterative method would yield meaningless answers. Indeed, such methods were tried and the results found to be unpredictable because of error accumulation and much worse than the results obtained by methods not requiring the computation of derivatives. The method of choice for optimizing a function over several dimensions without computing its derivative is the downhill simplex method (alternatively one may use direction set methods). Thus, we use the downhill simplex method to minimize α(A, B,C, D, E). The starting point for the simplex method will be given by those values of the ﬁve parameters that we obtain after some manual experiments. This is done for the reason that most practitioners of the simulated annealing paradigm choose their parameters based on manual experiments [54]. The other points on the simplex are set by manually estimating the length scale for each parameter [127]. We ﬁnd, after extensive computational trials on a variety of test problems, that the average improvement in the reduction ratio after annealing has been parametrized by the downhill simplex method as opposed to human tuning is 17.6%. We believe this is signiﬁcant enough to seriously recommend it in practise. Note that this is an average and so there are cases where the improvement is small and cases where it is large. It seems impossible to tell a priori what the result will be. Many simulated annealing papers have been published that center around the topic of performance of the algorithm in terms of getting to an acceptable minimum quickly [102]. A variety of cooling schedules have been designed that can reduce the computation time at the expense of solution quality. While the author experimented with a number of open-source implementations of simulated annealing for a variety

7.5 Perturbations for Continuous and Combinatorial Problems

181

of optimization problems with tools such as a proﬁler, speed-ups of up to three orders of magnitude were achieved. This is in contrast to claimed speed-up factors of between 1.2 and 2.0 that come from changing the cooling schedule at the expense of solution quality [102]. Thus the author believes the speed of the simulated annealing method to be so dominated by programming care that he has not attempted to simultaneously optimize solution quality and execution speed. This simultaneous optimization would, however, be no problem in principle after one made the, completely random, decision how relatively important speed is in relation to quality. Thus we may draw a number of conclusions that would appear to hold in general: (1) The solution quality obtained using simulated annealing depends strongly on the numerical values of the parameters of the cooling schedule, (2) the downhill simplex method is effective in locating the optimal parameter values for a speciﬁc input size, (3) the parameters depend strongly on input size and should therefore not be global constants for all instances of an optimization problem, (4) the improvement in solution quality can be signiﬁcant for theoretical and practical problems (up to 26.1% improvement was measured in these experiments which is large enough to have signiﬁcant industrial impact). Furthermore, the reason that the usual manual search is so much worse than an automated search seems to be that the solution quality (as measured by the average reduction ratio) depends strongly on the cooling schedule parameters, i.e. the landscape is a complex mountain range with narrow valleys that are hard to ﬁnd manually. Finally, the improved schedule parameters, in general, lead to slightly greater execution time but in view of the dramatic improvement of quality (as well as the fact that execution time seems to be dominated by programming care) this is well worth it. However the computation times are generally so low nowadays with powerful computers that increasing the speed of annealing at the expense of quality is a non-issue. Rather we would expand the computation time for the beneﬁt of additional quality.

7.5 Perturbations for Continuous and Combinatorial Problems Apart from the cost function, with which we compare any two proposed solutions, the only other point in annealing that the problem enters the algorithm is in the method to perturb or change any proposed solution. This method to modify a solution must be carefully constructed such that we have a good chance to meet with many solutions and to be able to exit local minima. Let us imagine that we are dealing with a continuous problem. That is, a problem in which the independent variables (the one whose value we want to determine) take on continuous values as opposed to discrete values. Then a solution is any value of the independent variable vector x that obeys the boundary conditions. In order to generate a new vector x from this, we can create several ideas 1. Set x to a random vector independent of x. This is a simple and intuitive idea but it violates the basic philosophy of simulated annealing of adaptive change.

182

2.

3.

4. 5.

7 Optimization: Simulated Annealing

This method makes annealing essentially a variant of random search (take the best solution of many randomly generated ones) and performs poorly. Set x equal to x and change one element by ±Δ for some a priori chosen length scale Δ . This is better but performs poorly as well, as most problems tend to have shorter length scales at lower temperatures due essentially to the phase transition observed at intermediate temperatures. Do the above but make Δ into Δ (T ) a function of temperature. The scale should decrease with temperature, that much is clear. There is wide disagreement in the literature as to how to decrease it exactly. Describing these methods would carry us too far aﬁeld. We merely note that the length scale of any particular variable at any particular temperature may be estimated by performing many transitions for various Δ for this variable at that temperature and noting down the variation in the cost function thus achieved. In this manner, we may empirically construct a Δ (T ). Do the above but assign a different Δ (T ) to each element in x because every variable may (and in general will) have its own length scale. Do the above but do not change only one element of x but several during a single move. We then ask how many is “several?” We have found that about 10% of the number of elements in x is a good number to vary simultaneously. Which they are should be chosen randomly.

In absence of domain knowledge that may allow us to design a better transition mechanism, the last of the above suggestions has performed best for many experiments of the author. In case of doubt, this approach is recommended. If we are dealing with a problem that is not continuous but rather discrete, matters are more diverse. We now need to take into account the actual structure of the problem. Take, for example, the traveling salesman problem. The mechanism that has proven to be the best has two moves: reverse any partial path and interchange two partial paths, e.g. A → B and C → D with the two partial paths A → C and B → D. There have been many other suggestions for generating a new solution from an old solution but it is this suggestion that has been found to perform best. It is unclear, before experimenting, which set of moves will perform best. It is apparent from the structure of the moves for the traveling salesman problem, that these were designed with the structure of the problem itself in mind – the problem is about a path between points without repetition. We cannot use these moves for a different problem. As such we must really design a move set with respect to a problem. There is no general theory for constructing a transition mechanism. We must think what is natural for the problem at hand and try it out. In general several suggestions will have to be tested. Most often we will have no theoretical explanation for the better performance of the winner but must merely be content to observe which happens to be the best. We note in closing that the suitability of the transition mechanism is a major point in using simulated annealing. If you use a poor transition mechanism, annealing will take much longer (require more transitions) and may indeed converge to a poorer

7.6 Case Study: Human Brains use Simulated Annealing to Think

183

outcome. Note also that the cooling schedule depends on the transition mechanism and so the cooling schedule must be tuned to a particular transition mechanism.

7.6 Case Study: Human Brains use Simulated Annealing to Think Co-Author: Prof. Dr. Adele Diederich, Jacobs University Bremen gGmbH Humanity has long searched for the mechanism that allows the human brain to be as successful as it is observed to be in solving a variety of problems both new and old every day. Much is known about the architecture of the brain on the level of neurons and synapses but very little about the global modus operanti. We ﬁnd evidence here that simulated annealing is that elusive mechanism that could be called the ‘formula of the brain.’ By examining the traveling salesman and capacitated vehicle routing problems that are typical of everyday problems that humans solve, we illustrate that none of the optimization methods known to date match the observations except for simulated annealing. The method is both simple and general while being highly successful and robust. It solves problems very close to optimality and shows faulttolerance and graceful degradation in the presence of errors in both input data and objective function calculations. The human brain is constituted of approximately 1011 individually simple computational elements (neurons) that are interconnected via approximately 5 · 1014 synapses1 [55, 111]. These large numbers prohibit a comprehensive direct computer model of the brain. Even if it were possible, such a model would be essentially epistemological, i.e. it would treat the brain as a “black box” and would concern itself only with input and output to this box. It is eminently more desirable to search for an ontology of the human brain, a theory that (at least to some degree) explains as well as reproduces input–output pairings. The importance for science in general of understanding how our brain thinks in global terms can hardly be overemphasized. Given the philosophical nature of the issues, it seems unreasonable to be able to resolve the nature-nurture, consciousnesscomplexity or intelligence debates on such grounds. However, many issues of scientiﬁc interest can be tackled from this basis such as the performance issues at the basis of intelligence and all manner of questions regarding memory and learning as well as modularization or compartmentalization of the brain. Furthermore, through better understanding of the human ‘hardware’ it should be possible to facilitate improved learning, recall and equilibrated and enhanced brain usage. In brief colloquial terms, an ontological brain mechanics forms the essential introduction to a brain operations manual for the scientist as well as for the lay-thinker. 1

Graph-theoretically speaking the brain is a very sparse graph – with 1011 nodes, a graph with all possible edges would have 0.5 · 1022 edges meaning that the human brain has approximately 0.00001% of all possible synapses. This is, of course, necessary as compartmentalization and modularization are quite essential for the myriad functions that the brain has to perform simulatenously.

184

7 Optimization: Simulated Annealing

The average human being must make complex choices many times per day. Many of these fall into the category of optimization problems: Choosing the ‘best’ alternative from a wealth of possibilities; a simple example being that of planning a route between many stops. The meaning of ‘best’ differs widely between problems2 but it is clear that we must be able to (and are able to) compare several possibilities as to their goodness when trying to ﬁnd the best alternative. The process of ‘thinking about’ the problem (i.e. considering the relative goodness of various possible alternatives) takes time and often we consider an alternative that is worse than the best one found so far in an effort to ﬁnd an aspect of that alternative that will allow us to ﬁnd an even better alternative later on – we accept temporary losses in expectation of greater gain at a later time. In most problems, the total number of possible alternatives is astronomically large and no simply recipe for solution exists. As an example, planning the shortest route between n stops while running errands would have us consider (n − 1)! possible routes. Most of these problems can only be (realistically) solved using heuristic methods. The human brain is thus capable of selecting a good alternative from a large set of possibilities without considering all possibilities. Additionally, the brain does not search randomly but ‘intelligently’ considers the alternatives. It is unreasonable to believe that the human brain has separate solution strategies for each possible different problem. This would require a vast brain (which we do not have) and enormous learning (which we do not have time for). Thus there must exist a central problem solving apparatus that manages to solve many very different problems to a reasonable degree each3 . Furthermore, the neuron-synapse structure of the brain operates approximately 1000 times slower than current computer hardware and humans still regularly outthink computer programs in tasks such as pattern recognition. It is sometimes thought that massive parallelism is the key to this performance gap [117, 111]. Our thinking strategy is thus problem-independent and very quickly obtains a nearly optimal solution via a directed search through a very small portion of the space of possible solutions. We wish to draw a parallel between the SA algorithm and the human brain functionality in solving optimization problems. As most problems encountered in everyday life are optimization problems, we will take this to be a strong indication that the human brain uses SA as its general problem solving strategy. Learning is adopted into this in two ways: (1) The cooling schedule of the SA paradigm is very ﬂexible and amenable to substantial tuning and (2) after sufﬁcient experience with a particular kind of problem, the brain may develop a custom method for dealing with those particular issues important to that human being. SA is very powerful as we have seen but it is also very robust. Robustness refers to the preservation of the method’s ability to ﬁnd a good solution in the presence of 2 Examples include minimizing the number of kilometers needed for a journey, the amount of time required for a job, the number of trucks needed to supply a chain of stores, the number of rooms required for a conference, the best assignment of employees to job tasks and so on. 3 Note that due to the astronomically large number of possible solutions and the non-existence of any quick guaranteed solution schemes, the human brain cannot be expected to (and does not) solve these problem to optimality but only close enough for practical purposes.

7.6 Case Study: Human Brains use Simulated Annealing to Think

185

noise (errors in the input data) and/or uncertainty (errors in the objective function evaluation). Clearly the human brain is very robust and this feature must carry over into any ontological mechanics of the brain. The proper unit of time for the SA paradigm is the number of proposals made. It is useless to measure the actual time taken as this depends too strongly on the computer and programmer. As each proposal necessitates the construction of a candidate solution, its comparison to the current reference solution and its subsequent acceptance or rejection, we postulate based on timing measurements of the brain that the average human is capable of doing this in 3 microseconds [56, 111]. This means that the human brain considers approximately 333 proposals per second. We shall take this as the conversion factor in order to compare our SA algorithm to brain measurements4 . In our experiments we will task a number of human subjects and the SA algorithm with a number of instances of the traveling salesman (TSP) and the capacitated vehicle routing problem (CVRP). In both problems, the locations (on the two-dimensional Euclidean plane) of n cities is chosen. In the TSP, the shortest possible round-trip journey meeting each city exactly once is to be constructed. In the CVRP, each city has an associated demand value that the traveling salesman has to meet given that his vehicle has a maximum capacity. A particular city is ﬂagged as the ‘depot’ where the salesman has to periodically return to reﬁll his vehicle. The CVRP asks for the shortest journey meeting each city (except the depot) exactly once starting and ending at the depot and fulﬁlling all demands while never exceeding the vehicle capacity5 . 4 It should be noted that almost no proposals considered are done so consciously. The computation power of the brain is vast but relies on almost all of the computing to be done unconsciously. 5 Method Note: Both problems can, of course, be asked with the cities not on the two-dimensional Euclidean plane but this restriction made it possible to get human test subjects to solve the same instances as the computer. The instances used had between 10 and 70 cities; half the instances were taken from actual cities on the Earth projected onto the plane and the other half were uniformly randomly distributed in a square. For the human test subjects, the computer screen ﬁrst informed them how many cities the next problem had and what the maximum time limit was, then a ﬁxation cross was displayed on the screen’s center and then the cities as dots on the screen were displayed together with a clock counting backwards from the maximum time limit. The subjects had to use the mouse pointer to click on the cities in the order that they wished to visit them. A choice could be undone by clicking the other mouse button and the click times were all recorded as well as the length of the ﬁnal journey. It was found that the average subject needs approximately one second per city just to perform the clicking operations, i.e. giving less than this much time would not yield a complete tour of the cities. Each instance was displayed several times for different maximum time values in order to measure the time progress of the subjects as they were given more or less time to think. Each time that the instance was redisplayed it was rotated by a different angle so that the instance did not look the same as it did on its previous display. As such the learning effects of the experiment were kept to the general task and not speciﬁc to an instance. For the computer algorithm, we used a cooling schedule that assesses the correct starting temperature by heating up the system until 99% of all proposals are accepted (using the MaxwellBoltzmann distribution for accepting cost-increasing proposals). Equilibrium is deﬁned as 200 proposals and the temperature is decreased by a constant factor of 0.99 until the cost does not change over four consecutive equilibria. This schedule is capable of solving to optimality all small

186

7 Optimization: Simulated Annealing

In order to compare the data from the subjects to that of the computer, we note that for the SA paradigm the graph of percentage cost deviation from optimum versus time is largely independent of the problem size for small instances. Furthermore, this graph as well as the graph of cost variance versus time display a smoothed step function proﬁle. The initial plateau is a feature that SA could be criticized for as an optimization algorithm as it could be viewed as a waste of computational resources; after all we only begin to see progress after a substantial amount (roughly one-third) of time has been spent. From a catalog of possible smoothed step function forms [103], we have determined that the best ﬁt is the complementary error function; 2 a · erfc(bx + c) + d where a, b, c and d are constants and erfc(x) = √π ∞ exp(−t 2 )dt . x Other optimization algorithms do not have this feature proﬁle; their cost versus time graphs generally follow a decaying exponential graph. We do indeed ﬁnd a scale invariance in the human subjects’ performance as it was expected from SA experience, i.e. the normalized cost and standard deviation curves do not depend upon the problem size. Thus we restrict ourselves to presenting data from a particular instance. See ﬁgure 7.3 where the SA output has been scaled in time according to the rate of 333 proposals per second. The notable features of this comparison are thus: (1) Scale invariance was observed in both human subjects and computer algorithm, (2) the cost and standard deviation functions agree closely between computer and human subjects, (3) the (independently arrived at) translation between the number of proposals and seconds is accurate, (4) these features are characteristic for the SA paradigm and do not occur all together in any of the other general optimization methods. Thus we have evidence that the brain mechanics cannot follow any of the other standard methods as well as evidence that SA is very close to the observed performance. Thus we conclude that simulated annealing is the prime candidate for an ontological brain mechanics.

7.7 Determining an Optimal Path from A to B Suppose that you are currently at the point A and that you have computed that point B is the optimum that you would like to reach. In industrial reality, you cannot always just change all set points from A to B in one go. The process must be guided smoothly from here to there. This is called a change at equilibrium meaning that the change must happen at such a slow speed that the process is always (nearly) at equilibrium even though values are being changed. This will ensure that the process problem instances contained in the TSPLIB collection of standard problem instances for both the TSP and CVRP. This collection represents the international testbed for TSP and CVRP algorithms. One possible criticism for this is that measuring cost deviation from optimum skews the human performance because the subjects do not control length but the order of the cities. It has been shown in the context of the TSP that the deviation from one journey to another (in terms of Hamming distance – the number of different bits in a vector) approximately scales with the corresponding cost [45].

7.7 Determining an Optimal Path from A to B

187

Fig. 7.3 The normalized cost deviation from optimum versus time in seconds is plotted in the upper image with the grey line being the SA output and the black line being the human subjects’ average for a particular instance. The normalized standard deviation versus time is plotted in the lower image in the same manner.

will continue producing the product that you want without causing any unwanted side effects that may even destroy the optimization gains altogether. This is equivalent to a navigation system in a car. It is unfortunately not possible to drive from A to B immediately but you need to pass through some of the intermediate points. There is an optimal route and it is the responsibility of the navigation system to tell what this optimal route is and to guide you through each of the steps involved as and when they are necessary and to alert you if you deviate from this plan. We must do the same for an industrial process. A simple example of this process is to ﬁnd the shortest line between two points. On a ﬂat space, this is obviously a straight line. If the space is not ﬂat (for example the hilly surface of the Earth), then this shortest path is no longer a straight line. The method of solving such problems is called the calculus of variations. This method has a few steps.

188

7 Optimization: Simulated Annealing

First, we deﬁne our criterion for optimality. In this context it is called the Lagrangian and is a function of the variables x, the function that we wish to ﬁnd f (x) and the derivative of the function we wish to ﬁnd f (x), L(x, f (x), f (x)). In the case of ﬁnding the shortest length distance between two points, we have L(x, f (x), f (x)) = 1 + ( f (x))2 . Second, we deﬁne the action integral to be A =

B

L(x, f (x), f (x))dx

A

where A and B are the two extremal points of the line. The action thus depends on the function f (x), which is unknown at this point. This is a strange dependency as f (x) is not a variable with a numerical value but it is a variable with a function as its value. We will not discuss this at length here but merely request the, in mathematics very common, “willing suspension of disbelief.” Third, we state that we wish f (x) to take on that function as its value that will minimize the action. The result, after some pencil work, is the Euler-Lagrange equation d ∂L ∂L = 0. − ∂ f dx ∂ f This equation must be solved for f (x). In general, this may be difﬁcult. We will illustrate this with a simple example. We start with the arc length L(x, f (x), f (x)) = 1 + ( f (x))2 and observe that here

∂L =0 ∂f

as f does not appear explicitly in L. Then, d d ∂L 2 f (x) = . dx ∂ f dx 1 + ( f (x))2 From the fact that this expression must equal zero, we see that we must have the numerator equal to zero and thus d 2 f (x) = 0. dx2

7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes

189

The general solution to this is f (x) = mx + b, i.e. the famous straight line. This is the calculus of variations way of proving that the shortest distance between two points (on a ﬂat space) is a straight line. In industrial reality, we have multidimensional space (x is a vector) and we want to minimize the path’s total cost in terms of the objective function that we used in optimization to ﬁnd the optimal point. This will lead to an Euler-Lagrange equation that must be solved. As the objective function becomes the L used above, we cannot be more concrete than this here without a speciﬁc example. However, this equation will, in general, not be solvable directly but must be solved numerically. That is how to determine the most economical path from A to B. The methods to numerically solve a non-linear second-order partial differential equation in several dimensions go beyond the scope of this book but may be obtained in several commercial software libraries for practical use.

¨ 7.8 Case Study: Optimization of the Muller-Rochow Synthesis of Silanes Silanes are chemical compounds that are based on silicon and hydrogen. Important for industrial use, are the methyl chloride silanes. Industrially, these are principally produced using the M¨uller-Rochow Synthesis (MRS), which is the reaction, 2CH3Cl + Si → (CH3 )2 SiCl2 . There are various hypotheses regarding the catalytic mechanism of the reaction but there is no generally accepted theory. Regardless of this, the reaction is widely used in large-scale industrial facilities to produce Di methyl chloride silanes (CH3 )SiCl2 and Tri methyl chloride silanes (CH3 )SiCl3 , which we will refer to below as Di and Tri. Practically, the reaction is carried out using silicon that is available in powder form in particle sizes between 45 and 250 μm with a purity of higher than 97%. The most common catalyst is copper and the promoters are a combination of zinc, tin, phosphorus and various other elements. The reaction is carried out at about 300◦C and between 0.5 and 2 bars overpressure. In a ﬂuidized bed reactor, the silicon powder encounters chloromethane gas from below. The product leaving the reactor contains the desired end product but also unused methyl chloride that has to be separated in a condenser. The mixture of a variety of silicones is now separated in a rectiﬁcation where the desired Di and Tri are split off from the other methyl chloride silanes that are mostly waste. These desired end products can now be hydrolyzed into various silicones. The ﬁnal products of this process can be practically used as lubricants in cars, creams for cosmetics, ﬂexible rubber piping, paint for various applications, isolating paste for buildings and in a variety of other applications. Unfortunately, the reaction produces several by-products that are unwanted. The selectivity of the process measures how much of the total end product is of the various types; for example a Di-selectivity of 80% indicates that 80% of the total end product is in the form of Di.

190

7 Optimization: Simulated Annealing

The market value of Di is highest among the different end products and so we want to maximize the Di-selectivity. However, the selectivity is inﬂuenced in part by the addition of catalyst. The relationship between increasing catalyst and increasing selectivity is a matter of folklore in this area. As part of this research project, no relationship whatsoever could be discovered within the range, 1% – 3% catalyst addition, studied. As the catalysts represent a ﬁnancial cost, the most economical selectivity is not, in fact, the maximum that could be chemically reached. We desire an economic maximum here. Due to the fact that no generally accepted theory of catalytic mechanics exists, there is considerable debate and experimentation in an industrial setting on the correct use of the catalyst and promoters in order to get optimum performance. The question speciﬁcally is: In what circumstances should what amount of what element be added to the reaction? An important component of ﬁnding an answer to this question is what the desired outcome of adding these substances is. In an industrial setting, the commercial environment supplies us with some additional variables such as market prices and supply and demand variations. Finally, we establish that the desired outcome is a maximum of proﬁtability. Whatever combination of catalysts, promoters and end products is required for this will be taken and it is the purpose of an optimization to compute this at any time. Consider a black box. This box has ﬁve principal features: 1. There are various slots into which you feed raw materials such as silicon, copper and so on. 2. There are some pipes where the various end products come out of the box. 3. The box has a few dials and buttons with which you can act upon the system. These will be called the controllable variables, c. 4. The box has various gauges that display some information about the inside of the box such as various temperatures and pressures. These variables change in dependence upon the controllable ones but cannot directly be controlled and thus will be called semi-controllable variables, x. 5. The box also has gauges that display some information about the external world such as market prices for end products or the outside air temperature. As these variables are determined by the external world, we have no inﬂuence over them at all. These will therefore be called uncontrollable variables, u. Inside this box, the M¨uller-Rochow synthesis is doing its job. Due to the lack of a theory about the synthesis, we cannot describe the process inside the box using a set of equations that we can write down from textbooks or ﬁrst principles. Therefore, we will be adopting a different viewpoint. Any industrial facility records the values measured by all the gauges and dials in an archive system that is capable of describing the state of the box over a long history. As the underlying chemistry has not changed over time, we therefore have a large collection of “input signals” (controllable) into the unknown process alongside their corresponding “output signals” (semi-controllable) in dependence upon the boundary conditions or constraints (uncontrollable), which, mathematically, are

7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes

191

also a form of input signal. This experimental data should allow us to design a mathematical description of the process that would take the form of several coupled partial differential equations. Formally speaking, these equations look like s = f (c; u). Mathematically speaking, the uncontrollable variables assume the role of parameters (and hence follow the semi-colon in the notation) in this function. Discovering this function is the principal purpose here and is very complex. One of the most intriguing features is that all three sets of variables are time-dependent and the process itself has a memory. Thus the output of the process now depends on the last few minutes of one variable and the last few hours of another. These memories of the process must be correctly modeled in order for this function to represent the process well enough to use it as a basis for decision making. In order not to clutter the mathematical notation, we will be skipping the dependence upon time that should really be added to every variable here. The modeling is done using the methods from section 6.5. In order to do optimization, we need to deﬁne a goal g to maximize, which is a function of the process variables and parameters, g = g (c, s; u). Using the recurrent neural network modeling approach, the goal becomes g = g (c, f (c; u) ; u), i.e. the goal is now a function of only the controllable variables and the uncontrollable parameters. Optimization theory can be applied to this in order to ﬁnd the optimal point cˆ ˆ f (c; ˆ u) ; u). As the loat which the goal function assumes a maximum, gmax = g (c, cation of the optimal point is computed in dependence upon the goal function as described above, it becomes clear that the optimal point is, in fact, a function of the uncontrollable measurements, cˆ = cˆ (u). Now we have the optimal point at any moment in time. We simply determine the uncontrollable measurements by observation and compute the optimal point that depends only upon these measurements. Thus we arrive at our ﬁnal destination: The correct operational response r at any moment in time is thus the difference between the current operational point c and the optimal controllable point cˆ (u), i.e. r = cˆ (u) − c. This response r is what we report to the control room personnel and we request them to implement. Ideally, the plant is already at the optimal point in which case the response r is the null vector and nothing needs to be done. As a result of the plant personnel performing the response r, the optimal point will be attained and an increase of the goal function value will be observed; this increase is Δ g = gmax −g (c, f (c; u) ; u), which we can easily compute and report as well. The relative (percentage) increase Δ grel = Δ g/g (c, f (c; u) ; u) has been found, in this example, to be approximately 6%, see below for details. Please note carefully that the response r = cˆ (u) − c is a time-dependent response even though we have skipped this dependency in the notation. Thus, we do not necessarily proceed from the current point c to the optimal point cˆ (u) in one step, see section 7.7 for a discussion of this point. Most often it is important to carefully negotiate the plant from the current to the optimal point and this journey may take a macroscopic amount of time – sometimes several hours. Figure 7.4 displays this problem graphically using real data taken from the current process. The two axes on the horizontal plane indicate two controllable variables and the vertical axis displays the goal function. We can easily see that the

192

7 Optimization: Simulated Annealing

Fig. 7.4 The dependency of the goal on two controllable variables. The upper path displays the reaction of a human operator and the lower path displays the reaction of the optimization system. The paths are different and arrive at a different destination even though they started from the same initial state on the left of the image. The optimized path is better than the human determined path by approximately 3% as measured by the goal function.

change in a controllable variable can produce a dramatic change in the goal. The two paths displayed represent the reactions to the current situation by a human operator (the upper path) and the computer program (the lower path). They initially begin on the left at the current operational point. Because of their differing operational philosophies, the paths deviate and eventually arrive at different ﬁnal states. This is a practical example of the human operator making decisions that he believes are best but that are, in fact, not the best possible. For the speciﬁc current application, the molecules are produced in three separate reactors and then brought together for shipment. We are to optimize the global performance of the plant but are able to make changes for each reactor separately. In this case, the controllable variables c were the following: Temperature of the reactor, amount of raw material to the jet mill, steam pressure to the jet mill, amount of Methylene Chloride (MeCl) to the reactor, pressure of the reactor and others relating to the processes before the synthesis itself. The uncontrollable parameters u were: X-ray ﬂuorescence spectroscopy measurements on 17 different elements in the reactor. The semi-controllable variables s are the other variables that are measured in the system. In total, there were almost 1000 variables measured at different cadences.

7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes

193

The goal function is the ﬁnancial gain of the reaction. We compute the input raw materials and the output end products. Each amount is multiplied with the currently relevant ﬁnancial cost or revenue. The ﬁnal goal is thus the added value to the product provided by the synthesis. We desire this to assume a maximum. This function is dominated by two effects: Di is the most valuable end product and so we wish to maximize its selectivity and the overall yield represents the proﬁt margin and so we wish to maximize it also. Possible conﬂicts between these criteria are resolved by their respective contributions to the overall ﬁnancial goal. In the results, we will focus on these two factors. The results reported here were obtained in an experimental period lasting three months and encompassing three reactors. The experiment was broken into three equal periods. During the reference period, the optimization was not used at all. During the evaluation period the optimization had only partial control in that the human operator controlled the input of catalyst and promoters. During the usage period, the optimization was given full control. We may observe the results in ﬁgure 7.5. In each graph, the dotted line is the reference period, the dashed line is the evaluation period and the continuous line is the usage period. What is being displayed is the probability distribution function of the observed values. This way of presenting the results allows immediate statistical assessment of the result instead of presenting a time-series.

Reference Evaluation Usage

Selectivity (%) 79.8 ± 3.6 79.9 ± 2.5 82.7 ± 1.9

Yield (%) 86.6 ± 4.2 89.7 ± 4.3 91.7 ± 3.2

Table 7.1 The results numerically displayed. For both selectivity and yield, we compute the mean ± the standard deviation for all three periods.

It is apparent, from the images alone, that we increase the selectivity and the yield with more use of the optimization and that we decrease the variance in both selectivity and yield as well. Numerically, the results are displayed in table 7.1. Decreasing the variance is desirable because it yields a more stable reaction over the long term and thus produces its output more uniformly over time. We may conclude that the selectivity can be increased by approximately 2.9% and the yield by approximately 5.1% absolute. Together these two factors yield an increase in proﬁtability of approximately 6% in the plant. We emphasize that this proﬁtability increase of 6% has been made possible through a change of operator behavior only (as assisted by the computational optimization) and no capital expenditures were necessary.

194

7 Optimization: Simulated Annealing

Fig. 7.5 The probability distribution functions for selectivity and yield of Di for periods in which the optimization was not used (dotted), used for controllable variables without the catalyst (dashed) and used fully without restrictions (solid).

7.9 Case Study: Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells Co-Authors: Prof. Chaodong Tan, China University of Petroleum Bailiang Liu, PetroChina Dagang Oilﬁeld Company Jie Zhang, Yadan Petroleum Technology Co Ltd

7.9 Case Study: Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells

195

Several shallow-water offshore oil wells are operated in the Dagang oilﬁeld in China. We demonstrate that it is possible to create a mathematical model of the pumping operation using automated machine learning methods. The resulting differential equations represent the process well enough to be able to make two computations: (1) We may predict the status of the pumps up to four weeks in advance allowing preventative maintenance to be performed and thus availabilities to be increased and (2) we may compute in real-time what set-points should be changed so as to obtain the maximum yield output of the oilﬁeld as a whole considering the numerous interdependencies and boundary conditions that exist. We conclude that a yield increase of approximately 5% is possible using these methods. The Dagang oilﬁeld lies in the Huanghua depression and is located in Dagang district of Tianjin. Its exploration covers twenty-ﬁve districts, cities and counties in Tianjin, Hebei and Shandong, including Dagang exploration area and Yoerdus basin in Xinjiang. The total exploration area in Dagang oilﬁeld is 34,629 km2 , including 18,629 km2 in the Dagang exploration area. For the present study, we will consider data for 5 oil-wells of a shallow water oil-rig in Dagang operated by PetroChina. An offshore platform drills several wells into an oilﬁeld and places a pump into each one. If the pressure of the oilﬁeld is too low - as in this case - the platform must inject water into the well in order to push out the oil. Thus, the pump extracts a mixture of oil, water and gas. This is then separated on the platform. External elements like sand and rock pieces in this mixture cause abrasion and damage the equipment. When a pump fails, it must be repaired. Such a maintenance activity requires signiﬁcantly less time if it can be planned as then the required spare parts and expert personnel can be procured and made available before the actual failure. If we wait until the failure happens, the amount of time that the well is out of operation is signiﬁcantly longer. Thus, we would like to know several weeks in advance when a pump is going to fail. Each pump can be inﬂuenced via two major control variables: the choke-diameter and the frequency of the pump. These parameters are currently controlled manually by the operators. Thus, the maximum possible yield of the rig depends largely on the decisions of the operators, deﬁned by the knowledge and experience of the operator as well as the level of difﬁculty of any particular pump state. However, the employment of continuous and uniform knowledge and experience for the pump operation is not realistically possible as no one operator controls the plant over the long-term but usually only over a shift. Observation results show oscillations of parameters in a rough eight-hour pattern which supports the argument that a ﬂuctuation in the knowledge and experience of human operators may lead to a ﬂuctuation in the decision making and thus a varying inﬂuence on the operation of the rig. While some operators may be better than others, it is often not fully practical and/or possible to extract and structure the experience and knowledge of the best operators in such a fashion as to teach it to the others. Pumps in an oilﬁeld are not independent. Demanding a great load from one will cause the local pressure ﬁeld to change and will make less oil available for neighboring pumps. Obtaining a maximum yield output, therefore, is not a simple matter but requires careful balancing of the entire ﬁeld. In addition, certain external factors

196

7 Optimization: Simulated Annealing

also inﬂuence the pressure of the oilﬁeld, e.g. the tide. This high degree of complexity of the pump control problem presents an overwhelming challenge to the human mind to handle and the consequence is that suboptimal decisions are made.

Fig. 7.6 The discharge pressure of a pump as measured (jagged curve) and calculated from the model (smooth curve). We observe that the model is able to correctly represent the pump as exempliﬁed by this one variable.

The model is accurate and stable enough to be able to predict the future working of the pump up to four weeks in advance. It can thus reliably predict a failure of a pump for this time horizon due to some slow mechanism. We verify that the model accurately represents a pump’s evolution in ﬁgure 7.6. The model was then inverted for optimization of yield. The computation was done for the entire history available of 2.5 years and it was found that the optimal point deviated from the actually achieved points by approximately 5% in absolute terms. The main beneﬁts of the current approach are: (1) processes all measured parameters from the rig in realtime, (2) encompasses all interactions between these parameters and their time evolution, (3) provides a uniform and sustainable operational strategy 24 hours per day and (4) achieves the optimal operational point and thus smoothes out variations in human operations. Effectively the model represents a virtual oil rig that acts identically to the real one. The virtual one can thus act as a proxy on that we can dry run a variety of strategies and then port these to the real rig only if they are good. That is the basic principle of the approach. The novelty here is that we have demonstrated on a real rig, that it is possible to generate a representative and correct model based on machine learning of historical process data. This model is more accurate, all encompassing, more detailed, more robust and more applicable to the real rig than any human engineered model possibly could be.

7.10 Case Study: Increase of coal burning efﬁciency in CHP power plant

197

The increase of approximately 5% in yield is signiﬁcant as it will allow the operator to extract more oil in the same amount of time as before and thus represents an economic competitive advantage.

7.10 Case Study: Increase of coal burning efﬁciency in CHP power plant Co-Author: J¨org-A. Czernitzky, Vattenfall Europe W¨arme AG The entire process of a combined-heat-and-power (CHP) coal-ﬁred power plant from coal delivery to electricity and heat generation can be modeled using machine learning methods that generate a single set of equations that describe the entire plant. The plant has an efﬁciency that depends on how the plant is operated. While many smaller processes are automated using various technologies, the large scale processes are often controlled by human operators. The Vattenfall power plant ReuterWest in Berlin, Germany is largely automated in these parts. The maximum possible efﬁciency of the plant depends in part on the decisions of the operators, deﬁned by the knowledge and experience of the operator as well as the level of difﬁculty of any particular plant state. However, the employment of continuous and uniform knowledge and experience for the plant operation is not realistically possible as no one operator controls the plant over the long-term but usually only over an eight-hour shift. Observation results show oscillations of parameters in a rough eight-hour pattern which indicates that a ﬂuctuation in the knowledge and experience of human operators may lead to a ﬂuctuation in the decision making and thus a varying inﬂuence on the operation of the plant. While some operators may be better than others, it is often not fully practical and/or possible to extract and structure the experience and knowledge of the best operators in such a fashion as to teach it to the others. Furthermore, the plant outputs several thousand measurements at high cadence. At such frequency an operator cannot possibly keep track of even the most important of these at all times. This intensity combined with the high degree of complexity of the outputs presents an overwhelming challenge to the human mind to handle and the consequence is that suboptimal decisions are made. Here, a novel method is suggested to achieve the best possible, i.e. optimal, efﬁciency at any moment in time, taking into account all outputs produced as well as their complex interconnections. This method yields a computed efﬁciency increase in the range of one percent. Moreover, this efﬁciency increase is available uniformly over time effectively increasing the base output capability of the plant or reducing the CO2 emission of the plant per megawatt. Initially, the machine learning algorithm was provided with no data. Then the points measured were presented to the algorithm one by one, starting with the ﬁrst measured point. Slowly, the model learned more and more about the system and the quality of its representation improved. Once even the last measured point was pre-

198

7 Optimization: Simulated Annealing

sented to the algorithm, it was found that the model correctly represents the system. See section 6.5 for details on the method. In the particular plant considered here, Reuter-West in Berlin, eight months and nearly 2000 measurement locations were selected as the history that was recorded at one value each every minute; yielding approximately 0.7 billion individual data points. After modeling, the accuracy of the function deviated from the real measured output by less than 0.1%. This indicates that the machine learning method is actually capable of ﬁnding a good model and also that the recurrent neural network is a good way of representing the model. The power plant is largely automated and so we considered, for test purposes, only the district heating portion of the plant to be under the inﬂuence of the optimization program. The controllable variables would then be the ﬂow rate, temperature and pressure of the of the district heating water at various stages during the production. The boundary conditions or uncontrollable parameters are provided by the coal quality, the temperature, pressure and humidity of the outside air, the amount of power demanded from the plant, the temperature demanded for the district heating water in the district and the temperature of the cooling water at various points during the production. The model was then inverted for optimization of plant efﬁciency. The computation was done for the entire history available and it was found that the optimal point deviated from the actually achieved points by 1.1% efﬁciency in absolute terms. This is a signiﬁcant gain in coal purchase but mainly a reduction of the CO2 emissions that save valuable emission certiﬁcates. In the analysis, about 800 different operational conditions (in the eight month history) were identiﬁed that the operators would have to react to. This is not practical for the human operator. The model is capable of determining the current state of the plant, computing the optimal reaction to these conditions and communicating this optimal reaction to the operators. The operators then implement this suggestion and the plant efﬁciency is monitored. It is found that 1.1% efﬁciency increase can be achieved uniformly over the long term. The model can provide this help continuously. As the plant changes, these changes are reﬂected in the data and the model learns this information continuously. Thus, the model is always current and can always deliver the optimal state. In daily operations, this means that the operators are given advice whenever the model computes that the optimal point is different from the current point. The operators then have the responsibility to implement the decision or to veto it. Speciﬁcally, an example situation may be that the outside air temperature changes during the day due to the sun rising. It could then be efﬁcient to lower the pressure of district heating water by 0.3 bars. The program would make this suggestion and after the change is effected, the efﬁciency increase can be observed. The main beneﬁts to a power plant are: (1) processes all measured parameters from the plant in real-time, (2) encompasses all interactions between these parameters and their time evolution, (3) provides a uniform and sustainable operational

7.11 Case Study: Reducing the Internal Power Demand of a Power Plant

199

strategy 24 hours per day and (4) achieves the optimal operational point and thus smooths out variations in human operations. For those parts of the power plant that are already automated, the model is valuable also. Automation generally functions by humans programming a certain response curve into the controller. This curve is obtained by experience and is generally not optimal. The model can provide an optimal response curve. Based on this, the programming of the automation can be changed and the efﬁciency increases. The model is thus advantageous for both manual and automated parts.

7.11 Case Study: Reducing the Internal Power Demand of a Power Plant Co-Author: Timo Zitt, RWE Power AG A power plant uses up some of the electricity it produces for its own operations. In particular the pumps in the cooling system and the fans in the cooling tower use up signiﬁcant amounts of electricity. It will increase the effective efﬁciency of the power plant if we can reduce this internal demand. For the particular power plant in question here, we have six pumps (two pumps each with 1100, 200 and 55 kW of power demand) and eight fans with 200 kW each of power demand. The inﬂuence we have is to be able to switch on and off any of the pumps and fans as we please with the restriction that the power plant as a whole must be able to perform its intended function. A further restriction is introduced by allowing a pump to be switched on or off only if it has not been switched in the prior 15 minutes to prevent too frequent turning off and on. Five factors deﬁne the boundary conditions of the plant: air pressure, air temperature, amount of available cooling water, power produced by each of two gas turbines. These factors are given at any moment in time and cannot be modiﬁed by the operator at all. The deﬁnition of the boundary conditions is crucial for optimization. We recall the example of looking for the tallest mountain in a certain region. If the region is Europe, the answer is Mount Blanc and if the region is the world, then the answer is Mount Everest. In more detail, we have a set of points (the locations over the whole world) that consist of three values each: latitude, longitude and altitude. Out of these points, ﬁrst select those matching the boundary conditions (Europe or the whole world) and then perform the search for the point of highest altitude. In the power plant context, we must also deﬁne regions in which we will look for an optimum. We do this by providing each boundary condition dimension (the ﬁve above) with a range parameter. Let us take the example of air temperature. We will give it the range parameter of 2 degrees Celsius. If we measure an air temperature of 25◦ C, then we will interpret this to mean that we are allowed to look for an optimal point of the function among all those points that have an air temperature in the range

200

7 Optimization: Simulated Annealing

[23, 27]◦ C. As we have ﬁve dimensions of boundary conditions, we have to supply ﬁve such range parameters. It is a priori unclear what value to give the range parameter. A typical choice is the standard deviation of the measurement over a long history. This gives the natural variation of this dimension over time. However we may artiﬁcially set it to be higher because the boundary condition may not be quite so restrictive for the application. In the present case, we have chosen each range parameter to be one standard deviation over a long history. In addition, we have investigated a scenario where the range parameter of the air pressure is two standard deviations because we regard this to be less important. In order to make our model, we have access to myriad other variables from within the plant. Thus we determine when we require which pumps and fans to be on in order to reliably run the power plant. This culminates in a recurrent neural network model of the plant, which we ﬁnd to represent the plant to an accuracy better than 1%. The model is now optimized using the simulated annealing approach to compute the minimal internal power demand at any one time. Operationally this means that the optimization would recommend the turning off of a pump or a fan from time to time and, aggregated over the long term, achieve a lower internal power demand by the plant. The computation was made for the period of one year and it was found that the internal power usage can be reduced by between 6.8% and 9.2% absolute. The two values are due to the two different boundary condition setups. We therefore see that the loosening of restrictions has a signiﬁcant effect on the potential of optimization. Please observe the essential conclusion from this that the parametrization of the problem is very important indeed both for the quality and the sensibility of the optimization output. In conclusion, we observe that the internal demand can be reduced by a substantial margin (6.8 to 9,2%) which will increase the effective efﬁciency by about 0.05% to 0.06% given that the internal demand is only about 0.7% of the base output capacity. This is achieved only by turning off a few pumps and fans when they are unnecessary.

Chapter 8

The human aspect in sustainable change and innovation

Author: Andreas Ruff, Elkem Silicon Materials

8.1 Introduction Imagine an everyday situation when driving to work except this time your usual route is blocked due to construction work for a couple of weeks. When realizing this would you not ruggedly wake from the brainless automation that literally makes you ﬂoat to the ofﬁce? And would you not quickly need to take back control over your turns in order to follow the detour? What distinguishes a rehearsed mental situation from the deliberate? How can humans actively participate in a change process? Finally, what is needed to sustain that change? Continue to imagine that, when ﬁnally arriving at your desk, your ofﬁce PC lets you know that your password expired. Most people - including myself - have a hard time to come up with a new set of letters, characters and numbers. Why is it so hard to think of something new and why do I struggle to use the new password for weeks? I enter the old password and only when it fails does my brain begin to question my action. Only then will my cognition remind me that I had to change it and that I have a new one. It will take a while until I get used to the new password. But after that it will become the usual password. The same applies to the example of the roadblock: Using the detour for a few weeks, you will sink into the same automation but with a changed route. It is the power of habit - the things we do regularly are processed in our brains automatically. A habit is acting on an accepted status quo and we tend not to think about it or even question its necessity or validity. To strive for the new is given by our human nature. In order to alter any situation, it takes a lot more than just the intention to change. It requires the will and ability to take on the new status quo and live up to it. The complexity in remembering the password or the changed route to work lies in the conceived futility of the change and is enhanced by the users perception that this change does not make life easier nor does it contain additional value. In fact: The change is an obstacle. P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7_8, © Springer-Verlag Berlin Heidelberg 2012

201

202

8 The human aspect in sustainable change and innovation

Breaking the habit and accepting the new situation is hard both for someone initializing and managing the change and for the user. Overpowering habits and, especially, changing business procedures dramatically depend on every individuals ability to understand and accept the need for change and to adopt the new practice. In this text, I will discuss the various aspects concerning the human ability to change and its inﬂuence on sustainability. In the past, I kept asking myself how much the sustainability in project work relates to a personal work style and how much is fortunate coincidence. I will summarize some important aspects on how to reduce the coincidence part. Additionally, I will mention some general aspects backed by research and literature. I want to share my personal experience and hope to illustrate the positive experiences that I made when dealing with people in project teams. I always used the human aspect in my project work and found that change against inner conviction is grueling and virtually impossible. But how can human aspects help, especially when one is introducing major changes, and what can businesses do in order to make change sustainable? What preconditions are needed to make employees accept change or even help create it? I try to answer what successful change management is and how to gain sustainability. I will focus on the human perception of change and how the organizational set up and a managers personality can support the sustainability in change. All ideas, suggestions and examples derive from my experience in the chemical industry but I believe that they apply to any other business. My proposals and visions should be seen as a recommendation. I would like to give you ideas and food for thought on the value of the human aspect in order to increase your personal satisfaction and efﬁciency when making use of it. The terms sustainability and change are no contradiction. The status of any situation between two deliberate changes should ideally be sustainable.

8.1.1 Deﬁning the items: idea, innovation, and change Thomas Edison, the doyen of innovation and creativity, once said that the process of invention takes 10 seconds of inspiration and 10 years of transpiration. In order to understand the process of creativity better, a short differentiation of the terms used is useful. The ﬁrst step in a creative process is the idea. It can be described as what is before we think. It is a mental activity, the interaction of neurons and synapses resulting in an electric impulse comparable with a ﬂash of light. This bioelectric interaction is able to create visions of objects or imagined solutions in our cognition. The brain is structured such that most ideas come up when the thinker focuses on something completely different or even sleeps. There are famous examples where scientists had worked intensively and concentrated to solve a problem but it was during relaxation that they imagined the solution. Kekule in 1885 worked on the structural form of benzene and dreamt about a snake chasing its tail. Researchers like McCarley (1982)

8.1 Introduction

203

afﬁliate this to the absence of the catecholamine adrenalin and noradrenalin resulting in reduced cortical arousal. The presence of catecholamine is related to the size of the neuronal networks and therefore depresses the individuals cognitive ﬂexibility [74]. Emotional distance and a let-go attitude are very important when solving a problem. Ramon and Cajal (1999) mention this in their book Advise for a young investigator. Here they describe the “ﬂower of truth, whose calyx usually opens after a long and profound sleep at dawn, in those placid hours of the morning that ... are especially favorable for discovery” [74]. Philosophers consider the ability to generate and understand ideas as the core aspect of individualism and the human being as a whole. Ideas are not isolated, they grow and improve when shared and discussed with others. The ﬁrst idea is not necessarily the ﬁnal solution, but it is the starting point of a thoughts long journey to realization. The bridge between the idea and realization is called creative innovation. The thought has to fall on fruitful soil or as Heilman calls it: the prepared mind [74]. It is obvious that the precondition for creativity is an open mindset and the ability to think outside conventional barriers. The prepared mind is able to look at the same situation from different angles. The open mind is not limited in its imagination and it keeps asking “what if?” Kekule and his problem of structuring benzene can illustrate this concept. His dream about the snake chasing its own tail just induced the idea of a ring structure. The innovation lies in the acceptance of the idea and the insight that this is a possible solution. If Kekule would have discarded the idea of a ring shaped molecule because it just cannot be, he would have failed in solving the problem probably forever. The ability to “understand, develop and express in a systematic fashion” is the foundation of creative innovation [74]. Neither some special skills nor an increased IQ are required to be creative or innovative. Coming back to Edisons quote, the creative phase takes seconds but it may take years to establish the novelty successfully in practice. The goal of innovation is positive change to the status quo or simply to make something better. Innovation leading to increased productivity is the fundamental source of increasing wealth in an economy [129]. Innovation can be described as an act of succeeding to establish something new. Success in that respect means to eventually make an idea come alive. The novelty can be a service, product or organization. Once it is launched, there is no guarantee for success in the market. This can be seen when Amabile et al. (1996) propose [6]: All innovation begins with creative ideas. We deﬁne innovation as the successful implementation of creative ideas within an organization. In that view, creativity of individuals and teams are a starting point for innovation, the ﬁrst is a necessary but not a sufﬁcient condition for the second.

In order to be innovative one needs more than just a plain creative idea or insight. The insight must be put into action to make a genuine difference. For example, it could result in a new or altered business process within an organization or it could create or improve products, processes or services. Sometimes creative people have taken on existing ideas or concepts making use of individualization in combination with hard work and luck, making the replicated idea even more successful. Neither creativity nor an innovative mind can grant sustainable economic success.

204

8 The human aspect in sustainable change and innovation

In 1921, J. Walter Anderson together with Edgar Waldo Ingram founded White Castle in Wichita, Kansas. White Castle quickly became Americas ﬁrst fast-food hamburger chain and satisﬁed customers with a standardized look, menu and service. In 1931, they had the idea to produce frozen hamburgers and they were the ﬁrst to use advertisements to sell their burgers [132]. White Castles success inspired many imitators and so, in 1937, Brothers Richard and Maurice McDonald opened a drive-in in Arcadia, California. First selling hot dogs and orange juice, they quickly added hamburgers to the menu [130]. Today, White Castle sells more than 500 million burgers a year [131]. That sounds like a lot but compared to the market leader McDonalds, it is less than 1%. Surely White Castle was the ﬁrst and is still in operation, but it is far from the success of others. The success of an innovation depends on several factors such as market conditions, customer demand and expectation. But more than that a good portion of luck is required: Being at the right place at the right time. That is what Edison meant when intimating that successful innovation is most of all hard work and transpiration. This chapter is not about business plans nor will it advise the reader how to plan a successful business. I want to get an insight on what preconditions are needed to implement change successfully and, most of all, how to make it sustainable. The word “change” describes a transactional move from stage A to stage B. It is not necessarily true that the new stage is better than the original even if the intentions were good. A software update, for example, can put the user in a position of not knowing where to ﬁnd buttons and features. The changed look leads to an uncomfortable feeling until the user gets to know the program better. This is similar to the above mentioned examples of the new password or the roadblock on the way to work. The software developers intention, of course, is to persuade the user with new tools and improved appearance. Change is perceived differently depending on the individuals situation. It is up to the user to experience the change as a chance or challenge. Those of us who had the chance to manage change or work in teams in order to implement change have experienced that a substantial amount of money is spend to introduce advanced software, restructure organizations or improve processes or products. With all this cost and effort, how and why does the improvement slowly regress when the project team moves away? It is the human aspect that needs to be considered and remembered right from the starting point. It determines the sustainability in the change process. Enforced change is unlikely to be sustainable. As modern organizations need to be able to adapt quickly, the human aspect needs to play a central role in any change process.

8.1.2 Resistance to change Most people are attracted by novelty. Especially when it comes to consumer electronics, thousands stroll over various trade shows or wait in front of retail stores to get the latest products. To be equipped with up-to-date fashion and to be trendy deﬁnes our status in modern societies. The speed of change is enormous. Companies

8.1 Introduction

205

underlie various constraints to keep up with the need to create new products and services at satisfactory prices. Challenges are to improve technology, market share or to stay (or become) competitive. The globalized world demands mental ﬂexibility and the passion to take on any new situation. Yes, we grow with our responsibilities and we have to try hard to live up to the task. But too many businesses experience such pressure and expect more of the employees than they can handle. The modern (especially western) business world generates more and more mental diseases. The number one reason for sick leave is mental overload. The managerial challenge is to balance the need for change with their duty of care towards their employees. Businesses need continuous improvement and enduring change in order to survive and to continue engagement. Employees are required to contribute for the beneﬁt of competitiveness, job safety and growth. Studies done by Waddell and Sohal in the United Kingdom and Australia show that resistance is the biggest hurdle in the implementation of modern production management methods [126]. Interesting enough this resistance comes equally from the management and the workers. They also found out that most managers and business leaders perceive resistance negatively. Historically, resistance is seen as an expression of divergent opinions and good change management is often related to little or no resistance. That implies that well managed change generates no resistance. Here the question has to be answered what comes ﬁrst, well managed change or to properly handle resistive forces. Enforced change, with the main focus on the technical aspects of change, seems to be widely accepted by management. Action plans are worked off and get reviewed but the individuals needs are often overlooked. It does not matter what is going to be changed, the reaction mechanisms of humans are similar and have to be taken into consideration. Resistance is a reaction to a transition from the known to the unknown [49]. Change is a natural attitude, but only if the initiation for this transition derives from the individual itself. If the initiation for alteration is pushed from the outside, then the change process needs to be well attuned. Every individual confronted with change undergoes the following phases: initial denial, resistance, gradual exploration and eventually commitment7. This is a well-developed natural habit of defending ourselves. Especially if companies execute on major business decisions (e.g. organizational rationalization by software implementation) resistance results from the individuals anticipated personal impact of the upcoming change. Humans immediately see themselves confronted with an uncertain future. Past experiences combined with what we have heard (from relatives who have been in a similar situation or the media reporting about others) escalate to existential fear and questions arise such as “will I lose my job?” and “do I have to sell my house?” Whether this anxiety is imaginary or real, the physiological response is that same: STRESS. The negative, irrational emotion represses any logical aspect afﬁliated with the intention and we divert all energy to defend the status quo rather than on the compulsory task. It is an ancient defense mechanism from deep inside that hinders us to consider change rationally, to adopt it or even to help shape it. Thus, resistance is often seen as an objection to change. But is that really the case? If resistance is negative how can we turn it into something useful to enable change? First of all, resistance and

206

8 The human aspect in sustainable change and innovation

anxiety are important human factors for any undertaking. Data provided by Waddell and Sohal indicates that humans are not against change for the sake of being against it. In many observed cases, resistance occurs when those who resist simply lack the necessary understanding. Therefore, they have a negative expectation about the upcoming effect of the change. A major organizational restructuring or a local implementation of a software tool for process improvement will be seen (by some) as an assault. Preparatory measures should be used not only during the implementation but also as a structural tool. It is important to realize and analyses resistance in order to make use of it. Managers and organizations should exploit resistance and use it with participative techniques. Participation in that respect means more than just giving regular status updates. Many companies inform their employees regularly about the status of an announced project. This is a one-way communication model and the problem is that, in many cases, questions of those involved are not or only inadequately answered. Bidirectional information and communication is a critical tool to create well-conceived solutions and to avoid misunderstandings or misinterpretations. After the intention to go through a change process is announced by top management, the middle management needs to communicate detailed information as soon as possible. Regular personal meetings with decision makers and their direct reports enable a consequent ﬂow of (the same) information from the top down. Teams and departments should sit with their direct management and discuss actions, explain project milestones and intended results. The lower and middle management should provide platforms for addressing questions. This will keep up the communication even if the local leadership must not explain the exact plans and details. Most of us have experienced situations where we lived through unconscious fear. It helps if one can express the sorrow and gets the sense that the fear is dealt with seriously. It is obvious that this cannot be done in a works meeting but has to be done in small groups or even in one-to-one conversations. Certainly this is a big effort and it (if done right) takes a lot of working time from the leadership and the work force but it will pay back when measuring the effectiveness of implementation. The organization needs to be set up to be ready for change. Thus, the organization should focus less on technical details but prepare psychologically and emotionally. It is mandatory for the management to deﬁne the goal and to set the expectations but it should also leave room for individualism and pluralism. The initial idea is not always the best. Leaders must learn to focus on the result rather than on every individual detail. The organization has to be constituted to articulate resistance and to deal with it. In return every individual needs to be conﬁdent that the managers are honestly willing to listen and communicate. Therefore the entire organization needs to be set up to ensure communication and a ﬂow of ideas and suggestions no matter from which level of hierarchy they originate. This requires managers at any level of the entity to remember their vested duty to lead and manage people. The entitys organizational structure determines its ability to adapt to innovation.

8.2 Interface Management

207

8.2 Interface Management 8.2.1 The Deliberate Organization Organizations depend on their ability to adapt to change and to make it sustainable. The following describes a desirable but ﬁctional stage. To claim that all of the following is doable appears to be rather unrealistic. But I do claim that most companies have issues with regards to sustainable change management and that this is due to neglecting the human aspect. As little as resistance is seen positive in real companies it still is a desired and necessary process for sustainable change. Furthermore, the importance of a proper deﬁnition and scrutiny of functional interfaces is the key to this process. Whenever peers have to work together the efﬁciency of this interaction depends on how well each of the individual knows what to do and how to do it. If there is an overlap of authority or improper deﬁnition of roles and responsibilities the involved employees spend a great deal of time and energy in organizing themselves. Some organizations even believe that employees will solve this conﬂict in the best interest of the business. The opposite is true. It is in the nature of humans to try to get the interesting or highly appreciated duties and to leave those that require a lot of work or are less esteemed to others. Instead of a cooperative atmosphere, a power struggle will eventually leave few winners and many defeated. Those who cannot or do not want to keep up with this ﬁght might eventually leave the company. In addition to this, it is not necessarily true that the ones who emerge victorious are the better leaders or managers. It is obvious that roles and responsibilities have to be deﬁned. Reality proves that in many organizations grey zones exist. An organizational interface requires several aspects to be deﬁned, executed and controlled. First of all it has to be deﬁned which roles have to have a share in the interface and who controls and monitors it. Consider the following example from my experience when purchasing raw materials. In order to buy the right amount and to set up realistic delivery plans, sourcing needs to cooperate with manufacturing, quality control and perhaps even research and development (R&D) as well as logistics. Depending on the company and its set up, the ﬁnance department can be involved to contribute to payment terms, agree to letter of credit and to manage cash ﬂow. This is an obvious exercise and it sounds simple when taking the concept of value chains into consideration. If it is crucial to deliver high quality products in reasonable time at competitive cost, all stages in the value chain have to receive the right material at the right time. The smooth execution of customer orders requires everybody involved to focus on the same target: total customer satisfaction. Opposite to this concept, most companies have introduced individual Key Performance Indicators (KPIs) for every department separately. Who could blame the manufacturing manager to demand only top class raw material or the most reliable supplier if his task is to produce “just in time”, reduce reject rate and keep inventory levels low. The manufacturing manager will simply not care for the cost of goods sourced

208

8 The human aspect in sustainable change and innovation

because it plays absolutely no role in his list of performance indicators. Assuming the sourcing manager is held responsible for price reductions and availability of material, are these targets not conﬂicting? This ﬁctional company will quickly ﬁnd itself in the position that everybody is pulling the same rope, but in different directions. Who would win the battle, especially if personal bonus payments depend on the level of goal achievement? Furthermore, a proper interface deﬁnition also looks at the acting persons (deﬁned as names or job functions) and deﬁnes who exactly has to take decisions. The principle of four eye decisions is expanded to six or more eyes. In the context of the procurement example, the manufacturing, R&D and quality leaders would all have to agree to the proposal made by sourcing. They could use evaluation forms and would discuss all pros and cons to ﬁnally sign off on the ultimate decision. Due to this, they share the risk of failure (instead of a single person taking the decision and being held responsible) and also improve communication. Throughout the process the team shares the need for transparent decision-making and traceable accountability. This little example shows how important it is to manage any interface and to deﬁne all incoming parameter as well as the outcome speciﬁcation. The better everybody understands the interfaces deﬁnition and the individuals share in it the more efﬁcient it is. The critical deﬁnitions have to come from the management and it is within the responsibility of every executive leader to maintain the deﬁned balance in each interface by frequently reviewing its overall performance. The beneﬁt of well-deﬁned, established and managed interfaces is that the need for change is easily detectable and its effect can be simply measured. Any diffuse organization with uncertain responsibilities unwillingly leads to mismanaged situations. The task of executing change is either hard to assign or it is given to someone who has to ﬁght it through against the resistance of his colleagues. They will claim that the executing individual has neither the authority nor the power to implement any action. The battle for competence and power will make the implementation extremely challenging. The desired effect ﬁnally will ﬁzzle out. The well-deﬁned interface will remain well deﬁned because the required change is assigned to the person in charge. Together with the necessary resources and helping hands of all involved, change can be implemented quickly and effectively. A famous example is the work organization done by the Toyota Car Companys assembly teams. All work tasks are clearly regulated, described and monitored. Expectations are deﬁned and the work outcome is permanently controlled. The individual is held responsible for his work and quality. If workers experience a problem, they pull a trigger and co-workers from previous and later work steps come together to discuss the issue, identify the root cause and the location of appearance as well as ﬁnding and implementing a solution. Later, they monitor if the implemented change is sufﬁcient and track if the problem is solved for good. If they cannot ﬁnd or agree on a solution, an upper hierarchy member has to be informed according to deﬁned levels of escalation and will support the instant problem solving process or take other decisions.

8.2 Interface Management

209

8.2.2 The Healthy Organization Let us assume that an enterprise deﬁnes, manages and controls its interfaces, all roles and all responsibilities. The resistance throughout the organization is still unmanaged and uncontrolled. Not knowing the reasons for change initiates resistance. In some cases this is aligned with the unwillingness of the leaders to exchange opinions. Maurer looks at resistance as a force that keeps us from attaching ourselves to every crazy idea that comes along. If an organization cultivates resistance and sensitizes itself for the human nature, it will provide better ideas and faster turn over in change. A necessary prerequisite is a certain type of manager, being able to change their attitude towards resistant employees and develop a healthy organization. This organization empowers managers to resist ideas originated in higher hierarchic levels. Every employee openly shares ideas and participates in a culture of discussion. A healthy organization is willing and empowered to improve initial ideas and shift opinions. It being healthy implies to have critical thinkers identiﬁed, accepted and involved at every level in the hierarchy. The appreciation of feedback and the willingness to shift opinion based on reasons needs to be practiced top down and should be lived as a business philosophy. Project teams must be put together based on both, their criticality and their professional experience. The more diversiﬁed a group is, the more facets are reﬂected and the more options are considered. It is critical for any environment to have strong characters with different opinions, whether it is a team, an enterprise organization or a group of peers. It is to the beneﬁt of the leadership (at every level) to cultivate at least a few dissidents rather than surround oneself only with those who just say what one wants to hear. If resistance is used as a productive tool, it enables a positive environment of trust and honesty with an open dialog between all levels of the hierarchy. In such an environment alternatives can be considered carefully and thorough discussions can evaluate every option. It is the duty of an organizations leadership to carefully select every member in the structure. Factors like specialist knowledge, leadership skills and personality play a major role when selecting the appropriate candidates. A healthy leadership has the guts to selectively pick those who are uneasy, unconventional and dare to speak up. It is in every managers hand to choose the appropriate team members – the healthy way is deﬁnitely stony and requires more effort from the beginning. I have chosen this way many times now and I have always been rewarded with excellent feedback, proactive ﬂow of information and an extraordinary team spirit. All this contributes to a lot more than just the sake of implementing and sustaining change. It is of highest importance to make aware that a manager is not responsible for its direct reports only. The managerial duties also apply for those levels below, thus it is in every managers interest to have close communication beyond the direct reporting line. This ensures translation of business visions from one level to the other and their proper explanation. At the same time it guarantees everybody is and stays focused on that same target. In well functioning organizations the vision is clear and broken down so that the important parts get executed where needed. Many companies suffer over-communication. Visions, missions and updates are sent weekly if not daily and those who actually execute on these missions are not capable (due to IT access or

210

8 The human aspect in sustainable change and innovation

language barriers) of understanding the message. Again, a good example of doing it right is Toyota. They have their targets visually broken down so that every employee can easily see, for example, how many cars need to leave the factory per day to make the monthly promise and they get their personal goals aligned. Especially micro-managing organizations need to reconsider their position. As Antoine de Saint-Exupery states: “If you want to build a ship, dont herd people together to collect wood and dont assign them tasks and work, but rather teach them to long for the endless immensity of the sea.” It cannot be overstated how important the information ﬂow within an organization is. Sharing the vision means to provide the right parts of it and to transform the vision into executable and measurable tasks. A proper interpretation of the vision has to be communicated through the hierarchy with a clear and understandable obligation. From top down, managers have to sit down with their staff explaining the part of the vision they own and then break this part up into workable portions. These parts of the vision get assigned to the managers direct reports. The latter then do the same with their teams. Via this, the organization is truly focused on the common target and individual interpretation and/or cherry picking is barred. The proper break up and exact goal setting should be controlled over (at least) two hierarchic levels. A manager needs to set the goals for his direct reports and the managers superior controls that the set targets truly match the overall vision. The vision itself is ideally deﬁned long term. When breaking the vision down into manageable actions, take the time horizon into consideration. There are decisions made today with an immediate impact and there are actions paying off tomorrow. The leadership role of the top management is per deﬁnition oriented to the long term future. Visions of growth, EBITDA margin and proﬁtability surely need to be transformed into actions with a time scale of months or years. This is the leading part of the organization. However, a worker on the assembly line needs to get the tasks in a more compressed time horizon. The vision of job safety, salary increases and promotion perspective can be achieved by every days performance. Executable tasks need to be laid out in hours or shifts. This is the part of the acting hierarchy and its targets need to be monitored on the short term. Led by the management, the vision gets executed by the shop ﬂoor personnel. The responsibility is spread equally over the hierarchy. One could not succeed without the other – leaders need actors and vice versa. The organization needs to be perceived like a Swiss watch: every cogwheel is equally important, no matter what size. The real difference is in the relative ratio of lead and act. This ratio varies and naturally it is at a 100% lead at the top of the hierarchic pyramid and at 100% act at the bottom. Draw an imaginary line through the hierarchy at the level where act and lead take the same share. Here is a certain, hierarchic level and anyone below that level is likely to be more on the workers side. This ﬁctional border is extremely important. It is this hierarchic level that is the most important interface. It should be seen as the front line in communication and needs to be supported in an extraordinary way. Those employees are the ones that will receive their part of the vision more lead oriented and will need to pass it on with a strong focus on short-term execution (act). They need to transform the business vision into something understandable, no matter what

8.2 Interface Management

211

language, cultural context or expectation. The front line in the organization is where the sorrow of team members gets shared. At the same time those working at this front line step in for the top managements decisions on a day-to-day basis. Those at the front line need to be embedded properly in the decision process or at least get well prepared and need to receive all information and training up front. Imagine a shift leader on night shift being responsible for a handful of workers with different backgrounds and histories. Today, a certain change in the business strategy gets announced via e-mail by the management. This shift leader is the single point of contact that every shift member will turn to. If the shift leaders supervisor did not prepare him with pre-information or supplied him with answers to most likely questions, how can one believe that the shift will continue to focus on the job? Everyone on the shift will spend a great deal of time with recurring questions like “does this affect me and my family?” This time is costly and must be avoided. I am not claiming that the less sophisticated are not capable of comprehending ﬁnancial data or business numbers – the opposite is true! But remember that a rational view on the uncertain new situation might be dismissed if the change affects oneself. It would be better to hand out individualized communication packages with background information and a few answers to likely questions. This information sent out shortly before or even with the announcement and to all relevant functions is necessary, especially in times of major business change. If done right, the abstract decision is explained and can be discussed. Concerns may be addressed and the focus will likely remain on the work rather than on something else. Depending on the impact of change, the team coming together has to be prepared well. Key personnel needs frequent training in communication and how to handle such an extraordinary situation. Most change does not come by surprise over night and communication can be prepared. The ofﬁcial announcement is the top of the iceberg only. Preparing the key information multipliers with required information is mandatory for sustainable change. Town hall meetings with all employees are good to maintain communication regarding the change process but only a few will have the courage to express personal fears or raise questions in public. It is best to break up the enlarged meeting into work teams and further explain the situation in smaller groups. This chapter is not about leadership or professional management and there are many other articles, publications and books available. Nevertheless, one aspect seems of importance to me when discussing communication and organizational responsibility. It is a question about leadership itself: how many employees can possibly be led effectively by one person? In times when efﬁciency and productivity are the main drivers for business decisions, many (especially larger) companies trend to merge departments and divisions. In order to make the organization ﬁnancially more efﬁcient, groups and independent teams are put together under just one leader. The number of direct and indirect reports keeps growing continuously. To me, the maximum number of direct reports should not exceed 5-7. Following the thought path of the effective vision communication, any manager is liable for at least the second hierarchic level below himself. For example, a manager with ﬁve direct reports would need to intensively work with all 30 (25 + 5) individuals. If one takes the leadership role seriously, this consumes a great amount of the daily

212

8 The human aspect in sustainable change and innovation

working time. On the other hand, the fewer hierarchic levels exist, the more day to day business inevitably will end up on the managers desk. At a certain point the leadership time gets repressed by the excessive time that functional involvement requires. To me, that stage is reached in the majority of companies and it is an organizational disaster. It is not my intention to judge but to alert. Any organization decides how much a manager is a people leader or a functional worker. But my deepest belief is that anything above 40% functional work will be at the cost of effective leadership and good communication. Time is the resource of today and especially of the future. It will not take long until (especially) highly educated professionals will ask for more balance between job hours and private time thus for a better work-life balance. There is and furthermore will be a trend towards ﬂexible and effective working with new work structures such as home ofﬁce, split of labor, etc. Time already is but certainly will become a more satisfying form of payment rather than pure monetary salary. Modern organizations will have to react to the fact that more managers (male and female) value family time over career. Any organizations challenge for the upcoming years will be to make use of the existing resources as effectively as possible. The Lean principle with its concept of labor efﬁciency is not able to model this by far. People management and modern leadership will have to focus much more on the individuals involvement in e.g. teamwork, projects or cross-functional problem solving. Thus the manager shall have more time to identify and promote talents and, at the same time, encourage the one being less motivated. Dealing with the latter is extremely time consuming but necessary, if taken seriously. Assuming a manager takes the time and spends it on a one-to-one conversation with a critical individual. Then the individuals supervisor needs to be coached and instructed as well. All this requires at least three peoples work time. The alternative is to do nothing but this consequently will de-motivate the entire group. Assuming that there is a ﬁxed number of less motivated people in every organization, enlarged teams are not necessarily easier to manage especially as hierarchic levels might have been eliminated. Following the idea of sustainable change, the new manager will have to focus on all individuals, ultimately identifying their talents and level of motivation, working through communication and structural issues and ﬁnally gain new team spirit. How can this person possibly ﬁnd the time to work on the newly assigned job tasks?! This conﬂict gets even worse if those who have the most professional qualiﬁcation are promoted to manage the merged department and not the ones with excellent leadership skills. The number of professional and leadership tasks suddenly overloads the new manager and there is a chance for the individual to get sick or leave the company. The worst case is if the manager decides to set priorities towards less people management. Thus, the frustration in the team increases with similar consequences for the employees health and the ﬂuctuation rate but multiplied by the number of reports. Leadership and people management cannot be learned as easily as professional skills. The higher the rank in the hierarchy the less important professional skills are and the more important leadership and emotional intelligence become.

8.3 Innovation Management

213

Healthy organizations consist of smaller highly specialized teams supervised by those who continuously prove to be motivators and enablers. The base for the healthy organization is the self-motivated worker who shares the managements vision and sees a clear perspective for himself. The workers talents need to be discovered and his resources should be managed properly. It is the managerial task of any supervisor or leader to bring the best out of their teams. The hierarchic pyramid needs to be turned upside down putting the worker and its individualized work environment in focus. Those directly involved in the value chain need to be supported most. The above-described front line is what needs the highest attention, as this is where the managerial vision is practically put into action. Here is the interface that makes businesses successful. The entire remaining organization is supportive only. Empowering the organization means to develop manageable structures with clearly deﬁned interfaces, responsibilities and authorities. A healthy organization is equivalent to hierarchic emancipation.

8.3 Innovation Management One issue of highest importance to long-term business existence is consistent success. Both rely on continues innovation. As shown above, innovation is not being the ﬁrst but being better than the others. Google started with an innovative algorithm to search the Internet but they did not invent the search function itself. They came into business by advancing existing technology. Only later in their history, they introduced revolutionary technology like Google earth, the virtual street view or providing an online free counterpart of Microsofts ofﬁce software package. Furthermore, they were the second (after Apple) to develop their own smart phone. Innovation is Googles backbone and each employee must spend 20time on non-job related issues. Googles success is based on diversiﬁed employees and the trust that the freedom of thought will result in creative ideas. As much as I support this concept, it is hardly applicable for those in traditional business with non-virtual products and rather high manpower cost. Those who operate in the real world need more basic tools to generate and implement innovation to be successful. The recently introduced industrial operational excellence programs use Lean, Six Sigma and other problem solving techniques. Lean relates to the Toyota Production System and Six Sigma is a statistical program, which was originated by Motorola in the 1960s. Both systems are common answers to the same problem: structural operational improvement. In many organizations the two are implemented together. Lean focuses strongly on the executing workforce and its ability to prevent failure immediately. In Japan, detecting defects is not viewed negatively or to phrase it differently: The Japanese are controlling the process to detect the issue as soon as possible. This, in combination with a culture being subservient to authority, worked well for Toyota. In Japan, no one would change work procedures or dare to question the importance of details. Only if there is a problem, the belt is stopped. In the event of a shut down, the team joins and discusses the issue. Together they ﬁnd a solution

214

8 The human aspect in sustainable change and innovation

and implement corrective actions immediately. If the issue is out of the teams authority or cannot be ﬁxed with a certain, well-deﬁned time frame, the problem gets escalated to the next hierarchic level. That level is equipped with more authority and might have the power to either empower the team or call for help. Continuing the escalation until the problem is solved guarantees that eventually something will be done about it. The belt stands until the solution is implemented. It is a common aim to continue the belt as quickly as possible but also to strive for perfection even if that binds the forces. This is opposite to most western civilizations were individuals tend to hide or cover their mistakes.We tend to believe that errors will either not matter or someone else will ﬁx it later anyways. The culture is fundamentally different, especially in the individuals identiﬁcation with the product. While most western employees work for money, the Japanese work culture is close to a second family. I am neither glorifying the one nor criticizing the other – it is important to understand that a program is based on culture and cannot easily be copied to an organization with a different culture. The individualized western civilization requires corrections to the program. Pulling Lean, as practiced in Japan, over a central European work environment will fail. There are multiple examples of companies being successful with Lean and Six Sigma. Managing innovation is doable with the various programs and systems. In many companies, the traditional organizational structure has been adjusted to match the newly introduced improvement programs. Departments with specialists and expert know-how are introduced and, in many cases, the talented employees have been moved from elsewhere in the hierarchy. The introduced program creates localized innovation while the entire surrounding organization is reacting to the input. Hot spots of innovation and creativity are not enough to inspire an entire business. The innovation must come from within the organization itself and must derive from the deepest wish, the one in every human, to create and innovate. Everyone should have the same opportunity to put innovations and improvements into practice. Exclusive programs will not succeed in sustaining the change they implement. Everybody is innovative. Inventions are as old as mankind and it is not surprising that people even today continue to invent. Thus, making use of ideas and developing creative solutions has become more vital for economic organizations than ever. The preconditions for innovation are adequately discussed, but the question remains how to bring the best ideas forward and how can the organizational structures support it. Operational Excellence continues its triumph in almost every enterprise. The larger businesses allow themselves the luxury to allocate resources in new departments or structures to implement and execute the systematic improvement. The smaller companies introduce those excellence systems within their existing structures. Lean, Six Sigma, Production Systems and innovation programs are introduced to ensure implementation of innovation management and systematic improvement. Black Belts are put almost everywhere with the clear objective to discover and eliminate waste. Statistical tools are installed, trainings held and the company policy gets adjusted to the new philosophy. Consequently, there is a lot of change in the organization: new faces are employed with new-sophisticated titles (many of them in English) to execute on those fancily named programs. Imagine being a worker at

8.3 Innovation Management

215

shop ﬂoor level, facing the structural shift and discovering the situation. You would be challenged with a double impact: (1) Improvement projects will unforeseeably change your work environment and (2) you will have to accommodate to a diversiﬁed organization with additional reporting and responsibility structures. The individuals position becomes more diffuse and established organizational structures change. This particular change is driven top down and is littered with terms that most workers do not comprehend. The acting participants are different from those who have to live with the alteration. How can one believe this will work without friction or at least with some resistance? The situation gets even worse when facing the fact that most improvement leaders (e.g. Black Belts) average retention time is a little over two years. The duration of stay is inﬂuenced by the projects size, its importance and the companies attitude towards its own program. Would it not be silly to promote that project manager who has just familiarized himself with the team, ﬁgured out shift structures or even came close to the problems root cause? Would you not agree that it could easily take up to two years to discover all this? I confess to be rather critical with the programs and its improvement leaders. For over 100 years, the industry in central Europe did not know these highly specialized and centralized functions. To me this is a consequence of the extreme reduction of manpower. Those working in an area for a long time and being highly experienced did not need intensive statistics or other tools to discover fundamental issues. I am not claiming that analytical (especially statistical) know-how can be replaced by experience, but I strongly suggest combining it. The young, inexperienced professional with his special knowledge and his technical ability is in a much better position when being introduced (not only to peers and colleagues) by an older employee who is well respected and familiar with the situation. I predict faster and more sustainable results if the generations work together and combine their individual strengths. Team success strongly depends on the characters in the game. Successfully implementing improvement starts when preparing the organization for the improvement program itself. Thus, organizational development depends on the ability to think critically and the respect for age and experience. Neither can the program replace the experience, nor can we implement profound effects without modern analytical tools. Sustainable process change and signiﬁcant improvement cannot be prescribed. Many things need to fall into place to sustain change. If statistical ﬁgures, graphs and key metrics are presented to the management, the viewers need to fully comprehend what is shown to them and they need to grasp what consequence might come along with the improvement. The biggest issue with improvement projects (especially of complex processes) is that an improvement here can have a major impact elsewhere. Amendments should be real and not a statistical fake or an imaginary effect on colorful slides. As soon as the number of improvements or the speed of its implementation enables career opportunities, the chance for improper project management with short-term effects increases. The individual promotion should be linked to the improved quantity, but mainly to the projects quality. Responsibly caring program managers will always ask for the voice of the customer when evaluating a projects success. The customer

216

8 The human aspect in sustainable change and innovation

in my eyes is, by deﬁnition, the one that needs to operate and maintain the improvement later on. This is unlikely to be a manager or supervisor, but mainly and certainly the person dealing with the change afterwards. It only takes moments to judge the quality of the implementation, the documentation and the users involvement during planning, execution and training. In a healthy organization, poorly managed change is identiﬁed immediately. Consequently, it gets attacked and stopped. Criticality and resistance are important factors and they prevent us from making stupid mistakes or repeating the same mistake. Healthy, in that respect, also implies that the management takes the resistance seriously and values it. At the same time, communication is open and everyone strives for necessary change. A healthy organization will be able to adopt any trend quickly but, at the same time, will wisely adapt it to meet its needs.

8.4 Handling the Human Aspect As outlined earlier, there are multiple aspects were the human interface impacts sustainable change. Many of the described company-wide introduced programs contain change management within the improvement project phases. Unfortunately, neither of them really looks into the human sensitivities within the change. It is not part of university education plans to prepare their students for the upcoming challenges nor is it included in management or leadership seminars. Any new leader faces this fundamental issue when being promoted into a leadership position. Together with the new professional task comes the responsibility to lead a team. The challenge of leading is accompanied by the struggle to compete with colleagues and other departments. The issue with this overload lies in poorly deﬁned roles and responsibilities and improper preparation of the individual itself. The ignorance of proper leadership is carried from one level to the other and sooner or later the entire organization lacks of human interface management. Some leaders do learn from experience. Throughout our professional lives we experience managers with positive as well as negative leadership attitudes. It is important to remember that nobody is perfect. It is of highest importance to take on some of the positive abilities presented by managers and peers. The principle for successful lead-management is “tread others like you want to be treated yourself.” Good and pure people management is not easy and lacking time for leadership induces most interpersonal issues in companies. Effective tools to manage the human aspect in sustainable change are wishful. Their effectiveness strongly depends on the overall companys attitude towards this aspect and every individuals intentions. Each of the following topics is an important factor by itself. But, when combining them into a leadership vision and living up to it, it will increase engagement, motivation, identiﬁcation and, ﬁnally, productivity and revenue. Focus on the human aspect is not only a matter of sustainable change but also a chance for sustainable business success. The following suggestions are not ranked

8.4 Handling the Human Aspect

217

nor are they a guarantee for success. They are meant to add some practicability to the paragraphs above.

8.4.1 Communication Team-Meetings should be held (at least bi-monthly) with a clear, pre-communicated agenda and an open space for critique and suggestions. The team should be selected by function such as shift leaders or regional marketing managers. On recurring events, an extended team, including e.g. the deputy shift leaders or the most experienced marketing manager, should get an opportunity to be heard. They also need to have their share in hot topic discussions and must have a chance to express their opinions. Especially in times of change, listening frequently to those affected can reduce tension and resistance. Remember that those who resist are not necessarily against the change. Change pressed through by tension and force is likely to increase the resistance. Team meetings are excellent tools to develop a common sense and to communicate the status quo. Sharing the information with the team enables a broad discussion and allows everybody to participate. The major disadvantage of multi-people meetings is that those who gain a major share in the discussion are those who would have probably expressed their opinion anyways. You need to get to those who are quiet and meet them in personal communications. One-to-one meetings should be held frequently with the leaders but also with those known to be the unofﬁcial leaders. Communicate and involve them during decision-making and have special focus on those who struggle accepting the change. Special attention is needed for those being extremely reticent. They are likely to not be brave enough to speak up while among others. But, you need to ﬁnd a way get to their viewpoint, too. An open atmosphere in a non-business environment (cafeteria or a colleagues ofﬁce) will help to establish this channel of communication. The individual needs to be and feel safe to disclose his opinion or express deep sorrows. Follow up meetings are important to double check that the person is still okay and on track but also to clarify any misunderstanding or misinterpretation upfront. Onetoone conversations are extremely time consuming but indispensable. I have made great positive experiences when demanding critique from my direct reports. That way, you force them to think about what needs to be changed. Or to phrase it slightly differently, the question to ask is: “what do I have to change in order to make you more successful?” One-to-one meetings can be planned as an ofﬁcial meeting but could also be set-up as a consultation-hour on ﬁxed dates. Getting to the crucial point of information is difﬁcult when dealing with people. You never know whether you are being told the truth or if your counterpart just plays politics. A good feel for people is the most essential characteristic of any leader but if you really need a good average opinion, you should use anonymous communication tools. The Critique Box can be a letterbox or a web-based anonymous message drop box. Most, but especially international, companies have compliance hotlines and ombudspersons. Here, people can address their concerns and questions and receive

218

8 The human aspect in sustainable change and innovation

help. This rather complex system might even prevent people from using it. Users could get the impression that their concern just is not important enough to bother such a professional instrument. And if they place the concern, will it be treated seriously and conﬁdentially? Certainly people do not want to feel or be disadvantaged when using the hotlines. An anonymous web-based interface is a simple tool and every employee can submit inquiries. Quick and frequent feedback is most important if the company takes it seriously and deals with the requests. Handling the inquiries and dealing with whatever comes in is the challenge the leadership has to take on. There could be a team consisting of the general management and the representatives of the workers union to process the incoming information. Depending on the type of feedback, the project manager and key users could provide facts to ensure a satisfying answer. In any case, both the question and the answer should be published closely to where the input came from. Place them for example on the intranet and hang them on dedicated information boards for everybody to read. Others might have had similar questions. The attitude and honesty of the answers is extremely important. The answer must be reasonable and comprehensible to the questioner. Write as clear as possible and try to really answer the question. The tool is extremely powerful if used properly. When providing reliable communication, and if the sender is treated seriously and respectfully, conﬁdence in the (project) management will certainly increase. Reliable communication or “do not say it if you do not mean it” is an obvious suggestion. It sounds easy but it is not. We all know that political statements are sometimes hard to believe. Take the criticality of the national ﬁnance situation as an example. The deﬁcit in most countries is so dramatic that it constrains the room for any political manoeuvre. A statement in which someone promises tax reduction may feel implausible. Nevertheless, certain politicians are able to get attention although their main statement seems illusory. Whoever communicates, wants to transmit a message. The key is the combination of the speeches context and the speakers body language. A plausible combination of both may determine the receivers emotion. Some people think about communication as a pure exchange of information. A very good example to prove that this is wrong is President Barrack Obama. He was not yet elected and already many people put great hope in him. This is mainly because he got contemplated as a new type of politician. He received the Nobel Peace Price after just a few months in ofﬁce because he managed to transport the hope for a better world. It is important to communicate with a positive attitude and to transmit an optimistic message. This is of highest importance especially in ofﬁcial announcements. You want to spread your optimism amongst the audience and would want them to look positively into the future as well. It is obvious that the right words are needed but, more than that, you need to believe in them yourself! Many are skeptical about what they hear or what is promised to them because they have been disappointed many times before. Promises never came true and so many projects that were sold too well got stuck half way. Honesty and reliability are the foundation for the trust and cooperation needed to create sustainable change.

8.4 Handling the Human Aspect

219

8.4.2 KPIs for team engagement Once a project is started and the team is selected you will ﬁnd that there is an inner project circle that will be driving the initiative. The so-called “core team” needs to be supported by experts who were not chosen to join it. The latter is called the outer circle. Both groups need to be engaged in the change activity and every single player must stay focused, committed and satisﬁed. Frequent checkups on the common understanding of the project’s base (target, approach and timing) enable you to effectively manage the project and to quickly overcome identiﬁed hurdles. You need to make your teams emotional involvement measurable and visible. This is a leading indicator for the projects, and your personal, progress and success. Communication either in team meetings or on a one-to-one base is vital for the project management but hard to quantify. Questionnaires do help to get a quick overview of where the team thinks it stands. Is your opinion on progress aligned with the teams view? Or is your perception impaired? Modern online forms help you to design questionnaires within a few minutes. Even the evaluation is simple. Just select the receivers click a few times and you are done. Unfortunately, there are a few more things to consider. In order to make the evaluation quick and easy, I suggest avoiding free text ﬁelds unless you want participants to comment on a special topic. You should use predeﬁned statements instead and continue to ask the user for his level of agreement. The grading for your statements could be 1-5 where 1 represents the least consensus and 5 represents highest. Alternatively you could go for 0-100% or chose any other way. A few examples for precise statements are 1. The project is on track and we will ﬁnish by the end of the month 2. Communication is reliable 3. Project management takes on suggestions If you do this assessment frequently you can detect disagreement early and take action if needed. Focus on binary, precise and clear questions. Look at the examples mentioned above. Asking whether communication is reliable and frequent might put the user into a dilemma to answer both questions at the same time. What if it is frequent but unreliable?! Also remember whom you ask and prepare separate questionnaires for individual groups if necessary. Example: in the preparation for a major enterprise resource planning system (e.g. SAP or Oracle) implementation, a group of people is asked to rank how much the system is perceived to simplify their daily work. Assume the result to be rather confusing as about 50% view the system as a useful tool and the others do not. Furthermore imagine that the more technical oriented group would see less use then the rest of the group. Would you not simply split the total into separate user groups to get a more diversiﬁed answer? Tracking the number of supporters within the technical group could also be an excellent KPI for the commitment of the originally more skeptical user group. In addition you have easily identiﬁed what group you need to focus on in order to increase the total engagement and commitment.

220

8 The human aspect in sustainable change and innovation

The happiness check is a very simple way of monitoring the emotional baseline. If you take 30 minutes and stand in the hall, walk over the site and look into ofﬁces, just count the number of happy faces. The “smiles rate” is no solid scientiﬁc indicator nor should you overvalue it. But, and especially if you start somewhere new or launch a project in an unfamiliar area, you can get a feel for the current morale. Frustration and resignation are sub-optimal conditions for launching a change project. If you ﬁnd the latter to be the case you should focus to identify the cause of the dissatisfaction ﬁrst and then try to use it in your favor. Develop an emotional leverage to get people out of their lethargy. Convince them to help you to change their situation. Encourage them to be part of the creative force rather than to continue complaining. The happiness check can also be used to measure the attitude towards you and your project. Again, do not overvalue it and also consider the environmental circumstances. I would expect more smiling, happy faces during spring and summer compared to autumn or winter. Visualization and non-verbal communication enable you to communicate to teams even without being present. Especially when interacting with the outer project circle visualization is the tool of choice. Try to put as much as possible into graphs and pictures. Replace text blocks by bullet points and use short but precise wording. All communication has to be profound and reliable. Avoid speculation or guessing and, if you have to, separate your guesses and mark them clearly. If visualization is used as a standard communication tool it reduces subjective perception. Print it as large as possible (e.g. poster size) and hang it in a highly frequented place such as the entrance hall, cafeteria, waiting area or the smokers room. Start to pick random people and present them your project by using the poster. Ask what they think and if they agree with your statements. Even if the volunteers are not impacted by the project nor involved in it, they might raise excellent questions or give you hints on misleading information. Thus you get the chance to inform people and spread the information you want to be spread. In addition, you may get detailed feedback free of charge. Put a feedback e-mail-address or a telephone number on the poster so that you can be contacted for questions or suggestions later on. Keep the poster as a record, especially for the project documentation. In order to make your poster easily comprehensible use dedicated areas for certain topics. Place a project summary preferably at the pages top or bottom and provide a status indication for the various sub-projects. Color codes or status bars can be used to indicate its progress. Colors as used in trafﬁc lights are recognized as green equaling “OK”, yellow as “behind” and red stands for “critical.” A status bar is more detailed and contains additional time related information. You could use weeks or months, milestones or the number of items worked off. Simply calculate the percentage and illustrate it in a bar chart similar to what most people know when downloading or installing computer software. Alternatively or in addition, you could place a little marker over the status bar to visualize where you are supposed to be in accordance to plan. Add a text box next to your status bar to point out the reasons for a delay and articulate required actions and help needed to speed the project up. Especially if your project is behind schedule, you need to think about proper communication to get the support

8.4 Handling the Human Aspect

221

you need. The earlier you indicate items as “behind” or “critical” the higher the chance to counteract. A ﬁrst year review is done to analyze the status of the project after one year. Many projects, even if they get planned and executed well, lack after-project support. Unfortunately, many project resources are cut back or responsible leaders are assigned to new tasks before the ultimate implementation. As a result, the users are quickly in charge to ﬁnalize the project and they handle all the problems. What if those who are suddenly in charge were least involved in the project? It is a managerial culture to ensure that expectations are met! Everybody who is buying a car will test drive it before signing the contract. In professional life somehow, the executive attention ends with the announcement of the projects end. As soon as the ﬁrst few results are reported, some managers believe in its continuation and, in many cases, focus on the next projects. Sustainable project management requires regular meetings to follow up on the status. This includes involving the users to ensure that the expectations really are met. Expectation in that respect can vary depending on the function. Unfortunately the preferred project expectation is whether the spending is within budget. Functionality or performance as promised is second in line. To me, the primary expectation in any change is to sustain the desired new state. Change of operational behavior and the related improvements depend on repetition. Be aware that learning can erode over time. There is no such thing as a quick hit. Behavior practiced for years will not change in a month. Sustainability equals change for, and especially over, a long time.

8.4.3 Project Preparation and Set Up Stakeholders and team members play a critical role in any project. Team members are everybody working on the change process. By deﬁnition, a project team consists of multiple professions and departments. In many cases, the team consists of employees that do project work besides their everyday jobs. Most companies have been through rationalization programs and it can be that each position only has one employee. There are no spare capacities that can be pulled in full time to work on projects. Only if there is a major investment to be launched, experts might be made available. But even then, a team member might be part of other teams as well. This person with his individual character and profession might play a more or less central role in each team. Thus, an individual might be critical for the overall success of multiple projects. In distinction to a team member, the stakeholders enable the change. Stakeholders are not part of the project team but supervise the activities. Project managers need to communicate with key stakeholders frequently. They demand updates and supply help to overcome hurdles during the project execution. Take the following example: Imagine a launch of a global marketing project. Delegates from the regional marketing organizations gather as the project team. The global product manager would act as the project leader driving the initiative and being supported by the team. Stake-

222

8 The human aspect in sustainable change and innovation

holders could be e.g. the global marketing manager and/or the global sales manager. It looks like the set-up is simple and roles and responsibilities are clearly deﬁned. Expectations seem to be set and frequent communication is ensured. This project should be straightforward, successful and sustainable. Reality proves that most of the projects (especially the ones driven globally) are a debacle. The simple but devastating cause is that too many projects need to be handled by only a few team members. In addition, the individual often is left alone when it comes to balance project time and their everyday duties. And even if the individual would be dedicated to full time project work, those that ultimately execute it are probably not. Back to the example of the global product launch as described above: Imagine that the regional marketing managers would be dedicated to that project only. What are their duties? They need to work with the regional marketing groups to prepare the promotion. They need to make sure that production is aware and ready for local manufacturing. Quality standards need to be discussed and agreed upon globally. In order to be effective and unique, they commonly need to be accepted locally. There are various local interfaces to work with in order to coordinate and control the information ﬂow to and from all involved departments. If just one interface is not managed properly or fails, the entire project is at risk. This is not an issue if one assumes all parties concerned are pulling the same rope. But what if conﬂicting targets are real? Do all the important projects that get launched almost daily really take into consideration that project members do work in more than just one team? What if a potential team member simply cannot be freed from daily duties? Would you agree, that everybody is pulling the same rope but in different directions? Just like the business vision, the projects need to be assigned top down and resources need to be planned and allocated reasonably. Most project managers tend to forget that there is more than what happens within their project. They forget that any action needs to be executed by someone in the outer project team. The executing individual probably does not get allocated nor planned. Regrettably, the executing forces likely set the pace of the implementation and ﬁnally the likelihood of sustainability. I claim that the number of projects within businesses is far too high to be managed properly. Or to phrase it differently: Fewer projects with proper resource allocation (at all levels) will increase the chance for successful and sustainable change. The Team Selection is an important act for any project. You should identify two or three well experienced supporters, even if the task seems easy and the implementation is perceived as a no-brainer. Assigning a team does not necessarily mean to hold meetings and sit for hours regularly. Any idea must grow and the condition will only remain changed if someone feels responsible and takes care after the project is done. The earlier in the project, the end users are involved and the closer you keep contact with the key personnel, the more sustainable it will be. So why not build a small team and spread the responsibility as well as share the credit? Another important aspect is the potentially reduced resistance. The lone ﬁghter, even with the smartest idea and a brilliant brain, is likely to fail. Selecting the right (trusted and respected) team members helps to knock down prejudices and helps you around roadblocks. And remember that it is not always easy to get the users’ honest opinion. Most people would rather express their concerns to a colleague than to a project

8.4 Handling the Human Aspect

223

manager. In order to be supported, you need to be perceived as helpful. Allies are needed to get to that stage quickly. The sooner you establish an honest working routine with the (informal) area leaders, the faster and more successful (and sustainable) your project will be. Focus on frequent communication and you might get all relevant information without even asking for it. Remember to deal with resistance openly and try to ﬁnd the root causes rather than to ﬁght it with force. Keep an eye on your project teams attitude towards the change. Once your team does not believe in success, consider re-adjusting the project rather than the team! Delegation needs to be learned. Delegation is putting trust and conﬁdence in your associates. The delegate represents the department and must have full support. Make sure you chose the right occasion to send delegates. Ofﬁcial meetings with the general management or the works council are likely to be un-delegable. Imagine the effect on the attendees and the delegate if he could not answer questions or even contradicted himself. Politically, you need to stand up for your project yourself. For any other issue, a delegate can be assigned. Some project managers keep complaining to be overworked and stressed. As much as I sympathize with that, I also question whether they do well in delegation. As stated above, choosing the appropriate team is the baseline for success and a good work-life-balance. Select the team members who are capable and assign tasks to them. Make sure the delegate understands the goal and is not overwhelmed by it. Delegated tasks still need to be supported. Offer your help frequently and request updates. Monitoring and controlling the project still is the project managers responsibility. Delegation done wisely can help to reduce the managers involvement in sub-topics and ensures his ability to see the big picture. Proper delegation will get things off your desk and the receiver might even see it as an opportunity or chance. Most people that I work with are grateful to participate in project work. They like to get away from the daily routine and to get different insights. They get motivated when having a say in business decisions. Trust and esteem put into the delegates will energize them and will enable them to deliver highest performance. Be aware that delegates are speaking in their managers name and that they represent the entire project. Their decision should be binding and must not be questioned or revised unless absolutely inevitable.

8.4.4 Risk Management Risk and Opportunity evaluation is a necessary act in managing businesses. Some modern companies do not evaluate well and go for any new idea, especially if it comes from the upper levels in the hierarchy. A natural diversiﬁcation between locations, culture and regional requirements gets abandoned in favor of organizational simplicity. Dealing globally and being successful in local markets requires diversity. No doubt, markets are different but demand is local. When the US car market still requested high horsepower vehicles, the trend in Europe and Asia was already going towards less emissions and city friendly mobility. Products and especially brand marketing are focused on localized needs. Most companies fail to make us of the

224

8 The human aspect in sustainable change and innovation

different cultural strengths. Just think about Toyotas production concept. It works ﬁne in Japan but needs corrections in order to work in Europe or North America. Business Unit structures covering continents and countries need to manage the diversity properly in order to beneﬁt from it. Opportunities can be straightforward in one country but have high risks in another. Local regulations and laws can be different in the various regions. To implement solutions blindly, just because they worked elsewhere, is likely to create problems. Thus, look at the potential solution ﬁrst. Try to understand what the possible consequences are when implementing the solution. Do you really have a similar issue that needs to be solved and what might the result be if one implements a solution for a non-existing problem? Solutions are powerful and ﬁnally sustainable only if adapted reasonably. Global guidelines and programs should be made variable. They need to be ﬂexible to comply with global policy once they have been adjusted to every region. Risks and opportunities can be judged best locally. Nevertheless, there is a need to control the entire process and to set up a supra-regional organization. Risk management structures need to be considered carefully and shall be maintained in any organization. If a business decides to introduce operational excellence or similar initiatives, the various options for setting up the organization need to be evaluated. Depending on these structures, the working attitude as well as the approach to manage risks and opportunities will be fundamentally different. One option is to centralize all global experts in one department in order to send them out as in-house improvement consultants. This concepts obvious advantage lies in a uniﬁed approach and standardized tools being used. The solution transparency is high as all information about local projects are tracked and documented centrally. Best practices and benchmarks get collected, evaluated and get looped back into the organization. There is only a small risk that every-day duties overwhelm these resources. A separate improvement organization ensures that improvement leaders focus 100% on the task. It can even be of advantage if the improvement expert is unfamiliar with the area. Too much expertise in a particular area contains the risk of routine-blindness. Similar to consultancy from an external company, the projects duration is critical and thus the total success depends on the involvement and commitment of the executing teams (outer project circle). An alternative set up could be to empower and enforce existing structures and resources. Here the local managers and their teams get trained to enable the improvement by themselves. The advantage of this set up is to have a direct (reporting) line to the process experts. An improvement manager based locally might instantly know what to change and whom to contact in order to get things done. By giving them the right tools you might enable the change directly. Local people have a strong network and it enables them to get to the cause quicker than non-locals. Furthermore, the non-local might be perceived and treated as an outsider even if employed by the same company. Getting things done often depends more on interpersonal relations than on organizational power. A well-known program leader, who ideally operates within a strong emotional network, might be able to reduce resistance towards change dramatically. That is exactly where the biggest risks re-

8.4 Handling the Human Aspect

225

side: As much as the personal relationship enables quick execution that much does it prevent necessary cuts and/or drastic decisions. I had the pleasure to work with a highly knowledgeable process manager who had over 40 years of experience in the area. He told me to not ask for his opinion about a change in the process. He expressed to know the existing process for too long to be able to imagine how it could be done differently. He suggested splitting duties: I would tell him what I would want changed and he would ensure execution, if doable. The combination of speciﬁc process experience and program know-how worked well in that case. Incorporating the program approach globally and executing it locally is the biggest challenge when designing corporate improvement programs. Who leads? Who decides? And ﬁnally who executes? It is a question about core competences, roles and responsibilities and ﬁnally power. Once the dispute “global capability versus local competence” and “program know-how versus process experience” is solved, the improvement work may begin. The steering committee controls the overall project progress and direction. It approves if milestones got worked off and keeps track of required next actions. Some project managers conceive steering committee meetings as an offense to their competence and professionalism. I consider steering committees helpful and important if it comes to responsibility sharing. Once the committee approved a milestone, it acknowledges your effort and you have a regular platform to raise questions and concerns.Working with steering committees is easier if you have prepared an excellent meeting agenda based on the projects schedule and planning. First, you need to collectively agree on the level of detail in the project plan. It depends very much on the initiative if a complete schedule for each individual action is required or if it would be sufﬁcient to assign completion dates to milestones. Second, set up a meeting and discuss your project plan with the committee as early as possible and also agree on the KPIs to monitor the projects progress. Once you start executing the plan without the committees improvement it is hard to redo the fundamental planning. This applies to major investments in civil engineering or construction and to smaller projects. Spend sufﬁcient time on the planning and ensure frequent and timely communication with the steering committee especially in the projects initial stage. Third, the steering group should also discuss if alternative plans and fallback positions are required in the event of obstacles in the master plan. You and the committee should agree and include a written procedure to your project plan with proper deﬁnitions about who needs to be informed in the event of complication or failure. In addition you should plan upfront which additional resources might be pulled in or can be requested in such a case. The deﬁnition of a crisis, as well as planning for its management, is done best if prepared timely. Some project managers are not aware that project delay or failure can easily cause major business issues. Mismanaged projects have the potential to harm the company not only by the pure fact of misspent investment money. Apart from a massive impact on the businesss image (e.g. the incident on the deep water horizon caused at BP in 2010) there can be legal liabilities. Imagine being responsible for a major capacity expansion project. A delay might have substantial impact on customers material availability

226

8 The human aspect in sustainable change and innovation

and therefore contains high risks of contractual penalty. Would you not agree that it might make sense to share this responsibility? In order to do so, every level of escalation as well as the communication paths needs to be deﬁned and approved by the steering committee beforehand.

8.4.5 Roles and responsibilities Sustainable Responsibility is one of the key factors in sustainable change. In order to sustain change it is critical to keep up the force that ultimately initiated the alteration process. There is a difference between doing the right things and doing the things right. This is a fundamental challenge. The residence time of the new state is determined not only by organizational structures but by the managerial attitude towards project work in general. Assume that a reengineering project in the chemical industry is launched. A team of highly professional project managers and technical experts is put together and all of them are dedicated to this task only. A budget is available, steering group meetings were held and the targets and timelines are approved. Everything looks in order and the project seems to be driven towards sustainability. The dilemma lies in different personal expectations that may contradict sustainability. What if the project leader got promised a bonus if the budget gets under-spent by some percentage? Would you agree that his personal ambition and expectation is set already? What if the leading project manager would know upfront that he is to be promoted to manage the department after the project? Would he do it right or do the right things?! Doing it right is not necessarily sustainable. Is it right if the project is within schedule and budget and minimum requirements are just met? Doing it right could also involve doing the right things and eventually even overspend budget if that would ensure advanced technology leading to a more reliable process. I am not proposing that one is better than the other. As a project manager you need to understand what the expectations are and as a business you need to know what you want and how to achieve it. You should deal openly with the expectation and communicate it accordingly. Albert Einstein once said that “we cannot solve problems by using the same kind of thinking we used when we created them.” Change, if well thought through, should ideally solve certain problems or improve the status quo. Well thought through in that respect means not to accept that issues may arise elsewhere. Relocating problems is not an option in sustainable change management. Sustainable responsibility also implies that the responsibility remains with the leader even after the projects closure. Coming back to the example of the reengineering project: Would you not agree that the project manager acts responsibly if supplying spare parts for the newly installed equipment? The areas maintenance budget should not be in charge to ﬁx project issues once the budget and project team are gone. Sustainable change needs holistic responsibility. Bedouins move from one place to another and leave, when they are gone, little to no impact to the environment. I was able to observe that same behavior by global engineering

8.4 Handling the Human Aspect

227

teams. They moved on to another project and left little improvement but caused a great stir. Inventor driven change is the most natural form to implement modiﬁcations. It lies in human nature to create, invent and improve. Historically, humans have always sustained the new status quo when they have seen a clear advantage compared to the old state. The ideal contender to implement the change is the one who had the idea. Practically speaking, you should enable the inventor to take the lead to implement his own idea. Who could explain the underlying problem and convince colleagues or co-workers better than the solutions originator? The inventor is motivated to ultimately solve the issue. Anyone else might give up after a few attempts. The innovator who believes in his idea will continue to try. In a professional situation, the amount of work spent and the total number of attempts certainly need to be restricted and controlled carefully. My unfortunate but practical observation is that too many ideas never get implemented. Innovators do not share their ideas as they fear loss of intellectual property. Many of them believe that their idea needs to be ready for implementation before disclosing. Many companies have suggestion tools where employees input their ideas and get rewarded if their suggestion is put into practice. It is a pity that once the raw idea is submitted, it will be evaluated and implemented (if at all) by someone else. The originator has extremely little inﬂuence on the process and his colleagues and peers often do not even know that the idea exists. Imagine a brain pool of ideas were everyone could browse through, get inspired and add input and thought to others suggestions. It would be like an open community where the order of ideas and innovations is tracked carefully. Everyone adding reasonable input gains the same share of the ﬁnal idea or suggestion. The originator is nothing without those adding practicality and vice versa. You could take your standard suggestion form and add a couple of text boxes to the back. Document the basic idea on the front. Describe the suggestion in as much detail as possible and name the originator, date and time. Publish it within the working community and if someone wants to add to this idea, they use one of the boxes on the back. Thus, the idea gets considered carefully and reﬁned. It is important to not jump on the idea immediately. Let it sit and age like a good red wine. Imagine the inventors could be asked to lead their own idea into realization. What if you could enable them with resources and help them make their idea come to life. The minority of the proposers do it only for the money. I state that an idea, realized by its inventor(s), per se comes with a longer sustainability. The implementation is done with brain and heart and the inventor(s) will do the right things right. But even more important is that the ideas fathers are known. This project will be perceived as change coming from the inside of the organization. Improvement being introduced by outsiders often is viewed as imposed. Suddenly the change gets a name – indeed name the equipment or improvement after the inventor and put a label next to it (if doable) in memory of the event. The emotional relation to the improvement made by someone known is different. Although the outcome of the improvement obviously is not any different, the handling and the sustainability are. It is the same emotional

228

8 The human aspect in sustainable change and innovation

differentiation as if you drive a rental car or borrow it from a friend. In both cases the car is not yours but you might treat them quite differently. Self-discipline is a necessary precondition for any behavioral-based change. Discipline can be ordered and controlled but once it derives from within a person, it is a lot more powerful. Let self-discipline grow by naming the innovator, by teaching the intentions underlying principles and by presenting the expected beneﬁts. You should never underestimate the users. If they lack the discipline, do not try to force change on them as the change will fail and a lot of resources will be wasted. Discipline needs managerial leadership and increases in an atmosphere of trust and honesty. Work on the people ﬁrst and try to create a team spirit. Without the discipline, do not implement change. I was once assigned to a process improvement project in a chemical synthesis area. Rather than starting to implement the necessary technical solutions I tried to harmonize the way the shifts ran the process. I established frequent meetings to discuss the various operational philosophies when running the process. A common understanding of what the process is, what it needs and what actions should be taken set the baseline. This discussion took months until we came to a workable agenda. The common understanding resulted in a shared commitment towards standardized behavior. This is not necessarily self-discipline yet but the resulting peer pressure forces discipline. The impact of a shared commitment towards harmonized operations reduces the induced variability of human interaction. Once the impact is positive and obvious, the discipline will follow naturally but slowly. The human aspect is great and powerful no matter whether the improvement is technical or organizational. And it has two sides. Support will help you to manage the situation easily but destructive forces destroy even the best idea or manager. It is extremely hard to ﬁght your ideas through against negative confrontation. In that case, you need subversive activities. Try to ﬁnd the root cause for the rejection and identify those willing to support you. Start with those being positive and start in tiny steps to prove your concept right. If you fail, do not give up. Always go back to the entire team to discuss the results, adjust the trial and commonly agree to start over again. Make sure to work on the teams self-discipline by involving them in the decision process and keep on selling the advantage that is in there for them.

8.4.6 Career development and sustainable change Stagnation and Sustainability are two totally different things. Stagnation is sustainable but not vice versa. Stagnation maintains the status quo. A good indicator for stagnating organizations is when people state, “it is OK the way it is” because “it has always been that way”. Stagnating organizations are inﬂexible and unimaginative. As shown above, ﬂexibility in mind and a creative vision are drivers for innovation. Organizations need to understand that the ability to accommodate new situations is a personal strength. Human Resource departments should identify those candidates being open minded and those who see the opportunities rather than the problems.

8.4 Handling the Human Aspect

229

Creative employees need to be developed and deserve the chance to witness the organization’s ﬂexibility and opportunities. Employers need to make use of every individuals strength independent of educational level and degree subject. Talent Development in that respect requires accepting and managing risk. Imagine a successful employee working in an almost perfect position. Would it not be unfair not to promote this employee just because the organization has no successor for that position? Or, would it not be sad if that same employee misses out to apply for a job just because there might be no option to go back? I am not in favor of massive rotation within organizations but I deplore that too many creative heads are stranded within inﬂexible structures. Simply there are too many underdeveloped talents. This major problem gets worse as the population ages and soon it will get even harder to recruit qualiﬁed staff. The problem is self-imposed as the resources are cut back and some organizations simply lack a minimum number of people. The human resources departments are degraded to count heads rather than develop talents and increase the resource efﬁciency. Sustainable improvement is a tool to release some of those talents that will initiate improvement somewhere else. Here lies a fundamental opportunity for businesses. Consider all the in-house experts already working in your structure. It is just a matter of will and a bit of clever investment in the right resources. Managing successful and sustainable change is not necessarily restricted to employees with university exams. Proper Resource planning takes all the above-mentioned into consideration. It is critical to look ahead and prepare the organization for the future. If an experienced employee retires, the knowledge is gone forever. Specialist know-how cannot be transferred to a successor in weeks; it might take years. Identifying the ideal candidate and training this person on the job are necessary requirements for sustainability. Success relates not only to know-how or experience, it often relates to the employees personal and professional network. Building up such a network is stony and slow. Once established, it helps to generate ideas and reduces the risk of repeating mistakes over and over again. The successor needs to be trained properly. That includes the time needed to interconnect and to build up the network within the organization. The training also needs to incorporate the professional preparation for potential leadership roles. Leadership is gained by experience, not by taking classes. The experienced leadership skills are even more important if the leader is not the manager and still needs to get things done. Cogency is more powerful than persuasiveness. In order to plan and track a projects resources properly I use mind maps to catch, sort and rank ideas. I prepare individualized action lists and assign them to people in a timely manner to be able to control their work. According to the slogan “how to eat an elephant – bite by bite,” I deﬁne sub-projects and assign resources to them. I deﬁne the minimum requirements and assign a sub-project leader. Despite the need for planning, you should not overdo it. Try to balance complexity and clarity. This can be done when one master plan shows an overview and multiple sub-plans and action lists are created for each sub-project, month or person. The level of detail differs with individual preferences and the project volume. Make sure to discuss the plan and its timetable frequently with the team to control the status and to ensure the

230

8 The human aspect in sustainable change and innovation

teams awareness and commitment to the common goal. Even if you use computerized project planning tools, print the plan for the discussion and post it somewhere publicly so that everybody can see it. The more open and transparent the project is, the easier it is for the project team to stay focused and for the outer project circle to comprehend and actively participate. Best to the top needs to be more than a phrase. It enables career opportunities for those being willing and capable. It is unacceptable to not invest in the human asset. The career of talented candidates needs to be supported and coached by professionals. A transparent program with support at every level of the organization ensures a constant talent ﬂow. This implies that the lower hierarchic levels need more attention as their number is higher and their (average) age is lower. Thus, this group contains more potential to be explored and developed. The senior leader who made it from the lowest level in the organization is emotionally bound to the company. He brings expertise, experience and a working network that outsiders probably do not have. External options should be used only if the outsider really is better than the internal solution – true to the motto: best to the top. Award and Recognition systems are supporting tools to career development programs. Not everybody can follow a career path but everybody wants to be recognized and awarded for performance. The award and recognition system is not necessarily monetary in nature. It has to be set up to enable managers and leaders to access it immediately. Employees need instant performance response but not only when things went wrong. Such a system could start with the allowance to buy ice cream during hot summer days or grant teams to order pizza by themselves. Celebrating success empowers the entire organization to be proud of performance. The system promotes the performer but also e.g. provides loans to selected talents when building a house. This monetary system focuses on employee retention and the double positive effect for the employer is obvious: 1. The employee is emotionally bound to the company and beneﬁts from a low interest loan. It is unlikely that this employee will leave the company during the payback time. Once owning property, the immobility increases resulting in strong regional inﬂexibility. That gets especially important if the area is rather unattractive to work in (e.g. little industry or reduced availability of jobs). 2. While the employer uses the tool to motivate identiﬁed individuals, he might even make some money from the interest, depending on the company size. This money could be used to ﬁnance training opportunities. Certainly, there is a risk of privileging individuals and there will be discussions about the systems fairness. Thus, I stress the need to limit the value of the immediate recognition and rather have frequent team events (sports, dinner, cultural events, etc.) and recognize individuals with an award but without money. Those awards could be presented in a funny way and should not only be limited to work performance. The employee of the month could be accompanied by the “we are glad you are back” award for someone having been sick for a while. It is less the award but more the recognition and the teams expression to really be glad that the individual

8.4 Handling the Human Aspect

231

is back. This also creates emotional binding, which is so important for employee retention. Another attempt could take the family situation into consideration. Why not give “time” as an award? One could be rewarded with a half working day and an entry voucher to a local swimming pool in order to take the kids for a swim. The award lies in additional quality time with the kids. The joy and fun will be remembered and might continue to have a positive effect on the employees motivation.

8.4.7 Sustainability in Training and Learning Culture of failure seems like a revolutionary concept. It implies that more learning experience derives from failure than from success. If failure is analyzed properly and the right conclusions are drawn, then this statement is true. In most cases, we do not question why something is working – only if it fails do we start to realize how things work. After Thomas Edison invented the light bulb, he stated that he had found a way to make it work. In addition, he learned over 500 ways how to not make it work. His innovations are based on a learning experience discovered from failures. Conclusions from results are driving forces in innovation and are not limited to technical processes only. For example, realizing that the information ﬂow within an organization is bad opens opportunities to restructure and gain potential business advantages. Improvement is possible only when detecting the issue and consequently analyzing and reporting it. A culture is needed where nonconformity, failure and defects are seen as a chance. Employees need to be safe to discover, alert and admit mistakes. It is a managerial responsibility to protect the information source. The organization should award employees for detecting failure and the general culture should be to deal openly with mistakes. Focus on solving the problem, e.g. like Toyota does: Bundle all available forces to improve the process. Train to change and practice sustainability. Everything needs to be learned and practiced, even the systematic of change. In a professional environment, improvement is often linked to rationalism and economization and is therefore mentally connected to job cuts and perceived injustice. The training sessions should clarify the mission of the improvement process. Most companies provide tool training and teach how to manage change when introducing improvement programs. The training should also teach how to think out of the box and how to gain creative ideas. There are several tools available to experience the creative thought process. A lesson could start with a small example to demonstrate that some things are not impossible even if they seem to be. Take the following example: draw a box and in it 9 circles (3x3). Ask your colleagues to connect all of these circles with just three straight lines. The task seems impossible. The clue is to remove imaginary boundaries, see ﬁgure 8.1. 1. There is no rule that overdrawing the box is prohibited. 2. There is no rule that the circles must be connected at the center.

232

8 The human aspect in sustainable change and innovation

Fig. 8.1 The solution of the puzzle in the text.

This little exercise could be a starting point. You could continue to show example projects from within your company and explain how change is managed and how the results affected the work environment. Most companies do not intend to cut jobs but if it is to do so, make it clear upfront. If you want the change to be sustainable, the rules must be clear and fair. Cutting jobs in one area does not necessarily mean that people get laid off over the whole enterprise. Will the affected persons be used to ﬁll open positions elsewhere in the organization or is there even a chance to continue to work as an improvement leader instead? Rationalization may free up resources desperately needed somewhere else in the company (please also refer to career development). Diversiﬁcation is a source of different ways to solutions and a surprising learning experience. As much as the overall business goals need to be deﬁned unambiguously, the way to achieve that goal has to remain ﬂexible. Too many boundaries will limit the creativity of the executing forces. Every problem has more than just one solution. Different departments, locations or business units have different backgrounds and individual needs. The teams have to have the freedom to discover their own ways and to develop their individual solutions. This will assure a more focused approach and increase commitment, as this is perceived as “our” solution. In addition, it broadens the experience baseline and ﬁlls the toolbox. Adapt the standard tools and adopt individualized ones. Every situation is unique and consequently individual tools need to be developed to design the matching solution. This may take longer and probably require more resources but it is a fundamental precondition for sustainable success.

8.4.8 The Economic Factor in Sustainable Innovation Employer attractiveness is an increasingly vital fact if companies want to maintain their business. The struggle for the best talents and various employee-binding strategies were described earlier. Business success in that respect depends on, but at the same time generates attractiveness to those considered as “the best”. Porsche is still

8.4 Handling the Human Aspect

233

one of the most desired employers in Europe. It is the brand that sells but also attracts young, ambitious talents. Forming an employer brand is more important than anything. The chance to be part of a highly innovative and well-respected brand is a major employee binder by itself. Choosing from the best is a luxury that only very few, huge industrial enterprises had in the past. In the last few years companies like eBay, Google or Facebook became extremely attractive to work for. It is the opportunity to bring ideas to life that attracts especially young people. Sustainable innovation and development are critical factors in developing existing employees as well as attracting the highly talented new ones. Another remarkable factor can be discovered when comparing the Internet based companies success to those dealing in traditional industries. I refer to e.g. the automobile or the consumer electronics industry as the traditional industry. Even the banking and insurance sector can be named as traditional although they got very much virtualized in the last years. The major difference between the two is the product application. Innovation and quick releases of new updates as well as an innovative website are the drivers for web based companies. The quality in innovation is perceived as the ability to anticipate or even create tomorrows demand. Who would have guessed 15 years ago that teenagers would possibly spend more time sitting at home alone chatting with their friends than doing face-to-face communication? The innovation cycle in web-based companies is shorter and innovation can be something that we, the users, do not even realize. So can a different programming language reduce the total storage on the server or improve the download speed. The products characteristic derives from the various options the user can choose from. Googles browser for example can get personalized and it is able to connect all services provided by the company. Despite the pure browser options, Google created an interface between the user and the web. Success will soon no longer be measured as the number of clicks or number of members. There might be more value in the total data volume transferred and stored. The information collected by a web site provider can be very powerful. Users unveil their privacy by uploading their lives. Thus, Internet based companies needed new structures to align the business areas they operate in. Unlike a car manufacturer who will remain in the car manufacturing industry. Thus the needs of existent product businesses are totally different. Products for use no matter if real (car) or virtual (bank account) need warranty, contracts and quality inspections. The traditional business model is founded on making money with the product produced. Traditional industries focus mainly internally on products or processes. The attention is inward rather than outward like in web operating companies. The innovation being directed outside the company can also be used as a marketing tool. Take eBay as an example when it introduced its PayPal service. This ﬁnancial solution has very little in common with the original trading platform. By launching it as a separate and individual tool, eBay received a lot of attention. The creative part is to imagine what users might want besides the service they came for originally. The focus on core competences is a reason why the traditional industry is rather slow with such products and services. The recently established ﬁnancial services owned

234

8 The human aspect in sustainable change and innovation

by car manufacturers are just the ﬁrst examples to align new products with services in the traditional industry. This is learning from the web based companies success. Sustainability as marketing and sales factor therefore should be seen less ecologically but strategically. Once you sustain in offering an advanced service, continue to be the leader in technology or are recognized as the low price source, you will bind your customers. Over the last 20 – 25 years, the German grocery store Aldi is perceived to be the leader in price. They sustain their lowest cost image and most consumers shop with the positive sureness to get the best price. Do they really? Aldi (North & South) operates in more than 27 countries worldwide and the founding brothers belong to the wealthiest people in Europe. The success is obvious and comes from a highly innovative and ﬂexible organization. The Aldi concept is to sustain the high quality standards of no name brands at low cost. At the same time, the reduced variety of articles improves the turn over and leads to positive cash ﬂow. Continuous and sustainable change, e.g. adding clothes to the product line, enables Aldi to maintain and widen their successful business model.

8.5 Summary Successful businesses enable their employees to be successful. The workforce will become an irreplaceable asset as modern high tech jobs are related less to manpower but to experience, know-how and emotional commitment. Those companies that retain their brilliant brains and experienced workers have a higher chance for future invention and faster market implementation. It is in the natural interest of any employer to maintain those who have or can gain know-how. The innovative driving force will no longer be the privilege of those in R&D or engineering departments but needs to be spread equally into every single step of the value chain; especially in high labor cost countries everyone is needed as source of creativeness. As most processes are complex and have hidden issues, mathematical modeling is not always the primary choice as a starting point for change. The change needs to happen within the organization ﬁrst. The traditional hierarchic constructions are already compressed and many levels have been removed. This was done in the true belief to strengthen the business. But to get more done with less hierarchy requires a different working attitude and altered organizational communication and commitment. Furthermore, it requires a shared vision comprehensible to everybody. The executing force needs to share the vision but also needs to understand what part of the vision is for them. Communication skills, project and people management and the ability to listen and reﬂect on oneself critically are key necessities for successful and sustainable change. These conditions are true for project managers but even more for the entire organization. Sustainability is obtained if the ﬁnal users are not only a part of the change right from the beginning, but are the drivers of change and innovation and interact equally and actively. Modern hierarchic constructions need to turn their hierarchic pyramid upside down and focus on the satisfaction and the interests of every individual. Talent de-

8.5 Summary

235

velopment programs focusing on retaining expertise as well as a strong employer brand to attract talents are key prerequisites for the future. The demographic change forces organizations to reconsider their attitude towards the human aspect. Thus successful businesses prove that non-job related tasks can support the process of creativity and innovation and leads to the above mentioned emotional commitment. Social values, whether it is in ﬂexible working time, parental leave or educational leave, eventually create a different organizational environment. This will turn into the formation and development of a most creative atmosphere, which is the key for innovation and satisfaction. In order to make the innovation last, successful employees in an emancipated hierarchy are necessary. The latter is not by chance but is derived when the top managers accept resistance as a source of employee interaction throughout all hierarchic levels. We need to make use of the diversiﬁed strength of every individual but at the same time value the weaknesses of humans. Most modern companies simply cannot afford to refrain from using their human assets intensively. Focus on, and increased investment in, the human aspect will not only speed up innovation, market implementation and sustainability, it is the most valuable asset of all. Make use of this asset but treat it with respect and do not forget that “there is a difference between listening and waiting to speak!” [61]

References

237

References 1. E.H.L. Aarts, P.J.M. Korst, and P.J.M. van Laarhoven. Pattern Recognition: Theory and Applications, chapter Simulated Annealing: A Pedestrian Review of the Theory and Some Applications. Springer Verlag, 1987. 2. E.H.L. Aarts, P.J.M. Korst, and P.J.M. van Laarhoven. Quantitative analysis of the statistical cooling algorithm. Philips J. Res., 1987. 3. E.H.L. Aarts and P.J.M. van Laarhoven. A new polynomial time cooling schedule. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, November 1985, page 206 208, 1985. 4. E.H.L. Aarts and P.J.M. van Laarhoven. Statistical cooling: A general approach to combinatorial optimization problems. Philips J. of Research, 40:193 226, 1985. 5. Forman S. Acton. Numerical Methods That Work. The Mathematical Association of America, 1990. 6. Teresa M. Amabile, Regina Conti, Heather Coon, Jeffrey Lazenby, and Michael Herron. Assessing work environment for creativity. The Academiy of Management Journal, 39(5):1154 – 1184, 1996. 7. Patrick D. Bangert. How smooth is space? Panopticon, 1:31 – 33, 1997. 8. Patrick D. Bangert. Algorithmic Problems in the Braid Groups. PhD thesis, University College London Mathematics Department, 2002. 9. Patrick D. Bangert. Mathematik – was ist das? Bild der Wissenschaft, page 10, 2004. 10. Patrick D. Bangert. Raid braid: Fast conjugacy disassembly in braid and other groups. In Quoc Nam Tran, editor, Proceedings of the 10th International Conference on Applications of Computer Algebra, ACA, pages 3 – 14, 2004. 11. Patrick D. Bangert. Downhill simplex methods for optimizing simulated annealing are effective. In Algoritmy 2005, number 17 in Conference on Scientiﬁc Computing, pages 341 – 347, 2005. 12. Patrick D. Bangert. In search of mathematical identity. MSOR Connections, 5(4):1 – 3, 2005. 13. Patrick D. Bangert. Optimizing simulated annealing. In Proceedings of SCI 2005 – The 9th World Multi-Conference on Systemics, Cybernetics and Informatics from 10.-13.07.2005 in Orlando, FL, USA, volume 3, page 198202, 2005. 14. Patrick D. Bangert. Optimizing simulated annealing. In Nagib Callaos and William Lesso, editors, 9th World Multi-Conference on Systemics, Cybernetics and Informatics, volume 3, pages 198 – 202, 2005. 15. Patrick D. Bangert. What is mathematics? Aust. Math. Soc. Gazette, 32(3):179 – 186, 2005. 16. Patrick D. Bangert. Jenseits des Verstandes, chapter Einf¨uhrung in die buddhistische Meditation, pages 165 – 172. S. Hirzel Verlag, 2007. 17. Patrick D. Bangert. Jenseits des Verstandes, chapter Inwieweit kann man mit Logik spirituell sein? Die Sicht eines Mathematikers und Buddhisten, pages 147 – 152. S. Hirzel Verlag, 2007. 18. Patrick D. Bangert. Kreativit¨at und Innovation, chapter Kreativit¨at in der deutschen Wirtschaft, pages 79 – 86. S. Hirzel Verlag, 2008. 19. Patrick D. Bangert. Mathematical identity (in greek). Journal of the Greek Mathematical Society, 5:22 – 31, 2008. 20. Patrick D. Bangert. Lectures on Topological Fluid Mechanics, chapter Braids and Knots, pages 1 – 74. Number 1973 in LNM. Springer Verlag, 2009. 21. Patrick D. Bangert. Neuro¨asthetik, chapter Fraktale Kunst: Eine Einf¨uhrung, pages 89 – 95. E. A. Seemann, 2009. 22. Patrick D. Bangert. Ausbeuteoptimierung einer silikonproduktion. In Arbeitskreis Prozessanalytik, number 6 in Tagung, page 14. DECHEMA, 2010. 23. Patrick D. Bangert. Ausbeuteoptimierung in der silikonproduktion. Analytic Journal, page www.analyticjournal.de/ fachreports/ ﬂuessig analytik/ algorithmica technol silikon.html, 2010.

238

8 The human aspect in sustainable change and innovation

24. Patrick D. Bangert. Increasing energy efﬁciency using autonomous mathematical modeling. In Victor Risonarta, editor, Energy Efﬁciency in Industry, Technology cooperation and economic beneﬁt of reduction of GHG emissions in Indonesia, pages 80 – 86. Shaker Verlag, 2010. 25. Patrick D. Bangert. Two-day advance prediction of a blade tear on a steam turbine of a coal power plant. In M. Link, editor, Schwingungsanalyse & Identiﬁkation. VDI-Berichte No. 2093, pages 175 – 182, 2010. 26. Patrick D. Bangert. Two-day advance prediction of a blade tear on a steam turbine of a coal power plant. In Instandhaltung 2010, pages 35 – 44. VDI/VDEh, VDI/VDEh, 2010. 27. Patrick D. Bangert. Prediction of damages on wind power plants. In Schwingungen von Windenergieanlagen 2011, number 2123 in VDI Berichte, pages 135 – 144. VDI, 2011. 28. Patrick D. Bangert. Prediction of damages using measurement data. In Bernd Bertsche, editor, Technische Zuverl¨assigkeit, number 2146 in VDI Berichte, pages 305 – 316. VDI, 2011. 29. Patrick D. Bangert. Two-day advance prediction of blade tear on the steam turbine at coalﬁred plant. In 54th ISA POWID Symposium, volume 54 of ISA. ISA, 2011. 30. Patrick D. Bangert and Markus Ahorner. Modellierung eines pumpenanlaufs zur lebensdaueroptimierung mit der v¨ollig neuen n-k¨orper methode. In Produktivit¨atssteigerung durch Anlagenoptimierung, number 29 in VDI / VDIEh Forum Instandhaltung, pages 29 – 36. VDI/VDEh, 2008. 31. Patrick D. Bangert, M.A. Berger, and R. Prandi. In search of minimal random braid conﬁgurations. J. Phys. A, 35:43–59, 2002. 32. Patrick D. Bangert, Mitchel A. Berger, and Rosela Prandi. In search of minimal random braid conﬁgurations. J. Phys. A, 35:43 – 59, 2002. 33. Patrick D. Bangert, Martin D. Cooper, and S.K. Lamoreaux. Enhancement of superthermal ultracold neutron production by trapping cold neutrons. Nuc. Instr. Meth. in Phys. Res. A, 410:264 – 272, 1998. 34. Patrick D. Bangert, Martin D. Cooper, and S.K. Lamoreaux. Uniformity of the magnetic ﬁeld produced by a cosine magnet with a superconducting shield. LANL EDM Expt. Tech. Rep., 1, 1999. 35. Patrick D. Bangert and J¨org-Andreas Czernitzky. Increase of overall combined-heat-andpower (chp) efﬁciency via mathematical modeling. In VGB Fachtagung Dampferzeuger, Industrie- und Heizkraftwerke, 2010. 36. Patrick D. Bangert and J¨org-Andreas Czernitzky. Efﬁciency increase of 1% in coal-ﬁred power plants with mathematical optimization. In 54th ISA POWID Symposium, volume 54 of ISA. ISA, 2011. 37. Patrick D. Bangert and J¨org-Andreas Czernitzky. Increase of overall combined-heat-andpower efﬁciency through mathematical modeling. VGB PowerTech, 91(3):55 – 57, 2011. 38. Patrick D. Bangert, Chaodong Tan, Zhang Jie, and Bailiang Liu. Mathematical model using machine learning boosts output offshore china. World Oil, 231(11):37 – 40, 2010. 39. Patrick D. Bangert, Chaodong Tan, Bailiang Liu, and Zhang Jie. Maschinelles lernen erh¨oht ertrag. China Contact, 15(6):52 – 54, 2011. 40. D.M. Bates and D.G. Watts. Nonlinear Regression Analysis and Its Applications. Wiley, 1988. 41. M.A. Berger. Minimum crossing numbers for three-braids. J. Phys. A, 27:6205–6213, 1994. 42. Lutz Beyering. Individual Marketing. Verlag Moderne Industrie, 1987. 43. Marco A. D. Bezerra, Leizer Schnitman, M. de A. Baretto Filho, and J.A.M. Felippe de Souza. Pattern recognition for downhold dynamometer card in oil rod pump system using artiﬁcial neural networks. Proceedings of the 11th International Conference on Enterprise Information Systems ICEIS 2009, Milan, Italy, pages 351 – 355, 2009. 44. C.M. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, 2006. 45. Dan Bonachea, Eugene Ingerman, Joshua Levy, and Scott McPeak. An improved adaptive multi-start approach to ﬁnding near-optimal solutions to the euclidean tsp. In Genetic and Evolutionary Computation Conference (GECCO-2000), 2000.

References

239

46. E. Bonomi and J.-L. Lutton. The asymptotic behaviour of quadratic sum assignment problems: A statistical mechanics approach. Euro. J. Oper. Res., 1984. 47. E. Bonomi and J.-L. Lutton. The n-city travelling salesman problem: Statistical mechanics and the metropolis algorithm. SIAM Rev., 26:551 568, 1984. 48. M. Boulle. Khiops: A statistical discretization method of continuous attributes. Machine Learning, 55:53 – 69, 2004. 49. Wayne H. Bovey and Andrew Hede. Resistance to organisational change: the role of defence mechanisms. Journal of Managerial Psychology, 16(7):534 – 548, 2001. 50. Michael Brusco and Stephanie Stahl. Branch-and-Bound Applications in Combinatorial Data Analysis. Springer Verlag, 2005. 51. R.E. Burkard and F. Rendl. A thermodynamically motivated simulation procedure for combinatorial optimization problems. Euro. J. Oper. Res., 17:169 174, 1984. 52. V. Cerny. Thermodynamical approach to the travelling salesman problem: An efﬁcient simulation algorithm. J. Opt. Theory Appl., 45:41 51, 1985. 53. William G. Cochran. Sampling Techniques. Wiley, 1977. 54. N.E. Collins, R.W. Eglese, and B.L. Golden. Simulated annealing - an annotated bibliography. Am. J. Math. Manag. Sci., 8:209–307, 1988. 55. Peter Dayan and L. F. Abbott. Theoretical Neuroscience. The MIT Press, 2001. 56. John E. Dowling. Neurons and Networks: An Introduction to Neuroscience. The Belknap Press of Harvard University Press, 1992. 57. L.A. McGeoch D.S. Johnson, C.R. Aragon and C. Schevon. Optimization by simulated annealing: An experimental evaluation. In List of Abstracts, Workshop on Statistical Physics in Engineering and Biology, Yorktown Heights, April 1984, revised version., 1986. 58. H. DeMan F. Catthoor and J. Vanderwalle. Sailplane: A simulated annealing based cadtool for the analysis of limit-cycle behaviour. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Oct. 1985, page 244 247, 1985. 59. A.L. Sangiovanni-Vincentelli F. Romeo and C. Sechen. Research on simulated annealing at berkely. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Nov. 1984, page 652 657, 1984. 60. U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classiﬁcation learning. In Proc. of the 13th Int. Joint Conf. on Artiﬁcial Intelligence, pages 1022 – 1027, 1993. 61. Tom Foremsko. Twitter from cocoon.ifs.tuwien.ac.at, 2009. 62. David A. Freedman. Statistical Models: Theory and Practice. Cambridge University Press, 2005. 63. S. George. An Improved Simulated Annealing Algorithm for Solving Spatially Explicit Forest Management Problems. PhD thesis, Penn. State Uni., 2003. 64. Walton E. Gilbert. An oil-well pump dynagraph. Production Practice, Shell Oil Company, pages 94 – 115, 1936. 65. Fred Glover and Manuel Laguna. Tabu Search. Kluwer Academic Publishers, 1996. 66. B.L. Golden and C.C. Skiscim. Using simulated annealing to solve routing and location problems. Nav. Log. Res. Quart., 33:261 279, 1986. 67. N. Golyandina, V. Nekrutkin, and A. Zhigljavsky. Analysis of Time Series Structure: SSA and related techniques. Chapman and Hall/CRC, 2001. 68. J.W. Greene and K.J. Supowit. Simulated annealing without rejected moves. IEEE Trans. Comp. Aided Design, CAD-5:221 – 228, 1986. 69. Martin T. Hagan, Howard B. Demuth, and Mark Beale. Neural Network Design. PWS Pub. Co., 1996. 70. B. Hajek. A tutorial survey of theory and application of simulated annealing. In Proc. 24th Conf. Decision and Control, Ft. Lauderdale, Dec. 1985, page 755 760, 1985. 71. J.D. Hamilton. Time Series Analysis. Princeton University Press, 1994. 72. T. Hastie and P. Simard. Models and metrics for handwritten character recognition. Statistical Science, 13(1):54 – 65, 1998. 73. Randy L. Haupt and Sue Ellen Haupt. Practical Genetic Algorithms. Wiley-Interscience, 2004.

240

8 The human aspect in sustainable change and innovation

74. Kenneth M. Heilman, Stephen E. Nadeau, and David O. Beversdorf. Creative innovation: Possible brain mechanisms. Neurocase, 9(5):369 – 379, 2003. 75. Klaus Hinkelmann and Oscar Kempthorne. Design and Analysis of Experiments. I and II. Wiley, 2008. 76. Douglas R. Hofstadter. Godel, ¨ Escher, Bach: An Eternal Golden Braid. Penguin Books, 1979. 77. Torbj¨orn Idhammar. Condition Monitoring Standards (4 vols). Idcon Inc., 2001-2009. 78. Alexander I. Khinchin and George Gamow. Mathematical Foundations of Statistical Mechanics. Dover Publications, 1949. 79. S. Kirkpatrick, C.D. Jr. Gelatt, and M.P. Vecchi. Optimization by simulated annealing. Science, 220:671 680, 1983. 80. J. Klos and S. Kobe. Nonextensive Statistical Mechanics and Its Applications, chapter Generalized Simulated Annealing Algorithms Using Tsallis Statistics, pages 253 – 258. LNP 560/2001 Springer Verlag, 2001. 81. D.E. Knuth. Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming. Addison-Wesley, Reading, MA, USA, 1981. 82. J. Lam and J.-M. Delosme. Logic minimization using simulated annealing. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, Nov. 1986, page 348 351, 1986. 83. Rotislav V. Lapshin. Analytical model for the approximation of hysteresis loop and its application to the scanning tunneling microscope. Rev. Sci. Instrum., 66(9):4718 – 4730, 1995. 84. H.W. Leong and C.L. Liu. Permutation channel routing. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Oct. 1985, page 579 584, 1985. 85. H.W. Leong, D.F. Wong, and C.L. Liu. A simulated annealing channel router. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, Nov. 1985, page 226 229, 1985. 86. S. Lin. Computer solutions for the travelling salesman problem. Bell Sys. Tech. J., 44:2245 2269, 1965. 87. H. R. Lindman. Analysis of variance in complex experimental designs. W. H. Freeman & Co. Hillsdale, 1974. 88. David G. Luenberger. Linear and Nonlinear Programming. Springer Verlag, 2003. 89. M. Lundy and A. Mees. Convergence of a annealing algorithm. Math. Prog., 34:111 124, 1986. 90. J.-L. Lutton and E. Bonomi. Simulated annealing algorithm for the minimum weighted perfect euclidean matching problem. R.A.I.R.O. Recherche operationelle, 20:177 197, 1986. 91. P.S. Mann. Introductory Statistics. Wiley, 1995. 92. S. Martin, M. Anderson, I. Salman, V. Lazar, and Patrick David Bangert. Processes contributing to the evolution of two ﬁlament channels to global scales. In K. Sankarasubramanian, M. Penn, and A. Pevtsov, editors, Large Scale Structures and their Role in Solar Activity, ASP Conference Proceedings Series. Astronomical Society of the Paciﬁc, 2005. 93. W. Mass. Efﬁcient agnostic pac-learning with simple hypotheses. In Proc. of the 7th ACM Conf. on Computational Learning Theory, pages 67 – 75, 1994. 94. F. Romeo M.D. Huang and A.L. Sangiovanni-Vincentelli. An efﬁcient general cooling schedule for simulated annealing. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, Nov. 1986, page 381 384, 1986. 95. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21:1087–1092, 1953. 96. D.C. Montgomery. Design and Analysis of Experiments. Wiley, 2000. 97. C.A. Morgenstern and H.D. Shapiro. Chromatic number approximation using simulated annealing. Technical report, CS86-1, Dept. Comp. Sci., Univ. New Mexico., 1986. 98. Leonard K. Nash. Elements of Chemical Thermodynamics. Dover Publications, 2005. 99. Taiichi Ohno. Toyota Production System: Beyond Large-Scale Production. Productivity Press, 1988. ¨ 100. Esin Onbasoglu and Linet Ozdamar. Parallel simulated annealing algorithms in global optimization. Journal of Global Optimization, 19(1), 2001. 101. R.H.J.M. Otten and L.P.P.P. van Ginneken. Floorplan design using simulated annealing. In Proc. IEEE Int. Conf. On Comp. Aided Design, Santa Clara, Nov. 1984, page 96 98, 1984.

References

241

102. P. Sibani P. Salamon and R. Frost. Facts, conjectures, and improvements for simulated annealing. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2002. 103. Athanasios Papoulis and S. Unnikrishna Pillai. Probability, Random Variables and Stochastic Processes. McGraw Hill, 2002. 104. Oliver Penrose. Foundations of Statistical Mechanics: A Deductive Treatment. Dover Pub. Inc., Mineola, NY, USA, 2005. 105. George Polya. How to Solve It. Princeton University Press, 1957. 106. ProQuest. http://www.umi.com/proquest/. 107. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. 108. F. Romeo and A.L. Sangiovanni-Vincentelli. Probabilistic hill climbing algorithms: Properties and applications. In Proc. 1985 Chapel Hill Conf. VLSI, May 1985, page 393 417, 1985. 109. B. Rosner. On the detection of many outliers. Technometrics, 17:221 – 227, 1975. 110. B. Rosner. Percentage points for a generalized esd many-outlier procedure. Technometrics, 25:165 – 172, 1983. 111. Stuart Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Approach. Prentice Hall International, 1995. 112. C.D. Jr. Gelatt S. Kirkpatrick and M.P. Vecchi. Optimization by simulated annealing. Technical report, IBM Research Report RC 9355, 1982. 113. S. Sahni S. Nahar and E. Shragowitz. Simulated annealing and combinatorial optimization. In Proc. 23rd Des. Automation Conf., Las Vegas, June 1986, page 293 299, 1986. 114. Ken Schwaber. Agile Project Management with SCRUM. Microsoft Press, 2004. 115. C. Sechen and A.L. Sangiovanni-Vincentelli. The timber wolf placement and routing package. IEEE J. Solid State Circuits, SC-20:510 522, 1985. 116. Amartya K. Sen. Collective Choice and Social Welfare. London, 1970. 117. Mike Sharples, David Hogg, Chris Hutchinson, Steve Torrance, and David Young. Computers and Thought: A Practical Introduction to Artiﬁcial Intelligence. The MIT Press, 1989. 118. J. Shore and Warden S. The Art of Agile Development. OReilly Media, Inc., 2008. 119. C.C. Skiscim and B.L. Golden. Optimization by simulated annealing: A preliminary computational study for the tsp. In NIHE Summer School on Comb. Opt., Dublin., 1983. 120. R.F. Stengel. Optimal Control and Estimation. Dover Publications, 1994. 121. Chaodong Tan, Patrick D. Bangert, Zhang Jie, and Bailiang Liu. Yield optimization in dagang offshore oilﬁeld (in chinese). China Petroleum and Chemical Industry, 237(11):46 – 47, 2010. 122. Lloyd N. Trefethen and David Bau III. Numerical linear algebra. Society for Industrial and Applied Mathematics, 1997. 123. P.J.M. van Laarhoven. Theoretical and Computational Aspects of Simulated Annealing. Centrum voor Wiskunde en Informatica, 1988. 124. P.J.M. van Laarhoven and E.H.L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel, Dordrecht, 1987. 125. R. von Mises. Probability, Statistics and Truth. George Allen & Unwin, London, UK, 1957. 126. Dianne Waddell and Amrik S. Sohal. Resistance: a constructive tool for change management. Management Decision, 38(8):543 – 548, 1998. 127. W.T. Vellerling W.H. Press, S.A. Teukolsky and B.P. Flannery. Numerical Recipes in C. 2nd edition. Cambridge University Press, 1992. 128. S.R. White. Concepts of scale in simulated annealing. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Nov. 1984, page 646 651, 1984. 129. Wikipedia. innovation. 130. www.buildabetterburger.com/burgers/timeline. 131. www.foodreference.com. 132. www.whatscookingamerica.net/History/HamburgerHistory.htm.

Index

accuracy, 3, 4, 10 action integral, 188 adaptive controller, 132 analysis of variance, 81 ANOVA, 81, 83 archive system, 39 artifact, 88 autocorrelation, 85 average ensemble, 15 moving, 47 time, 15 Bayesian, 71 benchmark, 53, 54 self, 54 binarization, 92 Boltzmann’s constant, 17, 21, 178, 179 boundary condition, 1, 15, 190 brain, 183 budgeting, 58 calculus of variations, 187 catalyst, 149 catalytic reactor, 148 cause-effect link, 82 center, 86 central limit theorem, 100, 105, 106 certainty, 4 change, 202 resistance, 204 chemical plant, 53, 58, 148, 155, 189 classiﬁcation, 130 clustering, 42, 46, 86, 88, 110 k-means, 87, 109, 117 center, 87, 88 Lloyd’s algorithm, 87

coal power plant, 96 combinatorial problem, 181 combined heat and power (CHP), 96, 197 communication theory, 111 conﬁguration density, 173 conjugate gradient, 165 constraint, 1, 3 continuous problem, 181 control theory, 110 correlation, 44, 84 spurious, 73 critical value, 73 curve ﬁtting, 113 cybernetics, 110 data, 43 Debye’s law, 24 decision tree, 46 design matrix, 81 design of experiment, 34, 81 determinism, 19 discretization, 38 distribution Gaussian, 106 normal, 106 posterior, 108 prior, 107 sampling, 107 domain knowledge, 124 dynamic control system (DCS), 39, 111 dynamometer card, 158 early-warning, 63 energy, 14, 16, 18, 20 ensemble, 15 enterprise resource planning (ERP), 54 entropy, 18, 20, 20, 21, 89, 90, 98, 99, 169, 170

P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7, © Springer-Verlag Berlin Heidelberg 2012

243

244 Boltzmann, 21 informational, 89 equilibrium, 16, 20, 21, 23, 168, 169, 171, 175 change at, 186 ergodic, 16, 26 source, 90 ergodic set, 26, 89 ergodicity, 25 breaking, 27, 90 theorem, 26 error type I, 72 type II, 72 estimate, 68 estimation, 67, 68 estimator, 68 Euler-Lagrange equation, 188 extrapolation, 51, 128 extreme studentized deviate, 77 fault localization, 143 feature, 44 feed-back, 71, 71 ﬁlter, 47 ﬁtness, 166 ﬁtting problem, 78 ﬂuid catalytic converter, 149 Fourier series, 52, 114 Fourier transform, 91, 98, 99 freezing in, 22 freezing point, 174 frequency, 71 function, 126 analytic, 116 goal, 112 merit, 112 functional form, 126 genetic algorithm, 165, 166 crossover, 167 mutation, 167 goal, 35 granular catalytic reactor, 149 ground state, 22 heat bath, 17 Helmholtz free energy, 18 heteroscedasticity, 82 historian, 39 homoscedasticity, 82 hypothesis alternative, 73, 74, 75 null, 73, 74, 75, 83

Index idea, 202 iid, 35 independent and identically distributed, 35 information, 43, 89 information theory, 111 injection molding, 135 innovation, 202, 213 instrumentation, 37, 38 insurance, 62 interpolation, 42, 51, 128 knowledge, 43 knowledge gain, 125 lag, 85 Lagrangian, 188 learning supervised, 94 un-supervised, 95, 110 least-squares ﬁtting, 78, 79, 113 general linear, 80 linear correlation coefﬁcient, 84 logistics, 118 M¨uller-Rochow Synthesis (MRS), 189 macrostate, 13, 14, 15, 19, 21 maintenance, 53, 55, 58 management, 29 agile, 32 change, 201 critique box, 217 delegation, 223 diversiﬁcation, 232 employer attractiveness, 232 ﬁrst year review, 221 innovation, 213 interface, 207 inventor-driven, 227 kaizen, 33 kanban, 32 KPI’s, 219 marketing, 234 one-to-one meetings, 217 questionnaires, 219 recognition, 230 resource planning, 229 responsibility, 226 risk, 223 sales, 234 scrum, 32 self-discipline, 228 stakeholder, 221 steering committee, 225 structures, 224

Index sustainability, 226, 228, 231, 232 talent development, 229 team member, 221 team selection, 222 team-meetings, 217 training, 231 marketing individual, 118 Markov chain, 19, 26, 105, 117, 118, 168, 175, 177 p-order, 106 homogeneous, 106 property, 105 Maxwell-Boltzmann distribution, 17, 178, 179 mean, 75, 76 measurement, 67, 70 error, 68, 69 spike, 69 melting point, 174 methods branch-and-bound, 6 complete enumeration, 5 enumeration, 165 exact, 5, 5 genetic algorithm, 6, 7 heuristic, 5, 6 Monte Carlo, 16 restarting, 27 scientiﬁc, 34 simulated annealing, 6, 7 advantages, 7 tabu search, 6 metric, 86 microstate, 13, 14, 14, 19, 21 model agile, 30 generalization, 128 kanban, 30 mathematical, 121 scrum, 30 waterfall, 30 modeling, 79, 121, 121, 123, 125 move, 6 multi-objective optimization, 7 neural network, 121, 123, 126, 129 echo-state network, 133 Hopﬁeld network, 132 multi-layer perceptron, 131 activation function, 131 bias vector, 131 layer, 131 topology, 131 weight matrix, 131

245 primary pattern, 132 Newton’s method, 165 noise, 38, 40, 47, 48 de-noising, 47 noisy channel, 107, 110, 111, 112 non-locality, 71, 71 nonlinearity, 115 nuclear power plant, 152 observation, 70 Occam’s razor, 52 occupation number, 17 offshore production, 195 oil production, 195 optimal path, 186 optimization, 1, 79, 112 optimum global, 1, 165 local, 1 Pareto, 7, 8 outlier, 40, 69, 77, 86, 88 outsourcing, 58 over-ﬁtting, 84, 123 partition function, 17, 18 Pearson’s r, 84 perceptron, 92 petrochemical plant, 148 phase transition, 22, 170, 171 second order, 22 Poisson stochastic error source, 101 polynomial, 52, 80 population, 67, 67 pre-processing, 92 price arbitrage, 118 prioritization, 35 probability, 3, 4, 19, 71, 72 acceptance, 172 distribution, 39, 72, 73, 75, 76 Cauchy, 107 generalized normal, 107 homogeneous, 82 logistic, 107 normal, 72, 82 posterior, 117 prior, 108, 117 sampling, 110 Student t, 107 uniform, 72, 73 transition, 106 problem, 11 instance, 11 solution, 15 process

246 non-reversible, 20 reversible, 20 process stability, 70 production, 63, 135 proﬁtability, 193 pump choke-diameter, 195 frequency, 195 qualitative, 81 quantitative, 81 random, 17, 73 regression, 79, 113 linear, 113 non-linear, 117 representation, 50 representative, 62, 68 reservoir computing, 133 restarting, 166 retailer, 117 sample, 67, 68 sampling, 50, 68 random, 50 stratiﬁed, 50 scrap, 63 identiﬁcation, 135 selectivity, 190, 193 self-organizing map (SOM), 92, 109 sensor, 37, 38 drift, 69 tolerance, 68 signiﬁcance, 72, 72, 73 level, 72, 74 silane, 189 simulated annealing, 87, 165, 167, 168, 183 cooling schedule, 169, 172, 179 equilibrium, 168, 175 ﬁnal temperature, 168, 174 initial temperature, 168, 173 selection criterion, 168, 178 temperature reduction, 168, 177 perturbation, 181 singular spectrum analysis, 48, 91, 98, 99 singular value decomposition, 81 six-sigma, 72 soft sensor, 42, 44 speciﬁc heat, 23, 25, 170, 172 constant pressure, 25 constant volume, 25 spline, 53 state persistent, 26 pseudo-persistent, 27 transient, 26

Index stationary, 16 statistical inference Bayesian, 107, 112, 117, 118 statistical mechanics, 14 statistics, 67, 71, 72 F-test, 76, 83 χ 2 -test, 76, 77, 79 t-test, 75 conﬁdence, 74 critical value, 74 descriptive, 117 independence, 82 Kolmogorov-Smirnov-test, 77 population, 74 Rosner test, 77 test, 72, 73, 73, 74, 75 variance, 83 storage, 118 straight line, 80 system, 15 systematic error, 70 Taylor’s theorem, 115 temperature, 17, 23, 167, 169 absolute zero, 20 testing, 32 thermodynamics, 14 laws, 19, 21 postulates, 18 time-series, 38 timescale, 10 traveling salesman problem, 11 turbine, 96, 98, 103, 140, 152 blade tear, 141 uncertainty, 68–70 valve failures, 155 variable categorical, 82 controllable, 190 input, 129 nominal, 130 numeric, 130 output, 129 semi-controllable, 190 uncontrollable, 190 variance, 75, 76 vibration, 104 vibration crisis, 152 virtually certain, 74 wind power plant, 143 yield, 193

Patrick Bangert

Optimization for Industrial Problems

123

Patrick Bangert algorithmica technologies GmbH Bremen, Germany

ISBN 978-3-642-24973-0 e-ISBN 978-3-642-24974-7 DOI 10.1007/978-3-642-24974-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011945031 Mathematics Subject Classiﬁcation (2010): 90-08, 90B50 © Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

It can be done !

algorithmica technologies GmbH Advanced International Research Institute on Industrial Optimization gGmbH Department of Mathematics, University College London

Preface

Some Early Opinions on Technology There is practically no chance communications space satellites will be used to provide better telephone, telegraph, television, or radio service inside the United States T. Craven, FCC Commissioner, 1961 There is not the slightest indication that nuclear energy will ever be obtainable. It would mean that the atom would have to be shattered at will. Albert Einstein, 1932 Heavier-than-air ﬂying machines are impossible. Lord Kelvin, 1895 We will never make a 32 bit operating system. Bill Gates, 1983 Such startling announcements as these should be deprecated as being unworthy of science and mischievous to its true progress. William Siemens, on Edison’s light bulb, 1880 The energy produced by the breaking down of the atom is a very poor kind of thing. Anyone who expects a source of power from the transformation of these atoms is talking moonshine. Ernest Rutherford, shortly after splitting the atom for the ﬁrst time, 1917 Everything that can be invented has been invented. Charles H. Duell, Commissioner of the US Patent Ofﬁce, 1899

Content and Scope Optimization is the determination of the values of the independent variables in a function such that the dependent variable attains a maximum over a suitably deﬁned vii

viii

Preface

area of validity (c.f. the boundary conditions). We consider the case in which the independent variables are many but the dependent variable is limited to one; multicriterion decision making will only be touched upon. This book, for the ﬁrst time, combines mathematical methods and a wide range of real-life case studies of industrial use of these methods. Both the methods and the problems to which they are applied as examples and case studies are useful in real situations that occur in proﬁt making industrial businesses from ﬁelds such as chemistry, power generation, oil exploration and reﬁning, manufacturing, retail and others. The case studies focus on real projects that actually happened and that resulted in positive business for the industrial corporation. They are problems that other companies also have and thus have a degree of generality. The thrust is on take-home lessons that industry managers can use to improve their production via optimization methods. Industrial production is characterized by very large investments in technical facilities and regular returns over decades. Improving yield or similar characteristics in a production facility is a major goal of the owners in order to leverage their investment. The current approach to do this is mostly via engineering solutions that are costly, time consuming and approximate. Mathematics has entered the industrial stage in the 1980s with methods such as linear programming to revolutionize the area of industrial optimization. Neural networks, simulation and direct modeling joined and an arsenal of methods now exists to help engineers improve plants; both existing and new. The dot-com revolution in the late 1990s slowed this trend of knowledge transfer and it is safe to say that the industry is essentially stuck with these early methods. Mathematics has evolved since then and accumulated much expertise in optimization that remains hardly used. Also, modern computing power has exploded with the affordable parallel computer so that methods that were once doomed to the dusty shelf can now actually be used. These two effects combine to harbor a possible revolution in industrial uses for mathematical methods. These uses center around the problem of optimization as almost every industrial problem concerns maximizing some goal function (usually efﬁciency or yield). We want to help start this revolution by a coordinated presentation of methods, uses and successful examples. The methods are necessarily heuristic, i.e. non-exact, as industrial problems are typically very large and complex indeed. Also, industrial problems are deﬁned by imprecise, sometimes even faulty data that must be absorbed by a model. They are always non-linear and have many independent variables. So we must focus on heuristic methods that have these characteristics. This book is practical This book is intended to be used to solve real problems in a handbook manner. It should be used to look for potential yet untapped. It should be used to see possibilities where there were none before. The impossible should move towards the realm

Preface

ix

of the possible. The use, therefore, will mainly be in the sphere of application by persons employed in the industry. The book may also be used as instructional material in courses on either optimization methods or applied mathematics. It may also be used as instructional material in MBA courses for industrial managers. Many readers will get their ﬁrst introduction as to what mathematics can really and practically do for the industry instead of general commonplaces. Many will ﬁnd out what problems exist where they previously thought none existed. Many will discover that presumed impossibilities have been solved elsewhere. In total, I believe that you, the reader, will beneﬁt by being empowered to solve real problems. These solutions will save the corporations money, they will employ people, they will reduce pollution into the environment. They will have impact. It will show people also that very theoretical sciences have real uses. It should be emphasized that this book focuses on applications. Practical problems must be understood at a reasonable level before a solution is possible. Also all applications have several non-technical aspects such as legal, compliance and managerial ramiﬁcations in addition to the obvious ﬁnancial dimension. Every solution must be implemented by people and the interactions with them is the principal cause for failure in industrial applications. The right change management including the motivation of all concerned is an essential element that will also be addressed. Thus, this book presents cases as they can really function in real life. Due to the wide scope of the book, it is impossible to present neither the methods nor the cases in full detail. We present what is necessary for understanding. To actually implement these methods, a more detailed study or prior knowledge is required. Many take-home lessons are however spelt out. The major aim of the book is to generate understanding and not technical facility. This book is intended for practitioners The intended readership has ﬁve groups: 1. Industrial managers - will learn what can be done with mathematical methods. They will ﬁnd that a lot of their problems, many seemingly impossible, are already solved. These methods can then be handed to technical persons for implementation. 2. Industrial scientists - will use the book as a manual for their jobs. They will ﬁnd methods that can be applied practically and have solve similar problems before. 3. University students - will learn that their theoretical subjects do have practical application in the context of diverse industries and will motivate them in their studies towards a practical end. As such it will also provide starting points for theses. 4. University researchers - will learn to what applications the methods that they research about have been put or respectively what methods have been used by others to solve problems they are investigating. As this is a trans-disciplinary

x

Preface

book, it should facilitate communication across the boundaries of the mathematics, computer science and engineering departments. 5. Government funding bodies - will learn that fundamental research does actually pay off in many particular cases. A potential reader from these groups will be assumed to have completed a mathematics background training up to and including calculus (European high-school or US ﬁrst year college level). All other mathematics will be covered as far as needed. The book contains no proofs or other technical material; it is practical. A short summary Before a problem can be solved, it and the tools must be understood. In fact, a correct, complete, detailed and clear description of the problem is (measured in total human effort) often times nearly half of the ﬁnal solution. Thus, we will place substantial room in this book on understanding both the problems and the tools that are presented to solve them. Indeed we place primary emphasis on understanding and only secondary emphasis on use. For the most part, ready-made packages exist to actually perform an analysis. For the remainder, experts exist that can carry it out. What cannot be denied however, is that a good amount of understanding must permeate the relationship between the problem-owner and the problem-solver; a relationship that often encompasses dozens of people for years. Here is a brief list of the contents of the chapters 1. 2. 3. 4. 5. 6. 7. 8.

What is optimization? What is an optimization problem? What are the management challenges in an optimization project? How can we deal with faulty and noisy empirical data? How do we gain an understanding of our dataset? How is a dataset converted into a mathematical model? How is the optimization problem actually solved? What are some challenges in implementing the optimal solution in industrial practice (change management)?

Most of the book was written by me. Any deﬁciencies are the result of my own limited mind and I ask for your patience with these. Any beneﬁts are, of course, obtained by standing on the shoulders of giants and making small changes. Many case studies are co-authored by the management from the relevant industrial corporations. I heartily thank all co-authors for their participation! All the case studies were also written by me and the same comments apply to them. I also thank the coauthors very much for the trust and willingness to conduct the projects in the ﬁrst place and also to publish them here. Chapter 8 was entirely written by Andreas Ruff of Elkem Silicon Materials. He has many years of experience in implementing optimization projects’ results

Preface

xi

in chemical corporations and has written a great practical account of the potential pitfalls and their solutions in change management. Following this text, we provide ﬁrst an alphabetical list of all co-authors and their afﬁliations and then a list of all case studies together with their main topics and educational illustrations. Bremen, 2011

Patrick Bangert

Markus Ahorner COO [email protected] Section 4.8, p. 53; Section 4.9, p. 58

algorithmica technologies GmbH Gustav-Heinemann-Strasse 101 28215 Bremen, Germany www.algorithmica-technologies.com

Dr. Patrick Bangert CEO [email protected] All sections

algorithmica technologies GmbH Gustav-Heinemann-Strasse 101 28215 Bremen, Germany www.algorithmica-technologies.com

Claus Borgb¨ohmer Director Project Management [email protected] Section 4.8, p. 53

Sasol Solvents Germany GmbH R¨omerstrasse 733 47443 Moers, Germany www.sasol.com

Pablo Cajaraville Director Engineering and Sales [email protected] Section 6.6, p. 135

Reiner Microtek Poligono Industrial Itziar, Parcela H-3 20820 Itziar-Deba, Spain www.reinermicrotek.com

Roger Chevalier Senior Research Engineer [email protected] Section 6.10, p. 152

EDF SA, R&D Division 6 quai Watier, BP49 78401 Chatou Cedex, France www.edf.com

J¨org-A. Czernitzky Power Plant Group Director Berlin [email protected] Section 7.10, p. 197

Vattenfall Europe W¨arme AG Puschkinallee 52 12435 Berlin, Germany www.vattenfall.de

Prof. Dr. Adele Diederich Professor of Psychology [email protected] Section 7.6, p. 183

Jacobs University Bremen gGmbH P.O. Box 750 561 28725 Bremen, Germany www.jacobs-university.de

xii

Preface

Bj¨orn Dormann Research Director [email protected] Section 6.6, p. 135

Kl¨ockner Desma Schuhmaschinen GmbH Desmastr. 3/5 28832 Achim, Germany www.desma.de

Hans Dreischmeier Director SAP [email protected] Section 4.9, p. 58

Vestolit GmbH & Co. KG Industriestrasse 3 45753 Marl, Germany www.vestolit.de

Bernd Herzog Quality Control Manager [email protected] Section 4.11, p. 63

Hella Fahrzeugkomponenten GmbH Dortmunder Strasse 5 28199 Bremen, Germany www.hella.de

Dr. Philipp Imgrund Director Biomaterial Technology Director Power Technologies [email protected] Section 6.6, p. 135 Maik K¨ohler Technical Expert [email protected] Section 6.6, p. 135

Kl¨ockner Desma Schuhmaschinen GmbH Desmastr. 3/5 28832 Achim, Germany www.desma.de

Lutz Kramer Project Manager Metal Injection Molding [email protected] Section 6.6, p. 135 Guisheng Li Institute Director [email protected] Section 6.12, p. 157

Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Oil Production Technology Research Institute Plant No. 5 of Petrochina Dagang Oilﬁeld Company Tianjin 300280, China www.petrochina.com.cn

Bailiang Liu Vice Director [email protected] Section 7.9, p. 194

PetroChina Dagang Oilﬁeld Company Tianjin 300280 China www.petrochina.com.cn

Preface

xiii

Oscar Lopez Senior Research Engineer [email protected] Section 6.6, p. 135 Torsten Mager Director Technical Services [email protected] Section 5.6, p. 102

MIM TECH ALFA, S.L. Avenida Otaola, 4 20600 Eibar, Spain www.alfalan.es KNG Kraftwerks- und Netzgesellschaft mbH Am K¨uhlturm 1 18147 Rostock, Germany www.kraftwerk-rostock.de

Manfred Meise CEO [email protected] Section 4.11, p. 63

Hella Fahrzeugkomponenten GmbH Dortmunder Strasse 5 28199 Bremen, Germany www.hella.de

Kurt M¨uller Director Maintenance [email protected] Section 4.9, p. 58

Vestolit GmbH & Co. KG Industriestrasse 3 45753 Marl, Germany www.vestolit.de

Kaline Pagnan Furlan Research Assistant Metal Injection Molding [email protected] Section 6.6, p. 135

Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Yingjun Qu Oil Production Technology Research Institute Institute Director Plant No. 6 of Petrochina Changqing Oilﬁeld Company 718600 Shanxi, China jyj [email protected] Section 6.12, p. 157 www.petrochina.com.cn Pedro Rodriguez Director R&D [email protected] Section 6.6, p. 135

MIM TECH ALFA, S.L. Avenida Otaola, 4 20600 Eibar, Spain www.alfalan.es

Andreas Ruff Technical Marketing Manager [email protected] Chapter 8, p. 201

Elkem Silicon Materials Hochstadenstrasse 33 50674 Kln www.elkem.no

Dr. Natalie Salk CEO

PolyMIM GmbH Am Gefach

xiv

Preface

[email protected] Section 6.6, p. 135

55566 Bad Sobernheim www.polymim.com

Prof. Chaodong Tan Professor [email protected] Section 6.12, p. 157; Section 7.9, p. 194 J¨org Volkert Project Manager Metal Injection Molding [email protected] Section 6.6, p. 135

China University of Petroleum Beijing 102249 China www.upc.edu.cn Fraunhofer Institute for Manufacturing and Advanced Materials IFAM Wiener Strasse 12 28359 Bremen, Germany www.ifam.fhg.de

Xuefeng Yan Beijing Yadan Petroleum Technology Co., Ltd. Director of Production Technology No. 37 Changqian Road, Changping Beijing 102200, China yxf [email protected] Section 6.12, p. 157 www.yadantech.com Jie Zhang Vice CEO [email protected] Section 7.9, p. 194

Yadan Petroleum Technology Co Ltd No. 37 Changqian Road, Changping Beijing 102200, China www.yadantech.com

Timo Zitt Director Dormagen Combined-Cycle Plant [email protected] Section 7.11, p. 199

RWE Power AG Chempark, Geb. A789 41538 Dormagen www.rwe.com

The following is a list of all case studies provided in the book. For each study, we provide its location in the text and its title. The summary indicates what the case deals with and what the result was. The “lessons” are the mathematical optimization concepts that this case particularly illustrates. Self-Benchmarking in Maintenance of a Chemical Plant Section 4.8, p. 53 Summary: In addition to the common practice of benchmarking, we suggest to compare the plant to itself in the past to make a self-benchmark. Lessons: The right pre-processing of raw data from the ERP system can already bear useful information without further mathematical analysis. Financial Data Analysis for Contract Planning Section 4.9, p. 58

Preface

xv

Summary: Based on past ﬁnancial data, we create a detailed projection into the future in several categories and so provide decision support for budgeting. Lessons: Discovering basic statistical features of data ﬁrst, allows the transformation of ERP data into a mathematical framework capable of making reliable projections. Early Warning System for Importance of Production Alarms Section 4.11, p. 63 Summary: Production alarms are analyzed in terms of their abnormality. Thus we only react to those alarms that indicate qualitative change in operations. Lessons: Comparison of statistical distributions based on statistical testing allows us to distinguish normal from abnormal events. Optical Digit Recognition Section 5.4, p. 92 Summary: Images of hand-written digits are shown to the computer in an effort for it to learn the difference between them without us providing this information (unsupervised learning). Lessons: It is possible to cluster data into categories without providing any information at all apart from the raw data but it pays to pre-process this data and to be careful about the number of categories speciﬁed. Turbine Diagnosis in a Power Plant Section 5.5, p. 96 Summary: Operational data from many turbines are analyzed to determine which turbine was behaving strangely and which was not. Lessons: Time-series can be statistically compared based on several distinctive features providing an automated check on qualitative behavior of the system. Determining the Cause of a Known Fault Section 5.6, p. 102 Summary: We search for the cause of a bent blade of a turbine and do not ﬁnd it. Lessons: Sometimes the causal mechanism is beyond current data acquisition and then cannot be analyzed out of it. It is important to recognize that analysis can only elucidate what is already there. Customer Segmentation Section 5.10, p. 117 Summary: Consumers are divided into categories based on their purchasing habits. Lessons: Based on purchasing histories, it is possible to group customers into behavioral groups. It is also possible to extract cause-effect information about which purchases trigger other purchases. Scrap Detection in Injection Molding Manufacturing Section 6.6, p. 135

xvi

Preface

Summary: It is determined whether an injection molded part is scrap or not. Lessons: Several time-series need to be converted into a few distinctive features to then be categorized by a neural network as scrap or not. Prediction of Turbine Failure Section 6.7, p. 140 Summary: A turbine blade tear is correctly predicted two days before it happened. Lessons: Time-series can be extrapolated into the future and thus failures predicted. The failure mechanism must be visible already in the data. Failures of Wind Power Plants Section 6.8, p. 143 Summary: Failures of wind power plants are predicted several days before they happen. Lessons: Even if the physical system is not stable because of changing wind conditions, the failure mechanism is sufﬁciently predictable. Catalytic Reactors in Chemistry and Petrochemistry Section 6.9, p. 148 Summary: The catalyst deactivation in ﬂuid and solid catalytic reactors is projected into the future. Lessons: Non-mechanical degradation can be predicted as well and allows for projection over one year in advance. Predicting Vibration Crises in Nuclear Power Plants Section 6.10, p. 152 Summary: A temporary increase in turbine vibrations is predicted several days before it happens. Lessons: Subtle events that are not discrete failures but rather quantitative changes in behavior can be predicted too. Identifying and Predicting the Failure of Valves Section 6.11, p. 155 Summary: In a system of valves, we determine which valve is responsible for a nonconstant ﬁnal mixture and predict when this state will be reached. Lessons: Using data analysis in combination with plant know-how, we can identify the root-cause even if the system is not fully instrumented. Predicting the Dynamometer Card of a Rod Pump Section 6.12, p. 157 Summary: The condition of a rod pump can be determined from a diagram known as the dynamometer card. This 2D shape is projected into the future in order to diagnose and predict future failures.

Preface

xvii

Lessons: It is possible not only to predict time-series but also changing geometrical shapes based on a combination of modeling and prediction. Human Brains use Simulated Annealing to Think Section 7.6, p. 183 Summary: Based on human trial, we determine that human problem solving uses the simulated annealing paradigm. Lessons: Simulated annealing is a very general and successful method to solve optimization problems that, when combined with the natural advantages of the computer, becomes very powerful and can ﬁnd the optimal solution in nearly all cases. Optimization of the M¨uller-Rochow Synthesis of Silanes Section 7.8, p. 189 Summary: A complex chemical reaction whose kinetics is not fully understood by science is modeled with the aim of increasing both selectivity and yield. Lessons: It is possible to construct empirical models without theoretical understanding and still compute the desired answers. Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells Section 7.9, p. 194 Summary: Offshore oil pumps are modeled with the aim of both predicting their future failures and increasing the oil production yield. Lessons: The pumps must be considered as a system in which the pumps inﬂuence each other. We solve a balancing problem between them using their individual models. Increase of coal burning efﬁciency in CHP power plant Section 7.10, p. 197 Summary: The efﬁciency of a CHP coal power plant is increased by 1%. Lessons: While each component in a power plant is already optimized, mathematical modeling offers added value in optimizing the combination of these components into a single system. The combination still allows a substantial efﬁciency increase based on dynamic reaction to changing external conditions. Reducing the Internal Power Demand of a Power Plant Section 7.11, p. 199 Summary: A power plant uses up some its own power by operating pumps and fans. The internal power is reduced by computing when these should be turned off. Lessons: We extrapolate discrete actions (turning off and on of pumps and fans) from the continuous data from the plant in order to optimize a ﬁnancial goal.

Contents

1

Overview of Heuristic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What is Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Searching vs. Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Finding through a little Searching . . . . . . . . . . . . . . . . . . . . . . 3 1.1.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.5 Certainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Exact vs. Heuristic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Exact Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Heuristic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Example Theoretical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2

Statistical Analysis in Solution Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Basic Vocabulary of Statistical Mechanics . . . . . . . . . . . . . . . . . . . . . . 2.2 Postulates of the Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 14 18 20 23 25

3

Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Waterfall Model vs. Agile Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Design of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Prioritizing Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30 34 35

4

Pre-processing: Cleaning up Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Dirty Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Time-Series from Instrumentation . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Data not Ordered in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 37 38 38 39

xix

xx

Contents

4.3

Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Unrealistic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Unlikely Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Irregular and Abnormal Data . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data reduction / Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Similar Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Irrelevant Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Redundant Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Distinguishing Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smoothing and De-noising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Singular Spectrum Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . Representation and Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study: Self-Benchmarking in Maintenance of a Chemical Plant 4.8.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Self-Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 Results and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study: Financial Data Analysis for Contract Planning . . . . . . . . Case Study: Measuring Human Inﬂuence . . . . . . . . . . . . . . . . . . . . . . . Case Study: Early Warning System for Importance of Production Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40 41 41 41 42 43 43 43 44 44 47 47 48 50 51 53 53 54 56 58 62

Data Mining: Knowledge from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Concepts of Statistics and Measurement . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Population, Sample and Estimation . . . . . . . . . . . . . . . . . . . . . 5.1.2 Measurement Error and Uncertainty . . . . . . . . . . . . . . . . . . . . 5.1.3 Inﬂuence of the Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Meaning of Probability and Statistics . . . . . . . . . . . . . . . . . . . . 5.2 Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Testing Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Speciﬁc Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Do two datasets have the same mean? . . . . . . . . . . . 5.2.2.2 Do two datasets have the same variance? . . . . . . . . 5.2.2.3 Are two datasets differently distributed? . . . . . . . . . 5.2.2.4 Are there outliers and, if so, where? . . . . . . . . . . . . . 5.2.2.5 How well does this model ﬁt the data? . . . . . . . . . . . 5.3 Other Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Correlation and Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67 67 67 68 70 71 73 73 75 75 76 76 77 78 79 79 81 84 85 89 91

4.4

4.5

4.6 4.7 4.8

4.9 4.10 4.11

5

63

Contents

xxi

5.4 5.5 5.6 5.7 5.8

Case Study: Optical Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 92 Case Study: Turbine Diagnosis in a Power Plant . . . . . . . . . . . . . . . . . 96 Case Study: Determining the Cause of a Known Fault . . . . . . . . . . . . 102 Markov Chains and the Central Limit Theorem . . . . . . . . . . . . . . . . . . 105 Bayesian Statistical Inference and the Noisy Channel . . . . . . . . . . . . 107 5.8.1 Introduction to Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . 107 5.8.2 Determining the Prior Distribution . . . . . . . . . . . . . . . . . . . . . . 108 5.8.3 Determining the Sampling Distribution . . . . . . . . . . . . . . . . . . 110 5.8.4 Noisy Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.8.4.1 Building a Noisy Channel . . . . . . . . . . . . . . . . . . . . . 111 5.8.4.2 Controlling a Noisy Channel . . . . . . . . . . . . . . . . . . . 112 5.9 Non-Linear Multi-Dimensional Regression . . . . . . . . . . . . . . . . . . . . . 113 5.9.1 Linear Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . 113 5.9.2 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.9.3 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.10 Case Study: Customer Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6

Modeling: Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1 What is Modeling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1.2 How much data is enough? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Basic Concepts of Neural Network Modeling . . . . . . . . . . . . . . . . . . . 129 6.4 Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.6 Case Study: Scrap Detection in Injection Molding Manufacturing . . 135 6.7 Case Study: Prediction of Turbine Failure . . . . . . . . . . . . . . . . . . . . . . 140 6.8 Case Study: Failures of Wind Power Plants . . . . . . . . . . . . . . . . . . . . . 143 6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry . . . 148 6.10 Case Study: Predicting Vibration Crises in Nuclear Power Plants . . . 152 6.11 Case Study: Identifying and Predicting the Failure of Valves . . . . . . . 155 6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump . . . . 157

7

Optimization: Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.2 Elementary Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.3 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.4 Cooling Schedule and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.4.1 Initial Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.4.2 Stopping Criterion (Deﬁnition of Freezing) . . . . . . . . . . . . . . 174 7.4.3 Markov Chain Length (Deﬁnition of Equilibrium) . . . . . . . . . 175 7.4.4 Decrement Formula for Temperature (Cooling Speed) . . . . . 177 7.4.5 Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.4.6 Parameter Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.5 Perturbations for Continuous and Combinatorial Problems . . . . . . . . 181

xxii

Contents

7.6 Case Study: Human Brains use Simulated Annealing to Think . . . . . 183 7.7 Determining an Optimal Path from A to B . . . . . . . . . . . . . . . . . . . . . . 186 7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.9 Case Study: Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.10 Case Study: Increase of coal burning efﬁciency in CHP power plant 197 7.11 Case Study: Reducing the Internal Power Demand of a Power Plant 199 8

The human aspect in sustainable change and innovation . . . . . . . . . . . . 201 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.1.1 Deﬁning the items: idea, innovation, and change . . . . . . . . . . 202 8.1.2 Resistance to change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 8.2 Interface Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.2.1 The Deliberate Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.2.2 The Healthy Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 8.3 Innovation Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.4 Handling the Human Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 8.4.1 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.4.2 KPIs for team engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.4.3 Project Preparation and Set Up . . . . . . . . . . . . . . . . . . . . . . . . . 221 8.4.4 Risk Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 8.4.5 Roles and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.4.6 Career development and sustainable change . . . . . . . . . . . . . . 228 8.4.7 Sustainability in Training and Learning . . . . . . . . . . . . . . . . . . 231 8.4.8 The Economic Factor in Sustainable Innovation . . . . . . . . . . . 232 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

Chapter 1

Overview of Heuristic Optimization

1.1 What is Optimization? Suppose we have a function f (x) where the variable x may be a vector of many dimensions. We seek the point x∗ such that f (x∗ ) is the maximum value among all possible f (x). This point x∗ is called the global optimum of the function f (x). It is possible that x∗ is a unique point but it is also possible that there are several points that share the maximal value f (x∗ ). Optimization is a ﬁeld of mathematics that concerns itself with ﬁnding the point x∗ given the function f (x). There are two ﬁne distinctions to be made relative to this. First, the point x∗ is the point with highest f (x) for all possible x and as such the global optimum. We are usually interested in this global optimum. There exists the concept of a local optimum that is the point with highest f (x) for all x in the neighborhood of the local optimum. For example, any peak is a local optimum but only the highest peak is the global maximum. Usually we are not interested in ﬁnding local optima but we are interested in recognizing them because we want to be able to determine that, while we are on a peak, there exists a higher peak elsewhere. Second, the phrase “all possible x needs careful consideration. Usually any value of the independent variable is allowed , x ∈ [−∞, ∞], but in some cases the independent variable is restricted. Such restrictions may be very simple like 3 ≤ x ≤ 18. Some may be complex by not giving explicit limitations but rather tying two elements of the independent variable vector together, e.g. “

x1

p we accept the null hypothesis and otherwise accept the alternative hypothesis. In the following treatment, we will be assuming this method. By using any standard statistical software, you will be able to follow this method. In summary, this is the method: 1. Compute the test statistic and call the value x,

5.2 Statistical Testing

75

2. Compute the probability P(X < x), 3. Choose a signiﬁcance level 1 − p, and 4. If P(X < x) > p, accept the null hypothesis and otherwise accept the alternative hypothesis. In the next section, we will state a few speciﬁc tests. For each, we will give the formulas for computing the test statistic, the distribution function and the identity of the null and alternative hypotheses. This makes the above method deﬁnite apart from choosing p, which must be done in dependence upon the practical problem at hand.

5.2.2 Speciﬁc Tests Please note that statistical theory has constructed a great many tests for various purposes. There are sometimes even several tests for a particular purpose. This book does not mean to give an exhaustive treatment. We will give a test for those questions that we consider relevant for basic data analysis in the process industry.

5.2.2.1 Do two datasets have the same mean? For the two datasets A and B, that are thought to have the same variance, we compute the t-statistic, xA − xB t=

−1 2 ∑i∈A (xi −xA ) +∑i∈B (xi −xB )2 1 1 + NA +NB −2 NA NB and the distribution function P(X < t) =

1

ν2B

t

1 1

x2 1+ ν

ν 2 , 2 −t

− ν+1 2 dx

where the number of degrees of freedom ν = NA +NB −2, B(· · · ) is the beta function and NA and NB are the number of observations in either dataset. If the two datasets do not have the same variance, the t-statistic is t=

xA − xB σA2 /NA + σB2 /NB

with σA2 the variance of dataset A and the number of degrees of freedom are

ν=

σA2 NA

(σA2 /NA )2 NA −1

σ2

+ NBB +

2

(σB2 /NB )2 NB −1

76

5 Data Mining: Knowledge from Data

while the distribution function remains unchanged. The null hypothesis is that the means are the same and the alternative hypothesis is that the means are different.

5.2.2.2 Do two datasets have the same variance? For the two datasets A and B, that are thought to have the same mean, we compute the F-statistic, σ2 F = A2 σB where σA2 > σB2 . The distribution function is

ν ν

ν ν A B B A , , − I νB P(X < F) = 2 − I νA νB +νA F νA +νB F 2 2 2 2 with I(· · · ) the incomplete beta function. The null hypothesis is that they have equal variances and the alternative hypothesis is that the variances are different.

5.2.2.3 Are two datasets differently distributed? There are different approaches depending on the nature of the two distributions. We have to answer whether we are comparing an empirical distribution to a theoretically expected distribution or to another empirical distribution. We also have to answer whether the empirical data is in the form of binned data or available as a continuously valued distribution. Note that while binning involves a loss of information and arbitrary choice of bins, it is necessary if the dataset is in itself not a distribution and will thus convert the dataset into a probability distribution. If possible, one should not bin datasets. From these two questions, we arrive at four possibilities. In all cases the null hypothesis is that the two sets are equally distributed and the alternative hypothesis is that they are differently distributed. Note that this test makes no statements as to how they are distributed but merely as to same or different. One binned empirical distribution against a theory: The empirical distribution has ni observations in bin i where we expect to ﬁnd mi observations and so we create the chi-squared statistic (ni − mi )2 χ2 = ∑ mi i with distribution

χ2

P(X < χ 2 ) =

1 Γ (a)

2 0

e−t t a−1 dt

5.2 Statistical Testing

77

where a is twice the number of degrees of freedom. The degrees of freedom are the number of bins used minus the number of constraint equations imposed on the theory, e.g. that the sum of expected bin counts over the theory equals that over the empirical data, ∑ mi = ∑ ni . Two binned empirical distributions: When the ﬁrst distribution has ni observations in bin i and the second has mi observations, the chi-squared statistic is ( M/Nni − N/Mmi )2 2 χ =∑ ni + m i i where N = ∑ ni and M = ∑ mi . The distribution function is the same as above. One continuous empirical distribution against a theory: The empirical distribution n(x) is compared to a theory m(x) by computing the very simple KolmogorovSmirnov statistic D = max |n(x) − m(x)| −∞<x λi for a particular i, then we look for the largest such i, i.e. l = max{i|ESDi > λi } and declare the l points xext (S0 ), xext (S1 ), ... xext (Sl−1 ) as outliers.

5.2.2.5 How well does this model ﬁt the data? Suppose we have experimental data (xi , yi ) for many i. We then suppose that the function f () represents the relationship between xi and yi to within our desired accuracy in the manner that yi ≈ f (xi ). Note that it is unrealistic to expect that the equality be precise, i.e. yi = f (xi ). The reason is simply that experimental data is always subject to measurement errors and, generally, no model considers all effects. Generally, a model f () has parameters a1 , a2 , · · · , am that must be determined in some manner. The ﬁtting problem consists of ﬁnding the values of these parameters such that the model ﬁts the data optimally, i.e. that ∑i ( f (xi ) − yi )2 is a minimum over all possible sets of parameters. This approach is called the least squares ﬁtting approach; as one attempts to ﬁnd the least sum over squared terms. One typically ﬁnds this method illustrated in books in the context of ﬁtting a straight line, f (x) = mx + b with m and b being the parameters, but the approach is quite general. See section 5.3.1 for this. The approach of least squares only prescribes the utility function (the above sum over squares) and not the method for ﬁnding the parameters. Finding these is a different issue and we must generally use a full optimization algorithm to do it. Even though one will usually see least squares ﬁtting in the context of straight lines, please note that straight lines are rare in real life situations. We almost always encounter non-linear situations and so f () must be a non-linear function. The methods to ﬁnd the parameters must then take account of this. The optimization methods discussed in chapter 7 can handle all such situations. Clearly, we can use the sum ∑i ( f (xi ) − yi )2 as a score to rank different parameter assignments and choose the best one; that is the least squares method. However, when we have chosen the best one and have ﬁnished the least squares method, we are left with the questions: Is the f () really representative of the data? Could a different 1 We start with the entire dataset, compute its mean and select the point that is furthest from the mean on either side. We remove this point to get dataset S1 . We recompute the mean and select the point that is now furthest from the mean and proceed like this. If we are only interested in removing outliers from one end of the distribution, we must only search for points on the interesting end.

5.3 Other Statistical Measures

79

f () have represented the data better? Is this least squares sum “good enough” for our practical purpose? The least squares approach makes no statements to this effect as it considers a single f () that is given to it by the human operator of the method. It is we who must choose the f () and here lies the magic of modeling. If you have chosen the modeling function f () well, the modeling and optimizing are merely laborious steps that will lead you to your goal; if you have chosen f () poorly, those steps will be a waste of time. Thus, it is elemental to verify that the model ﬁts the data. For this purpose, we will use the chi-squared test. First, we compute the chi-squared statistic (yi − f (xi ))2 . f (xi ) i=1 n

χ2 = ∑

The probability distribution is the same as the one from section 5.2.2.3, χ2

P(X < χ 2 ) =

1 Γ (a)

2

e−t t a−1 dt

0

where a is twice the number of degrees of freedom. The hypothesis that the model does indeed correctly represent the data collected is to be accepted if this probability is larger than your chosen signiﬁcance level (usually 0.95 or 0.99), otherwise the model is a poor one at this signiﬁcance level. Note that this method makes no statements about how to ﬁx the problem if your model is poor. You must choose another one by yourself and try again!

5.3 Other Statistical Measures 5.3.1 Regression The process of ﬁtting a model to data described loosely above in section 5.2.2.5 is often called regression. The term regression with its generally unﬂattering connotations derives from the ﬁrst academic use of the method: To describe the phenomenon that the descendants of tall ancestors tend to get shorter and approach the mean height of the population over the generations. This term is very commonly used for the general problem of ﬁtting a model to data. Often the word is used in the context of straight lines. If it is used in a wider context, the literature generally speaks of non-linear regression, which then refers to the general ﬁtting problem. Here, we will brieﬂy discuss a few special cases. We will present only the result, i.e. the formulas by which to compute the parameter values. If you are interested

80

5 Data Mining: Knowledge from Data

in how there were arrived at, there are many books that will describe this in detail, e.g. [62]. Recall that, with ﬁtting, we are concerned with determining the values of the parameters from empirical data given a known model. The process of deciding on the model is generally a human decision with all of its subjective alchemical features leaving it impossible to treat in a mathematical book. We will suppose that your data includes N observations in the form (xi , yi ) where we will suppose that the yi are dependent upon the xi in some way that we wish to model. Both the xi and the yi may be vectors and need not be single values. The model takes the general form y = f (x). It is our hope that yi ≈ f (xi ) with a “good” accuracy. Presumably, we have decided what accuracy we require for our practical purpose. We will also assume that the yi are empirically measured quantities that have a known measurement error σi inherent in them, so that each measurement is actually yi ± σi . Should you wish to ignore this feature, please set σi = 1 for all i in the formulas below. Should you choose a straight line, f (x) = mx + b, you will need to determine two parameters from the data: The slope m and the y-intercept b. This is how:

2 N 2 N 1 xi xi Δ = ∑ 2 ∑ σ2 − ∑ σ2 i=i σi i=i i i=i i N 2 N N N xi yi xi xi yi b= ∑ σ 2 ∑ σ 2 − ∑ σ 2 ∑ σ 2 /Δ i=i i i=i i i=i i i=i i N N N N 1 xi yi xi yi m= ∑ σ 2 ∑ σ 2 − ∑ σ 2 ∑ σ 2 /Δ i=i i i=i i i=i i i=i i N

(5.1) (5.2) (5.3)

Please note that many formulas that do not appear to be linear at ﬁrst sight, actually are after the variable has been transformed, e.g. y = eax + b → y = a z + b with a = ea and z = ex y = (ax)c + b → y = a z + b with a = ac and z = xc y = abx → y = a + b x with y = ln y, a = ln a and b = ln b.

(5.4) (5.5) (5.6)

Suppose that this model is not sufﬁcient and you would like to make things more interesting. General linear least squares focuses on models that are linear (in the parameters) but may take non-linear basis functions. An example is the polynomial, y = a1 + a2 x + a3 x2 + a4 x3 + · · · an xn−1 but generally any function linear in the parameters is ﬁne. The most general form is M

y = ∑ ai Xi (x) i=1

5.3 Other Statistical Measures

81

where the Xi (x) are functions of x only and have no unspeciﬁed parameters. To obtain the unknown parameter values ai , we go through the following steps, X j (xi ) , the design matrix σi yi bi = , the result vector σi ai = ai , the solution vector λ = AT · A

Ai j =

β = A ·b a = λ −1 · β . T

(5.7) (5.8) (5.9) (5.10) (5.11) (5.12)

Please note that the matrix A should have more rows than columns as we should have more data points than unknown parameters. All steps are easy except for the last, which requires us to compute λ −1 . As this matrix is generally large, an explicit inversion is not a good idea for many numerical reasons. The solution of the implicit equation λ · a = β for a must therefore be done numerically. We suggest singular value decomposition (SVD) as the method of choice as it deals well with the roundoff errors that accumulate. Explaining the SVD method would go beyond the scope of this book. Any linear algebra book will explain this in painful detail, e.g. [122]. This paragraph will just explain the procedure once this decomposition is done. So, we assume that A has been SVD decomposed into A = U · W · VT where W is the diagonal matrix of singular values, U is column orthogonal and V is orthogonal. Then, M

a= ∑

i=1

U(i) · b · V(i) Wii

where the subscript (i) refers to taking the i-th column of the respective matrix.

5.3.2 ANOVA ANOVA is an abbreviation for analysis of variance that is a very popular method that is also very prone to misunderstandings. Note carefully that ANOVA is not a statistical test. That is, it does not answer any question with a yes or no answer, nor does it conﬁrm or deny any hypotheses. The method is used to investigate the effect of qualitative factors on a quantitative result. Should you have quantitative factors, you can always suppress this by grouping them into low-middle-high or good-bad groups and use ANOVA to deal with that. The principal assumption behind ANOVA is that the relationship between factors and result is linear. This is a crucial assumption as many relationships will be known to be non-linear in which case this method will do you no good. Historically, the method was devised as a part of the theory of the design of experiments in

82

5 Data Mining: Knowledge from Data

which we plan experiments before carrying them out in order to reduce the work to a minimum while gathering all required data for a certain desired future analysis. The basic idea of ANOVA is to see if the variance of the result can be explained on the basis of the differing categories of the factors. For this, we make experiments and compute variances within and between groups of categories to see if they differ signiﬁcantly. This allows conclusions as to whether the division of a factor into its categories is sensible or useful with respect to the result. For instance we may use ANOVA to compare a control group to an experimental group to see if the factor that is different really causes some observable difference in some result. As such, ANOVA is a central and commonly used tool in many experimental studies (particularly involving people). Because it is based on categorical variables rather than numerical ones, it is naturally suited towards the social studies even though it is also used in the sciences. One use to which we may put ANOVA is the attribution of causes to effects. We may for instance ﬁnd that there is a link between touching dirt and washing hands and we may ﬁnd that this link has a preferred time-wise direction: Generally touching dirt occurs before washing hands. This may lead us to believe that touching dirt is a cause for washing hands. ANOVA allows us to draw this conclusion in a well-deﬁned procedural way. As such it is useful to deal with situations in which we lack the common sense that we all have in relation to dirt and washing. In order to use the method, the experimental data must satisfy a few conditions. If the data does not satisfy these, then the method is unusable. This is also the reason for which you should plan your experiments, using design-of-experiments methods, before you make them. The conditions are 1. The observations in any one group should be homogeneously distributed. This means that if we break our group of measurements into several (arbitrary) smaller groups, these should not differ in terms of their natural variation or distribution. A common example of when this does not hold is for a time series in which observations early in time are tightly clustered around a mean and then, as time passes, get less and less clustered around the mean. This group of measurements is not homogeneously distributed but heterogeneously distributed. The technical terms for this are homoscedasticity versus heteroscedasticity. There are various tests to conﬁrm or deny this but these would go beyond the scope of this book. You will ﬁnd them in many statistical books, e.g. [71]. 2. The observations over the groups of any one factor are assumed to be normally distributed. If, for example, we have a factor with three groups (high-middlelow), then we would expect to ﬁnd an equal number of observations in the high and low groups and a correspondingly larger amount in the middle category. 3. The observations must be independent of each other. In a time series, for example, this is deﬁnitely not true as the observation at a later time generally depends causally (often in a known way) on the observation at an earlier time. 4. If you have more than one factor, it is extremely helpful to have an equal number of observations in each combination of groups of all the factors. For example if we have two factors and each factor has three groups (high-middle-low), then we

5.3 Other Statistical Measures

83

have nine combinations of these groups (high-high, high-middle, high-low, ...). This is called balance. It is not a strict requirement but life is easier if it is true. It can easily be seen that if we are strict about these conditions (especially 2 and 3) most datasets cannot be analyzed with ANOVA. We urge you here, without any sarcasm, to consider the principle “Never trust a statistic you didn’t fake yourself” by Winston Churchill. Should you wish to continue beyond this point, we must now set up the presumed model. Here we will only treat the case of a single factor. This is not very realistic but treating more complex cases would go beyond the scope of this book, see e.g. [87]. We then assume that the relationship between the result y and the factor is given by y i j = μ + α i + εi j for i = 1, 2, · · · , k; j = 1, 2, · · · , ni where yi j is the result and is normally distributed within each group, μ is the mean of the entire dataset, αi is the effect of the i-th factor group, εi j is a catch-all for all sorts of external random disturbances to our experiment, k is the number of groups in our factor, ni is the number of observations in the i-th group. Practically, we must now compute the variance in each group σi2 , the means in each group xi and the variance of the entire dataset σ 2 . Then we compute the F statistic n1 n2 (x1 − x2 )2 n1 n2 (x1 − x2 )2 F= = (n1 + n2 )σ 2 n1 σ12 + n2 σ22 to compare two groups with each other. To ﬁnish the F-test, we must look up the value of the probability distribution of the F statistic

ν ν

ν ν 1 2 2 1 , , P(X < F) = 2 − I ν1 − I ν2 ν1 +ν2 F ν2 +ν1 F 2 2 2 2 with I(· · · ) the incomplete beta function and ν1 and ν2 being the number of degrees of freedom in each group. The null hypothesis in this case was that the two variances are the same. If we reject this, we must conclude that the variances are indeed different. In this case, we would conclude that the factor does indeed have an inﬂuence upon the result. Note that we did not say anything about the nature or strength of this inﬂuence. The signiﬁcance level at which this can be claimed is important as this can be interpreted to be the degree to which this factor can be used to explain the variation in the result. In this case, we used the F-test and this is quite frequent in ANOVA. Even though we ended up using an F-test, please note that ANOVA itself is not a test and involves a lot more than a test. For realistic cases, we would, of course, use tests with several factors but this goes beyond the level of this treatment, which is intended more as a caveat than a genuine introduction.

84

5 Data Mining: Knowledge from Data

5.3.3 Correlation and Autocorrelation The concept of correlation is very simple. When one thing is changed, does another thing change as a consequence? If yes, these are correlated. For example, if we increase the temperature in a vessel, then the pressure will rise (supposing nothing else was changed as well). Thus, these are correlated under these circumstances. In the ideal gas situation, the variation of pressure and temperature is linear and positive: If pressure rises by a factor a, then temperature will also rise (positive correlation) by the same factor a (linear correlation). The correlation is so strong that, if volume is not modiﬁed, one variable will be enough to compute the value of the other. We measure correlation strength numerically on a scale between -1 and 1. In the above example, the correlation would be 1. If we compare the effect of studying hard on exam results, we would expect a correlation to exist and for this correlation to be positive but it certainly will not be 1 because certain people are more prone to the subject and will thus get higher grades than others with the same amount of studying. The effect of sunlight exposure on the water level in a glass is negative: As the sun continues to shine, the water level drops due to evaporation. Two variables x and y given by a group of measurements (xi , yi ) have a linear correlation coefﬁcient or Pearson’s r of r=

∑(xi − x)(yi − y) i ∑(xi − x)2 ∑(yi − y)2 i

i

where the over line indicates taking an average. You should not try to determine if the correlation is signiﬁcant based on r, it is not suitable for that. However, if the correlation is signiﬁcant, r is a good measure of its strength. Depending on your ﬁeld of inquiry, various r values are considered “good” by the community. If you are a physicist, then you would be looking for r = 0.99 or so to explain an effect. Dealing with experimental data from a real life situation (as opposed to laboratory conditions), you should be happy with r = 0.8 or so. Many studies based on questionnaires among humans will try to interpret correlations of r = 0.4 and sometimes lower to have sensible meaning. Sometimes working with such low correlations is necessary when the inﬂuencing factors are too many or cannot all be measured. In our efforts to make a mathematical model of a natural phenomenon that is known via experimental data from an industrial process, we should be happy with r between 0.8 and 0.9. If we achieve an r higher than this, we should suspect ourselves of over-ﬁtting the model, i.e. introducing so many parameters that the model can memorize the data instead of extrapolating intelligently. Such over-ﬁtting is a modeler’s suicide and must be avoided. Note that the above formula is the linear correlation coefﬁcient. This assumes that the relationship between the two variables is linear. This is an important restric-

5.3 Other Statistical Measures

85

tion as few real life relationships are linear. In order to use a non-linear correlation, we must ﬁrst specify the exact form of the relationship that we believe to hold. For this reason, we cannot treat this here but must be done on a case by case basis. For a quick-and-dirty check if two variables have anything at all to do with each other, the linear coefﬁcient is a reasonable thing to compute. Just don’t base any important arguments on it. In industrial practice, we most often deal with time-series. This is a variable that depends on or changes with time. A very interesting question about such a timeseries is whether we can get a feeling for future values based on past values, i.e. knowing x(0), x(1), x(2), . . . , x(t) can we make a reasonable guess at x(t + 1) and so on? To answer this, we will be asking how a variables correlates with itself at an ealier time. This is called the autocorrelation function R(τ), R(τ) =

∞

x(t)x(t − τ)dt.

−∞

The new variable τ is called the lag of the time-series. Usually, we normalize the autocorrelation such that R(0) = 1. The autocorrelation indicates the correlation of the variable with itself at different times. The value R(τ) is a measure of the inﬂuence that the value x(t − τ) has on the value of x(t). Take the example of a retail business measuring its total revenue once per month. Business is mostly stable, except in December where the Christmas business doubles revenue. We will therefore observe that R(12) for this time-series will be much higher than the rest of the autocorrelation function. This indicates that there is a strong cyclic behavior with time-lag 12 (measured in months due to the data taking cadence), which agrees with our expectation. You will probably also see a stronger dimple at a time-lag of 4 and 8. These will be due to the Easter business (Easter is usually 4 months after Christmas and it usually takes 8 months from Easter to Christmas). Thus, the strength of the Christmas business may be used somewhat to predict the strength of next Easter’s business and also next Christmas’ business. That is the point of autocorrelation. The score R(τ) can only be interpreted relatively and offers a useful indication for further investigation. Note that autocorrelation is time directed. That is, we measure the correlation that past events have with future events. Thus, autocorrelation is a measure of the strength of predictability: Knowing a historical fact, how reliable is an estimate of future performance on its basis (relative to knowing the future fact by simply waiting for it, which is equal to 1 by normalizing the function)? Thus, autocorrelation identiﬁes cause-effect relationships within a single variable’s time-series.

5.3.4 Clustering Suppose that you have many observations of some process and that each observation is vector of values. We may now wish to group observations into a few qualitatively

86

5 Data Mining: Knowledge from Data

distinct categories in order to generate some form of understanding about the underlying dynamics that produce the observations. A simple example is a group of people visiting a store. Each purchase action is an observation. The observation itself is a vector of purchased goods. We now want to group the many observations made over some time interval into qualitative groups, e.g. “health conscious client” or “ready made meals client.” These groups can then be described both by their purchase habits (what, how often, in which combinations ...) as well as by their economic impact (what revenue, what margins ...) in order to draw conclusions about possible changes to marketing or the store. Each cluster should be as homogeneous as possible and the clusters should be as heterogeneous between each other as possible. The action of grouping vectored observations into phenomenological groups is called clustering. The manner in which observations are clustered has some elements that are common to all methods. Beyond these common elements, there are many algorithms to accomplish a clustering and it is difﬁcult to say a priori which method is best. A major reason for this, in practice, is that it is frequently not clear how to deﬁne or measure what is “best” and this emerges only empirically once the results of several methods have been compared by persons with signiﬁcant domain knowledge. First, all methods require a metric. A metric is a method to compute a distance measure between two observations. Some types of observation (locations and the like) have an inherent sensible distance measure but others (option A instead of option B) are difﬁcult to tie to the concept of distance. Since we are comparing vectors of values, the issue of comparing apples with pears is also a signiﬁcant problem. For instance if we measure both temperature and pressure of something in a two-dimensional vector, how are we going to measure the distance between two sample vectors? The physical units in which these quantities are measured become important. For example if we measure temperature in degrees Kelvin as opposed to degrees Celsius, then the numerical values of all temperatures will be much higher. In any normal distance measurement, the temperature will then be more signiﬁcant and could thus potentially skew the results. Thus, the design of the metric is an issue that must be resolved carefully in a practical application. Second, each clustering creates so called centers. These are points in the multidimensional space deﬁned by the observational vectors that indicate the position of the “center” of the cluster. Each cluster has a radius around this center (as measured by the metric function) and all observations within this sphere belong to that center. We may thus list the observations per center and, via descriptive statistics, arrive at a description of each group. This description can then be interpreted and thus knowledge may be derived. Third, some observations may not lie inside any center and thus be called outliers. Generally these outliers are few observations that, for one reason or another, are sufﬁciently atypical that they do not belong to a cluster and also sufﬁciently few in number not to justify forming a new cluster (or several new clusters depending on the metric function) for them.

5.3 Other Statistical Measures

87

Outliers are very interesting points as they indicate an abnormal observation. As outliers skew statistical results, it is important to look at these points in detail to determine if they are indeed genuine observations. It is possible that outliers are produced by some form of error in the observation process and would then be excluded from further treatment. If an outlier is a genuine and correct observation, then it offers insights into abnormal events. This may be important for the practical application for a variety of reasons such as capacity planning, which must orient itself on the events that are maximally taxing and so, by deﬁnition, abnormal. Fourth, the number of centers is often a crucial point. In many practical applications, a central question is: Into how many groups is it most sensible to divide the observation? The disadvantage of most clustering methods is that the number of centers must be speciﬁed by the user before clustering is begun. In practice then, clustering is re-run with several different settings for the number of centers and each result is examined for “sensibility” whatever this may mean in the practical application at hand. Simple versions of a sensibility deﬁnition include (1) a homogeneous distribution of observations within each cluster, (2) no cluster with less/more than a speciﬁed percentage of all observations, (3) less than some speciﬁed percentage of outliers and (4) a certain speciﬁed minimum distance between centers to make sure that clusters are sufﬁciently distinct. The idea of k-means requires the number of centers, k, to be speciﬁed by the user. It is an optimization problem that requires the mean-squared distance of each observation to its nearest center to be minimized by moving the k centers around in the multi-dimensional space provided by the observations. Please note that k-means clustering is not an algorithm but a problem speciﬁcation. There are several algorithms that are used to accomplish the solution of the above described problem. In fact, common optimization algorithms may proﬁtably be used for this purpose. There is a speciﬁc algorithm, called Lloyd’s algorithm, that has been invented just for this purpose: (1) Assign the observations to the centers randomly, (2) compute the location of each center as the centroid of the observations associated with it, (3) move each observation to the center that it is closest to, (4) repeat steps 2 and 3 until no re-assignments of observations to centers are made. This algorithm is simple but it ﬁnds a local minimum. In order for us to ﬁnd a global minimum, this algorithm is generally enhanced by an incorporation of simulated annealing (see chapter 7). A sample output of a k-means clustering run is shown in ﬁgure 5.1. The data is two-dimensional in this case to make drawing an image easier. In practice the number of dimensions would generally be large. The metric used here is the Euclidean metric where the straight line between two points is the shortest distance; this will be inappropriate for many practical applications. Having gotten this output, the question is what do we do with it? Clustering means that the observations associated with a particular cluster are in some sense similar and observations associated with different clusters are in the same sense different. The “sense” indicated here is principally measured by the metric function. The metric function is however a stepping stone and not the result because it is

88

5 Data Mining: Knowledge from Data

Fig. 5.1 The output of a k-means clustering with two clusters (top image) and ﬁve clusters (bottom image) on a two-dimensional dataset with the Euclidean distance metric. The centers are marked with stars and the observations with circles.

a convenient numerical measure to help the algorithm but it does not help human understanding. What is needed at this point is to describe each cluster in such a way as to be meaningful to a human being who is charged with interpreting the data. We suggest extracting some of the following statistics, in general, for each cluster: (1) The position of the center, (2) the radius in each dimension being a measure of how large the cluster is, (3) the number of points belonging to this cluster, (4) the mean and variance over the observations in each cluster to be compared to the center position and radius as a measure of how tightly clustered the cluster really is. A comparison and critical examination of this data will allow one to discover a number of generally useful conclusions. First, are there artifacts in the data? Artifacts are all those features of the dataset that are not intended to be there, are strange and thus to be excluded. With clustering one is likely to get artifacts from three sources: bad data (solution: better pre-processing, see chapter 4), outliers (solution: critically examine outliers and possibly exclude them) and an incorrect number of centers (solution: change k). One can determine simple artifacts by looking how clearly clusters are separated from each other. An example will illustrate the point: Two different large cities (e.g. Paris and London) are clusters of houses and are well separated as there are large tracts of virtually houseless lands in between them. However two suburbs of London are also clusters of houses but they are not well separated – their distinction is a purely administrative one and it is not immediately visible to the tourist that one section stops and another starts. Mathematically speaking thus, the division of a city into suburbs is an artifact that we would want to exclude (with respect to a certain viewpoint of ﬁnding clusters of houses). Clustering is there to discover meaningful qualitative differences.

5.3 Other Statistical Measures

89

Second, which dimensions principally distinguish the clusters, i.e. which attributes are sufﬁciently telling about an observation? This will in practice lead to the conclusion that a (hopefully small) subset of the measured parameters is sufﬁcient to distinguish observations into their clusters. This will save effort and cost while still allowing the important conclusions to be drawn. Third, what is the population of the clusters? If there is one cluster with many observations and other clusters with only a few each, then the situation is very different than if each cluster had approximately equally many observations. The one large cluster could be called the “normal” cluster while the others are various kinds of non-normal clusters. It depends upon the application of course, but situations with one large cluster are often correctly interpreted by saying that the large cluster acts as an attractor for the system, i.e. it is a kind of equilibrium state that a participant of the system would like to go towards. In that sense all other clusters would be pseudo-stable states that would eventually decay into the attractor. Thus, the clustering would have distinguished (pseudo-)ergodic sets from each other in the sense of statistical mechanics (see chapter 2). Fourth, what are the application speciﬁc characteristics of each cluster? It is interesting now to compute the distinguishing features of each cluster relative to the application at hand. This depends, of course, on the nature of the problem but industrially speaking these are now parameters focusing on the major areas: safety, reliability, quality, costs, margin/proﬁtability and the use of various resources. In this way, one can distinguish the clusters and judge them to be, in some sense, “bad” or “good” clusters. Typically this is done with respect to money using safety as a limiting criterion, i.e. we wish to maximize proﬁtability while retaining a reasonable level of safety, reliability and quality. Fifth, is there a dynamic in the clustering? If some of the major distinguishing features are such that they change over time, it may be possible to extrapolate a dynamic system over the clusters. This means that a participant in the system may be in one cluster at one time and in another cluster at a later time. This transition may be governed by laws that could perhaps be discovered (using other methods). In this way, a member of a “bad” cluster may be transformed into a member of a “good” cluster in some manner that, after analysis, could be understood well enough to be manipulatable.

5.3.5 Entropy Informational entropy is a measure of the information content of a signal. High entropy means that a signal carries much information per unit of signal. If a signal is a series of letters, then the sequence “aaaaa...” is predictable and thus carries very little information. In contrast, the sequence “abcd...” carries high information content. Generally speaking, a signal source that behaves in a uniform manner will transmit a signal with near-constant entropy. While the individual measurements vary from moment to moment, the informational content of the signal is static - this

90

5 Data Mining: Knowledge from Data

is referred to as an ergodic source and is a highly desirable property for mechanical systems. All the mechanical states that the system assumes from moment to moment and that give rise to this constant entropy are an ergodic state set. All states in one ergodic set may be interpreted as belonging to the same qualitative mode of operation. Using this, we can thus detect several qualitative states of the system over time and label these as desirable states or not. If the system switches from one ergodic state set into another, we will observe a discontinuity in the entropy signal. This event, referred to as ergodicity breaking, is a signiﬁcant event and indicates a qualitative change in operations, see section 2.5. Thus, our method of entropy tracing detects ergodicity breaking events in the history of the measurements. Entropy can be understood by graphing a signal as a histogram. During a particular time window, the range from lowest to highest observed value is split into bins and the number of occurrences per bin over the time window is counted. Divided by the total number of measurements, this yield a probability density function that characterizes the signal over that time window, see ﬁgure 5.2 (a). The shape of this distribution characterizes the ergodic set that the system experiences during this time. If we allow some time to pass, we may detect an alternative density function, see ﬁgure 5.2 (b). The system has evolved from one qualitative state to another and this is clearly visible from the change in the density function. There has been ergodicity breaking. Entropy provides a statistical measure of this change in a single numerical quantity and allows measurement of the severity of the change. That is only a simple example, of what the entropy method is able to detect. Most fundamental structural changes in the shape of the distribution will be detected, because they always include a change in the information. If the entropy for a single time-window is larger/smaller than the mean of all time windows ±2 standard deviations, then the change in entropy is big enough to let us deﬁne it as a signiﬁcant change. Supposing that the distribution is normal2 (as seen in ﬁgure 5.2 (a)), then a deviation of more than two standard deviations has a probability of less than 5%. This entropy method can be tuned to a problem by adaptively varying the number of standard deviations chosen as its detection sensitivity. We may summarize the meaning of the abnormality score of this method: The bigger the absolute value of the abnormality score for a single time window, the higher is the difference between the distribution of values in the present timewindow and the averaged distribution of all time-windows. 2 If a large number of independent and identically distributed factors add up to produce a single result, then this result is approximately normally distributed - this is called the central limit theorem. Often, this theorem is used to claim that the result of almost any complex mechanism should be normally distributed. However, in practice, we ﬁnd that the assumptions of the theorem (independent and identically distributed factors) hardly apply and thus the distribution is not normal. Many industrial processes have probability distributions signiﬁcantly different from normal. As a result of this, we must not over interpret the claim that a data point further away from the mean than two standard deviations occurs with 5% likelihood. This is a rough guideline unless we know the identity of the distribution.

5.3 Other Statistical Measures

91

Fig. 5.2 The value distribution of a measurement before and after a qualitative change in the system that gave rise to the secondary peak on the right.

5.3.6 Fourier Transformation The Fourier transform (FT) fˆ(ζ ) of a function f (t) is given by fˆ(ζ ) =

∞

f (t)e−2πiζ dt

−∞

where t is time and ζ can be interpreted to be the frequency of the signal. The transform is essentially a transformation of the basis of the coordinate system to another basis. While singular spectrum analysis (SSA) makes a linear transformation, FT makes a non-linear transformation and focuses on the frequencies with which signals change. We thus separate slow changes that occur with low frequency and fast changes that occur with high frequency.

92

5 Data Mining: Knowledge from Data

Over the different frequencies, we may plot the amplitudes of the frequencies in the signal spectrum. This amplitude is not the one of the original signal but of the transformed signal in frequency space. Figure 5.3 (a) displays an example in which it is clear that the frequencies around 50 Hz dominate this particular time-window. As the system evolves to a later time, displayed in ﬁgure 5.3 (b), we observe that the frequencies around 77 Hz get populated and thus there is a fast signal variation present now that was not present before. The shape of the graph of the amplitudes has thus changed. We measure the difference between two graphs by computing their correlation. This provides us with a numerical measure of shape similarity that we can trace over time. This correlation obtains a mean and standard deviation and we mark abnormalities if the correlation at a particular time-window is more than two standard deviations different from its mean over all time-windows. When a signal obtains or looses variations at particular frequencies, then the FT method will show this. An abnormality here indicates that there is a signiﬁcant change in the speed with which the signal varies over time.

5.4 Case Study: Optical Digit Recognition Consider a letter whose envelope is addressed by hand writing. Someone must be able to read the address to be able to deliver it. As there are a great many letters sent each day, post agencies the world over have invested in automated systems that can read an address on an envelope. Part of the task is to identify numbers like the ZIP code. Technically speaking, we are given an image of a digit and we are to say which digit the image represents. A sample of such data can be seen in ﬁgure 5.4 where we see several examples of the number six. There exists a database of about 60,000 images of digits written by about 250 different persons that we use as our dataset; this is the NIST dataset (National Institute of Standards and Technology in the USA). In order to obtain better results, the data set is pre-processed before applying any training algorithm. One of the classical pre-processing steps is binarization: The image is transformed into having only black and white pixels instead of various gray levels or colors. After binarization, the data was skeletonized and thinned, see ﬁgure 5.5. However, the results obtained were worse than the results obtained using only binarization. We will attempt to solve the problem by using a self-organizing map (SOM). In the terminology of section 6.4 this is a one-layer perceptron network but there is no need to skip ahead in the book as we will discuss the concept here. We want to classify an image of a digit into the abstract categories of digits. The input of the classiﬁer thus receives the pixels of this image and the output yields the category that this image belongs to. Thus, the SOM needs two layers. The input layer is a vector with as many elements as there are pixels in the image and so allows

5.4 Case Study: Optical Digit Recognition

93

Fig. 5.3 The frequency spectrum of a signal before and after a qualitative change in the system that gave rise to the presence of frequencies around 77 Hz on the right.

for the image to be inputted into the SOM. The output layer is a vector with as many elements as there are categories; in our case of digits there are 10 such categories. Each element of the input layer is connected with each element of the output layer and this connection has a certain strength. Thus the strengths form a matrix. In this way, every output element has a weight vector made up of the connection strength of all inputs with respect to this particular output. When an input is presented to the network, we determine that output element whose weight vector is closest to the input vector. To measure distance, we use a metric. Most of the time, the metric is the normal Euclidean metric where the distance between (a, b) and (c, d) is (a − c)2 + (b − d)2 .

94

5 Data Mining: Knowledge from Data

Fig. 5.4 Several examples of the number six that have been hand written.

Fig. 5.5 Binarized images (on the ﬁrst row) after applying thinning (on the second row) and skeletonization (on the third row)

We may however use a different metric. According to a speciﬁc algorithm, the weight vector is now adjusted slightly to reduce the distance between the input and the weight vector. Independently of the weight vector, the output elements are considered to be locations on a map. Usually, these elements are hexagonally distributed over a twodimensional plane, see ﬁgure 5.6 (a). The weight vectors of those output elements that are close to the speciﬁc output element just chosen are also modiﬁed by a speciﬁc algorithm. Then the next input vector is processed. In this way, the system learns the input data and particularly learns to classify it in the form of a map. Each output element corresponds to a region in this map. We may now display this map in various graphical ways to aid the human understanding. Particularly, we may plot the distance (in the sense of the above mentioned metric) between the neighboring output elements. Close elements indicate related categories and so on. The map may of course be distorted graphically in order to reﬂect these distances and to give the illusion of it being a true map of something. This way of classifying input data is particularly powerful if we have not categorized the inputs beforehand by human means. This is also an important difference. If we have input data for which we already know the output, we may teach a computer system by example as we would teach a student; this is called supervised learning. If we do not have this, then we must present only the input data to the computer system and hope that it will divine some sensible method to differentiate the data

5.4 Case Study: Optical Digit Recognition

95

into categories; this is called un-supervised learning. For un-supervised categorization, the SOM is a very good technology. We note in passing that the accuracy of an un-supervised system is generally several percentage points worse than a supervised learning system for the simple reason that a supervised learning system has much more information (this applies for the same input data volume). The weight vectors of each output element must be equal to some value at the start of training; this assignment is called initialization. The best results we obtained after we initialized the network with samples from the training data set. In this way the map is initially well-ordered and the convergence is faster. On Figures 5.6 (a) and 5.6 (b) is shown a map with hexagonal and rectangular topology respectively initialized from the training data.

Fig. 5.6 Initialization of a 6x6 output layer with hexagonal topology on the left and rectangular topology on the right

Note that when initializing the network from the training data set, similar letters should be close to each other. For example, in Figure 5.6, the digit 1 is close to 7 since they look alike. A crucial part of the SOM training algorithm is determining the winner node (the output element, and therefore category, assigned to the particular input image under present investigation) which is closest to the input with respect to the metric function. Thus the choice of the metric function is very important for the performance of the training. We use the so-called tangent metric and not the Euclidean metric. The reason is that it achieves substantially better results (approx. 5% difference). However, the downside of using the tangent metric is that evaluations are computationally intesive and take more time than using the Euclidean metric (about a factor 30 in time). It goes beyond the scope of this book to give the details of the tangent metric, we refer the interested reader to the literature, e.g. [72]. At ﬁrst, we may observe bad performance. This may be due to a badly formulated output layer. For instance the digit 7 may be written with or without a horizontal bar

96

5 Data Mining: Knowledge from Data

and thus the number 7 is orthographically worth at least two categories. The same is true of several other digits and so we must increase the output layer from 10 to 16 as seen in ﬁgure 5.6. In this way, we may cut the error probability into half. On a total training set of 25,000 images, we end up with an error rate of 11% on the remaining 35,000 images that were not used for training, which is not too bad given that the algorithm has no idea what we are trying to classify. The technique of re-learning is also useful: We learn and then use the ﬁnal state as the starting state for another round of learning. If we make several such rounds, we can end up with a very good performance.

5.5 Case Study: Turbine Diagnosis in a Power Plant A coal power plant essentially works by creating steam from water by heating it through a coal furnace, see ﬁgure 5.7. This steam is passed through a turbine, which turns a generator that makes the electricity. The most important piece of equipment in the power plant is the turbine.

Fig. 5.7 A schematic diagram of a combined cycle power plant. The term “combined cycle” means that the power plant produces both electricity and heat for local homes. The diagram describes the combined heat and power (CHP) process in overview including the major steps: (1) Entry-point of the air, (2) boiler with water and steam, (3) high-pressure turbine, (4) mid-pressure turbine, (5) low-pressure turbine, (6) condenser, (7) generator, (8) transformer, (9) feed into power grid, (10) district heating, (11) cooling water source, (12) cooling tower, (13) ﬂue gas, (14) Ammonia addition (15) denitriﬁcation of ﬂue gas, (16) air pre-heater, (17) dust ﬁlter, (18) ash end-product, (19) ﬁltered ash end-product, (20) desulfurization of ﬂue gas, (21) wash cycle, (22) chalk addition, (23) cement/gypsum removal, (24) cement/gypsum end-product.

5.5 Case Study: Turbine Diagnosis in a Power Plant

97

A turbine is a rotating engine that converts the energy of a ﬂuid (here steam) into usable work. Figure 5.8 shows a steam turbine. Turbines are used in many other contexts such as water power plants, windmills, wind power turbines, airplane turbines and so on. We will be focusing on turbines used in standard coal-ﬁred power plants. The blades on the turbine have the job of actually capturing the energy and driving the central rotating shaft. They are made from steel and may be more than one meter in length. The forces at work when this machine is running are very large indeed.

Fig. 5.8 A steam turbine of Siemens during a routine inspection. Source: Siemens AG Press Photos.

Economically, a turbine is the heart piece of a power plant. All other equipment in the plant essentially caters to the turbine-generator combination because it is responsible for the conversion of energy from steam to electricity. The rest of the power plant essentially has the job of making the steam and cleaning up after itself (for example ﬁltering the ﬂue gas). Thus, it is important to carefully watch the turbine for any signs of abnormal behavior. As the turbine is so important, its operations are monitored by a variety of sensors installed in key locations. The most crucial information regarding the health of the turbine is contained in the vibration measurements. All sensor output is logged in a data historian and therefore available for study. In our case of monitoring a ﬂeet of turbines, there are between 111 and 179 sensors measuring the condition of each turbine, which provides us with a satisfactory amount of data for the analysis.

98

5 Data Mining: Knowledge from Data

At all times, we want to provide some automatic diagnosis that decides whether the values of the sensors are abnormal. This is the key to the analysis. Only an abnormally functioning turbine should need manual looking at and thus we want to automatically determine abnormal behavior. For this purpose, we are going to use three methods that can do this. If any set of sensors delivers abnormal values, we want to know: 1. 2. 3. 4. 5.

Which method or combination of methods detected the abnormal operation? How is the development of strength of the abnormalities? Which sensor or set of sensors delivers the abnormal values and for how long? Which sensors send abnormal values as a result of a previous abnormality? When did the ﬁrst/last signs of this abnormal operation appear/disappear?

We analyze time-windows from the time-series, where each time-window contains 7 days of sensor data and the time difference between two consecutive windows is 1 hour. Thus, if the result of one of the methods is that in the time-window 1 May 00:00:00 - 8 May 00:00:00 there is no abnormality, but in the time-window 1 May 01:00:00 - 8 May 01:00:00 there is some abnormality, we conclude that the observed abnormality is induced by the 1 hour difference, i.e. that it occurred on 8 May between 00:00:00 and 01:00:00. We always use the last hour of the analyzed time interval to present the obtained results. Each analysis method delivers one value per time-window. To detect whether there was an abnormality in the analyzed time-window for a concrete method, we compute the mean and the standard deviation based on the results delivered by that method over all the time-windows. Then we compare whether the results for the current window are larger/smaller than the mean ± two standard deviations. If it is, then excess amount is called the “abnormality score” and recognized an abnormality. This means that if a sensor had the abnormality score of 4.3 on 1 May 00:00:00, then while analyzing the data for this sensor from 24 April 00:00:00 to 1 May 00:00:00, the analysis result was either greater or less than the mean ± (2 + 4.3) times the standard deviation. We used three techniques for the investigation of the time-windows of the four datasets. All three methods use the concept of “abnormality score” as deﬁned above and produce one abnormality score per time window and per sensor. The methods are singular spectrum analysis (see section 4.5.2), entropy (see section 5.3.5) and Fourier transformation (see section 5.3.6). In brief, the entropy indicates if there are abnormal values in the individual measurements, the SSA indicates if there are abnormal variances in the principal component direction and the FT indicates if there are abnormal frequencies in the signal. We thus have two indicators that concern the shape of a probability density (entropy and FT) and one that concerns the size of the density (SSA). These methods concern very different indicators of abnormality and thus may or may not simultaneously detect an event. We have seven combinations of the methods for detecting an event. Each of these possibilities indicates an abnormal situation. Which methods, or which combinations of methods, respond indicates the nature of the abnormality

5.5 Case Study: Turbine Diagnosis in a Power Plant

99

and may assist in the diagnosis of what the cause of the abnormality may be. We describe brieﬂy what this may mean: 1. Entropy only: We measure abnormal values but the system changes at the same speed and with the same variation as before. This must mean that the abnormal value observed is not along the principal component direction. 2. SSA only: We measure an abnormal variance but no abnormal values or frequencies. This must mean that the principal component direction changed as otherwise, we would require a signiﬁcant change in the value distribution (entropy) as well. Thus, we have a qualitative change of what measurements are important for characterizing the system. 3. FT only: We measure an abnormal frequency but no abnormal values of variances. This means that the system does what it did before but it has changed the speed at which it does it. 4. Entropy and SSA: We measure abnormal values and variances but the same frequencies. The system has changed the range of its operation but not the speed at which it varies. 5. Entropy and FT: We measure abnormal values and frequencies but the same variances. As the values have changed but not the variance, this must mean that the values that changed are not along the principal component direction. Additionally, the frequencies have changed so that the inherent speed of variation has changed. 6. SSA and FT: We measure abnormal variances and frequencies but the same values. As the variance changed without having values change, this must mean that the principal component direction changed. Additionally the speed changed. 7. Entropy, SSA and FT: We observe abnormal values, variances and frequencies. The system now visits new values at new speeds and this changes the range of variation along the principal component direction. This is the most signiﬁcant indication of a change in the system and these changes should be viewed as bearing the most danger. Whether an abnormality is dangerous or not is not included in this analysis. It is likely that when a signiﬁcant change in operational settings is made by the operator that an abnormality is detected even though this does not necessarily indicate a danger. However, it seems very likely that most dangerous situations would display an abnormality of at least one of the above kinds. A zoom-in for one dataset and the SSA method shows how a particular event starts and progresses over time, see ﬁgure 5.9. Monitored over a full year, the analysis yields the result of ﬁgure 5.10. When all three methods are combined, we may compare them as per ﬁgure 5.11. As discussed above, the combination of methods that detect a particular event lets us interpret what kind of event is taking place and thus aids the engineer in interpreting what should be done about the event. In table 5.1, we summarize, for four turbines, how many events were detected by each combination of methods. Please note that no combination is in any sense “better” than another.

100

5 Data Mining: Knowledge from Data

Fig. 5.9 The numbers on the vertical axis indicate the sensor that is being analyzed so that the image as a whole gives us a holistic health check for the whole turbine. We can read from the plot that the event starts with Sensor 41 on the 30th of June, then a more signiﬁcant deviation is observed for sensors 122 and 151, then on the 9th of July more sensors (51) get involved and the largest abnormality (-2.09) is observed for sensor 93 on the 15th of July. For several days, abnormalities of most sensors disappear and only the sensors 122 and 151 continue deviating and start a second reaction of a smaller magnitude on the 24th of July. On the 3rd of August all abnormalities disappear.

By and large, we can see in the data that those events that have a particularly large abnormality score do tend to be detected by more than one method. The largest abnormalities are mostly detected by all three methods. On this basis, we can say that the urgency with which an event should be looked at can be proportional to the number of methods that detected it. However, this judgment is made without correlating it with the actual ﬁnal outcome of the event (benign events vs. dangerous situations). In ﬁgure 5.12, we provide a plot of the abnormality score for the events detected for one turbine. It is also visible that the total abnormality score of an event forms an approximate exponential distribution. This is an interesting feature as this is not the outcome that would result from a large number of random interactions (central limit theorem) but rather suggests a much more structured causality. In particular, we would expect

5.5 Case Study: Turbine Diagnosis in a Power Plant

101

Fig. 5.10 Here we see a turbine analyzed by SSA over a full year’s operation. The vertical black areas are areas where the turbine was ofﬂine. In between the ofﬂine times, we can see where the method diagnosed abnormalities. These were then analyzed by human means and appropriately responded to. Set of Methods SSA only FFT only ENT only SSA and FFT SSA and Entropy FFT and Entropy SSA and FFT and Entropy

Turbine 1 2 7 1 4 3 1 4

Turbine 2 7 4 4 0 1 1 6

Turbine 3 1 4 0 3 1 7 3

Turbine 4 4 1 3 5 4 0 3

Table 5.1 Each combination detects a particular signature of event and thus they should be seen as complementary detection schemes rather than a hierarchy. No one method dominates this table. This shows that events of all signatures do take place in the systems studied.

this outcome to result from a Poisson stochastic error source, which is present when events occur continuously and independently at a constant average rate. Thus, we would conclude that, approximately and on average, the events detected here did not interfere with or cause each other but were independently caused. We would also conclude that whatever causation mechanism is giving rise to these

102

5 Data Mining: Knowledge from Data

Fig. 5.11 Here we see all three methods in comparison over a whole year on one particular sensor in a particular turbine. We see broad agreement but differing opinions in the details. These differences can be interpreted as discussed in the text.

events acts at a constant rate. This means that the system does not exhibit ageing over the time period (one year) investigated here where the event rate would increase with increasing time. We may summarize all the above ﬁndings by saying that the present methods allow the fully automatic screening of the majority of the data while ﬁnding that they indicate normal operations. For a selected small minority of the data, these methods give a clear indication at which times which sensors are abnormal, how abnormal they are and in what sense they are abnormal. This allows targeted human analysis to take place on a need basis.

5.6 Case Study: Determining the Cause of a Known Fault Co-Author: Torsten Mager, KNG Kraftwerks- und Netzgesellschaft mbH

5.6 Case Study: Determining the Cause of a Known Fault

103

Fig. 5.12 These events are sorted by the sum of their abnormality scores of all three methods. As the score is deﬁned similarly for each method, the numerical value of the score of one method is comparable to the score of another. We can see that the detection efﬁciency of the methods increases with increasing abnormality, as it should. The plot also contains the number of days before an event that an advance warning would have been possible. It is recognizable, that the ﬁrst signs appeared several days in advance for most events. There are only two events where there was no sign before the event. The average advance warning time for an event was ﬁve days.

Here we take the case of turbine that has failed for a known reason. We seek the cause for the failure in both time and space, i.e. where and when did what happen to bring about the known failure? This is often important to settle issues of liability between the manufacturer, operator and insurer of the machine. It is certainly important in terms of planning what to do about it and what to do to prevent a similar case in the future. We recall that a turbine failure is signiﬁcant and expensive event for the plant. In the previous case study, we presented three methods to analyse abnormal operations for turbines. We might think that using these methods over the history of the particular turbine would reveal the point at which things went wrong. It should be mentioned that the particular fault in question was that a single blade touched the casing and was bent. This bent blade was only detected visually upon opening the turbine months later. It was unclear to what extent the turbine would be able to continue running and so this blade was exchanged; a time-consuming and

104

5 Data Mining: Knowledge from Data

expensive process. We expect that this is an event that should be visible in the data (particularly in the vibration data) upon careful analysis. The analysis result can be seen in ﬁgure 5.13 for one vibration measurement as an example. The three lines indicate the abnormality score of each of the three methods outlined in the previous case study. It is not important here which is which. We simply note that there are a few points at which the analysis result intersects the abnormality boundary and thus we do ﬁnd abnormal events. Later manual analysis discovered that all such abnormal events could be explained by sensible operator decisions.

Fig. 5.13 The outcome of the analysis using the three methods of singular spectrum analysis, Fourier transform and entropy analysis are shown with respect to one vibration measurement. We observe that there are signiﬁcant changes that cross boundary lines and thus indicate abnormal events.

In ﬁgure 5.14, we display the raw data in a different manner that also allows some interesting interpretations. One vibration is plotted against another in an effort to track their mutual locus. We ﬁnd that they do indeed possess a well-deﬁned mutual locus that is traversed clockwise in ﬁgure 5.14. Each point in the image represents 10 minutes of operation. A single cycle thus represents approximately 15 hours. We see on the image (on the far right) that there is a region in time lasting roughly 3 hours in which the system deviated signiﬁcantly from the established locus. We can interpret this as an abnormal event. In this particular case, however, this event was benign as it was intentionally caused by the operators. This further indicates that abnormal events do not need to be harmful. A data analysis system can detect abnormalities but would have to be given a vast knowledge to be able to interpret these as benign or harmful – at present we regard this enhancement to be impractical. We do observe however that there exist some tools that can interpret particular problematic situations on particular devices and so some work has been done in this regard [77]. In a similar fashion, we also analyzed the cases in which the turbine was not rotating at operating speed but was in various stages of cycling up or down. Also here, we were not able to ﬁnd any genuine abnormality.

5.7 Markov Chains and the Central Limit Theorem

105

Fig. 5.14 One vibration measurement (horizontal axis) with respect to another (vertical axis). The locus is essentially a dented cycle, which (in this case) we traverse clockwise over time. We see that most of the time, the system holds a fairly well-deﬁned locus in time. Occasionally, we do deviate from this and this represents, loosely deﬁned, an abnormal event.

Even though this is a negative example of data analysis, it is instructive in various ways that must be taken seriously when starting such an analysis. In particular, it is clear that there are events that cannot be seen – even upon detailed analysis – in the normal measurements of a plant. To circumvent this, we may have to install further instrumentation equipment speciﬁcally targeted at discovering such a problem. We must be aware that before we conduct a data analysis that it is possible that the feature we are looking for: (1) does not exist at all, (2) is not contained in the data we have, (3) is overshadowed by noise in the data, (4) occurs at such short timescales that it appears to be an outlier and so on. Ideally, we would have the opportunity to design a data acquisition system in order to look for a particular problem in advance. When we encounter a problem however, then we will simply have to deal with the data available and may then encounter the above challenges. The correct acquisition and cleaning of data prior to analysis is crucial for success. We conclude this case study by observing that there are two principal explanations for the failure to ﬁnd the cause: 1. The event occurred during the time period analyzed but was not visible in the data possibly because it was too short-lived. 2. The event did not occur during the time period analyzed. This would imply that the turbine was initially taken into operation in a damaged state.

5.7 Markov Chains and the Central Limit Theorem A Markov chain is a sequence of numbers, each of which is a random variable, with the property that the probability distribution of any one number depends only upon the previous number in the sequence. This property is called the Markov property.

106

5 Data Mining: Knowledge from Data

Thus, a Markov chain z(0) , z(1) , z(2) · · · z(m−1) , z(m) has the property that

p z(m+1) |z(0) , z(1) , · · · z(m) = p z(m+1) |z(m) . This probability is called the transition probability, which is generally a matrix and not a scalar as both z(m) and z(m+1) are vectors and we must specify the probability of transition from each element of one to each element of the other

Tm ≡ p z(m+1) |z(m) . If the transition probability Tm is the same for all m, then the Markov chain is called homogeneous. The Markov property is a severe restriction and extreme simpliﬁcation. It therefore allows many special properties of Markov chains to be proved. However, it also means that a Markov chain is not always suitable to model a practical situation. For that reason, we will often want to relax the Markov property and deﬁne the p-order Markov chain by the property that the transition probability should depend upon the prior p random variables,

p z(m+1) |z(0) , z(1) , · · · z(m) = p z(m+1) |z(m−p−1) , z(m−p−2) , · · · z(m) . The transition probability becomes

Tm ≡ p z(m+1) |z(m−p−1) , z(m−p−2) , · · · z(m) . To model a real system by using a Markov chain, we thus need to determine the transition probabilities Tm . If we know these, and we determine the initial condition of the Markov chain (i.e. the values of the ﬁrst p random variables), then we may probabilistically compute the evolution of the chain into the future and thus arrive at our model. Supposing that the initial conditions are not to be obtained from physical experiments but rather must also be calculated, we must establish the probability distribution of the initial random variables and then rely on our determination of Tm to compute the others. As we are dealing with statistical distributions, we require a signiﬁcant amount of data to be able to distinguish the various possible distributions from each other. In case of doubt, one often chooses the Gaussian distribution also called the normal distribution that looks like p(z) = √

1 2πσ 2

e

−

(z−z)2 2σ 2

where σ is the standard deviation of the distribution and z is the mean of the distribution. The defense for this choice is often the central limit theorem. The users of this defense believe that the central limit theorem effectively says: If the factors leading

5.8 Bayesian Statistical Inference and the Noisy Channel

107

to a particular observation are sufﬁciently many, the distribution of this observation will tend to the normal distribution in the limit of inﬁnitely many factors. Actually it states that the random variable y which is the sum of many random variables xi (i.e. y = x1 + x2 + · · · xm ) will tend to be distributed normally in the limit of inﬁnitely many x’s if the xi are independent and identically distributed and have ﬁnite mean and variance. Please note the if clause in the previous sentence. Thus, the factors (xi ) leading to a particular observation (y) have to be independent and identically distributed, which is typically not the case as cause-effect interrelationships usually exist in the physical world. Also, the observation y is a very particular observation, namely the sum of the x’s, and not some other related observation. In short, we must be very careful when using the central limit theorem to justify a normal distribution for the initial random variable in a Markov chain. There are several popular probability distributions that optically all look the same – they have a bell shaped curve. It is also wrong to say that they are all the same and one might just as well stick with the normal distribution. In the central area of the bell, these distribution are indeed similar but they usually differ signiﬁcantly on the tails (i.e. far away from the central area). Please note that in probability theory it is generally the tails that are the interesting parts because these describe the nontypical situations that will nevertheless occur. A few names of such distributions are: Cauchy, Student’s t, generalized normal and logistic distribution. We will not go further into these individually. If an incorrect distribution is chosen for z(0) , then even a correct Tm will lead to bad results overall as everything depends on the initial condition. Thus, modeling the starting point correctly is essential for a correct Markov chain model. Moreover the correct modeling of the initial condition can only be done with sufﬁcient data from the system under consideration. In dearth of this, we must rely on some other source such as a physical model of the situation.

5.8 Bayesian Statistical Inference and the Noisy Channel 5.8.1 Introduction to Bayesian Inference Let us focus on a practical problem, namely optical character recognition, see for instance section 5.4 above. We may observe the system via a random variable x, the photographic image of a letter, and wish to deduce the parameter θ , the original letter, that gave rise to this observation. We begin our excursion with the prior distribution g(θ ) that is the probability distribution over the various possible values of the parameter. We will suppose, for the moment, that this distribution is known. We will discuss later on how it can be determined. The observable random variable x has a conditional distribution function f (x|θ ) called the sampling distribution that is also assumed known for all values of the

108

5 Data Mining: Knowledge from Data

parameter θ ∈ Θ . According to Bayes theorem, we may now compute the so-called posterior distribution f (x|θ )g(θ ) g(θ |x) = . Θ f (x|θ )g(θ )dθ Thus, we now know the distribution of the parameter given an observation. When we have made an observation, we can use this distribution to determine the probability distribution of the parameter for this particular observation. Knowing this distribution is very useful indeed. For example, if we have an image of a handwritten letter and we determine that the probability distribution over the alphabet is such that the probability of “r” is 0.4, the probability of “v” is 0.6 and the probability of all other letters is virtually zero, then we may conclude that we have either an “r” or a “v” with high probability. We may also conclude that we are more likely to have a “v” than an “r” and we have a quantitative method to assess the degree to which it is more likely, namely 0.2 more. We may conclude from such a distribution that our model is not yet good enough since it cannot tell these two letters apart with sufﬁcient certainty and thus that we need to present it with more examples of these two letters to train it to be better. Thus, it is a very useful result to have the posterior distribution. However, it depends on the prior distribution and the conditional distribution both to be known. In general these must be determined by examining the physical mechanism that gives rise to the problem. In the following subsections, we will treat the determination of these two distributions.

5.8.2 Determining the Prior Distribution In our case, the prior distribution g(θ ) is the probability that a particular letter is going to occur. We can make our life simple and deduce this from a typical piece of prose text and stipulate this as the prior distribution, see table 5.2. However, the letter frequencies are very different if we know what letter came before the current one, see table 5.3. We may, in fact, introduce a complex grammar based language model here to give an intelligent guess as to the next expected letter based on a lot of domain speciﬁc knowledge. We must make a decision what amount of knowledge must be inserted into the prior distribution for it to be good enough for our practical purpose. Noting how much complexity was added going from single letter to digram frequencies, we need to be careful before attempting to introduce a more complex model as this may create more effort than the result is worth. In the case of optical character recognition, the usefulness is however so large that several independent projects have created models of great complexity at great expense resulting in several commercial software packages. In the case that we do not want to or cannot insert theoretical knowledge into the construction of the prior distribution, we must construct it empirically. So let us assume that we have a number of observations at our disposal x1 , x2 , · · · xn . The parameters θ1 , θ2 , · · · θn that gave rise to these observations are unknown but we will

5.8 Bayesian Statistical Inference and the Noisy Channel a b c d e f g h i

8.167% 1.492% 2.782% 4.253% 12.702% 2.228% 2.015% 6.094% 6.966%

j k l m n o p q r

0.153% 0.772% 4.025% 2.406% 6.749% 7.507% 1.929% 0.095% 5.987%

s t u v w x y z

109 6.327% 9.056% 2.758% 0.978% 2.360% 0.150% 1.974% 0.074%

Table 5.2 The probability of each letter in average English prose texts.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A 1 8 44 45 131 21 11 84 18

B C D 32 39 15

18 11 2 2 1 7

34 56 54 9 21

7 9 7 18 1

57 75 56 18 15 32 3 11

4 13 14 5

12 4 10 64 107 9 1 1 1 2 1 55 16

8 28 1 2 31 118 18 16

E F 10 58 55 1 39 12 39 23 25 14 32 3 251 2 37 27 2 28 72 5 48 64 8 3 94 40

G 18

H

46 2 3 20 15 1 6 1 16 5 10

1 75 3

1 9 3 7

I 16 6 15 57 40 21 10 72

J K 10 2 8 1 1 2 1

L 77 21 16 7 46 10 4 3 8 39

8 57 1 3 55 26 37 3 3 10 13 5 17 8 29

M N O 18 172 2 1 11 59 9 5 37 43 120 46 3 2 38 1 3 23 1 2 46 32 169 63 4 3 3 4 1 28 5 3 28 7 9 65 44 145 23 28

14 16 148 6 6 3 77 1 11 12 15 12 54 21 6 84 13 6 30 42 2 6 14 19 71 6 9 94 5 1 315 128 12 14 8 111 17 11 11 1 12 2 5 28 9 33 2 53 19 6 3 4 30 1 48 37 4 1 10 17 5 1 4 1 11 10 4 12 3 5 5 18 6 4 3 28 5 2 1

P Q R 31 1 101 6 1 7 7 1 10 32 14 154 3 4 1 21 1 8 3 21

S 67 5 1 32 145 8 7 3 106

2 2 2 2 12 16 6 7 5 51 29 113 37 26 42 3 8 18 39 24 2 6 41 8 30 32 17 49 42 2 4 7

1

T U V 124 12 24 25 38 16 39 8 4 80 7 16 42 11 1 13 8 22 2 88 14 4 1 19 8 2 6 13 110 12 4 53 96 13 14 7 20 63 6 5 121 30 2 53 22 4 45

W X YZ 7 27 1 19 1 9 6 41 17 17 4 1 2 1 7 1 1 1 4 3 3 5 47 2 3 15 1 14 36 4 2 1 2 10 27 16

17 4 21 1 1 1

3

6 1 1 2 1 1 5 17 21 1 3 14 1

Table 5.3 The relative letter digram frequencies as measured on one English prose text. They are ordered in that the letter on the vertical axis precedes the letter on the horizontal axis in any individual diagram, so that the relative frequency of the digram “AB” is 32 and not 8.

assume that they are independent and identically distributed. Then we can group the observations into sets that are as homogeneous as possible within the set and as heterogeneous as possible between the sets. This is exactly the unsupervised clustering example that we introduced via the method of k-means in section 5.3.4 and illustrated in section 5.4 using the self-organizing map.

110

5 Data Mining: Knowledge from Data

5.8.3 Determining the Sampling Distribution The sampling distribution f (x|θ ) is a distribution of the observable x for every value of the parameter θ . In our case, it is the distribution of all possible letter images for every possible letter. A particular letter, when it is transformed into an image can be rotated, skewed, squeezed, expanded, thinned, thickened and so on. Various mechanisms exist to make the letter image look different from a prototypical letter image – just compare your handwriting to machine typing. Generally, we would have to have domain knowledge to construct this distribution. If this is unavailable, we may construct it empirically as we did above. We simply ask many people to write text and take this as our distribution, see section 5.4. When we cluster observations into clusters, the cluster determines the parameter and thus the prior distribution. The not quite homogeneous distribution inside any particular cluster is the sampling distribution for that particular parameter. Please note that there is a snag: Unsupervised clustering generally produces clusters that may or may not coincide with your intended parameter. In fact, a single parameter may actually correspond to several clusters because there are various ways to distort the parameter that are very different. In many cases, the connection between the parameter and the observation is indeed a very similar one to the case of the letter, i.e. the connection is that an idealistic concept is transformed into a physical reality via a series of mechanisms that distort the original prototype in various ways. Thus there is a channel from prototype to real object and this channel adds noise to the signal is some fashion. This noise must be removed for us to recover the original prototype.

5.8.4 Noisy Channels There are two branches of science that deal with noisy channels. From an engineering point of view, we focus on the noisy channel in that we will focus on actively building a channel that has the property that we can later on remove the noise as much as possible. This view led to the construction of the television or telephone channels that are both noisy but allow the noise to be removed at the recipient’s end sufﬁciently well to suit the user’s needs. From a computer science point of view, we focus on cybernetics or control theory in that we will focus on a given noisy channel that we cannot inﬂuence and we must remove the noise as best we can. An example of this is handwriting: We cannot inﬂuence the author’s handwriting and must do the best we can to recognize the intended letters via optical character recognition. Please note carefully the difference here. In the ﬁrst, we are building a channel with the noise removal problem on our mind. In the second, we are given an imperfect channel and told to remove the noise. The focus shifts from making an ideal channel to making a noise removing machine. Furthermore, when building the channel, we know to some extent the noise that the channel will add and this knowl-

5.8 Bayesian Statistical Inference and the Noisy Channel

111

edge helps in later removal attempts but when given an existing channel, we do not generally know how it adds the noise. We will brieﬂy present both approaches here.

5.8.4.1 Building a Noisy Channel In constructing a channel we will want to reduce as much as possible the noise that the channel adds. For example, we want to measure the temperature in a process and convey this measurement to a process control system. The original reality is a substance that has a certain temperature. Via a sensor, cables, analog-to-digital converters, ﬁlters and so on, the information arrives at the process control system in the form of a ﬂoating-point number. This will be stored once every so often or, as is commonly done, if the value changes by more than a certain amount from the last measurement. A noisy channel is generally characterized by a input alphabet (possible temperatures) A, an output alphabet (possible temperature readings) B and a set of conditional probability distributions P(y|x). We may construct a transition probability matrix Q ji = P(y = b j |x = ai ) which gives the probability that the original temperature ai is displayed as the reading b j . A probability distribution over the input (temperatures) px , which is a vector, may thus be converted into a distribution over the outputs (readings) py by multiplying it with the transition matrix like this: py = Qpx . As we can inﬂuence the channel itself because we are building it, what can we do to increase our chances of correct reconstruction? That’s right, we insert the same information into the channel more than once. For very important temperature measurements, the industry generally installs three sensors and the process control system records the measurement if at least two sensors agree. There are a plethora of other engineering changes that can be made to make the channel more reliable. They include procuring a good sensor with little drift and good aging properties, installing it in a location where it is not likely to get damaged, dirty or overheated, insulating the cables, setting the recording threshold low so that many value are stored and so on. Mathematically speaking, we would control the manner in which a source signal (your voice) is encoded into electrical form over the channel (telephone) to arrive at the decoding output (speaker). We may do this via special mathematical techniques of including some, but not too much, extra information to allow the decoder to correct some of the errors that are introduced inside the channel. The adding of extra information reduces the capacity of the channel and so it is important to add the right amount. There is a large theory on exactly how much is the right amount of extra information in order to be able to decode enough of the signal for all practical purposes. Of course, this requires the “practical purpose” to be speciﬁed very accurately ﬁrst. We will not go into this, as this goes beyond the scope of this book. You will ﬁnd this under the headings of information theory or communication theory.

112

5 Data Mining: Knowledge from Data

5.8.4.2 Controlling a Noisy Channel In real life, we are more often concerned with controlling a noisy channel. That is, the channel and the encoding mechanism exist and cannot be modiﬁed. Our task is exclusively restricted to decoding, i.e. recovering the original prototype from the signal which consists of the original prototype and some noise. As the noise is generally not uniformly random but rather structured in some fashion this becomes a challenge. Simply put, at one end we have the physical reality that we cannot directly observe, then we have a channel whose operation we do not know and at the other end we have the output that we can measure. Let us consider the channel to be a black box. Let us also suppose that we can arrange for the physical reality to be in a particular known state, at least for the purposes of experimentation if not in the ﬁnal application stage. Then let us present the known physical state to the black box and observe the output. We repeat this for many trials. If we chose our inputs carefully (for example by a design of experiment, see section 3.2), then we will have observed the functioning of the channel in all important modes of operation. What we are left with is a relationship between inputs A and outputs B. The channel is then a function B = f (A). Our task is to obtain the function from the data. This is a problem of modeling that we will tackle in chapter 6. For the moment, we will assume that the function is determined. Now, we can compute, for any input what its corresponding output will be. But we can do more. If we invert the function A = f −1 (B), then we may compute the input for a given observed output. This ﬁnally solves the problem of determining the reality that gave rise to an observation. Of course, it is possible to construct the inverted function directly via modeling, there is no need to ﬁrst model f (· · · ) only to then construct f −1 (· · · ). It is however often useful to have both orientations of the function on hand because f (· · · ) is effectively a simulation of the real system and can be used for experimentation and f −1 (· · · ) is a back trace for any output. Suppose that we are capable of inﬂuencing the inputs A in some ways, then we can take B = f (A) and ask the question: What is the value of A such that a function of the outputs g(B), the so-called goal function or merit function, takes on the maximum (or minimum) value possible? This is an optimization question that we will tackle in chapter 7. The topic of developing the model f (· · · ) is effectively the development of the sampling distribution in the Bayesian inference approach. Please note that there is no contradiction between inverting the function and using Bayes’ approach. The difference in method also yields a difference in output. The Bayesian inference approach does not yield a single answer (as the inversion approach does) but it yields the distribution of possible answers. In practice we may be boring and desire a single answer but for more scientiﬁc work we may indeed be interested in the distribution. Also, for practical work, we would indeed be interested in the conﬁdence (or probability) with which we may accept the single most likely answer. Perhaps the single answer of inversion is the most

5.9 Non-Linear Multi-Dimensional Regression

113

likely answer in Bayesian terms but there may be other answers that are sufﬁciently competitive that we actually cannot be conﬁdent in our conclusion.

5.9 Non-Linear Multi-Dimensional Regression 5.9.1 Linear Least Squares Regression The word regression refers, in the context of statistics, to a collection of methods that infer a function describing a set of known observations. Often, regression is also referred to as curve ﬁtting. A simple example is the ﬁtting of a straight line to a set of two dimensional observations, see ﬁgure 5.15. The simplest example is linear regression where we have observations of two variables (xi , yi ) and wish to deduce a straight line y = mx + b from this data. Our task, therefore, is to determine a slope m and an intercept b such that this straight line ﬁts the observed data best. The critical word here is “best” because we will need to deﬁne very carefully what we mean by best. The classic criterion is to minimize the squared difference between model and observations, which is called the method of least squares. Thus, we take D = ∑ (yi − y)2 = ∑ (yi − mxi − b)2 i

i

and desire to minimize D in the space of all possible m and b. The least squared sum is a simple function and so we can apply calculus to it, i.e. the minimum has the ﬁrst derivative equal to zero, ∂D ∂D = = 0. ∂m ∂b Solving this leads to m=

∑(xi − xi )(yi − yi ) i

∑(xi − xi )2

,

b = yi − mxi

i

where the over-line represents an average of the observed data. This method is very simple and yields a result visible in ﬁgure 5.15. Regression lends itself to more complex questions. We may principally make matters more complex along three different directions: The observed data may be in more dimensions, the function that we hope describes the data may be non-linear and the criterion for “best” ﬁtting may be more complex than the least squares criterion. The investigation of methods, such as the above, in the large arena opened by these three directions constitutes the ﬁeld of regression.

114

5 Data Mining: Knowledge from Data

Fig. 5.15 An example of a linear regression line drawn through a set of observations in two dimensions.

5.9.2 Basis Functions When the observations are multidimensional, we merely represent the observations by a vector xi . The non-linearity of the function can be complex and varied. An important possibility is to work with basis functions. An example of the basis function approach is a Fourier series in which we represent a function by a combination of sines and cosines, f (x) =

a0 + a1 cos(x) + b1 sin(x) + a2 cos(2x) + b2 sin(2x) + 2 · · · + an cos(nx) + bn sin(nx) + · · · .

(5.13) (5.14)

The trick in such a representation is to answer the following questions: (1) Can my function be represented in such a way in the ﬁrst place?, (2) Does this series converge to the true value of my function after a practical number of terms and how many terms do I need? and (3) How do I calculate the values of the coefﬁcients? In the case of Fourier series, we may answer these questions: (1) If the function is square integrable, then it may be represented in this way. A square integrable function is one for which ∞

| f (x)|2 dx

−∞

is ﬁnite. Note that this is true for nearly all functions that we are likely to meet in real life. (2) It will converge to the true value of the function at every point at which the function is differentiable. As the frequency of variation increases with each set of terms, we may terminate the series after a number of terms such that further terms would vary too rapidly for our practical purpose. In practice, we observe that this number of terms is low enough for this to be practical. Note, for instance, that music is represented (roughly) this way on a CD or in an MP3 ﬁle. (3) There are formulas for working out the value of the coefﬁcients. We will not present them here as this would carry us too far aﬁeld.

5.9 Non-Linear Multi-Dimensional Regression

115

In summary, the sines and cosines form a basis in which we may expand square integrable functions. There are other bases for other types of functions but the Fourier series is the most important one for practical applications. Another important basis is the polynomial basis, i.e. the fact that any polynomial function can be written in terms of the powers xn . So we may put y = b0 + b1 x + b2 x2 + b3 x3 + · · · + bn xn . This is suitable if the dependency of y upon x has n − 1 extrema (local maxima and minima) and n − 2 points of inﬂection. To determine the bi , we apply the same method as before. We form the least squares sum 2 D = ∑ yi − b0 − b1 xi − b2 xi2 − b3 xi3 − · · · − bn xin i

and then put the ﬁrst derivatives equal to zero ∂D =0 ∂ bi

∀i : 0 ≤ i ≤ n.

This yields a system of equations that can be solved using matrix methods (e.g. Gaussian elimination) and you are left with a regression polynomial.

5.9.3 Nonlinearity The above example of using the polynomial basis to obtain a function that is nonlinear in the independent variable x is a popular method. It leaves us with the question of determining the “correct” value of n. The answer to this is complex because we ﬁrst need to agree on a criterion that would allow us to objectively judge the quality of a particular choice. There are various possibilities but the least squares sum will also be useful here, i.e. select that n that minimizes D over the space of all n. This can easily be determined with a little computation. Let us recall Taylor’s theorem from our school days. In the simple version, it states that any n + 1 times differentiable function f (x) can be approximated in the neighborhood of a point x = a by f (x) ≈ f (a)+ f (a)(x−a)+

f (a) f (a) f (n) (a) (x−a)2 + (x−a)3 +· · ·+ (x−a)n . 2 3! n!

The approximation is at least as accurate as the remainder term R=

x (n+1) f (t) a

(n)!

(x − t)n dt.

116

5 Data Mining: Knowledge from Data

Simply put, we may approximate any reasonable function by a polynomial within some region around an interesting point. Because this theorem applies to virtually every function that we are likely to meet in daily life, the polynomial basis is a valid approximation for virtually any function within a certain convergence interval. Of course, practically it will not be easy to determine the convergence interval but it can be determined by experimental means (deviation between polynomial and reality). Strictly speaking, we must analyze the remainder term R and show that it approaches zero in the limit of n approaching inﬁnity. If we can do this, the function is called analytic and its series expansion will converge to the function within a neighborhood around a. The neighborhood’s size can be determined by determining the range of values for which the remainder tends to zero for inﬁnite n. With closed form functions, we can compute this but with empirically determined functions, we must compare the output of the polynomial to experimental data. To truly capture a real life case however, we need to be able to consider several independent variables. We will denote the variables by x(i) to distinguish them from the respective empirical measurements x(i) j . Thus, if we want the ﬁfth empirical measurement of the third variable we would say x(3)5 . If we now take m such variables and combine them to a maximum power of n, then the function is y=

n n−i1 n−i1 −i2

∑ ∑ ∑

i1 =0 i2 =0 i3 =0

···

n−i1 −i2 −···−im−1

∑

im =0

bi1 i2 ···im x(ii1 ) x(ii2 ) · · · xi(imm ) . 1

2

As before, we form the least squares sum D = ∑ yj − j

n n−i1 n−i1 −i2

∑ ∑ ∑

i1 =0 i2 =0 i3 =0

···

n−i1 −i2 −···−im−1

∑

im =0

2 bi1 i2 ···im x(ii1 ) j x(ii2 ) j · · · xi(imm ) j 1 2

.

To determine the parameters, we again perform the relevant partial derivatives and solve the ensuing system of coupled equations to arrive at the bi1 i2 ···im . The m will be apart from the problem speciﬁcations, it is equal to the number of independent variables that are available. However, we must be careful here: Not all available variables really do inﬂuence y (enough to matter)! Thus, we must be careful to choose only those independent variables that have a relevant and signiﬁcant inﬂuence and this is a matter of some delicacy that requires domain knowledge about the source of these variables. The n may be determined as before by performing several computations with diverse n and comparing D. In this manner, we may deduce a non-linear model from data that will be valid at least locally. Please note that such models are inherently static as we have said nothing about time – time is not just another variable as it represents a causal correlation between subsequent measurements of the same variable and creates a totally different complication.

5.10 Case Study: Customer Segmentation

117

5.10 Case Study: Customer Segmentation At a major European wholesale retailer, hoteliers, restaurants, caterers, canteens, small- and medium size retailers as well as service companies and businesses of all kind ﬁnd everything they need to run their daily business. Every customer has a dedicated membership number and card. Due to this, it is possible to attribute every item sold to a particular customer. Customer segmentation in general is the problem of grouping a set of customers into meaningful groups based e.g. on their profession or based on their buying behavior. In this particular case, it also allows us to trace which customers belong to these groups because we are aware of their (business) identities. This trace possibility is attempted by many other retailers via loyalty programs in which clients also allow the retailer to attach their identity to the products purchased. Globally speaking it is interesting to ﬁnd out buying patterns that can be detected in a certain group of clients. Based on a more detailed description of these groups and investigations on cause and effect for the actions of these groups, it is then possible to adjust the business model to react to such features, for example with targeted advertising such as speciﬁc products offerings to speciﬁc customers based on their purchasing habits. Such an investigation has taken place for a dataset of all sold items in two stores over one calendar year. This included over 31 million transactions. The investigation included no particular questions to be answered and no a priori hypotheses to be conﬁrmed or denied. The goal was to ﬁnd any structures that might be economically interesting from the point of marketing to these clients. The methods used to treat the data were diverse in nature. We used descriptive statistics, non-linear multi-dimensional regression analyses in all dimensions, k-means clustering and Markov Chain modeling. The aims were as follows: 1. Descriptive Statistics [91]: To get an overall feel for the dataset and its various sections as discovered by the other algorithms. This includes correlation analyses. In supporting the Markov chain methodology, this also includes Bayesian prior and posterior distribution analysis, which is able to tell, for example, in which order in time events happen (leads to cause and effect conclusions). 2. Nonlinear multidimensional regression [40]: To get a dependency model of the variable among each other. Expressing variables in terms of each other can lead at once to understanding and also dimensionality reduction. 3. k-means clustering [44]: To ﬁnd out which purchases/clients belong into the same phenomenological group and thus determine the actual segmentation that the other methods describe. 4. Markov Chain modeling [44]: To model the time-dependent dynamics of the system and thus to ﬁnd out what stable states exist. Several descriptive conclusions are available to help with the understanding of the dataset. We present them here in a descriptive format as this is all that is required for understanding the ﬁnal result. In the actual case study, these conclusions can be made numerically precise:

118

5 Data Mining: Knowledge from Data

1. The total amount of money spent per visit is, statistically speaking, the same per visit for any particular client. Thus, in order to increase total revenues, the key is to increase customer trafﬁc – either by getting a client to come more often or by attracting new customers. 2. Customers will generally go to the store closest to their own locations (in this study their place of business since we are dealing with a wholesaler). The probability of visiting another store decreases exponentially with distance. 3. The high seasonal business is focused in the aftermath of the summer school holidays and the preparations for Christmas. The low seasonal business is focused in the summer school holidays and during the early year post Christmas. 4. The total amount of money spent per year and per visit as well as the number of articles purchased depends highly on the type of client and the geographical region. This has a signiﬁcant effect upon storage and logistics planning. 5. The majority of clients very rarely shop in the store. There is a core group of clients that shop quite regularly. 6. The products and product groups sold depend strongly on regional effects and on the visit frequency of a customer. 7. Certain products are generally bought in combination with certain other products. Thus, we may speak of a “bag of goods” that is generally bought as a whole. The contents of this bag depend upon the customer group and geography. 8. Via Bayesian analysis and Markov Chain modeling it is possible to deduce that the purchase of a certain product causally leads to the purchase of another product as an effect of the initial purchase. An example is that a purchase of fresh meat directly leads to the purchase of vegetables, cheese, and other milk products. To summarize these conclusions, we may say that the customer behavior depends upon geography, product availability, time of the year and certain key products. It was determined that the following factors offer a signiﬁcant potential in order to improve the proﬁtability of the retail market (most signiﬁcant ﬁrst): 1. Individual marketing [42]: Customers tend to be interested in a narrow range of products. It is educational to cluster the customers into interest groups. We ﬁnd that there are less than 10 clusters that hold a signiﬁcant number of customers and that are sufﬁciently heterogeneous in terms of the products they offer to really divide the customers into different groups. These different interest groups could now be treated differently in some ways, e.g. by sending them advertising materials speciﬁcally targeted towards their interest group. 2. Price arbitrage: In each important product group there is a particular product that is the causal product in the group. This means that if the customer buys this product, then the customer will also buy a variety of related products in this category. This cause-effect relationship may be used to make this key product more attractive in order to boost sales in the entire product group. One way to do this is to lower the price of the key product (and raise the prices of the non-key products in the same category). It can be shown that the causal relationship is independent of price changes. However, the identity of the key product is not a universal in that there are regional differences.

5.10 Case Study: Customer Segmentation

119

3. Geography: Most sales are made to customers whose place of business is 20 to 40 minutes away from the store. At an average travel speed of 30 km/h, this is an area of approximately 940 km2 , which is comparable to the size of a moderate sized city. The wholesaler can focus his efforts, e.g. when establishing one-toone contacts with his customers, in this area. Promotional activities in this area like billboard advertising on major roads may also be effective. 4. Time of the year: The main purchase times are March, August and pre-Christmas. The low times are January, February and summer holidays. The rest of the year corresponds to the average purchase activity. The advertising should reﬂect this trend, focusing on and exploiting the seasonal peaks. Due to non-disclosure, we have presented the conclusions at such a high-level. The procedures of data-mining are able to output a quantitative presentation of these results (also with uncertainty corridors) that allows these conclusions to act as a ﬁrm basis for business decisions. We note that these conclusions were the result of blind analysis. That is, data of 31 million transactions were given to the mining algorithms without specifying either questions or hypotheses. These algorithms output data that could be interpreted by an experienced human analyst to the above conclusions in just a few hours. Based on these results we may now ask a number of speciﬁc questions to make the results clearer, especially when decisions have to be taken to implement changes based on these ﬁndings. We will not go into such an interactive question-answer process. Despite the wish to know more, these conclusions are quite telling and provide valuable material for high level decision making. This illustrates very well the power of data-mining. We have converted a vast collection of data into small number of understandable actionable conclusions that can be presented to corporate management. Moreover, we have been able to do so quickly. This procedure may well be automatically reproduced monthly to track changes to customer behavior. One caveat remains however: The challenge for any data-mining approach in a “bricks and mortar” business is to translate the ﬁndings into successful operational business concepts.

Chapter 6

Modeling: Neural Networks

6.1 What is Modeling? A mathematical model is a mathematical description of a system. This may take the form of a set of equations that can be solved for the values of several variables or it may take the form of an algorithmic prescription of how to compute something of interest, e.g. a decision tree to compute whether a produced part is good or bad. It is wrong to think that all models can be written in traditional equation format, e.g. y = mx + b no matter how complicated or simple the equation is. Frequently, it is far simpler to include a step-by-step recipe based on if-then rules and the like to describe how to get at the desired result. The model is thus this recipe. Modeling is a process that has the mathematical model as its objective and end. Mostly it starts with data that has been obtained by taking measurements in the world. Industrially, we have instrumentation equipment in our plants that produce a steady stream of data, which may be used to create a model of the plant. Note that modeling itself converts data into a model – a model that ﬁts the situation as described by the data. That’s it. Practically, just having a model is nice but does not solve the problem. In order to solve a particular practical problem, we need to use the model for something. Thus, modeling is not the end of industrial problem solving but modeling must be followed by other steps; at least some form of corporate decision making and analysis. Modeling is also not the start of solving a problem. The beginning is formed by formulating the problem itself, deﬁning what data is needed, collecting the data and preparing the data for modeling. Frequently it is these steps prior to modeling that require most of the human time and effort to solve the problem. Mathematically speaking, modeling is the most complicated step along the road. It is here that we must be careful with the methodology and the data as much happens that has the character of a black box. Generally speaking modeling involves two steps: (1) Manual choice of a functional form and learning algorithm and (2) automatic execution of the learning algorithm to determine the parameters of the chosen functional form. It is important to

P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7_6, © Springer-Verlag Berlin Heidelberg 2012

121

122

6 Modeling: Neural Networks

distinguish between these two because the ﬁrst step must be done by an experienced modeler after some insight into the problem is gained and step two is a question of computing resources only (supposing, of course, that the necessary software is ready). In practice, however this two-step process is most frequently a loop. The results of the computation make it clear that the manual choices must be re-evaluated and so on. Through a few loops a learning process takes place in the mind of the human modeler as to what approach would work best. Modeling is thus a discovery process with an uncertain time plan; it is not a mechanical application of rules or methods.

Fig. 6.1 A neural network is trained to distinguish two categories. After a few training steps (a) and two intermediate steps (b) and (c), the network settles into its ﬁnal convergent state (d). In this case it may take approximately 200 iterations of a normal neural network training method to reach the ﬁnal state.

Let us take the example of ﬁtting a curve through a set of data points. First, we choose to ﬁt a straight line, i.e. y = mx + b. Second, we use the linear least-squares algorithm to determine the values of m and b from the collected data. The result is the model, i.e. speciﬁc values for a and b. The algorithm used in the second step will produce the best values for the parameters that it can, given the functional form and the data. It will make an output even if a straight line is a patently poor ﬁt for

6.1 What is Modeling?

123

the data. Thus, it is essential to understand what type of model even has a chance to correctly model the data. Frequently model types are chosen that are provably so ﬂexible that they can be used for nearly all data. An example of this is the neural network that we will deal with in this chapter. It can model virtually any relationship present in a dataset. However, this statement must be taken with some salt as it is contingent upon a variety of conditions such as that the number parameters may have to be increased indeﬁnitely for this statement to hold. That would pose a practical problem, of course, because we desire to have many fewer parameters in the model than we have data points and clearly the data points are ﬁnite in number. If we had many parameters, then the model would be no better than a simple list of data points and the modeling step would not provide us with knowledge. It is precisely the compactiﬁcation of lots of data into a functional form with a few parameters that encapsulates the knowledge gain of the modeling process. We can attempt to understand the model because it is “small.” We can use the model to compute what the system will be like at points that were never measured (interpolation and perhaps extrapolation). If the number of parameters were too large, the model would merely learn the data by heart and we would loose both advantages – a situation known as over-ﬁtting. Thus, it is fair to say that a model is a functional summary of a dataset – it is a summary in that it encapsulates the same information in fewer numbers and it is functional in that we may use it to generate information that is not immediately apparent by evaluating it at novel points. To make this clearer, we will look at an example in ﬁgure 6.2. This is a vibration measurement on an industrial crane. The gray jagged line displays the raw data and the dotted black line displays the cleaned data according to the methods of chapter 4. In fact, we have seen this example in ﬁgure 4.1 before. What we have added in this ﬁgure is the solid black line.

Fig. 6.2 Here we compare raw data (gray) against ﬁltered data (black dotted) and the model (black solid) with a prediction into the future of the model.

124

6 Modeling: Neural Networks

Please note that the input data (raw as well as denoised) lasts from time zero to time 6500 minutes. This data is provided to a machine learning algorithm that produces a mathematical formula for calculating the next value in time given the previous values. Using this formula, we then attempt to re-create the known values and then we will calculate some more in order to predict the future. What we see here is that the model output (solid black line) for the time from zero to 6500 reproduces the known denoised data (dotted line) so well that we can hardly tell them apart on the diagram. That is a good sign. Beyond reproducing the known data, the model is then used to compute values for the time from 6500 to 8000 minutes. On the image, we have also graphed the uncertainty in this prediction as we can no longer validate the prediction made due to a lack of observations in the future. And so we see a line that slowly gains a greater and greater uncertainty but remains within a well-deﬁned corridor of values. The formula found, the model, thus reproduces the known data and makes a prediction that, on face value, makes sense. Whether this now corresponds to true reality, only experimentation (waiting for the future) can reveal. Once we are conﬁdent that the model can indeed predict the future, then we can use the model to compute the future. On this basis, we may then, with conﬁdence, plan actions to prevent events we do not want to occur but of which we know now that they will occur if we do nothing.

6.1.1 Data Preparation We will assume that the data is already clean and so only contains data that is representative of the problem that we wish to model; see chapter 4 for methods of getting raw data this far. A factor that must be considered at this point is whether the data is in the form that allows a machine learning algorithm to quickly learn the salient features of the dataset. We give a simple example to illustrate the point. Consider modeling the expected rise/fall of a stock price as a function of the principal balance sheet components of a company. To train this model, you may have data over several years from many companies. Such a model will not be useful to predict a share price (too little data per company) but it will reveal some interesting basic information. Two of the principal balance sheet ﬁgures are, for example, the stock price and the earnings. The machine learning algorithm will be able to learn that the expected rise/fall of the stock price depends upon the ratio price-to-earnings but it will require both time and data for making this conclusion. Since this is something that we humans already know, it would have helped the algorithm if we had removed the column of price and that of earnings and had inserted a column consisting of the ratio. This example is referred to by the general injunction that one should add domain knowledge into the training data set. We have not attempted to explain the dynamics of the stock exchange to a neural network but rather we transformed the data into a form that is conducive for learning just as we would purchase a colorful nice

6.1 What is Modeling?

125

language learning book for our child as opposed to a mere dictionary to aid it in its learning of a foreign language. Another feature of data preparation for machine learning is that many learning methods do not deal well with data that is collinear. What is meant by this is that if we have two series of observations (e.g. a and b) that are related by a simple linear transformation (e.g. a = mb + c with m and c constants), then the learner can become confused by this. The reason why this is strange may be compared to a situation with human beings. If we are trying to teach someone the meaning of “chair” and illustrate this with examples of the same chair in a small and large variety. The person who is to learn the concept may actually confuse the purpose of the chair with the size relationship in the dataset and thus learn something entirely unintended. This must be avoided by removing too simply related data from the dataset. It cannot be overemphasized that the form in which data is presented to a machine learning algorithm has more inﬂuence on the accuracy of the ﬁnal model than the choice of learning algorithm (as long as it is a reasonable algorithm that has the essential capability to learn the present problem). Thus, data preparation is a delicate activity and must be aided by thought and domain knowledge. Generally speaking, the following questions should be answered when modeling industrial data: 1. Are all the relevant measurements present and have they been cleaned in the sense of chapter 4? 2. Are only the relevant measurements present? 3. Have simple relationships been removed? 4. Can the data be transformed in some meaningful way in order to better represent the phenomenon that is important for modeling?

6.1.2 How much data is enough? When modeling, the question of how much data is needed always arises. Often, we are limited by practicality. Getting data sometimes represents a cost, perhaps both in time, money and effort. Having more data may also make learning slower. Increasing the sample rate may just produce noise and not more valuable signal information. Too little information is not enough and too much may be counterproductive as well. Before we end up with too much data, there is deﬁnitely a long interval of diminishing returns from getting more and more data. One can philosophically say that an industrial process contains a certain amount of information regardless of how much data we extract from it. We must make sure that this information is represented in the data and by a reasonable amount of data. Thus, we must choose the right amount of data that represents the greatest gain of knowledge for the resources that we put into the acquisition of the data. Because this is dependent upon the problem at hand, we need a quantitative measure of “enough.” We will approach this in two steps.

126

6 Modeling: Neural Networks

First, we will say that what we are really looking for is that if we have a reasonable model of the situation, then an additional bit of data will improve the model somewhat. As soon as we have reached the state where additional data does not allow a model improvement anymore, we may stop and say that we have enough data. Thus, we deﬁne enough data as that amount of data that we need to get the model to converge (to the right result), i.e. the model output for the validation data agrees with the experimental measurements for the same dataset and no longer changes appreciably with more data. Second, we need to ﬁnd a quantitative measure of convergence. This is provided, for example, by the variance. After the model is generated, we compute the variance, add more data, remodel and again compute the variance. In this way, we obtain the relationship between the size of the dataset and the variance. This relationship is roughly logarithmic, i.e. the variance rises with increasing data set size and eventually settles down to a more or less horizontal line. This can easily be detected and there is your convergence point. Equally well, this can be measured by the mean squared deviation between model output and experimental data in the validation data set. This should behave in roughly the same manner and thus provide the point at which training may proﬁtably stop. Note that in this approach, it is not possible to calculate, a priori, how many data points are needed. Rather it is a checking procedure, in dependence upon the modeling method, to check if we already have enough. Any method to compute a deﬁnite data volume before modeling begins will be a mere estimate. Note also that the information content is not a simple function of data volume as, for example, many repetitions of the same measurement do not add any information. The data volume must, therefore have a suitably diverse population of points in it to aid the analysis.

6.2 Neural Networks The term neural network refers to a wide family of functional forms that are frequently used to model experimental data. Historically, the creation of some of these forms was motivated by the apparent design of the human brain. This historical motivation aspect does not concern us here and will not be discussed. For the present book, we shall distinguish between a functional form and a function by allowing the former to have numerically-undetermined parameters and require the later to specify numerical values for all parameters. So, for example, sin(ax) with a a parameter is a functional form and sin(2x) is a function. This distinction is crucial in machine learning as the principal effort in machine learning is always put into the methods of determining the numerical value of the parameters in an a priori determined functional form. Practically speaking, we begin with a dataset and decide (by some unspeciﬁed and generally human procedure) that a certain functional form should be able to

6.2 Neural Networks

127

model that data. Then a machine learning algorithm determines the values of the parameters. The end result is a function (without parameters) that models the data. The topic of neural networks can be proﬁtably split into two categories 1. A list of different functional forms, so-called networks, that can be used to model data and their attendant properties such as a. b. c. d.

The kind of functions that can be represented, Restrictions on the values of parameters, Robustness properties, and Scaling behavior.

2. A presentation of various algorithms used to determine the numerical value of the parameters in the functional form and their attendant properties such as a. Requirements on the training data form (labeled or unlabeled) or cleanliness (signal-to-noise-ratio or other pre-processing requirements), b. Speed of training, c. Convergence rate, convergence target (local or global optimum) and robustness of convergence, and d. Practical issues such as termination criteria, parametrization or initialization requirements. Most books on neural networks mix these topics strongly and implicitly place a focus on the second. In practice, however, it is important to distinguish the network from the algorithm used to train it because we usually have a (largely) independent choice for both network and algorithm and must be aware of the involved limitations. For understanding, it is also important to know what a network can accomplish in the realistic case. Training a neural network is a black art requiring much experience and depends mostly on the person preparing the data (pre-processing) and selecting the training methodology and parametrization of the training method. Even then, the issue of convergence to local optima typically requires signiﬁcant tuning to a particular problem before a function is found that represents the data well enough for practical purposes1 . For this reason, we will not be discussing training algorithms at all but refer to the specialist literature for this purpose [69]. We mention in passing that if you have a problem that you wish to model using neural networks and you are not already an expert in training them, it is probably best to get an expert to do the modeling for you as learning how to do it can require many months. This chapter is intended to give the novice an overview of what neural networks are, what they can and cannot do and to give a sense of the complexity of the topic. Before going into the details, we need to discuss two important issues. 1 Together, the neural network topology and the training method give rise to several variables having to be chosen by the human modeler. On top of that, most training methods involve some degree of random number generation, which means that each training run will be conducted slightly differently and so results cannot be completely comparable. The exact effect of changing any one of the initial parameters is unclear and so much stabbing in the dark may become necessary before enough learning has happened in the modeler’s mind.

128

6 Modeling: Neural Networks

First, neural network methods yield a function that describes a speciﬁc set of data. What is ﬁrst done to obtain this set of data or what is later done with this formula is no longer an aspect of neural network theory or practice. As such, it is useful to view a neural network as a summary of data – a large table of numbers is converted into a function – similar to the abstract of a scientiﬁc paper being a summary. Please note, that a summary cannot contain more information than the original set of data2 ; indeed it contains less! Due to its brevity, however, it is hoped that the summary may be useful. In this way, neural networks can be said to transform information into knowledge, albeit into knowledge that still requires interpretation to yield something practically usable. Second, neural networks are intended for practical modeling purposes. The summarization of data is nice but it is not sufﬁcient for most applications. To be practical, we require interpolative and extrapolative qualities of the model. Supposing that the dataset included measurements taken for the independent variable x taking the values x = 1 and x = 2, then the model has the interpolative quality if the model output at a point in between these two is a reasonable value, i.e. it is close to what would have been measured had this measurement been taken, e.g. at x = 1.5. Of course, this can only hold if the original data has been taken with sufﬁcient resolution to cover all effects. The model has the extrapolative quality if the model output for values of the independent variable outside the observed range is also reasonable, e.g. for x = 2.5 in this case. If a model behaves well under both interpolation and extrapolation, it is said to generalize well. A neural network model is generally used as a black-box model. That is, it is not usually taken as a function to be understood by human engineers but rather as a computational tool to save having to do many experiments and determine values by direct observation. It is this application that necessitates both above aspects: We require a function for computation and this is only useful if it can produce reasonable output for values of the independent variables that were not measured (i.e. compute the result of an experiment that was not actually done having conﬁdence that this computed result corresponds to what would have been observed had the experiment been done). The data used for training can be obtained from actual experiments or alternatively from simulations – neural networks are not concerned with the data source. Simulations of physical phenomena are often very complex as they are usually done from so-called ﬁrst principles, i.e. by using the basic laws of physics and so on to model the system. As these simulations take time, they cannot be continuously run in an industrial setting. Neural networks provide a simpler empirical device by adjusting the parameters of a functional form until the so-determined function represents the data well. 2 Whatever is in the data will hopefully also be in the network but there is no guarantee of this as the summarization process is a lossy process. Whatever is not in the data, however, is deﬁnitely not in the network. We must thus not expect a neural network to divine effects that have not been measured in the data. Issues such as noise and over-representation of one effect over another can produce unexpected results. That is the reason why pre-processing is so important.

6.3 Basic Concepts of Neural Network Modeling

129

This represents the major advantages of neural networks: They are (1) easier to create than ﬁrst principles simulations, (2) can capture more dynamics than normal ﬁrst principles model due to them modeling actual experimental output and not idealized situations, (3) once existing, are easy and fast to evaluate. They are thus practical and cheap. The price that must be paid for this practicality is that the way in which the function is obtained and the resultant function itself are both not intended for human understanding. Questions such as ‘why’ and ‘how’ must thus never be asked of a neural network. We may only ask ‘what’ the value of some variable is.

6.3 Basic Concepts of Neural Network Modeling The dataset that will be used as the basis for obtaining the neural network contains several variables. For better understanding, we include a very simple example in table 6.1. We have three variables x1 , x2 and y1 in this table. Each variable has an associated measurement uncertainty or error Δ (x1 ), Δ (x2 ) and Δ (y1 ). It is important to note that no observation whatsoever is fully accurate and so there are always measurement errors. Frequently, however, their presence is ignored for basic modeling purposes. We include them here because they have signiﬁcant effects for industrial practice. x1 Δ (x1 ) x2 Δ (x2 ) y1 Δ (y1 ) 1.2 0.1 2.3 0.2 0.1 0.01 1.4 0.1 3.1 0.2 0.2 0.01 1.6 0.2 3.2 0.2 0.3 0.02 1.8 0.2 3.5 0.3 0.4 0.05 Table 6.1 An example data set for training a neural network. Note the presence of uncertainty measurements as well. This is important in practice as no measurement in the real world is totally precise.

The variables must ﬁrst be classiﬁed into dependent and independent variables or, to use neural network vocabulary, into output and input variables respectively. We will decide that the two x variables are independent and yield the single y variable that is the dependent variable. We could have arbitrarily many independent and dependent variables in general; there is no fundamental limitation. Thus, we look for some function f (· · · ), such that y1 = f (x1 , x2 ). In general, when we have many variables, we represent their collection by a vector, x = {x1 , x2 , · · · }. The general function is thus, y = f (x) knowing Δ xi , we may compute Δ y by

130

6 Modeling: Neural Networks

(Δ y)2 = ∑ i

∂ f (x) ∂ xi

2 (Δ xi )2 .

Having classiﬁed the variables into input/output for the function, we must classify the type of output variable. There are two principal options. An output variable may be numeric or nominal. If a variable is numeric, its numerical value has some signiﬁcance and so it makes sense to compare various value to each other, e.g. a temperature measurement. The function to be modeled is thus a regular mathematical function that could be drawn on a plot. If a variable is nominal, then the value of the variable serves only to distinguish it from other values and its numerical value has no signiﬁcance. We use nominal values to differentiate category 1 from category 2 knowing that it is senseless to say that the difference between category 2 and category 1 is 1. Neural networks are very often used for nominal variables and many are speciﬁcally intended to be classiﬁcation networks, i.e. they classify an observation into one of several categories. Empirically, it has been found that neural network methods are very good at learning classiﬁcations. Generally, most neural network methods assume that the data points illustrated in the above sample table are independent measurements. This is a pivotal point in modeling and bears some discussion. Suppose we have a collection of digital images and we classify them into two groups: Those showing human faces and those showing something else. Neural networks can learn the classiﬁcation into these two groups if the collection is large enough. The images are unrelated to each other; there is no cause-effect relationship between any two images – at least not one relevant to the task of learning to differentiate a human face from other images. Suppose, however, that we are classifying winter versus summer images of nature and that our images were of the same location and arranged chronologically with relatively high cadence. Now the images are not independent but rather they have a cause-effect relationship ordered in time. This implies that the function f (· · · ) that we were looking for is really quite different, y = f (x)

→

yi = f (xi−1 , xi−2 , · · · , xi−h ) .

In this version, we see a dependence upon history that implies a time-oriented memory of the system over a time length h that must somehow be determined. Depending on the dynamics of the time-dependent system, the memory of the process does not have to be a universal constant, so that in general h = h(i) is itself a function of time. As a consequence of this, we have network models that work well for datasets with independent data points (see section 6.4) and others that work well for datasets in which the data points are time-dependent (see section 6.5). The networks that deal with independent points are called feed-forward networks and form the historical beginning of the ﬁeld of neural networks as well as the principal methods being used. The networks that deal with time-dependent points are called recurrent networks, which are newer and more complex to apply.

6.4 Feed-Forward Networks

131

6.4 Feed-Forward Networks The most popular neural network is called the multi-layer perceptron and takes the form y = aN (WN · aN−1 (WN−1 · · · · a1 (W1 · x + b1 ) · · · + bN−1 ) + bN ) where N is the number of layers, the Wi are weight matrices, the bi are bias vectors, the ai (· · · ) are activation functions and x are the inputs as before. The weight matrices and the bias vectors are the place-holders for the model’s parameters. The so-called topology of the network refers to the freedom that we have in choosing the size of the matrices and vectors, and also the number of layers N. Per layer, we thus get to choose one integer and thus have N + 1 integers to choose for the topology. The only restriction that we have is the problem-inherent size of the input and output vectors. Once the topology is chosen, then the model has a speciﬁc number of parameters that reside in these matrices and vectors. The activation functions are almost always functions with the shape of a sinusoid, for example tanh(· · · ). In training such a network, we must ﬁrst choose the topology of the network and the nature of the activation functions. After that we must determine the value of the parameters inside the weight matrices and bias vectors. The ﬁrst step is a matter of human choice and involves considerable experience. After decades of research into this topic, the initial topological choices of a neural network are still effectively a black art. There are many results to guide the practitioners of this art in their rituals but these go far beyond the scope of this book. The second step can be accomplished by standard training algorithms that we mentioned before and also will not treat in this book (see e.g. [69]). A single-layer perceptron is thus y = a (W · x + b) and was one of the ﬁrst neural networks to be investigated. The single-layer perceptron can represent only linearly separable patterns. It is possible to prove that a two-layer perceptron with a sigmoidal activation function for the ﬁrst layer and a linear activation function for the second layer can approximate virtually any function of interest to any degree of accuracy if only the weight matrices and bias vectors in each layer are chosen large enough. In practice, we ﬁnd that perceptrons with between two and four layers are used very frequently to model data. Such a two-layer perceptron looks like y = m (W2 · tanh (W1 · x + b1 ) + b2 ) where m is a scalar and where we get to choose the size of the two weight matrices to match the problem.

132

6 Modeling: Neural Networks

This approach is used both for numerical data and for nominal data and is found to work very well indeed. Many industrial applications are based on perceptrons, e.g. adaptive controllers.

6.5 Recurrent Networks The basic idea of a recurrent neural network is to make the future dependent upon the past by a similar form of function like the perceptron model. So, for instance, a very simple recurrent model could be x(t + 1) = a (W · x(t) + b) . Note very carefully three important features of this equation in contrast to the perceptron discussed before: (1) Both input and output are no longer vectors of values but rather vectors of functions that depend on time, (2) there is no separate input and output but rather the input and output are the same entity at two different times and (3) as this is a time-dependent recurrence relation, we need an initial condition such as x(0) = p for evaluation. The above network, if we choose the activation function to be ⎧ z>1 ⎨1 a(z) = z −1 ≤ z ≤ 1 , ⎩ −1 z < −1 is called the Hopﬁeld network and is a very good classiﬁer. The methodology is like this: (1) Every one of the possible categories is characterized by a vector of values called a primary pattern, (2) each item to be classiﬁed is also characterized by a similar vector of values, (3) the network is trained using a specialized training algorithm, (4) the characterizing vector of an as-yet-unclassiﬁed item is input into the network as the initial condition p, (5) the network is now iterated in “time” until the vector converges to an unchanging state, (6) this convergent state is one of the primary patterns and thus the classiﬁcation is done. This network uses the concept of time to accomplish a static task in which actual time does not play a role. If correctly constructed, the time iteration can make the network numerically more stable and so produce more reliable answers than for a static network like the perceptron.

Fig. 6.3 The ﬁrst three digits as input to train a Hopﬁeld network to recognize these digits.

6.5 Recurrent Networks

133

We demonstrate the issues by means of a common example, recognition of digits. In ﬁgure 6.3 we display the bit pattern of three digits that we wish to recognize using a Hopﬁeld network. Each two-dimensional pattern can easily be converted into a single dimensional vector of bits by joining each column on the bottom of the previous column. Thus, the digit 0” becomes the vector x0 = [011110100001100001100001100001011110]T . With this setup, we can train the Hopﬁeld network and obtain the matrix W and the vector b. Using this network, we may then classify new inputs. Figure 6.4 displays the results of verifying the network on novel input. We see that if we occlude 50% of the original pattern, we retrieve the correct result. However, if we occlude 67%, then we will get errors in recognition. If the network is presented with noisy inputs, the network will make an identiﬁcation that is the same as a human being would have made. Thus, we conclude that the network is sensible in its classiﬁcation. We have thus been able to represent the difference between the ﬁrst three digits in a Hopﬁeld network. This is roughly the principle by which optical character recognition is done even though fancier techniques are being used in commercial software to make the system less error prone.

(a)

(b)

(c)

Fig. 6.4 Several test patterns for the digit recognizing Hopﬁeld network and their associated outputs. Test set (a) includes 50% occluded inputs. Test set (b) includes 67% occluded inputs. Test set (c) includes noisy inputs.

A very new type of recurrent neural network is the echo state network. This uses the concept of a reservoir, which is essentially just a set of nodes that are connected to each other in some way. The connection between nodes is expressed by a weight

134

6 Modeling: Neural Networks

matrix W that is initialized in some fashion3 . The current state of each node in the reservoir is stored in a vector x(t) that depends on time t. An input signal is given to the network u(t) that also depends upon time. This is the actual time-series measured in reality that we wish to predict. The input process is governed by an input weight matrix Win that provides the input vector values to any desired neurons in the reservoir. The output of the reservoir is given as a vector y(t). The system state then evolves over time according to

x(t + 1) = f W · x(t) + Win · u(t + 1) + W f b · y(t) where W f b is a feedback matrix, which is optional for cases in which we want to include the feedback of output back into the system and f (· · · ) is a sigmoidal function, usually tanh(· · · ). The output y(t) is computed from the extended system state z(t) = [x(t); u(t)] by using y(t) = g Wout · z(t) where Wout is an output weight matrix and g(· · · ) is a sigmoidal activation function, e.g. tanh(· · · ). The input and feedback matrices are part of the problem speciﬁcation and must therefore be provided by the user. The internal weight matrix is initialized in some way and then remains untouched. If the matrix W satisﬁes some complex conditions, then the network has the echo state property, which means that the prediction is eventually independent of the initial condition. This is crucial in that it does not matter at which time we begin modeling. Such networks are theoretically capable of representing any function (with some technical conditions) arbitrarily well if correctly set up. An example of this can be seen in ﬁgure 6.5. The original time-series is the very detailed spiky line. The smooth curve on top is an echo state network with many nodes in the reservoir and the thick line that seems to have a slight time delay is an echo state network with a small number of nodes in the reservoir. In this case, the time-series has a ﬁnancial origin, it is the Euro to US Dollar exchange rate. We see that such a signal can be modeled to sufﬁcient accuracy using an echo state network. Please note again at this point the principal difference between a time-series in which the points are correlated in time and the classiﬁcation of observations into categories in which the observations are not correlated at all. The correlation in time makes it necessary to use much more sophisticated mathematics. 3

Generally it is initialized randomly but substantial gain can be got when it is initialized with some structure. At present, it is a black art to determine what structure this should be as it deﬁnitely depends upon the problem to be solved.

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing

135

Fig. 6.5 The prediction of the Euro to US Dollar exchange rate over 1.5 years.

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing Co-Authors: Pablo Cajaraville, Reiner Microtek Bj¨orn Dormann, Kl¨ockner Desma Schuhmaschinen GmbH Dr. Philipp Imgrund, Fraunhofer Institute IFAM Maik K¨ohler, Kl¨ockner Desma Schuhmaschinen GmbH Lutz Kramer, Fraunhofer Institute IFAM Oscar Lopez, MIM TECH ALFA, S.L. Kaline Pagnan Furlan, Fraunhofer Institute IFAM Pedro Rodriguez, MIM TECH ALFA, S.L. Dr. Natalie Salk, PolyMIM GmbH J¨org Volkert, Fraunhofer Institute IFAM The injection molding technology is a widely used technology for the mass production of components with complex geometries. Almost all material classes can be processed with this technology. For polymers the pelletized material is injection molded under elevated temperature in a mold cavity showing the negative structure of the resulting part. The part is cooled down and ejected to the ﬁnished component. In the case of metals and ceramics the so called metal injection molding (MIM) or ceramic injection molding (CIM) process is applied. Both processes fall under the umbrella term powder injection molding (PIM). In all cases, the material powder is mixed with a binder system composed of polymers and/or waxes. This so-called feedstock is subsequently injection molded compared to the described polymer material. The ejected parts are called green parts still contain the binder material that acted as ﬂowing agent during the injection molding process. To remove the binder,

136

6 Modeling: Neural Networks

the components have to be debinded in a solvent or water solution for a certain amount of time. Subsequently, a thermal debinding step is needed to decompose the residual binder acting as backbone in what is now called the brown part. During the ﬁnal sintering step, the parts are heated up to approximately 3/4 of the melting point of the integrated material powder. During the sintering process the material densiﬁes to a full metallic or ceramic part, showing the same material properties as the respective material. Examples of a good and bad polymer as well as metal parts are illustrated in ﬁgure 6.6.

Fig. 6.6 Both images display a good part and a damaged part. On the left, we have a plastic part where we can have various deformations as one damage mechanism. On the right, we have a green metal part that is broken in one place and a ﬁnal metal part showing the end product as it should be. In both examples, the damage is visible but this is often not so.

Often, the damages to a part that occur during injection can only been seen on the ﬁnal part. The unnecessarily performed steps of debindering and sintering have expended signiﬁcant amounts of electrical energy and have also made the material useless. If we could identify a damaged part during its green stage, we could recycle the material and also save the energy for debindering and sintering. For all parts, there is an effort involved in determining whether the part is good or not. Today, this effort is usually made manually, which is expensive. If we could make the identiﬁcation automatic, then we would save this effort as well. An injection molding machine is controlled by manually inputting a series of values known as set-points. These are the values for various physical quantities that we desire to have during the injection process. It is the responsibility of the machine to attempt to realize these set-points in actual operation. This attempt is generally achieved but there are deviations in the details. In order to monitor what the actual value of these various measurements is, an injection machine will also have sensors that output these measurements over time, i.e. a time-series. For each part produced, we thus have the set-points and also a variety of timeseries over the duration of its injection. This information is available in order to characterize a part.

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing

137

We will assume that the function that computes the scrap versus good status, the decision function, takes the form of a three-layer perceptron γ = W3 · tanh(W2 · tanh(W1 · x + b1 ) + b2 ) + b3 where the bias vectors bi and the weight matrices Wi must be determined by some training algorithm. In order to make an input vector x, we must extract salient features from the time-series in the form of scalar quantities that allow the characterization of scrap vs. good parts as we cannot input the whole time-series into the neural network. If we did input the entire time-series, we would force the weight matrices to become extremely large. Thus we would have many parameters to be found by training. This would require many more parts to be produced and this is unrealistic. We must live with a few hundred training parts and so we must keep the input vector as small as possible. After many trials with various settings, we have found the best performance with the following procedure (see ﬁgure 6.7 for an example): 1. Take all observations of good parts over the pilot series. For each time series, create an averaged time series over all these good parts. 2. Disregarding local noise in the time-series, compute the turning points of this time series. 3. For every part encountered, perform the same turning point analysis. 4. For every part, we now compute the difference between the turning points in the present part relative to the turning points of the averaged series. If we ﬁnd differences, then these will be taken as salient features. 5. These differences form the salient features and these will constitute the vector x. For practicality, we limit ourselves to a speciﬁc maximum number of turning points allowed. In order to use the training method, it must be provided with some pairs (x, γ) of input vectors and the quality output. To determine these pairs, we must manually assess a number of injected parts and characterize them as scrap or good. Having obtained such training data, the training algorithm produces the decision function. In practise this means that the machine user must inject several parts, record the data x, manually determine whether the ﬁnal parts are scrap or not, γ, and provide this information to the learning method. We have determined that a good number of observations is 500 or more. Furthermore, it is good to use several settings of the set-points within these 500 parts. When a new part is now injected, its data x is provided to the decision function and it computes whether this part is good or bad γ and outputs this value. As a result a robot can be triggered to remove the scrap parts from further production. We may deﬁne four different recognition rates: 1. 2. 3. 4.

good rate, ρg – The number of correctly identiﬁed good parts scrap rate, ρs – The number of correctly identiﬁed scrap parts false-negative rate, ρn – The number of good parts identiﬁed as scrap false-positive rate, ρ p – The number of scrap parts identiﬁed as good

138

6 Modeling: Neural Networks

Fig. 6.7 The bottom curve is the average pressure observed at the nozzle averaged over all parts known to be good. The top curve is a single observation of a part known to be bad. The black arrows pointing up indicate the position of the turning points of the bottom curve and the black arrows pointing down indicate the position of the turning points of the top curve. We observe that the top curve has two extra turning points. We also observe that the vertical position of several turning points is higher than that of the average good curve.

where each count is divided by the total number of produced parts in order to make each item into a genuine rate. Please note that these rates are thus normalized by deﬁnition, i.e. ρg + ρs + ρn + ρ p = 1. It is generally not possible to design a system that will have perfect recognition efﬁciency (ρn = ρ p = 0). We would rather throw away a good part than let a scrap part through. Thus, the overall objective is to minimize the false-positive rate ρ p by designing a decision function that is as accurate as possible. This enumeration is, of course, theoretical as it would require the user to know which parts are really good or bad. This identiﬁcation would require the manual characterization that we want to avoid using the present methods (except for the training data set for which it is necessary). Thus, we will never actually know what these rates are except in two cases: the pilot series where the data is used for training the function and, possibly, in any quality control spot checks that are usually too infrequent to really allow the computation of a rate. Thus, these rates must be interpreted as a useful guideline for thinking but not practical numerical quantities except at training time when we actually know the quality state of all parts. Even if we are able to correctly identify every part as either good or bad, this will not change the amount of scrap actually produced – it will merely change the

6.6 Case Study: Scrap Detection in Injection Molding Manufacturing

139

amount of scrap delivered to the customer. In order to reduce the amount of scrap actually produced, we must interact with the production process actively via the set-points and this is our next stage. We observe that the recognition rates mentioned above are in fact functions of the set-points. In particular, we want to reduce the scrap rate as we would like to produce as many good parts as possible. Let us combine the ten set-points αi into a single vector, (6.1) s = (α1 , α2 , · · · , α10 ). Using this, we thus focus on the scrap rate function ρs (s). This function achieves a global minimum ρs∗ at the point s∗ , ρs∗ = min{ρs (s)} = ρs (s∗ ). s

(6.2)

The point s∗ is determined using an optimization algorithm and then communicated to the injection molding machine. These methods were tried on the part in the right image of ﬁgure 6.6. In total 500 parts were made, with various settings of the set-points, of which approximately 20% were scrap. This high scrap rate results from the various set-points settings, some of which are, of course, not optimal. We ﬁnd that the recognition efﬁciency is 98% in that ρg + ρs = 0.98 where we recognize good and bad parts correctly nearly always with 490 out of 500 parts. We obtain a low false positive rate ρ p = 0.002 in which scrap parts are recognized as good. Relative to our sample size of 500, this means that a single scrap part was not recognized as such. The false negative rate in which good parts are recognized as scrap was ρn = 0.018, which means 9 parts in total. Recall that the objective of training the network was to minimize the false positive rate. It is clear that this cannot be reliably zero and so getting a rate of one part in 500 can be interpreted as a success. The system is certainly more reliable than manual quality control, which is common in the industry. The actual production scrap rate was approximately 20%. This could have been reduced to 5% by adjusting the set-points appropriately. Of course, a real production will not be at the level of 20% scrap and so an improvement of a factor of four seems unlikely. Nevertheless, a signiﬁcant factor should be possible. Now that we have veriﬁed that it is possible to reliably identify scrap from good parts based only on process data and that we can optimize the process based on the same analysis, we ask what the practical signiﬁcance of this is. There are two major points: Quality improvement and energy savings. With respect to quality improvement, there are also two aspects. First, the quality of the delivered product. Even if we produce 20% scrap, 98% of these are correctly recognized to be scrap and so these are not delivered to the client. Looking at the current numbers, then, we produce 500 parts all in all. Of these, we have 100 scrap parts. We recognize 98 scrap parts as scrap, 392 good parts as good, 9 good parts as scrap and one scrap part as good. Thus the client receives 392 good parts and one scrap part in the delivery. This is an effective scrap rate – with the

140

6 Modeling: Neural Networks

client – of 0.3%. The identiﬁcation was thus able to lower the production scrap rate of 20% to a delivery scrap rate of 0.3%. Second, the production quality will also improve due to the optimization. Since we can lower the production scrap rate of 20% to 5%, we would produce 475 good parts and 25 scrap parts as compared to the above ﬁgures. In ﬁnal consequence, the client would receive 466 parts of which one would be scrap. This lowers the effective delivery scrap rate to 0.2% and relative to a larger delivery size. With the optimization, the molding production cost per delivered part is lowered by 19%. This is a reduction in production cost that is otherwise unreachable. With respect to energy savings, we also save the energy costs that would have ﬂown into the steps of debinding and sintering of parts that later are recognized as scrap. It is hard to quantify this in any general manner but we assume that this lowers the total production cost per delivered part by another 4%. We gratefully acknowledge the partial funding of this research by the European Union: Investment in your future – European fund for regional development, the EU programs MANUNET and ERANET, the German ministry for education and research (BMBF) and the Wirtschaftsf¨orderung Bremen (economic development agency of the city state of Bremen, Germany).

6.7 Case Study: Prediction of Turbine Failure In this study we will attempt to predict a known turbine failure using historical data for it. We refer to section 5.5 for details on turbines and coal power plants. On a particular turbine, a blade tore off and completely damaged the turbine. After the event, the question was raised whether this could have been predicted and localized to a speciﬁc location in the turbine. The speciﬁc turbine in question has over 80 measurements on it that were considered worthwhile to monitor. Most of these were vibrations but there were also some temperatures, pressures and electrical values. A history of six months was deemed long enough and the frequency depended upon each individual measurement point – some were measured several times per second, others only once every few hours. In fact, the data historian only stores a new value in its database if the new value differs from the last stored value by more than a static parameter. In this way, the history matrix contained a realistic picture of an actual turbine instrumented as it normally is in the industry. No enhancements were made to the turbine, its instrumentation or the data itself. A data dump of six months was made without modiﬁcation. The data stopped two days before a known (historically occurring) blade tear on that turbine. During the time leading up to the blade tear and until immediately before it, no sign of it could be detected by any analysis run by the plant engineers either before or after the blade tear was known. Thus, it was concluded that the tear is a spontaneous and thus unpredictable event. Initially, the machine learning algorithm (echo state network from section 6.5) was provided with no data. Then the points measured were presented to the algo-

6.7 Case Study: Prediction of Turbine Failure

141

rithm one by one, starting with the ﬁrst measured point. Slowly, the model learned more and more about the system and the quality of its predictions improved both absolutely and in terms of the maximum possible future period of prediction. Once even the last measured point was presented to the algorithm, it produced a predication valid for the following two days of real time. The result may be seen in ﬁgure 6.8.

Fig. 6.8 Here we see the actual measurement (spiky curve) versus the model output (smooth line) over a little history (left of the vertical line) and for the future three days (right of the vertical line). We observe a close correspondence between the measurement and the model. Particularly the event, the sharp drop, is correctly predicted two days in advance.

Thus, we can predict accurately that something will take place in two days from now with an accuracy of a few hours. Indeed it is apparent from the data that it would have been impossible to predict this particular event more than two days ahead of time due to the qualitative change in the system (the failure mode) occurring only a few days before the event. The model must be able to see some qualitative change for some period of time before it is capable of extrapolating a failure and so the model has a reaction time. Events that are caused quickly are thus predicted relatively close to the deadline but two days warning is enough to prevent the major damages in this case. In general, failure modes that are slower can be predicted longer in advance. It must be emphasized here that the model can only predict an event, such as the drop of a measurement. It cannot label this event with the words “blade tear.” The identiﬁcation of an event as a certain type of event is altogether another matter. It is possible via the same sort of methods but would require many examples of blade tears and this is a practical difﬁculty. Thus, the model is capable of giving a speciﬁc

142

6 Modeling: Neural Networks

time when the turbine will suffer a major defect; the nature of the defect must be discovered by manual search on the physical turbine. This is interesting but to be truly helpful, we must be able to locate the damage within the large structure of the turbine, so that maintenance personnel will not spend days looking for the proverbial needle in the haystack. Fault detection and localization is now done by performing an advanced datamining methodology (singular spectrum analysis from section 4.5.2) that tracks frequency distributions of signals over the history and can deduce qualitative changes. Over the 80 measurements points, we are able to isolate that four of them contain a qualitative shift in their history and that two of these four go through such a shift several days before the other two. Thus, we are able to determine which two out of 80 locations in the turbine are the root cause for the event that is to occur in two days. See ﬁgure 6.9 for an illustration. In this ﬁgure, we graph the abnormality as measured by singular spectrum analysis (see section 4.5.2) over time for each measurement. We observe that only four time-series are abnormal at all. Two of them become abnormal early in time and two others follow. When we ask which time-series these are, then we ﬁnd that the ﬁrst two are the radial and axial vibrations of one bearing and the second two are the same vibrations of the neighboring bearing. We may safely attribute causation to this evolution. Thus we say that the ﬁrst bearing’s abnormality causes the second bearing’s abnormality. We also say that the ﬁrst bearing’s abnormality was the ﬁrst sign of what later lead to the event, the blade tear. Indeed, the blade tore off very close to this particular bearing. Thus we are successful in localizing the fault within the large turbine.

Fig. 6.9 We compute a deviation from normal being tracked over a window of about four days length. So we observe that two sensors start behaving abnormally and two days later, two other sensors behave abnormally. About 3.5 days after the start of the abnormal behavior, this new behavior has become normal and so the deviation from normal is seen to reduce again. Therefore, we observe a qualitative change in the performance of these four points.

6.8 Case Study: Failures of Wind Power Plants

143

The localization that is possible here is to identify the sensor that measures an abnormal signal and that will be the ﬁrst to show the anomaly that will develop into the event. It is, of course, not possible to compute a physical location on the actual turbine more accurately than the data provided. However, a physical search of the turbine, after the actual blade tear, found out that the cause was indeed at the location determined by the data-mining approach. It is possible to reliably and accurately predict a failure on a steam turbine two days in advance. Furthermore, it is possible to locate the cause of this within the turbine so that the location covered by the sensor that measures the anomaly can be focused on by the maintenance personnel. The combination of these two results, allows preventative maintenance on a turbine to be performed in a real industrial setting saving the operator a great expense.

6.8 Case Study: Failures of Wind Power Plants Wind power plants sometime shut down due to diverse failure mechanisms and must be maintained. Especially in the offshore sector but also in the countryside, are these maintenance activities costly due to logistics and delay. Common failures are for example due to insufﬁcient lubrication or bearing damages. These can be seen in vibration patterns if the signal is analyzed appropriately. It is possible to model dynamic evolving mechanisms of aging in mathematical form such that a reliable prediction of a future failure can be computed. For example, we can say that a bearing will fail in 59 hours from now because the vibration will then exceed the allowed limits. This information allows a maintenance activity to be planned in advance and thus saves collateral damage and a longer outage. Wind power plants experience failures that lead to ﬁnancial losses due to a variety of causes. Please see ﬁgure 6.10 for an overview of the causes, ﬁgure 6.11 for their effects and ﬁgure 6.12 for the implemented maintenance measures. Figure 6.13 shows the mean time between failures, ﬁgure 6.14 the failure rate per age and ﬁgure 6.15 the shutdown duration and failure frequency4 . From these statistics we may conclude the following: 1. At least 62.9% of all failure causes are internal engineering related failure modes while the remainder are due to external effects, mostly weather related. 2. At least 69.5% of all failure consequences lead to less or no power being produced while the remainder leads to ageing in some form. 3. About 82.5% of all maintenance activity is hardware related and thus means that a maintenance crew must travel to the plant in order to ﬁx the problem. This is particularly problematic when the power plant is offshore. 4. On average, a failure will occur once per year for plants with less than 500 kW, twice per year for plants between 500 and 999 kW and 3.5 times per year for 4

All statistics used in ﬁgures 6.10 to 6.15 were obtained from ISET and IWET.

144

6 Modeling: Neural Networks

plants with more than 1 MW of power output. The more power producing capacity a plant has, the more often it will fail. 5. The age of a plant does not lead to a signiﬁcantly higher failure rate. 6. The more rare the failure mode, the longer the resulting shutdown. 7. A failure will, on average, lead to a shutdown lasting about 6 days. From this evidence, we must conclude that internal causes are responsible for a 1% capacity loss for plants with less than 500 kW, 2% for plants between 500 and 999 kW and 3.5% for plants with more than 1 MW of power output. In a wind power ﬁeld like Alpha Ventus in the North Sea, with 60 MW installed and expecting 220 GWh (i.e. an expectation that the ﬁeld will operate 41.8% of the time) of electricity production per year, the 3.5% loss indicates a loss of 7.7 GWh. At the rate of German government regulation of 7.6 Eurocents per kWh, this loss is worth 0.6 million Euro per year. Every cause leads to some damage that usually leads to collateral damages as well. Adding the cost of the maintenance measures related to these collateral damages themselves yields a ﬁnancial damage of well over 1 million Euro per year. The actual original cause exists and cannot be prevented but if it could be identiﬁed in advance, then these costs could be saved. This calculation does not take into account worst case scenarios such as the plant burning up thus requiring effectively a new build.

Fig. 6.10 The causes for a wind power plant to fail are illustrated here with their corresponding likelihood of happening relative to each other.

6.8 Case Study: Failures of Wind Power Plants

145

Fig. 6.11 The effects of the causes of ﬁgure 6.10 are presented here with the likelihood of happening relative to each other.

Fig. 6.12 The maintenance measures put into place to remedy the effects of ﬁgure 6.11 with the likelihood of being implemented relative to each other.

146

6 Modeling: Neural Networks

Fig. 6.13 The mean time between failures per major failure mode.

Fig. 6.14 The yearly failure rate as a function of wind power plant age. It can be seen that plants with higher output fail more often and that age does not signiﬁcantly inﬂuence the failure rate.

6.8 Case Study: Failures of Wind Power Plants

147

Fig. 6.15 The failure frequency per failure mode and the corresponding duration of the shutdown in days.

A recurrent neural network was applied to a particular wind power plant. From the instrumentation, all values were recorded to a data archive for six months. One value per second was taken and recorded if it different signiﬁcantly from the previously recorded value. There were a total of 56 measurements available from around the turbine and generator but also subsidiary systems such a lubrication pump and so on. Using ﬁve months of these time-series, a model was created and found that the model agreed with the last month of experimental data to within 0.1%. Thus, we can assume that the model correctly represents the dynamics of the wind power plant. This system was then allowed to make predictions for the future state of the plant. The prediction, according to the model’s own calculations, was accurate up to one week in advance. Naturally such predictions assume that the conditions present do not change signiﬁcantly during this projection. If they do, then a new prediction is immediately made. Thus, if for example a storm suddenly arises, the prediction must be adjusted. One prediction made is show in ﬁgure 6.16, where we can see that a particular vibration on the turbine is to exceed the maximum allowed alarm limit after 59 ± 5 hours from the present moment. Please note that this prediction actually means that the failure event will take place somewhere in the time range from 54 to 64 hours from now. A narrower range will become available as the event comes closer in time. This information is however, accurate enough to become actionable. We may schedule a maintenance activity in two days from now that will deﬁnitely prevent the problem. Planning for two days in advance is sufﬁciently practical that this would solve the problem in practice.

148

6 Modeling: Neural Networks

In this case, no maintenance activity was performed in order to test the system. It was found that the turbine failed due to this particular vibration exceeding the limit after 62 hours from the moment it was predicted to happen. This failure led to contact with the casing, which led to a ﬁre effectively destroying the plant.

Fig. 6.16 The prediction for one of the wind power plant’s vibration sensors on the turbine clearly indicating a failure due to excessive vibration. The vertical line on the last ﬁfth of the image is the current moment. The curve to the left of this is the actual measurement, the curve to the right shows the model’s output with the conﬁdence of the model.

It would have been impossible to predict this particular event more than 59 hours ahead of time due to the qualitative change in the system (the failure mode) occurring just a few days before the event. The model must be able to see some qualitative change for some period of time before it is capable of extrapolating a failure and so the model has a reaction time. Events that are caused quickly are thus predicted relatively close to the deadline. In general, failure modes that are slower can be predicted longer in advance.

6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry Catalytic reactors are devices in chemical plants whose job it is to provide a conducive environment for a certain chemical reaction to take place, see ﬁgure 6.17. In a reactor, at least two substances are brought into contact with each other. One is a substance that we would like to change in some chemical way and the other is the catalyst, i.e. the substance that is supposed to bring this change about. The two

6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry

149

substances are mixed and heated to provide the energy for the change. Then we wait and provide the necessary plumbing for the substances to come into and for the end product to leave the reactor. Some parts that are not converted have to be re-cycled back for a second, and possibly more, times through the reactor until ﬁnally all the original substance has been changed. An example is the breaking down of the long molecular chains of crude oil in the effort to make gasoline.

Fig. 6.17 The basic workings of the catalytic reactor system in a petrochemical reﬁnery.

The catalyst performs its work upon the substance and brings about a change. It thereby uses up its potential to cause this change and thus ages over time. This degradation of the catalyst is the primary problem with operating such a reactor continuously over the long term. The catalyst must therefore be re-activated in some fashion and at some time. We will investigate both of the major two kinds of catalytic reactors that exist: the ﬂuid catalytic converter (FCC) and the granular catalytic reactor (GCR). In the FCC, the catalyst is a ﬂuid that can be pumped into and out of the reactor. We can therefore create a loop in which the catalyst is pumped into the reactor

150

6 Modeling: Neural Networks

to perform its function and then out again into a reactivation phase only to return. This loop occurs forever and the catalyst can thus be used essentially without limit. However the speed of the loop must be carefully tuned to the actual aging of the catalyst inside the reactor so that we do not put either too much work (attempting to reactivate catalyst that is still ﬁne) or too little work (not reactivating enough catalyst so that eventually we have too little) into the reactivation job. In the GCR, the catalyst is in the form of granules that are ﬁlled into tubes in the reactor. These granules stay inside the tubes until the point is reached where the catalyst is so deactivated that the process is no longer economical. At this point, the reactor must be opened, the catalyst removed and fresh one injected. The old granules can then be sent for reactivation. Such an exchange process may require approximately four weeks of downtime and is thus a substantial cost for the plant. Also the new catalyst must be ordered well in advance and so the date of change must be planned beforehand. Both types of reactor therefore require a prediction of the aging process into the future. We must know weeks in advance if we will have a critical deactivation. In order to make the prediction, we have access to several temperatures around the reactor, the inﬂow and outﬂow, a gas chromatographic identiﬁcation of what is ﬂowing out and a few process pressures. In fact the age of the catalyst is measured by the pressure difference across the reactor. The higher it gets, the older the catalyst is. Using the method of recurrent neural networks, we create a model of the GCR using almost four years of data in which the catalyst was exchanged twice, see ﬁgure 6.18. The jagged curve running over the whole plot is the measured pressure difference over the reactor. Whenever you see a sharp drop, this means that the catalyst has been exchanged; this happened three times in total in that ﬁgure. The mathematical model draws the smooth curve over the data. We can hardly see the difference on the left of the image, so closely does the model represent the data. At the ﬁrst vertically dashed line, we have reached the “now” point from which the model predicts without receiving more input data. We see three smooth lines diverging from this time. The middle one is the actual prediction, the other two being the uncertainty boundaries for it. The jagged line then shows what actually occurred. We can see that the model is very accurate indeed. The brief ups and downs in the real measurement are not in fact due to aging of the catalyst but due to various operational modes and varying quality of the crude oil being injected in the reactor. We are only concerned with the long term trend and not with short term ﬂuctuations. At the time of the second vertically dashed line, the catalyst was exchanged. The prediction was accurate up to this time, 416 ± 25 days later. Thus the prediction was accurate over one year in advance. You may ask why the catalyst was exchanged, for the three times that we see on the ﬁgure, at different ages (or different differential pressures). Would it not be good to specify a single boundary value to deﬁne “too old?” In this case, our true cut-off criterion is not a certain age but rather the point of uneconomicality of the process. As this depends on various scientiﬁc but also on various economic inﬂuences, the

6.9 Case Study: Catalytic Reactors in Chemistry and Petrochemistry

151

Fig. 6.18 The pressure differential (equivalent to the catalyst age) over time. The jagged line is the actual measurement and the middle smooth line the prediction. The ﬁrst vertical line from the left to the right is the present moment from which the prediction starts. The second vertical line is the time at which we predict a catalyst exchange to be necessary. This prediction can be seen, using the real measurements, to be correct more than one year in advance.

actual age at which the catalyst becomes uneconomical changes with ﬂuctuating market prices. These (and their uncertainties) must be taken into account. We now proceed to the FCC in a chemical plant. Here we will take a different viewpoint. The FCC is a complex unit that has many set-points. For instance the rate and manner in which the ﬂuid catalyst is recycled is up to us to control. The setpoints that control this are changed by the operating personell to match the demands of the time. While measurements such as the ambient air temperature are variables that must be measured to be known, the set-points are of a different nature. As the operators modify the set-points in dependence upon market factors, we do have some knowledge about these beforehand. Thus, we ask to what extent does this prior knowledge help the model to predict the future state? We investigate, in ﬁgure 6.19, a very simple neural network model (perceptron, see section 6.4) for the pressure differential in dependence upon all other variables of the FCC, of which there many be several dozen and several set-points as well. If we only provide historical information to it, we obtain the solid curve. Compared to the actually measured dotted curve, we can see that this model is too simplistic to be able capture reality. If we present the future data for the set-points in addition (i.e. the production plan) to the historical data to the same simple model, we obtain the dashed curve. This dashed curve is accurate and achieves our aim.

152

6 Modeling: Neural Networks

Fig. 6.19 The pressure differential over time (in hours) on a ﬂuid catalytic converter predicted into the future using two different models. The actual measurement is shown in the dotted line. A prediction without any information about the future is shown in the solid line. A prediction made with the knowledge of some future set-points is shown in the dashed line. It is clear that knowledge of future actions is beneﬁcial.

We can thus see that, in the case of a simple perceptron model, the provision of some limited information about the future signiﬁcantly helps the model to predict those measurements that we cannot know in advance. In conclusion, we see that both major kinds of catalytic reactors can be modeled well. We can predict both kinds of reactor into the future and thus provide information about essential future events such as the deactivation of the catalyst and thus the time (in the case of GCR) and the manner (in the case of FCC) of this deactivation. From both predictions, we easily derive the ability to plan speciﬁc actions to remedy the problem.

6.10 Case Study: Predicting Vibration Crises in Nuclear Power Plants Co-Author: Roger Chevalier, EDF SA, R&D Division So far, we have focused on predicting failure events. Such events are characterized by large, usually fast, changes that result in damage and usually a shutdown of the plant. In this section, we will focus on predicting a more subtle phenomenon. We observe that the vibration measurement on a certain bearing of a steam turbine increases periodically. This increase is alarming but does not represent a damage or danger. In our speciﬁc example, the turbine has ﬁve bearings and each has a vibration sensor. See ﬁgure 6.20 for a plot of a temporary increase in vibration, which we

6.10 Case Study: Predicting Vibration Crises in Nuclear Power Plants

153

will call vibration crisis. The exact cause of the problem is not precisely identiﬁed at present but it always in the same conditions of vacuum pressure and power

Fig. 6.20 This is the vibration of one bearing over time. The horizontal line is the limit for the vibration crisis, i.e. if the vibration measurement exceeds this limit, we speak of a crisis. It will be our goal to detect such events. Time is measured in units of ten minutes and so the plot is over a period of roughly 35 days.

This study concerns itself with the prediction of future vibration crises and not with determining the mechanism at its source. If one can know, only hours in advance, that a crisis will happen, this will help operators signiﬁcantly in preparing for the event. The plant can be regulated into a state more conducive to controlling the impending crisis. We attempt a model via a recurrent neural network (see section 6.5). The information about the turbine includes ﬁve displacement vibrations, for each bearing we have two metal pad temperatures, the steam pressure at several points in the process, the axial position of the turbine shaft, the rotation rate, the active and reactive power produced and one temperature on the oil circuit. This information is sampled once every ten minutes for the period of about ﬁve months in order to generate the model and learn the signs of an impending vibration crisis. The dataset contained several examples of such crises so that effective learning was possible. With respect to the raw data from ﬁgure 6.20, we see the results of the prediction process in ﬁgure 6.21. The raw data is displayed in gray and the model output in black. We have a ﬁrst period during which we observe the turbine before a prediction is possible. When enough data is there, we enter into a second period during which we predict and immediately validate the predictions with the real data. We see during this second period a close agreement between model output and measured values. Then we enter the third period, the actual future during which no more measurement data are available. This is the genuine prediction. We note from ﬁgure 6.20 that we can correctly predict a vibration crisis, in this example, up to about 2.8 days in advance. Beyond this point, the prediction is no

154

6 Modeling: Neural Networks

Fig. 6.21 The same data as in ﬁgure 6.20 but now in gray and overlayed in black with the model output from time 3100. We note that the vibration crisis from 3100 to 3500 is predicted correctly but the next vibration crisis at 3700 and the one after that from 4500 are not predicted correctly.

longer successful. This is roughly played out in all other examples. Thus, we see that there is a lead time of less than 3 days before such an event is detectable. This must be enough in practice to construct some kind of reaction. Please note carefully the aim of this study. It is not the aim to correctly represent the vibration measurement over all values and all times. The goal is to accurately compute the times at which the vibration measurement crosses the limit line. When doing modeling it is essential to keep one’s objectives clear as modeling is an adjustment of numerical parameters with respect to minimizing some sort of numeric accuracy requirement. In general, if one’s aim were a representation of the vibration signal as such, one would choose the least squares method to measure the difference between measurement and model output. In this case, however, we are not trying to do this. Thus, our metric is not the least squares metric but the deviation between the actual and modeled times of the vibration signal crossing the limit line – a very different goal. Thus, the ﬁgure 6.20 should be interpreted accordingly. We note that the strongest correlant with the problem is the outside temperature (cooling water). However it is not as clean as having a speciﬁc trigger temperature. The problem has a compound trigger in which the cooling water temperature plays a leading role but not the only role. A prediction of a future vibration crisis is reliable for 3 days in advance. If we attempt to predict it further into the future, the uncertainty in the prediction makes the prediction itself useless. Of course, the closer we get to the interesting time, the more accurate the prediction gets. We have made 6 such predictions in a double blind study and have correctly predicted 5 vibration crises from among the 6 cases. The model is thus quite successful in being able to predict the future occurrence of a crisis. This is the case even though the speciﬁc causal mechanism remains still under investigation.. The model could be improved if the problem would be better understood so that the data can be more properly prepared..

6.11 Case Study: Identifying and Predicting the Failure of Valves

155

6.11 Case Study: Identifying and Predicting the Failure of Valves A chemical plant has a particular unit that is meant to combine several chemicals from a variety of input sources and to provide a gaseous output with an as-constantas-possible composition. This task is controlled by an assembly of 40 valves that are controlled by a computer that opens and closes them according to a well-balanced schedule. If the valves fail to open and close according to schedule and are either fast or slow or leak, then the tailgas is not constant and causes problems later on in the process. In this study, we are to predict future problems of the assembly and also identify which of the 40 valves is responsible for the problem. In the whole process, there are three phases. For each of these, we will compute the probability distribution of deviations between the set point provided by the control computer and the actual response as measured by the instrumentation. Figure 6.22 displays the results for each phase. We observe that one phase has an exponential distribution and the two others do not as they have secondary or even tertiary local maxima. An exponential distribution is what we would expect to see from normal operations of a controller – deviations are very rare and exponentially decreasing in magnitude indicating that deviations are random in origin. Seeing secondary peaks in this distribution is not expected as it shows a structured mechanism and hence some form of damage.

Fig. 6.22 The probability distribution over the three phases of valve operations. The vertical axis is the probability and the horizontal axis the normalized absolute value of the deviation between set point and actual response of valve openings. We observe that one phase appears to be operating correctly (exponential distribution) and two phases incorrectly (non-exponential distributions).

156

6 Modeling: Neural Networks

Next, we will introduce measure of abnormality for a valve. The score itself is based on the difference between set point and response (just as in ﬁgure 6.22). However, we also demand that there be an associated surge in non-constancy of the tailgas within a certain response time to track only those abnormal valve openings/closings that were close to a unwanted product surge. In ﬁgure 6.23, we graph the abnormality in this sense for each valve across all three phases of operation. The valve numbers are on the horizontal axis and the absolute value of the difference between set point and measurement on the valve opening and closing is on the vertical axis. The solid, dashed and dotted lines correspond to the three phases in the same manner as in ﬁgure 6.22. The grey line is the average of the three phase lines.

Fig. 6.23 A selection of the 40 valves (horizontal axis) is investigated here in terms of their deviation from the set point (vertical axis) in those cases in which an output pressure surge occurs within a certain time frame. The three black lines correspond to the three phases in ﬁgure 6.22 and the gray line is the average of the three black lines.

Combining these two images, we must look for the dashed and dotted phases in ﬁgure 6.23 and select those valves that have a high score there. We now combine this information with some process speciﬁc analysis from the scheduling application. From this we can rule out some valves because they do not operate in the relevant phase and their deviation is thus a spurious observation. From such an analysis, it is one valve that remains. We attribute the problem to it. As such, the valve is operating in a way that is not ideal but it is not causing a signiﬁcant problem just yet. Let us look over the time evolution of this abnormality in ﬁgure 6.24. The jagged line is the abnormality at each moment in time. As we are not concerned with the shorter term ups and downs, we take a 20-day moving average to smooth out the curve. This is the thick line on the plot. We observe an upward trend over time with a dip near the end. The time on the horizontal axis is in days and so this evolution occurs over a signiﬁcant time frame. We now create a prediction of this time-series, which is plotted as the continuation of the thick line on the right of the image.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

157

Please note that the peak observed in ﬁgure 6.24 at day 147 is a failure of a valve. After the valve has been ﬁxed, we see the abnormality decline, which suggests that the maintenance measure has relieved the problem. However the abnormality does not decline to its former levels. This means that we have not solved the problem fully. From the prediction made, we predict that another failure is going to occur on day 208. This prediction is made on day 175, i.e. 33 days in advance. The uncertainty of this prediction is ±10 days.

Fig. 6.24 Abnormality over time during the relevant phase together with 20-day moving average and prediction into the future. The peak around day 147 is a known failure. On day 175, we predict a second failure to occur on day 208 with an accuracy of ±10 days. This event occurred as predicted.

After waiting for a few weeks, we ﬁnd that the failure did indeed happen as predicted. The failed valve is the same valve as the one we have identiﬁed using the above abnormality approach. Thus we conclude that is possible to predict the future failure of valves and to identify which valve it will be even if we only have information about the family of valves.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump Co-Authors: Prof. Chaodong Tan, China University of Petroleum Guisheng Li, Plant No. 5 of Petrochina Dagang Oilﬁeld Company Yingjun Qu, Plant No. 6 of Petrochina Changqing Oilﬁeld Company Xuefeng Yan, Beijing Yadan Petroleum Technology Co., Ltd.

158

6 Modeling: Neural Networks

A rod pump is a simple device that is used the world over to pump for oil on land, see ﬁgures 6.25 and 6.26. Basically, we drill a hole in the ground and appropriately cement the hole such that a nice vertical cavity results. This cavity is ﬁlled with a rod that is going to move up and down using a mechanical device that is called the rod pump. Attached to the bottom of the rod is a plunger, which is a cylindrical “bottle” used to transport the oil. On the downward stroke, the plunger is allowed to ﬁll with oil and on the upward stroke this oil is transported to the top where it is extracted and put into barrels.

Fig. 6.25 A schematic of a rod pump. The motor drives the gearbox, which causes the beam to tilt. This drives the horsehead up and down. This assembly assures that the rotating motion of the motor is converted into a linear up-down movement of the rod. The stufﬁng box contains the oil that is discharged through a valve on the top of the well.

Let us focus our attention on two variables of this assembly: the displacement of the rod as measured from its topmost position and the tension force in the rod. When we graph these two variables against each other such that the displacement goes on the horizontal and the tension on the vertical axis, we will ﬁnd that, as the system is in cyclic motion, the curve is a closed locus. This is called the dynamometer card of the rod pump, see ﬁgure 6.27 (01) for an example of expected operations. To travel once around the locus takes the same amount of time as it takes the rod pump to complete one full cycle of downstroke and upstroke. A normal rod pump makes four strokes per minute. It is a remarkable observation that the shape of this locus allows us to diagnose any important problem with the rod pump [64]. In ﬁgure 6.27 we display dynamometer card examples for the most common problems. We will go into a little detail on these shapes and their problems because it is an exceptional fact that a complete diagnosis can be made so readily from a single image. This approach should be possible for a variety of other machines once only the correct measurements and the correct way of presenting them are found. That is the deeper reason behind presenting these here. It should be encouraged to seek a similar presentation of faults in other machinery.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

159

Fig. 6.26 A schematic of the well bottom. The rod drives the plunger down into the well guided by the well casing. The bottom of the plunger has a so-called riding valve to take in the oil through the inlets. The bottom of the well is closed off to the reservoir with a so-called standing valve that open once the plunger is at the bottom.

The cases are: 01 This is the shape we expect to see on a properly working rod pump. The upper and lower horizontal features are nearly parallel and the diagram is close to the theoretically expected diagram. 02 Another example of good operations 03 The two horizontal features are sloping downward, are much closer to each other and more wave-shaped than in the good case. This is due to excessive vibration of the rod. 04 The lower right-hand corner of the card is missing but the two horizontal features are still horizontal. This indicates that the plunger is not being ﬁlled fully but that the pump is working properly. 05 A more severe case of the former kind. 06 Here the pump is still working properly but the oil is very thick. 07 These distinctive jagged features with the lower right corner missing are caused by the presence of sand in the oil. This will cause damage to the rod assembly in the short term. 08 The lower right-hand corner is missing but the horizontal features are no longer horizontal; the bite taken out of the low right corner has an exponential boundary. This is caused by reservoir de-gassing and slowing the downward plunge. 09 A more severe case of the former kind. 10 A similar case to the former kind. Here the gas forms an air-lock inside the plunger preventing the plunger from draining at the top. 11 The bottom horizontal feature is rounded and/or lifted up making the whole card signiﬁcantly smaller. This is due to a leaking inlet valve.

160

6 Modeling: Neural Networks

12 The opposite feature to above. Here the top horizontal feature is rounded and/or pressed down making the whole card smaller. This is due to a leaking outlet valve. 13 This oval feature results from a combination of both the inlet and the outlet valve leaking. Note that this is a fairly ﬂat oval compared to the oval of image (06). 14 The top left-hand corner is missing and the boundary is in the shape of an exponential curve; compare with (08) and (09). This is due to a delay in the closing of the inlet valve. 15 Same as above but for a shorter delay. 16 The right side of the card is pressed down. This happens because of a sudden unloading of the oil at the top. The outlet valve is not opening properly but suddenly. 17 The characteristic upturned top right-hand corner (as opposed to (08)) indicates a collision of the plunger and the guide ring. 18 The lower left-hand corner is bent backwards and the top right-hand corner is sloped down in addition to features like (08). This indicates a collision between the plunger the ﬁxed valve at the bottom of the hole. 19 The thin card with concave loading and unloading dents on upper left and lower right corners indicates a resistance to the ﬂow of the oil such as the presence of parafﬁn wax. 20 Very thin but long card in the middle of the theoretically expected card with wavy horizontal features indicates a broken rod. 21 A thin long card with straight horizontal features indicates that the plunger is ﬁlling too fast due to a high pressure inside the reservoir. The plunger should be exchanged for a larger one. 22 The card looks normal but is too thin, particularly on the bottom. This is due to tubing leakage. 23 The piston is sticking to the walls of the hole and bending the rod. It can easily be seen that the diagnosis of problems is immediate from the shape given a little experience in the matter. In fact, it has been shown that the diagnosis can be automated by recognizing the shape with a perceptron neural network [43] (see section 6.4 for a discussion of perceptrons). Our purpose here is to investigate whether we can predict the future shape of the dynamometer card and thus diagnose a situation today that will lead to a problem in the next days. In order to predict the evolution of the shape over time, we must be able to characterize the shape numerically ﬁrst. For this we will seek a two step process. First, we will attempt to ﬁnd a formula-based model for the shape itself that has only a few parameters that must be ﬁtted to any particular shape. As we get a new dynamometer card several times per minute, this ﬁtting process happens continuously thereby inducing a time-series on those parameters. It is these parameters that we will model using a recurrent neural network. In total, this will yield a prediction system for the future shape of a dynamometer card.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

(01)

(02)

(03)

(04)

(05)

(06)

(07)

(08)

(09)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

Fig. 6.27 The various cases of dynamometer cards. See text for an explanation.

161

162

6 Modeling: Neural Networks

In order to allow modeling, we ﬁrst normalize the experimental data so that a dynamometer card’s displacements and tension always lie in the interval [-1,1]. We do this by t =

2t − tmin − tmax tmax − tmin

and d =

2d − dmin − dmax . dmax − dmin

Using this transformation, the shape of the dynamometer card can be described by a parametric equation f cos θ − sin θ t a cosb x + c sind x + = g sin θ cos θ d e sin x where x is the artiﬁcial variable of the parametric equation such that x ∈ [−π, π] [83]. The parameters a, b, c, d, e, f and g are the parameters that must be found by ﬁtting. A typical dynamometer card consists of about 144 observations of d and t. Thus, we have enough data to reliably ﬁt the 7 free parameters in the model using the normal least-squares approach. The vector v = [a, b, c, d, e, f , g,tmin ,tmax , dmin , dmax ] is actually a function of time v(t) and this function is then modeled as a recurrent neural network, see section 6.5. In ﬁgure 6.28 we see the evolution of such a prediction. Time is measured in strokes each of which is about 15 seconds in duration. The dotted line is the experimental data and the solid line is the model. The historical data upon which the model is based is mostly not shown but images (1) and (2) are still historical data. Image (3), (4) and (5) are the predictions that result. Based on this prediction, we initiate a maintenance measure that restores the pump to normal operations in image (6). As we can see that the model and measurement from the experiment agree quite well, we have demonstrated that this approach can indeed predict future problems with dynamometer cards. Note that the problem in image (5) and the moment at which the prediction is made in image (3) are separated by 4000 pump cycles or about 16.7 hours. This is enough warning time for practical maintenance to react. To fully understand this evolution, we need to look at the corresponding evolution of the model parameter. See the left image in ﬁgure 6.29 for those model parameters that changed. From time 15000 onwards, we have normal operations and so this level for the parameters is constant. Based on this, we observe an increasing deviation from normal operations in the model parameters. The right image in ﬁgure 6.29 displays the evolution of the average of the displacement and the width of the displacement. We display these as the experimental data was normalized for the images in ﬁgure 6.28. Here we also observe an abnormal behavior in the beginning.

6.12 Case Study: Predicting the Dynamometer Card of a Rod Pump

163

Fig. 6.28 The modeling of a dynamometer card’s evolution in time. The difference between each image is 2000 cycles, i.e. about 8.3 hours. The model was train on historical data. Images (1) and (2) are historical data providing the model with the initial data. Using this, the model predicts images (3) to (5) and indicates that at image (5) we have a problem requiring attention. The maintenance measure is performed and we observe, on image (6), the establishment of operations as they should be. See ﬁgure 6.29 for more details.

In conclusion, we note that a recurrent neural network can reliably predict a future fault of a rod pump system via predicting the future model parameters of a mathematical formulation of the dynamometer card. In this example, the prediction could be made 16.7 hours in advance of the problem.

164

6 Modeling: Neural Networks

Fig. 6.29 On the top image, we see the evolution of the model’s parameters over time: e as the solid line, a as the dotted line, θ as the closely dashed line and c as the long dashed line. All other model parameters were constant throughout. On the bottom we see the evolution of displacement average in the solid line and the displacement width (maximum minus minimum) in the dotted line. The period from time 15000 onwards is to be considered normal operations and so we can observe a gradual worsening of operations leading up to the necessary maintenance measure at time 10000.

Chapter 7

Optimization: Simulated Annealing

There are many optimization approaches. Most are exact algorithms that deﬁnitely compute the true global optimum for their input. Unfortunately such methods are usually intended for a very speciﬁc problem only. There exists only one general strategy that always ﬁnds the global optimum. This method is called enumeration and consists of listing all options and choosing the best one. This method is not realistic as in most problems the number of options is so large that they cannot be practically listed. All other general optimization methods derive from enumeration and attempt to either exclude bad options without examining them or only examine options that have a good potential based on some criteria. Particularly two methods, genetic algorithms and simulated annealing, have been successful in a variety of applications. The later advent of genetic algorithms stole the limelight from simulated annealing for some time [102]. However, from direct comparisons between these two approaches it appears that simulated annealing nearly always wins in all three important categories: implementation time, use of computing resources (memory and time) and solution quality [54, 102]. Thus, we shall focus on simulated annealing. Having said this, the opinions between these two approaches border on religious fervor. To do some justice to this debate, we will present genetic algorithms in section 7.1 and then describe simulated annealing for the rest of the chapter. It will become clear where the differences lie. It should also be mentioned here that several methods exist that are usually presented under the heading of optimization methods but that will be ignored here. Such methods are, for example, conjugate gradient methods or Newton’s method. The reason we shall ignore them here is that they rely on the problem being purely continuous. They cannot deal with some of the variables being discrete. Industrial problem however almost always involve discrete variables as equipment is turned on and off or many be switched in discrete modes or levels. If you meet with a particular problem that can be written in terms of a purely continuous function, then these methods may not be bad to use. However, general optimization methods may be used proﬁtably in this case as well. P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7_7, © Springer-Verlag Berlin Heidelberg 2012

165

166

7 Optimization: Simulated Annealing

7.1 Genetic Algorithms Genetic algorithms get their name and basic idea from the evolutionary ideas of biology. There is a population of individuals that mate, beget offspring and eventually die. At any one time, there is population of many individuals but over time these individuals change. These individuals, and thus the whole population, have certain characteristics that are important to us. Particularly, they have a so-called “ﬁtness” derived from the statement “In the struggle for survival, the ﬁttest win out at the expense of their rivals because they succeed in adapting themselves best to their environment” by Charles Darwin. The ﬁtness is thus identical with the objective function of optimization. The “purpose” of evolution is to breed an individual with the best possible ﬁtness. In nature, changing generations face changing environmental conditions and so it is likely that we will never reach an equilibrium stage at which the truly ﬁttest possible individual can live. In mathematical optimization however, the environment is the problem instance and so does not change. Thus evolution can reach an equilibrium and this is the proposed optimum. Note here that certain concepts are being turned upside down by the method: The ground state of the problem instance (the least likely state) becomes the equilibrium (the most likely state) of the population. The mechanism that achieves this switch is evolution. The fact that the search for something rare is turned into a process of equilibration to something common is the whole point behind genetic algorithms. The basic features of the genetic approach are thus: The initialization of a population of candidate solutions, the decision of how many such solutions there should be in any one generation, the method for combining several solutions into a new one and the criteria for stopping the search. Note that this approach is a heuristic. We thus have no assurance of ﬁnding the true optimum and even if we do ﬁnd it by chance, we do not have a foolproof way of recognizing it for what it is. This is a pity but we cannot both have our optimum and know it, as it were. This is the price we pay for fast practical execution times. The decision on how many individuals live per generation is a human design decision that is essentially a black art comparable to choosing the number of hidden neurons in a neural network. The initialization of the ﬁrst generation is generally done uniformly randomly among all possible candidate solutions. The criteria for stopping have given rise to a signiﬁcant research ﬁeld but most applications terminate the search when solution quality no longer improves signiﬁcantly over many generations, i.e. a convergence criterion. If this seems too haphazard, simply restart the process with a different initial population a few thousand times and take the best outcome. This “restart” method has been shown to provide a small but signiﬁcant improvement in general and is worth doing for nearly applications with the only exception being if you are very pressed for time (e.g. real-time applications). The only point that is really complex is deﬁning how solutions mate and beget child solutions. The idea again derives from biological evolution. Two DNA codes seem to combine to make a new DNA code by selecting genes from either parent DNA and then subjecting the result to some mutation.

7.2 Elementary Simulated Annealing

167

Supposing that the two parents are the two solution vectors A and B, then we may construct a new solution C A = [a1 , a2 , a3 , . . . , an ] B = [b1 , b2 , b3 , . . . , bn ] C = [c1 , c2 , c3 , . . . , cn ]

(7.1) (7.2) (7.3)

by putting randomly either ci = ai + εi or ci = bi + εi where εi is a small randomly generated mutation term. Choosing an element from either A or B is called the crossover operator and adding a small random element is called the mutation operator. It is the mutation operator that allows the new solution to be made up of elements that did not exist yet in the initial population and this is crucial in order to gradually visit all possible points. The method by which we choose elements from A or B can become arbitrary complex or may be a simple 50-50 random choice. Also the method to apply the mutation may be complex or simple. In general, it can be said that only simple problems can be solved by simple methods for these two stages of the solution generation process. For complex problems, we must prefer certain features in the crossover operation and must gradually suppress mutation over the long term so that the overall solution accuracy can focus on a precise ﬁnal result. How exactly this is done would go beyond the scope of this book since we wish to focus on simulated annealing but the ideas presented for simulated annealing may be applied to this context.

7.2 Elementary Simulated Annealing When an alloy is made from various metals, these are all heated beyond their melting points, stirred and then allowed to cool according to a carefully structured timetable. If the system is allowed to cool too quickly or is not sufﬁciently melted initially, local defects arise in the system and the alloy is unusable because it is too brittle or displays other defects. Simulated annealing is an optimization algorithm based on the same principle. It starts from a random microstate. This is modiﬁed by changing the values of the variables. The new microstate is compared with the old one. If it is better, we keep it. If it is worse, we keep it with a certain probability that depends on a ‘temperature’ parameter. As the temperature is lowered, the probability of accepting a change for the worse decreases and so uphill transitions are accepted increasingly rarely. Eventually the solution is so good that improvements are very rare and accepted changes for the worse are also rare because the temperature is very low and so the method ﬁnishes and is said to converge. The exact spelling out of the temperature values and other parameters is called the cooling schedule. Many different cooling schedules have been proposed in the literature but these have effect only on the details and not the overall philosophy of the method.

168

7 Optimization: Simulated Annealing

The initial idea came from physics [95]. The physical process of annealing is one in which a solid is ﬁrst heated up only to be cooled down slowly in an effort to ﬁnd its ground state. We are asked to cool the solid so slowly that it remains at thermal equilibrium throughout the entire process. Based on this assumption, statistical physics has been able to calculate the probability distribution of microstates (exact atomic conﬁguration) giving rise to an observed macrostate (the global features of the whole solid). This distribution says that the probability of the solid making a transition between a state of energy E to a state of energy E (with E > E) is proportional to E − E exp − kT where k is a constant and T is the temperature of the solid. As annealing is an actual physical process used in many manufacturing plants, the computerized simulation of this process is known as simulated annealing. In the context of a general substance the method takes the following form [13]: Data : A candidate solution S and a cost function C(x). Result : A solution S that minimizes the cost function C(x). T ← starting Temperature while not frozen do while not at equilibrium do S ← perturbation of S. if C(S ) < C(S) or selection criterion then S ← S end T ← reduced temperature end

Algorithm 1: General Simulated Annealing

We immediately see some rather enticing features: (1) Only two solutions must be kept in memory at any one time, (2) we must only be able to compare them against each other, (3) we allow temporary worsening of the solution, (4) the more time we give the algorithm, the better the solution gets and (5) this method is very general and can be applied to virtually any problem as it needs to know nothing about the problem as such. The only points at which the problem enters this method is via creating a perturbed solution and via making a comparison of cost function values. Note that the inner loop gives rise to a Markov chain as each new state depends only upon its predecessor. Also note that this formulation of the method is quite general. Several pieces are missing: (1) a method for assigning an initial temperature, (2) a deﬁnition of “frozen,” (3) a deﬁnition of “equilibrium,” (4) a selection criterion and (5) a method to calculate the next temperature. The cost function and perturbation mechanism are given by the problem we are trying to solve. It must be said that there are, in general, various ways in which perturbations could be gener-

7.3 Theoretical Results

169

ated. The solution quality (ﬁnal cost and computation time) depends on this choice. As this is highly problem speciﬁc, we can only hint at this in section 7.5. Giving deﬁnite computable functions for the ﬁve elements named above is referred to as a cooling schedule and must be found to turn simulated annealing into an algorithm that can be implemented. As presented above, it may be considered a general computational paradigm but not yet an algorithm. First we give an example in the form of the traveling salesman problem in which we try to ﬁnd the optimal journey between 100 cities arranged on a circle beginning from a randomly generated ordering. We chose a large number as initial temperature haphazardly, told the algorithm to stop once no improvement was seen over two consecutive temperatures, deﬁned equilibrium as 1000 proposed transitions and cooled our system by multiplying the current temperature by 0.9. This extremely simple schedule led to the optimal solution of this problem in 35 temperatures. The cost for the problem is the Euclidean distance between the points on the journey and the moves are simple too: (1) reverse a sub-path and (2) replace two sub-paths between towns A and B and C and D by two paths between A and C and B and D [86]. This is an example in which we know what the optimal solution is (a circular journey) and thus we can be happy with this simple cooling schedule. In general however, we do not know what the optimal solution is and so choosing a cooling schedule becomes a problem in its own right because we cannot verify whether the ﬁnal answer is actually good. In fact, most authors make a rather haphazard choice of functions and parameters for their cooling schedule.

7.3 Theoretical Results In the limit of inﬁnitely slow cooling, simulated annealing ﬁnds the optimal solution for any problem and any instance [123]. This is the central result on which most authors justify their use of simulated annealing. The question of how slowly is slow enough in practice is a complex one that again raises the question of a cooling schedule. It is possible to prove polynomial-time execution of simulated annealing for a large class of cooling schedules [124]. If R is the set of all possible microstates and qk the stationary probability distribution of the Markov chain (the inner loop of the algorithm), we may deﬁne the expected cost C(T ), the expected square cost C(T )2 , the variance in the cost at equilibrium σ 2 (T ) and the entropy at equilibrium S(T ) all at a certain value of the temperature T by

170

7 Optimization: Simulated Annealing

C(T ) = C(T )2 =

∑ C(k)qk (T );

(7.4)

∑ C2 (k)qk (T );

(7.5)

k∈R k∈R

σ 2 (T ) = C(T )2 − C(T )2 ; S(T ) =

∑ qk (T ) ln qk (T ).

(7.6) (7.7)

k∈R

Furthermore, the speciﬁc heat of the system is given by h=

∂ ∂ σ 2 (T )

C(T ) = T S(T ) = . ∂T ∂T T2

(7.8)

The expected cost and expected square cost can be shown to be approximated by their averages over the Markov chain by virtue of the central limit theorem and the law of large numbers [3]. These quantities prove helpful not only in analogy to the physics origins of the paradigm but also in providing us with a good cooling schedule. Furthermore, they become important in a discussion of the typical behavior of simulated annealing [124]. In physical annealing, the substance effectively undergoes slow solidiﬁcation after it has initially been melted at high temperature. It thus undergoes a phase transition. We would expect the total entropy of the system to drastically decrease during this transition and only slowly on either side of it. Physically, such transitions are usually fast but do take a ﬁnite amount of time. If we monitor the average energy, standard deviation of the energy and the speciﬁc heat of the alloy during the physical annealing process, we should be able to see the phase transition clearly. Surprisingly, we see the same effects in the evolution of combinatorial problems using simulated annealing. For the traveling salesman problem on 100 cities on the circle, we monitored these quantities over 72 temperatures and plotted them relative to the logarithm of the temperature, see ﬁgure 7.1. We see that there is a clear phase transition that is extended over a few temperatures and that both cost and standard deviation vary relatively little on either side of this phase transition. Subject to the assumption that cost is distributed normally over conﬁguration space, we are able to prove that at large temperatures, the standard deviation is constant and the cost inversely proportional to the temperature. At low temperatures, both standard deviation and cost are linearly depended on temperature [2, 70]. This is borne out by the data we have collected. The speciﬁc heat of the instance is roughly constant with a few exceptions. The speciﬁc heat of a material body is the amount of heat necessary to be added to the body to increase its temperature by one degree Kelvin per kilogram. It differs in value depending on whether the pressure or the volume of the body are kept constant throughout the process of heating. However, it is a constant property of the material of the body. In the context of combinatorial problems, we could interpret the pressure to be the external forces of change (i.e. the probability distribution of accepting proposed transition) which is constant over the Markov chain that is used to compute the speciﬁc heat. The volume of the problem could be considered to be

7.3 Theoretical Results

171

Fig. 7.1 We see the normalized average cost (top curve) and normalized standard deviation (lower curve) against the logarithm of the temperature. As temperature decreases over time, this means that time runs from the right of the plot to the left. We clearly observe the phase transition in the middle of the image.

the average cost during the Markov chain. Thus we are computing the speciﬁc heat at constant pressure. In further analogy to physics, we would assume that the speciﬁc heat is a constant throughout the execution of simulated annealing. In physics, local maxima in the speciﬁc heat indicate local freezing and hence local cluster formation. This is detrimental to ﬁnding the ground state of the material and thus should be avoided. In other words, local maxima in speciﬁc heat indicate deviation from equilibrium and thus too rapid cooling. If the speciﬁc heat at the end of any particular Markov chain is signiﬁcantly greater than the speciﬁc heat computed in the ﬁrst Markov chain, we should thus disregard the last chain and cool the system more slowly. This adaptive philosophy to a decrement formula will force the speciﬁc heat over the evolution to be roughly constant. In our speciﬁc example, we see a speciﬁc heat maximum around the onset of the phase transition on ﬁgure 7.2. Had we cooled more slowly here, we would have in general obtained a much better result.

172

7 Optimization: Simulated Annealing

Fig. 7.2 We observe the speciﬁc heat as the grey curve and the acceptance probability of the suggested transitions as the black curve against the logarithm of temperature. As in ﬁgure 7.1, time therefore runs from the right to the left of the image. We again clearly observe the phase transition in the speciﬁc heat curve as the onset of local freezing. As expected, the acceptance ratio of suggestions declines exponentially and, upon becoming too low for further progress, leads to the end of the optimization run.

7.4 Cooling Schedule and Parameters A host of experimental data from a variety of combinatorial problems show that the performance in both quality of ﬁnal solution and execution time is highly dependent upon the cooling schedule [124]. We will spend some time reviewing different possibilities for such a schedule and, as we shall see, the parameters that the parts of the cooling schedule require us to choose are no less signiﬁcant for the performance of the algorithm. Generally, the quality of the ﬁnal solution of simulated annealing can compete favorably with the very best of tailored algorithms for speciﬁc combinatorial problems however at the cost of additional execution time [124]. This observation has been made in many papers but all of them have used very simple cooling schedules and have not tuned the parameters of these schedules well. This leaves us to speculate that one may expect additional quality and time performance from simulated annealing after appropriate tuning. If this were consistently true, this method may beat a number of tailored algorithms in quality. There are parallel implementations of simulated annealing that speed up the execution considerably. These are, however, more complex and deviate somewhat from the original physical and intuitive ideas. They are also harder to implement and so we refer to the literature for this, e.g. [100] and references. We begin the discussion of cooling schedules with the cautionary remark that an experimental comparison

7.4 Cooling Schedule and Parameters

173

of the major cooling schedules (with tuned parameters) has never been done and so advantages and disadvantages are a matter of theory for now.

7.4.1 Initial Temperature The starting temperature has to be chosen such that the system can make highly uneconomical transitions in the beginning and later settle down to temperatures at which very few such transitions are possible. Thus, in combination with the temperature decrement formula and the selection criterion, this temperature should be chosen high enough but not too high. There are two major directions in which investigators have chosen to go. The ﬁrst is to say that the initial temperature T0 should be such that a certain percentage χ0 of uneconomical transitions are accepted [112, 79]. We start by assuming a Gaussian distribution of cost ﬂuctuations because this is our generic selection criterion and thus set the probability of acceptance that we want (χ0 ) equal to the probability at the initial temperature [57], ΔC(+) ΔC(+) χ0 = exp − → T0 = −1 T0 ln χ0 where ΔC(+) is the average of all cost function increases observed. In order to use this formula, we thus perform one Markov chain in order to compute ΔC(+) and then compute T0 by choosing some χ0 . Estimates of this kind are used in many papers [84, 85, 119, 66, 97]. While this is a relatively simple and utilitarian solution of ﬁnding a good starting temperature, most application oriented papers do not even go this far but merely decide to ﬁx T0 to a number that seems to give good results empirically. Once this number is found using a few test instances, this number is ﬁxed for all subsequent instances and thus becomes part of the problem speciﬁcation. Clearly this is not a good choice of strategy. In the best case, there will be many instances for which the computation time taken will be larger than needed but it is likely that in many cases the ﬁnal solution found will be inferior to the one that could have been found with a different initial temperature. We thus advise on an adaptive tuning of the initial temperature according to some model. A more sophisticated approach is based on the assumption that the number of solutions corresponding to a particular cost C, the conﬁguration density, is normally distributed with a mean of C and standard deviation of σ∞ . These parameters are empirically determined during a Markov chain. We may then compute the expectation value of the cost as a function of the temperature C(T ) ≈ C − σ∞2 /T , which is a local Taylor expansion where C is an average taken at thecurrent temperature, i.e. over the current Markov chain. The variance σ∞2 = C(T )2 − C(T )2 is estimated 2

to be σ∞2 ≈ C(T )2 − C(T ) . Finally, we agree that we would like the initial expectation of the cost to be within x standard deviations from the average cost. Together

174

7 Optimization: Simulated Annealing

with the empirical estimators for the expectation values, we obtain [128] 2 T0 > x C2 −C . We learn in statistics that 68% of all cases lie within one, 95% within two and 99.7% within three standard deviations of the mean for a normal distribution. The number of standard deviations x, thus again comes down to choosing a χ0 . From the cumulative Gaussian probability distribution the number of standard deviations and the initial acceptance probability are thus related by χ0 =

xσ∞ −xσ∞

2 C −C 1 √ exp − dC. 2σ∞2 2πσ∞

This is analogous to a practical method at which we start the metal off at room temperature and heat it up gradually until we believe it is hot enough for it to be well mixed (some distance past its melting point for instance) and then we begin the annealing process in earnest. It is the concept of melting point that we have essentially attempted to describe in this section and that accurately captures what we need the initial temperature to be.

7.4.2 Stopping Criterion (Deﬁnition of Freezing) The analogy between the melting point of a metal to be annealed and the starting temperature of a combinatorial problem to be simulated annealed was drawn in the previous section. This can be continued between the freezing point of a metal and the stop criterion of the simulated annealing process. Physically speaking, the freezing temperature and the melting point are the same (this temperature marks the phase transition between the solid and liquid phases) but this transition is not instantaneous. In the context of optimization, the starting temperature is higher and the ﬁnal temperature lower. Thus, a real phase transition does not occur at one instant but over a period of time. The entropy does not have an actual discontinuity (as the theory would have us believe) but rather it has a sudden and drastic change over a small but ﬁnite time frame. The simplest proposal for the ﬁnal temperature is to ﬁx the number of the different temperatures and thus the ﬁnal temperature depends upon the temperature reduction formula. The actual number varies between six [113] and ﬁfty [47] in the literature. The analogue of waiting until no more consequential transitions are made is to wait until the optimal conﬁgurations found after a number of Markov chains (typically the last four) are identical [79, 97, 115]. We may further require that the probability of accepting a random transition is smaller than some ﬁxed value χ f by analogy to the treatment of the starting temperature [57].

7.4 Cooling Schedule and Parameters

175

A number of more sophisticated proposals are made in the literature. Suppose that we are at a local minimum C0 in the cost function and the lowest cost value of any conﬁguration in the neighborhood of the current conﬁguration (i.e. reachable by a single transition) is C1 . Then we would like that the probability of transiting from the local minimum to this point should be lower than 1/R where R is the size of the neighborhood. This condition (assuming that cost is normally distributed, as before) gives [128], C1C0 Tf ≤ . ln R The choice that 1/R is to be the cut-off probability seems reasonable but nevertheless we may consider this a general statement under the Gaussian assumption and input any desired probability. Alternatively, we may require that the probability of the last conﬁguration reached in a Markov chain being more than ε (in cost) over the true minimum of the cost function, is less than some real number θ . This may be used to derive the condition [89] ε Tf ≤ ln(|R| − 1) − ln θ where R is the set of all possible conﬁgurations. This obviously suffers from having to choose an ε and a θ (similar to the χ f above) and having to calculate the size of the conﬁguration space. Consider the difference between the average cost C(Tk ) during the kth Markov chain and the true optimum. This may be expanded as a ﬁrst-order Taylor series when Tk is small. This difference (calculated by the Taylor series) relative to the average cost in the ﬁrst Markov chain is desired to be lower than some ﬁxed ε divided by the terminal temperature T f . Finally this gives [4, 101] 2

C2 (T f ) −C (Tf ) Tf > ε C(T0 ) −C(T f )

7.4.3 Markov Chain Length (Deﬁnition of Equilibrium) Continuing our physical analogy, we are to anneal our substance at equilibrium. Thus we are only allowed to lower the temperature when the substance has reached thermal equilibrium at the current temperature. We need a ﬁrm deﬁnition of this concept in combinatorial terms in order to terminate our Markov chain. A strict mathematical deﬁnition of equilibrium is practically impossible as it would entail computing the probability distribution of conﬁguration space which would correspond to the simplest of all optimization algorithms (check all possibilities and choose the best one). Let the length of the kth chain be Lk . The simplest practical way is to give a definite ﬁxed length to every chain so that Lk is independent of k and depends only

176

7 Optimization: Simulated Annealing

upon the problem size, i.e. it is some polynomial-time computable function of |R|, the size of the conﬁguration space. Many authors simply choose some number, e.g. Lk = 100 [47, 46, 90, 51, 115]. Alternatively we use this function only as a ceiling for the length and terminate the chain possibly before, such that the number of accepted transitions is at least some ηmin (this requires a ceiling because at low temperatures the acceptance ratio is lower) [79, 57, 84, 85, 97]. On the other hand, we may want to require that the refused transitions are at least a certain number [113]. This seems counter-intuitive as this leads to shorter chains at low temperatures and thus a speedup of cooling whereas one would assume that achieving equilibrium takes longer at low temperatures. More physically, consider breaking the chain into ﬁxed-length (in terms of accepted transitions) segments. The cost of a segment is the cost of the last conﬁguration. When the cost of current segment is within a cut-off of the preceding segment, we terminate the chain [119, 58, 52]. This is more intuitive because this is related to the ﬂuctuations in cost over the chain. We terminate when the ﬂuctuations settle down; a deﬁnition of equilibrium that an experimental physicist might agree with. Statistically speaking, we would wish for a sufﬁciently large probability to make an uneconomical transition (possibly out of a local minimum) to be maintained throughout the chain. A reasonable estimate, based on Markov chains, for a scale length of a speciﬁc chain is N≈

1 exp (− (CmaxCmin ) /T )

where Cmax and Cmin are the largest and smallest cost observed so far (including previous chains). This is further corroborated by the fact that N plays a similar role in stochastic dynamical systems as the time constant plays in linear dynamical systems; it is thus really a length scale [59, 108]. Taking the actual length of the chain to be a few Ns should thus be enough to get to equilibrium; exactly how many, however, remains to be decided by the user. Finally, it is possible to show that the number of accepted transitions within an interval ±δ about the average cost C reaches a stationary value ⎛ ⎞ δ δ ⎠, κ = erf ≈ erf ⎝ σ (T ) 2 2 C −C where erf(· · · ) denotes the error function, at equilibrium and we may take this as our deﬁnition (practical care has to be taken to avoid extremely long chains at low temperatures) [94].

7.4 Cooling Schedule and Parameters

177

7.4.4 Decrement Formula for Temperature (Cooling Speed) After each Markov chain the temperature is decreased in analogy to the physical annealing process in which a metal is cooled at equilibrium. The new temperature Tk+1 is calculated from the old temperature Tk very simply by either keeping their ratio [79, 57, 47, 46, 90, 51, 84, 85, 97, 115] or their difference [119, 66] a global constant. The ratio is usually taken between 0.9 and 0.99 but also 0.5 has been used. If the difference is used, then it is determined by ﬁxing the number of different temperatures, the initial temperature and the ﬁnal temperature. The ratio rule is used very frequently in applications to the virtual exclusion of other rules. It is the decrement rule that has the most impact upon the quality and efﬁciency of the algorithm among the ﬁve rules that we must prescribe [124]. We would like to decrease the temperature slowly so that the subsequent chains do not have to be too long in order to reestablish equilibrium but not too slowly so that the algorithm takes too much time to freeze. We begin with the reasonable statement that the stationary probability distributions of two successive temperatures should be close, i.e. their ratio larger than 1/(1 + δ ) and smaller than 1 + δ for some (small) real number δ . Depending on subsequent assumptions, we may derive the following rules,

Tk+1 Tk+1 Tk+1

−1 ln(1 + δ ) Tk = Tk 1 + , see [4, 1] 3 σ (Tk ) 3 Tk ln(1 + δ ) = Tk 1 + , see [82] 3 σ (Tk ) γTk −1 = Tk 1 + , see [89] U

(7.10)

Tk2 ln(1 + δ ) , see [101] Cmax + Tk ln(1 + δ )

(7.12)

Tk+1 = Tk −

(7.9)

(7.11)

where γ is some small real number and U is an upper bound on the difference in cost between the current point and the optimum. Alternatively, we can begin from the statistical mechanics formula for speciﬁc heat and approximate, as we have done so far, the expected cost by the average cost. This leads to λ Tk Tk+1 = Tk exp − σ (Tk ) where λ is the number of standard deviations by which the average costs of different chains are allowed to differ; we require λ ≤ 1 [94].

178

7 Optimization: Simulated Annealing

7.4.5 Selection Criterion Generally, the Maxwell-Boltzmann distribution is assumed to be a reasonable criterion for accepting or rejecting a proposed transition for the worse by the analogy to statistical physics and so the probabilistic selection criterion is relative to the function exp[−ΔC/T ]. The discussion of whether a different condition should be chosen is based on the observation that transitions of high cost difference can help to get the system out of local minima and these are accepted rather less often at low temperatures. Furthermore, at large values of the temperature virtually all transitions are accepted without bias. One may wish to bias the selection of transitions such that large transitions are more likely at lower temperatures to help approach the optimum faster. There are many possible choices but they are all centered around attempting to force faster convergence and not lower ﬁnal cost. It is our experience that with present hardware, it is not necessary to speed up the algorithm (at the cost of possibly a worse solution) for almost all practical problems. We mention one simple way to tune the selection process, namely the reintroduction of Boltzmann’s constant in the Maxwell-Boltzmann distribution, i.e. changing the function to exp[−ΔC/kT ] where k is a constant. In physics, this constant takes one universal and constant value; it can thus be thought of as a conversion factor of Kelvins to Joules (units of temperature to units of energy). In combinatorics, this factor would convert units of temperature (whatever that may mean here) to units of cost. In our discussion of initial temperatures and decrement formulas, however the unit of temperature was the same as the unit of cost and so the constant, in this context, does not need to convert units. Speciﬁcally, there is some evidence that k = 2 may lead to slightly faster convergence to equilibrium [68]. A very interesting approach uses the so-called Tsallis statistics to attempt a speed up of annealing’s convergence without a loss of quality. This is very promising but is beyond the scope of this book to discuss, see [80] for a start.

7.4.6 Parameter Choice We have seen that we must choose ﬁve elements (initial and ﬁnal temperatures, chain length, decrement formula and selection criterion) to turn the paradigm of simulated annealing into an algorithm in addition to formulating our problem in terms of transitions. This choice is far from obvious. In addition, almost all of these elements depend on some parameters that we must also choose. We have some theoretical and practical guidelines and intuition as to what rules to choose but the parameters generally escape precise quantiﬁcation by intuition. We are thus lead to the question: Do slight variations in the parameters make measurable differences in the performance (quality and speed) of the simulated annealing algorithm applied to a particular problem? The experimental answer is deﬁnitely afﬁrmative. Thus we have to make intelligent choices.

7.4 Cooling Schedule and Parameters

179

It is unfortunate that almost all practitioners of the simulated annealing paradigm do not put much effort into ﬁnding the optimal parameters. From the literature it seems that the vast majority follows the following method. A few test instances of the problem are generated. Some of these have known optimal solutions and some do not. The parameters are adjusted manually such that the known cases are solved to optimality and the unknown cases are solved to a ﬁnal cost that seems reasonable in the light of the known cases. This manual adjustment means in practise that rather few different parameter sets are tried and the ﬁrst one that looks good in the above way is kept. Furthermore, the parameters are then kept ﬁxed for all future cases to be solved and are hardly ever (except perhaps in the case of the initial temperature) varied. However an optimal parameter set can improve the average solution quality appreciably over a manually chosen one. Another method used more recently is to try a few values for each parameter and then use linear regression to obtain some optimal interpolated parameter set [63]. This is essentially a manual adjustment as well as there is no good method to choose the few sets on which the regression is based. Alternatively, we can regard simulated annealing as a function of its parameters that returns the relative cost reduction α = C f inal −Cinitial /Cinitial of the problem instance. This is not quite good enough because α has a probability distribution that is largely unknown. However, if we generate a large number of similar instances of the same size and compute the average cost reduction α by running simulated annealing with identical parameters for each one, then it should return the expectation value of the relative cost reduction. This is a good measurement of the efﬁcacy of the method and we shall take this as our ﬁgure of merit function to ﬁnd the optimal parameters. In short, we have a multidimensional function minimization problem (maximize α is the same as minimizing 1/α). The simplest version of simulated annealing sets ﬁve constants A, B, C, D and E to some initial values and looks like this: Data : A candidate solution S and a cost function C(x). Result : A solution S that minimizes the cost function C(x). T ←A while T > B do for i = 1 to C do S ← perturbation of S. if C(S ) < C(S) or Random < exp[(C(S ) −C(S))/DT ] then S ← S end T ← ET end

Algorithm 2: Simple Simulated Annealing In words, we start with a constant temperature A and deﬁne a constant temperature B to be the freezing point. Equilibration is assumed to occur after or within C steps of the proposal-acceptance loop where the selection criterion is the thermodynamic Maxwell-Boltzmann distribution with Boltzmann’s constant D after which

180

7 Optimization: Simulated Annealing

the temperature is decremented by a constant factor E. The standard choices for these constants are A = C(S), B is 100 times smaller than the best lower bound on cost, C = 1000, D = 1 and E = 0.9. After successful implementation of this algorithm, one usually plays with these parameters until the program behaves satisfactorily. It is clear that implementing this method is very fast and we observe from the literature that the vast majority of applications are computed using the version of simulated annealing given in algorithm 2 where the ﬁve parameters are determined manually [54]. Using this interpretation, we may regard simulated annealing as deﬁning a function α = α(A, B,C, D, E) depending on ﬁve parameters (for the simple schedule). We would like this average reduction ratio to be as large as possible. This is yet another optimization problem with a function instead of a combinatorial problem. We are able to evaluate the function only at considerable computational cost (N runs of simulated annealing for N randomly generated initial conﬁgurations) and we do not know its derivative accurately. Even approximating the derivative comes only at heavy computational cost. There are many optimization methods such as Newton’s method or more generally a family of methods known under the names Quasi-Newton or also Newton-Kantorovich methods that rely on computing the derivative of the objective function. Some of them require high computational complexity due to the computation of the Hessian matrix but complexity considerations are secondary here. The most important reason against all these methods is that the derivative computation is not very accurate for the function constructed here and this loss of accuracy in an iterative method would yield meaningless answers. Indeed, such methods were tried and the results found to be unpredictable because of error accumulation and much worse than the results obtained by methods not requiring the computation of derivatives. The method of choice for optimizing a function over several dimensions without computing its derivative is the downhill simplex method (alternatively one may use direction set methods). Thus, we use the downhill simplex method to minimize α(A, B,C, D, E). The starting point for the simplex method will be given by those values of the ﬁve parameters that we obtain after some manual experiments. This is done for the reason that most practitioners of the simulated annealing paradigm choose their parameters based on manual experiments [54]. The other points on the simplex are set by manually estimating the length scale for each parameter [127]. We ﬁnd, after extensive computational trials on a variety of test problems, that the average improvement in the reduction ratio after annealing has been parametrized by the downhill simplex method as opposed to human tuning is 17.6%. We believe this is signiﬁcant enough to seriously recommend it in practise. Note that this is an average and so there are cases where the improvement is small and cases where it is large. It seems impossible to tell a priori what the result will be. Many simulated annealing papers have been published that center around the topic of performance of the algorithm in terms of getting to an acceptable minimum quickly [102]. A variety of cooling schedules have been designed that can reduce the computation time at the expense of solution quality. While the author experimented with a number of open-source implementations of simulated annealing for a variety

7.5 Perturbations for Continuous and Combinatorial Problems

181

of optimization problems with tools such as a proﬁler, speed-ups of up to three orders of magnitude were achieved. This is in contrast to claimed speed-up factors of between 1.2 and 2.0 that come from changing the cooling schedule at the expense of solution quality [102]. Thus the author believes the speed of the simulated annealing method to be so dominated by programming care that he has not attempted to simultaneously optimize solution quality and execution speed. This simultaneous optimization would, however, be no problem in principle after one made the, completely random, decision how relatively important speed is in relation to quality. Thus we may draw a number of conclusions that would appear to hold in general: (1) The solution quality obtained using simulated annealing depends strongly on the numerical values of the parameters of the cooling schedule, (2) the downhill simplex method is effective in locating the optimal parameter values for a speciﬁc input size, (3) the parameters depend strongly on input size and should therefore not be global constants for all instances of an optimization problem, (4) the improvement in solution quality can be signiﬁcant for theoretical and practical problems (up to 26.1% improvement was measured in these experiments which is large enough to have signiﬁcant industrial impact). Furthermore, the reason that the usual manual search is so much worse than an automated search seems to be that the solution quality (as measured by the average reduction ratio) depends strongly on the cooling schedule parameters, i.e. the landscape is a complex mountain range with narrow valleys that are hard to ﬁnd manually. Finally, the improved schedule parameters, in general, lead to slightly greater execution time but in view of the dramatic improvement of quality (as well as the fact that execution time seems to be dominated by programming care) this is well worth it. However the computation times are generally so low nowadays with powerful computers that increasing the speed of annealing at the expense of quality is a non-issue. Rather we would expand the computation time for the beneﬁt of additional quality.

7.5 Perturbations for Continuous and Combinatorial Problems Apart from the cost function, with which we compare any two proposed solutions, the only other point in annealing that the problem enters the algorithm is in the method to perturb or change any proposed solution. This method to modify a solution must be carefully constructed such that we have a good chance to meet with many solutions and to be able to exit local minima. Let us imagine that we are dealing with a continuous problem. That is, a problem in which the independent variables (the one whose value we want to determine) take on continuous values as opposed to discrete values. Then a solution is any value of the independent variable vector x that obeys the boundary conditions. In order to generate a new vector x from this, we can create several ideas 1. Set x to a random vector independent of x. This is a simple and intuitive idea but it violates the basic philosophy of simulated annealing of adaptive change.

182

2.

3.

4. 5.

7 Optimization: Simulated Annealing

This method makes annealing essentially a variant of random search (take the best solution of many randomly generated ones) and performs poorly. Set x equal to x and change one element by ±Δ for some a priori chosen length scale Δ . This is better but performs poorly as well, as most problems tend to have shorter length scales at lower temperatures due essentially to the phase transition observed at intermediate temperatures. Do the above but make Δ into Δ (T ) a function of temperature. The scale should decrease with temperature, that much is clear. There is wide disagreement in the literature as to how to decrease it exactly. Describing these methods would carry us too far aﬁeld. We merely note that the length scale of any particular variable at any particular temperature may be estimated by performing many transitions for various Δ for this variable at that temperature and noting down the variation in the cost function thus achieved. In this manner, we may empirically construct a Δ (T ). Do the above but assign a different Δ (T ) to each element in x because every variable may (and in general will) have its own length scale. Do the above but do not change only one element of x but several during a single move. We then ask how many is “several?” We have found that about 10% of the number of elements in x is a good number to vary simultaneously. Which they are should be chosen randomly.

In absence of domain knowledge that may allow us to design a better transition mechanism, the last of the above suggestions has performed best for many experiments of the author. In case of doubt, this approach is recommended. If we are dealing with a problem that is not continuous but rather discrete, matters are more diverse. We now need to take into account the actual structure of the problem. Take, for example, the traveling salesman problem. The mechanism that has proven to be the best has two moves: reverse any partial path and interchange two partial paths, e.g. A → B and C → D with the two partial paths A → C and B → D. There have been many other suggestions for generating a new solution from an old solution but it is this suggestion that has been found to perform best. It is unclear, before experimenting, which set of moves will perform best. It is apparent from the structure of the moves for the traveling salesman problem, that these were designed with the structure of the problem itself in mind – the problem is about a path between points without repetition. We cannot use these moves for a different problem. As such we must really design a move set with respect to a problem. There is no general theory for constructing a transition mechanism. We must think what is natural for the problem at hand and try it out. In general several suggestions will have to be tested. Most often we will have no theoretical explanation for the better performance of the winner but must merely be content to observe which happens to be the best. We note in closing that the suitability of the transition mechanism is a major point in using simulated annealing. If you use a poor transition mechanism, annealing will take much longer (require more transitions) and may indeed converge to a poorer

7.6 Case Study: Human Brains use Simulated Annealing to Think

183

outcome. Note also that the cooling schedule depends on the transition mechanism and so the cooling schedule must be tuned to a particular transition mechanism.

7.6 Case Study: Human Brains use Simulated Annealing to Think Co-Author: Prof. Dr. Adele Diederich, Jacobs University Bremen gGmbH Humanity has long searched for the mechanism that allows the human brain to be as successful as it is observed to be in solving a variety of problems both new and old every day. Much is known about the architecture of the brain on the level of neurons and synapses but very little about the global modus operanti. We ﬁnd evidence here that simulated annealing is that elusive mechanism that could be called the ‘formula of the brain.’ By examining the traveling salesman and capacitated vehicle routing problems that are typical of everyday problems that humans solve, we illustrate that none of the optimization methods known to date match the observations except for simulated annealing. The method is both simple and general while being highly successful and robust. It solves problems very close to optimality and shows faulttolerance and graceful degradation in the presence of errors in both input data and objective function calculations. The human brain is constituted of approximately 1011 individually simple computational elements (neurons) that are interconnected via approximately 5 · 1014 synapses1 [55, 111]. These large numbers prohibit a comprehensive direct computer model of the brain. Even if it were possible, such a model would be essentially epistemological, i.e. it would treat the brain as a “black box” and would concern itself only with input and output to this box. It is eminently more desirable to search for an ontology of the human brain, a theory that (at least to some degree) explains as well as reproduces input–output pairings. The importance for science in general of understanding how our brain thinks in global terms can hardly be overemphasized. Given the philosophical nature of the issues, it seems unreasonable to be able to resolve the nature-nurture, consciousnesscomplexity or intelligence debates on such grounds. However, many issues of scientiﬁc interest can be tackled from this basis such as the performance issues at the basis of intelligence and all manner of questions regarding memory and learning as well as modularization or compartmentalization of the brain. Furthermore, through better understanding of the human ‘hardware’ it should be possible to facilitate improved learning, recall and equilibrated and enhanced brain usage. In brief colloquial terms, an ontological brain mechanics forms the essential introduction to a brain operations manual for the scientist as well as for the lay-thinker. 1

Graph-theoretically speaking the brain is a very sparse graph – with 1011 nodes, a graph with all possible edges would have 0.5 · 1022 edges meaning that the human brain has approximately 0.00001% of all possible synapses. This is, of course, necessary as compartmentalization and modularization are quite essential for the myriad functions that the brain has to perform simulatenously.

184

7 Optimization: Simulated Annealing

The average human being must make complex choices many times per day. Many of these fall into the category of optimization problems: Choosing the ‘best’ alternative from a wealth of possibilities; a simple example being that of planning a route between many stops. The meaning of ‘best’ differs widely between problems2 but it is clear that we must be able to (and are able to) compare several possibilities as to their goodness when trying to ﬁnd the best alternative. The process of ‘thinking about’ the problem (i.e. considering the relative goodness of various possible alternatives) takes time and often we consider an alternative that is worse than the best one found so far in an effort to ﬁnd an aspect of that alternative that will allow us to ﬁnd an even better alternative later on – we accept temporary losses in expectation of greater gain at a later time. In most problems, the total number of possible alternatives is astronomically large and no simply recipe for solution exists. As an example, planning the shortest route between n stops while running errands would have us consider (n − 1)! possible routes. Most of these problems can only be (realistically) solved using heuristic methods. The human brain is thus capable of selecting a good alternative from a large set of possibilities without considering all possibilities. Additionally, the brain does not search randomly but ‘intelligently’ considers the alternatives. It is unreasonable to believe that the human brain has separate solution strategies for each possible different problem. This would require a vast brain (which we do not have) and enormous learning (which we do not have time for). Thus there must exist a central problem solving apparatus that manages to solve many very different problems to a reasonable degree each3 . Furthermore, the neuron-synapse structure of the brain operates approximately 1000 times slower than current computer hardware and humans still regularly outthink computer programs in tasks such as pattern recognition. It is sometimes thought that massive parallelism is the key to this performance gap [117, 111]. Our thinking strategy is thus problem-independent and very quickly obtains a nearly optimal solution via a directed search through a very small portion of the space of possible solutions. We wish to draw a parallel between the SA algorithm and the human brain functionality in solving optimization problems. As most problems encountered in everyday life are optimization problems, we will take this to be a strong indication that the human brain uses SA as its general problem solving strategy. Learning is adopted into this in two ways: (1) The cooling schedule of the SA paradigm is very ﬂexible and amenable to substantial tuning and (2) after sufﬁcient experience with a particular kind of problem, the brain may develop a custom method for dealing with those particular issues important to that human being. SA is very powerful as we have seen but it is also very robust. Robustness refers to the preservation of the method’s ability to ﬁnd a good solution in the presence of 2 Examples include minimizing the number of kilometers needed for a journey, the amount of time required for a job, the number of trucks needed to supply a chain of stores, the number of rooms required for a conference, the best assignment of employees to job tasks and so on. 3 Note that due to the astronomically large number of possible solutions and the non-existence of any quick guaranteed solution schemes, the human brain cannot be expected to (and does not) solve these problem to optimality but only close enough for practical purposes.

7.6 Case Study: Human Brains use Simulated Annealing to Think

185

noise (errors in the input data) and/or uncertainty (errors in the objective function evaluation). Clearly the human brain is very robust and this feature must carry over into any ontological mechanics of the brain. The proper unit of time for the SA paradigm is the number of proposals made. It is useless to measure the actual time taken as this depends too strongly on the computer and programmer. As each proposal necessitates the construction of a candidate solution, its comparison to the current reference solution and its subsequent acceptance or rejection, we postulate based on timing measurements of the brain that the average human is capable of doing this in 3 microseconds [56, 111]. This means that the human brain considers approximately 333 proposals per second. We shall take this as the conversion factor in order to compare our SA algorithm to brain measurements4 . In our experiments we will task a number of human subjects and the SA algorithm with a number of instances of the traveling salesman (TSP) and the capacitated vehicle routing problem (CVRP). In both problems, the locations (on the two-dimensional Euclidean plane) of n cities is chosen. In the TSP, the shortest possible round-trip journey meeting each city exactly once is to be constructed. In the CVRP, each city has an associated demand value that the traveling salesman has to meet given that his vehicle has a maximum capacity. A particular city is ﬂagged as the ‘depot’ where the salesman has to periodically return to reﬁll his vehicle. The CVRP asks for the shortest journey meeting each city (except the depot) exactly once starting and ending at the depot and fulﬁlling all demands while never exceeding the vehicle capacity5 . 4 It should be noted that almost no proposals considered are done so consciously. The computation power of the brain is vast but relies on almost all of the computing to be done unconsciously. 5 Method Note: Both problems can, of course, be asked with the cities not on the two-dimensional Euclidean plane but this restriction made it possible to get human test subjects to solve the same instances as the computer. The instances used had between 10 and 70 cities; half the instances were taken from actual cities on the Earth projected onto the plane and the other half were uniformly randomly distributed in a square. For the human test subjects, the computer screen ﬁrst informed them how many cities the next problem had and what the maximum time limit was, then a ﬁxation cross was displayed on the screen’s center and then the cities as dots on the screen were displayed together with a clock counting backwards from the maximum time limit. The subjects had to use the mouse pointer to click on the cities in the order that they wished to visit them. A choice could be undone by clicking the other mouse button and the click times were all recorded as well as the length of the ﬁnal journey. It was found that the average subject needs approximately one second per city just to perform the clicking operations, i.e. giving less than this much time would not yield a complete tour of the cities. Each instance was displayed several times for different maximum time values in order to measure the time progress of the subjects as they were given more or less time to think. Each time that the instance was redisplayed it was rotated by a different angle so that the instance did not look the same as it did on its previous display. As such the learning effects of the experiment were kept to the general task and not speciﬁc to an instance. For the computer algorithm, we used a cooling schedule that assesses the correct starting temperature by heating up the system until 99% of all proposals are accepted (using the MaxwellBoltzmann distribution for accepting cost-increasing proposals). Equilibrium is deﬁned as 200 proposals and the temperature is decreased by a constant factor of 0.99 until the cost does not change over four consecutive equilibria. This schedule is capable of solving to optimality all small

186

7 Optimization: Simulated Annealing

In order to compare the data from the subjects to that of the computer, we note that for the SA paradigm the graph of percentage cost deviation from optimum versus time is largely independent of the problem size for small instances. Furthermore, this graph as well as the graph of cost variance versus time display a smoothed step function proﬁle. The initial plateau is a feature that SA could be criticized for as an optimization algorithm as it could be viewed as a waste of computational resources; after all we only begin to see progress after a substantial amount (roughly one-third) of time has been spent. From a catalog of possible smoothed step function forms [103], we have determined that the best ﬁt is the complementary error function; 2 a · erfc(bx + c) + d where a, b, c and d are constants and erfc(x) = √π ∞ exp(−t 2 )dt . x Other optimization algorithms do not have this feature proﬁle; their cost versus time graphs generally follow a decaying exponential graph. We do indeed ﬁnd a scale invariance in the human subjects’ performance as it was expected from SA experience, i.e. the normalized cost and standard deviation curves do not depend upon the problem size. Thus we restrict ourselves to presenting data from a particular instance. See ﬁgure 7.3 where the SA output has been scaled in time according to the rate of 333 proposals per second. The notable features of this comparison are thus: (1) Scale invariance was observed in both human subjects and computer algorithm, (2) the cost and standard deviation functions agree closely between computer and human subjects, (3) the (independently arrived at) translation between the number of proposals and seconds is accurate, (4) these features are characteristic for the SA paradigm and do not occur all together in any of the other general optimization methods. Thus we have evidence that the brain mechanics cannot follow any of the other standard methods as well as evidence that SA is very close to the observed performance. Thus we conclude that simulated annealing is the prime candidate for an ontological brain mechanics.

7.7 Determining an Optimal Path from A to B Suppose that you are currently at the point A and that you have computed that point B is the optimum that you would like to reach. In industrial reality, you cannot always just change all set points from A to B in one go. The process must be guided smoothly from here to there. This is called a change at equilibrium meaning that the change must happen at such a slow speed that the process is always (nearly) at equilibrium even though values are being changed. This will ensure that the process problem instances contained in the TSPLIB collection of standard problem instances for both the TSP and CVRP. This collection represents the international testbed for TSP and CVRP algorithms. One possible criticism for this is that measuring cost deviation from optimum skews the human performance because the subjects do not control length but the order of the cities. It has been shown in the context of the TSP that the deviation from one journey to another (in terms of Hamming distance – the number of different bits in a vector) approximately scales with the corresponding cost [45].

7.7 Determining an Optimal Path from A to B

187

Fig. 7.3 The normalized cost deviation from optimum versus time in seconds is plotted in the upper image with the grey line being the SA output and the black line being the human subjects’ average for a particular instance. The normalized standard deviation versus time is plotted in the lower image in the same manner.

will continue producing the product that you want without causing any unwanted side effects that may even destroy the optimization gains altogether. This is equivalent to a navigation system in a car. It is unfortunately not possible to drive from A to B immediately but you need to pass through some of the intermediate points. There is an optimal route and it is the responsibility of the navigation system to tell what this optimal route is and to guide you through each of the steps involved as and when they are necessary and to alert you if you deviate from this plan. We must do the same for an industrial process. A simple example of this process is to ﬁnd the shortest line between two points. On a ﬂat space, this is obviously a straight line. If the space is not ﬂat (for example the hilly surface of the Earth), then this shortest path is no longer a straight line. The method of solving such problems is called the calculus of variations. This method has a few steps.

188

7 Optimization: Simulated Annealing

First, we deﬁne our criterion for optimality. In this context it is called the Lagrangian and is a function of the variables x, the function that we wish to ﬁnd f (x) and the derivative of the function we wish to ﬁnd f (x), L(x, f (x), f (x)). In the case of ﬁnding the shortest length distance between two points, we have L(x, f (x), f (x)) = 1 + ( f (x))2 . Second, we deﬁne the action integral to be A =

B

L(x, f (x), f (x))dx

A

where A and B are the two extremal points of the line. The action thus depends on the function f (x), which is unknown at this point. This is a strange dependency as f (x) is not a variable with a numerical value but it is a variable with a function as its value. We will not discuss this at length here but merely request the, in mathematics very common, “willing suspension of disbelief.” Third, we state that we wish f (x) to take on that function as its value that will minimize the action. The result, after some pencil work, is the Euler-Lagrange equation d ∂L ∂L = 0. − ∂ f dx ∂ f This equation must be solved for f (x). In general, this may be difﬁcult. We will illustrate this with a simple example. We start with the arc length L(x, f (x), f (x)) = 1 + ( f (x))2 and observe that here

∂L =0 ∂f

as f does not appear explicitly in L. Then, d d ∂L 2 f (x) = . dx ∂ f dx 1 + ( f (x))2 From the fact that this expression must equal zero, we see that we must have the numerator equal to zero and thus d 2 f (x) = 0. dx2

7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes

189

The general solution to this is f (x) = mx + b, i.e. the famous straight line. This is the calculus of variations way of proving that the shortest distance between two points (on a ﬂat space) is a straight line. In industrial reality, we have multidimensional space (x is a vector) and we want to minimize the path’s total cost in terms of the objective function that we used in optimization to ﬁnd the optimal point. This will lead to an Euler-Lagrange equation that must be solved. As the objective function becomes the L used above, we cannot be more concrete than this here without a speciﬁc example. However, this equation will, in general, not be solvable directly but must be solved numerically. That is how to determine the most economical path from A to B. The methods to numerically solve a non-linear second-order partial differential equation in several dimensions go beyond the scope of this book but may be obtained in several commercial software libraries for practical use.

¨ 7.8 Case Study: Optimization of the Muller-Rochow Synthesis of Silanes Silanes are chemical compounds that are based on silicon and hydrogen. Important for industrial use, are the methyl chloride silanes. Industrially, these are principally produced using the M¨uller-Rochow Synthesis (MRS), which is the reaction, 2CH3Cl + Si → (CH3 )2 SiCl2 . There are various hypotheses regarding the catalytic mechanism of the reaction but there is no generally accepted theory. Regardless of this, the reaction is widely used in large-scale industrial facilities to produce Di methyl chloride silanes (CH3 )SiCl2 and Tri methyl chloride silanes (CH3 )SiCl3 , which we will refer to below as Di and Tri. Practically, the reaction is carried out using silicon that is available in powder form in particle sizes between 45 and 250 μm with a purity of higher than 97%. The most common catalyst is copper and the promoters are a combination of zinc, tin, phosphorus and various other elements. The reaction is carried out at about 300◦C and between 0.5 and 2 bars overpressure. In a ﬂuidized bed reactor, the silicon powder encounters chloromethane gas from below. The product leaving the reactor contains the desired end product but also unused methyl chloride that has to be separated in a condenser. The mixture of a variety of silicones is now separated in a rectiﬁcation where the desired Di and Tri are split off from the other methyl chloride silanes that are mostly waste. These desired end products can now be hydrolyzed into various silicones. The ﬁnal products of this process can be practically used as lubricants in cars, creams for cosmetics, ﬂexible rubber piping, paint for various applications, isolating paste for buildings and in a variety of other applications. Unfortunately, the reaction produces several by-products that are unwanted. The selectivity of the process measures how much of the total end product is of the various types; for example a Di-selectivity of 80% indicates that 80% of the total end product is in the form of Di.

190

7 Optimization: Simulated Annealing

The market value of Di is highest among the different end products and so we want to maximize the Di-selectivity. However, the selectivity is inﬂuenced in part by the addition of catalyst. The relationship between increasing catalyst and increasing selectivity is a matter of folklore in this area. As part of this research project, no relationship whatsoever could be discovered within the range, 1% – 3% catalyst addition, studied. As the catalysts represent a ﬁnancial cost, the most economical selectivity is not, in fact, the maximum that could be chemically reached. We desire an economic maximum here. Due to the fact that no generally accepted theory of catalytic mechanics exists, there is considerable debate and experimentation in an industrial setting on the correct use of the catalyst and promoters in order to get optimum performance. The question speciﬁcally is: In what circumstances should what amount of what element be added to the reaction? An important component of ﬁnding an answer to this question is what the desired outcome of adding these substances is. In an industrial setting, the commercial environment supplies us with some additional variables such as market prices and supply and demand variations. Finally, we establish that the desired outcome is a maximum of proﬁtability. Whatever combination of catalysts, promoters and end products is required for this will be taken and it is the purpose of an optimization to compute this at any time. Consider a black box. This box has ﬁve principal features: 1. There are various slots into which you feed raw materials such as silicon, copper and so on. 2. There are some pipes where the various end products come out of the box. 3. The box has a few dials and buttons with which you can act upon the system. These will be called the controllable variables, c. 4. The box has various gauges that display some information about the inside of the box such as various temperatures and pressures. These variables change in dependence upon the controllable ones but cannot directly be controlled and thus will be called semi-controllable variables, x. 5. The box also has gauges that display some information about the external world such as market prices for end products or the outside air temperature. As these variables are determined by the external world, we have no inﬂuence over them at all. These will therefore be called uncontrollable variables, u. Inside this box, the M¨uller-Rochow synthesis is doing its job. Due to the lack of a theory about the synthesis, we cannot describe the process inside the box using a set of equations that we can write down from textbooks or ﬁrst principles. Therefore, we will be adopting a different viewpoint. Any industrial facility records the values measured by all the gauges and dials in an archive system that is capable of describing the state of the box over a long history. As the underlying chemistry has not changed over time, we therefore have a large collection of “input signals” (controllable) into the unknown process alongside their corresponding “output signals” (semi-controllable) in dependence upon the boundary conditions or constraints (uncontrollable), which, mathematically, are

7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes

191

also a form of input signal. This experimental data should allow us to design a mathematical description of the process that would take the form of several coupled partial differential equations. Formally speaking, these equations look like s = f (c; u). Mathematically speaking, the uncontrollable variables assume the role of parameters (and hence follow the semi-colon in the notation) in this function. Discovering this function is the principal purpose here and is very complex. One of the most intriguing features is that all three sets of variables are time-dependent and the process itself has a memory. Thus the output of the process now depends on the last few minutes of one variable and the last few hours of another. These memories of the process must be correctly modeled in order for this function to represent the process well enough to use it as a basis for decision making. In order not to clutter the mathematical notation, we will be skipping the dependence upon time that should really be added to every variable here. The modeling is done using the methods from section 6.5. In order to do optimization, we need to deﬁne a goal g to maximize, which is a function of the process variables and parameters, g = g (c, s; u). Using the recurrent neural network modeling approach, the goal becomes g = g (c, f (c; u) ; u), i.e. the goal is now a function of only the controllable variables and the uncontrollable parameters. Optimization theory can be applied to this in order to ﬁnd the optimal point cˆ ˆ f (c; ˆ u) ; u). As the loat which the goal function assumes a maximum, gmax = g (c, cation of the optimal point is computed in dependence upon the goal function as described above, it becomes clear that the optimal point is, in fact, a function of the uncontrollable measurements, cˆ = cˆ (u). Now we have the optimal point at any moment in time. We simply determine the uncontrollable measurements by observation and compute the optimal point that depends only upon these measurements. Thus we arrive at our ﬁnal destination: The correct operational response r at any moment in time is thus the difference between the current operational point c and the optimal controllable point cˆ (u), i.e. r = cˆ (u) − c. This response r is what we report to the control room personnel and we request them to implement. Ideally, the plant is already at the optimal point in which case the response r is the null vector and nothing needs to be done. As a result of the plant personnel performing the response r, the optimal point will be attained and an increase of the goal function value will be observed; this increase is Δ g = gmax −g (c, f (c; u) ; u), which we can easily compute and report as well. The relative (percentage) increase Δ grel = Δ g/g (c, f (c; u) ; u) has been found, in this example, to be approximately 6%, see below for details. Please note carefully that the response r = cˆ (u) − c is a time-dependent response even though we have skipped this dependency in the notation. Thus, we do not necessarily proceed from the current point c to the optimal point cˆ (u) in one step, see section 7.7 for a discussion of this point. Most often it is important to carefully negotiate the plant from the current to the optimal point and this journey may take a macroscopic amount of time – sometimes several hours. Figure 7.4 displays this problem graphically using real data taken from the current process. The two axes on the horizontal plane indicate two controllable variables and the vertical axis displays the goal function. We can easily see that the

192

7 Optimization: Simulated Annealing

Fig. 7.4 The dependency of the goal on two controllable variables. The upper path displays the reaction of a human operator and the lower path displays the reaction of the optimization system. The paths are different and arrive at a different destination even though they started from the same initial state on the left of the image. The optimized path is better than the human determined path by approximately 3% as measured by the goal function.

change in a controllable variable can produce a dramatic change in the goal. The two paths displayed represent the reactions to the current situation by a human operator (the upper path) and the computer program (the lower path). They initially begin on the left at the current operational point. Because of their differing operational philosophies, the paths deviate and eventually arrive at different ﬁnal states. This is a practical example of the human operator making decisions that he believes are best but that are, in fact, not the best possible. For the speciﬁc current application, the molecules are produced in three separate reactors and then brought together for shipment. We are to optimize the global performance of the plant but are able to make changes for each reactor separately. In this case, the controllable variables c were the following: Temperature of the reactor, amount of raw material to the jet mill, steam pressure to the jet mill, amount of Methylene Chloride (MeCl) to the reactor, pressure of the reactor and others relating to the processes before the synthesis itself. The uncontrollable parameters u were: X-ray ﬂuorescence spectroscopy measurements on 17 different elements in the reactor. The semi-controllable variables s are the other variables that are measured in the system. In total, there were almost 1000 variables measured at different cadences.

7.8 Case Study: Optimization of the M¨uller-Rochow Synthesis of Silanes

193

The goal function is the ﬁnancial gain of the reaction. We compute the input raw materials and the output end products. Each amount is multiplied with the currently relevant ﬁnancial cost or revenue. The ﬁnal goal is thus the added value to the product provided by the synthesis. We desire this to assume a maximum. This function is dominated by two effects: Di is the most valuable end product and so we wish to maximize its selectivity and the overall yield represents the proﬁt margin and so we wish to maximize it also. Possible conﬂicts between these criteria are resolved by their respective contributions to the overall ﬁnancial goal. In the results, we will focus on these two factors. The results reported here were obtained in an experimental period lasting three months and encompassing three reactors. The experiment was broken into three equal periods. During the reference period, the optimization was not used at all. During the evaluation period the optimization had only partial control in that the human operator controlled the input of catalyst and promoters. During the usage period, the optimization was given full control. We may observe the results in ﬁgure 7.5. In each graph, the dotted line is the reference period, the dashed line is the evaluation period and the continuous line is the usage period. What is being displayed is the probability distribution function of the observed values. This way of presenting the results allows immediate statistical assessment of the result instead of presenting a time-series.

Reference Evaluation Usage

Selectivity (%) 79.8 ± 3.6 79.9 ± 2.5 82.7 ± 1.9

Yield (%) 86.6 ± 4.2 89.7 ± 4.3 91.7 ± 3.2

Table 7.1 The results numerically displayed. For both selectivity and yield, we compute the mean ± the standard deviation for all three periods.

It is apparent, from the images alone, that we increase the selectivity and the yield with more use of the optimization and that we decrease the variance in both selectivity and yield as well. Numerically, the results are displayed in table 7.1. Decreasing the variance is desirable because it yields a more stable reaction over the long term and thus produces its output more uniformly over time. We may conclude that the selectivity can be increased by approximately 2.9% and the yield by approximately 5.1% absolute. Together these two factors yield an increase in proﬁtability of approximately 6% in the plant. We emphasize that this proﬁtability increase of 6% has been made possible through a change of operator behavior only (as assisted by the computational optimization) and no capital expenditures were necessary.

194

7 Optimization: Simulated Annealing

Fig. 7.5 The probability distribution functions for selectivity and yield of Di for periods in which the optimization was not used (dotted), used for controllable variables without the catalyst (dashed) and used fully without restrictions (solid).

7.9 Case Study: Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells Co-Authors: Prof. Chaodong Tan, China University of Petroleum Bailiang Liu, PetroChina Dagang Oilﬁeld Company Jie Zhang, Yadan Petroleum Technology Co Ltd

7.9 Case Study: Increase of Oil Production Yield in Shallow-Water Offshore Oil Wells

195

Several shallow-water offshore oil wells are operated in the Dagang oilﬁeld in China. We demonstrate that it is possible to create a mathematical model of the pumping operation using automated machine learning methods. The resulting differential equations represent the process well enough to be able to make two computations: (1) We may predict the status of the pumps up to four weeks in advance allowing preventative maintenance to be performed and thus availabilities to be increased and (2) we may compute in real-time what set-points should be changed so as to obtain the maximum yield output of the oilﬁeld as a whole considering the numerous interdependencies and boundary conditions that exist. We conclude that a yield increase of approximately 5% is possible using these methods. The Dagang oilﬁeld lies in the Huanghua depression and is located in Dagang district of Tianjin. Its exploration covers twenty-ﬁve districts, cities and counties in Tianjin, Hebei and Shandong, including Dagang exploration area and Yoerdus basin in Xinjiang. The total exploration area in Dagang oilﬁeld is 34,629 km2 , including 18,629 km2 in the Dagang exploration area. For the present study, we will consider data for 5 oil-wells of a shallow water oil-rig in Dagang operated by PetroChina. An offshore platform drills several wells into an oilﬁeld and places a pump into each one. If the pressure of the oilﬁeld is too low - as in this case - the platform must inject water into the well in order to push out the oil. Thus, the pump extracts a mixture of oil, water and gas. This is then separated on the platform. External elements like sand and rock pieces in this mixture cause abrasion and damage the equipment. When a pump fails, it must be repaired. Such a maintenance activity requires signiﬁcantly less time if it can be planned as then the required spare parts and expert personnel can be procured and made available before the actual failure. If we wait until the failure happens, the amount of time that the well is out of operation is signiﬁcantly longer. Thus, we would like to know several weeks in advance when a pump is going to fail. Each pump can be inﬂuenced via two major control variables: the choke-diameter and the frequency of the pump. These parameters are currently controlled manually by the operators. Thus, the maximum possible yield of the rig depends largely on the decisions of the operators, deﬁned by the knowledge and experience of the operator as well as the level of difﬁculty of any particular pump state. However, the employment of continuous and uniform knowledge and experience for the pump operation is not realistically possible as no one operator controls the plant over the long-term but usually only over a shift. Observation results show oscillations of parameters in a rough eight-hour pattern which supports the argument that a ﬂuctuation in the knowledge and experience of human operators may lead to a ﬂuctuation in the decision making and thus a varying inﬂuence on the operation of the rig. While some operators may be better than others, it is often not fully practical and/or possible to extract and structure the experience and knowledge of the best operators in such a fashion as to teach it to the others. Pumps in an oilﬁeld are not independent. Demanding a great load from one will cause the local pressure ﬁeld to change and will make less oil available for neighboring pumps. Obtaining a maximum yield output, therefore, is not a simple matter but requires careful balancing of the entire ﬁeld. In addition, certain external factors

196

7 Optimization: Simulated Annealing

also inﬂuence the pressure of the oilﬁeld, e.g. the tide. This high degree of complexity of the pump control problem presents an overwhelming challenge to the human mind to handle and the consequence is that suboptimal decisions are made.

Fig. 7.6 The discharge pressure of a pump as measured (jagged curve) and calculated from the model (smooth curve). We observe that the model is able to correctly represent the pump as exempliﬁed by this one variable.

The model is accurate and stable enough to be able to predict the future working of the pump up to four weeks in advance. It can thus reliably predict a failure of a pump for this time horizon due to some slow mechanism. We verify that the model accurately represents a pump’s evolution in ﬁgure 7.6. The model was then inverted for optimization of yield. The computation was done for the entire history available of 2.5 years and it was found that the optimal point deviated from the actually achieved points by approximately 5% in absolute terms. The main beneﬁts of the current approach are: (1) processes all measured parameters from the rig in realtime, (2) encompasses all interactions between these parameters and their time evolution, (3) provides a uniform and sustainable operational strategy 24 hours per day and (4) achieves the optimal operational point and thus smoothes out variations in human operations. Effectively the model represents a virtual oil rig that acts identically to the real one. The virtual one can thus act as a proxy on that we can dry run a variety of strategies and then port these to the real rig only if they are good. That is the basic principle of the approach. The novelty here is that we have demonstrated on a real rig, that it is possible to generate a representative and correct model based on machine learning of historical process data. This model is more accurate, all encompassing, more detailed, more robust and more applicable to the real rig than any human engineered model possibly could be.

7.10 Case Study: Increase of coal burning efﬁciency in CHP power plant

197

The increase of approximately 5% in yield is signiﬁcant as it will allow the operator to extract more oil in the same amount of time as before and thus represents an economic competitive advantage.

7.10 Case Study: Increase of coal burning efﬁciency in CHP power plant Co-Author: J¨org-A. Czernitzky, Vattenfall Europe W¨arme AG The entire process of a combined-heat-and-power (CHP) coal-ﬁred power plant from coal delivery to electricity and heat generation can be modeled using machine learning methods that generate a single set of equations that describe the entire plant. The plant has an efﬁciency that depends on how the plant is operated. While many smaller processes are automated using various technologies, the large scale processes are often controlled by human operators. The Vattenfall power plant ReuterWest in Berlin, Germany is largely automated in these parts. The maximum possible efﬁciency of the plant depends in part on the decisions of the operators, deﬁned by the knowledge and experience of the operator as well as the level of difﬁculty of any particular plant state. However, the employment of continuous and uniform knowledge and experience for the plant operation is not realistically possible as no one operator controls the plant over the long-term but usually only over an eight-hour shift. Observation results show oscillations of parameters in a rough eight-hour pattern which indicates that a ﬂuctuation in the knowledge and experience of human operators may lead to a ﬂuctuation in the decision making and thus a varying inﬂuence on the operation of the plant. While some operators may be better than others, it is often not fully practical and/or possible to extract and structure the experience and knowledge of the best operators in such a fashion as to teach it to the others. Furthermore, the plant outputs several thousand measurements at high cadence. At such frequency an operator cannot possibly keep track of even the most important of these at all times. This intensity combined with the high degree of complexity of the outputs presents an overwhelming challenge to the human mind to handle and the consequence is that suboptimal decisions are made. Here, a novel method is suggested to achieve the best possible, i.e. optimal, efﬁciency at any moment in time, taking into account all outputs produced as well as their complex interconnections. This method yields a computed efﬁciency increase in the range of one percent. Moreover, this efﬁciency increase is available uniformly over time effectively increasing the base output capability of the plant or reducing the CO2 emission of the plant per megawatt. Initially, the machine learning algorithm was provided with no data. Then the points measured were presented to the algorithm one by one, starting with the ﬁrst measured point. Slowly, the model learned more and more about the system and the quality of its representation improved. Once even the last measured point was pre-

198

7 Optimization: Simulated Annealing

sented to the algorithm, it was found that the model correctly represents the system. See section 6.5 for details on the method. In the particular plant considered here, Reuter-West in Berlin, eight months and nearly 2000 measurement locations were selected as the history that was recorded at one value each every minute; yielding approximately 0.7 billion individual data points. After modeling, the accuracy of the function deviated from the real measured output by less than 0.1%. This indicates that the machine learning method is actually capable of ﬁnding a good model and also that the recurrent neural network is a good way of representing the model. The power plant is largely automated and so we considered, for test purposes, only the district heating portion of the plant to be under the inﬂuence of the optimization program. The controllable variables would then be the ﬂow rate, temperature and pressure of the of the district heating water at various stages during the production. The boundary conditions or uncontrollable parameters are provided by the coal quality, the temperature, pressure and humidity of the outside air, the amount of power demanded from the plant, the temperature demanded for the district heating water in the district and the temperature of the cooling water at various points during the production. The model was then inverted for optimization of plant efﬁciency. The computation was done for the entire history available and it was found that the optimal point deviated from the actually achieved points by 1.1% efﬁciency in absolute terms. This is a signiﬁcant gain in coal purchase but mainly a reduction of the CO2 emissions that save valuable emission certiﬁcates. In the analysis, about 800 different operational conditions (in the eight month history) were identiﬁed that the operators would have to react to. This is not practical for the human operator. The model is capable of determining the current state of the plant, computing the optimal reaction to these conditions and communicating this optimal reaction to the operators. The operators then implement this suggestion and the plant efﬁciency is monitored. It is found that 1.1% efﬁciency increase can be achieved uniformly over the long term. The model can provide this help continuously. As the plant changes, these changes are reﬂected in the data and the model learns this information continuously. Thus, the model is always current and can always deliver the optimal state. In daily operations, this means that the operators are given advice whenever the model computes that the optimal point is different from the current point. The operators then have the responsibility to implement the decision or to veto it. Speciﬁcally, an example situation may be that the outside air temperature changes during the day due to the sun rising. It could then be efﬁcient to lower the pressure of district heating water by 0.3 bars. The program would make this suggestion and after the change is effected, the efﬁciency increase can be observed. The main beneﬁts to a power plant are: (1) processes all measured parameters from the plant in real-time, (2) encompasses all interactions between these parameters and their time evolution, (3) provides a uniform and sustainable operational

7.11 Case Study: Reducing the Internal Power Demand of a Power Plant

199

strategy 24 hours per day and (4) achieves the optimal operational point and thus smooths out variations in human operations. For those parts of the power plant that are already automated, the model is valuable also. Automation generally functions by humans programming a certain response curve into the controller. This curve is obtained by experience and is generally not optimal. The model can provide an optimal response curve. Based on this, the programming of the automation can be changed and the efﬁciency increases. The model is thus advantageous for both manual and automated parts.

7.11 Case Study: Reducing the Internal Power Demand of a Power Plant Co-Author: Timo Zitt, RWE Power AG A power plant uses up some of the electricity it produces for its own operations. In particular the pumps in the cooling system and the fans in the cooling tower use up signiﬁcant amounts of electricity. It will increase the effective efﬁciency of the power plant if we can reduce this internal demand. For the particular power plant in question here, we have six pumps (two pumps each with 1100, 200 and 55 kW of power demand) and eight fans with 200 kW each of power demand. The inﬂuence we have is to be able to switch on and off any of the pumps and fans as we please with the restriction that the power plant as a whole must be able to perform its intended function. A further restriction is introduced by allowing a pump to be switched on or off only if it has not been switched in the prior 15 minutes to prevent too frequent turning off and on. Five factors deﬁne the boundary conditions of the plant: air pressure, air temperature, amount of available cooling water, power produced by each of two gas turbines. These factors are given at any moment in time and cannot be modiﬁed by the operator at all. The deﬁnition of the boundary conditions is crucial for optimization. We recall the example of looking for the tallest mountain in a certain region. If the region is Europe, the answer is Mount Blanc and if the region is the world, then the answer is Mount Everest. In more detail, we have a set of points (the locations over the whole world) that consist of three values each: latitude, longitude and altitude. Out of these points, ﬁrst select those matching the boundary conditions (Europe or the whole world) and then perform the search for the point of highest altitude. In the power plant context, we must also deﬁne regions in which we will look for an optimum. We do this by providing each boundary condition dimension (the ﬁve above) with a range parameter. Let us take the example of air temperature. We will give it the range parameter of 2 degrees Celsius. If we measure an air temperature of 25◦ C, then we will interpret this to mean that we are allowed to look for an optimal point of the function among all those points that have an air temperature in the range

200

7 Optimization: Simulated Annealing

[23, 27]◦ C. As we have ﬁve dimensions of boundary conditions, we have to supply ﬁve such range parameters. It is a priori unclear what value to give the range parameter. A typical choice is the standard deviation of the measurement over a long history. This gives the natural variation of this dimension over time. However we may artiﬁcially set it to be higher because the boundary condition may not be quite so restrictive for the application. In the present case, we have chosen each range parameter to be one standard deviation over a long history. In addition, we have investigated a scenario where the range parameter of the air pressure is two standard deviations because we regard this to be less important. In order to make our model, we have access to myriad other variables from within the plant. Thus we determine when we require which pumps and fans to be on in order to reliably run the power plant. This culminates in a recurrent neural network model of the plant, which we ﬁnd to represent the plant to an accuracy better than 1%. The model is now optimized using the simulated annealing approach to compute the minimal internal power demand at any one time. Operationally this means that the optimization would recommend the turning off of a pump or a fan from time to time and, aggregated over the long term, achieve a lower internal power demand by the plant. The computation was made for the period of one year and it was found that the internal power usage can be reduced by between 6.8% and 9.2% absolute. The two values are due to the two different boundary condition setups. We therefore see that the loosening of restrictions has a signiﬁcant effect on the potential of optimization. Please observe the essential conclusion from this that the parametrization of the problem is very important indeed both for the quality and the sensibility of the optimization output. In conclusion, we observe that the internal demand can be reduced by a substantial margin (6.8 to 9,2%) which will increase the effective efﬁciency by about 0.05% to 0.06% given that the internal demand is only about 0.7% of the base output capacity. This is achieved only by turning off a few pumps and fans when they are unnecessary.

Chapter 8

The human aspect in sustainable change and innovation

Author: Andreas Ruff, Elkem Silicon Materials

8.1 Introduction Imagine an everyday situation when driving to work except this time your usual route is blocked due to construction work for a couple of weeks. When realizing this would you not ruggedly wake from the brainless automation that literally makes you ﬂoat to the ofﬁce? And would you not quickly need to take back control over your turns in order to follow the detour? What distinguishes a rehearsed mental situation from the deliberate? How can humans actively participate in a change process? Finally, what is needed to sustain that change? Continue to imagine that, when ﬁnally arriving at your desk, your ofﬁce PC lets you know that your password expired. Most people - including myself - have a hard time to come up with a new set of letters, characters and numbers. Why is it so hard to think of something new and why do I struggle to use the new password for weeks? I enter the old password and only when it fails does my brain begin to question my action. Only then will my cognition remind me that I had to change it and that I have a new one. It will take a while until I get used to the new password. But after that it will become the usual password. The same applies to the example of the roadblock: Using the detour for a few weeks, you will sink into the same automation but with a changed route. It is the power of habit - the things we do regularly are processed in our brains automatically. A habit is acting on an accepted status quo and we tend not to think about it or even question its necessity or validity. To strive for the new is given by our human nature. In order to alter any situation, it takes a lot more than just the intention to change. It requires the will and ability to take on the new status quo and live up to it. The complexity in remembering the password or the changed route to work lies in the conceived futility of the change and is enhanced by the users perception that this change does not make life easier nor does it contain additional value. In fact: The change is an obstacle. P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7_8, © Springer-Verlag Berlin Heidelberg 2012

201

202

8 The human aspect in sustainable change and innovation

Breaking the habit and accepting the new situation is hard both for someone initializing and managing the change and for the user. Overpowering habits and, especially, changing business procedures dramatically depend on every individuals ability to understand and accept the need for change and to adopt the new practice. In this text, I will discuss the various aspects concerning the human ability to change and its inﬂuence on sustainability. In the past, I kept asking myself how much the sustainability in project work relates to a personal work style and how much is fortunate coincidence. I will summarize some important aspects on how to reduce the coincidence part. Additionally, I will mention some general aspects backed by research and literature. I want to share my personal experience and hope to illustrate the positive experiences that I made when dealing with people in project teams. I always used the human aspect in my project work and found that change against inner conviction is grueling and virtually impossible. But how can human aspects help, especially when one is introducing major changes, and what can businesses do in order to make change sustainable? What preconditions are needed to make employees accept change or even help create it? I try to answer what successful change management is and how to gain sustainability. I will focus on the human perception of change and how the organizational set up and a managers personality can support the sustainability in change. All ideas, suggestions and examples derive from my experience in the chemical industry but I believe that they apply to any other business. My proposals and visions should be seen as a recommendation. I would like to give you ideas and food for thought on the value of the human aspect in order to increase your personal satisfaction and efﬁciency when making use of it. The terms sustainability and change are no contradiction. The status of any situation between two deliberate changes should ideally be sustainable.

8.1.1 Deﬁning the items: idea, innovation, and change Thomas Edison, the doyen of innovation and creativity, once said that the process of invention takes 10 seconds of inspiration and 10 years of transpiration. In order to understand the process of creativity better, a short differentiation of the terms used is useful. The ﬁrst step in a creative process is the idea. It can be described as what is before we think. It is a mental activity, the interaction of neurons and synapses resulting in an electric impulse comparable with a ﬂash of light. This bioelectric interaction is able to create visions of objects or imagined solutions in our cognition. The brain is structured such that most ideas come up when the thinker focuses on something completely different or even sleeps. There are famous examples where scientists had worked intensively and concentrated to solve a problem but it was during relaxation that they imagined the solution. Kekule in 1885 worked on the structural form of benzene and dreamt about a snake chasing its tail. Researchers like McCarley (1982)

8.1 Introduction

203

afﬁliate this to the absence of the catecholamine adrenalin and noradrenalin resulting in reduced cortical arousal. The presence of catecholamine is related to the size of the neuronal networks and therefore depresses the individuals cognitive ﬂexibility [74]. Emotional distance and a let-go attitude are very important when solving a problem. Ramon and Cajal (1999) mention this in their book Advise for a young investigator. Here they describe the “ﬂower of truth, whose calyx usually opens after a long and profound sleep at dawn, in those placid hours of the morning that ... are especially favorable for discovery” [74]. Philosophers consider the ability to generate and understand ideas as the core aspect of individualism and the human being as a whole. Ideas are not isolated, they grow and improve when shared and discussed with others. The ﬁrst idea is not necessarily the ﬁnal solution, but it is the starting point of a thoughts long journey to realization. The bridge between the idea and realization is called creative innovation. The thought has to fall on fruitful soil or as Heilman calls it: the prepared mind [74]. It is obvious that the precondition for creativity is an open mindset and the ability to think outside conventional barriers. The prepared mind is able to look at the same situation from different angles. The open mind is not limited in its imagination and it keeps asking “what if?” Kekule and his problem of structuring benzene can illustrate this concept. His dream about the snake chasing its own tail just induced the idea of a ring structure. The innovation lies in the acceptance of the idea and the insight that this is a possible solution. If Kekule would have discarded the idea of a ring shaped molecule because it just cannot be, he would have failed in solving the problem probably forever. The ability to “understand, develop and express in a systematic fashion” is the foundation of creative innovation [74]. Neither some special skills nor an increased IQ are required to be creative or innovative. Coming back to Edisons quote, the creative phase takes seconds but it may take years to establish the novelty successfully in practice. The goal of innovation is positive change to the status quo or simply to make something better. Innovation leading to increased productivity is the fundamental source of increasing wealth in an economy [129]. Innovation can be described as an act of succeeding to establish something new. Success in that respect means to eventually make an idea come alive. The novelty can be a service, product or organization. Once it is launched, there is no guarantee for success in the market. This can be seen when Amabile et al. (1996) propose [6]: All innovation begins with creative ideas. We deﬁne innovation as the successful implementation of creative ideas within an organization. In that view, creativity of individuals and teams are a starting point for innovation, the ﬁrst is a necessary but not a sufﬁcient condition for the second.

In order to be innovative one needs more than just a plain creative idea or insight. The insight must be put into action to make a genuine difference. For example, it could result in a new or altered business process within an organization or it could create or improve products, processes or services. Sometimes creative people have taken on existing ideas or concepts making use of individualization in combination with hard work and luck, making the replicated idea even more successful. Neither creativity nor an innovative mind can grant sustainable economic success.

204

8 The human aspect in sustainable change and innovation

In 1921, J. Walter Anderson together with Edgar Waldo Ingram founded White Castle in Wichita, Kansas. White Castle quickly became Americas ﬁrst fast-food hamburger chain and satisﬁed customers with a standardized look, menu and service. In 1931, they had the idea to produce frozen hamburgers and they were the ﬁrst to use advertisements to sell their burgers [132]. White Castles success inspired many imitators and so, in 1937, Brothers Richard and Maurice McDonald opened a drive-in in Arcadia, California. First selling hot dogs and orange juice, they quickly added hamburgers to the menu [130]. Today, White Castle sells more than 500 million burgers a year [131]. That sounds like a lot but compared to the market leader McDonalds, it is less than 1%. Surely White Castle was the ﬁrst and is still in operation, but it is far from the success of others. The success of an innovation depends on several factors such as market conditions, customer demand and expectation. But more than that a good portion of luck is required: Being at the right place at the right time. That is what Edison meant when intimating that successful innovation is most of all hard work and transpiration. This chapter is not about business plans nor will it advise the reader how to plan a successful business. I want to get an insight on what preconditions are needed to implement change successfully and, most of all, how to make it sustainable. The word “change” describes a transactional move from stage A to stage B. It is not necessarily true that the new stage is better than the original even if the intentions were good. A software update, for example, can put the user in a position of not knowing where to ﬁnd buttons and features. The changed look leads to an uncomfortable feeling until the user gets to know the program better. This is similar to the above mentioned examples of the new password or the roadblock on the way to work. The software developers intention, of course, is to persuade the user with new tools and improved appearance. Change is perceived differently depending on the individuals situation. It is up to the user to experience the change as a chance or challenge. Those of us who had the chance to manage change or work in teams in order to implement change have experienced that a substantial amount of money is spend to introduce advanced software, restructure organizations or improve processes or products. With all this cost and effort, how and why does the improvement slowly regress when the project team moves away? It is the human aspect that needs to be considered and remembered right from the starting point. It determines the sustainability in the change process. Enforced change is unlikely to be sustainable. As modern organizations need to be able to adapt quickly, the human aspect needs to play a central role in any change process.

8.1.2 Resistance to change Most people are attracted by novelty. Especially when it comes to consumer electronics, thousands stroll over various trade shows or wait in front of retail stores to get the latest products. To be equipped with up-to-date fashion and to be trendy deﬁnes our status in modern societies. The speed of change is enormous. Companies

8.1 Introduction

205

underlie various constraints to keep up with the need to create new products and services at satisfactory prices. Challenges are to improve technology, market share or to stay (or become) competitive. The globalized world demands mental ﬂexibility and the passion to take on any new situation. Yes, we grow with our responsibilities and we have to try hard to live up to the task. But too many businesses experience such pressure and expect more of the employees than they can handle. The modern (especially western) business world generates more and more mental diseases. The number one reason for sick leave is mental overload. The managerial challenge is to balance the need for change with their duty of care towards their employees. Businesses need continuous improvement and enduring change in order to survive and to continue engagement. Employees are required to contribute for the beneﬁt of competitiveness, job safety and growth. Studies done by Waddell and Sohal in the United Kingdom and Australia show that resistance is the biggest hurdle in the implementation of modern production management methods [126]. Interesting enough this resistance comes equally from the management and the workers. They also found out that most managers and business leaders perceive resistance negatively. Historically, resistance is seen as an expression of divergent opinions and good change management is often related to little or no resistance. That implies that well managed change generates no resistance. Here the question has to be answered what comes ﬁrst, well managed change or to properly handle resistive forces. Enforced change, with the main focus on the technical aspects of change, seems to be widely accepted by management. Action plans are worked off and get reviewed but the individuals needs are often overlooked. It does not matter what is going to be changed, the reaction mechanisms of humans are similar and have to be taken into consideration. Resistance is a reaction to a transition from the known to the unknown [49]. Change is a natural attitude, but only if the initiation for this transition derives from the individual itself. If the initiation for alteration is pushed from the outside, then the change process needs to be well attuned. Every individual confronted with change undergoes the following phases: initial denial, resistance, gradual exploration and eventually commitment7. This is a well-developed natural habit of defending ourselves. Especially if companies execute on major business decisions (e.g. organizational rationalization by software implementation) resistance results from the individuals anticipated personal impact of the upcoming change. Humans immediately see themselves confronted with an uncertain future. Past experiences combined with what we have heard (from relatives who have been in a similar situation or the media reporting about others) escalate to existential fear and questions arise such as “will I lose my job?” and “do I have to sell my house?” Whether this anxiety is imaginary or real, the physiological response is that same: STRESS. The negative, irrational emotion represses any logical aspect afﬁliated with the intention and we divert all energy to defend the status quo rather than on the compulsory task. It is an ancient defense mechanism from deep inside that hinders us to consider change rationally, to adopt it or even to help shape it. Thus, resistance is often seen as an objection to change. But is that really the case? If resistance is negative how can we turn it into something useful to enable change? First of all, resistance and

206

8 The human aspect in sustainable change and innovation

anxiety are important human factors for any undertaking. Data provided by Waddell and Sohal indicates that humans are not against change for the sake of being against it. In many observed cases, resistance occurs when those who resist simply lack the necessary understanding. Therefore, they have a negative expectation about the upcoming effect of the change. A major organizational restructuring or a local implementation of a software tool for process improvement will be seen (by some) as an assault. Preparatory measures should be used not only during the implementation but also as a structural tool. It is important to realize and analyses resistance in order to make use of it. Managers and organizations should exploit resistance and use it with participative techniques. Participation in that respect means more than just giving regular status updates. Many companies inform their employees regularly about the status of an announced project. This is a one-way communication model and the problem is that, in many cases, questions of those involved are not or only inadequately answered. Bidirectional information and communication is a critical tool to create well-conceived solutions and to avoid misunderstandings or misinterpretations. After the intention to go through a change process is announced by top management, the middle management needs to communicate detailed information as soon as possible. Regular personal meetings with decision makers and their direct reports enable a consequent ﬂow of (the same) information from the top down. Teams and departments should sit with their direct management and discuss actions, explain project milestones and intended results. The lower and middle management should provide platforms for addressing questions. This will keep up the communication even if the local leadership must not explain the exact plans and details. Most of us have experienced situations where we lived through unconscious fear. It helps if one can express the sorrow and gets the sense that the fear is dealt with seriously. It is obvious that this cannot be done in a works meeting but has to be done in small groups or even in one-to-one conversations. Certainly this is a big effort and it (if done right) takes a lot of working time from the leadership and the work force but it will pay back when measuring the effectiveness of implementation. The organization needs to be set up to be ready for change. Thus, the organization should focus less on technical details but prepare psychologically and emotionally. It is mandatory for the management to deﬁne the goal and to set the expectations but it should also leave room for individualism and pluralism. The initial idea is not always the best. Leaders must learn to focus on the result rather than on every individual detail. The organization has to be constituted to articulate resistance and to deal with it. In return every individual needs to be conﬁdent that the managers are honestly willing to listen and communicate. Therefore the entire organization needs to be set up to ensure communication and a ﬂow of ideas and suggestions no matter from which level of hierarchy they originate. This requires managers at any level of the entity to remember their vested duty to lead and manage people. The entitys organizational structure determines its ability to adapt to innovation.

8.2 Interface Management

207

8.2 Interface Management 8.2.1 The Deliberate Organization Organizations depend on their ability to adapt to change and to make it sustainable. The following describes a desirable but ﬁctional stage. To claim that all of the following is doable appears to be rather unrealistic. But I do claim that most companies have issues with regards to sustainable change management and that this is due to neglecting the human aspect. As little as resistance is seen positive in real companies it still is a desired and necessary process for sustainable change. Furthermore, the importance of a proper deﬁnition and scrutiny of functional interfaces is the key to this process. Whenever peers have to work together the efﬁciency of this interaction depends on how well each of the individual knows what to do and how to do it. If there is an overlap of authority or improper deﬁnition of roles and responsibilities the involved employees spend a great deal of time and energy in organizing themselves. Some organizations even believe that employees will solve this conﬂict in the best interest of the business. The opposite is true. It is in the nature of humans to try to get the interesting or highly appreciated duties and to leave those that require a lot of work or are less esteemed to others. Instead of a cooperative atmosphere, a power struggle will eventually leave few winners and many defeated. Those who cannot or do not want to keep up with this ﬁght might eventually leave the company. In addition to this, it is not necessarily true that the ones who emerge victorious are the better leaders or managers. It is obvious that roles and responsibilities have to be deﬁned. Reality proves that in many organizations grey zones exist. An organizational interface requires several aspects to be deﬁned, executed and controlled. First of all it has to be deﬁned which roles have to have a share in the interface and who controls and monitors it. Consider the following example from my experience when purchasing raw materials. In order to buy the right amount and to set up realistic delivery plans, sourcing needs to cooperate with manufacturing, quality control and perhaps even research and development (R&D) as well as logistics. Depending on the company and its set up, the ﬁnance department can be involved to contribute to payment terms, agree to letter of credit and to manage cash ﬂow. This is an obvious exercise and it sounds simple when taking the concept of value chains into consideration. If it is crucial to deliver high quality products in reasonable time at competitive cost, all stages in the value chain have to receive the right material at the right time. The smooth execution of customer orders requires everybody involved to focus on the same target: total customer satisfaction. Opposite to this concept, most companies have introduced individual Key Performance Indicators (KPIs) for every department separately. Who could blame the manufacturing manager to demand only top class raw material or the most reliable supplier if his task is to produce “just in time”, reduce reject rate and keep inventory levels low. The manufacturing manager will simply not care for the cost of goods sourced

208

8 The human aspect in sustainable change and innovation

because it plays absolutely no role in his list of performance indicators. Assuming the sourcing manager is held responsible for price reductions and availability of material, are these targets not conﬂicting? This ﬁctional company will quickly ﬁnd itself in the position that everybody is pulling the same rope, but in different directions. Who would win the battle, especially if personal bonus payments depend on the level of goal achievement? Furthermore, a proper interface deﬁnition also looks at the acting persons (deﬁned as names or job functions) and deﬁnes who exactly has to take decisions. The principle of four eye decisions is expanded to six or more eyes. In the context of the procurement example, the manufacturing, R&D and quality leaders would all have to agree to the proposal made by sourcing. They could use evaluation forms and would discuss all pros and cons to ﬁnally sign off on the ultimate decision. Due to this, they share the risk of failure (instead of a single person taking the decision and being held responsible) and also improve communication. Throughout the process the team shares the need for transparent decision-making and traceable accountability. This little example shows how important it is to manage any interface and to deﬁne all incoming parameter as well as the outcome speciﬁcation. The better everybody understands the interfaces deﬁnition and the individuals share in it the more efﬁcient it is. The critical deﬁnitions have to come from the management and it is within the responsibility of every executive leader to maintain the deﬁned balance in each interface by frequently reviewing its overall performance. The beneﬁt of well-deﬁned, established and managed interfaces is that the need for change is easily detectable and its effect can be simply measured. Any diffuse organization with uncertain responsibilities unwillingly leads to mismanaged situations. The task of executing change is either hard to assign or it is given to someone who has to ﬁght it through against the resistance of his colleagues. They will claim that the executing individual has neither the authority nor the power to implement any action. The battle for competence and power will make the implementation extremely challenging. The desired effect ﬁnally will ﬁzzle out. The well-deﬁned interface will remain well deﬁned because the required change is assigned to the person in charge. Together with the necessary resources and helping hands of all involved, change can be implemented quickly and effectively. A famous example is the work organization done by the Toyota Car Companys assembly teams. All work tasks are clearly regulated, described and monitored. Expectations are deﬁned and the work outcome is permanently controlled. The individual is held responsible for his work and quality. If workers experience a problem, they pull a trigger and co-workers from previous and later work steps come together to discuss the issue, identify the root cause and the location of appearance as well as ﬁnding and implementing a solution. Later, they monitor if the implemented change is sufﬁcient and track if the problem is solved for good. If they cannot ﬁnd or agree on a solution, an upper hierarchy member has to be informed according to deﬁned levels of escalation and will support the instant problem solving process or take other decisions.

8.2 Interface Management

209

8.2.2 The Healthy Organization Let us assume that an enterprise deﬁnes, manages and controls its interfaces, all roles and all responsibilities. The resistance throughout the organization is still unmanaged and uncontrolled. Not knowing the reasons for change initiates resistance. In some cases this is aligned with the unwillingness of the leaders to exchange opinions. Maurer looks at resistance as a force that keeps us from attaching ourselves to every crazy idea that comes along. If an organization cultivates resistance and sensitizes itself for the human nature, it will provide better ideas and faster turn over in change. A necessary prerequisite is a certain type of manager, being able to change their attitude towards resistant employees and develop a healthy organization. This organization empowers managers to resist ideas originated in higher hierarchic levels. Every employee openly shares ideas and participates in a culture of discussion. A healthy organization is willing and empowered to improve initial ideas and shift opinions. It being healthy implies to have critical thinkers identiﬁed, accepted and involved at every level in the hierarchy. The appreciation of feedback and the willingness to shift opinion based on reasons needs to be practiced top down and should be lived as a business philosophy. Project teams must be put together based on both, their criticality and their professional experience. The more diversiﬁed a group is, the more facets are reﬂected and the more options are considered. It is critical for any environment to have strong characters with different opinions, whether it is a team, an enterprise organization or a group of peers. It is to the beneﬁt of the leadership (at every level) to cultivate at least a few dissidents rather than surround oneself only with those who just say what one wants to hear. If resistance is used as a productive tool, it enables a positive environment of trust and honesty with an open dialog between all levels of the hierarchy. In such an environment alternatives can be considered carefully and thorough discussions can evaluate every option. It is the duty of an organizations leadership to carefully select every member in the structure. Factors like specialist knowledge, leadership skills and personality play a major role when selecting the appropriate candidates. A healthy leadership has the guts to selectively pick those who are uneasy, unconventional and dare to speak up. It is in every managers hand to choose the appropriate team members – the healthy way is deﬁnitely stony and requires more effort from the beginning. I have chosen this way many times now and I have always been rewarded with excellent feedback, proactive ﬂow of information and an extraordinary team spirit. All this contributes to a lot more than just the sake of implementing and sustaining change. It is of highest importance to make aware that a manager is not responsible for its direct reports only. The managerial duties also apply for those levels below, thus it is in every managers interest to have close communication beyond the direct reporting line. This ensures translation of business visions from one level to the other and their proper explanation. At the same time it guarantees everybody is and stays focused on that same target. In well functioning organizations the vision is clear and broken down so that the important parts get executed where needed. Many companies suffer over-communication. Visions, missions and updates are sent weekly if not daily and those who actually execute on these missions are not capable (due to IT access or

210

8 The human aspect in sustainable change and innovation

language barriers) of understanding the message. Again, a good example of doing it right is Toyota. They have their targets visually broken down so that every employee can easily see, for example, how many cars need to leave the factory per day to make the monthly promise and they get their personal goals aligned. Especially micro-managing organizations need to reconsider their position. As Antoine de Saint-Exupery states: “If you want to build a ship, dont herd people together to collect wood and dont assign them tasks and work, but rather teach them to long for the endless immensity of the sea.” It cannot be overstated how important the information ﬂow within an organization is. Sharing the vision means to provide the right parts of it and to transform the vision into executable and measurable tasks. A proper interpretation of the vision has to be communicated through the hierarchy with a clear and understandable obligation. From top down, managers have to sit down with their staff explaining the part of the vision they own and then break this part up into workable portions. These parts of the vision get assigned to the managers direct reports. The latter then do the same with their teams. Via this, the organization is truly focused on the common target and individual interpretation and/or cherry picking is barred. The proper break up and exact goal setting should be controlled over (at least) two hierarchic levels. A manager needs to set the goals for his direct reports and the managers superior controls that the set targets truly match the overall vision. The vision itself is ideally deﬁned long term. When breaking the vision down into manageable actions, take the time horizon into consideration. There are decisions made today with an immediate impact and there are actions paying off tomorrow. The leadership role of the top management is per deﬁnition oriented to the long term future. Visions of growth, EBITDA margin and proﬁtability surely need to be transformed into actions with a time scale of months or years. This is the leading part of the organization. However, a worker on the assembly line needs to get the tasks in a more compressed time horizon. The vision of job safety, salary increases and promotion perspective can be achieved by every days performance. Executable tasks need to be laid out in hours or shifts. This is the part of the acting hierarchy and its targets need to be monitored on the short term. Led by the management, the vision gets executed by the shop ﬂoor personnel. The responsibility is spread equally over the hierarchy. One could not succeed without the other – leaders need actors and vice versa. The organization needs to be perceived like a Swiss watch: every cogwheel is equally important, no matter what size. The real difference is in the relative ratio of lead and act. This ratio varies and naturally it is at a 100% lead at the top of the hierarchic pyramid and at 100% act at the bottom. Draw an imaginary line through the hierarchy at the level where act and lead take the same share. Here is a certain, hierarchic level and anyone below that level is likely to be more on the workers side. This ﬁctional border is extremely important. It is this hierarchic level that is the most important interface. It should be seen as the front line in communication and needs to be supported in an extraordinary way. Those employees are the ones that will receive their part of the vision more lead oriented and will need to pass it on with a strong focus on short-term execution (act). They need to transform the business vision into something understandable, no matter what

8.2 Interface Management

211

language, cultural context or expectation. The front line in the organization is where the sorrow of team members gets shared. At the same time those working at this front line step in for the top managements decisions on a day-to-day basis. Those at the front line need to be embedded properly in the decision process or at least get well prepared and need to receive all information and training up front. Imagine a shift leader on night shift being responsible for a handful of workers with different backgrounds and histories. Today, a certain change in the business strategy gets announced via e-mail by the management. This shift leader is the single point of contact that every shift member will turn to. If the shift leaders supervisor did not prepare him with pre-information or supplied him with answers to most likely questions, how can one believe that the shift will continue to focus on the job? Everyone on the shift will spend a great deal of time with recurring questions like “does this affect me and my family?” This time is costly and must be avoided. I am not claiming that the less sophisticated are not capable of comprehending ﬁnancial data or business numbers – the opposite is true! But remember that a rational view on the uncertain new situation might be dismissed if the change affects oneself. It would be better to hand out individualized communication packages with background information and a few answers to likely questions. This information sent out shortly before or even with the announcement and to all relevant functions is necessary, especially in times of major business change. If done right, the abstract decision is explained and can be discussed. Concerns may be addressed and the focus will likely remain on the work rather than on something else. Depending on the impact of change, the team coming together has to be prepared well. Key personnel needs frequent training in communication and how to handle such an extraordinary situation. Most change does not come by surprise over night and communication can be prepared. The ofﬁcial announcement is the top of the iceberg only. Preparing the key information multipliers with required information is mandatory for sustainable change. Town hall meetings with all employees are good to maintain communication regarding the change process but only a few will have the courage to express personal fears or raise questions in public. It is best to break up the enlarged meeting into work teams and further explain the situation in smaller groups. This chapter is not about leadership or professional management and there are many other articles, publications and books available. Nevertheless, one aspect seems of importance to me when discussing communication and organizational responsibility. It is a question about leadership itself: how many employees can possibly be led effectively by one person? In times when efﬁciency and productivity are the main drivers for business decisions, many (especially larger) companies trend to merge departments and divisions. In order to make the organization ﬁnancially more efﬁcient, groups and independent teams are put together under just one leader. The number of direct and indirect reports keeps growing continuously. To me, the maximum number of direct reports should not exceed 5-7. Following the thought path of the effective vision communication, any manager is liable for at least the second hierarchic level below himself. For example, a manager with ﬁve direct reports would need to intensively work with all 30 (25 + 5) individuals. If one takes the leadership role seriously, this consumes a great amount of the daily

212

8 The human aspect in sustainable change and innovation

working time. On the other hand, the fewer hierarchic levels exist, the more day to day business inevitably will end up on the managers desk. At a certain point the leadership time gets repressed by the excessive time that functional involvement requires. To me, that stage is reached in the majority of companies and it is an organizational disaster. It is not my intention to judge but to alert. Any organization decides how much a manager is a people leader or a functional worker. But my deepest belief is that anything above 40% functional work will be at the cost of effective leadership and good communication. Time is the resource of today and especially of the future. It will not take long until (especially) highly educated professionals will ask for more balance between job hours and private time thus for a better work-life balance. There is and furthermore will be a trend towards ﬂexible and effective working with new work structures such as home ofﬁce, split of labor, etc. Time already is but certainly will become a more satisfying form of payment rather than pure monetary salary. Modern organizations will have to react to the fact that more managers (male and female) value family time over career. Any organizations challenge for the upcoming years will be to make use of the existing resources as effectively as possible. The Lean principle with its concept of labor efﬁciency is not able to model this by far. People management and modern leadership will have to focus much more on the individuals involvement in e.g. teamwork, projects or cross-functional problem solving. Thus the manager shall have more time to identify and promote talents and, at the same time, encourage the one being less motivated. Dealing with the latter is extremely time consuming but necessary, if taken seriously. Assuming a manager takes the time and spends it on a one-to-one conversation with a critical individual. Then the individuals supervisor needs to be coached and instructed as well. All this requires at least three peoples work time. The alternative is to do nothing but this consequently will de-motivate the entire group. Assuming that there is a ﬁxed number of less motivated people in every organization, enlarged teams are not necessarily easier to manage especially as hierarchic levels might have been eliminated. Following the idea of sustainable change, the new manager will have to focus on all individuals, ultimately identifying their talents and level of motivation, working through communication and structural issues and ﬁnally gain new team spirit. How can this person possibly ﬁnd the time to work on the newly assigned job tasks?! This conﬂict gets even worse if those who have the most professional qualiﬁcation are promoted to manage the merged department and not the ones with excellent leadership skills. The number of professional and leadership tasks suddenly overloads the new manager and there is a chance for the individual to get sick or leave the company. The worst case is if the manager decides to set priorities towards less people management. Thus, the frustration in the team increases with similar consequences for the employees health and the ﬂuctuation rate but multiplied by the number of reports. Leadership and people management cannot be learned as easily as professional skills. The higher the rank in the hierarchy the less important professional skills are and the more important leadership and emotional intelligence become.

8.3 Innovation Management

213

Healthy organizations consist of smaller highly specialized teams supervised by those who continuously prove to be motivators and enablers. The base for the healthy organization is the self-motivated worker who shares the managements vision and sees a clear perspective for himself. The workers talents need to be discovered and his resources should be managed properly. It is the managerial task of any supervisor or leader to bring the best out of their teams. The hierarchic pyramid needs to be turned upside down putting the worker and its individualized work environment in focus. Those directly involved in the value chain need to be supported most. The above-described front line is what needs the highest attention, as this is where the managerial vision is practically put into action. Here is the interface that makes businesses successful. The entire remaining organization is supportive only. Empowering the organization means to develop manageable structures with clearly deﬁned interfaces, responsibilities and authorities. A healthy organization is equivalent to hierarchic emancipation.

8.3 Innovation Management One issue of highest importance to long-term business existence is consistent success. Both rely on continues innovation. As shown above, innovation is not being the ﬁrst but being better than the others. Google started with an innovative algorithm to search the Internet but they did not invent the search function itself. They came into business by advancing existing technology. Only later in their history, they introduced revolutionary technology like Google earth, the virtual street view or providing an online free counterpart of Microsofts ofﬁce software package. Furthermore, they were the second (after Apple) to develop their own smart phone. Innovation is Googles backbone and each employee must spend 20time on non-job related issues. Googles success is based on diversiﬁed employees and the trust that the freedom of thought will result in creative ideas. As much as I support this concept, it is hardly applicable for those in traditional business with non-virtual products and rather high manpower cost. Those who operate in the real world need more basic tools to generate and implement innovation to be successful. The recently introduced industrial operational excellence programs use Lean, Six Sigma and other problem solving techniques. Lean relates to the Toyota Production System and Six Sigma is a statistical program, which was originated by Motorola in the 1960s. Both systems are common answers to the same problem: structural operational improvement. In many organizations the two are implemented together. Lean focuses strongly on the executing workforce and its ability to prevent failure immediately. In Japan, detecting defects is not viewed negatively or to phrase it differently: The Japanese are controlling the process to detect the issue as soon as possible. This, in combination with a culture being subservient to authority, worked well for Toyota. In Japan, no one would change work procedures or dare to question the importance of details. Only if there is a problem, the belt is stopped. In the event of a shut down, the team joins and discusses the issue. Together they ﬁnd a solution

214

8 The human aspect in sustainable change and innovation

and implement corrective actions immediately. If the issue is out of the teams authority or cannot be ﬁxed with a certain, well-deﬁned time frame, the problem gets escalated to the next hierarchic level. That level is equipped with more authority and might have the power to either empower the team or call for help. Continuing the escalation until the problem is solved guarantees that eventually something will be done about it. The belt stands until the solution is implemented. It is a common aim to continue the belt as quickly as possible but also to strive for perfection even if that binds the forces. This is opposite to most western civilizations were individuals tend to hide or cover their mistakes.We tend to believe that errors will either not matter or someone else will ﬁx it later anyways. The culture is fundamentally different, especially in the individuals identiﬁcation with the product. While most western employees work for money, the Japanese work culture is close to a second family. I am neither glorifying the one nor criticizing the other – it is important to understand that a program is based on culture and cannot easily be copied to an organization with a different culture. The individualized western civilization requires corrections to the program. Pulling Lean, as practiced in Japan, over a central European work environment will fail. There are multiple examples of companies being successful with Lean and Six Sigma. Managing innovation is doable with the various programs and systems. In many companies, the traditional organizational structure has been adjusted to match the newly introduced improvement programs. Departments with specialists and expert know-how are introduced and, in many cases, the talented employees have been moved from elsewhere in the hierarchy. The introduced program creates localized innovation while the entire surrounding organization is reacting to the input. Hot spots of innovation and creativity are not enough to inspire an entire business. The innovation must come from within the organization itself and must derive from the deepest wish, the one in every human, to create and innovate. Everyone should have the same opportunity to put innovations and improvements into practice. Exclusive programs will not succeed in sustaining the change they implement. Everybody is innovative. Inventions are as old as mankind and it is not surprising that people even today continue to invent. Thus, making use of ideas and developing creative solutions has become more vital for economic organizations than ever. The preconditions for innovation are adequately discussed, but the question remains how to bring the best ideas forward and how can the organizational structures support it. Operational Excellence continues its triumph in almost every enterprise. The larger businesses allow themselves the luxury to allocate resources in new departments or structures to implement and execute the systematic improvement. The smaller companies introduce those excellence systems within their existing structures. Lean, Six Sigma, Production Systems and innovation programs are introduced to ensure implementation of innovation management and systematic improvement. Black Belts are put almost everywhere with the clear objective to discover and eliminate waste. Statistical tools are installed, trainings held and the company policy gets adjusted to the new philosophy. Consequently, there is a lot of change in the organization: new faces are employed with new-sophisticated titles (many of them in English) to execute on those fancily named programs. Imagine being a worker at

8.3 Innovation Management

215

shop ﬂoor level, facing the structural shift and discovering the situation. You would be challenged with a double impact: (1) Improvement projects will unforeseeably change your work environment and (2) you will have to accommodate to a diversiﬁed organization with additional reporting and responsibility structures. The individuals position becomes more diffuse and established organizational structures change. This particular change is driven top down and is littered with terms that most workers do not comprehend. The acting participants are different from those who have to live with the alteration. How can one believe this will work without friction or at least with some resistance? The situation gets even worse when facing the fact that most improvement leaders (e.g. Black Belts) average retention time is a little over two years. The duration of stay is inﬂuenced by the projects size, its importance and the companies attitude towards its own program. Would it not be silly to promote that project manager who has just familiarized himself with the team, ﬁgured out shift structures or even came close to the problems root cause? Would you not agree that it could easily take up to two years to discover all this? I confess to be rather critical with the programs and its improvement leaders. For over 100 years, the industry in central Europe did not know these highly specialized and centralized functions. To me this is a consequence of the extreme reduction of manpower. Those working in an area for a long time and being highly experienced did not need intensive statistics or other tools to discover fundamental issues. I am not claiming that analytical (especially statistical) know-how can be replaced by experience, but I strongly suggest combining it. The young, inexperienced professional with his special knowledge and his technical ability is in a much better position when being introduced (not only to peers and colleagues) by an older employee who is well respected and familiar with the situation. I predict faster and more sustainable results if the generations work together and combine their individual strengths. Team success strongly depends on the characters in the game. Successfully implementing improvement starts when preparing the organization for the improvement program itself. Thus, organizational development depends on the ability to think critically and the respect for age and experience. Neither can the program replace the experience, nor can we implement profound effects without modern analytical tools. Sustainable process change and signiﬁcant improvement cannot be prescribed. Many things need to fall into place to sustain change. If statistical ﬁgures, graphs and key metrics are presented to the management, the viewers need to fully comprehend what is shown to them and they need to grasp what consequence might come along with the improvement. The biggest issue with improvement projects (especially of complex processes) is that an improvement here can have a major impact elsewhere. Amendments should be real and not a statistical fake or an imaginary effect on colorful slides. As soon as the number of improvements or the speed of its implementation enables career opportunities, the chance for improper project management with short-term effects increases. The individual promotion should be linked to the improved quantity, but mainly to the projects quality. Responsibly caring program managers will always ask for the voice of the customer when evaluating a projects success. The customer

216

8 The human aspect in sustainable change and innovation

in my eyes is, by deﬁnition, the one that needs to operate and maintain the improvement later on. This is unlikely to be a manager or supervisor, but mainly and certainly the person dealing with the change afterwards. It only takes moments to judge the quality of the implementation, the documentation and the users involvement during planning, execution and training. In a healthy organization, poorly managed change is identiﬁed immediately. Consequently, it gets attacked and stopped. Criticality and resistance are important factors and they prevent us from making stupid mistakes or repeating the same mistake. Healthy, in that respect, also implies that the management takes the resistance seriously and values it. At the same time, communication is open and everyone strives for necessary change. A healthy organization will be able to adopt any trend quickly but, at the same time, will wisely adapt it to meet its needs.

8.4 Handling the Human Aspect As outlined earlier, there are multiple aspects were the human interface impacts sustainable change. Many of the described company-wide introduced programs contain change management within the improvement project phases. Unfortunately, neither of them really looks into the human sensitivities within the change. It is not part of university education plans to prepare their students for the upcoming challenges nor is it included in management or leadership seminars. Any new leader faces this fundamental issue when being promoted into a leadership position. Together with the new professional task comes the responsibility to lead a team. The challenge of leading is accompanied by the struggle to compete with colleagues and other departments. The issue with this overload lies in poorly deﬁned roles and responsibilities and improper preparation of the individual itself. The ignorance of proper leadership is carried from one level to the other and sooner or later the entire organization lacks of human interface management. Some leaders do learn from experience. Throughout our professional lives we experience managers with positive as well as negative leadership attitudes. It is important to remember that nobody is perfect. It is of highest importance to take on some of the positive abilities presented by managers and peers. The principle for successful lead-management is “tread others like you want to be treated yourself.” Good and pure people management is not easy and lacking time for leadership induces most interpersonal issues in companies. Effective tools to manage the human aspect in sustainable change are wishful. Their effectiveness strongly depends on the overall companys attitude towards this aspect and every individuals intentions. Each of the following topics is an important factor by itself. But, when combining them into a leadership vision and living up to it, it will increase engagement, motivation, identiﬁcation and, ﬁnally, productivity and revenue. Focus on the human aspect is not only a matter of sustainable change but also a chance for sustainable business success. The following suggestions are not ranked

8.4 Handling the Human Aspect

217

nor are they a guarantee for success. They are meant to add some practicability to the paragraphs above.

8.4.1 Communication Team-Meetings should be held (at least bi-monthly) with a clear, pre-communicated agenda and an open space for critique and suggestions. The team should be selected by function such as shift leaders or regional marketing managers. On recurring events, an extended team, including e.g. the deputy shift leaders or the most experienced marketing manager, should get an opportunity to be heard. They also need to have their share in hot topic discussions and must have a chance to express their opinions. Especially in times of change, listening frequently to those affected can reduce tension and resistance. Remember that those who resist are not necessarily against the change. Change pressed through by tension and force is likely to increase the resistance. Team meetings are excellent tools to develop a common sense and to communicate the status quo. Sharing the information with the team enables a broad discussion and allows everybody to participate. The major disadvantage of multi-people meetings is that those who gain a major share in the discussion are those who would have probably expressed their opinion anyways. You need to get to those who are quiet and meet them in personal communications. One-to-one meetings should be held frequently with the leaders but also with those known to be the unofﬁcial leaders. Communicate and involve them during decision-making and have special focus on those who struggle accepting the change. Special attention is needed for those being extremely reticent. They are likely to not be brave enough to speak up while among others. But, you need to ﬁnd a way get to their viewpoint, too. An open atmosphere in a non-business environment (cafeteria or a colleagues ofﬁce) will help to establish this channel of communication. The individual needs to be and feel safe to disclose his opinion or express deep sorrows. Follow up meetings are important to double check that the person is still okay and on track but also to clarify any misunderstanding or misinterpretation upfront. Onetoone conversations are extremely time consuming but indispensable. I have made great positive experiences when demanding critique from my direct reports. That way, you force them to think about what needs to be changed. Or to phrase it slightly differently, the question to ask is: “what do I have to change in order to make you more successful?” One-to-one meetings can be planned as an ofﬁcial meeting but could also be set-up as a consultation-hour on ﬁxed dates. Getting to the crucial point of information is difﬁcult when dealing with people. You never know whether you are being told the truth or if your counterpart just plays politics. A good feel for people is the most essential characteristic of any leader but if you really need a good average opinion, you should use anonymous communication tools. The Critique Box can be a letterbox or a web-based anonymous message drop box. Most, but especially international, companies have compliance hotlines and ombudspersons. Here, people can address their concerns and questions and receive

218

8 The human aspect in sustainable change and innovation

help. This rather complex system might even prevent people from using it. Users could get the impression that their concern just is not important enough to bother such a professional instrument. And if they place the concern, will it be treated seriously and conﬁdentially? Certainly people do not want to feel or be disadvantaged when using the hotlines. An anonymous web-based interface is a simple tool and every employee can submit inquiries. Quick and frequent feedback is most important if the company takes it seriously and deals with the requests. Handling the inquiries and dealing with whatever comes in is the challenge the leadership has to take on. There could be a team consisting of the general management and the representatives of the workers union to process the incoming information. Depending on the type of feedback, the project manager and key users could provide facts to ensure a satisfying answer. In any case, both the question and the answer should be published closely to where the input came from. Place them for example on the intranet and hang them on dedicated information boards for everybody to read. Others might have had similar questions. The attitude and honesty of the answers is extremely important. The answer must be reasonable and comprehensible to the questioner. Write as clear as possible and try to really answer the question. The tool is extremely powerful if used properly. When providing reliable communication, and if the sender is treated seriously and respectfully, conﬁdence in the (project) management will certainly increase. Reliable communication or “do not say it if you do not mean it” is an obvious suggestion. It sounds easy but it is not. We all know that political statements are sometimes hard to believe. Take the criticality of the national ﬁnance situation as an example. The deﬁcit in most countries is so dramatic that it constrains the room for any political manoeuvre. A statement in which someone promises tax reduction may feel implausible. Nevertheless, certain politicians are able to get attention although their main statement seems illusory. Whoever communicates, wants to transmit a message. The key is the combination of the speeches context and the speakers body language. A plausible combination of both may determine the receivers emotion. Some people think about communication as a pure exchange of information. A very good example to prove that this is wrong is President Barrack Obama. He was not yet elected and already many people put great hope in him. This is mainly because he got contemplated as a new type of politician. He received the Nobel Peace Price after just a few months in ofﬁce because he managed to transport the hope for a better world. It is important to communicate with a positive attitude and to transmit an optimistic message. This is of highest importance especially in ofﬁcial announcements. You want to spread your optimism amongst the audience and would want them to look positively into the future as well. It is obvious that the right words are needed but, more than that, you need to believe in them yourself! Many are skeptical about what they hear or what is promised to them because they have been disappointed many times before. Promises never came true and so many projects that were sold too well got stuck half way. Honesty and reliability are the foundation for the trust and cooperation needed to create sustainable change.

8.4 Handling the Human Aspect

219

8.4.2 KPIs for team engagement Once a project is started and the team is selected you will ﬁnd that there is an inner project circle that will be driving the initiative. The so-called “core team” needs to be supported by experts who were not chosen to join it. The latter is called the outer circle. Both groups need to be engaged in the change activity and every single player must stay focused, committed and satisﬁed. Frequent checkups on the common understanding of the project’s base (target, approach and timing) enable you to effectively manage the project and to quickly overcome identiﬁed hurdles. You need to make your teams emotional involvement measurable and visible. This is a leading indicator for the projects, and your personal, progress and success. Communication either in team meetings or on a one-to-one base is vital for the project management but hard to quantify. Questionnaires do help to get a quick overview of where the team thinks it stands. Is your opinion on progress aligned with the teams view? Or is your perception impaired? Modern online forms help you to design questionnaires within a few minutes. Even the evaluation is simple. Just select the receivers click a few times and you are done. Unfortunately, there are a few more things to consider. In order to make the evaluation quick and easy, I suggest avoiding free text ﬁelds unless you want participants to comment on a special topic. You should use predeﬁned statements instead and continue to ask the user for his level of agreement. The grading for your statements could be 1-5 where 1 represents the least consensus and 5 represents highest. Alternatively you could go for 0-100% or chose any other way. A few examples for precise statements are 1. The project is on track and we will ﬁnish by the end of the month 2. Communication is reliable 3. Project management takes on suggestions If you do this assessment frequently you can detect disagreement early and take action if needed. Focus on binary, precise and clear questions. Look at the examples mentioned above. Asking whether communication is reliable and frequent might put the user into a dilemma to answer both questions at the same time. What if it is frequent but unreliable?! Also remember whom you ask and prepare separate questionnaires for individual groups if necessary. Example: in the preparation for a major enterprise resource planning system (e.g. SAP or Oracle) implementation, a group of people is asked to rank how much the system is perceived to simplify their daily work. Assume the result to be rather confusing as about 50% view the system as a useful tool and the others do not. Furthermore imagine that the more technical oriented group would see less use then the rest of the group. Would you not simply split the total into separate user groups to get a more diversiﬁed answer? Tracking the number of supporters within the technical group could also be an excellent KPI for the commitment of the originally more skeptical user group. In addition you have easily identiﬁed what group you need to focus on in order to increase the total engagement and commitment.

220

8 The human aspect in sustainable change and innovation

The happiness check is a very simple way of monitoring the emotional baseline. If you take 30 minutes and stand in the hall, walk over the site and look into ofﬁces, just count the number of happy faces. The “smiles rate” is no solid scientiﬁc indicator nor should you overvalue it. But, and especially if you start somewhere new or launch a project in an unfamiliar area, you can get a feel for the current morale. Frustration and resignation are sub-optimal conditions for launching a change project. If you ﬁnd the latter to be the case you should focus to identify the cause of the dissatisfaction ﬁrst and then try to use it in your favor. Develop an emotional leverage to get people out of their lethargy. Convince them to help you to change their situation. Encourage them to be part of the creative force rather than to continue complaining. The happiness check can also be used to measure the attitude towards you and your project. Again, do not overvalue it and also consider the environmental circumstances. I would expect more smiling, happy faces during spring and summer compared to autumn or winter. Visualization and non-verbal communication enable you to communicate to teams even without being present. Especially when interacting with the outer project circle visualization is the tool of choice. Try to put as much as possible into graphs and pictures. Replace text blocks by bullet points and use short but precise wording. All communication has to be profound and reliable. Avoid speculation or guessing and, if you have to, separate your guesses and mark them clearly. If visualization is used as a standard communication tool it reduces subjective perception. Print it as large as possible (e.g. poster size) and hang it in a highly frequented place such as the entrance hall, cafeteria, waiting area or the smokers room. Start to pick random people and present them your project by using the poster. Ask what they think and if they agree with your statements. Even if the volunteers are not impacted by the project nor involved in it, they might raise excellent questions or give you hints on misleading information. Thus you get the chance to inform people and spread the information you want to be spread. In addition, you may get detailed feedback free of charge. Put a feedback e-mail-address or a telephone number on the poster so that you can be contacted for questions or suggestions later on. Keep the poster as a record, especially for the project documentation. In order to make your poster easily comprehensible use dedicated areas for certain topics. Place a project summary preferably at the pages top or bottom and provide a status indication for the various sub-projects. Color codes or status bars can be used to indicate its progress. Colors as used in trafﬁc lights are recognized as green equaling “OK”, yellow as “behind” and red stands for “critical.” A status bar is more detailed and contains additional time related information. You could use weeks or months, milestones or the number of items worked off. Simply calculate the percentage and illustrate it in a bar chart similar to what most people know when downloading or installing computer software. Alternatively or in addition, you could place a little marker over the status bar to visualize where you are supposed to be in accordance to plan. Add a text box next to your status bar to point out the reasons for a delay and articulate required actions and help needed to speed the project up. Especially if your project is behind schedule, you need to think about proper communication to get the support

8.4 Handling the Human Aspect

221

you need. The earlier you indicate items as “behind” or “critical” the higher the chance to counteract. A ﬁrst year review is done to analyze the status of the project after one year. Many projects, even if they get planned and executed well, lack after-project support. Unfortunately, many project resources are cut back or responsible leaders are assigned to new tasks before the ultimate implementation. As a result, the users are quickly in charge to ﬁnalize the project and they handle all the problems. What if those who are suddenly in charge were least involved in the project? It is a managerial culture to ensure that expectations are met! Everybody who is buying a car will test drive it before signing the contract. In professional life somehow, the executive attention ends with the announcement of the projects end. As soon as the ﬁrst few results are reported, some managers believe in its continuation and, in many cases, focus on the next projects. Sustainable project management requires regular meetings to follow up on the status. This includes involving the users to ensure that the expectations really are met. Expectation in that respect can vary depending on the function. Unfortunately the preferred project expectation is whether the spending is within budget. Functionality or performance as promised is second in line. To me, the primary expectation in any change is to sustain the desired new state. Change of operational behavior and the related improvements depend on repetition. Be aware that learning can erode over time. There is no such thing as a quick hit. Behavior practiced for years will not change in a month. Sustainability equals change for, and especially over, a long time.

8.4.3 Project Preparation and Set Up Stakeholders and team members play a critical role in any project. Team members are everybody working on the change process. By deﬁnition, a project team consists of multiple professions and departments. In many cases, the team consists of employees that do project work besides their everyday jobs. Most companies have been through rationalization programs and it can be that each position only has one employee. There are no spare capacities that can be pulled in full time to work on projects. Only if there is a major investment to be launched, experts might be made available. But even then, a team member might be part of other teams as well. This person with his individual character and profession might play a more or less central role in each team. Thus, an individual might be critical for the overall success of multiple projects. In distinction to a team member, the stakeholders enable the change. Stakeholders are not part of the project team but supervise the activities. Project managers need to communicate with key stakeholders frequently. They demand updates and supply help to overcome hurdles during the project execution. Take the following example: Imagine a launch of a global marketing project. Delegates from the regional marketing organizations gather as the project team. The global product manager would act as the project leader driving the initiative and being supported by the team. Stake-

222

8 The human aspect in sustainable change and innovation

holders could be e.g. the global marketing manager and/or the global sales manager. It looks like the set-up is simple and roles and responsibilities are clearly deﬁned. Expectations seem to be set and frequent communication is ensured. This project should be straightforward, successful and sustainable. Reality proves that most of the projects (especially the ones driven globally) are a debacle. The simple but devastating cause is that too many projects need to be handled by only a few team members. In addition, the individual often is left alone when it comes to balance project time and their everyday duties. And even if the individual would be dedicated to full time project work, those that ultimately execute it are probably not. Back to the example of the global product launch as described above: Imagine that the regional marketing managers would be dedicated to that project only. What are their duties? They need to work with the regional marketing groups to prepare the promotion. They need to make sure that production is aware and ready for local manufacturing. Quality standards need to be discussed and agreed upon globally. In order to be effective and unique, they commonly need to be accepted locally. There are various local interfaces to work with in order to coordinate and control the information ﬂow to and from all involved departments. If just one interface is not managed properly or fails, the entire project is at risk. This is not an issue if one assumes all parties concerned are pulling the same rope. But what if conﬂicting targets are real? Do all the important projects that get launched almost daily really take into consideration that project members do work in more than just one team? What if a potential team member simply cannot be freed from daily duties? Would you agree, that everybody is pulling the same rope but in different directions? Just like the business vision, the projects need to be assigned top down and resources need to be planned and allocated reasonably. Most project managers tend to forget that there is more than what happens within their project. They forget that any action needs to be executed by someone in the outer project team. The executing individual probably does not get allocated nor planned. Regrettably, the executing forces likely set the pace of the implementation and ﬁnally the likelihood of sustainability. I claim that the number of projects within businesses is far too high to be managed properly. Or to phrase it differently: Fewer projects with proper resource allocation (at all levels) will increase the chance for successful and sustainable change. The Team Selection is an important act for any project. You should identify two or three well experienced supporters, even if the task seems easy and the implementation is perceived as a no-brainer. Assigning a team does not necessarily mean to hold meetings and sit for hours regularly. Any idea must grow and the condition will only remain changed if someone feels responsible and takes care after the project is done. The earlier in the project, the end users are involved and the closer you keep contact with the key personnel, the more sustainable it will be. So why not build a small team and spread the responsibility as well as share the credit? Another important aspect is the potentially reduced resistance. The lone ﬁghter, even with the smartest idea and a brilliant brain, is likely to fail. Selecting the right (trusted and respected) team members helps to knock down prejudices and helps you around roadblocks. And remember that it is not always easy to get the users’ honest opinion. Most people would rather express their concerns to a colleague than to a project

8.4 Handling the Human Aspect

223

manager. In order to be supported, you need to be perceived as helpful. Allies are needed to get to that stage quickly. The sooner you establish an honest working routine with the (informal) area leaders, the faster and more successful (and sustainable) your project will be. Focus on frequent communication and you might get all relevant information without even asking for it. Remember to deal with resistance openly and try to ﬁnd the root causes rather than to ﬁght it with force. Keep an eye on your project teams attitude towards the change. Once your team does not believe in success, consider re-adjusting the project rather than the team! Delegation needs to be learned. Delegation is putting trust and conﬁdence in your associates. The delegate represents the department and must have full support. Make sure you chose the right occasion to send delegates. Ofﬁcial meetings with the general management or the works council are likely to be un-delegable. Imagine the effect on the attendees and the delegate if he could not answer questions or even contradicted himself. Politically, you need to stand up for your project yourself. For any other issue, a delegate can be assigned. Some project managers keep complaining to be overworked and stressed. As much as I sympathize with that, I also question whether they do well in delegation. As stated above, choosing the appropriate team is the baseline for success and a good work-life-balance. Select the team members who are capable and assign tasks to them. Make sure the delegate understands the goal and is not overwhelmed by it. Delegated tasks still need to be supported. Offer your help frequently and request updates. Monitoring and controlling the project still is the project managers responsibility. Delegation done wisely can help to reduce the managers involvement in sub-topics and ensures his ability to see the big picture. Proper delegation will get things off your desk and the receiver might even see it as an opportunity or chance. Most people that I work with are grateful to participate in project work. They like to get away from the daily routine and to get different insights. They get motivated when having a say in business decisions. Trust and esteem put into the delegates will energize them and will enable them to deliver highest performance. Be aware that delegates are speaking in their managers name and that they represent the entire project. Their decision should be binding and must not be questioned or revised unless absolutely inevitable.

8.4.4 Risk Management Risk and Opportunity evaluation is a necessary act in managing businesses. Some modern companies do not evaluate well and go for any new idea, especially if it comes from the upper levels in the hierarchy. A natural diversiﬁcation between locations, culture and regional requirements gets abandoned in favor of organizational simplicity. Dealing globally and being successful in local markets requires diversity. No doubt, markets are different but demand is local. When the US car market still requested high horsepower vehicles, the trend in Europe and Asia was already going towards less emissions and city friendly mobility. Products and especially brand marketing are focused on localized needs. Most companies fail to make us of the

224

8 The human aspect in sustainable change and innovation

different cultural strengths. Just think about Toyotas production concept. It works ﬁne in Japan but needs corrections in order to work in Europe or North America. Business Unit structures covering continents and countries need to manage the diversity properly in order to beneﬁt from it. Opportunities can be straightforward in one country but have high risks in another. Local regulations and laws can be different in the various regions. To implement solutions blindly, just because they worked elsewhere, is likely to create problems. Thus, look at the potential solution ﬁrst. Try to understand what the possible consequences are when implementing the solution. Do you really have a similar issue that needs to be solved and what might the result be if one implements a solution for a non-existing problem? Solutions are powerful and ﬁnally sustainable only if adapted reasonably. Global guidelines and programs should be made variable. They need to be ﬂexible to comply with global policy once they have been adjusted to every region. Risks and opportunities can be judged best locally. Nevertheless, there is a need to control the entire process and to set up a supra-regional organization. Risk management structures need to be considered carefully and shall be maintained in any organization. If a business decides to introduce operational excellence or similar initiatives, the various options for setting up the organization need to be evaluated. Depending on these structures, the working attitude as well as the approach to manage risks and opportunities will be fundamentally different. One option is to centralize all global experts in one department in order to send them out as in-house improvement consultants. This concepts obvious advantage lies in a uniﬁed approach and standardized tools being used. The solution transparency is high as all information about local projects are tracked and documented centrally. Best practices and benchmarks get collected, evaluated and get looped back into the organization. There is only a small risk that every-day duties overwhelm these resources. A separate improvement organization ensures that improvement leaders focus 100% on the task. It can even be of advantage if the improvement expert is unfamiliar with the area. Too much expertise in a particular area contains the risk of routine-blindness. Similar to consultancy from an external company, the projects duration is critical and thus the total success depends on the involvement and commitment of the executing teams (outer project circle). An alternative set up could be to empower and enforce existing structures and resources. Here the local managers and their teams get trained to enable the improvement by themselves. The advantage of this set up is to have a direct (reporting) line to the process experts. An improvement manager based locally might instantly know what to change and whom to contact in order to get things done. By giving them the right tools you might enable the change directly. Local people have a strong network and it enables them to get to the cause quicker than non-locals. Furthermore, the non-local might be perceived and treated as an outsider even if employed by the same company. Getting things done often depends more on interpersonal relations than on organizational power. A well-known program leader, who ideally operates within a strong emotional network, might be able to reduce resistance towards change dramatically. That is exactly where the biggest risks re-

8.4 Handling the Human Aspect

225

side: As much as the personal relationship enables quick execution that much does it prevent necessary cuts and/or drastic decisions. I had the pleasure to work with a highly knowledgeable process manager who had over 40 years of experience in the area. He told me to not ask for his opinion about a change in the process. He expressed to know the existing process for too long to be able to imagine how it could be done differently. He suggested splitting duties: I would tell him what I would want changed and he would ensure execution, if doable. The combination of speciﬁc process experience and program know-how worked well in that case. Incorporating the program approach globally and executing it locally is the biggest challenge when designing corporate improvement programs. Who leads? Who decides? And ﬁnally who executes? It is a question about core competences, roles and responsibilities and ﬁnally power. Once the dispute “global capability versus local competence” and “program know-how versus process experience” is solved, the improvement work may begin. The steering committee controls the overall project progress and direction. It approves if milestones got worked off and keeps track of required next actions. Some project managers conceive steering committee meetings as an offense to their competence and professionalism. I consider steering committees helpful and important if it comes to responsibility sharing. Once the committee approved a milestone, it acknowledges your effort and you have a regular platform to raise questions and concerns.Working with steering committees is easier if you have prepared an excellent meeting agenda based on the projects schedule and planning. First, you need to collectively agree on the level of detail in the project plan. It depends very much on the initiative if a complete schedule for each individual action is required or if it would be sufﬁcient to assign completion dates to milestones. Second, set up a meeting and discuss your project plan with the committee as early as possible and also agree on the KPIs to monitor the projects progress. Once you start executing the plan without the committees improvement it is hard to redo the fundamental planning. This applies to major investments in civil engineering or construction and to smaller projects. Spend sufﬁcient time on the planning and ensure frequent and timely communication with the steering committee especially in the projects initial stage. Third, the steering group should also discuss if alternative plans and fallback positions are required in the event of obstacles in the master plan. You and the committee should agree and include a written procedure to your project plan with proper deﬁnitions about who needs to be informed in the event of complication or failure. In addition you should plan upfront which additional resources might be pulled in or can be requested in such a case. The deﬁnition of a crisis, as well as planning for its management, is done best if prepared timely. Some project managers are not aware that project delay or failure can easily cause major business issues. Mismanaged projects have the potential to harm the company not only by the pure fact of misspent investment money. Apart from a massive impact on the businesss image (e.g. the incident on the deep water horizon caused at BP in 2010) there can be legal liabilities. Imagine being responsible for a major capacity expansion project. A delay might have substantial impact on customers material availability

226

8 The human aspect in sustainable change and innovation

and therefore contains high risks of contractual penalty. Would you not agree that it might make sense to share this responsibility? In order to do so, every level of escalation as well as the communication paths needs to be deﬁned and approved by the steering committee beforehand.

8.4.5 Roles and responsibilities Sustainable Responsibility is one of the key factors in sustainable change. In order to sustain change it is critical to keep up the force that ultimately initiated the alteration process. There is a difference between doing the right things and doing the things right. This is a fundamental challenge. The residence time of the new state is determined not only by organizational structures but by the managerial attitude towards project work in general. Assume that a reengineering project in the chemical industry is launched. A team of highly professional project managers and technical experts is put together and all of them are dedicated to this task only. A budget is available, steering group meetings were held and the targets and timelines are approved. Everything looks in order and the project seems to be driven towards sustainability. The dilemma lies in different personal expectations that may contradict sustainability. What if the project leader got promised a bonus if the budget gets under-spent by some percentage? Would you agree that his personal ambition and expectation is set already? What if the leading project manager would know upfront that he is to be promoted to manage the department after the project? Would he do it right or do the right things?! Doing it right is not necessarily sustainable. Is it right if the project is within schedule and budget and minimum requirements are just met? Doing it right could also involve doing the right things and eventually even overspend budget if that would ensure advanced technology leading to a more reliable process. I am not proposing that one is better than the other. As a project manager you need to understand what the expectations are and as a business you need to know what you want and how to achieve it. You should deal openly with the expectation and communicate it accordingly. Albert Einstein once said that “we cannot solve problems by using the same kind of thinking we used when we created them.” Change, if well thought through, should ideally solve certain problems or improve the status quo. Well thought through in that respect means not to accept that issues may arise elsewhere. Relocating problems is not an option in sustainable change management. Sustainable responsibility also implies that the responsibility remains with the leader even after the projects closure. Coming back to the example of the reengineering project: Would you not agree that the project manager acts responsibly if supplying spare parts for the newly installed equipment? The areas maintenance budget should not be in charge to ﬁx project issues once the budget and project team are gone. Sustainable change needs holistic responsibility. Bedouins move from one place to another and leave, when they are gone, little to no impact to the environment. I was able to observe that same behavior by global engineering

8.4 Handling the Human Aspect

227

teams. They moved on to another project and left little improvement but caused a great stir. Inventor driven change is the most natural form to implement modiﬁcations. It lies in human nature to create, invent and improve. Historically, humans have always sustained the new status quo when they have seen a clear advantage compared to the old state. The ideal contender to implement the change is the one who had the idea. Practically speaking, you should enable the inventor to take the lead to implement his own idea. Who could explain the underlying problem and convince colleagues or co-workers better than the solutions originator? The inventor is motivated to ultimately solve the issue. Anyone else might give up after a few attempts. The innovator who believes in his idea will continue to try. In a professional situation, the amount of work spent and the total number of attempts certainly need to be restricted and controlled carefully. My unfortunate but practical observation is that too many ideas never get implemented. Innovators do not share their ideas as they fear loss of intellectual property. Many of them believe that their idea needs to be ready for implementation before disclosing. Many companies have suggestion tools where employees input their ideas and get rewarded if their suggestion is put into practice. It is a pity that once the raw idea is submitted, it will be evaluated and implemented (if at all) by someone else. The originator has extremely little inﬂuence on the process and his colleagues and peers often do not even know that the idea exists. Imagine a brain pool of ideas were everyone could browse through, get inspired and add input and thought to others suggestions. It would be like an open community where the order of ideas and innovations is tracked carefully. Everyone adding reasonable input gains the same share of the ﬁnal idea or suggestion. The originator is nothing without those adding practicality and vice versa. You could take your standard suggestion form and add a couple of text boxes to the back. Document the basic idea on the front. Describe the suggestion in as much detail as possible and name the originator, date and time. Publish it within the working community and if someone wants to add to this idea, they use one of the boxes on the back. Thus, the idea gets considered carefully and reﬁned. It is important to not jump on the idea immediately. Let it sit and age like a good red wine. Imagine the inventors could be asked to lead their own idea into realization. What if you could enable them with resources and help them make their idea come to life. The minority of the proposers do it only for the money. I state that an idea, realized by its inventor(s), per se comes with a longer sustainability. The implementation is done with brain and heart and the inventor(s) will do the right things right. But even more important is that the ideas fathers are known. This project will be perceived as change coming from the inside of the organization. Improvement being introduced by outsiders often is viewed as imposed. Suddenly the change gets a name – indeed name the equipment or improvement after the inventor and put a label next to it (if doable) in memory of the event. The emotional relation to the improvement made by someone known is different. Although the outcome of the improvement obviously is not any different, the handling and the sustainability are. It is the same emotional

228

8 The human aspect in sustainable change and innovation

differentiation as if you drive a rental car or borrow it from a friend. In both cases the car is not yours but you might treat them quite differently. Self-discipline is a necessary precondition for any behavioral-based change. Discipline can be ordered and controlled but once it derives from within a person, it is a lot more powerful. Let self-discipline grow by naming the innovator, by teaching the intentions underlying principles and by presenting the expected beneﬁts. You should never underestimate the users. If they lack the discipline, do not try to force change on them as the change will fail and a lot of resources will be wasted. Discipline needs managerial leadership and increases in an atmosphere of trust and honesty. Work on the people ﬁrst and try to create a team spirit. Without the discipline, do not implement change. I was once assigned to a process improvement project in a chemical synthesis area. Rather than starting to implement the necessary technical solutions I tried to harmonize the way the shifts ran the process. I established frequent meetings to discuss the various operational philosophies when running the process. A common understanding of what the process is, what it needs and what actions should be taken set the baseline. This discussion took months until we came to a workable agenda. The common understanding resulted in a shared commitment towards standardized behavior. This is not necessarily self-discipline yet but the resulting peer pressure forces discipline. The impact of a shared commitment towards harmonized operations reduces the induced variability of human interaction. Once the impact is positive and obvious, the discipline will follow naturally but slowly. The human aspect is great and powerful no matter whether the improvement is technical or organizational. And it has two sides. Support will help you to manage the situation easily but destructive forces destroy even the best idea or manager. It is extremely hard to ﬁght your ideas through against negative confrontation. In that case, you need subversive activities. Try to ﬁnd the root cause for the rejection and identify those willing to support you. Start with those being positive and start in tiny steps to prove your concept right. If you fail, do not give up. Always go back to the entire team to discuss the results, adjust the trial and commonly agree to start over again. Make sure to work on the teams self-discipline by involving them in the decision process and keep on selling the advantage that is in there for them.

8.4.6 Career development and sustainable change Stagnation and Sustainability are two totally different things. Stagnation is sustainable but not vice versa. Stagnation maintains the status quo. A good indicator for stagnating organizations is when people state, “it is OK the way it is” because “it has always been that way”. Stagnating organizations are inﬂexible and unimaginative. As shown above, ﬂexibility in mind and a creative vision are drivers for innovation. Organizations need to understand that the ability to accommodate new situations is a personal strength. Human Resource departments should identify those candidates being open minded and those who see the opportunities rather than the problems.

8.4 Handling the Human Aspect

229

Creative employees need to be developed and deserve the chance to witness the organization’s ﬂexibility and opportunities. Employers need to make use of every individuals strength independent of educational level and degree subject. Talent Development in that respect requires accepting and managing risk. Imagine a successful employee working in an almost perfect position. Would it not be unfair not to promote this employee just because the organization has no successor for that position? Or, would it not be sad if that same employee misses out to apply for a job just because there might be no option to go back? I am not in favor of massive rotation within organizations but I deplore that too many creative heads are stranded within inﬂexible structures. Simply there are too many underdeveloped talents. This major problem gets worse as the population ages and soon it will get even harder to recruit qualiﬁed staff. The problem is self-imposed as the resources are cut back and some organizations simply lack a minimum number of people. The human resources departments are degraded to count heads rather than develop talents and increase the resource efﬁciency. Sustainable improvement is a tool to release some of those talents that will initiate improvement somewhere else. Here lies a fundamental opportunity for businesses. Consider all the in-house experts already working in your structure. It is just a matter of will and a bit of clever investment in the right resources. Managing successful and sustainable change is not necessarily restricted to employees with university exams. Proper Resource planning takes all the above-mentioned into consideration. It is critical to look ahead and prepare the organization for the future. If an experienced employee retires, the knowledge is gone forever. Specialist know-how cannot be transferred to a successor in weeks; it might take years. Identifying the ideal candidate and training this person on the job are necessary requirements for sustainability. Success relates not only to know-how or experience, it often relates to the employees personal and professional network. Building up such a network is stony and slow. Once established, it helps to generate ideas and reduces the risk of repeating mistakes over and over again. The successor needs to be trained properly. That includes the time needed to interconnect and to build up the network within the organization. The training also needs to incorporate the professional preparation for potential leadership roles. Leadership is gained by experience, not by taking classes. The experienced leadership skills are even more important if the leader is not the manager and still needs to get things done. Cogency is more powerful than persuasiveness. In order to plan and track a projects resources properly I use mind maps to catch, sort and rank ideas. I prepare individualized action lists and assign them to people in a timely manner to be able to control their work. According to the slogan “how to eat an elephant – bite by bite,” I deﬁne sub-projects and assign resources to them. I deﬁne the minimum requirements and assign a sub-project leader. Despite the need for planning, you should not overdo it. Try to balance complexity and clarity. This can be done when one master plan shows an overview and multiple sub-plans and action lists are created for each sub-project, month or person. The level of detail differs with individual preferences and the project volume. Make sure to discuss the plan and its timetable frequently with the team to control the status and to ensure the

230

8 The human aspect in sustainable change and innovation

teams awareness and commitment to the common goal. Even if you use computerized project planning tools, print the plan for the discussion and post it somewhere publicly so that everybody can see it. The more open and transparent the project is, the easier it is for the project team to stay focused and for the outer project circle to comprehend and actively participate. Best to the top needs to be more than a phrase. It enables career opportunities for those being willing and capable. It is unacceptable to not invest in the human asset. The career of talented candidates needs to be supported and coached by professionals. A transparent program with support at every level of the organization ensures a constant talent ﬂow. This implies that the lower hierarchic levels need more attention as their number is higher and their (average) age is lower. Thus, this group contains more potential to be explored and developed. The senior leader who made it from the lowest level in the organization is emotionally bound to the company. He brings expertise, experience and a working network that outsiders probably do not have. External options should be used only if the outsider really is better than the internal solution – true to the motto: best to the top. Award and Recognition systems are supporting tools to career development programs. Not everybody can follow a career path but everybody wants to be recognized and awarded for performance. The award and recognition system is not necessarily monetary in nature. It has to be set up to enable managers and leaders to access it immediately. Employees need instant performance response but not only when things went wrong. Such a system could start with the allowance to buy ice cream during hot summer days or grant teams to order pizza by themselves. Celebrating success empowers the entire organization to be proud of performance. The system promotes the performer but also e.g. provides loans to selected talents when building a house. This monetary system focuses on employee retention and the double positive effect for the employer is obvious: 1. The employee is emotionally bound to the company and beneﬁts from a low interest loan. It is unlikely that this employee will leave the company during the payback time. Once owning property, the immobility increases resulting in strong regional inﬂexibility. That gets especially important if the area is rather unattractive to work in (e.g. little industry or reduced availability of jobs). 2. While the employer uses the tool to motivate identiﬁed individuals, he might even make some money from the interest, depending on the company size. This money could be used to ﬁnance training opportunities. Certainly, there is a risk of privileging individuals and there will be discussions about the systems fairness. Thus, I stress the need to limit the value of the immediate recognition and rather have frequent team events (sports, dinner, cultural events, etc.) and recognize individuals with an award but without money. Those awards could be presented in a funny way and should not only be limited to work performance. The employee of the month could be accompanied by the “we are glad you are back” award for someone having been sick for a while. It is less the award but more the recognition and the teams expression to really be glad that the individual

8.4 Handling the Human Aspect

231

is back. This also creates emotional binding, which is so important for employee retention. Another attempt could take the family situation into consideration. Why not give “time” as an award? One could be rewarded with a half working day and an entry voucher to a local swimming pool in order to take the kids for a swim. The award lies in additional quality time with the kids. The joy and fun will be remembered and might continue to have a positive effect on the employees motivation.

8.4.7 Sustainability in Training and Learning Culture of failure seems like a revolutionary concept. It implies that more learning experience derives from failure than from success. If failure is analyzed properly and the right conclusions are drawn, then this statement is true. In most cases, we do not question why something is working – only if it fails do we start to realize how things work. After Thomas Edison invented the light bulb, he stated that he had found a way to make it work. In addition, he learned over 500 ways how to not make it work. His innovations are based on a learning experience discovered from failures. Conclusions from results are driving forces in innovation and are not limited to technical processes only. For example, realizing that the information ﬂow within an organization is bad opens opportunities to restructure and gain potential business advantages. Improvement is possible only when detecting the issue and consequently analyzing and reporting it. A culture is needed where nonconformity, failure and defects are seen as a chance. Employees need to be safe to discover, alert and admit mistakes. It is a managerial responsibility to protect the information source. The organization should award employees for detecting failure and the general culture should be to deal openly with mistakes. Focus on solving the problem, e.g. like Toyota does: Bundle all available forces to improve the process. Train to change and practice sustainability. Everything needs to be learned and practiced, even the systematic of change. In a professional environment, improvement is often linked to rationalism and economization and is therefore mentally connected to job cuts and perceived injustice. The training sessions should clarify the mission of the improvement process. Most companies provide tool training and teach how to manage change when introducing improvement programs. The training should also teach how to think out of the box and how to gain creative ideas. There are several tools available to experience the creative thought process. A lesson could start with a small example to demonstrate that some things are not impossible even if they seem to be. Take the following example: draw a box and in it 9 circles (3x3). Ask your colleagues to connect all of these circles with just three straight lines. The task seems impossible. The clue is to remove imaginary boundaries, see ﬁgure 8.1. 1. There is no rule that overdrawing the box is prohibited. 2. There is no rule that the circles must be connected at the center.

232

8 The human aspect in sustainable change and innovation

Fig. 8.1 The solution of the puzzle in the text.

This little exercise could be a starting point. You could continue to show example projects from within your company and explain how change is managed and how the results affected the work environment. Most companies do not intend to cut jobs but if it is to do so, make it clear upfront. If you want the change to be sustainable, the rules must be clear and fair. Cutting jobs in one area does not necessarily mean that people get laid off over the whole enterprise. Will the affected persons be used to ﬁll open positions elsewhere in the organization or is there even a chance to continue to work as an improvement leader instead? Rationalization may free up resources desperately needed somewhere else in the company (please also refer to career development). Diversiﬁcation is a source of different ways to solutions and a surprising learning experience. As much as the overall business goals need to be deﬁned unambiguously, the way to achieve that goal has to remain ﬂexible. Too many boundaries will limit the creativity of the executing forces. Every problem has more than just one solution. Different departments, locations or business units have different backgrounds and individual needs. The teams have to have the freedom to discover their own ways and to develop their individual solutions. This will assure a more focused approach and increase commitment, as this is perceived as “our” solution. In addition, it broadens the experience baseline and ﬁlls the toolbox. Adapt the standard tools and adopt individualized ones. Every situation is unique and consequently individual tools need to be developed to design the matching solution. This may take longer and probably require more resources but it is a fundamental precondition for sustainable success.

8.4.8 The Economic Factor in Sustainable Innovation Employer attractiveness is an increasingly vital fact if companies want to maintain their business. The struggle for the best talents and various employee-binding strategies were described earlier. Business success in that respect depends on, but at the same time generates attractiveness to those considered as “the best”. Porsche is still

8.4 Handling the Human Aspect

233

one of the most desired employers in Europe. It is the brand that sells but also attracts young, ambitious talents. Forming an employer brand is more important than anything. The chance to be part of a highly innovative and well-respected brand is a major employee binder by itself. Choosing from the best is a luxury that only very few, huge industrial enterprises had in the past. In the last few years companies like eBay, Google or Facebook became extremely attractive to work for. It is the opportunity to bring ideas to life that attracts especially young people. Sustainable innovation and development are critical factors in developing existing employees as well as attracting the highly talented new ones. Another remarkable factor can be discovered when comparing the Internet based companies success to those dealing in traditional industries. I refer to e.g. the automobile or the consumer electronics industry as the traditional industry. Even the banking and insurance sector can be named as traditional although they got very much virtualized in the last years. The major difference between the two is the product application. Innovation and quick releases of new updates as well as an innovative website are the drivers for web based companies. The quality in innovation is perceived as the ability to anticipate or even create tomorrows demand. Who would have guessed 15 years ago that teenagers would possibly spend more time sitting at home alone chatting with their friends than doing face-to-face communication? The innovation cycle in web-based companies is shorter and innovation can be something that we, the users, do not even realize. So can a different programming language reduce the total storage on the server or improve the download speed. The products characteristic derives from the various options the user can choose from. Googles browser for example can get personalized and it is able to connect all services provided by the company. Despite the pure browser options, Google created an interface between the user and the web. Success will soon no longer be measured as the number of clicks or number of members. There might be more value in the total data volume transferred and stored. The information collected by a web site provider can be very powerful. Users unveil their privacy by uploading their lives. Thus, Internet based companies needed new structures to align the business areas they operate in. Unlike a car manufacturer who will remain in the car manufacturing industry. Thus the needs of existent product businesses are totally different. Products for use no matter if real (car) or virtual (bank account) need warranty, contracts and quality inspections. The traditional business model is founded on making money with the product produced. Traditional industries focus mainly internally on products or processes. The attention is inward rather than outward like in web operating companies. The innovation being directed outside the company can also be used as a marketing tool. Take eBay as an example when it introduced its PayPal service. This ﬁnancial solution has very little in common with the original trading platform. By launching it as a separate and individual tool, eBay received a lot of attention. The creative part is to imagine what users might want besides the service they came for originally. The focus on core competences is a reason why the traditional industry is rather slow with such products and services. The recently established ﬁnancial services owned

234

8 The human aspect in sustainable change and innovation

by car manufacturers are just the ﬁrst examples to align new products with services in the traditional industry. This is learning from the web based companies success. Sustainability as marketing and sales factor therefore should be seen less ecologically but strategically. Once you sustain in offering an advanced service, continue to be the leader in technology or are recognized as the low price source, you will bind your customers. Over the last 20 – 25 years, the German grocery store Aldi is perceived to be the leader in price. They sustain their lowest cost image and most consumers shop with the positive sureness to get the best price. Do they really? Aldi (North & South) operates in more than 27 countries worldwide and the founding brothers belong to the wealthiest people in Europe. The success is obvious and comes from a highly innovative and ﬂexible organization. The Aldi concept is to sustain the high quality standards of no name brands at low cost. At the same time, the reduced variety of articles improves the turn over and leads to positive cash ﬂow. Continuous and sustainable change, e.g. adding clothes to the product line, enables Aldi to maintain and widen their successful business model.

8.5 Summary Successful businesses enable their employees to be successful. The workforce will become an irreplaceable asset as modern high tech jobs are related less to manpower but to experience, know-how and emotional commitment. Those companies that retain their brilliant brains and experienced workers have a higher chance for future invention and faster market implementation. It is in the natural interest of any employer to maintain those who have or can gain know-how. The innovative driving force will no longer be the privilege of those in R&D or engineering departments but needs to be spread equally into every single step of the value chain; especially in high labor cost countries everyone is needed as source of creativeness. As most processes are complex and have hidden issues, mathematical modeling is not always the primary choice as a starting point for change. The change needs to happen within the organization ﬁrst. The traditional hierarchic constructions are already compressed and many levels have been removed. This was done in the true belief to strengthen the business. But to get more done with less hierarchy requires a different working attitude and altered organizational communication and commitment. Furthermore, it requires a shared vision comprehensible to everybody. The executing force needs to share the vision but also needs to understand what part of the vision is for them. Communication skills, project and people management and the ability to listen and reﬂect on oneself critically are key necessities for successful and sustainable change. These conditions are true for project managers but even more for the entire organization. Sustainability is obtained if the ﬁnal users are not only a part of the change right from the beginning, but are the drivers of change and innovation and interact equally and actively. Modern hierarchic constructions need to turn their hierarchic pyramid upside down and focus on the satisfaction and the interests of every individual. Talent de-

8.5 Summary

235

velopment programs focusing on retaining expertise as well as a strong employer brand to attract talents are key prerequisites for the future. The demographic change forces organizations to reconsider their attitude towards the human aspect. Thus successful businesses prove that non-job related tasks can support the process of creativity and innovation and leads to the above mentioned emotional commitment. Social values, whether it is in ﬂexible working time, parental leave or educational leave, eventually create a different organizational environment. This will turn into the formation and development of a most creative atmosphere, which is the key for innovation and satisfaction. In order to make the innovation last, successful employees in an emancipated hierarchy are necessary. The latter is not by chance but is derived when the top managers accept resistance as a source of employee interaction throughout all hierarchic levels. We need to make use of the diversiﬁed strength of every individual but at the same time value the weaknesses of humans. Most modern companies simply cannot afford to refrain from using their human assets intensively. Focus on, and increased investment in, the human aspect will not only speed up innovation, market implementation and sustainability, it is the most valuable asset of all. Make use of this asset but treat it with respect and do not forget that “there is a difference between listening and waiting to speak!” [61]

References

237

References 1. E.H.L. Aarts, P.J.M. Korst, and P.J.M. van Laarhoven. Pattern Recognition: Theory and Applications, chapter Simulated Annealing: A Pedestrian Review of the Theory and Some Applications. Springer Verlag, 1987. 2. E.H.L. Aarts, P.J.M. Korst, and P.J.M. van Laarhoven. Quantitative analysis of the statistical cooling algorithm. Philips J. Res., 1987. 3. E.H.L. Aarts and P.J.M. van Laarhoven. A new polynomial time cooling schedule. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, November 1985, page 206 208, 1985. 4. E.H.L. Aarts and P.J.M. van Laarhoven. Statistical cooling: A general approach to combinatorial optimization problems. Philips J. of Research, 40:193 226, 1985. 5. Forman S. Acton. Numerical Methods That Work. The Mathematical Association of America, 1990. 6. Teresa M. Amabile, Regina Conti, Heather Coon, Jeffrey Lazenby, and Michael Herron. Assessing work environment for creativity. The Academiy of Management Journal, 39(5):1154 – 1184, 1996. 7. Patrick D. Bangert. How smooth is space? Panopticon, 1:31 – 33, 1997. 8. Patrick D. Bangert. Algorithmic Problems in the Braid Groups. PhD thesis, University College London Mathematics Department, 2002. 9. Patrick D. Bangert. Mathematik – was ist das? Bild der Wissenschaft, page 10, 2004. 10. Patrick D. Bangert. Raid braid: Fast conjugacy disassembly in braid and other groups. In Quoc Nam Tran, editor, Proceedings of the 10th International Conference on Applications of Computer Algebra, ACA, pages 3 – 14, 2004. 11. Patrick D. Bangert. Downhill simplex methods for optimizing simulated annealing are effective. In Algoritmy 2005, number 17 in Conference on Scientiﬁc Computing, pages 341 – 347, 2005. 12. Patrick D. Bangert. In search of mathematical identity. MSOR Connections, 5(4):1 – 3, 2005. 13. Patrick D. Bangert. Optimizing simulated annealing. In Proceedings of SCI 2005 – The 9th World Multi-Conference on Systemics, Cybernetics and Informatics from 10.-13.07.2005 in Orlando, FL, USA, volume 3, page 198202, 2005. 14. Patrick D. Bangert. Optimizing simulated annealing. In Nagib Callaos and William Lesso, editors, 9th World Multi-Conference on Systemics, Cybernetics and Informatics, volume 3, pages 198 – 202, 2005. 15. Patrick D. Bangert. What is mathematics? Aust. Math. Soc. Gazette, 32(3):179 – 186, 2005. 16. Patrick D. Bangert. Jenseits des Verstandes, chapter Einf¨uhrung in die buddhistische Meditation, pages 165 – 172. S. Hirzel Verlag, 2007. 17. Patrick D. Bangert. Jenseits des Verstandes, chapter Inwieweit kann man mit Logik spirituell sein? Die Sicht eines Mathematikers und Buddhisten, pages 147 – 152. S. Hirzel Verlag, 2007. 18. Patrick D. Bangert. Kreativit¨at und Innovation, chapter Kreativit¨at in der deutschen Wirtschaft, pages 79 – 86. S. Hirzel Verlag, 2008. 19. Patrick D. Bangert. Mathematical identity (in greek). Journal of the Greek Mathematical Society, 5:22 – 31, 2008. 20. Patrick D. Bangert. Lectures on Topological Fluid Mechanics, chapter Braids and Knots, pages 1 – 74. Number 1973 in LNM. Springer Verlag, 2009. 21. Patrick D. Bangert. Neuro¨asthetik, chapter Fraktale Kunst: Eine Einf¨uhrung, pages 89 – 95. E. A. Seemann, 2009. 22. Patrick D. Bangert. Ausbeuteoptimierung einer silikonproduktion. In Arbeitskreis Prozessanalytik, number 6 in Tagung, page 14. DECHEMA, 2010. 23. Patrick D. Bangert. Ausbeuteoptimierung in der silikonproduktion. Analytic Journal, page www.analyticjournal.de/ fachreports/ ﬂuessig analytik/ algorithmica technol silikon.html, 2010.

238

8 The human aspect in sustainable change and innovation

24. Patrick D. Bangert. Increasing energy efﬁciency using autonomous mathematical modeling. In Victor Risonarta, editor, Energy Efﬁciency in Industry, Technology cooperation and economic beneﬁt of reduction of GHG emissions in Indonesia, pages 80 – 86. Shaker Verlag, 2010. 25. Patrick D. Bangert. Two-day advance prediction of a blade tear on a steam turbine of a coal power plant. In M. Link, editor, Schwingungsanalyse & Identiﬁkation. VDI-Berichte No. 2093, pages 175 – 182, 2010. 26. Patrick D. Bangert. Two-day advance prediction of a blade tear on a steam turbine of a coal power plant. In Instandhaltung 2010, pages 35 – 44. VDI/VDEh, VDI/VDEh, 2010. 27. Patrick D. Bangert. Prediction of damages on wind power plants. In Schwingungen von Windenergieanlagen 2011, number 2123 in VDI Berichte, pages 135 – 144. VDI, 2011. 28. Patrick D. Bangert. Prediction of damages using measurement data. In Bernd Bertsche, editor, Technische Zuverl¨assigkeit, number 2146 in VDI Berichte, pages 305 – 316. VDI, 2011. 29. Patrick D. Bangert. Two-day advance prediction of blade tear on the steam turbine at coalﬁred plant. In 54th ISA POWID Symposium, volume 54 of ISA. ISA, 2011. 30. Patrick D. Bangert and Markus Ahorner. Modellierung eines pumpenanlaufs zur lebensdaueroptimierung mit der v¨ollig neuen n-k¨orper methode. In Produktivit¨atssteigerung durch Anlagenoptimierung, number 29 in VDI / VDIEh Forum Instandhaltung, pages 29 – 36. VDI/VDEh, 2008. 31. Patrick D. Bangert, M.A. Berger, and R. Prandi. In search of minimal random braid conﬁgurations. J. Phys. A, 35:43–59, 2002. 32. Patrick D. Bangert, Mitchel A. Berger, and Rosela Prandi. In search of minimal random braid conﬁgurations. J. Phys. A, 35:43 – 59, 2002. 33. Patrick D. Bangert, Martin D. Cooper, and S.K. Lamoreaux. Enhancement of superthermal ultracold neutron production by trapping cold neutrons. Nuc. Instr. Meth. in Phys. Res. A, 410:264 – 272, 1998. 34. Patrick D. Bangert, Martin D. Cooper, and S.K. Lamoreaux. Uniformity of the magnetic ﬁeld produced by a cosine magnet with a superconducting shield. LANL EDM Expt. Tech. Rep., 1, 1999. 35. Patrick D. Bangert and J¨org-Andreas Czernitzky. Increase of overall combined-heat-andpower (chp) efﬁciency via mathematical modeling. In VGB Fachtagung Dampferzeuger, Industrie- und Heizkraftwerke, 2010. 36. Patrick D. Bangert and J¨org-Andreas Czernitzky. Efﬁciency increase of 1% in coal-ﬁred power plants with mathematical optimization. In 54th ISA POWID Symposium, volume 54 of ISA. ISA, 2011. 37. Patrick D. Bangert and J¨org-Andreas Czernitzky. Increase of overall combined-heat-andpower efﬁciency through mathematical modeling. VGB PowerTech, 91(3):55 – 57, 2011. 38. Patrick D. Bangert, Chaodong Tan, Zhang Jie, and Bailiang Liu. Mathematical model using machine learning boosts output offshore china. World Oil, 231(11):37 – 40, 2010. 39. Patrick D. Bangert, Chaodong Tan, Bailiang Liu, and Zhang Jie. Maschinelles lernen erh¨oht ertrag. China Contact, 15(6):52 – 54, 2011. 40. D.M. Bates and D.G. Watts. Nonlinear Regression Analysis and Its Applications. Wiley, 1988. 41. M.A. Berger. Minimum crossing numbers for three-braids. J. Phys. A, 27:6205–6213, 1994. 42. Lutz Beyering. Individual Marketing. Verlag Moderne Industrie, 1987. 43. Marco A. D. Bezerra, Leizer Schnitman, M. de A. Baretto Filho, and J.A.M. Felippe de Souza. Pattern recognition for downhold dynamometer card in oil rod pump system using artiﬁcial neural networks. Proceedings of the 11th International Conference on Enterprise Information Systems ICEIS 2009, Milan, Italy, pages 351 – 355, 2009. 44. C.M. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, 2006. 45. Dan Bonachea, Eugene Ingerman, Joshua Levy, and Scott McPeak. An improved adaptive multi-start approach to ﬁnding near-optimal solutions to the euclidean tsp. In Genetic and Evolutionary Computation Conference (GECCO-2000), 2000.

References

239

46. E. Bonomi and J.-L. Lutton. The asymptotic behaviour of quadratic sum assignment problems: A statistical mechanics approach. Euro. J. Oper. Res., 1984. 47. E. Bonomi and J.-L. Lutton. The n-city travelling salesman problem: Statistical mechanics and the metropolis algorithm. SIAM Rev., 26:551 568, 1984. 48. M. Boulle. Khiops: A statistical discretization method of continuous attributes. Machine Learning, 55:53 – 69, 2004. 49. Wayne H. Bovey and Andrew Hede. Resistance to organisational change: the role of defence mechanisms. Journal of Managerial Psychology, 16(7):534 – 548, 2001. 50. Michael Brusco and Stephanie Stahl. Branch-and-Bound Applications in Combinatorial Data Analysis. Springer Verlag, 2005. 51. R.E. Burkard and F. Rendl. A thermodynamically motivated simulation procedure for combinatorial optimization problems. Euro. J. Oper. Res., 17:169 174, 1984. 52. V. Cerny. Thermodynamical approach to the travelling salesman problem: An efﬁcient simulation algorithm. J. Opt. Theory Appl., 45:41 51, 1985. 53. William G. Cochran. Sampling Techniques. Wiley, 1977. 54. N.E. Collins, R.W. Eglese, and B.L. Golden. Simulated annealing - an annotated bibliography. Am. J. Math. Manag. Sci., 8:209–307, 1988. 55. Peter Dayan and L. F. Abbott. Theoretical Neuroscience. The MIT Press, 2001. 56. John E. Dowling. Neurons and Networks: An Introduction to Neuroscience. The Belknap Press of Harvard University Press, 1992. 57. L.A. McGeoch D.S. Johnson, C.R. Aragon and C. Schevon. Optimization by simulated annealing: An experimental evaluation. In List of Abstracts, Workshop on Statistical Physics in Engineering and Biology, Yorktown Heights, April 1984, revised version., 1986. 58. H. DeMan F. Catthoor and J. Vanderwalle. Sailplane: A simulated annealing based cadtool for the analysis of limit-cycle behaviour. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Oct. 1985, page 244 247, 1985. 59. A.L. Sangiovanni-Vincentelli F. Romeo and C. Sechen. Research on simulated annealing at berkely. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Nov. 1984, page 652 657, 1984. 60. U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classiﬁcation learning. In Proc. of the 13th Int. Joint Conf. on Artiﬁcial Intelligence, pages 1022 – 1027, 1993. 61. Tom Foremsko. Twitter from cocoon.ifs.tuwien.ac.at, 2009. 62. David A. Freedman. Statistical Models: Theory and Practice. Cambridge University Press, 2005. 63. S. George. An Improved Simulated Annealing Algorithm for Solving Spatially Explicit Forest Management Problems. PhD thesis, Penn. State Uni., 2003. 64. Walton E. Gilbert. An oil-well pump dynagraph. Production Practice, Shell Oil Company, pages 94 – 115, 1936. 65. Fred Glover and Manuel Laguna. Tabu Search. Kluwer Academic Publishers, 1996. 66. B.L. Golden and C.C. Skiscim. Using simulated annealing to solve routing and location problems. Nav. Log. Res. Quart., 33:261 279, 1986. 67. N. Golyandina, V. Nekrutkin, and A. Zhigljavsky. Analysis of Time Series Structure: SSA and related techniques. Chapman and Hall/CRC, 2001. 68. J.W. Greene and K.J. Supowit. Simulated annealing without rejected moves. IEEE Trans. Comp. Aided Design, CAD-5:221 – 228, 1986. 69. Martin T. Hagan, Howard B. Demuth, and Mark Beale. Neural Network Design. PWS Pub. Co., 1996. 70. B. Hajek. A tutorial survey of theory and application of simulated annealing. In Proc. 24th Conf. Decision and Control, Ft. Lauderdale, Dec. 1985, page 755 760, 1985. 71. J.D. Hamilton. Time Series Analysis. Princeton University Press, 1994. 72. T. Hastie and P. Simard. Models and metrics for handwritten character recognition. Statistical Science, 13(1):54 – 65, 1998. 73. Randy L. Haupt and Sue Ellen Haupt. Practical Genetic Algorithms. Wiley-Interscience, 2004.

240

8 The human aspect in sustainable change and innovation

74. Kenneth M. Heilman, Stephen E. Nadeau, and David O. Beversdorf. Creative innovation: Possible brain mechanisms. Neurocase, 9(5):369 – 379, 2003. 75. Klaus Hinkelmann and Oscar Kempthorne. Design and Analysis of Experiments. I and II. Wiley, 2008. 76. Douglas R. Hofstadter. Godel, ¨ Escher, Bach: An Eternal Golden Braid. Penguin Books, 1979. 77. Torbj¨orn Idhammar. Condition Monitoring Standards (4 vols). Idcon Inc., 2001-2009. 78. Alexander I. Khinchin and George Gamow. Mathematical Foundations of Statistical Mechanics. Dover Publications, 1949. 79. S. Kirkpatrick, C.D. Jr. Gelatt, and M.P. Vecchi. Optimization by simulated annealing. Science, 220:671 680, 1983. 80. J. Klos and S. Kobe. Nonextensive Statistical Mechanics and Its Applications, chapter Generalized Simulated Annealing Algorithms Using Tsallis Statistics, pages 253 – 258. LNP 560/2001 Springer Verlag, 2001. 81. D.E. Knuth. Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming. Addison-Wesley, Reading, MA, USA, 1981. 82. J. Lam and J.-M. Delosme. Logic minimization using simulated annealing. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, Nov. 1986, page 348 351, 1986. 83. Rotislav V. Lapshin. Analytical model for the approximation of hysteresis loop and its application to the scanning tunneling microscope. Rev. Sci. Instrum., 66(9):4718 – 4730, 1995. 84. H.W. Leong and C.L. Liu. Permutation channel routing. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Oct. 1985, page 579 584, 1985. 85. H.W. Leong, D.F. Wong, and C.L. Liu. A simulated annealing channel router. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, Nov. 1985, page 226 229, 1985. 86. S. Lin. Computer solutions for the travelling salesman problem. Bell Sys. Tech. J., 44:2245 2269, 1965. 87. H. R. Lindman. Analysis of variance in complex experimental designs. W. H. Freeman & Co. Hillsdale, 1974. 88. David G. Luenberger. Linear and Nonlinear Programming. Springer Verlag, 2003. 89. M. Lundy and A. Mees. Convergence of a annealing algorithm. Math. Prog., 34:111 124, 1986. 90. J.-L. Lutton and E. Bonomi. Simulated annealing algorithm for the minimum weighted perfect euclidean matching problem. R.A.I.R.O. Recherche operationelle, 20:177 197, 1986. 91. P.S. Mann. Introductory Statistics. Wiley, 1995. 92. S. Martin, M. Anderson, I. Salman, V. Lazar, and Patrick David Bangert. Processes contributing to the evolution of two ﬁlament channels to global scales. In K. Sankarasubramanian, M. Penn, and A. Pevtsov, editors, Large Scale Structures and their Role in Solar Activity, ASP Conference Proceedings Series. Astronomical Society of the Paciﬁc, 2005. 93. W. Mass. Efﬁcient agnostic pac-learning with simple hypotheses. In Proc. of the 7th ACM Conf. on Computational Learning Theory, pages 67 – 75, 1994. 94. F. Romeo M.D. Huang and A.L. Sangiovanni-Vincentelli. An efﬁcient general cooling schedule for simulated annealing. In Proc. IEEE Int. Conf. Comp. Aided Design, Santa Clara, Nov. 1986, page 381 384, 1986. 95. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21:1087–1092, 1953. 96. D.C. Montgomery. Design and Analysis of Experiments. Wiley, 2000. 97. C.A. Morgenstern and H.D. Shapiro. Chromatic number approximation using simulated annealing. Technical report, CS86-1, Dept. Comp. Sci., Univ. New Mexico., 1986. 98. Leonard K. Nash. Elements of Chemical Thermodynamics. Dover Publications, 2005. 99. Taiichi Ohno. Toyota Production System: Beyond Large-Scale Production. Productivity Press, 1988. ¨ 100. Esin Onbasoglu and Linet Ozdamar. Parallel simulated annealing algorithms in global optimization. Journal of Global Optimization, 19(1), 2001. 101. R.H.J.M. Otten and L.P.P.P. van Ginneken. Floorplan design using simulated annealing. In Proc. IEEE Int. Conf. On Comp. Aided Design, Santa Clara, Nov. 1984, page 96 98, 1984.

References

241

102. P. Sibani P. Salamon and R. Frost. Facts, conjectures, and improvements for simulated annealing. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2002. 103. Athanasios Papoulis and S. Unnikrishna Pillai. Probability, Random Variables and Stochastic Processes. McGraw Hill, 2002. 104. Oliver Penrose. Foundations of Statistical Mechanics: A Deductive Treatment. Dover Pub. Inc., Mineola, NY, USA, 2005. 105. George Polya. How to Solve It. Princeton University Press, 1957. 106. ProQuest. http://www.umi.com/proquest/. 107. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. 108. F. Romeo and A.L. Sangiovanni-Vincentelli. Probabilistic hill climbing algorithms: Properties and applications. In Proc. 1985 Chapel Hill Conf. VLSI, May 1985, page 393 417, 1985. 109. B. Rosner. On the detection of many outliers. Technometrics, 17:221 – 227, 1975. 110. B. Rosner. Percentage points for a generalized esd many-outlier procedure. Technometrics, 25:165 – 172, 1983. 111. Stuart Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Approach. Prentice Hall International, 1995. 112. C.D. Jr. Gelatt S. Kirkpatrick and M.P. Vecchi. Optimization by simulated annealing. Technical report, IBM Research Report RC 9355, 1982. 113. S. Sahni S. Nahar and E. Shragowitz. Simulated annealing and combinatorial optimization. In Proc. 23rd Des. Automation Conf., Las Vegas, June 1986, page 293 299, 1986. 114. Ken Schwaber. Agile Project Management with SCRUM. Microsoft Press, 2004. 115. C. Sechen and A.L. Sangiovanni-Vincentelli. The timber wolf placement and routing package. IEEE J. Solid State Circuits, SC-20:510 522, 1985. 116. Amartya K. Sen. Collective Choice and Social Welfare. London, 1970. 117. Mike Sharples, David Hogg, Chris Hutchinson, Steve Torrance, and David Young. Computers and Thought: A Practical Introduction to Artiﬁcial Intelligence. The MIT Press, 1989. 118. J. Shore and Warden S. The Art of Agile Development. OReilly Media, Inc., 2008. 119. C.C. Skiscim and B.L. Golden. Optimization by simulated annealing: A preliminary computational study for the tsp. In NIHE Summer School on Comb. Opt., Dublin., 1983. 120. R.F. Stengel. Optimal Control and Estimation. Dover Publications, 1994. 121. Chaodong Tan, Patrick D. Bangert, Zhang Jie, and Bailiang Liu. Yield optimization in dagang offshore oilﬁeld (in chinese). China Petroleum and Chemical Industry, 237(11):46 – 47, 2010. 122. Lloyd N. Trefethen and David Bau III. Numerical linear algebra. Society for Industrial and Applied Mathematics, 1997. 123. P.J.M. van Laarhoven. Theoretical and Computational Aspects of Simulated Annealing. Centrum voor Wiskunde en Informatica, 1988. 124. P.J.M. van Laarhoven and E.H.L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel, Dordrecht, 1987. 125. R. von Mises. Probability, Statistics and Truth. George Allen & Unwin, London, UK, 1957. 126. Dianne Waddell and Amrik S. Sohal. Resistance: a constructive tool for change management. Management Decision, 38(8):543 – 548, 1998. 127. W.T. Vellerling W.H. Press, S.A. Teukolsky and B.P. Flannery. Numerical Recipes in C. 2nd edition. Cambridge University Press, 1992. 128. S.R. White. Concepts of scale in simulated annealing. In Proc. IEEE Int. Conf. Comp. Design, Port Chester, Nov. 1984, page 646 651, 1984. 129. Wikipedia. innovation. 130. www.buildabetterburger.com/burgers/timeline. 131. www.foodreference.com. 132. www.whatscookingamerica.net/History/HamburgerHistory.htm.

Index

accuracy, 3, 4, 10 action integral, 188 adaptive controller, 132 analysis of variance, 81 ANOVA, 81, 83 archive system, 39 artifact, 88 autocorrelation, 85 average ensemble, 15 moving, 47 time, 15 Bayesian, 71 benchmark, 53, 54 self, 54 binarization, 92 Boltzmann’s constant, 17, 21, 178, 179 boundary condition, 1, 15, 190 brain, 183 budgeting, 58 calculus of variations, 187 catalyst, 149 catalytic reactor, 148 cause-effect link, 82 center, 86 central limit theorem, 100, 105, 106 certainty, 4 change, 202 resistance, 204 chemical plant, 53, 58, 148, 155, 189 classiﬁcation, 130 clustering, 42, 46, 86, 88, 110 k-means, 87, 109, 117 center, 87, 88 Lloyd’s algorithm, 87

coal power plant, 96 combinatorial problem, 181 combined heat and power (CHP), 96, 197 communication theory, 111 conﬁguration density, 173 conjugate gradient, 165 constraint, 1, 3 continuous problem, 181 control theory, 110 correlation, 44, 84 spurious, 73 critical value, 73 curve ﬁtting, 113 cybernetics, 110 data, 43 Debye’s law, 24 decision tree, 46 design matrix, 81 design of experiment, 34, 81 determinism, 19 discretization, 38 distribution Gaussian, 106 normal, 106 posterior, 108 prior, 107 sampling, 107 domain knowledge, 124 dynamic control system (DCS), 39, 111 dynamometer card, 158 early-warning, 63 energy, 14, 16, 18, 20 ensemble, 15 enterprise resource planning (ERP), 54 entropy, 18, 20, 20, 21, 89, 90, 98, 99, 169, 170

P. Bangert (ed.), Optimization for Industrial Problems, DOI 10.1007/978-3-642-24974-7, © Springer-Verlag Berlin Heidelberg 2012

243

244 Boltzmann, 21 informational, 89 equilibrium, 16, 20, 21, 23, 168, 169, 171, 175 change at, 186 ergodic, 16, 26 source, 90 ergodic set, 26, 89 ergodicity, 25 breaking, 27, 90 theorem, 26 error type I, 72 type II, 72 estimate, 68 estimation, 67, 68 estimator, 68 Euler-Lagrange equation, 188 extrapolation, 51, 128 extreme studentized deviate, 77 fault localization, 143 feature, 44 feed-back, 71, 71 ﬁlter, 47 ﬁtness, 166 ﬁtting problem, 78 ﬂuid catalytic converter, 149 Fourier series, 52, 114 Fourier transform, 91, 98, 99 freezing in, 22 freezing point, 174 frequency, 71 function, 126 analytic, 116 goal, 112 merit, 112 functional form, 126 genetic algorithm, 165, 166 crossover, 167 mutation, 167 goal, 35 granular catalytic reactor, 149 ground state, 22 heat bath, 17 Helmholtz free energy, 18 heteroscedasticity, 82 historian, 39 homoscedasticity, 82 hypothesis alternative, 73, 74, 75 null, 73, 74, 75, 83

Index idea, 202 iid, 35 independent and identically distributed, 35 information, 43, 89 information theory, 111 injection molding, 135 innovation, 202, 213 instrumentation, 37, 38 insurance, 62 interpolation, 42, 51, 128 knowledge, 43 knowledge gain, 125 lag, 85 Lagrangian, 188 learning supervised, 94 un-supervised, 95, 110 least-squares ﬁtting, 78, 79, 113 general linear, 80 linear correlation coefﬁcient, 84 logistics, 118 M¨uller-Rochow Synthesis (MRS), 189 macrostate, 13, 14, 15, 19, 21 maintenance, 53, 55, 58 management, 29 agile, 32 change, 201 critique box, 217 delegation, 223 diversiﬁcation, 232 employer attractiveness, 232 ﬁrst year review, 221 innovation, 213 interface, 207 inventor-driven, 227 kaizen, 33 kanban, 32 KPI’s, 219 marketing, 234 one-to-one meetings, 217 questionnaires, 219 recognition, 230 resource planning, 229 responsibility, 226 risk, 223 sales, 234 scrum, 32 self-discipline, 228 stakeholder, 221 steering committee, 225 structures, 224

Index sustainability, 226, 228, 231, 232 talent development, 229 team member, 221 team selection, 222 team-meetings, 217 training, 231 marketing individual, 118 Markov chain, 19, 26, 105, 117, 118, 168, 175, 177 p-order, 106 homogeneous, 106 property, 105 Maxwell-Boltzmann distribution, 17, 178, 179 mean, 75, 76 measurement, 67, 70 error, 68, 69 spike, 69 melting point, 174 methods branch-and-bound, 6 complete enumeration, 5 enumeration, 165 exact, 5, 5 genetic algorithm, 6, 7 heuristic, 5, 6 Monte Carlo, 16 restarting, 27 scientiﬁc, 34 simulated annealing, 6, 7 advantages, 7 tabu search, 6 metric, 86 microstate, 13, 14, 14, 19, 21 model agile, 30 generalization, 128 kanban, 30 mathematical, 121 scrum, 30 waterfall, 30 modeling, 79, 121, 121, 123, 125 move, 6 multi-objective optimization, 7 neural network, 121, 123, 126, 129 echo-state network, 133 Hopﬁeld network, 132 multi-layer perceptron, 131 activation function, 131 bias vector, 131 layer, 131 topology, 131 weight matrix, 131

245 primary pattern, 132 Newton’s method, 165 noise, 38, 40, 47, 48 de-noising, 47 noisy channel, 107, 110, 111, 112 non-locality, 71, 71 nonlinearity, 115 nuclear power plant, 152 observation, 70 Occam’s razor, 52 occupation number, 17 offshore production, 195 oil production, 195 optimal path, 186 optimization, 1, 79, 112 optimum global, 1, 165 local, 1 Pareto, 7, 8 outlier, 40, 69, 77, 86, 88 outsourcing, 58 over-ﬁtting, 84, 123 partition function, 17, 18 Pearson’s r, 84 perceptron, 92 petrochemical plant, 148 phase transition, 22, 170, 171 second order, 22 Poisson stochastic error source, 101 polynomial, 52, 80 population, 67, 67 pre-processing, 92 price arbitrage, 118 prioritization, 35 probability, 3, 4, 19, 71, 72 acceptance, 172 distribution, 39, 72, 73, 75, 76 Cauchy, 107 generalized normal, 107 homogeneous, 82 logistic, 107 normal, 72, 82 posterior, 117 prior, 108, 117 sampling, 110 Student t, 107 uniform, 72, 73 transition, 106 problem, 11 instance, 11 solution, 15 process

246 non-reversible, 20 reversible, 20 process stability, 70 production, 63, 135 proﬁtability, 193 pump choke-diameter, 195 frequency, 195 qualitative, 81 quantitative, 81 random, 17, 73 regression, 79, 113 linear, 113 non-linear, 117 representation, 50 representative, 62, 68 reservoir computing, 133 restarting, 166 retailer, 117 sample, 67, 68 sampling, 50, 68 random, 50 stratiﬁed, 50 scrap, 63 identiﬁcation, 135 selectivity, 190, 193 self-organizing map (SOM), 92, 109 sensor, 37, 38 drift, 69 tolerance, 68 signiﬁcance, 72, 72, 73 level, 72, 74 silane, 189 simulated annealing, 87, 165, 167, 168, 183 cooling schedule, 169, 172, 179 equilibrium, 168, 175 ﬁnal temperature, 168, 174 initial temperature, 168, 173 selection criterion, 168, 178 temperature reduction, 168, 177 perturbation, 181 singular spectrum analysis, 48, 91, 98, 99 singular value decomposition, 81 six-sigma, 72 soft sensor, 42, 44 speciﬁc heat, 23, 25, 170, 172 constant pressure, 25 constant volume, 25 spline, 53 state persistent, 26 pseudo-persistent, 27 transient, 26

Index stationary, 16 statistical inference Bayesian, 107, 112, 117, 118 statistical mechanics, 14 statistics, 67, 71, 72 F-test, 76, 83 χ 2 -test, 76, 77, 79 t-test, 75 conﬁdence, 74 critical value, 74 descriptive, 117 independence, 82 Kolmogorov-Smirnov-test, 77 population, 74 Rosner test, 77 test, 72, 73, 73, 74, 75 variance, 83 storage, 118 straight line, 80 system, 15 systematic error, 70 Taylor’s theorem, 115 temperature, 17, 23, 167, 169 absolute zero, 20 testing, 32 thermodynamics, 14 laws, 19, 21 postulates, 18 time-series, 38 timescale, 10 traveling salesman problem, 11 turbine, 96, 98, 103, 140, 152 blade tear, 141 uncertainty, 68–70 valve failures, 155 variable categorical, 82 controllable, 190 input, 129 nominal, 130 numeric, 130 output, 129 semi-controllable, 190 uncontrollable, 190 variance, 75, 76 vibration, 104 vibration crisis, 152 virtually certain, 74 wind power plant, 143 yield, 193

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close