# statistical model

189 results back to index

pages: 227 words: 62,177

Numbers Rule Your World: The Hidden Influence of Probability and Statistics on Everything You Do by Kaiser Fung

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Figure C-1 Drawing a Line Between Natural and Doping Highs Because the anti-doping laboratories face bad publicity for false positives (while false negatives are invisible unless the dopers confess), they calibrate the tests to minimize false accusations, which allows some athletes to get away with doping. The Virtue of Being Wrong The subject matter of statistics is variability, and statistical models are tools that examine why things vary. A disease outbreak model links causes to effects to tell us why some people fall ill while others do not; a credit-scoring model identifies correlated traits to describe which borrowers are likely to default on their loans and which will not. These two examples represent two valid modes of statistical modeling. George Box is justly celebrated for his remark “All models are false but some are useful.” The mark of great statisticians is their confidence in the face of fallibility. They recognize that no one can have a monopoly on the truth, which is unknowable as long as there is uncertainty in the world.

Highway engineers in Minnesota tell us why their favorite tactic to reduce congestion is a technology that forces commuters to wait more, while Disney engineers make the case that the most effective tool to reduce wait times does not actually reduce average wait times. Second, variability does not need to be explained by reasonable causes, despite our natural desire for a rational explanation of everything; statisticians are frequently just as happy to pore over patterns of correlation. In Chapter 2, we compare and contrast these two modes of statistical modeling by trailing disease detectives on the hunt for tainted spinach (causal models) and by prying open the black box that produces credit scores (correlational models). Surprisingly, these practitioners freely admit that their models are “wrong” in the sense that they do not perfectly describe the world around us; we explore how they justify what they do. Third, statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups.

They play a high-stakes game, ever wary of the tyranny of the unknown, ever worried about the consequence of miscalculation. Their special talent is the educated guess, with emphasis on the adjective. The leaders of the pack are practical-minded people who rely on detailed observation, directed research, and data analysis. Their Achilles heel is the big I, when they let intuition lead them astray. This chapter celebrates two groups of statistical modelers who have made lasting, positive impacts on our lives. First, we meet the epidemiologists whose investigations explain the causes of disease. Later, we meet credit modelers who mark our fiscal reputation for banks, insurers, landlords, employers, and so on. By observing these scientists in action, we will learn how they have advanced the technical frontier and to what extent we can trust their handiwork. ~###~ In November 2006, the U.S.

pages: 209 words: 13,138

Empirical Market Microstructure: The Institutions, Economics and Econometrics of Securities Trading by Joel Hasbrouck

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

If we know that the structural model is the particular one described in section 9.2, we simply set vt so that qt = +1, set ut = 0 and forecast using equation (9.7). We do not usually know the structural model, however. Typically we’re working from estimates of a statistical model (a VAR or VMA). This complicates specification of ε0 . From the perspective of the VAR or VMA model of the trade and price data, the innovation vector and its variance are: 2 σp,q σp εp,t . (9.15) and = εt = εq,t σp,q σq2 The innovations in the statistical model are simply associated with the observed variables, and have no necessary structural interpretation. We can still set εq,t according to our contemplated trade (εq,t = +1), but how should we set εp,t ? MULTIVARIATE LINEAR MICROSTRUCTURE MODELS The answer to this specific problem depends on the immediate (time t) relation between the trade and price-change innovations.

The role they play and how they should be regulated are ongoing concerns of practical interest. 117 12 Limit Order Markets The worldwide proliferation of limit order markets (LOMs) clearly establishes a need for economic and statistical models of these mechanisms. This chapter discusses some approaches, but it should be admitted at the outset that no comprehensive and realistic models (either statistical or economic) exist. One might start with the view that a limit order, being a bid or offer, is simply a dealer quote by another name. The implication is that a limit order is exposed to asymmetric information risk and also must recover noninformational costs of trade. This view supports the application of the economic and statistical models described earlier to LOM, hybrid, and other nondealer markets. This perspective features a sharp division between liquidity suppliers and demanders.

Stock exchanges—Mathematical models. I. Title. HG4521.H353 2007 332.64—dc22 2006003935 9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper To Lisa, who inspires these pages and much more. This page intentionally left blank Preface This book is a study of the trading mechanisms in financial markets: the institutions, the economic principles underlying the institutions, and statistical models for analyzing the data they generate. The book is aimed at graduate and advanced undergraduate students in financial economics and practitioners who design or use order management systems. Most of the book presupposes only a basic familiarity with economics and statistics. I began writing this book because I perceived a need for treatment of empirical market microstructure that was unified, authoritative, and comprehensive.

pages: 327 words: 103,336

Everything Is Obvious: *Once You Know the Answer by Duncan J. Watts

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Nevertheless, as a speculative exercise, we tested a range of plausible assumptions, each corresponding to a different hypothetical “influencer-based” marketing campaign, and measured their return on investment using the same statistical model as before. What we found was surprising even to us: Even though the Kim Kardashians of the world were indeed more influential than average, they were so much more expensive that they did not provide the best value for the money. Rather, it was what we called ordinary influencers, meaning individuals who exhibit average or even less-than-average influence, who often proved to be the most cost-effective means to disseminate information. CIRCULAR REASONING AGAIN Before you rush out to short stock in Kim Kardashian, I should emphasize that we didn’t actually run the experiment that we imagined. Even though we were studying data from the real world, not a computer simulation, our statistical models still made a lot of assumptions. Assuming, for example, that our hypothetical marketer could persuade a few thousand ordinary influencers to tweet about their product, it is not at all obvious that their followers would respond as favorably as they do to normal tweets.

Next, we compared the performance of these two polls with the Vegas sports betting market—one of the oldest and most popular betting markets in the world—as well as with another prediction market, TradeSports. And finally, we compared the prediction of both the markets and the polls against two simple statistical models. The first model relied only on the historical probability that home teams win—which they do 58 percent of the time—while the second model also factored in the recent win-loss records of the two teams in question. In this way, we set up a six-way comparison between different prediction methods—two statistical models, two markets, and two polls.6 Given how different these methods were, what we found was surprising: All of them performed about the same. To be fair, the two prediction markets performed a little better than the other methods, which is consistent with the theoretical argument above.

Indeed, an entire field of research called sabermetrics has developed specifically for the purpose of analyzing baseball statistics, even spawning its own journal, the Baseball Research Journal. One might think, therefore, that prediction markets, with their far greater capacity to factor in different sorts of information, would outperform simplistic statistical models by a much wider margin for baseball than they do for football. But that turns out not to be true either. We compared the predictions of the Las Vegas sports betting markets over nearly twenty thousand Major League baseball games played from 1999 to 2006 with a simple statistical model based again on home-team advantage and the recent win-loss records of the two teams. This time, the difference between the two was even smaller—in fact, the performance of the market and the model were indistinguishable. In spite of all the statistics and analysis, in other words, and in spite of the absence of meaningful salary caps in baseball and the resulting concentration of superstar players on teams like the New York Yankees and Boston Red Sox, the outcomes of baseball games are even closer to random events than football games.

pages: 257 words: 13,443

Statistical Arbitrage: Algorithmic Trading Insights and Techniques by Andrew Pole

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Once again, some of the variation magically disappears when each day is scaled according to that day’s overall volume in the stock. Orders, up to a threshold labeled ‘‘visibility threshold,’’ have less impact on large-volume days. Fitting a mathematical curve or statistical model to the order size–market impact data yields a tool for answering the question: How much will I have to pay to buy 10,000 shares of XYZ? Note that buy and sell responses may be different and may be dependent on whether the stock is moving up or down that day. Breaking down the raw (60-day) data set and analyzing up days and down days separately will illuminate that issue. More formally, one could define an encompassing statistical model including an indicator variable for up or down day and test the significance of the estimated coefficient. Given the dubious degree to which one could reasonably determine independence and other conditions necessary for the validity of such statistical tests (without a considerable amount of work) one will be better off building prediction models for the combined data and for the up/down days separately and comparing predictions.

Approaches for selecting a universe of instruments for modeling and trading are described. Consideration of change is Preface xv introduced from this first toe dipping into analysis, because temporal dynamics underpin the entirety of the project. Without the dynamic there is no arbitrage. In Chapter 3 we increase the depth and breadth of the analysis, expanding the modeling scope from simple observational rules1 for pairs to formal statistical models for more general portfolios. Several popular models for time series are described but detailed focus is on weighted moving averages at one extreme of complexity and factor analysis at another, these extremes serving to carry the message as clearly as we can make it. Pair spreads are referred to throughout the text serving, as already noted, as the simplest practical illustrator of the notions discussed.

Therefore, it is not necessary to be overly concerned about which set of events to use in the correlation analysis as a screen for good risk-controlled candidate pairs. Events in trading volume series provide information sometimes not identified (by turning point analysis) in price series. Volume patterns do not directly affect price spreads but volume spurts are a useful warning that a stock may be subject to unusual trading activity and that price development may therefore not be as characterized in statistical models that have been estimated on average recent historical price series. In historical analysis, flags of unusual activity are extremely important in the evaluation of, for example, simulation 25 Statistical Arbitrage 80 \$ 70 60 50 40 19970102 19970524 19971016 19980312 FIGURE 2.8 Adjusted close price trace (General Motors) with 20 percent turning points identified TABLE 2.1 Event return summary for Chrysler–GM Criterion daily 30% move 25% move 20% move # Events Return Correlation 332 22 26 33 0.53 0.75 0.73 0.77 results.

pages: 829 words: 186,976

The Signal and the Noise: Why So Many Predictions Fail-But Some Don't by Nate Silver

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Moreover, even the aggregate economic forecasts have been quite poor in any real-world sense, so there is plenty of room for progress. Most economists rely on their judgment to some degree when they make a forecast, rather than just take the output of a statistical model as is. Given how noisy the data is, this is probably helpful. A study62 by Stephen K. McNess, the former vice president of the Federal Reserve Bank of Boston, found that judgmental adjustments to statistical forecasting methods resulted in forecasts that were about 15 percent more accurate. The idea that a statistical model would be able to “solve” the problem of economic forecasting was somewhat in vogue during the 1970s and 1980s when computers came into wider use. But as was the case in other fields, like earthquake forecasting during that time period, improved technology did not cover for the lack of theoretical understanding about the economy; it only gave economists faster and more elaborate ways to mistake noise for a signal.

McNees, “The Role of Judgment in Macroeconomic Forecasting Accuracy,” International Journal of Forecasting, 6, no. 3, pp. 287–99, October 1990. http://www.sciencedirect.com/science/article/pii/016920709090056H. 63. About the only economist I am aware of who relies solely on statistical models without applying any adjustments to them is Ray C. Fair of Yale. I looked at the accuracy of the forecasts from Fair’s model, which have been published regularly since 1984. They aren’t bad in some cases: the GDP and inflation forecasts from Fair’s model have been roughly as good as those of the typical judgmental forecaster. However, the model’s unemployment forecasts have always been very poor, and its performance has been deteriorating recently as it considerably underestimated the magnitude of the recent recession while overstating the prospects for recovery. One problem with statistical models is that they tend to perform well until one of their assumptions is violated and they encounter a new situation, in which case they may produce very inaccurate forecasts.

.* This explanation becomes less credible, however, when the forecaster does not have a history of successful predictions and when the magnitude of his error is larger. In these cases, it is much more likely that the fault lies with the forecaster’s model of the world and not with the world itself. In the instance of CDOs, the ratings agencies had no track record at all: these were new and highly novel securities, and the default rates claimed by S&P were not derived from historical data but instead were assumptions based on a faulty statistical model. Meanwhile, the magnitude of their error was enormous: AAA-rated CDOs were two hundred times more likely to default in practice than they were in theory. The ratings agencies’ shot at redemption would be to admit that the models had been flawed and the mistake had been theirs. But at the congressional hearing, they shirked responsibility and claimed to have been unlucky. They blamed an external contingency: the housing bubble.

Data Mining: Concepts and Techniques: Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Statistics Statistics studies the collection, analysis, interpretation or explanation, and presentation of data. Data mining has an inherent connection with statistics. A statistical model is a set of mathematical functions that describe the behavior of the objects in a target class in terms of random variables and their associated probability distributions. Statistical models are widely used to model data and data classes. For example, in data mining tasks like data characterization and classification, statistical models of target classes can be built. In other words, such statistical models can be the outcome of a data mining task. Alternatively, data mining tasks can be built on top of statistical models. For example, we can use statistics to model noise and missing data values. Then, when mining patterns in a large data set, the data mining process can use the model to help identify and handle noisy or missing values in the data.

Thus, the Gaussian distribution gD can be used to model the normal data, that is, most of the data points in the data set. For each object y in region, R, we can estimate , the probability that this point fits the Gaussian distribution. Because is very low, y is unlikely generated by the Gaussian model, and thus is an outlier. The effectiveness of statistical methods highly depends on whether the assumptions made for the statistical model hold true for the given data. There are many kinds of statistical models. For example, the statistic models used in the methods may be parametric or nonparametric. Statistical methods for outlier detection are discussed in detail in Section 12.3. Proximity-Based Methods Proximity-based methods assume that an object is an outlier if the nearest neighbors of the object are far away in feature space, that is, the proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set.

Then, when mining patterns in a large data set, the data mining process can use the model to help identify and handle noisy or missing values in the data. Statistics research develops tools for prediction and forecasting using data and statistical models. Statistical methods can be used to summarize or describe a collection of data. Basic statistical descriptions of data are introduced in Chapter 2. Statistics is useful for mining various patterns from data as well as for understanding the underlying mechanisms generating and affecting the patterns. Inferential statistics (or predictive statistics) models data in a way that accounts for randomness and uncertainty in the observations and is used to draw inferences about the process or population under investigation. Statistical methods can also be used to verify data mining results. For example, after a classification or prediction model is mined, the model should be verified by statistical hypothesis testing.

pages: 204 words: 58,565

Keeping Up With the Quants: Your Guide to Understanding and Using Analytics by Thomas H. Davenport, Jinho Kim

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Data analysis * * * Key Software Vendors for Different Analysis Types (listed alphabetically) REPORTING SOFTWARE BOARD International IBM Cognos Information Builders WebFOCUS Oracle Business Intelligence (including Hyperion) Microsoft Excel/SQL Server/SharePoint MicroStrategy Panorama SAP BusinessObjects INTERACTIVE VISUAL ANALYTICS QlikTech QlikView Tableau TIBCO Spotfire QUANTITATIVE OR STATISTICAL MODELING IBM SPSS R (an open-source software package) SAS * * * While all of the listed reporting software vendors also have capabilities for graphical display, some vendors focus specifically on interactive visual analytics, or the use of visual representations of data and reporting. Such tools are often used simply to graph data and for data discovery—understanding the distribution of the data, identifying outliers (data points with unexpected values) and visual relationships between variables. So we’ve listed these as a separate category. We’ve also listed key vendors of software for the other category of analysis, which we’ll call quantitative or statistical modeling. In that category, you’re trying to use statistics to understand the relationships between variables and to make inferences from your sample to a larger population.

However, there are circumstances in which these “black box” approaches to analysis can greatly leverage the time and productivity of human analysts. In big-data environments, where the data just keeps coming in large volumes, it may not always be possible for humans to create hypotheses before sifting through the data. In the context of placing digital ads on publishers’ sites, for example, decisions need to be made in thousandths of a second by automated decision systems, and the firms doing this work must generate several thousand statistical models per week. Clearly this type of analysis can’t involve a lot of human hypothesizing and reflection on results, and machine learning is absolutely necessary. But for the most part, we’d advise sticking to hypothesis-driven analysis and the steps and sequence in this book. The Modeling (Variable Selection) Step A model is a purposefully simplified representation of the phenomenon or problem.

The software vendors for this type of data tend to be different from the reporting software vendors, though the two categories are blending a bit over time. Microsoft Excel, for example, perhaps the most widely used analytical software tool in the world (though most people think of it as a spreadsheet tool), can do some statistical analysis (and visual analytics) as well as reporting, but it’s not the most robust statistical software if you have a lot of data or a complex statistical model to build, so it’s not listed in that category. Excel’s usage for analytics in the corporate environment is frequently augmented by other Microsoft products, including SQL Server (primarily a database tool with some analytical functionality) and SharePoint (primarily a collaboration tool, with some analytical functionality). Types of Models There are a variety of model types that analysts and their organizations use to think analytically and make data-based decisions.

pages: 400 words: 94,847

Reinventing Discovery: The New Era of Networked Science by Michael Nielsen

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Might it be that the statistical models contain more truth than our conventional theories of language, with their notions of verb, noun, and adjective, subjects and objects, and so on? Or perhaps the models contain a different kind of truth, in part complementary, and in part overlapping, with conventional theories of language? Maybe we could develop a better theory of language by combining the best insights from the conventional approach and the approach based on statistical modeling into a single, unified explanation? Unfortunately, we don’t yet know how to make such unified theories. But it’s stimulating to speculate that nouns and verbs, subjects and objects, and all the other paraphernalia of language are really emergent properties whose existence can be deduced from statistical models of language.

The program would also examine the corpus to figure out how words moved around in the sentence, observing, for example, that “hola” and “hello” tend to be in the same parts of the sentence, while other words get moved around more. Repeating this for every pair of words in the Spanish and English languages, their program gradually built up a statistical model of translation—an immensely complex model, but nonetheless one that can be stored on a modern computer. I won’t describe the models they used in complete detail here, but the hola-hello example gives you the flavor. Once they had analyzed the corpus and built up their statistical model, they used that model to translate new texts. To translate a Spanish sentence, the idea was to find the English sentence that, according to the model, had the highest probability. That high-probability sentence would be output as the translation. Frankly, when I first heard about statistical machine translation I thought it didn’t sound very promising.

But whereas Darwin’s theory of evolution can be summed up in a few sentences, and Einstein’s general theory of relativity can be expressed in a single equation, these theories of translation are expressed in models with billions of parameters. You might object that such a statistical model doesn’t seem much like a conventional scientific explanation, and you’d be right: it’s not an explanation in the conventional sense. But perhaps it should be considered instead as a new kind of explanation. Ordinarily, we judge explanations in part by their ability to predict new phenomena. In the case of translation, that means accurately translating never-before-seen sentences. And so far, at least, the statistical translation models do a better job of that than any conventional theory of language. It’s telling that a model that doesn’t even understand the noun-verb distinction can outperform our best linguistic models. At the least we should take seriously the idea that these statistical models express truths not found in more conventional explanations of language translation.

pages: 354 words: 26,550

High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems by Irene Aldridge

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Operational risk—the risk of financial losses embedded in daily trading operations 5. Legal risk—the risk of litigation expenses All current risk measurement approaches fall into four categories: r r r r Statistical models Scalar models Scenario analysis Causal modeling Statistical models generate predictions about worst-case future conditions based on past information. The Value-at-Risk (VaR) methodology is the most common statistical risk measurement tool, discussed in detail in the sections that focus on market and liquidity risk estimation. Statistical models are the preferred methodology of risk estimation whenever statistical modeling is feasible. Scalar models establish the maximum foreseeable loss levels as percentages of business parameters, such as revenues, operating costs, and the like. The parameters can be computed as averages of several days, weeks, months, or even years of a particular business variable, depending on the time frame most suitable for each parameter.

Yet, readers relying on software packages with preconfigured statistical procedures may find the level of detail presented here to be sufficient for quality analysis of trading opportunities. The depth of the statistical content should be also sufficient for readers to understand the models presented throughout the remainder of this book. Readers interested in a more thorough treatment of statistical models may refer to Tsay (2002); Campbell, Lo, and MacKinlay (1997); and Gouriéroux and Jasiak (2001). This chapter begins with a review of the fundamental statistical estimators, moves on to linear dependency identification methods and volatility modeling techniques, and concludes with standard nonlinear approaches for identifying and modeling trading opportunities. T STATISTICAL PROPERTIES OF RETURNS According to Dacorogna et al. (2001, p. 121), “high-frequency data opened up a whole new field of exploration and brought to light some behaviors that could not be observed at lower frequencies.”

CHAPTER 12 Event Arbitrage ith news reported instantly and trades placed on a tick-by-tick basis, high-frequency strategies are now ideally positioned to profit from the impact of announcements on markets. These highfrequency strategies, which trade on the market movements surrounding news announcements, are collectively referred to as event arbitrage. This chapter investigates the mechanics of event arbitrage in the following order: W r Overview of the development process r Generating a price forecast through statistical modeling of r Directional forecasts r Point forecasts r Applying event arbitrage to corporate announcements, industry news, and macroeconomic news r Documented effects of events on foreign exchange, equities, fixed income, futures, emerging economies, commodities, and REIT markets DEVELOPING EVENT ARBITRAGE TRADING STRATEGIES Event arbitrage refers to the group of trading strategies that place trades on the basis of the markets’ reaction to events.

pages: 320 words: 33,385

Market Risk Analysis, Quantitative Methods in Finance by Carol Alexander

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Chapter 3, Probability and Statistics, covers the probabilistic and statistical models that we use to analyse the evolution of financial asset prices or interest rates. Starting from the basic concepts of a random variable, a probability distribution, quantiles and population and sample moments, we then provide a catalogue of probability distributions. We describe the theoretical properties of each distribution and give examples of practical applications to finance. Stable distributions and kernel estimates are also covered, because they have broad applications to financial risk management. The sections on statistical inference and maximum likelihood lay the foundations for Chapter 4. Finally, we focus on the continuous time and discrete time statistical models for the evolution of financial asset prices and returns, which are further developed in Volume III.

The multivariate t distribution has very useful applications which will be described in Volumes II and IV. Its most important market risk modelling applications are to: • multivariate GARCH modelling, generating copulas, and • simulating asset prices. • I.3.5 INTRODUCTION TO STATISTICAL INFERENCE A statistical model will predict well only if it is properly specified and its parameter estimates are robust, unbiased and efficient. Unbiased means that the expected value of the estimator is equal to the true model parameter and efficient means that the variance of the estimator is low, i.e. different samples give similar estimates. When we set up a statistical model the implicit assumption is that this is the ‘true’ model for the population. We estimate the model’s parameters from a sample and then use these estimates to infer the values of the ‘true’ population parameters. With what degree of confidence can we say that the ‘true’ parameter takes some value such as 0?

Using this add-in, we have been able to compute eigenvectors and eigenvalues and perform many other matrix operations that would not be possible otherwise in Excel, except by purchasing software. This matrix.xla add-in is included on the CD-ROM, but readers may also like to download any later versions, currently available free from: http://digilander.libero.it/foxes (e-mail: leovlp@libero.it). I.3 Probability and Statistics I.3.1 INTRODUCTION This chapter describes the probabilistic and statistical models that we use to analyse the evolution of financial asset prices or interest rates. Prices or returns on financial assets, interest rates or their changes, and the value or P&L of a portfolio are some examples of the random variables used in finance. A random variable is a variable whose value could be observed today and in the past, but whose future values are unknown. We may have some idea about the future values, but we do not know exactly which value will be realized in the future.

pages: 265 words: 74,000

The Numerati by Stephen Baker

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

He started publishing papers nearly as soon as he arrived. And when he got his master's, he decided to look for a job "at places where they hire Ph.D.'s." He landed at Accenture, and now, at an age at which many of his classmates are just finishing their doctorate, he runs the analytics division from his perch in Chicago. Ghani leads me out of his office and toward the shopping cart. For statistical modeling, he explains, grocery shopping is one of the first retail industries to conquer. This is because we buy food constantly. For many of us, the supermarket functions as a chilly, Muzak-blaring annex to our pantries. (I would bet that millions of suburban Americans spend more time in supermarkets than in their formal living room.) Our grocery shopping is so prodigious that just by studying one year of our receipts, researchers can detect all sorts of patterns—far more than they can learn from a year of records detailing our other, more sporadic purchases.

It's terrifying." He thinks that over the next generation, many of us will surround ourselves with the kinds of networked gadgets he and his team are building and testing. These machines will busy themselves with far more than measuring people's pulse and counting the pills they take, which is what today's state-of-the-art monitors can do. Dishman sees sensors eventually recording and building statistical models of almost every aspect of our behavior. They'll track our pathways in the house, the rhythm of our gait. They'll diagram our thrashing in bed and chart our nightly trips to the bathroom—perhaps keeping tabs on how much time we spend in there. Some of these gadgets will even measure the pause before we recognize a familiar voice on the phone. A surveillance society gone haywire? Personal privacy in tatters?

Let's say they see lots of activity in the morning and at bedtime. Together those two periods might represent 90 percent of toothbrush movement. From that, they can calculate a 90 percent probability that toothbrush movement involves teeth cleaning. (They could factor in time variables, but there's more than enough complexity ahead, as we'll see.) Next they move to the broom and the teakettle, and they ask the same questions. The goal is to build a statistical model for each of us that will infer from a series of observations what we're most likely to be doing. The toothbrush was easy. For the most part, it sticks to only one job. But consider the kettle. What are the chances that it's being used for tea? Maybe a person uses it to make instant soup (which is more nutritious than tea but dangerously salty for people like my mother). How can the Intel team come up with a probability?

Everydata: The Misinformation Hidden in the Little Data You Consume Every Day by John H. Johnson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

You collect all the data on every wheat price in the history of humankind, and all the different factors that determine the price of wheat (temperature, feed prices, transportation costs, etc.). First, you need to develop a statistical model to determine what factors have affected the price of wheat in the past and how these various factors relate to one another mathematically. Then, based on that model, you predict the price of wheat for next year.14 The problem is that no matter how big your sample is (even if it’s the full population), and how accurate your statistical model is, there are still unknowns that can cause your forecast to be off: n n n What if a railroad strike doubles the transportation costs? What if Congress passes new legislation capping the price of wheat? What if there’s a genetic mutation that makes wheat grow twice as fast, essentially doubling the world’s supply?

As Hovenkamp said, “the plaintiff’s expert had ignored a clear ‘outlier’ in the data.”33 If that outlier data had been ­excluded—​­as it arguably should have ­been—​­then the results would have shown a clear increase in market share for Conwood. Instead, the ­conclusion—​­driven by an extreme ­observation—​­showed a decrease. If your conclusions change dramatically by excluding a data point, then that data point is a strong candidate to be an outlier. In a good statistical model, you would expect that you can drop a data point without seeing a substantive difference in the results. It’s something to think about when looking for outliers. Are You Better Than Average? The average American: n n n n n n Sleeps more than 8.7 hours per day34 Weighs approximately 181 pounds (195.5 pounds for men and 166.2 pounds for women)35 Drinks 20.8 gallons of beer per year36 Drives 13,476 miles per year (hopefully not after drinking all that beer)37 Showers six times a week, but only shampoos four times a week38 Has been at his or her current job 4.6 years39 221158 i-xiv 1-210 r4ga.indd  42 2/8/16  5:58:50 PM Red State Blues 43 So, are you better than average?

(On its website, Visa even suggests that you tell your financial institution if you’ll be traveling, which can “help ensure that your card isn’t flagged for unusual activity.”18) This is a perfect example of a false ­positive—​­the credit card company predicted that the charges on your card were potentially fraudulent, but it was wrong. Events like this, which may not be accounted for in the statistical model, are potential sources of prediction error. Just as sampling error tells us about the uncertainty in our sample, prediction error is a way to measure uncertainty in the future, essentially by comparing the predicted results to the actual outcomes, once they occur.19 Prediction error is often measured using a prediction interval, which is the range in which we expect to see the next data point.

pages: 294 words: 82,438

Simple Rules: How to Thrive in a Complex World by Donald Sull, Kathleen M. Eisenhardt

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

One study looked at how police can identify where serial criminals live. A simple rule—take the midpoint of the two most distant crime scenes—got police closer to the criminal than more sophisticated decision-making approaches. Another study compared a state-of-the-art statistical model and a simple rule to determine which did a better job of predicting whether past customers would purchase again. According to the simple rule, a customer was inactive if they had not purchased in x months (the number of months varies by industry). The simple rule did as well as the statistical model in predicting repeat purchases of online music, and beat it in the apparel and airline industries. Other research finds that simple rules match or beat more complicated models in assessing the likelihood that a house will be burgled and in forecasting which patients with chest pain are actually suffering from a heart attack.

., “Validation of the Emergency Severity Index (ESI) in Self-Referred Patients in a European Emergency Department,” Emergency Medicine Journal 24, no. 3 (2007): 170–74. [>] Statisticians have found: Professor Scott Armstrong of the Wharton School reviewed thirty-three studies comparing simple and complex statistical models used to forecast business and economic outcomes. He found no difference in forecasting accuracy in twenty-one of the studies. Sophisticated models did better in five studies, while simple models outperformed complex ones in seven cases. See J. Scott Armstrong, “Forecasting by Extrapolation: Conclusions from 25 Years of Research,” Interfaces 14 (1984): 52–66. Spyros Makridakis has hosted a series of competitions for statistical models over two decades, and consistently found that complex models fail to outperform simpler approaches. The history of the competitions is summarized in Spyros Makridakis and Michèle Hibon, “The M3-Competition: Results, Conclusions, and Implications,” International Journal of Forecasting 16, no. 4 (2000): 451–76. [>] When it comes to modeling: In statistical terms, a model that closely approximates the underlying function that generates observed data is said to have low bias.

In fact, the 1/N rule ignores everything except for the number of investment alternatives under consideration. It is hard to imagine a simpler investment rule. And yet it works. One recent study of alternative investment approaches pitted the Markowitz model and three extensions of his approach against the 1/N rule, testing them on seven samples of data from the real world. This research ran a total of twenty-eight horseraces between the four state-of-the-art statistical models and the 1/N rule. With ten years of historical data to estimate risk, returns, and correlations, the 1/N rule outperformed the Markowitz equation and its extensions 79 percent of the time. The 1/N rule earned a positive return in every test, while the more complicated models lost money for investors more than half the time. Other studies have run similar tests and come to the same conclusions.

pages: 348 words: 39,850

Data Scientists at Work by Sebastian Gutierrez

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

www.it-ebooks.info 63 64 Chapter 3 | Yann LeCun, Facebook In physics, a lot of the new results in astrophysics and high-energy physics actually rely very heavily on large data and complex statistical models. Things like the discovery of dark energy, for example, co-discovered by Saul Perlmutter, Nobel Prize winner, who is my counterpart of the Moore-Sloan Data Science Initiative at UC Berkeley, was made using massive statistical analysis. Also, a thing like the discovery of the Higgs boson was the result of massive statistical data analysis and results. Part of the system for this work was actually designed by my NYU colleague, Kyle Cranmer, who designed the integration for all the statistical models. Data Science is also on its way to revolutionize social science. There is actually a big push from social scientists who would love to put their hands on Facebook’s data.

You just can’t really move slowly when you’ve got a whole company full of super-motivated people excited about what they’re doing. It’s just not in your DNA. Of course, as competitors enter the www.it-ebooks.info Data Scientists at Work market, there’s also a legitimate business need of moving fast if we really want to keep our awesome business thriving. Gutierrez: How would you describe your work to a data scientist? Smallwood: I would say we’re a team that does all kinds of statistical modeling. We really focus and output three things as a team. We work on predictive models using all of the techniques that people in this field would be familiar with—regression techniques, clustering techniques, matrix factorization, support vector machines, et cetera, both supervised and unsupervised techniques. A second thing is algorithms, which I would say are obviously closely related to models, except that they’re embedded in some sort of ongoing process, like our product.

That’s really been my favorite part of working on a multidisciplinary team. Gutierrez: In addition to pair programming, do you do pair data science? Shellman: We don’t formally pair on statistics or data science work. For these subjects we have standing discussions around the whiteboards that surround our open-plan office. For instance, yesterday we finished the day with a discussion of how a statistical model could be applied, what data would be needed, the limitations of the model, and the latency expected when using the model in a real-time application. So while we weren’t pair programming, we were discussing behavior and expected results as a group. The great thing about our workspace is that these discussions happen in the open, so everybody can hear, chose to participate, and join in if they have something to contribute.

pages: 443 words: 51,804

Handbook of Modeling High-Frequency Data in Finance by Frederi G. Viens, Maria C. Mariani, Ionut Florescu

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Florescu, Ionuţ, 1973– III. Title. HG106.V54 2011 332.01 5193–dc23 2011038022 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 Contents Preface Contributors xi xiii part One Analysis of Empirical Data 1 1 Estimation of NIG and VG Models for High Frequency Financial Data 3 José E. Figueroa-López, Steven R. Lancette, Kiseop Lee, and Yanhui Mi 1.1 1.2 1.3 1.4 1.5 1.6 Introduction, 3 The Statistical Models, 6 Parametric Estimation Methods, 9 Finite-Sample Performance via Simulations, 14 Empirical Results, 18 Conclusion, 22 References, 24 2 A Study of Persistence of Price Movement using High Frequency Financial Data 27 Dragos Bozdog, Ionuţ Florescu, Khaldoun Khashanah, and Jim Wang 2.1 Introduction, 27 2.2 Methodology, 29 2.3 Results, 35 v vi Contents 2.4 Rare Events Distribution, 41 2.5 Conclusions, 44 References, 45 3 Using Boosting for Financial Analysis and Trading 47 Germán Creamer 3.1 3.2 3.3 3.4 3.5 Introduction, 47 Methods, 48 Performance Evaluation, 53 Earnings Prediction and Algorithmic Trading, 60 Final Comments and Conclusions, 66 References, 69 4 Impact of Correlation Fluctuations on Securitized structures 75 Eric Hillebrand, Ambar N.

In Section 1.5, we present our empirical results using high frequency transaction data from the US equity market. The data was obtained from the NYSE TAQ database of 2005 trades via Wharton’s WRDS system. For the sake of clarity and space, we only present the results for Intel and defer a full analysis of other stocks for a future publication. We ﬁnish with a section of conclusions and further recommendations. 1.2 The Statistical Models 1.2.1 GENERALITIES OF EXPONENTIAL LÉVY MODELS Before introducing the speciﬁc models we consider in this chapter, let us brieﬂy motivate the application of Lévy processes in ﬁnancial modeling. We refer the reader to the monographs of Cont & Tankov (2004) and Sato (1999) or the recent review papers Figueroa-López (2011) and Tankov (2011) for further information. Exponential (or Geometric) Lévy models are arguably the most natural generalization of the geometric Brownian motion intrinsic in the Black–Scholes option pricing model.

Exponential (or Geometric) Lévy models are arguably the most natural generalization of the geometric Brownian motion intrinsic in the Black–Scholes option pricing model. A geometric Brownian motion (also called Black–Scholes model) postulates the following conditions about the price process (St )t≥0 of a risky asset: (1) The (log) return on the asset over a time period [t, t + h] of length h, that is, Rt,t+h := log St+h St is Gaussian with mean μh and variance σ 2 h (independent of t); 7 1.2 The Statistical Models (2) Log returns on disjoint time periods are mutually independent; (3) The price path t → St is continuous; that is, P(Su → St , as u → t, ∀ t) = 1. The previous assumptions can equivalently be stated in terms of the so-called log return process (Xt )t , denoted henceforth as Xt := log St . S0 Indeed, assumption (1) is equivalent to ask that the increment Xt+h − Xt of the process X over [t, t + h] is Gaussian with mean μh and variance σ 2 h.

pages: 545 words: 137,789

How Markets Fail: The Logic of Economic Calamities by John Cassidy

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

“Today, retail lending has become more routinized as banks have become increasingly adept at predicting default risk by applying statistical models to data, such as credit scores,” Bernanke went on. “Other tools include proprietary internal debt-rating models and third-party programs that use market data to analyze the risk of exposures to corporate borrowers that issue stock.” While challenges remained, Bernanke concluded, “banking organizations of all sizes have made substantial strides over the past two decades in their ability to measure and manage risks.” Nobody could quibble with Bernanke’s point that Wall Street was becoming more quantitative: the research and risk departments of big financial firms were teeming with physicists, applied mathematicians, and statisticians. But the proper role of statistical models is as a useful adjunct to an overall strategy of controlling risk, not as a substitute for one.

However, it also raises the possibility that the causal relationships that determine market movements aren’t fixed, but vary over time. Maybe because of shifts in psychology or government policy, there are periods when markets will settle into a rut, and other periods when they will be apt to gyrate in alarming fashion. This picture seems to jibe with reality, but it raises some tricky issues for quantitative finance. If the underlying reality of the markets is constantly changing, statistical models based on past data will be of limited use, at best, in determining what is likely to happen in the future. And firms and investors that rely on these models to manage risk may well be exposing themselves to danger. The economics profession didn’t exactly embrace Mandelbrot’s criticisms. As the 1970s proceeded, the use of quantitative techniques became increasingly common on Wall Street. The coin-tossing view of finance made its way into the textbooks and, with the help of Burton Malkiel, onto the bestsellers list.

After listening to Vincent Reinhart, the head of the Fed’s Division of Monetary Affairs, suggest several ways the Fed could try to revive the economy if interest rate changes could no longer be used, he dismissed the discussion as “premature” and described the possibility of a prolonged deflation as “a very small probability event.” The discussion turned to the immediate issue of whether to keep the funds rate at 1.25 percent. Since the committee’s previous meeting, Congress had approved the Bush administration’s third set of tax cuts since 2001, which was expected to give spending a boost. The Fed’s own statistical model of the economy was predicting a vigorous upturn later in 2003, suggesting that further rate cuts would be unnecessary and that some policy tightening might even be needed. “But that forecast has a very low probability, as far as I’m concerned,” Greenspan said curtly. “It points to an outcome that would be delightful if it were to materialize, but it is not a prospect on which we should focus our policy at this point.”

pages: 263 words: 75,455

Quantitative Value: A Practitioner's Guide to Automating Intelligent Investment and Eliminating Behavioral Errors by Wesley R. Gray, Tobias E. Carlisle

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

We need some means to protect us from our cognitive biases, and the quantitative method is that means. It serves both to protect us from our own behavioral errors and to exploit the behavioral errors of others. The model does need not be complex to achieve this end. In fact, the weight of evidence indicates that even simple statistical models outperform the best experts. It speaks to the diabolical nature of our faulty cognitive apparatus that those simple statistical models continue to outperform the best experts even when those same experts are given access to the models' output. This is as true for a value investor as it is for any other expert in any other field of endeavor. This book is aimed at value investors. It's a humbling and maddening experience to compare active investment results with an analogous passive strategy.

In his book, Expert Political Judgment,36 Philip Tetlock discusses his extensive study of people who make prediction their business—the experts. Tetlock's conclusion is that experts suffer from the same behavioral biases as the laymen. Tetlock's study fits within a much larger body of research that has consistently found that experts are as unreliable as the rest of us. A large number of studies have examined the records of experts against simple statistical model, and, in almost all cases, concluded that experts either underperform the models or can do no better. It's a compelling argument against human intuition and for the statistical approach, whether it's practiced by experts or nonexperts.37 Even Experts Make Behavioral Errors In many disciplines, simple quantitative models outperform the intuition of the best experts. The simple quantitative models continue to outperform the judgments of the best experts, even when those experts are given the benefit of the outputs from the simple quantitative model.

The model predicted O'Connor's vote correctly 70 percent of the time, while the experts' success rate was only 61 percent.41How can it be that simple models perform better than experienced clinical psychologists or renowned legal experts with access to detailed information about the cases? Are these results just flukes? No. In fact, the MMPI and Supreme Court decision examples are not even rare. There are an overwhelming number of studies and meta-analyses—studies of studies—that corroborate this phenomenon. In his book, Montier provides a diverse range of studies comparing statistical models and experts, ranging from the detection of brain damage, the interview process to admit students to university, the likelihood of a criminal to reoffend, the selection of “good” and “bad” vintages of Bordeaux wine, and the buying decisions of purchasing managers. Value Investors Have Cognitive Biases, Too Graham recognized early on that successful investing required emotional discipline.

Big Data at Work: Dispelling the Myths, Uncovering the Opportunities by Thomas H. Davenport

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Another difference is a widespread preference for visual analytics on big data. For reasons not entirely understood (by anyone, I think), the results of big data analyses are often expressed in visual formats. Now, visual analytics have a lot of strengths: They are relatively easy for non-quantitative executives to interpret, and they get attention. The downside is that they are not generally well suited for expressing complex multivariate relationships and statistical models. Put in other terms, most visual displays of data are for descriptive analytics, rather than predictive or prescriptive ones. They can, however, show a lot of data at once, as figure 4-1 illustrates. It’s a display of the tweets and retweets on Twitter involving particular New York Times articles.5 I find—as with many other complex big data visualizations—this one difficult to decipher. I sometimes think that many big data visualizations are created simply because they can be, rather than to provide clarity on an issue.

Chapter_04.indd 112 03/12/13 12:00 PM 5 Technology for Big Data Written with Jill Dyché A major component of what makes the management and analysis of big data possible is new technology.* In effect,     big data is not just a large volume of unstructured data, but also the technologies that make processing and analyzing it possible. Specific big data technologies analyze textual, video, and audio content. When big data is fast moving, technologies like machine learning allow for the rapid creation of statistical models that fit, optimize, and predict the data. This chapter is devoted to all of these big data technologies and the difference they make. The technologies addressed in the chapter are outlined in table 5-1. *I am indebted in this section to Jill Dyché, vice president of SAS Best Practices, who collaborated with me on this work and developed many of the frameworks in this section. Much of the content is taken from our report, Big Data in Big Companies (International Institute for Analytics, April 2013).

Hive performs similar functions but is more batch oriented, and it can transform data into the relational format suitable for Structured Query Language (SQL; used to access and manipulate data in databases) queries. This makes it useful for analysts who are familiar with that query language. Business View The business view layer of the stack makes big data ready for further analysis. Depending on the big data application, additional processing via MapReduce or custom code might be used to construct an intermediate data structure, such as a statistical model, a flat file, a relational table, or a data cube. The resulting structure may be intended for additional analysis or to be queried by a traditional SQL-based query tool. Many vendors are moving to so-called “SQL on Hadoop” approaches, simply because SQL has been used in business for a couple of decades, and many people (and higher-level languages) know how to create SQL queries. This business view ensures that big data is more consumable by the tools and the knowledge workers that already exist in an organization.

pages: 460 words: 122,556

The End of Wall Street by Roger Lowenstein

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

See AIG bailouts Ben Bernanke and board of Warren Buffett and CDOs and collateral calls on compensation at corporate structure of credit default swaps and credit rating agencies and Jamie Dimon and diversity of holdings employees, number of Financial Products subsidiary Timothy Geithner and Goldman Sachs and insurance (credit default swap) premiums of JPMorgan Chase and lack of reserve for losses leadership changes Lehman Brothers and losses Moody’s and Morgan Stanley and New York Federal Reserve Bank and Hank Paulson and rescue of. See AIG bailouts revenue of shareholders statistical modeling of stock price of struggles of risk of systemic effects of failure of Texas and AIG bailouts amount of Ben Bernanke and board’s role in credit rating agencies and Federal Reserve and Timothy Geithner and Goldman Sachs and JPMorgan Chase and Lehman Brothers’ bankruptcy and New York state and Hank Paulson and reasons for harm to shareholders in Akers, John Alexander, Richard Allison, Herbert Ambac American Home Mortgages Andrukonis, David appraisers, real estate Archstone-Smith Trust Associates First Capital Atteberry, Thomas auto industry Bagehot, Walter bailouts.

See credit crisis volatility of credit crisis borrowers, lack of effects of fear of lending mortgages and reasons for spread of as unforeseen credit cycle credit default swaps AIG and Goldman Sachs and Morgan Stanley and credit rating agencies. See also specific agencies AIG and capital level determination by guessing by inadequacy of models of Lehman Brothers and Monte Carlo method of mortgage-backed securities and statistical modeling used by Credit Suisse Cribiore, Alberto Cummings, Christine Curl, Gregory Dallavecchia, Enrico Dannhauser, Stephen Darling, Alistair Dean Witter debt of financial firms U.S. reliance on of U.S. families defaults/delinquencies deflation deleveraging. See also specific firms del Missier, Jerry Democrats deposit insurance deregulation of banking system and derivatives of financial markets derivatives.

See home foreclosure(s) foreign investors France Frank, Barney Freddie Mac and Fannie Mae accounting problems of affordable housing and Alternative-A loans bailout of Ben Bernanke and capital raised by competitive threats to Congress and Countrywide Financial and Democrats and Federal Reserve and foreign investment in Alan Greenspan and as guarantor history of lack of regulation of leadership changes leverage losses mortgage bubble and as mortgage traders Hank Paulson and politics and predatory lending and reasons for failures of relocation to private sector Robert Rodriguez and shareholders solving financial crisis through statistical models of stock price of Treasury Department and free market Freidheim, Scott Friedman, Milton Fuld, Richard compensation of failure to pull back from mortgage-backed securities identification with Lehman Brothers Lehman Brothers’ bankruptcy and Lehman Brothers’ last days and long tenure of Hank Paulson and personality and character of Gamble, James (Jamie) GDP Geithner, Timothy AIG and bank debt guarantees and Bear Stearns bailout and career of China and Citigroup and financial crisis, response to Lehman Brothers and money markets and Morgan Stanley and in Obama administration Hank Paulson and TARP and Gelband, Michael General Electric General Motors Germany Glass-Steagall Act Glauber, Robert Golden West Savings and Loan Goldman Sachs AIG and as bank holding company Warren Buffett investment in capital raised by capital sought by compensation at credit default swaps and hedge funds and insurance (credit default swap) premiums of job losses at leverage of Merrill Lynch and Stanley O’Neal’s obsession with Hank Paulson and pull back from mortgage-backed securities short selling against stock price of Wachovia and Gorton, Gary government, U.S.

pages: 461 words: 128,421

The Myth of the Rational Market: A History of Risk, Reward, and Delusion on Wall Street by Justin Fox

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

First, modeling financial risk is hard. Statistical models can never fully capture all things that can go wrong (or right). It was as physicist and random walk pioneer M. F. M. Osborne told his students at UC–Berkeley back in 1972: For everyday market events the bell curve works well. When it doesn’t, one needs to look outside the statistical models and make informed judgments about what’s driving the market and what the risks are. The derivatives business and other financial sectors on the rise in the 1980s and 1990s were dominated by young quants. These people knew how to work statistical models, but they lacked the market experience needed to make informed judgments. Meanwhile, those with the experience, wisdom, and authority to make informed judgments—the bosses—didn’t understand the statistical models. It’s possible that, as more quants rise into positions of high authority (1986 Columbia finance Ph.D.

Traditional ratios of loan-to-value and monthly payments to income gave way to credit scoring and purportedly precise gradations of default risk that turned out to be worse than useless. In the 1970s, Amos Tversky and Daniel Kahneman had argued that real-world decision makers didn’t follow the statistical models of John von Neumann and Oskar Morgenstern, but used simple heuristics—rules of thumb—instead. Now the mortgage lending industry was learning that heuristics worked much better than statistical models descended from the work of von Neumann and Morgenstern. Simple trumped complex. In 2005, Robert Shiller came out with a second edition of Irrational Exuberance that featured a new twenty-page chapter on “The Real Estate Market in Historical Perspective.” It offered no formulas for determining whether prices were right, but it did feature an index of U.S. home prices back to 1890.

Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals by David Aronson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

It was a review of prior studies, known as a meta-analysis, which examined 20 studies that had compared the subjective diagnoses of psychologists and psychiatrists with those produced by linear statistical models. The studies covered the prediction of academic success, the likelihood of criminal recidivism, and predicting the outcomes of electrical shock therapy. In each case, the experts rendered a judgment by evaluating a multitude of variables in a subjective manner. “In all studies, the statistical model provided more accurate predictions or the two methods tied.”34 A subsequent study by Sawyer35 was a meta analysis of 45 studies. “Again, there was not a single study in which clinical global judgment was superior to the statistical prediction (termed ‘mechanical combination’ by Sawyer).”36 Sawyer’s investigation is noteworthy because he considered studies in which the human expert was allowed access to information that was not considered by the statistical model, and yet the model was still superior.

The prediction problems spanned nine different ﬁelds: (1) academic performance of graduate students, (2) life-expectancy of cancer patients, (3) changes in stock prices, (4) mental illness using personality tests, (5) grades and attitudes in a psychology course, (6) business failures using ﬁnancial ratios, (7) students’ ratings of teaching effectiveness, (8) performance of life insurance sales personnel, and (9) IQ scores using Rorschach Tests. Note that the average correlation of the statistical model was 0.64 versus the expert average of 0.33. In terms of information content, which is measured by the correlation coefﬁcient squared or r-squared, the model’s predictions were on average 3.76 times as informative as the experts’. Numerous additional studies comparing expert judgment to statistical models (rules) have conﬁrmed these ﬁndings, forcing the conclusion that people do poorly when attempting to combine a multitude of variables to make predictions or judgments. In 1968, Goldberg39 showed that a linear prediction model utilizing personality test scores as inputs could discriminate neurotic from psychotic patients better than experienced clinical diagnosticians.

The task was to predict the propensity for violence among newly admitted male psychiatric patients based on 19 inputs. The average accuracy of the experts, as measured by the correlation coefﬁcient between their prediction of violence and the actual manifestation of violence, was a poor 0.12. The single best expert had a score of 0.36. The predictions of a linear statistical model, using the same set of 19 inputs, achieved a correlation of 0.82. In this instance the model’s predictions were nearly 50 times more informative than the experts’. Meehl continued to expand his research of comparing experts and statistical models and in 1986 concluded that “There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one. When you are pushing 90 investigations [currently greater than 15040] predicting everything from the outcomes of football games to the diagnosis of liver disease and when you can hardly come up with a half dozen studies showing even a weak tendency in favor of the clinician, it is time to draw a practical conclusion.”41 The evidence continues to accumulate, yet few experts pay heed.

pages: 197 words: 35,256

NumPy Cookbook by Ivan Idris

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

diff Calculates differences of numbers within a NumPy array. If not specified, first-order differences are computed. log Calculates the natural log of elements in a NumPy array. sum Sums the elements of a NumPy array. dot Does matrix multiplication for 2D arrays. Calculates the inner product for 1D arrays. Installing scikits-statsmodels The scikits-statsmodels package focuses on statistical modeling. It can be integrated with NumPy and Pandas (more about Pandas later in this chapter). How to do it... Source and binaries can be downloaded from http://statsmodels.sourceforge.net/install.html . If you are installing from source, you need to run the following command: python setup.py install If you are using setuptools, the command is: easy_install statsmodels Performing a normality test with scikits-statsmodels The scikits-statsmodels package has lots of statistical tests.

Perform an ordinary least squares calculation by creating an OLS object, and calling its fit method as follows: x, y = data.exog, data.endog fit = statsmodels.api.OLS(y, x).fit() print "Fit params", fit.params This should print the result of the fitting procedure, as follows: Fit params COPPERPRICE 14.222028 INCOMEINDEX 1693.166242 ALUMPRICE -60.638117 INVENTORYINDEX 2515.374903 TIME 183.193035 Summarize.The results of the OLS fit can be summarized by the summary method as follows: print fit.summary() This will give us the following output for the regression results: The code to load the copper data set is as follows: import statsmodels.api # See https://github.com/statsmodels /statsmodels/tree/master/statsmodels/datasets data = statsmodels.api.datasets.copper.load_pandas() x, y = data.exog, data.endog fit = statsmodels.api.OLS(y, x).fit() print "Fit params", fit.params print print "Summary" print print fit.summary() How it works... The data in the Dataset class of statsmodels follows a special format. Among others, this class has the endog and exog attributes. Statsmodels has a load function, which loads data as NumPy arrays. Instead, we used the load_pandas method, which loads data as Pandas objects. We did an OLS fit, basically giving us a statistical model for copper price and consumption. Resampling time series data In this tutorial, we will learn how to resample time series with Pandas. How to do it... We will download the daily price time series data for AAPL, and resample it to monthly data by computing the mean. We will accomplish this by creating a Pandas DataFrame, and calling its resample method. Creating a date-time index.Before we can create a Pandas DataFrame, we need to create a DatetimeIndex method to pass to the DataFrame constructor.

pages: 416 words: 39,022

Asset and Risk Management: Risk Oriented Finance by Louis Esch, Robert Kieffer, Thierry Lopez

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Table 6.3 Student distribution quantiles ν γ2 z0.95 z0.975 z0.99 6.00 1.00 0.55 0.38 0.29 0.23 0.17 0.11 0.05 0 2.601 2.026 1.883 1.818 1.781 1.757 1.728 1.700 1.672 1.645 3.319 2.491 2.289 2.199 2.148 2.114 2.074 2.034 1.997 1.960 4.344 3.090 2.795 2.665 2.591 2.543 2.486 2.431 2.378 2.326 5 10 15 20 25 30 40 60 120 normal 8 Blattberg R. and Gonedes N., A comparison of stable and student distributions as statistical models for stock prices, Journal of Business, Vol. 47, 1974, pp. 244–80. 9 Pearson E. S. and Hartley H. O., Biometrika Tables for Statisticians, Biometrika Trust, 1976, p. 146. 190 Asset and Risk Management This clearly shows that when the normal law is used in place of the Student laws, the VaR parameter is underestimated unless the number of degrees of freedom is high. Example With the same data as above, that is, E(pt ) = 100 and σ (pt ) = 80, and for 15 degrees of freedom, we ﬁnd the following evaluations of VaR, instead of 31.6, 64.3 and 86.1 respectively.

Using pt presents the twofold advantage of: • making the magnitudes of the various factors likely to be involved in evaluating an asset or portfolio relative; • supplying a variable that has been shown to be capable of possessing certain distributional properties (normality or quasi-normality for returns on equities, for example). 1 Estimating quantiles is often a complex problem, especially for arguments close to 0 or 1. Interested readers should read Gilchrist W. G., Statistical Modelling with Quantile Functions, Chapman & Hall/CRC, 2000. 2 If the risk factor X is a share price, we are looking at the return on that share (see Section 3.1.1). 200 Asset and Risk Management Valuation models Historical data Estimation technique VaR Figure 7.1 Estimating VaR Note In most calculation methods, a different expression is taken into consideration: ∗ (t) = ln X(t) X(t − 1) As we saw in Section 3.1.1, this is in fact very similar to (t) and has the advantage that it can take on any real value3 and that the logarithmic return for several consecutive periods is the sum of the logarithmic return for each of those periods.

If the model is nonstationary (nonstationary variance and/or mean), it can be converted into a stationary model by using the integration of order r after the logarithmic transformation : if y is the transformed variable, apply the technique to ((. . . (yt ))) − r times− instead of yt ((yt ) = yt − yt−1 ). We therefore use an ARIMA(p, r, q) procedure.16 If this procedure fails because of nonconstant volatility in the error term, it will be necessary to use the ARCH-GARCH or EGARCH models (Appendix 7). B. The equation on the replicated positions This equation may be estimated by a statistical model (such as SAS/OR procedure PROC NPL), using multiple regression with the constraints 15 years αi = 1 and αi ≥ 0 i=3 months It is also possible to estimate the replicated positions (b) with the single constraint (by using the SAS/STAT procedure) 15 years αi = 1 i=3 months In both cases, the duration of the demand product is a weighted average of the durations. In the second case, it is possible to obtain negative αi values.

pages: 252 words: 72,473

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O'Neil

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The proxies the journalists chose for educational excellence make sense, after all. Their spectacular failure comes, instead, from what they chose not to count: tuition and fees. Student financing was left out of the model. This brings us to the crucial question we’ll confront time and again. What is the objective of the modeler? In this case, put yourself in the place of the editors at U.S. News in 1988. When they were building their first statistical model, how would they know when it worked? Well, it would start out with a lot more credibility if it reflected the established hierarchy. If Harvard, Stanford, Princeton, and Yale came out on top, it would seem to validate their model, replicating the informal models that they and their customers carried in their own heads. To build such a model, they simply had to look at those top universities and count what made them so special.

In a sense, it learns. Compared to the human brain, machine learning isn’t especially efficient. A child places her finger on the stove, feels pain, and masters for the rest of her life the correlation between the hot metal and her throbbing hand. And she also picks up the word for it: burn. A machine learning program, by contrast, will often require millions or billions of data points to create its statistical models of cause and effect. But for the first time in history, those petabytes of data are now readily available, along with powerful computers to process them. And for many jobs, machine learning proves to be more flexible and nuanced than the traditional programs governed by rules. Language scientists, for example, spent decades, from the 1960s to the early years of this century, trying to teach computers how to read.

Imagine if a highly motivated and responsible person with modest immigrant beginnings is trying to start a business and needs to rely on such a system for early investment. Who would take a chance on such a person? Probably not a model trained on such demographic and behavioral data. I should note that in the statistical universe proxies inhabit, they often work. More times than not, birds of a feather do fly together. Rich people buy cruises and BMWs. All too often, poor people need a payday loan. And since these statistical models appear to work much of the time, efficiency rises and profits surge. Investors double down on scientific systems that can place thousands of people into what appear to be the correct buckets. It’s the triumph of Big Data. And what about the person who is misunderstood and placed in the wrong bucket? That happens. And there’s no feedback to set the system straight. A statistics-crunching engine has no way to learn that it dispatched a valuable potential customer to call center hell.

Analysis of Financial Time Series by Ruey S. Tsay

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Stable Distribution The stable distributions are a natural generalization of normal in that they are stable under addition, which meets the need of continuously compounded returns rt . Furthermore, stable distributions are capable of capturing excess kurtosis shown by historical stock returns. However, non-normal stable distributions do not have a finite variance, which is in conflict with most finance theories. In addition, statistical modeling using non-normal stable distributions is difficult. An example of non-normal stable distributions is the Cauchy distribution, which is symmetric with respect to its median, but has infinite variance. Scale Mixture of Normal Distributions Recent studies of stock returns tend to use scale mixture or finite mixture of normal distributions. Under the assumption of scale mixture of normal distributions, the log return rt is normally distributed with mean µ and variance σ 2 [i.e., rt ∼ N (µ, σ 2 )].

Furthermore, the lag- autocovariance of rt is γ = Cov(rt , rt− ) = E =E ∞ i=0 ∞ ψi at−i ∞ ψ j at−− j j=0 ψi ψ j at−i at−− j i, j=0 = ∞ j=0 2 2 ψ j+ ψ j E(at−− j ) = σa ∞ ψ j ψ j+ . j=0 Consequently, the ψ-weights are related to the autocorrelations of rt as follows: ∞ ψi ψi+ γ = i=0 ρ = ∞ 2 , γ0 1 + i=1 ψi ≥ 0, (2.5) where ψ0 = 1. Linear time series models are econometric and statistical models used to describe the pattern of the ψ-weights of rt . 2.4 SIMPLE AUTOREGRESSIVE MODELS The fact that the monthly return rt of CRSP value-weighted index has a statistically significant lag-1 autocorrelation indicates that the lagged return rt−1 might be useful in predicting rt . A simple model that makes use of such predictive power is rt = φ0 + φ1rt−1 + at , (2.6) where {at } is assumed to be a white noise series with mean zero and variance σa2 .

If at has a symmetric distribution around zero, then conditional on pt−1 , pt has a 50–50 chance to go up or down, implying that pt would go up or down at random. If we treat the random-walk model as a special AR(1) model, then the coefficient of pt−1 is unity, which does not satisfy the weak stationarity condition of an AR(1) model. A random-walk series is, therefore, not weakly stationary, and we call it a unit-root nonstationary time series. The random-walk model has been widely considered as a statistical model for the movement of logged stock prices. Under such a model, the stock price is not predictable or mean reverting. To see this, the 1-step ahead forecast of model (2.32) at the forecast origin h is p̂h (1) = E( ph+1 | ph , ph−1 , . . .) = ph , which is the log price of the stock at the forecast origin. Such a forecast has no practical value. The 2-step ahead forecast is UNIT- ROOT NONSTATIONARITY 57 p̂h (2) = E( ph+2 | ph , ph−1 , . . .) = E( ph+1 + ah+2 | ph , ph−1 , . . .) = E( ph+1 | ph , ph−1 , . . .) = p̂h (1) = ph , which again is the log price at the forecast origin.

pages: 481 words: 125,946

What to Think About Machines That Think: Today's Leading Thinkers on the Age of Machine Intelligence by John Brockman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

A literature pioneered by psychologists such as the late Robyn Dawes finds that virtually any routine decision-making task—detecting fraud, assessing the severity of a tumor, hiring employees—is done better by a simple statistical model than by a leading expert in the field. Let me offer just two illustrative examples, one from human-resource management and the other from the world of sports. First, let’s consider the embarrassing ubiquity of job interviews as an important, often the most important, determinant of who gets hired. At the University of Chicago Booth School of Business, where I teach, recruiters devote endless hours to interviewing students on campus for potential jobs—a process that selects the few who will be invited to visit the employer, where they will undergo another extensive set of interviews. Yet research shows that interviews are nearly useless in predicting whether a job prospect will perform well on the job. Compared to a statistical model based on objective measures such as grades in courses relevant to the job in question, interviews primarily add noise and introduce the potential for prejudice.

AI systems can be thought of as trying to approximate rational behavior using limited resources. There’s an algorithm for computing the optimal action for achieving a desired outcome, but it’s computationally expensive. Experiments have found that simple learning algorithms with lots of training data often outperform complex hand-crafted models. Today’s systems primarily provide value by learning better statistical models and performing statistical inference for classification and decision making. The next generation will be able to create and improve their own software and are likely to self-improve rapidly. In addition to improving productivity, AI and robotics are drivers for numerous military and economic arms races. Autonomous systems can be faster, smarter, and less predictable than their competitors.

Compared to a statistical model based on objective measures such as grades in courses relevant to the job in question, interviews primarily add noise and introduce the potential for prejudice. (Statistical models don’t favor any particular alma mater or ethnic background and cannot detect good looks.) These facts have been known for more than four decades, but hiring practices have barely budged. The reason is simple: Each of us just knows that if we are the one conducting an interview, we will learn a lot about the candidate. It might well be that other people are not good at this task, but I am! This illusion, in direct contradiction to empirical research, means that we continue to choose employees the same way we always did. We size them up, eye to eye. One domain where some progress has been made in adopting a more scientific approach to job-candidate selection is sports, as documented by the Michael Lewis book and movie Moneyball.

pages: 396 words: 117,149

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World by Pedro Domingos

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

In machine learning, knowledge is often in the form of statistical models, because most knowledge is statistical: all humans are mortal, but only 4 percent are Americans. Skills are often in the form of procedures: if the road curves left, turn the wheel left; if a deer jumps in front of you, slam on the brakes. (Unfortunately, as of this writing Google’s self-driving cars still confuse windblown plastic bags with deer.) Often, the procedures are quite simple, and it’s the knowledge at their core that’s complex. If you can tell which e-mails are spam, you know which ones to delete. If you can tell how good a board position in chess is, you know which move to make (the one that leads to the best position). Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more.

They called this scheme the EM algorithm, where the E stands for expectation (inferring the expected probabilities) and the M for maximization (estimating the maximum-likelihood parameters). They also showed that many previous algorithms were special cases of EM. For example, to learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on them. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. This makes it one of the most popular algorithms in all of machine learning. You might have noticed a certain resemblance between k-means and EM, in that they both alternate between assigning entities to clusters and updating the clusters’ descriptions. This is not an accident: k-means itself is a special case of EM, which you get when all the attributes have “narrow” normal distributions, that is, normal distributions with very small variance.

See S curves Significance tests, 87 Silver, Nate, 17, 238 Similarity, 178, 179 Similarity measures, 192, 197–200, 207 Simon, Herbert, 41, 225–226, 302 Simultaneous localization and mapping (SLAM), 166 Singularity, 28, 186, 286–289, 311 The Singularity Is Near (Kurzweil), 286 Siri, 37, 155, 161–162, 165, 172, 255 SKICAT (sky image cataloging and analysis tool), 15, 299 Skills, learners and, 8, 217–227 Skynet, 282–286 Sloan Digital Sky Survey, 15 Smith, Adam, 58 Snow, John, 183 Soar, chunking in, 226 Social networks, information propagation in, 231 The Society of Mind (Minsky), 35 Space complexity, 5 Spam filters, 23–24, 151–152, 168–169, 171 Sparse autoencoder, 117 Speech recognition, 155, 170–172, 276, 306 Speed, learning algorithms and, 139–142 Spin glasses, brain and, 102–103 Spinoza, Baruch, 58 Squared error, 241, 243 Stacked autoencoder, 117 Stacking, 238, 255, 309 States, value of, 219–221 Statistical algorithms, 8 Statistical learning, 37, 228, 297, 300, 307 Statistical modeling, 8. See also Machine learning Statistical relational learning, 227–233, 254, 309 Statistical significance tests, 76–77 Statistics, Master Algorithm and, 31–32 Stock market predictions, neural networks and, 112, 302 Stream mining, 258 String theory, 46–47 Structure mapping, 199–200, 254, 307 Succession, rule of, 145–146 The Sun Also Rises (Hemingway), 106 Supervised learning, 209, 214, 220, 222, 226 Support vector machines (SVMs), 53, 179, 190–196, 240, 242, 244, 245, 254, 307 Support vectors, 191–193, 196, 243–244 Surfaces and Essences (Hofstadter & Sander), 200 Survival of the fittest programs, 131–134 Sutton, Rich, 221, 223 SVMs.

pages: 442 words: 39,064

Why Stock Markets Crash: Critical Events in Complex Financial Systems by Didier Sornette

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Of special interest will be the study of the premonitory processes before ﬁnancial crashes or “bubble” corrections in the stock market. For this purpose, I shall describe a new set of computational methods that are capable of searching and comparing patterns, simultaneously and iteratively, at multiple scales in hierarchical systems. I shall use these patterns to improve the understanding of the dynamical state before and after a ﬁnancial crash and to enhance the statistical modeling of social hierarchical systems with the goal of developing reliable forecasting skills for these large-scale ﬁnancial crashes. IS PREDICTION POSSIBLE? A WORKING HYPOTHESIS With the low of 3227 on April 17, 2000, identiﬁed as the end of the “crash,” the Nasdaq Composite index lost in ﬁve weeks over 37% of its all-time high of 5133 reached on March 10, 2000. This crash has not been followed by a recovery, as occurred from the October 1987 crash.

Following the null hypothesis that the exponential description is correct and extrapolating this description to, for example, the three largest crashes on the U.S. market in this century (1914, 1929, and 1987), as indicated in Figure 3.4, yields a recurrence time of about ﬁfty centuries for each single crash. In reality, the three crashes occurred in less than one century. This result is a ﬁrst indication that the exponential model may not apply for the large crashes. As an additional test, 10,000 so-called synthetic data sets, each covering a time span close to a century, hence adding up to about 1 million years, was generated using a standard statistical model used by the ﬁnancial industry [46]. We use the model version GARCH(1,1) estimated from the true index with a student distribution with four degrees of freedom. This model includes both nonstationarity of volatilities (the amplitude of price variations) and the (fat tail) nature of the distribution of the price returns seen in Figure 2.7. Our analysis [209] shows that, in approximately 1 million years of heavy tail “GARCH-trading,” with a reset every century, never did three crashes similar to the three largest observed in the true DJIA occur in a single “GARCH-century.”

More recently, Feigenbaum has examined the ﬁrst differences for the logarithm of the S&P 500 from 1980 to 1987 and ﬁnds that he cannot reject the log-periodic component at the 95% conﬁdence level [127]: in plain words, this means that the probability that the log-periodic component results from chance is about or less than one in twenty. To test furthermore the solidity of the advanced log-periodic hypothesis, Johansen, Ledoit, and I [209] tested whether the null hypothesis that a standard statistical model of ﬁnancial markets, called the GARCH(1,1) model with Student-distributed noise, could “explain” the presence of log-periodicity. In the 1,000 surrogate data sets of length 400 weeks generated using this GARCH(1,1) model with Student-distributed noise and analyzed as for the real crashes, only two 400-week windows qualiﬁed. This result corresponds to a conﬁdence level of 998% for rejecting the hypothesis that GARCH(1,1) with Student-distributed noise can generate meaningful log-periodicity.

pages: 518 words: 147,036

The Fissured Workplace by David Weil

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The impact of shedding janitorial jobs in otherwise higher-wage companies is borne out in several studies of contracting out among janitorial workers. Using a statistical model to predict the factors that increase the likelihood of contracting out specific types of jobs, Abraham and Taylor demonstrate that the higher the typical wage for the workforce at an establishment, the more likely that establishment will contract out its janitorial work. They also show that establishments that do any contracting out of janitorial workers tend to shift out the function entirely.36 Wages and benefits for workers employed directly versus contracted out can be compared given the significant number of people in both groups. Using statistical models that control for both observed characteristics of the workers and the places in which they work, several studies directly compare the wages and benefits for these occupations.

For example, franchisees might be more common in areas where there is greater competition among fast-food restaurants. That competition (and franchising only indirectly) might lead them to have higher incentives to not comply. Alternatively, company-owned outlets might be in locations with stronger consumer markets, higher-skilled workers, or lower crime rates, all of which might also be associated with compliance. To adequately account for these problems, statistical models that consider all of the potentially relevant factors, including franchise status, are generated to predict compliance levels. By doing so, the effect of franchising can be examined, holding other factors constant. This allows measurement of the impact on compliance of an outlet being run by a franchisee with otherwise identical features, as opposed to a company-owned outlet. Figure 6.1 provides estimates of the impact of franchise ownership on three different measures of compliance for the top twenty branded fast-food companies in the United States.22 The figure presents the percentage difference in compliance between franchised outlets relative to otherwise comparable company-owned outlets of the same brand.23 FIGURE 6.1.

Mining entered into contract agreements at mine sites that Ember had never worked. This narrative is based on Federal Mine Safety and Health Review Commission, Secretary of Labor MSHA v. Ember Contracting Corporation, Office of Administrative Law Judges, November 4, 2011. I am grateful to Greg Wagner for flagging this case and to Andrew Razov for additional research on it. 26. These estimates are based on quarterly mining data from 2000–2010. Using statistical modeling techniques, two different measures of traumatic injuries and a direct measure of fatality rates are associated with contracting status of the mine operator as well as other explanatory factors, including mining method, physical attributes of the mine, union status, size of operations, year, and location. The contracting measure includes all forms of contracting. See Buessing and Weil (2013). 27.

pages: 336 words: 113,519

The Undoing Project: A Friendship That Changed Our Minds by Michael Lewis

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

He helped hire new management, then helped to figure out how to price tickets, and, finally, inevitably, was asked to work on the problem of whom to select in the NBA draft. “How will that nineteen-year-old perform in the NBA?” was like “Where will the price of oil be in ten years?” A perfect answer didn’t exist, but statistics could get you to some answer that was at least a bit better than simply guessing. Morey already had a crude statistical model to evaluate amateur players. He’d built it on his own, just for fun. In 2003 the Celtics had encouraged him to use it to pick a player at the tail end of the draft—the 56th pick, when the players seldom amount to anything. And thus Brandon Hunter, an obscure power forward out of Ohio University, became the first player picked by an equation.* Two years later Morey got a call from a headhunter who said that the Houston Rockets were looking for a new general manager.

He had a diffidence about him—an understanding of how hard it is to know anything for sure. The closest he came to certainty was in his approach to making decisions. He never simply went with his first thought. He suggested a new definition of the nerd: a person who knows his own mind well enough to mistrust it. One of the first things Morey did after he arrived in Houston—and, to him, the most important—was to install his statistical model for predicting the future performance of basketball players. The model was also a tool for the acquisition of basketball knowledge. “Knowledge is literally prediction,” said Morey. “Knowledge is anything that increases your ability to predict the outcome. Literally everything you do you’re trying to predict the right thing. Most people just do it subconsciously.” A model allowed you to explore the attributes in an amateur basketball player that led to professional success, and determine how much weight should be given to each.

Without data, there’s nothing to analyze. The Indian was DeAndre Jordan all over again; he was, like most of the problems you faced in life, a puzzle, with pieces missing. The Houston Rockets would pass on him—and be shocked when the Dallas Mavericks took him in the second round of the NBA draft. Then again, you never knew.†† And that was the problem: You never knew. In Morey’s ten years of using his statistical model with the Houston Rockets, the players he’d drafted, after accounting for the draft slot in which they’d been taken, had performed better than the players drafted by three-quarters of the other NBA teams. His approach had been sufficiently effective that other NBA teams were adopting it. He could even pinpoint the moment when he felt, for the first time, imitated. It was during the 2012 draft, when the players were picked in almost the exact same order the Rockets ranked them.

pages: 447 words: 104,258

Mathematics of the Financial Markets: Financial Instruments and Derivatives Modelling, Valuation and Risk Issues by Alain Ruttiens

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

FOCARDI, Frank J. FABOZZI, The Mathematics of Financial Modeling and Investment Management, John Wiley & Sons, Inc., Hoboken, 2004, 800 p. Lawrence GALITZ, Financial Times Handbook of Financial Engineering, FT Press, 3rd ed. Scheduled on November 2011, 480 p. Philippe JORION, Financial Risk Manager Handbook, John Wiley & Sons, Inc., Hoboken, 5th ed., 2009, 752 p. Tze Leung LAI, Haipeng XING, Statistical Models and Methods for Financial Markets, Springer, 2008, 374 p. David RUPPERT, Statistics and Finance, An Introduction, Springer, 2004, 482 p. Dan STEFANICA, A Primer for the Mathematics of Financial Engineering, FE Press, 2011, 352 p. Robert STEINER, Mastering Financial Calculations, FT Prentice Hall, 1997, 400 p. John L. TEALL, Financial Market Analytics, Quorum Books, 1999, 328 p. Presents the maths needed to understand quantitative finance, with examples and applications focusing on financial markets. 1.

More generally, Jarrow has developed some general but very useful considerations about model risk in an article devoted to risk management models, but valid for any kind of (financial) mathematical model.17 In his article, Jarrow is distinguishing between statistical and theoretical models: the former ones refer to modeling a market price or return evolution, based on historical data, such as a GARCH model. What is usually developed as “quantitative models” by some fund or portfolio managers, also belong to statistical models. On the other hand, theoretical models aim to evidence some causality based on a financial/economic reasoning, for example the Black–Scholes formula. Both types of model imply some assumptions: Jarrow distinguishes between robust and non-robust assumptions, depending on the size of the impact when the assumption is slightly modified. The article then develops pertinent considerations about testing, calibrating and using a model.

Philippe JORION, Financial Risk Manager Handbook, John Wiley & Sons, Inc., Hoboken, 6th ed., 2010, 800 p. E. JURCZENKO, B. MAILLET (eds), Multi-Moment Asset Allocation and Pricing Models, John Wiley & Sons, Ltd, Chichester, 2006, 233 p. Ioannis KARATZAS, Steven E. SHREVE, Methods of Mathematical Finance, Springer, 2010, 430 p. Donna KLINE, Fundamentals of the Futures Market, McGraw-Hill, 2000, 256 p. Tze Leung LAI, Haipeng XING, Statistical Models and Methods for Financial Markets, Springer, 2008, 374 p. Raymond M. LEUTHOLD, Joan C. JUNKUS, Jean E. CORDIER, The Theory and Practice of Futures Markets, Stipes Publishing, 1999, 410 p. Bob LITTERMAN, Modern Investment Management – An Equilibrium Approach, John Wiley & Sons, Inc., Hoboken, 2003, 624 p. T. LYNCH, J. APPLEBY, Large Fluctuation of Stochastic Differential Equations: Regime Switching and Applications to Simulation and Finance, LAP LAMBERT Academic Publishing, 2010, 240 p.

Quantitative Trading: How to Build Your Own Algorithmic Trading Business by Ernie Chan

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

I will illustrate this somewhat convoluted procedure at the end of Example 3.6. Data-Snooping Bias In Chapter 2, I mentioned data-snooping bias—the danger that backtest performance is inflated relative to the future performance of the strategy because we have overoptimized the parameters of the model based on transient noise in the historical data. Data snooping bias is pervasive in the business of predictive statistical models of historical data, but is especially serious in finance because of the limited amount of independent data we have. High-frequency data, while in abundant supply, is useful only for high-frequency models. And while we have stock market data stretching back to the early parts of the twentieth century, only data within the past 10 years are really suitable for building predictive model. Furthermore, as discussed in Chapter 2, regime shifts may render even data that are just a few years old obsolete for backtesting purposes.

Chan & Associates (www.epchan.com), a consulting firm focusing on trading strategy and software development for money managers. He also co-manages EXP Quantitative Investments, LLC and publishes the Quantitative Trading blog (epchan.blogspot.com), which is syndicated to multiple financial news services including www.tradingmarkets.com and Yahoo! Finance. He has been quoted by the New York Times and CIO magazine on quantitative hedge funds, and has appeared on CNBC’s Closing Bell. Ernie is an expert in developing statistical models and advanced computer algorithms to discover patterns and trends from large quantities of data. He was a researcher in computer science at IBM’s T. J. Watson Research Center, in data mining at Morgan Stanley, and in statistical arbitrage trading at Credit Suisse. He has also been a senior quantitative strategist and trader at various hedge funds, with sizes ranging from millions to billions of dollars.

pages: 49 words: 12,968

Industrial Internet by Jon Bruner

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

“Imagine trying to operate a highway system if all you have are monthly traffic readings for a few spots on the road. But that’s what operating our power system was like.” The utility’s customers benefit, too — an example of the industrial internet creating value for every entity to which it’s connected. Fort Collins utility customers can see data on their electric usage through a Web portal that uses a statistical model to estimate how much electricity they’re using on heating, cooling, lighting and appliances. The site then draws building data from county records to recommend changes to insulation and other improvements that might save energy. Water meters measure usage every hour — frequent enough that officials will soon be able to dispatch inspection crews to houses whose vacationing owners might not know about a burst pipe.

pages: 1,088 words: 228,743

Expected Returns: An Investor's Guide to Harvesting Market Rewards by Antti Ilmanen

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

This is an in-sample measure and can be misleading if the correlations are not stable over time. Note, though, that most academic studies rely on such in-sample relations; econometricians simply assume that any observed statistical relation between predictors and subsequent market returns was already known to rational investors in real time. Practitioners who find this assumption unrealistic try to avoid in-sample bias by selecting and/or estimating statistical models repeatedly using only data that were available at each point in time, so as to assess predictability in a quasi-out-of-sample sense, but never completely succeeding in doing so. Table 8.6. Correlations with future excess returns of the S&P 500, 1962–2009 Sources: Haver Analytics, Robert Shiller’s website, Amit Goyal’s website, own calculations. Valuations. Various valuation ratios have predictive correlations between 10% and 20% for the next quarter [5].

They treat default (or rating change) as a random event whose probability can be estimated from observed market prices in the context of an analytical model (or directly from historical default data). Useful indicators, besides equity volatility and leverage, include past equity returns, certain financial ratios, and proxies for the liquidity premium. This modeling approach is sort of a compromise between statistical models and theoretically purer structural models. Reduced-form models can naturally match market spreads better than structural models, but unconstrained indicator selection can make them overfitted to in-sample data. Box 10.1. (wonkish) Risk-neutral and actual default probabilities Under certain assumptions (continuous trading, a single-factor diffusion process), positions in risky assets can be perfectly hedged and thus should earn riskless return.

However, there is some evidence of rising correlations across all quant strategies, presumably due to common positions among leveraged traders. 12.7 NOTES [1] Like many others, I prefer to use economic intuition as one guard against data mining, but the virtues of such intuition can be overstated as our intuition is inevitably influenced by past experiences. Purely data-driven statistical approaches are even worse, but at least then statistical models can help assess the magnitude of data-mining bias. [2] Here are some additional points on VMG: —No trading costs or financing costs related to shorting are subtracted from VMG returns. This is typical for academic studies because such costs are trade specific and/or investor specific and, moreover, such data are not available over long histories. —VMG is constructed in a deliberately conservative (“underfitted”) manner.

pages: 58 words: 18,747

The Rent Is Too Damn High: What to Do About It, and Why It Matters More Than You Think by Matthew Yglesias

That said, though automobiles are unquestionably a useful technology, they’re not teleportation devices and they haven’t abolished distance. Location still matters, and some land is more valuable than other land. Since land and structures are normally sold in a bundle, it’s difficult in many cases to get precise numbers on land prices as such. But researchers at the Federal Reserve Bank of New York used a statistical model based on prices paid for vacant lots and for structures that were torn down to be replaced by brand-new buildings and found that the price of land in the metro area is closely linked to its distance from the Empire State Building: CHART 1 Land Prices and Distance of Property from Empire State Building Natural logarithm of land price per square foot Distance from Empire State Building (kilometers) In general, the expensive land should be much more densely built upon than the cheap land.

pages: 183 words: 17,571

Broken Markets: A User's Guide to the Post-Finance Economy by Kevin Mellyn

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Regulators were becoming increasingly comfortable with the “market-centric” model too, because the securities churned out had to be properly vetted and rated by the credit agencies under SEC (Securities and Exchange Commission) rules. Moreover, distributing risk to large numbers of sophisticated institutions seemed safer than leaving it concentrated on the books of individual banks. Besides, even the Basel-process experts had become convinced that bank risk management had reached a new level of effectiveness through the use of sophisticated statistical models, and the Basel II rules that superseded Basel I especially allowed the largest and most sophisticated banks to use approved models to set their capital requirements. The ﬂy in the ointment of market-centric ﬁnance was that it allowed an almost inﬁnite expansion of credit in the economy, but creditworthy risks are by deﬁnition ﬁnite. At some point, every household with a steady income has 33 34 Chapter 2 |  Banking, Regulation, and Financial Crises seven credit cards, a mortgage, and a home equity line.

Americans make this tradeoff with limited fair-credit-reporting protections, while many other societies do not. It is critical to understand that a credit score is only a measure of whether a consumer can service a certain amount of credit—that is, make timely interest and principal payments. It is not concerned with the ability to pay off Broken Markets debts over time. What it really measures is the probability that an individual will default. This is a statistical model–based determination, and as such is hostage to historical experience of the behavior of tens of millions of individuals. The factors that over time have proved most predictive include not only behavior—late or missed payments on any bill, not just a loan, signals potential default—but also circumstances. Home ownership of long duration is a plus. So is long-term employment at the same ﬁrm.

pages: 238 words: 77,730

Final Jeopardy: Man vs. Machine and the Quest to Know Everything by Stephen Baker

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The Google team had fed millions of translated documents, many of them from the United Nations, into their computers and supplemented them with a multitude of natural-language text culled from the Web. This training set dwarfed their competitors’. Without knowing what the words meant, their computers had learned to associate certain strings of words in Arabic and Chinese with their English equivalents. Since they had so very many examples to learn from, these statistical models caught nuances that had long confounded machines. Using statistics, Google’s computers won hands down. “Just like that, they bypassed thirty years of work on machine translation,” said Ed Lazowska, the chairman of the computer science department at the University of Washington. The statisticians trounced the experts. But the statistically trained machines they built, whether they were translating from Chinese or analyzing the ads that a Web surfer clicked, didn’t know anything.

“We knew all of its algorithms,” he said, and the team had precise statistics on every aspect of its behavior. The human players were more complicated. Tesauro had to pull together statistics on the thousands of humans who had played Jeopardy: how often they buzzed in, their precision in different levels of clues, their betting patterns for Daily Doubles and Final Jeopardy. From these, the IBM team pieced together statistical models of two humans. Then they put them into action against the model of Watson. The games had none of the life or drama of Jeopardy—no suspense, no jokes, no jingle while the digital players came up with their Final Jeopardy responses. They were only simulations of the scoring dynamics of Jeopardy. Yet they were valuable. After millions of games, Tesauro was able to calculate the value of each clue at each state of the game.

pages: 304 words: 80,965

What They Do With Your Money: How the Financial System Fails Us, and How to Fix It by Stephen Davis, Jon Lukomnik, David Pitt-Watson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Even if they change your life profoundly, such days are not likely to resemble the ones before and after. That is why the day you get married is so memorable. In fact, the elements of that day are not likely to be present in the sample of any of the previous 3,652 days.28 So how could the computer possibly calculate the likelihood of their recurring tomorrow, or next week? Similarly, in the financial world, if you feed a statistical model data that have come from a period where there has been no banking crisis, the model will predict that it is very unlikely you will have a banking crisis. When statisticians worked out that a financial crisis of the sort we witnessed in 2008 would occur once in billions of years, their judgment was based on years of data when there had not been such a crisis.29 It compounds the problem that people tend to simplify the outcome of risk models.

Just as the laws of gravity don’t explain magnetism or subatomic forces, so the disciplines of economics that held sway in our financial institutions paid little attention to the social, cultural, legal, political, institutional, moral, psychological, and technological forces that shape our economy’s behavior. The compass that bankers and regulators were using worked well according to its own logic, but it was pointing in the wrong direction, and they steered the ship onto the rocks. History does not record whether the Queen was satisfied with the academics’ response. She might, however, have noted that this economic-statistical model had been found wanting before—in 1998, when the collapse of the hedge fund Long-Term Capital Management nearly took the financial system down with it. Ironically, its directors included the two people who had shared the Nobel Prize in Economics the previous year.20 The Queen might also have noted the glittering lineup of senior economists who, over the last century, have warned against excessive confidence in predictions made using models.

pages: 88 words: 25,047

The Mathematics of Love: Patterns, Proofs, and the Search for the Ultimate Equation by Hannah Fry

Statistical Science, 1989. Todd, Peter M. ‘Searching for the Next Best Mate.’ Simulating Social Phenomena, edited by Rosaria Conte, Rainer Hegselmann, Pietro Terna, 419–36. Berlin: Springer Berlin Heidelberg, 1997. CHAPTER 8: HOW TO OPTIMIZE YOUR WEDDING Bellows, Meghan L. and J. D. Luc Peterson. ‘Finding an Optimal Seating Chart.’ Annals of Improbable Research, 2012. Alexander, R. A Statistically Modelled Wedding. (2014): http://www­.bbc­.co­.uk/­news­/mag­azi­ne-25980076. CHAPTER 9: HOW TO LIVE HAPPILY EVER AFTER Gottman, John M., James D. Murray, Catherine C. Swanson, Rebecca Tyson and Kristin R. Swanson. The Mathematics of Marriage: Dynamic Nonlinear Models. Cambridge, MA.: Basic Books, 2005. AUTHOR THANKS This book isn’t exactly War and Peace, but it has still required help and support from a number of wonderful people.

pages: 752 words: 131,533

Python for Data Analysis by Wes McKinney

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

While readers may have many different end goals for their work, the tasks required generally fall into a number of different broad groups: Interacting with the outside world Reading and writing with a variety of file formats and databases. Preparation Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis. Transformation Applying mathematical and statistical operations to groups of data sets to derive new data sets. For example, aggregating a large table by group variables. Modeling and computation Connecting your data to statistical models, machine learning algorithms, or other computational tools Presentation Creating interactive or static graphical visualizations or textual summaries In this chapter I will show you a few data sets and some things we can do with them. These examples are just intended to pique your interest and thus will only be explained at a high level. Don’t worry if you have no experience with any of these tools; they will be discussed in great detail throughout the rest of the book.

To create a Panel, you can use a dict of DataFrame objects or a three-dimensional ndarray: import pandas.io.data as web pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012')) for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL'])) Each item (the analogue of columns in a DataFrame) in the Panel is a DataFrame: In [297]: pdata Out[297]: <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 861 (major) x 6 (minor) Items: AAPL to MSFT Major axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00 Minor axis: Open to Adj Close In [298]: pdata = pdata.swapaxes('items', 'minor') In [299]: pdata['Adj Close'] Out[299]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 861 entries, 2009-01-02 00:00:00 to 2012-06-01 00:00:00 Data columns: AAPL 861 non-null values DELL 861 non-null values GOOG 861 non-null values MSFT 861 non-null values dtypes: float64(4) ix-based label indexing generalizes to three dimensions, so we can select all data at a particular date or a range of dates like so: In [300]: pdata.ix[:, '6/1/2012', :] Out[300]: Open High Low Close Volume Adj Close AAPL 569.16 572.65 560.52 560.99 18606700 560.99 DELL 12.15 12.30 12.05 12.07 19396700 12.07 GOOG 571.79 572.65 568.35 570.98 3057900 570.98 MSFT 28.76 28.96 28.44 28.45 56634300 28.45 In [301]: pdata.ix['Adj Close', '5/22/2012':, :] Out[301]: AAPL DELL GOOG MSFT Date 2012-05-22 556.97 15.08 600.80 29.76 2012-05-23 570.56 12.49 609.46 29.11 2012-05-24 565.32 12.45 603.66 29.07 2012-05-25 562.29 12.46 591.53 29.06 2012-05-29 572.27 12.66 594.34 29.56 2012-05-30 579.17 12.56 588.23 29.34 2012-05-31 577.73 12.33 580.86 29.19 2012-06-01 560.99 12.07 570.98 28.45 An alternate way to represent panel data, especially for fitting statistical models, is in “stacked” DataFrame form: In [302]: stacked = pdata.ix[:, '5/30/2012':, :].to_frame() In [303]: stacked Out[303]: Open High Low Close Volume Adj Close major minor 2012-05-30 AAPL 569.20 579.99 566.56 579.17 18908200 579.17 DELL 12.59 12.70 12.46 12.56 19787800 12.56 GOOG 588.16 591.90 583.53 588.23 1906700 588.23 MSFT 29.35 29.48 29.12 29.34 41585500 29.34 2012-05-31 AAPL 580.74 581.50 571.46 577.73 17559800 577.73 DELL 12.53 12.54 12.33 12.33 19955500 12.33 GOOG 588.72 590.00 579.00 580.86 2968300 580.86 MSFT 29.30 29.42 28.94 29.19 39134000 29.19 2012-06-01 AAPL 569.16 572.65 560.52 560.99 18606700 560.99 DELL 12.15 12.30 12.05 12.07 19396700 12.07 GOOG 571.79 572.65 568.35 570.98 3057900 570.98 MSFT 28.76 28.96 28.44 28.45 56634300 28.45 DataFrame has a related to_panel method, the inverse of to_frame: In [304]: stacked.to_panel() Out[304]: <class 'pandas.core.panel.Panel'> Dimensions: 6 (items) x 3 (major) x 4 (minor) Items: Open to Adj Close Major axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00 Minor axis: AAPL to MSFT Chapter 6.

There are much more efficient sampling-without-replacement algorithms, but this is an easy strategy that uses readily available tools: In [183]: df.take(np.random.permutation(len(df))[:3]) Out[183]: 0 1 2 3 1 4 5 6 7 3 12 13 14 15 4 16 17 18 19 To generate a sample with replacement, the fastest way is to use np.random.randint to draw random integers: In [184]: bag = np.array([5, 7, -1, 6, 4]) In [185]: sampler = np.random.randint(0, len(bag), size=10) In [186]: sampler Out[186]: array([4, 4, 2, 2, 2, 0, 3, 0, 4, 1]) In [187]: draws = bag.take(sampler) In [188]: draws Out[188]: array([ 4, 4, -1, -1, -1, 5, 6, 5, 4, 7]) Computing Indicator/Dummy Variables Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame containing k columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this, though devising one yourself is not difficult. Let’s return to an earlier example DataFrame: In [189]: df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], .....: 'data1': range(6)}) In [190]: pd.get_dummies(df['key']) Out[190]: a b c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0 In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. get_dummies has a prefix argument for doing just this: In [191]: dummies = pd.get_dummies(df['key'], prefix='key') In [192]: df_with_dummy = df[['data1']].join(dummies) In [193]: df_with_dummy Out[193]: data1 key_a key_b key_c 0 0 0 1 0 1 1 0 1 0 2 2 1 0 0 3 3 0 0 1 4 4 1 0 0 5 5 0 1 0 If a row in a DataFrame belongs to multiple categories, things are a bit more complicated.

pages: 467 words: 116,094

I Think You'll Find It's a Bit More Complicated Than That by Ben Goldacre

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Obviously, there are no out gay people in the eighteen-to-twenty-four group who came out at an age later than twenty-four; so the average age at which people in the eighteen-to-twenty-four group came out cannot possibly be greater than the average age of that group, and certainly it will be lower than, say, thirty-seven, the average age at which people in their sixties came out. For the same reason, it’s very likely indeed that the average age of coming out will increase as the average age of each age group rises. In fact, if we assume (in formal terms we could call this a ‘statistical model’) that at any time, all the people who are out have always come out at a uniform rate between the age of ten and their current age, you would get almost exactly the same figures (you’d get fifteen, twenty-three and thirty-five, instead of seventeen, twenty-one and thirty-seven). This is almost certainly why ‘the average coming-out age has fallen by over twenty years’: in fact you could say that Stonewall’s survey has found that on average, as people get older, they get older.

For example, a recent study identified two broad subpopulations of cyclist: ‘one speed-happy group that cycle fast and have lots of cycle equipment including helmets, and one traditional kind of cyclist without much equipment, cycling slowly’. The study concluded that compulsory cycle-helmet legislation may selectively reduce cycling in the second group. There are even more complex second-round effects if each individual cyclist’s safety is improved by increased cyclist density through ‘safety in numbers’, a phenomenon known as Smeed’s law. Statistical models for the overall impact of helmet habits are therefore inevitably complex and based on speculative assumptions. This complexity seems at odds with the current official BMA policy, which confidently calls for compulsory helmet legislation. Standing over all this methodological complexity is a layer of politics, culture and psychology. Supporters of helmets often tell vivid stories about someone they knew, or heard of, who was apparently saved from severe head injury by a helmet.

A&E departments: randomised trials in 208; waiting times 73–5 abdominal aortic aneurysms (AAA) 18, 114 abortion; GPs and xviii, 89–91; Science and Technology Committee report on ‘scientific developments relating to the Abortion Act, 1967’ 196–201 academia, bad xviii–xix, 127–46; animal experiments, failures in research 136–8; brain-imaging studies report more positive findings than their numbers can support 131–4; journals, failures of academic 138–46; Medical Hypotheses: Aids denialism in 138–41; Medical Hypotheses: ‘Down Subjects and Oriental Population Share Several Specific Attitudes and Characteristics’ article 139, 141–3; Medical Hypotheses: masturbation as a treatment for nasal congestion articles 139, 143–6; misuse of statistics 129–31; retractions, academic literature and 134–6 academic journals: access to papers published in 32–4, 143; cherry-picking and 5–8; ‘citation classics’ and 9–10, 102–3, 173; commercial ghost writers and 25–6; data published in newspapers rather than 17–20; doctors and technical academic journals 214; ‘impact factor’ 143; number of 14, 17; peer review and 138–46 see also peer review; poor quality (‘crap’) 138–46; refusal to publish in 3–5; retractions and 134–6; statistical model errors in 129–31; studies of errors in papers published in 9–10, 129–31; summaries of important new research from 214–15; teaching and 214–15; youngest people to publish papers in 11–12 academic papers xvi; access to 32–4; cherry-picking from xvii, 5–8, 12, 174, 176–7, 192, 193, 252, 336, 349, 355; ‘citation classics’ 9–10, 102–3, 173; commercial ‘ghost writers’ and 25–6; investigative journalism work and 18; journalists linking work to 342, 344, 346; number of 14; peer review and see peer review; post-publication 4–5; press releases and xxi, 6, 29–31, 65, 66, 107–9, 119, 120, 121–2, 338–9, 340–2, 358–60; public relations and 358–60; publication bias 132–3, 136, 314, 315; references to other academic papers within allowing study of how ideas spread 26; refusal to publish in 3–5, 29–31; retractions and 134–6; studies of errors in 9–10, 129–31; titles of 297 Acousticom 366 acupuncture 39, 388 ADE 651 273–5 ADHD 40–2 Advertising Standards Authority (ASA) 252 Afghanistan 231; crop captures in xx, 221–4 Ahn, Professor Anna 341 Aids; antiretroviral drugs and 140, 185, 281, 284, 285; Big Pharma and 186; birth control, abortion and US Christian aid groups 185; Catholic Church fight against condom use and 183–4; cures for 12, 182–3, 185–6, 366; denialism 138–41, 182–3, 185–6, 263, 273, 281–6; drug users and 182, 183, 233–4; House of Numbers film 281–3; Medical Hypotheses, Aids denial in 138–41; needle-exchange programmes and 182, 183; number of deaths from 20, 186, 309; power of ideas and 182–7; Roger Coghill and ‘the Aids test’ 366; Spectator, Aids denialism at the xxi, 283–6; US Presidential Emergency Plan for Aids Relief 185 Aidstruth.org 139 al-Jabiri, Major General Jehad 274–5 alcohol: intravenous use of 233; lung cancer and 108–9; rape and consumption of 329, 330 ALLHAT trial 119 Alzheimer’s, smoking and 20–1 American Academy of Child and Adolescent Psychiatry 325 American Association on Mental Retardation 325 American Journal of Clinical Nutrition 344 American Medical Association 262 American Psychological Association 325 American Speech-Language-Hearing Association 325 anecdotes, illustrating data with 8, 118–22, 189, 248–9, 293 animal experiments 136–8 Annals of Internal Medicine 358 Annals of Thoracic Surgery 134 anti-depressants 18; recession linked to rise in prescriptions for xviii, 104–7; SSRI 18, 105 antiretroviral medications 140, 185, 281, 284, 285 aortic aneurysm repair, mortality rates in hospital after/during 18–20, 114 APGaylard 252 Appleby, John 19, 173 artificial intelligence xxii, 394–5 Asch, Solomon 15, 16 Asphalia 365 Associated Press 316 Astel, Professor Karl 22 ATSC 273 autism: educational interventions in 325; internet use and 3; MMR and 145, 347–55, 356–8 Autism Research Centre, Cambridge 348, 354 Bad Science (Goldacre) xvi, 104, 110n, 257, 346 Bad Science column see Guardian Ballas, Dr Dimitris 58 Barasi, Leo 96 Barden, Paul 101–4 Barnardo’s 394 Baron-Cohen, Professor Simon 349–51, 353–4 Batarim 305–6 BBC xxi; ‘bioresonance’ story and 277–8; Britain’s happiest places story and 56, 57; causes of avoidable death, overall coverage of 20; Down’s syndrome births increase story and 61–2; ‘EDF Survey Shows Support for Hinkley Power Station’ story and 95–6; psychological nature of libido problems story and 37; radiation from wi-fi networks story and 289–91, 293; recession and anti-depressant link, reports 105; Reform: The Value of Mathematics’ story and 196; ‘Threefold variation’ in UK bowel cancer rates’ story and 101–4; Wightman and 393, 394; ‘“Worrying’’ Jobless Rise Needs Urgent Action – Labour’ story and 59 Beating Bowel Cancer 101, 104 Becker muscular dystrophy 121 Bem Sex Role Inventory (BSRI) 45 Benedict XVI, Pope 183, 184 Benford’s law 54–6 bicycle helmets, the law and 110–13 big data xvii, xviii, 71–86; access to government data 75–7; care.data and risk of sharing medical records 77–86; magical way that patterns emerge from data 73–5 Big Pharma xvii, 324, 401 bin Laden, Osama 357 biologising xvii, 35–46; biological causes for psychological or behavioural conditions 40–2; brain imaging, reality of phenomena and 37–9; girls’ love of pink, evolution and 42–6 Biologist 6 BioSTAR 248 birth rate, UK 49–50 Bishop, Professor Dorothy 3, 6 bladder cancer 24–5, 342 Blair, Tony 357 Blakemore, Colin 138 blame, mistakes in medicine and 267–70 blind auditions, orchestras and xxi, 309–11 blinding, randomised testing and xviii, 12, 118, 124, 126, 133, 137–8, 292–3, 345 blood tests 117, 119–20, 282 blood-pressure drugs 119–20 Blundell, Professor John 337 BMA 112 Booth, Patricia 265 Boston Globe 39 bowel cancer 101–4 Boynton, Dr Petra 252 Brain Committee 230–1 Brain Gym 10–12 Brainiac: faking of science on xxii, 371–5 brain-imaging studies, positive findings in 131–4 breast cancer: abortion and 200–1; diet and 338–40; red wine and 267, 269; screening 113, 114, 115 breast enhancement cream xx, 254–7 Breuning, Stephen 135–6 The British Association for Applied Nutrition and Nutritional Therapy (BANT) 268–9 British Association of Nutritional Therapists 270 British Chiropractic Association (BCA) 250–4 British Dental Association 24 British Household Panel Survey 57 British Journal of Cancer: ‘What if Cancer Survival in Britain were the Same as in Europe: How Many Deaths are Avoidable?’

pages: 719 words: 104,316

R Cookbook by Paul Teetor

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Solution The factor function encodes your vector of discrete values into a factor: > f <- factor(v) # v is a vector of strings or integers If your vector contains only a subset of possible values and not the entire universe, then include a second argument that gives the possible levels of the factor: > f <- factor(v, levels) Discussion In R, each possible value of a categorical variable is called a level. A vector of levels is called a factor. Factors fit very cleanly into the vector orientation of R, and they are used in powerful ways for processing data and building statistical models. Most of the time, converting your categorical data into a factor is a simple matter of calling the factor function, which identifies the distinct levels of the categorical data and packs them into a factor: > f <- factor(c("Win","Win","Lose","Tie","Win","Lose")) > f [1] Win Win Lose Tie Win Lose Levels: Lose Tie Win Notice that when we printed the factor, f, R did not put quotes around the values.

So think twice before you diddle with those globals: do you really want all lines in all graphics to be (say) magenta, dotted, and three times wider? Probably not, so use local parameters rather than global parameters whenever possible. See Also The help page for par lists the global graphics parameters; the chapter of R in a Nutshell on graphics includes the list with useful annotations. R Graphics contains extensive explanations of graphics parameters. Chapter 11. Linear Regression and ANOVA Introduction In statistics, modeling is where we get down to business. Models quantify the relationships between our variables. Models let us make predictions. A simple linear regression is the most basic model. It’s just two variables and is modeled as a linear relationship with an error term: yi = β0 + β1xi + εi We are given the data for x and y. Our mission is to fit the model, which will give us the best estimates for β0 and β1 (Recipe 11.1).

pages: 398 words: 86,855

Bad Data Handbook by Q. Ethan McCallum

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

In a previous life, he invented the refrigerator. Spencer Burns is a data scientist/engineer living in San Francisco. He has spent the past 15 years extracting information from messy data in fields ranging from intelligence to quantitative finance to social media. Richard Cotton is a data scientist with a background in chemical health and safety, and has worked extensively on tools to give non-technical users access to statistical models. He is the author of the R packages “assertive” for checking the state of your variables and “sig” to make sure your functions have a sensible API. He runs The Damned Liars statistics consultancy. Philipp K. Janert was born and raised in Germany. He obtained a Ph.D. in Theoretical Physics from the University of Washington in 1997 and has been working in the tech industry since, including four years at Amazon.com, where he initiated and led several projects to improve Amazon’s order fulfillment process.

As the first and second examples show, a scientist can spot faulty experimental setups, because of his or her ability to test the data for internal consistency and for agreement with known theories, and thereby prevent wrong conclusions and faulty analyses. What possibly could be more importantto a scientist? And if that means taking a trip to the factory, I’ll be glad to go. Chapter 8. Blood, Sweat, and Urine Richard Cotton A Very Nerdy Body Swap Comedy I spent six years working in the statistical modeling team at the UK’s Health and Safety Laboratory.[23] A large part of my job was working with the laboratory’s chemists, looking at occupational exposure to various nasty substances to see if an industry was adhering to safe limits. The laboratory gets sent tens of thousands of blood and urine samples each year (and sometimes more exotic fluids like sweat or saliva), and has its own team of occupational hygienists who visit companies and collect yet more samples.

pages: 350 words: 103,270

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The mattress had done its job—it had given international regulators the confidence to sign off as commercial banks built up their trading businesses. Betting—and Beating—the Spread Now return to the trading floor, to the people regulators and bank senior management need to police. Although they are taught to overcome risk aversion, traders continue to look for a mattress everywhere, in the form of “free lunches.” But do they use statistical modeling to identify a mattress, and make money? If you talk to traders, the answer tends to be no. Listen to the warning of a senior Morgan Stanley equities trader who I interviewed in 2009: “You can compare to theoretical or historic value. But these forms of trading are probably a bit dangerous.” While regulators and senior bankers may have embraced VAR, traders themselves have always been skeptical.

According to the Morgan Stanley trader, “You study the perception of the market: I buy this because the next tick will be on the upside, or I sell because the next tick will be on the downside. This is probably based on the observations of your peers and so on. If you look purely at the anticipation of the price, that’s a way to make money in trading.” One reason traders don’t tend to make outright bets on the basis of statistical modeling is that capital rules such as VAR discourage it. The capital required to be set aside by VAR scales up with the size of the positions and the degree of worst-case scenario projected by the statistics. For volatile markets like equities, that restriction takes a big bite out of potential profit since trading firms must borrow to invest.5 On the other hand, short-term, opportunistic trading (which might be less profitable) slips under the VAR radar because the positions never stay on the books for very long.

pages: 502 words: 107,510

Natural Language Annotation for Machine Learning by James Pustejovsky, Amber Stubbs

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

This is a corpus of tagged and parsed sentences of naturally occurring English (4.5 million words). The British National Corpus (BNC) is compiled and released as the largest corpus of English to date (100 million words). The Text Encoding Initiative (TEI) is established to develop and maintain a standard for the representation of texts in digital form. 2000s: As the World Wide Web grows, more data is available for statistical models for Machine Translation and other applications. The American National Corpus (ANC) project releases a 22-million-word subcorpus, and the Corpus of Contemporary American English (COCA) is released (400 million words). Google releases its Google N-gram Corpus of 1 trillion word tokens from public web pages. The corpus holds up to five n-grams for each word token, along with their frequencies . 2010s: International standards organizations, such as ISO, begin to recognize and co-develop text encoding formats that are being used for corpus annotation efforts.

.), this algorithm computes a probability distribution over the possible labels associated with them, and then computes the best label sequence. We can identify two basic methods for sequence classification: Feature-based classification A sequence is tranformed into a feature vector. The vector is then classified according to conventional classifier methods. Model-based classification An inherent model of the probability distribution of the sequence is built. HMMs and other statistical models are examples of this method. Included in feature-based methods are n-gram models of sequences, where an n-gram is selected as a feature. Given a set of such n-grams, we can represent a sequence as a binary vector of the occurrence of the n-grams, or as a vector containing frequency counts of the n-grams. With this sort of encoding, we can apply conventional methods to model sequences (Manning and Schütze 1999).

pages: 317 words: 106,130

The New Science of Asset Allocation: Risk Management in a Multi-Asset World by Thomas Schneeweis, Garry B. Crowder, Hossein Kazemi

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

In practice, we must come up with estimates of the expected returns, standard deviations, and correlations. There are libraries of statistical books dedicated to the simple task of coming up with estimates of the parameters used in MPT. Here is the point: It is not simple. For example, (1) for what period is one estimating the parameters (week, month, year)? and (2) how constant are the estimates (e.g., do they change and, if they do, do we have statistical models that permit us to systematically reflect those changes?)? There are many more issues in parameter estimation, but probably the biggest is that when two assets exist with the same true expected return, standard deviation, and Measuring Risk 33 correlation but when the risk parameter is often estimated with error (e.g., standard deviation is larger or smaller than its true standard deviation), the procedure for determining the efficient frontier always picks the asset with the downward bias risk estimate (e.g., the lower estimated standard deviation) and the upward bias return estimate.

The expected return on a comparably risky non-actively managed investment strategy is often either derived from academic theory or statistically derived from historical pricing relationships. The primary issue, of course, remains how to create a comparably risky investable non-actively managed asset. Even when one believes in the use of ex ante equilibrium (e.g., CAPM) or arbitrage (e.g., APT) models of expected return, problems in empirically estimating the required parameters usually results in alpha being determined using statistical models based on the underlying theoretical model. As generally measured in a statistical sense, the term alpha is often derived from a linear regression in which the equation that relates an observed variable y (asset return) to some other factor x (market index) is written as: y = α + βx + ε The first term, α (alpha) represents the intercept; β (beta) represents the slope; and ε (epsilon) represents a random error term.

pages: 311 words: 99,699

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

That triggered panic among some investors, and many rushed to sell CDSs and CDOs, causing their prices to drop, an eventuality not predicted by the models. JPMorgan Chase, Deutsche Bank, and many other banks and funds suffered substantial losses. For a few weeks after the turmoil, the banking community engaged in soul-searching. At J.P. Morgan the traders stuck bananas on their desks as a jibe at the so-called F9 model monkeys, the mathematical wizards who had created such havoc. (The “monkeys” who wrote the statistical models tended to use the “F9” key on the computer when they performed their calculations, giving rise to the tag.) J.P. Morgan, Deutsche, and others conducted internal reviews that led them to introduce slight changes in their statistical systems. GLG Ltd., one large hedge fund, told its investors that it would use a wider set of data to analyze CDOs in the future. Within a couple of months, though, the markets rebounded, and the furor died down.

Compared to Greenspan, Geithner was not just younger, but he also commanded far less clout and respect. As the decade wore on, though, he became privately uneasy about some of the trends in the credit world. From 2005 onwards, he started to call on bankers to prepare for so-called “fat tails,” a statistical term for extremely negative events that occur more often than the normal bell curve statistical models the banks’ risk assessment relied on so much implied. He commented in the spring of 2006: “A number of fundamental changes in the US financial system over the past twenty-five years appear to have rendered it able to withstand the stress of a broader array of shocks than was the case in the past. [But] confidence in the overall resilience of the financial system needs to be tempered by the realization that there is much we still do not know about the likely sources and consequences of future stress to the system…[and]…The proliferation of new forms of derivatives and structured financial products has changed the nature of leverage in the financial system.

pages: 317 words: 100,414

Superforecasting: The Art and Science of Prediction by Philip Tetlock, Dan Gardner

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Amos had an impish sense of humor. He also appreciated the absurdity of an academic committee on a mission to save the world. So I am 98% sure he was joking. And 99% sure his joke captures a basic truth about human judgment. Probability for the Stone Age Human beings have coped with uncertainty for as long as we have been recognizably human. And for almost all that time we didn’t have access to statistical models of uncertainty because they didn’t exist. It was remarkably late in history—arguably as late as the 1713 publication of Jakob Bernoulli’s Ars Conjectandi—before the best minds started to think seriously about probability. Before that, people had no choice but to rely on the tip-of-your-nose perspective. You see a shadow moving in the long grass. Should you worry about lions? You try to think of an example of a lion attacking from the long grass.

Appendix Ten Commandments for Aspiring Superforecasters The guidelines sketched here distill key themes in this book and in training systems that have been experimentally demonstrated to boost accuracy in real-world forecasting contests. For more details, visit www.goodjudgment.com. (1) Triage. Focus on questions where your hard work is likely to pay off. Don’t waste time either on easy “clocklike” questions (where simple rules of thumb can get you close to the right answer) or on impenetrable “cloud-like” questions (where even fancy statistical models can’t beat the dart-throwing chimp). Concentrate on questions in the Goldilocks zone of difficulty, where effort pays off the most. For instance, “Who will win the presidential election, twelve years out, in 2028?” is impossible to forecast now. Don’t even try. Could you have predicted in 1940 the winner of the election, twelve years out, in 1952? If you think you could have known it would be a then-unknown colonel in the United States Army, Dwight Eisenhower, you may be afflicted by one of the worst cases of hindsight bias ever documented by psychologists.

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage by Zdravko Markov, Daniel T. Larose

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

WHY THE BOOK IS NEEDED The book provides the reader with: r The models and techniques to uncover hidden nuggets of information in Webbased data r Insight into how web mining algorithms really work r The experience of actually performing web mining on real-world data sets “WHITE-BOX” APPROACH: UNDERSTANDING THE UNDERLYING ALGORITHMIC AND MODEL STRUCTURES The best way to avoid costly errors stemming from a blind black-box approach to data mining, is to apply, instead, a white-box methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software. The book, applies this white-box approach by: r Walking the reader through various algorithms r Providing examples of the operation of web mining algorithms on actual large data sets PREFACE xiii r Testing the reader’s level of understanding of the concepts and algorithms r Providing an opportunity for the reader to do some real web mining on large Web-based data sets Algorithm Walk-Throughs The book walks the reader through the operations and nuances of various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really going on inside an algorithm.

By inspecting the normal density curves, determine which attribute is more relevant for the classiﬁcation task. CHAPTER 4 EVALUATING CLUSTERING APPROACHES TO EVALUATING CLUSTERING SIMILARITY-BASED CRITERION FUNCTIONS PROBABILISTIC CRITERION FUNCTIONS MDL-BASED MODEL AND FEATURE EVALUATION CLASSES-TO-CLUSTERS EVALUATION PRECISION, RECALL, AND F-MEASURE ENTROPY APPROACHES TO EVALUATING CLUSTERING Clustering algorithms group documents by similarity or create statistical models based solely on the document representation, which in turn reﬂects document content. Then the criterion functions evaluate these models objectively (i.e., using only the document content). In contrast, when we label documents by topic we use additional knowledge, which is generally not explicitly available in document content and representation. Labeled documents are used primarily in supervised learning (classiﬁcation) to create a mapping between the document representation and the external notion (concept, category, class) provided by the teacher through labeling.

pages: 347 words: 97,721

Only Humans Need Apply: Winners and Losers in the Age of Smart Machines by Thomas H. Davenport, Julia Kirby

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Where It All Began Today, someone using the term “smart machine” could be talking about any number of technologies. The term “artificial intelligence” alone, for example, has been used to describe such technologies as expert systems (collections of rules facilitating decisions in a specified domain, such as financial planning or knowing when a batch of soup is cooked), neural networks (a more mathematical approach to creating a model that fits a data set), machine learning (semiautomated statistical modeling to achieve the best fitting-model to data), natural language processing or NLP (in which computers make sense of human language in textual form), and so forth. Wikipedia lists at least ten branches of AI, and we have seen other sources that mention many more. To make sense of this army of machines and the direction in which it is marching, it helps to remember where it all started: with numerical analytics supporting and supported by human decision-makers.

He hired additional credit risk modelers, and encouraged them to build a variety of quantitative models to identify any problems with the bank’s loan portfolios and credit processes. This work required a broad range of sophisticated models including “neural network” models; some were vendor supplied; some were custom-built . Cathcart, who was an English major at Dartmouth College but also learned the BASIC computer language there from its creator, John Kemeny, knew his way around computer systems and statistical models. Most important, he knew when to trust them and when not to. The models and analyses began to exhibit significant problems. No matter how automated and sophisticated the models were, Cathcart realized that they were becoming less valid over time with changes in the economy and banking climate. Many of the mortgage models, for example, were based on five years of historical data. But as the economy became worse by the day in 2007, those five-year models became dramatically overoptimistic.

pages: 502 words: 107,657

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

FICO: Todd Steffes, “Predictive Analytics: Saving Lives and Lowering Medical Bills,” Analytics Magazine, Analytics Informs, January/February 2012. www.analytics-magazine.org/januaryfebruary-2012/505-predictive-analytics-saving-lives-and-lowering-medical-bills. GlaxoSmithKline (UK): Vladimir Anisimov, GlaxoSmithKline, “Predictive Analytic Patient Recruitment and Drug Supply Modelling in Clinical Trials,” Predictive Analytics World London Conference, November 30, 2011, London, UK. www.predictiveanalyticsworld.com/london/2011/agenda.php#day1–16. Vladimir V. Anisimov, “Statistical Modelling of Clinical Trials (Recruitment and Randomization),” Communications in Statistics—Theory and Methods 40, issue 19–20 (2011): 3684–3699. www.tandfonline.com/toc/lsta20/40/19–20. MultiCare Health System (four hospitals in Washington): Karen Minich-Pourshadi for HealthLeaders Media, “Hospital Data Mining Hits Paydirt,” HealthLeaders Media Online, November 29, 2010. www.healthleadersmedia.com/page-1/FIN-259479/Hospital-Data-Mining-Hits-Paydirt.

Johnson, Serena Lee, Frank Doherty, and Arthur Kressner (Consolidated Edison Company of New York), “Predicting Electricity Distribution Feeder Failures Using Machine Learning Susceptibility Analysis,” March 31, 2006. www.phillong.info/publications/GBAetal06_susc.pdf. This work has been partly supported by a research contract from Consolidated Edison. BNSF Railway: C. Tyler Dick, Christopher P. L. Barkan, Edward R. Chapman, and Mark P. Stehly, “Multivariate Statistical Model for Predicting Occurrence and Location of Broken Rails,” Transportation Research Board of the National Academies, January 26, 2007. http://trb.metapress.com/content/v2j6022171r41478/. See also: http://ict.uiuc.edu/railroad/cee/pdf/Dick_et_al_2003.pdf. TTX: Thanks to Mahesh Kumar at Tiger Analytics for this case study, “Predicting Wheel Failure Rate for Railcars.” Fortune 500 global technology company: Thanks to Dean Abbott, Abbot Analytics (http://abbottanalytics.com/index.php) for information about this case study.

pages: 345 words: 86,394

Frequently Asked Questions in Quantitative Finance by Paul Wilmott

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Here is a list and description of the most important.• A static arbitrage is an arbitrage that does not require rebalancing of positions • A dynamic arbitrage is an arbitrage that requires trading instruments in the future, generally contingent on market states • A statistical arbitrage is not an arbitrage but simply a likely profit in excess of the risk-free return (perhaps even suitably adjusted for risk taken) as predicted by past statistics • Model-independent arbitrage is an arbitrage which does not depend on any mathematical model of financial instruments to work. For example, an exploitable violation of put-call parity or a violation of the relationship between spot and forward prices, or between bonds and swaps • Model-dependent arbitrage does require a model. For example, options mispriced because of incorrect volatility estimate.

One hat’s numbers have mean of zero and standard deviation 0.1. This is hat A. Another hat’s numbers have mean of zero and standard deviation 1. This is hat B. The final hat’s numbers have mean of zero and standard deviation 10. This is hat C. You don’t know which hat is which. You pick a number out of one hat, it is −2.6. Which hat do you think it came from? MLE can help you answer this question. Long Answer A large part of statistical modelling concerns finding model parameters. One popular way of doing this is Maximum Likelihood Estimation. The method is easily explained by a very simple example. You are attending a maths conference. You arrive by train at the city hosting the event. You take a taxi from the train station to the conference venue. The taxi number is 20,922. How many taxis are there in the city? This is a parameter estimation problem.

pages: 309 words: 86,909

The Spirit Level: Why Greater Equality Makes Societies Stronger by Richard Wilkinson; Kate Pickett

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

One factor is the strength of the relationship, which is shown by the steepness of the lines in Figures 4.1 and 4.2. People in Sweden are much more likely to trust each other than people in Portugal. Any alternative explanation would need to be just as strong, and in our own statistical models we find that neither poverty nor average standards of living can explain our findings. We also see a consistent association among both the United States and the developed countries. Earlier we described how Uslaner and Rothstein used a statistical model to show the ordering of inequality and trust: inequality affects trust, not the other way round. The relationships between inequality and women’s status and between inequality and foreign aid also add coherence and plausibility to our belief that inequality increases the social distance between different groups of people, making us less willing to see them as ‘us’ rather than ‘them’.

pages: 346 words: 92,984

The Lucky Years: How to Thrive in the Brave New World of Health by David B. Agus

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

It didn’t take long for there to be a backlash against the implied message. Tomasetti and Vogelstein were accused of focusing on rare cancers while leaving out several common cancers that indeed are largely preventable. The International Agency for Research on Cancer, the cancer arm of the World Health Organization, published a press release stating it “strongly disagrees” with the report. To arrive at their conclusion, Tomasetti and Vogelstein used a statistical model they developed based on known rates of cell division in thirty-one types of tissue. Stem cells were their main focal point. As a reminder, these are the small, specialized “mothership” cells in each organ or tissue that divide to replace cells that die or wear out. Only in recent years have researchers been able to conduct these kinds of studies due to advances in the understanding of stem-cell biology.

., “Intensive Lifestyle Changes May Affect the Progression of Prostate Cancer,” Journal of Urology, 174, no. 3 (September 2005): 1065–69; discussion 1069–70. 11. A. R. Kristal et al., “Baseline Selenium Status and Effects of Selenium and Vitamin E Supplementation on Prostate Cancer Risk,” Journal of the National Cancer Institute 106, no. 3 (March 2014): djt456, doi:10.1093/jnci/djt456, Epub February 22, 2014. 12. Johns Hopkins Medicine, “Bad Luck of Random Mutations Plays Predominant Role in Cancer, Study Shows—Statistical Modeling Links Cancer Risk with Number of Stem Cell Divisions,” news release, January 1, 2015, www.hopkinsmedicine.org/news/media/releases/bad_luck_of_random_mutations_plays_predominant_role_in_cancer_study_shows. 13. C. Tomasetti and B. Vogelstein, “Cancer Etiology. Variation in Cancer Risk Among Tissues Can Be Explained by the Number of Stem Cell Divisions,” Science 347, no. 6217 (January 2, 2015): 78–81, doi:10.1126/science.1260825. 14.

pages: 360 words: 85,321

The Perfect Bet: How Science and Math Are Taking the Luck Out of Gambling by Adam Kucharski

The probability each horse will win is a balance between the chance of the horse winning in the model and the chance of victory according to the current odds. The scales can tip one way or the other: whichever produces the combined prediction that lines up best with actual results. Strike the right balance, and good predictions can become profitable ones. WHEN WOODS AND BENTER arrived in Hong Kong, they did not meet with immediate success. While Benter spent the first year putting together the statistical model, Woods tried to make money exploiting the long-shot-favorite bias. They had come to Asia with a bankroll of \$150,000; within two years, they’d lost it all. It didn’t help that investors weren’t interested in their strategy. “People had so little faith in the system that they would not have invested for 100 percent of the profits,” Woods later said. By 1986, things were looking better. After writing hundreds of thousands of lines of computer code, Benter’s model was ready to go.

All sorts of factors could influence a horse’s performance in a race, from past experience to track conditions. Some of which provide clear hints about the future, while others just muddy the predictions. To pin down which factors are useful, syndicates need to collect reliable, repeated observations about races. Hong Kong was the closest Bill Benter could find to a laboratory setup, with the same horses racing on a regular basis on the same tracks in similar conditions. Using his statistical model, Benter identified factors that could lead to successful race predictions. He found that some came out as more important than others. In Benter’s early analysis, for example, the model said the number of races a horse had previously run was a crucial factor when making predictions. In fact, it was more important than almost any other factor. Maybe the finding isn’t all that surprising. We might expect horses that have run more races to be used to the terrain and less intimated by their opponents.

pages: 103 words: 32,131

Program Or Be Programmed: Ten Commands for a Digital Age by Douglas Rushkoff

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

In fact, the game only became a mass phenomenon as free agenting and Major League players’ strikes soured fans on the sport. As baseball became a business, the fans took back baseball as a game—even if it had to happen on their computers. The effects didn’t stay in the computer. Leveraging the tremendous power of digital abstraction back to the real world, Billy Bean, coach of the Oakland Athletics, applied these same sorts of statistical modeling to players for another purpose: to assemble a roster for his own Major League team. Bean didn’t have the same salary budget as his counterparts in New York or Los Angeles, and he needed to find another way to assemble a winning combination. So he abstracted and modeled available players in order to build a better team that went from the bottom to the top of its division, and undermined the way that money had come to control the game.

pages: 123 words: 32,382

Grouped: How Small Groups of Friends Are the Key to Influence on the Social Web by Paul Adams

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Research by Forrester found that cancer patients trust their local care physician more than world renowned cancer treatment centers, and in most cases, the patient had known their local care physician for years.16 We overrate the advice of experts Psychologist Philip Tetlock conducted numerous studies to test the accuracy of advice from experts in the fields of journalism and politics. He quantified over 82,000 predictions and found that the journalism experts tended to perform slightly worse than picking answers at random. Political experts didn’t fare much better. They slightly outperformed random chance, but did not perform as well as a basic statistical model. In fact, they actually performed slightly better at predicting things outside their area of expertise, and 80 percent of their predictions were wrong. Studies in finance also show that only 20 percent of investment bankers outperform the stock market.17 We overestimate what we know Sometimes we consider ourselves as experts, even though we don’t know as much as we think we know. Research by Russo and Schoemaker asked managers in the advertising industry questions about their domain.

pages: 456 words: 185,658

More Guns, Less Crime: Understanding Crime and Gun-Control Laws by John R. Lott

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

As to the concern that other changes in law enforcement may have been occurring at the same time, the estimates account for changes in other gun-control laws and changes in law enforcement as measured by arrest and conviction rates as well as by prison terms. No previous study of crime has attempted to control for as many diﬀerent factors that might explain changes in the crime rate. 3 Did I assume that there was an immediate and constant effect from these laws and that the effect should be the same everywhere? The “statistical models assumed: (1) an immediate and constant eﬀect of shall-issue laws, and (2) similar eﬀects across diﬀerent states and counties.” (Webster, “Claims,” p. 2; see also Dan Black and Daniel Nagin, “Do ‘Right-to-Carry’ Laws Deter Violent Crime?” Journal of Legal Studies 27 [January 1998], p. 213.) One of the central arguments both in the original paper and in this book is that the size of the deterrent eﬀect is related to the number of permits issued, and it takes many years before states reach their long-run level of permits.

A major reason for the larger eﬀect on crime in the more urban counties was that in rural areas, permit requests already were being approved; hence it was in urban areas that the number of permitted concealed handguns increased the most. A week later, in response to a column that I published in the Omaha WorldHerald,20 Mr. Webster modified this claim somewhat: Lott claims that his analysis did not assume an immediate and constant eﬀect, but that is contrary to his published article, in which the vast majority of the statistical models assume such an eﬀect. (Daniel W. Webster, “Concealed-Gun Research Flawed,” Omaha World-Herald, March 12, 1997; emphasis added.) When one does research, it is most appropriate to take the simplest specifications first and then gradually make things more complicated. The simplest way of doing this is to examine the mean crime rates before and 136 | CHAPTER SEVEN after the change in a law.

While he includes a chapter that contains replies to his critics, unfortunately he doesn’t directly respond to the key Black and Nagin finding that formal statistical tests reject his methods. The closest he gets to addressing this point is to acknowledge “the more serious possibility is that some other factor may have caused both the reduction in crime rates and the passage of the law to occur at the same time,” but then goes on to say that he has “presented over a thousand [statistical model] specifications” that reveal “an extremely consistent pattern” that right-to-carry laws reduce crime. Another view would be that a thousand versions of a demonstrably invalid analytical approach produce boxes full of invalid results. (Jens Ludwig, “Guns and Numbers,” Washington Monthly, June 1998, p. 51)76 We applied a number of specification tests suggested by James J. Heckman and V. Joseph Hotz.

pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack by Matthew A. Russell

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Substituting various values into the precision and recall formulas is straightforward and a worthwhile exercise if this is your first time encountering these terms. For example, what would the precision, recall, and F1 score have been if your algorithm had identified “Mr. Green”, “Colonel”, “Mustard”, and “candlestick”? As somewhat of an aside, you might find it interesting to know that many of the most compelling technology stacks used by commercial businesses in the NLP space use advanced statistical models to process natural language according to supervised learning algorithms. A supervised learning algorithm is essentially an approach in which you provide training samples of the form [(input1, output1), (input2, output2), ..., (inputN, outputN)] to a model such that the model is able to predict the tuples with reasonable accuracy. The tricky part is ensuring that the trained model generalizes well to inputs that have not yet been encountered.

SocialGraph Node Mapper, Brief analysis of breadth-first techniques sorting, Sensible Sorting, Sorting Documents by Value documents by value, Sorting Documents by Value documents in CouchDB, Sensible Sorting split method, using to tokenize text, Data Hacking with NLTK, Before You Go Off and Try to Build a Search Engine… spreadsheets, visualizing Facebook network data, Visualizing with spreadsheets (the old-fashioned way) statistical models processing natural language, Quality of Analytics stemming verbs, Querying Buzz Data with TF-IDF stopwords, Data Hacking with NLTK, Analysis of Luhn’s Summarization Algorithm downloading NLTK stopword data, Data Hacking with NLTK filtering out before document summarization, Analysis of Luhn’s Summarization Algorithm streaming API (Twitter), Analyzing Tweets (One Entity at a Time) Strong Links API, The Infochimps “Strong Links” API, Interactive 3D Graph Visualization student’s t-score, How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions subject-verb-object triples, Entity-Centric Analysis: A Deeper Understanding of the Data, Man Cannot Live on Facts Alone summarizing documents, Summarizing Documents, Analysis of Luhn’s Summarization Algorithm, Summarizing Documents, Analysis of Luhn’s Summarization Algorithm analysis of Luhn’s algorithm, Analysis of Luhn’s Summarization Algorithm Tim O’Reilly Radar blog post (example), Summarizing Documents summingReducer function, Frequency by date/time range, What entities are in Tim’s tweets?

pages: 302 words: 82,233

Beautiful security by Andy Oram, John Viega

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Ashenfelter is a statistician at Princeton who loves wine but is perplexed by the pomp and circumstance around valuing and rating wine in much the same way I am perplexed by the pomp and circumstance surrounding risk management today. In the 1980s, wine critics dominated the market with predictions based on their own reputations, palate, and frankly very little more. Ashenfelter, in contrast, studied the Bordeaux region of France and developed a statistic model about the quality of wine. His model was based on the average rainfall in the winter before the growing season (the rain that makes the grapes plump) and the average sunshine during the growing season (the rays that make the grapes ripe), resulting in simple formula: quality = 12.145 + (0.00117 * winter rainfall) + (0.0614 * average growing season temperature) (0.00386 * harvest rainfall) Of course he was chastised and lampooned by the stuffy wine critics who dominated the industry, but after several years of producing valuable results, his methods are now widely accepted as providing important valuation criteria for wine.

I hope that when I look back on this text and my blog in years to come, I’ll cringe at their resemblance to the cocktail-mixing house robots from movies of the 1970s. I believe the right elements are really coming together where technology can create better technology. Advances in technology have been used to both arm and disarm the planet, to empower and oppress populations, and to attack and defend the global community and all it will have become. The areas I’ve pulled together in this chapter—from business process management, number crunching and statistical modeling, visualization, and long-tail technology—provide fertile ground for security management systems in the future that archive today’s best efforts in the annals of history. At least I hope so, for I hate mediocrity with a passion and I think security management systems today are mediocre at best! 168 CHAPTER NINE Acknowledgments This chapter is dedicated to my mother, Margaret Curphey, who passed away after an epileptic fit in 2004 at her house in the south of France.

pages: 404 words: 43,442

The Art of R Programming by Norman Matloff

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The latter again stems from vectorization, a beneﬁt discussed in detail in Chapter 14. This approach is used in the loop beginning at line 53. (Arguably, in this case, the increase in speed comes at the expense of readability of the code.) 9.1.7 Extended Example: A Procedure for Polynomial Regression As another example, consider a statistical regression setting with one predictor variable. Since any statistical model is merely an approximation, in principle, you can get better and better models by ﬁtting polynomials of higher and higher degrees. However, at some point, this becomes overﬁtting, so that the prediction of new, future data actually deteriorates for degrees higher than some value. The class "polyreg" aims to deal with this issue. It ﬁts polynomials of various degrees but assesses ﬁts via cross-validation to reduce the risk of overﬁtting.

Input/Output 239 We’ll create a function called extractpums() to read in a PUMS ﬁle and create a data frame from its Person records. The user speciﬁes the ﬁlename and lists ﬁelds to extract and names to assign to those ﬁelds. We also want to retain the household serial number. This is good to have because data for persons in the same household may be correlated and we may want to add that aspect to our statistical model. Also, the household data may provide important covariates. (In the latter case, we would want to retain the covariate data as well.) Before looking at the function code, let’s see what the function does. In this data set, gender is in column 23 and age in columns 25 and 26. In the example, our ﬁlename is pumsa. The following call creates a data frame consisting of those two variables. pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26))) Note that we are stating here the names we want the columns to have in the resulting data frame.

pages: 566 words: 155,428

After the Music Stopped: The Financial Crisis, the Response, and the Work Ahead by Alan S. Blinder

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

As we will see later, these tests were phenomenally successful.* And there was more. To date, there have been precious few studies of the broader effects of this grab bag of financial-market policies. The only one I know of that even attempts to estimate the macroeconomic impacts of the entire potpourri was published in July 2010 by Mark Zandi and me. Our methodology was pretty simple—and very standard. Take a statistical model of the U.S. economy—we used the Moody’s Analytics model—and simulate it both with and without the policies. The differences between the two simulations are then estimates of the effects of the policies. These estimates, of course, are only as good as the model, but ours were huge. By 2011, we estimated, real GDP was about 6 percent higher, the unemployment rate was nearly 3 percentage points lower, and 4.8 million more Americans were employed because of the financial-market policies (as compared with sticking with laissez-faire).

The standard analysis of conventional monetary policy—what we teach in textbooks and what central bankers are raised on—is predicated, roughly speaking, on constant risk spreads. When the Federal Reserve lowers riskless interest rates, like those on federal funds and T-bills, riskier interest rates, like those on corporate lending and auto loans, are supposed to follow suit.* The history on which we economists base our statistical models looks like that. Figure 9.1 shows the behavior of the interest rates on 10-year Treasuries (the lower line) and Moody’s Baa corporate bonds (the upper line) over the period from January 1980 through June 2007, just before the crisis got started. The spread between these two rates is the vertical distance between the two lines, and the fact that they look roughly parallel means that the spread did not change much over those twenty-seven years.

pages: 480 words: 138,041

The Book of Woe: The DSM and the Unmaking of Psychiatry by Gary Greenberg

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

If the DSM is not the map of an actual world against whose contours any changes can be validated, then opening up old arguments, or inviting new ones, might only sow dissension and reap chaos—and annoy Frances in the bargain. If he was going to revise the DSM, Frances told Pincus, then his goal would be stabilizing the system rather than trying to perfect it—or, as he put it to me, “loving the pet, even if it is a mutt5.” Frances thought there was a way to protect the system from both instability and pontificating: meta-analysis, a statistical method that, thanks to advances in computer technology and statistical modeling, had recently allowed statisticians to compile results from large numbers of studies by combining disparate data into common terms. The result was a statistical synthesis by which many different research projects could be treated as one large study. “We needed something that would leave it up to the tables rather than the people,” he told me, and meta-analysis was perfect for the job. “The idea was you would have to present evidence in tabular form that would be so convincing it would jump up and grab people by the throats.”

There’s a lot of information they”—I think she meant the APA, not the National Transportation Safety Board—“can look at, but it’s not a matter of analyzing the data to find out exactly what’s wrong.” Kraemer seemed to be saying that the point wasn’t to sift through the wreckage and try to prevent another catastrophe but, evidently, to crash the plane and then announce that the destruction could have been a lot worse. To be honest, however, I wasn’t sure. She was not making all that much sense, or maybe I just didn’t grasp the complexities of statistical modeling. And besides, I was distracted by a memory of something Steve Hyman once wrote. Fixing the DSM, finding another paradigm, getting away from its reifications—this, he said, was like “repairing a plane while it is flying.” It was a suggestive analogy, I thought at the time, one that recognized the near impossibility of the task even as it indicated its high stakes—and the necessity of keeping the mechanics from swearing and banging too loudly, lest the passengers start asking for a quick landing and a voucher on another airline.

pages: 504 words: 139,137

Efficiently Inefficient: How Smart Money Invests and Market Prices Are Determined by Lasse Heje Pedersen

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

However, volatility is not an appropriate measure of risk for strategies with an extreme crash risk. For instance, volatility does not capture well the risk of selling out-the-money options, a strategy with small positive returns on most days but infrequent large crashes. To compute the volatility of a large portfolio, hedge funds need to account for correlations across assets, which can be accomplished by simulating the overall portfolio or by using a statistical model such as a factor model. Another measure of risk is value-at-risk (VaR), which attempts to capture tail risk (non-normality). The VaR measures the maximum loss with a certain confidence, as seen in figure 4.1 below. For example, the VaR is the most that you can lose with a 95% or 99% confidence. For instance, a hedge fund has a one-day 95% VaR of \$10 million if A simple way to estimate VaR is to line up past returns, sort them by magnitude, and find a return that has 5% worse days and 95% better days.

Intermediaries are always worried that the flows will continue against them. That part is invisible to them. The market demand might evolve as a wave builds up. The intermediary makes money when the wave subsides. Then the flows and equilibrium pricing are in the same direction. LHP: Or you might even short at a nickel cheap? MS: You might. Trend following is based on understanding macro developments and what governments are doing. Or they are based on statistical models of price movements. A positive up price tends to result in a positive up price. Here, however, it is not possible to determine whether the trend will continue. LHP: Why do spreads tend to widen during some periods of stress? MS: Well, capital becomes more scarce, both physical capital and human capital, in the sense that there isn’t enough time for intermediaries to understand what is happening in chaotic times.

pages: 444 words: 138,781

Evicted: Poverty and Profit in the American City by Matthew Desmond

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

With Jonathan Mijs, I combined all eviction court records between January 17 and February 26, 2011 (the Milwaukee Eviction Court Study period) with information about aspects of tenants’ neighborhoods, procured after geocoding the addresses that appeared in the eviction records. Working with the Harvard Center for Geographic Analysis, I also calculated the distance (in drive miles and time) between tenants’ addresses and the courthouse. Then I constructed a statistical model that attempted to explain the likelihood of a tenant appearing in court based on aspects of that tenant’s case and her or his neighborhood. The model generated only null findings. How much a tenant owed a landlord, her commute time to the courthouse, her gender—none of these factors were significantly related to appearing in court. I also investigated whether several aspects of a tenant’s neighborhood—e.g., its eviction, poverty, and crime rates—mattered when it came to explaining defaults.

In those where children made up at least 40 percent of the population, 1 household in every 12 was. All else equal, a 1 percent increase in the percentage of children in a neighborhood is predicted to increase a neighborhood’s evictions by almost 7 percent. These estimates are based on court-ordered eviction records that took place in Milwaukee County between January 1, 2010, and December 31, 2010. The statistical model evaluating the association between a neighborhood’s percentage of children and its number of evictions is a zero-inflated Poisson regression, which is described in detail in Matthew Desmond et al., “Evicting Children,” Social Forces 92 (2013): 303–27. 3. That misery could stick around. At least two years after their eviction, mothers like Arleen still experienced significantly higher rates of depression than their peers.

pages: 624 words: 127,987

The Personal MBA: A World-Class Business Education in a Single Volume by Josh Kaufman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The primary question is not whether attending a university is a positive experience: it’s whether or not the experience is worth the cost.9 2. MBA programs teach many worthless, outdated, even outright damaging concepts and practices—assuming your goal is to actually build a successful business and increase your net worth. Many of my MBAHOLDING readers and clients come to me after spending tens (sometimes hundreds) of thousands of dollars learning the ins and outs of complex financial formulas and statistical models, only to realize that their MBA program didn’t teach them how to start or improve a real, operating business. That’s a problem—graduating from business school does not guarantee having a useful working knowledge of business when you’re done, which is what you actually need to be successful. 3. MBA programs won’t guarantee you a high-paying job, let alone make you a skilled manager or leader with a shot at the executive suite.

Over time, managers and executives began using statistics and analysis to forecast the future, relying on databases and spreadsheets in much the same way ancient seers relied on tea leaves and goat entrails. The world itself is no less unpredictable or uncertain: as in the olden days, the signs only “prove” the biases and desires of the soothsayer. The complexity of financial transactions and the statistical models those transactions relied upon continued to grow until few practitioners fully understood how they worked or respected their limits. As Wired revealed in a February 2009 article, “Recipe for Disaster: The Formula That Killed Wall Street,” the inherent limitations of deified financial formulas such as the Black-Scholes option pricing model, the Gaussian copula function, and the capital asset pricing model (CAPM) played a major role in the tech bubble of 2000 and the housing market and derivatives shenanigans behind the 2008 recession.

The Blockchain Alternative: Rethinking Macroeconomic Policy and Economic Theory by Kariappa Bheemaiah

As we have seen in Chapter 3, it is monetary and fiscal policy that play a determining role in guiding the state of markets and the prosperity of a nation. Thus, owing to their fundamental role in monetary policy decision making, it is important to understand the history, abilities and limitations of these models. Currently, most central banks, such as the Federal Reserve and the ECB,13 use two kinds of models to study and build forecasts about the economy (Axtell and Farmer, 2015). The first, statistical models, fit current aggregate data of variables such as GDP, interest rates, and unemployment to empirical data in order to predict/suggest what the near future holds. The second type of models (which are more widely used), are known as “Dynamic Stochastic General Equilibrium” (DSGE) models. These models are constructed on the basis that the economy would be at rest (i.e.: static equilibrium) if it wasn’t being randomly perturbed by events from outside the economy.

See Efficient Market Hypothesis (EMH) Equation based modelling (EBM), 196 Equilibrium business-cycle models, 221 Equilibrium economic models contract theory contact incompleteness, 171 efficiency wages, 172 explicit contracts, 172 implicit contracts, 172 intellectual framework, 171 labor market flexibility, 171 menu cost, 173 risk sharing, 171 DSGE models Federal Reserve system, 173 implicit contracts, 172 macroeconomic models of business cycle, 168 NK models, 170 non-optimizing households, 168 principles, 175 RBC models, 169 RET, 174–175 ‘rigidity’ of wage and price change, 171 SIGE, 170 steady state equilibrium, economy, 176 structure, 176 Taylor rule, 168 FRB/US model, 173, 175 Keynesian macroeconomic theory, 169 RBC models, 169–170 244 Romer’s analysis tests, 178 statistical models, 168 Estonian government, 80 European Migration Network (EMN), 88 Exogenous and endogenous function, 137 Explicit contracts, 172          F Feedback loop, 191 Fiat currency CBDC, 129 commercial banks, 129 debt-based money, 124 digital cash, 129 digital monetary framework, 125 framework, 124 ideas and methods, 130 non-bank private sector, 124 sovereign digital currency, 125–128 transition, 124 Financialization, 25 de facto, 26 definition of, 27 eastern economic association, 27 enemy of my enemy is my friend, 65 FT slogans, 26 Palley, Thomas I., 28 relative industry shares, 27 risk innovation CDOs, CLOs and CDSs, 29 non-financial firms, 29 originate, repackage and sell model, 29 originate-to-distribute model, 29 originate-to-hold model, 29 principal component, 29 production and exchange, 29 sharding, 44 Blockchain, 54 FinTech transformation, 45, 48 global Fintech financing activity, 46 private sector, 44 skeleton keys, 60 AI-led high frequency trading, 63 amalgamation, 61 Blockchain, 63–64 fragmentation process, 60 information asymmetries, 62 Kabbage, 62 ■ INDEX KYC/AML procedures, 62 KYC process, 61 machine learning, 62 P2P lending sector, 62 payments and remittances sector, 60 physical barriers, 64 rehypothecation, 63 robo-advisors, 62 SWIFT and ACH, 61 transferwise, 61 solution pathways digital identity and KYC, 67 private and public utilization, 67 scalability, 81 TBTF (see (Too Big to Fail (TBTF))) television advertisement, 25 Financialization.

pages: 428 words: 121,717

Warnings by Richard A. Clarke

The deeper they dig, the harder it gets to climb out and see what is happening outside, and the more tempting it becomes to keep on doing what they know how to do . . . uncovering new reasons why their initial inclination, usually too optimistic or pessimistic, was right.” Still, maddeningly, even the foxes, considered as a group, were only ever able to approximate the accuracy of simple statistical models that extrapolated trends. They did perform somewhat better than undergraduates subjected to the same exercises, and they outperformed the proverbial “chimp with a dart board,” but they didn’t come close to the predictive accuracy of formal statistical models. Later books have looked at Tetlock’s foundational results in some additional detail. Dan Gardner’s 2012 Future Babble draws on recent research in psychology, neuroscience, and behavioral economics to detail the biases and other cognitive processes that skew our judgment when we try to make predictions about the future.

pages: 199 words: 47,154

Gnuplot Cookbook by Lee Phillips

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

These new features include the use of Unicode characters, transparency, new graph positioning commands, plotting objects, internationalization, circle plots, interactive HTML5 canvas plotting, iteration in scripts, lua/tikz/LaTeX integration, cairo and SVG terminal drivers, and volatile data. What this book covers Chapter 1, Plotting Curves, Boxes, Points, and more, covers the basic usage of Gnuplot: how to make all kinds of 2D plots for statistics, modeling, finance, science, and more. Chapter 2, Annotating with Labels and Legends, explains how to add labels, arrows, and mathematical text to our plots. Chapter 3, Applying Colors and Styles, covers the basics of colors and styles in gnuplot, plus transparency, and plotting with points and objects. Chapter 4, Controlling Your Tics, will show you how to get your tic marks and labels just right, along with gnuplot's new internationalization features.

pages: 186 words: 49,251

The Automatic Customer: Creating a Subscription Business in Any Industry by John Warrillow

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

But your true return is much greater because you have had \$1,200 of your customer’s money—interest free—to invest in your business. You have taken on a risk in guaranteeing your customer’s roof replacement and need to be paid for placing that bet. The repair job could have cost you \$3,000, and then you would have taken an underwriting loss of \$1,800 (\$1,200−\$3,000). Calculating your risk is the primary challenge of running a peace-of-mind model company. Big insurance companies employ an army of actuaries who use statistical models to predict the likelihood of a claim being made. You don’t need to be quite so scientific. Instead, start by looking back at the last 20 roofs you’ve installed with a guarantee and figure out how many service calls you needed to make. That will give you a pretty good idea of the possible risk of offering a peace-of-mind subscription. Assuming you’re not an actuary and you didn’t get your doctorate in math from MIT, it’s probably a wise idea to go slow in leveraging the peace-of-mind subscription model.

pages: 133 words: 42,254

Big Data Analytics: Turning Big Data Into Big Money by Frank J. Ohlhorst

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Much like the data themselves, the team should not be static in nature and should be able to evolve and adapt to the needs of the business. CHALLENGES REMAIN Locating the right talent to analyze data is the biggest hurdle in building a team. Such talent is in high demand, and the need for data analysts and data scientists continues to grow at an almost exponential rate. Finding this talent means that organizations will have to focus on data science and hire statistical modelers and text data–mining professionals as well as people who specialize in sentiment analysis. Success with Big Data analytics requires solid data models, statistical predictive models, and test analytic models, since these will be the core applications needed to do Big Data. Locating the appropriate talent takes more than just a typical IT job placement; the skills required for a good return on investment are not simple and are not solely technology oriented.

pages: 219 words: 63,495

50 Future Ideas You Really Need to Know by Richard Watson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Link all this to new imaging technologies, remote monitoring, medical smartcards, e-records and even gamification. One day, we may, for example, develop a tiny chip that can hold the full medical history of a person including any medical conditions, allergies, prescriptions and contact information (this is already planned in America). Digital vacuums Digital vacuuming refers to the practice of scooping up vast amounts of data then using mathematical and statistical models to determine content and possible linkages. The data itself can be anything from phone calls in historical or real time (the US company AT&T, for example, holds the records of 1.9 trillion telephone calls) to financial transactions, emails and Internet site visits. Commercial applications could include future health risks to counterterrorism. The card could feature a picture ID and hours of video content, such as X-rays or moving medical imagery.

pages: 222 words: 53,317

Overcomplicated: Technology at the Limits of Comprehension by Samuel Arbesman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

What techniques are used by experts: Nick Bostrom, Superintelligence: Paths, Dangers, Strategies (Oxford, UK: Oxford University Press, 2014), 15. say, 99.9 percent of the time: I made these numbers up for effect, but if any linguist wants to chat, please reach out! “based on millions of specific features”: Alon Halevy et al., “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems 24, no. 2 (2009): 8–12. In some ways, these statistical models are actually simpler than those that start from seemingly more elegant rules, because the latter end up being complicated by exceptions. sophisticated machine learning techniques: See Douglas Heaven, “Higher State of Mind,” New Scientist 219 (August 10, 2013), 32–35, available online (under the title “Not Like Us: Artificial Minds We Can’t Understand”): http://complex.elte.hu/~csabai/simulationLab/AI_08_August_2013_New_Scientist.pdf.

Syntactic Structures by Chomsky, Noam

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

We shall see, in fact, in § 7, that there are deep structural reasons for distinguish i ng (3) and (4) from (5) and (6) ; but before we are able to find an explana­ tion for such facts as these we shall have to carry the theory of syntactic structure a good deal beyond its fam i l iar li mits. 2.4 Third, the notion "grammatical i n English" cannot be identi- 16 SYNTACTIC STRUCTURES fied in any way with the notion "h igh order of statistical approxi­ mation to English." It is fa ir to assume that neither sentence ( I ) nor (2) (nor i ndeed any part of these sentences) has ever occurred in an English di scourse. Hence, in ,my statistical model for grammatical­ ness, these sentences will be ruled out on i dentica l grounds as equally 'remote' from English. Yet ( I ), though nonsensica l, i s grammatical, w h i l e ( 2 ) is not. Presented with these sentences, a speaker of English will read ( I ) with a normal sentence intonation, but he will read (2) with a fall ing i ntonation on each word ; i n fact, with just the i ntonation pattern given to any sequence of unrelated words.

pages: 279 words: 75,527

Collider by Paul Halpern

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Although this could represent an escaping graviton, more likely possibilities would need to be ruled out, such as the commonplace production of neutrinos. Unfortunately, even a hermetic detector such as ATLAS can’t account for the streams of lost neutrinos that pass unhindered through almost everything in nature—except by estimating the missing momentum and assuming it is all being transferred to neutrinos. Some physicists hope that statistical models of neutrino production would eventually prove sharp enough to indicate significant differences between the expected and actual pictures. Such discrepancies could prove that gravitons fled from collisions and ducked into regions beyond. Another potential means of establishing the existence of extra dimensions would be to look for the hypothetical phenomena called Kaluza-Klein excitations (named for Klein and an earlier unification pioneer, German mathematician Theodor Kaluza).

pages: 204 words: 67,922

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

And how much should Amex have paid for this privilege? Should they have gotten a discount since the first word of their brand is also the first word of American Airlines and thereby reinforces—albeit in a subtle way—the host company’s image? In order to know the value of the deal, they would have had to know how much the marketing campaign increases their business. Impossible. No focus group or statistical model will tell Amex how much worse or better their bottom line would have been in the absence of this marketing campaign. Ditto for the impact of billboards, product placement, and special promotions like airline mileage plans. There are simply too many other forces that come into play to be able to isolate the impact of a specific effort. Ditto for most of the symbolic economy. It is ironic that in this age of markets and seemingly limitless information, we can’t get the very answers we need to make rational business decisions.

pages: 306 words: 78,893

After the New Economy: The Binge . . . And the Hangover That Won't Go Away by Doug Henwood

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

It's also hard to reconcile with the fact that the distribution of educational attainment has long been growing less, not more, unequal. Even classic statements of this skills argument, Hke that of Juhn, Murphy, and Pierce (1993), find that the standard proxies for skill Hke years of education and years of work experience (proxies being needed because skill is nearly impossible to define or measure) only explain part of the increase in polarization—less than half, in fact. Most of the increase remains unexplained by statistical models, a remainder that is typically attributed to "unobserved" attributes. That is, since conventional economists believe as a matter of faith that market rates of pay are fair compensation for a worker s productive contribution, any inexpHcable anomaUes in pay must be the result of things a boss can see that elude the academics model. Those of us w^ho are not constrained by a faith in the correlation of pay and productivity, or v^ho don't accept conventional definitions of what constitutes productive labor, will want to look elsewhere.

pages: 225 words: 11,355

Financial Market Meltdown: Everything You Need to Know to Understand and Survive the Global Credit Crisis by Kevin Mellyn

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Financial innovation was all about getting more credit into the hands of consumers, making more income using less capital, and turning what had been concentrated risks off the books of banks into securities that could be traded between and owned by professional investors who could be expected to look after themselves. Like much of the ‘‘progress’’ of the last century, it was a matter of replacing common sense and tradition with science. The models produced using advanced statistics and computers were designed by brilliant minds from the best universities. At the Basle Committee, which set global standards for bank regulation to be followed by all major central banks, the use of statistical models to measure risk and reliance on the rating agencies were baked into the proposed rules for capital adequacy. The whole thing blew up not because of something obvious like greed. It failed because of the hubris, the fatal pride, of men and women who sincerely thought that they could build computer models that were capable of predicting risk and pricing it correctly. They were wrong. 4 t HOW WE GOT HERE Henry Ford famously said that history is bunk.

pages: 274 words: 75,846

The Filter Bubble: What the Internet Is Hiding From You by Eli Pariser

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The best way to avoid overfitting, as Popper suggests, is to try to prove the model wrong and to build algorithms that give the benefit of the doubt. If Netflix shows me a romantic comedy and I like it, it’ll show me another one and begin to think of me as a romantic-comedy lover. But if it wants to get a good picture of who I really am, it should be constantly testing the hypothesis by showing me Blade Runner in an attempt to prove it wrong. Otherwise, I end up caught in a local maximum populated by Hugh Grant and Julia Roberts. The statistical models that make up the filter bubble write off the outliers. But in human life it’s the outliers who make things interesting and give us inspiration. And it’s the outliers who are the first signs of change. One of the best critiques of algorithmic prediction comes, remarkably, from the late-nineteenth-century Russian novelist Fyodor Dostoyevsky, whose Notes from Underground was a passionate critique of the utopian scientific rationalism of the day.

pages: 322 words: 77,341

I.O.U.: Why Everyone Owes Everyone and No One Can Pay by John Lanchester

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The 1998 default was a 7-sigma event. That means it should statistically have happened only once every 3 billion years. And it wasn’t the only one. The last decades have seen numerous 5-, 6-, and 7-sigma events. Those are supposed to happen, respectively, one day in every 13,932 years, one day in every 4,039,906 years, and one day in every 3,105,395,365 years. Yet no one concluded from this that the statistical models in use were wrong. The mathematical models simply didn’t work in a crisis. They worked when they worked, which was most of the time; but the whole point of them was to assess risk, and some risks by definition happen at the edges of known likelihoods. The strange thing is that this is strongly hinted at in the VAR model, as propounded by its more philosophically minded defenders such as Philippe Jorion: it marks the boundaries of the known world, up to the VAR break, and then writes “Here be Dragons.”

Exploring Everyday Things with R and Ruby by Sau Sheong Chang

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The default method for a smooth geom in ggplot2 is the LOESS algorithm, which is suitable for a small number of data points. LOESS is not suitable for a large number of data points, however, because it scales on an O(n2) basis in memory, so instead we use the mgcv library and its gam method. We also send in the formula y~s(x), where s is the smoother function for GAM. GAM stands for generalized addictive model, which is a statistical model used to describe how items of data relate to each other. In our case, we use GAM as an algorithm in the smoother to provide us with a reasonably good estimation of how a large number of data points can be visualized. In Figure 8-5, you can see that the population of roids fluctuates over time between two extremes caused by the oversupply and exhaustion of food, respectively. Figure 8-5.

pages: 373 words: 80,248

Empire of Illusion: The End of Literacy and the Triumph of Spectacle by Chris Hedges

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

He told the senators that the collapse of the global financial system is “likely to produce a wave of economic crises in emerging market nations over the next year.” He added that “much of Latin America, former Soviet Union states, and sub-Saharan Africa lack sufficient cash reserves, access to international aid or credit, or other coping mechanism.” “When those growth rates go down, my gut tells me that there are going to be problems coming out of that, and we’re looking for that,” he said. He referred to “statistical modeling” showing that “economic crises increase the risk of regime-threatening instability if they persist over a one- to two-year period.” Blair articulated the newest narrative of fear. As the economic unraveling accelerates, we will be told it is not the bearded Islamic extremists who threaten us most, although those in power will drag them out of the Halloween closet whenever they need to give us an exotic shock, but instead the domestic riffraff, environmentalists, anarchists, unions, right-wing militias, and enraged members of our dispossessed working class.

pages: 291 words: 81,703

Average Is Over: Powering America Beyond the Age of the Great Stagnation by Tyler Cowen

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

I accessed the Wikipedia entry on string theory on December 26, 2012. Perhaps it will become clearer! On the age dynamics for achievement for non-economists, see Benjamin F. Jones and Bruce A. Weinberg, “Age Dynamics in Scientific Creativity,” published online before print, PNAS, November 7, 2011, doi: 10.1073/pnas.1102895108. On data crunching pushing out theory, see the famous essay by Leo Breiman, “Statistical Modeling: The Two Cultures,” Statistical Science, 2001, 16(3): 199–231, including the comments on the piece as well. See also the recent piece by Betsey Stevenson and Justin Wolfers, “Business is Booming in Empirical Economics,” Bloomberg.com, August 6, 2012. And as mentioned earlier, see Daniel S. Hamermesh, “Six Decades of Top Economics Publishing: Who and How?” National Bureau of Economic Research, Working Paper 18635, December 2012.

pages: 589 words: 69,193

Mastering Pandas by Femi Anthony

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The normalizing constant doesn't always need to be calculated, especially in many popular algorithms such as MCMC, which we will examine later in this chapter. is the probability that the hypothesis is true, given the data that we observe. This is called the posterior. is the probability of obtaining the data, considering our hypothesis. This is called the likelihood. Thus, Bayesian statistics amounts to applying Bayes rule to solve problems in inferential statistics with H representing our hypothesis and D the data. A Bayesian statistical model is cast in terms of parameters, and the uncertainty in these parameters is represented by probability distributions. This is different from the Frequentist approach where the values are regarded as deterministic. An alternative representation is as follows: where, is our unknown data and is our observed data In Bayesian statistics, we make assumptions about the prior data and use the likelihood to update to the posterior probability using the Bayes rule.

pages: 242 words: 68,019

Why Information Grows: The Evolution of Order, From Atoms to Economies by Cesar Hidalgo

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

GDP considers the production of goods and services within a country. GNP considers the goods and services produced by the citizens of a country, whether or not those goods are produced within the boundaries of the country. 5. Simon Kuznets, “Modern Economic Growth: Findings and Reflections,” American Economic Review 63, no. 3 (1973): 247–258. 6. Technically, total factor productivity is the residual or error term of the statistical model. Also, economists often refer to total factor productivity as technology, although this is a semantic deformation that is orthogonal to the definition of technology used by anyone who has ever developed a technology. In the language of economics, technology is the ability to do more—of anything—with the same cost. For inventors of technology, technology is the ability to do something completely new, which often involves the development of a new capacity.

pages: 280 words: 79,029

Smart Money: How High-Stakes Financial Innovation Is Reshaping Our WorldÑFor the Better by Andrew Palmer

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Public data from a couple of longitudinal studies showing the long-term relationship between education and income in the United States enabled him to build what he describes as “a simple multivariate regression model”—you know the sort, we’ve all built one—and work out the relationships between things such as test scores, degrees, and first jobs on later income. That model has since grown into something whizzier. An applicant’s education, SAT scores, work experience, and other details are pumped into a proprietary statistical model, which looks at people with comparable backgrounds and generates a prediction of that person’s personal income. Upstart now uses these data to underwrite loans to younger people—who often find it hard to raise money because of their limited credit histories. But the model was initially used to determine how much money an applicant could raise for each percentage point of future income they gave away.

pages: 239 words: 70,206

Data-Ism: The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else by Steve Lohr

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Cleveland, then a researcher at Bell Labs, wrote a paper he called an “action plan” for essentially redefining statistics as an engineering task. “The altered field,” he wrote, “will be called ‘data science.’” In his paper, Cleveland, who is now a professor of statistics and computer science at Purdue University, described the contours of this new field. Data science, he said, would touch all disciplines of study and require the development of new statistical models, new computing tools, and educational programs in schools and corporations. Cleveland’s vision of a new field is now rapidly gaining momentum. The federal government, universities, and foundations are funding data science initiatives. Nearly all of these efforts are multidisciplinary melting pots that seek to bring together teams of computer scientists, statisticians, and mathematicians with experts who bring piles of data and unanswered questions from biology, astronomy, business and finance, public health, and elsewhere.

pages: 283 words: 81,163

How Capitalism Saved America: The Untold History of Our Country, From the Pilgrims to the Present by Thomas J. Dilorenzo

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Wages rose by a phenomenal 13.7 percent during the first three quarters of 1937 alone.46 The union/nonunion wage differential increased from 5 percent in 1933 to 23 percent by 1940.47 On top of this, the Social Security payroll and unemployment insurance taxes contributed to a rapid rise in government-mandated fringe benefits, from 2.4 percent of payrolls in 1936 to 5.1 percent just two years later. Economists Richard Vedder and Lowell Gallaway have determined the costs of all this misguided legislation, showing how most of the abnormal unemployment of the 1930s would have been avoided had it not been for the New Deal. Using a statistical model, Vedder and Gallaway concluded that by 1940 the unemployment rate was more than 8 percentage points higher than it would have been without the legislation-induced growth in unionism and government-mandated fringe-benefit costs imposed on employers.48 Their conclusion: “The Great Depression was very significantly prolonged in both its duration and its magnitude by the impact of New Deal programs.”49 In addition to fascistic labor policies and government-mandated wage and fringe-benefit increases that destroyed millions of jobs, the Second New Deal was responsible for economy-destroying tax increases and massive government spending on myriad government make-work programs.

pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python by Joel Grus

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

(You attempt to explain to her that search engine algorithms are clever enough that this won’t actually work, but she refuses to listen.) Of course, she doesn’t want to write thousands of web pages, nor does she want to pay a horde of “content strategists” to do so. Instead she asks you whether you can somehow programatically generate these web pages. To do this, we’ll need some way of modeling language. One approach is to start with a corpus of documents and learn a statistical model of language. In our case, we’ll start with Mike Loukides’s essay “What is data science?” As in Chapter 9, we’ll use requests and BeautifulSoup to retrieve the data. There are a couple of issues worth calling attention to. The first is that the apostrophes in the text are actually the Unicode character u"\u2019". We’ll create a helper function to replace them with normal apostrophes: def fix_unicode(text): return text.replace(u"\u2019", "'") The second issue is that once we get the text of the web page, we’ll want to split it into a sequence of words and periods (so that we can tell where sentences end).

pages: 277 words: 80,703

Revolution at Point Zero: Housework, Reproduction, and Feminist Struggle by Silvia Federici

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

At least since the Zapatistas, on December 31, 1993, took over the zócalo of San Cristóbal to protest legislation dissolving the ejidal lands of Mexico, the concept of the “commons” has gained popularity among the radical Left, internationally and in the United States, appearing as a ground of convergence among anarchists, Marxists/socialists, ecologists, and ecofeminists.1 There are important reasons why this apparently archaic idea has come to the center of political discussion in contemporary social movements. Two in particular stand out. On the one side, there has been the demise of the statist model of revolution that for decades has sapped the efforts of radical movements to build an alternative to capitalism. On the other, the neoliberal attempt to subordinate every form of life and knowledge to the logic of the market has heightened our awareness of the danger of living in a world in which we no longer have access to seas, trees, animals, and our fellow beings except through the cash-nexus.

Raw Data Is an Oxymoron by Lisa Gitelman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Data storage of this scale, potentially measured in petabytes, would necessarily require sophisticated algorithmic querying in order to detect informational patterns. For David Gelernter, this type of data management would require “topsight,” a topdown perspective achieved through software modeling and the creation of microcosmic “mirror worlds,” in which raw data filters in from the bottom and the whole comes into focus through statistical modeling and rule and pattern extraction.36 The promise of topsight, in Gelernter’s terms, is a progression from annales to annalistes, from data collection that would satisfy a “neo-Victorian curatorial” drive to data analysis that calculates prediction scenarios and manages risk.37 What would be the locus of suspicion and paranoid fantasy (Poster calls it “database anxiety”) if not such an intricate and operationally efficient system, the aggregating capacity of which easily ups the ante on Thomas Pynchon’s paranoid realization that “everything is connected”?

The Armchair Economist: Economics and Everyday Life by Steven E. Landsburg

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

(Exactly why he thought this has never been determined, but he was quite sure of himself.) The commissioner became obsessed with the need to discourage punting and called in his assistants for advice on how to cope with the problem. One of those assistants, a fresh M.B.A., breathlessly announced that he had taken courses from an economist who was a great expert on all aspects of the game and who had developed detailed statistical models to predict how teams behave. He proposed retaining the economist to study what makes teams punt. 211 212 THE PITFALLS OF SCIENCE The commissioner summoned the economist, who went home with a large retainer check and a mandate to discover the causes of punting. Many hours later (he billed by the hour) the answer was at hand. Volumes of computer printouts left no doubt: Punting nearly always takes place on the fourth down.

Deep Work: Rules for Focused Success in a Distracted World by Cal Newport

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

But the real importance of this story is the experiment itself, and in particular, its complexity. It turns out to be really difficult to answer a simple question such as: What’s the impact of our current e-mail habits on the bottom line? Cochran had to conduct a company-wide survey and gather statistics from the IT infrastructure. He also had to pull together salary data and information on typing and reading speed, and run the whole thing through a statistical model to spit out his final result. And even then, the outcome is fungible, as it’s not able to separate out, for example, how much value was produced by this frequent, expensive e-mail use to offset some of its cost. This example generalizes to most behaviors that potentially impede or improve deep work. Even though we abstractly accept that distraction has costs and depth has value, these impacts, as Tom Cochran discovered, are difficult to measure.

pages: 305 words: 69,216

A Failure of Capitalism: The Crisis of '08 and the Descent Into Depression by Richard A. Posner

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Marketers to Americans (as distinct from Japanese) have had greater success appealing to the first set of motives than to the second. Quantitative models of risk—another fulfillment of Weber's prophecy that more and more activities would be brought under the rule of rationality— are also being blamed for the financial crisis. Suppose a trader is contemplating the purchase of a stock using largely borrowed money, so that if the stock falls even a little way the loss will be great. He might consult a statistical model that predicted, on the basis of the ups and downs of the stock in the preceding two years, the probability distribution of the stock's behavior over the next few days or weeks. The criticism is that the model would have based the prediction on market behavior during a period of rising stock values; the modeler should have gone back to the 1980s or earlier to get a fuller picture of the riskiness of the stock.

pages: 251 words: 76,128

Borrow: The American Way of Debt by Louis Hyman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

In the fall of 2006, the impossible happened. Housing prices began to fall. As credit-rating agencies began to reassess the safety of the AAA mortgage-backed securities, insurance companies had to pony up greater quantities of collateral to guarantee the insurance policies on the bonds. The global credit market rested on a simple assumption: housing prices would always go up. Foreclosures would be randomly distributed, as the statistical models assumed. Yet as those models, and the companies that had created them, began to fail, a shudder ran through the corpus of global capitalism. The insurance giant AIG, which had hoped for so much profit in 1998, watched as its entire business—both traditional and new—went down, supported only by the U.S. government. The arcane operations of the credit markets spilled out into the larger economy, bringing about the greatest economic downturn since the Great Depression.

pages: 291 words: 77,596

Total Recall: How the E-Memory Revolution Will Change Everything by C. Gordon Bell, Jim Gemmell

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Adding summarization to visualization for geolocated photos: Ahern, Shane, Mor Naaman, Rahul Nair, Jeannie Yang. “World Explorer: Visualizing Aggregate Data from Unstructured Text in Geo-Referenced Collections.” In Proceedings, Seventh ACM/IEEE-CS Joint Conference on Digital Libraries ( JCDL 07), June 2007. The Stuff I’ve Seen project did some experiments that showed how displaying milestones alongside a timeline may help orient the user. Horvitz et al. used statistical models to infer the probability that users will consider events to be memory landmarks. Ringel, M., E. Cutrell, S. T. Dumais, and E. Horvitz. 2003. “Milestones in Time: The Value of Landmarks in Retrieving Information from Personal Stores.” Proceedings of IFIP Interact 2003. Horvitz, Eric, Susan Dumais, and Paul Koch. “Learning Predictive Models of Memory Landmarks.” CogSci 2004: 26th Annual Meeting of the Cognitive Science Society, Chicago, August 2004.

pages: 238 words: 75,994

A Burglar's Guide to the City by Geoff Manaugh

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

* The fundamental premise of the capture-house program is that police can successfully predict what sorts of buildings and internal spaces will attract not just any criminal but a specific burglar, the unique individual each particular capture house was built to target. This is because burglars unwittingly betray personal, as well as shared, patterns in their crimes; they often hit the same sorts of apartments and businesses over and over. But the urge to mathematize this, and to devise complex statistical models for when and where a burglar will strike next, can lead to all sorts of analytical absurdities. A great example of this comes from an article published in the criminology journal Crime, Law and Social Change back in 2011. Researchers from the Physics Engineering Department at Tsinghua University reported some eyebrow-raisingly specific data about the meteorological circumstances during which burglaries were most likely to occur in urban China.

pages: 804 words: 212,335

Revelation Space by Alastair Reynolds

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

But if Sajaki's equipment was not the best, chances were good that he had excellent algorithms to distil memory traces. Over centuries, statistical models had studied patterns of memory storage in ten billion human minds, correlating structure against experience. Certain impressions tended to be reflected in similar neural structures — internal qualia — which were the functional blocks out of which more complex memories were assembled. Those qualia were never the same from mind to mind, except in very rare cases, but neither were they encoded in radically different ways, since nature would never deviate far from the minimum-energy route to a particular solution. The statistical models could identify those qualia patterns very efficiently, and then map the connections between them out of which memories were forged.

pages: 504 words: 89,238

Natural language processing with Python by Steven Bird, Ewan Klein, Edward Loper

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Structure of the published TIMIT Corpus: The CD-ROM contains doc, train, and test directories at the top level; the train and test directories both have eight sub-directories, one per dialect region; each of these contains further subdirectories, one per speaker; the contents of the directory for female speaker aks0 are listed, showing 10 wav files accompanied by a text transcription, a wordaligned transcription, and a phonetic transcription. there is a split between training and testing sets, which gives away its intended use for developing and evaluating statistical models. Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus. Therefore, many of the computational methods described in this book are applicable. Moreover, notice that all of the data types included in the TIMIT Corpus fall into the two basic categories of lexicon and text, which we will discuss later.

For example, one intermediate position is to assume that humans are innately endowed with analogical and memory-based learning methods (weak rationalism), and use these methods to identify meaningful patterns in their sensory language experience (empiricism). We have seen many examples of this methodology throughout this book. Statistical methods inform symbolic models anytime corpus statistics guide the selection of productions in a context-free grammar, i.e., “grammar engineering.” Symbolic methods inform statistical models anytime a corpus that was created using rule-based methods is used as a source of features for training a statistical language model, i.e., “grammatical inference.” The circle is closed. NLTK Roadmap The Natural Language Toolkit is a work in progress, and is being continually expanded as people contribute code. Some areas of NLP and linguistics are not (yet) well supported in NLTK, and contributions in these areas are especially welcome.

pages: 666 words: 181,495

In the Plex: How Google Thinks, Works, and Shapes Our Lives by Steven Levy

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Och’s official role was as a scientist in Google’s research group, but it is indicative of Google’s view of research that no step was required to move beyond study into actual product implementation. Because Och and his colleagues knew they would have access to an unprecedented amount of data, they worked from the ground up to create a new translation system. “One of the things we did was to build very, very, very large language models, much larger than anyone has ever built in the history of mankind.” Then they began to train the system. To measure progress, they used a statistical model that, given a series of words, would predict the word that came next. Each time they doubled the amount of training data, they got a .5 percent boost in the metrics that measured success in the results. “So we just doubled it a bunch of times.” In order to get a reasonable translation, Och would say, you might feed something like a billion words to the model. But Google didn’t stop at a billion.

To keep making consistently accurate predictions on click-through rates and conversions, Google needed to know everything. “We are trying to understand the mechanisms behind the metrics,” says Qing Wu, a decision support analyst at Google. His specialty was forecasting. He could predict patterns of queries from season to season, in different parts of the day, and the climate. “We have the temperature data, we have the weather data, and we have the queries data so we can do correlation and statistical modeling.” To make sure that his predictions were on track, Qing Wu and his colleagues made use of dozens of onscreen dashboards with information flowing through them, a Bloomberg of the Googlesphere. “With a dashboard you can monitor the queries, the amount of money you make, how many advertisers we have, how many keywords they’re bidding on, what the ROI is for each advertiser.” It’s like the census data, he would say, only Google does much better analyzing its information than the government does with the census results.

pages: 741 words: 179,454

Extreme Money: Masters of the Universe and the Cult of Risk by Satyajit Das

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Mortgages against second and third homes, vacation homes and nonowner-occupied investment homes to be rented out (buy-to-let) or sold later (condo flippers) were allowed. HE (home equity) and HELOC (home equity line of credit), borrowing against the equity in existing homes, became prevalent. Empowered by high-tech models, lenders loaned to less creditworthy borrowers, believing they could price any risk. Ben Bernanke shared his predecessor Alan Greenspan’s faith: “banks have become increasingly adept at predicting default risk by applying statistical models to data, such as credit scores.” Bernanke concluded that banks “have made substantial strides...in their ability to measure and manage risks.”13 Innovative affordability products included jumbo and super jumbo loans that did not conform to guidelines because of their size. More risky than prime but less risky than subprime, Alt A (Alternative A) mortgages were for borrowers who did not meet normal criteria.

In 2007, Moody’s upgraded three major Icelandic banks to the highest AAA rating, citing new methodology that took into account the likelihood of government support. Although Moody’s reversed the upgrades, all three banks collapsed in 2008. Unimpeded by insufficient disclosure, lack of information transparency, fraud, and improper accounting, traders anticipated these defaults, marking down bond prices well before rating downgrades. Rating-structured securities required statistical models, mapping complex securities to historical patterns of default on normal bonds. With mortgage markets changing rapidly, this was like “using weather in Antarctica to forecast conditions in Hawaii.”17 Antarctica from 100 years ago! The agencies did not look at the underlying mortgages or loans in detail, relying instead on information from others. Moody’s Yuri Yoshizawa stated: “We’re structure experts.

Debtor Nation: The History of America in Red Ink (Politics and Society in Modern America) by Louis Hyman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Applications became more consistent and less subject to the whims of a particular loan officer. In computer models, feminist credit advocates believed they had found the solution to discriminatory lending, ushering in the contemporary calculated credit regimes under which we live today. Yet removing such basic demographics from any model was not as straightforward as the authors of the ECOA had hoped because of how THE CREDIT INFRASTRUCTURE 215 all statistical models function, but which legislators seem to not have fully understood. The “objective” credit statistics that legislators had pined for during the early investigations of the Consumer Credit Protection Act could now exist, but with new difficulties that stemmed from using regressions and not human judgment to decide on loans. In human-judged credit lending, a loan officer who knew the race and gender of an applicant would be more discriminatory, whereas in a computer credit model, knowing the applicant’s race and gender allowed the credit decision to be less discriminatory.

The higher the level of education and income, the lower the effective interest rate paid, since such users tended more frequently to be non-revolvers.96 The researchers found that young, large, low-income families who could not save for major purchases, paid finance charges, while their opposite, older, smaller, highincome families who could save for major purchases, did not pay finance charges. Effectively the young and poor cardholders subsidized the convenience of the old and rich.97 And white.98 The new statistical models revealed that the second best predicator of revolving debt, after a respondent’s own “self-evaluation of his or her ability to save,” was race.99 But what these models revealed was that the very group—African Americans—that the politicians wanted to increase credit access to, tended to revolve their credit more than otherwise similar white borrowers. Though federal laws prevented businesses from using race in their lending decisions, academics were free to examine race as a credit model would and found that, even after adjusting for income and other demographics, race was still the second strongest predictive factor.

pages: 757 words: 193,541

The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2 by Thomas A. Limoncelli, Strata R. Chalup, Christina J. Hogan

By reducing lead time, capacity planning can be more agile. Standard capacity planing is sufficient for small sites, sites that grow slowly, and sites with simple needs. It is insufficient for large, rapidly growing sites. They require more advanced techniques. Advanced capacity planning is based on core drivers, capacity limits of individual resources, and sophisticated data analysis such as correlation, regression analysis, and statistical models for forecasting. Regression analysis finds correlations between core drivers and resources. Forecasting uses past data to predict future needs. With sufficiently large sites, capacity planning is a full-time job, often done by project managers with technical backgrounds. Some organizations employ full-time statisticians to build complex models and dashboards that provide the information required by a project manager.

Capacity planning involves the technical work of understanding how many resources are needed per unit of growth, plus non-technical aspects such as budgeting, forecasting, and supply chain management. These topics are covered in Chapter 18. Sample Assessment Questions • How much capacity do you have now? • How much capacity do you expect to need three months from now? Twelve months from now? • Which statistical models do you use for determining future needs? • How do you load-test? • How much time does capacity planning take? What could be done to make it easier? • Are metrics collected automatically? • Are metrics available always or does their need initiate a process that collects them? • Is capacity planning the job of no one, everyone, a specific person, or a team of capacity planners? • If there is a corporate standard practice for this OR, what is it and how does this service comply with the practice?

pages: 257 words: 94,168

Oil Panic and the Global Crisis: Predictions and Myths by Steven M. Gorelick

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

At a depth of over 5 miles, this find contains anywhere between 3 and 15 billion barrels and could comprise 11 percent of US production by 2013.107 In 2009, Chevron reported another deep-water discovery just 44 miles away that may yield 0.5 billion barrels and could be profitably produced at an oil price of \$50 per barrel.108 The second insight from discovery trends is that an underlying premise of many statistical models of oil discovery is probably incorrect. This premise is that larger oil fields are found first, followed by the discovery of smaller fields. Large fields in geologically related proximity to one another are typically discovered first simply because they are the most easily detected targets. However, this is not always the case, as pointed out by Ron Charpentier of the USGS, who notes that new technology can rejuvenate the discovery 140 Counter-Arguments to Imminent Global Oil Depletion process.

pages: 364 words: 101,286

The Misbehavior of Markets by Benoit Mandelbrot

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

. • Abstract: Intermittency and periodicity, and the problem of long cycles. Econometrica 34, 1966 (Supplement): 152-153. Mandelbrot, Benoit B. 1970. Long-run interdependence in price records and other economic time series. Econometrica 38: 122-123. Mandelbrot, Benoit B. 1972. Possible refinement of the lognormal hypothesis concerning the distribution of energy dissipation in intermittent turbulence. Statistical Models and Turbulence. M. Rosenblatt and C. Van Atta, eds. Lecture Notes in Physics 12. New York: Springer, 333-351. • Reprint: Chapter N14 of Mandelbrot 1999a. Mandelbrot, Benoit B. 1974a. Intermittent turbulence in self-similar cascades; divergence of high moments and dimension of the carrier. Journal of Fluid Mechanics 62: 331-358. • Reprint: Chapter N15 of Mandelbrot 1999a. Mandelbrot, Benoit B. 1974b.

pages: 227 words: 32,306

Using Open Source Platforms for Business Intelligence: Avoid Pitfalls and Maximize Roi by Lyndsay Wise

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

All of these situations mean that different people within businesses have different worldviews and apply separate calculations to their work, resulting in data that is considered “manipulated” to some extent. 82 CHAPTER 8 The strategy behind BI adoption Mitigating risk Another reason organizations look at BI is to help mitigate risk. In the past, much risk management within BI remained within the realm of finance, insurance, and banking, but most organizations need to assess potential risk and help mitigate its effects on the organization. Within BI, this goes beyond information visibility and means using predictive modeling and other advanced statistical models to ensure that customers with accounts past due are not allowed to submit new orders unless it is known beforehand, or that insurance claims aren’t being submitted fraudulently. The National Health Care Anti-Fraud Association (NHCAA) estimates that in 2010, 3% of all health care spending or \$68 billion is lost to health care fraud in the United States.2 This makes fraud detection in health care extremely important, especially when you consider that if you are paying for insurance in the United States, part of your insurance premiums are probably being paid to cover the instances of fraud that occur, making this relevant beyond health care insurance providers.

pages: 302 words: 86,614

The Alpha Masters: Unlocking the Genius of the World's Top Hedge Funds by Maneet Ahuja, Myron Scholes, Mohamed El-Erian

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Wong says that the one thing most people don’t understand about systematic trading is the trade-off between profit potential in the long term and the potential for short-term fluctuation and losses. “We are all about the long run,” he says. “It’s why I say, over and over, the trend is your friend.” “If you’re a macro trader and you basically have 20 positions, you better make sure that no more than two or three are wrong. But we base our positions on statistical models, and we take hundreds of positions. At any given time, a lot of them are going to be wrong, and we have to accept that. But in the long run, we’ll be more right than wrong.” Evidently—since 1990, AHL’s total returns have exceeded 1,000 percent. Still, AHL is hardly invulnerable. The financial crisis brought on a sharp reversal, and the firm remains vulnerable to the Fed-induced drop in market volatility.

pages: 335 words: 94,657

The Bogleheads' Guide to Investing by Taylor Larimore, Michael Leboeuf, Mel Lindauer

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Mensa is an exclusive society whose membership is restricted to persons scoring in the top 2 percent on IQ tests. During a 15-year period when the S&P 500 had average annual returns of 15.3 percent, the Mensa Investment Club's performance averaged returns of only 2.5 percent. 3. In 1994, a hedge fund called Long Term Capital Management (LTCM) was created with the help of two Nobel Prize-winning economists. They believed they had a statistical model that could eliminate risk from investing. The fund was extremely leveraged. They controlled positions totaling \$1.25 trillion, an amount equal to the annual budget of the U.S. government. After some spectacular early successes, a financial panic swept across Asia. In 1998, LTCM hemorrhaged and faced bankruptcy. To prevent a world economic collapse, the New York Federal Reserve orchestrated a buyout by 14 banks that put up a total of \$3.6 billion to buy out the fund.

pages: 377 words: 97,144

Singularity Rising: Surviving and Thriving in a Smarter, Richer, and More Dangerous World by James D. Miller

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Nobel Prize-winning economist James Heckman has written that “an entire literature has found” that cognitive abilities “significantly affect wages.”147 Of course, “cognitive abilities” aren’t necessarily the same thing as g or IQ. Recall that the theory behind g, and therefore IQ’s importance, is that a single variable can represent intelligence. To check whether a single measure of cognitive ability has predictive value, Heckman developed a statistical model testing whether one number essentially representing g and another representing noncognitive ability can explain most of the variations in wages.148 Heckman’s model shows that it could. Heckman, however, carefully points out that noncognitive traits such as “stick-to-it-iveness” are at least as important as cognitive traits in determining wages—meaning that a lazy worker with a high IQ won’t succeed at Microsoft or Goldman Sachs.

pages: 364 words: 99,613

Servant Economy: Where America's Elite Is Sending the Middle Class by Jeff Faux

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Martin Wolf, “Why Obama’s Plan Is Still Inadequate and Incomplete,”Financial Times, January 13, 2009. 9. “Larry Summers and Michael Steele,” This Week with Christiane Amanpour, ABC News, February 8, 2009. 10. CNN Politics, Election Center, November 24, 2010, http://www.cnn.com/ELECTION/2010/results/polls.main. 11. Andrew Gelman, “Unsurprisingly, More People Are Worried about the Economy and Jobs Than about Deficit,” Statistical Modeling, Causal Interference, and Social Science, June 19, 2010, http://www.stat.columbia.edu/~cook/movabletype/archives/2010/06/unsurprisingly.html;Ryan Grim, “Mayberry Machiavellis: Obama Political Team Handcuffing Recovery,” Huffington Post, July 6, 2010, http://www.huffingtonpost.com/2010/07/06/mayberry-machiavellis-oba_n_636770.html. 12. Grim, “Mayberry Machiavellis.” 13. Ryan Lizza, “The Obama Memos,” New Yorker, January 30, 2012. 14.

pages: 323 words: 89,795

Food and Fuel: Solutions for the Future by Andrew Heintzman, Evan Solomon, Eric Schlosser

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Relying on member countries to provide their own catch reports, the FAO has few safeguards to ensure that its statistics are accurate. Specifically, there were some indications that China’s catch reports were too high. For example, some of China’s major fish populations were declared overexploited decades ago. In 2001, Watson and Pauly published an eye-opening study in the journal Nature about the true status of our world’s fisheries. These researchers used a statistical model to compare China’s officially reported catches to those that would be expected, given oceanographic conditions and other factors. They determined that China’s actual catches were likely closer to one half their reported levels. The implications of China’s over-reporting are dramatic: instead of global catches increasing by 0.33 million tonnes per year since 1988, as reported by the FAO, catches have actually declined by 0.36 million tonnes per year.

pages: 411 words: 108,119

The Irrational Economist: Making Decisions in a Dangerous World by Erwann Michel-Kerjan, Paul Slovic

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

First, we tend to overreact when virgin risks occur. The particular danger, now both available and salient, is likely to be overestimated in the future. Second, and by contrast, we tend to raise our probability estimate insufficiently when an experienced risk occurs. Follow-up research should document these tendencies with many more examples, and in laboratory settings. If improved predictions are our goal, it should also provide rigorous statistical models of effective updating of virgin and experienced risks. Future inquiry should consider resembled risks as well. Evidence from both terrorist incidents and financial markets suggests that we have difficulty extrapolating from risks that, though varied, bear strong similarities. Behavioral biases such as these are difficult to counteract, but awareness of them is the first step. Requiring careful analysis of all available data could help decision makers to make better risk assessments.

pages: 313 words: 101,403

My Life as a Quant: Reflections on Physics and Finance by Emanuel Derman

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The most complex, which used interest-rate simulation models of the B1T type I had helped develop at Goldman, was Salon-ion's option-adjusted spread model that reported the spread over Treasury bonds the pool would generate, on average, over all future interest-rate scenarios. We ran daily reports on the desk's inventory using both these models. Different clients preferred different metrics, depending on their sophistication and on the accounting rules and regulations to which they were subject. We also did some longer-term, client-focused research, developing improved statistical models for homeowner prepayments or programs for valuing the more exotic ARM-based structures that were growing in popularity. The traders on the desk used the option-adjusted spread model to decide how much to bid for newly available ARM pools. The calculation was arduous. Each pool consisted of a variety of mortgages with a range of coupons and a spectrum of servicing fees, and the optionadjusted spread was calculated by averaging over thousands of future scenarios, each one involving a month-by-month simulation of interest rates over hundreds of months.

pages: 342 words: 94,762

Wait: The Art and Science of Delay by Frank Partnoy

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Cohen, “Separate Neural Systems Value Immediacy and Delayed Monetary Rewards,” Science 306(2004): 503–507. It is worth noting that when economists attempt to describe human behavior using high-level math, it often doesn’t go particularly well. Because the math is complex, people are prone to rely on it without question. And the equations often are vulnerable to unrealistic assumptions. Most recently, the financial crisis was caused in part by overreliance on statistical models that didn’t take into account the chances of declines in housing prices. But that was just the most recent iteration: the collapse of Enron, the implosion of the hedge fund Long-Term Capital Management, the billions of dollars lost by rogue traders Kweku Adoboli, Jerome Kerviel, Nick Leeson, and others—all of these fiascos have, at their heart, a mistaken reliance on complex math. Nassim N.

pages: 339 words: 88,732

The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson, Andrew McAfee

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

These days those initial investigations will take place over the Internet and consist of typing into a search engine phrases like “Phoenix real estate agent,” “Phoenix neighborhoods,” and “Phoenix two-bedroom house prices.” To test this hypothesis, Erik asked Google if he could access data about its search terms. He was told that he didn’t have to ask; the company made these data freely available over the Web. Erik and his doctoral student Lynn Wu, neither of whom was versed in the economics of housing, built a simple statistical model to look at the data utilizing the user-generated content of search terms made available by Google. Their model linked changes in search-term volume to later housing sales and price changes, predicting that if search terms like the ones above were on the increase today, then housing sales and prices in Phoenix would rise three months from now. They found their simple model worked. In fact, it predicted sales 23.6 percent more accurately than predictions published by the experts at the National Association of Realtors.

pages: 294 words: 81,292

Our Final Invention: Artificial Intelligence and the End of the Human Era by James Barrat

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Through several well-funded projects, IBM pursues AGI, and DARPA seems to be backing every AGI project I look into. So, again, why not Google? When I asked Jason Freidenfelds, from Google PR, he wrote: … it’s much too early for us to speculate about topics this far down the road. We’re generally more focused on practical machine learning technologies like machine vision, speech recognition, and machine translation, which essentially is about building statistical models to match patterns—nothing close to the “thinking machine” vision of AGI. But I think Page’s quotation sheds more light on Google’s attitudes than Freidenfelds’s. And it helps explain Google’s evolution from the visionary, insurrectionist company of the 1990s, with the much touted slogan DON’T BE EVIL, to today’s opaque, Orwellian, personal-data-aggregating behemoth. The company’s privacy policy shares your personal information among Google services, including Gmail, Google+, YouTube, and others.

pages: 370 words: 94,968

The Most Human Human: What Talking With Computers Teaches Us About What It Means to Be Alive by Brian Christian

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The Turing test would seem to corroborate that. UCSD’s computational linguist Roger Levy: “Programs have gotten relatively good at what is actually said. We can devise complex new expressions, if we intend new meanings, and we can understand those new meanings. This strikes me as a great way to break the Turing test [programs] and a great way to distinguish yourself as a human. I think that in my experience with statistical models of language, it’s the unboundedness of human language that’s really distinctive.”4 Dave Ackley offers very similar confederate advice: “I would make up words, because I would expect programs to be operating out of a dictionary.” My mind on deponents and attorneys, I think of drug culture, how dealers and buyers develop their own micro-patois, and how if any of these idiosyncratic reference systems started to become too standardized—if they use the well-known “snow” for cocaine, for instance—their text-message records and email records become much more legally vulnerable (i.e., have less room for deniability) than if the dealers and buyers are, like poets, ceaselessly inventing.

pages: 353 words: 88,376

The Investopedia Guide to Wall Speak: The Terms You Need to Know to Talk Like Cramer, Think Like Soros, and Buy Like Buffett by Jack (edited By) Guinan

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Some examples of defined-contribution plans are 401(k) plans, money-purchase pension plans, and profit-sharing plans. Related Terms: • Defined-Benefit Plan • Defined-Contribution Plan • Individual Retirement Account—IRA • Roth IRA • Tax Deferred 241 242 The Investopedia Guide to Wall Speak Quantitative Analysis What Does Quantitative Analysis Mean? A business or financial analysis technique that is used to understand market behavior by employing complex mathematical and statistical modeling, measurement, and research. By assigning a numerical value to variables, quantitative analysts try to replicate reality in mathematical terms. Quantitative analysis helps measure performance evaluation or valuation of a financial instrument. It also can be used to predict real-world events such as changes in a share’s price. Investopedia explains Quantitative Analysis In broad terms, quantitative analysis is a way of measuring things.

pages: 338 words: 106,936

The Physics of Wall Street: A Brief History of Predicting the Unpredictable by James Owen Weatherall

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

“Complex Critical Exponents From Renormalization Group Theory of Earthquakes: Implications for Earthquake Predictions.” Journal de Physique I 5 (5): 607–19. Sornette, Didier, and Christian Vanneste. 1992. “Dynamics and Memory Effects in Rupture of Thermal Fuse.” Physical Review Letters 68: 612–15. — — — . 1994. “Dendrites and Fronts in a Model of Dynamical Rupture with Damage.” Physical Review E 50 (6, December): 4327–45. Sornette, D., C. Vanneste, and L. Knopoff. 1992. “Statistical Model of Earthquake Foreshocks.” Physical Review A 45: 8351–57. Sourd, Véronique, Le. 2008. “Hedge Fund Performance in 2007.” EDHEC Risk and Asset Management Research Centre. Spence, Joseph. 1820. Observations, Anecdotes, and Characters, of Books and Men. London: John Murray. Stewart, James B. 1992. Den of Thieves. New York: Simon & Schuster. Stigler, Stephen M. 1986. The History of Statistics: The Measurement of Uncertainty Before 1900.

pages: 312 words: 89,728

The End of My Addiction by Olivier Ameisen

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

., Hansen, H. J., Sunde, N. et al. (2002) Evidence of tolerance to baclofen in treatment of severe spasticity with intrathecal baclofen. Clinical Neurology and Neurosurgery 104, 142–145. Pelc, I., Ansoms, C., Lehert, P. et al. (2002) The European NEAT program: an integrated approach using acamprosate and psychosocial support for the prevention of relapse in alcohol-dependent patients with a statistical modeling of therapy success prediction. Alcoholism: Clinical and Experimental Research 26, 1529–1538. Roberts, D. C. and Andrews, M. M. (1997) Baclofen suppression of cocaine self-administration: demonstration using a discrete trials procedure. Psychopharmacology (Berlin) 131, 271–277. Shoaib, M., Swanner, L. S., Beyer, C. E. et al. (1998) The GABAB agonist baclofen modifies cocaine self-administration in rats.

pages: 561 words: 87,892

Losing Control: The Emerging Threats to Western Prosperity by Stephen D. King

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

WE’RE NOT ON OUR OWN In my twenty-five years as a professional economist, initially as a civil servant in Whitehall but, for the most part, as an employee of a major international bank, I’ve spent a good deal of time looking into the future. As the emerging nations first appeared on the economic radar screen, I began to realize I could talk about the future only by delving much further into the past. I wasn’t interested merely in the history incorporated into statistical models of the economy, a history which typically includes just a handful of years and therefore ignores almost all the interesting economic developments that have taken place over the last millennium. Instead, the history that mattered to me had to capture the long sweep of economic and political progress and all too frequent reversal. In recent years, as the emerging nations have taken their seats at the international table of powers and superpowers, economic and political history has become increasingly important.

pages: 322 words: 84,752

Pax Technica: How the Internet of Things May Set Us Free or Lock Us Up by Philip N. Howard

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

This makes it tough to learn from the causes and consequences of technology diffusion throughout history. Important events and recognizable causal connections can’t be replicated or falsified. We can’t repeat the Arab Spring in some kind of experiment. We can’t test its negation—an Arab Spring that never happened, or an Arab Spring minus one key factor that resulted in a different outcome. We don’t have enough large datasets about Arab Spring–like events to run statistical models. That doesn’t mean we shouldn’t try to learn from the real events that happened. In fact, for many in the social sciences, tracing how real events unfolded is the best way to understand political change. The richest explanations of the fall of the Berlin Wall, for example, as sociologist Steve Pfaff crafts them, come from such process tracing.2 We do, however, know enough to make some educated guesses about what will happen next.

pages: 322 words: 88,197

Wonderland: How Play Made the Modern World by Steven Johnson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

It was the first time anyone had begun talking, mathematically at least, about what we now call life expectancy. Probability theory served as a kind of conceptual fossil fuel for the modern world. It gave rise to the modern insurance industry, which for the first time could calculate with some predictive power the claims it could expect when insuring individuals or industries. Capital markets—for good and for bad—rely extensively on elaborate statistical models that predict future risk. “The pundits and pollsters who today tell us who is likely to win the next election make direct use of mathematical techniques developed by Pascal and Fermat,” the mathematician Keith Devlin writes. “In modern medicine, future-predictive statistical methods are used all the time to compare the benefits of various drugs and treatments with their risks.” The astonishing safety record of modern aviation is in part indebted to the dice games Pascal and Fermat analyzed; today’s aircraft are statistical assemblages, with each part’s failure rate modeled to multiple decimal places.

pages: 364 words: 102,926

What the F: What Swearing Reveals About Our Language, Our Brains, and Ourselves by Benjamin K. Bergen

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

So if you believe that exposure to violence in media could be a confounding factor—it correlates with exposure to profanity and could explain some amount of aggression—then you measure not only how much profanity but also how much violence children are exposed to. The two will probably correlate, but the key point is that you can measure exactly how much media violence correlates with child aggressiveness, and you can pull that apart in a statistical model from the amount that profanity exposure correlates with child aggressiveness. The authors of the Pediatrics study tried to do this. But to know that profanity exposure per se and not any of these other possible confounding factors is responsible for increased reports of aggressiveness, you’d need to do the same thing not just for exposure to media violence, as the authors did, but for every other possible confounding factor, which they did not.

pages: 297 words: 91,141

Market Sense and Nonsense by Jack D. Schwager

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

If, as occurred in 2008, they need to liquidate at the same time because of a flight-to-safety psychology in the market, the huge imbalance between supply and demand can result in managers being forced to liquidate positions at deeply discounted prices. Statistical arbitrage. The premise underlying statistical arbitrage is that short-term imbalances in buy and sell orders cause temporary price distortions, which provide short-term trading opportunities. Statistical arbitrage is a mean-reversion strategy that seeks to sell excessive strength and buy excessive weakness based on statistical models that define when short-term price moves in individual equities are considered out of line relative to price moves in related equities. The origin of the strategy was a subset of statistical arbitrage called pairs trading. In pairs trading, the price ratios of closely related stocks are tracked (e.g., Ford and General Motors), and when the mathematical model indicates that one stock has gained too much versus the other (either by rising more or by declining less), it is sold and hedged by the purchase of the related equity in the pair.

pages: 345 words: 92,849

Equal Is Unfair: America's Misguided Fight Against Income Inequality by Don Watkins, Yaron Brook

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

You will no doubt encounter many claims that you can’t easily evaluate: academic studies that say inequality undermines mobility or economic progress, claims about “the bulk of the gains” going to “the rich” rather than the middle class, stories about injustices supposedly committed by “the 1 percent” against “the 99 percent.” In these cases the question to ask is: “Assuming this is a problem, what is your solution?” Inevitably, the inequality critics’ answer will be that some form of force must be used to tear down the top by depriving them of the earned, and to prop up the bottom by giving them the unearned. But nothing can justify an injustice, nor can any statistical model erase the fact that all of the values human life requires are a product of the human mind, and that the human mind cannot function without freedom. Don’t concede that the inequality alarmists value equality. The egalitarians pose as defenders of equality. But there is no such thing as being for equality across the board: different types of equality conflict. Namely, economic equality (including equality of opportunity) is incompatible with political equality.

The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences by Rob Kitchin

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

The difference between the humanities and social sciences in this respect is because the statistics used in the digital humanities are largely descriptive – identifying patterns and plotting them as counts, graphs, and maps. In contrast, the computational social sciences employ the scientific method, complementing descriptive statistics with inferential statistics that seek to identify causality. In other words, they are underpinned by an epistemology wherein the aim is to produce sophisticated statistical models that explain, simulate and predict human life. This is much more difficult to reconcile with post-positivist approaches. The defence then rests on the utility and value of the method and models, not on providing complementary analysis of a more expansive set of data. There are alternatives to this position, such as that adopted within critical GIS (Geographic Information Science) and radical statistics, and those who utilise mixed-method approaches, that either employ models and inferential statistics while being mindful of their shortcomings, or more commonly only utilise descriptive statistics that are complemented with small data studies.

pages: 291 words: 90,200

Networks of Outrage and Hope: Social Movements in the Internet Age by Manuel Castells

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

He wrote: “Countries where civil society and journalism made active use of the new information technologies subsequently experience a radical democratic transition or significant solidification of their democratic institutions” (2011: 200). Particularly significant, before the Arab Spring, was the transformation of social involvement in Egypt and Bahrain with the help of ICT diffusion. In a stream of research conducted in 2011 and 2012 after the Arab uprisings, Howard and Hussain, using a series of quantitative and qualitative indicators, probed a multi-causal, statistical model of the processes and outcomes of the Arab uprisings by using fuzzy logic (Hussain and Howard 2012). They found that the extensive use of digital networks by a predominantly young population of demonstrators had a significant effect on the intensity and power of these movements, starting with a very active debate on social and political demands in the social media before the demonstrations’ onset.

pages: 403 words: 111,119

Doughnut Economics: Seven Ways to Think Like a 21st-Century Economist by Kate Raworth

Studying trend data for GDP side by side with data on local air and water pollution in around 40 countries, they found that pollution first rose then fell as GDP increased, tracing out the shape of an inverted-U when plotted on the page. Given its uncanny resemblance to that famous inequality curve of Chapter 5, this new one was soon known as the Environmental Kuznets Curve. The Environmental Kuznets Curve, which suggests that growth will eventually fix the environmental problems that it creates. Having discovered another apparent economic law of motion, the economists could not resist the urge to use statistical modelling in order to identify the level of income at which the curve magically turned. For lead contamination in rivers, they found, pollution peaked and started to fall when national income reached \$1,887 per person (measured in 1985 US dollars, the standard metric of the day). What about sulphur dioxide in the air? That appeared to fall when income hit \$4,053 per person. As for black smoke? Wait until GDP exceeds \$6,151 per capita and it will begin to clear.

pages: 337 words: 86,320

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz

Crossley, “Validity of Responses to Survey Questions,” Public Opinion Quarterly 14, 1 (1950). 106 survey asked University of Maryland graduates: Frauke Kreuter, Stanley Presser, and Roger Tourangeau, “Social Desirability Bias in CATI, IVR, and Web Surveys,” Public Opinion Quarterly 72(5), 2008. 107 failure of the polls: For an article arguing that lying might be a problem in trying to predict support for Trump, see Thomas B. Edsall, “How Many People Support Trump but Don’t Want to Admit It?” New York Times, May 15, 2016, SR2. But for an argument that this was not a large factor, see Andrew Gelman, “Explanations for That Shocking 2% Shift,” Statistical Modeling, Causal Inference, and Social Science, November 9, 2016, http://andrewgelman.com/2016/11/09/explanations-shocking-2-shift/. 107 says Tourangeau: I interviewed Roger Tourangeau by phone on May 5, 2015. 107 so many people say they are above average: This is discussed in Adam Grant, Originals: How Non-Conformists Move the World (New York: Viking, 2016). The original source is David Dunning, Chip Heath, and Jerry M.

pages: 623 words: 448,848

Food Allergy: Adverse Reactions to Foods and Food Additives by Dean D. Metcalfe

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Of course, the group of 29 subjects must be representative of the entire allergic population. Furthermore, this approach allows for the possibility that almost 10% of patients allergic to that food will react to ingestion of that dose and this possibility may be considered as too high. Modeling of collective data from several studies is probably the preferred approach to determine the population-based threshold, although the best statistical model to use remains to be determined [8]. typical servings of these foods. Thus, it is tempting to speculate that those individuals with very low individual threshold doses would be less likely to outgrow their food allergy or would require a longer time period for that to occur. In at least one study [25], individuals with histories of severe food allergies had significantly lower individual threshold doses.

As it stands, most food-allergic patients do not know their individual threshold dose because few allergy clinics make this assessment. The knowledge of individual threshold doses would allow physicians to offer more complete advice to food-allergic patients in terms of their comparative vulnerability to hidden residues of allergenic foods. The clinical determination of large numbers of individual threshold doses would allow estimates of population-based thresholds using appropriate statistical modeling approaches. The food industry and regulatory agencies could also make effective use of information on population-based threshold doses to establish improved labeling regulations and practices and allergen control programs. References 1 Gern JE, Yang E, Evrard HM, et al. Allergic reactions to milk-contaminated “non-dairy” products. N Engl J Med 1991;324:976–9. 2 Yman IM. Detection of inadequate labeling and contamination as causes of allergic reactions to foods.

Allergy 2005;60:865–70. 74 Bindslev-Jensen C. Standardization of double-blind, placebocontrolled food challenges. Allergy 2001;56:75–7. 75 Caffarelli C, Petroccione T. False-negative food challenges in children with suspected food allergy. Lancet 2001;358:1871–2. 76 Sampson HA. Use of food-challenge tests in children. Lancet 2001; 358:1832–3. 77 Briggs D, Aspinall L, Dickens A, Bindslev-Jensen C. Statistical model for assessing the proportion of subjects with subjective sensitisations in adverse reactions to foods. Allergy 2001; 56:83–5. 78 Chinchilli VM, Fisher L, Craig TJ. Statistical issues in clinical trials that involve the double-blind, placebo-controlled food challenge. J Allergy Clin Immunol 2005;115:592–7. 21 CHAPTER 21 IgE Tests: In Vitro Diagnosis Kirsten Beyer KEY CONCEPTS • The presence of food allergen-specific IgE determines the sensitization to a specific food.

pages: 945 words: 292,893

Seveneves by Neal Stephenson

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Would it stay together as a compact swarm or spread out? Or would it split up into two or more distinct swarms that would try different things? Arguments could be made for all of the above scenarios and many more, depending on what actually happened in the Hard Rain. Since the Earth had never before been bombarded by a vast barrage of lunar fragments, there was no way to predict what it was going to be like. Statistical models had been occupying much of Doob’s time because they had a big influence on which scenarios might be most worth preparing for. To take a simplistic example, if the moon could be relied on to disassemble itself into pea-sized rocks, then the best strategy was to remain in place and not worry too much about maneuvering. It was hard to detect a pea-sized bolide until it was pretty close, by which time it was probably too late to take evasive action.

They could clearly make out Cleft’s radar signature, as well as those of many other big rocks that traveled in its vicinity. A clutter of faint noise and clouds on the optical telescope gave them data about the density of objects too small and numerous to resolve. All of it fed into the plan. Doob looked tired, and nodded off frequently, and hadn’t eaten a square meal since the last perigee, but he pulled himself together when he was needed and fed any new information into a statistical model, prepared long in advance, that would enable them to maximize their chances by ditching Amalthea and doing the big final burn at just the right times. But as he kept warning Ivy and Zeke, the time was coming soon when they would become so embroiled in the particulars of which rock was coming from which direction that it wouldn’t be a statistical exercise anymore. It would be a video game, and its objective would be to build up speed while merging into a stream of large and small rocks that would be overtaking them with the speed of artillery shells.

pages: 473 words: 154,182

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

In June, 690 miles southwest of Sitka, for the first time since the spill, dramatic complications occur. As it collides with the continental shelf and then with the freshwater gushing out of the rainforests of the coastal mountains, and then with the coast, the North Pacific Drift loses its coherence, crazies, sends out fractal meanders and eddies and tendrils that tease the four voyagers apart. We don’t know for certain what happens next, but statistical models suggest that at least one of the four voyagers I’m imagining—the frog, let’s pretend—will turn south, carried by an eddy or a meander into the California Current, which will likely deliver it, after many months, into the North Pacific Subtropical Gyre. You may now forget about the frog. We already know its story—how, as it disintegrates, it will contribute a few tablespoons of plastic to the Garbage Patch, or to Hawaii’s Plastic Beach, or to the dinner of an albatross, or to a sample collected in the codpiece of Charlie Moore’s manta trawl.

pages: 755 words: 121,290

Statistics hacks by Bruce Frey

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

He is proudest of two accomplishments: his marriage to his sweet wife, and his purchase of a low-grade copy of Showcase #4, a comic book wherein the "Silver Age Flash first appears," whatever that means. Contributors The following people contributed their hacks, writing, and inspiration to this book: Joseph Adler is the author of Baseball Hacks (O'Reilly), and a researcher in the Advanced Product Development Group at VeriSign, focusing on problems in user authentication, managed security services, and RFID security. Joe has years of experience analyzing data, building statistical models, and formulating business strategies as an employee and consultant for companies including DoubleClick, American Express, and Dun & Bradstreet. He is a graduate of the Massachusetts Institute of Technology with an Sc.B. and an M.Eng. in computer science and computer engineering. Joe is an unapologetic Yankees fan, but he appreciates any good baseball game. Joe lives in Silicon Valley with his wife, two cats, and a DirecTV satellite dish.

pages: 484 words: 136,735

Capitalism 4.0: The Birth of a New Economy in the Aftermath of Crisis by Anatole Kaletsky

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Mandelbrot’s research program undermined most of the mathematical assumptions of modern portfolio theory, which is the basis for the conventional risk models used by regulators, credit-rating agencies, and unsophisticated financial institutions. Mandelbrot’s analysis, presented to nonspecialist readers in his 2004 book (Mis)behavior of Markets, shows with mathematical certainty that these standard statistical models based on neoclassical definitions of efficient markets and rational expectations among investors cannot be true. Had these models been valid, events such as the 1987 stock market crash and the bankruptcy of the 1998 hedge fund crisis would not have occurred even once in the fifteen billion years since the creation of the universe.9 In fact, four such extreme events occurred in just two weeks after the Lehman bankruptcy.

pages: 561 words: 120,899

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Cochran WG, Mosteller F, Tukey JW. (1954) Statistical Problems of the Kinsey Report on Sexual Behavior in the Human Male. American Statistical Association. Converse, Jean M. (1987) Survey Research in the United States: Roots and Emergence 1890–1960. University of California Press. Fienberg SE, Hoaglin DC, eds. (2006) Selected Papers of Frederick Mosteller. Springer. Fienberg SE et al., eds. (1990) A Statistical Model: Frederick Mosteller’s Contributions to Statistics, Science and Public Policy. Springer-Verlag. Hedley-Whyte J. (2007) Frederick Mosteller (1916–2006): Mentoring, A Memoir. International Journal of Technology Assessment in Health Care (23) 152–54. Ingelfinger, Joseph, et al. (1987) Biostatistics in Clinical Medicine. Macmillan. Jones, James H. (1997) Alfred C. Kinsey: A Public/Private Life.

pages: 402 words: 110,972

Nerds on Wall Street: Math, Machines and Wired Markets by David J. Leinweber

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

Shaw went on to found D.E. Shaw & Company, one of the largest and most consistently successful quantitative hedge funds. Fischer Black’s Quantitative Strategies Group at Goldman Sachs were algo pioneers. They were perhaps the first to use computers for actual trading, as well as for identifying trades. The early alpha seekers were the first combatants in the algo wars. Pairs trading, popular at the time, relied on statistical models. Finding stronger short-term correlations than the next guy had big rewards. Escalation beyond pairs to groups of related securities was inevitable. Parallel developments in futures markets opened the door to electronic index arbitrage trading. Automated market making was a valuable early algorithm. In quiet, normal markets buying low and selling high across the spread was easy 68 Nerds on Wall Str eet money.

Beginning R: The Statistical Programming Language by Mark Gardener

Amazon: amazon.comamazon.co.ukamazon.deamazon.fr

What steps will you need to carry out to conduct an ANOVA? 5. The bats data yielded a significant interaction term in the two-way ANOVA. Look at this further. Make a graphic of the data and then follow up with a post-hoc analysis. Draw a graph of the interaction. What You Learned in This Chapter Topic Key Points Formula syntax response ~ predictor The formula syntax enables you to specify complex statistical models. Usually the response variables go on the left and predictor variables go on the right. The syntax can also be used in more simple situations and for graphics. Stacking samples stack() In more complex analyses, the data need to be in a layout where each column is a separate item; that is, a column for the response variable and a column for each predictor variable. The stack() command can rearrange data into this layout.

pages: 349 words: 134,041