Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: "use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems".
Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.
In 2012, when Harvard Business Review called it "The Sexiest Job of the 21st Century", the term "data science" became a buzzword. It is now often used interchangeably with earlier concepts like business analytics, business intelligence, predictive modeling, and statistics. Even the suggestion that data science is sexy was paraphrasing Hans Rosling, featured in a 2011 BBC documentary with the quote, "Statistics is now the sexiest subject around." Nate Silver referred to data science as a sexed up term for statistics. In many cases, earlier approaches and solutions are now simply rebranded as "data science" to be more attractive, which can cause the term to become "dilute[d] beyond usefulness." While many university programs now offer a data science degree, there exists no consensus on a definition or suitable curriculum contents. To its discredit, however, many data-science and big-data projects fail to deliver useful results, often as a result of poor management and utilization of resources.
The term "data science" has appeared in various contexts over the past thirty years but did not become an established term until recently. In an early usage, it was used as a substitute for computer science by Peter Naur in 1960. Naur later introduced the term "datalogy". In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of applications.
In 1996, members of the International Federation of Classification Societies (IFCS) met in Kobe for their biennial conference. Here, for the first time, the term data science is included in the title of the conference ("Data Science, classification, and related methods"), after the term was introduced in a roundtable discussion by Chikio Hayashi.
In November 1997, C.F. Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?" for his appointment to the H. C. Carver Professorship at the University of Michigan. In this lecture, he characterized statistical work as a trilogy of data collection, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists. Later, he presented his lecture entitled "Statistics = Data Science?" as the first of his 1998 P.C. Mahalanobis Memorial Lectures. These lectures honor Prasanta Chandra Mahalanobis, an Indian scientist and statistician and founder of the Indian Statistical Institute.
In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate "advances in computing with data" in his article "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics," which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique. In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.
In April 2002, the International Council for Science (ICSU): Committee on Data for Science and Technology (CODATA) started the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues. Shortly thereafter, in January 2003, Columbia University began publishing The Journal of Data Science, which provided a platform for all data workers to present their views and exchange ideas. The journal was largely devoted to the application of statistical methods and quantitative research. In 2005, The National Science Board published "Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century" defining data scientists as "the information and computer scientists, database and software and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection" whose primary activity is to "conduct creative inquiry and analysis."
Around 2007, Turing award winner Jim Gray envisioned "data-driven science" as a "fourth paradigm" of science that uses the computational analysis of large data as primary scientific method and "to have a world in which all of the science literature is online, all of the science data is online, and they interoperate with each other."
In the 2012 Harvard Business Review article "Data Scientist: The Sexiest Job of the 21st Century", DJ Patil claims to have coined this term in 2008 with Jeff Hammerbacher to define their jobs at LinkedIn and Facebook, respectively. He asserts that a data scientist is "a new breed", and that a "shortage of data scientists is becoming a serious constraint in some sectors", but describes a much more business-oriented role.
In 2013, the IEEE Task Force on Data Science and Advanced Analytics was launched. In 2013, the first "European Conference on Data Analysis (ECDA)" was organised in Luxembourg, establishing the European Association for Data Science (EuADS). The first international conference: IEEE International Conference on Data Science and Advanced Analytics was launched in 2014. In 2014, General Assembly launched student-paid bootcamp and The Data Incubator launched a competitive free data science fellowship. In 2014, the American Statistical Association section on Statistical Learning and Data Mining renamed its journal to "Statistical Analysis and Data Mining: The ASA Data Science Journal" and in 2016 changed its section name to "Statistical Learning and Data Science". In 2015, the International Journal on Data Science and Analytics was launched by Springer to publish original work on data science and big data analytics. In September 2015 the Gesellschaft für Klassifikation (GfKl) added to the name of the Society "Data Science Society" at the third ECDA conference at the University of Essex, Colchester, UK.
"Data science" has recently become a popular term among business executives. However, many critical academics and journalists see no distinction between data science and statistics, whereas others consider it largely a popular term for "data mining" and "big data". Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced “business analytics” in contexts such as graduate degree programs. In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, “I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.” Similarly, in business sector, multiple researchers and analysts state that data scientists alone are far from being sufficient in granting companies a real competitive advantage and consider data scientists as only one of the four greater job families companies require to leverage big data effectively, namely: data analysts, data scientists, big data developers and big data engineers.
On the other hand, responses to criticism are as numerous. In a 2014 Wall Street Journal article, Irving Wladawsky-Berger compares the data science enthusiasm with the dawn of computer science. He argues data science, like any other interdisciplinary field, employs methodologies and practices from across the academia and industry, but then it will morph them into a new discipline. He brings to attention the sharp criticisms computer science, now a well respected academic discipline, had to once face. Likewise, NYU Stern's Vasant Dhar, as do many other academic proponents of data science, argues more specifically in December 2013 that data science is different from the existing practice of data analysis across all disciplines, which focuses only on explaining data sets. Data science seeks actionable and consistent pattern for predictive uses. This practical engineering goal takes data science beyond traditional analytics. Now the data in those disciplines and applied fields that lacked solid theories, like health science and social science, could be sought and utilized to generate powerful predictive models.
In an effort similar to Dhar's, Stanford professor David Donoho, in September 2015, takes the proposition further by rejecting three simplistic and misleading definitions of data science in lieu of criticisms. First, for Donoho, data science does not equate to big data, in that the size of the data set is not a criterion to distinguish data science and statistics. Second, data science is not defined by the computing skills of sorting big data sets, in that these skills are already generally used for analyses across all disciplines. Third, data science is a heavily applied field where academic programs right now do not sufficiently prepare data scientists for the jobs, in that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. As a statistician, Donoho, following many in his field, champions the broadening of learning scope in the form of data science, like John Chambers who urges statisticians to adopt an inclusive concept of learning from data, or like William Cleveland who urges to prioritize extracting from data applicable predictive tools over explanatory theories. Together, these statisticians envision an increasingly inclusive applied field that grows out of traditional statistics and beyond.
For the future of data science, Donoho projects an ever-growing environment for open science where data sets used for academic publications are accessible to all researchers. US National Institute of Health has already announced plans to enhance reproducibility and transparency of research data. Other big journals are likewise following suit. This way, the future of data science not only exceeds the boundary of statistical theories in scale and methodology, but data science will revolutionize current academia and research paradigms. As Donoho concludes, "the scope and impact of data science will continue to expand enormously in coming decades as scientific data and data about science itself become ubiquitously available."
The Alan Turing Institute is the United Kingdom's national institute for data science and artificial intelligence, founded in 2015. It is named after Alan Turing, the British mathematician and computing pioneer.Anaconda (Python distribution)
Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Package versions are managed by the package management system conda.
The Anaconda distribution is used by over 13 million users and includes more than 1400 popular data-science packages suitable for Windows, Linux, and MacOS.Annual Reviews (publisher)
Annual Reviews, located in Palo Alto California, Annual Reviews is a nonprofit publisher dedicated to synthesizing and integrating knowledge for the progress of science and the benefit of society. It has a collection of 46 review series in specific disciplines in science and social science. Each review series contains 12 to 40 authoritative comprehensive review articles, covering the major journal articles on a specific topic during the preceding few years. The major topics in each subject are covered every few years, and special topics appear as appropriate.
The reviews are widely used in teaching and research, and serve the purposes both of current awareness and introduction to a new subject. Since in scientific literature it is customary to cite in detail the sources only since the most recent review, these periodicals are among the highest ranking journals in impact factor for their subjects, as shown in the publisher's table. (This does not imply that they are necessarily the most important journals in the subject; review series always rank highly because of the relatively few articles published each year and the many articles that cite them.) The reviews are written in a compact narrative style, with a minimum of descriptive text for each article covered. Many authors provide lists of summary points and future issues. The length of each review and the number of articles covered vary widely depending on both the topic and the preferences of the author. The articles are written by invitation to the authors, who are accepted authorities on the material covered.Cloudera
Cloudera, Inc. is a US-based software company that provides a software platform for data engineering, data warehousing, machine learning and analytics that runs in the cloud or on premises.
Cloudera started as a hybrid open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), that targeted enterprise-class deployments of that technology. Cloudera states that more than 50% of its engineering output is donated upstream to the various Apache-licensed open source projects (Apache Spark, Apache Hive, Apache Avro, Apache HBase, and so on) that combine to form the Apache Hadoop platform. Cloudera is also a sponsor of the Apache Software Foundation.Coursera
Coursera () is an online learning platform founded by Stanford professors Andrew Ng and Daphne Koller that offers courses, specializations, and degrees.
Coursera works with universities and other organizations to offer online courses, specializations, and degrees in a variety of subjects, such as engineering, humanities, medicine, biology, social sciences, mathematics, business, computer science, digital marketing, data science, and others.
As of June 2018, Coursera had more than 33 million registered users and more than 2,400 courses.Data Science Institute
The Data Science Institute is a research institute at the Imperial College London founded in May 2014. The institute is one of five Global Institutes at Imperial College London, alongside the Institute of Global Health Innovation, Energy Futures Lab, Institute for Security Science and Technology, and the Grantham Institute - Climate Change and Environment.The Data Science Institute has partnerships with international industry and academia, with formal investments from Chinese multinational telecoms company Huawei, multinational consultancy KPMG, and Zhejiang University, China.The goal of the institute is to enhance multidisciplinary data science research across the whole of Imperial College by coordinating and promoting data-driven research and education activities. These activities cover all areas across the College including engineering, medicine, natural sciences, and business.
The institute houses a custom built large-scale immersive data visualization facility called the KPMG Data Observatory, which has a resolution of 132 megapixels that is thought to be the largest such system in Europe.Data visualization
Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data.To communicate information clearly and efficiently, data visualization uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message. Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look up a specific measurement, while charts of various types are used to show patterns or relationships in the data for one or more variables.
Data visualization is both an art and a science. It is viewed as a branch of descriptive statistics by some, but also as a grounded theory development tool by others. Increased amounts of data created by Internet activity and an expanding number of sensors in the environment are referred to as "big data" or Internet of things. Processing, analyzing and communicating this data present ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists help address this challenge.ENSAE ParisTech
ENSAE ParisTech (officially École nationale de la statistique et de l'administration économique) is one of the most prestigious French Grandes écoles of engineering and a member of ParisTech (Paris Institute of Technology). ENSAE ParisTech is known as the branch school of École Polytechnique for statistics, data science and machine learning.
It is one of France's top schools of economics, statistics, data science and machine learning and is directly attached to France's Institut national de la statistique et des études économiques (INSEE) and the French Ministry of Economy and Finance.
Students are given a proficient training both in economics and statistics and they can specialize in macroeconomics, microeconomics, statistics or finance.
The ENSAE has the ability to train its students for the French actuary graduation (Institut des Actuaires).European Physical Journal
The European Physical Journal (or EPJ) is a joint publication of EDP Sciences, Springer Science+Business Media, and the Società Italiana di Fisica. It arose in 1998 as a merger and continuation of Acta Physica Hungarica, Anales de Física, Czechoslovak Journal of Physics, Il Nuovo Cimento, Journal de Physique, Portugaliae Physica and Zeitschrift für Physik. The journal is published in various sections, covering all areas of physics.Gordon and Betty Moore Foundation
The Gordon and Betty Moore Foundation is an American foundation established by Intel co-founder Gordon E. Moore and his wife Betty I. Moore in September 2000 to support scientific discovery, environmental conservation, patient care improvements and preservation of the character of the Bay Area.
As outlined in the Statement of Founder's Intent, the foundation's aim is to tackle large, important issues at a scale where it can achieve significant and measurable impacts.Mathematical sciences
The mathematical sciences are a group of areas of study that includes, in addition to mathematics, those academic disciplines that are primarily mathematical in nature but may not be universally considered subfields of mathematics proper.
Statistics, for example, is mathematical in its methods but grew out of scientific observations which merged with inverse probability and grew through applications in the social sciences, some areas of physics and biometrics to become its own separate, though closely allied field. Computer science, computational science, data science, population genetics, operations research, control theory, cryptology, econometrics, theoretical physics, fluid mechanics, chemical reaction network theory and actuarial science are other fields that may be considered part of mathematical sciences.
Some institutions offer degrees in mathematical sciences (e.g. the United States Military Academy, Stanford University, and University of Khartoum) or applied mathematical sciences (e.g. the University of Rhode Island).Microsoft Certified Professional
The Microsoft Certified Professional or MCP Program is the certification program from Microsoft that enables IT Professionals and Developers to validate their technical expertise through rigorous, industry-proven, and industry-recognized exams. The certification exams offered cover a broad range of technologies throughout the Microsoft ecosystem of IT technologies. When an individual passes a certification exam and earns a Microsoft certification, then they are recognized as a Microsoft Certified Professional (MCP). By passing multiple exams they have the opportunity to earn larger, more distinguished certifications; such as the MCSE and MCSD certifications.
In 2016, Microsoft expanded with the launch of its Microsoft Professional Program, a fully online certification program in partnership with edX which includes various tracks in data science, front end web development, cloud computing, DevOps. This program has expanded to a total of 8 tracks in 2018 with the addition of its artificial intelligence and software development program certification in April 2018. Many of these programs are focused on equipping learners with up to date skillsets with various Microsoft tools, including Excel, PowerBI, Visual Studio, and Azure.
The Microsoft Certified Professional (MCP) certification is no longer available.
You can review all certifications associated with the Microsoft Certification Program on the Microsoft Technical Certifications page.Oxford Internet Institute
The Oxford Internet Institute (OII) is a multi-disciplinary department of social and computer science dedicated to the study of information, communication, and technology, and is part of the Social Sciences Division of the University of Oxford, England. It is housed over three sites on St Giles in Oxford, including a primary site at 1 St Giles, owned by Balliol College. The department undertakes research and teaching devoted to understanding life online, with the aim of shaping Internet research, policy, and practice.
Founded in 2001, the OII has tracked the Internet's development and use, aiming to shed light on individual, collective and institutional behaviour online. The department brings together academics from a wide range of disciplines including political science, sociology, geography, economics, philosophy, physics and psychology.
Professor William H. Dutton served as Director of the OII from 2001 to 2011. Professor Helen Margetts occupied the Directorship between 2011 and 2018. The current director is Professor Philip N. Howard.Philip Bourne
Philip Eric Bourne (born 1953) is a United States researcher in health informatics, non-fiction writer, and entrepreneur. He is currently Stephenson Chair of Data Science and Director of the Data Science Institute and Professor of Biomedical Engineering and was the first Associate Director for Data Science at the National Institutes of Health, where his projects include managing the Big Data to Knowledge initiative, and formerly Associate Vice Chancellor at UCSD,. He has contributed to textbooks and is a strong supporter of open-access literature and software. His diverse interests have spanned structural biology, medical informatics, information technology, structural bioinformatics, scholarly communication and pharmaceutical sciences. His papers are highly cited, and he has an h-index above 50.RStudio
RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. RStudio was founded by JJ Allaire, creator of the programming language ColdFusion. Hadley Wickham is the Chief Scientist at RStudio.RStudio is available in two editions: RStudio Desktop, where the program is run locally as a regular desktop application; and RStudio Server, which allows accessing RStudio using a web browser while it is running on a remote Linux server. Prepackaged distributions of RStudio Desktop are available for Windows, macOS, and Linux.
RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation and optimization. RapidMiner is developed on an open core model. The RapidMiner Studio Free Edition, which is limited to 1 logical processor and 10,000 data rows is available under the AGPL license. Commercial pricing starts at $2,500 and is available from the developer.Society for Industrial and Applied Mathematics
The Society for Industrial and Applied Mathematics (SIAM) is an academic association dedicated to the use of mathematics in industry. SIAM is the world's largest professional association devoted to applied mathematics, and roughly two-thirds of its membership resides within the United States. Founded in 1951, the organization began holding annual national meetings in 1954, and now hosts conferences, publishes books and scholarly journals, and engages in lobbying in issues of interest to its membership. The focus for the society is applied, computational, and industrial mathematics, and the society often promotes its acronym as "Science and Industry Advance with Mathematics". Members include engineers, scientists, and mathematicians, both those employed in academia and those working in industry. The society supports educational institutions promoting applied mathematics.
SIAM is one of the four member organizations of the Joint Policy Board for Mathematics.World Programming System
The World Programming System, also known as WPS Analytics or WPS, is a software product developed by a company called World Programming.
WPS Analytics supports users of mixed ability to access and process data and to perform data science tasks. It has interactive visual programming tools using data workflows, and it has coding tools supporting the use of the SAS language mixed with Python , R and SQL.