What is Data?

Bill Beaver
Oct 28, 2023
17 min read

Updated: Nov 1, 2023

Information defined as data [1]

"We are not immobile, at the centre of the universe (Copernicus), we are not unnaturally detached and diverse from the rest of the animal world (Darwin) and we are not Cartesian subjects entirely transparent to ourselves (Freud). We are currently coming to terms with the possibility that we might not be disconnected and standalone material entities, but rather informational organisms, not unlike other biological agents and engineered artefacts, with which we share a global environment ultimately made of information, the infosphere (Turing). This is the fourth revolution."

Luciano Floridi [2]

Data, meaning, and truth. These are all tied together around various concepts and explanations of the term 'information.' A special form of information is information that has meaning and truth, called semantic information. This is defined in terms of data, the so-called General Definition of Information (GDI):

σ is an instance of information, understood as semantic content, if and only if: [1]

σ consists of one or more data;
the data in σ are well-formed;
the well-formed data in σ are meaningful.

Data is defined by difference (diaphora), the Diaphoric Definition of Data (DDD):

A datum is a putative fact regarding some difference or lack of uniformity within some context. [1]

Data is relational but GDI doesn't care what type of relation. There are different types of data including secondary data; the unpainted parts of a canvas, or the spaces between notes of music, There is metadata; data about data, and operational data, data about the operation of the data system. A very important type of data is derived data, data created from other data. [1] [3] [4]

In 1949 Shannon and Weaver [5] specified what now is called information theory, more specifically, a Mathematical Theory of Communication (MTC.) Data is an abstract concept, a relationship between things. For this relationship to be communicated it has to be transformed (a code) and then the code has to pass between two physical entities through a physical substrate, a channel. MTC spells out the constraints and limitations of data communication and quantifies this in a value called entropy. MTC is neutral as to meaning and truth, garbage in, garbage out, the channel doesn't care about semantic content. [1]

According to GDI, to be semantic data have to be both well-formed and meaningful. If data is well-formed some process can make it meaningful. This process is cognition, or more generally, computation. Like Shannon, Turing [6] abstracted a physical process and laid out both the constraints and limitations of this process. The Turing machine turns sets of ordered data in the form of a coded string into another coded string, meaningful information. This is done through a third coded string, the program or algorithm, the algorithm being the logical steps that are coded as a program.

In the 1960s three independent researchers: Solomonoff 1964 [7][8], Kolmogorov 1965 [9], and Chaitin 1966 [10] developed what is now called the Algorithmic Theory of Information (ATI). Central to this is the idea of Kolmogorov Complexity. This is the idea that if a program can compress a coded string, then the information content of that string is a measure of the amount of that compression. [11][12] An example of this is something anyone using computers sees every day. This is jpeg compression. Modern computers use a color space of 2^24 or 16.7 million different colors. Human beings cannot see all of these colors at once, especially if two very close colors are next to each other. Jpeg compression combines similar colors that touch each other. This can reduce the storage size of an image greatly but does little to change the content of the image, in most cases it is unnoticeable. This type of compression is called lossy compression as there is no way to go from the compressed image back to the original. Also, the amount of compression is a parameter and the value of this parameter is unique for each image. In reality, most people use a default value that works for 99% of all images but does not represent maximum compression.

Cat picture with jpeg compression. Max value to minimum left to right [13]

ATI encompasses many of the ideas that prevail today. One is that the data from a process can be used to predict the future of that process. This is a holy grail in the study of human behavior with a weird mix of researchers, large corporations, and authoritarian governments around the concept of 'big data.' Prediction models suggest purely computational research and computational proofs but this has not as yet happened. I have always found commercial prediction models kind of lame. ATI has shown connections with Shannon information and thermodynamic entropy, with (several) different new definitions of randomness, and could be considered the basis for a theory of learning. [14][15][16] One important idea from ATI is the Minimum Description Length Principle which is a mathematical conception of Occam's Razor: that the simplest logical explanation is the best one. [17] This allows for a data set to be partitioned into a meaningful information part and a random part.

There are two other aspects of information that I'd like to touch on. One is the weird 900-pound gorilla of quantum information. This is beyond the scope of this essay and unique in its strangeness. It also represents ideas of what we consider reality. The other is what Floridi in his outline calls 'unintentional misinformation.' This seems to be most of what humans deal with, what we call 'Ignorance.' I would put a third category that doesn't seem to fit cleanly into intentional or unintentional misinformation. This is the human imagination, fiction, fantasy, speculation, and scenario. How does this fit into this process from data to meaningful information? It is information and it seems to me to be essential to the formation of knowledge and it doesn't quite fit cleanly into the category of unintentional misinformation.

For now, I want to leave semantic information and look at what well-ordered data entails. This idea takes up the most practical and important aspect of what today we call 'data science.' In 1970, when John Tukey first started writing about exploratory data analysis [18], computer processors were becoming cheap enough to start a hobbyist movement of home-built computers based out of Popular Electronics Magazine. In March of 1975, the first meeting of the Homebrew Computer Club in Menlo Park California took place, and in 1977 the first three popular personal computers were released: the TRS-80, the Commodore Pet-2001, and the Apple-II. [19] In 1970 what were called hierarchical or networked databases existed. Relational databases were first proposed in 1970 by Edgar F. Codd. [20] Another database type, the navigational database was proposed in 1973 by Charles Bachmann. [21] Relational databases became the most flexible way to handle data, particularly a technique called a transaction, the basis of banking and e-commerce. Relational databases now dominate but during the early 2000s it became clear that relational databases couldn't handle certain forms of data: object-oriented databases were developed and researchers went back to old ideas to develop specialized databases that were named in 2009 as NoSQL (non-relational) databases. [22] Spreadsheets, a way of displaying data in tabular form, have been around since the start of the 20th century. Several spreadsheet programs and spreadsheet languages for doing calculations on tabular data were developed by 1970 for mainframe computers. When the Apple II was released in 1977, Steve Wosniak included a simple spreadsheet that he wrote, one of the first open-source programs. In 1979, Apple released the program VisiCalc. It was developed by Dan Bricklan and Bob Franklin. VisiCalc became popular with realtors and a major factor in Apple's early success. [23]

"The view I have held, so far back as I can remember, is that we need both exploratory and confirmatory data analysis. Truly confirmatory calls for data collected (or mined) after the question(s) have been clearly formulated. The process of formulating the question and planning the data acquisition will, often, have to depend on exploratory analyses of previous data. This is not the only instance of cooperation between exploratory and confirmatory, but it is probably the most pervasive one." [24]

The following is a set of quotes from the 1977 rewrite of Tukey's original 1971 work (I took out the page numbers for clarity: [25]

"Exploratory data analysis is detective work,"
"We do not guarantee to introduce you to the 'best' tools particularly since we are not sure there can be unique bests."
"'Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone."
"Checking is inevitable ... Our need is for enough checks but not too many."
"Summaries can be very useful, but they are not the details."
"(We almost always want to look at numbers. We do not always have graph paper.) There is no excuse for failing to plot and look (if you have ruled paper)."
"There is often no substitute for the detective's microscope -- or for the enlarging graphs."
"We now regard reexpression as a tool, something to let us do a better job of grasping data."
"Most batches of data fail to tell us exactly how they should be analyzed." (This does not mean that we shouldn't try.)
"There cannot be too much emphasis on our need to see behavior."
"Whatever the data, we can try to gain by straightening or by flattening. When we succeed in one or both, we almost always see more clearly what is going on."
" 1. Graphs are friendly........ 3. Graphs force us to note the unexpected; nothing could be more important ....... 5. There is no more reason to expect one graph to 'tell all' than to expect one number to do the same."
"Even when we see a very good fit -- something we know has to be a very good summary of the data --we dare not believe that we have found a natural law."
"In dealing with distributions of the real world, we are very lucky if (a) we know APPROXIMATELY how values are distributed, (b) this approximation is itself not too far from being simple."

Let me try to summarize what I believe, from hindsight, to be the main features of Tukey's exploratory data analysis.

Procedure over Theory

There is really no way to collect any data without some sort of conceptual model in mind but the exploratory part of data analysis needs to be assumed to be without a model.

Data Transformation

Normalization, log transformation, ratios, etc. Data is transformable in and of itself.

Smoothing

A form of data transformation, often called coarse-graining, it has its own theory: Renormalization Theory, which tries to predict how coarse-graining changes a model.

Binning and Rank

Another form of data transformation. Important enough for a separate article.

Visualization

What Tukey is most famous for. Visualization tools include:

Scatter Plots
Histograms
Stem & leaf Plots
Whisker Plots
etc.

Table Transformation

Tables are both visualization tools and structures on which calculations can be performed. This may mean different tables that are related by transformation.

I would say that the above features create a definition of well-ordered data.

Iterative Process

This includes the whole analysis, not just the exploratory part. An analysis has a history and is thus repeatable. Data can also be revisited with new tools.

An early and still popular program for statistical software was SPSS, which was first released by IBM in 1968. SPSS II was released for Windows machines in 1983 and the latest release, SPSS 29 came out in September of 2022. [26] The S programming language is a programming language specifically designed for statistics and the ideas of John Tukey. It was developed by Bell Labs in 1976 first by Rick Becker and Allen Wilks, and later by John Chambers. The last version of S was released in 1998. Although a proprietary language, source code was made available in 1984. [27] In 1993 the first version of the R statistical language was released. Developed by Ross Ihaka and Robert Gentleman at Auckland University as a language for teaching statistics, it was first released as binaries to a mailing list and then came under the GNU Public License for free software in 1997. Also in 1997, a repository for packages written in R, routines that extend the language, called CRAN, was created. The first official version of R was released in early 2000. [28]

R was always considered a language with a 'high learning curve' and I have also heard it described as 'gnarly.' This is true, the language always seemed to get in the way of what one wanted to do with it. Starting around 2007 a statistician at the University of Auckland, Hadley Wickham, began creating packages for R. One of these was ggplot2, This package handles graphics and plotting in R and does so using a well-defined language-like format for the functions and attributes needed to describe a graph. The package also supports extensions, sort of a package within a package, at the time of this writing there are 128 officially registered extensions. [29]

In 2014, Wickham, now with his own research organization, [30] released RStudio, an IDE (Interactive Development Environment) for R. This, with a paper, also published in 2014, 'Tidy Data' [31] spelled out the beginning of a new direction no just for computational statistics, but for computational science in general. Tidy data is clean data, tidy data is well-ordered data. Dirty data can be dirty for a reason, such as optimal storage; or not. Tidy data can be defined with one definition and three rules:

Data is structurally a table.

Columns are variables
Rows are observations
Values take up one cell each

Wickham has created a set of functions that turn dirty data into tidy data. He has also created a series of functions that take tidy data, transform it, and output tidy data. This allows for a workflow as functions can easily be tied together. These packages, called the 'tidyverse,' have become a separate dialect of the R language and the idea of tidy data has spread to modern computer languages such as Python.

RStudio also incorporates the ideas of Open Science in the form of notebooks. Notebooks combine text with reproducible code allowing for reproducible science. The reproducibility of a result is an important part of science but is difficult to accomplish. Also difficult are what once were called a metastudy, studies that combine the results of current and past research.

There are several reasons for this, centered around access. There were approximately 2 million scientific papers published last year. At present, some 90% of scientific results are behind a paywall in the form of a journal. Journals took off after World War II but have created many problems, institutions have to buy back the results of their own research, and citizens paying for the research with their taxes are barred from the research papers unless they have some sort of relationship with an institution. [32] I have a loose affiliation now with the University of Arizona which allows me to get behind some paywalls but there are still journals, especially foreign ones, which the U of A does not pay for. Computation and ideas from the history of computation in this country, like open-source software, have been very important but the issues are also social. In 1942, the sociologist Robert K. Merton wrote about the ethical norms of scientific discovery [33] One important one was the collective nature of scientific research. In 1985, Daryl E. Chubin coined the term 'Open Science' and related it to its importance in a democratic society. [34] Results have been mixed and depend on the nature of the research, generational change, and how far computational methods have impacted the research. Often there is a gap between what researchers want and what can actually be accomplished, Open Science can be difficult for institutions and people strapped for time and money. [35]

UNESCO's View of Open Science [36]

Governments started funding science seriously during World War II and most science projects were funded through the military. This began to change with more civilian funding, both for individual projects and massive single-funded projects, called Big Science. The success of the Human Genome Project in the US and CERN in Europe has held out hope for not only Big Science but also Open Science. For instance, making the human genome open-source was controversial at the time, yet this open-source project now generates $45 billion annually from an original investment of $3 billion. In 2013, twin projects, one in Europe and one in the US began with the intent of kick-starting the neuroscience of the human brain. The BRAIN Initiative [37] was funded with $5 billion from NIH and the Human Brain Project (HBP) [38] in Europe was funded with 1 billion Euros for ten years. BRAIN has used around $700 million of its funding, while HBP's funding ended in September.

Simulation is an important computational goal of a science. To people concerned with the treatment and use of experimental animals simulation holds out the possibility of research without animal experimentation. The neuroscientist Henry Markram was the first to simulate the interactions of two or more neurons around 2005. At the time only single neuron simulation was possible. In 2009 he started the Blue Brain Project [39] at the École Polytechnique Fédérale de Lausanne, in Switzerland. Blue Brain is attempting to simulate larger and larger sections of the rat and mouse brain. Also in 2009, Markham started promoting, via TED Talks, the idea of a ten-year mega-project to simulate the human brain. In 2013, HBP was started with Markham as the project director. HBP was controversial from the start. The European Commission met in secret to pick a project and there was only one neuroscientist involved. The European neuroscience community felt cut out. Many were skeptical of a ten-year timeline for simulating the human brain, many thought that simulation was still a poor substitute for current animal research, and many disliked Markham's management style and the European Commission's failure of oversight. In 2015 this came to a head when 800 neuroscientists signed a letter questioning the project's goals. That year Markham was out as manager and HBP was restructured and the goal of a human brain simulation was dropped. Markham and Blue Brain still got funding separately from HBP. [40][41][42][43] I first ran into Blue Brain four years ago and until this article, I had always thought that it and HBP were one and the same. As for the BRAIN Initiative in America, it has generated less controversy. This is because it was managed by the National Institute of Health, which has some experience projects such as this. They spent a year talking to scientists about exactly what was needed and what the goals should be. In October, BRAIN released the first-ever cell atlas of the human and primate brain. [44] All three projects have published a large set of computational tools, data structures, and methods of dealing with data. After ten years of Big Science, the consensus is that any comprehensive understanding of the human brain is still fifty years away.

The links I discovered on the Blue Brain project website four years ago unfortunately no longer exist. What I found was a new way, at the time, of looking at data, called data provenance. I have already mentioned metastudy and metadata, data provenance is a formalization of the process of a metastudy in terms of its metadata. The problem is this:

There is a vast amount of past information on a subject sometimes going back hundreds of years or more. This doesn't just include scientific papers but books, articles, correspondence, lab notebooks, field notes, collected specimens not always in museums or lost inside a museum, and results from failed experiments, etc. Also, these documents are written in multiple human languages. This is all data. Getting this into digital form is a major problem because it can't really be automated. Optical Character Recognition (OCR) is pretty much stalled on the problem of reading human handwriting. Many older scanned documents are unreadable because they have been scanned as images and must be turned into text. Then there are issues of data format, accuracy, access, and security.
Once data has been converted into a machine-readable form, the narrative structured data ( research papers, etc.) can be processed using natural language techniques to extract the concepts, categories, experimental methods; the ontology of the narrative.
Since the meaning of concepts, experimental methods, and the structure of the data can vary within labs, between labs, and also change over generations of researchers, there must be some way to form a consensus, strip out what does not conform to this consensus, and then transform various data sets into a consensus set. It is only then that a metastudy can be performed. In the case of Blue Brain, the data would feed into the various brain simulations.

The process of data. [45]

Data provenance is a form of metadata that records the history of the data, the transformations applied to it, and ways to find each transformation, including the original data. This can become quite complex and there is still some discussion on what the term entails. The idea is that to trust data one must know its history and, if necessary, be able to track and verify its transformations. [46][47][48] This is hard to accomplish and the first time I saw any software doing this was in a project manager built by Blue Brain called Blue Brain Nexus. [49] In 2019 I presented a talk at an Arizona GIS conference and nobody there had heard of the term 'provenance.' The GIS software we use, ESRI, has embraced the concept of a notebook and metadata, but data provenance, not really.

First image of Sagittarius A* [50]

The image above is the first image of the black hole at the center of our Milky Way. An array of 8 radio telescopes around the world were used to create one giant telescope to capture the image. Multiple petabytes of data (2^50 bytes) were collected from each telescope, and stored in racks of hard drives, and after the image was captured, the hard drives were packed in cases and manually transported to a central processing location. [51] The result is an image of something no other human being has seen before. The Kolmogorov Complexity must be immense! In addition, the image above has been JPEG compressed. Data to knowledge; a culmination of over one hundred years of theory and research.

Despite the hype, data is important. A sign of its importance has been the rise of an actual Data Science. however it is just the start of an iterative process to creating what we define as knowledge.

Floridi, Luciano. “The Philosophy of Information as a Conceptual Framework.” Knowledge, Technology & Policy 23, no. 1 (June 1, 2010): 253–81. https://doi.org/10.1007/s12130-010-9112-x.
Zalta, Edward N, Uri Nodelman, Colin Allen, and R Lanier Anderson. “Information.” In Stanford Encyclopedia of Philosophy, 66, 2018.
Mussgnug, Alexander M. “A Philosophy of Data.” arXiv, May 20, 2020. https://doi.org/10.48550/arXiv.2004.09990.
Mills, Stuart. “The Philosophical Theory of Data.” Data Impact blog. Accessed October 4, 2023. https://blog.ukdataservice.ac.uk/philosophical-theory-of-data/.
Shannon, C E. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (1948).
Turing, A. M. “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society s2-42, no. 1 (1937): 230–65. https://doi.org/10.1112/plms/s2-42.1.230.
Solomonoff, R. J. “A Formal Theory of Inductive Inference. Part I.” Information and Control 7, no. 1 (March 1, 1964): 1–22. https://doi.org/10.1016/S0019-9958(64)90223-2.
Solomonoff, R. J. “A Formal Theory of Inductive Inference. Part II.” Information and Control 7, no. 1 (March 1, 1964): 1–31. https://doi.org/10.1016/S0019-9958(64)90223-2.
Kolmogorov, Andrei N. “Three Approaches to the Quantitative Definition Of Information.” Problems of Information Transmission 1, no. 1 (1965): 1–7.
Chaitin, Gregory J. “On the Length of Programs for Computing Finite Binary Sequences.” Journal of the ACM 13, no. 4 (October 1, 1966): 547–69. https://doi.org/10.1145/321356.321363.
Wikipedia. “Kolmogorov Complexity.” In Wikipedia, August 22, 2023. https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&oldid=1171617849.
Resch, Nicolas. “Kolmogorov Complexity,” February, 2020.
AzaToth, Felis_silvestris_silvestris jpg: Michael Gäblerderivative work: Gradual JPEG Artifacts Example, with Decreasing Quality from Right to Left. October 3, 2011. Felis_silvestris_silvestris.jpg. https://commons.wikimedia.org/wiki/File:Felis_silvestris_silvestris_small_gradual_decrease_of_quality.png.
Grunwald, Peter D., and Paul M. B. Vitanyi. “Algorithmic Information Theory.” arXiv, September 17, 2008. https://doi.org/10.48550/arXiv.0809.2754.
Hutter, Marcus. “Algorithmic Information Theory.” Scholarpedia 2, no. 3 (March 6, 2007): 2519. https://doi.org/10.4249/scholarpedia.2519.
Wikipedia. “Algorithmic Information Theory.” In Wikipedia, August 18, 2023. https://en.wikipedia.org/w/index.php?title=Algorithmic_information_theory&oldid=1171067903.
Grunwald, Peter. “A Tutorial Introduction to the Minimum Description Length Principle.” arXiv, June 4, 2004. http://arxiv.org/abs/math/0406077.
Tukey, John W. Exploratory Data Analysis: Limited Preliminary Ed. Addison-Wesley Publishing, 1970.
Wikipedia. “History of Personal Computers.” In Wikipedia, October 13, 2023. https://en.wikipedia.org/w/index.php?title=History_of_personal_computers&oldid=1179879975.
Wikipedia. “Database.” In Wikipedia, August 24, 2023. https://en.wikipedia.org/w/index.php?title=Database&oldid=1171956467.
Wikipedia. “Navigational Database.” In Wikipedia, August 5, 2023. https://en.wikipedia.org/w/index.php?title=Navigational_database&oldid=1168860321.
Wikipedia. “NoSQL.” In Wikipedia, October 14, 2023. https://en.wikipedia.org/w/index.php?title=NoSQL&oldid=1180099009.
Wikipedia. “Spreadsheet.” In Wikipedia, October 4, 2023. https://en.wikipedia.org/w/index.php?title=Spreadsheet&oldid=1178632214.
Tukey, John W. “Exploratory Data Analysis: Past, Present and Future.” University of Maryland: Princeton University Press, 1993. https://apps.dtic.mil/sti/citations/ADA266775.
Tukey, John W. Exploratory Data Analysis. Addison-Wesley Publishing, 1977.
Wikipedia. “SPSS.” In Wikipedia, August 19, 2023. https://en.wikipedia.org/w/index.php?title=SPSS&oldid=1171239082.
Wikipedia. “S (Programming Language).” In Wikipedia, April 17, 2023. https://en.wikipedia.org/w/index.php?title=S_(programming_language)&oldid=1150304933.
Wikipedia. “R (Programming Language).” In Wikipedia, September 30, 2023. https://en.wikipedia.org/w/index.php?title=R_(programming_language)&oldid=1177966512.
Wickham, Hadley. “Gplot2 Extensions - Gallery.” Accessed October 15, 2023. https://exts.ggplot2.tidyverse.org/gallery/.
Wickham, Hadley. “Tidyverse.” Accessed September 10, 2023. https://www.tidyverse.org/.
Wickham, Hadley. “Tidy Data.” Journal of Statistical Software 14, no. 10 (2014). https://www.researchgate.net/publication/215990669_Tidy_data.
Research Culture Is Broken; Open Science Can Fix It, 2019. https://www.youtube.com/watch?v=c-bemNZ-IqA.
Merton, Robert K. “The Normative Structure of Science,” 1942. https://www.panarchy.org/merton/science.html.
Chubin, Daryl E. “Open Science and Closed Science: Tradeoffs in a Democracy.” Science, Technology, & Human Values 10, no. 2 (April 1, 1985): 73–80. https://doi.org/10.1177/016224398501000211.
The Foundations of Biomedical Data Science, 2017. https://www.youtube.com/watch?v=DBGvZ0ni5Tk.
RobbieIanMorrison. Redrawn Slide from Presentation of Ana Persic, Division of Science Policy and Capacity-Building (SC/PCB), UNESCO (France) Presentation to Open Science Conference 2021, ZBW — Leibniz Information Centre for Economics, Germany. February 20, 2021. Own work. https://commons.wikimedia.org/wiki/File:Osc2021-unesco-open-science-no-gray.png.
The BRAIN Initiative Alliance. “The BRAIN Initiative Alliance.” Accessed October 21, 2023. https://www.braininitiative.org/.
humanbrainproject.eu. “Human Brain Project.” Accessed October 26, 2023. https://www.humanbrainproject.eu/en/.
EPFL. “Blue Brain Portal.” Blue Brain Portal. Accessed October 17, 2023. https://portal.bluebrain.epfl.ch/.
Abbott, Alison. “Documentary Follows Implosion of Billion-Euro Brain Project.” Nature 588, no. 7837 (December 7, 2020): 215–16. https://doi.org/10.1038/d41586-020-03462-3.
Siva, Nayanah. “What Happened to the Human Brain Project?” The Lancet 402, no. 10411 (October 21, 2023): 1408–9. https://doi.org/10.1016/S0140-6736(23)02346-2.
Theil, Stefan. “Why the Human Brain Project Went Wrong - and How to Fix It.” Scientific American. Accessed October 20, 2023. https://doi.org/10.1038/scientificamerican1015-36.
Naujokaitytė, Goda. “Rethink for Human Brain Project as It Enters the Final Phase.” Science|Business, 2020. https://sciencebusiness.net/news/rethink-human-brain-project-it-enters-final-phase.
HIH. “NIMH » Scientists Unveil Detailed Cell Maps of the Human Brain and the Nonhuman Primate Brain,” October 12, 2023. https://www.nimh.nih.gov/news/science-news/2023/scientists-unveil-detailed-cell-maps-of-the-human-brain-and-the-nonhuman-primate-brain.
Paulson, John A. “Capturing, Storing and Analysing Data Provenance.” Provenance@Harvard. Accessed October 29, 2023. https://projects.iq.harvard.edu/provenance-at-harvard/home.
Doan, AnHai, Alon Halevy, and Zachary Ives. “Data Provenance.” In Principles of Data Integration, 359–71. Elsevier, 2012. https://doi.org/10.1016/B978-0-12-416044-6.00014-4.
Ridge, Enda. “Guerrilla Analytics Principles.” In Guerrilla Analytics, 2016.
Viglas, Stratis D. “Data Provenance and Trust” 12, no. 0 (July 30, 2013): GRDI58. https://doi.org/10.2481/dsj.GRDI-010.
shinyhappydan. “Blue Brain Nexus.” Scala. 2017. Reprint, The Blue Brain Project, October 21, 2023. https://github.com/BlueBrain/nexus.
WikiCommons. “Saggitarius A* Black Hole.” In Wikipedia, February 25, 2023. https://en.wikipedia.org/w/index.php?title=File:EHT_Saggitarius_A_black_hole.tif&oldid=1141542954.
Siliezar, Juan. “Second Black Hole Image Unveiled, First from Our Galaxy.” Harvard Gazette (blog), May 12, 2022. https://news.harvard.edu/gazette/story/2022/05/second-black-hole-image-unveiled-first-from-our-galaxy/.

What is Data?

Recent Posts

Comments