Intro to Data - 06-03 - Storage

Jan 1, 2020 22:56 · 353 words · 2 minute read forth analytical databases – computers

The second step in the data lifecycle is data storage. Once we’ve collected and recorded an observation as data, we need to store it so that it can be retrieved for future analysis. As data are being recorded by sensors, these data are first recorded temporarily in a type of memory called volatile storage. Volatile storage means that the data are lost when the device loses power. As a result, we need to transfer our data somewhere more permanent, so that they will be available anytime we need them.

00:28 - In data science, we store our data in one of several persistent- storage mediums. Persistent storage, means that the device retains the data after the power to the device has been shut off. For example, a computer’s hard drive retains its data even if you turn the power off and then turn it back on again. By storing our data in a persistent-storage medium, we can retrieve our data as needed. Unless we overwrite the data, or the device permanently fails, our data should always remain available.

00:56 - Data can be stored in computers in several ways: First, we have file-based formats – which store data in files on the file system of a computer. For example, comma-separate values (or CSV) files and Excel spreadsheets. Next, we have web-based formats – which store data in formats best suited for data transfer over the internet. For example, eXtensible markup language (or XML) and javascript- object notation (or JSON) Third, we have transactional databases – which store data in a form best suited for transaction processing. For example, normalized relational databases and No-SQL databases.

01:31 - Forth, we have analytical databases – which store data in a format best suited for analytical processing. For example, data warehouses, data marts, and data cubes. And finally, we have Big Data platforms – which can store massive data sets by distributing both data and processing across many computers For example, Spark, Hive, and Hadoop. There are many options to choose from, so it’s important to know which option is right for your specific data-storage scenario. .