“Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.”

Available in Debian derivatives via:

`$ sudo apt-get install miller`

Also has a nice 10 minute introduction.

Skip to content
# Tag: data

# Miller

# Problems of Post Hoc Analysis

# Forecasting in R: Probability Bins for Time-Series Data

## R Script

# Gardener’s Vision

# Which US States Best Protect Privacy Online? – Comparitech

Every day. Without hope, without despair.

“Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.”

Available in Debian derivatives via:

`$ sudo apt-get install miller`

Also has a nice 10 minute introduction.

“Misuse of statistical testing often involves post hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives…. A study from the late-1980s gives a striking example of how such post hoc analysis can be misleading. The International Study of Infarct Survival was a large-scale, international, randomized trial that examined the potential benefit of aspirin for patients who had had a heart attack. After data collection and analysis were complete, the publishing journal asked the researchers to do additional analysis to see if certain subgroups of patients benefited more or less from aspirin. Richard Peto, one of the researchers, refused to do so because of the risk of finding invalid but seemingly significant associations. In the end, Peto relented and performed the analysis, but with a twist: he also included a post hoc analysis that divided the patients into the twelve astrological signs, and found that Geminis and Libras did not benefit from aspirin, while Capricorns benefited the most (Peto, 2011). This obviously spurious relationship illustrates the dangers of analyzing data with hypotheses and subgroups that were not prespecified (p.97).”

—Mayo, quoting

National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019 in “National Academies of Science: Please Correct Your Definitions of P-values.”Statsblogs. September 30, 2019.

This time-series.R script, below, takes a set of historical time series data and does a walk using the forecast period to generate probabilistic outcomes from the data set.

Input file is a csv file with two columns (Date, Value) with dates in reverse chronological order and in ISO-8601 format. Like so:

```
2019-08-06,1.73
2019-08-05,1.75
2019-08-02,1.86
```

Output is as follows:

```
0.466: Bin 1 - <1.7
0.328: Bin 2 - 1.7 to <=1.9
0.144: Bin 3 - 1.9+ to <2.1
0.045: Bin 4 - 2.1 to <=2.3
0.017: Bin 5 - 2.3+
```

**Note**: Patterns in data sets will skew results. A 20-year upward trend will make higher probabilities more likely. A volatile 5-year period will produce more conservative predictions and may not capture recent trends or a recent change in direction of movement.

```
# time-series.R
# Original: December 4, 2018
# Last revised: December 4, 2018
#################################################
# Description: This script is for running any
# sequence of historical time-series data to make
# a forecast for five values by a particular date.
# Assumes a cvs file with two columns (Date, Value)
# with dates in reverse chronological order and in
# ISO-8601 format. Like so:
#
# 2019-08-06,1.73
# 2019-08-05,1.75
# 2019-08-02,1.86
#Clear memory and set string option for reading in data:
rm(list=ls())
gc()
#################################################
# Function
time-series <- function(time_path="./path/file.csv",
closing_date="2020-01-01", trading_days=5,
bin1=1.7, bin2=1.9,
bin3=2.1, bin4=2.3) {
#################################################
# Libraries
#
# Load libraries. If library X is not installed
# you can install it with this command at the R prompt:
# install.packages('X')
# Determine how many days until end of question
todays_date <- Sys.Date()
closing_date <- as.Date(closing_date)
remaining_weeks <- as.numeric(difftime(closing_date, todays_date, units = "weeks"))
remaining_weeks <- round(remaining_weeks, digits=0)
non_trading_days <- (7 - trading_days) * remaining_weeks
day_difference <- as.numeric(difftime(closing_date, todays_date))
remaining_days <- day_difference - non_trading_days
#################################################
# Import & Parse
# Point to time series data file and import it.
time_import <- read.csv(time_path, header=FALSE)
colnames(time_import) <- c("date", "value")
# Setting data types
time_import$date <- as.Date(time_import$date)
time_import$value <- as.vector(time_import$value)
# Setting most recent value, assuming descending data
current_value <- time_import[1,2]
# Get the length of time_import$value and shorten it by remaining_days
time_rows = length(time_import$value) - remaining_days
# Create a dataframe
time_calc <- NULL
# Iterate through value and subtract the difference
# from the row remaining days away.
for (i in 1:time_rows) {
time_calc[i] <- time_import$value[i] - time_import$value[i+remaining_days]
}
# Adjusted against current values to match time_calc
adj_bin1 <- bin1 - current_value
adj_bin2 <- bin2 - current_value
adj_bin3 <- bin3 - current_value
adj_bin4 <- bin4 - current_value
# Determine how many trading days fall in each question bin
prob1 <- round(sum(time_calc<adj_bin1)/length(time_calc), digits = 3)
prob2 <- round(sum(time_calc>=adj_bin1 & time_calc<=adj_bin2)/length(time_calc), digits = 3)
prob3 <- round(sum(time_calc>adj_bin2 & time_calc<adj_bin3)/length(time_calc), digits = 3)
prob4 <- round(sum(time_calc>=adj_bin3 & time_calc<=adj_bin4)/length(time_calc), digits = 3)
prob5 <- round(sum(time_calc>adj_bin4)/length(time_calc), digits = 3)
###############################################
# Print results
return(cat(paste0(prob1, ": Bin 1 - ", "<", bin1, "\n",
prob2, ": Bin 2 - ", bin1, " to <=", bin2, "\n",
prob3, ": Bin 3 - ", bin2, "+ to <", bin3, "\n",
prob4, ": Bin 4 - ", bin3, " to <=", bin4, "\n",
prob5, ": Bin 5 - ", bin4, "+", "\n")))
}
```

“Without communication, connection, and empathy, it becomes easy for actors to take on the “gardener’s vision”: to treat those they are acting upon as less human or not human at all and to see the process of interacting with them as one of grooming, of control, of organization. This organization, far from being a laudable form of efficiency, is inseparable from dehumanization.”

—Os Keyes, “The Gardener’s Vision of Data.”

Real Life.May 6, 2019.

“Laws governing online privacy in the US vary widely from state to state. To find out how each US state ranks from least to most private, we evaluated each and every one of them based on 20 key criteria.”

—Paul Bischoff. “Which U.S. States Best Protect Online Privacy.”

comparitech.com.July 20, 2018.