From Data to Insights with Open Science Tools: Exploration of Theatrical Plays with DraCor and R

# Exploration of Theatrical Plays with DraCor and R

## From Data to Insights with Open Science Tools
[https://hlageek.github.io/reports/ufal2022](https://hlageek.github.io/reports/ufal2022)
<br><br><br><br><br><br><br><br><br><br><br>
Radim Hladík

Institute of Philosophy of the Czech Academy of Sciences

hladik@flu.cas.cz

<br>

---

# The perfect audience

.pull-left[
Who can benefit the most from this lecture?

Me!

More specifically, myself a few years ago:

-   effectively zero experience with coding  
-   no idea where/how to start coding  
-   frankly, a bit scared of the whole coding business  
-   background in the humanities

-   motivated to learn

]

.pull-right[
What kind of benefits can you expect?

-   **NOT** learning to code
-   **NOT** understanding computer science concepts

<hr>

-   getting a feeling for what coding looks like
-   removal of the initial barriers on your own path to computational analyses
-   recognition of advantages of coding over point-and-click software
-   appreciation of (clean and documented) data
-   awareness of the need to "munge" the data

]

---

# R

## R
- programming language with focus on statics
- widely used by data analysts in academia and private sector, obscure otherwise
- free and open source
- to teach your computer to understand R language, you must install it on your operating system first (although if you work with Python instead, that comes pre-installed on Macs)
- [https://www.r-project.org/](https://www.r-project.org/)
  
## RStudio
- IDE - integrated development environment designed for the R language
- free and open source
- not a requirement, just makes the interaction with R easier
- [RStudio](https://www.rstudio.com/products/rstudio/download/)
- cloud version without a need to install anything [https://rstudio.cloud/](https://rstudio.cloud/)

---

# DraCor - dataset of theatrical plays
.panelset[

.panel[.panel-name[DraCor website]

<https://dracor.org>

</iframe>
]
.panel[.panel-name[DraCor R library]

[https://github.com/dracor-org/rdracor](https://github.com/dracor-org/rdracor)

[https://rdrr.io/github/Pozdniakov/rdracor/man/](https://rdrr.io/github/Pozdniakov/rdracor/man/)

`rdracor` is a library dedicated to interaction with DraCore data from R programming language 
]
]

---

# Setup 
.panelset[
.panel[.panel-name[Command]

```r
# install.packages("remotes")
# remotes::install_github("Pozdniakov/rdracor")
library(rdracor)
# install.packages("tidyverse")
library(tidyverse)
```

]
.panel[.panel-name[Explanation]

```r
# install.packages("remotes")
# remotes::install_github("Pozdniakov/rdracor")
library(rdracor)
# install.packages("tidyverse")
library(tidyverse)
```

We loaded **libraries** that provide library-specific **functions** that can be used in our project and that are not available in what is called the base `R`.

Before loading libraries, you must install the packages that contain them (installation is a one-time thing, loading needs to be done once per each coding session).

-   notice the hash `#` symbol - `R` uses it to signify comments, i.e. any line of text that should be ignored when you run the script
-   use comments to remind yourself of what you intended to express by the code

]

.panel[.panel-name[Advantages of using libraries]

The base `R` language is made of general purpose functions and objects. **Libraries** build upon the base `R` functions to provide an extra set of functions with some advantages:

-   they simplify things (you can use one library function instead of writing many lines of code that achieve the same goal)
-   they are likely to be well-tested (unlike your code)

]
]

---

# First call to DraCor API

.panelset[

.panel[.panel-name[Command]

```r
dracor_metadata <- get_dracor_meta() 
```

```r
class(dracor_metadata) 
```

```
## [1] "dracor_meta" "data.frame"
```

```r
colnames(dracor_metadata)
```

```
##  [1] "description"     "uri"             "title"           "name"           
##  [5] "repository"      "licence"         "licenceUrl"      "plays"          
##  [9] "characters"      "male"            "female"          "text"           
## [13] "sp"              "stage"           "updated"         "wordcount.text" 
## [17] "wordcount.sp"    "wordcount.stage"
```

]
.panel[.panel-name[Explanation]

```r
dracor_metadata <- get_dracor_meta()
class(dracor_metadata)
colnames(dracor_metadata)
```

-   we create an object called `dracor_metadata` to which we assign the result of the function `get_dracor_meta()`
-   we check what kind of object the result is (what is its class)
-   the result is a table-like object of class `data.frame` and we can inspect the names of the table columns by calling the function `colnames()`
]

.panel[.panel-name[Look inside the metadata]

.font50[
<div id="htmlwidget-0667072346a69e6fd6bb" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-0667072346a69e6fd6bb">{"x":{"filter":"none","vertical":false,"data":[["A [TEI P5](https://tei-c.org/guidelines/p5/) version of Paul Fièvre's [Théâtre Classique](https://www.theatre-classique.fr/index.html). Edited by Carsten Milling, Frank Fischer and Mathias Göbel. For corpus description and list of enhancements please consult the [README on GitHub](https://github.com/dracor-org/fredracor).","Edited by Frank Fischer and Peer Trilcke. Features more than 550 German-language plays from the 1650s to the 1940s. For a corpus description and full credits please see the [README on GitHub](https://github.com/dracor-org/gerdracor).","Edited by Frank Fischer and Daniil Skorinkin. Features more than 200 Russian plays from the 1740s to the 1940s. For a corpus description and full credits please see the [README on GitHub](https://github.com/dracor-org/rusdracor).","Edited by members of the [Institute of Romance Languages and Literatures at University of Tübingen](https://uni-tuebingen.de/en/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/romanisches-seminar/home/) and the [Institute of Romance Studies at University of Vienna](https://romanistik.univie.ac.at/). For a corpus description and full credits please see the [README on GitHub](https://github.com/dracor-org/caldracor).","Derived from [Biblioteca Italiana](http://www.bibliotecaitaliana.it/), maintained by Frank Fischer and Carsten Milling. Features 139 Italian plays from 15th to 20th century. For corpus description and list of enhancements please consult the [README on GitHub](https://github.com/dracor-org/itadracor).","Derived from the [eDrama](http://www.dramawebben.se/sida/edrama) project. Features 68 Swedish plays from around 1880 to 1900. More information in the [README on GitHub](https://github.com/dracor-org/swedracor).","Derived from [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/opensource/download), features 39 ancient Greek plays. Maintained by Boris Orekhov, Julia Jennifer Beine and Frank Fischer. Corpus description and list of enhancements in the [README on GitHub](https://github.com/dracor-org/greekdracor).","Edited by [Frank Fischer](https://lehkost.github.io/). This corpus contains all 37 of Shakespeare's plays in their German translations published by Schlegel and Tieck, in the edition of Aufbau-Verlag Berlin/Weimar (3rd edition 1975), which is based on the last edition published during Schlegel's lifetime (3rd edition 1843/44). The digitised print edition was procured from [Zeno.org](http://www.zeno.org/nid/20005683920) (via TextGrid Repository), which also provided an additional play (»Die beiden edlen Vettern«) that Shakespeare is now considered to have co-authored. This is a first beta version of the corpus. For a full description please see the [README on GitHub](https://github.com/dracor-org/gershdracor).","Derived from the [Folger Shakespeare Library](https://shakespeare.folger.edu/). Enhancements documented in our [README at GitHub](https://github.com/dracor-org/shakedracor).","Derived from [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/opensource/download), updated to [TEI P5](https://tei-c.org/guidelines/p5/) and enhanced by Julia Jennifer Beine (Ruhr University Bochum), Frank Fischer and Boris Orekhov (both Higher School of Economics, Moscow). Features 36 plays (20 comedies by **Plautus**, 6 comedies by **Terence**, 10 tragedies by **Seneca**). For corpus description and full credits please see the [README on GitHub](https://github.com/dracor-org/romdracor).","Edited by Pablo Ruiz Fabo ([MeThAL project](https://methal.pages.unistra.fr/) at University of Strasbourg). Features a growing number of Alsation plays from the 19th and 20th century. For a corpus description and full credits please see the [README on GitHub](https://github.com/dracor-org/alsdracor).","Edited by [María Concepción Jiménez](https://www.unir.net/profesores/maria-concepcion-jimenez-fernandez/), [Teresa Santa María Fernández](https://www.unir.net/profesores/ma-teresa-santa-maria-fernandez/) and [José Calvo Tello](https://www.sub.uni-goettingen.de/kontakt/personen-a-z/personendetails/person/jose-calvo-tello/). Forked from the [BETTE corpus](https://github.com/GHEDI/BETTE) (Biblioteca Electrónica Textual del Teatro en Español de 1868–1936).","Edited by Boris Orekhov. Contains a growing number of plays in the Bashkir language. Works still under copyright published with the permission of the rights holders.","Edited by Daniil Skorinkin and Frank Fischer. Features a handful of plays in Tatar language, provided through Tatar Electronic Library."],["https://dracor.org/api/corpora/fre","https://dracor.org/api/corpora/ger","https://dracor.org/api/corpora/rus","https://dracor.org/api/corpora/cal","https://dracor.org/api/corpora/ita","https://dracor.org/api/corpora/swe","https://dracor.org/api/corpora/greek","https://dracor.org/api/corpora/gersh","https://dracor.org/api/corpora/shake","https://dracor.org/api/corpora/rom","https://dracor.org/api/corpora/als","https://dracor.org/api/corpora/span","https://dracor.org/api/corpora/bash","https://dracor.org/api/corpora/tat"],["French Drama Corpus","German Drama Corpus","Russian Drama Corpus","Calderón Drama Corpus","Italian Drama Corpus","Swedish Drama Corpus","Greek Drama Corpus","German Shakespeare Drama Corpus","Shakespeare Drama Corpus","Roman Drama Corpus","Alsatian Drama Corpus","Spanish Drama Corpus","Bashkir Drama Corpus","Tatar Drama Corpus"],["fre","ger","rus","cal","ita","swe","greek","gersh","shake","rom","als","span","bash","tat"],["https://github.com/dracor-org/fredracor","https://github.com/dracor-org/gerdracor","https://github.com/dracor-org/rusdracor","https://github.com/dracor-org/caldracor","https://github.com/dracor-org/itadracor","https://github.com/dracor-org/swedracor","https://github.com/dracor-org/greekdracor","https://github.com/dracor-org/gershdracor","https://github.com/dracor-org/shakedracor","https://github.com/dracor-org/romdracor","https://github.com/dracor-org/alsdracor","https://github.com/dracor-org/spandracor","https://github.com/dracor-org/bashdracor","https://github.com/dracor-org/tatdracor"],["CC BY-NC-SA 4.0","CC0","CC0",null,null,null,"CC BY-SA 3.0 US","CC0","CC BY-NC 3.0","CC BY-SA 3.0 US",null,null,null,null],["https://creativecommons.org/licenses/by-nc-sa/4.0/","https://creativecommons.org/share-your-work/public-domain/cc0/","https://creativecommons.org/share-your-work/public-domain/cc0/",null,null,null,"https://creativecommons.org/licenses/by-sa/3.0/us/","https://creativecommons.org/share-your-work/public-domain/cc0/","https://creativecommons.org/licenses/by-nc/3.0/deed.en_US","https://creativecommons.org/licenses/by-sa/3.0/us/",null,null,null,null],[1560,570,212,201,139,68,39,38,37,36,25,25,3,3],[15526,13811,3707,3354,1527,769,437,0,1433,405,308,580,56,30],[6860,9802,2608,2163,989,382,275,0,797,278,195,331,37,20],[3720,2796,871,1113,413,327,110,0,116,104,82,226,16,9],[1560,570,212,201,139,73,39,38,37,36,25,25,3,3],[445403,397532,119332,119286,66607,35420,15693,31767,31066,18485,13022,23600,1405,701],[38306,187883,49440,23714,13522,17209,11,6883,10450,204,6571,8393,575,433],["2022-05-03T10:09:25Z","2022-06-12T13:54:12Z","2022-05-03T10:31:14Z","2022-05-03T01:10:05Z","2022-05-03T01:01:47Z","2022-05-03T01:09:56Z","2022-05-03T00:23:36Z","2022-05-03T01:23:17Z","2022-05-03T00:55:50Z","2022-05-03T00:26:04Z","2022-05-03T00:10:48Z","2022-05-03T00:34:00Z","2022-05-03T00:11:26Z","2022-05-03T00:11:43Z"],[15671583,10036323,2316995,2792425,1895476,737001,321594,873038,908286,307157,275642,444620,23464,13037],[14444033,9573219,2191657,2693673,1763669,690633,320963,855064,876744,298299,262931,388883,21184,12223],[307262,1182986,215112,113425,62311,96212,19,32064,41230,529,35102,84337,3275,1788]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>description<\/th>\n      <th>uri<\/th>\n      <th>title<\/th>\n      <th>name<\/th>\n      <th>repository<\/th>\n      <th>licence<\/th>\n      <th>licenceUrl<\/th>\n      <th>plays<\/th>\n      <th>characters<\/th>\n      <th>male<\/th>\n      <th>female<\/th>\n      <th>text<\/th>\n      <th>sp<\/th>\n      <th>stage<\/th>\n      <th>updated<\/th>\n      <th>wordcount.text<\/th>\n      <th>wordcount.sp<\/th>\n      <th>wordcount.stage<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"autoWidth":true,"scrollX":true,"scrollY":true,"paging":true,"lengthMenu":[3,4,5],"columnDefs":[{"className":"dt-right","targets":[7,8,9,10,11,12,13,15,16,17]}],"order":[],"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]
]
]

---

# Interact with the data

.panelset[

.panel[.panel-name[Command]

```r
dracor_metadata %>% 
  pull(title)
```

```
##  [1] "French Drama Corpus"             "German Drama Corpus"            
##  [3] "Russian Drama Corpus"            "Calderón Drama Corpus"          
##  [5] "Italian Drama Corpus"            "Swedish Drama Corpus"           
##  [7] "Greek Drama Corpus"              "German Shakespeare Drama Corpus"
##  [9] "Shakespeare Drama Corpus"        "Roman Drama Corpus"             
## [11] "Alsatian Drama Corpus"           "Spanish Drama Corpus"           
## [13] "Bashkir Drama Corpus"            "Tatar Drama Corpus"
```
]

.panel[.panel-name[Explanation]

```r
dracor_metadata %>% 
  pull(title)
```

-   `pull()` function takes as it 1st argument an object of class `data.frame` (tabular data structure) and a column name as its 2nd argument - then it "pulls" the specified column out from the `data.frame`

-   we "pipe" the object `dracor_metadata` into `pull()` function with the `%>%` pipe operator

-   the advantage of piping is that the code reads as a sequence of transformation steps

-   `dracor_metadata %>% pull(title)` is equivalent to `pull(dracor_metadata, title)`:
.font80[

```r
pull(dracor_metadata, title)
```

]
]

---
# Explore the data

.panelset[

.panel[.panel-name[Command] 
.pull-left[

```r
dracor_metadata %>% 
  ggplot(aes(title, plays)) +
  geom_col() 
```
]
.pull-right[
![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-10-1.png)
]
]

.panel[.panel-name[Explanation]

```r
dracor_metadata %>% 
  ggplot(aes(x = title, y = plays)) +
  geom_col() 
```

-   we will use the `ggplot2` data visualization library that is included in the `tidyverse` collection

-   the main function in the `ggplot2` library is called `ggplot()` and takes as its 1st argument a data object of class `data.frame` (here we supply the argument with the help of the pipe) and the 2nd argument is a function `aes()` which maps specified variables (column names) onto visualization parameters - most commonly the axes of the 2D charts

> Take the table called `dracor_metadata` and turn it into a plot. Map the data values from the column `title` onto X-axis and the data values from the column `plays` onto Y-axis. Add (`+`) a visualization layer based on column geometry (constructed by `geom_col()` function).

]
]

---
# Make the axes readable

.panelset[

.panel[.panel-name[Command] 
.pull-left[

```r
dracor_metadata %>% 
  ggplot(aes(x = title, y = plays)) +
  geom_col() +
  `coord_flip()`
```
]

.pull-right[
![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-13-1.png)
]
]
.panel[.panel-name[Explanation]

```r
dracor_metadata %>% 
  ggplot(aes(x = title, y = plays)) +
  geom_col() +
  coord_flip()
```

-   `coord_flip()` function adds another layer to the visualization

> Add another visualization layer that flips the coordinates, so that X becomes Y and Y becomes X (this is simply for better readability of text displayed on the axes).

]
]

---

# Which country has the most theatrical plays?

.panelset[

.panel[.panel-name[Consider]

The data contains the most plays by French authors. But does it mean that they write the most? Could this be just an artifact of the data and how it was created? 
-   Perhaps the team preparing the French corpus was the largest and could process most plays. 
-   And many plays from the ancient Greece most likely have not even survived until today. 
-   So asking which nation has the most theatrical plays is not a good question to ask of this dataset. 
-   However, we can rephrase the question to make it more appropriate: on average, which plays are the longest in terms of their text content?

]

.panel[.panel-name[Command]

.pull-left[

```r
dracor_metadata %>% 
  `mutate(normalized_text = wordcount.text/plays)` %>% 
  ggplot(aes(x = title, y = `normalized_text`)) +
  geom_col() +
  coord_flip()
```
]
.pull-right[

![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-16-1.png)
]
]

.panel[.panel-name[Explanation]

```r
dracor_metadata %>% 
  mutate(normalized_text = wordcount.text/plays) %>% 
  ggplot(aes(x = title, y = normalized_text)) +
  geom_col() +
  coord_flip()
```

-   create a new variable by mutating the existing ones using the `mutate()` function
-   obtain `normalized_text` through dividing the variable ` wordcount.text` that contains word count by the variable `plays` that contains the number of plays
-   project the new variable on `normalized_text` Y-axis

]
]

---

# Let's rank the data properly

.panelset[

.panel[.panel-name[Command]

.pull-left[

```r
dracor_metadata %>% 
  mutate(normalized_text = wordcount.text/plays) %>% 
  `mutate(title = fct_reorder(title, normalized_text)) %>%`
  ggplot(aes(x = title, y = normalized_text)) +
  geom_col() +
  coord_flip()
```
]
.pull-right[
![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-19-1.png)
]
]
.panel[.panel-name[Explanation]

```r
dracor_metadata %>% 
  mutate(normalized_text = wordcount.text/plays) %>% 
  `mutate(title = fct_reorder(title, normalized_text))` %>% 
  ggplot(aes(x = title, y = normalized_text)) +
  geom_col() +
  coord_flip()
```

-   we again mutate the `dracor_metadata` object
-   this time, we convert the variable `title` into a factor (before, it was just a text string)
-   factors, or categories, are textual labels that can be attached to observations
  - for example, if you interview a sample of 100 people and record their gender as a binary category - `man` or `woman` - this information will not be treated as 100 text strings, but as 100 instances of 2 categories
  - categories can sometimes be ordered, for example, you can classify the highest attained education level as `primary`, `secondary`, `tertiary`
  - the ordering of categories is what we use here to order the national copora by their normalized word count
- the function `fct_reorder()` does just that - it converts strings into factors puts them into specified order

]
]

---

# Remove individuals

.panelset[

.panel[.panel-name[Command]

.pull-left[

```r
authors <- c("shake", "gersh", "cal")

dracor_metadata %>% 
  `filter(!name %in% authors) %>%` 
  mutate(normalized_text = wordcount.text/plays) %>% 
  mutate(title = fct_reorder(title, normalized_text)) %>%
  ggplot(aes(x = title, y = normalized_text)) +
  geom_col() +
  coord_flip()
```
]
.pull-right[
![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-22-1.png)
]
]
.panel[.panel-name[Explanation]

```r
authors <- c("shake", "gersh", "cal")

dracor_metadata %>% 
  filter(!name %in% authors) %>% 
  mutate(normalized_text = wordcount.text/plays) %>% 
  mutate(title = fct_reorder(title, normalized_text)) %>%
  ggplot(aes(x = title, y = normalized_text)) +
  geom_col() +
  coord_flip()
```

-   we combine three text strings that match the names of the copora into one vector
-   than we filter the dataset using the `filter()` function
-   the `%in%` operator finds an intersection between two objects
-   the `!` negates the expression

> Combine the text strings into one object called `authors`. Take the `dracor_metadata` object and keep only those lines, where the variable `name` does not overlap with the object `authors`.

]

.panel[.panel-name[Consider]

Although the result of our code is a visualization, we have not been changing the visualization! **When you need to visualize data in a desired form, you must change the data.**

]
]

---

# Gender perspective

.panelset[

.panel[.panel-name[Command]

.pull-left[

```r
authors <- c("shake", "gersh", "cal")

dracor_metadata %>% 
  filter(!name %in% authors) %>% 
  `mutate(percentage_female = female/(male+female)) %>% `
  mutate(title = fct_reorder(title, percentage_female)) %>%
  ggplot(aes(x = title, y = percentage_female)) +
  geom_col() +
  coord_flip()
```
]
.pull-right[
![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-25-1.png)
]
]

.panel[.panel-name[Explanation]

```r
authors <- c("shake", "gersh", "cal")

dracor_metadata %>% 
  filter(!name %in% authors) %>% 
  mutate(percentage_female = female/(male+female)) %>% 
  mutate(title = fct_reorder(title, percentage_female)) %>%
  ggplot(aes(x = title, y = percentage_female)) +
  geom_col() +
  coord_flip()
```

-  calculate the percentage of female characters from existing variables `male` and `female`
-   make sure to plot the new variable `percentage_female`

]

---

# Add some color

.panelset[

.panel[.panel-name[Command]

.pull-left[

```r
authors <- c("shake", "gersh", "cal")

.panel[.panel-name[Explanation]

.pull-left[

```r
authors <- c("shake", "gersh", "cal")

dracor_metadata %>% 
  filter(!name %in% authors) %>% 
  mutate(percentage_female = female/(male+female)) %>%
  mutate(title = fct_reorder(title, percentage_female)) %>% 
  select(title, male,female) %>% 
  pivot_longer(cols = c(male, female), 
               names_to = "gender", 
               values_to = "no_of_characters") %>% 
  ggplot(aes(x = title, 
             y = no_of_characters,
             fill = gender)) +
  geom_col(position = "fill") +
  coord_flip()
```
]
.pull-right[
-  we still use the variable `percentage_female` to transform `title` into ordered factor, but we no longer visualize it
-   for this visualization, we need only three columns from the dataset `title`, `male`, `female`, which we can separate from the rest using the `select()` function
-   we are no longer interested in just the percentage of female characters, now we want to see the distribution of both genders
-  to change the visualization, we again reshape the data by pivoting the table with `pivot_longer()` function
]

]

---

# Final remarks

- the code can get complex quickly, but each step is clearly defined
- coding is not easy
  -  when preparing these few lines of code, I frequently consulted documentation or googled "how to..." questions
- but at every point, you only need to figure out the next step
-   there are many resources out there!