<style type="text/css"> ul.panel-tabs { font-size: 80%; margin-top: -20px } </style> # Exploration of Theatrical Plays with DraCor and R ## From Data to Insights with Open Science Tools [https://hlageek.github.io/reports/ufal2022](https://hlageek.github.io/reports/ufal2022) <br><br><br><br><br><br><br><br><br><br><br> Radim HladÃk Institute of Philosophy of the Czech Academy of Sciences hladik@flu.cas.cz <br> --- # The perfect audience .pull-left[ Who can benefit the most from this lecture? Me! More specifically, myself a few years ago: - effectively zero experience with coding - no idea where/how to start coding - frankly, a bit scared of the whole coding business - background in the humanities - motivated to learn ] .pull-right[ What kind of benefits can you expect? - **NOT** learning to code - **NOT** understanding computer science concepts <hr> - getting a feeling for what coding looks like - removal of the initial barriers on your own path to computational analyses - recognition of advantages of coding over point-and-click software - appreciation of (clean and documented) data - awareness of the need to "munge" the data ] --- # R ## R - programming language with focus on statics - widely used by data analysts in academia and private sector, obscure otherwise - free and open source - to teach your computer to understand R language, you must install it on your operating system first (although if you work with Python instead, that comes pre-installed on Macs) - [https://www.r-project.org/](https://www.r-project.org/) ## RStudio - IDE - integrated development environment designed for the R language - free and open source - not a requirement, just makes the interaction with R easier - [RStudio](https://www.rstudio.com/products/rstudio/download/) - cloud version without a need to install anything [https://rstudio.cloud/](https://rstudio.cloud/) --- # DraCor - dataset of theatrical plays .panelset[ .panel[.panel-name[DraCor website] <https://dracor.org> <iframe src="https://dracor.org" width="100%" height="400px"> </iframe> ] .panel[.panel-name[DraCor R library] <img src="https://github.com/Pozdniakov/rdracor/raw/main/man/figures/logo.png"> [https://github.com/dracor-org/rdracor](https://github.com/dracor-org/rdracor) [https://rdrr.io/github/Pozdniakov/rdracor/man/](https://rdrr.io/github/Pozdniakov/rdracor/man/) `rdracor` is a library dedicated to interaction with DraCore data from R programming language ] ] --- # Setup .panelset[ .panel[.panel-name[Command] ```r # install.packages("remotes") # remotes::install_github("Pozdniakov/rdracor") library(rdracor) # install.packages("tidyverse") library(tidyverse) ``` ] .panel[.panel-name[Explanation] ```r # install.packages("remotes") # remotes::install_github("Pozdniakov/rdracor") library(rdracor) # install.packages("tidyverse") library(tidyverse) ``` We loaded **libraries** that provide library-specific **functions** that can be used in our project and that are not available in what is called the base `R`. Before loading libraries, you must install the packages that contain them (installation is a one-time thing, loading needs to be done once per each coding session). - notice the hash `#` symbol - `R` uses it to signify comments, i.e. any line of text that should be ignored when you run the script - use comments to remind yourself of what you intended to express by the code ] .panel[.panel-name[Advantages of using libraries] The base `R` language is made of general purpose functions and objects. **Libraries** build upon the base `R` functions to provide an extra set of functions with some advantages: - they simplify things (you can use one library function instead of writing many lines of code that achieve the same goal) - they are likely to be well-tested (unlike your code) ] ] --- # First call to DraCor API .panelset[ .panel[.panel-name[Command] ```r dracor_metadata <- get_dracor_meta() ``` ```r class(dracor_metadata) ``` ``` ## [1] "dracor_meta" "data.frame" ``` ```r colnames(dracor_metadata) ``` ``` ## [1] "description" "uri" "title" "name" ## [5] "repository" "licence" "licenceUrl" "plays" ## [9] "characters" "male" "female" "text" ## [13] "sp" "stage" "updated" "wordcount.text" ## [17] "wordcount.sp" "wordcount.stage" ``` ] .panel[.panel-name[Explanation] ```r dracor_metadata <- get_dracor_meta() class(dracor_metadata) colnames(dracor_metadata) ``` - we create an object called `dracor_metadata` to which we assign the result of the function `get_dracor_meta()` - we check what kind of object the result is (what is its class) - the result is a table-like object of class `data.frame` and we can inspect the names of the table columns by calling the function `colnames()` ] .panel[.panel-name[Look inside the metadata] .font50[
] ] ] --- # Interact with the data .panelset[ .panel[.panel-name[Command] ```r dracor_metadata %>% pull(title) ``` ``` ## [1] "French Drama Corpus" "German Drama Corpus" ## [3] "Russian Drama Corpus" "Calderón Drama Corpus" ## [5] "Italian Drama Corpus" "Swedish Drama Corpus" ## [7] "Greek Drama Corpus" "German Shakespeare Drama Corpus" ## [9] "Shakespeare Drama Corpus" "Roman Drama Corpus" ## [11] "Alsatian Drama Corpus" "Spanish Drama Corpus" ## [13] "Bashkir Drama Corpus" "Tatar Drama Corpus" ``` ] .panel[.panel-name[Explanation] ```r dracor_metadata %>% pull(title) ``` - `pull()` function takes as it 1st argument an object of class `data.frame` (tabular data structure) and a column name as its 2nd argument - then it "pulls" the specified column out from the `data.frame` - we "pipe" the object `dracor_metadata` into `pull()` function with the `%>%` pipe operator - the advantage of piping is that the code reads as a sequence of transformation steps - `dracor_metadata %>% pull(title)` is equivalent to `pull(dracor_metadata, title)`: .font80[ ```r pull(dracor_metadata, title) ``` ``` ## [1] "French Drama Corpus" "German Drama Corpus" ## [3] "Russian Drama Corpus" "Calderón Drama Corpus" ## [5] "Italian Drama Corpus" "Swedish Drama Corpus" ## [7] "Greek Drama Corpus" "German Shakespeare Drama Corpus" ## [9] "Shakespeare Drama Corpus" "Roman Drama Corpus" ## [11] "Alsatian Drama Corpus" "Spanish Drama Corpus" ## [13] "Bashkir Drama Corpus" "Tatar Drama Corpus" ``` ] ] ] --- # Explore the data .panelset[ .panel[.panel-name[Command] .pull-left[ ```r dracor_metadata %>% ggplot(aes(title, plays)) + geom_col() ``` ] .pull-right[ ![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ] ] .panel[.panel-name[Explanation] ```r dracor_metadata %>% ggplot(aes(x = title, y = plays)) + geom_col() ``` - we will use the `ggplot2` data visualization library that is included in the `tidyverse` collection - the main function in the `ggplot2` library is called `ggplot()` and takes as its 1st argument a data object of class `data.frame` (here we supply the argument with the help of the pipe) and the 2nd argument is a function `aes()` which maps specified variables (column names) onto visualization parameters - most commonly the axes of the 2D charts > Take the table called `dracor_metadata` and turn it into a plot. Map the data values from the column `title` onto X-axis and the data values from the column `plays` onto Y-axis. Add (`+`) a visualization layer based on column geometry (constructed by `geom_col()` function). ] ] --- # Make the axes readable .panelset[ .panel[.panel-name[Command] .pull-left[ ```r dracor_metadata %>% ggplot(aes(x = title, y = plays)) + geom_col() + `coord_flip()` ``` ] .pull-right[ ![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] ] .panel[.panel-name[Explanation] ```r dracor_metadata %>% ggplot(aes(x = title, y = plays)) + geom_col() + coord_flip() ``` - `coord_flip()` function adds another layer to the visualization > Add another visualization layer that flips the coordinates, so that X becomes Y and Y becomes X (this is simply for better readability of text displayed on the axes). ] ] --- # Which country has the most theatrical plays? .panelset[ .panel[.panel-name[Consider] The data contains the most plays by French authors. But does it mean that they write the most? Could this be just an artifact of the data and how it was created? - Perhaps the team preparing the French corpus was the largest and could process most plays. - And many plays from the ancient Greece most likely have not even survived until today. - So asking which nation has the most theatrical plays is not a good question to ask of this dataset. - However, we can rephrase the question to make it more appropriate: on average, which plays are the longest in terms of their text content? ] .panel[.panel-name[Command] .pull-left[ ```r dracor_metadata %>% `mutate(normalized_text = wordcount.text/plays)` %>% ggplot(aes(x = title, y = `normalized_text`)) + geom_col() + coord_flip() ``` ] .pull-right[ ![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] ] .panel[.panel-name[Explanation] ```r dracor_metadata %>% mutate(normalized_text = wordcount.text/plays) %>% ggplot(aes(x = title, y = normalized_text)) + geom_col() + coord_flip() ``` - create a new variable by mutating the existing ones using the `mutate()` function - obtain `normalized_text` through dividing the variable ` wordcount.text` that contains word count by the variable `plays` that contains the number of plays - project the new variable on `normalized_text` Y-axis ] ] --- # Let's rank the data properly .panelset[ .panel[.panel-name[Command] .pull-left[ ```r dracor_metadata %>% mutate(normalized_text = wordcount.text/plays) %>% `mutate(title = fct_reorder(title, normalized_text)) %>%` ggplot(aes(x = title, y = normalized_text)) + geom_col() + coord_flip() ``` ] .pull-right[ ![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] ] .panel[.panel-name[Explanation] ```r dracor_metadata %>% mutate(normalized_text = wordcount.text/plays) %>% `mutate(title = fct_reorder(title, normalized_text))` %>% ggplot(aes(x = title, y = normalized_text)) + geom_col() + coord_flip() ``` - we again mutate the `dracor_metadata` object - this time, we convert the variable `title` into a factor (before, it was just a text string) - factors, or categories, are textual labels that can be attached to observations - for example, if you interview a sample of 100 people and record their gender as a binary category - `man` or `woman` - this information will not be treated as 100 text strings, but as 100 instances of 2 categories - categories can sometimes be ordered, for example, you can classify the highest attained education level as `primary`, `secondary`, `tertiary` - the ordering of categories is what we use here to order the national copora by their normalized word count - the function `fct_reorder()` does just that - it converts strings into factors puts them into specified order ] ] --- # Remove individuals .panelset[ .panel[.panel-name[Command] .pull-left[ ```r authors <- c("shake", "gersh", "cal") dracor_metadata %>% `filter(!name %in% authors) %>%` mutate(normalized_text = wordcount.text/plays) %>% mutate(title = fct_reorder(title, normalized_text)) %>% ggplot(aes(x = title, y = normalized_text)) + geom_col() + coord_flip() ``` ] .pull-right[ ![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-22-1.png)<!-- --> ] ] .panel[.panel-name[Explanation] ```r authors <- c("shake", "gersh", "cal") dracor_metadata %>% filter(!name %in% authors) %>% mutate(normalized_text = wordcount.text/plays) %>% mutate(title = fct_reorder(title, normalized_text)) %>% ggplot(aes(x = title, y = normalized_text)) + geom_col() + coord_flip() ``` - we combine three text strings that match the names of the copora into one vector - than we filter the dataset using the `filter()` function - the `%in%` operator finds an intersection between two objects - the `!` negates the expression > Combine the text strings into one object called `authors`. Take the `dracor_metadata` object and keep only those lines, where the variable `name` does not overlap with the object `authors`. ] .panel[.panel-name[Consider] Although the result of our code is a visualization, we have not been changing the visualization! **When you need to visualize data in a desired form, you must change the data.** ] ] --- # Gender perspective .panelset[ .panel[.panel-name[Command] .pull-left[ ```r authors <- c("shake", "gersh", "cal") dracor_metadata %>% filter(!name %in% authors) %>% `mutate(percentage_female = female/(male+female)) %>% ` mutate(title = fct_reorder(title, percentage_female)) %>% ggplot(aes(x = title, y = percentage_female)) + geom_col() + coord_flip() ``` ] .pull-right[ ![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-25-1.png)<!-- --> ] ] .panel[.panel-name[Explanation] ```r authors <- c("shake", "gersh", "cal") dracor_metadata %>% filter(!name %in% authors) %>% mutate(percentage_female = female/(male+female)) %>% mutate(title = fct_reorder(title, percentage_female)) %>% ggplot(aes(x = title, y = percentage_female)) + geom_col() + coord_flip() ``` - calculate the percentage of female characters from existing variables `male` and `female` - make sure to plot the new variable `percentage_female` ] ] --- # Add some color .panelset[ .panel[.panel-name[Command] .pull-left[ ```r authors <- c("shake", "gersh", "cal") dracor_metadata %>% filter(!name %in% authors) %>% mutate(percentage_female = female/(male+female)) %>% mutate(title = fct_reorder(title, percentage_female)) %>% select(title, male,female) %>% pivot_longer(cols = c(male, female), names_to = "gender", values_to = "no_of_characters") %>% ggplot(aes(x = title, y = no_of_characters, fill = gender)) + geom_col(position = "fill") + coord_flip() ``` ] .pull-right[ ![](data:image/png;base64,#dracor_tutorial_files/figure-html/unnamed-chunk-28-1.png)<!-- --> ] ] .panel[.panel-name[Explanation] .pull-left[ ```r authors <- c("shake", "gersh", "cal") dracor_metadata %>% filter(!name %in% authors) %>% mutate(percentage_female = female/(male+female)) %>% mutate(title = fct_reorder(title, percentage_female)) %>% select(title, male,female) %>% pivot_longer(cols = c(male, female), names_to = "gender", values_to = "no_of_characters") %>% ggplot(aes(x = title, y = no_of_characters, fill = gender)) + geom_col(position = "fill") + coord_flip() ``` ] .pull-right[ - we still use the variable `percentage_female` to transform `title` into ordered factor, but we no longer visualize it - for this visualization, we need only three columns from the dataset `title`, `male`, `female`, which we can separate from the rest using the `select()` function - we are no longer interested in just the percentage of female characters, now we want to see the distribution of both genders - to change the visualization, we again reshape the data by pivoting the table with `pivot_longer()` function ] ] ] --- # Final remarks - the code can get complex quickly, but each step is clearly defined - coding is not easy - when preparing these few lines of code, I frequently consulted documentation or googled "how to..." questions - but at every point, you only need to figure out the next step - there are many resources out there!