capesR is an R package designed to facilitate access to and manipulation of data from the Catalog of Theses and Dissertations maintained by the Brazilian Coordination for the Improvement of Higher Education Personnel (CAPES). This catalog contains information about theses and dissertations defended at higher education institutions (HEIs) in Brazil.
The original CAPES data is available at dadosabertos.capes.gov.br.
The data used in this package is available in the repository of the The Open Science Framework (OSF).
You can install this package directly from GitHub with:
# Install the remotes package if not already installed
install.packages("remotes")
# Install capesR from GitHub
::install_github("hugoavmedeiros/capesR") remotes
The download_capes_data
function allows you to download
CAPES data files hosted on OSF. You can specify the desired years, and
the corresponding files will be saved locally.
Download data using the temporary directory (default):
library(capesR)
library(dplyr)
# Download data for the years 1987 and 1990
<- download_capes_data(c(1987, 1990))
capes_files
# View the list of downloaded files
%>% glimpse() capes_files
In this case, the data will not persist for future use.
It is recommended to define a persistent directory to store the
downloaded data instead of using the default temporary directory
(tempdir()
). This allows you to reuse the data later.
# Define the directory where the data will be stored
<- "/capes_data"
data_directory
# Download data for 1987 and 1990 using a persistent directory
<- download_capes_data(
capes_files c(1987, 1990),
destination = data_directory)
In this case, data will only be downloaded once. Future calls will identify which files already exist and return their paths.
Use the read_capes_data
function to combine the
downloaded files from a list generated by
download_capes_data
or manually created.
# Combine all selected data without filters
<- read_capes_data(capes_files)
combined_data
# View the combined data
%>% glimpse() combined_data
Filters are applied before reading the data, improving performance.
# Create a filter object
<- list(
exact_filter ano_base = c(2021, 2022),
uf = c("PE", "CE")
)
# Combine filtered data
<- read_capes_data(capes_files, exact_filter)
filtered_data
# View the filtered data
%>% glimpse() filtered_data
Exact filters are applied before reading for performance, and the text filter is optimized for quick searches.
# Create a filter object
<- list(
text_filter ano_base = c(2018, 2019, 2020, 2021, 2022),
uf = c("PE", "CE"),
titulo = "Educação"
)
# Combine filtered data
<- read_capes_data(capes_files, text_filter)
text_filtered_data
# View the filtered data
%>% glimpse() text_filtered_data
To search for text in already combined data, use the
search_capes_text
function, specifying the term and the
text field (e.g., title, abstract, author, or advisor).
<- search_capes_text(
results data = combined_data,
term = "Educação",
field = "titulo"
)
The package also provides a set of synthetic data,
capes_synthetic_df
, which contains aggregated information
from the CAPES Catalog of Theses and Dissertations. These synthetic data
simplify quick analyses and prototyping without requiring full data
downloads and processing.
The synthetic data includes the following columns:
The synthetic data is available directly in the package and can be loaded with:
data(capes_synthetic_df)
# View the first rows of the data
head(capes_synthetic_df)
You can use the synthetic data for quick exploratory analyses or charts:
# Load the data
data(capes_synthetic_df)
# Example: Count by year and type of work
library(dplyr)
%>%
capes_synthetic_df group_by(base_year, type) %>%
summarise(total = sum(n)) %>%
arrange(desc(total))