Friday, September 26, 2025

R Learning - Part 1

What does the below statements do in R

# Set columns in right format

Countries$Country.Name = as.factor(Countries$Country.Name)

Countries$Country.Code = as.factor(Countries$Country.Code)


The as.factor() function is used to convert a column from one data type (likely a character string) into a factor.


Countries$Country.Name = as.factor(...): This line takes the values in the Country.Name column and converts them into an R factor. In R, factors are used to store categorical data (data that falls into distinct groups), such as country names. By converting it to a factor, R treats each unique country name as a distinct, ordered or unordered level.


Countries$Country.Code = as.factor(...): Similarly, this line converts the Country.Code column into a factor. This is appropriate because country codes (e.g., "USA," "CAN," "IND") are also categorical identifiers.


Why is this done?

Converting to a factor is crucial before generating summaries or running statistical models. It tells R that these columns represent distinct groups, not continuous numeric values or arbitrary text.



What does the below do in R ? Is First.Countries.Pop is assignment variable? what does . repersent in between? Space. ?



# Focus on first rows


First.Countries.Pop = head(Countries$Total.Population.2017,10)


First.Countries.Pop


 


# Create logical vector


Big.First.Countries.Flag = First.Countries.Pop>10000000 ## Condition: More than 10 million


Big.First.Countries.Flag



# Focus on first rows

First.Countries.Pop = head(Countries$Total.Population.2017, 10)

First.Countries.Pop


head(): This is a function that returns the first part of an object. In this case, head() is getting the first 10 values from the Countries$Total.Population.2017 column.


First.Countries.Pop is an assignment variable. The = sign is the assignment operator in R. It takes the result of the head() function (a vector of the first 10 population values) and stores it in a new variable named First.Countries.Pop. You can tell this is the case because the second line simply prints the contents of this new variable.


The dot (.) in the variable name: In this context, the dot has no special function in R's syntax. It is simply part of the variable name itself. Programmers often use dots or underscores to separate words in variable names for readability.



# Create logical vector

Big.First.Countries.Flag = First.Countries.Pop > 10000000

Big.First.Countries.Flag


What it does: The result of this test is a new vector made up of only TRUE and FALSE values. For every country with a population over 10 million, the vector will contain a TRUE. For those with a population less than or equal to 10 million, it will contain a FALSE. This type of vector is known as a logical vector.


Big.First.Countries.Flag: This is another assignment variable. The name is a common convention in programming to indicate that the variable's purpose is to act as a flag or a boolean indicator for a specific condition.


In summary, the code first isolates the population data for the first 10 countries and then creates a list of TRUE/FALSE values to easily identify which of those countries meet a specific population threshold.



In the below statement, from where does the read_csv coming from? and what is make ? What is colnames ?


Import dataset

library(readr)

Countries = read_csv("Countries Population.csv")

 


## Get info about functions

?colnames

?make.names() 

 


## Give data frame's columns proper names

colnames(Countries) = make.names(colnames(Countries))

colnames(Countries)



1. Where does read_csv come from?

The function read_csv() is provided by the readr package.


library(readr): This line is necessary because it loads the readr package into your current R session. Once loaded, all the functions within that package—including read_csv(), read_tsv(), etc.—become available for use.


What it does: read_csv("Countries Population.csv") is designed to read Comma Separated Values (CSV) files quickly and reliably. It's often preferred over R's base function read.csv() because it's faster and less likely to automatically convert character columns into factors, which gives the user more control over data types.


Countries = ...: This assigns the resulting data frame (the contents of the CSV file) to the variable named Countries.



2. What is make.names()?

The make.names() function is a base R function, meaning it is always available in your session without needing to load a package. It is used to automatically clean up character strings so they are valid and safe to use as variable or column names in R.


Syntax Correction: R has rules for what constitutes a valid variable name. It cannot start with a number, contain spaces, or include most punctuation (like hyphens, parentheses, etc.).


How it works:


It converts spaces and other illegal characters into dots (.).


If a name starts with a number, it will prepend an X.


It ensures all names are unique by appending a number (e.g., Column.1, Column.2) if duplicates exist.


In your code:


colnames(Countries): This function first returns the current names of all columns in the Countries data frame (e.g., it might return a vector like c("Country Name", "Total Population (2017)")).


make.names(colnames(Countries)): make.names() takes that vector of names and converts them into safe names (e.g., "Country Name" becomes "Country.Name", and "Total Population (2017)" might become "Total.Population..2017.").


colnames(Countries) = ...: Finally, the assignment operator (=) replaces the original column names with the new, standardized, dot-separated names.



3. What is colnames()?

The colnames() function is another base R function used to interact with the column names of a data frame or matrix.


When used alone (as a getter): colnames(Countries) returns a character vector listing all the names of the columns in the Countries data frame.


When used with assignment (as a setter): colnames(Countries) = new_names is used to replace the existing column names with a new vector of names (new_names), as demonstrated when it is combined with make.names().


The last line of your code, colnames(Countries), simply prints the final, cleaned column names to the console to confirm the make.names() operation was successful.


No comments:

Post a Comment