Introduction In Bioinformatics, all tasks can be done using one of 2 programming languages: R Python Python is a high-level programming language that is known for its format. Python can do tasks from automation, to data science and machine learning. “Easy to read and learn” On the other hand, R is a programming language that was made for statistics and data processing. Through-out the past few years, R was adopted by the bioinformatics community as the number one programming language for the release of new packages, partially because of Bioconductor (a collection of mature libraries for next-generation sequencing analysis) and the ggplot2 library for advanced plotting. But from personal experience with Bioinformatics, most of my data wrangling and manipulation is done in python and endpoint analysis, and plotting is done in R. But R can do almost anything that Python can in terms of statistics and even more. My only problem with R is the sometimes — unintuitive syntax , but there is a way around this using the rpy2 library in python. (<-, %>%, variable$attribute) In this post, I want to try to show an example of using this library for bioinformatics. Setting up the environment Download the Anaconda Package Download Python 3 and R Install rpy2 from here Open up Jupyter Notebook Downloading the Data For this demonstration, we are going to use data from the . To download the data for the project use the command below: 1000 genome project wget Linux: !wget https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/ -genomes_other_sample_info_sample_info.csv # Get the data 1000 If you are on windows or mac, the command won't work so use the library instead : wget urlib Windows/Mac: urllib.request url = filename = urllib.request.urlretrieve(url, filename) import 'https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/1000-genomes_other_sample_info_sample_info.csv' 'genomes_other_sample_info_sample_info.csv' Converting Python DataFrame to R DataFrame Let’s start by importing the necessary libraries to open and convert the python DataFrame to an R DataFrame: pandas pd rpy2.robjects ro rpy2.robjects.packages importr rpy2.robjects pandas2ri rpy2.robjects.conversion localconverter #Import Libraries import as import as from import from import from import Now let's open the file that we have downloaded using pandas function: read_csv df = pd.read_csv( ) df.head() '1000-genomes_other_sample_info_sample_info.csv' Using the we can view the data of the first few rows. Lets Double check that this is a pandas DataFrame before we continue on: .head() method print(type(df)) #expected output: <class 'pandas.core.frame.DataFrame'> The conversion of the pandas DataFrame to R DataFrame has trouble encoding float and integer values so let’s start by converting the entire DataFrame into a string. df = df.applymap(str) print(type(df[ ][ ])) #converts values in df into string # Double check that values have changed 'In_Final_Phase_Variant_Calling' 0 #Expected output: <class 'str'> Next, let’s convert the Pandas DataFrame to an R DataFrame using from the rpy2 library: localcoverter localconverter(ro.default_converter + pandas2ri.converter): r_df = ro.conversion.py2rpy(df) print(type(r_df)) #conversion with #Check if conversion Worked #Expected output: rpy2.robjects.vectors.DataFrame Converting Datatypes using rpy2 Before we move forward, let’s get a better understanding of our data: print( ) print(r_df.colnames) f'This dataframe has columns and rows\n' {r_df.ncol} {r_df.nrow} Here we are just printing out the different columns in our DataFrame. The output should look like this: This dataframe has columns and rows [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 62 3500 1 "Sample" "Family_ID" 3 "Population" "Population_Description" 5 "Gender" "Relationship" 7 "Unexpected_Parent_Child" "Non_Paternity" 9 "Siblings" "Grandparents" 11 "Avuncular" "Half_Siblings" 13 "Unknown_Second_Order" "Third_Order" 15 "In_Low_Coverage_Pilot" "LC_Pilot_Platforms" 17 "LC_Pilot_Centers" "In_High_Coverage_Pilot" 19 "HC_Pilot_Platforms" "HC_Pilot_Centers" 21 "In_Exon_Targetted_Pilot" "ET_Pilot_Platforms" 23 "ET_Pilot_Centers" "Has_Sequence_in_Phase1" 25 "Phase1_LC_Platform" "Phase1_LC_Centers" 27 "Phase1_E_Platform" "Phase1_E_Centers" 29 "In_Phase1_Integrated_Variant_Set" "Has_Phase1_chrY_SNPS" 31 "Has_phase1_chrY_Deletions" "Has_phase1_chrMT_SNPs" 33 "Main_project_LC_Centers" "Main_project_LC_platform" 35 "Total_LC_Sequence" "LC_Non_Duplicated_Aligned_Coverage" 37 "Main_Project_E_Centers" "Main_Project_E_Platform" 39 "Total_Exome_Sequence" "X_Targets_Covered_to_20x_or_greater" 41 "VerifyBam_E_Omni_Free" "VerifyBam_E_Affy_Free" 43 "VerifyBam_E_Omni_Chip" "VerifyBam_E_Affy_Chip" 45 "VerifyBam_LC_Omni_Free" "VerifyBam_LC_Affy_Free" 47 "VerifyBam_LC_Omni_Chip" "VerifyBam_LC_Affy_Chip" 49 "LC_Indel_Ratio" "E_Indel_Ratio" 51 "LC_Passed_QC" "E_Passed_QC" 53 "In_Final_Phase_Variant_Calling" "Has_Omni_Genotypes" 55 "Has_Axiom_Genotypes" "Has_Affy_6_0_Genotypes" 57 "Has_Exome_LOF_Genotypes" "EBV_Coverage" 59 "DNA_Source_from_Coriell" "Has_Sequence_from_Blood_in_Index" 61 "Super_Population" "Super_Population_Description" Now, we need to perform some data cleanup. For example, some columns should be interpreted as Integers or Floats, but they are read as strings: Let’s start by defining the and functions: as_numeric match as_numeric = ro.r( ) match = ro.r.match #define the as_numeric function 'as.numeric' #define the Match function Functions: : converts string to integer or float value as_numeric : works like python index function match Since there are a lot of columns that need to be converted into numbers, let's store those column names in a list and iterate through it: numCol = [ , , , , , , , , , , , , , , , , , , , , , , , , , , , ] col numCol: my_col = match(col, r_df.colnames)[ ] print( % r_df[my_col - ].rclass[ ]) r_df[my_col - ] = as_numeric(r_df[my_col - ]) print( % r_df[my_col - ].rclass[ ]) #columns 'Has_Sequence_in_Phase1' 'In_Phase1_Integrated_Variant_Set' 'Has_Phase1_chrY_SNPS' 'Has_phase1_chrY_Deletions' 'Has_phase1_chrMT_SNPs' 'LC_Passed_QC' 'E_Passed_QC' 'In_Final_Phase_Variant_Calling' 'Has_Omni_Genotypes' 'Has_Axiom_Genotypes' 'Has_Affy_6_0_Genotypes' 'Has_Exome_LOF_Genotypes' 'Has_Sequence_from_Blood_in_Index' 'Total_LC_Sequence' 'LC_Non_Duplicated_Aligned_Coverage' 'Total_Exome_Sequence' 'X_Targets_Covered_to_20x_or_greater' 'VerifyBam_E_Omni_Free' 'VerifyBam_E_Affy_Free' 'VerifyBam_E_Omni_Chip' 'VerifyBam_E_Affy_Chip' 'VerifyBam_LC_Omni_Free' 'VerifyBam_LC_Affy_Free' 'VerifyBam_LC_Omni_Chip' 'VerifyBam_LC_Affy_Chip' 'LC_Indel_Ratio' 'E_Indel_Ratio' 'EBV_Coverage' #loop for in 0 #returned as a vector 'Type of read count before as.numeric: %s' 1 0 1 1 'Type of read count after as.numeric: %s' 1 0 The results should look like this (make sure all columns are converted to numbers): Type of read count before as.numeric: character of read count after as.numeric: numeric of read count before as.numeric: character of read count after as.numeric: numeric of read count before as.numeric: character of read count after as.numeric: numeric of read count before as.numeric: character of read count after as.numeric: numeric of read count before as.numeric: character of read count after as.numeric: numeric . of read count before as.numeric: character of read count after as.numeric: numeric of read count before as.numeric: character of read count after as.numeric: numeric Type Type Type Type Type Type Type Type Type .. Type Type Type Type Using ggplot2 to plot Data Now let's use from the rpy2 library to plot our data. ggplot2 Let's start by making a bar graph plotting the count of people per country that participated in the : 1000 genome project rpy2.robjects.lib.ggplot2 ggplot2 rpy2.robjects.functions SignatureTranslatedFunction ggplot2.theme = SignatureTranslatedFunction(ggplot2.theme, init_prm_translate = { : }) bar = ggplot2.ggplot(r_df) + ggplot2.geom_bar() + ggplot2.aes_string(x= ) + ggplot2.theme(axis_text_x=ggplot2.element_text(angle= , hjust= )) ro.r.png( , type= ) bar.plot() dev_off = ro.r( ) dev_off() #import ggplot2 import as from import #set theme 'axis_text_x' 'axis.text.x' #plot 'Population' 90 1 #save to img 'out.png' 'cairo-png' 'dev.off' This may look a little bit intimidating but its mainly just boilerplate Code: Code for Plotting: bar = ggplot2.ggplot(seq_data) + ggplot2.geom_bar() + ggplot2.aes_string(x=’CENTER_NAME’) + ggplot2.theme(axis_text_x=ggplot2.element_text(angle= , hjust= )) 90 1 Format: variableName = ggplot2.ggplot(**DATAFRAME**) + ggplot.**GRAPH_TYPE** + ggplot2.aes_string(x=’**X-AXIS**’) + ggplot2.theme(**ADJUST THE THEME HOWEVER NEEDED**) In the end, the graph will be stored in a pdf file, but you can also view the file in your Jupyter Notebook: IPython.display Image Image(filename= ) from import 'out.png' Graph: Similarly, let's create a scatterplot to compare the and the with relation to the and Total_LC_Sequence LC_Non_Duplicated_Aligned_Coverage population gender: pp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x= , y= , col= , shape= ) + ggplot2.geom_point() ro.r.png( , type= ) pp.plot() dev_off = ro.r( ) dev_off() Image(filename= ) #plot 'Total_LC_Sequence' 'LC_Non_Duplicated_Aligned_Coverage' 'Population' 'Gender' #save img 'scatter.png' 'cairo-png' 'dev.off' #veiw img 'scatter.png' Graph: We can also plot a Box Plot using the same data above: bp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x= , y= , fill= ) + ggplot2.geom_boxplot() ro.r.png( , type= ) bp.plot() dev_off = ro.r( ) dev_off() Image(filename= ) #plot 'Total_LC_Sequence' 'LC_Non_Duplicated_Aligned_Coverage' 'Population' #save img 'box.png' 'cairo-png' 'dev.off' #veiw img 'box.png' Graph: Completed Jupyter Notebook: Bioinformatics/rpy2.ipynb at main · VarunSendilraj/Bioinformatics (github.com) Also published at https://varunsendilraj.medium.com/interfacing-r-using-python-for-bioinformatics-9387c17344bd