Machine learning and gene regulation

  • High throughput experiments such as next generation sequencing are often used to answer simple biological questions; “which genes are more expressed in breast cancer compared to normal?”. Given the huge amount of information generated for each experiment, this is equivalent to having a privileged access to an oracle and asking “what time is it?”. Machine learning is an excellent tool for discovering hidden information in large amounts of data. These not only allow life scientists to get better answers but also to generate novel hypotheses. Our lab looks for opportunities in medical and fundamental biology data where information theory and machine learning can make a substantial impact. A few examples of our discoveries include using Shannon’s Entropy to discover transcriptional disorder in cancer (PLoS CB, 2008), simulating a biologists behavior to identify a method to detect microRNA targets (Nature Methods, 2009) and using novel bioinformatics strategies to discover the impact of introns on gene expression (Cell, 2013; Genome Biology 2017; Nature Communications 2017).


    Artificial Intelligence in Biology and Health, October 2018: Back to the symposium with all the videos of the speakers


    William Ritchie
    Ritchie William
    Aubin Thomas
    Thomas Aubin
    Sylvain Barriere
    Barriere Sylvain
    Audrey Winter
    Winter Audrey
    Bandana Kumari
    Kumari Bandana
    Lucile Broseus
    Broseus Lucile
    Claudio Lorenzi
    Lorenzi Claudio


    iMOKA: ?-mer based software to analyze large collections of sequencing data

    Claudio Lorenzi, Sylvain Barriere, Jean-Philippe Villemin, Laureline Dejardin Bretones, Alban Mancheron, William Ritchie

    2020 - Genome Biology

    Request for full article

    TALC: Transcription Aware Long Read Correction

    Lucile Broseus, Aubin Thomas, Andrew J Oldfield, Dany Severac, Emeric Dubois, William Ritchie

    2020 - Bioinformatics

    Request for full articleMore informations

    Exploring the Roles of CREBRF and TRIM2 in the Regulation of Angiogenesis by High-Density Lipoproteins.

    Wong NKP, Cheung H, Solly EL, Vanags LZ, Ritchie W, Nicholls SJ, Ng MKC, Bursill CA, Tan JTM

    2018 - Int J Mol Sci, 19(7)

    Request for full article29958463

    An NF90/NF110-mediated feedback amplification loop regulates dicer expression and controls ovarian carcinoma progression.

    Barbier J, Chen X, Sanchez G, Cai M, Helsmoortel M, Higuchi T, Giraud P, Contreras X, Yuan G, Feng Z, Nait-Saidi R, Deas O, Bluy L, Judde JG, Rouquier S, Ritchie W, Sakamoto S, Xie D, Kiernan R

    2018 - Cell Res, 28(5):556-571

    Request for full article29563539
    Display all the publications

    Publications of the team

  • Ultra-long sequencing to detect cancer-associated intron retention

    Intron retention (IR) occurs when an intron is included in a mature mRNA. Previously regarded as a by-product of faulty splicing, transcripts with retained introns are often rapidly degraded by a surveillance mechanism called nonsense-mediated decay (NMD). We discovered that numerous cell types make use of this mechanism by increasing the amount of transcripts with retained introns for degradation in granulopoiesis (Cell, 2013), pluripotent stem cells (Nature, 2014) and erythrocyte differentiation (Blood, 2016). IR was recently found to have a major role in modulating tumour suppressor genes in hundreds of different cancers (Nature Genetics, 2015). However, because IR could not previously be correctly identified, numerous studies have overlooked potential biomarkers and therapeutic targets linked to this novel type of gene regulation. In this project we will combine new long RNA sequencing with classical Illumina sequencing to define IR with unprecedented accuracy. This will enable us to define IR features that contribute to normal development and disease.

    Programming genetic networks to extract hidden information in sequencing data

    Advances in next generation sequencing methods have revealed that transcription is more pervasive, more diverse and more cryptic than expected. Despite this heterogeneity in information and despite the fact that our understanding of transcript architecture is incomplete, bioinformatics analyses of these data are frequently initiated through a common, biased procedure; they are mapped to a reference genome or transcriptome. This step does not account for major changes in the genome or transcriptome as can be the case in multiple cancers nor does it account for small sequence variations common between individuals. As a result, only a portion of transcriptional information measured by NGS is used to discover meaningful signatures between different biological samples.