Country Lyrics Created with Recurrent Neural Networks





Motivation

I already found out that all country music sounds similar, and I was even able to generate some country music lyrics using a naive-Bayes approach. The naive-Bayes approach that I used took a simple probablistic approach that depends on the going through each word and choosing the next word dependent on probable words that have followed from a database of scraped lyrics. For the word "hotel" was proceeded by the word "room" 3 times, "wall" once, "bed" once, "sign" once, and "with" once. So, when generatin the lyrics randomly I come across the word "hotel", I will take a weighted probablitly of the proceeding words, to choose that proceeding word in the generated song lyric. An example of it's generation is:


 teen havin fun there are times there’s at school football
 team and there was was custommade to do nothing before
 anywhere that it feels its the things that lullaby
 of the chance to sing stuff we ain’t funny more try

 couldve lied but when i was thinking bout as long
 as i was in her name in love me that
 looks like a different then well ive been grand daddy
 ain’t muddy and shes a little moments like hes alright now



We can see that it sort of sounds like a country song, but gramatically, it doesn't quite make sense. Also I have to manually specify the the number of words in a line, and the number of lines in a verse/chorus.

To tackle these problems, recurrent neural networks (RNNs) are a good choice since, like artificial neural networks used for classification, they are able to generate their own features, but they differ in that they feed previous time steps into the current one. This additional feature of RNNs may enable them to pick up on the grammatical syntax of country music lyrics, and also features pertaining to the number of lines, and the chorus/verse structure.

The complete code can be found on github here.


Method

The training data was scraped from the website country-lyrics.com using python and the beautiful soup package. The website has a total of 5369 total song lyrics, which totals about 5.5Mb of text data. This corresponds to about 6 million characters, which should be sufficient to train the data set.


Python Webscraping Script

See here

# -*- coding: utf-8 -*-
"""
webscraper for country-lyrics.com
Created on Wed Jul 20 09:43:35 2016
@author: mmoocarme
"""

from bs4 import BeautifulSoup
import string
import re
import pickle
import requests

# = Helper Functions =========================================================
def visible(element):
    '''
    function to get main body of web page
    '''
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('', str(element.encode('utf-8'))):
        return False
    return True

def strClean(strList):
    '''
    string cleaning function, gets only the lyrics of the text, and removes
    text enclosed with parenthsis and square brackets
    '''
    headerLength, footerLength = 220, 313
    subList=strList[headerLength:-footerLength] # country-lyrics.com
    cleanList = []
    for lines in subList:
        noParen = re.sub(r'\([^)]*\)', '', lines)
        noSquare = re.sub(r'\[[^)]*\]', '', noParen)
        cleanList.append(noSquare)
    cleanSong = ' '.join([lines for lines in cleanList])
    return cleanSong

# ================================================================

# = Web scraper ==================================================

baseurl = requests.get('http://www.country-lyrics.com/')
soup = BeautifulSoup(baseurl.content)

# = Get all links by artist first letter
# i.e. 'http://www.country-lyrics.com/artists/a'
artistAlphaLinks = []
for link in soup.find_all('a', href=True):
    if link.has_attr('href') and '.com/artists' in link['href']:
        #print link#['href']
        artistAlphaLinks.append(link['href'])
artistAlphaLinks = list(set(artistAlphaLinks)) # remove duplicates

# = Get all links of artists
artistLinks = []
for alphaLink in artistAlphaLinks:
    newurl = BeautifulSoup(requests.get(alphaLink).content)
    for link in newurl.find_all('a', href=True):
        if link.has_attr('href') and '.com/lyrics' in link['href']:
            #print link['href']
            artistLinks.append(link['href'])
artistLinks = list(set(artistLinks)) # remove duplicates
print(len(artistLinks))


# = Get all links of songs
songLinks = []
for artistLink in artistLinks:
    newArtisturl = BeautifulSoup(requests.get(artistLink).content)
    for link in newArtisturl.find_all('a', href=True):
        if link.has_attr('href') and '.com/lyrics/' in link['href']:
            #print link['href']
            songLinks.append(link['href'])
songLinks = list(set(songLinks)) # remove duplicates
print(len(songLinks))
# 5369 songs

songLyrics = []
for songLink in songLinks:
    songSoup = BeautifulSoup(requests.get(songLink).content)
    songData = songSoup.findAll(text=True)
    songResult = filter(visible, songData)
    cleanLyrics = strClean(songResult)
    songLyrics.append(cleanLyrics)
#    print(cleanLyrics)## Extend -> one long list, append -> list length of songs

# dump in pickle file
pickle.dump(songLyrics, open("songLyrics_country-lytics-dot-com.p","wb"))

# output to txt file
f=open('songLyrics_country-lytics-dot-com.txt','w')
for ele in songLyrics:
    f.write((ele).encode('utf8')+'\n')
f.close()



The model comes from Andrej Karpathy’s great char-rnn library for Lua/Torch. Recurrent neural networks can use the output of the current node as the input for the next node, which can be viewed as follows:


Ideally, we want the number of parameters of the model to be around the same size as the number of characters in the training dataset, which is around 6 million. To get an idea of the model I trained with around 500,000 parameters, so I anticipate the model to underfit, but I can run this on my laptop in about half a day, so I'll take it.

After every 1000 iterations of the model outputs a checkpoint that can be used to see how the model is progessing. We can look at the output after 5000 iterations:

Hey I'm canyed and the better wey the line
You don't do that we're and sang
How it wint that they wonder a one
I tell you and in the lool like the beas
I don't know you don't know a beaked
And wourd you say

And I’m no love and doing me
Gord a a coming be wound
How not you deally tong the hought and don't morning
When it ain't hold and a great over a pearing we pet amed
The pilling chinder
To let it always me

We like call for the gonna my best now
I plight the big in of the goon
I can like the peonbing now
They was all it wantin'
I would think you hear it moon and you could the day
On the will
Now you want to go this chould,
Put in hove nearing to make, some hell
It's so compor toull crack
Said you want you that there
Hould me the sweet not good
Sighter up and more me of a ream at the done


After 5000 iterations the model seems to have the line structure down pretty well, however since the model has to learn how to spell, and grammatical syntax we can see there is a lot of misspelled words, and the text often doen't make sense.

After 40,000 iterations the model seems to have converged, and we can look an example of some of the lyrics it generates:

Tied right now
I got life now he never thought I got by the all
Going up like a house four boy
Nothing his thing out of hands
No one with the danger in the world
I love my black fire as I know
But the short knees just around me
Fun the heart couldnes fall to back
I see a rest of my wild missing far
When I was missing to wait
And if I think
It's a real tame
I say I belong is every long night
Maybe lovin' you
I'm scared caught here bowe
the capnt in a never fishing from dark
I drark good justed like beer
Oat on a revivin', four freezuss leavin go
I found the whiskey up wheels

You are in your name
I guess it could do come for the music of love
I looked like my mind turned my days hour
Love on my san

Honey love me one time
Whiskey, everybody gone forgotta find the harsless.
And the fall's were over my life is a star,
But when you'll start the world in the window
Say I know this is not to know that you have to go
But your days, your raintast miles
I don't like trouble I'm boued to tell you a puttred eyes he used to hold me to a
While anymore are going to wear it
What it's a right, but I will burn around your mind
And the childrens daddy played by the end.

Singing when it feels on over, I can surver
But you’ve been better than I found you
Before me looking for
Wondering I'd go alone

The leovet good wides.
I feel it 


Now the lyrics are starting to make sense and there are much less spelling mistakes. The spelling and grammatical mistakes are a result of the underfitting due to the insufficient number of parameters.

Now time to train with appropriate parameters!



Source Code




Related Projects


Recommending Music with Apache Spark

Deep Learning Metallica Guitar Tabs